CSCE 5200 Information Retrieval and Web Search Spring 2008 Perl warm-up exercises Note: This assignment is optional, and is meant to help you get started with Perl. It will not be graded. If you want feedback, you can submit it as an email to the instructors by 02/04/2008. 1. Write a program that reads a string from the standard input, and uses a Perl regular expression to test whether the string looks like a valid IP address. Save the program as validIP.pl. 2. Match the following patterns: a) an odd digit followed by an even digit (eg. 12 or 74) b) a letter followed by a non-letter followed by a number c) a word that starts with an upper case letter d) the word "yes" in any combination of upper and lower cases letters e) one or more times the word "the" f) a date in the form of one or two digits, a dot, one or two digits, a dot, two digits g) a punctuation mark Save your program as regularExpressions.pl 3. Consider the file words.dat the,10000 where,1000 a,9999 an,6000 The columns represent words and frequencies, as collected from a text collection. Write a program that reads in this data, sorts the list of words in descending order of their frequencies, and writes them out in the same format, in the file new.dat. Save the program as wordOrder.pl 4. Write a program that takes as input an HTML formatted Web page, identifies all the URLs that this page links to, and prints them out to the standard output. Your program should convert relative URLs into absolute URLs. Save the program as linkExtractor.pl 5. Write a program that reads two files (names are given as command line arguments), and prints out a sorted list of all the words in the two files. For each word, prints the frequency from file one, file two, and both files. Save your program as wordFrequencies.pl Example: assuming the content of the two files is file1: the corpus is a collection of conversations in British English file2: the transcripts of the English conversations are also included output: British 1 0 1 English 1 1 2 a 1 0 1 also 0 1 1 are 0 1 1 collection 1 0 1 conversations 1 1 2 corpus 1 0 1 in 1 0 1 included 0 1 1 is 1 0 1 of 1 1 2 the 1 2 3 transcripts 0 1 1