CSCE 5200 Information Retrieval and Web Search Spring 2008 Issued: 02/05/2008 Due: 02/22/2008 The Cranfield collection is a standard IR text collection, consisting of 1400 documents from the aerodynamics field. It is available from the class web page. (Check the "Links and resources" section). 1. Write a program that preprocesses the collection. This preprocessing stage should specifically include: a. Function that eliminates SGML tags b. Function that tokenizes the text. In doing this, pay particular attention to characters that need special handling, as discussed in class (. , - etc.) 2. Determine the frequency of occurence for all the words in this collection. Answer the following questions: a. What is the vocabulary size? (i.e. number of unique terms) b. What are the top 10 words in the ranking? (i.e. the words with the highest frequencies) c. From these top 10 words, which are "meaningful" (i.e. they are not stopwords), and which ones you would eliminate as "stopwords". d. What is the minimum number of unique words accounting for half of the total number of words in the collection? Example: if the total number of words in the collection is 100, and we have the following word-frequency pairs: the - 30 of - 10 a - 10 clear - 8 cut - 7 etc. the answer to this question will be 3 (3 unique words account for half of the total 100 words) 3. Integrate the Porter stemmer and a stopword eliminator into your code. Answer again questions a-d from the previous point. (Check the "Links and resources" section for a link to various implementations of the Porter stemmer and to lists of stopwords). 4. Pick two subsets of this dataset, and determine the size of the vocabulary and the size of the subset you selected. Use this information to derive the K and beta parameters required by the application of the Heaps law. Use these values to predict what would be the vocabulary size if the corpus were to increase to 500,000 words. How about 2,000,000 words? 5. Write a Web crawler that collects the URL of webpages from the UNT domain. Your crawler will have to perform the following tasks: a. Start with http://www.unt.edu b. Perform a Web traversal using a breadth-first strategy. c. Keep track of the traversed URLs, making sure: a. they are part of the UNT domain b. they were not already traversed (i.e. avoid duplicates, avoid cycles) d. Stop when you reach 1000 URLs. Note: It is highly recommended that your code is as modularized as possible; many of the functions that you implement during this assignment will be needed in future assignments. Submission instructions: - write a README file including: * a detailed note about the functionality of each of the above programs, * complete instructions on how to run them * answers to the questions above * the list of URLs from question 5 - make sure you include your name in each program and in the README file. - make sure all your programs run correctly on the CSP machines. - submit your assignment by the due date using the 'project' program. class code is 5200s001, project HW1.