CSCE 5200 Information Retrieval and Web Search Spring 2007 Assignment 2 Issued: 02/21/2008 Due: 03/14/2008 The Cranfield collection is a standard IR text collection, consisting of 1400 documents from the aerodynamics field. It is available from the class web page. Several queries, and relevance judgments associated with this query, are also provided from the class web page (under the Assignments section) To complete this assignment, you are encouraged to use the pre-processing tools implemented during previous assignments. 1. Implement an indexing scheme based on the vectorial model, as discussed in class. The steps pointed out in class can be used as guidelines for the implementation. For weighting, use (1) the TF/IDF weighting scheme, and (2) select an additional weighting scheme from Salton&Buckley paper on "Term weighting" that we discussed in class. Select a scheme that in your opinion should lead to high retrieval efficiency. [which means that the criterion for deciding on a weighting scheme should be efficiency, rather than simplicity!]. Add this weighting scheme to your indexing program. 2. For each of the ten queries provided on the class webpage, determine a ranked list of documents, in descending order of their similarity with the query. The output of your retrieval should be a list of (query_id, document_id) pairs. Determine the average precision and recall for the ten queries, when you use: - top 10 documents in the ranking - top 50 documents in the ranking - top 100 documents in the ranking - top 500 documents in the ranking Repeat this experiment for the second weighting scheme you selected from the Salton&Buckley paper. Which weighting scheme provides better results? A list of relevant documents for each query is provided on the class webpage, such that you can determine precision and recall. Submission instructions: - write a README file including: * a detailed note about the functionality of the above programs, * complete instructions on how to run them * answers to the questions above - make sure you include your name in each program and in the README file. - make sure all your programs run correctly on the CSP machines. - submit your assignment, including programs and README file by the due date using the 'project' program. Class code is 5200s001, project HW2.