BITS WILP Information Retrieval Mid-Sem Exam 2017-H1 (Regular)


Link to solutions

Birla Institute of Technology & Science, Pilani
Work-Integrated Learning Programmes Division
Second Semester 2016-2017

Mid-Semester Test
(EC-2 Regular)

Course No.                  : SS ZG537 
Course Title                 : INFORMATION RETRIEVAL  
Nature of Exam           : Closed Book
Weightage                    : 30%
Duration                      : 2 Hours 
Date of Exam              : 25/02/2017    (FN)
No of pages: 2
No of questions: 7
Note:
1.       Please follow all the Instructions to Candidates given on the cover page of the answer book.
2.       All parts of a question should be answered consecutively. Each answer should start from a fresh page. 
3.       Assumptions made if any, should be stated clearly at the beginning of your answer.

Q.1.        Discuss in brief the limitations of the Boolean retrieval model.                                        [2]

Q.2.        Give the name of the index we need to use if                                                  [1 + 1 + 2 = 4]
(a)             We want to consider word order in the queries and the documents for a random number of words?
(b)             What kind of Index can we use if we assume that word order is only important for two consecutive terms?
(c)             What is the soundex code for the following two names, Robert and Rupert? Assume that the alphabets are mapped to numbers as follows: (B, F, P, V ® 1), (C, G, J, K, Q, S, X, Z ® 2 ), (D,T ® 3), (L ® 4), (M, N ® 5) and (R ® 6) .                                                                       

Q.3.        Discuss briefly the index construction algorithm used in Distributed Indexing with a suitable diagram.                                                                                                                [5]
                                                                                                                                                
Q.4 (a)          An IR system returns 8 relevant documents, and 10 non-relevant documents. There are a total of 20 relevant documents in the collection. What is the precision of the system on this search, and what is its recall?                                                                                                                                
Q.4 (b)          What is the likely effect of ‘Stemming’ and ‘Lemmatization’ on                                         
                                            i.            Vocabulary size: Increase, Decrease, Unpredictable?
                                          ii.            Precision: Increase, Decrease, Unpredictable?
                                        iii.             Recall: Increase, Decrease, Unpredictable?                                       [2 + 3 = 5]

Q.5.        Consider the following documents:                                                                       [1 + 2 = 3]

Doc1: catholic church in brisbane
Doc2: garden city church brisbane
Doc3: brisbane courier garden city
Doc4: where in brisbane catholic church

(a)             Draw a term-document incidence matrix for this document collection.
(b)             Draw the positional inverted index representation for this collection.


SS ZG537 (EC-2 Regular)                 Second Semester 2016-2017                                     Page 2

Q.6.        Consider the following document: “The universe contains many different universities”                                                                                         
                                                                                                                           [1 + 2 + 3 + 2 =  8]
(a)              How many entries a bigram index would contain?  
(b)              If a boolean query of answering is used on this index for the initial query uni*, what terms would you search in this permuterm index?
(c)             How do you process queries such as univ*,uni*rse,uni*e*se by using the permuterm index? Show what terms will you search for and how?
(d)            Use the 2-gram index and 3-gram index for processing the following wildcard queries tol* and rea* . Is "tool" result for the wildcard query tol* ? If the answer is yes, solve this problem.

Q.7.        Assume that Simple term frequency weights are used (with no IDF factor), and the stop words “is”, “am” and “are” are removed. Compute the cosine similarity of the following two documents:   [Show the term frequency matrix]                                                         [3]
Doc1: Precision is very very high”
Doc2: “high precision is very very very important”

***********

No comments:

Post a Comment