survival8: BITS WILP Information Retrieval Mid-Sem Exam 2017-H1 (Regular)

Link to solutions

Birla Institute of Technology & Science, Pilani

Work-Integrated Learning Programmes Division

Second Semester 2016-2017

Mid-Semester Test

(EC-2 Regular)

Course No. : SS ZG537

Course Title : INFORMATION RETRIEVAL

Nature of Exam : Closed Book

Weightage : 30%

Duration : 2 Hours

Date of Exam : 25/02/2017 (FN)

No of pages: 2

No of questions: 7

Note:

1. Please follow all the Instructions to Candidates given on the cover page of the answer book.

2. All parts of a question should be answered consecutively. Each answer should start from a fresh page.

3. Assumptions made if any, should be stated clearly at the beginning of your answer.

Q.1. Discuss in brief the limitations of the Boolean retrieval model. [2]

Q.2. Give the name of the index we need to use if [1 + 1 + 2 = 4]

(a) We want to consider word order in the queries and the documents for a random number of words?

(b) What kind of Index can we use if we assume that word order is only important for two consecutive terms?

(c) What is the soundex code for the following two names, Robert and Rupert? Assume that the alphabets are mapped to numbers as follows: (B, F, P, V ® 1), (C, G, J, K, Q, S, X, Z ® 2 ), (D,T ® 3), (L ® 4), (M, N ® 5) and (R ® 6) .

Q.3. Discuss briefly the index construction algorithm used in Distributed Indexing with a suitable diagram. [5]

Q.4 (a) An IR system returns 8 relevant documents, and 10 non-relevant documents. There are a total of 20 relevant documents in the collection. What is the precision of the system on this search, and what is its recall?

Q.4 (b) What is the likely effect of ‘Stemming’ and ‘Lemmatization’ on

i. Vocabulary size: Increase, Decrease, Unpredictable?

ii. Precision: Increase, Decrease, Unpredictable?

iii. Recall: Increase, Decrease, Unpredictable? [2 + 3 = 5]

Q.5. Consider the following documents: [1 + 2 = 3]

Doc1: catholic church in brisbane

Doc2: garden city church brisbane

Doc3: brisbane courier garden city

Doc4: where in brisbane catholic church

(a) Draw a term-document incidence matrix for this document collection.

(b) Draw the positional inverted index representation for this collection.

SS ZG537 (EC-2 Regular) Second Semester 2016-2017 Page 2

Q.6. Consider the following document: “The universe contains many different universities”

[1 + 2 + 3 + 2 = 8]

(a) How many entries a bigram index would contain?

(b) If a boolean query of answering is used on this index for the initial query uni*, what terms would you search in this permuterm index?

(c) How do you process queries such as univ*,uni*rse,uni*e*se by using the permuterm index? Show what terms will you search for and how?

(d) Use the 2-gram index and 3-gram index for processing the following wildcard queries tol* and rea* . Is "tool" result for the wildcard query tol* ? If the answer is yes, solve this problem.

Q.7. Assume that Simple term frequency weights are used (with no IDF factor), and the stop words “is”, “am” and “are” are removed. Compute the cosine similarity of the following two documents: [Show the term frequency matrix] [3]

Doc1: Precision is very very high”

Doc2: “high precision is very very very important”

***********

survival8

Pages

BITS WILP Information Retrieval Mid-Sem Exam 2017-H1 (Regular)

No comments:

Post a Comment