BITS WILP Information Retrieval End-Sem Exam 2017-H1 (SS ZG537)


BITS WILP Information Retrieval End Sem Exam 2017-H1

Birla Institute of Technology & Science, Pilani
Work-Integrated Learning Programmes Division
Second Semester 2016-2017
Comprehensive Examination (EC-3 Regular)

Course No.                  : SS ZG537 
Course Title                 : INFORMATION RETRIEVAL  
Nature of Exam           : Open Book
Weightage                    : 50%
Duration                      : 3 Hours 
Date of Exam              : 08/04/2017    (FN)
No. of page                  : 4
No. of questions          : 8
Note:
1.       Please follow all the Instructions to Candidates given on the cover page of the answer book.
2.       All parts of a question should be answered consecutively. Each answer should start from a fresh page. 
3.       Assumptions made if any, should be stated clearly at the beginning of your answer.

Q.1.A Write out the full set of permuterm indexes for the query: “scorpio”.   
Answer 1.A
 

Q.1.B Compute the Levenshtein edit distance between the terms WHAT and WAIT in Table 1. Fill in the table given below by distances between all prefixes as computed by the algorithm.  
[1 + 2 = 3]                                                                 
Table 1


W
A
I
T






W





H





A





T





Answer 1.B
1. Cell [T,T] ‘2’ came from upper-left (cost difference ‘0’). Operation copy, cost ‘0’.
2. Cell [A,I] ‘2’ came from upper-left (cost difference ‘1’). Operation replace, cost ‘1’.
3. Cell [H,A] ‘1’ came from upper-left (cost difference ‘1’). Operation replace, cost ‘1’
4. Cell [W,W] ‘0’ came from upper-left (cost difference ‘0’) Operation copy, cost ‘0’.

Q.2. Given below are two tables, Table 2 gives the tf values and Table 3 gives the idf values for the 4 terms and 3 documents.  Compute the tf-idf matrix and using that, compute the Euclidean normalized document vectors for each of the documents, where each vector has four components, one for each of the four terms. [3]                                                                        
                                                                                                                                            
Table 2: tf values
             
Term
Doc1
Doc2
Doc3
cream
15
5
20
cake
2
22
0
milk
0
22
15
butter
3
0
14
Table 3: idf values
Term
dft

 idft

cream
2312
0.64
cake
345
1.46
milk
3030
0.52
butter
178
1.75
  


                                                                                                                                                         
Answer 2:
      dft is the document frequency, the number of documents that t occurs in.
      dft is an inverse measure of the informativeness of term t.
      We define the idf weight of term t as follows:
      i dft = log ( N / dft )   Here, N is the number of documents in the collection.
      idft is a measure of the informativeness of the term.
      The tf-idf weight of a term is the product of its tf weight and its idf weight.
‘Log-frequency weighting’ table:
Term
Doc1
Doc2
Doc3
cream
1 + log 15 = 2.176
1 + log 5 = 1.698
1 + log 20 = 2.301
cake
1 + log 2 = 1.301
1 + log 22 = 2.342
0
milk
0
1 + log 22 = 2.342
1 + log 15 = 2.176
butter
1 + log 3 = 1.477
0
1 + log 14 = 2.146
Now, idft are already given, we have to multiply ‘log frequency weighting’ with ‘idf’ to get the tf-idf weights.

Q.3. Consider the problem of classifying a name as being Food or Beverage. Assume the following training set in Table 4: [6]                                                         
                                                                     Table 4
           Document
Class
D1: “egg stuffing”
Food
D2: “chicken wings”
D3: “cream soda”
Beverage
D4: “Lemon soda”

Apply the kNN algorithm with k=3, to classify the following test document D5: “egg soda” into its correct class? Use tf without idf  and  cosine similarity with length normalization for computation. Show the intermediate steps clearly.
Answer 3:
‘tf’ Table:
Term
D1
D2
D3
D4
D5
Egg
1
0
0
0
1
Stuffing
1
0
0
0
0
Chicken
0
1
0
0
0
Wings
0
1
0
0
0
Cream
0
0
1
0
0
Soda
0
0
1
1
1
Lemon
0
0
0
1
0

Normalized ‘tf’ Table:
Term
D1
D2
D3
D4
D5
Egg
1 / (1^2 + 1^ 2)^0.5 = 0.707
0
0
0
0.707
Stuffing
1 / (1^2 + 1^ 2)^0.5 = 0.707
0
0
0
0
Chicken
0
1 / (1^2 + 1^ 2)^0.5 = 0.707
0
0
0
Wings
0
1 / (1^2 + 1^ 2)^0.5 = 0.707
0
0
0
Cream
0
0
0.707
0
0
Soda
0
0
0.707
0.707
0.707
Lemon
0
0
0
0.707
0

Cosine-similarity (D1, D5) = 0.707*0.707 + 0 = 0.5
Cosine-similarity (D2, D5) = 0
Cosine-similarity (D3, D5) = 0 + 0.707*0.707 = 0.5
Cosine-similarity (D4, D5) = 0 + 0.707*0.707 = 0.5

Three nearest neighbours of D5 are: D1, D3, D4 (Food, beverage, beverage)
So, D5 is classified as ‘beverage’.

Q.4. Given the following phrase alignment matrix in Table 5 with English words as rows and Hindi words as columns for the given parallel corpus, answer the questions below:
                                                                                                                        [3 + 1 + 2 +1 + 3 =10]


Q.4.A. Assuming that the given phrase alignment matrix is the intersection of P(f|e) and P(e|f), identify whether the following phrase pairs are consistent with the alignment:   

Q.4.B. Which is the longest phrase pair that is consistent with the alignment?
Q.4.C. Compute the reordering distance between the following 2 phrase pairs       

Q.4.D. In the mathematical model of the phrase based translation, why is the reordering distance not directly used but an exponentially decaying cost function  is used? 
Q.4.E. If a spurious phrase translation pair occurs only once, how will you compute the phrase translation values. Show with the help of an example.                      

Q.5.A. State the properties of SVD decomposition.                                                         
Q.5.B. Given the following students to courses preference at BITS Pilani in Table 6 in the form of a rating matrix M where each row of M represents the given student’s set of ratings and the columns represent the courses:  [3 + 7 = 10]
Table 6
 

From the SVD it is evident that there are two concepts of courses here: the first three are Computer Science courses while the last two are Mechanical Engineering courses. Answer the questions:                                                                                                       
                       i.            After we decomposed M using SVD into U, S and VT , a new student named Rachat gave the following ratings: 4 for ML,5 for PPL, and 2 for Dynamics. Rachat can be represented as vector [0 4 5 2 0]. What is the representation of Rachat in the concept space?

                     ii.            Explain in one sentence what these values indicated about Rachat’s choice.

                   iii.            Another student named Vinay has the following reviews: 5 for IR, 2 for ML, 4 for Dynamics, and 5 for Mechanics. What is the representation of Vinay in the concept space?

                   iv.            Calculate the cosine similarity between Rachat and Vinay using their concept space vectors.

Q.6.A. Given a grey scale image of size 5x5 pixels in Table 7 with the intensity range K=0,1,2,.., 255. Sketch the histogram to represent the image. [Note in the Y axis you may just show the value of intensities present in the image.]       
Table 7


4
100
250
4
200

3
6
35
6
5

5
4
30
35
20

6
3
5
30
10

200
3
4
2
100


Q.6.B. Given the color histograms for the query and the three images named a, b and c with each histogram having four colors: red, blue, purple, and yellow where the first bin shows number of red pixels, second bin shows blue, third bin shows purple and fourth bin shows yellow. Compute the Bray Curtis dissimilarity and Squared chord and rank the images based on both the distances. [2 + 7 = 9]
     
Q.7. Identify the type of index to be used if the IR systems wants to allow the following:                  
(a)             Leading wildcard queries
(b)             Trailing wildcard queries
(c)             Spelling correction
(d)            Wildcard characters in between a word                                                                   [4]
                                              
Q.8. Consider the following web pages and the set of web pages they link to
Page A points to pages B, C, and D.
Page B points to pages A and D.
Page C points to pages B and D.
Page D points to page A.
Trace the page rank algorithm for two iterations by calculating each page’s probability. What is the order of the pages after the two iterations? Use initial PR values 1 for all nodes use d=0.85.
Remember that one way to describe the algorithm is:
PR(A) = (1-d) + d(PR(T1)/L(T1) + … + PR(Tn)/L(Tn))                                                       [5]



*******

No comments:

Post a Comment