BITS WILP Information Retrieval End Sem Exam 2017-H1
Birla
Institute of Technology & Science, Pilani
Work-Integrated
Learning Programmes Division
Second
Semester 2016-2017
Comprehensive
Examination (EC-3 Regular)
Course
No. : SS ZG537
Course
Title : INFORMATION
RETRIEVAL
Nature
of Exam : Open Book
Weightage
: 50%
Duration
: 3 Hours
Date
of Exam : 08/04/2017 (FN)
No.
of page : 4
No.
of questions : 8
Note:
1.
Please follow
all the Instructions to Candidates given on the cover page of the answer
book.
2.
All parts of a
question should be answered consecutively. Each answer should start from a
fresh page.
3.
Assumptions
made if any, should be stated clearly at the beginning of your answer.
Q.1.A Write out the full set of permuterm indexes for the query:
“scorpio”.
Answer 1.A
Q.1.B
Compute the Levenshtein edit distance between the terms WHAT and WAIT in Table
1. Fill in the table given below by distances between all prefixes as computed
by the algorithm.
[1
+ 2 = 3]
Table
1
W
|
A
|
I
|
T
|
||
W
|
|||||
H
|
|||||
A
|
|||||
T
|
Answer
1.B
2.
Cell [A,I] ‘2’ came from upper-left (cost difference ‘1’). Operation replace,
cost ‘1’.
3.
Cell [H,A] ‘1’ came from upper-left (cost difference ‘1’). Operation replace,
cost ‘1’
4.
Cell [W,W] ‘0’ came from upper-left (cost difference ‘0’) Operation copy, cost ‘0’.
Q.2.
Given below are two tables, Table 2 gives the tf values and Table 3 gives the
idf values for the 4 terms and 3 documents.
Compute the tf-idf matrix and using that, compute the Euclidean
normalized document vectors for each of the documents, where each vector has
four components, one for each of the four terms. [3]
Table 2: tf values
Term
|
Doc1
|
Doc2
|
Doc3
|
cream
|
15
|
5
|
20
|
cake
|
2
|
22
|
0
|
milk
|
0
|
22
|
15
|
butter
|
3
|
0
|
14
|
Table 3: idf values
Term
|
dft
|
idft
|
cream
|
2312
|
0.64
|
cake
|
345
|
1.46
|
milk
|
3030
|
0.52
|
butter
|
178
|
1.75
|
Answer 2:
•
dft is
the document frequency, the number of documents that t occurs in.
•
dft is an
inverse measure of the informativeness of term t.
•
We define the idf weight of term
t as follows:
•
i dft = log ( N / dft
) Here, N is the number of documents in the collection.
•
idft is a
measure of the informativeness of the term.
•
The tf-idf weight of a term is the product of its tf
weight and its idf weight.
‘Log-frequency weighting’ table:
Term
|
Doc1
|
Doc2
|
Doc3
|
cream
|
1
+ log 15 = 2.176
|
1
+ log 5 = 1.698
|
1
+ log 20 = 2.301
|
cake
|
1 + log 2 = 1.301
|
1
+ log 22 = 2.342
|
0
|
milk
|
0
|
1
+ log 22 = 2.342
|
1
+ log 15 = 2.176
|
butter
|
1
+ log 3 = 1.477
|
0
|
1
+ log 14 = 2.146
|
Now, idft are already given, we
have to multiply ‘log frequency weighting’ with ‘idf’ to get the tf-idf
weights.
Q.3. Consider
the problem of classifying a name as being Food or Beverage. Assume the
following training set in Table 4: [6]
Table 4
Document
|
Class
|
D1: “egg
stuffing”
|
Food
|
D2: “chicken
wings”
|
|
D3: “cream
soda”
|
Beverage
|
D4: “Lemon
soda”
|
Apply the kNN algorithm
with k=3, to classify the following test document D5: “egg soda” into its
correct class? Use tf without idf
and cosine similarity with length
normalization for computation. Show the intermediate steps clearly.
Answer 3:
‘tf’ Table:
Term
|
D1
|
D2
|
D3
|
D4
|
D5
|
Egg
|
1
|
0
|
0
|
0
|
1
|
Stuffing
|
1
|
0
|
0
|
0
|
0
|
Chicken
|
0
|
1
|
0
|
0
|
0
|
Wings
|
0
|
1
|
0
|
0
|
0
|
Cream
|
0
|
0
|
1
|
0
|
0
|
Soda
|
0
|
0
|
1
|
1
|
1
|
Lemon
|
0
|
0
|
0
|
1
|
0
|
Normalized ‘tf’ Table:
Term
|
D1
|
D2
|
D3
|
D4
|
D5
|
Egg
|
1 / (1^2 + 1^ 2)^0.5 = 0.707
|
0
|
0
|
0
|
0.707
|
Stuffing
|
1 / (1^2 + 1^ 2)^0.5 = 0.707
|
0
|
0
|
0
|
0
|
Chicken
|
0
|
1 / (1^2 + 1^ 2)^0.5 = 0.707
|
0
|
0
|
0
|
Wings
|
0
|
1 / (1^2 + 1^ 2)^0.5 = 0.707
|
0
|
0
|
0
|
Cream
|
0
|
0
|
0.707
|
0
|
0
|
Soda
|
0
|
0
|
0.707
|
0.707
|
0.707
|
Lemon
|
0
|
0
|
0
|
0.707
|
0
|
Cosine-similarity (D1, D5) = 0.707*0.707 + 0 = 0.5
Cosine-similarity (D2, D5) = 0
Cosine-similarity (D3, D5) = 0 + 0.707*0.707 = 0.5
Cosine-similarity (D4, D5) = 0 + 0.707*0.707 = 0.5
Three nearest neighbours of D5 are: D1, D3, D4 (Food, beverage, beverage)
So, D5 is classified as ‘beverage’.
Q.4.
Given the following phrase alignment matrix in Table 5 with English
words as rows and Hindi words as columns for the given parallel corpus, answer
the questions below:
[3 + 1 + 2 +1 + 3 =10]
Q.4.A.
Assuming that the given phrase alignment matrix is the intersection of P(f|e)
and P(e|f), identify whether the following phrase pairs are consistent with the
alignment:
Q.4.B.
Which is the longest phrase pair that is consistent with the alignment?
Q.4.C.
Compute the reordering distance between the following 2 phrase pairs
Q.4.D. In the
mathematical model of the phrase based translation, why is the reordering
distance not directly used but an exponentially decaying cost function is used?
Q.4.E. If a
spurious phrase translation pair occurs only once, how will you compute the
phrase translation values. Show with the help of an example.
Q.5.A. State the properties of SVD
decomposition.
Q.5.B. Given the following students to courses
preference at BITS Pilani in Table 6 in the form of a rating matrix M
where each row of M represents the given student’s set of ratings and the
columns represent the courses: [3 + 7 =
10]
Table
6
From
the SVD it is evident that there are two concepts of courses here: the first
three are Computer Science courses while the last two are Mechanical
Engineering courses. Answer the questions:
i.
After we decomposed M using SVD into U,
S and VT , a new student named
Rachat gave the following ratings: 4 for ML,5 for PPL, and 2 for Dynamics.
Rachat can be represented as vector [0 4 5 2 0]. What is the representation of
Rachat in the concept space?
ii.
Explain
in one sentence what these values indicated about Rachat’s choice.
iii.
Another student named
Vinay has the following reviews: 5 for IR, 2 for ML, 4 for Dynamics, and 5 for
Mechanics. What is the representation of Vinay in the concept space?
iv.
Calculate the cosine similarity between
Rachat and Vinay using their concept space vectors.
Q.6.A. Given a grey scale image of size 5x5 pixels
in Table 7 with the intensity range K=0,1,2,.., 255. Sketch the histogram to
represent the image. [Note in the Y axis you may just show the value of intensities
present in the image.]
Table 7
4
|
100
|
250
|
4
|
200
|
|
3
|
6
|
35
|
6
|
5
|
|
5
|
4
|
30
|
35
|
20
|
|
6
|
3
|
5
|
30
|
10
|
|
200
|
3
|
4
|
2
|
100
|
Q.6.B. Given the color histograms for the query and
the three images named a, b and c with each histogram having four colors: red,
blue, purple, and yellow where the first bin shows number of red pixels, second
bin shows blue, third bin shows purple and fourth bin shows yellow. Compute the
Bray Curtis dissimilarity and Squared chord and rank the images based on both
the distances. [2 + 7 = 9]
Q.7.
Identify the type of index to be used if the IR systems wants to allow the following:
(a)
Leading
wildcard queries
(b)
Trailing
wildcard queries
(c)
Spelling
correction
(d)
Wildcard
characters in between a word [4]
Q.8.
Consider the following web pages and the set of web pages they link to
Page A points to
pages B, C, and D.
Page B points to
pages A and D.
Page C points to
pages B and D.
Page D points to
page A.
Trace the page
rank algorithm for two iterations by calculating each page’s probability. What
is the order of the pages after the two iterations? Use initial PR values 1 for
all nodes use d=0.85.
Remember that
one way to describe the algorithm is:
*******
No comments:
Post a Comment