survival8: BITS WILP Information Retrieval End-Sem Exam 2017-H1 (SS ZG537)

BITS WILP Information Retrieval End Sem Exam 2017-H1

Birla Institute of Technology & Science, Pilani

Work-Integrated Learning Programmes Division

Second Semester 2016-2017

Comprehensive Examination (EC-3 Regular)

Course No. : SS ZG537

Course Title : INFORMATION RETRIEVAL

Nature of Exam : Open Book

Weightage : 50%

Duration : 3 Hours

Date of Exam : 08/04/2017 (FN)

No. of page : 4

No. of questions : 8

Note:

1. Please follow all the Instructions to Candidates given on the cover page of the answer book.

2. All parts of a question should be answered consecutively. Each answer should start from a fresh page.

3. Assumptions made if any, should be stated clearly at the beginning of your answer.

Q.1.A Write out the full set of permuterm indexes for the query: “scorpio”.

Answer 1.A

Q.1.B Compute the Levenshtein edit distance between the terms WHAT and WAIT in Table 1. Fill in the table given below by distances between all prefixes as computed by the algorithm.

[1 + 2 = 3]

Table 1

	W	A	I	T

W
H
A
T

Answer 1.B

1. Cell [T,T] ‘2’ came from upper-left (cost difference ‘0’). Operation copy, cost ‘0’.

2. Cell [A,I] ‘2’ came from upper-left (cost difference ‘1’). Operation replace, cost ‘1’.

3. Cell [H,A] ‘1’ came from upper-left (cost difference ‘1’). Operation replace, cost ‘1’

4. Cell [W,W] ‘0’ came from upper-left (cost difference ‘0’) Operation copy, cost ‘0’.

Q.2. Given below are two tables, Table 2 gives the tf values and Table 3 gives the idf values for the 4 terms and 3 documents. Compute the tf-idf matrix and using that, compute the Euclidean normalized document vectors for each of the documents, where each vector has four components, one for each of the four terms. [3]

Table 2: tf values

Term	Doc1	Doc2	Doc3
cream	15	5	20
cake	2	22	0
milk	0	22	15
butter	3	0	14

Table 3: idf values

Term	dft	idft
cream	2312	0.64
cake	345	1.46
milk	3030	0.52
butter	178	1.75

Answer 2:

• df_t is the document frequency, the number of documents that t occurs in.

• df_tis an inverse measure of the informativeness of term t.

• We define the idf weight of term t as follows:

• i df_t = log ( N / df_t ) Here, N is the number of documents in the collection.

• idf_t is a measure of the informativeness of the term.

• The tf-idf weight of a term is the product of its tf weight and its idf weight.

‘Log-frequency weighting’ table:

Term	Doc1	Doc2	Doc3
cream	1 + log 15 = 2.176	1 + log 5 = 1.698	1 + log 20 = 2.301
cake	1 + log 2 = 1.301	1 + log 22 = 2.342	0
milk	0	1 + log 22 = 2.342	1 + log 15 = 2.176
butter	1 + log 3 = 1.477	0	1 + log 14 = 2.146

Now, idf_tare already given, we have to multiply ‘log frequency weighting’ with ‘idf’ to get the tf-idf weights.

Q.3. Consider the problem of classifying a name as being Food or Beverage. Assume the following training set in Table 4: [6]

Table 4

Document	Class
D1: “egg stuffing”	Food
D2: “chicken wings”	Food
D3: “cream soda”	Beverage
D4: “Lemon soda”	Beverage

Apply the kNN algorithm with k=3, to classify the following test document D5: “egg soda” into its correct class? Use tf without idf and cosine similarity with length normalization for computation. Show the intermediate steps clearly.

Answer 3:

‘tf’ Table:

Term	D1	D2	D3	D4	D5
Egg	1	0	0	0	1
Stuffing	1	0	0	0	0
Chicken	0	1	0	0	0
Wings	0	1	0	0	0
Cream	0	0	1	0	0
Soda	0	0	1	1	1
Lemon	0	0	0	1	0

Normalized ‘tf’ Table:

Term	D1	D2	D3	D4	D5
Egg	1 / (1^2 + 1^ 2)^0.5 = 0.707	0	0	0	0.707
Stuffing	1 / (1^2 + 1^ 2)^0.5 = 0.707	0	0	0	0
Chicken	0	1 / (1^2 + 1^ 2)^0.5 = 0.707	0	0	0
Wings	0	1 / (1^2 + 1^ 2)^0.5 = 0.707	0	0	0
Cream	0	0	0.707	0	0
Soda	0	0	0.707	0.707	0.707
Lemon	0	0	0	0.707	0

Cosine-similarity (D1, D5) = 0.707*0.707 + 0 = 0.5

Cosine-similarity (D2, D5) = 0

Cosine-similarity (D3, D5) = 0 + 0.707*0.707 = 0.5

Cosine-similarity (D4, D5) = 0 + 0.707*0.707 = 0.5

Three nearest neighbours of D5 are: D1, D3, D4 (Food, beverage, beverage)

So, D5 is classified as ‘beverage’.

Q.4. Given the following phrase alignment matrix in Table 5 with English words as rows and Hindi words as columns for the given parallel corpus, answer the questions below:

[3 + 1 + 2 +1 + 3 =10]

Q.4.A. Assuming that the given phrase alignment matrix is the intersection of P(f|e) and P(e|f), identify whether the following phrase pairs are consistent with the alignment:

Q.4.B. Which is the longest phrase pair that is consistent with the alignment?

Q.4.C. Compute the reordering distance between the following 2 phrase pairs

Q.4.D. In the mathematical model of the phrase based translation, why is the reordering distance not directly used but an exponentially decaying cost function

is used?

Q.4.E. If a spurious phrase translation pair occurs only once, how will you compute the phrase translation values. Show with the help of an example.

Q.5.A. State the properties of SVD decomposition.

Q.5.B. Given the following students to courses preference at BITS Pilani in Table 6 in the form of a rating matrix M where each row of M represents the given student’s set of ratings and the columns represent the courses: [3 + 7 = 10]

Table 6

From the SVD it is evident that there are two concepts of courses here: the first three are Computer Science courses while the last two are Mechanical Engineering courses. Answer the questions:

i. After we decomposed M using SVD into U, S and V^T , a new student named Rachat gave the following ratings: 4 for ML,5 for PPL, and 2 for Dynamics. Rachat can be represented as vector [0 4 5 2 0]. What is the representation of Rachat in the concept space?

ii. Explain in one sentence what these values indicated about Rachat’s choice.

iii. Another student named Vinay has the following reviews: 5 for IR, 2 for ML, 4 for Dynamics, and 5 for Mechanics. What is the representation of Vinay in the concept space?

iv. Calculate the cosine similarity between Rachat and Vinay using their concept space vectors.

Q.6.A. Given a grey scale image of size 5x5 pixels in Table 7 with the intensity range K=0,1,2,.., 255. Sketch the histogram to represent the image. [Note in the Y axis you may just show the value of intensities present in the image.]

Table 7

4	100	250	4	200
3	6	35	6	5
5	4	30	35	20
6	3	5	30	10
200	3	4	2	100

Q.6.B. Given the color histograms for the query and the three images named a, b and c with each histogram having four colors: red, blue, purple, and yellow where the first bin shows number of red pixels, second bin shows blue, third bin shows purple and fourth bin shows yellow. Compute the Bray Curtis dissimilarity and Squared chord and rank the images based on both the distances. [2 + 7 = 9]

Q.7. Identify the type of index to be used if the IR systems wants to allow the following:

(a) Leading wildcard queries

(b) Trailing wildcard queries

(d) Wildcard characters in between a word [4]

Q.8. Consider the following web pages and the set of web pages they link to

Page A points to pages B, C, and D.

Page B points to pages A and D.

Page C points to pages B and D.

Page D points to page A.

Trace the page rank algorithm for two iterations by calculating each page’s probability. What is the order of the pages after the two iterations? Use initial PR values 1 for all nodes use d=0.85.

Remember that one way to describe the algorithm is:

PR(A) = (1-d) + d(PR(T₁)/L(T₁) + … + PR(T_n)/L(T_n)) [5]

*******

survival8

Pages

BITS WILP Information Retrieval End-Sem Exam 2017-H1 (SS ZG537)

No comments:

Post a Comment