BITS WILP Data Mining Mid-Sem Exam 2017-H1 (Regular)


Download solutions

Birla Institute of Technology & Science, Pilani
Work-Integrated Learning Programmes Division
Second Semester 2016-2017
Mid-Semester Test (EC-2 Regular)

Course No.                  : IS ZC415  
Course Title                 : DATA MINING  
Nature of Exam           : Closed Book
Weightage                    : 30%
Duration                      : 2 Hours 
Date of Exam              : 25/02/2017    (AN)
No. of pages: 2
No. of questions: 5
Note:
1.       Please follow all the Instructions to Candidates given on the cover page of the answer book.
2.       All parts of a question should be answered consecutively. Each answer should start from a fresh page. 
3.       Assumptions made if any, should be stated clearly at the beginning of your answer.

Q.1 (a)          What is mode of the following data?
10, 2, 30, 14, 50                                                                                                     [1]

Q.1 (b)          Eleven students were asked to measure their pulses for 30 seconds and multiply by two to get their one minute pulse rates. The results were: 62, 32, 60, 66, 70, 72, 74,
74, 78, 80, 84. Create five-number summary for the pulse rates and draw boxplot.   [3]
Q.1 (c)          Students admitted for a certain course have mean score of 560 and a standard deviation of 60. Calculate the z-score of a student having a score of 500.             [1]

Q.1 (d)         Calculate the cosine similarity between the two phrases below. Feature vector of a word occurring multiple times is greater than 1. Clearly show steps of your calculations.
            mid term regular exam
            regular exam mid term mid term regular exam           [2]

Q.2.     You are given 10 training samples. They are divided into four classes: a, b, c, and d.
One sample belongs to A, two belong to B, three belong to C, and four to D. Use the
following log2 table to answer the questions:

p
log2(p)
0.1
-3.32
0.2
-2.32
0.3
-1.74
0.4
-1.32
0.5
-1.0
0.6
-0.74

(a)             What is the total information contained in the samples? [2]

(b)             What is the total Gini index?  [2]         

Q.3 (a)          Given below is a database of flight delays over a period and under various conditions. We  Want to create a decision tree classifier with information gain(entropy) as the attribute splitting criterion.
Feature
Value = Yes
Value = no
Rain
Fog
Summer
Winter
Day
Night
Delayed=30, not Delayed=10
Delayed=25, not Delayed=15
Delayed=5,   not Delayed=35
Delayed=20, not Delayed=10
Delayed=20, not Delayed=20
Delayed=15, not Delayed=10
Delayed=10, not Delayed=30
Delayed=15, not Delayed=25
Delayed=35, not Delayed=5
Delayed=20, not Delayed=30
Delayed=20, not Delayed=20
Delayed=25, not Delayed=30

Which feature should be at the root of decision tree?                                                                   [2]
Q.3 (b)          Given the following training documents and their classes:

Document#
Content of document
Class
1
good
Ham
2
very good
Ham
3
bad
Spam
4
very bad
Spam
5
very bad very bad
Spam

Use Naïve Bayes classifier with Laplace (+1) smoothing to find the class of a document with the following contents:
very good bad very very bad              [5]

Q.4.      Suppose you have the following candidate itemsets of length 4:
{1 2 3 5}, {1 2 4 7}, {1 2 5 6}, {1 3 5 9}, {1 4 5 7}, {1 5 6 9}, {2 3 5 9}, {3 4 5 9},
{4 5 6 8}, {5 6 7 9}

(a)    Use hash function k mod 5 to create a hash tree of the itemsets. Assume that each leaf
node can store a maximum of three itemsets.    [4]

(b)   Given transaction {1, 2, 3, 5, 7, 9}, which leaf nodes of the hash tree will be visited
for support-counting? Clearly show the visited leaf nodes in the hash tree.    [2]

Q.5.     Given that min support is 2, and min confidence is 70%, find all association rules from
the following market basket dataset using Apriori:      [6]

Transaction ID
Items
1
a, b, c
2
b, c, d, e
3
c, d
4
a, b, d
5
a, b, c
*********

1 comment: