BITS WILP Data Mining Mid-Sem Exam 2017-H2


Birla Institute of Technology & Science, Pilani
Work Integrated Learning Programmes Division
First Semester 2017-18
Mid Semester Test (EC2 Regular)
Course No: IS ZC415
Course Title: Data Mining
Nature of Exam: Closed Book
Weightage: 30%
Duration: 2 Hours
Date of Exam: 23/Sep/2017 (AN)
No of pages: 2
No of questions: 4

Page: 1


Page: 2


Solutions:


Answer 1(A):
Binning Methods for Data Smoothing
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
*  Partition into (equi-depth) bins:
      - Bin 1: 4, 8, 9, 15
      - Bin 2: 21, 21, 24, 25
      - Bin 3: 26, 28, 29, 34
*  Smoothing by bin means:
      - Bin 1: 9, 9, 9, 9
      - Bin 2: 23, 23, 23, 23
      - Bin 3: 29, 29, 29, 29
*  Smoothing by bin boundaries:
      - Bin 1: 4, 4, 4, 15
      - Bin 2: 21, 21, 25, 25

      - Bin 3: 26, 26, 26, 34


Anwer 1(B)
Euclidean distance is widely used in the Geometry where shortest distance between two points is often required to calculated as in distances between two celestial objects in space.

Manhattan distance is used in the navigation systems to calculate the distance between two points through the obstacle that are there in the path. This is also known as ‘taxi cab’ distance.

Cosine distance is used in ‘web search, information retrieval’ where two documents are represented as vectors with terms as dimensions and similarity between two documents is calculated in the form of cosine distance between them.


Answer 1(C):
...

A confusion matrix is a table that is often used to describe the performance of a classification model (or "classifier") on a set of test data for which the true values are known.

...

In z-score normalization (or zero-mean normalization), the values for an attribute, A, are normalized based on the mean and standard deviation of A. A value, v, of A is normalized to v’ by computing:


Marks
Z-score
Subject 1
70
(70 – 60)/15 = 0.666
Subject 2
65
(65 - 60)/6 = 0.833

Student did better subject 2.

Answer 2:
Use this:
x-mean = 1.5
y-mean = 3.5
W1 = ((1-1.5)*(2 – 3.5) + (2-1.5)*(5-3.5)) / ((1-1.5)^2 + (2-1.5)^2) = 3
W0 = 3.5 – 3*(1.5) = -1

“What will be the class label for nodes with no training samples?”
Answer 3(A)
Info(D) = -(2/6)(log (2/6) / log (2)) - (4/6)(log (4/6) / log (2)) = 0.92


...
...

Intermediate calculation: [-(1/3)*(log(1/3) / log (2)) -(2/3)*(log(2/3) / log (2))] = 0.92
For “What will be the class label for nodes with no training samples?”, from ML course:
...



Question 3(B)
...


Answer 4:
Example:
          Support
     Usefulness of discovered rules
          Confidence
     Certainty of discovered rules

computer => antivirus software [support = 2%, confidence = 60%]
·         A support of 2% means that 2% of all the transactions under analysis show that computer and a.v. are purchased together.
·         A confidence of 60% means that 60% of the customers who purchased a computer also bought the software.


...
Answer 4(B)
From Stackoverflow.com
Ques: I want to find out the maximal frequent item sets and the closed frequent item sets.
Frequent item set XF is maximal if it does not have any frequent supersets.
Frequent item set X F is closed if it has no superset with the same frequency

So I counted the occurrence of each item set.
{A} = 4 ;  {B} = 2  ; {C} = 5  ; {D} = 4  ; {E} = 6

{A,B} = 1; {A,C} = 3; {A,D} = 3; {A,E} = 4; {B,C} = 2;
{B,D} = 0; {B,E} = 2; {C,D} = 3; {C,E} = 5; {D,E} = 3

{A,B,C} = 1; {A,B,D} = 0; {A,B,E} = 1; {A,C,D} = 2; {A,C,E} = 3;
{A,D,E} = 3; {B,C,D} = 0; {B,C,E} = 2; {C,D,E} = 3

{A,B,C,D} = 0; {A,B,C,E} = 1; {B,C,D,E} = 0

Min_Support set to 50%
Does maximal = {A,B,C,E}?
Does closed = {A,B,C,Dand {B,C,D,E}?
Ans:
Note:
  • Did not check the support counts
  • Let's say min_support=0.5. This is fulfilled if min_support_count >= 3
{A} = 4  ; not closed due to {A,E}
{B} = 2  ; not frequent => ignore
{C} = 5  ; not closed due to {C,E}
{D} = 4  ; closed, but not maximal due to e.g. {A,D}
{E} = 6  ; closed, but not maximal due to e.g. {D,E}

{A,B} = 1; not frequent => ignore
{A,C} = 3; not closed due to {A,C,E}
{A,D} = 3; not closed due to {A,D,E}
{A,E} = 4; closed, but not maximal due to {A,D,E}
{B,C} = 2; not frequent => ignore
{B,D} = 0; not frequent => ignore
{B,E} = 2; not frequent => ignore
{C,D} = 3; not closed due to {C,D,E}
{C,E} = 5; closed, but not maximal due to {C,D,E}
{D,E} = 4; closed, but not maximal due to {A,D,E}

{A,B,C} = 1; not frequent => ignore
{A,B,D} = 0; not frequent => ignore
{A,B,E} = 1; not frequent => ignore
{A,C,D} = 2; not frequent => ignore
{A,C,E} = 3; maximal frequent
{A,D,E} = 3; maximal frequent
{B,C,D} = 0; not frequent => ignore
{B,C,E} = 2; not frequent => ignore
{C,D,E} = 3; maximal frequent

{A,B,C,D} = 0; not frequent => ignore
{A,B,C,E} = 1; not frequent => ignore
{B,C,D,E} = 0; not frequent => ignore
Answer to problem:
Frequent item set XF is maximal if it does not have any frequent supersets.
Frequent item set X F is closed if it has no superset with the same frequency

Closed 2-itemsets: “ab, bc, bd”
Maximal 2-itemsets: “bd”
*****

Tag: BITS WILP Data Mining Mid-Sem Exam 2017-H2

No comments:

Post a Comment