BITS WILP Data Mining End-Sem Exam 2017-H1 (Regular)



Birla Institute of Technology & Science, Pilani
Work-Integrated Learning Programmes Division
Second Semester 2016-2017
Comprehensive Examination (EC-3 Regular)

Course No.                  : IS ZC415 
Course Title                 : DATA MINING  
Nature of Exam           : Open Book
Weightage                    : 50%
Duration                      : 3 Hours 
Date of Exam              : 08/04/2017    (AN)
No. of pages: 3
No. of questions: 5
Note:
1.       Please follow all the Instructions to Candidates given on the cover page of the answer book.
2.       All parts of a question should be answered consecutively. Each answer should start from a fresh page. 
3.       Assumptions made if any, should be stated clearly at the beginning of your answer.

Q1(a) What is the difference between lift value of 0 vs 1?                                               [1]
Answer 1(a):


 
If some rule had a lift of 1, it would imply that the probability of occurrence of the antecedent and that of the consequent are independent of each other. When two events are independent of each other, no rule can be drawn involving those two events.
URL: https://en.wikipedia.org/wiki/Lift_(data_mining)

Example
Assume the data set being mined is:
Antecedent
Consequent
A
0
A
0
A
1
A
0
B
1
B
0
B
1
where the antecedent is the input variable that we can control, and the consequent is the variable we are trying to predict. Real mining problems would typically have more complex antecedents, but usually focus on single-value consequents.
Most mining algorithms would determine the following rules (targeting models):
  • Rule 1: A implies 0
  • Rule 2: B implies 1
because these are simply the most common patterns found in the data. A simple review of the above table should make these rules obvious.
The support for Rule 1 is 3/7 because that is the number of items in the dataset in which the antecedent is A and the consequent 0. The support for Rule 2 is 2/7 because two of the seven records meet the antecedent of B and the consequent of 1. The supports can be written as:

 

The confidence for Rule 1 is 3/4 because three of the four records that meet the antecedent of A meet the consequent of 0. The confidence for Rule 2 is 2/3 because two of the three records that meet the antecedent of B meet the consequent of 1. The confidences can be written as:

 

Lift can be found by dividing the confidence by the unconditional probability of the consequent, or by dividing the support by the probability of the antecedent times the probability of the consequent, so:
  • The lift for Rule 1 is (3/4)/(4/7) = (3*7)/(4 * 4) = 21/16 ≈ 1.31
·         The lift for Rule 2 is (2/3)/(3/7) = (2*7)/(3 * 3) = 14/9 ≈ 1.56
 


If some rule had a lift of 1, it would imply that the probability of occurrence of the antecedent and that of the consequent are independent of each other. When two events are independent of each other, no rule can be drawn involving those two events.
If the lift is > 1, like it is here for Rules 1 and 2, that lets us know the degree to which those two occurrences are dependent on one another, and makes those rules potentially useful for predicting the consequent in future data sets.
Observe that even though Rule 1 has higher confidence, it has lower lift. Intuitively, it would seem that Rule 1 is more valuable because of its higher confidence—it seems more accurate (better supported). But accuracy of the rule independent of the data set can be misleading. The value of lift is that it considers both the confidence of the rule and the overall data set.



Q1(b) What do you mean by stratified sampling? Give an example.                                [2]
Answer 1B:
Stratification for data mining
Published on October 4, 2007 in class distribution, cross-validation, stratification by Sandro Saitta          

One common issue in data mining is the size of the data set. It is often limited. When this is the case, the test of the model is an issue. Usually, 2/3 of the data are used for training and validation and 1/3 for final testing. By chance, the training or the test set may not be representative of the overall data set. Consider for example a data set of 200 samples and 10 classes. It is likely that one of these 10 classes is not represented in the validation or test set.

To avoid this problem, you should take care of the fact that each class should be correctly represented in both the training and testing sets. This process is called stratification. One way to avoid doing stratification, regarding the training phase is to use k-fold cross-validation. Instead of having only one given validation set with a given class distribution, k different validation sets are used. However, this process doesn’t guarantee a correct class distribution among the training and validation sets.

And what about the test set? The test set can only be used once, on the final model. Therefore, no method such as cross-validation can be used. There is no guarantee that the test contains all the classes that are present in the data sets. However, this situation is more likely to happen when the number of samples is small and the number of class is high. In this situation, the stratification process may be crucial. I’m wondering if people usually apply stratification or not and why. Feel free to comment on this issue regarding your personal experience.

More details about stratification can be found in the book Data Mining: Practical Machine Learning tools and techniques, by Witten and Frank (2005).

Q1(c) What do you mean by bootstrap aggregating and how is it helpful?                      [2]
Answer 1C:
Bootstrap aggregating, also called bagging, is a machine learning ensemble meta-algorithm designed to improve the stability and accuracy of machine learning algorithms used in statistical classification and regression. It also reduces variance and helps to avoid overfitting. Although it is usually applied to decision tree methods, it can be used with any type of method. Bagging is a special case of the model averaging approach.

Description of the technique

Given a standard training set D D {\displaystyle D} of size n, bagging generates m new training sets Di, each of size n′, by sampling from D uniformly and with replacement. By sampling with replacement, some observations may be repeated in each Di. If n=n, then for large n the set Di is expected to have the fraction (1 - 1/e) (≈63.2%) of the unique examples of D, the rest being duplicates.[1] This kind of sample is known as a bootstrap sample. The m models are fitted using the above m bootstrap samples and combined by averaging the output (for regression) or voting (for classification).
Bagging leads to "improvements for unstable procedures" (Breiman, 1996), which include, for example, artificial neural networks, classification and regression trees, and subset selection in linear regression (Breiman, 1994). An interesting application of bagging showing improvement in preimage learning is provided here.[2][3] On the other hand, it can mildly degrade the performance of stable methods such as K-nearest neighbors (Breiman, 1996).

Example: Ozone data

To illustrate the basic principles of bagging, below is an analysis on the relationship between ozone and temperature (data from Rousseeuw and Leroy (1986), analysis done in R).
The relationship between temperature and ozone in this data set is apparently non-linear, based on the scatter plot. To mathematically describe this relationship, LOESS smoothers (with span 0.5) are used. Instead of building a single smoother from the complete data set, 100 bootstrap samples of the data were drawn. Each sample is different from the original data set, yet resembles it in distribution and variability. For each bootstrap sample, a LOESS smoother was fit. Predictions from these 100 smoothers were then made across the range of the data. The first 10 predicted smooth fits appear as grey lines in the figure below. The lines are clearly very wiggly and they overfit the data - a result of the span being too low.
By taking the average of 100 smoothers, each fitted to a subset of the original data set, we arrive at one bagged predictor (red line). Clearly, the mean is more stable and there is less overfit.





Q1(d) Give an example of microaveraged vs macroaveraged Precision.                                      [2]
Answer 1D:

Micro- and Macro-average of Precision, Recall and F-Score

1. Micro-average Method
In Micro-average method, you sum up the individual true positives, false positives, and false negatives of the system for different sets and the apply them to get the statistics. For example, for a set of data, the system's
True positive (TP1)= 12
False positive (FP1)=9
False negative (FN1)=3
Then precision (P1) and recall (R1) will be 57.14 and 80
And for a different set of data, the system's
True positive (TP2)= 50
False positive (FP2)=23
False negative (FN2)=9
Then precision (P2) and recall (R2) will be 68.49 and 84.75
Now, the average precision and recall of the system using the Micro-average method is
Micro-average of precision = (TP1+TP2)/(TP1+TP2+FP1+FP2) = (12+50)/(12+50+9+23) = 65.96
Micro-average of recall = (TP1+TP2)/(TP1+TP2+FN1+FN2) = (12+50)/(12+50+3+9) = 83.78
The Micro-average F-Score will be simply the harmonic mean of these two figures.

2. Macro-average Method
The method is straight forward. Just take the average of the precision and recall of the system on different sets. For example, the macro-average precision and recall of the system for the given example is
Macro-average precision = (P1+P2)/2 = (57.14+68.49)/2 = 62.82
Macro-average recall = (R1+R2)/2 = (80+84.75)/2 = 82.25
The Macro-average F-Score will be simply the harmonic mean of these two figures.

Suitability
Macro-average method can be used when you want to know how the system performs overall across the sets of data. You should not come up with any specific decision with this average.
On the other hand, micro-average can be a useful measure when your dataset varies in size.

Q1(e) We generally will be more interested in association rules with high confidence. However, often we will not be interested in association rules that have a confidence of 100%. Why? Then specifically explain why association rules with 99% confidence may be interesting (i.e., what might they indicate)?           [2]
Answer 1E:
Confidence is defined as:
conf(X Y) = supp(XY) / supp(X).
For example, the rule {butter, bread} {milk} has a confidence of 0.2 / 0.2 = 1.0 in the database, which means that for 100% of the transactions containing butter and bread the rule is correct (100% of the times a customer buys butter and bread, milk is bought as well).

Example database with 5 transactions and 5 items
transaction ID
milk
bread
Butter
beer
diapers
1
1
1
0
0
0
2
0
0
1
0
0
3
0
0
0
1
1
4
1
1
1
0
0
5
0
1
0
0
0

Let X  be an itemset, X Y an association rule and T a set of transactions of a given database.
Support
Support is an indication of how frequently the itemset appears in the dataset.
The support of X with respect to T is defined as the proportion of transactions t in the dataset which contains the itemset X.
supp(X) = |{tT; X t}| /  |T|
In the example dataset, the itemset X = { beer, diapers } has a support of 1 / 5 = 0.2 since it occurs in 20% of all transactions (1 out of 5 transactions). The argument of supp( ) is a set of preconditions, and thus becomes more restrictive as it grows (instead of more inclusive).[3]
Confidence
Confidence is an indication of how often the rule has been found to be true.
The confidence value of a rule, X Y, with respect to a set of transactions T, is the proportion of the transactions that contains X which also contains Y.
Confidence is defined as:
conf(X Y) = supp(XY) / supp(X).
For example, the rule {butter, bread} {milk} has a confidence of 0.2 / 0.2 = 1.0 in the database, which means that for 100% of the transactions containing butter and bread the rule is correct (100% of the times a customer buys butter and bread, milk is bought as well).
Note that supp(XY) means the support of the union of the items in X and Y. This is somewhat confusing since we normally think in terms of probabilities of events and not sets of items. We can rewrite supp(XY) as the probability P(EX EY), where EX and EY are the events that a transaction contains itemset X and Y, respectively.[4]
Thus confidence can be interpreted as an estimate of the conditional probability P(EY | EX), the probability of finding the RHS of the rule in transactions under the condition that these transactions also contain the LHS.[3][5]
Lift
The lift of a rule is defined as:
lift(XY) = supp(XY) / supp(X) × supp(Y)
or the ratio of the observed support to that expected if X and Y were independent. For example, the rule {milk, bread} {butter} has a lift of 0.2 / 0.4 × 0.4 = 1.25.
If the rule had a lift of 1, it would imply that the probability of occurrence of the antecedent and that of the consequent are independent of each other. When two events are independent of each other, no rule can be drawn involving those two events.
If the lift is > 1, that lets us know the degree to which those two occurrences are dependent on one another, and makes those rules potentially useful for predicting the consequent in future data sets.
The value of lift is that it considers both the confidence of the rule and the overall data set.

Q1(f) Give an example where two different initial selection of centroids result in different clusters in k-means clustering.        [3]
Answer 1F:
K-means Clustering
         Given k, the k-means algorithm consists of four steps:
    Select initial centroids at random.
    Assign each object to the cluster with the nearest centroid.
    Compute each centroid as the mean of the objects assigned to it.
    Repeat previous 2 steps until there is no change.

Q2(a): You are given the following data of marks obtained by 12 students in midterm exam and final exam. Given that the mean(x) = 72.167 and mean(y) = 74, derive the linear regression line equation to predict the final exam marks, given the midterm marks.  [5]

Midterm exam(x)
Final exam(y)
72
50
81
74
94
86
59
83
65
33
88
81
84
63
77
78
90
75
49
79
77
52
74
90
Anwser 2A:
 


Q2(b) Given the following pairs of credit ranking and fraud outcome in a data set, calculate the information gain (using entropy) when split by Ranking= L. Note: You only need to write the formulation; no need to solve it completely.                                                 [4]

1
2
3
4
5
6
7
8
9
10

Fraud

No
No
No
Yes
No
No
No
Yes
No
Yes

Ranking

L
L
M
H
H
L
L
H
M
M

Answer 2B:
   

...
  


Q3(a) The following rules identify different video files that customers watched on Youtube in the same session. Along with the rules, certain measures are also given. As a consultant to Youtube, make a recommendation on what video thumbnails to suggest/display to the user based on the association rule.
video1 à video2  with low support, high confidence, and lift = n where n is large. [2]
Answer 3A:
The support of X with respect to T is defined as the proportion of transactions t in the dataset which contains the itemset X.
supp(X) = |{tT; X t}| /  |T|
The rule has low support meaning there are few transactions in which {video1, video2} occur together.

The confidence value of a rule, X Y, with respect to a set of transactions T, is the proportion of the transactions that contains X which also contains Y.
Confidence is defined as: conf(X Y) = supp(XY) / supp(X).
For example, the rule {butter, bread} {milk} has a confidence of 0.2 / 0.2 = 1.0 in the database, which means that for 100% of the transactions containing butter and bread the rule is correct (100% of the times a customer buys butter and bread, milk is bought as well).
Confidence is high, meaning, for most transactions in which video1 was on LHS, video2 was on RHS.

The lift of a rule is defined as:
lift(XY) = supp(XY) / supp(X) × supp(Y)
or the ratio of the observed support to that expected if X and Y were independent. For example, the rule {milk, bread} {butter} has a lift of 0.2 / 0.4 × 0.4 = 1.25.
If the rule had a lift of 1, it would imply that the probability of occurrence of the antecedent and that of the consequent are independent of each other. When two events are independent of each other, no rule can be drawn involving those two events.
If the lift is > 1, that lets us know the degree to which those two occurrences are dependent on one another, and makes those rules potentially useful for predicting the consequent in future data sets.
The value of lift is that it considers both the confidence of the rule and the overall data set.

Lift is high, large number ‘n’, so this means video1 and video2 are dependent on one another closely.

Recommendation:
If a user is watching video1, recommend video2 to him/her.

Q3(b) The following frequent itemset identifies different video files that customers watched on Youtube over two consecutive sessions. Make a recommendation on what video thumbnails to suggest/display to the user based on this.
<{video1, video2}, {video3}>    with high support                                                     [2]
Answer 3B:
The support of X with respect to T is defined as the proportion of transactions t in the dataset which contains the itemset X.
supp(X) = |{tT; X t}| /  |T|
The rule has high support meaning there are large number of transactions in which {video1, video2, video3} occur together. This means if user is watching any of these videos, recommend the other two to him.

Q3(c) The following contingency table summarizes supermarket transaction data,

diaper
Not(diaper)
Sum over row               
air freshener
2,000
500
2,500
Not(air freshener)
1,000
1,500
2,500
Sum over column
3,000
2,000
5,000

3C(I) Suppose that association rule “diapers => air freshener” is mined. Given that the minimum support threshold is 30% and the minimum confidence threshold is 60%, is this association rule strong?        [1]
Answe 3C(I)
Given a set of transactions T, the goal of association rule mining is to find all rules having
    support ≥ minsup threshold
    confidence ≥ minconf threshold
These rules are called “strong rules”. minsup threshold and minconf threshold are set by domain expert.
#(transactions with both air freshener and diapers) = 2000
#(transactions) = 5000
Support(XUY) = 2000/5000 = 0.4

The confidence value of a rule, X Y, with respect to a set of transactions T, is the proportion of the transactions that contains X which also contains Y.
Confidence is defined as: conf(X Y) = supp(XY) / supp(X).
Support(air freshener) = 2500 /5000 = 0.5
Confidence (X => Y) = 0.4 / 0.5 = 0.8

3C(II) Is the purchase of diapers independent of the purchase of air freshener? Calculate the correlation that exists between the two.                             [1]
Answer 3C(II)
URL: https://en.wikipedia.org/wiki/Contingency_table
The degree of association between the two variables can be assessed by a number of coefficients. The simplest, applicable only to the case of 2×2 contingency tables, is the phi coefficient defined by

 


where χ2 is computed as in Pearson's chi-squared test, and N is the grand total of observations. φ varies from 0 (corresponding to no association between the variables) to 1 or −1 (complete association or complete inverse association), provided it is based on frequency data represented in 2 × 2 tables. Then its sign equals the sign of the product of the main diagonal elements of the table minus the product of the off–diagonal elements. φ takes on the minimum value −1.00 or the maximum value of 1.00 if and only if every marginal proportion is equal to .50 (and two diagonal cells are empty).[2]

URL: https://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test
For example, to test the hypothesis that a random sample of 100 people has been drawn from a population in which men and women are equal in frequency, the observed number of men and women would be compared to the theoretical frequencies of 50 men and 50 women. If there were 44 men in the sample and 56 women, then
 


Q4. Given the following transactions and minimum support = 50% and minimum confidence = 80%:

Cust Id
TID
Item_bought (in the form of brand-item_category)
01
02
01
03
T100
T200
T300
T400
Venkys-Chicken, Amul-Milk, Nestle-Cheese, Britannia-Bread
Britannia-Cheese, Nestle-Milk, Himalaya-Apple, Parle-Biscuit, Modern-Bread
Fuji-Apple, Nestle-Milk, Modern-Bread, Parle-Biscuit
Modern-Bread, Amul-Milk, Nestle-Cheese

a) At the granularity of item category (e.g., itemi could be “Milk"), create a FP-tree of the dataset.  [2]
Answer 4A:

Cust Id
TID
Item_bought (in the form of item_category)
01
02
01
03
T100
T200
T300
T400
Chicken, Milk, Cheese, Bread
Cheese, Milk, Apple, Biscuit, Bread
Apple, Milk, Bread, Biscuit
Bread, Milk, Cheese
Support counts => chicken: 1, milk: 4, cheese: 3, bread: 4, apple: 2, biscuit: 2
Ordered: Milk, bread, cheese, apple, biscuit, chicken
L-Order: Order the items in the transaction-table according to support-counts.
For items with equal support-counts, put them as they appear in support count order.
TID
Items
T100
Milk, bread, cheese, chicken
T200
Milk, bread, cheese, apple, biscuit
T300
Milk, bread, apple, biscuit
T400
Milk, bread, cheese

 


b) At the granularity of brand-item category (e.g., itemi could be “Amul-Milk"), list all frequent itemsets using FP-growth. [5]
Support counts => Venkys-chicken: 1, amul-milk: 2, nestle-cheese: 2, Britannia-bread: 1, Britannia-cheese: 1, Nestle-milk: 2, Himalaya-apple: 1, Parle-biscuit: 2, Modern-bread: 3, Fuji-apple: 1

 L-Order: Modern-bread: 3, Amul-milk: 2, Nestle-cheese: 2, Nestle-milk: 2, Parle-biscuit: 2, Britannia-bread: 1, Britannia-cheese: 1, Fuji-apple: 1, Himalaya-apple: 1, Venkys-chicken: 1,

L-Order: Order the items in the transaction-table according to support-counts.
For items with equal support-counts, put them as they appear in support count order.
Cust Id
TID
Item_bought (in the form of brand-item_category)
01
02
01
03
T100
T200
T300
T400
Amul-Milk, Nestle-Cheese, Britannia-Bread, Venkys-Chicken
Modern-Bread, Nestle-Milk, Parle-Biscuit, Britannia-Cheese, Himalaya-Apple
Modern-Bread, Nestle-Milk, Parle-Biscuit, Fuji-Apple
Modern-Bread, Amul-Milk, Nestle-Cheese

INSERT THE FP-TREE HERE
 


...


 
...
 


Minimum-support: 50% = 2
Possible candidates of frequent itemsets are: { Modern-bread: 3, Amul-milk: 2, Nestle-cheese: 2, Nestle-milk: 2, Parle-biscuit: 2 }
 From the “Prefix path tree”, if the support-count threshold condition is met, take the item-set to be frequent.

c) List all strong association rules derived from the frequent 3-itemsets you derived in part (b) and calculate their supports, confidences and lifts.                             [4]
d) Give one recommendation (e.g., store layout or promotion) to store management based on the association rules and item sets you discovered.                               [1]

Q5(a) Given the eight points below, you want to cluster these using k-means (k=3) and Manhattan distance measure. The initial clusters are C1= {x1, x2, x3}, C2 = {x4, x5, x6}, C3 = {x7, x8}. Use k-means to continue clustering the eight points until convergence.                                                                                                                  [5]                                                         

A1
A2
x1
x2
x3
x4
x5
x6
x7
x8
2
2
8
5
7
6
1
4
10
5
4
8
5
4
2
9

Q5 (b) Given the 2D points in the figure below, use DBSCAN with Eps= , and MinPoints=3 (including the central point) to cluster these points using Euclidean distance measurement. List the points in each cluster. Clearly show all the steps you are taking for DBSCAN. Which of the points are outliers?                                    [6]
 


No comments:

Post a Comment