survival8: BITS WILP Machine Learning Quiz-1 2017-H2

BITS WILP Machine Learning (ISZC464) Quiz-1 2017-H2

Q1.
Which of the following statements are true in context of Graphical models?
Select one or more:
a. None of the above.
b. Bayesian Belief networks describe conditional independence among subsets of variables.
c. Bayes network represents the joint probability distribution over a collection of random variable.

d. Each node denotes a random variable.

Answer: (B, C, D)

A Bayesian network, Bayes network, belief network, Bayes(ian) model or probabilistic directed acyclic graphical model is a probabilistic graphical model (a type of statistical model) that represents a set of random variables and their conditional dependencies via a directed acyclic graph (DAG).

Q2.

Assuming log base 2, the entropy of a binary feature with p(x=1) = 0.5 is

Select one:

a. 0.75

b. 0

c. 0.25

d. 1

e. 0.5

Answer: (D) It is '1'.

Q3.

Which of the following statement are true for the given graphical model.

Select one or more:

a. A is conditionally independence of B given C.

b. B is conditionally independence of A given C.

c. B is not conditionally independence of A given C.

d. A is not conditionally independence of B given C.

Answer: C, D

(B is not conditionally independence of A given C., A is not conditionally independence of B given C.)

Q4.

Let X be random variable and let Y=aX+b, where a and b are given scalars. Then which of the following statements are true. (E[Z] states the expected value of Z)

Select one or more:

a. E[Y]=(a/b)*E[X]

b. E[Y] = E[X]

c. E[Y]=a*E[X]+b

d. E[Y]=a*b*E[X]

Answer: (C)

Q5.

When we can use Expectation maximization algorithm.

Select one or more:

a. None of the these.

b. Unsupervised clustering (target value unobservable).

c. Data is only partially observable.

d. Supervised Learning (some instance attributes unobservable).

Answer: B, C, D

Q6.

Which of the following statements are true?

Select one or more:

a. Maximum a Posteriori estimation seek the estimate of Ө that is most probable, given the observed data, plus background assumptions about its value.

b. Maximum Likelihood estimation seek the estimate of Ө that is most probable, given the observed data, plus background assumptions about its value.

c. Maximum Likelihood estimation seek an estimate of Ө that maximizes the probability of the observed data.

d. Maximum a Posteriori estimation seek an estimate of Ө that maximizes the probability of the observed data.

Answer: A, C

In MLE, we don’t have prior knowledge, as in the example of a toss of coin, about the coin whether it is biased or unbiased. We arrive at Theta based on the data.

While in MAP, we incorporate our prior knowledge:

Q7.

If X is a vector of n attributes and Y is boolean valued label. How many different functions are possible? (2^n represents 2ⁿ )

a. 2^{2^n}

b. 2^{n^2}

c. 2ⁿ

d. 2n

Answer: A

It is 2^(2^n)

If X has two attributes x1 and x2, then # of observations one has to take are (x1=0, x2=0, y), (x1=0, x2=1, y), (x1=1, x2=0, y), (x1=1, x2=1, y). 'n' attributes means 2^n states.

Number of functions would be: 2^(n^2)

X1, X2, Y

Function: 1

0, 0, 0

0, 1, 0

1, 0, 0

1, 1, 0

Function: 2

0, 0, 0

0, 1, 0

1, 0, 0

1, 1, 1

Run 1:

Input: X1, X2

0, 0

0, 1

1, 0

1, 1

Output: 0,0,0,0

Run 2:

Input: X1, X2

0, 0

0, 1

1, 0

1, 1

Output: 0,0,0,1

Run 3:

Input: X1, X2

0, 0

0, 1

1, 0

1, 1

Output: 0,0,1,0

Run 4:

Input: X1, X2

0, 0

0, 1

1, 0

1, 1

Output: 0,0,1,1

Run 5: Output: 0,1,0,0. Run 6: Output: 0,1,0,1. Run 7: Output: 0,1,1,0. Run 8: Output: 0,1,1,1

Run 9: Output: 1,0,0,0. Run 10: Output: 1,0,0,1. Run 11: Output: 1,0,1,0. Run 12: Output: 1,0,1,1

Run 13: Output: 1,1,0,0. Run 14: Output: 1,1,0,1. Run 15: Output: 1,1,1,0. Run 16: Output: 1,1,1,1

This can be understood as there will be 2^n rows in one truth table for X1, X2. Now, assume Y to be a vector of length 2^n, number of states it can take = 2^(2^n).

Q8.

Which of the following statements are true?

Select one or more:

a. To infer posterior probability, Bayesian linear regression uses Naïve Bayes principle.

b. None of these

c. Bayesian linear regression cannot be used for classification.

d. In Bayesian linear regression Prior can be used for regularization.

Answer: A, D

In Bayesian linear regression Prior can be used for regularization., To infer posterior probability, Bayesian linear regression uses Naïve Bayes principle.

...

These slides show the derivation of posterior probability using Bayes theorem, and here all probabilities are represented by multivariate Gauss distribution.

Q9.

Which of the following statement are true for the given graphical model?

Select one or more:

a. B is not conditionally independence of A given C.

b. A is conditionally independence of B given C.

c. B is conditionally independence of A given C.

d. A is not conditionally independence of B given C.

Answer: B, C

Proved in : P(A,B | C) = P(A,B,C)/P(C) = P(A)*P(C|A)*P(B|A)/P(C) = P(A|C) * P(B|C)

Hence, P(A,B | C) = P(A|C) * P(B|C)

This implies “A and B are conditionally independent given C”.

Q10.

Smoothing can be used in which of the following cases:

Select one or more:

a. When likelihood estimates zero probability

b. When test error and training error are very different

c. None of the above

d. When learning algorithm result in very rough function

Answer: A

(When likelihood estimates zero probability.)

Could someone explain Laplacian smoothing (or 1-up smoothing)?

Ans: Suppose you are looking at outcomes of a die. Let us say you get the following outcomes of each number, in 10 throws:
One :   1
Two :   3
Three : 1
Four : 0
Five :   3
Six :     2

Now, the probabilities without the smoothing are
One :   1/10
Two :   3/10
Three : 1/10
Four : 0/10
Five :   3/10
Six :     2/10

The sums of probabilities is (of course) 1.
To smoothen out, we add '1' to numerators. Now we need to add "something" to the denominator such that the sum remains 1.
So,
(1+1+3+1+1+1+0+1+3+1+2+1) / (10+K) = 1

This gives K=6. Now note that if you had zero throws, the probabilities are all 1/6. These are called the "prior probabilities" - our prior assumption of the outcomes. We initially believe all of them are equally likely.
And K=6 is essentially the no. of classes!

(URL: https://www.quora.com/Could-someone-explain-Laplacian-smoothing-or-1-up-smoothing)

Q11.

Which of the following statements are true in context of decision trees?

Select one or more:

a. Capable in classifying non-linearly separable data.

b. None of these.

c. Capable in classifying linearly separable data.

d. It is always possible to get zero training error.

Answer: A, C, D

“Zero training error”: means decision tree can give a model that will give correct output for all the traning data.

An attribute is a discreete valued variable. While traversing a decision tree downwards based on attributes, it is always possible to arrive at a label (decision (yes, no)).

Q12.

Let a probability of disease is 1 in 10,000 and the test accuracy of the disease is 99 %. Let event A is the event you have this disease, and event B is the event that you test positive. Given test is positive what is the probability that disease is actually present? Precisely you need calculate probability P(A|B)

Select one:

a. 0.0990

b. 0.0988

c. 0.9902

d. 0.0098

Answer: (D = 0.0098)

...

Q13.

In context of Bais-Variance decomposition which of the following statements are true?

Select one or more:

a. High bais implies high vairiance in the out of sample error.

b. High variance implies less bias in the out of sample error.

c. Less bais implies less variance in the out of sample error.

d. Bais-Variance analysis help us to quantify out of sample error.

Answer: B, D

(From L6-Part2 last slide)

(URL: http://www.stat.cmu.edu/~ryantibs/advmethods/notes/errval.pdf)

Q14.

In context of linear regression, which of the following statements are true?

Select one or more:

a. You can use linear regression for classification.

b. It is not possible to get zero training error, if there are few samples used in training.

c. It is not possible to get zero test error, if there are few samples used in training.

d. You cannot use linear regression for classification.

Answer: A, C

Training error is the error that you get when you run the trained model back on the training data. Remember that this data has already been used to train the model and this necessarily doesn't mean that the model once trained will accurately perform when applied back on the training data itself.

Test error is the error when you get when you run the trained model on a set of data that it has previously never been exposed to. This data is often used to measure the accuracy of the model before it is shipped to production.

………………………………………………

URL: https://stats.stackexchange.com/questions/22381/why-not-approach-classification-through-regression

QUESTION: Some material I've seen on machine learning said that it's a bad idea to approach a classification problem through regression. But I think it's always possible to do a continuous regression to fit the data and truncate the continuous prediction to yield discrete classifications. So why is it a bad idea?

ANSWER:

"..approach classification problem through regression.." by "regression" I will assume you mean linear regression, and I will compare this approach to the "classification" approach of fitting a logistic regression model.

Before we do this, it is important to clarify the distinction between regression and classification models. Regression models predict a continuous variable, such as rainfall amount or sunlight intensity. They can also predict probabilities, such as the probability that an image contains a cat. A probability-predicting regression model can be used as part of a classifier by imposing a decision rule - for example, if the probability is 50% or more, decide it's a cat.

Logistic regression predicts probabilities, and is therefore a regression algorithm. However, it is commonly described as a classification method in the machine learning literature, because it can be (and is often) used to make classifiers. There are also "true" classification algorithms, such as SVM, which only predict an outcome and do not provide a probability. We won't discuss this kind of algorithm here.

Linear vs. Logistic Regression on Classification Problems

As Andrew Ng explains it, with linear regression you fit a polynomial through the data - say, like on the example below we're fitting a straight line through {tumor size, tumor type} sample set:

Above, malignant tumors get 1and non-malignant ones get 0, and the green line is our hypothesis h(x). To make predictions we may say that for any given tumor size x, if h(x) gets bigger than 0.5 we predict malignant tumor, otherwise we predict benign.