Tuesday, August 19, 2025

20 Interview Questions - From the 'Hundred-Page Machine Learning Book' - Ch.2

To See All Interview Preparation Articles: Index For Interviews Preparation

Prompt: Extract 10 simple and 10 complex questions from this pdf with detailed answers

Here’s a set of 10 simple and 10 complex questions with detailed answers from the chapter you provided (pp. 29–42 of The Hundred-Page Machine Learning Book).
I’ve made sure the simple ones check basic understanding, while the complex ones require reasoning, synthesis, or application.


Simple Questions (with Answers)

1. What is the main difference between Linear Regression and SVM in terms of their objective?
Answer:
In Linear Regression, the model’s hyperplane is chosen to be as close as possible to all training examples, minimizing the average squared error. In SVM, the hyperplane serves as a decision boundary that separates classes and is placed to maximize the margin between them.


2. What is the loss function used in Linear Regression?
Answer:
Linear Regression uses squared error loss, which is (f(xi)yi)2(f(x_i) - y_i)^2 for each example, averaged across the dataset.


3. Why is squared loss preferred over absolute loss in Linear Regression?
Answer:
Squared loss has a continuous derivative, making it smoother and easier to optimize using linear algebra for closed-form solutions, unlike absolute loss, which is not smooth.


4. Is Logistic Regression used for regression tasks?
Answer:
No. Despite the name, Logistic Regression is a classification algorithm, typically for binary classification, not regression.


5. What function is used in Logistic Regression to map outputs between 0 and 1?
Answer:
The sigmoid (logistic) function:

f(x)=11+exf(x) = \frac{1}{1 + e^{-x}}

6. What is the main optimization criterion for Logistic Regression?
Answer:
It uses Maximum Likelihood Estimation (MLE), maximizing the likelihood of the observed data under the model.


7. What does the ID3 algorithm use to determine the best split?
Answer:
ID3 uses entropy to measure uncertainty and selects the feature/threshold that minimizes the weighted average entropy after the split.


8. What is the difference between hard-margin and soft-margin SVM?
Answer:
Hard-margin SVM requires perfect separation of data without errors. Soft-margin SVM allows some misclassifications using hinge loss and a regularization parameter CC to balance margin size and classification errors.


9. What is the kernel trick in SVM?
Answer:
It’s a method to compute dot products in a higher-dimensional feature space without explicitly transforming the data, using a kernel function.


10. What does the parameter kk represent in k-Nearest Neighbors (kNN)?
Answer:
It represents the number of nearest neighbors considered when predicting the label for a new example.


Complex Questions (with Answers)

1. Explain why overfitting can occur with high-degree polynomial regression, using the concepts from the text.
Answer:
High-degree polynomial regression can create a curve that fits the training data almost perfectly, capturing noise and outliers rather than the underlying pattern. This leads to poor generalization on unseen data, as shown in Fig. 2 of the text. The curve follows training points too closely, increasing variance and overfitting.


2. Why does Logistic Regression use log-likelihood instead of raw likelihood for optimization?
Answer:
Log-likelihood simplifies the product of probabilities into a sum (via logarithms), making it easier to compute and differentiate. Since the log function is monotonically increasing, maximizing log-likelihood yields the same result as maximizing likelihood but is more numerically stable and computationally convenient.


3. How does the choice of the hyperparameter CC in SVM affect bias and variance?
Answer:

  • High CC: Focuses on minimizing classification errors, leading to low bias but high variance (risk of overfitting).

  • Low CC: Allows more misclassifications for a larger margin, increasing bias but reducing variance (better generalization).


4. Describe the process of finding the best split in the ID3 decision tree algorithm.
Answer:
For each feature and possible threshold, ID3:

  1. Splits the dataset into two subsets.

  2. Computes the weighted average entropy of the subsets.

  3. Selects the split (feature + threshold) with the lowest weighted entropy.
    This process is repeated recursively until stopping criteria (e.g., pure nodes, max depth) are met.


5. Why is the kernel trick computationally advantageous compared to explicit transformation?
Answer:
Explicit transformation to high-dimensional space is computationally expensive and memory-intensive. The kernel trick computes the dot product in that space directly from the original feature vectors, avoiding the explicit mapping and thus saving time and resources.


6. Explain the role of hinge loss in soft-margin SVM.
Answer:
Hinge loss, defined as max(0,1yi(wxib))\max(0, 1 - y_i(w x_i - b)), penalizes points on the wrong side of the margin or decision boundary. It allows the SVM to tolerate some violations (misclassifications) while still aiming to maximize the margin, balancing between training accuracy and generalization.


7. In what situations would the RBF kernel be preferred over a polynomial kernel?
Answer:
RBF kernel is preferred when:

  • The relationship between features is highly nonlinear.

  • You expect the decision boundary to be smooth rather than oscillatory.

  • The data is not well-represented by polynomial combinations of features.
    RBF maps to an infinite-dimensional space, allowing flexible, smooth boundaries.


8. How does cosine similarity differ from Euclidean distance in kNN, and when might it be preferred?
Answer:

  • Cosine similarity measures the angle between vectors, focusing on direction rather than magnitude.

  • Euclidean distance measures absolute geometric distance.
    Cosine similarity is preferred when the magnitude of features is irrelevant (e.g., text data with TF-IDF vectors) and direction (pattern) matters more.


9. Why does the ID3 algorithm not guarantee an optimal decision tree?
Answer:
ID3 makes greedy, local decisions at each split without considering future splits. It doesn’t backtrack to evaluate whether earlier splits could lead to better overall trees, so it may get stuck in local optima.


10. Explain how kNN can be interpreted as a locally linear classifier under certain assumptions.
Answer:
With normalized feature vectors and cosine similarity, kNN predicts based on a weighted sum of feature vectors from neighbors with label 1. This sum acts like a local linear model, where the classification is based on the dot product (cosine similarity) between the summed vector and the input vector.


If you want, I can also make a printable 2-column table with these questions and answers so you can directly use it as a quick study sheet. That would make it easy to revise before exams. Would you like me to prepare that?

Tags: Technology,Machine Learning,Interview Preparation,

No comments:

Post a Comment