Showing posts with label Machine Learning. Show all posts
Showing posts with label Machine Learning. Show all posts

Tuesday, August 19, 2025

20 Interview Questions - From the 'Hundred-Page Machine Learning Book' - Ch.2

To See All Interview Preparation Articles: Index For Interviews Preparation

Prompt: Extract 10 simple and 10 complex questions from this pdf with detailed answers

Here’s a set of 10 simple and 10 complex questions with detailed answers from the chapter you provided (pp. 29–42 of The Hundred-Page Machine Learning Book).
I’ve made sure the simple ones check basic understanding, while the complex ones require reasoning, synthesis, or application.


Simple Questions (with Answers)

1. What is the main difference between Linear Regression and SVM in terms of their objective?
Answer:
In Linear Regression, the model’s hyperplane is chosen to be as close as possible to all training examples, minimizing the average squared error. In SVM, the hyperplane serves as a decision boundary that separates classes and is placed to maximize the margin between them.


2. What is the loss function used in Linear Regression?
Answer:
Linear Regression uses squared error loss, which is (f(xi)yi)2(f(x_i) - y_i)^2 for each example, averaged across the dataset.


3. Why is squared loss preferred over absolute loss in Linear Regression?
Answer:
Squared loss has a continuous derivative, making it smoother and easier to optimize using linear algebra for closed-form solutions, unlike absolute loss, which is not smooth.


4. Is Logistic Regression used for regression tasks?
Answer:
No. Despite the name, Logistic Regression is a classification algorithm, typically for binary classification, not regression.


5. What function is used in Logistic Regression to map outputs between 0 and 1?
Answer:
The sigmoid (logistic) function:

f(x)=11+exf(x) = \frac{1}{1 + e^{-x}}

6. What is the main optimization criterion for Logistic Regression?
Answer:
It uses Maximum Likelihood Estimation (MLE), maximizing the likelihood of the observed data under the model.


7. What does the ID3 algorithm use to determine the best split?
Answer:
ID3 uses entropy to measure uncertainty and selects the feature/threshold that minimizes the weighted average entropy after the split.


8. What is the difference between hard-margin and soft-margin SVM?
Answer:
Hard-margin SVM requires perfect separation of data without errors. Soft-margin SVM allows some misclassifications using hinge loss and a regularization parameter CC to balance margin size and classification errors.


9. What is the kernel trick in SVM?
Answer:
It’s a method to compute dot products in a higher-dimensional feature space without explicitly transforming the data, using a kernel function.


10. What does the parameter kk represent in k-Nearest Neighbors (kNN)?
Answer:
It represents the number of nearest neighbors considered when predicting the label for a new example.


Complex Questions (with Answers)

1. Explain why overfitting can occur with high-degree polynomial regression, using the concepts from the text.
Answer:
High-degree polynomial regression can create a curve that fits the training data almost perfectly, capturing noise and outliers rather than the underlying pattern. This leads to poor generalization on unseen data, as shown in Fig. 2 of the text. The curve follows training points too closely, increasing variance and overfitting.


2. Why does Logistic Regression use log-likelihood instead of raw likelihood for optimization?
Answer:
Log-likelihood simplifies the product of probabilities into a sum (via logarithms), making it easier to compute and differentiate. Since the log function is monotonically increasing, maximizing log-likelihood yields the same result as maximizing likelihood but is more numerically stable and computationally convenient.


3. How does the choice of the hyperparameter CC in SVM affect bias and variance?
Answer:

  • High CC: Focuses on minimizing classification errors, leading to low bias but high variance (risk of overfitting).

  • Low CC: Allows more misclassifications for a larger margin, increasing bias but reducing variance (better generalization).


4. Describe the process of finding the best split in the ID3 decision tree algorithm.
Answer:
For each feature and possible threshold, ID3:

  1. Splits the dataset into two subsets.

  2. Computes the weighted average entropy of the subsets.

  3. Selects the split (feature + threshold) with the lowest weighted entropy.
    This process is repeated recursively until stopping criteria (e.g., pure nodes, max depth) are met.


5. Why is the kernel trick computationally advantageous compared to explicit transformation?
Answer:
Explicit transformation to high-dimensional space is computationally expensive and memory-intensive. The kernel trick computes the dot product in that space directly from the original feature vectors, avoiding the explicit mapping and thus saving time and resources.


6. Explain the role of hinge loss in soft-margin SVM.
Answer:
Hinge loss, defined as max(0,1yi(wxib))\max(0, 1 - y_i(w x_i - b)), penalizes points on the wrong side of the margin or decision boundary. It allows the SVM to tolerate some violations (misclassifications) while still aiming to maximize the margin, balancing between training accuracy and generalization.


7. In what situations would the RBF kernel be preferred over a polynomial kernel?
Answer:
RBF kernel is preferred when:

  • The relationship between features is highly nonlinear.

  • You expect the decision boundary to be smooth rather than oscillatory.

  • The data is not well-represented by polynomial combinations of features.
    RBF maps to an infinite-dimensional space, allowing flexible, smooth boundaries.


8. How does cosine similarity differ from Euclidean distance in kNN, and when might it be preferred?
Answer:

  • Cosine similarity measures the angle between vectors, focusing on direction rather than magnitude.

  • Euclidean distance measures absolute geometric distance.
    Cosine similarity is preferred when the magnitude of features is irrelevant (e.g., text data with TF-IDF vectors) and direction (pattern) matters more.


9. Why does the ID3 algorithm not guarantee an optimal decision tree?
Answer:
ID3 makes greedy, local decisions at each split without considering future splits. It doesn’t backtrack to evaluate whether earlier splits could lead to better overall trees, so it may get stuck in local optima.


10. Explain how kNN can be interpreted as a locally linear classifier under certain assumptions.
Answer:
With normalized feature vectors and cosine similarity, kNN predicts based on a weighted sum of feature vectors from neighbors with label 1. This sum acts like a local linear model, where the classification is based on the dot product (cosine similarity) between the summed vector and the input vector.


If you want, I can also make a printable 2-column table with these questions and answers so you can directly use it as a quick study sheet. That would make it easy to revise before exams. Would you like me to prepare that?

Tags: Technology,Machine Learning,Interview Preparation,

Wednesday, August 13, 2025

Q8: What is a margin in SVM?

To See All Interview Preparation Articles: Index For Interviews Preparation
Other Ques From: Ch.1 from the 'Hundred-Page Machine Learning Book'

In Support Vector Machines (SVM), the margin is the distance between the decision boundary (the separating hyperplane) and the closest data points from each class.


📌 Key Points:

  1. Support Vectors

    • The data points that lie closest to the decision boundary.

    • They "support" or define the position and orientation of the boundary.

  2. Margin Definition

    • The margin is measured as the perpendicular distance from the decision boundary to the nearest support vector from either class.

    • In a maximum-margin classifier, SVM chooses the boundary that maximizes this distance.

  3. Why Maximize the Margin?

    • A larger margin means the decision boundary is farther from any training data point, which usually improves generalization on unseen data.

    • This aligns with the structural risk minimization principle: simpler decision boundaries generalize better.


📐 Formula for Margin

For a linear SVM, the margin MM is:

M=2wM = \frac{2}{\|w\|}

Where:

  • ww = weight vector defining the hyperplane wx+b=0w \cdot x + b = 0


🖼 Intuition Example:

Imagine two parallel lines enclosing the training points of both classes:

  • The middle line between them is the decision boundary.

  • The distance from this middle line to either parallel line is half the margin.

  • SVM’s goal is to push these lines as far apart as possible while still separating the classes.


If you want, I can create a visual diagram showing an SVM decision boundary with the margin and support vectors labeled. That usually makes this concept much clearer.
Do you want me to make it?

Tags: Technology,Machine Learning,Interview Preparation,

Q10: How does feature dimensionality affect SVM performance?

To See All Interview Preparation Articles: Index For Interviews Preparation
Other Ques From: Ch.1 from the 'Hundred-Page Machine Learning Book'

Feature dimensionality affects SVM performance in both positive and negative ways, depending on the balance between number of features and number of training samples.


🔹 1. Positive Effect — High Dimensions Can Help

  • In higher-dimensional spaces, data points are more likely to become linearly separable.

  • This is why SVMs often perform better with more features, especially when the original space is not separable.

  • The kernel trick can implicitly create such high-dimensional feature spaces.

Example:
In 2D, two intertwined spirals may be inseparable, but in 3D after a transformation, a plane can separate them.


🔹 2. Negative Effect — Curse of Dimensionality

When the number of features greatly exceeds the number of training samples:

  • SVM might overfit — the model finds boundaries that perfectly classify the training set but fail on new data.

  • The margin can become unstable because distances in high dimensions tend to lose meaning (all points look equally far apart).

  • More dimensions = higher computation cost.


🔹 3. Impact on Margin

  • SVM tries to maximize the margin between classes.

  • In very high dimensions, the margin can shrink or become noisy if irrelevant features are present.

  • This makes feature selection and regularization (via C, gamma) crucial.


🔹 4. Practical Tips

  • If features are high-dimensional: Use regularization (C in SVM) to avoid overfitting.

  • If data is sparse: Linear SVM (with a linear kernel) often works better.

  • If many irrelevant features exist: Apply feature selection (e.g., PCA, mutual information).


Summary Table:

Dimensionality ChangeEffect on SVMRisk
Moderate increaseEasier separationHigher computation
Large increase (> samples)Overfitting riskPoor generalization
Many irrelevant featuresNoisy marginReduced accuracy

If you want, I can make you a diagram showing how increasing feature dimensions changes the SVM decision boundary and margin.
Do you want me to create that?

Tags: Technology,Machine Learning,Interview Preparation,

Q8: What assumptions are made about training data to ensure a model generalizes well?

To See All Interview Preparation Articles: Index For Interviews Preparation
Other Ques From: Ch.1 from the 'Hundred-Page Machine Learning Book'

What assumptions are made about training data to ensure a model generalizes well?

To ensure a model generalizes well — meaning it performs well on unseen data — machine learning makes some key assumptions about the training data:


1️⃣ Independence of Examples

  • Training samples are assumed to be independent of each other (no autocorrelation unless explicitly modeled, e.g., in time series).

  • This avoids misleading patterns caused by dependencies between observations.


2️⃣ Identically Distributed (i.i.d.)

  • All training and test data are drawn from the same probability distribution.

  • This ensures the patterns the model learns are relevant for future predictions.


3️⃣ Representative Sampling

  • The training data should be representative of the real-world population.

  • Missing subgroups or skewed sampling can cause bias and poor performance on certain cases.


4️⃣ Sufficient Size

  • The dataset should be large enough to cover the natural variability in the data.

  • Small datasets increase the risk of overfitting to noise.


5️⃣ Correct and Consistent Labels (in supervised learning)

  • Labels should be accurate; mislabeled data acts like noise and can degrade model accuracy.


💡 If any of these assumptions is violated — for example, if the training and test sets come from different distributions (dataset shift) — the model’s generalization ability can drop significantly.


If you want, I can also share 5 warning signs during training that indicate your generalization assumptions might be failing. Would you like me to do that?




Here are 5 warning signs that the generalization assumptions might be breaking down during training:


1️⃣ Large Gap Between Training and Validation Performance

  • Symptom: High accuracy (or low loss) on the training set but much worse results on validation/test sets.

  • Possible Cause: Overfitting due to non-representative training data or too much noise.


2️⃣ Sudden Drop in Performance on New Data

  • Symptom: Model works well on historical data but fails when deployed.

  • Possible Cause: Data drift — the real-world data distribution has shifted away from the training distribution.


3️⃣ High Variance Across Cross-Validation Folds

  • Symptom: Performance varies significantly from fold to fold.

  • Possible Cause: Training data might not be independent or is not evenly representative of all cases.


4️⃣ Very High Model Complexity Without Performance Gain

  • Symptom: Adding more parameters or layers increases training accuracy but validation accuracy stays the same or drops.

  • Possible Cause: Model is memorizing training data rather than learning general patterns.


5️⃣ Poor Performance on Minority Subgroups

  • Symptom: Model works well overall but fails for specific segments of the population.

  • Possible Cause: Training data underrepresents those subgroups, breaking the "representative sampling" assumption.


If you want, I can give you a table of these warning signs + possible remedies so you can keep it handy for ML project reviews.
Do you want me to prepare that table?

Tags: Technology,Machine Learning,Interview Preparation,

Q5: Why might perfect separation be impossible in some datasets?

To See All Interview Preparation Articles: Index For Interviews Preparation
Other ques from: Ch.1 from the 'Hundred-Page Machine Learning Book'

Why might perfect separation be impossible in some datasets?

Perfect separation might be impossible in some datasets because:


1️⃣ Overlapping Classes

  • The features for different classes may overlap in the feature space.

  • Example: Two groups of points mixed together without a clear boundary.


2️⃣ Noise in Data

  • Random variations in measurements can cause points to appear in unexpected places.

  • Example: Sensor errors or random fluctuations.


3️⃣ Labeling Errors

  • Some data points may be assigned the wrong label during data collection.

  • This creates contradictions that no decision boundary can perfectly separate.


4️⃣ Outliers

  • Extreme values can disrupt the separation, even if the majority of the data is separable.

  • Example: One mislabeled point far away from its class cluster.


5️⃣ Insufficient Features

  • The chosen features might not fully capture the differences between classes.

  • Without the right information, no model can separate them perfectly.


If you want, I can show you a visual example where overlapping data makes perfect separation impossible. Would you like me to prepare that?

Tags: Technology,Machine Learning,Interview Preparation,

Q4: How can SVM handle non-linear boundaries?

To See All Interview Preparation Articles: Index For Interviews Preparation
Other ques from Ch.1 from the 'Hundred-Page Machine Learning Book'

How can SVM handle non-linear boundaries?

SVM can handle non-linear boundaries using the kernel trick.

Here’s how it works:


1️⃣ The Problem

  • In the original feature space, the data might not be linearly separable.

  • A straight line (or hyperplane in higher dimensions) can’t divide the classes without misclassifications.


2️⃣ The Idea: Map Data to a Higher-Dimensional Space

  • SVM applies a feature transformation ϕ(x)\phi(\mathbf{x}) that maps the original data into a higher-dimensional space.

  • In this new space, the data may become linearly separable.

Example:

  • In 2D, circles and spirals can be hard to separate with a line.

  • If we map the data to 3D (adding a new dimension like x12+x22x_1^2 + x_2^2), a plane might separate them perfectly.


3️⃣ The Kernel Trick

  • Instead of computing ϕ(x)\phi(\mathbf{x}) explicitly (which can be computationally expensive), SVM uses a kernel function K(xi,xj)K(\mathbf{x}_i, \mathbf{x}_j) that computes the inner product in the transformed space directly.

  • This avoids the need to calculate the high-dimensional mapping explicitly.


4️⃣ Common Kernels

  • Polynomial Kernel: K(x,z)=(xz+c)dK(x, z) = (x \cdot z + c)^d

  • Radial Basis Function (RBF): K(x,z)=exp(γxz2)K(x, z) = \exp(-\gamma ||x - z||^2)

  • Sigmoid Kernel: K(x,z)=tanh(αxz+c)K(x, z) = \tanh(\alpha x \cdot z + c)


5️⃣ Intuition

Think of kernels as a way to add non-linear “features” automatically, allowing SVM to find complex boundaries in the original space while still solving a linear problem in a hidden space.


If you want, I can create a diagram showing how kernel mapping turns a non-linear 2D problem into a linearly separable one in 3D. That would make this crystal clear.

Tags: Machine Learning,Interview Preparation,Technology

20 Interview Questions - From the 'Hundred-Page Machine Learning Book' - Ch.1

To See All Interview Preparation Articles: Index For Interviews Preparation

Prompt: Extract 10 simple and 10 complex questions from this pdf with detailed answers

Got it — I’ll pull 10 simple and 10 complex questions from the provided section (pages 7–13) and give detailed answers for each.


10 Simple Questions

1. What is Machine Learning according to the book?

Answer:
Machine learning is a subfield of computer science concerned with building algorithms that rely on examples of some phenomenon. It can also be defined as solving a practical problem by:

  1. Gathering a dataset, and

  2. Algorithmically building a statistical model based on that dataset.
    The model is then used to solve the practical problem.


2. What are the main types of learning in machine learning?

Answer:

  1. Supervised learning – learns from labeled data.

  2. Unsupervised learning – learns from unlabeled data.

  3. Semi-supervised learning – uses both labeled and unlabeled data.

  4. Reinforcement learning – learns through interactions with an environment, aiming to maximize rewards.


3. What is a feature vector?

Answer:
A feature vector is an ordered list of values (features) describing an example in the dataset. Each position corresponds to a specific attribute. Example: for a person, x(1) could be height, x(2) weight, and x(3) gender.


4. In supervised learning, what is the goal of the model?

Answer:
The goal is to take a feature vector xx as input and output information (class label, probability, etc.) that allows deducing the correct label yy.


5. What is clustering?

Answer:
Clustering is an unsupervised learning task where the model assigns each data point to a group (cluster) so that points in the same cluster are similar.


6. What is a decision boundary?

Answer:
It’s the surface (line, plane, or hyperplane) in the feature space that separates different classes.


7. What is a policy in reinforcement learning?

Answer:
A policy is a function that maps the current state (feature vector) of the environment to the optimal action to take.


8. What is a margin in SVM?

Answer:
In SVM, the margin is the distance between the closest positive and negative examples, measured across the decision boundary. Maximizing the margin helps generalization.


9. What is the bag-of-words model in NLP?

Answer:
It’s a way to convert text into a feature vector by checking for the presence/absence (or frequency) of each word from a dictionary in the text.


10. What does the sign function do in SVM?

Answer:
It outputs +1 if its input is positive and –1 if the input is negative, allowing binary classification.




10 Complex Questions

1. How does semi-supervised learning leverage unlabeled data to improve performance?

Answer:
Even though unlabeled data lacks target labels, it provides additional information about the underlying data distribution. A larger dataset (labeled + unlabeled) better approximates the probability distribution of the phenomenon, allowing the model to make more informed decisions.


2. How does SVM find the optimal decision boundary?

Answer:
SVM formulates an optimization problem:

minws.t.yi(wxib)1\min ||w|| \quad \text{s.t.} \quad y_i(w x_i - b) \ge 1

Minimizing w||w|| maximizes the margin between the two classes, ensuring better generalization. The optimal ww^* and bb^* define the decision boundary.


3. Why does maximizing the margin in SVM improve generalization?

Answer:
A larger margin places the decision boundary farther from the nearest training points, reducing the risk of misclassification for new, unseen examples. This is supported by the principle that simpler boundaries generalize better.


4. How can SVM handle non-linear boundaries?

Answer:
SVM can use kernels to implicitly map the data into a higher-dimensional space where a linear separation becomes possible, even if the data is not linearly separable in the original space.


5. Why might perfect separation be impossible in some datasets?

Answer:
Due to noise, labeling errors, or outliers, the classes may overlap in feature space, making a perfect decision boundary impossible without misclassifying some points.


6. How is a real-world object transformed into a machine-readable feature vector in supervised learning?

Answer:
A data analyst chooses a representation (feature engineering). Example: In spam detection, an email can be transformed into a binary vector using the bag-of-words approach, where each position represents the presence or absence of a specific word.


7. What is the role of optimization in training machine learning models?

Answer:
Optimization algorithms search for parameters (e.g., ww, bb) that minimize a loss function subject to constraints. In SVM, the objective is to maximize the margin while classifying all training examples correctly.


8. What assumptions are made about training data to ensure a model generalizes well?

Answer:

  • Training examples are randomly sampled.

  • They are independent of each other.

  • They follow the same distribution as the test data.


9. What is PAC learning and why is it important?

Answer:
PAC (Probably Approximately Correct) learning is a framework that studies under what conditions an algorithm will produce a model that is probably close to the correct classifier, given finite data. It relates model complexity, dataset size, and expected error.


10. How does feature dimensionality affect SVM performance?

Answer:
High-dimensional spaces make separation easier (due to the curse of dimensionality), but also risk overfitting if the number of features greatly exceeds the number of examples. Regularization and feature selection help control this.


If you want, I can now put these 20 Q&As into a neatly formatted interview-style PDF so you can use it directly for candidate assessment. Would you like me to prepare that?

Tags: Interview Preparation,Machine Learning,Technology,