Showing posts with label Regression. Show all posts

Saturday, March 28, 2026

Supervised Machine Learning - Regression and Classification (at DeepLearning.ai)

View Course on DeepLearning.AI View Other Courses Audited By Us

Download Lecture Slides

Quizzes

Week 1

1.1: Supervised vs unsupervised learning

1.2: Regression

1.3: Train the model with gradient descent

Week 2

2.1: Multiple linear regression

2.2: Gradient descent in practice

Week 3

3.1: Classification with logistic regression

3.2: Cost function for logistic regression

3.3: Gradient descent for logistic regression

3.4: The problem of overfitting

Tags: Machine Learning,Classification,Regression

Thursday, May 23, 2024

Four Practice Problems on Linear Regression (Taken From Interviews For Data Scientist Role)

To See All Interview Preparation Articles: Index For Interviews Preparation
To watch our related video on: YouTube

Previous Videos

Linear Regression Theory (2022-02-15)
https://www.youtube.com/watch?v=qS3HhMV8YG0
Interview Question 1: What is linear regression and what is it's primary purpose?
https://www.youtube.com/watch?v=9S2FM9EGcdc
Use Khan Academy to get started with the basic concepts of linear regression (Motivational video)
https://www.youtube.com/watch?v=glXMN1VIttA
Unit 5: Exploring bivariate numerical data
https://www.khanacademy.org/math/statistics-probability/describing-relationships-quantitative-data

Question (1): Asked At Ericsson

You are given the data generated for the following equation:
y = (x^9)*3
Can you apply linear regression to learn from this data?

Solution (1)

Equation of line: y = mx + c

Equation we are given is of the form y = (x^m)c

Taking log on both the sides:

log(y) = log((x^m)c)

Applying multiplication rule of logarithms:

log(y) = log(x^m) + log(c)

Applying power rule of logarithms:

log(y) = m.log(x) + log(c)

Y = log(y)

X = log(x)

C = log(c)

Y = mX + C

So answer is 'yes'.

Question (2): Infosys – Digital Solution Specialist

If you do linear regression in 3D, what do you get?

Solution (2)

When you perform linear regression on 3D data, you are essentially fitting a plane to a set of data points in three-dimensional space. The general form of the equation for a plane in three dimensions is:

z=ax+by+c

Here:

z is the dependent variable you are trying to predict.

x and y are the independent variables.

a and b are the coefficients that determine the orientation of the plane.

c is the intercept.

Solution (2)...

Suppose you have data points (1,2,3), (2,3,5), (3,4,7), and you fit a linear regression model to this data. The resulting plane might have an equation like z=0.8x+1.2y+0.5. This equation tells you how z changes as x and y change.

In summary, performing linear regression on 3D data gives you a plane in three-dimensional space that best fits your data points in the least squares sense. This plane can then be used to predict new z values given new x and y values.

Generalizing a bit further

If you do linear regression in N dimensions, you get a hypersurface in N-1 dimensions.

Question (3): Infosys – Digital Solution Specialist

How do you tell if there is linearity between two variables?

Solution (3)

Determining if there is linearity between two variables involves several steps, including visual inspection, statistical tests, and fitting a linear model to evaluate the relationship. Here are the main methods you can use:

1. Scatter Plot

Create a scatter plot of the two variables. This is the most straightforward way to visually inspect the relationship.

Linearity: If the points roughly form a straight line (either increasing or decreasing), there is likely a linear relationship.

Non-linearity: If the points form a curve, cluster in a non-linear pattern, or are randomly scattered without any apparent trend, there is likely no linear relationship.

2. Correlation Coefficient

Calculate the Pearson correlation coefficient, which measures the strength and direction of the linear relationship between two variables.

Pearson Correlation Coefficient (r): Ranges from -1 to 1.

r≈1 or r≈−1: Strong linear relationship (positive or negative).

r≈0: Weak or no linear relationship.

3. Fitting a Linear Model

Fit a simple linear regression model to the data.

Model Equation: y = β0 + β1.x + ϵ

y: Dependent variable. / x: Independent variable. / β0: Intercept. / β1: Slope. / ϵ: Error term.

4. Residual Analysis

Examine the residuals (differences between observed and predicted values) from the fitted linear model.

Residual Plot: Plot residuals against the independent variable or the predicted values.

Linearity: Residuals are randomly scattered around zero.

Non-linearity: Residuals show a systematic pattern (e.g., curve, trend).

5. Statistical Tests

Perform statistical tests to evaluate the significance of the linear relationship.

t-test for Slope: Test if the slope (β1) is significantly different from zero.

Null Hypothesis (H0): β1=0 (no linear relationship).

Alternative Hypothesis (H1): β1≠0 (linear relationship exists).

p-value: If the p-value is less than the chosen significance level (e.g., 0.05), reject H0 and conclude that a significant linear relationship exists.

6. Coefficient of Determination (R²)

Calculate the R² value, which indicates the proportion of variance in the dependent variable explained by the independent variable.

R² Value: Ranges from 0 to 1.

Closer to 1: Indicates a strong linear relationship.

Closer to 0: Indicates a weak or no linear relationship.

Example:

Suppose you have two variables, x and y.

Scatter Plot: You plot x vs. y and observe a straight-line pattern.

Correlation Coefficient: You calculate the Pearson correlation coefficient and find r=0.85, indicating a strong positive linear relationship.

Fitting a Linear Model: You fit a linear regression model y=2+3x.

Residual Analysis: You plot the residuals and observe they are randomly scattered around zero, indicating no pattern.

Statistical Tests: The t-test for the slope gives a p-value of 0.001, indicating the slope is significantly different from zero.

R² Value: You calculate R^2=0.72, meaning 72% of the variance in y is explained by x.

Based on these steps, you would conclude there is a strong linear relationship between x and y.

Question (4): TCS and Infosys (DSS)

What is the difference between Lasso regression and Ridge regression?

Solution (4)

Lasso and Ridge regression are both techniques used to improve the performance of linear regression models, especially when dealing with multicollinearity or when the number of predictors is large compared to the number of observations. They achieve this by adding a regularization term to the loss function, which penalizes large coefficients. However, they differ in the type of penalty applied:

Ridge Regression:

Penalty Type: L2 norm (squared magnitude of coefficients)
Objective Function: Minimizes the sum of squared residuals plus the sum of squared coefficients multiplied by a penalty term $\lambda$ $\text{Objective Function: } \min \left( \sum_{i=1}^n (y_i - \hat{y}_i)^2 + \lambda \sum_{j=1}^p \beta_j^2 \right)$ Here, $\lambda$ is the regularization parameter, $y_i$ are the observed values, $\hat{y}_i$ are the predicted values, and $\beta_j$ are the coefficients.
Effect on Coefficients: Shrinks coefficients towards zero but does not set any of them exactly to zero. As a result, all predictors are retained in the model.
Use Cases: Useful when you have many predictors that are all potentially relevant to the model, and you want to keep all of them but shrink their influence.

Lasso Regression:

Penalty Type: L1 norm (absolute magnitude of coefficients)
Objective Function: Minimizes the sum of squared residuals plus the sum of absolute values of coefficients multiplied by a penalty term $\lambda$ $\text{Objective Function: } \min \left( \sum_{i=1}^n (y_i - \hat{y}_i)^2 + \lambda \sum_{j=1}^p |\beta_j| \right)$ Here, $\lambda$ is the regularization parameter, $y_i$ are the observed values, $\hat{y}_i$ are the predicted values, and $\beta_j$ are the coefficients.
Effect on Coefficients: Can shrink some coefficients exactly to zero, effectively performing variable selection. This means that it can produce a sparse model where some predictors are excluded.
Use Cases: Useful when you have many predictors but you suspect that only a subset of them are actually important for the model. Lasso helps in feature selection by removing irrelevant predictors.

Key Differences:

Type of Regularization:
- Ridge: L2 regularization (squared magnitude of coefficients)
- Lasso: L1 regularization (absolute magnitude of coefficients)
Effect on Coefficients:
- Ridge: Tends to shrink coefficients uniformly, but none are set exactly to zero.
- Lasso: Can shrink some coefficients to exactly zero, leading to a sparse model.
Use Cases:
- Ridge: Better when you want to retain all predictors and control their magnitude.
- Lasso: Better when you want to perform feature selection and eliminate some predictors.
Computational Complexity:
- Ridge: Generally simpler to compute because the penalty term is differentiable everywhere.
- Lasso: Can be more computationally intensive because the penalty term is not differentiable at zero, requiring more sophisticated optimization techniques.

Elastic Net:

As a side note, there is also the Elastic Net method, which combines both L1 and L2 penalties. It is useful when you want the benefits of both Ridge and Lasso regression:

\text{Objective Function: } \min \left( \sum_{i=1}^n (y_i - \hat{y}_i)^2 + \lambda_1 \sum_{j=1}^p |\beta_j| + \lambda_2 \sum_{j=1}^p \beta_j^2 \right)

Here, $\lambda_1$ and $\lambda_2$ control the L1 and L2 penalties, respectively. This method can select variables like Lasso and shrink coefficients like Ridge.

In summary, Ridge regression is ideal when you want to shrink coefficients without eliminating any, while Lasso regression is useful for creating simpler, more interpretable models by removing some predictors entirely.

Tuesday, April 30, 2024

MCQs and Complex Scenario Based Questions on Decision Tree Regressor

Index For Job Interviews Preparation

Decision Tree to Model: Years-of-Experience vs. Salary

The prediction surface from a DecisionTreeRegressor:

Share some MCQs on DecisionTreeRegressor

Multiple Choice Questions on DecisionTreeRegressor

Here are some multiple choice questions (MCQs) to test your understanding of DecisionTreeRegressor in scikit-learn:

Which of the following is NOT a splitting criterion used by DecisionTreeRegressor? a) Mean squared error (MSE) b) Information gain c) Gini impurity d) Root mean squared error (RMSE)
What parameter in DecisionTreeRegressor controls the maximum depth of the tree? a) min_samples_split b) max_depth c) min_samples_leaf d) criterion
What is the purpose of the min_samples_split parameter in DecisionTreeRegressor? a) It determines the minimum number of samples required to split a node. b) It specifies the minimum number of samples required at a leaf node. c) It sets the maximum depth of the tree. d) It defines the splitting criterion used for decision making.
What is the advantage of using DecisionTreeRegressor compared to linear regression models for certain tasks? a) Decision trees can capture non-linear relationships between features and the target variable. b) Decision trees are always more accurate than linear regression models. c) Decision trees require less data to train compared to linear regression models. d) Decision trees are inherently more interpretable than linear regression models.
What is a potential drawback of decision trees? a) Decision trees are not suitable for handling categorical features. b) Decision trees can suffer from overfitting, especially with high-dimensional data. c) Decision trees are computationally expensive to train compared to linear regression models. d) Decision trees cannot be used for regression tasks.

Answer Key:

d) Root mean squared error (RMSE) (RMSE is a metric, not a splitting criterion)
b) max_depth
a) It determines the minimum number of samples required to split a node.
a) Decision trees can capture non-linear relationships between features and the target variable.
b) Decision trees can suffer from overfitting, especially with high-dimensional data.

What is the advantage of setting "min_samples_split"?

The main advantage of setting min_samples_split in a DecisionTreeRegressor is to prevent overfitting and improve the model's generalization ability. Here's how it works:

Decision Trees and Overfitting: Decision trees can potentially overfit the training data, especially with high-dimensional datasets. This means the tree becomes too specific to the training data and performs poorly on unseen data.
Role of min_samples_split: This parameter sets a minimum number of samples required to split a node (decision point) in the tree. By requiring a minimum number of samples at each split, you avoid creating overly specific branches based on noise or small fluctuations in the data. This helps to:
- Reduce Variance: By ensuring a minimum number of samples at each split, the tree is less likely to make decisions based on random fluctuations in the training data. This leads to a more stable and generalizable model.
- Prevent Leaf Nodes with Few Samples: With min_samples_split, you prevent creating leaf nodes (terminal nodes) with very few data points. These leaf nodes might not represent the broader trends in the data and can lead to poor predictions for unseen data.

In essence, setting min_samples_split helps the decision tree focus on learning robust patterns from the data instead of memorizing specific details that might not generalize well to unseen examples.

Here are some additional points to consider:

The optimal value for min_samples_split depends on the size and complexity of your dataset. Experiment with different values and evaluate the model's performance on a held-out test set to find the best setting.
There's a trade-off between preventing overfitting and reducing training efficiency. A very high min_samples_split could lead to underfitting (failing to capture important patterns), while a very low value might not provide enough protection against overfitting.

What other parameters are there for tuning a DecisionTreeRegressor?

DecisionTreeRegressor in scikit-learn offers several parameters for tuning the model's behavior and performance. Here are some key parameters beyond min_samples_split that you can adjust to optimize your decision tree:

Stopping Criteria:

max_depth: This parameter controls the maximum depth of the tree. A deeper tree can capture more complex relationships but is also more prone to overfitting. Experiment with different values to find the sweet spot between complexity and generalization.
min_samples_leaf: This sets the minimum number of samples required at a leaf node. Similar to min_samples_split, it helps prevent overfitting by avoiding overly specific leaf nodes based on few data points.

Splitting Criteria:

criterion: This defines the function used to measure the quality of a split at each node. Common options include "gini" (Gini impurity) for classification and "mse" (mean squared error) for regression tasks. You can experiment with different criteria to see which one improves performance for your specific problem.

Regularization:

max_features: This parameter controls the number of features considered at each split. A lower value can help reduce overfitting by preventing the tree from focusing on irrelevant features. However, setting it too low might prevent the tree from capturing important relationships.

Other Parameters:

random_state: Setting a random seed for random_state ensures reproducibility when training the decision tree multiple times with different hyperparameter configurations.

Tuning Techniques:

Once you've identified relevant parameters, consider techniques like grid search or randomized search to efficiently explore different parameter combinations and find the optimal configuration for your dataset.

By understanding and adjusting these parameters, you can effectively tune your DecisionTreeRegressor to achieve better performance and prevent overfitting. Remember to evaluate the model's performance on a held-out test set to avoid overfitting to the training data.

Share some scenario based questions for DecisionTreeRegressor.

Scenario-Based Questions for DecisionTreeRegressor:

Here are some questions that explore using DecisionTreeRegressor in various scenarios:

Customer Churn Prediction: You're building a model to predict customer churn for a telecommunications company. The features include tenure with the company, monthly call volume, service plan type, and customer satisfaction score. How can you use DecisionTreeRegressor to identify the key factors influencing customer churn?
- Solution:
  - Train a DecisionTreeRegressor with the target variable being a binary indicator of churn (0 for not churned, 1 for churned).
  - Analyze the feature importance scores provided by the model to understand which features (tenure, call volume, etc.) have the most significant influence on churn predictions.
  - Visualize the decision tree to gain insights into the decision-making process for churn prediction.
Housing Price Estimation: You're developing a model to estimate housing prices based on features like square footage, number of bedrooms, location, and year built. How can you handle potential non-linear relationships between features and price using DecisionTreeRegressor?
- Solution:
  - Decision trees can inherently capture non-linear relationships by splitting the data based on feature values. Train the DecisionTreeRegressor to predict the continuous target variable (housing price).
  - Evaluate the model's performance on a held-out test set using metrics like mean squared error or R-squared.
  - Consider comparing the performance of DecisionTreeRegressor with other models like Random Forest or Support Vector Machines with suitable kernels, which might be better suited for non-linear problems with larger datasets.
Anomaly Detection in Sensor Data: You're analyzing sensor data from a factory to detect anomalous readings that might indicate equipment malfunctions. The sensor readings are continuous numerical values. How can you use DecisionTreeRegressor for anomaly detection?
- Solution:
  - Train a DecisionTreeRegressor to predict the typical sensor readings based on historical data.
  - Identify data points that fall outside the decision tree's predictions (deviations from the learned patterns) as potential anomalies.
  - Fine-tune the model parameters (min_samples_split, max_depth) to balance capturing normal variations and identifying true anomalies.

Remember, these are just a few examples. DecisionTreeRegressor can be applied to various scenarios where understanding the decision-making process and capturing non-linear relationships add value to the modeling task.

Can you show it through code the use case of Anamoly Detection in Sensor Data using DecisionTreeRegressor?

Certainly, here's Python code demonstrating anomaly detection in sensor data using DecisionTreeRegressor:

Python
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor

# Load sensor data (replace with your actual data)
data = pd.read_csv("sensor_data.csv")
sensor_readings = data["sensor_reading"]  # Feature (sensor readings)

# Split data into training and testing sets
X_train, X_test, _, _ = train_test_split(sensor_readings, sensor_readings, test_size=0.2, random_state=42)

# Train DecisionTreeRegressor to predict typical sensor readings
model = DecisionTreeRegressor()
model.fit(X_train.values.reshape(-1, 1), X_train)  # Reshape for single feature

# Predict sensor readings on the test set
predicted_readings = model.predict(X_test.values.reshape(-1, 1))

# Define threshold for anomaly detection (adjust based on data)
threshold = 0.1  # Example threshold, can be standard deviation or a percentile

# Detect anomalies based on prediction errors
anomalies = []
for i, (reading, prediction) in enumerate(zip(X_test, predicted_readings)):
  error = abs(reading - prediction)
  if error > threshold:
    anomalies.append(i)

# Print or process anomaly indices
print(f"Anomaly indices in test set: {anomalies}")

# You can further analyze the sensor data points at these anomaly indices
 Use code with caution.

Explanation:

Import Libraries: We import pandas for data manipulation, train_test_split from sklearn.model_selection to split the data, and DecisionTreeRegressor from sklearn.tree.
Load Data: Replace "sensor_data.csv" with your actual CSV containing a column named "sensor_reading".
Split Data: Split the data into training and testing sets using train_test_split. The test set will be used for anomaly detection.
Train Model: Train the DecisionTreeRegressor model using model.fit(). Reshape the training data (X_train) to a 2D array for compatibility with the model (even though it's a single feature).
Predict Sensor Readings: Predict sensor readings on the test set using model.predict().
Define Anomaly Threshold: Set a threshold for anomaly detection. This could be a fixed value (like 0.1 here), a standard deviation from the predicted value, or a specific percentile.
Detect Anomalies: Iterate through the test set and predicted readings. Calculate the absolute error (difference) between the actual and predicted values. If the error exceeds the threshold, consider the data point an anomaly and append its index to the anomalies list.
Process Anomalies: The code prints the anomaly indices in the test set. You can use these indices to further analyze the corresponding sensor data points to investigate potential equipment malfunctions or unusual readings.

Note:

Adjust the anomaly threshold based on the characteristics of your sensor data and the desired sensitivity for anomaly detection.
This is a basic example. Consider techniques like cross-validation or hyperparameter tuning for a more robust model. You might also explore other anomaly detection algorithms depending on your specific needs.

Pages