For this article we went through the following research paper: Estimating the Contamination Factor's Distribution in Unsupervised Anomaly Detection Lorenzo Perini . Paul-Christian B¨urkner . Arto Klami All of the code and data is available to download from this link: Download Code and Data Here are some highlights from the paper:1. Introduction
... Therefore, we are the first to study the estimation of the contamination factor from a Bayesian perspective. We propose γGMM, the first algorithm for estimating the contamination factor's (posterior) distribution in unlabeled anomaly detection setups. First, we use a set of unsupervised anomaly detectors to assign anomaly scores for all samples and use these scores as a new representation of the data. Second, we fit a Bayesian Gaussian Mixture model with a Dirichlet Process prior (DPGMM) (Ferguson, 1973; Rasmussen, 1999) in this new space. If we knew which components contain the anomalies, we could derive the contamination factor's posterior distribution as the distribution of the sum of such components' weights. Because we do not know this, as a third step γGMM estimates the probability that the k most extreme components are jointly anomalous, and uses this information to construct the desired posterior. The method explained in detail in Section 3. ...3. Methodology
We tackle the problem: Given an unlabeled dataset D and a set of M unsupervised anomaly detectors; Estimate a (posterior) distribution of the contamination factor γ. Learning from an unlabeled dataset has three key challenges. First, the absence of labels forces us to make relatively strong assumptions. Second, the anomaly detectors rely on different heuristics that may or may not hold, and their performance can hence vary significantly across datasets. Third, we need to be careful in introducing user-specified hyperparameters, because setting them properly may be as hard as directly specifying the contamination factor. In this paper, we propose γGMM, a novel Bayesian approach that estimates the contamination factor's posterior distribution in four steps, which are illustrated in Figure 1: Step 1. Because anomalies may not follow any particular pattern in covariate space, γGMM maps the covariates X ∈ Rd into an M dimensional anomaly space, where the dimensions correspond to the anomaly scores assigned by the M unsupervised anomaly detectors. Within each dimension of such a space, the evident pattern is that “the higher the more anomalous”. Step 2. We model the data points in the new space RM using a Dirichlet Process Gaussian Mixture Model (DPGMM) (Neal, 1992; Rasmussen, 1999). We assume that each of the (potentially many) mixture components contains either only normals or only anomalies. If we knew which components contained anomalies, we could then easily derive γ's posterior as the sum of the mixing proportions π of the anomalous components. However, such information is not available in our setting. Step 3. Thus, we order the components in decreasing order, and we estimate the probability of the largest k components being anomalous. This poses three challenges: (a) how to represent each M -dimensional component by a single value to sort them from the most to the least anomalous, (b) how to compute the probability that the kth component is anomalous given that the (k − 1)th is such, (c) how to derive the target probability that k components are jointly anomalous. Step 4. γGMM estimates the contamination factor's posterior by exploiting such a joint probability and the components' mixing proportions posterior.A Simplified Implementation of The Above Algorithm
1. We have our dataset consisting of page views for our blog on Blogger. We load this dataset using Pandas. 2. We initialize two Unsupervised Anomaly Detection models namely: - IsolationForest - LocalOutlierFactor Both of them are available in Scikit-Learn. 3. To begin with, we initialize them with the default values for hyperparameters as in the code below: clf = IsolationForest(random_state=0).fit(X) clf = LocalOutlierFactor().fit(X) That means at this point the model's contamination factor is set to 'auto'. 4. Since we two models here so M = 2 for us. If there were three models, then M would be 3. 5. We get the anomaly scores: anomalyscores_if = clf.decision_function(X) anomalyscores_lof = clf.negative_outlier_factor_ 6. For a simplified view, we plot this 2D data in a scatter plot. import matplotlib.pyplot as plt x = anomalyscores_if y = anomalyscores_lof plt.scatter(x, y) plt.show() 7. Next, we use Bayesian Gaussian Mixture model to cluster the data of anomaly scores into two groups (one being anomalous, other being normal). 8. Next, we find the percentage of anomalous points (Class: 1). This percentage is our contamination factor. 9. Using the above contamination factor for IsolationForest model, we find out anomalies as shown below in red:
Pages
- Index of Lessons in Technology
- Index of Book Summaries
- Index of Book Lists And Downloads
- Index For Job Interviews Preparation
- Index of "Algorithms: Design and Analysis"
- Python Course (Index)
- Data Analytics Course (Index)
- Index of Machine Learning
- Postings Index
- Index of BITS WILP Exam Papers and Content
- Lessons in Investing
- Index of Math Lessons
- Index of Management Lessons
- Book Requests
- Index of English Lessons
- Index of Medicines
- Index of Quizzes (Educational)
Tuesday, May 28, 2024
Estimating the Contamination Factor For Unsupervised Anomaly Detection
Thursday, May 23, 2024
Four Practice Problems on Linear Regression (Taken From Interviews For Data Scientist Role)
To watch our related video on: YouTube
Previous Videos
- Linear Regression Theory (2022-02-15)
- https://www.youtube.com/watch?v=qS3HhMV8YG0
- Interview Question 1: What is linear regression and what is it's primary purpose?
- https://www.youtube.com/watch?v=9S2FM9EGcdc
- Use Khan Academy to get started with the basic concepts of linear regression (Motivational video)
- https://www.youtube.com/watch?v=glXMN1VIttA
- Unit 5: Exploring bivariate numerical data
- https://www.khanacademy.org/math/statistics-probability/describing-relationships-quantitative-data
Question (1): Asked At Ericsson
- You are given the data generated for the following equation:
- y = (x^9)*3
- Can you apply linear regression to learn from this data?
Solution (1)
Equation of line: y = mx + c
Equation we are given is of the form y = (x^m)c
Taking log on both the sides:
log(y) = log((x^m)c)
Applying multiplication rule of logarithms:
log(y) = log(x^m) + log(c)
Applying power rule of logarithms:
log(y) = m.log(x) + log(c)
Y = log(y)
X = log(x)
C = log(c)
Y = mX + C
So answer is 'yes'.
Question (2): Infosys – Digital Solution Specialist
- If you do linear regression in 3D, what do you get?
Solution (2)
When you perform linear regression on 3D data, you are essentially fitting a plane to a set of data points in three-dimensional space. The general form of the equation for a plane in three dimensions is:
z=ax+by+c
Here:
z is the dependent variable you are trying to predict.
x and y are the independent variables.
a and b are the coefficients that determine the orientation of the plane.
c is the intercept.
Solution (2)...
Suppose you have data points (1,2,3), (2,3,5), (3,4,7), and you fit a linear regression model to this data. The resulting plane might have an equation like z=0.8x+1.2y+0.5. This equation tells you how z changes as x and y change.
In summary, performing linear regression on 3D data gives you a plane in three-dimensional space that best fits your data points in the least squares sense. This plane can then be used to predict new z values given new x and y values.
Generalizing a bit further
- If you do linear regression in N dimensions, you get a hypersurface in N-1 dimensions.
Question (3): Infosys – Digital Solution Specialist
- How do you tell if there is linearity between two variables?
Solution (3)
Determining if there is linearity between two variables involves several steps, including visual inspection, statistical tests, and fitting a linear model to evaluate the relationship. Here are the main methods you can use:
1. Scatter Plot
Create a scatter plot of the two variables. This is the most straightforward way to visually inspect the relationship.
Linearity: If the points roughly form a straight line (either increasing or decreasing), there is likely a linear relationship.
Non-linearity: If the points form a curve, cluster in a non-linear pattern, or are randomly scattered without any apparent trend, there is likely no linear relationship.
2. Correlation Coefficient
Calculate the Pearson correlation coefficient, which measures the strength and direction of the linear relationship between two variables.
Pearson Correlation Coefficient (r): Ranges from -1 to 1.
r≈1 or r≈−1: Strong linear relationship (positive or negative).
r≈0: Weak or no linear relationship.
3. Fitting a Linear Model
Fit a simple linear regression model to the data.
Model Equation: y = β0 + β1.x + ϵ
y: Dependent variable. / x: Independent variable. / β0: Intercept. / β1: Slope. / ϵ: Error term.
4. Residual Analysis
Examine the residuals (differences between observed and predicted values) from the fitted linear model.
Residual Plot: Plot residuals against the independent variable or the predicted values.
Linearity: Residuals are randomly scattered around zero.
Non-linearity: Residuals show a systematic pattern (e.g., curve, trend).
5. Statistical Tests
Perform statistical tests to evaluate the significance of the linear relationship.
t-test for Slope: Test if the slope (β1) is significantly different from zero.
Null Hypothesis (H0): β1=0 (no linear relationship).
Alternative Hypothesis (H1): β1≠0 (linear relationship exists).
p-value: If the p-value is less than the chosen significance level (e.g., 0.05), reject H0 and conclude that a significant linear relationship exists.
6. Coefficient of Determination (R²)
Calculate the R² value, which indicates the proportion of variance in the dependent variable explained by the independent variable.
R² Value: Ranges from 0 to 1.
Closer to 1: Indicates a strong linear relationship.
Closer to 0: Indicates a weak or no linear relationship.
Example:
Suppose you have two variables, x and y.
Scatter Plot: You plot x vs. y and observe a straight-line pattern.
Correlation Coefficient: You calculate the Pearson correlation coefficient and find r=0.85, indicating a strong positive linear relationship.
Fitting a Linear Model: You fit a linear regression model y=2+3x.
Residual Analysis: You plot the residuals and observe they are randomly scattered around zero, indicating no pattern.
Statistical Tests: The t-test for the slope gives a p-value of 0.001, indicating the slope is significantly different from zero.
R² Value: You calculate R^2=0.72, meaning 72% of the variance in y is explained by x.
Based on these steps, you would conclude there is a strong linear relationship between x and y.
Question (4): TCS and Infosys (DSS)
- What is the difference between Lasso regression and Ridge regression?
Solution (4)
Lasso and Ridge regression are both techniques used to improve the performance of linear regression models, especially when dealing with multicollinearity or when the number of predictors is large compared to the number of observations. They achieve this by adding a regularization term to the loss function, which penalizes large coefficients. However, they differ in the type of penalty applied:
Ridge Regression:
- Penalty Type: L2 norm (squared magnitude of coefficients)
- Objective Function: Minimizes the sum of squared residuals plus the sum of squared coefficients multiplied by a penalty term λObjective Function: min(i=1∑n(yi−y^i)2+λj=1∑pβj2)Here, λ is the regularization parameter, yi are the observed values, y^i are the predicted values, and βj are the coefficients.
- Effect on Coefficients: Shrinks coefficients towards zero but does not set any of them exactly to zero. As a result, all predictors are retained in the model.
- Use Cases: Useful when you have many predictors that are all potentially relevant to the model, and you want to keep all of them but shrink their influence.
Lasso Regression:
- Penalty Type: L1 norm (absolute magnitude of coefficients)
- Objective Function: Minimizes the sum of squared residuals plus the sum of absolute values of coefficients multiplied by a penalty term λObjective Function: min(i=1∑n(yi−y^i)2+λj=1∑p∣βj∣)Here, λ is the regularization parameter, yi are the observed values, y^i are the predicted values, and βj are the coefficients.
- Effect on Coefficients: Can shrink some coefficients exactly to zero, effectively performing variable selection. This means that it can produce a sparse model where some predictors are excluded.
- Use Cases: Useful when you have many predictors but you suspect that only a subset of them are actually important for the model. Lasso helps in feature selection by removing irrelevant predictors.
Key Differences:
Type of Regularization:
- Ridge: L2 regularization (squared magnitude of coefficients)
- Lasso: L1 regularization (absolute magnitude of coefficients)
Effect on Coefficients:
- Ridge: Tends to shrink coefficients uniformly, but none are set exactly to zero.
- Lasso: Can shrink some coefficients to exactly zero, leading to a sparse model.
Use Cases:
- Ridge: Better when you want to retain all predictors and control their magnitude.
- Lasso: Better when you want to perform feature selection and eliminate some predictors.
Computational Complexity:
- Ridge: Generally simpler to compute because the penalty term is differentiable everywhere.
- Lasso: Can be more computationally intensive because the penalty term is not differentiable at zero, requiring more sophisticated optimization techniques.
Elastic Net:
As a side note, there is also the Elastic Net method, which combines both L1 and L2 penalties. It is useful when you want the benefits of both Ridge and Lasso regression:
Here, λ1 and λ2 control the L1 and L2 penalties, respectively. This method can select variables like Lasso and shrink coefficients like Ridge.
In summary, Ridge regression is ideal when you want to shrink coefficients without eliminating any, while Lasso regression is useful for creating simpler, more interpretable models by removing some predictors entirely.


