survival8: Anomaly Detection using Scikit-Learn and "eif" PyPI package (for Extended Isolation Forest)

Definition

Anomaly detection is the process of identifying unexpected items or events in data sets, which differ from the norm. And anomaly detection is often applied on unlabeled data which is known as unsupervised anomaly detection. Anomaly detection has two basic assumptions:

• Anomalies only occur very rarely in the data.

• Their features differ from the normal instances significantly.

Novelty and Outlier Detection

• Outlier detection (unsupervised anomaly detection)

The training data contains outliers which are defined as observations that are far from the others. Outlier detection estimators thus try to fit the regions where the training data is the most concentrated, ignoring the deviant observations.

• Novelty detection (semi-supervised anomaly detection)

The training data is not polluted by outliers and we are interested in detecting whether a new observation is an outlier. In this context an outlier is also called a novelty.

One-Class SVM

SVM or “Support Vector Machine” is a supervised classification model. One-class SVM is its unsupervised sibling for use in Outlier Analysis.

According to skLearn documentation:
The sklearn.svm.OneClassSVM is known to be sensitive to outliers and thus does not perform very well for outlier detection.
This estimator is best suited for novelty detection when the training set is not contaminated by outliers. That said, outlier detection in high-dimension, or without any assumptions on the distribution of the inlying data is very challenging, and a One-class SVM might give useful results in these situations depending on the value of its hyperparameters.

Robust Covariance. Fitting an elliptic envelope.

One common way of performing outlier detection is to assume that the regular data come from a known distribution (e.g. data are Gaussian distributed). From this assumption, we generally try to define the “shape” of the data, and can define outlying observations as observations which stand far enough from the fit shape.

The scikit-learn provides an object “covariance.EllipticEnvelope” that fits a robust covariance estimate to the data, and thus fits an ellipse to the central data points, ignoring points outside the central mode.

For instance, assuming that the inlier data are Gaussian distributed, it will estimate the inlier location and covariance in a robust way (i.e. without being influenced by outliers). The Mahalanobis distances obtained from this estimate is used to derive a measure of outlyingness.

Isolation Forest

One efficient way of performing outlier detection in high-dimensional datasets is to use random forests. The ensemble.IsolationForest ‘isolates’ observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature.

Since recursive partitioning can be represented by a tree structure, the number of splittings required to isolate a sample is equivalent to the path length from the root node to the terminating node.
This path length, averaged over a forest of such random trees, is a measure of normality and our decision function.
Random partitioning produces noticeably shorter paths for anomalies. Hence, when a forest of random trees collectively produce shorter path lengths for particular samples, they are highly likely to be anomalies.
The implementation of ensemble.IsolationForest is based on an ensemble of tree.ExtraTreeRegressor.

Idea Behind Isolation Forest

Decision Boundary With Isolation Forest

Local Outlier Factor

Another efficient way to perform outlier detection on moderately high dimensional datasets is to use the Local Outlier Factor (LOF) algorithm.

The neighbors.LocalOutlierFactor (LOF) algorithm computes a score (called local outlier factor) reflecting the degree of abnormality of the observations. It measures the local density deviation of a given data point with respect to its neighbors. The idea is to detect the samples that have a substantially lower density than their neighbors.

In practice the local density is obtained from the k-nearest neighbors. The LOF score of an observation is equal to the ratio of the average local density of his k-nearest neighbors, and its own local density: a normal instance is expected to have a local density similar to that of its neighbors, while abnormal data are expected to have much smaller local density.

The number k of neighbors considered, (alias parameter n_neighbors) is typically chosen 1) greater than the minimum number of objects a cluster has to contain, so that other objects can be local outliers relative to this cluster, and 2) smaller than the maximum number of close by objects that can potentially be local outliers. In practice, such informations are generally not available, and taking n_neighbors=20 appears to work well in general. When the proportion of outliers is high (i.e. greater than 10 %, as in the example below), n_neighbors should be greater (n_neighbors=35 in the example below).

The strength of the LOF algorithm is that it takes both local and global properties of datasets into consideration: it can perform well even in datasets where abnormal samples have different underlying densities. The question is not, how isolated the sample is, but how isolated it is with respect to the surrounding neighborhood.

Local Outlier Factor: Detect the samples that have a substantially lower density than their neighbors.

Outside of skLearn but an important development is the algorithm: “Extended Isolation Forest”

Isolation Forest algorithm utilizes the fact that anomalous observations are few and significantly different from ‘normal’ observations. The forest is built based on the decision trees, each of the trees having access to a sub-sample of the training data. In order to create a branch in the tree, first, a random feature is selected. Afterward, a random split value (between min and max value) is chosen for that feature. If the given observation has a lower value of this feature than the one selected, then the observation follows the left branch, otherwise the right one. This process is continued until a single point is isolated or specified maximum depth is reached.

Isolation Forest in Action

In principle, outliers are less frequent than regular observations and are different from them in terms of values (they lie further away from the regular observations in the feature space). That is why by using such random partitioning they should be identified closer to the root of the tree (shorter average path length, i.e., the number of edges an observation must pass in the tree going from the root to the terminal node), with fewer splits necessary.
The anomaly score is created based on all trees in the forest and the depth the point reaches in these trees.

Isolation Forest’s Problem

In the left picture below, we can see data sampled from the multivariate normal distribution. Intuitively, we would assume that the anomaly score assigned to the observations would increase radially from the central point of the distribution [0, 0]. However, this is clearly not the case, as seen in the right image. What is more, there are also rectangular artifacts of a lower score, such as the vertical one between point 0 and 1 on the x-axis.

Let’s move on to the second example. Here we see two blobs centered at points [0, 10] and [10, 0]. By inspecting the right figure we see not only the artifacts that were present before, but also two ghost clusters (approximately at [0, 0] and [10, 10]).

The reason for this peculiar behavior originates from the fact that the decision boundaries of the Isolation Forest are either vertical or horizontal (random value of a random feature), as seen in the picture below, where the authors plot branch cuts generated by the IF during the training phase. We see that the branches tend to cluster where majority of the points are located. But as the lines can only be parallel to the axes, there are regions that contain many branch cuts and only a few or single observations, which results in improper anomaly scores for some of the observations. An example might be in single blob figure where there are points around [3, 0] (many branch cuts) and [3, 3] (few cuts).

Extended Isolation Forest

Analysis of the Isolation Forest’s drawback led to the conclusion that the problem is caused by only horizontal/vertical branch cuts. Extended Random Forest addresses that issue by approaching the problem a bit differently. Instead of selecting a random feature and then random value within the range of data it selects:

- the random slope for the branch cut

- random intercept chosen from the range of available values from the training data

- These are the terms (slope/intercept) you most likely recall from the simple linear regression (y = ax + b). Let’s look at a visual example! From the image below we can see the main difference from the original IF algorithm -> cuts that are not parallel to the axes.

Extended Random Forest generalizes well into higher dimensions, where instead of straight lines we are dealing with hyperplanes. For a deeper dive into N-dimensional generalization, refer this paper.

Let’s wrap up the theoretical explanation by looking at the difference in the anomaly score maps generated by IF/EIF. In the images below we see that the anomaly score spreads out from the data clusters radially and there are no artifacts/ghost clusters visible.

An extra feature captured by the EIF is the higher anomaly score region directly in-between the two clusters (where they kind of link). This region can be considered as close to ‘normal’ given the proximity to both clusters, but with a higher score, as it is far from the concentrated groups.

Caveat: "eif" implementations are way slower than skLearn implementation of Isolation Forest. For a test task, skLearn Iolation Forest took 14s to train, while the "eif" took roughly 10 minutes.

References

% towardsdatascience

% Extended Isolation Forest

% Novelty and Outlier Detection (scikit-learn.org/stable)

% Comparing anomaly detection algorithms for outlier detection on toy datasets (scikit-learn.org/0.20)

% Outlier detection with Local Outlier Factor (LOF) [scikit-learn.org/stable]

% skLearn v0.23