Anomaly detection with machine learning

Anomaly detection is the process of identifying data that does not belong to a template or set. Inconsistent data may indicate a system failure. Anomaly detection is one of the tasks of machine learning.

Cooperate

Support vector machine

Advantages: the fastest method for finding decision functions. The presence of a unique solution, due to solving a quadratic programming problem in a convex domain. The ability to carry out more confident classification, as the algorithm determines the maximum width of the separating band. Disadvantages: sensitivity to noise and data standardization. In the case of linear inseparability of classes, there is a lack of a clear approach to the automatic choice of the kernel (construction of the rectifying subspace as a whole).

Neural networks

Advantages: the ability to transform input information into output information without using information about the probabilistic data distribution model. There are many possibilities for optimizing the model through the use of nonlinear artificial neurons. Ability to retrain and adapt an algorithm to a changing, non-stationary environment. The universality of the algorithm. The ability to use one design solution in several subject areas. High speed of implementation. Disadvantages: difficulty in choosing a suitable neural network structure for a specific task; trained neural networks are not interpretable models – "black boxes", so the logical interpretation of the described patterns is almost impossible. Ability to process only numeric variables.

Advantages and disadvantages of machine learning methods for anomaly detection

Classification of methods for detecting anomalies in data

Novelty detection

Novelty is an object that is fundamentally different in its properties from the previous selection objects and is characterized by completely new behavior under unchanged conditions. When conducting novelty detection, new objects are detected that differ from the previous ones, but are not necessarily outliers. That is, the algorithm estimates how similar the new value is to the existing sample.

Outlier detection

Outlier is an object that is the result of data errors such as rounding, measurement inaccuracies, incorrect entries, typos, etc.; noise objects arising as a result of misclassification; objects belonging to other selections included in the considered selection. Outlier detection is aimed at detecting objects that distort the total sample: abnormally high/low or too volatile values, etc. on an existing sample.

Supervised anomaly detection

Supervised anomaly detection is a method in which already labeled data is fed into the model for training, having labels ("data marks") previously characterized as outliers.

Semi-supervised anomaly detection

Semi-supervised anomaly detection – the input is sampled, consisting only of normal values, without any deviations. The main idea is that anomalies will be detected at subsequent stages as a result of deviations from the values belonging to the original sample.

Unsupervised anomaly detection

Unsupervised anomaly detection – a case when there are no data labels, and the developed algorithm needs to independently determine which data are anomalies. Moreover, with this option, there is no particular difference between training and test data. The idea is to detect anomalies based on the intrinsic properties of the data. Typically, the distances or densities parameters are used to decide whether a value is an outlier or not.

Determination of anomalies is one of the key tasks in preparing data for further analysis and modeling. The quality of the approach chosen in identifying anomalies is usually measured in the accuracy of the result obtained.

The most relevant areas of application of such methods are seen in medicine and payment systems. For example, in the first option, it is necessary to describe as qualitatively as possible the possible side effects of a particular medical preparation in order to avoid the appearance of undesirable effects in potential patients.

An example for the second option is the detection of anomalies in transactions on customers' bank cards. Here, the accuracy is already expressed in monetary value, as well as in the growth of customer distrust of the bank, which allowed third-party interference in customer accounts. Insufficiently correct anomaly detection can lead to important anomalies in the data being non-deterministic in the information flow.

This can create loopholes for fraudsters who can adapt to the ineffectiveness of the control algorithm. For example, setting a low threshold for cutting off anomalies (by parameters such as the frequency of authorization attempts or the frequency of external money transfers across a subset of accounts) may lead to the fact that small fraudulent transactions will not be noticed, while large algorithms will correctly identify threats.

Therefore, today the detection of anomalies is more practical than research. There are also a number of applications for this kind of algorithms:

detection of suspicious banking transactions;
intrusion detection;
detection of non-standard players on the exchange (insiders);
detection of malfunctions in mechanisms according to the readings of sensors;
medical diagnostics;
seismology;
obtaining information about an abnormal interest in a certain group of goods on the site.

Each of these areas involves the use of different methods to identify anomalies.

The choice of models depends on a number of factors:

the task at hand;
quality of the sample;
total amount of data;
description of the sample (marked/unmarked sample);
required speed of the algorithm;
number of potential anomalies;
nature of the anomalies (bright / not pronounced).

When choosing a model, the analyst is guided by his own criteria. Each method for detecting anomalies is, to one degree or another, effective/ neffective when applied to specific tasks. Typically, several methods are used to improve the result. Statistical methods are combined with machine learning methods, and graphical methods are used to simplify and coarse clipping anomalies.