以下我为这篇《Rapid Deployment of Anomaly Detection Models for Large Number of Emerging KPI Streams》做的阅读笔记 - Jeanva
Abstract
Rapid deployment of anomaly detection models for large number of emerging KPI streams, without manual algorithm selection, parameter tuning, or new anomaly labeling for any newly emerging KPI streams. We propose the first framework ADS that tackles the above problem, via clustering and semi-supervised learning. with the labels of only the 5 cluster centroids of 70 historical KPI streams, ADS achieves an averaged best F-score of 0.92 on 81 new KPI streams KPI流太多,没有办法去一个个手工异常标记,也没有办法去为每一个KPI流手动去做模型调优。这篇文章提出的一个方法是可以自动地为每一个新来的KPI流生成一个异常检测模型,用到了聚类和半监督学习。
KEYWORDS
ADS (Anomaly Detection through Self-training)
INTRODUCTION
Specifically, when large number of KPI streams emerge continuously and frequently, operators need to deploy accurate anomaly detection models for these new KPI streams as quickly as possible (e.g., within 3 weeks at most) 只需要3个星期就可以为一个KPI stream训练好检测模型
For example, in a top gaming company G studied in this paper, on average over ten new games are launched per quarter, which results in more than 6000 new KPI streams per 10 days on average 每10天就有6000个新的KPI流需要分析,数量是非常巨大的
For traditional statistical algorithms, to achieve the best accuracy, operators have to manually select an anomaly detection algorithm and tune its parameters for each KPI stream, which is infeasible for the large number of emerging KPI streams 传统的方法,需要对每个KPI流挑选模型,调优参数
Supervised learning based methods [2], [10] require manually labeling anomalies for each new KPI stream, which is not feasible for the large number of emerging KPI streams either. 有监督的算法,又需要对KPI流进行手工标记
Unsupervised learning based methods [11], [12] do not require algorithm selection, parameter tuning, or manual labels, but they either suffer from low accuracy [14] or require large amounts of training data for each new KPI stream (e.g., six months worth of data) [12], which do not satisfy the requirement of rapid deployment (e.g., within 3 weeks) of accurate anomaly detection. 无监督的方法,准确度低,需要的数据量大,一般需要半年的数据。
In this paper, we propose ADS, the first framework that enables the rapid deployment of anomaly detection models (say at most 3 weeks) for large number of emerging KPI streams, without manual algorithm selection, parameter tuning, or new anomaly labeling for any newly emerging KPI streams 我们只需要3个星期的数据就可以搞定,还不需要挑选算法,不需要手工标记
Our idea of ADS is based on the following two observations. (1) In practice, many KPI streams (e.g., the number of queries per server in a well load balanced server cluster) are similar due to their implicit associations and similarities, thus potentially we can use the similar anomaly detection algorithms and parameters for these similar KPI streams. (2) Clustering methods such as ROCKA [15] can be used to cluster many KPI streams into clusters according to their similarities. The number of clusters are largely determined by the nature of the service (e.g., shopping, gaming, social network, search) and the type of KPIs (e.g., number of queries, CPU usage, memory usage), but not by the scale of the entire system. 这个想法是基于KPI流的相似性比较高,聚成几个类就大概抓住了所有KPI的特点
Thus for a given service, the number of clusters can be orders of magnitude smaller than the number of KPI streams, and there is a good chance that a newly emerging KPI stream falls into one of the existing clusters resulted from historical KPI streams.
Utilizing the above two observations, ADS proposes to (1) cluster all existing/historical KPI streams into clusters, (2) manually label the anomalies of all cluster centroids, (3) assign each newly emerging KPI stream into one of the existing clusters, and (4) combine the data of the new KPI stream (unlabeled) and it’s cluster centroid (labeled) and use semisupervised learning [16] to train a new model for each new KPI stream. 大致过程:1)先聚类 2)类中心的KPI流需要label 3)对新的KPI流进行分类 4)使用类中心和新的KPI流进行半监督学习 结果:新的KPI流的检测模型就出来了
Semi-supervised learning can train a new model for a new KPI stream using an existing labeled KPI stream as long as these two KPI streams have similar data distribution, which is the case since they are in the same cluster. 半监督算法就是训练数据中有label的和没有label的,类中心的KPI流就是需要label的
BACKGROUND
Anomaly Detection for KPI Streams
Most anomaly detection algorithms, including traditional statistical algorithms, supervised learning based methods and unsupervised learning based methods, compute an anomaly score for a data point to denote how likely this data point is anomalous. Operators then set a threshold to determine whether each data point is anomalous or not. That is, only if the anomaly score at time t exceeds this threshold, \(x_{t}\) will be regarded as an anomalous data point. 时间序列的异常检测是逐点的检测
From the above definition, we can see that anomaly detection for a KPI stream is essentially a two-class classification problem – classifying a data point into an anomalous data point or a normal one. Consequently, we can use the intuitive classification metrics of two-class classification methods, including precision, recall, and F-score, to evaluate the performance of anomaly detection algorithms. 异常检测是一个二分类问题
Anomaly Detection Methods for KPI Streams
Over the years, diverse traditional statistical algorithms have been applied for KPI anomaly detection, including SVD [6], Wavelet [7], ARIMA [8], Time Series Decomposition [1], Holt-Winters [9], etc 传统的一些异常检测的算法
To achieve the best accuracy, operators have to manually select an anomaly detection algorithm and tune its parameters for each KPI stream. 问题在于需要人去做算法的选择和参数的调优
To address the problem posed by algorithm selection and parameter tuning, several supervised learning based methods such as EGADS [10] and Opprentice [2] have been proposed. For each KPI stream, these methods learn (traditional statistical) algorithm selection and parameter tuning from operators’ manual labels of KPI anomalies. 有监督学习的问题虽然可以挑选算法和参数,但是他们需要手动地去标记异常
Unsupervised learning has emerged as a promising field in KPI anomaly detection. For example, isolation based methods [11] and variational autoencoders (VAE) [12] are applied in detecting anomalies in (KPI) streams. These methods are trained without manual labels, and thus they can be applied for large volume of KPI streams.
However, isolation based methods suffer from low accuracy [14] (see Section IV-C for more details). In addition, Donut [12], which is based on VAE, requires a long period (say six months) of training data for newly emerging KPI streams. 无监督算法的问题是准确率低,需要的时间长,长达半年的数据
Semi-supervised learning [16] is halfway between supervised and unsupervised learning. It uses unlabeled data to modify either parameters or models obtained from labeled data alone to maximize the learning performance. [20]–[22] use semi-supervised learning for anomaly detection in other domains, but are not designed for KPI streams (time series). 半监督的方法则是中庸的,使用没有标记的数据去调整模型的参数,最大化模型的效果。使用CPLE半监督学习方法是这篇文章的主要亮点,具体学习方法可以参照CPLE。
Clustering of KPI Streams
If we can identify similar KPI streams, and group the huge volume of KPI streams into a few clusters, we can reduce the overhead in anomaly detection. In this work, we adopt ROCKA [15], a rapid clustering algorithm for KPI streams based on their shapes. It applies moving average to extract baselines which successfully reduce the biases of noises and anomalies. In addition, it uses shapebased distance as the distance measure, and reduces its algorithm complexity to O(mlog(m)) using Fast Fourier Transform. Finally, it uses DBSCAN to cluster KPI streams and chooses the centroid for every cluster. KPI聚类,这个可以参照ROCKA算法。
FRAMEWORK OF ADS
For historical KPI streams, ADS preprocesses them with filling missing points and standardization, clusters the preprocessed KPI streams using ROCKA (Section III-A), and extracts features (namely, output results of different anomaly detection algorithms and parameters) for the KPI streams on cluster centroids (Section III-B). 只对类中心的KPI流才做这些操作,比如特征提取
Similarly, when a new KPI stream is generated, ADS preprocesses it, and classifies it into one of the above clusters (Section III-A), after which the features of this new KPI stream are extracted (Section III-B). 对新的KPI流做同样的工作。
Based on the features of the new KPI stream, and the features and labels of the new KPI stream’s cluster centroid, ADS trains a semi-supervised learning model using the CPLE algorithm (Section III-C). Finally, ADS detects anomalies in the new KPI stream based on the above model and the severity threshold (aThld henceforth).
Preprocessing and Clustering
In addition, to make KPI steams of different amplitudes and/or length scales comparable, we also standardize KPI streams. This way, different KPI streams are clustered based on their shapes, rather than the absolute values of amplitudes and/or length scales. As discussed in section II-C, ADS adopts ROCKA [15] to group KPI streams into a few clusters, and obtains a centroid KPI stream for each cluster.
The features of the new KPI stream, the features and the anomaly labels of the new KPI stream’s cluster centroid, together form the training set.
we use the output results of different anomaly detectors (namely, the anomaly severities measured by anomaly detectors) as the features of KPI streams. This way, each detector serves as a feature extractor.
When an anomaly detector receives an incoming data point of a KPI stream, it internally produces a non-negative value, called severity, to measure how anomalous that data point is. 每个基础算法计算出一个非负值来为每个点计算一个严重程度,作为一个feature
For example, historical average [24] applies how many times of standard deviations the point is away from the mean as the severity, based on the assumption that the KPI stream data follows Gaussian distribution;
Holt-Winters [9] uses a residual error (namely, the absolute difference between the actual value and the forecast value of each data point) to measure the severity.
The feature extraction, training, and classification (detection) in ADS are all designed to work with individual data points, not anomaly windows, so that the semi-supervised learning algorithm can have enough data for training. 每个KPI的点会对应到很多feature
Semi-Supervised Learning
we try to learn a model which is based on both labeled and unlabeled data, namely a semi-supervised learning model.
self-training based methods [20] apply an existing model to “label” unlabeled data, and employ the newly labeled data together with the actual labeled data to retrain the model until the prediction result no longer changes or iteration ends.
In this work, we adopt CPLE [28], an extension model of self-training. CPLE is a resilient semi-supervised learning framework for any supervised learning classifier (base-model henceforth) including random forest, SVM, decision tree, etc. It takes the prediction probabilities of base-model as input to fully utilize the unlabeled data.
CPLE有点类似EM算法,先给没有标记的点随机标记,然后优化参数
1、先利用有标签的样本 X 建模,得到估算的 \(\theta_{sup}\) ; 2、初始化 q ; 3、在这个 q 下计算 得到 CPL ; 4、最大化 CPL 得到估算的 \(\theta\) ; 5、将 \(\theta\) 代入 CL ,并最小化 CL ,得到估算的 q ; 6、重复3-5,直到 CPL 收敛,最终得到估算的 \(\theta_{semi}\) 参考https://zhuanlan.zhihu.com/p/53584825
EVALUATION
we compare the performance of ADS with that of supervised learning based method such as Opprentice, and unsupervised learning based method including iForest and Donut (Section IV-C).
Finally, to highlight the importance of semi-supervised learning, we compare ADS with the combination of ROCKA and Opprentice
Data Set
We randomly pick 70 historical KPI streams for clustering and 81 new ones for anomaly detection from a top global online game service.
KPI |
# KPI streams |
Interval (minute) |
Length (month) |
#/percentage of anomalous data points per KPI stream |
Success rate |
4 |
5 |
1 |
84/0.97% |
Latency |
19 |
5 |
1 |
150/1.7% |
# online players |
58 |
5 |
1 |
72/0.83% |
70个历史KPI stream聚成了5个类,这5个类的类中心是有label的,这里的label是对每个点进行label 然后我们对这个5个KPI以及81个新的KPI stream进行特征提取
Therefore, we randomly selected 151 KPI streams in our evaluation. We believe that the 151 KPI streams are sufficient to evaluate ADS’s performance.
Evaluation Metrics
we use a simple strategy following [12]: if any point in an anomaly segment in the ground truth can be detected by a chosen threshold, we say this segment is detected correctly, and all points in this segment are treated as if they can be detected by this threshold. Meanwhile, the points outside the anomaly segments are treated as usual. The precision, recall, F-score and best F-score are then computed accordingly 预测的时候用预测一个段来表示是否预测成功,段可以认为是一段连续的异常点,避免了要求对每个点的精确预测
For each of the 81 new KPI streams, we train ADS using the features extracted from the front 60% of it, as well as the features and manual labels of its cluster centroid (KPI stream). Then we use this ADS model to detect anomalies on the back 40% of this new KPI stream. 对每个KPI stream训练一个模型,用前60%来训练,后40%来detect
Figure 4 shows the cumulative distribution functions (CDFs) of the best F-scores of each KPI stream (of the new 81 KPI streams) using the above four methods. Note that in this CDF figure, being closer to the upper left means worse performance, and being closer to the bottom right means better performance 对每种方法看累计分布函数,低F-score的数目越少越好
We can see that both ADS and Opprentice perform superior in detecting anomalies for KPI streams, with 80% of best Fscores over 0.8.
(1) iForest is sensitive to noises in the training set because it does not utilize any labels [11]. (2) Donut is a deep Bayesian model, which requires a lot of training data to get good results (say six months worth of KPI streams) [12]. However, as Table II shows, we have to train the model using 60% of one month, namely 18 days, worth of newly emerging KPI streams, which is a too small amount of training data for Donut.
ADS and Opprentice perform well across all the five clusters and much better than iForst and Donut, demonstrating that it is important to utilize anomaly labels in anomaly detection for newly emerging KPI streams Opprentice的结果也还可以,它不需要聚类,直接用随机森林,使用前60%的label,预测后40%的label
Opprentice, performs similarly to ADS, it needs much more labeling works. For example, in this setting, ADS is trained based on only five labeled KPI streams on cluster centroids, while Opprentice is trained using all the 81 labeled KPI streams. ADS只需要label5个,Opprentice需要label 81个新的KPI stream,所以Opprentice不太可以用
Evaluation of CPLE
We adopt a robust semi-supervised learning model, CPLE, which is suitable for KPI anomaly detection and only requires similar (not necessarily the same) data distribution between the existing labeled KPI stream and the new KPI stream.
We set up ROCKA + Opprentice as follows. We first apply ROCKA to group the 70 historical KPI streams into five clusters, and classify the 81 new KPI streams into these clusters. For each cluster, we train Opprentice using the features and manual labels of its centroid KPI streams 这里没有使用81个KPI的label,假设81个KPI没有label,ROCKA+Opprentice反而显得比Opprentice更差,当然也比ADS差
Here we explain why ADS performs better than ROCKA + Opprentice. KPI stream clustering methods such as ROCKA usually extract baselines (namely underlying shapes) from KPI streams and ignore fluctuations. However, the fluctuations of KPI streams can impact anomaly detection.
ADS addresses the above problem effectively using semisupervised learning. In other words, it learns not only from the labels of the centroid KPI stream, but also from the fluctuation degree of the new KPI stream. This is consistent with the observation that the model trained based on both labeled and unlabeled data should not be worse than the one trained based only on the labeled data [28]
Reference
CPLE [28] Detectors [2] Evaluation [12] |