1642087020
luminol
Luminol is a light weight python library for time series data analysis. The two major functionalities it supports are anomaly detection and correlation. It can be used to investigate possible causes of anomaly. You collect time series data and Luminol can:
Luminol is configurable in a sense that you can choose which specific algorithm you want to use for anomaly detection or correlation. In addition, the library does not rely on any predefined threshold on the values of a time series. Instead, it assigns each data point an anomaly score and identifies anomalies using the scores.
By using the library, we can establish a logic flow for root cause analysis. For example, suppose there is a spike in network latency:
Investigating the possible ways to automate root cause analysis is one of the main reasons we developed this library and it will be a fundamental part of the future work.
make sure you have python, pip, numpy, and install directly through pip:
pip install luminol
the most up-to-date version of the library is 0.4.
This is a quick start guide for using luminol for time series analysis.
import luminol
detector = luminol.anomaly_detector.AnomalyDetector(ts)
anomalies = detector.get_anomalies()
if anomalies:
time_period = anomalies[0].get_time_window()
correlator = luminol.correlator.Correlator(ts, ts2, time_period)
print(correlator.get_correlation_result().coefficient)
These are really simple use of luminol. For information about the parameter types, return types and optional parameters, please refer to the API.
Modules in Luminol refers to customized classes developed for better data representation, which are Anomaly
, CorrelationResult
and TimeSeries
.
class luminol.modules.anomaly.Anomaly
It contains these attributes:
self.start_timestamp: # epoch seconds represents the start of the anomaly period.
self.end_timestamp: # epoch seconds represents the end of the anomaly period.
self.anomaly_score: # a score indicating how severe is this anomaly.
self.exact_timestamp: # epoch seconds indicates when the anomaly reaches its severity.
It has these public methods:
get_time_window()
: returns a tuple (start_timestamp, end_timestamp).class luminol.modules.correlation_result.CorrelationResult
It contains these attributes:
self.coefficient: # correlation coefficient.
self.shift: # the amount of shift needed to get the above coefficient.
self.shifted_coefficient: # a correlation coefficient with shift taken into account.
class luminol.modules.time_series.TimeSeries
__init__(self, series)
series(dict)
: timestamp -> valueIt has a various handy methods for manipulating time series, including generator iterkeys
, itervalues
, and iteritems
. It also supports binary operations such as add and subtract. Please refer to the code and inline comments for more information.
The library contains two classes: AnomalyDetector
and Correlator
, and there are two sets of APIs, one corresponding to each class. There are also customized modules for better data representation. The Modules section in this documentation may provide useful information as you walk through the APIs.
class luminol.anomaly_detector.AnomalyDetecor
__init__(self, time_series, baseline_time_series=None, score_only=False, score_threshold=None,
score_percentile_threshold=None, algorithm_name=None, algorithm_params=None,
refine_algorithm_name=None, refine_algorithm_params=None)
time_series
: The metric you want to conduct anomaly detection on. It can have the following three types:1. string: # path to a csv file 2. dict: # timestamp -> value 3. lumnol.modules.time_series.TimeSeries
baseline_time_series
: an optional baseline time series of one the types mentioned above.score only(bool)
: if asserted, anomaly scores for the time series will be available, while anomaly periods will not be identified.score_threshold
: if passed, anomaly scores above this value will be identified as anomaly. It can override score_percentile_threshold.score_precentile_threshold
: if passed, anomaly scores above this percentile will be identified as anomaly. It can not override score_threshold.algorithm_name(string)
: if passed, the specific algorithm will be used to compute anomaly scores.algorithm_params(dict)
: additional parameters for algorithm specified by algorithm_name.refine_algorithm_name(string)
: if passed, the specific algorithm will be used to compute the time stamp of severity within each anomaly period.refine_algorithm_params(dict)
: additional parameters for algorithm specified by refine_algorithm_params.Available algorithms and their additional parameters are:
1. 'bitmap_detector': # behaves well for huge data sets, and it is the default detector.
{
'precision'(4): # how many sections to categorize values,
'lag_window_size'(2% of the series length): # lagging window size,
'future_window_size'(2% of the series length): # future window size,
'chunk_size'(2): # chunk size.
}
2. 'default_detector': # used when other algorithms fails, not meant to be explicitly used.
3. 'derivative_detector': # meant to be used when abrupt changes of value are of main interest.
{
'smoothing factor'(0.2): # smoothing factor used to compute exponential moving averages
# of derivatives.
}
4. 'exp_avg_detector': # meant to be used when values are in a roughly stationary range.
# and it is the default refine algorithm.
{
'smoothing factor'(0.2): # smoothing factor used to compute exponential moving averages.
'lag_window_size'(20% of the series length): # lagging window size.
'use_lag_window'(False): # if asserted, a lagging window of size lag_window_size will be used.
}
It may seem vague for the meanings of some parameters above. Here are some useful insights:
The AnomalyDetector class has the following public methods:
get_all_scores()
: returns an anomaly score time series of type TimeSeries.get_anomalies()
: return a list of Anomaly objects.class luminol.correlator.Correlator
__init__(self, time_series_a, time_series_b, time_period=None, use_anomaly_score=False,
algorithm_name=None, algorithm_params=None)
time_series_a
: a time series, for its type, please refer to time_series for AnomalyDetector above.time_series_b
: a time series, for its type, please refer to time_series for AnomalyDetector above.time_period(tuple)
: a time period where to correlate the two time series.use_anomaly_score(bool)
: if asserted, the anomaly scores of the time series will be used to compute correlation coefficient instead of the original data in the time series.algorithm_name
: if passed, the specific algorithm will be used to calculate correlation coefficient.algorithm_params
: any additional parameters for the algorithm specified by algorithm_name.Available algorithms and their additional parameters are:
1. 'cross_correlator': # when correlate two time series, it tries to shift the series around so that it
# can catch spikes that are slightly apart in time.
{
'max_shift_seconds'(60): # maximal allowed shift room in seconds,
'shift_impact'(0.05): # weight of shift in the shifted coefficient.
}
The Correlator class has the following public methods:
get_correlation_result()
: return a CorrelationResult object.is_correlated(threshold=0.7)
: if coefficient above the passed in threshold, return a CorrelationResult object. Otherwise, return false.from luminol.anomaly_detector import AnomalyDetector
ts = {0: 0, 1: 0.5, 2: 1, 3: 1, 4: 1, 5: 0, 6: 0, 7: 0, 8: 0}
my_detector = AnomalyDetector(ts)
score = my_detector.get_all_scores()
for timestamp, value in score.iteritems():
print(timestamp, value)
""" Output:
0 0.0
1 0.873128250131
2 1.57163085024
3 2.13633686334
4 1.70906949067
5 2.90541813415
6 1.17154110935
7 0.937232887479
8 0.749786309983
"""
from luminol.anomaly_detector import AnomalyDetector
from luminol.correlator import Correlator
ts1 = {0: 0, 1: 0.5, 2: 1, 3: 1, 4: 1, 5: 0, 6: 0, 7: 0, 8: 0}
ts2 = {0: 0, 1: 0.5, 2: 1, 3: 0.5, 4: 1, 5: 0, 6: 1, 7: 1, 8: 1}
my_detector = AnomalyDetector(ts1, score_threshold=1.5)
score = my_detector.get_all_scores()
anomalies = my_detector.get_anomalies()
for a in anomalies:
time_period = a.get_time_window()
my_correlator = Correlator(ts1, ts2, time_period)
if my_correlator.is_correlated(threshold=0.8):
print("ts2 correlate with ts1 at time period (%d, %d)" % time_period)
""" Output:
ts2 correlates with ts1 at time period (2, 5)
"""
Clone source and install package and dev requirements:
pip install -r requirements.txt
pip install pytest pytest-cov pylama
Tests and linting run with:
python -m pytest --cov=src/luminol/ src/luminol/tests/
python -m pylama -i E501 src/luminol/
Author: Linkedin
Source Code: https://github.com/linkedin/luminol
License: Apache-2.0 License
1618310820
In this article, you will learn a couple of Machine Learning-Based Approaches for Anomaly Detection and then show how to apply one of these approaches to solve a specific use case for anomaly detection (Credit Fraud detection) in part two.
A common need when you analyzing real-world data-sets is determining which data point stand out as being different from all other data points. Such data points are known as anomalies, and the goal of anomaly detection (also known as outlier detection) is to determine all such data points in a data-driven fashion. Anomalies can be caused by errors in the data but sometimes are indicative of a new, previously unknown, underlying process.
#machine-learning #machine-learning-algorithms #anomaly-detection #detecting-data-anomalies #data-anomalies #machine-learning-use-cases #artificial-intelligence #fraud-detection
1618128600
This is the second and last part of my series which focuses on Anomaly Detection using Machine Learning. If you haven’t already, I recommend you read my first article here which will introduce you to Anomaly Detection and its applications in the business world.
In this article, I will take you through a case study focus on Credit Card Fraud Detection. It is important that credit card companies are able to recognize fraudulent credit card transactions so that customers are not charged for items that they did not purchase. So the main task is to identify fraudulent credit card transactions by using Machine learning. We are going to use a Python library called PyOD which is specifically developed for anomaly detection purposes.
#machine-learning #anomaly-detection #data-anomalies #detecting-data-anomalies #fraud-detection #fraud-detector #data-science #machine-learning-tutorials
1604230740
An anomaly by definition is something that deviates from what is standard, normal, or expected.
When dealing with datasets on a binary classification problem, we usually deal with a balanced dataset. This ensures that the model picks up the right features to learn. Now, what happens if you have very little data belonging to one class, and almost all data points belong to another class?
In such a case, we consider one classification to be the ‘normal’, and the sparse data points as a deviation from the ‘normal’ classification points.
For example, you lock your house every day twice, at 11 AM before going to the office and 10 PM before sleeping. In case a lock is opened at 2 AM, this would be considered abnormal behavior. Anomaly detection means predicting these instances and is used for Intrusion Detection, Fraud Detection, health monitoring, etc.
In this article, I show you how to use pycaret on a dataset for anomaly detection.
So, simply put, pycaret makes it super easy for you to visualize and train a model on your datasets within 3 lines of code!
So let’s dive in!
#anomaly-detection #machine-learning #anomaly #fraud-detection #pycaret
1601280960
Anomaly and fraud detection is a multi-billion-dollar industry. According to a Nilson Report, the amount of global credit card fraud alone was USD 7.6 billion in 2010. In the UK fraudulent credit card transaction losses were estimated at more than USD 1 billion in 2018. To counter these kinds of financial losses a huge amount of resources are employed to identify frauds and anomalies in every single industry.
In data science, “Outlier”, “Anomaly” and “Fraud” are often synonymously used, but there are subtle differences. An “outliers’ generally refers to a data point that somehow stands out from the rest of the crowd. However, when this outlier is completely unexpected and unexplained, it becomes an anomaly. That is to say, all anomalies are outliers but not necessarily all outliers are anomalies. In this article, however, I am using these terms interchangeably.
There are numerous reasons why understanding and detecting outliers are important. As a data scientist when we make data preparation we take great care in understanding if there is any data point unexplained, which may have entered erroneously. Sometimes we filter completely legitimate outlier data points and remove them to ensure greater model performance.
There is also a huge industrial application of anomaly detection. Credit card fraud detection is the most cited one but in numerous other cases anomaly detection is an essential part of doing business such as detecting network intrusion, identifying instrument failure, detecting tumor cells etc.
A range of tools and techniques are used to detect outliers and anomalies, from simple statistical techniques to complex machine learning algorithms, depending on the complexity of data and sophistication needed. The purpose of this article is to summarise some simple yet powerful statistical techniques that can be readily used for initial screening of outliers. While complex algorithms can be inevitable to use, sometimes simple techniques are more than enough to serve the purpose.
Below is a primer on five statistical techniques.
#anomaly-detection #machine-learning #outlier-detection #data-science #fraud-detection
1601334000
Today’s article is my 5th in a series of “bite-size” article I am writing on different techniques used for anomaly detection. If you are interested, the following are the previous four articles:
Today I am going beyond statistical techniques and stepping into machine learning algorithms for anomaly detection.
#outlier-detection #fraud-detection #data-science #machine-learning #anomaly-detection