Anomaly Detection Using Tidy and Anomalize
Written by Matt Dancho
We recently had an awesome opportunity to work with a great client that asked Business Science to build an open source anomaly detection algorithm that suited their needs. The business goal was to accurately detect anomalies for various marketing data consisting of website actions and marketing feedback spanning thousands of time series across multiple customers and web sources. Enter anomalize
: a tidy anomaly detection algorithm that’s time-based (built on top of tibbletime
) and scalable from one to many time series!! We are really excited to present this open source R package for others to benefit. In this post, we’ll go through an overview of what anomalize
does and how it works.
Case Study: When Open Source Interests Align
We work with many clients teaching data science and using our expertise to accelerate their business. However, it’s rare to have a client’s needs and their willingness to let others benefit align with our interests of pushing the boundaries of data science. This was an exception.
Our client had a challenging problem: detecting anomalies in time series on daily or weekly data at scale. Anomalies indicate exceptional events, which could be increased web traffic in the marketing domain or a malfunctioning server in the IT domain. Regardless, it’s important to flag these unusual occurrences to ensure the business is running smoothly. One of the challenges was that the client deals with not one time series but thousands that need to be analyzed for these extreme events.
An opportunity presented itself to develop an open source package that aligned with our interests of building a scalable adaptation of the Twitter AnomalyDetection
package and our client’s desire for a package that would benefit from the open source data science community’s ability to improve over time. The result is anomalize
!!!
2 Minutes To Anomalize
We’ve made a short introductory video that’s part of our new Business Science Software Intro Series on YouTube. This will get you up and running in under 2 minutes.
For those of us who prefer to read, here’s the gist of how anomalize
works in four simple steps.
Step 1: Install Anomalize
Step 2: Load Tidyverse and Anomalize
Step 3: Collect Time Series Data
We’ve provided a dataset, tidyverse_cran_downloads
, to get you up and running. The dataset consists of daily download counts of 15 “tidyverse” packages.
Step 4: Anomalize
Use the three tidy functions: time_decompose()
, anomalize()
, and time_recompose()
to detect anomalies. Tack on a fourth, plot_anomalies()
to visualize.
Well that was easy… but, what did we just do???
Anomalize Workflow
You just implemented the “anomalize” (anomaly detection) workflow, which consists of:
- Time series decomposition with
time_decompose()
- Anomaly detection of remainder with
anomalize()
- Anomaly lower and upper bound transformation with
time_recompose()
Time Series Decomposition
The first step is time series decomposition using time_decompose()
. The “count” column is decomposed into “observed”, “season”, “trend”, and “remainder” columns. The default values for time series decompose are method = "stl"
, which is just seasonal decomposition using a Loess smoother (refer to stats::stl()
). The frequency
and trend
parameters are automatically set based on the time scale (or periodicity) of the time series using tibbletime
based function under the hood.
A nice aspect is that the frequency
and trend
are automatically selected for you. If you want to see what was selected, set message = TRUE
. Also, you can change the selection by inputting a time-based period such as “1 week” or “2 quarters”, which is typically more intuitive that figuring out how many observations fall into a time span. Under the hood, time_frequency()
and time_trend()
convert these from time-based periods to numeric values using tibbletime
!
Anomaly Detection Of Remainder
The next step is to perform anomaly detection on the decomposed data, specifically the “remainder” column. We did this using anomalize()
, which produces three new columns: “remainder_l1” (lower limit), “remainder_l2” (upper limit), and “anomaly” (Yes/No Flag). The default method is method = "iqr"
, which is fast and relatively accurate at detecting anomalies. The alpha
parameter is by default set to alpha = 0.05
, but can be adjusted to increase or decrease the height of the anomaly bands, making it more difficult or less difficult for data to be anomalous. The max_anoms
parameter is by default set to a maximum of max_anoms = 0.2
for 20% of data that can be anomalous. This is the second parameter that can be adjusted. Finally, verbose = FALSE
by default which returns a data frame. Try setting verbose = TRUE
to get an outlier report as a list.
If you want to visualize what’s happening, now’s a good point to try out another plotting function, plot_anomaly_decomposition()
. It only works on a single time series so we’ll need to select just one to review. The “season” is removing the weekly cyclic seasonality. The trend is smooth, which is desirable to remove the central tendency without overfitting. Finally, the remainder is analyzed for anomalies detecting the most significant outliers.
Anomaly Lower and Upper Bounds
The last step is to create the lower and upper bounds around the “observed” values. This is the work of time_recompose()
, which recomposes the lower and upper bounds of the anomalies around the observed values. Two new columns were created: “recomposed_l1” (lower limit) and “recomposed_l2” (upper limit).
Let’s visualize on just the “lubridate” data. We can do so using plot_anomalies()
and setting time_recomposed = TRUE
. This function works on both single and grouped data.
That’s it. Once you have the “anomalize workflow” down, you’re ready to detect anomalies!
Packages That Helped In Development
The first thing we did after getting this request was to investigate what methods are currently available. The last thing we wanted to do was solve a problem that’s old news. We were aware of three excellent open source tools:
- Twitter’s
AnomalyDetection
package: Available on GitHub.
- Rob Hyndman’s
forecast::tsoutliers()
function available on through the forecast
package on CRAN.
- Javier Lopez-de-Lacalle’s package,
tsoutliers
, on CRAN.
We have worked with all of these R packages and functions before, and each presented learning opportunities that could be integrated into a scalable workflow.
What we liked about Twitter’s AnomalyDetection
was that it used two methods in tandem that work extremely well for time series. The “Twitter” method uses time series decomposition (i.e. stats::stl()
) but instead of subtracting the Loess trend, it uses the piece-wise median of the data (one or several medians split at specified intervals). The other method that AnomalyDetection
employs is the use of Generalized Extreme Studentized Deviate (GESD) as a way of detecting outliers. GESD is nice because it is resistant to the high leverage points that tend to pull a mean or even median in the direction of the most significant outliers. The package works very well with stationary data or even data with trend. However, the package was not built with a tidy interface making it difficult to scale.
Forecast tsoutliers() Function
The tsoutliers()
function from the forecast
package is a great way to efficiently collect outliers for cleaning prior to performing forecasts. It uses an outlier detection method based on STL with a 3X inner quartile range around remainder from time series decomposition. It’s very fast because there are a maximum of two iterations to determine the outlier bands. However, it’s not setup for a tidy workflow. Nor does it allow adjustment of the 3X. Some time series may need more or less depending on the magnitude of the variance of the remainders in relation to the magnitude of the outliers.
tsoutliers Package
The tsoutliers
package works very effectively on a number of traditional forecast time series for detecting anomalies. However, speed was an issue especially when attempting to scale to multiple time series or with minute or second timestamp data.
Anomalize: Incorporating The Best Of All
In reviewing the available packages, we learned from them all incorporating the best of each:
-
Decomposition Methods: We include two time series decomposition methods: "stl"
(using traditional seasonal decomposition by Loess) and "twitter"
(using seasonal decomposition with median spans).
-
Anomaly Detection Methods: We include two anomaly detection methods: "iqr"
(using an approach similar to the 3X IQR of forecast::tsoutliers()
) and "gesd"
(using the GESD method employed by Twitter’s AnomalyDetection
).
In addition, we’ve made some improvements of our own:
-
Anomalize Scales Well: The workflow is tidy and scales with dplyr
groups. The functions operate as expected on grouped time series meaning you can just as easily anomalize 500 time series data sets as a single data set.
-
Visuals For Analyzing Anomalies:
-
We include a way to get bands around the “normal” data separating the outliers. People are visual, and bands are really useful in determining how the methods are working or if we need to make adjustments.
-
We include two plotting functions making it easy to see what’s going on during the “anomalize workflow” and providing a way to assess the affect of “adjusting the knobs” that drive time_decompose()
and anomalize()
.
-
Time Based:
-
The entire workflow works with tibbletime
data set up with a time-based index. This is good because in our experience almost all time data comes with a date or datetime timestamp that’s really important to characteristics of the data.
-
There’s no need to calculate how many observations fall within a frequency span or trend span. We set up time_decompose()
to handle frequency
and trend
using time-based spans such as “1 week” or “2 quarters” (powered by tibbletime
).
Conclusions
We hope that the open source community can benefit from anomalize
. Our client is very happy with it, and it’s exciting to see that we can continue to build in new features and functionality that everyone can enjoy.
About Business Science
Business Science specializes in “ROI-driven data science”. We offer training, education, coding expertise, and data science consulting related to business and finance. Our latest creation is Business Science University, which is coming soon! In addition, we deliver about 80% of our effort into the open source data science community in the form of software and our Business Science blog. Visit Business Science on the web or contact us to learn more!