timekit: Time Series Forecast Applications Using Data Mining
Written by Matt Dancho
The timekit
package contains a collection of tools for working with time series in R. There’s a number of benefits. One of the biggest is the ability to use a time series signature to predict future values (forecast) through data mining techniques. While this post is geared toward exposing the user to the timekit
package, there are examples showing the power of data mining a time series as well as how to work with time series in general. A number of timekit
functions will be discussed and implemented in the post. The first group of functions works with the time series index, and these include functions tk_index()
, tk_get_timeseries_signature()
, tk_augment_timeseries_signature()
and tk_get_timeseries_summary()
. We’ll spend the bulk of this post introducing you to these. The next function deals with creating a future time series from an existing index, tk_make_future_timeseries()
. The last set of functions deal with coercion to and from the major time series classes in R, tk_tbl()
, tk_xts()
, tk_zoo()
(and tk_zooreg()
), and tk_ts()
.
Benefits
So, why another time series package? The short answer is because it helps with data mining, communication between time series objects, and facilitating accurate future time series. The long answer is slightly more complicated, and I will attempt to explain.
Time Series Signature
The first reason and arguably the most important reason is the idea that there is a large amount of information stored inside a simple yet complex time index that is very useful for modeling and data mining. The time index is the collection of time-based values that define when each observation occurred. Consider the timestamp “2016-01-01 00:00:00”. This contains a wealth of information related to the observation including year, month, day, hour, minute and second. We can even extract more information including half, quarter, week of year, day of year, and so on with little effort. Next is the concept of the frequency (or periodicity or scale), which is the amount of time between multiple observations. From this time difference we can get even more information such as the periodicity of the data, whether the observations are regular or irregularly spaced, and even which observations are frequently missing. By my count, there’s at least 20+ features that can be retrieved from a simple timestamp. The important concept is that these features can exploded (or broken out) into what I’m calling the time series signature, which is nothing more than a decomposition of the unique features related to time index values. This data is very useful as it can be summarized, modeled, mined, sliced and diced, etc. As and example of the power of the signature, we can generate a prediction using data mining techniques such as this (see alcohol sales example later).
Prediction and Forecast Accuracy
The second reason is that often we want to make predictions into the future. There’s a number of packages such as forecast
and prophet
that already specialize in this. For forecast
the future dates can be incorrect especially for daily data. A regular numeric system doesn’t contain true dates and a sequential system results in inaccuracy with respect to irregular dates. For prophet, the mechanism to compute holidays and missing days is internal to the predict()
method, and therefore the a method specific to creating future dates is needed. Two types of days cause problems: those that are regularly skipped and irregularly skipped. The regularly skipped days (such as weekends or sometimes companies get to take every other Friday off) need to be factored into the future date sequence. The irregularly skipped days (think holidays) cause issues as well, and these suffer the additional problem as they can be difficult (but not impossible) to predict.
Communication and Coercion Between Time-Based Object Classes
The third reason is that the R object structures that contain time-based information are difficult use together. My first attempt was born in tidyquant
where I created the as_tibble()
and as_xts()
functions to coerce (convert) back and forth. I was naive in this attempt because the problem is larger: we have zoo
, ts
and many other packages that work with time-based information. The xts
and zoo
packages solved part of the problem, but there’s two issues that persist. First, the time-based tibble (“tidy” data frame with class tbl
) does not communicate well with the rest of the group. Coercing to xts
, zoo
and ts
objects can result in a lot of issues especially when coercion rules for homogeneous object classes take over. Weird things can happen such as turning numeric data into character or converting date to numeric without warning. Further, each coercion method (as_tibble
, as.xts
, as.zoo
, as.ts
) has its own nuances that are inconsistent. Second, some classes like ts
do not use a time-based index, but rather use a regularized numeric-based index. Without maintaining the time-based index, we can never go back to the original data structure, whether it is tbl
, xts
, zoo
, etc.
Enter timekit
The timekit
package solves each of these issues. It includes functions to create a time series signature and a time series summary from a sequence of dates. It includes methods to accurately generate future time series index values, which is especially important for daily data. It provides consistent coercion methods that prevent inadvertent class coercion issues resulting from homogeneous object structures and that maximize time-based index retention for regularized data structures.
Test Driving timekit
Let’s take timekit
for a test drive. We’ll be using a few other packages in the process to help with the examples. First, install timekit
.
Next, load these packages:
Example 1: Predicting Daily Volume for FB
This example is intended to expose potential users to several functions in timekit
. We’ll develop a prediction algorithm to predict daily volume using the time series signature. First, start with the FANG
data set from the tidyquant
package. Filter to get just the FB stock prices, and select the “date” and “volume” columns. This is a typical time series data structure. A time-based tibble with a “date” column and a features column (“volume” in this case).
First, split the data into two sets, one for training and one for comparing the actual output to our predictions.
Next, augment the time series signature to the training set using tk_augment_timeseries_signature()
. This function adds the time series signature as additional columns to the data frame. The signature will be used next for the data mining process.
Next, model the data using a regression. We are going to use the lm()
function to model volume using the time series signature.
Now we need to build the future data to model. We already have the index in the actual_future
data. However, in practice we don’t normally have the future index. Let’s build it using the existing index following three steps:
- Extract the index from the training set with
tk_index()
- Make a future index factoring in holidays and weekly inspection using
tk_make_future_timeseries()
- Create a time series signature from the future index using
tk_get_timeseries_signature()
Now use the predict()
function to run a regression prediction on the new data.
Let’s compare the prediction to the actual daily FB volume in 2016. Using the add_column()
function, we can add the predictions to the actual data, actual_future
. We can then plot the prediction using ggplot()
.
The predictions are a bit off as compared to the actuals and in some months the values are actually negative which is impossible. While the result is not necessarily earth shattering, let’s see how a regression algorithm performs data with a more prevalent pattern. Note that we did a performance comparison and the prophet
package with default settings did much better job at identifying the volume pattern. With different modeling methods and tuning, the data mining approach can be significantly improved but it’s difficult to tell if the performance would be better than prophet
.
Example 2: Forecasting Alcohol Sales
In this example, we’ll evaluate a time series with a more prevalent pattern. The beauty of this example is that you will see the power of data mining the time series signature with just a simple linear regression. We’ll be using a linear regression model again to model the time series signature, but you should be thinking about what other better modeling methods could be implemented. The example is truncated for brevity since the major steps are the same as Example 1.
When a pattern is present, data mining using the time series signature can provide exceptional results. Further, the analyst has the flexibility to implement other data mining techniques and methods. We implemented a linear regression, but possibly other regression methods would work better.
Example 3: Simplified and Extensible Coercion
In the final example, we’ll examine briefly the various coercion functions that enable simplified coercion back and forth. We’ll start with the FB_tbl
data.
We use the various timekit
coercion methods to go back and forth without data loss. See how the original tibble is returned. Note the argument silent = TRUE
removes the warning that the “date” column is being dropped. This is desirable since xts
and the other matrix-based time classes should only use numeric data. No need to specify “order.by” arguments or worry about non-numeric data types being passed inadvertently. In addition, the ts
object maintains a time-based index in addition to a regularized index.
One caveat is the going from ts
to tbl
. The default is timekit_idx = FALSE
argument which returns a regularized index. If the time-based index is needed, just set timekit_idx = TRUE
.
Recap
Hopefully you can now see how timekit
benefits time series analysis. We reviewed several of the functions related to extracting an index, adding a time series signature to an index or augmenting to a data frame, making a future time series that accounts for weekends and holidays, and coercing between various time series object classes. We also saw how the time series signature can be used in predictive analytics and data mining. The goal was to introduce you to timekit
. Hopefully you now have a baseline to assist with future time series analysis.
Announcements
If you’re interested in meeting with the members of Business Science, we’ll be speaking at the following upcoming conferences:
Important Links
If you are interested learning more about timekit
:
Further Reading
I find the R Data Mining Website and Reference Card to be an invaluable tool when researching (and trying to remember) the various data mining techniques. Many of these techniques can be implemented in time series analysis with timekit
.