Demo Week: Time Series Machine Learning with h2o and timetk
Written by Matt Dancho
We’re at the final day of Business Science Demo Week. Today we are demo-ing the h2o
package for machine learning on time series data. What’s demo week? Every day this week we are demoing an R package: tidyquant
(Monday), timetk
(Tuesday), sweep
(Wednesday), tibbletime
(Thursday) and h2o
(Friday)! That’s five packages in five days! We’ll give you intel on what you need to know about these packages to go from zero to hero. Today you’ll see how we can use timetk
+ h2o
to get really accurate time series forecasts. Here we go!
Demo Week Demos:
Get The Best Resources In Data Science. Every Friday!
Sign up for our free "5 Topic Friday" Newsletter. Every week, I'll send you the five coolest topics in data science for business that I've found that week. These could be new R packages, free books, or just some fun to end the week on.
Sign Up For Five-Topic-Friday!
h2o: What’s It Used For?
The h2o
package is a product offered by H2O.ai that contains a number of cutting edge machine learning algorithms, performance metrics, and auxiliary functions to make machine learning both powerful and easy. One of the main benefits of H2O is that it can be deployed on a cluster (this will not be discussed today). From the R perspective, there are four main uses:
-
Data Manipulation: Merging, grouping, pivoting, imputing, splitting into training/test/validation sets, etc.
-
Machine Learning Algorithms: Very sophisiticated algorithms in both supervised and unsupervised categories. Supervised include deep learning (neural networks), random forest, generalized linear model, gradient boosting machine, naive bayes, stacked ensembles, and xgboost. Unsupervised include generalized low rank models, k-means and PCA. There’s also Word2vec for text analysis. The latest stable release also has AutoML: automatic machine learning, which is really cool as we’ll see in this post!
-
Auxiliary ML Functionality Performance analysis and grid hyperparameter search
-
Production, Map/Reduce and Cloud: Capabilities for productionizing into Java environments, cluster deployment with Hadoop / Spark (Sparkling Water), deploying in cloud environments (Azure, AWS, Databricks, etc)
Sticking with the theme for the week, we’ll go over how h2o
can be used for time series machine learning as an advanced algorithm. We’ll use h2o
locally to develop a high accuracy time series model on the same data set (beer_sales_tbl
) from the timetk
and sweep
posts. This is a supervised regression problem.
Load Libraries
We’ll need three libraries today:
h2o
: Awesome machine learning library
tidyquant
: For getting data and loading the tidyverse behind the scenes
timetk
: Toolkit for working with time series in R
IMPORTANT FOR INSTALLING H2O
For h2o
, you must install the latest stable release. Select H2O » Latest Stable Release » Install in R. Then follow the instructions exactly.
Installing Other Packages
If you haven’t done so already, install the timetk
and tidyquant
packages:
Loading Libraries
Load the libraries.
Data
We’ll get data using the tq_get()
function from tidyquant
. The data comes from FRED: Beer, Wine, and Distilled Alcoholic Beverages Sales.
It’s a good idea to visualize the data so we know what we’re working with. Visualization is particularly important for time series analysis and forecasting, and it’s a good idea to identify spots where we will split the data into training, test and validation sets.
Now that you have a feel for the time series we’ll be working with today, let’s move onto the demo!
DEMO: h2o + timetk, Time Series Machine Learning
We’ll follow a similar workflow for time series machine learning from the timetk
+ linear regression post on Tuesday. However, this time we’ll swap out the lm()
function for h2o.autoML()
to get superior accuracy!
Time Series Machine Learning
Time series machine learning is a great way to forecast time series data, but before we get started here are a couple pointers for this demo:
-
Key Insight: The time series signature ~ timestamp information expanded column-wise into a feature set ~ is used to perform machine learning.
-
Objective: We’ll predict the next 8 months of data for 2017 using the time series signature. We’ll then compare the results to the two prior demos that predicted the same data using different methods: timetk
+ lm()
(linear regression) and sweep
+ auto.arima()
(ARIMA).
We’ll go through a workflow that can be used to perform time series machine learning.
Step 0: Review data
Just to show our starting point, let’s print out our beer_sales_tbl
. We use glimpse()
to take a quick peek at the data.
Step 1: Augment Time Series Signature
The tk_augment_timeseries_signature()
function expands out the timestamp information column-wise into a machine learning feature set, adding columns of time series information to the original data frame. We’ll again use glimpse()
for quick inspection. See how there are now 30 features. Not all will be important, but some will.
Step 2: Prep the Data for H2O
We need to prepare the data in a format for H2O. First, let’s remove any unnecessary columns such as dates or those with missing values, and change the ordered classes to plain factors. We prefer dplyr
operations for these steps.
Let’s split into a training, validation and test sets following the time ranges in the visualization above.
Step 3: Model with H2O
First, fire up h2o
. This will initialize the Java Virtual Machine (JVM) that H2O uses locally.
We change our data to an H2OFrame
object that can be interpreted by the h2o
package.
Set the names that h2o will use as the target and predictor variables.
Apply any regression model to the data. We’ll use h2o.automl
.
x = x
: The names of our feature columns.
y = y
: The name of our target column.
training_frame = train_h2o
: Our training set consisting of data from 2010 to start of 2016.
validation_frame = valid_h2o
: Our validation set consisting of data in the year 2016. H2O uses this to ensure the model does not overfit the data.
leaderboard_frame = test_h2o
: The models get ranked based on MAE performance against this set.
max_runtime_secs = 60
: We supply this to speed up H2O’s modeling. The algorithm has a large number of complex models so we want to keep things moving at the expense of some accuracy.
stopping_metric = "deviance"
: Use deviance as the stopping metric, which provides very good results for MAPE.
Next we extract the leader model.
Step 4: Predict
Generate predictions using h2o.predict()
on the test data.
There are a few ways to evaluate performance. We’ll go through the easy way, which is h2o.performance()
. This yields a preset values that are commonly used to compare regression models including root mean squared error (RMSE) and mean absolute error (MAE).
Our preference for this is assessment is mean absolute percentage error (MAPE), which is not included above. However, we can easily calculate. We can investigate the error on our test set (actuals vs predictions).
For comparison sake, we can calculate a few residuals metrics.
And The Winner of Demo Week Is…
The MAPE for the combination of h2o
+ timetk
is superior to the two prior demos:
- timetk + h2o: MAPE = 3.9% (This demo)
- timetk + linear regression: MAPE = 4.3% (timetk demo)
- sweep + ARIMA: MAPE = 4.3%, (sweep demo)
A question for the interested reader to figure out: What happens to the accuracy when you average the predictions of all three different methods? Try it to find out.
Note that the accuracy of time series machine learning may not always be superior to ARIMA and other forecast techniques including those implemented by prophet
and GARCH methods. The data scientist has a responsibility to test different methods and to select the right tool for the job.
HaLLowEen TRick oR TrEat BoNuS!
We are going to visualize the forecast compared to the actual values, but this time taking a cue from @lenkiefer’s theme_spooky
described in one of his recent posts, Mortgage Rates are Low!
We’re going to need to load a few libraries to get setup. The biggest challenge is the fonts, but there’s a really cool package called extrafont
that we can use. We’ll use extrafont
to load the Chiller fontset. Load the bonus library.
Next, you’ll need to setup the Chiller font. Revolutions Analytics has a great article, How to Use Your Favorite Fonts in R Charts, which will get you up and running with extrafont
. IMPORTANT: Make sure you go throught the process of loading your system fonts with font_import()
.
Once fonts are imported, you can load fonts using.
We’ll use Len’s script for theme_spooky()
. I highly encourage you to use theme_spooky()
all month of October around the office. Very spooky, and surprisingly engaging. :)
Now let’s create the final visualization so we can see our spooky forecast… Conclusion from the plot: It’s scary how accurate h2o
is.
Next Steps
We’ve only scratched the surface of h2o
. There’s more to learn including working classifiers and unsupervised learning. Here are a few resources to help you along the way:
Announcements
We have a busy couple of weeks. In addition to Demo Week, we have:
Facebook LIVE DataTalk
Matt was recently hosted on Experian DataLabs live webcast, #DataTalk, where he spoke about Machine Learning in Human Resources. The talk already has 80K+ views and is growing!! Check it out if you are interested in #rstats, #hranalytics and #MachineLearning.
EARL
On Friday, November 3rd, Matt will be presenting at the EARL Conference on HR Analytics: Using Machine Learning to Predict Employee Turnover.
Courses
Based on recent demand, we are considering offering application-specific machine learning courses for Data Scientists. The content will be business problems similar to our popular articles:
The student will learn from Business Science how to implement cutting edge data science to solve business problems. Please let us know if you are interested. You can leave comments as to what you would like to see at the bottom of the post in Disqus.
About Business Science
Business Science specializes in “ROI-driven data science”. Our focus is machine learning and data science in business applications. We help businesses that seek to add this competitive advantage but may not have the resources currently to implement predictive analytics. Business Science works with clients primarily in small to medium size businesses, guiding these organizations in expanding predictive analytics while executing on ROI generating projects. Visit the Business Science website or contact us to learn more!