Purrr-operly Fitting Multiple Time Series Model

January 29, 2019 R. Dimas Bagas Herlambang

27 minute read

In this article, I will explain some basic functional programming for fitting multiple time series using R, particularly using purrr interface.

TL;DR: you can find the distraction-free script in here, and read some of my concluding remarks for a quick summary :grin:

Preface

When it comes to time series analyses and forecasting, R users are blessed with an invaluable tools that could helps us to conveniently fit–from basic to advanced–univariate time series models: forecast package. This package is authored by Prof. Rob. J. Hyndman from Monash University. If you are interested in time series and forecasting, but new to the forecast package, I really recommend you to checkout his Forecasting: Principles and Practice book to get you started; its online version is totally free!

Back then when I’m working with some econometrics cases, I was saved a lot by this package. Well, who can resist to use auto.arima() ? (I actually agree with you on this Iffa, just too “Bagas” to admit :v:). I’m also found its source code very helpful if you need to edit the algorithms to suit your need.

Yet, this package still have some limitation when you need to handle multiple time series object simultaneously; it doesn’t have that kind of built-in support (yet!). I was very curious back then. First, I found a solution proposed by Earo Wang. She released the hts (hierarchical time series) package, which give some workaround on this case using aggregration approach. But somehow I found the package are less versatile than forecast on fitting complex univariate models; well, it doesn’t built for that purpose at the first place.

Then I finally found an article that inspired me to write this post: Pur(r)rify Your Carets by Rahul Sangole. He cleverly hack his model selection routines using purrr. Basically, he explained a workflow to use functional programming approach to fit a combination of multiple data transformation, and multiple model.

This kind of approach actually already implemented in sweep package for time series and forecasting. But unfortunately, after seeing the package now “orphaned” (doesn’t have an official maintainer), I decided to not include the package in this tutorial; don’t worry, it doesn’t affecting the workflow that much.

In this article, I will explain how to a-purrr-opriately fit multiple time series using functional programming. If you happen to be new to purrr, I really recommend you to read this purrr-fect tutorial by Jennifer Bryan; and if you confused with my piping flow, then I suggest you to read my Data Wars series first :grin:

Libraries used

For this tutorial, we will use some time series, data wrangling, and statistical modeling packages. Among them, the core packages that we will use are:

forecast: for time series modeling and forecasting
yardstick: for measuring forecast performance
recipes: for data preprocess
purrr: for functional programming
dplyr: for general data wrangling
lubridate: for working with dates

In addition to that, there are also some some package that I recommend to use for easier workflow:

magrittr: for various pipe operators
timetk: for creating future time index for our forecast results
tidyquant: for some ggplot aesthetics

Note that I don’t import timetk and tidyquant in the following chunk, since we only use their one or two functions. Also some packages are already included in their bigger packages; e.g., dplyr and purrr are already included in tidyverse, and recipes and yardstick are already included in tidymodels, so we only need to import the bigger packages:

# import libs
library(forecast)
library(lubridate)
library(magrittr)
library(tidymodels)
library(tidyverse)

Hourly Energy Consumption dataset from PJM

For this tutorial, we will use Hourly Energy Consumption dataset provided by PJM, and could be accessed from Kaggle dataset, which kindly shared by Rob Mulla. The data contains the hourly power consumption estimated by each electricity provider.

The dataset is actually need more cleaning before we can use it for time series analyses and forecasting. The version I use in this tutorial is already going through some cleaning and adjustment. If you are interested in the cleaning process, or simply want to reproduce the results using the same dataset, you can check the script from this post folder in my blog repository (to be exact, in file data.R inside the R folder).

Let’s start by importing the dataset:

# import dataset
pjm <- read_csv("data_input/pjm.csv")

pjm
#> # A tibble: 362,688 x 3
#>    datetime            provider  cons
#>    <dttm>              <chr>    <dbl>
#>  1 2013-06-01 00:00:00 AEP      13477
#>  2 2013-06-01 01:00:00 AEP      12699
#>  3 2013-06-01 02:00:00 AEP      12274
#>  4 2013-06-01 03:00:00 AEP      11904
#>  5 2013-06-01 04:00:00 AEP      11862
#>  6 2013-06-01 05:00:00 AEP      11976
#>  7 2013-06-01 06:00:00 AEP      12447
#>  8 2013-06-01 07:00:00 AEP      12713
#>  9 2013-06-01 08:00:00 AEP      13730
#> 10 2013-06-01 09:00:00 AEP      14609
#> # ... with 362,678 more rows

The data contain datetime, provider, and cons columns. The dates are ranging from 2013-06-01 to 2018-08-02 23:00:00. Here’s some example data from the last month:

# quick plot
pjm %>%
  filter(datetime >= max(datetime) - hours(24 * 7 * 4)) %>%
  ggplot(aes(x = datetime, y = cons)) +
    geom_line() +
    labs(x = NULL, y = NULL) +
    facet_wrap(~ provider, scale = "free", ncol = 1) +
    tidyquant::theme_tq()

The scenario

To help us benchmark multiple models, we will need to split some portion of our data for validation. The strategy here is to cut the last 4 weeks–approximately 1 month–as our test dataset. Then, we will cut again some (bigger) portion as the train dataset, say, 4 weeks times 3–approximately 3 months. This strategy is just the simplified version of Rolling Forecasting Origin, with only having one pair of train and test sample. Of course, you can experiment with the lengths; will definitely cover more on this in one of my future posts, stay tuned!

So the first step here is to get the start and end date of the train and test sample. The most straighforward way is to define the train and test size, then recursively get the start and end index for each:

# train-val-test size
test_size <- 24 * 7 * 4
train_size <- 24 * 7 * 4 * 3

# get the min-max of the time index for each sample
test_end <- max(pjm$datetime)
test_start <- test_end - hours(test_size) + hours(1)

train_end <- test_start - hours(1)
train_start <- train_end - hours(train_size) + hours(1)

To make it more handy, we can combine the start and end indices into a date interval:

# get the interval of each samples
intrain <- interval(train_start, train_end)
intest <- interval(test_start, test_end)

intrain
#> [1] 2018-04-13 UTC--2018-07-05 23:00:00 UTC

intest
#> [1] 2018-07-06 UTC--2018-08-02 23:00:00 UTC

I really recommend this splitting approach, since you can use the date interval for many things while still keeping the true data.

For example, we can use the intervals to visualize the train and test series:

# plot the train and test
pjm %>%
  mutate(sample = case_when(
    datetime %within% intrain ~ "train",
    datetime %within% intest ~ "test"
  )) %>%
  drop_na() %>%
  mutate(sample = factor(sample, levels = c("train", "test"))) %>%
  ggplot(aes(x = datetime, y = cons, colour = sample)) +
    geom_line() +
    labs(x = NULL, y = NULL, colour = NULL) +
    facet_wrap(~ provider, scale = "free", ncol = 1) +
    tidyquant::theme_tq() +
    tidyquant::scale_colour_tq()

Data preprocess using `recipes`

Data preprocessing is a very crucial step in time series model fitting. In this tutorial, we will use recipes package for data preprocessing.

Since recipe package work columnwise, we need to convert our data into a wide format first:

# convert to wide format
pjm %<>%
  spread(provider, cons)

pjm
#> # A tibble: 45,336 x 9
#>    datetime              AEP COMED DAYTON  DEOK   DOM   DUQ  EKPC    FE
#>    <dttm>              <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1 2013-06-01 00:00:00 13477 11378   1682  2773 10150  1739  1166  7099
#>  2 2013-06-01 01:00:00 12699 10584   1568  2613  9406  1641  1098  6617
#>  3 2013-06-01 02:00:00 12274  9973   1514  2499  8910  1569  1036  6351
#>  4 2013-06-01 03:00:00 11904  9451   1467  2409  8551  1518  1023  6201
#>  5 2013-06-01 04:00:00 11862  9184   1456  2392  8413  1503   949  6157
#>  6 2013-06-01 05:00:00 11976  9043   1485  2425  8409  1515  1017  6187
#>  7 2013-06-01 06:00:00 12447  9054   1511  2469  8582  1531  1036  6257
#>  8 2013-06-01 07:00:00 12713  9290   1596  2606  9600  1586  1096  6606
#>  9 2013-06-01 08:00:00 13730  9980   1711  2823 10816  1745  1200  7182
#> 10 2013-06-01 09:00:00 14609 10590   1823  3013 12017  1903  1285  7777
#> # ... with 45,326 more rows

Then we could start to define the preprocess recipe(), and bake() our data based on the defined recipe:

# recipes: square root, center, scale
rec <- recipe(~ ., filter(pjm, datetime %within% intrain)) %>%
  step_sqrt(all_numeric()) %>%
  step_center(all_numeric()) %>%
  step_scale(all_numeric()) %>%
  prep()

# preview the bake results
pjm <- bake(rec, pjm)

pjm
#> # A tibble: 45,336 x 9
#>    datetime                AEP   COMED DAYTON   DEOK     DOM     DUQ   EKPC
#>    <dttm>                <dbl>   <dbl>  <dbl>  <dbl>   <dbl>   <dbl>  <dbl>
#>  1 2013-06-01 00:00:00 -0.393   0.0676 -0.784 -0.510 -0.373   0.481  -0.837
#>  2 2013-06-01 01:00:00 -0.722  -0.247  -1.10  -0.741 -0.693   0.191  -1.08 
#>  3 2013-06-01 02:00:00 -0.907  -0.496  -1.25  -0.911 -0.913  -0.0282 -1.31 
#>  4 2013-06-01 03:00:00 -1.07   -0.716  -1.39  -1.05  -1.08   -0.186  -1.35 
#>  5 2013-06-01 04:00:00 -1.09   -0.831  -1.42  -1.07  -1.14   -0.233  -1.63 
#>  6 2013-06-01 05:00:00 -1.04   -0.892  -1.34  -1.02  -1.14   -0.196  -1.38 
#>  7 2013-06-01 06:00:00 -0.831  -0.887  -1.26  -0.956 -1.06   -0.146  -1.31 
#>  8 2013-06-01 07:00:00 -0.716  -0.785  -1.02  -0.752 -0.608   0.0240 -1.09 
#>  9 2013-06-01 08:00:00 -0.288  -0.493  -0.706 -0.439 -0.0967  0.499  -0.719
#> 10 2013-06-01 09:00:00  0.0699 -0.244  -0.408 -0.175  0.381   0.949  -0.431
#> # ... with 45,326 more rows, and 1 more variable: FE <dbl>

Note: Don’t forget to do the preprocess based on the data in train sample only, since it’s similar to the actual scenario where we only have the “train” data.

If we use recipes package, the next steps is to create a revert back function:

# revert back function
rec_revert <- function(vector, rec, varname) {

  # store recipe values
  rec_center <- rec$steps[[2]]$means[varname]
  rec_scale <- rec$steps[[3]]$sds[varname]

  # convert back based on the recipe
  results <- (vector * rec_scale + rec_center) ^ 2

  # add additional adjustment if necessary
  results <- round(results)

  # return the results
  results

}

This revert back function would be very handy if we want to convert back the data to its original form.

Now we can convert our data into the long format again:

# convert back to long format
pjm %<>%
  gather(provider, cons, -datetime)
  
pjm
#> # A tibble: 362,688 x 3
#>    datetime            provider    cons
#>    <dttm>              <chr>      <dbl>
#>  1 2013-06-01 00:00:00 AEP      -0.393 
#>  2 2013-06-01 01:00:00 AEP      -0.722 
#>  3 2013-06-01 02:00:00 AEP      -0.907 
#>  4 2013-06-01 03:00:00 AEP      -1.07  
#>  5 2013-06-01 04:00:00 AEP      -1.09  
#>  6 2013-06-01 05:00:00 AEP      -1.04  
#>  7 2013-06-01 06:00:00 AEP      -0.831 
#>  8 2013-06-01 07:00:00 AEP      -0.716 
#>  9 2013-06-01 08:00:00 AEP      -0.288 
#> 10 2013-06-01 09:00:00 AEP       0.0699
#> # ... with 362,678 more rows

Nested model fitting and forecasting

In functional programming using purrr, we need to convert our tbl into a nested tbl. You can think a nested data like a table inside a table, which could be controlled using a key indicator; in other words, we can have tbl for each provider and samples. Using this format, the fitting and forecasting process would be very versatile, yet we can still convert the results as long as we have a proper key like provider.

Let’s start by converting our tbl into a nested tbl. First, we need to add sample indicator so it could be recognized as a key when we nest() the data:

# adjust by sample
pjm %<>%
  mutate(sample = case_when(
    datetime %within% intrain ~ "train",
    datetime %within% intest ~ "test"
  )) %>%
  drop_na()
  
pjm
#> # A tibble: 21,504 x 4
#>    datetime            provider   cons sample
#>    <dttm>              <chr>     <dbl> <chr> 
#>  1 2018-04-13 00:00:00 AEP      -1.26  train 
#>  2 2018-04-13 01:00:00 AEP      -1.43  train 
#>  3 2018-04-13 02:00:00 AEP      -1.52  train 
#>  4 2018-04-13 03:00:00 AEP      -1.49  train 
#>  5 2018-04-13 04:00:00 AEP      -1.39  train 
#>  6 2018-04-13 05:00:00 AEP      -1.07  train 
#>  7 2018-04-13 06:00:00 AEP      -0.548 train 
#>  8 2018-04-13 07:00:00 AEP      -0.308 train 
#>  9 2018-04-13 08:00:00 AEP      -0.228 train 
#> 10 2018-04-13 09:00:00 AEP      -0.209 train 
#> # ... with 21,494 more rows

Then, we could start to nest() the the data by provider and sample, and spread() the tbl based on sample key:

# nest the train data
pjm %<>%
  group_by(provider, sample) %>%
  nest(.key = "data") %>%
  spread(sample, data)

pjm
#> # A tibble: 8 x 3
#>   provider test               train               
#>   <chr>    <list>             <list>              
#> 1 AEP      <tibble [672 x 2]> <tibble [2,016 x 2]>
#> 2 COMED    <tibble [672 x 2]> <tibble [2,016 x 2]>
#> 3 DAYTON   <tibble [672 x 2]> <tibble [2,016 x 2]>
#> 4 DEOK     <tibble [672 x 2]> <tibble [2,016 x 2]>
#> 5 DOM      <tibble [672 x 2]> <tibble [2,016 x 2]>
#> 6 DUQ      <tibble [672 x 2]> <tibble [2,016 x 2]>
#> 7 EKPC     <tibble [672 x 2]> <tibble [2,016 x 2]>
#> 8 FE       <tibble [672 x 2]> <tibble [2,016 x 2]>

Preparing the data model list

For data and model combination, we could start by defining some options for data representation. Recall that our series have a relatively high frequency, so we could consider two option of data representation here: a ts object with daily seasonality, and an msts with daily and weekly seasonality.

To incorporate them into our nested data, we need to create another nested data frame containing the data representation name, and the accompanying function for converting the data into the specified data representation.

Let’s start with making a named list containing the transformation functions:

# data funs list
data_funs <- list(
  ts = function(x) ts(x$cons, frequency = 24),
  msts = function(x) msts(x$cons, seasonal.periods = c(24, 24 * 7))
)

data_funs
#> $ts
#> function (x) 
#> ts(x$cons, frequency = 24)
#> 
#> $msts
#> function (x) 
#> msts(x$cons, seasonal.periods = c(24, 24 * 7))

Then we could convert the list into a tbl using enframe(). Note that we should also give a key–which is the provider in our case–so we could use left_join() later. The trick here is to use rep() function:

# convert to nested
data_funs %<>%
  rep(length(unique(pjm$provider))) %>%
  enframe("data_fun_name", "data_fun") %>%
  mutate(provider =
    sort(rep(unique(pjm$provider), length(unique(.$data_fun_name))))
  )

data_funs
#> # A tibble: 16 x 3
#>    data_fun_name data_fun provider
#>    <chr>         <list>   <chr>   
#>  1 ts            <fn>     AEP     
#>  2 msts          <fn>     AEP     
#>  3 ts            <fn>     COMED   
#>  4 msts          <fn>     COMED   
#>  5 ts            <fn>     DAYTON  
#>  6 msts          <fn>     DAYTON  
#>  7 ts            <fn>     DEOK    
#>  8 msts          <fn>     DEOK    
#>  9 ts            <fn>     DOM     
#> 10 msts          <fn>     DOM     
#> 11 ts            <fn>     DUQ     
#> 12 msts          <fn>     DUQ     
#> 13 ts            <fn>     EKPC    
#> 14 msts          <fn>     EKPC    
#> 15 ts            <fn>     FE      
#> 16 msts          <fn>     FE

Then the last steps here is to join the nested function with our nested data:

# combine with models
pjm %<>%
  left_join(data_funs)

pjm
#> # A tibble: 16 x 5
#>    provider test               train                data_fun_name data_fun
#>    <chr>    <list>             <list>               <chr>         <list>  
#>  1 AEP      <tibble [672 x 2]> <tibble [2,016 x 2]> ts            <fn>    
#>  2 AEP      <tibble [672 x 2]> <tibble [2,016 x 2]> msts          <fn>    
#>  3 COMED    <tibble [672 x 2]> <tibble [2,016 x 2]> ts            <fn>    
#>  4 COMED    <tibble [672 x 2]> <tibble [2,016 x 2]> msts          <fn>    
#>  5 DAYTON   <tibble [672 x 2]> <tibble [2,016 x 2]> ts            <fn>    
#>  6 DAYTON   <tibble [672 x 2]> <tibble [2,016 x 2]> msts          <fn>    
#>  7 DEOK     <tibble [672 x 2]> <tibble [2,016 x 2]> ts            <fn>    
#>  8 DEOK     <tibble [672 x 2]> <tibble [2,016 x 2]> msts          <fn>    
#>  9 DOM      <tibble [672 x 2]> <tibble [2,016 x 2]> ts            <fn>    
#> 10 DOM      <tibble [672 x 2]> <tibble [2,016 x 2]> msts          <fn>    
#> 11 DUQ      <tibble [672 x 2]> <tibble [2,016 x 2]> ts            <fn>    
#> 12 DUQ      <tibble [672 x 2]> <tibble [2,016 x 2]> msts          <fn>    
#> 13 EKPC     <tibble [672 x 2]> <tibble [2,016 x 2]> ts            <fn>    
#> 14 EKPC     <tibble [672 x 2]> <tibble [2,016 x 2]> msts          <fn>    
#> 15 FE       <tibble [672 x 2]> <tibble [2,016 x 2]> ts            <fn>    
#> 16 FE       <tibble [672 x 2]> <tibble [2,016 x 2]> msts          <fn>

Preparing the time series model list

Similar to when we create the data representation list, we could also make some time series models as a nested list.

Again, let’s start by making a list of models. For this tutorial, let’s consider to use auto.arima(), ets(), stlm(), and tbats(). We need to make some functions to call those model, and store them inside a list:

# models list
models <- list(
  auto.arima = function(x) auto.arima(x),
  ets = function(x) ets(x),
  stlm = function(x) stlm(x),
  tbats = function(x) tbats(x, use.box.cox = FALSE)
)

models
#> $auto.arima
#> function (x) 
#> auto.arima(x)
#> 
#> $ets
#> function (x) 
#> ets(x)
#> 
#> $stlm
#> function (x) 
#> stlm(x)
#> 
#> $tbats
#> function (x) 
#> tbats(x, use.box.cox = FALSE)

Then we can convert it into a nested format like previous example:

# convert to nested
models %<>%
  rep(length(unique(pjm$provider))) %>%
  enframe("model_name", "model") %>%
  mutate(provider =
    sort(rep(unique(pjm$provider), length(unique(.$model_name))))
  )

models
#> # A tibble: 32 x 3
#>    model_name model  provider
#>    <chr>      <list> <chr>   
#>  1 auto.arima <fn>   AEP     
#>  2 ets        <fn>   AEP     
#>  3 stlm       <fn>   AEP     
#>  4 tbats      <fn>   AEP     
#>  5 auto.arima <fn>   COMED   
#>  6 ets        <fn>   COMED   
#>  7 stlm       <fn>   COMED   
#>  8 tbats      <fn>   COMED   
#>  9 auto.arima <fn>   DAYTON  
#> 10 ets        <fn>   DAYTON  
#> # ... with 22 more rows

And finally, we can join the result into our nested data. Note that we could also apply some rule here. For example, if I don’t want to have ets() and auto.arima() for data with msts class–since they are not suitable for multiple seasonality time series–we can use filter to remove them out:

# combine with models
pjm %<>%
  left_join(models) %>%
  filter(
    !(model_name == "ets" & data_fun_name == "msts"),
    !(model_name == "auto.arima" & data_fun_name == "msts")
  )

Here’s our data with the full combination:

pjm
#> # A tibble: 48 x 7
#>    provider test        train       data_fun_name data_fun model_name model
#>    <chr>    <list>      <list>      <chr>         <list>   <chr>      <lis>
#>  1 AEP      <tibble [6~ <tibble [2~ ts            <fn>     auto.arima <fn> 
#>  2 AEP      <tibble [6~ <tibble [2~ ts            <fn>     ets        <fn> 
#>  3 AEP      <tibble [6~ <tibble [2~ ts            <fn>     stlm       <fn> 
#>  4 AEP      <tibble [6~ <tibble [2~ ts            <fn>     tbats      <fn> 
#>  5 AEP      <tibble [6~ <tibble [2~ msts          <fn>     stlm       <fn> 
#>  6 AEP      <tibble [6~ <tibble [2~ msts          <fn>     tbats      <fn> 
#>  7 COMED    <tibble [6~ <tibble [2~ ts            <fn>     auto.arima <fn> 
#>  8 COMED    <tibble [6~ <tibble [2~ ts            <fn>     ets        <fn> 
#>  9 COMED    <tibble [6~ <tibble [2~ ts            <fn>     stlm       <fn> 
#> 10 COMED    <tibble [6~ <tibble [2~ ts            <fn>     tbats      <fn> 
#> # ... with 38 more rows

Execute the nested fitting

To execute the model fitting, we need to wrap up the needed arguments as a list using map() function. Then, we could call the function using invoke_map(). We need to do this for data transformation using the function inside data_fun, then continue to fit the model with the same process using the function inside model.

See the code in the chunk below for the implementation of the process:

# invoke nested fitting
pjm %<>%
  mutate(
    params = map(train, ~ list(x = .x)),
    data = invoke_map(data_fun, params),
    params = map(data, ~ list(x = .x)),
    fitted = invoke_map(model, params)
  ) %>%
  select(-data, -params)
  
pjm
#> # A tibble: 48 x 8
#>    provider test    train    data_fun_name data_fun model_name model fitted
#>    <chr>    <list>  <list>   <chr>         <list>   <chr>      <lis> <list>
#>  1 AEP      <tibbl~ <tibble~ ts            <fn>     auto.arima <fn>  <fr_A~
#>  2 AEP      <tibbl~ <tibble~ ts            <fn>     ets        <fn>  <ets> 
#>  3 AEP      <tibbl~ <tibble~ ts            <fn>     stlm       <fn>  <stlm>
#>  4 AEP      <tibbl~ <tibble~ ts            <fn>     tbats      <fn>  <tbat~
#>  5 AEP      <tibbl~ <tibble~ msts          <fn>     stlm       <fn>  <stlm>
#>  6 AEP      <tibbl~ <tibble~ msts          <fn>     tbats      <fn>  <tbat~
#>  7 COMED    <tibbl~ <tibble~ ts            <fn>     auto.arima <fn>  <fr_A~
#>  8 COMED    <tibbl~ <tibble~ ts            <fn>     ets        <fn>  <ets> 
#>  9 COMED    <tibbl~ <tibble~ ts            <fn>     stlm       <fn>  <stlm>
#> 10 COMED    <tibbl~ <tibble~ ts            <fn>     tbats      <fn>  <tbat~
#> # ... with 38 more rows

Now the next step is to measure the test error. First, we need to forecast() into the test dataset, and then pipe it into the error measurement using one of function provided in yardstick. For this example, let’s use Root Mean Squared-Error (we use rmse_vec for simple vector calculation) as an error measurement:

# calculate test errors
pjm %<>%
  mutate(error =
    map(fitted, ~ forecast(.x, h = 24 * 7 * 4)) %>%
    map2_dbl(test, ~ rmse_vec(truth = .y$cons, estimate = .x$mean))
  ) %>%
  arrange(provider, error)

pjm %>%
  select(provider, ends_with("_name"), error)
#> # A tibble: 48 x 4
#>    provider data_fun_name model_name error
#>    <chr>    <chr>         <chr>      <dbl>
#>  1 AEP      msts          stlm       0.733
#>  2 AEP      msts          tbats      0.745
#>  3 AEP      ts            stlm       0.761
#>  4 AEP      ts            auto.arima 0.820
#>  5 AEP      ts            tbats      0.951
#>  6 AEP      ts            ets        4.28 
#>  7 COMED    msts          tbats      0.805
#>  8 COMED    msts          stlm       0.813
#>  9 COMED    ts            stlm       0.837
#> 10 COMED    ts            tbats      0.935
#> # ... with 38 more rows

Unnesting the result

Beside measuring the error, we can also compare the forecast results to the real test series through graphical analysis. But to do that, we need to make a tbl containing our forecast to the test dataset, then do some spread-gather tricks to make a set of keys that unique for each data representations, models, and also one for the forecast itself. If we get to that format, we could conveniently unnest() the data into a proper long format

Let’s start with creating the forecast first:

pjm_test <- pjm %>%
  mutate(
    forecast =
      map(fitted, ~ forecast(.x, h = 24 * 7 * 4)) %>%
      map2(test, ~ tibble(
        datetime = .y$datetime,
        cons = as.vector(.x$mean)
      )),
    key = paste(data_fun_name, model_name, sep = "-")
  )
  
pjm_test
#> # A tibble: 48 x 11
#>    provider test  train data_fun_name data_fun model_name model fitted
#>    <chr>    <lis> <lis> <chr>         <list>   <chr>      <lis> <list>
#>  1 AEP      <tib~ <tib~ msts          <fn>     stlm       <fn>  <stlm>
#>  2 AEP      <tib~ <tib~ msts          <fn>     tbats      <fn>  <tbat~
#>  3 AEP      <tib~ <tib~ ts            <fn>     stlm       <fn>  <stlm>
#>  4 AEP      <tib~ <tib~ ts            <fn>     auto.arima <fn>  <fr_A~
#>  5 AEP      <tib~ <tib~ ts            <fn>     tbats      <fn>  <tbat~
#>  6 AEP      <tib~ <tib~ ts            <fn>     ets        <fn>  <ets> 
#>  7 COMED    <tib~ <tib~ msts          <fn>     tbats      <fn>  <tbat~
#>  8 COMED    <tib~ <tib~ msts          <fn>     stlm       <fn>  <stlm>
#>  9 COMED    <tib~ <tib~ ts            <fn>     stlm       <fn>  <stlm>
#> 10 COMED    <tib~ <tib~ ts            <fn>     tbats      <fn>  <tbat~
#> # ... with 38 more rows, and 3 more variables: error <dbl>,
#> #   forecast <list>, key <chr>

Then do some spread-gather to create a proper key:

pjm_test %<>%
  select(provider, key, actual = test, forecast) %>%
  spread(key, forecast) %>%
  gather(key, value, -provider)

pjm_test
#> # A tibble: 56 x 3
#>    provider key       value             
#>    <chr>    <chr>     <list>            
#>  1 AEP      actual    <tibble [672 x 2]>
#>  2 COMED    actual    <tibble [672 x 2]>
#>  3 DAYTON   actual    <tibble [672 x 2]>
#>  4 DEOK     actual    <tibble [672 x 2]>
#>  5 DOM      actual    <tibble [672 x 2]>
#>  6 DUQ      actual    <tibble [672 x 2]>
#>  7 EKPC     actual    <tibble [672 x 2]>
#>  8 FE       actual    <tibble [672 x 2]>
#>  9 AEP      msts-stlm <tibble [672 x 2]>
#> 10 COMED    msts-stlm <tibble [672 x 2]>
#> # ... with 46 more rows

The last but not least, unnest() the series data, and apply the revert back function:

pjm_test %<>%
  unnest(value) %>%
  mutate(cons = rec_revert(cons, rec, provider))
  
pjm_test
#> # A tibble: 37,632 x 4
#>    provider key    datetime             cons
#>    <chr>    <chr>  <dttm>              <dbl>
#>  1 AEP      actual 2018-07-06 00:00:00 15139
#>  2 AEP      actual 2018-07-06 01:00:00 14325
#>  3 AEP      actual 2018-07-06 02:00:00 13685
#>  4 AEP      actual 2018-07-06 03:00:00 13350
#>  5 AEP      actual 2018-07-06 04:00:00 13357
#>  6 AEP      actual 2018-07-06 05:00:00 13671
#>  7 AEP      actual 2018-07-06 06:00:00 14124
#>  8 AEP      actual 2018-07-06 07:00:00 14646
#>  9 AEP      actual 2018-07-06 08:00:00 15294
#> 10 AEP      actual 2018-07-06 09:00:00 15804
#> # ... with 37,622 more rows

With the resulting tbl, we can compare the forecast and actual data on test like this:

# plot forecast on test
pjm_test %>%
  ggplot(aes(x = datetime, y = cons, colour = key)) +
    geom_line() +
    labs(x = NULL, y = NULL, colour = NULL) +
    facet_wrap(~ provider, scale = "free", ncol = 1) +
    tidyquant::theme_tq() +
    tidyquant::scale_colour_tq()

Automate the model selection

As you can see from the plot results, it is hard to decide which model we want to use based on graphical comparation. Then the most straighforward solution is to use the model with the least error.

It is very simple to do the model selection, we only need some basic dplyr grammars to filter() the model with lowest error:

# filter by lowest test error
pjm %<>%
  select(-fitted) %>% # remove unused
  group_by(provider) %>%
  filter(error == min(error)) %>%
  ungroup()

pjm
#> # A tibble: 8 x 8
#>   provider test      train    data_fun_name data_fun model_name model error
#>   <chr>    <list>    <list>   <chr>         <list>   <chr>      <lis> <dbl>
#> 1 AEP      <tibble ~ <tibble~ msts          <fn>     stlm       <fn>  0.733
#> 2 COMED    <tibble ~ <tibble~ msts          <fn>     tbats      <fn>  0.805
#> 3 DAYTON   <tibble ~ <tibble~ msts          <fn>     stlm       <fn>  0.714
#> 4 DEOK     <tibble ~ <tibble~ msts          <fn>     tbats      <fn>  0.687
#> 5 DOM      <tibble ~ <tibble~ msts          <fn>     tbats      <fn>  0.865
#> 6 DUQ      <tibble ~ <tibble~ ts            <fn>     stlm       <fn>  0.619
#> 7 EKPC     <tibble ~ <tibble~ ts            <fn>     auto.arima <fn>  0.721
#> 8 FE       <tibble ~ <tibble~ msts          <fn>     stlm       <fn>  0.713

Perform the final forecast

After we have the final model, then finally we can proceed to the final forecast. For the final forecast, we can do the same process as in model fitting, but this time we will use train and test data as our new “full data”.

Now let’s start by recombine the train and test dataset:

# recombine samples
pjm %<>%
  mutate(fulldata = map2(train, test, ~ bind_rows(.x, .y))) %>%
  select(provider, fulldata, everything(), -train, -test)

pjm
#> # A tibble: 8 x 7
#>   provider fulldata           data_fun_name data_fun model_name model error
#>   <chr>    <list>             <chr>         <list>   <chr>      <lis> <dbl>
#> 1 AEP      <tibble [2,688 x ~ msts          <fn>     stlm       <fn>  0.733
#> 2 COMED    <tibble [2,688 x ~ msts          <fn>     tbats      <fn>  0.805
#> 3 DAYTON   <tibble [2,688 x ~ msts          <fn>     stlm       <fn>  0.714
#> 4 DEOK     <tibble [2,688 x ~ msts          <fn>     tbats      <fn>  0.687
#> 5 DOM      <tibble [2,688 x ~ msts          <fn>     tbats      <fn>  0.865
#> 6 DUQ      <tibble [2,688 x ~ ts            <fn>     stlm       <fn>  0.619
#> 7 EKPC     <tibble [2,688 x ~ ts            <fn>     auto.arima <fn>  0.721
#> 8 FE       <tibble [2,688 x ~ msts          <fn>     stlm       <fn>  0.713

Then do the same nested fitting as in previous example:

# invoke nested fitting for full data
pjm %<>%
  mutate(
    params = map(fulldata, ~ list(x = .x)),
    data = invoke_map(data_fun, params),
    params = map(data, ~ list(x = .x)),
    fitted = invoke_map(model, params)
  ) %>%
  select(-data, -params)

Next, let’s make a tbl containing each of our forecast results, and convert our nested data to a proper long format:

# get forecast
pjm %<>%
  mutate(forecast =
    map(fitted, ~ forecast(.x, h = 24 * 7 * 4)) %>%
    map2(fulldata, ~ tibble(
      datetime = timetk::tk_make_future_timeseries(.y$datetime, 24 * 7 * 4),
      cons = as.vector(.x$mean)
    ))
  )

pjm
#> # A tibble: 8 x 9
#>   provider fulldata data_fun_name data_fun model_name model error fitted
#>   <chr>    <list>   <chr>         <list>   <chr>      <lis> <dbl> <list>
#> 1 AEP      <tibble~ msts          <fn>     stlm       <fn>  0.733 <stlm>
#> 2 COMED    <tibble~ msts          <fn>     tbats      <fn>  0.805 <tbat~
#> 3 DAYTON   <tibble~ msts          <fn>     stlm       <fn>  0.714 <stlm>
#> 4 DEOK     <tibble~ msts          <fn>     tbats      <fn>  0.687 <tbat~
#> 5 DOM      <tibble~ msts          <fn>     tbats      <fn>  0.865 <tbat~
#> 6 DUQ      <tibble~ ts            <fn>     stlm       <fn>  0.619 <stlm>
#> 7 EKPC     <tibble~ ts            <fn>     auto.arima <fn>  0.721 <fr_A~
#> 8 FE       <tibble~ msts          <fn>     stlm       <fn>  0.713 <stlm>
#> # ... with 1 more variable: forecast <list>

Finally, we can unnest() the data to get the result:

# unnest actual and forecast
pjm %<>%
  select(provider, actual = fulldata, forecast) %>%
  gather(key, value, -provider) %>%
  unnest(value) %>%
  mutate(cons = rec_revert(cons, rec, provider))
  
pjm
#> # A tibble: 26,880 x 4
#>    provider key    datetime             cons
#>    <chr>    <chr>  <dttm>              <dbl>
#>  1 AEP      actual 2018-04-13 00:00:00 11488
#>  2 AEP      actual 2018-04-13 01:00:00 11100
#>  3 AEP      actual 2018-04-13 02:00:00 10915
#>  4 AEP      actual 2018-04-13 03:00:00 10972
#>  5 AEP      actual 2018-04-13 04:00:00 11204
#>  6 AEP      actual 2018-04-13 05:00:00 11895
#>  7 AEP      actual 2018-04-13 06:00:00 13107
#>  8 AEP      actual 2018-04-13 07:00:00 13681
#>  9 AEP      actual 2018-04-13 08:00:00 13874
#> 10 AEP      actual 2018-04-13 09:00:00 13921
#> # ... with 26,870 more rows

There you go, the forecast results for each provider in a proper long format :grimacing:

You can proceed to plot the forecast:

pjm %>%
  ggplot(aes(x = datetime, y = cons, colour = key)) +
    geom_line() +
    labs(x = NULL, y = NULL, colour = NULL) +
    facet_wrap(~ provider, scale = "free", ncol = 1) +
    tidyquant::theme_tq() +
    tidyquant::scale_colour_tq()

Concluding remarks

Fitting multiple time series models using purrr is somehow a little bit complicated, but on the other hand, very flexible. Actually, it also could be easily extended into a more complex scenario, such as incorporating multiple preprocess recipes as another model combination, or applying more complex rule in automating the fitting. Despite of that, I do understand that for someone new to purrr, it would be very hard to follow the workflow; to be honest, even me spending around 80% of writing this post thinking about the easiest way to explain the process narratively. But I really hope that the codes above enough to explain the basic workflow for functional programming in time series :smile:

If you want to tryout yourself, you can checkout this post folder in my blog repository, and use the distraction-free version of the codes above inside the fit.R script, which located inside the R folder. I really recommend you to use your own dataset, so you could understand better about the workflow! :grin:

If you find any difficulties in trying this example, please let me know in the comment so I can help you :ok_hand:

Ok, as always, here’s a tidyverse logo for you:

#> * __  _    __   .    o           *  . 
#>  / /_(_)__/ /_ ___  _____ _______ ___ 
#> / __/ / _  / // / |/ / -_) __(_-</ -_)
#> \__/_/\_,_/\_, /|___/\__/_/ /___/\__/ 
#>      *  . /___/      o      .       *

Session:

# session info
sessionInfo()
#> R version 3.5.2 (2018-12-20)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 17763)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=English_Indonesia.1252  LC_CTYPE=English_Indonesia.1252   
#> [3] LC_MONETARY=English_Indonesia.1252 LC_NUMERIC=C                      
#> [5] LC_TIME=English_Indonesia.1252    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] forcats_0.4.0    stringr_1.4.0    readr_1.3.1      tidyverse_1.2.1 
#>  [5] yardstick_0.0.4  tibble_2.1.3     rsample_0.0.5    tidyr_0.8.3     
#>  [9] recipes_0.1.6    purrr_0.3.2      parsnip_0.0.3.1  infer_0.4.0.1   
#> [13] ggplot2_3.2.1    dplyr_0.8.3      dials_0.0.2      scales_1.0.0    
#> [17] broom_0.5.2      tidymodels_0.0.2 magrittr_1.5     lubridate_1.7.4 
#> [21] forecast_8.9    
#> 
#> loaded via a namespace (and not attached):
#>   [1] readxl_1.3.1               backports_1.1.4           
#>   [3] tidytext_0.2.2             plyr_1.8.4                
#>   [5] igraph_1.2.4.1             lazyeval_0.2.2            
#>   [7] splines_3.5.2              crosstalk_1.0.0           
#>   [9] listenv_0.7.0              SnowballC_0.6.0           
#>  [11] rstantools_1.5.1           inline_0.3.15             
#>  [13] digest_0.6.20              htmltools_0.3.6           
#>  [15] rsconnect_0.8.15           fansi_0.4.0               
#>  [17] globals_0.12.4             modelr_0.1.5              
#>  [19] gower_0.2.1                matrixStats_0.54.0        
#>  [21] tidyquant_0.5.6            xts_0.11-2                
#>  [23] tseries_0.10-47            prettyunits_1.0.2         
#>  [25] colorspace_1.4-1           rvest_0.3.4               
#>  [27] haven_2.1.1                xfun_0.9                  
#>  [29] jsonlite_1.6               callr_3.3.1               
#>  [31] crayon_1.3.4               lme4_1.1-21               
#>  [33] zeallot_0.1.0              survival_2.44-1.1         
#>  [35] zoo_1.8-6                  glue_1.3.1                
#>  [37] gtable_0.3.0               ipred_0.9-9               
#>  [39] pkgbuild_1.0.3             Quandl_2.10.0             
#>  [41] rstan_2.19.2               timetk_0.1.1.1            
#>  [43] quantmod_0.4-15            miniUI_0.1.1.1            
#>  [45] Rcpp_1.0.2                 xtable_1.8-4              
#>  [47] stats4_3.5.2               lava_1.6.6                
#>  [49] StanHeaders_2.18.1-10      prodlim_2018.04.18        
#>  [51] DT_0.8                     httr_1.4.1                
#>  [53] htmlwidgets_1.3            threejs_0.3.1             
#>  [55] pkgconfig_2.0.2            loo_2.1.0                 
#>  [57] nnet_7.3-12                utf8_1.1.4                
#>  [59] labeling_0.3               tidyselect_0.2.5          
#>  [61] rlang_0.4.0                reshape2_1.4.3            
#>  [63] later_0.8.0                cellranger_1.1.0          
#>  [65] munsell_0.5.0              tools_3.5.2               
#>  [67] cli_1.1.0                  generics_0.0.2            
#>  [69] ggridges_0.5.1             evaluate_0.14             
#>  [71] yaml_2.2.0                 processx_3.4.1            
#>  [73] knitr_1.25                 future_1.14.0             
#>  [75] nlme_3.1-141               mime_0.7                  
#>  [77] rstanarm_2.18.2            xml2_1.2.2                
#>  [79] tokenizers_0.2.1           compiler_3.5.2            
#>  [81] bayesplot_1.7.0            shinythemes_1.1.2         
#>  [83] rstudioapi_0.10            curl_4.0                  
#>  [85] tidyposterior_0.0.2        stringi_1.4.3             
#>  [87] ps_1.3.0                   blogdown_0.13             
#>  [89] lattice_0.20-38            Matrix_1.2-17             
#>  [91] nloptr_1.2.1               markdown_1.1              
#>  [93] shinyjs_1.0.1.9004         urca_1.3-0                
#>  [95] vctrs_0.2.0                pillar_1.4.2              
#>  [97] furrr_0.1.0                lmtest_0.9-37             
#>  [99] httpuv_1.5.2               R6_2.4.0                  
#> [101] bookdown_0.13              promises_1.0.1            
#> [103] gridExtra_2.3              janeaustenr_0.1.5         
#> [105] codetools_0.2-16           boot_1.3-23               
#> [107] colourpicker_1.0           MASS_7.3-51.4             
#> [109] gtools_3.8.1               assertthat_0.2.1          
#> [111] withr_2.1.2                shinystan_2.5.0           
#> [113] fracdiff_1.4-2             PerformanceAnalytics_1.5.3
#> [115] hms_0.5.1                  parallel_3.5.2            
#> [117] quadprog_1.5-7             grid_3.5.2                
#> [119] rpart_4.1-15               timeDate_3043.102         
#> [121] class_7.3-15               minqa_1.2.4               
#> [123] rmarkdown_1.15             TTR_0.23-4                
#> [125] pROC_1.15.3                tidypredict_0.4.2         
#> [127] shiny_1.3.2                base64enc_0.1-3           
#> [129] dygraphs_1.1.1.6

blog

Home

Machine Learning

Data Visualization

Article List

Recent Posts

Pendekatan Regresi Terboboti Spasial untuk Analisa Tingkat Kriminalitas di Pulau Jawa

Regression ARIMA (ARIMAX)

Multiple Hotel Segments Demand Forecasting

Advancing Your Shiny Application II

GridSearchCV