In this article, I will explain some basic functional programming for fitting multiple time series using R, particularly using purrr
interface.
TL;DR: you can find the distraction-free script in here, and read some of my concluding remarks for a quick summary :grin:
Preface
When it comes to time series analyses and forecasting, R users are blessed with an invaluable tools that could helps us to conveniently fit–from basic to advanced–univariate time series models: forecast
package. This package is authored by Prof. Rob. J. Hyndman from Monash University. If you are interested in time series and forecasting, but new to the forecast
package, I really recommend you to checkout his Forecasting: Principles and Practice book to get you started; its online version is totally free!
Back then when I’m working with some econometrics cases, I was saved a lot by this package. Well, who can resist to use auto.arima()
? (I actually agree with you on this Iffa, just too “Bagas” to admit :v:). I’m also found its source code very helpful if you need to edit the algorithms to suit your need.
Yet, this package still have some limitation when you need to handle multiple time series object simultaneously; it doesn’t have that kind of built-in support (yet!). I was very curious back then. First, I found a solution proposed by Earo Wang. She released the hts
(hierarchical time series) package, which give some workaround on this case using aggregration approach. But somehow I found the package are less versatile than forecast
on fitting complex univariate models; well, it doesn’t built for that purpose at the first place.
Then I finally found an article that inspired me to write this post: Pur(r)rify Your Carets by Rahul Sangole. He cleverly hack his model selection routines using purrr
. Basically, he explained a workflow to use functional programming approach to fit a combination of multiple data transformation, and multiple model.
This kind of approach actually already implemented in sweep
package for time series and forecasting. But unfortunately, after seeing the package now “orphaned” (doesn’t have an official maintainer), I decided to not include the package in this tutorial; don’t worry, it doesn’t affecting the workflow that much.
In this article, I will explain how to a-purrr
-opriately fit multiple time series using functional programming. If you happen to be new to purrr
, I really recommend you to read this purrr
-fect tutorial by Jennifer Bryan; and if you confused with my piping flow, then I suggest you to read my Data Wars series first :grin:
Libraries used
For this tutorial, we will use some time series, data wrangling, and statistical modeling packages. Among them, the core packages that we will use are:
forecast
: for time series modeling and forecastingyardstick
: for measuring forecast performancerecipes
: for data preprocesspurrr
: for functional programmingdplyr
: for general data wranglinglubridate
: for working with dates
In addition to that, there are also some some package that I recommend to use for easier workflow:
magrittr
: for various pipe operatorstimetk
: for creating future time index for our forecast resultstidyquant
: for some ggplot aesthetics
Note that I don’t import timetk
and tidyquant
in the following chunk, since we only use their one or two functions. Also some packages are already included in their bigger packages; e.g., dplyr
and purrr
are already included in tidyverse
, and recipes
and yardstick
are already included in tidymodels
, so we only need to import the bigger packages:
# import libs
library(forecast)
library(lubridate)
library(magrittr)
library(tidymodels)
library(tidyverse)
Hourly Energy Consumption dataset from PJM
For this tutorial, we will use Hourly Energy Consumption dataset provided by PJM, and could be accessed from Kaggle dataset, which kindly shared by Rob Mulla. The data contains the hourly power consumption estimated by each electricity provider.
The dataset is actually need more cleaning before we can use it for time series analyses and forecasting. The version I use in this tutorial is already going through some cleaning and adjustment. If you are interested in the cleaning process, or simply want to reproduce the results using the same dataset, you can check the script from this post folder in my blog repository (to be exact, in file data.R
inside the R
folder).
Let’s start by importing the dataset:
# import dataset
pjm <- read_csv("data_input/pjm.csv")
pjm
#> # A tibble: 362,688 x 3
#> datetime provider cons
#> <dttm> <chr> <dbl>
#> 1 2013-06-01 00:00:00 AEP 13477
#> 2 2013-06-01 01:00:00 AEP 12699
#> 3 2013-06-01 02:00:00 AEP 12274
#> 4 2013-06-01 03:00:00 AEP 11904
#> 5 2013-06-01 04:00:00 AEP 11862
#> 6 2013-06-01 05:00:00 AEP 11976
#> 7 2013-06-01 06:00:00 AEP 12447
#> 8 2013-06-01 07:00:00 AEP 12713
#> 9 2013-06-01 08:00:00 AEP 13730
#> 10 2013-06-01 09:00:00 AEP 14609
#> # ... with 362,678 more rows
The data contain datetime
, provider
, and cons
columns. The dates are ranging from 2013-06-01 to 2018-08-02 23:00:00. Here’s some example data from the last month:
# quick plot
pjm %>%
filter(datetime >= max(datetime) - hours(24 * 7 * 4)) %>%
ggplot(aes(x = datetime, y = cons)) +
geom_line() +
labs(x = NULL, y = NULL) +
facet_wrap(~ provider, scale = "free", ncol = 1) +
tidyquant::theme_tq()
The scenario
To help us benchmark multiple models, we will need to split some portion of our data for validation. The strategy here is to cut the last 4 weeks–approximately 1 month–as our test dataset. Then, we will cut again some (bigger) portion as the train dataset, say, 4 weeks times 3–approximately 3 months. This strategy is just the simplified version of Rolling Forecasting Origin, with only having one pair of train and test sample. Of course, you can experiment with the lengths; will definitely cover more on this in one of my future posts, stay tuned!
So the first step here is to get the start and end date of the train and test sample. The most straighforward way is to define the train and test size, then recursively get the start and end index for each:
# train-val-test size
test_size <- 24 * 7 * 4
train_size <- 24 * 7 * 4 * 3
# get the min-max of the time index for each sample
test_end <- max(pjm$datetime)
test_start <- test_end - hours(test_size) + hours(1)
train_end <- test_start - hours(1)
train_start <- train_end - hours(train_size) + hours(1)
To make it more handy, we can combine the start and end indices into a date interval:
# get the interval of each samples
intrain <- interval(train_start, train_end)
intest <- interval(test_start, test_end)
intrain
#> [1] 2018-04-13 UTC--2018-07-05 23:00:00 UTC
intest
#> [1] 2018-07-06 UTC--2018-08-02 23:00:00 UTC
I really recommend this splitting approach, since you can use the date interval for many things while still keeping the true data.
For example, we can use the intervals to visualize the train and test series:
# plot the train and test
pjm %>%
mutate(sample = case_when(
datetime %within% intrain ~ "train",
datetime %within% intest ~ "test"
)) %>%
drop_na() %>%
mutate(sample = factor(sample, levels = c("train", "test"))) %>%
ggplot(aes(x = datetime, y = cons, colour = sample)) +
geom_line() +
labs(x = NULL, y = NULL, colour = NULL) +
facet_wrap(~ provider, scale = "free", ncol = 1) +
tidyquant::theme_tq() +
tidyquant::scale_colour_tq()
Data preprocess using recipes
Data preprocessing is a very crucial step in time series model fitting. In this tutorial, we will use recipes
package for data preprocessing.
Since recipe package work columnwise, we need to convert our data into a wide format first:
# convert to wide format
pjm %<>%
spread(provider, cons)
pjm
#> # A tibble: 45,336 x 9
#> datetime AEP COMED DAYTON DEOK DOM DUQ EKPC FE
#> <dttm> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 2013-06-01 00:00:00 13477 11378 1682 2773 10150 1739 1166 7099
#> 2 2013-06-01 01:00:00 12699 10584 1568 2613 9406 1641 1098 6617
#> 3 2013-06-01 02:00:00 12274 9973 1514 2499 8910 1569 1036 6351
#> 4 2013-06-01 03:00:00 11904 9451 1467 2409 8551 1518 1023 6201
#> 5 2013-06-01 04:00:00 11862 9184 1456 2392 8413 1503 949 6157
#> 6 2013-06-01 05:00:00 11976 9043 1485 2425 8409 1515 1017 6187
#> 7 2013-06-01 06:00:00 12447 9054 1511 2469 8582 1531 1036 6257
#> 8 2013-06-01 07:00:00 12713 9290 1596 2606 9600 1586 1096 6606
#> 9 2013-06-01 08:00:00 13730 9980 1711 2823 10816 1745 1200 7182
#> 10 2013-06-01 09:00:00 14609 10590 1823 3013 12017 1903 1285 7777
#> # ... with 45,326 more rows
Then we could start to define the preprocess recipe()
, and bake()
our data based on the defined recipe:
# recipes: square root, center, scale
rec <- recipe(~ ., filter(pjm, datetime %within% intrain)) %>%
step_sqrt(all_numeric()) %>%
step_center(all_numeric()) %>%
step_scale(all_numeric()) %>%
prep()
# preview the bake results
pjm <- bake(rec, pjm)
pjm
#> # A tibble: 45,336 x 9
#> datetime AEP COMED DAYTON DEOK DOM DUQ EKPC
#> <dttm> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 2013-06-01 00:00:00 -0.393 0.0676 -0.784 -0.510 -0.373 0.481 -0.837
#> 2 2013-06-01 01:00:00 -0.722 -0.247 -1.10 -0.741 -0.693 0.191 -1.08
#> 3 2013-06-01 02:00:00 -0.907 -0.496 -1.25 -0.911 -0.913 -0.0282 -1.31
#> 4 2013-06-01 03:00:00 -1.07 -0.716 -1.39 -1.05 -1.08 -0.186 -1.35
#> 5 2013-06-01 04:00:00 -1.09 -0.831 -1.42 -1.07 -1.14 -0.233 -1.63
#> 6 2013-06-01 05:00:00 -1.04 -0.892 -1.34 -1.02 -1.14 -0.196 -1.38
#> 7 2013-06-01 06:00:00 -0.831 -0.887 -1.26 -0.956 -1.06 -0.146 -1.31
#> 8 2013-06-01 07:00:00 -0.716 -0.785 -1.02 -0.752 -0.608 0.0240 -1.09
#> 9 2013-06-01 08:00:00 -0.288 -0.493 -0.706 -0.439 -0.0967 0.499 -0.719
#> 10 2013-06-01 09:00:00 0.0699 -0.244 -0.408 -0.175 0.381 0.949 -0.431
#> # ... with 45,326 more rows, and 1 more variable: FE <dbl>
Note: Don’t forget to do the preprocess based on the data in train sample only, since it’s similar to the actual scenario where we only have the “train” data.
If we use recipes
package, the next steps is to create a revert back function:
# revert back function
rec_revert <- function(vector, rec, varname) {
# store recipe values
rec_center <- rec$steps[[2]]$means[varname]
rec_scale <- rec$steps[[3]]$sds[varname]
# convert back based on the recipe
results <- (vector * rec_scale + rec_center) ^ 2
# add additional adjustment if necessary
results <- round(results)
# return the results
results
}
This revert back function would be very handy if we want to convert back the data to its original form.
Now we can convert our data into the long format again:
# convert back to long format
pjm %<>%
gather(provider, cons, -datetime)
pjm
#> # A tibble: 362,688 x 3
#> datetime provider cons
#> <dttm> <chr> <dbl>
#> 1 2013-06-01 00:00:00 AEP -0.393
#> 2 2013-06-01 01:00:00 AEP -0.722
#> 3 2013-06-01 02:00:00 AEP -0.907
#> 4 2013-06-01 03:00:00 AEP -1.07
#> 5 2013-06-01 04:00:00 AEP -1.09
#> 6 2013-06-01 05:00:00 AEP -1.04
#> 7 2013-06-01 06:00:00 AEP -0.831
#> 8 2013-06-01 07:00:00 AEP -0.716
#> 9 2013-06-01 08:00:00 AEP -0.288
#> 10 2013-06-01 09:00:00 AEP 0.0699
#> # ... with 362,678 more rows
Nested model fitting and forecasting
In functional programming using purrr
, we need to convert our tbl
into a nested tbl
. You can think a nested data like a table inside a table, which could be controlled using a key indicator; in other words, we can have tbl
for each provider
and samples
. Using this format, the fitting and forecasting process would be very versatile, yet we can still convert the results as long as we have a proper key like provider
.
Let’s start by converting our tbl
into a nested tbl
. First, we need to add sample indicator so it could be recognized as a key when we nest()
the data:
# adjust by sample
pjm %<>%
mutate(sample = case_when(
datetime %within% intrain ~ "train",
datetime %within% intest ~ "test"
)) %>%
drop_na()
pjm
#> # A tibble: 21,504 x 4
#> datetime provider cons sample
#> <dttm> <chr> <dbl> <chr>
#> 1 2018-04-13 00:00:00 AEP -1.26 train
#> 2 2018-04-13 01:00:00 AEP -1.43 train
#> 3 2018-04-13 02:00:00 AEP -1.52 train
#> 4 2018-04-13 03:00:00 AEP -1.49 train
#> 5 2018-04-13 04:00:00 AEP -1.39 train
#> 6 2018-04-13 05:00:00 AEP -1.07 train
#> 7 2018-04-13 06:00:00 AEP -0.548 train
#> 8 2018-04-13 07:00:00 AEP -0.308 train
#> 9 2018-04-13 08:00:00 AEP -0.228 train
#> 10 2018-04-13 09:00:00 AEP -0.209 train
#> # ... with 21,494 more rows
Then, we could start to nest()
the the data by provider
and sample
, and spread()
the tbl
based on sample
key:
# nest the train data
pjm %<>%
group_by(provider, sample) %>%
nest(.key = "data") %>%
spread(sample, data)
pjm
#> # A tibble: 8 x 3
#> provider test train
#> <chr> <list> <list>
#> 1 AEP <tibble [672 x 2]> <tibble [2,016 x 2]>
#> 2 COMED <tibble [672 x 2]> <tibble [2,016 x 2]>
#> 3 DAYTON <tibble [672 x 2]> <tibble [2,016 x 2]>
#> 4 DEOK <tibble [672 x 2]> <tibble [2,016 x 2]>
#> 5 DOM <tibble [672 x 2]> <tibble [2,016 x 2]>
#> 6 DUQ <tibble [672 x 2]> <tibble [2,016 x 2]>
#> 7 EKPC <tibble [672 x 2]> <tibble [2,016 x 2]>
#> 8 FE <tibble [672 x 2]> <tibble [2,016 x 2]>
Preparing the data model list
For data and model combination, we could start by defining some options for data representation. Recall that our series have a relatively high frequency, so we could consider two option of data representation here: a ts
object with daily seasonality, and an msts
with daily and weekly seasonality.
To incorporate them into our nested data, we need to create another nested data frame containing the data representation name, and the accompanying function for converting the data into the specified data representation.
Let’s start with making a named list containing the transformation functions:
# data funs list
data_funs <- list(
ts = function(x) ts(x$cons, frequency = 24),
msts = function(x) msts(x$cons, seasonal.periods = c(24, 24 * 7))
)
data_funs
#> $ts
#> function (x)
#> ts(x$cons, frequency = 24)
#>
#> $msts
#> function (x)
#> msts(x$cons, seasonal.periods = c(24, 24 * 7))
Then we could convert the list
into a tbl
using enframe()
. Note that we should also give a key–which is the provider
in our case–so we could use left_join()
later. The trick here is to use rep()
function:
# convert to nested
data_funs %<>%
rep(length(unique(pjm$provider))) %>%
enframe("data_fun_name", "data_fun") %>%
mutate(provider =
sort(rep(unique(pjm$provider), length(unique(.$data_fun_name))))
)
data_funs
#> # A tibble: 16 x 3
#> data_fun_name data_fun provider
#> <chr> <list> <chr>
#> 1 ts <fn> AEP
#> 2 msts <fn> AEP
#> 3 ts <fn> COMED
#> 4 msts <fn> COMED
#> 5 ts <fn> DAYTON
#> 6 msts <fn> DAYTON
#> 7 ts <fn> DEOK
#> 8 msts <fn> DEOK
#> 9 ts <fn> DOM
#> 10 msts <fn> DOM
#> 11 ts <fn> DUQ
#> 12 msts <fn> DUQ
#> 13 ts <fn> EKPC
#> 14 msts <fn> EKPC
#> 15 ts <fn> FE
#> 16 msts <fn> FE
Then the last steps here is to join the nested function with our nested data:
# combine with models
pjm %<>%
left_join(data_funs)
pjm
#> # A tibble: 16 x 5
#> provider test train data_fun_name data_fun
#> <chr> <list> <list> <chr> <list>
#> 1 AEP <tibble [672 x 2]> <tibble [2,016 x 2]> ts <fn>
#> 2 AEP <tibble [672 x 2]> <tibble [2,016 x 2]> msts <fn>
#> 3 COMED <tibble [672 x 2]> <tibble [2,016 x 2]> ts <fn>
#> 4 COMED <tibble [672 x 2]> <tibble [2,016 x 2]> msts <fn>
#> 5 DAYTON <tibble [672 x 2]> <tibble [2,016 x 2]> ts <fn>
#> 6 DAYTON <tibble [672 x 2]> <tibble [2,016 x 2]> msts <fn>
#> 7 DEOK <tibble [672 x 2]> <tibble [2,016 x 2]> ts <fn>
#> 8 DEOK <tibble [672 x 2]> <tibble [2,016 x 2]> msts <fn>
#> 9 DOM <tibble [672 x 2]> <tibble [2,016 x 2]> ts <fn>
#> 10 DOM <tibble [672 x 2]> <tibble [2,016 x 2]> msts <fn>
#> 11 DUQ <tibble [672 x 2]> <tibble [2,016 x 2]> ts <fn>
#> 12 DUQ <tibble [672 x 2]> <tibble [2,016 x 2]> msts <fn>
#> 13 EKPC <tibble [672 x 2]> <tibble [2,016 x 2]> ts <fn>
#> 14 EKPC <tibble [672 x 2]> <tibble [2,016 x 2]> msts <fn>
#> 15 FE <tibble [672 x 2]> <tibble [2,016 x 2]> ts <fn>
#> 16 FE <tibble [672 x 2]> <tibble [2,016 x 2]> msts <fn>
Preparing the time series model list
Similar to when we create the data representation list, we could also make some time series models as a nested list.
Again, let’s start by making a list of models. For this tutorial, let’s consider to use auto.arima()
, ets()
, stlm()
, and tbats()
. We need to make some functions to call those model, and store them inside a list:
# models list
models <- list(
auto.arima = function(x) auto.arima(x),
ets = function(x) ets(x),
stlm = function(x) stlm(x),
tbats = function(x) tbats(x, use.box.cox = FALSE)
)
models
#> $auto.arima
#> function (x)
#> auto.arima(x)
#>
#> $ets
#> function (x)
#> ets(x)
#>
#> $stlm
#> function (x)
#> stlm(x)
#>
#> $tbats
#> function (x)
#> tbats(x, use.box.cox = FALSE)
Then we can convert it into a nested format like previous example:
# convert to nested
models %<>%
rep(length(unique(pjm$provider))) %>%
enframe("model_name", "model") %>%
mutate(provider =
sort(rep(unique(pjm$provider), length(unique(.$model_name))))
)
models
#> # A tibble: 32 x 3
#> model_name model provider
#> <chr> <list> <chr>
#> 1 auto.arima <fn> AEP
#> 2 ets <fn> AEP
#> 3 stlm <fn> AEP
#> 4 tbats <fn> AEP
#> 5 auto.arima <fn> COMED
#> 6 ets <fn> COMED
#> 7 stlm <fn> COMED
#> 8 tbats <fn> COMED
#> 9 auto.arima <fn> DAYTON
#> 10 ets <fn> DAYTON
#> # ... with 22 more rows
And finally, we can join the result into our nested data. Note that we could also apply some rule here. For example, if I don’t want to have ets()
and auto.arima()
for data with msts
class–since they are not suitable for multiple seasonality time series–we can use filter to remove them out:
# combine with models
pjm %<>%
left_join(models) %>%
filter(
!(model_name == "ets" & data_fun_name == "msts"),
!(model_name == "auto.arima" & data_fun_name == "msts")
)
Here’s our data with the full combination:
pjm
#> # A tibble: 48 x 7
#> provider test train data_fun_name data_fun model_name model
#> <chr> <list> <list> <chr> <list> <chr> <lis>
#> 1 AEP <tibble [6~ <tibble [2~ ts <fn> auto.arima <fn>
#> 2 AEP <tibble [6~ <tibble [2~ ts <fn> ets <fn>
#> 3 AEP <tibble [6~ <tibble [2~ ts <fn> stlm <fn>
#> 4 AEP <tibble [6~ <tibble [2~ ts <fn> tbats <fn>
#> 5 AEP <tibble [6~ <tibble [2~ msts <fn> stlm <fn>
#> 6 AEP <tibble [6~ <tibble [2~ msts <fn> tbats <fn>
#> 7 COMED <tibble [6~ <tibble [2~ ts <fn> auto.arima <fn>
#> 8 COMED <tibble [6~ <tibble [2~ ts <fn> ets <fn>
#> 9 COMED <tibble [6~ <tibble [2~ ts <fn> stlm <fn>
#> 10 COMED <tibble [6~ <tibble [2~ ts <fn> tbats <fn>
#> # ... with 38 more rows
Execute the nested fitting
To execute the model fitting, we need to wrap up the needed arguments as a list
using map()
function. Then, we could call the function using invoke_map()
. We need to do this for data transformation using the function inside data_fun
, then continue to fit the model with the same process using the function inside model
.
See the code in the chunk below for the implementation of the process:
# invoke nested fitting
pjm %<>%
mutate(
params = map(train, ~ list(x = .x)),
data = invoke_map(data_fun, params),
params = map(data, ~ list(x = .x)),
fitted = invoke_map(model, params)
) %>%
select(-data, -params)
pjm
#> # A tibble: 48 x 8
#> provider test train data_fun_name data_fun model_name model fitted
#> <chr> <list> <list> <chr> <list> <chr> <lis> <list>
#> 1 AEP <tibbl~ <tibble~ ts <fn> auto.arima <fn> <fr_A~
#> 2 AEP <tibbl~ <tibble~ ts <fn> ets <fn> <ets>
#> 3 AEP <tibbl~ <tibble~ ts <fn> stlm <fn> <stlm>
#> 4 AEP <tibbl~ <tibble~ ts <fn> tbats <fn> <tbat~
#> 5 AEP <tibbl~ <tibble~ msts <fn> stlm <fn> <stlm>
#> 6 AEP <tibbl~ <tibble~ msts <fn> tbats <fn> <tbat~
#> 7 COMED <tibbl~ <tibble~ ts <fn> auto.arima <fn> <fr_A~
#> 8 COMED <tibbl~ <tibble~ ts <fn> ets <fn> <ets>
#> 9 COMED <tibbl~ <tibble~ ts <fn> stlm <fn> <stlm>
#> 10 COMED <tibbl~ <tibble~ ts <fn> tbats <fn> <tbat~
#> # ... with 38 more rows
Now the next step is to measure the test error. First, we need to forecast()
into the test dataset, and then pipe it into the error measurement using one of function provided in yardstick
. For this example, let’s use Root Mean Squared-Error (we use rmse_vec
for simple vector calculation) as an error measurement:
# calculate test errors
pjm %<>%
mutate(error =
map(fitted, ~ forecast(.x, h = 24 * 7 * 4)) %>%
map2_dbl(test, ~ rmse_vec(truth = .y$cons, estimate = .x$mean))
) %>%
arrange(provider, error)
pjm %>%
select(provider, ends_with("_name"), error)
#> # A tibble: 48 x 4
#> provider data_fun_name model_name error
#> <chr> <chr> <chr> <dbl>
#> 1 AEP msts stlm 0.733
#> 2 AEP msts tbats 0.745
#> 3 AEP ts stlm 0.761
#> 4 AEP ts auto.arima 0.820
#> 5 AEP ts tbats 0.951
#> 6 AEP ts ets 4.28
#> 7 COMED msts tbats 0.805
#> 8 COMED msts stlm 0.813
#> 9 COMED ts stlm 0.837
#> 10 COMED ts tbats 0.935
#> # ... with 38 more rows
Unnesting the result
Beside measuring the error, we can also compare the forecast results to the real test series through graphical analysis. But to do that, we need to make a tbl
containing our forecast to the test dataset, then do some spread-gather tricks to make a set of keys that unique for each data representations, models, and also one for the forecast itself. If we get to that format, we could conveniently unnest()
the data into a proper long format
Let’s start with creating the forecast first:
pjm_test <- pjm %>%
mutate(
forecast =
map(fitted, ~ forecast(.x, h = 24 * 7 * 4)) %>%
map2(test, ~ tibble(
datetime = .y$datetime,
cons = as.vector(.x$mean)
)),
key = paste(data_fun_name, model_name, sep = "-")
)
pjm_test
#> # A tibble: 48 x 11
#> provider test train data_fun_name data_fun model_name model fitted
#> <chr> <lis> <lis> <chr> <list> <chr> <lis> <list>
#> 1 AEP <tib~ <tib~ msts <fn> stlm <fn> <stlm>
#> 2 AEP <tib~ <tib~ msts <fn> tbats <fn> <tbat~
#> 3 AEP <tib~ <tib~ ts <fn> stlm <fn> <stlm>
#> 4 AEP <tib~ <tib~ ts <fn> auto.arima <fn> <fr_A~
#> 5 AEP <tib~ <tib~ ts <fn> tbats <fn> <tbat~
#> 6 AEP <tib~ <tib~ ts <fn> ets <fn> <ets>
#> 7 COMED <tib~ <tib~ msts <fn> tbats <fn> <tbat~
#> 8 COMED <tib~ <tib~ msts <fn> stlm <fn> <stlm>
#> 9 COMED <tib~ <tib~ ts <fn> stlm <fn> <stlm>
#> 10 COMED <tib~ <tib~ ts <fn> tbats <fn> <tbat~
#> # ... with 38 more rows, and 3 more variables: error <dbl>,
#> # forecast <list>, key <chr>
Then do some spread-gather to create a proper key:
pjm_test %<>%
select(provider, key, actual = test, forecast) %>%
spread(key, forecast) %>%
gather(key, value, -provider)
pjm_test
#> # A tibble: 56 x 3
#> provider key value
#> <chr> <chr> <list>
#> 1 AEP actual <tibble [672 x 2]>
#> 2 COMED actual <tibble [672 x 2]>
#> 3 DAYTON actual <tibble [672 x 2]>
#> 4 DEOK actual <tibble [672 x 2]>
#> 5 DOM actual <tibble [672 x 2]>
#> 6 DUQ actual <tibble [672 x 2]>
#> 7 EKPC actual <tibble [672 x 2]>
#> 8 FE actual <tibble [672 x 2]>
#> 9 AEP msts-stlm <tibble [672 x 2]>
#> 10 COMED msts-stlm <tibble [672 x 2]>
#> # ... with 46 more rows
The last but not least, unnest()
the series data, and apply the revert back function:
pjm_test %<>%
unnest(value) %>%
mutate(cons = rec_revert(cons, rec, provider))
pjm_test
#> # A tibble: 37,632 x 4
#> provider key datetime cons
#> <chr> <chr> <dttm> <dbl>
#> 1 AEP actual 2018-07-06 00:00:00 15139
#> 2 AEP actual 2018-07-06 01:00:00 14325
#> 3 AEP actual 2018-07-06 02:00:00 13685
#> 4 AEP actual 2018-07-06 03:00:00 13350
#> 5 AEP actual 2018-07-06 04:00:00 13357
#> 6 AEP actual 2018-07-06 05:00:00 13671
#> 7 AEP actual 2018-07-06 06:00:00 14124
#> 8 AEP actual 2018-07-06 07:00:00 14646
#> 9 AEP actual 2018-07-06 08:00:00 15294
#> 10 AEP actual 2018-07-06 09:00:00 15804
#> # ... with 37,622 more rows
With the resulting tbl
, we can compare the forecast and actual data on test like this:
# plot forecast on test
pjm_test %>%
ggplot(aes(x = datetime, y = cons, colour = key)) +
geom_line() +
labs(x = NULL, y = NULL, colour = NULL) +
facet_wrap(~ provider, scale = "free", ncol = 1) +
tidyquant::theme_tq() +
tidyquant::scale_colour_tq()
Automate the model selection
As you can see from the plot results, it is hard to decide which model we want to use based on graphical comparation. Then the most straighforward solution is to use the model with the least error.
It is very simple to do the model selection, we only need some basic dplyr
grammars to filter()
the model with lowest error:
# filter by lowest test error
pjm %<>%
select(-fitted) %>% # remove unused
group_by(provider) %>%
filter(error == min(error)) %>%
ungroup()
pjm
#> # A tibble: 8 x 8
#> provider test train data_fun_name data_fun model_name model error
#> <chr> <list> <list> <chr> <list> <chr> <lis> <dbl>
#> 1 AEP <tibble ~ <tibble~ msts <fn> stlm <fn> 0.733
#> 2 COMED <tibble ~ <tibble~ msts <fn> tbats <fn> 0.805
#> 3 DAYTON <tibble ~ <tibble~ msts <fn> stlm <fn> 0.714
#> 4 DEOK <tibble ~ <tibble~ msts <fn> tbats <fn> 0.687
#> 5 DOM <tibble ~ <tibble~ msts <fn> tbats <fn> 0.865
#> 6 DUQ <tibble ~ <tibble~ ts <fn> stlm <fn> 0.619
#> 7 EKPC <tibble ~ <tibble~ ts <fn> auto.arima <fn> 0.721
#> 8 FE <tibble ~ <tibble~ msts <fn> stlm <fn> 0.713
Perform the final forecast
After we have the final model, then finally we can proceed to the final forecast. For the final forecast, we can do the same process as in model fitting, but this time we will use train and test data as our new “full data”.
Now let’s start by recombine the train and test dataset:
# recombine samples
pjm %<>%
mutate(fulldata = map2(train, test, ~ bind_rows(.x, .y))) %>%
select(provider, fulldata, everything(), -train, -test)
pjm
#> # A tibble: 8 x 7
#> provider fulldata data_fun_name data_fun model_name model error
#> <chr> <list> <chr> <list> <chr> <lis> <dbl>
#> 1 AEP <tibble [2,688 x ~ msts <fn> stlm <fn> 0.733
#> 2 COMED <tibble [2,688 x ~ msts <fn> tbats <fn> 0.805
#> 3 DAYTON <tibble [2,688 x ~ msts <fn> stlm <fn> 0.714
#> 4 DEOK <tibble [2,688 x ~ msts <fn> tbats <fn> 0.687
#> 5 DOM <tibble [2,688 x ~ msts <fn> tbats <fn> 0.865
#> 6 DUQ <tibble [2,688 x ~ ts <fn> stlm <fn> 0.619
#> 7 EKPC <tibble [2,688 x ~ ts <fn> auto.arima <fn> 0.721
#> 8 FE <tibble [2,688 x ~ msts <fn> stlm <fn> 0.713
Then do the same nested fitting as in previous example:
# invoke nested fitting for full data
pjm %<>%
mutate(
params = map(fulldata, ~ list(x = .x)),
data = invoke_map(data_fun, params),
params = map(data, ~ list(x = .x)),
fitted = invoke_map(model, params)
) %>%
select(-data, -params)
Next, let’s make a tbl
containing each of our forecast results, and convert our nested data to a proper long format:
# get forecast
pjm %<>%
mutate(forecast =
map(fitted, ~ forecast(.x, h = 24 * 7 * 4)) %>%
map2(fulldata, ~ tibble(
datetime = timetk::tk_make_future_timeseries(.y$datetime, 24 * 7 * 4),
cons = as.vector(.x$mean)
))
)
pjm
#> # A tibble: 8 x 9
#> provider fulldata data_fun_name data_fun model_name model error fitted
#> <chr> <list> <chr> <list> <chr> <lis> <dbl> <list>
#> 1 AEP <tibble~ msts <fn> stlm <fn> 0.733 <stlm>
#> 2 COMED <tibble~ msts <fn> tbats <fn> 0.805 <tbat~
#> 3 DAYTON <tibble~ msts <fn> stlm <fn> 0.714 <stlm>
#> 4 DEOK <tibble~ msts <fn> tbats <fn> 0.687 <tbat~
#> 5 DOM <tibble~ msts <fn> tbats <fn> 0.865 <tbat~
#> 6 DUQ <tibble~ ts <fn> stlm <fn> 0.619 <stlm>
#> 7 EKPC <tibble~ ts <fn> auto.arima <fn> 0.721 <fr_A~
#> 8 FE <tibble~ msts <fn> stlm <fn> 0.713 <stlm>
#> # ... with 1 more variable: forecast <list>
Finally, we can unnest()
the data to get the result:
# unnest actual and forecast
pjm %<>%
select(provider, actual = fulldata, forecast) %>%
gather(key, value, -provider) %>%
unnest(value) %>%
mutate(cons = rec_revert(cons, rec, provider))
pjm
#> # A tibble: 26,880 x 4
#> provider key datetime cons
#> <chr> <chr> <dttm> <dbl>
#> 1 AEP actual 2018-04-13 00:00:00 11488
#> 2 AEP actual 2018-04-13 01:00:00 11100
#> 3 AEP actual 2018-04-13 02:00:00 10915
#> 4 AEP actual 2018-04-13 03:00:00 10972
#> 5 AEP actual 2018-04-13 04:00:00 11204
#> 6 AEP actual 2018-04-13 05:00:00 11895
#> 7 AEP actual 2018-04-13 06:00:00 13107
#> 8 AEP actual 2018-04-13 07:00:00 13681
#> 9 AEP actual 2018-04-13 08:00:00 13874
#> 10 AEP actual 2018-04-13 09:00:00 13921
#> # ... with 26,870 more rows
There you go, the forecast results for each provider
in a proper long format :grimacing:
You can proceed to plot the forecast:
pjm %>%
ggplot(aes(x = datetime, y = cons, colour = key)) +
geom_line() +
labs(x = NULL, y = NULL, colour = NULL) +
facet_wrap(~ provider, scale = "free", ncol = 1) +
tidyquant::theme_tq() +
tidyquant::scale_colour_tq()
Concluding remarks
Fitting multiple time series models using purrr
is somehow a little bit complicated, but on the other hand, very flexible. Actually, it also could be easily extended into a more complex scenario, such as incorporating multiple preprocess recipes
as another model combination, or applying more complex rule in automating the fitting. Despite of that, I do understand that for someone new to purrr
, it would be very hard to follow the workflow; to be honest, even me spending around 80% of writing this post thinking about the easiest way to explain the process narratively. But I really hope that the codes above enough to explain the basic workflow for functional programming in time series :smile:
If you want to tryout yourself, you can checkout this post folder in my blog repository, and use the distraction-free version of the codes above inside the fit.R
script, which located inside the R
folder. I really recommend you to use your own dataset, so you could understand better about the workflow! :grin:
If you find any difficulties in trying this example, please let me know in the comment so I can help you :ok_hand:
Ok, as always, here’s a tidyverse
logo for you:
#> * __ _ __ . o * .
#> / /_(_)__/ /_ ___ _____ _______ ___
#> / __/ / _ / // / |/ / -_) __(_-</ -_)
#> \__/_/\_,_/\_, /|___/\__/_/ /___/\__/
#> * . /___/ o . *
Session:
# session info
sessionInfo()
#> R version 3.5.2 (2018-12-20)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 17763)
#>
#> Matrix products: default
#>
#> locale:
#> [1] LC_COLLATE=English_Indonesia.1252 LC_CTYPE=English_Indonesia.1252
#> [3] LC_MONETARY=English_Indonesia.1252 LC_NUMERIC=C
#> [5] LC_TIME=English_Indonesia.1252
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] forcats_0.4.0 stringr_1.4.0 readr_1.3.1 tidyverse_1.2.1
#> [5] yardstick_0.0.4 tibble_2.1.3 rsample_0.0.5 tidyr_0.8.3
#> [9] recipes_0.1.6 purrr_0.3.2 parsnip_0.0.3.1 infer_0.4.0.1
#> [13] ggplot2_3.2.1 dplyr_0.8.3 dials_0.0.2 scales_1.0.0
#> [17] broom_0.5.2 tidymodels_0.0.2 magrittr_1.5 lubridate_1.7.4
#> [21] forecast_8.9
#>
#> loaded via a namespace (and not attached):
#> [1] readxl_1.3.1 backports_1.1.4
#> [3] tidytext_0.2.2 plyr_1.8.4
#> [5] igraph_1.2.4.1 lazyeval_0.2.2
#> [7] splines_3.5.2 crosstalk_1.0.0
#> [9] listenv_0.7.0 SnowballC_0.6.0
#> [11] rstantools_1.5.1 inline_0.3.15
#> [13] digest_0.6.20 htmltools_0.3.6
#> [15] rsconnect_0.8.15 fansi_0.4.0
#> [17] globals_0.12.4 modelr_0.1.5
#> [19] gower_0.2.1 matrixStats_0.54.0
#> [21] tidyquant_0.5.6 xts_0.11-2
#> [23] tseries_0.10-47 prettyunits_1.0.2
#> [25] colorspace_1.4-1 rvest_0.3.4
#> [27] haven_2.1.1 xfun_0.9
#> [29] jsonlite_1.6 callr_3.3.1
#> [31] crayon_1.3.4 lme4_1.1-21
#> [33] zeallot_0.1.0 survival_2.44-1.1
#> [35] zoo_1.8-6 glue_1.3.1
#> [37] gtable_0.3.0 ipred_0.9-9
#> [39] pkgbuild_1.0.3 Quandl_2.10.0
#> [41] rstan_2.19.2 timetk_0.1.1.1
#> [43] quantmod_0.4-15 miniUI_0.1.1.1
#> [45] Rcpp_1.0.2 xtable_1.8-4
#> [47] stats4_3.5.2 lava_1.6.6
#> [49] StanHeaders_2.18.1-10 prodlim_2018.04.18
#> [51] DT_0.8 httr_1.4.1
#> [53] htmlwidgets_1.3 threejs_0.3.1
#> [55] pkgconfig_2.0.2 loo_2.1.0
#> [57] nnet_7.3-12 utf8_1.1.4
#> [59] labeling_0.3 tidyselect_0.2.5
#> [61] rlang_0.4.0 reshape2_1.4.3
#> [63] later_0.8.0 cellranger_1.1.0
#> [65] munsell_0.5.0 tools_3.5.2
#> [67] cli_1.1.0 generics_0.0.2
#> [69] ggridges_0.5.1 evaluate_0.14
#> [71] yaml_2.2.0 processx_3.4.1
#> [73] knitr_1.25 future_1.14.0
#> [75] nlme_3.1-141 mime_0.7
#> [77] rstanarm_2.18.2 xml2_1.2.2
#> [79] tokenizers_0.2.1 compiler_3.5.2
#> [81] bayesplot_1.7.0 shinythemes_1.1.2
#> [83] rstudioapi_0.10 curl_4.0
#> [85] tidyposterior_0.0.2 stringi_1.4.3
#> [87] ps_1.3.0 blogdown_0.13
#> [89] lattice_0.20-38 Matrix_1.2-17
#> [91] nloptr_1.2.1 markdown_1.1
#> [93] shinyjs_1.0.1.9004 urca_1.3-0
#> [95] vctrs_0.2.0 pillar_1.4.2
#> [97] furrr_0.1.0 lmtest_0.9-37
#> [99] httpuv_1.5.2 R6_2.4.0
#> [101] bookdown_0.13 promises_1.0.1
#> [103] gridExtra_2.3 janeaustenr_0.1.5
#> [105] codetools_0.2-16 boot_1.3-23
#> [107] colourpicker_1.0 MASS_7.3-51.4
#> [109] gtools_3.8.1 assertthat_0.2.1
#> [111] withr_2.1.2 shinystan_2.5.0
#> [113] fracdiff_1.4-2 PerformanceAnalytics_1.5.3
#> [115] hms_0.5.1 parallel_3.5.2
#> [117] quadprog_1.5-7 grid_3.5.2
#> [119] rpart_4.1-15 timeDate_3043.102
#> [121] class_7.3-15 minqa_1.2.4
#> [123] rmarkdown_1.15 TTR_0.23-4
#> [125] pROC_1.15.3 tidypredict_0.4.2
#> [127] shiny_1.3.2 base64enc_0.1-3
#> [129] dygraphs_1.1.1.6
Share this post
Twitter
Facebook
LinkedIn