Generic time-series backtesing/cross-validation with R - r

I want to make some time-series evaluation in R. The process is usually to define a time lag and the evaluation frequency/periods, and for each evaluation period, train a model with the defined time lag and compute metrics for that period.
For example, we have:
Evaluation period size and interval n
Evaluation start at b
Time lag l
We train a model with points 1:b-l, evaluate it on b:b+n. After that we train a model with points 1:b+n-l and evaluate it on b+n:b+2n and etc, for k periods. It could vary a bit but that's the general spirit. So this is basically a sliding window for the evaluation data, but an increasing window for training data.
This is illustrated in the answer to this question (the expanding window solution).
How could I do this, preferably without loops and using the tidyverse and/or packages specific for time-series analysis?

So this is how I'm doing at the moment, but I'm really not satisfied with it. Too much custom code and not very modular.
time_series_cv <- function(dates_lim, df) {
eval_data <-
df %>%
filter(
date >= dates_lim[['date_beg']],
date < dates_lim[['date_end']]
)
eval_data$prediction <-
predict(
lm(
log(y) ~ .,
df %>% filter(date < dates_lim[['date_beg']]) %>% select(-c(date))
),
eval_data
)
eval_data %>%
select(date, y, prediction)
}
predictions <-
lapply(dates, time_series_cv, df = df) %>%
bind_rows()
dates is a list of named lists with the start and end of the evaluation period. Lag is 1 sample here.

Related

Print iterations when loop is inside a function R

I am trying to get a loop inside a function.
So the main idea is that you get a var model, periods for forecasting, the training set, and the real data (for the forecasted period).
So I am making up this function, that reorganizes the data and computes the MSFE, the loop works fine! But as soon as it is inside a function, it won't print the results in the console.
Here I asked about the loop, so I know it is working fine.
Since the desired output is text, I don't want to store the sentences. Just print them in the console, I don't know how to make it.
Thanks in advance!
This is the function:
MSFE <- function(var, periods, train, test){
# Now for each variable compute the MSFE
## Recall that MSFE is mean((real-predicted)^2)
# var is the model,
# periods is the forecasted period
# train is the training data (1971m1-2003m12)
# test is the real data (2004m1-2004m12)
fcast_data <- forecast(var, h=periods)
test <- head(test, periods)
fcast_data <- fcast_data %>%
as.data.frame() %>% select(Time, Series, `Point Forecast`) %>%
spread(key="Series", value = `Point Forecast`) %>%
mutate(Time=toupper(Time),
Time=gsub(" ","",Time),
Time=paste0("01",Time),
Time=dmy(Time)) %>% arrange(Time) %>% select(-Time)
for(i in names(train)){
msfe <- paste("The Mean Squared Forecast Error for",i,"is:",mean((test[,i]-fcast_data[,i])^2))
print(msfe)
}
}

Hierarchical forecast with xreg late in series using fable

I am using the great fable package and am trying to create a hierarchical forecast using arima and ets models, and reconciling with td, mo, bu, and min trace to compare and see what is the best approach. My series has some effects late in the series that need to be regressed away and so I am trying to create a binary regressor to deal with that. I have read link1 and link2 about using the new_data argument to add a regressor with a hierarchical forecast, instead of the xreg argument which I've used for non-hierarchical forecast. I've had success with this approach by splitting the data into train and test sets and passing the test to new_data as Rob Hyndman describes in link1. The problem I am having with this current task is that the effects that need to be modeled away are all late in the series and so they are all in the test set.
First here is my reproducible example data:
library(tidyverse)
library(forecast)
library(fable)
library(tsibble)
library(tsibbledata)
library(lubridate)
data <- aus_livestock %>%
filter(State %in% c("Tasmania", "New South Wales", "Queensland"),
as.Date(Month) > as.Date("2000-01-01")) %>%
aggregate_key(State, Count=sum(Count)) %>%
mutate(xreg=as.factor(if_else(as.Date(Month) > as.Date("2018-01-01") &
as.Date(Month) < as.Date("2018-10-01"), 1, 0)))
I have had success in the past doing something like this:
train <- data %>%
filter(as.Date(Month) < as.Date("2017-10-01"))
test <- data %>%
filter(as.Date(Month) >= as.Date("2017-10-01"))
mod_data <- train %>%
model(ets=ETS(Count),
arima=ARIMA(Count ~ xreg)
) %>%
reconcile(bu_ets=bottom_up(ets),
td_ets=top_down(ets),
mint_ets=min_trace(ets),
bu_arima=bottom_up(arima),
td_arima=top_down(arima),
mint_arima=min_trace(arima)
)
forc_data <- mod_data %>%
forecast(new_data=test)
autoplot(forc_data,
data,
level=NULL)
But since in this case the regressor is all zeros in the train set this expectedly provides the error Provided exogenous regressors are rank deficient, removing regressors: xreg1. I think what I need to do is feed all the data I have to the model, not split the data into train and test, but I am unsure how to forecast that model using fable when there is no data for the new_data file. The closest I've gotten is something like this:
dates <- sort(rep(seq(as.Date("2019-01-01"), as.Date("2020-12-01"), "months"), 3))
future_data <- tibble(
Month=dates,
State=rep(c("Tasmania", "New South Wales", "Queensland"), 24),
Count=0
) %>%
mutate(Month=yearmonth(Month)) %>%
as_tsibble(index=Month, key=State) %>%
aggregate_key(State, Count=sum(Count)) %>%
mutate(xreg=factor(0, levels=c(0, 1))) %>%
select(-Count)
mod_data <- data %>%
model(ets=ETS(Count),
arima=ARIMA(Count ~ xreg)
) %>%
reconcile(bu_ets=bottom_up(ets),
td_ets=top_down(ets),
mint_ets=min_trace(ets),
bu_arima=bottom_up(arima),
td_arima=top_down(arima),
mint_arima=min_trace(arima)
)
forc_data <- mod_data %>%
forecast(new_data=future_data)
autoplot(forc_data,
data,
level=NULL)
Oddly this code causes my R Studio to crash when I try to run the forecast piece saying R session aborted R has encountered a fatal error. I think this may be unrelated to the code because I actually got this to work on my real data but the forecasts dont look like I would expect.
So, in summary I would like to know how I can use fable to forecast a hierarchical series with an exogenous regressor when all the regression effects need to happen in the period of the test set.
Thanks in advance for any help I can get!
I think it's not possible to do it only in the test set because then the model has nothing to learn from in the train set. I.e. you can only include an exogenous variable in the training process if it is both present in the train and test set.

R Tidymodels: What objects to save for use in production after fitting a recipe-based workflow utilizing pre-processing?

After designing a Tidymodels recipe-based workflow, which is tuned then fitted to some training data, I'm not clear what objects (fitted "workflow", "recipe", ..etc) should be saved to disk for use in predicting new data in production. I understand I can use saveRDS()/readRDS(), write_rds()/read_rds(), or other options to actually do the saving/loading of these objects, but which ones?
In a clean R environment I will have incoming new raw data which will need pre-processed using the "recipe" I used in training the model. I then want to make predictions based on that data after it has been pre-processed. If I intend to use the prep() and bake() functions to pre-process the new data as I did the training data, then I will minimally need the recipe and original training data it seems to get prep() to work. Plus, I also need the fitted model/workflow to make predictions. So three objects it seems. If I save to disk the workflow object in SESSION 1 then I have the ability to extract the recipe and model from it in SESSION 2 with pull_workflow_prepped_recipe() and pull_workflow_fit() respectively. But prep() seems to require the original training data, which I can keep in the workflow with an earlier use of retain = TRUE...but then that gets stripped out of the workflow after a call to fit(). Hear my cries for help! :)
So, imagine two different R sessions, where the first session I am doing all the training and model building, and the second session is some running production app that uses what was learned from the first session. I need help at the arrows in the bottom of SESSION1, and in multiple places in SESSION 2. I used the Tidymodels Get Started as the base for this example.
SESSION 1
library(tidymodels)
library(nycflights13)
library(readr)
set.seed(123)
flight_data <-
head(flights, 500) %>%
mutate(
arr_delay = ifelse(arr_delay >= 30, "late", "on_time"),
arr_delay = factor(arr_delay),
date = as.Date(time_hour)
) %>%
inner_join(weather, by = c("origin", "time_hour")) %>%
select(dep_time, flight, origin, dest, air_time, distance, carrier, date, arr_delay, time_hour) %>%
na.omit() %>%
mutate_if(is.character, as.factor)
set.seed(555)
data_split <- initial_split(flight_data, prop = 3/4)
train_data <- training(data_split)
test_data <- testing(data_split)
flights_rec <-
recipe(arr_delay ~ ., data = train_data) %>%
update_role(flight, time_hour, new_role = "ID") %>%
step_date(date, features = c("dow", "month")) %>%
step_holiday(date, holidays = timeDate::listHolidays("US")) %>%
step_rm(date) %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
step_zv(all_predictors())
lr_mod <-
logistic_reg() %>%
set_engine("glm")
flights_wflow <-
workflow() %>%
add_model(lr_mod) %>%
add_recipe(flights_rec)
flights_fit <-
flights_wflow %>%
fit(data = train_data)
predict(flights_fit, test_data)
### SAVE ONE OR MORE OBJECTS HERE FOR NEXT SESSION <------------
# What to save? workflow (pre or post fit()?), recipe, training data...etc.
write_rds(flights_wflow, "flights_wflow.rds") # Not fitted workflow
write_rds(flights_fit, "flights_fit.rds") # Fitted workflow
SESSION 2
### READ ONE OR MORE OBJECTS HERE FROM PRIOR SESSION <------------
flights_wflow <- read_rds("flights_wflow.rds")
flights_fit <- read_rds("flights_fit.rds")
# Acquire new data, do some basic transforms as before
new_flight_data <-
tail(flights, 500) %>%
mutate(
arr_delay = ifelse(arr_delay >= 30, "late", "on_time"),
arr_delay = factor(arr_delay),
date = as.Date(time_hour)
) %>%
inner_join(weather, by = c("origin", "time_hour")) %>%
select(dep_time, flight, origin, dest, air_time, distance, carrier, date, arr_delay, time_hour) %>%
na.omit() %>%
mutate_if(is.character, as.factor)
# Something here to preprocess the data with recipe as in SESSION 1 <----------
# new_flight_data_prep <- prep(??)
# new_flight_data_preprocessed <- bake(??)
# Predict new data
predict(flights_fit, new_data = new_flight_data_preprocessed)
You have some flexibility in how you approach this, depending on your constraints, but generally I would recommend saving/serializing the fitted workflow, perhaps after using butcher to reduce its size. You can see an example model fitting script in this repo that shows at the end how I save the fitted workflow.
When you go to predict with this workflow, there are some things to keep in mind. I have an example Plumber API in the same repo that demonstrates what is needed to predict for that particular workflow. Notice that the packages how the package needed for prediction are loaded/attached for this API. I didn't use all of tidymodels, but instead only the specific packages I need, for better performance and a smaller container.
Saving the fitted workflow did not work for me. When trying to predict with new data is asking for the target variable (a churn model)
predict(churn_model, the_data)
Error: Problem with `mutate()` column `churn`.
i `churn = dplyr::if_else(churn == 1, "yes", "no")`.
x object 'churn' not found
I still don't get why is asking for a column that should not be present in the data as it is the variable I try to predict...

How do I avoid using a for-loop to get elements of a nested list to the top level?

I am trying to extract the coefficients of a set of linear models into a data frame. How do I extract these values without using a for-loop?
The data in my example is dummy data for clarity. The actual project makes models for air temperature for each day of the year, and then tries to model the parameters of these models. Currently I can only accumulate each coefficient in a separate variable, and then apply individually it to my data set:
require(tidyverse)
# making different mpg models from displacement, distinguished by cylinder count
models <- mtcars %>%
nest(-cyl, .key = "cardata") %>%
mutate(mod = map(cardata, ~lm(mpg ~ disp, data = .))) %>%
mutate(coefficients = map(mod, coefficients)) #this only extracts a list of coefficients
# currently using one for-loop to extract each coefficient, looking for a more elegant way...
coef.intercept <- c()
for (i in models$coefficients) {
coef.intercept <- c(coef.intercept,i[1])
}
coef.disp <- c()
for (i in models$coefficients) {
coef.disp <- c(coef.disp,i[2])
}
# putting together the final data frame
models <- models %>%
mutate(coef.intercept) %>%
mutate(coef.disp) %>%
select(cyl, coef.intercept, coef.disp) %>%
as.data.frame()
Using 'map' I can extract a list of coefficients, but I cannot use the '[' operator in order to get specific elements of the individual lists. Something like
mutate(models, coef.intercept = map(models, coefficients[1]))
does not work, I get "Error: Index 1 must have length 1, not 2".
I'm not able at the moment to replicate your example, but I think you can start from here and adapt this solution to your needs.
A <- list(a = list(1,"j"), b = list(2, "k") , d = list(3, "m" ) )
sapply(A, `[[`, 1)

Regression of a Data Frame with multiple factor groupings

I am working on a regression script.
I have a data.frame with roughly 130 columns, of which I need to do a regression for one column (lets call it X column) against all the other ~100 numeric columns.
Before the regression is calculated, I need to group the data by 4 factors: myDat$Recipe, myDat$Step, myDat$Stage, and myDat$Prod while still keeping the other ~100 columns and row data attached for the regression. Then I need to do a regression of each column ~ X column and print out the R^2 value with the column name. This is what I've tried so far but it is getting overly complicated and I know there's got to be a better way.
rm(list=ls())
myDat <- read.csv(file="C:/Users/Documents/myDat.csv", header=TRUE, sep=",")
for(j in myDat$Recipe)
{
myDatj <- subset(myDat, myDat$Recipe == j)
for(k in myDatj$Step)
{
myDatk <- subset(myDatj, myDatj$Step == k)
for(i in myDatk$Stage)
{
myDati <- subset(myDatk, myDatk$Stage == i)
for(m in myDati$Prod)
{
myDatm <- subset(myDati, myDati$Prod == m)
if(is.numeric(myDatm[3,i]))
{
fit <- lm(myDatk[,i] ~ X, data=myDatm)
rsq <- summary(fit)$r.squared
{
writeLines(paste(rsq,i,"\n"))
}
}
}
}
}
}
You can do this by combining dplyr, tidyr and my broom package (you can install them with install.packages). First you need to gather all the numeric columns into a single column:
library(dplyr)
library(tidyr)
tidied <- myDat %>%
gather(column, value, -X, -Recipe, -Step, -Stage, -Prod)
To understand what this does, you can read up on tidyr's gather operation. (This assumes that all columns besides X, Recipe, Step, Stage, and Prod are numeric and therefore should be predicted in your regression. If that's not the case, you need to remove them beforehand. You'll need to produce a reproducible example of the problem if you need a more customized solution).
Then perform each regression, while grouping by the column and the four grouping variables.
library(broom)
regressions <- tidied %>%
group_by(column, Recipe, Step, Stage, Prod) %>%
do(mod = lm(value ~ X))
glances <- regressions %>% glance(mod)
The resulting glances data frame will have one row for each combination of column, Recipe, Step, Stage, and Prod, along with an r.squared column containing the R-squared from each model. (It will also contain adj.r.squared, along with other columns such as F-test p-value: see here for more). Running coefs <- regressions %>% tidy(mod) will probably also be useful for you, as it will get the coefficient estimates and p-values from each regression.
A similar use case is described in the "broom and dplyr" vignette, and in Section 3.1 of the broom manuscript.

Resources