I am using the great fable package and am trying to create a hierarchical forecast using arima and ets models, and reconciling with td, mo, bu, and min trace to compare and see what is the best approach. My series has some effects late in the series that need to be regressed away and so I am trying to create a binary regressor to deal with that. I have read link1 and link2 about using the new_data argument to add a regressor with a hierarchical forecast, instead of the xreg argument which I've used for non-hierarchical forecast. I've had success with this approach by splitting the data into train and test sets and passing the test to new_data as Rob Hyndman describes in link1. The problem I am having with this current task is that the effects that need to be modeled away are all late in the series and so they are all in the test set.
First here is my reproducible example data:
library(tidyverse)
library(forecast)
library(fable)
library(tsibble)
library(tsibbledata)
library(lubridate)
data <- aus_livestock %>%
filter(State %in% c("Tasmania", "New South Wales", "Queensland"),
as.Date(Month) > as.Date("2000-01-01")) %>%
aggregate_key(State, Count=sum(Count)) %>%
mutate(xreg=as.factor(if_else(as.Date(Month) > as.Date("2018-01-01") &
as.Date(Month) < as.Date("2018-10-01"), 1, 0)))
I have had success in the past doing something like this:
train <- data %>%
filter(as.Date(Month) < as.Date("2017-10-01"))
test <- data %>%
filter(as.Date(Month) >= as.Date("2017-10-01"))
mod_data <- train %>%
model(ets=ETS(Count),
arima=ARIMA(Count ~ xreg)
) %>%
reconcile(bu_ets=bottom_up(ets),
td_ets=top_down(ets),
mint_ets=min_trace(ets),
bu_arima=bottom_up(arima),
td_arima=top_down(arima),
mint_arima=min_trace(arima)
)
forc_data <- mod_data %>%
forecast(new_data=test)
autoplot(forc_data,
data,
level=NULL)
But since in this case the regressor is all zeros in the train set this expectedly provides the error Provided exogenous regressors are rank deficient, removing regressors: xreg1. I think what I need to do is feed all the data I have to the model, not split the data into train and test, but I am unsure how to forecast that model using fable when there is no data for the new_data file. The closest I've gotten is something like this:
dates <- sort(rep(seq(as.Date("2019-01-01"), as.Date("2020-12-01"), "months"), 3))
future_data <- tibble(
Month=dates,
State=rep(c("Tasmania", "New South Wales", "Queensland"), 24),
Count=0
) %>%
mutate(Month=yearmonth(Month)) %>%
as_tsibble(index=Month, key=State) %>%
aggregate_key(State, Count=sum(Count)) %>%
mutate(xreg=factor(0, levels=c(0, 1))) %>%
select(-Count)
mod_data <- data %>%
model(ets=ETS(Count),
arima=ARIMA(Count ~ xreg)
) %>%
reconcile(bu_ets=bottom_up(ets),
td_ets=top_down(ets),
mint_ets=min_trace(ets),
bu_arima=bottom_up(arima),
td_arima=top_down(arima),
mint_arima=min_trace(arima)
)
forc_data <- mod_data %>%
forecast(new_data=future_data)
autoplot(forc_data,
data,
level=NULL)
Oddly this code causes my R Studio to crash when I try to run the forecast piece saying R session aborted R has encountered a fatal error. I think this may be unrelated to the code because I actually got this to work on my real data but the forecasts dont look like I would expect.
So, in summary I would like to know how I can use fable to forecast a hierarchical series with an exogenous regressor when all the regression effects need to happen in the period of the test set.
Thanks in advance for any help I can get!
I think it's not possible to do it only in the test set because then the model has nothing to learn from in the train set. I.e. you can only include an exogenous variable in the training process if it is both present in the train and test set.
Related
I developed a stacked ensemble model using the tidymodels workflow and I want to monitor the performance of this model from time to time using vetiver. However, it seems the stacked model object isn't supported yet.
Please see the code snippet below
library(tidymodels)
library(vetiver)
library(pins)
library(arrow)
library(tidyverse)
library(bonsai)
library(stacks)
library(lubridate)
library(magrittr)
b <- board_folder(path = "pins-r/")
model <- vetiver_pin_read(board = b, name = "dcp_ibese_truck_arrival",
version = "20230110T094207Z-69661")
trips <- read_parquet("../IbeseLivePosition/ml_data/data_to_monitor_model.parquet")
trips %<>%
mutate(Date = as.Date(DateTimeReceived))
original_metrics <-
vetiver::augment(model, new_data = trips)
Error: No augment method for objects of class butchered_linear_stack
I am learning about Neural Networks on my own for a business analytics class I am taking. We don't have to learn it, but we were asked to discuss other machine learning algorithms we could use R to process and I went down the rabbit hole of trying it out as I think its pretty useful to know long term. I found a few resources that talk through the process but my outputs aren't making sense and it's causing me to pull hair out I don't have.
My Code:
library(neuralnet)
### Prepare the Dataframe for use. These items not needed for the network
nn1 <- df %>% ungroup() %>%
select(store, week, high_med_rev, high_med_gp, high_med_gpm)
### Prepare the Dataframe for use. These items are needed for the network
nn2 %<>% ungroup() %>%
select(high_med_units, size, region,
promo_units_per,
altbev_units_per, confect_units_per, salty_units_per,
velocityA_units_per, velocityB_units_per, velocityC_units_per, velocityD_units_per, velocityNEW_units_per)
### The region column needs to be a number so the neural network is able to work properly
nn2$region %<>% as.numeric()
# Check that "positive" is last for `my_confusion_matrix()` to work
# contrasts(nn2$high_med_units) ## NOT APPLICABLE FOR THIS MODEL
set.seed(77)
## Create a training and testing data partition
partition_nn <- caret::createDataPartition(y=nn2$high_med_units, p=.75, list=FALSE)
data_train_nn <- nn2[partition, ]
data_test_nn <- nn2[-partition, ]
## Create the Neural Network where high_med_units it the dependent outcome, the rest is independent
nn <- neuralnet(high_med_units ~ ., data=data_train_nn, hidden=3, threshold = 0.01,
linear.output = FALSE)
##Plot the Neural Network
plot(nn)
##predict the network outcomes on the testing data
predict_nn <- compute(nn, data_test_nn)
# Put the prediction back into the test data
data_test_nn$nn <- predict_nn$net.result
# Create a variable that shows if the prediction was correct
data_test_nn %<>%
mutate(correct_nn = if_else(nn == high_med_units, 'correct', 'WRONG!'))
# Add back the variables we took out at the beginning to the test data
temp1 <- nn1[-partition, ]
full_nn <- bind_cols(temp1, data_test_nn)
# For viewing in class
full_nn %>%
select(store, week, high_med_units, tree, correct_tree, correct_nn, size, region, promo_units_per, salty_units_per)
slice_sample(full_test_tree, n=10)
What I find is when I display the output of predict_nn$net.result it gives me a columns of data per row similar to this when I output it to the console:
https://www.datacamp.com/tutorial/neural-network-models-r
But when I add it to the data_test_nn dataframe above, the column name ends up being nn[,1] so the rest of the code fails.
The network does generate here:
Another interesting thing is that all the prediction values are the same:
I'm probably doing something very wrong since this is all new to me, any pointers/help would be awesome.
I am trying to get a loop inside a function.
So the main idea is that you get a var model, periods for forecasting, the training set, and the real data (for the forecasted period).
So I am making up this function, that reorganizes the data and computes the MSFE, the loop works fine! But as soon as it is inside a function, it won't print the results in the console.
Here I asked about the loop, so I know it is working fine.
Since the desired output is text, I don't want to store the sentences. Just print them in the console, I don't know how to make it.
Thanks in advance!
This is the function:
MSFE <- function(var, periods, train, test){
# Now for each variable compute the MSFE
## Recall that MSFE is mean((real-predicted)^2)
# var is the model,
# periods is the forecasted period
# train is the training data (1971m1-2003m12)
# test is the real data (2004m1-2004m12)
fcast_data <- forecast(var, h=periods)
test <- head(test, periods)
fcast_data <- fcast_data %>%
as.data.frame() %>% select(Time, Series, `Point Forecast`) %>%
spread(key="Series", value = `Point Forecast`) %>%
mutate(Time=toupper(Time),
Time=gsub(" ","",Time),
Time=paste0("01",Time),
Time=dmy(Time)) %>% arrange(Time) %>% select(-Time)
for(i in names(train)){
msfe <- paste("The Mean Squared Forecast Error for",i,"is:",mean((test[,i]-fcast_data[,i])^2))
print(msfe)
}
}
After designing a Tidymodels recipe-based workflow, which is tuned then fitted to some training data, I'm not clear what objects (fitted "workflow", "recipe", ..etc) should be saved to disk for use in predicting new data in production. I understand I can use saveRDS()/readRDS(), write_rds()/read_rds(), or other options to actually do the saving/loading of these objects, but which ones?
In a clean R environment I will have incoming new raw data which will need pre-processed using the "recipe" I used in training the model. I then want to make predictions based on that data after it has been pre-processed. If I intend to use the prep() and bake() functions to pre-process the new data as I did the training data, then I will minimally need the recipe and original training data it seems to get prep() to work. Plus, I also need the fitted model/workflow to make predictions. So three objects it seems. If I save to disk the workflow object in SESSION 1 then I have the ability to extract the recipe and model from it in SESSION 2 with pull_workflow_prepped_recipe() and pull_workflow_fit() respectively. But prep() seems to require the original training data, which I can keep in the workflow with an earlier use of retain = TRUE...but then that gets stripped out of the workflow after a call to fit(). Hear my cries for help! :)
So, imagine two different R sessions, where the first session I am doing all the training and model building, and the second session is some running production app that uses what was learned from the first session. I need help at the arrows in the bottom of SESSION1, and in multiple places in SESSION 2. I used the Tidymodels Get Started as the base for this example.
SESSION 1
library(tidymodels)
library(nycflights13)
library(readr)
set.seed(123)
flight_data <-
head(flights, 500) %>%
mutate(
arr_delay = ifelse(arr_delay >= 30, "late", "on_time"),
arr_delay = factor(arr_delay),
date = as.Date(time_hour)
) %>%
inner_join(weather, by = c("origin", "time_hour")) %>%
select(dep_time, flight, origin, dest, air_time, distance, carrier, date, arr_delay, time_hour) %>%
na.omit() %>%
mutate_if(is.character, as.factor)
set.seed(555)
data_split <- initial_split(flight_data, prop = 3/4)
train_data <- training(data_split)
test_data <- testing(data_split)
flights_rec <-
recipe(arr_delay ~ ., data = train_data) %>%
update_role(flight, time_hour, new_role = "ID") %>%
step_date(date, features = c("dow", "month")) %>%
step_holiday(date, holidays = timeDate::listHolidays("US")) %>%
step_rm(date) %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
step_zv(all_predictors())
lr_mod <-
logistic_reg() %>%
set_engine("glm")
flights_wflow <-
workflow() %>%
add_model(lr_mod) %>%
add_recipe(flights_rec)
flights_fit <-
flights_wflow %>%
fit(data = train_data)
predict(flights_fit, test_data)
### SAVE ONE OR MORE OBJECTS HERE FOR NEXT SESSION <------------
# What to save? workflow (pre or post fit()?), recipe, training data...etc.
write_rds(flights_wflow, "flights_wflow.rds") # Not fitted workflow
write_rds(flights_fit, "flights_fit.rds") # Fitted workflow
SESSION 2
### READ ONE OR MORE OBJECTS HERE FROM PRIOR SESSION <------------
flights_wflow <- read_rds("flights_wflow.rds")
flights_fit <- read_rds("flights_fit.rds")
# Acquire new data, do some basic transforms as before
new_flight_data <-
tail(flights, 500) %>%
mutate(
arr_delay = ifelse(arr_delay >= 30, "late", "on_time"),
arr_delay = factor(arr_delay),
date = as.Date(time_hour)
) %>%
inner_join(weather, by = c("origin", "time_hour")) %>%
select(dep_time, flight, origin, dest, air_time, distance, carrier, date, arr_delay, time_hour) %>%
na.omit() %>%
mutate_if(is.character, as.factor)
# Something here to preprocess the data with recipe as in SESSION 1 <----------
# new_flight_data_prep <- prep(??)
# new_flight_data_preprocessed <- bake(??)
# Predict new data
predict(flights_fit, new_data = new_flight_data_preprocessed)
You have some flexibility in how you approach this, depending on your constraints, but generally I would recommend saving/serializing the fitted workflow, perhaps after using butcher to reduce its size. You can see an example model fitting script in this repo that shows at the end how I save the fitted workflow.
When you go to predict with this workflow, there are some things to keep in mind. I have an example Plumber API in the same repo that demonstrates what is needed to predict for that particular workflow. Notice that the packages how the package needed for prediction are loaded/attached for this API. I didn't use all of tidymodels, but instead only the specific packages I need, for better performance and a smaller container.
Saving the fitted workflow did not work for me. When trying to predict with new data is asking for the target variable (a churn model)
predict(churn_model, the_data)
Error: Problem with `mutate()` column `churn`.
i `churn = dplyr::if_else(churn == 1, "yes", "no")`.
x object 'churn' not found
I still don't get why is asking for a column that should not be present in the data as it is the variable I try to predict...
I want to make some time-series evaluation in R. The process is usually to define a time lag and the evaluation frequency/periods, and for each evaluation period, train a model with the defined time lag and compute metrics for that period.
For example, we have:
Evaluation period size and interval n
Evaluation start at b
Time lag l
We train a model with points 1:b-l, evaluate it on b:b+n. After that we train a model with points 1:b+n-l and evaluate it on b+n:b+2n and etc, for k periods. It could vary a bit but that's the general spirit. So this is basically a sliding window for the evaluation data, but an increasing window for training data.
This is illustrated in the answer to this question (the expanding window solution).
How could I do this, preferably without loops and using the tidyverse and/or packages specific for time-series analysis?
So this is how I'm doing at the moment, but I'm really not satisfied with it. Too much custom code and not very modular.
time_series_cv <- function(dates_lim, df) {
eval_data <-
df %>%
filter(
date >= dates_lim[['date_beg']],
date < dates_lim[['date_end']]
)
eval_data$prediction <-
predict(
lm(
log(y) ~ .,
df %>% filter(date < dates_lim[['date_beg']]) %>% select(-c(date))
),
eval_data
)
eval_data %>%
select(date, y, prediction)
}
predictions <-
lapply(dates, time_series_cv, df = df) %>%
bind_rows()
dates is a list of named lists with the start and end of the evaluation period. Lag is 1 sample here.