How to forecast an arima with Dynamic regression models for grouped data? - r

I'm trying to make a forecast of a arima with regression (Regression with ARIMA errors) to several ts at the same time and using grouped data.
I'm new in the tidy data so... Basically, I'm reproducing this example (https://cran.rstudio.com/web/packages/sweep/vignettes/SW01_Forecasting_Time_Series_Groups.html) with a multivariate ts, and multivariate model.
here is a reproducible example:
library(tidyverse); library(tidyquant)
library(timetk); library(sweep)
library(forecast)
library(tsibble)
library(fpp3)
# using package data
bike_sales
# grouping data
monthly_qty_by_cat2 <- bike_sales %>%
mutate(order.month = as_date(as.yearmon(order.date))) %>%
group_by(category.secondary, order.month) %>%
summarise(total.qty = sum(quantity), price.m = mean(price))
# using nest
monthly_qty_by_cat2_nest <- monthly_qty_by_cat2 %>%
group_by(category.secondary) %>%
nest()
monthly_qty_by_cat2_nest
# Forecasting Workflow
# Step 1: Coerce to a ts object class
monthly_qty_by_cat2_ts <- monthly_qty_by_cat2_nest %>%
mutate(data.ts = map(.x = data,
.f = tk_ts,
select = -order.month, # take off date
start = 2011,
freq = 12))
# Step 2: modeling an ARIMA(y ~ x)
# make a function to map
modeloARIMA_reg <- function(y,x) {
result <- ARIMA(y ~ x)
return(list(result))}
# map the function
monthly_qty_by_cat2_fit <- monthly_qty_by_cat2_ts %>%
mutate(fit.arima = map(data.ts, modeloARIMA_reg))
monthly_qty_by_cat2_fit
Here I dont know if the map is using the right variable in y (the dependent), but I keep going try the forecast and an error appears
# Step 3: Forecasting the model
monthly_qty_by_cat2_fcast <- monthly_qty_by_cat2_fit %>%
mutate(fcast.ets = map(fit.arima, forecast))
# this give me this error
# Erro: Problem with `mutate()` input `fcast.arima`.
# x argumento não-numérico para operador binário
# i Input `fcast.arima` is `map(fit.arima, forecast)`.
# i The error occured in group 1: category.secondary = "Cross Country Race".
# Run `rlang::last_error()` to see where the error occurred.
# Além disso: Warning message:
# In mean.default(x, na.rm = TRUE) :
# argument is not numeric or logical: returning NA
Two questions emerge:
I dont know how to input the mean of the independent variable (x) of each group;
AND how to declare this new data as a forecast argument.
PS: Dont need be tibble or nested result, I just need the point forecast and the CI (total.qty lo.95 hi.95)

Well, this code solve the problem for me.
This make one forecast for each time-series (grouped tsibble) and use the own mean value of those time-series as future data in the forecast
Any comment is welcome.
# MY FLOW
monthly_qty_by_cat2 <-
sweep::bike_sales %>%
mutate(order.month = yearmonth(order.date)) %>%
group_by(category.secondary, order.month) %>%
summarise(total.qty = sum(quantity), price.m = mean(price)) %>%
as_tsibble(index=order.month, key=category.secondary) # coerse in tsibble
# mean for the future
futuro <- monthly_qty_by_cat2 %>%
group_by(category.secondary) %>%
mutate(fut_x = mean(price.m)) %>%
do(price.m = head(.$fut_x,1))
# as.numeric
futuro$price.m <- as.numeric(futuro$price.m)
futuro
# make values in the future
future_x <- new_data(monthly_qty_by_cat2, 3) %>%
left_join(futuro, by = "category.secondary")
future_x
# model and forecast
fc <- monthly_qty_by_cat2 %>%
group_by(category.secondary) %>%
model(ARIMA(total.qty ~ price.m)) %>%
forecast(new_data=future_x) %>%
hilo(level = 95) %>%
unpack_hilo("95%")
fc
# Tidy the forecast
fc_tibble <- fc %>% as_tibble() %>% select(-total.qty)
fc_tibble
# the end
Well this solve the problem for me.
This make one forecast for each group time-series and use the own mean value of those group time-series as future data in the forecast
Any comment is welcome.

Related

Forecasting time series groups based on trends using nested data in R

Reminiscent of this question: Forecasting Time Series Groups with tslm() & tidyverse
except I want to use Matt Dancho’s code at https://cran.rstudio.com/web/packages/sweep/vignettes/SW01_Forecasting_Time_Series_Groups.html
How can I use forecast::tslm() to produce grouped time series predictions from a nested data frame? The solution Rob Hyndman kindly provided used the tsibble and fable packages without nesting.
The map() of tslm() at the bottom of this code generates:
Caused by error in `formula.default()`:
! invalid formula
starter_time <- Sys.time()
library(dplyr)
library(timetk)
library(tidyr)
library(purrr)
library(lubridate)
library(forecast)
library(broom)
library(sweep)
library(zoo)
monthly_qty_by_cat2 <- bike_sales %>%
mutate(order.month = as_date(as.yearmon(order.date))) %>%
group_by(category.secondary, order.month) %>%
summarise(total.qty = sum(quantity))
monthly_qty_by_cat2_nest <- monthly_qty_by_cat2 %>%
group_by(category.secondary) %>%
nest()
monthly_qty_by_cat2_ts <- monthly_qty_by_cat2_nest %>%
mutate(data.ts = map(.x = data,
.f = tk_ts,
select = -order.month,
start = 2011,
freq = 12))
## invalid formula ERROR
monthly_qty_by_cat2_fit <- monthly_qty_by_cat2_ts %>%
mutate(fit.ets = map(data.ts, tslm, total.qty ~ trend))

R - Looping linear regression results for time series

I want to run linear regressions on the NZD vs a number of securities
I have some code to runs the regression but rather than apply it to each security i would prefer to run a loop through the list of securities to give me a file with the r^2 results from each linear regression
my dep variable is called: nzdusd
independent variables I would like to loop through are spx, adxy, vix
Code: as it currently stands with spx (like to use the same code to loop it through for variables adxy and vix as well)
library(tseries)
library(lmtest)
library(dplyr)
library(lubridate)
# 3 month regression, change variable here to get number of days
# e.g. 3 months sd = 60
# inputs
# 3 month regression
sd <- 60
# loading my market data from a saved location (variables nzdusd,spx, adxy, vix)
my_path <- file.path ("K:","X,"bbg_daily.Rdata")
load(file = my_path)
# Transform NZD into percentage change
pct.nzdusd <- nzdusd %>%
select(date, PX_LAST) %>%
mutate(lag = lag(PX_LAST),
pct_chg = (PX_LAST - lag) * 100 / lag) %>%
select(date, pct_chg)
# SPX(S&P 500)
myfun <- function(x) {
deparse(substitute(x))
}
# ^=^=^=^=^=^=^=^=^=^=^=^=^=^=
mysec_str <- myfun(spx)
mysec <- spx
z <- 5 # Series ID
# ^=^=^=^=^=^=^=^=^=^=^=^=^=^=
# Transform into percentage change
mypct <- mysec %>%
select(date, PX_LAST) %>%
mutate(lag = lag(PX_LAST),
pct_chg = (PX_LAST - lag) * 100 / lag) %>%
select(date, pct_chg)
assign(paste("pct.", mysec_str, sep = ""),mypct)
# join times series
ts <- paste("ts_", z, sep ="")
ts <- (inner_join(x = pct.nzdusd, y = mypct, by = "date"))
# get last row
last_row <- ts %>% slice(n())
end_dt <- last_row [1,1]
# start date declared above depending on regression
start_dt <- ts[((nrow (ts))-sd),1]
# getting subset of time series
ts_sub <- subset(ts,
date >= as.POSIXct(start_dt) &
date <= as.POSIXct(end_dt))
# regression
reg.ts = lm(pct_chg.x~pct_chg.y, ts_sub)
r2 <- summary(reg.ts)$r.squared
assign(paste(mysec_str, ".r2", sep = ""),r2)
stderr <- sqrt(deviance(reg.ts)/df.residual(reg.ts))
assign(paste(mysec_str, ".stderr", sep = ""),stderr)
#===================================================
r2 <- c(spx.r2, *adxy.r2, vix.r2*)
my_path2 <- file.path ("K:","x")
save (r2, file = my_path2 )
I've done code by simply copying and pasting and then replacing spx with the other variable names. But i know the code can be a lot slicker by using a loop. Particularily if I want to add a lot more independent variables
It's hard to known without reprex data, but to run multiple models, I've found pivoting longer, nesting by independent variables and then mutating through those variables works well. If your data just contains your dependent and independent variables, you can:
library(tidyverse)
ts_sub %>%
# Keep independent variable outside nested data
pivot_longer(- nzdusd, names_to = "dependent_vars", values_to = "values") %>%
nest_by(dependent_vars) %>%
mutate(model = list(lm(nzdusd ~ values, data = data)))
See: https://dplyr.tidyverse.org/reference/nest_by.html

Future dataset is incomplete when using Fable Prophet

I'm trying to view the out of sample performance scores after running fable prophet. Please note, the forecast is grouped based on type and the forecast is looking 5 observations ahead.
Here is the code:
library(tibble)
library(tsibble)
library(fable.prophet)
lax_passengers <- read.csv("https://raw.githubusercontent.com/mitchelloharawild/fable.prophet/master/data-raw/lax_passengers.csv")
library(dplyr)
library(lubridate)
lax_passengers <- lax_passengers %>%
mutate(datetime = mdy_hms(ReportPeriod)) %>%
group_by(month = yearmonth(datetime), type = Domestic_International) %>%
summarise(passengers = sum(Passenger_Count)) %>%
ungroup()
lax_passengers <- as_tsibble(lax_passengers, index = month, key = type)
fit <- lax_passengers %>%
model(
mdl = prophet(passengers ~ growth("linear") + season("year", type = "multiplicative")),
)
fit
test_tr <- lax_passengers %>%
slice(1:(n()-5)) %>%
stretch_tsibble(.init = 12, .step = 1)
fc <- test_tr %>%
model(
mdl = prophet(passengers ~ growth("linear") + season("year", type = "multiplicative")),
) %>%
forecast(h = 5)
fc %>% accuracy(lax_passengers)
When I run fc %>% accuracy(lax_passenger), I get the following warning:
Warning message:
The future dataset is incomplete, incomplete out-of-sample data will be treated as missing.
5 observations are missing between 2019 Apr and 2019 Aug
How do make the future dataset complete as I believe the performance score isn't accurate based on the missing 5 observations.
It seems like when I try to stretch the tsibble, it doesn't slice correctly as it doesn't remove the last 5 observations from each type.
The slice() function removes rows from the entire dataset, so it is only removing the last 5 rows from your last key (type=="International"). To remove the last 5 rows from all keys, you'll need to group by keys and slice.
test_tr <- lax_passengers %>%
group_by_key() %>%
slice(1:(n()-5)) %>%
ungroup() %>%
stretch_tsibble(.init = 12, .step = 1)

How to use gtsummary::tbl_svysummary() to display confidence intervals for levels of a factor variable?

I am using survey data from the National Electronic Injury Surveillance System (https://www.cpsc.gov/Research--Statistics/NEISS-Injury-Data) to research trends in consumer product injuries.
Using gtsummary and tbl_svysummary(), my goal is to create a descriptive table of summary measures of injuries. Since this is survey data, I want to display the 95% confidence interval associated with each summary measure.
This previous post provides a solution to generating confidence intervals for two level factor variables (Using (gtsummary) tbl_svysummaary function to display confidence intervals for survey.design object?), however, I am looking for a solution to produce confidence intervals for factor variables with >=2 levels.
I am borrowing a reproducible example from the previous post:
library(gtsummary)
library(survey)
svy_trial <-
svydesign(~1, data = trial %>% select(trt, response, death), weights = ~1)
ci <- function(variable, by, data, ...) {
svyby(as.formula( paste0( "~" , variable)) , by = as.formula( paste0( "~" , by)), data, svyciprop, vartype="ci") %>%
tibble::as_tibble() %>%
dplyr::mutate_at(vars(ci_l, ci_u), ~style_number(., scale = 100) %>% paste0("%")) %>%
dplyr::mutate(ci = stringr::str_glue("{ci_l}, {ci_u}")) %>%
dplyr::select(all_of(c(by, "ci"))) %>%
tidyr::pivot_wider(names_from = all_of(by), values_from = ci) %>%
set_names(paste0("add_stat_", seq_len(ncol(.))))
}
ci("response", "trt", svy_trial)
#> # A tibble: 1 x 2
#> add_stat_1 add_stat_2
#> <glue> <glue>
#> 1 21%, 40% 25%, 44%
svy_trial %>%
tbl_svysummary(by = "trt", missing = "no") %>%
add_stat(everything() ~ "ci") %>%
modify_table_body(
dplyr::relocate, add_stat_1, .after = stat_1
) %>%
modify_header(starts_with("add_stat_") ~ "**95% CI**") %>%
modify_footnote(everything() ~ NA)
Table screenshot from previous post 1
In the above example, the factor variables have two levels and summary data from 1 level is shown.
How can I tweak the above approach so that both levels of factor variables are displayed with their respective confidence intervals?
How can this solution be generalized to factor variables with >2 levels (e.g., an age variable binned as follows: <18 years, 18-25 years, 26-50 years, etc)?
Lastly, how could this desired solution also accommodate generating confidence intervals for continuous variables in the same column as the confidence intervals for factor variables?
Here is an example of the table I am trying to produce:
Screenshot of desired table output2
Apologies if this request for help doesn't follow good stack overflow etiquette (I'm fairly new to this community) and your time and assistance is much appreciated!
I have a prepared example for factors with >=2 levels, but not with a by= variable (although the approach is similar). FYI, we have an open issue to support survey objects more thoroughly with a new function add_ci.tbl_svysummary() that will calculate CIs for both categorical and continuous variables. You can click the "subscribe" link here to be alerted when this feature is implemented https://github.com/ddsjoberg/gtsummary/issues/965
In the meantime, here is a code example:
library(gtsummary)
library(tidyverse)
packageVersion("gtsummary")
#> [1] '1.5.0'
svy <- survey::svydesign(~1, data = as.data.frame(Titanic), weights = ~Freq)
# put the CI in a tibble with the variable name
# first create a data frame with each variable and it's values
df_result <-
tibble(variable = c("Class", "Sex", "Age", "Survived")) %>%
# get the levels of each variable in a new column
# adding them as a list to allow for different variable classes
rowwise() %>%
mutate(
# level to be used to construct call
level = unique(svy$variables[[variable]]) %>% as.list() %>% list(),
# character version to be merged into table
label = unique(svy$variables[[variable]]) %>% as.character() %>% as.list() %>% list()
) %>%
unnest(c(level, label)) %>%
mutate(
label = unlist(label)
)
# construct call to svyciprop
df_result$svyciprop <-
map2(
df_result$variable, df_result$label,
function(variable, level) rlang::inject(survey::svyciprop(~I(!!rlang::sym(variable) == !!level), svy))
)
# round/format the 95% CI
df_result <-
df_result %>%
rowwise() %>%
mutate(
ci =
svyciprop %>%
attr("ci") %>%
style_sigfig(scale = 100) %>%
paste0("%", collapse = ", ")
) %>%
ungroup() %>%
# keep variables needed in tbl
select(variable, label, ci)
# construct gtsummary table with CI
tbl <-
svy %>%
tbl_svysummary() %>%
# merge in CI
modify_table_body(
~.x %>%
left_join(
df_result,
by = c("variable", "label")
)
) %>%
# add a header
modify_header(ci = "**95% CI**")
Created on 2021-12-04 by the reprex package (v2.0.1)

Fit loess smoothers for multiple groups across multiple numeric variables

I need to fit many loess splines by the grouping variable (Animal) across multiple numeric columns (Var1, Var2), and extract these values.
I found code to do this task one variable at a time;
# Create dataframe 1
OneVarDF <- data.frame(Day = c(replicate(1,sample(1:50,200,rep=TRUE))),
Animal = c(c(replicate(100,"Greyhound"), c(replicate(100,"Horse")))),
Var1 = c(c(replicate(1,sample(2:10,100,rep=TRUE))), c(replicate(1,sample(15:20,100,rep=TRUE)))))
library(dplyr)
library(tidyr)
library(purrr)
# Get fitted values from each model
Models <- OneVarDF %>%
tidyr::nest(-Animal) %>%
dplyr::mutate(m = purrr::map(data, loess, formula = Var1 ~ Day, span = 0.30),
fitted = purrr::map(m, `[[`, "fitted")
)
# Create prediction column
Results <- Models %>%
dplyr::select(-m) %>%
tidyr::unnest()
This "Results" dataframe is essential for downstream tasks (detrending many non-parametric distributions).
How can we achieve this with a dataframe with multiple numeric columns (code below), and extract a "Results" dataframe? Thank you.
# Create dataframe 2
TwoVarDF <- data.frame(Day = c(replicate(1,sample(1:50,200,rep=TRUE))),
Animal = c(c(replicate(100,"Greyhound"), c(replicate(100,"Horse")))),
Var1 = c(c(replicate(1,sample(2:10,100,rep=TRUE))), c(replicate(1,sample(15:20,100,rep=TRUE)))),
Var2 = c(c(replicate(1,sample(22:27,100,rep=TRUE))), c(replicate(1,sample(29:35,100,rep=TRUE)))))
We can get the data in long format using. pivot_longer, group_by Animal and column name and apply loess to each combinaton.
library(dplyr)
library(tidyr)
TwoVarDF %>%
pivot_longer(cols = starts_with('Var')) %>%
group_by(Animal, name) %>%
mutate(model = loess(value~Day, span = 0.3)$fitted)
Include a gather() function to proceed as similar to your previous code.
Models2 <- TwoVarDF %>%
gather(varName, varVal, 3:4) %>%
tidyr::nest(-Animal, -varName) %>%
dplyr::mutate(m = purrr::map(data, loess, formula = varVal ~ Day, span = 0.30),
fitted = purrr::map(m, `[[`, "fitted")
)

Resources