Print iterations when loop is inside a function R - r

I am trying to get a loop inside a function.
So the main idea is that you get a var model, periods for forecasting, the training set, and the real data (for the forecasted period).
So I am making up this function, that reorganizes the data and computes the MSFE, the loop works fine! But as soon as it is inside a function, it won't print the results in the console.
Here I asked about the loop, so I know it is working fine.
Since the desired output is text, I don't want to store the sentences. Just print them in the console, I don't know how to make it.
Thanks in advance!
This is the function:
MSFE <- function(var, periods, train, test){
# Now for each variable compute the MSFE
## Recall that MSFE is mean((real-predicted)^2)
# var is the model,
# periods is the forecasted period
# train is the training data (1971m1-2003m12)
# test is the real data (2004m1-2004m12)
fcast_data <- forecast(var, h=periods)
test <- head(test, periods)
fcast_data <- fcast_data %>%
as.data.frame() %>% select(Time, Series, `Point Forecast`) %>%
spread(key="Series", value = `Point Forecast`) %>%
mutate(Time=toupper(Time),
Time=gsub(" ","",Time),
Time=paste0("01",Time),
Time=dmy(Time)) %>% arrange(Time) %>% select(-Time)
for(i in names(train)){
msfe <- paste("The Mean Squared Forecast Error for",i,"is:",mean((test[,i]-fcast_data[,i])^2))
print(msfe)
}
}

Related

how to use R package `caret` to run `pls::plsr( )` with multiple responses

the caret::train() does not seem to accept y if y is a matrix of multiple columns.
Thanks for any help!
That's correct. Perhaps you want the tidymodels package? Kuhn has said there would be support for multivariate response in it. Here's evidence in favor of my suggestion: https://www.tidymodels.org/learn/models/pls/
Do a search of that document for plsr:
library(tidymodels)
library(pls)
get_var_explained <- function(recipe, ...) {
# Extract the predictors and outcomes into their own matrices
y_mat <- bake(recipe, new_data = NULL, composition = "matrix", all_outcomes())
x_mat <- bake(recipe, new_data = NULL, composition = "matrix", all_predictors())
# The pls package prefers the data in a data frame where the outcome
# and predictors are in _matrices_. To make sure this is formatted
# properly, use the `I()` function to inhibit `data.frame()` from making
# all the individual columns. `pls_format` should have two columns.
pls_format <- data.frame(
endpoints = I(y_mat),
measurements = I(x_mat)
)
# Fit the model
mod <- plsr(endpoints ~ measurements, data = pls_format)
# Get the proportion of the predictor variance that is explained
# by the model for different number of components.
xve <- explvar(mod)/100
# To do the same for the outcome, it is more complex. This code
# was extracted from pls:::summary.mvr.
explained <-
drop(pls::R2(mod, estimate = "train", intercept = FALSE)$val) %>%
# transpose so that components are in rows
t() %>%
as_tibble() %>%
# Add the predictor proportions
mutate(predictors = cumsum(xve) %>% as.vector(),
components = seq_along(xve)) %>%
# Put into a tidy format that is tall
pivot_longer(
cols = c(-components),
names_to = "source",
values_to = "proportion"
)
}
#We compute this data frame for each resample and save the results in the different columns.
folds <-
folds %>%
mutate(var = map(recipes, get_var_explained),
var = unname(var))
#To extract and aggregate these data, simple row binding can be used to stack the data vertically. Most of the action happens in the first 15 components so let’s filter the data and compute the average proportion.
variance_data <-
bind_rows(folds[["var"]]) %>%
filter(components <= 15) %>%
group_by(components, source) %>%
summarize(proportion = mean(proportion))
This might not be a reproducible code block. May need additional data or packages.

Hierarchical forecast with xreg late in series using fable

I am using the great fable package and am trying to create a hierarchical forecast using arima and ets models, and reconciling with td, mo, bu, and min trace to compare and see what is the best approach. My series has some effects late in the series that need to be regressed away and so I am trying to create a binary regressor to deal with that. I have read link1 and link2 about using the new_data argument to add a regressor with a hierarchical forecast, instead of the xreg argument which I've used for non-hierarchical forecast. I've had success with this approach by splitting the data into train and test sets and passing the test to new_data as Rob Hyndman describes in link1. The problem I am having with this current task is that the effects that need to be modeled away are all late in the series and so they are all in the test set.
First here is my reproducible example data:
library(tidyverse)
library(forecast)
library(fable)
library(tsibble)
library(tsibbledata)
library(lubridate)
data <- aus_livestock %>%
filter(State %in% c("Tasmania", "New South Wales", "Queensland"),
as.Date(Month) > as.Date("2000-01-01")) %>%
aggregate_key(State, Count=sum(Count)) %>%
mutate(xreg=as.factor(if_else(as.Date(Month) > as.Date("2018-01-01") &
as.Date(Month) < as.Date("2018-10-01"), 1, 0)))
I have had success in the past doing something like this:
train <- data %>%
filter(as.Date(Month) < as.Date("2017-10-01"))
test <- data %>%
filter(as.Date(Month) >= as.Date("2017-10-01"))
mod_data <- train %>%
model(ets=ETS(Count),
arima=ARIMA(Count ~ xreg)
) %>%
reconcile(bu_ets=bottom_up(ets),
td_ets=top_down(ets),
mint_ets=min_trace(ets),
bu_arima=bottom_up(arima),
td_arima=top_down(arima),
mint_arima=min_trace(arima)
)
forc_data <- mod_data %>%
forecast(new_data=test)
autoplot(forc_data,
data,
level=NULL)
But since in this case the regressor is all zeros in the train set this expectedly provides the error Provided exogenous regressors are rank deficient, removing regressors: xreg1. I think what I need to do is feed all the data I have to the model, not split the data into train and test, but I am unsure how to forecast that model using fable when there is no data for the new_data file. The closest I've gotten is something like this:
dates <- sort(rep(seq(as.Date("2019-01-01"), as.Date("2020-12-01"), "months"), 3))
future_data <- tibble(
Month=dates,
State=rep(c("Tasmania", "New South Wales", "Queensland"), 24),
Count=0
) %>%
mutate(Month=yearmonth(Month)) %>%
as_tsibble(index=Month, key=State) %>%
aggregate_key(State, Count=sum(Count)) %>%
mutate(xreg=factor(0, levels=c(0, 1))) %>%
select(-Count)
mod_data <- data %>%
model(ets=ETS(Count),
arima=ARIMA(Count ~ xreg)
) %>%
reconcile(bu_ets=bottom_up(ets),
td_ets=top_down(ets),
mint_ets=min_trace(ets),
bu_arima=bottom_up(arima),
td_arima=top_down(arima),
mint_arima=min_trace(arima)
)
forc_data <- mod_data %>%
forecast(new_data=future_data)
autoplot(forc_data,
data,
level=NULL)
Oddly this code causes my R Studio to crash when I try to run the forecast piece saying R session aborted R has encountered a fatal error. I think this may be unrelated to the code because I actually got this to work on my real data but the forecasts dont look like I would expect.
So, in summary I would like to know how I can use fable to forecast a hierarchical series with an exogenous regressor when all the regression effects need to happen in the period of the test set.
Thanks in advance for any help I can get!
I think it's not possible to do it only in the test set because then the model has nothing to learn from in the train set. I.e. you can only include an exogenous variable in the training process if it is both present in the train and test set.

Iterating Effect Size Calculations Through Columns

I am currently comparing the size of 159 regions (ROI) in the brain between an at-risk and normal population on R. I originally calculated lm model p-values using this loop:
storage <- list()
for(i in names(ThalPC)[-c(1:8)]){
storage[[i]] <- lm(get(i) ~ Status, ThalPC)
}
table <- storage %>% tibble(
dvsub = names(.),
untidied = .
) %>%
mutate(tidy = map(untidied, broom::tidy)) %>%
unnest(tidy)
tab <- as.data.frame(table)
to <- subset(tab, select = -c(2))
newtable <- filter(to, term == "StatusControl")
ThalPC= my data frame
Status = Their status as Control or at-risk population
Now, I have around 59 regions with significant p-values and I am hoping to calculate the effect sizes for them. Currently I am trying to use this loop:
stor <- list()
for(i in names(ThalPC)[-c(1:9)]) {
stor[[i]] <- lm(get(i) ~ Status, ThalPC)
try <- effectsize(stor[[i]], type="eta")
}
However, I get the following error:
Error in get(i) : object 'Left_LGN' not found
(Left_LGN being a region that I am studying, all the 159 regions are set up as columns through the data frame)
Perhaps I am overthinking it, does anyone know any simple solution/ better approach to getting the effect sizes for them?
I am still a beginner in R and statistics so I really appreciate your input!!
Thank you!
I would guess you used attach(ThalPC) before running your first script to add columns of ThalPC to the search path. Instead, try constructing your call to lm as:
stor[[i]] <- lm(as.formula(paste(i, "~ Status")),
data = ThalPC)
It looks like you might want to collect the output of effectsize as elements of a list too, otherwise you're overwriting it each time.

How do I avoid using a for-loop to get elements of a nested list to the top level?

I am trying to extract the coefficients of a set of linear models into a data frame. How do I extract these values without using a for-loop?
The data in my example is dummy data for clarity. The actual project makes models for air temperature for each day of the year, and then tries to model the parameters of these models. Currently I can only accumulate each coefficient in a separate variable, and then apply individually it to my data set:
require(tidyverse)
# making different mpg models from displacement, distinguished by cylinder count
models <- mtcars %>%
nest(-cyl, .key = "cardata") %>%
mutate(mod = map(cardata, ~lm(mpg ~ disp, data = .))) %>%
mutate(coefficients = map(mod, coefficients)) #this only extracts a list of coefficients
# currently using one for-loop to extract each coefficient, looking for a more elegant way...
coef.intercept <- c()
for (i in models$coefficients) {
coef.intercept <- c(coef.intercept,i[1])
}
coef.disp <- c()
for (i in models$coefficients) {
coef.disp <- c(coef.disp,i[2])
}
# putting together the final data frame
models <- models %>%
mutate(coef.intercept) %>%
mutate(coef.disp) %>%
select(cyl, coef.intercept, coef.disp) %>%
as.data.frame()
Using 'map' I can extract a list of coefficients, but I cannot use the '[' operator in order to get specific elements of the individual lists. Something like
mutate(models, coef.intercept = map(models, coefficients[1]))
does not work, I get "Error: Index 1 must have length 1, not 2".
I'm not able at the moment to replicate your example, but I think you can start from here and adapt this solution to your needs.
A <- list(a = list(1,"j"), b = list(2, "k") , d = list(3, "m" ) )
sapply(A, `[[`, 1)

Generic time-series backtesing/cross-validation with R

I want to make some time-series evaluation in R. The process is usually to define a time lag and the evaluation frequency/periods, and for each evaluation period, train a model with the defined time lag and compute metrics for that period.
For example, we have:
Evaluation period size and interval n
Evaluation start at b
Time lag l
We train a model with points 1:b-l, evaluate it on b:b+n. After that we train a model with points 1:b+n-l and evaluate it on b+n:b+2n and etc, for k periods. It could vary a bit but that's the general spirit. So this is basically a sliding window for the evaluation data, but an increasing window for training data.
This is illustrated in the answer to this question (the expanding window solution).
How could I do this, preferably without loops and using the tidyverse and/or packages specific for time-series analysis?
So this is how I'm doing at the moment, but I'm really not satisfied with it. Too much custom code and not very modular.
time_series_cv <- function(dates_lim, df) {
eval_data <-
df %>%
filter(
date >= dates_lim[['date_beg']],
date < dates_lim[['date_end']]
)
eval_data$prediction <-
predict(
lm(
log(y) ~ .,
df %>% filter(date < dates_lim[['date_beg']]) %>% select(-c(date))
),
eval_data
)
eval_data %>%
select(date, y, prediction)
}
predictions <-
lapply(dates, time_series_cv, df = df) %>%
bind_rows()
dates is a list of named lists with the start and end of the evaluation period. Lag is 1 sample here.

Resources