Fit a Mean forecasting model using tsibble, fable in R - r

(Using Orange dataset from library(Ecdat) for reproducibility.)
I am trying to fit a mean forecasting model in R using tsibble, fable package in R. The code below is pretty simple, however I get the error Error in NCOL(x) : object 'value' not found when I try to run the last model part (even though value is a column name in o_ts), not sure why would that be. I am following RJH tutorials from here (https://robjhyndman.com/hyndsight/fable/).
I would also appreciate any help whether arima & mean forecasting model are same, if not what is the function that I should be using instead of Arima.
library(Ecdat)
library(tsibble)
library(feasts)
library(tidyverse)
library(fable)
o<- Orange
o_ts <- o %>% as_tsibble()
o_ts %>%
filter(key=="priceoj") %>%
model(
arima=arima(value))

arima is from the stats package. I believe you want ARIMA from fable.
o_ts %>%
filter(key == "priceoj") %>%
model(
arima = ARIMA(value)
)
#> # A mable: 1 x 2
#> # Key: key [1]
#> key arima
#> <chr> <model>
#> 1 priceoj <ARIMA(1,1,0)(0,0,1)[12]>

If you by mean forecasting model are referring to taking the mean of the last X observation (Moving Average), then you should be using MEAN.
While ARIMA does refer to Moving Average (Auto Regressive Integrated Moving Average), however this refers to a weighted moving average of the forecast errors - you can read more here: 9.4 Moving average models in Forecasting: Principles and Practice
o <- Orange
o_ts <- o %>% as_tsibble()
o_ts %>%
filter(key == "priceoj") %>%
model(mean = MEAN(value))
If you want to specify the amount of observations to take the mean of, then you need to add the special ~window(size = X). Otherwise all observations are used.
o_ts %>%
filter(key == "priceoj") %>%
model(mean = MEAN(value ~ window(size = 3)))

Related

Extracting estimates with ranger decision trees

I am getting the error message Error: No tidy method for objects of class ranger when trying to extract the estimates for a regression model built with the ranger package in R.
Here is my code:
# libraries
library(tidymodels)
library(textrecipes)
library(LiblineaR)
library(ranger)
library(tidytext)
# create the recipe
comments.rec <- recipe(year ~ comments, data = oa.comments) %>%
step_tokenize(comments, token = "ngrams", options = list(n = 2, n_min = 1)) %>%
step_tokenfilter(comments, max_tokens = 1e3) %>%
step_stopwords(comments, stopword_source = "stopwords-iso") %>%
step_tfidf(comments) %>%
step_normalize(all_predictors())
# workflow with recipe
comments.wf <- workflow() %>%
add_recipe(comments.rec)
# create the regression model using support vector machine
svm.spec <- svm_linear() %>%
set_engine("LiblineaR") %>%
set_mode("regression")
svm.fit <- comments.wf %>%
add_model(svm.spec) %>%
fit(data = oa.comments)
# extract the estimates for the support vector machine model
svm.fit %>%
pull_workflow_fit() %>%
tidy() %>%
arrange(-estimate)
Below is the table of estimates for each tokenized term in the data set (this is a dirty data set for demo purposes)
term estimate
<chr> <dbl>
1 Bias 2015.
2 tfidf_comments_2021 0.877
3 tfidf_comments_2019 0.851
4 tfidf_comments_2020 0.712
5 tfidf_comments_2018 0.641
6 tfidf_comments_https 0.596
7 tfidf_comments_plan s 0.462
8 tfidf_comments_plan 0.417
9 tfidf_comments_2017 0.399
10 tfidf_comments_libraries 0.286
However, when using the ranger engine to create a regression model from random forests, I have no such luck and get the error message above
# create the regression model using random forests
rf.spec <- rand_forest(trees = 50) %>%
set_engine("ranger") %>%
set_mode("regression")
rf.fit <- comments.wf %>%
add_model(rf.spec) %>%
fit(data = oa.comments)
# extract the estimates for the random forests model
rf.fit %>%
pull_workflow_fit() %>%
tidy() %>%
arrange(-estimate)
To put this back to you in a simpler form that I think highlights the issue - if you had a decision tree model, how would you produce coefficients on the data in the dataset? What would those mean?
I think what you are looking for here is some form a attribution to each column. There are tools to do this built into tidymodels, but you should read on what it's actually reporting.
For you, you can get a basic idea of what those numbers would look like by using the vip package, though the produced numbers are definitely not comparable directly to your svm ones.
install.packages('vip')
library(vip)
rf.fit %>%
pull_workflow_fit() %>%
vip(geom = "point") +
labs(title = "Random forest variable importance")
You'll produce a plot with relative importance scores. To get the numbers
rf.fit %>%
pull_workflow_fit() %>%
vi()
tidymodels has a decent walkthrough doing this here but, given you have a model that can estimate importance scores you should be good to go.
Tidymodels tutorial page - 'a case study'
edit: if you HAVEN'T done this you may need to rerun your initial model with a new parameter passed during the 'set_engine' step of your code that gives ranger an idea of what kind of importance scores you are looking for/how they should be computed.

How to get multiple-steps ahead forecast with STL model in fable-r?

My purpose is forecast multiple-step without re-estimation. And I will update new observation to next forecast.
I did not using fit and apply forecast(h=7) because this function using fitted value to forecast next observation.
I used following codes to get 1-step ahead forecast with stretch_tsibble to do it.
library(fable)
library(dplyr)
library(tsibble)
library(feasts)
us_accidental_deaths <- as_tsibble(USAccDeaths)
stretch_dt <- us_accidental_deaths %>%
stretch_tsibble(.init = 60, .step = 1)
fit_train <- stretch_dt %>%
# keep same estimate period with each .id
filter_index(. ~ '1977 Dec') %>%
model(stl_ets_mod = decomposition_model(
STL(value, ~ season(window = 12)),
ETS(season_adjust ~ season("N")),
SNAIVE(season_year)
),
arima_mod = ARIMA(value))
It's ok when I refit ARIMA model
fit_train %>%
select(arima_mod) %>%
refit(stretch_dt) %>%
forecast(h = 1)
But I met error when I refit STL model.
fit_train %>%
select(stl_ets_mod) %>%
refit(stretch_dt) %>%
forecast(h = 1)
Many thanks !!!
The error you are getting is
! no applicable method for 'refit' applied to an object of class "c('decomposition_model', 'model_combination')"
refit() is not available for all models.
It is not clear how a refit should work for an STL decomposition. The STL components are specific to the data set used for training. If the model is applied to a different data set, potentially of a different length, what should the components be?

Forecasting future observations based off of lowest RMSE models

I asked this question over at RStudio community and received no answer so I figured I'd give it a go here. My question pertains to what budugulo asked here Select models with lowest RMSE but I'm wondering how I can go further and use the models with the best predictive capability against the test data and apply it across the entire original hierarchical dataset to get future observations.
I understand how to forecast into the future with one individual time series, but I'm trying to forecast a hierarchical dataset that would require too much time to forecast the best models onto all of the original time series individually to forecast future observations. Is there a way to fit the best models (using lowest RMSE) onto the original time series in a hierarchical dataset to forecast future observations 3 years into the future (2020)? I tried using refit() but to no avail.
Hopefully the code below will help towards answering my question.
library(tidyverse)
library(tsibble)
library(fable)
library(fpp3)
fit <- tourism %>%
filter(Quarter <= yearquarter("2015 Q1")) %>%
model(
ets = ETS(Trips),
arima = ARIMA(Trips)
)
fc <- fit %>%
forecast(new_data = filter(tourism, Quarter > yearquarter("2015 Q1")))
bestrmse <- accuracy(fc, tourism) %>%
group_by(Region, State, Purpose) %>%
filter(RMSE == min(RMSE)) %>%
select(.model:Region)
bestfits <- fit %>%
pivot_longer(cols=ets:arima, names_to = ".model", values_to = "fit") %>%
right_join(bestrmse) %>%
mutate(.model = "best") %>%
pivot_wider(Region:Purpose, names_from = ".model", values_from = "fit") %>%
as_mable(key = c(Region, State, Purpose), model = best)
#Apply 'best' models from bestfits onto original non-trained/non-tested time series and
#forecast future observations into 2020.

Understanding the Output Coefficients from a Linear Model Regression in R

I'm reading a fairly simple hypothesis textbook at the moment. It is being explained that the coefficients from a linear model, where the independent variables are two categorical variables with 2 and 3 factors respectively, and the dependent variable is a continuous variable should be interpreted as; the difference between the overall mean of the dependent variable (mean across all categorical variables and factors) and the mean of the dependent variable based on the values of the dependent variable from a given factorized categorical variable. I hope it's understandable.
However, when I try to reproduce the example in the book, I do not get the same coefficients, std. err., T- or P-values.
I created a reproducible example using the ToothGrowth dataset, where the same is the case:
library(tidyverse)
# Transforming Data to a Tibble and Change Variable 'dose' to a Factor:
tooth_growth_reprex <- ToothGrowth %>%
as_tibble() %>%
mutate(dose = as.factor(dose))
# Creating Linear Model of Variables in ToothGrowth (tg):
tg_lm <- lm(formula = len ~ supp * dose, data = tooth_growth_reprex)
# Extracting suppVC coefficient:
(coef_supp_vc <- tg_lm$coefficients["suppVC"])
#> suppVC
#> -5.25
# Calculating Mean Difference between Overall Mean and Supplement VC Mean:
## Overall Mean:
(overall_summary <- tooth_growth_reprex %>%
summarise(Mean = mean(len)))
#> # A tibble: 1 x 1
#> Mean
#> <dbl>
#> 1 18.8
## Supp VC Mean:
(supp_vc_summary <- tooth_growth_reprex %>%
group_by(supp) %>%
summarise(Mean = mean(len))) %>%
filter(supp == "VC")
#> # A tibble: 1 x 2
#> supp Mean
#> <fct> <dbl>
#> 1 VC 17.0
## Difference between Overall Mean and Supp VC Mean:
(mean_dif_overall_vc <- overall_summary$Mean - supp_vc_summary$Mean[2])
#> [1] 1.85
# Testing if supp_VC coefficient and difference between Overall Mean and Supp VC Mean is near identical:
near(coef_supp_vc, mean_dif_overall_vc)
#> suppVC
#> FALSE
Created on 2021-02-23 by the reprex package (v1.0.0)
My questions:
Am I understanding the interpretation of the coefficient values completely wrong?
What is the lm actually calculating regarding the coefficients?
Is there any functions in R that can calculate what I'm interested in, with me having to do it manually?
I hope this is enough information. If not, please don't hesitate to ask me!
The lm() function uses dummy coding, so all the coefficients in your model are compared to the reference group's mean. The reference group here is the first levels of your factors, so supp=OJ and dose=0.5
You can then do this verification like so:
coef(tg_lm)["(Intercept)"] + coef(tg_lm)["suppVC"] == mean_table %>% filter(supp=='VC' & dose==0.5) %>% pull(M)
(coef(tg_lm)["(Intercept)"] + coef(tg_lm)["suppVC"] + coef(tg_lm)["dose1"] + coef(tg_lm)["suppVC:dose1"]) == mean_table %>% filter(supp=='VC' & dose==1) %>% pull(M)
You can read into the differences here

Hundreds of linear regressions that run by group in R [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 4 years ago.
Improve this question
I have a table with 3,000+ rows and 10+ variables. I am trying to run a linear regression using one variables as the predictor and another as the response for 300 different groups. I need the slope, p-value, and r-squared for each of these regressions. To do each regression individually and record the summary variables would take hours if not days.
I have used the following package to get the intercept and slope for each group, but I do not know how to also get the corresponding p-value and r-squared for each group:
library(lme4)
groupreg<-lmList(logpop ~ avgp | id, data=data)
groupreg
I achieved a list sample below, where "Adams #" is the id value. NAs exist because not all groups have multiple points to plot and compare:
Coefficients:
(Intercept) avgp
Adams 6 4.0073332 NA
Adams 7 6.5177389 -7.342443e+00
Adams 8 4.7449321 NA
Adams 9 NA NA
This table does not include any significance statistics, however. I still need the p-value and r-squared statistic. If there is a code to do it all in one go for all group values, or a code to just pull the remaining values, it would be helpful.
Is there are way also to exponentiate the slope output for all groups? My outcome was log-transformed.
Thank you all!!
I thinks the easiest answer is still missing. You can use a combination of nesting and mapping. I'll show you how it works for linear regression. I think you're able to apply the same principle to models of the lme4 package.
Lets create a toy data set, where we've measured the IQ score for three different groups at two different points of time.
library(tidyverse)
library(broom)
df <- tibble(
id = seq_len(90),
IQ = rnorm(90, 100, 15),
group = rep(c("A", "B", "C"), each = 30),
time = rep(c("T1", "T2"), 45)
)
If we want to build a regression model for each group, investigating the relation between the IQ score and the point of time, we only need five lines of code.
df %>%
nest(-group) %>%
mutate(fit = map(data, ~ lm(IQ ~ time, data = .)),
results = map(fit, glance)) %>%
unnest(results) %>%
select(group, r.squared, p.value)
Which will return
# A tibble: 3 x 3
group r.squared p.value
<chr> <dbl> <dbl>
1 A 0.0141 0.532
2 B 0.0681 0.164
3 C 0.00432 0.730
where nest(-group) creates tibbles within your tibble for each group, containing the corresponding variables of id, IQ and time. Then you add a new column fit with mutate() where you apply a regression model for each group and a new column containing the results, which we unnest() shortly after to access the values glance() returned properly. In the last step we select() the three values of interest.
To get the slope you need to call tidy() in addition. Maybe it's possible to shorten the code somehow, but one solution would be
df %>%
nest(-group) %>%
mutate(fit = map(data, ~ lm(IQ ~ time, data = .)),
results1 = map(fit, glance),
results2 = map(fit, tidy)) %>%
unnest(results1) %>%
unnest(results2) %>%
select(group, term, estimate, r.squared, p.value) %>%
mutate(estimate = exp(estimate))
To exponentiate the slope, you can just add another mutate() statement. Finally it returns
# A tibble: 6 x 5
group term estimate r.squared p.value
<chr> <chr> <dbl> <dbl> <dbl>
1 A (Intercept) 3.34e+46 0.0141 0.532
2 A timeT2 3.31e- 2 0.0141 0.532
3 B (Intercept) 1.17e+47 0.0681 0.164
4 B timeT2 1.34e- 3 0.0681 0.164
5 C (Intercept) 8.68e+43 0.00432 0.730
6 C timeT2 1.25e- 1 0.00432 0.730
Note that the estimates are exponentiated already. Without the exponentiation you can double check the slope and p value with base R calling
summary(lm(IQ ~ time, data = filter(df, group == "A")))
If you work with more complex models (lme4), there is a package called lmerTest which offers wrapper functions for lme4 which return p-values (at least for mixed models, with which I already worked with).
A word of warning towards using glance() for lme4 models should be spoken, because the maintainers of the broom package, will try a new concept where they outsource the summary statistics to the particular package developer responsible for the model.
If I am understanding your question correctly, you want to run multiple regressions over lots of groups. Here is an example of how to do so with the mtcars data.
library(dplyr)
mtcars %>% group_by(cyl) %>%
summarise_at(vars(disp:wt), funs(
r.sqr = summary(lm(mpg~.))$r.squared,
intercept = summary(lm(mpg~.))$coefficients[[1]],
slope = summary(lm(mpg~.))$coefficients[[2]],
p.value = summary(lm(mpg~.))$coefficients[[8]]
))
This will run a regression per group per variable and extract the info you asked for. If your formula is always the same, you could simplify as follows.
mtcars %>% group_by(cyl) %>%
summarise(
r.sqr = summary(lm(mpg~wt))$r.squared,
intercept = summary(lm(mpg~wt))$coefficients[[1]],
slope = summary(lm(mpg~wt))$coefficients[[2]],
p.value = summary(lm(mpg~wt))$coefficients[[8]]
)
This is actually running the regression 4 times(once per value of interest). If that takes too long for your real data, you could try this:
df <- mtcars %>% group_by(cyl) %>% summarise(model = list(summary(lm(mpg~wt))))
which simply runs the model once per group and then extract out the info you want. The problem is that extracting values this way can be a pain
df$model[[1]]$coefficients[[1]]
[1] 39.5712
While the code given by AndS will work, it will run lm function 4 times for each group which makes it a bit inefficient. You can use the following. I am trying to break it into simpler steps:
Assuming your data frame(df) has three variables: "Group", "Dep", "Indep":
#Getting the unique list of groups
groups <- unique(df$Group)
#Creating a model summary list to combine the model summary of each model
model_summaries = list()
#Running the models
for(i in 1:length(groups)){
model <- lm(Dep ~ Indep, df[df$Group == Groups[i], c("Dep", "Indep")])
model_summaries[i] <- summary(model)
}
In each model summary you have following elements RSQ, coefficients(contains p-values and intercept too)
Let me know if this helps.

Resources