step_pca() arguments are not being applied - r

I'm new to tidymodels but apparently the step_pca() arguments such as nom_comp or threshold are not being implemented when being trained. as in example below, I'm still getting 4 component despite setting nom_comp = 2.
library(tidyverse)
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#> method from
#> required_pkgs.model_spec parsnip
rec <- recipe( ~ ., data = USArrests) %>%
step_normalize(all_numeric()) %>%
step_pca(all_numeric(), num_comp = 2)
prep(rec) %>% tidy(number = 2, type = "coef") %>%
pivot_wider(names_from = component, values_from = value, id_cols = terms)
#> # A tibble: 4 x 5
#> terms PC1 PC2 PC3 PC4
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Murder -0.536 0.418 -0.341 0.649
#> 2 Assault -0.583 0.188 -0.268 -0.743
#> 3 UrbanPop -0.278 -0.873 -0.378 0.134
#> 4 Rape -0.543 -0.167 0.818 0.0890

The full PCA is determined (so you can still compute the variances of each term) and num_comp only specifies how many of the components are retained as predictors. If you want to specify the maximal rank, you can pass that through options:
library(recipes)
#> Loading required package: dplyr
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
#>
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#>
#> step
rec <- recipe( ~ ., data = USArrests) %>%
step_normalize(all_numeric()) %>%
step_pca(all_numeric(), num_comp = 2, options = list(rank. = 2))
prep(rec) %>% tidy(number = 2, type = "coef")
#> # A tibble: 8 × 4
#> terms value component id
#> <chr> <dbl> <chr> <chr>
#> 1 Murder -0.536 PC1 pca_AoFOm
#> 2 Assault -0.583 PC1 pca_AoFOm
#> 3 UrbanPop -0.278 PC1 pca_AoFOm
#> 4 Rape -0.543 PC1 pca_AoFOm
#> 5 Murder 0.418 PC2 pca_AoFOm
#> 6 Assault 0.188 PC2 pca_AoFOm
#> 7 UrbanPop -0.873 PC2 pca_AoFOm
#> 8 Rape -0.167 PC2 pca_AoFOm
Created on 2022-01-12 by the reprex package (v2.0.1)
You could also control this via the tol argument from stats::prcomp(), also passed in as an option.

If you bake the recipe it seems to work as intended but I don't know what you aim to achieve afterward.
library(tidyverse)
library(tidymodels)
USArrests <- USArrests %>%
rownames_to_column("Countries")
rec <-
recipe( ~ ., data = USArrests) %>%
step_normalize(all_numeric()) %>%
step_pca(all_numeric(), num_comp = 2)
prep(rec) %>%
bake(new_data = NULL)
#> # A tibble: 50 x 3
#> Countries PC1 PC2
#> <fct> <dbl> <dbl>
#> 1 Alabama -0.976 1.12
#> 2 Alaska -1.93 1.06
#> 3 Arizona -1.75 -0.738
#> 4 Arkansas 0.140 1.11
#> 5 California -2.50 -1.53
#> 6 Colorado -1.50 -0.978
#> 7 Connecticut 1.34 -1.08
#> 8 Delaware -0.0472 -0.322
#> 9 Florida -2.98 0.0388
#> 10 Georgia -1.62 1.27
#> # ... with 40 more rows
Created on 2022-01-11 by the reprex package (v2.0.1)

Related

What is the reason for the error reported for the ETS and ARIMA accuracy checks?

The code is as follows.
library(fable)
library(tsibble)
library(dplyr)
tourism_melb <- tourism %>%
filter(Region == "Melbourne")
tourism_melb %>%
group_by(Purpose) %>%
slice(1)
tourism_melb %>%
autoplot(Trips)
fit <- tourism_melb %>%
model(
ets = ETS(Trips ~ trend("A")),
arima = ARIMA(Trips)
)
fit %>%
accuracy() %>%
arrange(MASE)
Error in accuracy.default(.) :
No accuracy method found for an object of class mdl_dfNo accuracy method found for an object of class tbl_dfNo accuracy method found for an object of class tblNo accuracy method found for an object of class data.frame
What is the reason for the error in the last step?
This might be some configuration issue at your computer. I get some output instead of errors when I run your code.I am pasting the output from my console below, when I run your code.
> library(fable)
Loading required package: fabletools
> library(tsibble)
Attaching package: ‘tsibble’
The following objects are masked from ‘package:base’:
intersect, setdiff, union
> library(dplyr)
Attaching package: ‘dplyr’
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
>
> tourism_melb <- tourism %>%
+ filter(Region == "Melbourne")
> tourism_melb %>%
+ group_by(Purpose) %>%
+ slice(1)
# A tsibble: 4 x 5 [1Q]
# Key: Region, State, Purpose [4]
# Groups: Purpose [4]
Quarter Region State Purpose Trips
<qtr> <chr> <chr> <chr> <dbl>
1 1998 Q1 Melbourne Victoria Business 405.
2 1998 Q1 Melbourne Victoria Holiday 428.
3 1998 Q1 Melbourne Victoria Other 79.9
4 1998 Q1 Melbourne Victoria Visiting 666.
>
> tourism_melb %>%
+ autoplot(Trips)
>
> fit <- tourism_melb %>%
+ model(
+ ets = ETS(Trips ~ trend("A")),
+ arima = ARIMA(Trips)
+ )
>
> fit %>%
+ accuracy() %>%
+ arrange(MASE)
# A tibble: 8 × 13
Region State Purpose .model .type ME RMSE MAE
<chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 Melbourne Victo… Holiday ets Trai… 4.67 50.5 37.2
2 Melbourne Victo… Busine… ets Trai… 3.31 56.4 42.9
3 Melbourne Victo… Busine… arima Trai… 2.54 58.2 46.0
4 Melbourne Victo… Holiday arima Trai… -4.64 54.3 41.4
5 Melbourne Victo… Other arima Trai… -0.344 21.7 17.0
6 Melbourne Victo… Other ets Trai… -0.142 21.7 17.0
7 Melbourne Victo… Visiti… ets Trai… 8.17 60.9 51.4
8 Melbourne Victo… Visiti… arima Trai… 6.89 63.1 51.7
# … with 5 more variables: MPE <dbl>, MAPE <dbl>,
# MASE <dbl>, RMSSE <dbl>, ACF1 <dbl>
# ℹ Use `colnames()` to see all variable names

How to forecast with lagged external regressors using fable::VAR

I'd like to use lagged external regressors in my VAR forecast. Using the VAR() function from the fable package, I am able to fit a model, but I can't use it to forecast, as I return NAs for the dependent variables. My reprex follows examples from Forecasting: Principles and Practice v3.
Thanks in advance for any guidance.
require(fpp3)
#> Loading required package: fpp3
#> ── Attaching packages ──────────────────────────────────────────── fpp3 0.4.0 ──
#> ✔ tibble 3.1.7 ✔ tsibble 1.0.1
#> ✔ dplyr 1.0.9 ✔ tsibbledata 0.3.0
#> ✔ tidyr 1.1.3 ✔ feasts 0.2.2
#> ✔ lubridate 1.7.10 ✔ fable 0.3.1
#> ✔ ggplot2 3.3.5
#> ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
#> ✖ lubridate::date() masks base::date()
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ tsibble::intersect() masks base::intersect()
#> ✖ tsibble::interval() masks lubridate::interval()
#> ✖ dplyr::lag() masks stats::lag()
#> ✖ tsibble::setdiff() masks base::setdiff()
#> ✖ tsibble::union() masks base::union()
us_change <- fpp3::us_change
fit <- us_change %>%
model(
xregs_lag1 = VAR(vars(Consumption, Income) ~ xreg(Unemployment, lag(Unemployment, 1)))
)
fit
#> # A mable: 1 x 1
#> xregs_lag1
#> <model>
#> 1 <VAR(5) w/ mean>
new_data_ex <- new_data(us_change, 4) %>%
mutate(Unemployment = mean(us_change$Unemployment))
#############
# Here I tried creating a new_data frame that included one lag of Unemployment, and pass that to the new_data argument of forecast, but it doesn't work either
#
# new_data_ex_lags <- us_change %>%
# tail(1) %>%
# bind_rows(new_data_ex) %>%
# select(colnames(new_data_ex))
#############
fit %>%
select(xregs_lag1) %>%
forecast(new_data = new_data_ex)
#> # A fable: 4 x 6 [1Q]
#> # Key: .model [1]
#> .model Quarter .distribution .mean_Consumption .mean_Income Unemployment
#> <chr> <qtr> <dist> <dbl> <dbl> <dbl>
#> 1 xregs_lag1 2019 Q3 MVN[2] NA NA 0.00101
#> 2 xregs_lag1 2019 Q4 MVN[2] NA NA 0.00101
#> 3 xregs_lag1 2020 Q1 MVN[2] NA NA 0.00101
#> 4 xregs_lag1 2020 Q2 MVN[2] NA NA 0.00101
fit %>%
select(xregs_lag1) %>%
report()
#> Series: Consumption, Income
#> Model: VAR(5) w/ mean
#>
#> Coefficients for Consumption:
#> lag(Consumption,1) lag(Income,1) lag(Consumption,2) lag(Income,2)
#> 0.1156 0.1062 0.1479 0.0079
#> s.e. 0.0772 0.0483 0.0753 0.0509
#> lag(Consumption,3) lag(Income,3) lag(Consumption,4) lag(Income,4)
#> 0.2248 -0.0207 -0.0729 -0.0544
#> s.e. 0.0730 0.0499 0.0746 0.0500
#> lag(Consumption,5) lag(Income,5) constant Unemployment
#> -0.0217 0.0327 0.3923 -0.8602
#> s.e. 0.0708 0.0491 0.0923 0.1331
#> lag(Unemployment, 1)
#> 0.4563
#> s.e. 0.1402
#>
#> Coefficients for Income:
#> lag(Consumption,1) lag(Income,1) lag(Consumption,2) lag(Income,2)
#> 0.3715 -0.2991 0.0836 -0.0410
#> s.e. 0.1212 0.0758 0.1182 0.0799
#> lag(Consumption,3) lag(Income,3) lag(Consumption,4) lag(Income,4)
#> 0.4531 -0.1445 0.2481 -0.2475
#> s.e. 0.1145 0.0783 0.1170 0.0785
#> lag(Consumption,5) lag(Income,5) constant Unemployment
#> -0.1270 -0.1878 0.6142 -0.1100
#> s.e. 0.1111 0.0771 0.1449 0.2089
#> lag(Unemployment, 1)
#> -0.0401
#> s.e. 0.2201
#>
#> Residual covariance matrix:
#> Consumption Income
#> Consumption 0.2602 0.1341
#> Income 0.1341 0.6410
#>
#> log likelihood = -350.43
#> AIC = 760.86 AICc = 772.34 BIC = 858.74
Created on 2022-07-22 by the reprex package (v2.0.0)
Using lag() with VAR() models was not fully implemented, but I have added support for this in the development version of the fable package (https://github.com/tidyverts/fable/commit/bb15c9462b80850565aee13d8f9b33e49dfd0f33).
There are some other changes not yet pushed to CRAN such as how forecast means are represented in the fable, but the code is otherwise the same.
require(fpp3)
#> Loading required package: fpp3
#> ── Attaching packages ──────────────────────────────────────────── fpp3 0.4.0 ──
#> ✔ tibble 3.1.7 ✔ tsibble 1.1.1
#> ✔ dplyr 1.0.9 ✔ tsibbledata 0.4.0
#> ✔ tidyr 1.2.0 ✔ feasts 0.2.2
#> ✔ lubridate 1.8.0 ✔ fable 0.3.1.9000
#> ✔ ggplot2 3.3.6
#> ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
#> ✖ lubridate::date() masks base::date()
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ tsibble::intersect() masks base::intersect()
#> ✖ tsibble::interval() masks lubridate::interval()
#> ✖ dplyr::lag() masks stats::lag()
#> ✖ tsibble::setdiff() masks base::setdiff()
#> ✖ tsibble::union() masks base::union()
us_change <- fpp3::us_change
fit <- us_change %>%
model(
xregs_lag1 = VAR(vars(Consumption, Income) ~ xreg(Unemployment, lag(Unemployment, 1)))
)
fit
#> # A mable: 1 x 1
#> xregs_lag1
#> <model>
#> 1 <VAR(5) w/ mean>
new_data_ex <- new_data(us_change, 4) %>%
mutate(Unemployment = mean(us_change$Unemployment))
#############
# Here I tried creating a new_data frame that included one lag of Unemployment, and pass that to the new_data argument of forecast, but it doesn't work either
#
# new_data_ex_lags <- us_change %>%
# tail(1) %>%
# bind_rows(new_data_ex) %>%
# select(colnames(new_data_ex))
#############
fit %>%
select(xregs_lag1) %>%
forecast(new_data = new_data_ex)
#> Warning in if (is_transformed) {: the condition has length > 1 and only the
#> first element will be used
#> # A fable: 4 x 5 [1Q]
#> # Key: .model [1]
#> .model Quarter .distribution .mean[,"Consumption… [,"Income"] Unemployment
#> <chr> <qtr> <dist> <dbl> <dbl> <dbl>
#> 1 xregs_lag1 2019 Q3 MVN[2] 0.548 0.657 0.00101
#> 2 xregs_lag1 2019 Q4 MVN[2] 0.679 0.316 0.00101
#> 3 xregs_lag1 2020 Q1 MVN[2] 0.763 0.832 0.00101
#> 4 xregs_lag1 2020 Q2 MVN[2] 0.697 0.733 0.00101
fit %>%
select(xregs_lag1) %>%
report()
#> Series: Consumption, Income
#> Model: VAR(5) w/ mean
#>
#> Coefficients for Consumption:
#> lag(Consumption,1) lag(Income,1) lag(Consumption,2) lag(Income,2)
#> 0.1156 0.1062 0.1479 0.0079
#> s.e. 0.0772 0.0483 0.0753 0.0509
#> lag(Consumption,3) lag(Income,3) lag(Consumption,4) lag(Income,4)
#> 0.2248 -0.0207 -0.0729 -0.0544
#> s.e. 0.0730 0.0499 0.0746 0.0500
#> lag(Consumption,5) lag(Income,5) constant Unemployment
#> -0.0217 0.0327 0.3923 -0.8602
#> s.e. 0.0708 0.0491 0.0923 0.1331
#> lag(Unemployment, 1)
#> 0.4563
#> s.e. 0.1402
#>
#> Coefficients for Income:
#> lag(Consumption,1) lag(Income,1) lag(Consumption,2) lag(Income,2)
#> 0.3715 -0.2991 0.0836 -0.0410
#> s.e. 0.1212 0.0758 0.1182 0.0799
#> lag(Consumption,3) lag(Income,3) lag(Consumption,4) lag(Income,4)
#> 0.4531 -0.1445 0.2481 -0.2475
#> s.e. 0.1145 0.0783 0.1170 0.0785
#> lag(Consumption,5) lag(Income,5) constant Unemployment
#> -0.1270 -0.1878 0.6142 -0.1100
#> s.e. 0.1111 0.0771 0.1449 0.2089
#> lag(Unemployment, 1)
#> -0.0401
#> s.e. 0.2201
#>
#> Residual covariance matrix:
#> Consumption Income
#> Consumption 0.2602 0.1341
#> Income 0.1341 0.6410
#>
#> log likelihood = -350.43
#> AIC = 760.86 AICc = 772.34 BIC = 858.74
Created on 2022-07-23 by the reprex package (v2.0.1)

R: using a lmer model in fit_resamples() fails with "Error: Assigned data `factor(lvl[1], levels = lvl)` must be compatible with existing data."

I am trying to use the tidymodels package to build a linear mixed model. It looks like I'm specifying the formula in the correct way, as I can run a fit() on the workflow.
However, when I try to run it on resamples of my data, using the function fit_resamples(), I get an error relative to missing factor levels.
It is not clear to me if I'm doing something wrong, or if the packages "multilevelmod" and "tune" are not compatible in this way, and so I would greatly appreciate any advice.
I have included a reprex using the "mpg" dataset.
EDIT: After looking more into the problem, and especially at this question: Prediction with lme4 on new levels I found out how to make lmer models predict on new value combinations, by using the allow.new.levels = TRUE argument in the predict() function.
My question is then, how can I specify this inside workflow() or fit_resamples()?
I guessed that adding step_novel() in the recipe and default_recipe_blueprint(allow_novel_levels = TRUE) in the add_recipe() call would be sufficient (as in tidymodels Novel levels found in column), but it still doesn't seem to work.
library(tidyverse)
library(tidymodels)
library(multilevelmod)
set.seed(1243)
data(mpg, package = "ggplot2")
training = mpg %>%
initial_split() %>%
training() %>%
mutate(manufacturer = manufacturer %>% as_factor(),
model = model %>% as_factor())
training_folds = training %>%
validation_split()
lmm_model = linear_reg() %>%
set_engine("lmer")
lmm_recipe = recipe(cty ~ year + manufacturer + model, data = training) %>%
step_novel(manufacturer, model)
lmm_formula = cty ~ year + (1|manufacturer/model)
lmm_workflow = workflow() %>%
# see: https://stackoverflow.com/questions/68183077/tidymodels-novel-levels-found-in-column
add_recipe(lmm_recipe, blueprint = hardhat::default_recipe_blueprint(allow_novel_levels = TRUE)) %>%
add_model(lmm_model, formula = lmm_formula)
# A simple fit works
fit(lmm_workflow, training)
#> == Workflow [trained] ==========================================================
#> Preprocessor: Recipe
#> Model: linear_reg()
#>
#> -- Preprocessor ----------------------------------------------------------------
#> 1 Recipe Step
#>
#> * step_novel()
#>
#> -- Model -----------------------------------------------------------------------
#> Linear mixed model fit by REML ['lmerMod']
#> Formula: cty ~ year + (1 | manufacturer/model)
#> Data: data
#> REML criterion at convergence: 864.9986
#> Random effects:
#> Groups Name Std.Dev.
#> model:manufacturer (Intercept) 2.881
#> manufacturer (Intercept) 2.675
#> Residual 2.181
#> Number of obs: 175, groups: model:manufacturer, 38; manufacturer, 15
#> Fixed Effects:
#> (Intercept) year
#> -8.06202 0.01228
# A fit with resamplings doesn't work:
fit_resamples(lmm_workflow, resamples = training_folds)
#> Warning: package 'lme4' was built under R version 4.1.1
#> x validation: preprocessor 1/1, model 1/1 (predictions): Error:
#> ! Assigned data `facto...
#> Warning: All models failed. See the `.notes` column.
#> # Resampling results
#> # Validation Set Split (0.75/0.25)
#> # A tibble: 1 x 4
#> splits id .metrics .notes
#> <list> <chr> <list> <list>
#> 1 <split [131/44]> validation <NULL> <tibble [1 x 3]>
#>
#> There were issues with some computations:
#>
#> - Error(s) x1: ! Assigned data `factor(lvl[1], levels = lvl)` must be compatibl...
#>
#> Use `collect_notes(object)` for more information.
# It seems that the problem is that the combinations of factor levels differ
# between analysis and assessment set
analysis_set = analysis(training_folds$splits[[1]])
assessment_set = assessment(training_folds$splits[[1]])
identical(
analysis_set %>% distinct(manufacturer, model),
assessment_set %>% distinct(manufacturer, model)
)
#> [1] FALSE
# directly fitting the model on the analysis set
analysis_fit = lmer(formula = lmm_formula, data = analysis_set)
# predicting the values for the missing combinations of levels is actually possible
# see: https://stackoverflow.com/questions/29259750/prediction-with-lme4-on-new-levels
assessment_predict = analysis_fit %>%
predict(training_folds$splits[[1]] %>% assessment(),
allow.new.levels = TRUE)
Created on 2022-05-24 by the reprex package (v2.0.1)
I believe the issue is that you are ending up with different factor levels in the training vs. testing sets (which are called analysis and assessment for resamples, in tidymodels):
library(tidymodels)
data(mpg, package = "ggplot2")
training <- mpg %>%
initial_split() %>%
training()
training_folds <- training %>%
validation_split()
training %>% count(manufacturer, model)
#> # A tibble: 38 × 3
#> manufacturer model n
#> <chr> <chr> <int>
#> 1 audi a4 5
#> 2 audi a4 quattro 5
#> 3 audi a6 quattro 3
#> 4 chevrolet c1500 suburban 2wd 4
#> 5 chevrolet corvette 4
#> 6 chevrolet k1500 tahoe 4wd 3
#> 7 chevrolet malibu 5
#> 8 dodge caravan 2wd 6
#> 9 dodge dakota pickup 4wd 8
#> 10 dodge durango 4wd 5
#> # … with 28 more rows
analysis(training_folds$splits[[1]]) %>% count(manufacturer, model)
#> # A tibble: 37 × 3
#> manufacturer model n
#> <chr> <chr> <int>
#> 1 audi a4 1
#> 2 audi a4 quattro 4
#> 3 audi a6 quattro 3
#> 4 chevrolet c1500 suburban 2wd 4
#> 5 chevrolet corvette 3
#> 6 chevrolet k1500 tahoe 4wd 2
#> 7 chevrolet malibu 4
#> 8 dodge caravan 2wd 5
#> 9 dodge dakota pickup 4wd 5
#> 10 dodge durango 4wd 4
#> # … with 27 more rows
assessment(training_folds$splits[[1]]) %>% count(manufacturer, model)
#> # A tibble: 28 × 3
#> manufacturer model n
#> <chr> <chr> <int>
#> 1 audi a4 4
#> 2 audi a4 quattro 1
#> 3 chevrolet corvette 1
#> 4 chevrolet k1500 tahoe 4wd 1
#> 5 chevrolet malibu 1
#> 6 dodge caravan 2wd 1
#> 7 dodge dakota pickup 4wd 3
#> 8 dodge durango 4wd 1
#> 9 dodge ram 1500 pickup 4wd 1
#> 10 ford expedition 2wd 3
#> # … with 18 more rows
Created on 2022-05-23 by the reprex package (v2.0.1)
When you fit the model one time with all the training data, you have 38 combinations of manufacturer and model. When you fit using your validation split, you have 37 combinations in the training/analysis set, and 28 combinations in the testing/assessment set. Some of these don't overlap, so the model can't predict for the observation that is in assessment but analysis.

how can I make a new data frame where the columns are the unique values with corresponding observations from an old data frame? [duplicate]

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 11 months ago.
My data frame has different dates as rows. Every unique date occurs appr. 500 times. I want to make a new data frame where every column is a unique date and where the rows are all the observations of that date from my old dataset. So for every column dat represents a certain date, I should have appr. 500 rows that each represent a rel_spread from that day.
You can use pivot_wider from tidyr:
library(tidyr)
pivot_wider(df, names_from = date, values_from = rel_spread, values_fn = list) %>%
unnest(everything())
#> # A tibble: 2 x 17
#> `20000103` `20000104` `20000105` `20000106` `20000107` `20000108` `20000109`
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 -0.0234 -0.0128 0.00729 0.0408 -0.0298 0.0398 0.0445
#> 2 0.0492 -0.0120 0.0277 0.0435 -0.0288 0.0152 -0.0374
#> # ... with 10 more variables: `20000110` <dbl>, `20000111` <dbl>,
#> # `20000112` <dbl>, `20000113` <dbl>, `20000114` <dbl>, `20000115` <dbl>,
#> # `20000116` <dbl>, `20000117` <dbl>, `20000118` <dbl>, `20000119` <dbl>
Note that we don't have your data (and I wasn't about to transcribe a picture of your data), but I created a little reproducible data set which should match the structure of your data set, except it only has two values per date for demo purposes:
set.seed(1)
df <- data.frame(date = rep(as.character(20000103:20000119), 2),
rel_spread = runif(34, -0.05, 0.05))
df
#> date rel_spread
#> 1 20000103 -0.0234491337
#> 2 20000104 -0.0127876100
#> 3 20000105 0.0072853363
#> 4 20000106 0.0408207790
#> 5 20000107 -0.0298318069
#> 6 20000108 0.0398389685
#> 7 20000109 0.0444675269
#> 8 20000110 0.0160797792
#> 9 20000111 0.0129114044
#> 10 20000112 -0.0438213730
#> 11 20000113 -0.0294025425
#> 12 20000114 -0.0323443247
#> 13 20000115 0.0187022847
#> 14 20000116 -0.0115896282
#> 15 20000117 0.0269841420
#> 16 20000118 -0.0002300758
#> 17 20000119 0.0217618508
#> 18 20000103 0.0491906095
#> 19 20000104 -0.0119964821
#> 20 20000105 0.0277445221
#> 21 20000106 0.0434705231
#> 22 20000107 -0.0287857479
#> 23 20000108 0.0151673766
#> 24 20000109 -0.0374444904
#> 25 20000110 -0.0232779331
#> 26 20000111 -0.0113885907
#> 27 20000112 -0.0486609667
#> 28 20000113 -0.0117612043
#> 29 20000114 0.0369690846
#> 30 20000115 -0.0159651003
#> 31 20000116 -0.0017919885
#> 32 20000117 0.0099565825
#> 33 20000118 -0.0006458693
#> 34 20000119 -0.0313782399
Allan’s answer is perfect if you have the same number of rows for each date. If this isn’t the case, the following should work:
library(tidyr)
library(dplyr)
data_wide <- data_long %>%
group_by(date) %>%
mutate(daterow = row_number()) %>%
ungroup() %>%
pivot_wider(names_from = date, values_from = rel_spread) %>%
select(!daterow)
data_wide
Output:
# A tibble: 6 x 4
`20000103` `20000104` `20000105` `20000106`
<dbl> <dbl> <dbl> <dbl>
1 -0.626 0.184 -0.836 -0.621
2 1.60 0.330 -0.820 -2.21
3 0.487 0.738 0.576 1.12
4 -0.305 1.51 0.390 -0.0449
5 NA NA NA -0.0162
6 NA NA NA 0.944
Example data:
set.seed(1)
data_long <- data.frame(
date = c(rep(20000103:20000105, 4), rep(20000106, 6)),
rel_spread = rnorm(18)
)

Time series forecasting in R; plotting "events" and generating new forecasting plots with specified date range after initial forecast

I have created a function which allows me to carry out time series forecasting using the fable package. The idea of the function was to analyse observed vs predicted values after a particular date/event. Here is a mock data frame which generates a column of dates:-
set.seed(1)
df <- data.frame(Date = sort(sample(seq(as.Date('2018/01/01'), as.Date('2020/09/17'), by="day"),1368883, replace = T)))
And here is the function I created. You specify the data, then the date of the event, then the forecast period in days and lastly a title for your graph.
event_analysis<-function(data,eventdate,period,title){
require(dplyr)
require(tsibble)
require(fable)
require(fabletools)
require(imputeTS)
require(ggplot2)
data_count<-data%>%
group_by(Date)%>%
summarise(Count=n())
data_count<-as_tsibble(data_count)
data_count<-na_mean(data_count)
train <- data_count %>%
#sample_frac(0.8)
filter(Date<=as.Date(eventdate))
fit <- train %>%
model(
ets = ETS(Count),
arima = ARIMA(Count),
snaive = SNAIVE(Count)
) %>%
mutate(mixed = (ets + arima + snaive) / 3)
fc <- fit %>% forecast(h = period)
forecastplot<-fc %>%
autoplot(data_count, level = NULL)+ggtitle(title)+
geom_vline(xintercept = as.Date(eventdate),linetype="dashed",color="red")+
labs(caption = "Red dashed line = Event occurrence")
fc_accuracy<-accuracy(fc,data_count)
#obs<-data_count
#colnames(obs)[2]<-"Observed"
#obs_pred<-merge(data_count,fc_accuracy, by="Date")
return(list(forecastplot,fc_accuracy,fc))
}
And in one run, I specify the df, the date of the event, the number of days that I want to forecast (3 weeks), then the title:-
event_analysis(df, "2020-01-01",21,"Event forecast")
Which will print this outcome and plot:-
I concede that the mock data I made isn't totally ideal but the function works well on my real-world data.
Here is what I want to achieve. I would like this output that has been made to come out of the function, but in addition, I would like an additional graph which "zooms in" on the period that has been forecasted, for 2 reasons:-
for ease of interpretation
I want to be able to see the N number of days before and N number of days after the event date (N representing the forecast period i.e. 21).
So, an additional graph (along with the original full forecast) that would look like this, perhaps in the one output, "multiplot" style:-
The other thing would be to print another output which shows the observed values in the test set against the predicted values from the models used in the forecasting.
These are basically the two additional things I want to add to my function but I am not sure how to go about this. Any help is massively appreciated :) .
I suppose you could rewrite it this way. I made a couple of adjustments to help you out.
set.seed(1)
df <- data.frame(Date = sort(sample(seq(as.Date('2018/01/01'), as.Date('2020/09/17'), by="day"),1368883, replace = T)))
event_analysis <- function(data, eventdate, period, title){
# in the future, you may think to move them out
library(dplyr)
library(tsibble)
library(fable)
library(fabletools)
library(imputeTS)
library(ggplot2)
# convert at the beginning
eventdate <- as.Date(eventdate)
# more compact sintax
data_count <- count(data, Date, name = "Count")
# better specify the date variable to avoid the message
data_count <- as_tsibble(data_count, index = Date)
# you need to complete missing dates, just in case
data_count <- tsibble::fill_gaps(data_count)
data_count <- na_mean(data_count)
train <- data_count %>%
filter(Date <= eventdate)
test <- data_count %>%
filter(Date > eventdate, Date <= (eventdate+period))
fit <- train %>%
model(
ets = ETS(Count),
arima = ARIMA(Count),
snaive = SNAIVE(Count)
) %>%
mutate(mixed = (ets + arima + snaive) / 3)
fc <- fit %>% forecast(h = period)
# your plot
forecastplot <- fc %>%
autoplot(data_count, level = NULL) +
ggtitle(title) +
geom_vline(xintercept = as.Date(eventdate), linetype = "dashed", color = "red") +
labs(caption = "Red dashed line = Event occurrence")
# plot just forecast and test
zoomfcstplot <- autoplot(fc) + autolayer(test, .vars = Count)
fc_accuracy <- accuracy(fc,data_count)
### EDIT: ###
# results vs test
res <- fc %>%
as_tibble() %>%
select(-Count) %>%
tidyr::pivot_wider(names_from = .model, values_from = .mean) %>%
inner_join(test, by = "Date")
##############
return(list(forecastplot = forecastplot,
zoomplot = zoomfcstplot,
accuracy = fc_accuracy,
forecast = fc,
results = res))
}
event_analysis(df,
eventdate = "2020-01-01",
period = 21,
title = "Event forecast")
Output:
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
#> Carico il pacchetto richiesto: fabletools
#> Registered S3 method overwritten by 'quantmod':
#> method from
#> as.zoo.data.frame zoo
#> $forecastplot
#>
#> $zoomplot
#>
#> $accuracy
#> # A tibble: 4 x 9
#> .model .type ME RMSE MAE MPE MAPE MASE ACF1
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 arima Test -16.8 41.8 35.2 -1.31 2.61 0.791 0.164
#> 2 ets Test -16.8 41.8 35.2 -1.31 2.61 0.791 0.164
#> 3 mixed Test -21.9 44.7 36.8 -1.68 2.73 0.825 -0.0682
#> 4 snaive Test -32.1 57.3 46.6 -2.43 3.45 1.05 -0.377
#>
#> $forecast
#> # A fable: 84 x 4 [1D]
#> # Key: .model [4]
#> .model Date Count .mean
#> <chr> <date> <dist> <dbl>
#> 1 ets 2020-01-02 N(1383, 1505) 1383.
#> 2 ets 2020-01-03 N(1383, 1505) 1383.
#> 3 ets 2020-01-04 N(1383, 1505) 1383.
#> 4 ets 2020-01-05 N(1383, 1505) 1383.
#> 5 ets 2020-01-06 N(1383, 1505) 1383.
#> 6 ets 2020-01-07 N(1383, 1505) 1383.
#> 7 ets 2020-01-08 N(1383, 1505) 1383.
#> 8 ets 2020-01-09 N(1383, 1505) 1383.
#> 9 ets 2020-01-10 N(1383, 1505) 1383.
#> 10 ets 2020-01-11 N(1383, 1505) 1383.
#> # ... with 74 more rows
#>
#> $results
#> # A tibble: 21 x 6
#> Date ets arima snaive mixed Count
#> <date> <dbl> <dbl> <dbl> <dbl> <int>
#> 1 2020-01-02 1383. 1383. 1386 1384. 1350
#> 2 2020-01-03 1383. 1383. 1366 1377. 1398
#> 3 2020-01-04 1383. 1383. 1426 1397. 1357
#> 4 2020-01-05 1383. 1383. 1398 1388. 1415
#> 5 2020-01-06 1383. 1383. 1431 1399. 1399
#> 6 2020-01-07 1383. 1383. 1431 1399. 1346
#> 7 2020-01-08 1383. 1383. 1350 1372. 1299
#> 8 2020-01-09 1383. 1383. 1386 1384. 1303
#> 9 2020-01-10 1383. 1383. 1366 1377. 1365
#> 10 2020-01-11 1383. 1383. 1426 1397. 1328
#> # ... with 11 more rows

Resources