I'm trying to view the out of sample performance scores after running fable prophet. Please note, the forecast is grouped based on type and the forecast is looking 5 observations ahead.
Here is the code:
library(tibble)
library(tsibble)
library(fable.prophet)
lax_passengers <- read.csv("https://raw.githubusercontent.com/mitchelloharawild/fable.prophet/master/data-raw/lax_passengers.csv")
library(dplyr)
library(lubridate)
lax_passengers <- lax_passengers %>%
mutate(datetime = mdy_hms(ReportPeriod)) %>%
group_by(month = yearmonth(datetime), type = Domestic_International) %>%
summarise(passengers = sum(Passenger_Count)) %>%
ungroup()
lax_passengers <- as_tsibble(lax_passengers, index = month, key = type)
fit <- lax_passengers %>%
model(
mdl = prophet(passengers ~ growth("linear") + season("year", type = "multiplicative")),
)
fit
test_tr <- lax_passengers %>%
slice(1:(n()-5)) %>%
stretch_tsibble(.init = 12, .step = 1)
fc <- test_tr %>%
model(
mdl = prophet(passengers ~ growth("linear") + season("year", type = "multiplicative")),
) %>%
forecast(h = 5)
fc %>% accuracy(lax_passengers)
When I run fc %>% accuracy(lax_passenger), I get the following warning:
Warning message:
The future dataset is incomplete, incomplete out-of-sample data will be treated as missing.
5 observations are missing between 2019 Apr and 2019 Aug
How do make the future dataset complete as I believe the performance score isn't accurate based on the missing 5 observations.
It seems like when I try to stretch the tsibble, it doesn't slice correctly as it doesn't remove the last 5 observations from each type.
The slice() function removes rows from the entire dataset, so it is only removing the last 5 rows from your last key (type=="International"). To remove the last 5 rows from all keys, you'll need to group by keys and slice.
test_tr <- lax_passengers %>%
group_by_key() %>%
slice(1:(n()-5)) %>%
ungroup() %>%
stretch_tsibble(.init = 12, .step = 1)
Related
Reminiscent of this question: Forecasting Time Series Groups with tslm() & tidyverse
except I want to use Matt Dancho’s code at https://cran.rstudio.com/web/packages/sweep/vignettes/SW01_Forecasting_Time_Series_Groups.html
How can I use forecast::tslm() to produce grouped time series predictions from a nested data frame? The solution Rob Hyndman kindly provided used the tsibble and fable packages without nesting.
The map() of tslm() at the bottom of this code generates:
Caused by error in `formula.default()`:
! invalid formula
starter_time <- Sys.time()
library(dplyr)
library(timetk)
library(tidyr)
library(purrr)
library(lubridate)
library(forecast)
library(broom)
library(sweep)
library(zoo)
monthly_qty_by_cat2 <- bike_sales %>%
mutate(order.month = as_date(as.yearmon(order.date))) %>%
group_by(category.secondary, order.month) %>%
summarise(total.qty = sum(quantity))
monthly_qty_by_cat2_nest <- monthly_qty_by_cat2 %>%
group_by(category.secondary) %>%
nest()
monthly_qty_by_cat2_ts <- monthly_qty_by_cat2_nest %>%
mutate(data.ts = map(.x = data,
.f = tk_ts,
select = -order.month,
start = 2011,
freq = 12))
## invalid formula ERROR
monthly_qty_by_cat2_fit <- monthly_qty_by_cat2_ts %>%
mutate(fit.ets = map(data.ts, tslm, total.qty ~ trend))
I want to make a rolling mean on the last X number of days. rollmean() does that using rows. Since I am using loggers that sometimes fail, and also the data were cleaned, the time series is not continuous (rows do not necessarily represent a constant time difference).
A colleague suggested the solution below, which works great. Except my data need to be grouped (in the example by treatment). For each day, I want the rolling mean of the last X days for each treatment.
Thanks
# making some example data
# vector with days since the beginning of experiment
days <- 0:30
# random values df1 <- tibble::tibble(
days_since_beginning = days,
value_to_used = rnorm(length(days)),
treatment = sample(letters[1],31,replace = TRUE) )
df2 <- tibble::tibble(
days_since_beginning = days,
value_to_used = rnorm(length(days)),
treatment = sample(letters[2],31,replace = TRUE) )
df <- full_join(df1, df2)
# how long should be the period for mean
time_period <- 10 # calculate for last 10 days
df_mean <- df %>% dplyr::mutate(
# calculate rolling mean
roll_mean = purrr::map_dbl(
.x = days_since_beginning,
.f = ~ df %>%
# select only data for the last `time_period`
dplyr::filter(days_since_beginning >= .x - time_period &
days_since_beginning <= .x) %>%
purrr::pluck("value_to_used") %>%
mean() %>%
return()
) )
This takes the mean over the last 10 days by treatment. The width argument includes a computation of how many rows back to use so that it corresponds to 10 days rather than 10 rows. This uses the fact that width can be a vector.
library(dplyr)
library(zoo)
df %>%
group_by(treatment) %>%
mutate(roll = rollapplyr(value_to_used,
seq_along(days_since_beginning) - findInterval(days_since_beginning - 10, days_since_beginning),
mean)) %>%
ungroup
Same colleague came up with his own solution:
df_mean <-
df %>%
dplyr::group_by(treatment) %>%
tidyr::nest() %>%
dplyr::mutate(
data_with_mean = purrr::map(
.x = data,
.f = ~ {
dataset <- .x
dataset %>%
dplyr::mutate(
# calculate rolling mean
roll_mean = purrr::map_dbl(
.x = days_since_beginning,
.f = ~ dataset %>%
# select only data for the last `time_period`
dplyr::filter(days_since_beginning >= .x - time_period &
days_since_beginning <= .x) %>%
purrr::pluck("value_to_used") %>%
mean() %>%
return()
)) %>%
return()
}
)) %>%
dplyr::select(-data) %>%
tidyr::unnest(data_with_mean) %>%
dplyr::ungroup()
I compared the results with G. Grothendieck's idea, and it only matches if I use time_period in my colleague's code and time_period + 1 in G. Grothendieck's code. So there is a difference in how the time_period is used, and I am confused about why it happens.
I have a panel dataset where the time and group variables were already converted to dummies. I want to reverse the transformation though back to a simple id and time variable.
Let's create a comparable data:
library(plm)
library(tidyverse)
library(fastDummies)
data(EmplUK)
EmplUK %>%
select(-sector) %>%
dummy_cols(.data = .,select_columns = c("firm","year"),remove_selected_columns = TRUE,remove_first_dummy = TRUE) -> paneldata
head(paneldata)
So basically now all my dummy variables are firm_X and year_X and I would like to have a Year and Firm variable again.
This is slightly complicated by the fact that Firm 1 and Year 1 does not exist as dummy (as they would not be needed in a regression model).
I'm fine with this precise data missing (I can simply infer that the first Firm would be Firm 1 and the year would be Year 1976, which is one less than the smallest one).
Any ideas how to do this nicely? Ideally using tidyverse?
After some thinking, I figured it out and created a small function:
getfactorback <- function(data,
groupdummyprefix,
timedummyprefix,
grouplabel,
timelabel,
firstgroup,
firsttime) {
data %>%
mutate(newgroup = ifelse(rowSums(cur_data() %>% select(starts_with("id")))==1,0,1),
newtime = ifelse(rowSums(cur_data() %>% select(starts_with("time")))==1,0,1)) %>%
rename(!!paste0(groupdummyprefix,firstgroup):=newgroup,
!!paste0(timedummyprefix,firsttime):=newtime) %>%
pivot_longer(cols = starts_with(groupdummyprefix),names_to = grouplabel,names_prefix = groupdummyprefix) %>%
filter(value == 1) %>%
select(-value) %>%
pivot_longer(cols = starts_with(timedummyprefix),names_to = timelabel,names_prefix = timedummyprefix) %>%
filter(value == 1) %>%
select(-value) %>%
mutate(across(.cols = c(all_of(grouplabel),all_of(timelabel)),factor)) %>%
relocate(all_of(c(grouplabel,timelabel))) -> output
return(output)
}
getfactorback(data = paneldata,
groupdummyprefix = "firm_",
grouplabel = "firm",
timedummyprefix = "year_",
timelabel = "year",
firstgroup = "1",
firsttime = 1976)
I'm trying to make a forecast of a arima with regression (Regression with ARIMA errors) to several ts at the same time and using grouped data.
I'm new in the tidy data so... Basically, I'm reproducing this example (https://cran.rstudio.com/web/packages/sweep/vignettes/SW01_Forecasting_Time_Series_Groups.html) with a multivariate ts, and multivariate model.
here is a reproducible example:
library(tidyverse); library(tidyquant)
library(timetk); library(sweep)
library(forecast)
library(tsibble)
library(fpp3)
# using package data
bike_sales
# grouping data
monthly_qty_by_cat2 <- bike_sales %>%
mutate(order.month = as_date(as.yearmon(order.date))) %>%
group_by(category.secondary, order.month) %>%
summarise(total.qty = sum(quantity), price.m = mean(price))
# using nest
monthly_qty_by_cat2_nest <- monthly_qty_by_cat2 %>%
group_by(category.secondary) %>%
nest()
monthly_qty_by_cat2_nest
# Forecasting Workflow
# Step 1: Coerce to a ts object class
monthly_qty_by_cat2_ts <- monthly_qty_by_cat2_nest %>%
mutate(data.ts = map(.x = data,
.f = tk_ts,
select = -order.month, # take off date
start = 2011,
freq = 12))
# Step 2: modeling an ARIMA(y ~ x)
# make a function to map
modeloARIMA_reg <- function(y,x) {
result <- ARIMA(y ~ x)
return(list(result))}
# map the function
monthly_qty_by_cat2_fit <- monthly_qty_by_cat2_ts %>%
mutate(fit.arima = map(data.ts, modeloARIMA_reg))
monthly_qty_by_cat2_fit
Here I dont know if the map is using the right variable in y (the dependent), but I keep going try the forecast and an error appears
# Step 3: Forecasting the model
monthly_qty_by_cat2_fcast <- monthly_qty_by_cat2_fit %>%
mutate(fcast.ets = map(fit.arima, forecast))
# this give me this error
# Erro: Problem with `mutate()` input `fcast.arima`.
# x argumento não-numérico para operador binário
# i Input `fcast.arima` is `map(fit.arima, forecast)`.
# i The error occured in group 1: category.secondary = "Cross Country Race".
# Run `rlang::last_error()` to see where the error occurred.
# Além disso: Warning message:
# In mean.default(x, na.rm = TRUE) :
# argument is not numeric or logical: returning NA
Two questions emerge:
I dont know how to input the mean of the independent variable (x) of each group;
AND how to declare this new data as a forecast argument.
PS: Dont need be tibble or nested result, I just need the point forecast and the CI (total.qty lo.95 hi.95)
Well, this code solve the problem for me.
This make one forecast for each time-series (grouped tsibble) and use the own mean value of those time-series as future data in the forecast
Any comment is welcome.
# MY FLOW
monthly_qty_by_cat2 <-
sweep::bike_sales %>%
mutate(order.month = yearmonth(order.date)) %>%
group_by(category.secondary, order.month) %>%
summarise(total.qty = sum(quantity), price.m = mean(price)) %>%
as_tsibble(index=order.month, key=category.secondary) # coerse in tsibble
# mean for the future
futuro <- monthly_qty_by_cat2 %>%
group_by(category.secondary) %>%
mutate(fut_x = mean(price.m)) %>%
do(price.m = head(.$fut_x,1))
# as.numeric
futuro$price.m <- as.numeric(futuro$price.m)
futuro
# make values in the future
future_x <- new_data(monthly_qty_by_cat2, 3) %>%
left_join(futuro, by = "category.secondary")
future_x
# model and forecast
fc <- monthly_qty_by_cat2 %>%
group_by(category.secondary) %>%
model(ARIMA(total.qty ~ price.m)) %>%
forecast(new_data=future_x) %>%
hilo(level = 95) %>%
unpack_hilo("95%")
fc
# Tidy the forecast
fc_tibble <- fc %>% as_tibble() %>% select(-total.qty)
fc_tibble
# the end
Well this solve the problem for me.
This make one forecast for each group time-series and use the own mean value of those group time-series as future data in the forecast
Any comment is welcome.
I've written a routine that extracts information from lmer models to compute the ICC and get the LRT from lmerTest's ranova function. What I have below works but I suspect it could be improved by (a) combining the two functions into one and returning a list, but I can't seem to access the list elements with purrr's map function, and (b) using multiple mutate/purrr lines to get all the needed data in one place rather than having to join later. My code follows using the "Peet" dataset provided in Hox (2002) and available at the UCLA IDRE site:
library(foreign)
library(lme4)
library(tidyverse)
library(purrr)
#Peet family data described and used in Hox
peet.dat<-read.dta("https://stats.idre.ucla.edu/stat/stata/examples/mlm_ma_hox/peetmis.dta")
names(peet.dat)
#convert to long format
peet.long.dat <- peet.dat %>%
tidyr::gather(type, score, -family,-sex,-person) %>%
arrange(type)
names(peet.long.dat)
#need two functions, one for the MLM estimates and the other for
#ranova p-test for variance--merge later by type
aov_model <- function(df) {
lmr.model <- lmerTest::lmer(score~ 1 + (1|family), data=df)
}
aov_test <- function(df) {
lmr.model <- lmerTest::lmer(score~ 1 + (1|family), data=df)
ll.test <- lmerTest::ranova(lmr.model)
}
#get the model estimates
models <- peet.long.dat %>%
nest(-type) %>%
mutate(aov_obj = map(data, aov_model),
summaries = map(aov_obj, broom.mixed::tidy)) %>%
unnest(summaries, .drop = T) %>%
select(type, effect, estimate, term) %>%
filter(effect != "fixed") %>%
mutate(variance = estimate^2) %>%
select(-estimate, -effect) %>%
spread(term, variance) %>%
rename(group.var = `sd__(Intercept)`, residual = `sd__Observation`) %>%
mutate(ICC = group.var/(group.var+residual))
models
#get the ranova LRTs
tests <- peet.long.dat %>%
nest(-type) %>%
mutate(test_obj = map(data, aov_test),
test_summaries = map(test_obj, broom.mixed::tidy)) %>%
unnest(test_summaries, .drop = T) %>%
filter(!is.na(LRT))
#join estimates with LRT p values
models %>% left_join(tests[c("type","p.value")])
Any help greatly appreciated.
I think the key here is to split() your data.frame based on the variable type:
# convert to list by type
peet.ls <- peet.dat %>%
tidyr::gather(type, score, -family,-sex,-person) %>%
split(.$type)
# map to fit models on subsets and return summaries
peet.ls %>%
map(function(df.x) {
# fit the model
lmr_model <- lmerTest::lmer(score~ 1 + (1|family), data = df.x)
#get the model estimates
mlm_est <- lmr_model %>%
broom.mixed::tidy() %>%
select(effect, estimate, term) %>%
filter(effect != "fixed") %>%
mutate(variance = estimate^2) %>%
select(-estimate, -effect) %>%
spread(term, variance) %>%
rename(group.var = `sd__(Intercept)`,
residual = `sd__Observation`) %>%
mutate(ICC = group.var/(group.var+residual))
# get the ranova LRTs & add to other estimates
mlm_est$p.value <- lmr_model %>%
lmerTest::ranova() %>%
broom.mixed::tidy() %>%
filter(!is.na(LRT)) %>%
pull(p.value)
# return summaries
mlm_est
}) %>%
# combine data.frames and add the variable 'type'
bind_rows(.id = "type") %>%
select(type, everything())