Hierarchical Forecasting problem generating the hts object - r

I want to do hierarchical forecasting as described in Hyndman Forecasting's book in chapter 10: https://otexts.com/fpp2/
My problem is that for generating this type of forecasting (specifically the bottom-up approach) I need to develop a hts object that is a matrix. For example:
If I have a data frame like this:
Image of an example of data frame prior to hts object
I need to convert it to a matrix like this:
Image of Matrix that I need
For this matrix, every row is a unit of time (it could be days, months, etc.).
My problem is that my data frame looks like this:
Image of Problem with dataframe
One column is the date and the other are the categories from which I need to forecast the sales. The problem is this: for supermarket=4, id_product=187, and id_label=a the system registers movements on days 21 and 23 but nothing happens on day 22, which means that I need to have sales=0 on that day or in other words a row like this:
Image of Row missing
How can I generate the matrix needed to create the hts object? Do I need to create the missing rows with 0? (I have thousands of missing rows, so it would be a nightmare to do it by hand)
Here is a reproducible example:
date=c("2019-03-22","2019-03-23","2019-04-24","2019-03-25")
id_supermarket=c(4,4,2,2)
id_product=c(187,187,189,190)
id_label=c("a","a","c","d")
sales=c(21,22,23,24)
df=as.data.frame(cbind(date,id_supermarket,id_product,id_label,sales))
Thanks in advance.

I recommend you use the fable package instead of hts. It is more recent and much easier to use. Here is an example with your data.
library(tsibble)
library(fable)
# Create tsibble
df <- tibble(
date = lubridate::ymd(c("2019-03-22", "2019-03-23", "2019-03-24", "2019-03-25")),
id_supermarket = as.character(c(4, 4, 2, 2)),
id_product = c(187, 187, 189, 190),
id_label = c("a", "a", "c", "d"),
sales = c(21, 22, 23, 24)
) %>%
as_tsibble(index = date, key = c(id_supermarket, id_product, id_label)) %>%
fill_gaps(.full = TRUE)
# Forecast with reconciliation
fc <- df %>%
aggregate_key(id_supermarket * id_label, sales = sum(sales, na.rm = TRUE)) %>%
model(
arima = ARIMA(sales)
) %>%
reconcile(
arima = min_trace(arima)
) %>%
forecast(h = "5 days")
fc
#> # A fable: 45 x 6 [1D]
#> # Key: id_supermarket, id_label, .model [9]
#> id_supermarket id_label .model date sales .distribution
#> <chr> <chr> <chr> <date> <dbl> <dist>
#> 1 2 c arima 2019-03-26 5.82 N(5.8, 44)
#> 2 2 c arima 2019-03-27 5.82 N(5.8, 44)
#> 3 2 c arima 2019-03-28 5.82 N(5.8, 44)
#> 4 2 c arima 2019-03-29 5.82 N(5.8, 44)
#> 5 2 c arima 2019-03-30 5.82 N(5.8, 44)
#> 6 2 d arima 2019-03-26 6.34 N(6.3, 46)
#> 7 2 d arima 2019-03-27 6.34 N(6.3, 46)
#> 8 2 d arima 2019-03-28 6.34 N(6.3, 46)
#> 9 2 d arima 2019-03-29 6.34 N(6.3, 46)
#> 10 2 d arima 2019-03-30 6.34 N(6.3, 46)
#> # … with 35 more rows
Created on 2020-02-01 by the reprex package (v0.3.0)

Related

Create date of "X" column, when I have age in days at "X" column and birth date column in R

I'm having some trouble finding out how to do a specific thing in R.
In my dataset, I have a column with the date of birth of participants. I also have a column giving me the age in days at which a disease was diagnosed.
What I want to do is to create a new column showing the date of diagnosis. I'm guessing it's a pretty easy thing to do since I have all the information needed, basically it's birth date + X number of days = Date of diagnosis, but I'm unable to figure out how to do it.
All of my searches give me information on the opposite, going from date to age. So if you're able to help me, it would be much appreciated!
library(tidyverse)
library(lubridate)
df <- tibble(
birth = sample(seq("1950-01-01" %>%
as.Date(),
today(), by = "day"), 10, replace = TRUE),
age = sample(3650:15000, 10, replace = TRUE)
)
df %>%
mutate(diagnosis_date = birth %m+% days(age))
#> # A tibble: 10 x 3
#> birth age diagnosis_date
#> <date> <int> <date>
#> 1 1955-01-16 6684 1973-05-05
#> 2 1958-11-03 6322 1976-02-24
#> 3 2007-02-23 4312 2018-12-14
#> 4 2002-07-11 8681 2026-04-17
#> 5 2021-12-28 11892 2054-07-20
#> 6 2017-07-31 3872 2028-03-07
#> 7 1995-06-30 14549 2035-04-30
#> 8 1955-09-02 12633 1990-04-04
#> 9 1958-10-10 4534 1971-03-10
#> 10 1980-12-05 6893 1999-10-20
Created on 2022-06-30 by the reprex package (v2.0.1)

Time series forecasting in R; plotting "events" and generating new forecasting plots with specified date range after initial forecast

I have created a function which allows me to carry out time series forecasting using the fable package. The idea of the function was to analyse observed vs predicted values after a particular date/event. Here is a mock data frame which generates a column of dates:-
set.seed(1)
df <- data.frame(Date = sort(sample(seq(as.Date('2018/01/01'), as.Date('2020/09/17'), by="day"),1368883, replace = T)))
And here is the function I created. You specify the data, then the date of the event, then the forecast period in days and lastly a title for your graph.
event_analysis<-function(data,eventdate,period,title){
require(dplyr)
require(tsibble)
require(fable)
require(fabletools)
require(imputeTS)
require(ggplot2)
data_count<-data%>%
group_by(Date)%>%
summarise(Count=n())
data_count<-as_tsibble(data_count)
data_count<-na_mean(data_count)
train <- data_count %>%
#sample_frac(0.8)
filter(Date<=as.Date(eventdate))
fit <- train %>%
model(
ets = ETS(Count),
arima = ARIMA(Count),
snaive = SNAIVE(Count)
) %>%
mutate(mixed = (ets + arima + snaive) / 3)
fc <- fit %>% forecast(h = period)
forecastplot<-fc %>%
autoplot(data_count, level = NULL)+ggtitle(title)+
geom_vline(xintercept = as.Date(eventdate),linetype="dashed",color="red")+
labs(caption = "Red dashed line = Event occurrence")
fc_accuracy<-accuracy(fc,data_count)
#obs<-data_count
#colnames(obs)[2]<-"Observed"
#obs_pred<-merge(data_count,fc_accuracy, by="Date")
return(list(forecastplot,fc_accuracy,fc))
}
And in one run, I specify the df, the date of the event, the number of days that I want to forecast (3 weeks), then the title:-
event_analysis(df, "2020-01-01",21,"Event forecast")
Which will print this outcome and plot:-
I concede that the mock data I made isn't totally ideal but the function works well on my real-world data.
Here is what I want to achieve. I would like this output that has been made to come out of the function, but in addition, I would like an additional graph which "zooms in" on the period that has been forecasted, for 2 reasons:-
for ease of interpretation
I want to be able to see the N number of days before and N number of days after the event date (N representing the forecast period i.e. 21).
So, an additional graph (along with the original full forecast) that would look like this, perhaps in the one output, "multiplot" style:-
The other thing would be to print another output which shows the observed values in the test set against the predicted values from the models used in the forecasting.
These are basically the two additional things I want to add to my function but I am not sure how to go about this. Any help is massively appreciated :) .
I suppose you could rewrite it this way. I made a couple of adjustments to help you out.
set.seed(1)
df <- data.frame(Date = sort(sample(seq(as.Date('2018/01/01'), as.Date('2020/09/17'), by="day"),1368883, replace = T)))
event_analysis <- function(data, eventdate, period, title){
# in the future, you may think to move them out
library(dplyr)
library(tsibble)
library(fable)
library(fabletools)
library(imputeTS)
library(ggplot2)
# convert at the beginning
eventdate <- as.Date(eventdate)
# more compact sintax
data_count <- count(data, Date, name = "Count")
# better specify the date variable to avoid the message
data_count <- as_tsibble(data_count, index = Date)
# you need to complete missing dates, just in case
data_count <- tsibble::fill_gaps(data_count)
data_count <- na_mean(data_count)
train <- data_count %>%
filter(Date <= eventdate)
test <- data_count %>%
filter(Date > eventdate, Date <= (eventdate+period))
fit <- train %>%
model(
ets = ETS(Count),
arima = ARIMA(Count),
snaive = SNAIVE(Count)
) %>%
mutate(mixed = (ets + arima + snaive) / 3)
fc <- fit %>% forecast(h = period)
# your plot
forecastplot <- fc %>%
autoplot(data_count, level = NULL) +
ggtitle(title) +
geom_vline(xintercept = as.Date(eventdate), linetype = "dashed", color = "red") +
labs(caption = "Red dashed line = Event occurrence")
# plot just forecast and test
zoomfcstplot <- autoplot(fc) + autolayer(test, .vars = Count)
fc_accuracy <- accuracy(fc,data_count)
### EDIT: ###
# results vs test
res <- fc %>%
as_tibble() %>%
select(-Count) %>%
tidyr::pivot_wider(names_from = .model, values_from = .mean) %>%
inner_join(test, by = "Date")
##############
return(list(forecastplot = forecastplot,
zoomplot = zoomfcstplot,
accuracy = fc_accuracy,
forecast = fc,
results = res))
}
event_analysis(df,
eventdate = "2020-01-01",
period = 21,
title = "Event forecast")
Output:
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
#> Carico il pacchetto richiesto: fabletools
#> Registered S3 method overwritten by 'quantmod':
#> method from
#> as.zoo.data.frame zoo
#> $forecastplot
#>
#> $zoomplot
#>
#> $accuracy
#> # A tibble: 4 x 9
#> .model .type ME RMSE MAE MPE MAPE MASE ACF1
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 arima Test -16.8 41.8 35.2 -1.31 2.61 0.791 0.164
#> 2 ets Test -16.8 41.8 35.2 -1.31 2.61 0.791 0.164
#> 3 mixed Test -21.9 44.7 36.8 -1.68 2.73 0.825 -0.0682
#> 4 snaive Test -32.1 57.3 46.6 -2.43 3.45 1.05 -0.377
#>
#> $forecast
#> # A fable: 84 x 4 [1D]
#> # Key: .model [4]
#> .model Date Count .mean
#> <chr> <date> <dist> <dbl>
#> 1 ets 2020-01-02 N(1383, 1505) 1383.
#> 2 ets 2020-01-03 N(1383, 1505) 1383.
#> 3 ets 2020-01-04 N(1383, 1505) 1383.
#> 4 ets 2020-01-05 N(1383, 1505) 1383.
#> 5 ets 2020-01-06 N(1383, 1505) 1383.
#> 6 ets 2020-01-07 N(1383, 1505) 1383.
#> 7 ets 2020-01-08 N(1383, 1505) 1383.
#> 8 ets 2020-01-09 N(1383, 1505) 1383.
#> 9 ets 2020-01-10 N(1383, 1505) 1383.
#> 10 ets 2020-01-11 N(1383, 1505) 1383.
#> # ... with 74 more rows
#>
#> $results
#> # A tibble: 21 x 6
#> Date ets arima snaive mixed Count
#> <date> <dbl> <dbl> <dbl> <dbl> <int>
#> 1 2020-01-02 1383. 1383. 1386 1384. 1350
#> 2 2020-01-03 1383. 1383. 1366 1377. 1398
#> 3 2020-01-04 1383. 1383. 1426 1397. 1357
#> 4 2020-01-05 1383. 1383. 1398 1388. 1415
#> 5 2020-01-06 1383. 1383. 1431 1399. 1399
#> 6 2020-01-07 1383. 1383. 1431 1399. 1346
#> 7 2020-01-08 1383. 1383. 1350 1372. 1299
#> 8 2020-01-09 1383. 1383. 1386 1384. 1303
#> 9 2020-01-10 1383. 1383. 1366 1377. 1365
#> 10 2020-01-11 1383. 1383. 1426 1397. 1328
#> # ... with 11 more rows

How to compare technical duplicates on separate rows in R?

I would like to compare the mean, sd, and percentage CV of two technical duplicates in R.
Currently my data frame looks like this:
library(tidyverse)
data <- tribble(
~rowname, ~Sample, ~Phagocytic_Score,
1, 1232, 24030,
2, 1232, 11040,
3, 4321, 7266,
4, 4321, 4096,
5, 5631, 7383,
6, 5631, 21507
)
Created on 2019-10-22 by the reprex package (v0.3.0)
So I would want to compare the values from rows 1 and 2 together, 3 and 4 and so on.
With ideally this being stored in a new data frame just with the average score and stats if that makes sense.
Sorry I'm quite new to R so apoplogies if this is really straightforward.
Thanks! Mari
summarize() can give you exactly this, especially if all the stats you want are computed within groups defined by one variable, i.e. Sample:
library(raster)
#> Loading required package: sp
library(tidyverse)
data <- tribble(
~rowname, ~Sample, ~Phagocytic_Score,
1, 1232, 24030,
2, 1232, 11040,
3, 4321, 7266,
4, 4321, 4096,
5, 5631, 7383,
6, 5631, 21507
)
data %>%
group_by(Sample) %>%
summarize(
mean = mean(Phagocytic_Score),
sd = sd(Phagocytic_Score),
pct_cv = cv(Phagocytic_Score)
)
#> # A tibble: 3 x 4
#> Sample mean sd pct_cv
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1232 17535 9185. 52.4
#> 2 4321 5681 2242. 39.5
#> 3 5631 14445 9987. 69.1
We've got some repeating going on, though, don't we? Each variable is defined as a function call with the same input variable. summarize_at() is more appropriate, then:
data %>%
group_by(Sample) %>%
summarize_at("Phagocytic_Score",
list(mean = mean, sd = sd, cv = cv))
#> # A tibble: 3 x 4
#> Sample mean sd cv
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1232 17535 9185. 52.4
#> 2 4321 5681 2242. 39.5
#> 3 5631 14445 9987. 69.1
Ah, but there's still some more room for improvement. Why are we repeating the names of the functions as the names of the variables, since they're the same? Well, mget() will take a single vector of the function names we want, and return a named list of those functions, with the names as those function names:
data %>%
group_by(Sample) %>%
summarize_at("Phagocytic_Score",
mget(c("mean", "sd", "cv"), inherits = TRUE))
#> # A tibble: 3 x 4
#> Sample mean sd cv
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1232 17535 9185. 52.4
#> 2 4321 5681 2242. 39.5
#> 3 5631 14445 9987. 69.1
Note we need inherits = TRUE for the reason explained here.
Created on 2019-10-22 by the reprex package (v0.3.0)
If I'm understanding your question, you are looking to summarize your dataframe by grouping based on one of the columns. I assume that in your real data you don't always have exactly two observations of each of your samples.
This approach uses the tidyverse packages, there are other ways to accomplish the same thing
library(tidyverse)
df %>% # name of your data frame
group_by(Sample) %>% This puts all the observations with the same value under "Sample" into groups for subsequent analysis
summarize(Mean = mean(Phagocytic_Score),
SD = sd(Phagocytic_Score),
PercentCV = SD/Mean # using the sd and mean just calculated for each group
)

loop to run model on subset dataframe

I am not very experienced with loops so I am not sure where I went wrong here...
I have a dataframe that looks like:
month year day mean.temp mean.temp.year.month
1 1961 1 4.85 4.090323
1 1961 2 4.90 4.090323
1 1961 3 2.95 4.090323
1 1961 4 3.40 4.090323
1 1961 5 2.90 4.090323
dataset showing 3 months for 2 years can be found here:
https://drive.google.com/file/d/1w7NVeoEh8b7cAkU3cu1sXx6yCh75Inqg/view?usp=sharing
and I want to subset this dataframe by year and month so that I can run one nls model per year and month. Since my dataset contains 56 years (and each year has 12 months), that will give 672 models. Then I want to store the parameter estimates in a separate table.
I've created this code, but I can't work out why it is only giving me the parameter estimates for month 12 (all 56 years, but just month 12):
table <- matrix(99999, nrow=672, ncol=4)
YEARMONTHsel <- unique(df_weather[c("year", "month")])
YEARsel <- unique(df_weather$year)
MONTHsel <- unique(df_weather$month)
for (i in 1:length(YEARsel)) {
for (j in 1:length(MONTHsel)) {
temp2 <- df_weather[df_weather$year==YEARsel[i] & df_weather$month==MONTHsel[j],]
mn <- nls(mean.temp~mean.temp.year.month+alpha*sin(day*pi*2/30+phi),
data = temp2, control=nlc,
start=list(alpha=-6.07043, phi = -10))
cr <- as.vector(coef(mn))
nv <-length(coef(mn))
table[i,1:nv] <- cr
table[i,nv+1]<- YEARsel[i]
table[i,nv+2]<- MONTHsel[j]
}
}
I've tried several options (i.e. without using nested loop) but I'm not getting anywhere.
Any help would be greatly appreciated!Thanks.
Based on your loop, it looks like you want to run the regression grouped by year and month and then extract the coefficients in a new dataframe (correct me if thats wrong)
library(readxl)
library(tidyverse)
df <- read_excel("~/Downloads/df_weather.xlsx")
df %>% nest(-month, -year) %>%
mutate(model = map(data, ~nls(mean.temp~mean.temp.year.month+alpha*sin(day*pi*2/30+phi),
data = .x, control= "nlc",
start=list(alpha=-6.07043, phi = -10))),
coeff = map(model, ~coefficients(.x))) %>%
unnest(coeff %>% map(broom::tidy)) %>%
spread(names, x) %>%
arrange(year)
#> # A tibble: 6 x 4
#> month year alpha phi
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1961 0.561 -10.8
#> 2 2 1961 -1.50 -10.5
#> 3 3 1961 -2.06 -9.77
#> 4 1 1962 -3.35 -5.48
#> 5 2 1962 -2.27 -9.97
#> 6 3 1962 0.959 -10.8
First we nest the data based on your groups (in this case year and month), then we map the model for each group, then we map the coefficients for each group, lastly we unnest the coefficients and spread the data from long to wide.

use model object, e.g. panelmodel, to flag data used

Is it possible in some way to use a fit object, specifically the regression object I get form a plm() model, to flag observations, in the data used for the regression, if they were in fact used in the regression. I realize this could be done my looking for complete observations in my original data, but I am curious if there's a way to use the fit/reg object to flag the data.
Let me illustrate my issue with a minimal working example,
First some packages needed,
# install.packages(c("stargazer", "plm", "tidyverse"), dependencies = TRUE)
library(plm); library(stargazer); library(tidyverse)
Second some data, this example is drawing heavily on Baltagi (2013), table 3.1, found in ?plm,
data("Grunfeld", package = "plm")
dta <- Grunfeld
now I create some semi-random missing values in my data object, dta
dta[c(3:13),3] <- NA; dta[c(22:28),4] <- NA; dta[c(30:33),5] <- NA
final step in the data preparation is to create a data frame with an index attribute that describes its individual and time dimensions, using tidyverse,
dta.p <- dta %>% group_by(firm, year)
Now to the regression
plm.reg <- plm(inv ~ value + capital, data = dta.p, model = "pooling")
the results, using stargazer,
stargazer(plm.reg, type="text") # stargazer(dta, type="text")
#> ============================================
#> Dependent variable:
#> ---------------------------
#> inv
#> ----------------------------------------
#> value 0.114***
#> (0.008)
#>
#> capital 0.237***
#> (0.028)
#>
#> Constant -47.962***
#> (9.252)
#>
#> ----------------------------------------
#> Observations 178
#> R2 0.799
#> Adjusted R2 0.797
#> F Statistic 348.176*** (df = 2; 175)
#> ===========================================
#> Note: *p<0.1; **p<0.05; ***p<0.01
Say I know my data has 200 observations, and I want to find the 178 that was used in the regression.
I am speculating if there's some vector in the plm.reg I can (easily) use to crate a flag i my original data, dta, if this observation was used/not used, i.e. the semi-random missing values I created above. Maybe some broom like tool.
I imagine something like,
dta <- dta %>% valid_reg_obs(plm.reg)
The desired outcome would look something like this, the new element is the vector plm.reg at the end, i.e.,
dta %>% as_tibble()
#> # A tibble: 200 x 6
#> firm year inv value capital plm.reg
#> * <int> <int> <dbl> <dbl> <dbl> <lgl>
#> 1 1 1935 318 3078 2.80 T
#> 2 1 1936 392 4662 52.6 T
#> 3 1 1937 NA 5387 157 F
#> 4 1 1938 NA 2792 209 F
#> 5 1 1939 NA 4313 203 F
#> 6 1 1940 NA 4644 207 F
#> 7 1 1941 NA 4551 255 F
#> 8 1 1942 NA 3244 304 F
#> 9 1 1943 NA 4054 264 F
#> 10 1 1944 NA 4379 202 F
#> # ... with 190 more rows
Update, I tried to use broom's augment(), but unforunatly it gave me the error message I had hoped would create some flag,
# install.packages(c("broom"), dependencies = TRUE)
library(broom)
augment(plm.reg, dta)
#> Error in data.frame(..., check.names = FALSE) :
#> arguments imply differing number of rows: 200, 178
The vector is plm.reg$residuals. Not sure of a nice broom solution, but this seems to work:
library(tidyverse)
dta.p %>%
as.data.frame %>%
rowid_to_column %>%
mutate(plm.reg = rowid %in% names(plm.reg$residuals))
for people who use the class pdata.frame() to create an index attribute that describes its individual and time dimensions, you can us the following code, this is from another Baltagi in the ?plm,
# == Baltagi (2013), pp. 204-205
data("Produc", package = "plm")
pProduc <- pdata.frame(Produc, index = c("state", "year", "region"))
form <- log(gsp) ~ log(pc) + log(emp) + log(hwy) + log(water) + log(util) + unemp
Baltagi_reg_204_5 <- plm(form, data = pProduc, model = "random", effect = "nested")
pProduc %>% mutate(reg.re = rownames(pProduc) %in% names(Baltagi_reg_204_5$residuals)) %>%
as_tibble() %>% select(state, year, region, reg.re)
#> # A tibble: 816 x 4
#> state year region reg.re
#> <fct> <fct> <fct> <lgl>
#> 1 CONNECTICUT 1970 1 T
#> 2 CONNECTICUT 1971 1 T
#> 3 CONNECTICUT 1972 1 T
#> 4 CONNECTICUT 1973 1 T
#> 5 CONNECTICUT 1974 1 T
#> 6 CONNECTICUT 1975 1 T
#> 7 CONNECTICUT 1976 1 T
#> 8 CONNECTICUT 1977 1 T
#> 9 CONNECTICUT 1978 1 T
#> 10 CONNECTICUT 1979 1 T
#> # ... with 806 more rows
finally, if you are running the first Baltagi without index attributes, i.e. unmodified example from the help file, the code should be,
Grunfeld %>% rowid_to_column %>%
mutate(plm.reg = rowid %in% names(p$residuals)) %>% as_tibble()

Resources