I am trying to forecast three variables using R, but I am running into issues on how to deal with correlation.
The three variables I am trying to forecast are Revenue, Subscriptions and Price.
My initial approach was to do two independent time series forecast of subscriptions and price and multiply the outcomes to generate the revenue forecast.
I wanted to understand if this approach makes sense, as there is an inherent correlation between the price and the subscribers, and this is the part I do not know how to deal with.
# Load packages.
library(forecast)
# Read data
data <- read.csv("data.csv")
data.train <- data[0:57,]
data.test <- data[58:72,]
# Create time series for variables of interest
data.subs <- ts(data.train$subs, start=c(2014,1), frequency = 12)
data.price <- ts(data.train$price, start=c(2014,1), frequency = 12)
#Create model
subs.stlm <- stlm(data.subs)
price.stlm <- stlm(data.price)
#Forecast
subs.pred <- forecast(subs.stlm, h = 15, level = c(0.6, 0.75, 0.9))
price.pred <- forecast(price.stlm, h = 15, level = c(0.6, 0.75, 0.9))
Any help is greatly appreciated!
Looks like you can use the vector autoregression (VAR) model. Take a look at the description and the code provided here:
https://otexts.org/fpp2/VAR.html
Related
I would like to do a pseudo-out-of-sample exercises with Dynamic factor model (DFM) from the Nowcasting-package in R.
Let me first provide you with a replicable example using the data from the Nowcasting-package.
library(nowcasting)
data(NYFED)
NYFED$legend$SeriesName
base <- NYFED$base
blocks <- NYFED$blocks$blocks
trans <- NYFED$legend$Transformation
frequency <- NYFED$legend$Frequency
delay <- NYFED$legend$delay
vintage <- PRTDB(mts = BRGDP$base, delay = BRGDP$delay, vintage = "2015-06-01")
base <- window(vintage, start = c(2005,06), frequency = 12)
x <- Bpanel(base = base, trans = BRGDP$trans)
GDP <- base[,which(colnames(base) == "PIB")]
GDP_qtr <- month2qtr(x = GDP, reference_month = 3)
y <- diff(diff(GDP_qtr,4))
y <- qtr2month(y)
data <- cbind(y,x)
frequency <- c(4,rep(12,ncol(x)))
nowca <- nowcast(formula = y~., data = data, r = 1, q = 1 , p = 1, method = "2s_agg",
frequency = frequency)
summary(nowca$reg)
nowca$yfcst
nowcast.plot(nowca, type = "fcst")
This code runs fine and creates forecasts and a plot with GDP, in-sample fit and three steps of out-of-sample forecasts.
However, I would like to do a full pseudo-out-of-sample forecasting exercise with this package. In other words, I would like to create multiple point forecasts using forecasts generated by this nowcast-function.
I have already written a replicable code to do this. It uses the same the data as before, but now the data is inputted gradually to the model.
nowcasts_dfm <- rep(NA,nrow(data))
for (i in 12:nrow(data)){
data <- ts(data[1:i,], start=c(2005,06), frequency=12)
nowca <- nowcast(formula = y~., data = data, r = 1, q = 1 , p = 1, method = "2s_agg",
frequency = frequency)
nowcasts_dfm[i] <- now$yfcst[,3][!is.na(now$yfcst[,3])][1]
}
So, this pseudo-out-of-sample uses expanding window starting with the first 12 observations. It then expands to cover the whole sample. However, I am getting a error message.
Error in eigen(cov(x)) : infinite or missing values in 'x'
Could some help me with this, please? How do you code a expanding window pseudo-out-of-sample forecasting exercise with this package?
Or is there a better way to code a expanding window Dynamic factor model (DFM) in R?
Thanks!
Hello I am learning about survival analysis and I was curious if I could use the survival package on survival data of this form:
Here is some code to genereate data in this form
start_interval <- seq(0, 13)
end_interval <- seq(1, 14)
living_at_start <- round(seq(1000, 0, length.out = 14))
dead_in_interval <- c(abs(diff(living_at_start)), 0)
df <- data.frame(start_interval, end_interval, living_at_start, dead_in_interval)
From my use of the survival package so far it seems to have each individual be a survival time but I might be misreading the documentation of the Surv function. If survival will not work what other packages are out there for this type of data.
If there is not a package or function to easily to estimate the survival function I can easily calculate the survival times myself with the following equation.
Since the survival package need one observation per survival time we need to do some transformations. Using the simulated data.
Simulated Data:
library(survival)
start_interval <- seq(0, 13)
end_interval <- seq(1, 14)
living_at_start <- round(seq(1000, 0, length.out = 14))
dead_in_interval <- c(abs(diff(living_at_start)), 0)
df <- data.frame(start_interval, end_interval, living_at_start, dead_in_interval)
Transforming the data by duplicated by the number dead
duptimes <- df$dead_in_interval
rid <- rep(1:nrow(df), duptimes)
df.t <- df[rid,]
Using the Surv Function
test <- Surv(time = df.t$start_interval,
time2 = df.t$end_interval,
event = rep(1, nrow(df.t)), #Every Observation is a death
type = "interval")
Fitting the survival curve
summary(survfit(test ~ 1))
Comparing with by hand calculation from original data
df$living_at_start/max(df$living_at_start)
They match.
Questions
When using the survfit function why is number of risk 1001 at time 0 when there is only 1000 people in the data?
length(test)
I want to do batch forecasting among multiple series, for example, if I want to forecast time series with IDs that end with 1(1,11,21,31...), how can I do that?
Since you did not provide detailed information, I was not sure which forecasting method you want to use hence I give here an example of a univariate time series model:
Load required packages:
library(forecast)
library(dplyr)
We use example data from Rob Hyndman:
dta <- read.csv("https://robjhyndman.com/data/ausretail.csv")
Now change the column names:
colnames(dta) <- c("date", paste0("tsname_", seq_len(ncol(dta[,-1]))))
Select timeseries which end with 1:
dta_ends_with1 <- dplyr::select(dta, dplyr::ends_with("1"))
Create a ts object:
dta_ends_with1 <- ts(dta_ends_with1, start = c(1982,5), frequency = 12)
Specify how many steps ahead you want to forecast, here I set it to 6 steps ahead,
h <- 6
Now we prepare a matrix to save the forecast:
fc <- matrix(NA, ncol = ncol(dta_ends_with1), nrow = h)
Forecasting loop.
for (i in seq_len(ncol(dta_ends_with1))) {
fc[,i] <- forecast::forecast(forecast::auto.arima(dta_ends_with1[,i]),
h = h)$mean
}
Set the column names:
colnames(fc) <- colnames(dta_ends_with1)
head(fc)
This is my first question on stack overflow.
Situation: I have 2 time series. Both series have the same values but the second series has 5 NAs at the start. Hence, first series has 105 observations, where 2nd series has 110 observations. I have fitted an ARIMA(0,1,0) using the Arima function to both series separately. And then I used the forecast package to predict 10 steps to the future.
Issue: Even though the ARIMA coefficient for both series are the same, the projections (10 steps) appear to be different. I am uncertain why this is the case. Has anyone come across this before? Any guidance is highly appreciated.
Tried: I tried setting seed, creating index manually, and using auto.ARIMA for the model fitting. However, none of the steps has helped me to reconcile the difference.
I have added a picture to show you what I see. Please note I have hidden the mid part of the series so that you can see the start and the end of the series. The yellow highlighted cells are the projection outputs from the 'Forecast' package. I have manually added the index to be years after extracting the results from R.
Time series projected and base in excel
Rates <- read.csv("Rates_for_ARIMA.csv")
set.seed(123)
#ARIMA with NA
Simple_Arima <- Arima(
ts(Rates$Rates1),
order = c(0,1,0),
include.drift = TRUE)
fcasted_Arima <- forecast(Simple_Arima, h = 10)
fcasted_Arima$mean
#ARIMA Without NA
Rates2 <- as.data.frame(Rates$Rates2)
##Remove the final spaces from the CSV
Rates2 <- Rates2[-c(106,107,108,109,110),]
Simple_Arima2 <- Arima(
ts(Rates2),
order = c(0,1,0),
include.drift = TRUE)
fcasted_Arima2 <- forecast(Simple_Arima2, h = 10)
fcasted_Arima2$mean
The link to data is here, CSV format
Could you share your data and code such that others can see if there is any issue with it?
I tried to come up with an example and got the same results for both series, one that includes NAs and one that doesn't.
library(forecast)
library(xts)
set.seed(123)
ts1 <- arima.sim(model = list(0, 1, 0), n = 105)
ts2 <- ts(c(rep(NA, 5), ts1), start = 1)
fit1 <- forecast::Arima(ts1, order = c(0, 1, 0))
fit2 <- forecast::Arima(ts2, order = c(0, 1, 0))
pred1 <- forecast::forecast(fit1, 10)
pred2 <- forecast::forecast(fit2, 10)
forecast::autoplot(pred1)
forecast::autoplot(pred2)
> all.equal(as.numeric(pred1$mean), as.numeric(pred2$mean))
[1] TRUE
I have an existing time series (1000 samples) and calculated the rolling mean using the filter() function in R, averaging across 30 samples each. The goal of this was to create a "smoothed" version of the time series. Now I would like to create artificial data that "look like" the original time series, i.e., are somewhat noisy, that would result in the same rolling mean if I would apply the same filter() function to the artificial data. In short, I would like to simulate a time series with the same overall course but not the exact same values as those of an existing time series. The overall goal is to investigate whether certain methods can detect similarity of trends between time series, even when the fluctuations around the trend are not the same.
To provide some data, my time series looks somewhat like this:
set.seed(576)
ts <- arima.sim(model = list(order = c(1,0,0), ar = .9), n = 1000) + 900
# save in dataframe
df <- data.frame("ts" = ts)
# plot the data
plot(ts, type = "l")
The filter function produces the rolling mean:
my_filter <- function(x, n = 30){filter(x, rep(1 / n, n), sides = 2, circular = T)}
df$rolling_mean <- my_filter(df$ts)
lines(df$rolling_mean, col = "red")
To simulate data, I have tried the following:
Adding random noise to the rolling mean.
df$sim1 <- df$rolling_mean + rnorm(1000, sd = sd(df$ts))
lines(df$sim1, col = "blue")
df$sim1_rm <- my_filter(df$sim1)
lines(df$sim1_rm, col = "green")
The problem is that a) the variance of the simulated values is higher than the variance of the original values, b) that the rolling average, although quite similar to the original, sometimes deviates quite a bit from the original, and c) that there is no autocorrelation. To have an autocorrelational structure in the data would be good since it is supposed to resemble the original data.
Edit: Problem a) can be solved by using sd = sqrt(var(df$ts)-var(df$rolling_mean)) instead of sd = sd(df$ts).
I tried arima.sim(), which seems like an obvious choice to specify the autocorrelation that should be present in the data. I modeled the original data using arima(), using the model parameters as input for arima.sim().
ts_arima <- arima(ts, order = c(1,0,1))
my_ar <- ts_arima$coef["ar1"]
my_ma <- ts_arima$coef["ma1"]
my_intercept <- ts_arima$coef["intercept"]
df$sim2 <- arima.sim(model = list(order = c(1,0,1), ar = my_ar, ma = my_ma), n = 1000) + my_intercept
plot(df$ts)
lines(df$sim2, col = "blue")
The resulting time series is very different from the original. Maybe a higher order for ar and ma in arima.sim() would solve this, but I think a whole different method might be more appropriate.