R: Holt-Winters with daily data (forecast package) - r

In the following example, I am trying to use Holt-Winters smoothing on daily data, but I run into a couple of issues:
# generate some dummy daily data
mData = cbind(seq.Date(from = as.Date('2011-12-01'),
to = as.Date('2013-11-30'), by = 'day'), rnorm(731))
# convert to a zoo object
zooData = as.zoo(mData[, 2, drop = FALSE],
order.by = as.Date(mData[, 1, drop = FALSE], format = '%Y-%m-%d'),
frequency = 7)
# attempt Holt-Winters smoothing
hw(x = zooData, h = 10, seasonal = 'additive', damped = FALSE,
initial = 'optimal', exponential = FALSE, fan = FALSE)
# no missing values in the data
sum(is.na(zooData))
This leads to the following error:
Error in ets(x, "AAA", alpha = alpha, beta = beta, gamma = gamma,
damped = damped, : You've got to be joking. I need more data! In
addition: Warning message: In ets(x, "AAA", alpha = alpha, beta =
beta, gamma = gamma, damped = damped, : Missing values encountered.
Using longest contiguous portion of time series
Emphasis mine.
Couple of questions:
1. Where are the missing values coming from?
2. I am assuming that the "need more data" arises from attempting to estimate 365 seasonal parameters?
Update 1:
Based on Gabor's suggestion, I have recreated a fractional index for the data where whole numbers are weeks.
I have a couple of questions.
1. Is this is an appropriate way of handling daily data when the periodicity is assumed to be weekly?
2. Is there is a more elegant way of handling the dates when working with daily data?
library(zoo)
library(forecast)
# generate some dummy daily data
mData = cbind(seq.Date(from = as.Date('2011-12-01'),
to = as.Date('2013-11-30'), by = 'day'), rnorm(731))
# conver to a zoo object with weekly frequency
zooDataWeekly = as.zoo(mData[, 2, drop = FALSE],
order.by = seq(from = 0, by = 1/7, length.out = 731))
# attempt Holt-Winters smoothing
hwData = hw(x = zooDataWeekly, h = 10, seasonal = 'additive', damped = FALSE,
initial = 'optimal', exponential = FALSE, fan = FALSE)
plot(zooDataWeekly, col = 'red')
lines(fitted(hwData))

hw requires a ts object not a zoo object. Use
zooDataWeekly <- ts(mData[,2], frequency=7)
Unless there is a good reason for specifying the model exactly, it is usually better to let R select the best model for you:
fit <- ets(zooDataWeekly)
fc <- forecast(fit)
plot(fc)

Related

Output of keras in R can not be used to predict

i use keras and tensorflow to run an lstm in R to predict some stock market prices.
Here I am providing the code where instead of stock market prices, I just use one randomly generated vector VECTOR of length 100. Then I consider a training period of 80 first values and try to predict the 20 test values...
What am I doing wrong?
I am getting an error:Error in UseMethod("predict") :
no applicable method for 'predict' applied to an object of class "keras_training_history"
Thank you
library(tensorflow)
library(keras)
set.seed(12345)
VECTOR=rnorm(100,2,5)
VECTOR_training=VECTOR[1:80]
VECTOR_test=VECTOR[81:100]
training_rescaled=scale(VECTOR_training)
#I also calculate the scale factors because I will need them when I will be coming
#back to the original data
scale_factors=matrix(NA,nrow=1,ncol=2)
scale_factors=c(mean(VECTOR_training), sd(VECTOR_training))
#We want to predict 20 days, so we need to base each prediction on 20 data points.
prediction_stocks=20
lag_stocks=prediction_stocks
test_rescaled =training_rescaled[(length(VECTOR_training)- prediction_stocks + 1):length(VECTOR_training)]
#We lag the data 20times, so that each prediction is based on 20 values, and arrange lagged values into columns. Then we transform it into the desired 3D form.
x_train_data_stocks=t(sapply(1:(length(VECTOR_training)-lag_stocks-prediction_stocks+1),
function(x) training_rescaled[x:(x+lag_stocks-1),1]
))
# now we transform it into 3D form
x_train_arr_stocks=array(
data=as.numeric(unlist(x_train_data_stocks)),
dim=c(
nrow(x_train_data_stocks),
lag_stocks,
1
)
)
#Now we apply similar transformation for the Y values.
y_train_data_stocks=t(sapply(
(1 + lag_stocks):(length(training_rescaled) - prediction_stocks + 1),
function(x) training_rescaled[x:(x + prediction_stocks - 1)]
))
y_train_arr_stocks= array(
data = as.numeric(unlist(y_train_data_stocks)),
dim = c(
nrow(y_train_data_stocks),
prediction_stocks,
1
)
)
#In the same manner we need to prepare input data for the prediction
#list_test_rescaled
# this time our array just has one sample, as we intend to perform one 20-days prediction
x_pred_arr_stocks=array(
data = test_rescaled,
dim = c(
1,
lag_stocks,
1
)
)
###lstm forecast prova
set.seed(12345)
lstm_model <- keras_model_sequential()
lstm_model_prova=
layer_lstm(lstm_model,units = 70, # size of the layer
batch_input_shape = c(1, 20, 1), # batch size, timesteps, features
return_sequences = TRUE,
stateful = TRUE) %>%
# fraction of the units to drop for the linear transformation of the inputs
layer_dropout(rate = 0.5) %>%
layer_lstm(units = 50,
return_sequences = TRUE,
stateful = TRUE) %>%
layer_dropout(rate = 0.5) %>%
time_distributed(keras::layer_dense(units = 1))
lstm_model_compile=compile(lstm_model_prova,loss = 'mae', optimizer = 'adam', metrics = 'accuracy')
lstm_fit_prova=fit(lstm_model_compile,
x = x_train_arr_stocks[[1]],
y = y_train_arr_stocks[[1]],
batch_size = 1,
epochs = 20,
verbose = 0,
shuffle = FALSE
)
lstm_forecast_prova=predict(lstm_fit_prova,x_pred_arr_stocks, batch_size = 1)
It works if I use
lstm_forecast_prova=predict(lstm_model_compile,x_pred_arr_stocks, batch_size = 1)
But shouldn't I use the fitted model in order to make the predictions?
Also, if I plot the fitted model, the accuracy is 0. And actually on my real data the predictions do not make any sense. So what does it mean that the accuracy is 0? Maybe something is wrong with the lstm parameters?
Thank you in advance!!

R LightGBM ignores init_score when continuing training with init_model

General description of my problem
I am performing a Poisson regression using LightGBM in R.
I am using an "offset" for the training, similar to using log(time) in a GLM as the offset when modelling insurance claims because we want to ensure that expected value of the response is proportional to time. I do this using the init_score parameter within lab.train().
I am using the "continue training" option in lgb.train (where you specify a value for init_model). This is because I want to build a "stumps" model first, and then continue training with a more complex model. This is to help me identify potential interaction terms in the data. This is just for background why I am doing this - not relevant to the specific issue described below.
However, when I continue training, the offset originally specified in the first model I build is no longer used by the fitting process. I think init_model overrides any value of init_score, but init_model does NOT itself contain or allow for init_score. So, as far as I can see, the init_score is totally lost from the fitting process once you continue training using init_model.
This means that the "starting point" when continuing to train a model is not the "finishing point" from the original model build. e.g. in my example below, I want the poisson log-likelihood error metric for models 2 and 3 to "start" from where model 1 finished. This isn't the case - but surely that is what "continue training" should deliver?
I have entered comments into the code below to explain the issue more clearly.
Reproducible example
library(lightgbm)
library(data.table)
# simulate some data
# z follows a Poisson distribution
# the mean of z is given by t * exp(x+y), where t is the "time exposed to risk"
# t is uniform(0,10)
# x and y are uniform(0,1)
# I want to specify log(t) using init_score in the lightGBM
# i.e. just like Poisson regression in insurance where log(t) is the offset in a GLM or GBM
n <- 10000 # number of rows
set.seed(42)
d <- data.table(t = runif(n,0,10), x = runif(n,0,1), y = runif(n,0,1))
d[, z := rpois(n, t * exp(x+y))]
# check weighted mean looks about right
# should get actual = 2.957188 and
# underlying = 2.939975
d[, list(actual = sum(z)/sum(t),
underlying = sum(t * exp(x+y))/sum(t)),]
# build a lightGBM using 100 rounds and specify log(t) as init_score
feature_cols <- c('x','y')
dm <- as.matrix(d[, ..feature_cols])
l_train <- lgb.Dataset(dm, label=d[,z], free_raw_data = FALSE)
setinfo(l_train, "init_score", log(d$t))
params <- list(objective='poisson', metric = 'poisson')
lgbm_1 <- lgb.train(params = params,
valids = list(train = l_train),
data = l_train,
nrounds = 100,
num_leaves = 2,
bagging_fraction = 1,
bagging_freq = 1,
feature_fraction = 1,
learning_rate=0.2)
train_log_1 <- lgb.get.eval.result(lgbm_1, "train", 'poisson')
# get the model predictions and check that they are close to expected
# remember that we need to manually apply the init_score to get the prediction
# i.e. we need to add log(t) onto the raw score, or multiply the scaled prediction by t
# the predictions are all very close
d[, lgbm_predicted_1 := t*predict(lgbm_1, dm, raw_score = FALSE)]
d[, list(actual = sum(z)/sum(t),
predicted_1 = sum(lgbm_predicted_1)/sum(t),
underlying = sum(t * exp(x+y))/sum(t)),]
# save the model
lgb.save(lgbm_1, 'lgbm_1.txt')
# ATTEMPT A - CONTINUE TRAINING FROM MODEL 1
# don't change the init_score
# note iterations in console start at 101 because we are continuing training
# however, the error metric (poisson log likelihood)
# start from a totally different value to where the first model ended
lgbm_2 <- lgb.train(params = params,
init_model = 'lgbm_1.txt',
valids = list(train = l_train),
data = l_train,
nrounds = 100,
num_leaves = 2,
bagging_fraction = 1,
bagging_freq = 1,
feature_fraction = 1,
learning_rate=0.2)
train_log_2 <- lgb.get.eval.result(lgbm_2, "train", 'poisson')
# check predictions - predicted_2 are WAY TOO HIGH now!
# I think this is because lightGBM uses the predictions from the first model
# as the starting point for training
# but the predictions from model 1 DO NOT ALLOW FOR THE log(t) being the offset to the original model!
d[, lgbm_predicted_2 := t*predict(lgbm_2, dm, raw_score = FALSE)]
d[, list(actual = sum(z)/sum(t),
predicted_1 = sum(lgbm_predicted_1)/sum(t),
predicted_2 = sum(lgbm_predicted_2)/sum(t),
underlying = sum(t * exp(x+y))/sum(t)),]
# ATTEMPT B - try init_score = 0?
# doesn't seem to make any difference
# so my hypothesis is that init_score is being ignored
# and over-written by the init_model
# but... how does the original init_score ever get back into the fitting process?
# init_score + init_model is a good stating point
# init_model on it's own is not
setinfo(l_train, "init_score", rep(0, nrow(d)))
lgbm_3 <- lgb.train(params = params,
valids = list(train = l_train),
init_model = 'lgbm_1.txt',
data = l_train,
nrounds = 100,
num_leaves = 2,
bagging_fraction = 1,
bagging_freq = 1,
feature_fraction = 1,
learning_rate=0.2)
train_log_3 <- lgb.get.eval.result(lgbm_3, "train", 'poisson')
# check predictions - models 2 and 3 are identical, the init_score made no difference
d[, lgbm_predicted_3 := t*predict(lgbm_3, dm, raw_score = FALSE)]
d[, list(actual = sum(z)/sum(t),
predicted_1 = sum(lgbm_predicted_1)/sum(t),
predicted_2 = sum(lgbm_predicted_2)/sum(t),
predicted_3 = sum(lgbm_predicted_3)/sum(t),
underlying = sum(t * exp(x+y))/sum(t)),]
# compare training logs
# question - why do V2 and V3 not start from the "finishing" point of V1?
# it's because the init_model is wrong, because it doesn't allow for the init_score
logs <- data.table(v1 = train_log_1, v2 = train_log_2, v3 = train_log_3)

How to use cvts on an hybridModel with XREG?

I am trying to do a cross validation using the cvts function from the forecastHybrid package using an "an" model (ARIMA + NNETAR) with external regressors.
I have two variables with 100 observations : Y and X
Note that:
length(Y) == length(X)
TRUE
I did this:
crossv =cvts(Y,
FUN=hybridModel, models="an", a.args=list(xreg=X,n.args=list(xreg=X),
rolling = TRUE, windowSize = 84, maxHorizon = 1, horizonAverage = FALSE)
and got this error
Error in { :
task 1 failed - "variable lengths differ (found for 'xregg')"
if I try to pass it as a function.
CUSTOM=function(x){hybridModel(x, models="an", a.args=list(xreg=X),n.args=list(xreg=X))}
crossv2 = cvts(Y,
FUN=CUSTOM,
rolling = TRUE,
windowSize = 84,
maxHorizon = 1,
horizonAverage = FALSE)
I get:
Error in { : task 1 failed - "object 'X' not found"
I can, of course do cross validation separately for nnetar and arima and then averaging the cross validation forecasting but any ideas why it’s not working via cvts + hybrid model ?
Thanks a lot.
I finally found the answer. We should add an other xreg argument in the FUN function. Then, a second xreg argument (from the cvts function) should be written outside the function. The value of the second xreg are then passed to the first xreg.
y = ts(rnorm(100), start = c(1999,1), frequency = 12)
x = ts(rnorm(100), start = c(1999,1), frequency = 12)
cv = cvts(y, FUN = function(z, xreg = xreg),
forecastHybrid::hybridModel(z,models = "an")},
xreg=as.matrix(x),rolling = TRUE, windowSize = 84,
maxHorizon = 1, horizonAverage = FALSE)
It worked!
accuracy(cv)
ME RMSE MAE
Forecast Horizon 1 -0.2173456 0.9328878 0.7368945

How to cluster standard error in clubSandwich's vcovCR()?

I'm trying to specify a cluster variable after plm using vcovCR() in clubSandwich package for my simulated data (which I use for power simulation), but I get the following error message:
"Error in [.data.frame(eval(mf$data, envir), , index_names) : undefined columns selected"
I'm not sure if this is specific to vcovCR() or something general about R, but could anyone tell me what's wrong with my code? (I saw a related post here How to cluster standard errors of plm at different level rather than id or time?, but it didn't solve my problem).
My code:
N <- 100;id <- 1:N;id <- c(id,id);gid <- 1:(N/2);
gid <- c(gid,gid,gid,gid);T <- rep(0,N);T = c(T,T+1)
a <- qnorm(runif(N),mean=0,sd=0.005)
gp <- qnorm(runif(N/2),mean=0,sd=0.0005)
u <- qnorm(runif(N*2),mean=0,sd=0.05)
a <- c(a,a);gp = c(gp,gp,gp,gp)
Ylatent <- -0.05*T + a + u
Data <- data.frame(
Y = ifelse(Ylatent > 0, 1, 0),
id = id,gid = gid,T = T
)
library(clubSandwich)
library(plm)
fe.fit <- plm(formula = Y ~ T, data = Data, model = "within", index = "id",effect = "individual", singular.ok = FALSE)
vcovCR(fe.fit,cluster=Data$id,type = "CR2") # doesn't work, but I can run this by not specifying cluster as in the next line
vcovCR(fe.fit,type = "CR2")
vcovCR(fe.fit,cluster=Data$gid,type = "CR2") # I ultimately want to run this
Make your data a pdata.frame first. This is safer, especially if you want to have the time index created automatically (seems to be the case looking at your code).
Continuing what you have:
pData <- pdata.frame(Data, index = "id") # time index is created automatically
fe.fit2 <- plm(formula = Y ~ T, data = pData, model = "within", effect = "individual")
vcovCR(fe.fit2, cluster=Data$id,type = "CR2")
vcovCR(fe.fit2, type = "CR2")
vcovCR(fe.fit2,cluster=Data$gid,type = "CR2")
Your example does not work due to a bug in clubSandwich's data extraction function get_index_order (from version 0.3.3) for plm objects. It assumes both index variables are in the original data but this is not the case in your example where the time index is created automatically by only specifying the individual dimension by the index argument.

Predicting with bsts model and updated olddata

I've built a bsts model using 2 years of weekly historical data. I'm able to predict using the model with the existing training data. In order to mimic the process that would occur with the model in production, I've created an xts object that moves the 2 years of data forward by one week. When I try to predict using this dataset (populating the olddata parameter in predict.bsts), I receive the following error:
Error in terms.default(object) : no terms component nor attribute
I realize I'm probably doing something dumb here, but haven't been able to find any examples of usage of values of olddata when predicting. Appreciate any help you can provide.
Thanks
dat = xts(fcastdat$SumScan_units,order.by = fcastdat$enddate)
traindat = window(dat, start=as.Date("2015-01-03"), end=as.Date("2016-12-26"))
ss = AddLocalLevel(list(), traindat)
ss = AddSeasonal(ss, traindat, nseasons = 52)
holidays = c("EasterSunday", "USMothersDay", "IndependenceDay", "MemorialDay", "LaborDay", "Thanksgiving", "Christmas")
ss = AddNamedHolidays(ss, named.holidays = holidays, traindat)
model_loclev_seas_hol = bsts(traindat, state.specification = ss, niter = 500, ping=50, seed=1289)
burn = SuggestBurn(0.1, model_loclev_seas_hol)
pred_len = 5
pred = predict.bsts(model_ll_seas_hol, horizon = pred_len, burn = burn, quantiles = c(.025, .975))
begdt = index(traindat[1]) + 7
enddt = index(traindat[length(traindat)]) + 7
predseries = window(dat, start=as.Date(begdt), end=as.Date(enddt))
pred2 = predict.bsts(model_ll_seas_hol, horizon=pred_len, burn=burn, olddata = predseries,quantiles = c(.025, .975))

Resources