How to predict future values of time series using h2o.predict - r

I am going through the book "Hands-on Time series analysis with R" and I am stuck at the example using machine learning h2o package. I don't get how to use h2o.predict function. In the example it requires newdata argument, which is test data in this case. But how do you predict future values of time series if you in fact don't know these values ?
If I just ignore newdata argument I get : predictions with a missing newdata argument is not implemented yet.
library(h2o)
h2o.init(max_mem_size = "16G")
train_h <- as.h2o(train_df)
test_h <- as.h2o(test_df)
forecast_h <- as.h2o(forecast_df)
x <- c("month", "lag12", "trend", "trend_sqr")
y <- "y"
rf_md <- h2o.randomForest(training_frame = train_h,
nfolds = 5,
x = x,
y = y,
ntrees = 500,
stopping_rounds = 10,
stopping_metric = "RMSE",
score_each_iteration = TRUE,
stopping_tolerance = 0.0001,
seed = 1234)
h2o.varimp_plot(rf_md)
rf_md#model$model_summary
library(plotly)
tree_score <- rf_md#model$scoring_history$training_rmse
plot_ly(x = seq_along(tree_score), y = tree_score,
type = "scatter", mode = "line") %>%
layout(title = "Random Forest Model - Trained Score History",
yaxis = list(title = "RMSE"),
xaxis = list(title = "Num. of Trees"))
test_h$pred_rf <- h2o.predict(rf_md, test_h)
test_1 <- as.data.frame(test_h)
mape_rf <- mean(abs(test_1$y - test_1$pred_rf) / test_1$y)
mape_rf

H2O-3 does not support traditional time series algorithms (e.g., ARIMA). Instead, the recommendation is to treat the time series use case as a supervised learning problem and perform time-series specific pre-processing.
For example, if your goal was to predict the sales for a store tomorrow, you could treat this as a regression problem where your target would be the Sales. If you try to train a supervised learning model on the raw data, however, chances are your performance would be pretty poor. So the trick is to add historical attributes like lags as a pre-processing step.
If we trained a model on an unaltered dataset, the Mean Absolute Error is around 35%.
If we start adding historical features like the sales from the previous day for that store, we can reduce the Mean Absolute Error to about 15%.
While H2O-3 does not support lagging, you can leverage Sparkling Water to perform this pre-processing. You can use Spark to generate the lags per group and then use H2O-3 to train the regression model. Here is an example of this process: https://github.com/h2oai/h2o-tutorials/tree/master/best-practices/forecasting

The training data, train_df has to have all the columns listed in both x (c("month", "lag12", "trend", "trend_sqr")) and y ("y"), whereas the data you give to h2o.predict() just has to have the columns in x; the y-column is what will be returned as the prediction.
As you have features (in x) that are things like lag, trend, etc. the fact that it is a time series does not matter. (But you do have to be very careful when preparing those features to make sure you do not use any information in them that was not known at that point in time - but I would imagine the book has already been emphasizing that.)
Normally with a time series, for a given row in the training data, your x data is the data known at time t, and the value in the y column is the value of interest at time t+1. So when doing a prediction, you give the x values as the values at the moment, and the prediction returned is what will happen next.

Related

Auto.Arima incorrectly predicts first point

I'm trying to complete a time series analysis of some reservoir data and am using auto.arima with a Fourier component to account for seasonality, as described here https://otexts.com/fpp2/dhr.html#dhr The code I have used is shown below and the dataset I used can be found here https://www.dropbox.com/sh/563nu3daeid0agb/AAB6NSddVUKgBCCbQtuqXPsZa?dl=0
Reservoir = read.csv("Reservoir1.csv",TRUE,",")
#impute missing data from data set
Reservoir = imputeTS::na_interpolation(Reservoir)
#Create Time Series
Reservoir = ts(Reservoir[,2],frequency = (365.25),start = c(2013,116))
plots = list()
for (i in seq (10)) {
fit = auto.arima(Reservoir, xreg = fourier(Reservoir, K = i), seasonal = FALSE)
plots[[i]] = autoplot(forecast(fit, xreg = fourier(Reservoir, K = i, h=10))) +
xlab(paste("K=",i,"AICC=",round(fit[["aicc"]],2))) + ylab("")
}
gridExtra::grid.arrange(plots[[1]],plots[[2]],plots[[3]],plots[[4]],plots[[5]],
plots[[6]],plots[[7]],plots[[8]],plots[[9]],plots[[10]],
nrow=5)
bestfit = auto.arima(Reservoir, xreg=fourier(Reservoir, K=9), seasonal=FALSE)
summary(bestfit)
checkresiduals(bestfit)
plot(Reservoir,col="red")
lines(fitted(bestfit),col="blue")
The model fits well, except for the incorrect first prediction. I'm lost as to why only this value would be so far off. Or, is this an acceptable error?
The residuals are the one-step forecast errors using all previous observations. At time 1, the residual is the forecast error with no previous observations, so it is simply based on the fitted model. In fact, it is an artificially "good" forecast because the differencing means there is no way for the model to know the location of the data until there is an observation. But the way ARIMA models are implemented in R makes the first prediction use a little more information than it should.

LSTM time series forecasting, predictions stabilize

My code is in R using the Keras and Tensorflow libraries. I'm creating an LSTM model to forecast 100 future values. My input shape is (100,200,1).
Let's say my input data is X. I make a prediction at time step t=201 and get the column Y of predictions. Then I create Xnew = c(X[2:200],Y) a new variable where I concatenate X (except for the first column) and Y. I use this Xnew to predict the next time step.
What's happening is that, after a certain number of predicted future time steps (around 15), the predictions become constant for each time step afterwards. Does anyone know why this happens?
prdvec = function(dat,modname, numpreds, cnt, scl){
model = load_model_hdf5(modname)
inpt = dat
pred = list()
for(i in 1:numpreds){
pred[[i]] <- predict(model, reshape_X_3d((inpt[,1:ncol(inpt)]-cnt)/scl), batch_size = 1)
inpt = cbind(inpt[,2:ncol(inpt)],(pred[[i]]*scl+cnt))
print(i)
flush.console()
}
pred
}
I encounter a similar problem. Maybe when the LSTM units take into input created by itself, it tends to stabilize.

How to ensemble forecasts in R using weights

I have a couple forecasts and was trying to figure out how to merge the two according to some criterion that is widely used.
In part one, I split the data and compared the forecasts against the actual balues using Forest_comb.
library(forecast)
library(ForecastCombinations)
y1 = rnorm(100)
train = y1[1:90]
test = y1[91:100]
fit1 = auto.arima(train)
fit2 = ets(train)
forc1 = forecast(fit1, n=10)$mean
forc2 = forecast(fit2, n=10)$mean
forc_all = cbind(forc1,forc2)
forc_all
?Forecast_comb
fitted <- Forecast_comb(obs = test ,fhat = as.matrix(forc_all), Averaging_scheme = "best")$fitted
fitted
In part two, I rebuilt the whole model on the entire data and forecast out by ten values. How can one merge the two values together according to some criterion?
fit3 = auto.arima(y1)
fit4 = ets(y1)
forc3 = forecast(fit3, n=20)$mean
forc4 = forecast(fit4, n=20)$mean
forc_all = cbind(forc3,forc4)
forc_all
fitted <- Forecast_comb(obs = y1[91:100] ,fhat = as.matrix(forc_all), Averaging_scheme = "best")$fitted
fitted
Thanks for the help
The reason that I am using ForecastCombination is that it includes procedures for popular combination strategies. I thought that perhaps that function could be modified to perform the desired ensembling.
Based on a lot of Kaggle competitions where people share/discuss their scripts, I'd say that by far the most common and most effective way is simply to manually weight and add your predictions.
pacman::p_load(forecast)
pacman::p_load(ForecastCombinations)
y1 = rnorm(100)
train = y1[1:90]
test = y1[91:100]
fit1 = auto.arima(train)
fit2 = ets(train)
forc1 = forecast(fit1, n=10)$mean
forc2 = forecast(fit2, n=10)$mean
forc_all = cbind(forc1,forc2)
forc_all
?Forecast_comb
fitted_1 <- Forecast_comb(obs = test ,fhat = as.matrix(forc_all), Averaging_scheme = "best")$fitted
fitted_1
fit3 = auto.arima(y1)
fit4 = ets(y1)
forc3 = forecast(fit3, n=20)$mean
forc4 = forecast(fit4, n=20)$mean
forc_all = cbind(forc3,forc4)
forc_all
fitted_2 <- Forecast_comb(obs = y1[91:100] ,fhat = as.matrix(forc_all), Averaging_scheme = "best")$fitted
fitted_2
# By far the most common way to combine/weight is simply:
fitted <- fitted_2*.5+fitted_1*.5
fitted
One might ask if you should use equal weights or how to know what to make the weights. This is usually determined by
(a) naive, equal weighting if that's all you have time for and it seems to work fine
(b) iterating with a holdout or cross-validation sample(s), being careful not to overfit
Some people try to take more fancy approaches. It's easy to mess that up, however if you get it right then it can lead you to a more optimal forecast.
The model-based and other more fancy approaches are things like creating another stage of the modeling process wherein your predictions on a holdout sample are the X matrix and the outcome variable is the actual y for that sample.
Also, check out Erin LeDell's approach in h2oEnsemble.

PLS in R: Predicting new observations returns Fitted values instead

In the past few days I have developed multiple PLS models in R for spectral data (wavebands as explanatory variables) and various vegetation parameters (as individual response variables). In total, the dataset comprises of 56. The first 28 (training set) have been used for model calibration, now all I want to do is to predict the response values for the remaining 28 observations in the tesset. For some reason, however, R keeps on the returning the fitted values of the calibration set for a given number of components rather than predictions for the independent test set. Here is what the model looks like in short.
# first simulate some data
set.seed(123)
bands=101
data <- data.frame(matrix(runif(56*bands),ncol=bands))
colnames(data) <- paste0(1:bands)
data$height <- rpois(56,10)
data$fbm <- rpois(56,10)
data$nitrogen <- rpois(56,10)
data$carbon <- rpois(56,10)
data$chl <- rpois(56,10)
data$ID <- 1:56
data <- as.data.frame(data)
caldata <- data[1:28,] # define model training set
valdata <- data[29:56,] # define model testing set
# define explanatory variables (x)
spectra <- caldata[,1:101]
# build PLS model using training data only
library(pls)
refl.pls <- plsr(height ~ spectra, data = caldata, ncomp = 10, validation =
"LOO", jackknife = TRUE)
It was then identified that a model comprising of 3 components yielded the best performance without over-fitting. Hence, the following command was used to predict the values of the 28 observations in the testing set using the above calibrated PLS model with 3 components:
predict(refl.pls, ncomp = 3, newdata = valdata)
Sensible as the output may seem, I soon discovered that all this piece of code generates are the fitted values of the PLS model for the calibration/training data, rather than predictions. I discovered this because the below code, in which newdata = is omitted, yields identical results.
predict(refl.pls, ncomp = 3)
Surely something must be going wrong, although I cannot seem to find out what specifically is. Is there someone out there who can, and is willing to help me move in the right direction?
I think the problem is with the nature of the input data. Looking at ?plsr and str(yarn) that goes with the example, plsr requires a very specific data frame that I find tricky to work with. The input data frame should have a matrix as one of its elements (in your case, the spectral data). I think the following works correctly (note I changed the size of the training set so that it wasn't half the original data, for troubleshooting):
library("pls")
set.seed(123)
bands=101
spectra = matrix(runif(56*bands),ncol=bands)
DF <- data.frame(spectra = I(spectra),
height = rpois(56,10),
fbm = rpois(56,10),
nitrogen = rpois(56,10),
carbon = rpois(56,10),
chl = rpois(56,10),
ID = 1:56)
class(DF$spectra) <- "matrix" # just to be certain, it was "AsIs"
str(DF)
DF$train <- rep(FALSE, 56)
DF$train[1:20] <- TRUE
refl.pls <- plsr(height ~ spectra, data = DF, ncomp = 10, validation =
"LOO", jackknife = TRUE, subset = train)
res <- predict(refl.pls, ncomp = 3, newdata = DF[!DF$train,])
Note that I got the spectral data into the data frame as a matrix by protecting it with I which equates to AsIs. There might be a more standard way to do this, but it works. As I said, to me a matrix inside of a data frame is not completely intuitive or easy to grok.
As to why your version didn't work quite right, I think the best explanation is that everything needs to be in the one data frame you pass to plsr for the data sources to be completely unambiguous.

Time series modelling: "train" function with method "nnet" is not giving satisfactory result

I was trying to implement the use of train function in R using nnet as method on monthly consumption data. But the output (the predicted values) are all showing to be equal to some mean value.
I have data for 24 time points (each representing a month's data) and I have used first 20 for training and the rest 4 for testing the model. Here is my code:
a<-read.csv("...",header=TRUE)
tem<-a[,5]
hum<-a[,4]
con<- a[,3]
require(quantmod)
require(nnet)
require(caret)
y<-con
plot(con,type="l")
dat <- data.frame( y, x1=tem, x2=hum)
names(dat) <- c('y','x1','x2')
#Fit model
model <- train(y ~ x1+x2,
dat[1:20,],
method='nnet',
linout=TRUE,
trace = FALSE)
ps <- predict(model2, dat[21:24,])
plot(1:24,y,type="l",col = 2)
lines(1:24,c(y[1:20],ps), col=3,type="o")
legend(5, 70, c("y", "pred"), cex=1.5, fill=2:3)
Any suggestion on how can I approach this problem alternatively? Is there any way to use Neural Network more efficiently? Or is there any other better method for this?
The problem is likely to be not enough data. 24 data points is quite low, for any machine learning problem. If the curve/shape/surface of the data is eg a simple sin wave, then 24 would be enough.
But for any more complex function, the more data the better. Can you accurately model eg sin^2 x * cos^0.3 x / sinh x with only 6 data points? No, because the available data does not capture enough detail.
If you can acquire daily data, use that instead.

Resources