arima model for multiple seasonalities in R - r
I'm learning to create a forecasting model for time series that has multiple seasonalities. Following is the subset of dataset that I'm refering to. This dataset includes hourly data points and I wish to include daily as well as weekly seasonalities in my arima model. Following is the subset of dataset:
data= c(4,4,1,2,6,21,105,257,291,172,72,10,35,42,77,72,133,192,122,59,29,25,24,5,7,3,3,0,7,15,91,230,284,147,67,53,54,55,63,73,114,154,137,57,27,31,25,11,4,4,4,2,7,18,68,218,251,131,71,43,55,62,63,80,120,144,107,42,27,11,10,16,8,10,7,1,4,3,12,17,58,59,68,76,91,95,89,115,107,107,41,40,25,18,14,15,6,12,2,4,1,6,9,14,43,67,67,94,100,129,126,122,132,118,68,26,19,12,9,5,4,2,5,1,3,16,89,233,304,174,53,55,53,52,59,92,117,214,139,73,37,28,15,11,8,1,2,5,4,22,103,258,317,163,58,29,37,46,54,62,95,197,152,58,32,30,17,9,8,1,3,1,3,16,109,245,302,156,53,34,47,46,54,65,102,155,116,51,30,24,17,10,7,4,8,0,11,0,2,225,282,141,4,87,44,60,52,74,135,157,113,57,44,26,29,17,8,7,4,4,2,10,57,125,182,100,33,27,41,39,35,50,69,92,66,30,11,10,11,9,6,5,10,4,1,7,9,17,24,21,29,28,48,38,30,21,26,25,35,10,9,4,4,4,3,5,4,4,4,3,5,10,16,28,47,63,40,49,28,22,18,27,18,10,5,8,7,3,2,2,4,1,4,19,59,167,235,130,57,45,46,42,40,49,64,96,54,27,17,18,15,7,6,2,3,1,2,21,88,187,253,130,77,47,49,48,53,77,109,147,109,45,41,35,16,13)
The code I'm trying to use is following:
tsdata = ts (data, frequency = 24)
aicvalstemp = NULL
aicvals= NULL
for (i in 1:5) {
for (j in 1:5) {
xreg1 = fourier(tsdata,i,24)
xreg2 = fourier(tsdata,j,168)
xregs = cbind(xreg1,xreg2)
armodel = auto.arima(bike_TS_west, xreg = xregs)
aicvalstemp = cbind(i,j,armodel$aic)
aicvals = rbind(aicvals,aicvalstemp)
}
}
The cbind command in the above command fails because the number of rows in xreg1 and xreg2 are different. I even tried using 1:length(data) argument in the fourier function but that also gave me an error. If someone can rectify the mistakes in the above code to produce a forecast of next 24 hours using an arima model with minimum AIC values, it would be really helpful. Also if you can include datasplitting in your code by creating training and testing data sets, it would be totally awesome. Thanks for your help.
I don't understand the desire to fit a weekly "season" to these data as there is no evidence for one in the data subset you provided. Also, you should really log-transform the data because they do not reflect a Gaussian process as is.
So, here's how you could fit models with a some form of hourly signals.
## the data are not normal, so log transform to meet assumption of Gaussian errors
ln_dat <- log(tsdata)
## number of hours to forecast
hrs_out <- 24
## max number of Fourier terms
max_F <- 5
## empty list for model fits
mod_res <- vector("list", max_F)
## fit models with increasing Fourier terms
for (i in 1:max_F) {
xreg <- fourier(ln_dat,i)
mod_res[[i]] <- auto.arima(tsdata, xreg = xreg)
}
## table of AIC results
aic_tbl <- data.frame(F=seq(max_F), AIC=sapply(mod_res, AIC))
## number of Fourier terms in best model
F_best <- which(aic_tbl$AIC==min(aic_tbl$AIC))
## forecast from best model
fore <- forecast(mod_res[[F_best]], xreg=fourierf(ln_dat,F_best,hrs_out))
Related
How to decide the frequency while using the forecast function in R?
I have a series of daily data from 01-01-2014 to 31-01-2022. I want to predict the next 30 days. I am using auto.arima and it has some exogenous variables attached. Here's the code: - datax$NMD1<-(datax$NMD1/1000000000) #Here to make an Arima series out of NMD 1. Exogenous variables here. ts1<- ts(datax, frequency = 1) class(ts1) colnames(ts1) autoplot(ts1[,"NMD1"]) #defining the set of exogenous variables xset<- as.matrix(ts1[,"1Y TD INTEREST RATE"], ts1[,"BSE"], ts1[,"Repo Rate"], ts1[,"MIBOR Rate"], ts1[,"1Y OIS Rate" ], ts1[,"3M CD rate(PSU)"], ts1[,"2 Y GSec Rate"]) #Fitting the model model1 <- auto.arima(ts1[,'NMD1'], xreg=xset, approximation = FALSE, allowmean = FALSE, allowdrift = FALSE) summary(model1) checkresiduals(model1) fcast <- forecast(model1,xreg=xset, h=1) print(summary(fcast)) autoplot(fcast) My problems: - While my model seems to work fine, I am not able to understand what value of h shall i put while forecasting. I also don't understand what frequency really is while we define a time series. Please help.
Auto.Arima incorrectly predicts first point
I'm trying to complete a time series analysis of some reservoir data and am using auto.arima with a Fourier component to account for seasonality, as described here https://otexts.com/fpp2/dhr.html#dhr The code I have used is shown below and the dataset I used can be found here https://www.dropbox.com/sh/563nu3daeid0agb/AAB6NSddVUKgBCCbQtuqXPsZa?dl=0 Reservoir = read.csv("Reservoir1.csv",TRUE,",") #impute missing data from data set Reservoir = imputeTS::na_interpolation(Reservoir) #Create Time Series Reservoir = ts(Reservoir[,2],frequency = (365.25),start = c(2013,116)) plots = list() for (i in seq (10)) { fit = auto.arima(Reservoir, xreg = fourier(Reservoir, K = i), seasonal = FALSE) plots[[i]] = autoplot(forecast(fit, xreg = fourier(Reservoir, K = i, h=10))) + xlab(paste("K=",i,"AICC=",round(fit[["aicc"]],2))) + ylab("") } gridExtra::grid.arrange(plots[[1]],plots[[2]],plots[[3]],plots[[4]],plots[[5]], plots[[6]],plots[[7]],plots[[8]],plots[[9]],plots[[10]], nrow=5) bestfit = auto.arima(Reservoir, xreg=fourier(Reservoir, K=9), seasonal=FALSE) summary(bestfit) checkresiduals(bestfit) plot(Reservoir,col="red") lines(fitted(bestfit),col="blue") The model fits well, except for the incorrect first prediction. I'm lost as to why only this value would be so far off. Or, is this an acceptable error?
The residuals are the one-step forecast errors using all previous observations. At time 1, the residual is the forecast error with no previous observations, so it is simply based on the fitted model. In fact, it is an artificially "good" forecast because the differencing means there is no way for the model to know the location of the data until there is an observation. But the way ARIMA models are implemented in R makes the first prediction use a little more information than it should.
ARFIMA model and accurancy function
I am foresting with data sets from fpp2 package and forecast package. So my intention is to make automatic forecasting with a several time series. So for that reason I am forecasting with function. You can see code below: # CODE library(fpp2) library(dplyr) library(forecast) df<-qauselec # Forecasting function fct_fun <- function(Z, hrz = forecast_horizon) { timeseries <- msts(Z, start = 1956, seasonal.periods = 4) forecast <- arfima(timeseries) } acc_list <- lapply(X = df, fct_fun) So next step is to check accuracy of model. So for that reason I am trying with this line of code you can see below accurancy_arfima <- lapply(acc_list, accuracy) Until now this line of code or function accuracy worked perfectly with other models like snaive,ets etc. but with arfima can’t work properly. So can anybody help me how to resolve this problem with accuracy function?
Follow R-documentation, Returns range of summary measures of the forecast accuracy. If x is provided, the function measures test set forecast accuracy based on x-f . If x is not provided, the function only produces training set accuracy measures of the forecasts based on f["x"]-fitted(f). And usage summary can be seen : accuracy(f, x, test = NULL, d = NULL, D = NULL, ...) So : accuracy(acc_list[[1]]$fitted, df) If you want to evaluate separately accuracy, It will work. a <- c() for (i in 1:4) { b <- accuracy(df[i], acc_list[[1]]$fitted[i]) a <- rbind(a,b) }
PLS in R: Predicting new observations returns Fitted values instead
In the past few days I have developed multiple PLS models in R for spectral data (wavebands as explanatory variables) and various vegetation parameters (as individual response variables). In total, the dataset comprises of 56. The first 28 (training set) have been used for model calibration, now all I want to do is to predict the response values for the remaining 28 observations in the tesset. For some reason, however, R keeps on the returning the fitted values of the calibration set for a given number of components rather than predictions for the independent test set. Here is what the model looks like in short. # first simulate some data set.seed(123) bands=101 data <- data.frame(matrix(runif(56*bands),ncol=bands)) colnames(data) <- paste0(1:bands) data$height <- rpois(56,10) data$fbm <- rpois(56,10) data$nitrogen <- rpois(56,10) data$carbon <- rpois(56,10) data$chl <- rpois(56,10) data$ID <- 1:56 data <- as.data.frame(data) caldata <- data[1:28,] # define model training set valdata <- data[29:56,] # define model testing set # define explanatory variables (x) spectra <- caldata[,1:101] # build PLS model using training data only library(pls) refl.pls <- plsr(height ~ spectra, data = caldata, ncomp = 10, validation = "LOO", jackknife = TRUE) It was then identified that a model comprising of 3 components yielded the best performance without over-fitting. Hence, the following command was used to predict the values of the 28 observations in the testing set using the above calibrated PLS model with 3 components: predict(refl.pls, ncomp = 3, newdata = valdata) Sensible as the output may seem, I soon discovered that all this piece of code generates are the fitted values of the PLS model for the calibration/training data, rather than predictions. I discovered this because the below code, in which newdata = is omitted, yields identical results. predict(refl.pls, ncomp = 3) Surely something must be going wrong, although I cannot seem to find out what specifically is. Is there someone out there who can, and is willing to help me move in the right direction?
I think the problem is with the nature of the input data. Looking at ?plsr and str(yarn) that goes with the example, plsr requires a very specific data frame that I find tricky to work with. The input data frame should have a matrix as one of its elements (in your case, the spectral data). I think the following works correctly (note I changed the size of the training set so that it wasn't half the original data, for troubleshooting): library("pls") set.seed(123) bands=101 spectra = matrix(runif(56*bands),ncol=bands) DF <- data.frame(spectra = I(spectra), height = rpois(56,10), fbm = rpois(56,10), nitrogen = rpois(56,10), carbon = rpois(56,10), chl = rpois(56,10), ID = 1:56) class(DF$spectra) <- "matrix" # just to be certain, it was "AsIs" str(DF) DF$train <- rep(FALSE, 56) DF$train[1:20] <- TRUE refl.pls <- plsr(height ~ spectra, data = DF, ncomp = 10, validation = "LOO", jackknife = TRUE, subset = train) res <- predict(refl.pls, ncomp = 3, newdata = DF[!DF$train,]) Note that I got the spectral data into the data frame as a matrix by protecting it with I which equates to AsIs. There might be a more standard way to do this, but it works. As I said, to me a matrix inside of a data frame is not completely intuitive or easy to grok. As to why your version didn't work quite right, I think the best explanation is that everything needs to be in the one data frame you pass to plsr for the data sources to be completely unambiguous.
Time series modelling: "train" function with method "nnet" is not giving satisfactory result
I was trying to implement the use of train function in R using nnet as method on monthly consumption data. But the output (the predicted values) are all showing to be equal to some mean value. I have data for 24 time points (each representing a month's data) and I have used first 20 for training and the rest 4 for testing the model. Here is my code: a<-read.csv("...",header=TRUE) tem<-a[,5] hum<-a[,4] con<- a[,3] require(quantmod) require(nnet) require(caret) y<-con plot(con,type="l") dat <- data.frame( y, x1=tem, x2=hum) names(dat) <- c('y','x1','x2') #Fit model model <- train(y ~ x1+x2, dat[1:20,], method='nnet', linout=TRUE, trace = FALSE) ps <- predict(model2, dat[21:24,]) plot(1:24,y,type="l",col = 2) lines(1:24,c(y[1:20],ps), col=3,type="o") legend(5, 70, c("y", "pred"), cex=1.5, fill=2:3) Any suggestion on how can I approach this problem alternatively? Is there any way to use Neural Network more efficiently? Or is there any other better method for this?
The problem is likely to be not enough data. 24 data points is quite low, for any machine learning problem. If the curve/shape/surface of the data is eg a simple sin wave, then 24 would be enough. But for any more complex function, the more data the better. Can you accurately model eg sin^2 x * cos^0.3 x / sinh x with only 6 data points? No, because the available data does not capture enough detail. If you can acquire daily data, use that instead.