Forecasting with ARIMA and xreg in R - r

I'm having trouble with my first forecasting implementation in R. What I'd like to achieve is to predict the variable Y with 2 exogenous variables X1 and X2. The 3 datasets are each represented as a single column with 12 rows.
From another Stackpost I followed a similar approach:
DataSample <- data.frame(Y=Y[,1],Month=rep(1:12,1),
X1=X1[,1],X2=X2[,1])
predictor_matrix <- cbind(Month=model.matrix(~as.factor(DataSample$Month)),
X1=DataSample$X1,
X2=DataSample$X2)
# Remove intercept
predictor_matrix <- predictor_matrix[,-1]
# Rename columns
colnames(predictor_matrix) <- c("January","February","March","April","May","June","July","August","September","October","November","X1","X2")
# Variable to be modeled
var <- ts(DataSample$Y, frequency=12)
#Find ARIMA
modArima <- auto.arima(var, xreg = predictor_matrix)
At this line I get the following error:
Error in optim(init[mask], armaCSS, method = optim.method, hessian =
FALSE, : non-finite value supplied by optim
I presume that my predictor_matrix is not in the correct format but I can't find the error.
Any help would be appreciated,

You have indicated "datasets are ... 12 rows". Your predictor matrix has 13 columns (11 months [of dummy variables?] and 2 other variables). Therefore, you necessarily have a linear dependence among the columns and the optimization procedure fails.
You need (ideally much) more data to support the number of predictor variables and/or a sparser set of predictors.

Related

Plot how the estimated survival depends upon the value of a covariate of interest. Problems with relevel

I want to plot how the estimated survival from a Cox model depends upon the value of a covariate of interest, while the rest of variables are fixed to their average values (if they are continuous variables) or lowest values for dummy. Following this example http://www.sthda.com/english/wiki/cox-proportional-hazards-model , I have construct a new data frame with three rows, one for each value of my variable of interest; and the other covariates are fixed. Among these covariates I have two factor vectors. I created the new dataset and later it is passed to survfit() via the newdata argument.
When I passed the data frame to survfit(), I obtain the following error message error in relevel.default(occupation) : 'relevel' only for factors. Where is the source of problem? If the source of problem is related to the factor vectors, how I can solve it? Below find an example of the code. Unfortunately, I cannot share the data or find a dataset that produces the same error message:
I have transformed the factor variables into integer vectors in the cox model and in the new dataset. it did not work.
I have deleated all the factor variables and it works.
I have tried to implement this strategy, but it did not work: Plotting predicted survival curves for continuous covariates in ggplot
fit <- coxph(Surv(entry, exit, event == 1) ~ status_plot +
exp_national + relevel(occupation, 5) + age + gender + EDUCATION , data = data)
data_rank <- with(data,
data.frame(status_plot = c(1,2,3), # factor vector of interest
exp_national=rep(mean(exp_national, na.rm = TRUE), 3),
occupation = c(5,5,5), # factor with 6 categories, number 5 is the category of reference in the cox model
age=rep(mean(age, na.rm = TRUE), 3),
gender = c(1,1,1),
EDUCATION=rep(mean(EDUCATION, na.rm = TRUE), 3) ))
surv.fin <- survfit(fit, newdata=data_rank) # this produces the error
Looking at the code it appears you probably attempted to take the mean of a factor. So do post at least str(data) as an edit to the body of your question. You should also realize that you can give a single value to a column in a data.frame call and have it recycled to the correct length, you all the meanss could be entered as a single item rather thanrep`-ng.

arima model for multiple seasonalities in R

I'm learning to create a forecasting model for time series that has multiple seasonalities. Following is the subset of dataset that I'm refering to. This dataset includes hourly data points and I wish to include daily as well as weekly seasonalities in my arima model. Following is the subset of dataset:
data= c(4,4,1,2,6,21,105,257,291,172,72,10,35,42,77,72,133,192,122,59,29,25,24,5,7,3,3,0,7,15,91,230,284,147,67,53,54,55,63,73,114,154,137,57,27,31,25,11,4,4,4,2,7,18,68,218,251,131,71,43,55,62,63,80,120,144,107,42,27,11,10,16,8,10,7,1,4,3,12,17,58,59,68,76,91,95,89,115,107,107,41,40,25,18,14,15,6,12,2,4,1,6,9,14,43,67,67,94,100,129,126,122,132,118,68,26,19,12,9,5,4,2,5,1,3,16,89,233,304,174,53,55,53,52,59,92,117,214,139,73,37,28,15,11,8,1,2,5,4,22,103,258,317,163,58,29,37,46,54,62,95,197,152,58,32,30,17,9,8,1,3,1,3,16,109,245,302,156,53,34,47,46,54,65,102,155,116,51,30,24,17,10,7,4,8,0,11,0,2,225,282,141,4,87,44,60,52,74,135,157,113,57,44,26,29,17,8,7,4,4,2,10,57,125,182,100,33,27,41,39,35,50,69,92,66,30,11,10,11,9,6,5,10,4,1,7,9,17,24,21,29,28,48,38,30,21,26,25,35,10,9,4,4,4,3,5,4,4,4,3,5,10,16,28,47,63,40,49,28,22,18,27,18,10,5,8,7,3,2,2,4,1,4,19,59,167,235,130,57,45,46,42,40,49,64,96,54,27,17,18,15,7,6,2,3,1,2,21,88,187,253,130,77,47,49,48,53,77,109,147,109,45,41,35,16,13)
The code I'm trying to use is following:
tsdata = ts (data, frequency = 24)
aicvalstemp = NULL
aicvals= NULL
for (i in 1:5) {
for (j in 1:5) {
xreg1 = fourier(tsdata,i,24)
xreg2 = fourier(tsdata,j,168)
xregs = cbind(xreg1,xreg2)
armodel = auto.arima(bike_TS_west, xreg = xregs)
aicvalstemp = cbind(i,j,armodel$aic)
aicvals = rbind(aicvals,aicvalstemp)
}
}
The cbind command in the above command fails because the number of rows in xreg1 and xreg2 are different. I even tried using 1:length(data) argument in the fourier function but that also gave me an error. If someone can rectify the mistakes in the above code to produce a forecast of next 24 hours using an arima model with minimum AIC values, it would be really helpful. Also if you can include datasplitting in your code by creating training and testing data sets, it would be totally awesome. Thanks for your help.
I don't understand the desire to fit a weekly "season" to these data as there is no evidence for one in the data subset you provided. Also, you should really log-transform the data because they do not reflect a Gaussian process as is.
So, here's how you could fit models with a some form of hourly signals.
## the data are not normal, so log transform to meet assumption of Gaussian errors
ln_dat <- log(tsdata)
## number of hours to forecast
hrs_out <- 24
## max number of Fourier terms
max_F <- 5
## empty list for model fits
mod_res <- vector("list", max_F)
## fit models with increasing Fourier terms
for (i in 1:max_F) {
xreg <- fourier(ln_dat,i)
mod_res[[i]] <- auto.arima(tsdata, xreg = xreg)
}
## table of AIC results
aic_tbl <- data.frame(F=seq(max_F), AIC=sapply(mod_res, AIC))
## number of Fourier terms in best model
F_best <- which(aic_tbl$AIC==min(aic_tbl$AIC))
## forecast from best model
fore <- forecast(mod_res[[F_best]], xreg=fourierf(ln_dat,F_best,hrs_out))

MICE imputation even when same data in column

Is it possible to get an imputation using the package MICE even when all the values in the column are the same? Then it would impute just with that number.
Example:
test<-data.frame(var1=c(2.3,2.3,2.3,2.3,2.3,NA),var2=c(5.3,5.6,5.9,6.4,4.5,NA))
miceImp<-mice(test)
testImp<-complete(miceImp)
only imputate on var2. I would like it to replace the NA in var1 too with 2.3.
You can use passive imputation for this. For a full explanation, see section 3.4 on page 25 of this article. As applied to constant variables, the objective here would be to set the imputation method for any constant variable x to the constant value of x. If the constant value of x is y, then the imputation method for x should be "~I(y)".
test = data.frame(
var1=c(2.3,2.3,2.3,2.3,2.3,NA,2.3),
var2=c(5.3,5.6,5.9,6.4,4.5,5.1,NA),
var3=c(NA,1:6))
cVars = which(sapply(test,sd,na.rm=T)==0) #determine which vars are constant (props to SimonG)
allMeans = colMeans(test,na.rm=T) #get the column means
miceImp.ini = mice(test,maxit=0,print=F) #initial mids object with no imputations
meth = miceImp.ini$method #extract the imputation method vector
meth[cVars] = paste0("~I(",allMeans[cVars],")") #set the imputation method to be a constant (the current column mean)
miceImp = mice(test,method=meth) #run the imputation with the user defined imputation methods
testImp = complete(miceImp) #extract an imputedly complete dataset
View(testImp) #take a look at it
All that being said, constant values tend not to be of great use in statistics, so it might be more efficient to drop any constant variables before imputation (since imputation is such a costly process).

Multiple regression predicting using R, predicting a data.frame

I have been given data in a data.frame called petrol which has 125 rows and the following columns:
hydrcarb, tanktemp, disptemp, tankpres, disppres, sqrtankpres, sqrdisppres
I have been asked to delete the last 25 rows from petrol, fit the model where hydrcarb is the response variable and the rest are the explanatory variables, and to do this for the first 100 rows. Then use the fitted model to predict for the remaining 25.
This is what I have done so far:
#make a new table that only contains first 100
petrold <- petrol[-101:-125,]
petrold
#FITTING THE MODEL
petrol.lmB <- lm(hydrcarb~ tanktemp + disptemp + tankpres + disppres + sqrtankpres + sqrdisppres, data=petrol)
#SELECT LAST 25 ROWS FROM PETROL
last25rows <-petrol[101:125,c('tanktemp','disptemp','tankpres','disppres','sqrtankpres','sqrdisppres')]
#PREDICT LAST 25 ROWS
predict(petrold,last25rows[101,c('tanktemp','disptemp','tankpres','disppres','sqrtankpres','sqrdisppres')])
I know I have done something wrong for my predict command since R gives me the error message:
Error in UseMethod("predict") :
no applicable method for 'predict' applied to an object of class "data.frame"
So I am not sure how to get predicted values for hydrcarb for 25 different sets of data.
Alex A. already pointed out that predict expects a model as first argument. In addition to this, you should pass predict all rows you want predict at once. Besides, I recommend that you subset your dataframe "on-the-fly" instead of creating unnecessary copies. Lastly, there's a shorter way to write the fromula you pass to lm:
# data for example
data(Seatbelts)
petrol <- as.data.frame(Seatbelts[1:125, 1:7])
colnames(petrol) <- c("hydrcarb", "tanktemp", "disptemp", "tankpres", "disppres", "sqrtankpres", "sqrdisppres")
# fit model using observations 1:100
petrol.lmB <- lm(hydrcarb ~ ., data = petrol[1:100,])
#predict last 25 rows
predict(petrol.lmB, newdata = petrol[101:125,])

Forecasting with `tslm` returning dimension error

I'm having a similar problem to the questioners here had with the linear model predict function, but I am trying to use the "time series linear model" function from Rob Hyndman's forecasting package.
Predict.lm in R fails to recognize newdata
predict.lm with newdata
totalConv <- ts(varData[,43])
metaSearch <- ts(varData[,45])
PPCBrand <- ts(varData[,38])
PPCGeneric <- ts(varData[,34])
PPCLocation <- ts(varData[,35])
brandDisplay <- ts(varData[,29])
standardDisplay <- ts(varData[,3])
TV <- ts(varData[,2])
richMedia <- ts(varData[,46])
df.HA <- data.frame(totalConv, metaSearch,
PPCBrand, PPCGeneric, PPCLocation,
brandDisplay, standardDisplay,
TV, richMedia)
As you can see I've tried to avoid the names issues by creating a data frame of the time series objects.
However, I then fit a tslm object (time series linear model) as follows -
fit1 <- tslm(totalConv ~ metaSearch
+ PPCBrand + PPCGeneric + PPCLocation
+ brandDisplay + standardDisplay
+ TV + richMedia data = df.HA
)
Despite having created a data frame and named all the objects properly I get the same dimension error as these other users have experienced.
Error in forecast.lm(fit1) : Variables not found in newdata
In addition: Warning messages:
1: 'newdata' had 10 rows but variables found have 696 rows
2: 'newdata' had 10 rows but variables found have 696 rows
the model frame seems to give sensible names to all of the variables, so I don't know what is up with the forecast function:-
names(model.frame(fit1))
[1] "totalConv" "metaSearch" "PPCBrand" "PPCGeneric" "PPCLocation" "brandDisplay"
[7] "standardDisplay" "TV" "richMedia"
Can anyone suggest any other improvements to my model specification that might help the forecast function to run?
EDIT 1: Ok, just so there's a working example, I've used the data given in Irsal's answer to this question (converting to time series objects) and then fitted the tslm. I get the same error (different dimensions obviously):-
Is there an easy way to revert a forecast back into a time series for plotting?
I'm really confused about what I'm doing wrong, my code looks identical to that used in all of the examples on this....
data <- c(11,53,50,53,57,69,70,65,64,66,66,64,61,65,69,61,67,71,74,71,77,75,85,88,95,
93,96,89,95,98,110,134,127,132,107,94,79,72,68,72,70,66,62,62,60,59,61,67,
74,87,112,134,51,50,38,40,44,54,52,51,48,50,49,49,48,57,52,53,50,50,55,50,
55,60,65,67,75,66,65,65,69,72,93,137,125,110,93,72,61,55,51,52,50,46,46,45,
48,44,45,53,55,65,89,112,38,7,39,35,37,41,51,53,57,52,57,51,52,49,48,48,51,
54,48,50,50,53,56,64,71,74,66,69,71,75,84,93,107,111,112,90,75,62,53,51,52,
51,49,48,49,52,50,50,59,58,69,95,148,49,83,40,40,40,53,57,54,52,56,53,55,
55,51,54,45,49,46,52,49,50,57,58,63,73,66,63,72,72,71,77,105,97,104,85,73,
66,55,52,50,52,48,48,46,48,53,49,58,56,72,84,124,76,4,40,39,36,38,48,55,49,
51,48,46,46,47,44,44,45,43,48,46,45,50,50,56,62,53,62,63)
data2 <- c(rnorm(237))
library(forecast)
nData <- ts(data)
nData2 <- ts(data2)
dat.ts <- tslm(nData~nData2)
forecast(dat.ts)
Error in forecast.lm(dat.ts) : Variables not found in newdata
In addition: Warning messages:
1: 'newdata' had 10 rows but variables found have 237 rows
2: 'newdata' had 10 rows but variables found have 237 rows
EDIT 2: Same error even if I combine both series into a data frame.
nData.df <- data.frame(nData, nData2)
dat.ts <- tslm(nData~nData2, data = nData.df)
forecast(dat.ts)
tslm fits a linear regression model. You need to provide the future values of the explanatory variables if you want to forecast. These should be provided via the newdata argument of forecast.lm.

Resources