how to use previous observations to forecast the next period using for loops in r? - r

I have made 1000 observations for xt = γ1xt−1 + γ2xt−2 + εt [AR(2)].
What I would like to do is to use the first 900 observations to estimate the model, and use the remaining 100 observations to predict one-step ahead.
This is what I have done so far:
data2=arima.sim(n=1000, list(ar=c(0.5, -0.7))) #1000 observations simulated, (AR (2))
arima(data2, order = c(2,0,0), method= "ML") #estimated parameters of the model with ML
fit2<-arima(data2[1:900], c(2,0,0), method="ML") #first 900 observations used to estimate the model
predict(fit2, 100)
But the problem with my code right now is that the n.ahead=100 but I would like to use n.ahead=1 and make 100 predictions in total.
I think I need to use for loops for this, but since I am a very new user of Rstudio I haven't been able to figure out how to use for loops to make predictions. Can anyone help me with this?

If I've understood you correctly, you want one-step predictions on the test set. This should do what you want without loops:
library(forecast)
data2 <- arima.sim(n=1000, list(ar=c(0.5, -0.7)))
fit2 <- Arima(data2[1:900], c(2,0,0), method="ML")
fit2a <- Arima(data2[901:1000], model=fit2)
fc <- fitted(fit2a)
The Arima command allows a model to be applied to a new data set without the parameters being re-estimated. Then fitted gives one-step in-sample forecasts.
If you want multi-step forecasts on the test data, you will need to use a loop. Here is an example for two-step ahead forecasts:
fcloop <- numeric(100)
h <- 2
for(i in 1:100)
{
fit2a <- Arima(data2[1:(899+i)], model=fit2)
fcloop[i] <- forecast(fit2a, h=h)$mean[h]
}
If you set h <- 1 above you will get almost the same results as using fitted in the previous block of code. The first two values will be different because the approach using fitted does not take account of the data at the end of the training set, while the approach using the loop uses the end of the training set when making the forecasts.

Related

How to apply a long list of functions automatically in 10 imputed datasets in R

I have 10 datasets that are the result of multiple imputation, which i named: data1, data2, ..., data10. For each of them, I want to do:
Create a logistic regression model
Do multiple steps which include creating a LASSO model, resampling 200 times from my imputed dataset, recreate LASSO model in each resampling, evaluate measures of performance.
I'm able to do it separately for each dataset but I was wondering if there was a way to automatically do all of the steps for each imputed dataset. Below, I included an example of all the steps I do to get results separately for each imputation.
To do it automatically, i first thought about using lapply to create regressions for every imputation:
log01.1 <- lapply(paste0("data",1:10), function(x){lrm(y ~ x1 + x2 + x3, data=eval(parse(text = x)), x=T, y=T)})
Then I wanted to use lapply again on the whole block of code below with something like :
lapply(log01.1,fun(x){*All the steps following the regression*}
But I realized it doesn't work since lapply can only be applied to one function at a time as I understand it + at model.L1 <- glmnet(x=log01.1$x, y=log01.1$y, alpha=1, lambda=cv.glmmod$lambda.1se, family="binomial")
it wouldn't work since my lambda would come from a list. And I can't use lapply both on log01.1 and on cv.glmmod at the same time. Add to that the resampling with the 200 repetitons and I'm sure I would run into other problems I can't even think of right now.
And that's about the extent of my knowledge on lapply and other functions that could do similar things. Is there a way to take the chunk of code I wrote below and tell R to repeat it for every one of my 10 imputations and then store into separate lists the objects that would have been created? Or maybe not in lists but I would get for example App1, App2, App3, etc.?
Or am I better off just repeating it 10 times and storing the results?
log01.1 <- lrm(y ~ x1 + x2 + x3 , data=data1, x=T, y=T)})
reps <- 200;App=numeric(reps);Test=numeric(reps)
for(i in 1:reps){
#1.Construct LASSO model in sample i
cv.glmmod <- cv.glmnet(x=log01.1$x, y=log01.1$y, alpha=1, family="binomial")
model.L1 <- glmnet(x=log01.1$x, y=log01.1$y, alpha=1,
lambda=cv.glmmod$lambda.1se, family="binomial") #use optimum penalty
lp1 <- log01.1$x %*% model.L1$beta #for apparent performance
#2. Draw bootstrap sample with replacement from sample i
j <- sample(nrow(data1), replace=T) #for sample Bi
#3. Construct a model in sample Bi replaying every step that was done in the imputed sample
#I, especially model specification steps such as selection of predictors.
#Determine the bootstrap performance as the apparent performance in sample Bi.
#3 Construct LASSO model in sample i replaying every step done in imputed sample i
cv.j <- cv.glmnet (x=log01.1$x[j,], y=log01.1$y[j,], alpha = 1, family="binomial")
model.L1j <- glmnet (x=log01.1$x[j,], y=log01.1$y[j,], alpha=1,
lambda=cv.j$lambda.1se, family="binomial") #use optimum penalty for Bi
lp1j <- log01.1$x[j,] %*% model.L1j$beta #apparent performance in Bi
App[i] <- lrm.fit(y=log01.1$y[j,], x=lp1j)$stats[6] #apparent c for Bi
#4. Apply model from Bi to the original sample i without any modification to determine the test performance
lp1 <- log01.1$x %*% model.L1j$beta #Validated performance in I
Test[i] <- lrm.fit(y=log01.1$y, x=lp1)$stats[6]} #Test c in I
That is the code I would like to repeat automatically for every imputed set.

Bootstrapping regression coefficients from random subsets of data

I’m attempting to perform a regression calibration on two variables using the yorkfit() function in the IsoplotR package. I would like to estimate the confidence interval of the bootstrapped slope coefficient from this model; however, instead of using the typical bootstrap method below, I’d like to only perform the iterations on 75% of the data (randomly selected) at a time. So far, using the following sample data, I managed to bootstrap the slope coefficient result of the yorkfit() function:
library(boot)
library(IsoplotR)
X <- c(9.105,8.987,8.974,8.994,8.996,8.966,9.035,9.215,9.239,
9.307,9.227,9.17, 9.102)
Y <- c(28.1,28.9,29.6,29.5,29.0,28.8,28.5,27.3,27.1,26.5,
27.0,27.5,28.4)
n <- length(X)
sX <- X*0.02
sY <- Y*0.05
rXY <- rep(0.8,n)
dat <- cbind(X,sX,Y,sY,rXY)
fit <- york(dat)
boot.test <- function(data,indices){
sample = data[indices,]
mod = york(sample)
return (mod$b)
}
result <- boot(data=dat, statistic = boot.test, R=1000)
boot.ci(result, type = 'bca')
...but I'm not really sure where to go from here. Any help to move me in the right direction would be greatly appreciated. I’m new to R so I apologize if question is ambiguous. Thanks.
Based on the package documentation, you should be able to use the ran.gen argument, with sim="parametric", to sample using a custom function. In this case, the sample is a certain percent of the total observations, chosen at random. Something like the following should accomplish what you want:
result <- boot(
data=dat,
statistic =boot.test,
R=1000,
sim="parametric",
ran.gen=function(data, percent){
n=nrow(data)
indic=runif(n)
data[rank(indic, ties.method="random")<=round(n*percent,0),]
},
percent=0.75)

How to create a sliding window in R to divide data into test and train samples to test accuracy of forecasts?

We are using the forecast package in R to read 3 weeks worth of hourly data (3*7*24 data points) and make predictions for the next 24 hours. It's a time-series with multiple seasonality.
We have the forecast model running just fine and it seems to be doing well. Now, we wish to quantify the accuracy of our approach / forecasting algorithm for our data. We wish to use the accuracy function in forecast package for this purpose. We understand that the accuracy function works so that it f is the forecast and x is the actual observation vector then accuracy(f,x) would give us the several accuracy measurements for this forecast.
We have data from the past several months and we wish to write a sliding window algorithm that picks (3*7*24) hour values and then predicts the next 24 hours. Then, compares these values against actual data for the next day / 24 hours, displays the accuracy, then slides the window by (24 points / hours) / next day and repeats.
The sample data is generated as follows:
library("forecast")
time <- 1:(12*168)
set.seed(1)
ds <- msts(sin(2*pi*time/24)+c(1,1,1.2,0.8,1,0,0)[((time-1)%/%24)%%7+1]+ time/400+rnorm(length(time),0,0.2),seasonal.periods=c(24,168))
plot(ds)
head(ds)
tail(ds)
length(ds)
length(time)
Forecasting procedure is as follows:
model <- tbats(ds[1:504])
fcst <- forecast(model,h=24,level=90)
accuracy(fcst,ds[505:528]) ##Test accuracy of forecast against next/actual 24 values
Now, we wish to slide the "window" by 24 and repeat the same procedure, that is, the next set of values used to build the model will be ds[25:528] and their accuracy will be tested against ds[529:552] ... and so on. How can we implement this?
Also, is there a better way to test overall accuracy of this forecasting algorithm for our scenario?
I would do this by creating a vector of times representing the front edge of the sliding windows, then using lapply to iterate the forecasting and scoring process over the windows those edges imply. Like...
# set a couple of parameters we'll use to slice the series into chunks:
# window width (w) and the time step at which you want to end the first
# training set
w = 24 ; start = 504
# now use those parameters to make a vector of the time steps at which each
# window will end
steps <- seq(start + w, length(ds), by = w)
# using lapply, iterate the forecasting-and-scoring process over the
# windows that created
cv_list <- lapply(steps, function(x) {
train <- ds[1:(x - w)]
test <- ds[(x - w + 1):x]
model <- tbats(train)
fcst <- forecast(model, h = w, level = 90)
accuracy(fcst, test)
})
Example output for the first window:
> cv_list[[1]]
ME RMSE MAE MPE MAPE MASE
Training set 0.0001587681 0.3442898 0.2689754 34.3957362 84.30841 0.9560206
Test set 0.2619029897 0.8961109 0.7868256 -0.6832273 36.64301 2.7966186
ACF1
Training set 0.02588145
Test set NA
If you want summaries of the scores for the whole list, you can do something like...
rmse <- mean(unlist(lapply(cv_list, '[[', "Test set","RMSE")))
...which produces this:
[1] 1.011177

Feature selection + cross-validation, but how to make ROC-curves in R

I'm stuck with the next problem. I divide my data into 10 folds. Each time, I use 1 fold as test set and the other 9 as training set (I do this ten times). On each training set, I do feature selection (filter methode with chi.squared) and then I make a SVMmodel with my training set and the selected features.
So at the end, I become 10 different models (because of the feature selection). But now I want to make a ROC-curve in R from this filter methode in general. How can I do this?
Silke
You can indeed store the predictions if they are all on the same scale (be especially careful about this as you perform feature selection... some methods may produce scores that are dependent on the number of features) and use them to build a ROC curve. Here is the code I used for a recent paper:
library(pROC)
data(aSAH)
k <- 10
n <- dim(aSAH)[1]
indices <- sample(rep(1:k, ceiling(n/k))[1:n])
all.response <- all.predictor <- aucs <- c()
for (i in 1:k) {
test = aSAH[indices==i,]
learn = aSAH[indices!=i,]
model <- glm(as.numeric(outcome)-1 ~ s100b + ndka + as.numeric(wfns), data = learn, family=binomial(link = "logit"))
model.pred <- predict(model, newdata=test)
aucs <- c(aucs, roc(test$outcome, model.pred)$auc)
all.response <- c(all.response, test$outcome)
all.predictor <- c(all.predictor, model.pred)
}
roc(all.response, all.predictor)
mean(aucs)
The roc curve is built from all.response and all.predictor that are updated at each step. This code also stores the AUC at each step in auc for comparison. Both results should be quite similar when the sample size is sufficiently large, however small samples within the cross-validation may lead to underestimated AUC as the ROC curve with all data will tend to be smoother and less underestimated by the trapezoidal rule.

plot multiple fit and predictions for logistic regression

I am running multiple times a logistic regression over more than 1000 samples taken from a dataset. My question is what is the best way to show my results ? how can I plot my outputs for both the fit and the prediction curve?
This is an example of what I am doing, using the baseball dataset from R. For example I want to fit and predict the model 5 times. Each time I take one sample out (for the prediction) and use another for the fit.
library(corrgram)
data(baseball)
#Exclude rows with NA values
dataset=baseball[complete.cases(baseball),]
#Create vector replacing the Leage (A our N) by 1 or 0.
PA=rep(0,dim(dataset)[1])
PA[which(dataset[,2]=="A")]=1
#Model the player be league A in function of the Hits,Runs,Errors and Salary
fit_glm_list=list()
prd_glm_list=list()
for (k in 1:5){
sp=sample(seq(1:length(PA)),30,replace=FALSE)
fit_glm<-glm(PA[sp[1:15]]~baseball$Hits[sp[1:15]]+baseball$Runs[sp[1:15]]+baseball$Errors[sp[1:15]]+baseball$Salary[sp[1:15]])
prd_glm<-predict(fit_glm,baseball[sp[16:30],c(6,8,20,21)])
fit_glm_list[[k]]=fit_glm;prd_glm_list[[k]]=fit_glm
}
There are a number of issues here.
PA is a subset of baseball$League but the model is constructed on columns from the whole baseball data frame, i.e. they do not match.
PA is treated as a continuous response when using the default family (gaussian), it should be changed to a factor and binomial family.
prd_glm_list[[k]]=fit_glm should probably be prd_glm_list[[k]]=prd_glm
You must save the true class labels for the predictions otherwise you have nothing to compare to.
My take on your code looks like this.
library(corrgram)
data(baseball)
dataset <- baseball[complete.cases(baseball),]
fits <- preds <- truths <- vector("list", 5)
for (k in 1:5){
sp <- sample(nrow(dataset), 30, replace=FALSE)
fits[[k]] <- glm(League ~ Hits + Runs + Errors + Salary,
family="binomial", data=dataset[sp[1:15],])
preds[[k]] <- predict(fits[[k]], dataset[sp[16:30],], type="response")
truths[[k]] <- dataset$League[sp[1:15]]
}
plot(unlist(truths), unlist(preds))
The model performs poorly but at least the code runs without problems. The y-axis in the plot shows the estimated probabilities that the examples belong to league N, i.e. ideally the left box should be close to 0 and the right close to 1.

Resources