Random Forest - Caret - Time Series - r

I have a time series (apple stock prices -closing prices- turn into a data frame to fit a random forest using caret. I lagged on 1 day, 2 days and 6 days. I want to predict the next 2 days. Two step ahead forecast. But caretuses the predictfunction that does not allow the argument has the forecastfunction. And i have seen that some people try to put the argument n.ahead but is not working for me. Any advice? See the code
df<-data.frame(APPL)
df$f1<-lag(df$APPL,1)
df$f2=lag(df$APPL,2)
df$f3=lag(df$APPL,6)
# change column names
colnames(df)<-c("price", "price_1", "price_2", "price_6")
# remove rows (days) with NA.
df<-df[complete.cases(df),]
fitControl <- trainControl(
method = "repeatedcv",
number = 10,
repeats = 1,
classProbs = FALSE,
verboseIter = TRUE,
preProcOptions=list(thresh = 0.95, na.remove = TRUE, verbose = TRUE))
set.seed(1234)
rf_grid= expand.grid(mtry = c(1:3))
fit <- train(price~.,
data=df,
method="rf",
preProcess=c("center","scale"),
tuneGrid = rf_grid,
trControl=fitControl,
ntree = 200,
metric="RMSE")
nextday <- predict(fit,`WHAT GOES HERE?`)
If i put just predict(fit)uses as newdatathe whole dataset. Which i think is wrong. The other thing i was thinking about is to do a loop. Predict for 1 step ahead, because i have the data of 1,2 and 6 days ago. And the fill for the 2 step ahead forecast the 1 day ago "cell" with the forecast i did before.

Right now, you can't pass other options to the underlying predict method. There is a proposed change that might enable this though.
In your case, you should give the predict function a data frame that has the appropriate predictors for the next few observations.

#1:: colnames(df)<-c("price","price_1","price_2","price_6") ;; "after price6
#2:: Predict{stats} is a generic function for predictions from the results of various model fitting functions
::predict(model object , dataframe)
we have 3 cases here for dataframe ::
case 1 :: train data::on which model is fitted :: Insample prediction
case 2 :: test data::Out of sample prediction
case 3 :: forecasted data :: forecasted values of the independent variables : we get the forecasted values of the dependent variable according to the model
The column names in case 2 & 3 should be same as column names of the train data

Related

Implementation of time series cross-validation

I am working with time series 551 of the monthly data of the M3 competition.
So, my data is :
library(forecast)
library(Mcomp)
# Time Series
# Subset the M3 data to contain the relevant series
ts.data<- subset(M3, 12)[[551]]
print(ts.data)
I want to implement time series cross-validation for the last 18 observations of the in-sample interval.
Some people would normally call this “forecast evaluation with a rolling origin” or something similar.
How can i achieve that ? Whats means the in-sample interval ? Which is the timeseries i must evaluate?
Im quite confused , any help in order to light up this would be welcome.
The tsCV function of the forecast package is a good place to start.
From its documentation,
tsCV(y, forecastfunction, h = 1, window = NULL, xreg = NULL, initial = 0, .
..)
Let ‘y’ contain the time series y[1:T]. Then ‘forecastfunction’ is
applied successively to the time series y[1:t], for t=1,...,T-h,
making predictions f[t+h]. The errors are given by e[t+h] =
y[t+h]-f[t+h].
That is first tsCV fit a model to the y[1] and then forecast y[1 + h], next fit a model to y[1:2] and forecast y[2 + h] and so on for T-h steps.
The tsCV function returns the forecast errors.
Applying this to the training data of the ts.data
# function to fit a model and forecast
fmodel <- function(x, h){
forecast(Arima(x, order=c(1,1,1), seasonal = c(0, 0, 2)), h=h)
}
# time-series CV
cv_errs <- tsCV(ts.data$x, fmodel, h = 1)
# RMSE of the time-series CV
sqrt(mean(cv_errs^2, na.rm=TRUE))
# [1] 778.7898
In your case, it maybe that you are supposed to
fit a model to ts.data$x and then forecast ts.data$xx[1]
fit mode the c(ts.data$x, ts.data$xx[1]) and forecast(ts.data$xx[2]),
so on.

weird svm behavior in R (e1071)

I ran the following code for a binary classification task w/ an SVM in both R (first sample) and Python (second example).
Given randomly generated data (X) and response (Y), this code performs leave group out cross validation 1000 times. Each entry of Y is therefore the mean of the prediction across CV iterations.
Computing area under the curve should give ~0.5, since X and Y are completely random. However, this is not what we see. Area under the curve is frequently significantly higher than 0.5. The number of rows of X is very small, which can obviously cause problems.
Any idea what could be happening here? I know that I can either increase the number of rows of X or decrease the number of columns to mediate the problem, but I am looking for other issues.
Y=as.factor(rep(c(1,2), times=14))
X=matrix(runif(length(Y)*100), nrow=length(Y))
library(e1071)
library(pROC)
colnames(X)=1:ncol(X)
iter=1000
ansMat=matrix(NA,length(Y),iter)
for(i in seq(iter)){
#get train
train=sample(seq(length(Y)),0.5*length(Y))
if(min(table(Y[train]))==0)
next
#test from train
test=seq(length(Y))[-train]
#train model
XX=X[train,]
YY=Y[train]
mod=svm(XX,YY,probability=FALSE)
XXX=X[test,]
predVec=predict(mod,XXX)
RFans=attr(predVec,'decision.values')
ansMat[test,i]=as.numeric(predVec)
}
ans=rowMeans(ansMat,na.rm=TRUE)
r=roc(Y,ans)$auc
print(r)
Similarly, when I implement the same thing in Python I get similar results.
Y = np.array([1, 2]*14)
X = np.random.uniform(size=[len(Y), 100])
n_iter = 1000
ansMat = np.full((len(Y), n_iter), np.nan)
for i in range(n_iter):
# Get train/test index
train = np.random.choice(range(len(Y)), size=int(0.5*len(Y)), replace=False, p=None)
if len(np.unique(Y)) == 1:
continue
test = np.array([i for i in range(len(Y)) if i not in train])
# train model
mod = SVC(probability=False)
mod.fit(X=X[train, :], y=Y[train])
# predict and collect answer
ansMat[test, i] = mod.predict(X[test, :])
ans = np.nanmean(ansMat, axis=1)
fpr, tpr, thresholds = roc_curve(Y, ans, pos_label=1)
print(auc(fpr, tpr))`
You should consider each iteration of cross-validation to be an independent experiment, where you train using the training set, test using the testing set, and then calculate the model skill score (in this case, AUC).
So what you should actually do is calculate the AUC for each CV iteration. And then take the mean of the AUCs.

Lambda Issue, or cross validation

I am doing double cross validation with LASSO of glmnet package, however when I plot the results I am getting lambda of 0 - 150000 which is unrealistic in my case, not sure what is wrong I am doing, can someone point me in the right direction. Thanks in advance!
calcium = read.csv("calciumgood.csv", header=TRUE)
dim(calcium)
n = dim(calcium)[1]
calcium = na.omit(calcium)
names(calcium)
library(glmnet) # use LASSO model from package glmnet
lambdalist = exp((-1200:1200)/100) # defines models to consider
fulldata.in = calcium
x.in = model.matrix(CAMMOL~. - CAMLEVEL - AGE,data=fulldata.in)
y.in = fulldata.in[,2]
k.in = 10
n.in = dim(fulldata.in)[1]
groups.in = c(rep(1:k.in,floor(n.in/k.in)),1:(n.in%%k.in))
set.seed(8)
cvgroups.in = sample(groups.in,n.in) #orders randomly, with seed (8)
#LASSO cross-validation
cvLASSOglm.in = cv.glmnet(x.in, y.in, lambda=lambdalist, alpha = 1, nfolds=k.in, foldid=cvgroups.in)
plot(cvLASSOglm.in$lambda,cvLASSOglm.in$cvm,type="l",lwd=2,col="red",xlab="lambda",ylab="CV(10)")
whichlowestcvLASSO.in = order(cvLASSOglm.in$cvm)[1]; min(cvLASSOglm.in$cvm)
bestlambdaLASSO = (cvLASSOglm.in$lambda)[whichlowestcvLASSO.in]; bestlambdaLASSO
abline(v=bestlambdaLASSO)
bestlambdaLASSO # this is the lambda for the best LASSO model
LASSOfit.in = glmnet(x.in, y.in, alpha = 1,lambda=lambdalist) # fit the model across possible lambda
LASSObestcoef = coef(LASSOfit.in, s = bestlambdaLASSO); LASSObestcoef # coefficients for the best model fit
I found the dataset you referring at
Calcium, inorganic phosphorus and alkaline phosphatase levels in elderly patients.
Basically the data are "dirty", and it is a possible reason why the algorithm does not converge properly. E.g. there are 771 year old patients, bisides 1 and 2 for male and female, there is 22 for sex encodeing etc.
As for your case you removed only NAs.
You need to check data.frame imported types as well. E.g. instead of factors it could be imported as integers (SEX, Lab and Age group) which will affect the model.
I think you need:
1) cleanse the data;
2) if doesnot work submit *.csv file

Storing arima predictions into empty vector in R

I am having some trouble storing arima predictions into an empty vector. The problem is arima predictions give you predictions and standard errors. There are two columns of values. I cannot seem to store the values in an empty vector. I tried to create two empty vectors and bind them together, but it did not solve the problem.
My intention is to simulate 1000 observations. Use the first 900 observations to make 100 predictions. The list of values have to update. For example, use 900 observations to predict the value of the 901th observation. Now use 901 observations, including the predicted 901th observation, to predict the 902th observation. Repeat until you use 999 observations to predict the 1000th observation. I hope to figure out how to store multiple values into a vector.
The empty vector I hope to contain 100 predictions is called Predictions1.
# Create Arima Series #
ArimaSeries1 = arima.sim(n=1000, list(ar=c(0.99), ma=c(0.1)))+50
ts.plot(ArimaSeries1)
acf(ArimaSeries1)
ArimaSeries2 = arima.sim(n=1000, list(ar=c(0.7,0.2), ma=c(0.1,0.1)))+50
ts.plot(ArimaSeries2)
acf(ArimaSeries2)
ArimaSeries3 = arima.sim(n=1000, list(ar=c(0.6,0.2,0.1), ma=c(0.1,0.1,0.1)))+50
ts.plot(ArimaSeries3)
acf(ArimaSeries3)
# Estimate Arima Coefficients using maximum likehood #
ARC1 = arima(ArimaSeries1, order = c(1,0,1))
ARC2 = arima(ArimaSeries2, order = c(2,0,2))
ARC3 = arima(ArimaSeries3, order = c(3,0,3))
# Estimate Arima Coefficients with 900 observations #
AR1 = arima(ArimaSeries1[1:900], order = c(1,0,1))
AR2 = arima(ArimaSeries2[1:900], order = c(2,0,2))
AR3 = arima(ArimaSeries3[1:900], order = c(3,0,3))
# Create for-loop to make one prediction ahead for 100 times #
PredictionsA = rep(0,100)
PredictionsB = rep(0,100)
Predictions1 = cbind(PredictionsA,PredictionsB)
for(a in 1:100){ Forcasting1 = predict(arima(ArimaSeries1[1:900+a], order=c(1,0,1)), n.ahead=1)}
Predictions1[a] = Forcasting1
R would give me this error message:
Warning message: In Predictions1[a] = Forcasting1 : number of items
to replace is not a multiple of replacement length
I would be grateful for any suggestions. Any explanations on where I went wrong is also appreciated. Thank you for your time.
Maybe something like this:
Predictions1 <- array(NA, c(100,2))
for(a in 1:100){
Forcasting1 = predict(arima(ArimaSeries1[1:900+a], order=c(1,0,1)), n.ahead=1)
Predictions1[a,] = unlist(Forcasting1)
}

How to specify newxreg in prediction model of ARIMA?

I have fit the model below to my time series data. The xreg consists of a time vector that goes from 1 through 1000 and of 12 indicator variables (1 or 0) that represent the month. The data that I'm dealing with has some strong weekly and monthly seasonal patterns.
fit <- arima(x, order = c(3, 0, 0),
seasonal = list(order = c(1, 0, 1), period = 7),
xreg = cbind(t, M1, M2, M3, M4, M5,
M6, M7, M8, M9, M10, M11, M12), include.mean = FALSE,
transform.pars = TRUE,
fixed = NULL, init = NULL,
method = c("CSS-ML", "ML", "CSS"),
optim.method = "BFGS",
optim.control = list(), kappa = 1e6)
At this time I'm trying to figure out how I can predict 14 values for the month of January (M1=1).
So when I use the predict function in R, I think I need to specify in the newxreg portion that I want M1=1 and M2,...,M12=0 for my prediction - correct?
I've played around with the code, but I couldn't get it to work and I was not able to find very detailed information about the newxreg portion of the predict formula online.
Can anyone explain to me how I can get predictions for one partigular month, say January?
And how do I need to note that in the newxreg part of the predict function?
Many thanks in advance!
I have finally found a way out and wanted to post it - in case it helps someone else.
So basically, newxreg should be a matrix that contains values of the regressors that you want predictions for.
So in my case, my regressors were all 1 or 0 (coded variables) to specify a particular month.
So what I did is I created a matrix of 0's and 1's to be used as my newxreg.
What I did is I defined a matrix mx, and then in the predict function I set newxreg=mx. I made sure that the number of rows of mx>= number of rows of n.ahead.
pred <- predict(fit,n.ahead=n, newxreg=mx)
Hope this is helpful for others as well!

Resources