Prediction of 'mlm' linear model object from `lm()` - r

I have three datasets:
response - matrix of 5(samples) x 10(dependent variables)
predictors - matrix of 5(samples) x 2(independent variables)
test_set - matrix of 10(samples) x 10(dependent variables defined in response)
response <- matrix(sample.int(15, size = 5*10, replace = TRUE), nrow = 5, ncol = 10)
colnames(response) <- c("1_DV","2_DV","3_DV","4_DV","5_DV","6_DV","7_DV","8_DV","9_DV","10_DV")
predictors <- matrix(sample.int(15, size = 7*2, replace = TRUE), nrow = 5, ncol = 2)
colnames(predictors) <- c("1_IV","2_IV")
test_set <- matrix(sample.int(15, size = 10*2, replace = TRUE), nrow = 10, ncol = 2)
colnames(test_set) <- c("1_IV","2_IV")
I'm doing a multivariate linear model using a training set defined as the combination of response and predictor sets, and I would like to use this model to make predictions for the test set:
training_dataframe <- data.frame(predictors, response)
fit <- lm(response ~ predictors, data = training_dataframe)
predictions <- predict(fit, data.frame(test_set))
However, the results for predictions are really odd:
predictions
First off the matrix dimensions are 5 x 10, which is the number of samples in the response variable by the number of DVs.
I'm not very skilled with this type of analysis in R, but shouldn't I be getting a 10 x 10 matrix, so that I have predictions for each row in my test_set?
Any help with this issue would be greatly appreciated,
Martin

You are stepping into a poorly supported part in R. The model class you have is "mlm", i.e., "multiple linear models", which is not the standard "lm" class. You get it when you have several (independent) response variables for a common set of covariates / predictors. Although lm() function can fit such model, predict method is poor for "mlm" class. If you look at methods(predict), you would see a predict.mlm*. Normally for a linear model with "lm" class, predict.lm is called when you call predict; but for a "mlm" class the predict.mlm* is called.
predict.mlm* is too primitive. It does not allow se.fit, i.e., it can not produce prediction errors, confidence / prediction intervals, etc, although this is possible in theory. It can only compute prediction mean. If so, why do we want to use predict.mlm* at all?! The prediction mean can be obtained by a trivial matrix-matrix multiplication (in standard "lm" class this is a matrix-vector multiplication), so we can do it on our own.
Consider this small, reproduce example.
set.seed(0)
## 2 response of 10 observations each
response <- matrix(rnorm(20), 10, 2)
## 3 covariates with 10 observations each
predictors <- matrix(rnorm(30), 10, 3)
fit <- lm(response ~ predictors)
class(fit)
# [1] "mlm" "lm"
beta <- coef(fit)
# [,1] [,2]
#(Intercept) 0.5773235 -0.4752326
#predictors1 -0.9942677 0.6759778
#predictors2 -1.3306272 0.8322564
#predictors3 -0.5533336 0.6218942
When you have a prediction data set:
# 2 new observations for 3 covariats
test_set <- matrix(rnorm(6), 2, 3)
we first need to pad an intercept column
Xp <- cbind(1, test_set)
Then do this matrix multiplication
pred <- Xp %*% beta
# [,1] [,2]
#[1,] -2.905469 1.702384
#[2,] 1.871755 -1.236240
Perhaps you have noticed that I did not even use a data frame here. Yes it is unnecessary as you have everything in matrix form. For those R wizards, maybe using lm.fit or even qr.solve is more straightforward.
But as a complete answer, it is a must to demonstrate how to use predict.mlm to get our desired result.
## still using previous matrices
training_dataframe <- data.frame(response = I(response), predictors = I(predictors))
fit <- lm(response ~ predictors, data = training_dataframe)
newdat <- data.frame(predictors = I(test_set))
pred <- predict(fit, newdat)
# [,1] [,2]
#[1,] -2.905469 1.702384
#[2,] 1.871755 -1.236240
Note the I() when I use data.frame(). This is a must when we want to obtain a data frame of matrices. You can compare the difference between:
str(data.frame(response = I(response), predictors = I(predictors)))
#'data.frame': 10 obs. of 2 variables:
# $ response : AsIs [1:10, 1:2] 1.262954.... -0.32623.... 1.329799.... 1.272429.... 0.414641.... ...
# $ predictors: AsIs [1:10, 1:3] -0.22426.... 0.377395.... 0.133336.... 0.804189.... -0.05710.... ...
str(data.frame(response = response, predictors = predictors))
#'data.frame': 10 obs. of 5 variables:
# $ response.1 : num 1.263 -0.326 1.33 1.272 0.415 ...
# $ response.2 : num 0.764 -0.799 -1.148 -0.289 -0.299 ...
# $ predictors.1: num -0.2243 0.3774 0.1333 0.8042 -0.0571 ...
# $ predictors.2: num -0.236 -0.543 -0.433 -0.649 0.727 ...
# $ predictors.3: num 1.758 0.561 -0.453 -0.832 -1.167 ...
Without I() to protect the matrix input, data are messy. It is amazing that this will not cause problem to lm, but predict.mlm will have a hard time obtaining the correct matrix for prediction, if you don't use I().
Well, I would recommend using a "list" instead of a "data frame" in this case. data argument in lm as well newdata argument in predict allows list input. A "list" is a more general structure than a data frame, which can hold any data structure without difficulty. We can do:
## still using previous matrices
training_list <- list(response = response, predictors = predictors)
fit <- lm(response ~ predictors, data = training_list)
newdat <- list(predictors = test_set)
pred <- predict(fit, newdat)
# [,1] [,2]
#[1,] -2.905469 1.702384
#[2,] 1.871755 -1.236240
Perhaps in the very end, I should stress that it is always safe to use formula interface, rather than matrix interface. I will use R built-in dataset trees as a reproducible example.
fit <- lm(cbind(Girth, Height) ~ Volume, data = trees)
## use the first two rows as prediction dataset
predict(fit, newdata = trees[1:2, ])
# Girth Height
#1 9.579568 71.39192
#2 9.579568 71.39192
Perhaps you still remember my saying that predict.mlm* is too primitive to support se.fit. This is the chance to test it.
predict(fit, newdata = trees[1:2, ], se.fit = TRUE)
#Error in predict.mlm(fit, newdata = trees[1:2, ], se.fit = TRUE) :
# the 'se.fit' argument is not yet implemented for "mlm" objects
Oops... How about confidence / prediction intervals (actually without the ability to compute standard error it is impossible to produce those intervals)? Well, predict.mlm* will just ignore it.
predict(fit, newdata = trees[1:2, ], interval = "confidence")
# Girth Height
#1 9.579568 71.39192
#2 9.579568 71.39192
So this is so different compared with predict.lm.

Related

kNN algorithm not working while using caret

I am trying to run LOOCV kNN on this dataset (104x182 where the first 62 samples are B and the following 42 are C). I first conducted a PCA on the standardized version of this dataset (giving me 104 PCs). I then try to perform LOOCV kNN for i = 3:98 where i refers to the number of PCs I will use for my kNN model. For each i I pull out the highest accuracy, which k it occurs at and store it within a data frame.
# required packages
library(MASS)
library(class)
library(tidyverse)
library(caret)
# reading in and cleaning data
data <- read.csv("chowdary.csv")
og_data <- data[, -1]
st_data <- as.data.frame(cbind(og_data[, 1], scale(og_data[, -1])))
colnames(st_data)[1] <- "tumour"
# PCA for dimension reduction
# on standardized data
pca_all <- prcomp(og_data[, -1], center=TRUE, scale=TRUE)
# creating data frame to store best k value for each number of PCs
kdf_pca_all_cc <- tibble(i=as.numeric(), # this is for storing number of PCs used,
pca_all_k=as.numeric(), # k value,
pca_all_acc=as.numeric(), # accuracy value,
pca_all_kapp=as.numeric()) # and kappa value
# kNN
k_kNN <- 3:97 # number of PCs to use in each iteration of the model
train_control <- trainControl(method="LOOCV")
kNN_data <- as.data.frame(cbind(as.factor(st_data[, 1]), pca_all$x)) # data used in kNN model below
for (i in k_kNN){
a111 <- train(V1~ .,
method="knn",
tuneGrid=expand.grid(k=1:25),
trControl=train_control,
metric="Accuracy",
data=kNN_data[, 1:i])
b111 <- a111$results[as.integer(a111$bestTune), ] # this is to store the best accuracy rate, along with its k and kappa value
kdf_pca_all_cc <- kdf_pca_all_cc %>%
add_row(i=i-1,
pca_all_k=b111[, 1],
pca_all_acc=b111[, 2],
pca_all_kapp=b111[, 3])
}
For example, for i = 5, the kNN model would be using the following data:
head(kNN_data[, 1:5])
V1 PC1 PC2 PC3 PC4
1 1 3.299844 0.2587487 -1.00501632 2.0273727
2 1 1.427856 -1.0455044 -1.79970790 2.5244021
3 1 3.087657 1.2563404 1.67591441 -1.4270431
4 1 3.107778 1.5893396 2.65871270 -2.8217264
5 1 3.244306 0.5982652 0.37011029 0.3642425
6 1 3.000098 0.5471276 -0.01178315 1.0857886
However, whenever I try to run the for-loop, I am given the following warning message:
Error: Metric Accuracy not applicable for regression models
In addition: Warning message:
In train.default(x, y, weights = w, ...) :
You are trying to do regression and your outcome only has two possible values Are you trying to do classification? If so, use a 2 level factor as your outcome column.
I have no idea how to fix this. Any help would be much appreciated.
Also, as a side note, is there a faster way to run this for-loop? It takes quite a while but I have no idea how to make it more efficient. Thank you.

What is the meaning of component err.rate of class randomForest?

I'm using the function randomForest from package randomForest. One of the objects of class randomForest is err.rate which is
(classification only) vector error rates of the prediction on the input data, the i-th element being the (OOB) error rate for all trees up to the i-th.
Could you please explain what is the meaning of this component? Thank you so much for your help!
I take the dataset Sonar, Mines vs. Rocks as an code example.
library(mlbench)
data(Sonar)
library(boot)
library(randomForest)
n <- 208
ntrain <- 100
ntest <- 108
train.idx <- sample(1:n, ntrain, replace = FALSE)
train.set <- Sonar[train.idx, ]
test.set <- Sonar[-train.idx, ]
rf <- randomForest(Class ~ ., data = train.set, keep.inbag = TRUE, importance = TRUE)
head(rf$err.rate)
Here is the result of the code
OOB M R
[1,] 0.1891892 0.1500000 0.2352941
[2,] 0.2931034 0.2307692 0.3437500
[3,] 0.2739726 0.2647059 0.2820513
[4,] 0.2911392 0.2894737 0.2926829
[5,] 0.2413793 0.2682927 0.2173913
[6,] 0.2555556 0.2142857 0.2916667
[7,] 0.2553191 0.2444444 0.2653061
[8,] 0.2268041 0.1956522 0.2549020
[9,] 0.2783505 0.2608696 0.2941176
One component of randomForest is bagging where you get a consensus prediction from i number of trees.
As you increase the number of trees, the OOB error is computed at each step. The OOB error is not calculated from comparing the prediction obtained from 1 tree onto OOB samples with respect to that tree, but rather you use the averaged prediction across trees from which this sample is not used. I recommend checking this for an overview.
So in the example you have, we can visualize this:
library(ggplot2)
library(tidyr)
plotdf <- pivot_longer(data.frame(ntrees=1:nrow(rf$err.rate),rf$err.rate),-ntrees)
ggplot(plotdf,aes(x=ntrees,y=value,col=name)) +
geom_line() + theme_bw()
M and R are lines for error in prediction for that specific label, and OOB (your first column) is simply the average of the two. As the number of trees increase, your OOB error gets lower because you get a better prediction from more trees.
The nice thing about randomForest is that you don't need the cross-validation, because the OOB estimate is usually quite indicative. Below we can try to show that we get the same result:
set.seed(12)
# split in 5 parts
trn = split(1:nrow(Sonar),sample(1:nrow(Sonar) %% 5))
sim = vector("list",5)
# the number of trees we incrementally grow
ntrees = c(1,20*(1:50)+1)
for(CV in 1:5){
idx = trn[[CV]]
train.set <- Sonar[-idx, ]
test.set <- Sonar[idx, ]
# first forest, n=1, but works
mdl <- randomForest(Class ~ ., data = train.set, ntree=1,
keep.inbag = TRUE, importance = TRUE,keep.forest=TRUE)
err_rate <- vector("numeric",51)
err_rate[1] <- mean(predict(mdl,test.set)!=test.set$Class)
#growing the tree
for(i in 1:50){
mdl <- grow(mdl,10)
err_rate[i+1] <- mean(predict(mdl,test.set)!=test.set$Class)
}
sim[[CV]] <- data.frame(ntrees=ntrees,err_rate=err_rate,CV=CV)
}
sim = do.call(rbind,sim)
#plot
ggplot(sim,aes(x=ntrees,y=err_rate)) + geom_line(aes(group=CV),alpha=0.2) +
stat_summary(fun.y=mean,geom="line",col="blue")+theme_bw()

Loop linear regression different predictor and outcome variables

I'm new to R but am slowly learning it to analyse a data set.
Let's say I have a data frame which contains 8 variables and 20 observations. Of the 8 variables, V1 - V3 are predictors and V4 - V8 are outcomes.
B = matrix(c(1:160),
nrow = 20,
ncol = 8,)
df <- as.data.frame(B)
Using the car package, to perform a simple linear regression, display summary and confidence intervals is:
fit <- lm(V4 ~ V1, data = df)
summary(fit)
confint(fit)
How can I write code (loop or apply) so that R regresses each predictor on each outcome individually and extracts the coefficients and confidence intervals? I realise I'm probably trying to run before I can walk but any help would be really appreciated.
You could wrap your lines in a lapply call and train a linear model for each of your predictors (excluding the target, of course).
my.target <- 4
my.predictors <- 1:8[-my.target]
lapply(my.predictors, (function(i){
fit <- lm(df[,my.target] ~ df[,i])
list(summary= summary(fit), confint = confint(fit))
}))
You obtain a list of lists.
So, the code in my own data that returns the error is:
my.target <- metabdata[c(34)]
my.predictors <- metabdata[c(18 : 23)]
lapply(my.predictors, (function(i){
fit <- lm(metabdata[, my.target] ~ metabdata[, i])
list(summary = summary(fit), confint = confint(fit))
}))
Returns:
Error: Unsupported index type: tbl_df

Multinomial logit in R: mlogit versus nnet

I want to run a multinomial logit in R and have used two libraries, nnet and mlogit, which produce different results and report different types of statistics. My questions are:
What is the source of discrepency between the coefficients and standard errors reported by nnet and those reported by mlogit?
I would like to report my results to a Latex file using stargazer. When doing so, there is a problematic tradeoff:
If I use the results from mlogit then I get the statistics I wish, such as psuedo R squared, however, the output is in long format (see example below).
If I use the results from nnet then the format is as expected, but it reports statistics that I am not interested in such as AIC, but does not include, for example, psuedo R squared.
I would like to have the statistics reported by mlogit in the formatting of nnet when I use stargazer.
Here is a reproducible example, with three choice alternatives:
library(mlogit)
df = data.frame(c(0,1,1,2,0,1,0), c(1,6,7,4,2,2,1), c(683,276,756,487,776,100,982))
colnames(df) <- c('y', 'col1', 'col2')
mydata = df
mldata <- mlogit.data(mydata, choice="y", shape="wide")
mlogit.model1 <- mlogit(y ~ 1| col1+col2, data=mldata)
The tex output when compiled is of what I refer to as "long format" which I deem undesired:
Now, using nnet:
library(nnet)
mlogit.model2 = multinom(y ~ 1 + col1+col2, data=mydata)
stargazer(mlogit.model2)
Gives the tex output:
which is of the "wide" format which I desire. Note the different coefficient and standard errors.
To my knowledge, there are three R packages that allow the estimation of the multinomial logistic regression model: mlogit, nnet and globaltest (from Bioconductor). I do not consider here the mnlogit package, a faster and more efficient implementation of mlogit.
All the above packages use different algorithms that, for small samples, give different results. These differencies vanishes for moderate sample sizes (try with n <- 100).
Consider the following data generating process taken from the James Keirstead's blog:
n <- 40
set.seed(4321)
df1 <- data.frame(x1=runif(n,0,100), x2=runif(n,0,100))
df1 <- transform(df1, y=1+ifelse(100 - x1 - x2 + rnorm(n,sd=10) < 0, 0,
ifelse(100 - 2*x2 + rnorm(n,sd=10) < 0, 1, 2)))
str(df1)
'data.frame': 40 obs. of 3 variables:
$ x1: num 33.48 90.91 41.15 4.38 76.35 ...
$ x2: num 68.6 42.6 49.9 36.1 49.6 ...
$ y : num 1 1 3 3 1 1 1 1 3 3 ...
table(df1$y)
1 2 3
19 8 13
The model parameters estimated by the three packages are respectively:
library(mlogit)
df2 <- mlogit.data(df1, choice="y", shape="wide")
mlogit.mod <- mlogit(y ~ 1 | x1+x2, data=df2)
(mlogit.cf <- coef(mlogit.mod))
2:(intercept) 3:(intercept) 2:x1 3:x1 2:x2 3:x2
42.7874653 80.9453734 -0.5158189 -0.6412020 -0.3972774 -1.0666809
#######
library(nnet)
nnet.mod <- multinom(y ~ x1 + x2, df1)
(nnet.cf <- coef(nnet.mod))
(Intercept) x1 x2
2 41.51697 -0.5005992 -0.3854199
3 77.57715 -0.6144179 -1.0213375
#######
library(globaltest)
glbtest.mod <- globaltest::mlogit(y ~ x1+x2, data=df1)
(cf <- glbtest.mod#coefficients)
1 2 3
(Intercept) -41.2442934 1.5431814 39.7011119
x1 0.3856738 -0.1301452 -0.2555285
x2 0.4879862 0.0907088 -0.5786950
The mlogit command of globaltest fits the model without using a reference outcome category, hence the usual parameters can be calculated as follows:
(glbtest.cf <- rbind(cf[,2]-cf[,1],cf[,3]-cf[,1]))
(Intercept) x1 x2
[1,] 42.78747 -0.5158190 -0.3972774
[2,] 80.94541 -0.6412023 -1.0666813
Concerning the estimation of the parameters in the three packages, the method used in mlogit::mlogit is explained in detail here.
In nnet::multinom the model is a neural network with no hidden layers, no bias nodes and a softmax output layer; in our case there are 3 input units and 3 output units:
nnet:::summary.nnet(nnet.mod)
a 3-0-3 network with 12 weights
options were - skip-layer connections softmax modelling
b->o1 i1->o1 i2->o1 i3->o1
0.00 0.00 0.00 0.00
b->o2 i1->o2 i2->o2 i3->o2
0.00 41.52 -0.50 -0.39
b->o3 i1->o3 i2->o3 i3->o3
0.00 77.58 -0.61 -1.02
Maximum conditional likelihood is the method used in multinom for model fitting.
The parameters of multinomial logit models are estimated in globaltest::mlogit using maximum likelihood and working with an equivalent log-linear model and the Poisson likelihood. The method is described here.
For models estimated by multinom the McFadden's pseudo R-squared can be easily calculated as follows:
nnet.mod.loglik <- nnet:::logLik.multinom(nnet.mod)
nnet.mod0 <- multinom(y ~ 1, df1)
nnet.mod0.loglik <- nnet:::logLik.multinom(nnet.mod0)
(nnet.mod.mfr2 <- as.numeric(1 - nnet.mod.loglik/nnet.mod0.loglik))
[1] 0.8483931
At this point, using stargazer, I generate a report for the model estimated by mlogit::mlogit which is as similar as possible to the report of multinom. The basic idea is to substitute the estimated coefficients and probabilities in the object created by multinom with the corresponding estimates of mlogit.
# Substitution of coefficients
nnet.mod2 <- nnet.mod
cf <- matrix(nnet.mod2$wts, nrow=4)
cf[2:nrow(cf), 2:ncol(cf)] <- t(matrix(mlogit.cf,nrow=2))
# Substitution of probabilities
nnet.mod2$wts <- c(cf)
nnet.mod2$fitted.values <- mlogit.mod$probabilities
Here is the result:
library(stargazer)
stargazer(nnet.mod2, type="text")
==============================================
Dependent variable:
----------------------------
2 3
(1) (2)
----------------------------------------------
x1 -0.516** -0.641**
(0.212) (0.305)
x2 -0.397** -1.067**
(0.176) (0.519)
Constant 42.787** 80.945**
(18.282) (38.161)
----------------------------------------------
Akaike Inf. Crit. 24.623 24.623
==============================================
Note: *p<0.1; **p<0.05; ***p<0.01
Now I am working on the last issue: how to visualize loglik, pseudo R2 and other information in the above stargazer output.
If you are using stargazer you can use omit to remove unwanted rows or references. Here is a quick example, hopefully, it will point you int he right direction.
nb. My assumption is you are using Rstudio and rmarkdown with knitr.
```{r, echo=FALSE}
library(mlogit)
df = data.frame(c(0,1,1,2,0,1,0), c(1,6,7,4,2,2,1), c(683,276,756,487,776,100,982))
colnames(df) <- c('y', 'col1', 'col2')
mydata = df
mldata <- mlogit.data(mydata, choice = "y", shape="wide")
mlogit.model1 <- mlogit(y ~ 1| col1+col2, data=mldata)
mlogit.col1 <- mlogit(y ~ 1 | col1, data = mldata)
mlogit.col2 <- mlogit(y ~ 1 | col2, data = mldata)
```
# MLOGIT
```{r echo = FALSE, message = TRUE, error = TRUE, warning = FALSE, results = 'asis'}
library(stargazer)
stargazer(mlogit.model1, type = "html")
stargazer(mlogit.col1,
mlogit.col2,
type = "html",
omit=c("1:col1","2:col1","1:col2","2:col2"))
```
Result:
Note that the second image omits 1:col1, 2:col2, 1:col2 and 2:col2

Adding lagged variables to an lm model?

I'm using lm on a time series, which works quite well actually, and it's super super fast.
Let's say my model is:
> formula <- y ~ x
I train this on a training set:
> train <- data.frame( x = seq(1,3), y = c(2,1,4) )
> model <- lm( formula, train )
... and I can make predictions for new data:
> test <- data.frame( x = seq(4,6) )
> test$y <- predict( model, newdata = test )
> test
x y
1 4 4.333333
2 5 5.333333
3 6 6.333333
This works super nicely, and it's really speedy.
I want to add lagged variables to the model. Now, I could do this by augmenting my original training set:
> train$y_1 <- c(0,train$y[1:nrow(train)-1])
> train
x y y_1
1 1 2 0
2 2 1 2
3 3 4 1
update the formula:
formula <- y ~ x * y_1
... and training will work just fine:
> model <- lm( formula, train )
> # no errors here
However, the problem is that there is no way of using 'predict', because there is no way of populating y_1 in a test set in a batch manner.
Now, for lots of other regression things, there are very convenient ways to express them in the formula, such as poly(x,2) and so on, and these work directly using the unmodified training and test data.
So, I'm wondering if there is some way of expressing lagged variables in the formula, so that predict can be used? Ideally:
formula <- y ~ x * lag(y,-1)
model <- lm( formula, train )
test$y <- predict( model, newdata = test )
... without having to augment (not sure if that's the right word) the training and test datasets, and just being able to use predict directly?
Have a look at e.g. the dynlm package which gives you lag operators. More generally the Task Views on Econometrics and Time Series will have lots more for you to look at.
Here is the beginning of its examples -- a one and twelve month lag:
R> data("UKDriverDeaths", package = "datasets")
R> uk <- log10(UKDriverDeaths)
R> dfm <- dynlm(uk ~ L(uk, 1) + L(uk, 12))
R> dfm
Time series regression with "ts" data:
Start = 1970(1), End = 1984(12)
Call:
dynlm(formula = uk ~ L(uk, 1) + L(uk, 12))
Coefficients:
(Intercept) L(uk, 1) L(uk, 12)
0.183 0.431 0.511
R>
Following Dirk's suggestion on dynlm, I couldn't quite figure out how to predict, but searching for that led me to dyn package via https://stats.stackexchange.com/questions/6758/1-step-ahead-predictions-with-dynlm-r-package
Then after several hours of experimentation I came up with the following function to handle the prediction. There were quite a few 'gotcha's on the way, eg you can't seem to rbind time series, and the result of predict is offset by start and a whole bunch of things like that, so I feel this answer adds significantly compared to just naming a package, though I have upvoted Dirk's answer.
So, a solution that works is:
use the dyn package
use the following method for prediction
predictDyn method:
# pass in training data, test data,
# it will step through one by one
# need to give dependent var name, so that it can make this into a timeseries
predictDyn <- function( model, train, test, dependentvarname ) {
Ntrain <- nrow(train)
Ntest <- nrow(test)
# can't rbind ts's apparently, so convert to numeric first
train[,dependentvarname] <- as.numeric(train[,dependentvarname])
test[,dependentvarname] <- as.numeric(test[,dependentvarname])
testtraindata <- rbind( train, test )
testtraindata[,dependentvarname] <- ts( as.numeric( testtraindata[,dependentvarname] ) )
for( i in 1:Ntest ) {
result <- predict(model,newdata=testtraindata,subset=1:(Ntrain+i-1))
testtraindata[Ntrain+i,dependentvarname] <- result[Ntrain + i + 1 - start(result)][1]
}
return( testtraindata[(Ntrain+1):(Ntrain + Ntest),] )
}
Example usage:
library("dyn")
# size of training and test data
N <- 6
predictN <- 10
# create training data, which we can get exact fit on, so we can check the results easily
traindata <- c(1,2)
for( i in 3:N ) { traindata[i] <- 0.5 + 1.3 * traindata[i-2] + 1.7 * traindata[i-1] }
train <- data.frame( y = ts( traindata ), foo = 1)
# create testing data, bunch of NAs
test <- data.frame( y = ts( rep(NA,predictN) ), foo = 1)
# fit a model
model <- dyn$lm( y ~ lag(y,-1) + lag(y,-2), train )
# look at the model, it's a perfect fit. Nice!
print(model)
test <- predictDyn( model, train, test, "y" )
print(test)
# nice plot
plot(test$y, type='l')
Output:
> model
Call:
lm(formula = dyn(y ~ lag(y, -1) + lag(y, -2)), data = train)
Coefficients:
(Intercept) lag(y, -1) lag(y, -2)
0.5 1.7 1.3
> test
y foo
7 143.2054 1
8 325.6810 1
9 740.3247 1
10 1682.4373 1
11 3823.0656 1
12 8686.8801 1
13 19738.1816 1
14 44848.3528 1
15 101902.3358 1
16 231537.3296 1
Edit: hmmm, this is super slow though. Even if I limit the data in the subset to a constant few rows of the dataset, it takes about 24 milliseconds per prediction, or, for my task, 0.024*7*24*8*20*10/60/60 = 1.792 hours :-O
Try the ARIMA function. The AR parameter is for auto-regressive, which means lagged y. xreg = allows you to add other X variables. You can get predictions with predict.ARIMA.
Here's a thought:
Why don't you create a new data frame? Fill a data frame with the regressors you need. You could have columns like L1, L2, ..., Lp for all lags of any variable you want and, then, you get to use your functions exactly like you would for a cross-section type of regression.
Because you will not have to operate on your data every time you call fitting and prediction functions, but will have transformed the data once, it will be considerably faster. I know that Eviews and Stata provide lagging operators. It is true that there is some convenience to it. But it also is inefficient if you do not need everything functions like 'lm' compute. If you have a few hundreds of thousands of iterations to perform and you just need the forecast, or the forecast and the value of information criteria like BIC or AIC, you can beat 'lm' in speed by avoiding to make computations that you will not use -- just write an OLS estimator in a function and you're good to go.

Resources