Adding lagged variables to an lm model? - r

I'm using lm on a time series, which works quite well actually, and it's super super fast.
Let's say my model is:
> formula <- y ~ x
I train this on a training set:
> train <- data.frame( x = seq(1,3), y = c(2,1,4) )
> model <- lm( formula, train )
... and I can make predictions for new data:
> test <- data.frame( x = seq(4,6) )
> test$y <- predict( model, newdata = test )
> test
x y
1 4 4.333333
2 5 5.333333
3 6 6.333333
This works super nicely, and it's really speedy.
I want to add lagged variables to the model. Now, I could do this by augmenting my original training set:
> train$y_1 <- c(0,train$y[1:nrow(train)-1])
> train
x y y_1
1 1 2 0
2 2 1 2
3 3 4 1
update the formula:
formula <- y ~ x * y_1
... and training will work just fine:
> model <- lm( formula, train )
> # no errors here
However, the problem is that there is no way of using 'predict', because there is no way of populating y_1 in a test set in a batch manner.
Now, for lots of other regression things, there are very convenient ways to express them in the formula, such as poly(x,2) and so on, and these work directly using the unmodified training and test data.
So, I'm wondering if there is some way of expressing lagged variables in the formula, so that predict can be used? Ideally:
formula <- y ~ x * lag(y,-1)
model <- lm( formula, train )
test$y <- predict( model, newdata = test )
... without having to augment (not sure if that's the right word) the training and test datasets, and just being able to use predict directly?

Have a look at e.g. the dynlm package which gives you lag operators. More generally the Task Views on Econometrics and Time Series will have lots more for you to look at.
Here is the beginning of its examples -- a one and twelve month lag:
R> data("UKDriverDeaths", package = "datasets")
R> uk <- log10(UKDriverDeaths)
R> dfm <- dynlm(uk ~ L(uk, 1) + L(uk, 12))
R> dfm
Time series regression with "ts" data:
Start = 1970(1), End = 1984(12)
Call:
dynlm(formula = uk ~ L(uk, 1) + L(uk, 12))
Coefficients:
(Intercept) L(uk, 1) L(uk, 12)
0.183 0.431 0.511
R>

Following Dirk's suggestion on dynlm, I couldn't quite figure out how to predict, but searching for that led me to dyn package via https://stats.stackexchange.com/questions/6758/1-step-ahead-predictions-with-dynlm-r-package
Then after several hours of experimentation I came up with the following function to handle the prediction. There were quite a few 'gotcha's on the way, eg you can't seem to rbind time series, and the result of predict is offset by start and a whole bunch of things like that, so I feel this answer adds significantly compared to just naming a package, though I have upvoted Dirk's answer.
So, a solution that works is:
use the dyn package
use the following method for prediction
predictDyn method:
# pass in training data, test data,
# it will step through one by one
# need to give dependent var name, so that it can make this into a timeseries
predictDyn <- function( model, train, test, dependentvarname ) {
Ntrain <- nrow(train)
Ntest <- nrow(test)
# can't rbind ts's apparently, so convert to numeric first
train[,dependentvarname] <- as.numeric(train[,dependentvarname])
test[,dependentvarname] <- as.numeric(test[,dependentvarname])
testtraindata <- rbind( train, test )
testtraindata[,dependentvarname] <- ts( as.numeric( testtraindata[,dependentvarname] ) )
for( i in 1:Ntest ) {
result <- predict(model,newdata=testtraindata,subset=1:(Ntrain+i-1))
testtraindata[Ntrain+i,dependentvarname] <- result[Ntrain + i + 1 - start(result)][1]
}
return( testtraindata[(Ntrain+1):(Ntrain + Ntest),] )
}
Example usage:
library("dyn")
# size of training and test data
N <- 6
predictN <- 10
# create training data, which we can get exact fit on, so we can check the results easily
traindata <- c(1,2)
for( i in 3:N ) { traindata[i] <- 0.5 + 1.3 * traindata[i-2] + 1.7 * traindata[i-1] }
train <- data.frame( y = ts( traindata ), foo = 1)
# create testing data, bunch of NAs
test <- data.frame( y = ts( rep(NA,predictN) ), foo = 1)
# fit a model
model <- dyn$lm( y ~ lag(y,-1) + lag(y,-2), train )
# look at the model, it's a perfect fit. Nice!
print(model)
test <- predictDyn( model, train, test, "y" )
print(test)
# nice plot
plot(test$y, type='l')
Output:
> model
Call:
lm(formula = dyn(y ~ lag(y, -1) + lag(y, -2)), data = train)
Coefficients:
(Intercept) lag(y, -1) lag(y, -2)
0.5 1.7 1.3
> test
y foo
7 143.2054 1
8 325.6810 1
9 740.3247 1
10 1682.4373 1
11 3823.0656 1
12 8686.8801 1
13 19738.1816 1
14 44848.3528 1
15 101902.3358 1
16 231537.3296 1
Edit: hmmm, this is super slow though. Even if I limit the data in the subset to a constant few rows of the dataset, it takes about 24 milliseconds per prediction, or, for my task, 0.024*7*24*8*20*10/60/60 = 1.792 hours :-O

Try the ARIMA function. The AR parameter is for auto-regressive, which means lagged y. xreg = allows you to add other X variables. You can get predictions with predict.ARIMA.

Here's a thought:
Why don't you create a new data frame? Fill a data frame with the regressors you need. You could have columns like L1, L2, ..., Lp for all lags of any variable you want and, then, you get to use your functions exactly like you would for a cross-section type of regression.
Because you will not have to operate on your data every time you call fitting and prediction functions, but will have transformed the data once, it will be considerably faster. I know that Eviews and Stata provide lagging operators. It is true that there is some convenience to it. But it also is inefficient if you do not need everything functions like 'lm' compute. If you have a few hundreds of thousands of iterations to perform and you just need the forecast, or the forecast and the value of information criteria like BIC or AIC, you can beat 'lm' in speed by avoiding to make computations that you will not use -- just write an OLS estimator in a function and you're good to go.

Related

Creating a for loop to calculate AIC scores for different models using lm

Im trying to create AIC scores for several different models in a for loop.
I have created a for loop with the log likeliness for each model. However, I am stuck to create the lm function so that it calculates a model for each combination of my column LOGABUNDANCE with columns 4 to 11 of my dataframe.
This is the code I have used so far. But that gives me a similar AIC score for every model.
# AIC score for every model
LL <- rep(NA, 10)
AIC <- rep(NA, 10)
for(i in 1:10){
mod <- lm(LOGABUNDANCE ~ . , data = butterfly)
sigma = as.numeric(summary(mod)[6])
LL[i] <- sum(log(dnorm(butterfly$LOGABUNDANCE, predict(mod), sigma)))
AIC[i] <- -2*LL[i] + 2*(2)
}
You get the same AIC for every model, because you create 10 equal models.
To make the code work, you need some way of changing the model in each iteration.
I can see two options:
Either subset the data in the start of each iteration so it only contains LOGABUNDANCE and one other variable (as suggested by #yacine-hajji in the comments), or
Create a vector of the variables you want to create models with, and use as.formula() together with paste0() to create a new formula for each iteration.
I think solution 2 is easier. Here is a working example of solution 2, using mtcars:
# AIC score for every model
LL <- rep(NA, 10)
AIC <- rep(NA, 10)
# Say I want to model all variables against `mpg`:
# Create a vector of all variable names except mpg
variables <- names(mtcars)[-1]
for(i in 1:10){
# Note how the formula is different in each iteration
mod <- lm(
as.formula(paste0("mpg ~ ", variables[i])),
data = mtcars
)
sigma = as.numeric(summary(mod)[6])
LL[i] <- sum(log(dnorm(mtcars$mpg, predict(mod), sigma)))
AIC[i] <- -2*LL[i] + 2*(2)
}
Output:
AIC
#> [1] 167.3716 168.2746 179.3039 188.8652 164.0947 202.6534 190.2124 194.5496
#> [9] 200.4291 197.2459

Running random error model with mgcv gam takes too much memory

I am working on a model that includes several REs and a spline for one of the variables, so I am trying to use gam(). However, I reach memory exhaust limit error (even when I run it on a cluster with 128GB). This happens even when I run the simplest of models with just one RE. The same models (minus the spline) run smoothly and in just a few seconds (or minutes for the full model) when I use lmer() instead.
I was wondering if anyone had any idea why the discrepancy between gam() and lmer() and any potential solutions.
Here's some code with simulated data and the simplest of models:
library(mgcv)
library(lme4)
set.seed(1234)
person_n <- 38000 # number of people (grouping variable)
n_j <- 15 # number of data points per person
B1 <- 3 # beta for the main predictor
n <- person_n * n_j
person_id <- gl(person_n, k = n_j) #creating the grouping variable
person_RE <- rep(rnorm(person_n), each = n_j) # creating the random errors
x <- rnorm(n) # creating x as a normal dist centered at 0 and sd = 1
error <- rnorm(n)
#putting it all together
y <- B1 * x + person_RE + error
dat <- data.frame(y, person_id, x)
m1 <- lmer(y ~ x + (1 | person_id), data = dat)
g1 <- gam(y ~ x + s(person_id, bs = "re"), method = "REML", data = dat)
m1 runs in just a couple seconds on my computer, whereas g1 hits the error:
Error: vector memory exhausted (limit reached?)
From ?mgcv::random.effects:
gam can be slow for fitting models with large numbers of random
effects, because it does not exploit the sparsity that is often a
feature of parametric random effects ... However ‘gam’ is often
faster and more reliable than ‘gamm’ or ‘gamm4’, when the number
of random effects is modest. [emphasis added]
What this means is that in the course of setting up the model, s(., bs = "re") tries to generate a dense model matrix equivalent to model.matrix( ~ person_id - 1); this takes (nrows x nlevels x 8 bytes/double) = (3.8e4*5.7e5*8)/2^30 = 161.4 Gb (which is exactly the object size that my machine reports it can't allocate).
Check out mgcv::gamm and gamm4::gamm4 for more memory-efficient (and faster, in this case) methods ...

R: one regression model for 2 different data sets to prepare for waldtest

I have two different data sets. Each of them represents one portfolio of my two portfolios.
y(p) as dependent variable and x1(p), x2(p),x3(p),x4(p) as independent variables.
(p) indicates a portfolio-specific value. column 1 of each variable represents portfolio 1 and column 2 represents portfolio 2.
The regression equation is:
y(p)=∝(p)+ 𝛽1(p)*x1(p)+𝛽2(p)*x2(p)+𝛽3(p)*x3(p)+𝛽4(p)*x4(p)
What i did so far is to implement a separate regression model for each portfolio in R:
lm1 <- lm(y[,1]~x1[,1]+x2[,1]+x3[,1]+x4[,1])
lm2 <- lm(y[,2]~x1[,2]+x2[,2]+x3[,2]+x4[,2])
My objective is to compare the two intercepts of both regression models. Within the scope of this comparison i need to test the joint significance of these intercepts. As far as i can tell, using the wald test should be appropriate.
If I use the waldtest-function from the lmtest-package it does not work.
Obviously, because the response variable is not the same for both models.
library(lmtest)
waldtest(lm1,lm2)
In waldtest.default(object, ..., test = match.arg(test)) :
models with response "y[, 2]" removed because response differs from model 1
All workarounds I tried so far did not work either, e.g. R: Waldtest: "Error in solve.default(vc[ovar, ovar]) : 'a' is 0-diml"
My guess is that the regression needs to be done in a different way to fix the problems regarding the waldtest.
So that leads to my question:
Is there a possibility to do the regression in one model, which still generates portfolio-specific intercepts and coefficients? (I assume, that this would fix the problems with the waldtest-function.)
Any advice or suggestion will be appreciated.
The following data can be used for a reproducible example:
y=matrix(rnorm(10),ncol=2)
x1=matrix(rnorm(10),ncol=2)
x2=matrix(rnorm(10),ncol=2)
x3=matrix(rnorm(10),ncol=2)
x4=matrix(rnorm(10),ncol=2)
lm1 <- lm(y[,1]~x1[,1]+x2[,1]+x3[,1]+x4[,1])
lm2 <- lm(y[,2]~x1[,2]+x2[,2]+x3[,2]+x4[,2])
library(lmtest)
waldtest(lm1,lm2)
Best regards,
Simon
Here are three ways to test intercepts equality. The second one is an implementation of the accepted answer to this question, while the other two are implementations of the second answer to the aforementioned question under different assumptions.
Let
n <- 5
y <- matrix(rnorm(10), ncol = 2)
x <- matrix(rnorm(10), ncol = 2)
First, we may indeed perform the test with only a single model. For that purpose we create a new vector Y that concatenates y[, 1] and y[, 2]. As for the independent variables, we create a block-diagonal matrix with the regressors of one model at the upper-left block and those for the other model at the lower-right block. Lastly, I create a group factor indicating the hidden model. Hence,
library(Matrix)
Y <- c(y)
X <- as.matrix(bdiag(x[, 1], x[, 2]))
G <- factor(rep(0:1, each = n))
Now the unrestricted model is
m1 <- lm(Y ~ G + X - 1)
while the restricted one is
m2 <- lm(Y ~ X)
Testing for intercepts equality gives
library(lmtest)
waldtest(m1, m2)
# Wald test
#
# Model 1: Y ~ G + X - 1
# Model 2: Y ~ X
# Res.Df Df F Pr(>F)
# 1 6
# 2 7 -1 0.5473 0.4873
so that, as expected, we cannot reject they equality. A problem with this solution, however, is that it is like estimating the two models separately but assuming that the errors have the same variance in both. Also, we don't allow for a cross-correlation between errors.
Second, we can relax the assumption of identical errors variance by estimating two separate models and employing a Z-test as follows.
M1 <- lm(y[, 1] ~ x[, 1])
M2 <- lm(y[, 2] ~ x[, 2])
Z <- unname((coef(M1)[1] - coef(M2)[1]) / (coef(summary(M1))[1, 2]^2 + coef(summary(M2))[1, 2])^2)
2 * pnorm(-abs(Z))
# [1] 0.5425736
leading to the same conclusion.
Lastly, we can employ the SUR in this way allowing for model-dependent errors variance as well as contemporaneous errors cross-dependence (that may be not necessary in your case, it matters what kind of data you are using). For that we can use the systemfit package as follows:
library(systemfit)
eq1 <- y[, 1] ~ x[, 1]
eq2 <- y[, 2] ~ x[, 2]
m <- systemfit(list(eq1, eq2), method = "SUR")
In this case we also are able to perform the Wald test:
R <- matrix(c(1, 0, -1, 0), nrow = 1) # Restriction matrix
linearHypothesis(m, R, test = "Chisq")
# Linear hypothesis test (Chi^2 statistic of a Wald test)
#
# Hypothesis:
# eq1_((Intercept) - eq2_(Intercept) = 0
#
# Model 1: restricted model
# Model 2: m
#
# Res.Df Df Chisq Pr(>Chisq)
# 1 7
# 2 6 1 0.3037 0.5816

R: Dynamic linear regression with dynlm package, how to predict()?

I am trying to build a dynamic regression model and so far I did it with the dynlm package. Basically the model looks like this
y_t = a*x1_t + b*x2_t + ... + c*y_(t-1).
y_t shall be predicted, x1_t and x2_t will be given and so is y_(t-1).
Building the model with the dynlm package worked fine, but when it came to predict y_t I got confused...
I found this, which seems to be a very similar problem, but it did not help me to handle my own problem.
Here is the problem I am facing (basically what predict() does, seems to be weird. See comments!):
library(dynlm)
# Create Data
set.seed(1)
y <- arima.sim(model = list(ar = c(.9)), n = 11) #Create AR(1) dependant variable
A <- rnorm(11) #Create independent variables
B <- rnorm(11)
y <- y + .5 * A + .2 * B #Add relationship to independent variables
data = cbind(y, A, B)
# subset used for the fitting of the model
reg <- data[1:10, ]
# Fit dynamic linear model
model <- dynlm(y ~ A + B + L(y, k = 1), data = reg) # dynlm
model
# Time series regression with "zooreg" data:
# Start = 2, End = 11
#
# Call:
# dynlm(formula = y ~ A + B + L(y, k = 1), data = reg)
# Coefficients:
# (Intercept) A B L(y, k = 1)
# 0.8930 -0.2175 0.2892 0.5176
# subset last two rows.
# the last row (r11) for which y_t shall be predicted, where from the same time A and B are input for the prediction
# and the second last row (r10), so y_(t-1) can be input for the model as well
pred <- as.data.frame(data[10:11, ])
# prediction using predict()
predict(model, newdata = pred)
# 1 2
# 1.833134 1.483809
# manual calculation of prediction of y in r11 (how I thought it should be...), taking y_(t-1) as input
predicted_value <- model$coefficients[1] + model$coefficients[2] * pred[2, 2] + model$coefficients[3] * pred[2, 3] + model$coefficients[4] * pred[1, 1]
predicted_value
# (Intercept)
# 1.743334
# and then what gives the value from predict() above taking y_t into the model (which is the value that should be predicted and not y_(t-1))
predicted_value <- model$coefficients[1] + model$coefficients[2] * pred[2, 2] + model$coefficients[3] * pred[2, 3] + model$coefficients[4] * pred[2, 1]
predicted_value
# (Intercept)
# 1.483809
Of course I could just use my own prediction function, but the problem is that my real model will have way more variables (which can even vary as I use the the step function to optimize the model according to AIC) and that I is why I want to use the predict() function.
Any ideas, how to solve this?
Unfortunately, the dynlm package does not provide a predict() method. At the moment the package completely separates the data pre-processing (which knows about functions like d(), L(), trend(), season() etc.) and the model fitting (which itself is not aware of the functions). A predict() method has been on my wishlist but so far I did not get round to write one because the flexibility of the interface allows so many models where it is not quite straightforward what to do. In the meantime, I should probably add a method that throws a warning before the lm method is found by inheritance.

Fitting (multlple) linear models by group in R [duplicate]

This question already has answers here:
Linear Regression and group by in R
(10 answers)
Closed 6 years ago.
I'm trying to (somewhat) elegantly fit 3 models (linear, exponential and quadratic) to a dataset with classes/factors and save p-values and R2 for each model and class/factor. Simple dataset with 3 variables: x,y, and class. What I can't figure out is how to force each of the 3 models to fit to each of the 3 classes. What I have now fits each model to the complete dataset. The next question is how I then output p-values & R2 to a table, for each model+class
My code looks like:
set.seed(100)
library(plyr)
#create datast
nit <- within(data.frame(x = 3:32),
{
class <- rep(1:3, each = 10)
y <- 0.5 * x* (1:10) + rnorm(30)
class <- factor(class) # convert to a factor
}
)
x2<-nit$x*nit$x #for quadratic model
forms<- paste(c("y ~ x", "y ~ x+x2", "log(y) ~ x"), sep = "") # create 3 models
names(forms) <- paste("Model", LETTERS[1:length(forms)])
models <- llply(forms, lm, data = nit)
models # shows coefficients for each of the 3 models
There are a variety of ways to do this, but I liked how the names came out of nested lapply calls better than my mapply or do (from package dplyr) solutions even though the code looks a bit complicated. The names made it easier to tell the models apart (which forms and class combination each list element represented).
In this solution, it is important to actually add x2 to the nit dataset.
nit$x2 = nit$x*nit$x
models = lapply(forms,
function(x) {
lapply(levels(nit$class),
function(y) {lm(x, data = nit[nit$class == y,])} )
})
The output is a lists of lists, though, so I had to flatten this into a single list using unlist with recursive = FALSE.
models2 = unlist(models, recursive = FALSE)
Now you can easily pull out elements you want from the summary of each model. For example, here is how you might pull at the R-squared for each model:
lapply(models2, function(x) summary(x)$r.squared)
Or if you want a vector instead of a list:
unlist(lapply(models2, function(x) summary(x)$r.squared))
You could consider to jump into linear and quadratic discrimination analyis, LDA and QDA.
This guide provides an easy introduction
http://tgmstat.wordpress.com/2014/01/15/computing-and-visualizing-lda-in-r/
Maybe like this? You can probably adapt it to do exactly what you want.
modsumm <- llply(models, summary)
ldply(modsumm, function(x) data.frame(term = row.names(x$coefficients),
x$coefficients,
R.sq = x$r.squared))
.id term Estimate Std..Error t.value Pr...t.. R.sq
1 Model A (Intercept) -12.60545292 11.37539598 -1.1081331 2.772327e-01 0.5912020
2 Model A x 3.70767525 0.58265177 6.3634498 6.921738e-07 0.5912020
3 Model B (Intercept) 16.74908684 20.10241672 0.8331877 4.120490e-01 0.6325661
4 Model B x -0.73357356 2.60879262 -0.2811927 7.807063e-01 0.6325661
5 Model B x2 0.12689282 0.07278352 1.7434279 9.263740e-02 0.6325661
6 Model C (Intercept) 1.79394266 0.32323588 5.5499490 6.184167e-06 0.5541830
7 Model C x 0.09767635 0.01655626 5.8996644 2.398030e-06 0.5541830
Or, if you just want the p-value from the F statistic and the R squared
ldply(modsumm, function(x) data.frame(F.p.val = pf(x$fstatistic[1],
x$fstatistic[2],
x$fstatistic[3],
lower.tail = F),
R.sq = x$r.squared))
.id F.p.val R.sq
1 Model A 6.921738e-07 0.5912020
2 Model B 1.348711e-06 0.6325661
3 Model C 2.398030e-06 0.5541830

Resources