Is there a utility to run regressions using xts objects of the following type:
lm(y ~ lab(x, 1) + lag(x, 2) + lag(x,3), data=as.data.frame(coredata(my_xts)))
where my_xts is an xts object that contains an x and a y. The point of the question is is there a way to avoid doing a bunch of lags and merges to have a data.frame with all the lags? I think that the package dyn works for zoo objects so i would expect it to work the same way with xts but though there might be something updated.
The dyn and dynlm packages can do that with zoo objects. In the case of dyn just write dyn$lm instead of lm and pass it a zoo object instead of a data frame.
Note that lag in xts works the opposite of the usual R convention so if x is of xts class then lag(x, 1) is the same as lag(x, -1) if x were of zoo or ts class.
> library(xts)
> library(dyn)
> x <- xts(anscombe[c("y1", "x1")], as.Date(1:11)) # test data
> dyn$lm(y1 ~ lag(x1, -(1:3)), as.zoo(x))
Call:
lm(formula = dyn(y1 ~ lag(x1, -(1:3))), data = as.zoo(x))
Coefficients:
(Intercept) lag(x1, -(1:3))1 lag(x1, -(1:3))2 lag(x1, -(1:3))3
3.80530 0.04995 -0.12042 0.46631
Since you are already removing the data from the xts environment, I'm not using any xts features here. There is an embed function that will construct a "lagged" matrix to any desired degree. (I never understood the time-series lag function.) (the order of the embed-lagged variables is reversed from what I would have expected.)
embed(1:6, 3)
#--------
[,1] [,2] [,3]
[1,] 3 2 1
[2,] 4 3 2
[3,] 5 4 3
[4,] 6 5 4
#Worked example ... need to shorten the y variable
y <- rnorm(20)
x <- rnorm(20)
lm( tail(y, 18) ~ embed(x, 3) )
#-------------------
Call:
lm(formula = tail(y, 18) ~ embed(x, 3))
Coefficients:
(Intercept) embed(x, 3)1 embed(x, 3)2 embed(x, 3)3
-0.12452 -0.34919 0.01571 0.01715
It was a relief to note that after changing the lags to match those used by #GGrothendieck that we get identical results:
lm( tail(xx[,"y1"], NROW(xx)-3) ~ embed(xx[,"x1"], 4)[,2:4] )
Call:
lm(formula = tail(xx[, "y1"], NROW(xx) - 3) ~ embed(xx[, "x1"],
4)[, 2:4])
Coefficients:
(Intercept) embed(xx[, "x1"], 4)[, 2:4]1 embed(xx[, "x1"], 4)[, 2:4]2
3.80530 0.04995 -0.12042
embed(xx[, "x1"], 4)[, 2:4]3
0.46631
Related
I want to estimate an equation such as:
(where the bar denotes the mean of a variable.... Meaning, I want to automatically have interactions between Z and a demeaned version of X. So far I just demean the variables manually beforehand and estimate:
lm(Y ~ .*Z, data= sdata)
This seems to be working, but I would rather use a solution that does not require manual demeaning beforehand because I would also like to include the means of more complex terms, such as:
Edit:
As requested, a working code-sample, note that in the actual thing I have large (and varying) numbers of X- variables, so that I dont want to use a hard-coded variant:
x1 <- runif(100)
x2 <- runif(100)
Z <- runif(100)
Y <- exp(x1) + exp(x2) + exp(z)
##current way of estimating the first equation:
sdata <- data.frame(Y=Y,Z=Z,x1=x1-mean(x1),x2=x2-mean(x2))
lm(Y ~ .*Z, data= sdata)
##basically what I want is that the following terms, and their interactions with Z are also used:
# X1^2 - mean(X1^2)
# X2^2 - mean(X2^2)
# X1*X2 - mean(X1*X2)
Edit 2:
Now, what I want to achieve is basically what
lm(Y ~ .^2*Z, data= sdata)
would do. However, given prior demeaing expressions in there, such as: Z:X1:X2 would correspond to: (x1-mean(x1))*(x2-mean(x2)), while what I want to have is x1*x2-mean(x1*x2)
To show that scale works inside a formula:
lm(mpg ~ cyl + scale(disp*hp, scale=F), data=mtcars)
Call:
lm(formula = mpg ~ cyl + scale(disp * hp, scale = F), data = mtcars)
Coefficients:
(Intercept) cyl scale(disp * hp, scale = F)
3.312e+01 -2.105e+00 -4.642e-05
Now for comparison let's scale the interaction outside the formula:
mtcars$scaled_interaction <- with(mtcars, scale(disp*hp, scale=F))
lm(mpg ~ cyl + scaled_interaction, data=mtcars)
Call:
lm(formula = mpg ~ cyl + scaled_interaction, data = mtcars)
Coefficients:
(Intercept) cyl scaled_interaction
3.312e+01 -2.105e+00 -4.642e-05
At least in these examples, it seems as if scale inside formulae is working.
To provide a solution to your specific issue:
Alternative 1: Use formulae
# fit without Z
mod <- lm(Y ~ (.)^2, data= sdata[, names(sdata) != "Z" ])
vars <- attr(mod$terms, "term.labels")
vars <- gsub(":", "*", vars) # needed so that scale works later
vars <- paste0("scale(", vars, ", scale=F)")
newf <- as.formula(paste0("Y ~ ", paste0(vars, collapse = "+")))
# now interact with Z
f2 <- update.formula(newf, . ~ .*Z)
# This fives the following formula:
f2
Y ~ scale(x1, scale = F) + scale(x2, scale = F) + scale(x1*x2, scale = F) +
Z + scale(x1, scale = F):Z + scale(x2, scale = F):Z + scale(x1*x2, scale = F):Z
Alternative 2: Use Model Matrices
# again fit without Z and get model matrix
mod <- lm(Y ~ (.)^2, data= sdata[, names(sdata) != "Z" ])
modmat <- apply(model.matrix(mod), 2, function(x) scale(x, scale=F))
Here, all x's and the interactions are demeaned:
> head(modmat)
(Intercept) x1 x2 x1:x2
[1,] 0 0.1042908 -0.08989091 -0.01095459
[2,] 0 0.1611867 -0.32677059 -0.05425087
[3,] 0 0.2206845 0.29820499 0.06422944
[4,] 0 0.3462069 -0.15636463 -0.05571430
[5,] 0 0.3194451 -0.38668844 -0.12510551
[6,] 0 -0.4708222 -0.32502269 0.15144812
> round(colMeans(modmat), 2)
(Intercept) x1 x2 x1:x2
0 0 0 0
You can use the model matrix as follows:
modmat <- modmat[, -1] # remove intercept
lm(sdata$Y ~ modmat*sdata$Z)
It is not beautiful, but should do the work with any number of explanatory variables. You can also add Y and Z to the matrix so that the output looks prettier if this is a concern. Note that you can also create the model matrix directly without fitting the model. I took it from the fitted model directly since it have already fitted it for the first approach.
As a sidenote, it may be that this is not implemented in a more straight forward fashion because it is difficult to imagine situations in which demeaning the interaction is more desirable compared to the interaction of demeaned variables.
Comparing both approaches:
Here the output of both approaches for comparison. As you can see, apart from the coefficient names everything is identical.
> lm(sdata$Y ~ modmat*sdata$Z)
Call:
lm(formula = sdata$Y ~ modmat * sdata$Z)
Coefficients:
(Intercept) modmatx1 modmatx2 modmatx1:x2 sdata$Z
4.33105 1.56455 1.43979 -0.09206 1.72901
modmatx1:sdata$Z modmatx2:sdata$Z modmatx1:x2:sdata$Z
0.25332 0.38155 -0.66292
> lm(f2, data=sdata)
Call:
lm(formula = f2, data = sdata)
Coefficients:
(Intercept) scale(x1, scale = F) scale(x2, scale = F)
4.33105 1.56455 1.43979
scale(x1 * x2, scale = F) Z scale(x1, scale = F):Z
-0.09206 1.72901 0.25332
scale(x2, scale = F):Z scale(x1 * x2, scale = F):Z
0.38155 -0.66292
This question already has answers here:
Linear Regression and storing results in data frame [duplicate]
(5 answers)
Closed 7 years ago.
I need to store lm fit object in a data frame for further processing (This is needed as I will have around 200+ regressions to be stored in the data frame). I am not able to store the fit object in the data frame. Following code produces the error message:
x = runif(100)
y = 2*x+runif(100)
fit = lm(y ~x)
df = data.frame()
df = rbind(df, c(id="xx1", fitObj=fit))
Error in rbind(deparse.level, ...) :
invalid list argument: all variables should have the same length
I would like to get the data frame as returned by "do" call of dplyr, example below:
> tacrSECOutput
Source: local data frame [24 x 5]
Groups: <by row>
sector control id1 fit count
1 Chemicals and Chemical Products S tSector <S3:lm> 2515
2 Construation and Real Estate S tSector <S3:lm> 985
Please note that this is a sample output only. I would like to create the data frame (fit column for the lm object) in the above format so that my rest of the code can work on the added models.
What am I doing wrong? Appreciate the help very much.
The list approach:
Clearly based on #Pascal 's idea. Not a fan of lists, but in some cases they are extremely helpful.
set.seed(42)
x <- runif(100)
y <- 2*x+runif(100)
fit1 <- lm(y ~x)
set.seed(123)
x <- runif(100)
y <- 2*x+runif(100)
fit2 <- lm(y ~x)
# manually select model names
model_names = c("fit1","fit2")
# create a list based on models names provided
list_models = lapply(model_names, get)
# set names
names(list_models) = model_names
# check the output
list_models
# $fit1
#
# Call:
# lm(formula = y ~ x)
#
# Coefficients:
# (Intercept) x
# 0.5368 1.9678
#
#
# $fit2
#
# Call:
# lm(formula = y ~ x)
#
# Coefficients:
# (Intercept) x
# 0.5545 1.9192
Given that you have lots of models in your work space, the only "manual" thing you have to do is provide a vector of your models names (how are they stored) and then using the get function you can obtain the actual model objects with those names and save them in a list.
Store model objects in a dataset when you create them:
The data frame can be created using dplyr and do if you are planning to store the model objects when they are created.
library(dplyr)
set.seed(42)
x1 = runif(100)
y1 = 2*x+runif(100)
set.seed(123)
x2 <- runif(100)
y2 <- 2*x+runif(100)
model_formulas = c("y1~x1", "y2~x2")
data.frame(model_formulas, stringsAsFactors = F) %>%
group_by(model_formulas) %>%
do(model = lm(.$model_formulas))
# model_formulas model
# (chr) (chr)
# 1 y1~x1 <S3:lm>
# 2 y2~x2 <S3:lm>
It REALLY depends on how "organised" is the process that allows you to built those 200+ models you mentioned. You can build your models this way if they depend on columns of a specific dataset. It will not work if you want to build models based on various columns of different datasets, maybe of different work spaces or different model types (linear/logistic regression).
Store existing model objects in a dataset:
Actually I think you can still use dplyr using the same philosophy as in the list approach. If the models are already built you can use their names like this
library(dplyr)
set.seed(42)
x <- runif(100)
y <- 2*x+runif(100)
fit1 <- lm(y ~x)
set.seed(123)
x <- runif(100)
y <- 2*x+runif(100)
fit2 <- lm(y ~x)
# manually select model names
model_names = c("fit1","fit2")
data.frame(model_names, stringsAsFactors = F) %>%
group_by(model_names) %>%
do(model = get(.$model_names))
# model_names model
# (chr) (chr)
# 1 fit1 <S3:lm>
# 2 fit2 <S3:lm>
This seems to work:
x = runif(100)
y = 2*x+runif(100)
fit = lm(y ~x)
df <- data.frame()
fitvec <- serialize(fit,NULL)
df <- rbind(df, data.frame(id="xx1", fitObj=fitvec))
fit1 <- unserialize( df$fitObj )
print(fit1)
yields:
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept) x
0.529 1.936
Update Okay, now more complex, so as to get one row per fit.
vdf <- data.frame()
fitlist <- list()
niter <- 5
for (i in 1:niter){
# Create a new model each time
a <- runif(1)
b <- runif(1)
n <- 50*runif(1) + 50
x <- runif(n)
y <- a*x + b + rnorm(n,0.1)
fit <- lm(x~y)
fitlist[[length(fitlist)+1]] <- serialize(fit,NULL)
}
vdf <- data.frame(id=1:niter)
vdf$fitlist <- fitlist
for (i in 1:niter){
print(unserialize(vdf$fitlist[[i]]))
}
yields:
Call:
lm(formula = x ~ y)
Coefficients:
(Intercept) y
0.45689 0.07766
Call:
lm(formula = x ~ y)
Coefficients:
(Intercept) y
0.44922 0.00658
Call:
lm(formula = x ~ y)
Coefficients:
(Intercept) y
0.41036 0.04522
Call:
lm(formula = x ~ y)
Coefficients:
(Intercept) y
0.40823 0.07189
Call:
lm(formula = x ~ y)
Coefficients:
(Intercept) y
0.40818 0.08141
I'd like to run 10 regressions against the same regressor, then pull all the standard errors without using a loop.
depVars <- as.matrix(data[,1:10]) # multiple dependent variables
regressor <- as.matrix([,11]) # independent variable
allModels <- lm(depVars ~ regressor) # multiple, single variable regressions
summary(allModels)[1] # Can "view" the standard error for 1st regression, but can't extract...
allModels is stored as an "mlm" object, which is really tough to work with. It'd be great if I could store a list of lm objects or a matrix with statistics of interest.
Again, the objective is to NOT use a loop. Here is a loop equivalent:
regressor <- as.matrix([,11]) # independent variable
for(i in 1:10) {
tempObject <- lm(data[,i] ~ regressor) # single regressions
table1Data[i,1] <- summary(tempObject)$coefficients[2,2] # assign std error
rm(tempObject)
}
If you put your data in long format it's very easy to get a bunch of regression results using lmList from the nlme or lme4 packages. The output is a list of regression results and the summary can give you a matrix of coefficients, just like you wanted.
library(lme4)
m <- lmList( y ~ x | group, data = dat)
summary(m)$coefficients
Those coefficients are in a simple 3 dimensional array so the standard errors are at [,2,2].
Given an "mlm" model object model, you can use the below function written by me to get standard errors of coefficients. This is very efficient: no loop, and no access to summary.mlm().
std_mlm <- function (model) {
Rinv <- with(model$qr, backsolve(qr, diag(rank)))
## unscaled standard error
std_unscaled <- sqrt(rowSums(Rinv ^ 2)[order(model$qr$pivot)])
## residual standard error
sigma <- sqrt(colSums(model$residuals ^ 2) / model$df.residual)
## return final standard error
## each column corresponds to a model
"dimnames<-"(outer(std_unscaled, sigma), list = dimnames(model$coefficients))
}
A simple, reproducible example
set.seed(0)
Y <- matrix(rnorm(50 * 5), 50) ## assume there are 5 responses
X <- rnorm(50) ## covariate
fit <- lm(Y ~ X)
We all know that it is simple to extract estimated coefficients via:
fit$coefficients ## or `coef(fit)`
# [,1] [,2] [,3] [,4] [,5]
#(Intercept) -0.21013925 0.1162145 0.04470235 0.08785647 0.02146662
#X 0.04110489 -0.1954611 -0.07979964 -0.02325163 -0.17854525
Now let's apply our std_mlm:
std_mlm(fit)
# [,1] [,2] [,3] [,4] [,5]
#(Intercept) 0.1297150 0.1400600 0.1558927 0.1456127 0.1186233
#X 0.1259283 0.1359712 0.1513418 0.1413618 0.1151603
We can of course, call summary.mlm just to check our result is correct:
coef(summary(fit))
#Response Y1 :
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) -0.21013925 0.1297150 -1.6200072 0.1117830
#X 0.04110489 0.1259283 0.3264151 0.7455293
#
#Response Y2 :
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 0.1162145 0.1400600 0.8297485 0.4107887
#X -0.1954611 0.1359712 -1.4375183 0.1570583
#
#Response Y3 :
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 0.04470235 0.1558927 0.2867508 0.7755373
#X -0.07979964 0.1513418 -0.5272811 0.6004272
#
#Response Y4 :
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 0.08785647 0.1456127 0.6033574 0.5491116
#X -0.02325163 0.1413618 -0.1644831 0.8700415
#
#Response Y5 :
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 0.02146662 0.1186233 0.1809646 0.8571573
#X -0.17854525 0.1151603 -1.5504057 0.1276132
Yes, all correct!
Here an option:
put your data in the long format using regressor as an id key.
do your regression against value by group of variable.
For example , using mtcars data set:
library(reshape2)
dat.m <- melt(mtcars,id.vars='mpg') ## mpg is my regressor
library(plyr)
ddply(dat.m,.(variable),function(x)coef(lm(variable~value,data=x)))
variable (Intercept) value
1 cyl 1 8.336774e-18
2 disp 1 6.529223e-19
3 hp 1 1.106781e-18
4 drat 1 -1.505237e-16
5 wt 1 8.846955e-17
6 qsec 1 6.167713e-17
7 vs 1 2.442366e-16
8 am 1 -3.381738e-16
9 gear 1 -8.141220e-17
10 carb 1 -6.455094e-17
I am running a simple multivariate regression on a panel/time-series dataset, using lm() and the underlying formula $(X'X)^{-1} X'Y$
expecting to get the same coefficient values from the two methods. However, I get completely different estimates.
Here is the R code:
return = matrix(ret.ff.zoo, ncol = 50) # y vector
data = cbind(df$EQ, df$EFF, df$SIZE, df$MOM, df$MSCR, df$SY, df$UMP) # x vector
#First method
BETA = solve(crossprod(data)) %*% crossprod(data, return)
#Second method
OLS <- lm(return ~ data)
I am not sure why the estimates are different between the two methods..
Any help is appreciated! Thank you.
Your example isn't reproducible, but if you try it with some dummy data, the matrix formula and lm produce the same results when you take out the intercept:
set.seed(1)
x <- matrix(rnorm(1000),ncol=5)
y <- rnorm(200)
solve(t(x) %*% x) %*% t(x) %*% y
[,1]
[1,] -0.0826496646
[2,] -0.0165735273
[3,] -0.0009412659
[4,] 0.0070475728
[5,] -0.0642452777
> lm(y ~ x + 0)
Call:
lm(formula = y ~ x + 0)
Coefficients:
x1 x2 x3 x4 x5
-0.0826497 -0.0165735 -0.0009413 0.0070476 -0.0642453
Is there a function that can extract two or more columns from a coeftest object? This is easy one coeftest object at a time, but can I do the same to a list (other than a for() loop)?
> # meaningless data
> temp <- data.frame(a = rnorm(100, mean = 5), b = rnorm(100, mean = 1),
+ c = 1:100)
> formulas <- list(a ~ b, a ~ c)
> models <- lapply(formulas, lm, data = temp)
> library(lmtest)
> cts <- lapply(models, coeftest)
> # easy to extract columns one object at a time
> cts[[1]][, 1:2]
Estimate Std. Error
(Intercept) 5.0314196 0.1333705
b -0.1039264 0.0987044
> # but more difficult algorithmically
> # either one column
> lapply(cts, "[[", 1)
[[1]]
[1] 5.03142
[[2]]
[1] 5.312007
> # or two
> lapply(cts, "[[", 1:2)
Error in FUN(X[[1L]], ...) : attempt to select more than one element
Maybe the more fundamental question is if there is a way to turn the meat of the coeftest object into a data frame, which would allow me to extract columns singly, then use mapply(). Thanks!
Edit: I would like to end up with a matrices (or data frames) with the first and second columns.
[[1]]
Estimate Std. Error
(Intercept) 5.0314196 0.1333705
b -0.1039264 0.0987044
[[2]]
Estimate Std. Error
(Intercept) 5.312007153 0.199485363
c -0.007378529 0.003429477
[[ is the wrong subset function in this case. Note that when you lapply() over a list, what you are operating on are the components of the list, the bits you would get with list[[i]] where i is the ith component.
As such, you only need the [, 1:2] bit of cts[[1]][, 1:2] in the lapply() call. It is a little bit trickier because of the arguments for [, but easily doable with lapply():
> lapply(cts, `[`, , 1:2)
[[1]]
Estimate Std. Error
(Intercept) 4.926679544 0.1549482
b -0.001967657 0.1062437
[[2]]
Estimate Std. Error
(Intercept) 4.849041327 0.204342067
c 0.001494454 0.003512972
Note the <space>, before 1:2; this is the equivalent of [ , 1:2].
I'm not sure if this is what you want, but how about:
> do.call("rbind", cts)[, 1:2]
Estimate Std. Error
(Intercept) 4.8200993881 0.142381642
b -0.0421189130 0.092620363
(Intercept) 4.7459340076 0.206372906
c 0.0005770324 0.003547885