Get the p-values from the lm function for grouped data - r

I am trying to fit a model for each segment in my data using the lm() function in conjunction with the plyr package because my data is grouped by a key.
I've managed to run the model and get the coefficients along with the R^2 & adj r-squared but I am struggling with the p-values.
library("plyr")
#Sample data
test_data <- data.frame(key = c("a","a","a","a","a","b","b","b","b","b"),
y = c(100,180,120,60,140,200,220,240,260,280),
x1 = c(50,60,79,85,90,133,140,120,160,170),
x2 = c(20,18,47,16,15,25,30,25,20,15))
#model
model_1 <- dlply(test_data, .(key),
function(test_data) lm(y ~ x1 + x2,data = test_data))
#coefficients
ldply(model_1, coef)
#adj r-squared
ldply(model_1, function(x) summary(x)$r.squared)
I've tried this which gets me the key and the p-value but it doesn't have the names of the variables which I need to be able to merge the output with the coefficients from the model later.
#p-values but missing the variable names
ldply(model_1, function(x) summary(x)$coefficients)[,c(1,5)]
I've tried to fit the models using Do and then tidy from the dplyr package and this works fine with a small data set because it actually returns everything I need but my actual data contains over 1,000 different segments and RStudio end up crashing.

I'm using the "dplyr" package to formatting the output. In the function that you use inside the "dlply" function you should use summary() to the lm(), so when you call "coef" it will also include the p.values.
test_data <- data.frame(key = c("a","a","a","a","a","b","b","b","b","b"),
y = c(100,180,120,60,140,200,220,240,260,280),
x1 = c(50,60,79,85,90,133,140,120,160,170),
x2 = c(20,18,47,16,15,25,30,25,20,15))
model<-by(test_data,test_data$key,function(x)summary(lm(y~x1+x2,x)))
R2<-t(data.frame(lapply(model,function(x)x$adj.r.squared)));colnames(R2)<-"R2_adj";R2
R2_adj
a -0.8939647
b 0.4292186
Co<-as.data.frame(t(data.frame(lapply(model,function(x)x$coef))))
colnames(Co)<-c("intercept","x1","x2")
library(dplyr)
Co%>%
mutate(key=substr(rownames(Co),1,1),
variable=substr(rownames(Co),3,12))%>%
select(key,variable,intercept,x1,x2)
key variable intercept x1 x2
1 a Estimate 162.1822438 -0.6037364 0.07628315
2 a Std..Error 141.3436897 1.8054132 2.29385395
3 a t.value 1.1474318 -0.3344035 0.03325545
4 a Pr...t.. 0.3699423 0.7698867 0.97649134
5 b Estimate 271.0532276 0.3624009 -3.62853907
6 b Std..Error 196.2769562 0.9166979 3.25911570
7 b t.value 1.3809733 0.3953330 -1.11335080
8 b Pr...t.. 0.3013515 0.7307786 0.38142882

No need for plyr I think, sapply will do just fine.
sapply(model_1, function(x) summary(x)$coefficients[, 4])
a b
(Intercept) 0.3699423 0.3013515
x1 0.7698867 0.7307786
x2 0.9764913 0.3814288
And t() will get those in the same configuration as your estimates.
By the way, you may want to look at the multidplyr package, to do with tidy and dplyr::do after all.

Related

Getting a subset of variables in R summary

When using the summary function in R, is there an option I can pass in there to present only a subset of the variables?
In my example, I ran a panel regression I have several explanatory variables, and have many dummy variables whose coefficients I do not want to present. I suppose there is a simple way to do this, but couldn't find it in the function documentation. Thanks
It is in the documentation, but you have to look for the associacted print method for summary.plm. The argument is subset. Use it as in the following example:
library(plm)
data("Grunfeld", package = "plm")
mod <- plm(inv ~ value + capital, data = Grunfeld)
print(summary(mod), subset = c("capital"))
Assuming the regression you ran behaves similarly as the summary() of a basic lm() model:
# set up data
x <- 1:100 * runif(100, .01, .02)
y <- 1:100 * runif(100, .01, .03)
# run a very basic linear model
mylm <- lm(x ~ y)
summary(mylm)
# we can save summary of our linear model as a variable
mylm_summary <- summary(mylm)
# we can then isolate coefficients from this summary (summary is just a list)
mylm_summary$coefficients
#output:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.2007199 0.04352267 4.611846 1.206905e-05
y 0.5715838 0.03742379 15.273273 1.149594e-27
# note that the class of this "coefficients" object is a matrix
class(mylm_summ$coefficients)
# output
[1] "matrix"
# we can convert that matrix into a data frame so it is easier to work with and subset
mylm_df_coefficients <- data.frame(mylm_summary$coefficients)

passing model parameters to R's predict() function robustly

I am trying to use R to fit a linear model and make predictions. My model includes some constant side parameters that are not in the data frame. Here's a simplified version of what I'm doing:
dat <- data.frame(x=1:5,y=3*(1:5))
b <- 1
mdl <- lm(y~I(b*x),data=dat)
Unfortunately the model object now suffers from a dangerous scoping issue: lm() does not save b as part of mdl, so when predict() is called, it has to reach back into the environment where b was defined. Thus, if subsequent code changes the value of b, the predict value will change too:
y1 <- predict(mdl,newdata=data.frame(x=3)) # y1 == 9
b <- 5
y2 <- predict(mdl,newdata=data.frame(x=3)) # y2 == 45
How can I force predict() to use the original b value instead of the changed one? Alternatively, is there some way to control where predict() looks for the variable, so I can ensure it gets the desired value? In practice I cannot include b as part of the newdata data frame, because in my application, b is a vector of parameters that does not have the same size as the data frame of new observations.
Please note that I have greatly simplified this relative to my actual use case, so I need a robust general solution and not just ad-hoc hacking.
eval(substitute the value into the quoted expression
mdl <- eval(substitute(lm(y~I(b*x),data=dat), list(b=b)))
mdl
# Call:
# lm(formula = y ~ I(1 * x), data = dat)
# ...
We could also use bquote
mdl <- eval(bquote(lm(y~I(.(b)*x), data=dat)))
mdl
#Call:
#lm(formula = y ~ I(1 * x), data = dat)
#Coefficients:
#(Intercept) I(1 * x)
# 9.533e-15 3.000e+00
According to ?bquote description
‘bquote’ quotes its
argument except that terms wrapped in ‘.()’ are evaluated in the
specified ‘where’ environment.

Fitting (multlple) linear models by group in R [duplicate]

This question already has answers here:
Linear Regression and group by in R
(10 answers)
Closed 6 years ago.
I'm trying to (somewhat) elegantly fit 3 models (linear, exponential and quadratic) to a dataset with classes/factors and save p-values and R2 for each model and class/factor. Simple dataset with 3 variables: x,y, and class. What I can't figure out is how to force each of the 3 models to fit to each of the 3 classes. What I have now fits each model to the complete dataset. The next question is how I then output p-values & R2 to a table, for each model+class
My code looks like:
set.seed(100)
library(plyr)
#create datast
nit <- within(data.frame(x = 3:32),
{
class <- rep(1:3, each = 10)
y <- 0.5 * x* (1:10) + rnorm(30)
class <- factor(class) # convert to a factor
}
)
x2<-nit$x*nit$x #for quadratic model
forms<- paste(c("y ~ x", "y ~ x+x2", "log(y) ~ x"), sep = "") # create 3 models
names(forms) <- paste("Model", LETTERS[1:length(forms)])
models <- llply(forms, lm, data = nit)
models # shows coefficients for each of the 3 models
There are a variety of ways to do this, but I liked how the names came out of nested lapply calls better than my mapply or do (from package dplyr) solutions even though the code looks a bit complicated. The names made it easier to tell the models apart (which forms and class combination each list element represented).
In this solution, it is important to actually add x2 to the nit dataset.
nit$x2 = nit$x*nit$x
models = lapply(forms,
function(x) {
lapply(levels(nit$class),
function(y) {lm(x, data = nit[nit$class == y,])} )
})
The output is a lists of lists, though, so I had to flatten this into a single list using unlist with recursive = FALSE.
models2 = unlist(models, recursive = FALSE)
Now you can easily pull out elements you want from the summary of each model. For example, here is how you might pull at the R-squared for each model:
lapply(models2, function(x) summary(x)$r.squared)
Or if you want a vector instead of a list:
unlist(lapply(models2, function(x) summary(x)$r.squared))
You could consider to jump into linear and quadratic discrimination analyis, LDA and QDA.
This guide provides an easy introduction
http://tgmstat.wordpress.com/2014/01/15/computing-and-visualizing-lda-in-r/
Maybe like this? You can probably adapt it to do exactly what you want.
modsumm <- llply(models, summary)
ldply(modsumm, function(x) data.frame(term = row.names(x$coefficients),
x$coefficients,
R.sq = x$r.squared))
.id term Estimate Std..Error t.value Pr...t.. R.sq
1 Model A (Intercept) -12.60545292 11.37539598 -1.1081331 2.772327e-01 0.5912020
2 Model A x 3.70767525 0.58265177 6.3634498 6.921738e-07 0.5912020
3 Model B (Intercept) 16.74908684 20.10241672 0.8331877 4.120490e-01 0.6325661
4 Model B x -0.73357356 2.60879262 -0.2811927 7.807063e-01 0.6325661
5 Model B x2 0.12689282 0.07278352 1.7434279 9.263740e-02 0.6325661
6 Model C (Intercept) 1.79394266 0.32323588 5.5499490 6.184167e-06 0.5541830
7 Model C x 0.09767635 0.01655626 5.8996644 2.398030e-06 0.5541830
Or, if you just want the p-value from the F statistic and the R squared
ldply(modsumm, function(x) data.frame(F.p.val = pf(x$fstatistic[1],
x$fstatistic[2],
x$fstatistic[3],
lower.tail = F),
R.sq = x$r.squared))
.id F.p.val R.sq
1 Model A 6.921738e-07 0.5912020
2 Model B 1.348711e-06 0.6325661
3 Model C 2.398030e-06 0.5541830

convert dredge function outputs to data.frame in R

I am using the MuMIn package in R to select the best model for my data. Here, I use an example using the Cement data set provided with the code.
require(MuMIn)
data(Cement)
d <- data.frame(Cement)
idx <- seq(11,13)
avgmod.95p <- list()
for (i in 1:length(idx)){
d2 <- d[1:idx[i],]
fm1 <- lm(y ~ ., data = d2)
dd <- dredge(fm1, extra = c("R^2", F = function(x)
summary(x)$fstatistic[[1]]))
# 95% confidence set:
confset.95p <- get.models(dd, cumsum(weight) <= .95)
avgmod.95p[[i]] <- model.avg(confset.95p)
}
As you can see, I'm running and iteration loop to construct the model average estimate for the dataset (which I alter the length of here, for illustration). The variable avgmod.95 returns:
> avgmod.95p[[1]][3]
$avg.model
Estimate Std. Error Adjusted SE Lower CI Upper CI
(Intercept) 56.1637849 15.06079485 15.15303057 26.4643908 85.8631791
X1 1.4810616 0.14016773 0.16302190 1.1615446 1.8005787
X2 0.6850913 0.05397343 0.06358329 0.5604704 0.8097123
X4 -0.6063184 0.05919637 0.06964775 -0.7428255 -0.4698113
X3 0.2126228 0.19480789 0.23502854 -0.2480246 0.6732703
which includes the estimated parameter and the lower and upper confidence intervals.
How do I combine all of the outputs from the iteration loop into one data.frame, for example:
Variable Estimate Lower CI Upper CI
X1 1.4810616 1.1615446 1.8005787
X1
X1
X2
i.e. I would have three values for X1, X2 and X3 where three is the number of iterations in the loop.
How can this be done? I have tried:
do.call(rbind.data.frame, avgmod.95p)
but it doesn't work, in the sense that it provides an error.
You are assigning it to a list, so let's use lapply
#get number of rows for each model
no.of.rows <-unlist(lapply(avgmod.95p, function(x) nrow(x$avg.model)))
#use lapply again to rbind the models
foo<-do.call(rbind, lapply(avgmod.95p, function(x) x$avg.model))
Now make it into a nice data.frame using no.of rows to indicate which model it came from:
result.df <- data.frame(Model.No = rep(seq(1:length(no.of.rows)),no.of.rows),
Coefs = rownames(foo),
foo)
If you modify your index in the for loop, you can give it fancy names as well, and use that.
your avgmod.95p will be named this and we can use that.

Adding lagged variables to an lm model?

I'm using lm on a time series, which works quite well actually, and it's super super fast.
Let's say my model is:
> formula <- y ~ x
I train this on a training set:
> train <- data.frame( x = seq(1,3), y = c(2,1,4) )
> model <- lm( formula, train )
... and I can make predictions for new data:
> test <- data.frame( x = seq(4,6) )
> test$y <- predict( model, newdata = test )
> test
x y
1 4 4.333333
2 5 5.333333
3 6 6.333333
This works super nicely, and it's really speedy.
I want to add lagged variables to the model. Now, I could do this by augmenting my original training set:
> train$y_1 <- c(0,train$y[1:nrow(train)-1])
> train
x y y_1
1 1 2 0
2 2 1 2
3 3 4 1
update the formula:
formula <- y ~ x * y_1
... and training will work just fine:
> model <- lm( formula, train )
> # no errors here
However, the problem is that there is no way of using 'predict', because there is no way of populating y_1 in a test set in a batch manner.
Now, for lots of other regression things, there are very convenient ways to express them in the formula, such as poly(x,2) and so on, and these work directly using the unmodified training and test data.
So, I'm wondering if there is some way of expressing lagged variables in the formula, so that predict can be used? Ideally:
formula <- y ~ x * lag(y,-1)
model <- lm( formula, train )
test$y <- predict( model, newdata = test )
... without having to augment (not sure if that's the right word) the training and test datasets, and just being able to use predict directly?
Have a look at e.g. the dynlm package which gives you lag operators. More generally the Task Views on Econometrics and Time Series will have lots more for you to look at.
Here is the beginning of its examples -- a one and twelve month lag:
R> data("UKDriverDeaths", package = "datasets")
R> uk <- log10(UKDriverDeaths)
R> dfm <- dynlm(uk ~ L(uk, 1) + L(uk, 12))
R> dfm
Time series regression with "ts" data:
Start = 1970(1), End = 1984(12)
Call:
dynlm(formula = uk ~ L(uk, 1) + L(uk, 12))
Coefficients:
(Intercept) L(uk, 1) L(uk, 12)
0.183 0.431 0.511
R>
Following Dirk's suggestion on dynlm, I couldn't quite figure out how to predict, but searching for that led me to dyn package via https://stats.stackexchange.com/questions/6758/1-step-ahead-predictions-with-dynlm-r-package
Then after several hours of experimentation I came up with the following function to handle the prediction. There were quite a few 'gotcha's on the way, eg you can't seem to rbind time series, and the result of predict is offset by start and a whole bunch of things like that, so I feel this answer adds significantly compared to just naming a package, though I have upvoted Dirk's answer.
So, a solution that works is:
use the dyn package
use the following method for prediction
predictDyn method:
# pass in training data, test data,
# it will step through one by one
# need to give dependent var name, so that it can make this into a timeseries
predictDyn <- function( model, train, test, dependentvarname ) {
Ntrain <- nrow(train)
Ntest <- nrow(test)
# can't rbind ts's apparently, so convert to numeric first
train[,dependentvarname] <- as.numeric(train[,dependentvarname])
test[,dependentvarname] <- as.numeric(test[,dependentvarname])
testtraindata <- rbind( train, test )
testtraindata[,dependentvarname] <- ts( as.numeric( testtraindata[,dependentvarname] ) )
for( i in 1:Ntest ) {
result <- predict(model,newdata=testtraindata,subset=1:(Ntrain+i-1))
testtraindata[Ntrain+i,dependentvarname] <- result[Ntrain + i + 1 - start(result)][1]
}
return( testtraindata[(Ntrain+1):(Ntrain + Ntest),] )
}
Example usage:
library("dyn")
# size of training and test data
N <- 6
predictN <- 10
# create training data, which we can get exact fit on, so we can check the results easily
traindata <- c(1,2)
for( i in 3:N ) { traindata[i] <- 0.5 + 1.3 * traindata[i-2] + 1.7 * traindata[i-1] }
train <- data.frame( y = ts( traindata ), foo = 1)
# create testing data, bunch of NAs
test <- data.frame( y = ts( rep(NA,predictN) ), foo = 1)
# fit a model
model <- dyn$lm( y ~ lag(y,-1) + lag(y,-2), train )
# look at the model, it's a perfect fit. Nice!
print(model)
test <- predictDyn( model, train, test, "y" )
print(test)
# nice plot
plot(test$y, type='l')
Output:
> model
Call:
lm(formula = dyn(y ~ lag(y, -1) + lag(y, -2)), data = train)
Coefficients:
(Intercept) lag(y, -1) lag(y, -2)
0.5 1.7 1.3
> test
y foo
7 143.2054 1
8 325.6810 1
9 740.3247 1
10 1682.4373 1
11 3823.0656 1
12 8686.8801 1
13 19738.1816 1
14 44848.3528 1
15 101902.3358 1
16 231537.3296 1
Edit: hmmm, this is super slow though. Even if I limit the data in the subset to a constant few rows of the dataset, it takes about 24 milliseconds per prediction, or, for my task, 0.024*7*24*8*20*10/60/60 = 1.792 hours :-O
Try the ARIMA function. The AR parameter is for auto-regressive, which means lagged y. xreg = allows you to add other X variables. You can get predictions with predict.ARIMA.
Here's a thought:
Why don't you create a new data frame? Fill a data frame with the regressors you need. You could have columns like L1, L2, ..., Lp for all lags of any variable you want and, then, you get to use your functions exactly like you would for a cross-section type of regression.
Because you will not have to operate on your data every time you call fitting and prediction functions, but will have transformed the data once, it will be considerably faster. I know that Eviews and Stata provide lagging operators. It is true that there is some convenience to it. But it also is inefficient if you do not need everything functions like 'lm' compute. If you have a few hundreds of thousands of iterations to perform and you just need the forecast, or the forecast and the value of information criteria like BIC or AIC, you can beat 'lm' in speed by avoiding to make computations that you will not use -- just write an OLS estimator in a function and you're good to go.

Resources