Fitting (multlple) linear models by group in R [duplicate] - r

This question already has answers here:
Linear Regression and group by in R
(10 answers)
Closed 6 years ago.
I'm trying to (somewhat) elegantly fit 3 models (linear, exponential and quadratic) to a dataset with classes/factors and save p-values and R2 for each model and class/factor. Simple dataset with 3 variables: x,y, and class. What I can't figure out is how to force each of the 3 models to fit to each of the 3 classes. What I have now fits each model to the complete dataset. The next question is how I then output p-values & R2 to a table, for each model+class
My code looks like:
set.seed(100)
library(plyr)
#create datast
nit <- within(data.frame(x = 3:32),
{
class <- rep(1:3, each = 10)
y <- 0.5 * x* (1:10) + rnorm(30)
class <- factor(class) # convert to a factor
}
)
x2<-nit$x*nit$x #for quadratic model
forms<- paste(c("y ~ x", "y ~ x+x2", "log(y) ~ x"), sep = "") # create 3 models
names(forms) <- paste("Model", LETTERS[1:length(forms)])
models <- llply(forms, lm, data = nit)
models # shows coefficients for each of the 3 models

There are a variety of ways to do this, but I liked how the names came out of nested lapply calls better than my mapply or do (from package dplyr) solutions even though the code looks a bit complicated. The names made it easier to tell the models apart (which forms and class combination each list element represented).
In this solution, it is important to actually add x2 to the nit dataset.
nit$x2 = nit$x*nit$x
models = lapply(forms,
function(x) {
lapply(levels(nit$class),
function(y) {lm(x, data = nit[nit$class == y,])} )
})
The output is a lists of lists, though, so I had to flatten this into a single list using unlist with recursive = FALSE.
models2 = unlist(models, recursive = FALSE)
Now you can easily pull out elements you want from the summary of each model. For example, here is how you might pull at the R-squared for each model:
lapply(models2, function(x) summary(x)$r.squared)
Or if you want a vector instead of a list:
unlist(lapply(models2, function(x) summary(x)$r.squared))

You could consider to jump into linear and quadratic discrimination analyis, LDA and QDA.
This guide provides an easy introduction
http://tgmstat.wordpress.com/2014/01/15/computing-and-visualizing-lda-in-r/

Maybe like this? You can probably adapt it to do exactly what you want.
modsumm <- llply(models, summary)
ldply(modsumm, function(x) data.frame(term = row.names(x$coefficients),
x$coefficients,
R.sq = x$r.squared))
.id term Estimate Std..Error t.value Pr...t.. R.sq
1 Model A (Intercept) -12.60545292 11.37539598 -1.1081331 2.772327e-01 0.5912020
2 Model A x 3.70767525 0.58265177 6.3634498 6.921738e-07 0.5912020
3 Model B (Intercept) 16.74908684 20.10241672 0.8331877 4.120490e-01 0.6325661
4 Model B x -0.73357356 2.60879262 -0.2811927 7.807063e-01 0.6325661
5 Model B x2 0.12689282 0.07278352 1.7434279 9.263740e-02 0.6325661
6 Model C (Intercept) 1.79394266 0.32323588 5.5499490 6.184167e-06 0.5541830
7 Model C x 0.09767635 0.01655626 5.8996644 2.398030e-06 0.5541830
Or, if you just want the p-value from the F statistic and the R squared
ldply(modsumm, function(x) data.frame(F.p.val = pf(x$fstatistic[1],
x$fstatistic[2],
x$fstatistic[3],
lower.tail = F),
R.sq = x$r.squared))
.id F.p.val R.sq
1 Model A 6.921738e-07 0.5912020
2 Model B 1.348711e-06 0.6325661
3 Model C 2.398030e-06 0.5541830

Related

Get the p-values from the lm function for grouped data

I am trying to fit a model for each segment in my data using the lm() function in conjunction with the plyr package because my data is grouped by a key.
I've managed to run the model and get the coefficients along with the R^2 & adj r-squared but I am struggling with the p-values.
library("plyr")
#Sample data
test_data <- data.frame(key = c("a","a","a","a","a","b","b","b","b","b"),
y = c(100,180,120,60,140,200,220,240,260,280),
x1 = c(50,60,79,85,90,133,140,120,160,170),
x2 = c(20,18,47,16,15,25,30,25,20,15))
#model
model_1 <- dlply(test_data, .(key),
function(test_data) lm(y ~ x1 + x2,data = test_data))
#coefficients
ldply(model_1, coef)
#adj r-squared
ldply(model_1, function(x) summary(x)$r.squared)
I've tried this which gets me the key and the p-value but it doesn't have the names of the variables which I need to be able to merge the output with the coefficients from the model later.
#p-values but missing the variable names
ldply(model_1, function(x) summary(x)$coefficients)[,c(1,5)]
I've tried to fit the models using Do and then tidy from the dplyr package and this works fine with a small data set because it actually returns everything I need but my actual data contains over 1,000 different segments and RStudio end up crashing.
I'm using the "dplyr" package to formatting the output. In the function that you use inside the "dlply" function you should use summary() to the lm(), so when you call "coef" it will also include the p.values.
test_data <- data.frame(key = c("a","a","a","a","a","b","b","b","b","b"),
y = c(100,180,120,60,140,200,220,240,260,280),
x1 = c(50,60,79,85,90,133,140,120,160,170),
x2 = c(20,18,47,16,15,25,30,25,20,15))
model<-by(test_data,test_data$key,function(x)summary(lm(y~x1+x2,x)))
R2<-t(data.frame(lapply(model,function(x)x$adj.r.squared)));colnames(R2)<-"R2_adj";R2
R2_adj
a -0.8939647
b 0.4292186
Co<-as.data.frame(t(data.frame(lapply(model,function(x)x$coef))))
colnames(Co)<-c("intercept","x1","x2")
library(dplyr)
Co%>%
mutate(key=substr(rownames(Co),1,1),
variable=substr(rownames(Co),3,12))%>%
select(key,variable,intercept,x1,x2)
key variable intercept x1 x2
1 a Estimate 162.1822438 -0.6037364 0.07628315
2 a Std..Error 141.3436897 1.8054132 2.29385395
3 a t.value 1.1474318 -0.3344035 0.03325545
4 a Pr...t.. 0.3699423 0.7698867 0.97649134
5 b Estimate 271.0532276 0.3624009 -3.62853907
6 b Std..Error 196.2769562 0.9166979 3.25911570
7 b t.value 1.3809733 0.3953330 -1.11335080
8 b Pr...t.. 0.3013515 0.7307786 0.38142882
No need for plyr I think, sapply will do just fine.
sapply(model_1, function(x) summary(x)$coefficients[, 4])
a b
(Intercept) 0.3699423 0.3013515
x1 0.7698867 0.7307786
x2 0.9764913 0.3814288
And t() will get those in the same configuration as your estimates.
By the way, you may want to look at the multidplyr package, to do with tidy and dplyr::do after all.

How to Loop/Repeat a Linear Regression in R

I have figured out how to make a table in R with 4 variables, which I am using for multiple linear regressions. The dependent variable (Lung) for each regression is taken from one column of a csv table of 22,000 columns. One of the independent variables (Blood) is taken from a corresponding column of a similar table.
Each column represents the levels of a particular gene, which is why there are so many of them. There are also two additional variables (Age and Gender of each patient). When I enter in the linear regression equation, I use lm(Lung[,1] ~ Blood[,1] + Age + Gender), which works for one gene.
I am looking for a way to input this equation and have R calculate all of the remaining columns for Lung and Blood, and hopefully output the coefficients into a table.
Any help would be appreciated!
You want to run 22,000 linear regressions and extract the coefficients? That's simple to do from a coding standpoint.
set.seed(1)
# number of columns in the Lung and Blood data.frames. 22,000 for you?
n <- 5
# dummy data
obs <- 50 # observations
Lung <- data.frame(matrix(rnorm(obs*n), ncol=n))
Blood <- data.frame(matrix(rnorm(obs*n), ncol=n))
Age <- sample(20:80, obs)
Gender <- factor(rbinom(obs, 1, .5))
# run n regressions
my_lms <- lapply(1:n, function(x) lm(Lung[,x] ~ Blood[,x] + Age + Gender))
# extract just coefficients
sapply(my_lms, coef)
# if you need more info, get full summary call. now you can get whatever, like:
summaries <- lapply(my_lms, summary)
# ...coefficents with p values:
lapply(summaries, function(x) x$coefficients[, c(1,4)])
# ...or r-squared values
sapply(summaries, function(x) c(r_sq = x$r.squared,
adj_r_sq = x$adj.r.squared))
The models are stored in a list, where model 3 (with DV Lung[, 3] and IVs Blood[,3] + Age + Gender) is in my_lms[[3]] and so on. You can use apply functions on the list to perform summaries, from which you can extract the numbers you want.
The question seems to be about how to call regression functions with formulas which are modified inside a loop.
Here is how you can do it in (using diamonds dataset):
attach(ggplot2::diamonds)
strCols = names(ggplot2::diamonds)
formula <- list(); model <- list()
for (i in 1:1) {
formula[[i]] = paste0(strCols[7], " ~ ", strCols[7+i])
model[[i]] = glm(formula[[i]])
#then you can plot or do anything else with the result ...
png(filename = sprintf("diamonds_price=glm(%s).png", strCols[7+i]))
par(mfrow = c(2, 2))
plot(model[[i]])
dev.off()
}
Sensible or not, to make the loop at least somehow work you need:
y<- c(1,5,6,2,5,10) # response
x1<- c(2,12,8,1,16,17) # predictor
x2<- c(2,14,5,1,17,17)
predictorlist<- list("x1","x2")
for (i in predictorlist){
model <- lm(paste("y ~", i[[1]]), data=df)
print(summary(model))
}
The paste function will solve the problem.
A tidyverse addition - with map()
Another way - using map2() from the purrr package:
library(purrr)
xs <- anscombe[,1:3] # Select variables of interest
ys <- anscombe[,5:7]
map2_df(ys, xs,
function(i,j){
m <- lm(i ~j + x4 , data = anscombe)
coef(m)
})
The output is a dataframe (tibble) of all coefficients:
`(Intercept)` j x4
1 4.33 0.451 -0.0987
2 6.42 0.373 -0.253
3 2.30 0.526 0.0518
If more variables are changing this can be done using the pmap() functions

Different number of predictions than expecting in linear regression [duplicate]

This question already has an answer here:
r predict function returning too many values [closed]
(1 answer)
Closed 6 years ago.
I'm anticipating that I'm missing something glaringly obvious here.
I'm trying to build a demonstration of overfitting. I've got a quadratic generating function from which I've drawn 20 samples, and I now want to fit polynomial linear models of increasing degree to the sampled data.
For some reason, regardless which model I use, every time I run predict I get N predictions back, where N is the number of records used to train my model.
set.seed(123)
N=20
xv = seq(1,5,length.out=1e4)
x=sample(xv,N)
gen=function(v){v^2 + 2*rnorm(length(v))}
y=gen(x)
df = data.frame(x,y)
# convenience function for building formulas for polynomial regression
build_formula = function(N){
fpart = paste(lapply(2:N, function(i) {paste('+ poly(x,',i,',raw=T)')} ), collapse="")
paste('y~x',fpart)
}
## Example:
## build_formula(4)="y~x + poly(x, 2 ,raw=T)+ poly(x, 3 ,raw=T)+ poly(x, 4 ,raw=T)"
model = lm(build_formula(10), data=df)
predict(model, data=xv) # returns 20 values instead of 1000
predict(model, data=1) # even *this* spits out 20 results. WTF?
This behavior is present regardless of the degree of polynomial in the formula, including the trivial case 'y~x':
formulas = sapply(c(2,10,20), build_formula)
formulas = c('y~x', formulas)
pred = lapply(formulas
,function(f){
predict(
lm(f, data=df)
,data=xv)
})
lapply(pred, length) # 4 x 20 predictions, expecting 4 x 1000
# unsuccessful sanity check
m1 = lm('y~x', data=df)
predict(m1,data=xv)
This is driving me insane. What am I doing wrong?
The second argument to predict is newdata, not data.
Also, you don't need multiple calls to poly in your model formula; poly(N) will be collinear with poly(N-1) and all the others.
Also^2, to generate a sequence of predictions using xv, you have to put it in a data frame with the appropriate name: data.frame(x=xv).

Regression Summaries in R

I've been using the glm function to do regression analysis, and it's treating me quite well. I'm wondering though, some of the things I want to regress involve a large amount of regression factors. I have two main questions:
Is it possible to give a text vector for the regressors?
Can the p-value portion of summary(glm) be sorted at all? Preferably by the p-values of each regressor.
Ex.
A # sample data frame
names(A)
[1] Dog Cat Human Limbs Tail Height Weight Teeth.Count
a = names(A)[4:7]
glm( Dog ~ a, data = A, family = "binomial")
For your first question, see as.formula. Basically you want to do the following:
x <- names(A)[4:7]
regressors <- paste(x,collapse=" + ")
form <- as.formula(c("Dog ~ ",regressors))
glm(form, data = A, family = "binomial")
If you want interaction terms in your model, you need to make the structure somewhat more complex by using different collapse= arguments. That argument specifies which symbols are placed between the elements of your vector. For instance, if you specify "*" in the code above, you will have a saturated model with all possible interactions. If you just need some interactions, but not all, you will want to create the part of the formula containing all interactions first (using "*" as collapse argument), and then add the remaining terms in the separate paste function (using "+" as collapse argument). All in all, you want to create a character string that is identical to your formula, and then convert it to the formula class.
For your second question, you need to convert the output of summary to a data structure that can be sorted. For instance, a data frame. Let's say that the name of your glm model is model:
library(plyr)
coef <- summary(model)[12]
coef.sort <- as.data.frame(coef)
names(coef.sort) <- c("Estimate","SE","Tval","Pval")
arrange(coef.sort,Pval)
Assign the result of arrange() to a varable, and continue with it as you like.
An example data frame:
set.seed(42)
A <- data.frame(Dog = sample(0:1, 100, TRUE), b = rnorm(100), c = rnorm(100))
a <- names(A)[2:3]
Firstly, you can use the character vector a to create a model formula with reformulate:
glm(Dog ~ a, data = A, family = "binomial")
form <- reformulate(a, "Dog")
# Dog ~ b + c
model <- glm(form, data = A, family = "binomial")
Secondly, this is a way to sort the model summary by the p-values:
modcoef <- summary(model)[["coefficients"]]
modcoef[order(modcoef[ , 4]), ]
# Estimate Std. Error z value Pr(>|z|)
# b 0.23902684 0.2212345 1.0804232 0.2799538
# (Intercept) 0.20855908 0.2025642 1.0295951 0.3032001
# c -0.09287769 0.2191231 -0.4238608 0.6716673

Adding lagged variables to an lm model?

I'm using lm on a time series, which works quite well actually, and it's super super fast.
Let's say my model is:
> formula <- y ~ x
I train this on a training set:
> train <- data.frame( x = seq(1,3), y = c(2,1,4) )
> model <- lm( formula, train )
... and I can make predictions for new data:
> test <- data.frame( x = seq(4,6) )
> test$y <- predict( model, newdata = test )
> test
x y
1 4 4.333333
2 5 5.333333
3 6 6.333333
This works super nicely, and it's really speedy.
I want to add lagged variables to the model. Now, I could do this by augmenting my original training set:
> train$y_1 <- c(0,train$y[1:nrow(train)-1])
> train
x y y_1
1 1 2 0
2 2 1 2
3 3 4 1
update the formula:
formula <- y ~ x * y_1
... and training will work just fine:
> model <- lm( formula, train )
> # no errors here
However, the problem is that there is no way of using 'predict', because there is no way of populating y_1 in a test set in a batch manner.
Now, for lots of other regression things, there are very convenient ways to express them in the formula, such as poly(x,2) and so on, and these work directly using the unmodified training and test data.
So, I'm wondering if there is some way of expressing lagged variables in the formula, so that predict can be used? Ideally:
formula <- y ~ x * lag(y,-1)
model <- lm( formula, train )
test$y <- predict( model, newdata = test )
... without having to augment (not sure if that's the right word) the training and test datasets, and just being able to use predict directly?
Have a look at e.g. the dynlm package which gives you lag operators. More generally the Task Views on Econometrics and Time Series will have lots more for you to look at.
Here is the beginning of its examples -- a one and twelve month lag:
R> data("UKDriverDeaths", package = "datasets")
R> uk <- log10(UKDriverDeaths)
R> dfm <- dynlm(uk ~ L(uk, 1) + L(uk, 12))
R> dfm
Time series regression with "ts" data:
Start = 1970(1), End = 1984(12)
Call:
dynlm(formula = uk ~ L(uk, 1) + L(uk, 12))
Coefficients:
(Intercept) L(uk, 1) L(uk, 12)
0.183 0.431 0.511
R>
Following Dirk's suggestion on dynlm, I couldn't quite figure out how to predict, but searching for that led me to dyn package via https://stats.stackexchange.com/questions/6758/1-step-ahead-predictions-with-dynlm-r-package
Then after several hours of experimentation I came up with the following function to handle the prediction. There were quite a few 'gotcha's on the way, eg you can't seem to rbind time series, and the result of predict is offset by start and a whole bunch of things like that, so I feel this answer adds significantly compared to just naming a package, though I have upvoted Dirk's answer.
So, a solution that works is:
use the dyn package
use the following method for prediction
predictDyn method:
# pass in training data, test data,
# it will step through one by one
# need to give dependent var name, so that it can make this into a timeseries
predictDyn <- function( model, train, test, dependentvarname ) {
Ntrain <- nrow(train)
Ntest <- nrow(test)
# can't rbind ts's apparently, so convert to numeric first
train[,dependentvarname] <- as.numeric(train[,dependentvarname])
test[,dependentvarname] <- as.numeric(test[,dependentvarname])
testtraindata <- rbind( train, test )
testtraindata[,dependentvarname] <- ts( as.numeric( testtraindata[,dependentvarname] ) )
for( i in 1:Ntest ) {
result <- predict(model,newdata=testtraindata,subset=1:(Ntrain+i-1))
testtraindata[Ntrain+i,dependentvarname] <- result[Ntrain + i + 1 - start(result)][1]
}
return( testtraindata[(Ntrain+1):(Ntrain + Ntest),] )
}
Example usage:
library("dyn")
# size of training and test data
N <- 6
predictN <- 10
# create training data, which we can get exact fit on, so we can check the results easily
traindata <- c(1,2)
for( i in 3:N ) { traindata[i] <- 0.5 + 1.3 * traindata[i-2] + 1.7 * traindata[i-1] }
train <- data.frame( y = ts( traindata ), foo = 1)
# create testing data, bunch of NAs
test <- data.frame( y = ts( rep(NA,predictN) ), foo = 1)
# fit a model
model <- dyn$lm( y ~ lag(y,-1) + lag(y,-2), train )
# look at the model, it's a perfect fit. Nice!
print(model)
test <- predictDyn( model, train, test, "y" )
print(test)
# nice plot
plot(test$y, type='l')
Output:
> model
Call:
lm(formula = dyn(y ~ lag(y, -1) + lag(y, -2)), data = train)
Coefficients:
(Intercept) lag(y, -1) lag(y, -2)
0.5 1.7 1.3
> test
y foo
7 143.2054 1
8 325.6810 1
9 740.3247 1
10 1682.4373 1
11 3823.0656 1
12 8686.8801 1
13 19738.1816 1
14 44848.3528 1
15 101902.3358 1
16 231537.3296 1
Edit: hmmm, this is super slow though. Even if I limit the data in the subset to a constant few rows of the dataset, it takes about 24 milliseconds per prediction, or, for my task, 0.024*7*24*8*20*10/60/60 = 1.792 hours :-O
Try the ARIMA function. The AR parameter is for auto-regressive, which means lagged y. xreg = allows you to add other X variables. You can get predictions with predict.ARIMA.
Here's a thought:
Why don't you create a new data frame? Fill a data frame with the regressors you need. You could have columns like L1, L2, ..., Lp for all lags of any variable you want and, then, you get to use your functions exactly like you would for a cross-section type of regression.
Because you will not have to operate on your data every time you call fitting and prediction functions, but will have transformed the data once, it will be considerably faster. I know that Eviews and Stata provide lagging operators. It is true that there is some convenience to it. But it also is inefficient if you do not need everything functions like 'lm' compute. If you have a few hundreds of thousands of iterations to perform and you just need the forecast, or the forecast and the value of information criteria like BIC or AIC, you can beat 'lm' in speed by avoiding to make computations that you will not use -- just write an OLS estimator in a function and you're good to go.

Resources