I'm running a regression in the form
reg=lm(y ~ x1+x2+x3+z1,data=mydata)
In the place of the last term, z1, I want to loop through a set of different variables, z1 through z10, running a regression for each with it as the last term. E.g. in second run I want to use
reg=lm(y ~ x1+x2+x3+z2,data=mydata)
in 3rd run:
reg=lm(y ~ x1+x2+x3+z3,data=mydata)
How can I automate this by looping through the list of z-variables?
While what Sam has provided works and is a good solution, I would personally prefer to go about it slightly differently. His answer has already been accepted, so I'm just posting this for the sake of completeness.
dat1 <- data.frame(y = rpois(100, 5),
x1 = runif(100),
x2 = runif(100),
x3 = runif(100),
z1 = runif(100),
z2 = runif(100))
lapply(colnames(dat1)[5:6],
function(x, d) lm(as.formula(paste("y ~ x1 + x2 + x3", x, sep = " + ")), data = d),
d = dat1)
Rather than looping over the actual columns of the data frame, this loops only over the string of names. This provides some speed improvements as fewer things are copied between iterations.
library(microbenchmark)
microbenchmark({ lapply(<what I wrote above>) })
# Unit: milliseconds
# expr
# {lapply(colnames(dat1)[5:6], function(x, d) lm(as.formula(paste("y ~ x1 + x2 + x3", x, sep = "+")), data = d), d = dat1)}
# min lq mean median uq max neval
# 4.014237 4.148117 4.323387 4.220189 4.281995 5.898811 100
microbenchmark({ lapply(<other answer>) })
# Unit: milliseconds
# expr
# {lapply(dat1[, 5:6], function(x) lm(dat1$y ~ dat1$x1 + dat1$x2 + dat1$x3 + x))}
# min lq mean median uq max neval
# 4.391494 4.505056 5.186972 4.598301 4.698818 51.573 100
The difference is fairly small for this toy example, but as the number of observations and predictors increases, the difference will likely become more pronounced.
Depending on what your final goal is, it can be much faster to fit a base model, update it with add1, and extract the F-test/AIC you want:
> basemodel <- lm(y~x1+x2+x3, dat1)
>
> add1(object=basemodel, grep("z\\d", names(dat1), value=TRUE), test="F")
Single term additions
Model:
y ~ x1 + x2 + x3
Df Sum of Sq RSS AIC F value Pr(>F)
<none> 477.34 164.31
z1 1 0.0768 477.26 166.29 0.0153 0.9019
z2 1 5.1937 472.15 165.21 1.0450 0.3093
See also ?update for refitting the model.
With this dummy data:
dat1 <- data.frame(y = rpois(100,5),
x1 = runif(100),
x2 = runif(100),
x3 = runif(100),
z1 = runif(100),
z2 = runif(100)
)
You could get your list of two lm objects this way:
lapply(dat1[5:6], function(x) lm(dat1$y ~ dat1$x1 + dat1$x2 + dat1$x3 + x))
Which iterates through those two columns and substitutes them as arguments into the lm call.
As Alex notes below, it's preferable to pass the names through the formula, rather than the actual data columns as I have done here.
Here's a different approach using packages from the dplyr / tidyr family. It restructures the data to a long form, then uses group_by() from the dplyr package instead of lapply():
library(dplyr)
library(tidyr)
library(magrittr) # for use_series ()
dat1 %>%
gather(varname, z, z1:z2) %>% # convert data to long form
group_by(varname) %>%
do(model = lm(y ~ x1 + x2 + x3 + z, data = .)) %>%
use_series(model)
This converts the data to a long format using gather, where the z-values occupy the same column. use_series() from the magrittr package return the list of lm objects instead of a data.frame. If you load the broom package, you can extract the model coefficients in this pipeline of code:
library(broom)
dat1 %>%
gather(varname, z, z1:z2) %>%
group_by(varname) %>%
do(model = lm(y ~ x1 + x2 + x3 + z, data = .)) %>%
glance(model) # or tidy(model)
Source: local data frame [2 x 12]
Groups: varname
varname r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual
1 z1 0.06606736 0.02674388 2.075924 1.680099 0.1609905 5 -212.3698 436.7396 452.3707 409.3987 95
2 z2 0.06518852 0.02582804 2.076900 1.656192 0.1666479 5 -212.4168 436.8337 452.4647 409.7840 95
Data:
dat1 <- data.frame(y = rpois(100, 5), x1 = runif(100),
x2 = runif(100), x3 = runif(100),
z1 = runif(100), z2 = runif(100))
Related
Excuse my naiveté. I'm not sure what this type of model is called -- perhaps panel regression.
Imagine I have the following data:
n <- 100
x1 <- rnorm(n)
y1 <- x1 * 0.5 + rnorm(n)/2
x2 <- rnorm(n)
y2 <- x2 * 0.5 + rnorm(n)/2
x3 <- rnorm(n)
y3 <- x3 * 0.25 + rnorm(n)/2
x4 <- rnorm(n)
y4 <- x4 * 0 + rnorm(n)/2
x5 <- rnorm(n)
y5 <- x5 * -0.25 + rnorm(n)/2
x6 <- rnorm(n)
y6 <- x6 * -0.5 + rnorm(n) + rnorm(n)/2
x7 <- rnorm(n)
y7 <- x7 * -0.75 + rnorm(n)/2
foo <- data.frame(s=rep(1:100,times=7),
y=c(y1,y2,y3,y4,y5,y6,y7),
x=c(x1,x2,x3,x4,x5,x6,x7),
i=rep(1:7,each=n))
Where y and x are individual AR1 time series measured over 100 seconds (I use 's' instead of 't' for the time variable) divided equally into groups (i). I wish to model these as:
y_t= b_0 + b_1(y_{t-1}) + b_2(x_{t}) + e_t
but while taking the group (i) into account:
y_{it)= b_0 + b_1(y_{it-1}) + b_2(x_{it}) + e_{it}
I wish to know if b_2 (the coef on x) is a good predictor of y and how that coef varies with group. I also want to know the R2 and RMSE by group and to predict y_i given x_i and i. The grouping variable can be discrete or continuous.
I gather that this type of problem is called panel regression but it is not a term that is familiar to me. Is using plm in R a good approach to investigate this problem?
Based on the comment below, I guess this is a simple start:
require(dplyr)
require(broom)
fitted_models <- foo %>% group_by(grp) %>% do(model = lm(y ~ x, data = .))
fitted_models %>% tidy(model)
fitted_models %>% glance(model)
Since you don't include fixed or random effects in the model, we are dealing with the pooled OLS (POLS) which can be estimated using lm or plm.
Let's construct example data of 100 groups and 100 observations for each:
df <- data.frame(x = rnorm(100 * 100), y = rnorm(100 * 100),
group = factor(rep(1:100, each = 100)))
df$ly <- unlist(tapply(df$y, df$group, function(x) c(NA, head(x, -1))))
head(df, 2)
# x y group ly
# 1 1.7893855 1.2694873 1 NA
# 2 0.8671304 -0.9538848 1 1.2694873
Then
m1 <- lm(y ~ ly + x:group, data = df)
is a model with a common autoregressive coefficient and a group-dependent effect of x:
head(coef(m1)[-1:-2], 5)
# x:group1 x:group2 x:group3 x:group4 x:group5
# -0.02057244 0.06779381 0.04628942 -0.11384630 0.06377069
This allows you to plot them, etc. I suppose one thing that you will want to do is to test whether those coefficients are equal. That can be done as follows:
m2 <- lm(y ~ ly + x, data = df)
library(lmtest)
lrtest(m1, m2)
# Likelihood ratio test
#
# Model 1: y ~ ly + x:group
# Model 2: y ~ ly + x
# #Df LogLik Df Chisq Pr(>Chisq)
# 1 103 -14093
# 2 4 -14148 -99 110.48 0.2024
Hence, we cannot reject that the effects of x are the same, as expected.
I tried to perform independent t-test for many columns of a dataframe. For example, i created a data frame
set seed(333)
a <- rnorm(20, 10, 1)
b <- rnorm(20, 15, 2)
c <- rnorm(20, 20, 3)
grp <- rep(c('m', 'y'),10)
test_data <- data.frame(a, b, c, grp)
To run the test, i used with(df, t.test(y ~ group))
with(test_data, t.test(a ~ grp))
with(test_data, t.test(b ~ grp))
with(test_data, t.test(c ~ grp))
I would like to have the outputs like this
mean in group m mean in group y p-value
9.747412 9.878820 0.6944
15.12936 16.49533 0.07798
20.39531 20.20168 0.9027
I wonder how can I achieve the results using
1. for loop
2. apply()
3. perhaps dplyr
This link R: t-test over all columns is related but it was 6 years old. Perhaps there are better ways to do the same thing.
Use select_if to select only numeric columns then use purrr:map_df to apply t.test against grp. Finally use broom:tidy to get the results in tidy format
library(tidyverse)
res <- test_data %>%
select_if(is.numeric) %>%
map_df(~ broom::tidy(t.test(. ~ grp)), .id = 'var')
res
#> # A tibble: 3 x 11
#> var estimate estimate1 estimate2 statistic p.value parameter conf.low
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 a -0.259 9.78 10.0 -0.587 0.565 16.2 -1.19
#> 2 b 0.154 15.0 14.8 0.169 0.868 15.4 -1.78
#> 3 c -0.359 20.4 20.7 -0.287 0.778 16.5 -3.00
#> # ... with 3 more variables: conf.high <dbl>, method <chr>,
#> # alternative <chr>
Created on 2019-03-15 by the reprex package (v0.2.1.9000)
Simply extract the estimate and p-value results from t.test call while iterating through all needed columns with sapply. Build formulas from a character vector and transpose with t() for output:
formulas <- paste(names(test_data)[1:(ncol(test_data)-1)], "~ grp")
output <- t(sapply(formulas, function(f) {
res <- t.test(as.formula(f))
c(res$estimate, p.value=res$p.value)
}))
Input data (seeded for reproducibility)
set.seed(333)
a <- rnorm(20, 10, 1)
b <- rnorm(20, 15, 2)
c <- rnorm(20, 20, 3)
grp <- rep(c('m', 'y'),10)
test_data <- data.frame(a, b, c, grp)
Output result
# mean in group m mean in group y p.value
# a ~ grp 9.775477 10.03419 0.5654353
# b ~ grp 14.972888 14.81895 0.8678149
# c ~ grp 20.383679 20.74238 0.7776188
As you asked for a for loop:
a <- rnorm(20, 10, 1)
b <- rnorm(20, 15, 2)
c <- rnorm(20, 20, 3)
grp <- rep(c('m', 'y'),10)
test_data <- data.frame(a, b, c, grp)
meanM=NULL
meanY=NULL
p.value=NULL
for (i in 1:(ncol(test_data)-1)){
meanM=as.data.frame(rbind(meanM, t.test(test_data[,i] ~ grp)$estimate[1]))
meanY=as.data.frame(rbind(meanY, t.test(test_data[,i] ~ grp)$estimate[2]))
p.value=as.data.frame(rbind(p.value, t.test(test_data[,i] ~ grp)$p.value))
}
cbind(meanM, meanY, p.value)
It works, but I am a beginner in R. So maybe there is a more efficient solution
Using lapply this is rather easy.
I have tested the code with set.seed(7060) before creating the dataset, in order to make the results reproducible.
tests_list <- lapply(letters[1:3], function(x) t.test(as.formula(paste0(x, "~ grp")), data = test_data))
result <- do.call(rbind, lapply(tests_list, `[[`, "estimate"))
pval <- sapply(tests_list, `[[`, "p.value")
result <- cbind(result, p.value = pval)
result
# mean in group m mean in group y p.value
#[1,] 9.909818 9.658813 0.6167742
#[2,] 14.578926 14.168816 0.6462151
#[3,] 20.682587 19.299133 0.2735725
Note that a real life application would use names(test_data)[1:3], not letters[1:3], in the first lapply instruction.
This should be a comment rather than an answer, but I'll make it an answer. The reason is that the accepted answer is awesome but with one caveat that may cost others hours, which is at least the case for me.
The original data posted by OP
a <- rnorm(20, 10, 1)
b <- rnorm(20, 15, 2)
c <- rnorm(20, 20, 3)
grp <- rep(c('m', 'y'),10)
test_data <- data.frame(a, b, c, grp)
The answer provided by #Tung
library(tidyverse)
res <- test_data %>%
select_if(is.numeric) %>%
map_df(~ broom::tidy(t.test(. ~ grp)), .id = 'var')
res
The problem, or more accurately, the caveat, of this answer is that one has to define the grp variable separately. Having the group variable outside of the dataframe is not a common practice as far as I know. So, even the answer is neat, it may be better to point out this operation (define group variable outside of the dataframe). Therefore, I use this comment like answer in the hope to save some time for those late comers.
Consider the two data.frames below. In each case I want to extract the intercept, and slopes for the three variables from the associated models.
set.seed(911)
df1 <- data.frame(y=rnorm(10) + 1:10, x=1:10, x2=rnorm(10), x3 = rnorm(10))
model1 <- lm(y ~ x + x2 + x3, data = df1)
summary(model1)
summary(model1)$coefficients[1]
summary(model1)$coefficients[2]
summary(model1)$coefficients[3]
summary(model1)$coefficients[4]
set.seed(911)
df2 <- data.frame(y=rnorm(10) + 1:10, x=1:10, x2=1, x3 = rnorm(10))
model2 <- lm(y ~ x + x2 + x3, data = df2)
summary(model2)
summary(model2)$coefficients[1]
summary(model2)$coefficients[2]
summary(model2)$coefficients[3]
summary(model2)$coefficients[4]
However, in the second example there is no variation in x2 and so the coefficient estimate is NA. Importantly, summary(model2) prints the NA but summary(model2)$coefficients[3] does not return the NA but skips and moves to the next parameter.
But instead I would want:
0.9309032
0.8736204
NA
0.5494
If I do not know in adnavce which coefficients will be NA, i.e. it could be x1,x2 or x2 &x3or even something likex1&x2&x3`, how can I return the result I want?
Grab them directly from the model. No need for using summary():
> model2$coefficients
(Intercept) x x2 x3
0.9309032 0.8736204 NA 0.5493671
Is it possible to replace coefficients in lm object?
I thought the following would work
# sample data
set.seed(2157010)
x1 <- 1998:2011
x2 <- x1 + rnorm(length(x1))
y <- 3*x1 + rnorm(length(x1))
fit <- lm( y ~ x1 + x2)
# view origional coefficeints
coef(fit)
# replace coefficent with new values
fit$coef(fit$coef[2:3]) <- c(5, 1)
# view new coefficents
coef(fit)
Any assistance would be greatly appreciated
Your code is not reproducible, as there's few errors in your code. Here's corrected version which shows also your mistake:
set.seed(2157010) #forgot set.
x1 <- 1998:2011
x2 <- x1 + rnorm(length(x1))
y <- 3*x2 + rnorm(length(x1)) #you had x, not x1 or x2
fit <- lm( y ~ x1 + x2)
# view original coefficients
coef(fit)
(Intercept) x1 x2
260.55645444 -0.04276353 2.91272272
# replace coefficients with new values, use whole name which is coefficients:
fit$coefficients[2:3] <- c(5, 1)
# view new coefficents
coef(fit)
(Intercept) x1 x2
260.5565 5.0000 1.0000
So the problem was that you were using fit$coef, although the name of the component in lm output is really coefficients. The abbreviated version works for getting the values, but not for setting, as it made new component named coef, and the coef function extracted the values of fit$coefficient.
I'm looking for suggestions on how to deal with NA's in linear regressions when all occurrences of an independent/explanatory variable are NA (i.e. x3 below).
I know the obvious solution would be to exclude the independent/explanatory variable in question from the model but I am looping through multiple regions and would prefer not to have a different functional forms for each region.
Below is some sample data:
set.seed(23409)
n <- 100
time <- seq(1,n, 1)
x1 <- cumsum(runif(n))
y <- .8*x1 + rnorm(n, mean=0, sd=2)
x2 <- seq(1,n, 1)
x3 <- rep(NA, n)
df <- data.frame(y=y, time=time, x1=x1, x2=x2, x3=x3)
# Quick plot of data
library(ggplot2)
library(reshape2)
df.melt <-melt(df, id=c("time"))
p <- ggplot(df.melt, aes(x=time, y=value)) +
geom_line() + facet_grid(variable ~ .)
p
I have read the documentation for lm and tried various na.action settings without success:
lm(y~x1+x2+x3, data=df, singular.ok=TRUE)
lm(y~x1+x2+x3, data=df, na.action=na.omit)
lm(y~x1+x2+x3, data=df, na.action=na.exclude)
lm(y~x1+x2+x3, data=df, singular.ok=TRUE, na.exclude=na.omit)
lm(y~x1+x2+x3, data=df, singular.ok=TRUE, na.exclude=na.exclude)
Is there a way to get lm to run without error and simply return a coefficient for the explanatory reflective of the lack of explanatory power (i.e. either zero or NA) from the variable in question?
Here's one idea:
set.seed(23409)
n <- 100
time <- seq(1,n, 1)
x1 <- cumsum(runif(n))
y <- .8*x1 + rnorm(n, mean=0, sd=2)
x2 <- seq(1,n, 1)
x3 <- rep(NA, n)
df <- data.frame(y=y, time=time, x1=x1, x2=x2, x3=x3)
replaceNA<-function(x){
if(all(is.na(x))){
rep(0,length(x))
} else x
}
lm(y~x1+x2+x3, data= data.frame(lapply(df,replaceNA)))
Call:
lm(formula = y ~ x1 + x2 + x3, data = data.frame(lapply(df, replaceNA)))
Coefficients:
(Intercept) x1 x2 x3
0.05467 1.01133 -0.10613 NA
lm(y~x1+x2, data=df)
Call:
lm(formula = y ~ x1 + x2, data = df)
Coefficients:
(Intercept) x1 x2
0.05467 1.01133 -0.10613
So you replace the variables which contain only NA's with variable which contains only 0's. you get the coefficient value NA, but all the relevant parts of the model fits are same (expect qr decomposition, but if information about that is needed, it can be easily modified). Note that component summary(fit)$alias (see ?alias) might be useful.
This seems to relate your other question: Replace lm coefficients in [r]
You won't be able to include a column with all NA values. It does strange things to model.matrix
x1 <- 1:5
x2 <- rep(NA,5)
model.matrix(~x1+x2)
(Intercept) x1 x2TRUE
attr(,"assign")
[1] 0 1 2
attr(,"contrasts")
attr(,"contrasts")$x2
[1] "contr.treatment"
So your alternative is to programatically create the model formula based on the data.
Something like...
make_formula <- function(variables, data, response = 'y'){
if(missing(data)){stop('data not specified')}
using <- Filter(variables,f= function(i) !all(is.na(data[[i]])))
deparse(reformulate(using, response))
}
variables <- c('x1','x2','x3')
make_formula(variables, data =df)
[1] "y ~ x1 + x2"
I've used deparse to return a character string so that there is no environment issues from creating the formula within the function. lm can happily take a character string which is a valid formula.