R - Using apply only on specific columns - r

I am using a package in R that fits a specific form of a regression model. However, unlike the base lm() function that permits the x and y to be separate objects, the function that I'm using requires them to be in the same dataframe.
My problem arises because I have a lot of variables that I want to regress on y independently. Therefore, I have a dataframe with 10 predictor variables (x1, x2... x10) and one criterion variable (y), 11 columns in total. I could use a for loop to run ten separate regressions, but I want to avoid it and use the apply function instead. However, if I call apply on my dataframe, in the last step it will regress y on y itself and I want to avoid this. Is there a function similar to apply which I could run and specify thiat I only want it to run 10 times and not 11, or is there another workaround to this problem?

Here's a tidyverse solution:
library( tidyverse )
xx <- c("disp", "hp", "drat", "wt") # Names of predictor variables
y <- "mpg" # Name of response
str_c( y, xx, sep="~" ) %>%
map( as.formula ) %>% # Optional (see below)
map( lm, data = mtcars )
str_c simply builds up formulas as strings (e.g., "mpg~disp"). While lm accepts strings directly, your particular regression model might not. If it requires an actual formula, you can convert strings to formulas using as.formula (Thanks for the suggestion, #J.Doe!). Other than that, simply replace lm with your particular model and mtcars with your data frame.
Here's the same solution using base R without any additional packages:
strs <- paste( y, xx, sep="~" )
strs <- lapply( strs, as.formula ) # Optional
lapply( strs, lm, data=mtcars )

Using the builtin anscombe data frame having columns x1, x2, x3, x4, y1, y2, y3, y4 suppose we want to regress y1 on each of x1, x2, x3, x4 separately.
First create a character vector of the names of the independent variables, xnames, and the use lapply to run the indicated run_lm over it. That function pastes together the required formula and performs the lm returning an "lm" class object. L, the result, is a list of such objects, one for each regression.
No packages are used.
xnames <- names(anscombe)[1:4]
run_lm <- function(nm) lm(paste("y1 ~", nm), anscombe)
L <- lapply(xnames, run_lm)
Alternately, this shorter version of run_lm would also work with the above lapply but the Call: output line is not as nice:
run_lm <- function(nm) lm(anscombe[c("y1", nm)])

Related

using lm function in R with a variable name in a loop

I am trying to create a simple linear model in R a for loop where one of the variables will be specified as a parameter and thus looped through, creating a different model for each pass of the loop. The following does NOT work:
model <- lm(test_par[i] ~ weeks, data=all_data_plant)
If I tried the same model with the "test_par[i]" replaced with the variable's explicit name, it works just as expected:
model <- lm(weight_dry ~ weeks, data=all_data_plant)
I tried reformulate and paste ineffectively. Any thoughts?
Maybe try something like this:
n <- #add the column position of first variable
m <- #add the column position of last variable
lm_models <- lapply(n:m, function(x) lm(all_data_plant[,x] ~ weeks, data=all_data_plant))
You can pass the argument "formula" in lm() as character using paste(). Here a working example:
data("trees")
test_par <- names(trees)
model <- lm(Girth ~ Height, data = trees)
model <- lm("Girth ~ Height", data = trees) # character formula works
model <- lm(paste(test_par[1], "~ Height"), data=trees)

How to make regression based on grouped rows and loop over columns?

What I want to do is to perform a regression loop that always has the same predictor but loops over responses (here: y1, y2 and y3). The problem is that I want it also to be done for each category of a grouping variable. In the example data below, I want to make the regression y_i=x for all three y variables, which would result in three regressions. But I want this to be done separately for group=a, group=b and group=c, resulting in 9 different regressions (preferably stored as lists). Cant figure out how to do it! Anyone who has an idea on how to do this?
My idea so far was to maybe do a for-loop or lapply combined with dplyr::group_by, but can't get it to work.
Example data (I have a much larger data set for the actual analysis).
set.seed(123)
dat <- data.frame(group=c(rep("a",10), rep("b",10), rep("c",10)),
x=rnorm(30), y1=rnorm(30), y2=rnorm(30), y3=rnorm(30))
1) Use lmList in nlme (which comes with R so you don't have to install it).
library(nlme)
regs <- lmList(cbind(y1, y2, y3) ~ x | group, dat)
giving an lmList object having a component for each group. We show the component for group a and the other groups are similar.
> regs$a
Call:
lm(formula = object, data = dat, na.action = na.action)
Coefficients:
y1 y2 y3
(Intercept) 0.2943 0.1395 0.4539
x 0.3721 -0.2206 -0.2255
2) Another approach is to perform one overall lm giving an lm object having the same coefficients as above.
lm(cbind(y1, y2, y3) ~ group + x:group + 0, dat)
3) We could also use one of several list comprehension packages. This gives a list of 9 components. The names of the components identify the combination used as does the call component (shown in the Call: line of the output) within each main component. Note t hat the current CRAN version is 0.1.0 but the code below relies on listcompr 0.1.1 which can be obtained from github until it is put on CRAN.
# install.github("patrickroocks/listcompr")
library(listcompr)
packageVersion("listcompr") # need version 0.1.1 or later
regs <- gen.named.list("{y}.{g}",
do.call("lm",
list(reformulate("x", y), quote(dat), subset = bquote(dat$group == .(g)))
), y = c("y1", "y2", "y3"), g = unique(dat$group)
)
If you don't mind that the Call: line in the output is less descriptive then it can be simplified to:
gen.named.list("{y}.{g}", lm(reformulate("x", y), dat, subset = group == g),
y = c("y1", "y2", "y3"), g = unique(dat$group))
Note
The input corrected from question which had two y2's.
set.seed(123)
dat <- data.frame(group=c(rep("a",10), rep("b",10), rep("c",10)),
x=rnorm(30), y1=rnorm(30), y2=rnorm(30), y3=rnorm(30))

lm function gives estimate for the y-variable also

I am trying to run a simple lm model. I am using the following
dt <- data.table(
y=rnorm(100,0,1),
x1=rnorm(100,0,1),
x2=rnorm(100,0,1),
x3=rnorm(100,0,1))
y_var2 <- names(dt)[names(dt)%like%"y"]
x_var2 <- names(dt)[names(dt)%like%"x"]
tmp2 <- summary(a <- lm(get(y_var2)~.,dt[,c(x_var2,y_var2),with=F]))
coefs2 <- as.data.table(tmp2$coefficients,keep.rownames = T)
So in the end, coefs2 should contain the estimates, p-values etc. But in the last row of the coefs2 i also see the y-variable.
But if I use
tmp2 <- summary(a <- lm(y~.,dt[,c(x_var2,y_var2),with=F]))
Then this does not happen. Why is that ?
This has to do with how R stores variables. y_var2 is a character "y" and you fill it into the formula as a character variable which you wish to model with all variables in your data.table dt. However, you have to tell R that you wish to evaluate the formula y~. and not "y"~. which are two different expressions for R.
lm( formula(paste(y_var2,"~.")),dt[,c(x_var2,y_var2),with=F])
will do the trick. formula constructs a formula out of the string variable with which a contructed the expression.
Actually it would probably be cleaner just to make the formula with reformulate() and the data= parameter of lm
tmp2 <- summary(a <- lm(reformulate(x_var2, y_var2), dt))

Using list of LM estimates as stargazer input

I'm trying to use stargazer over a several LM estimates at once, say "OLS1",...,"OLS5".
I would usually insert them as separate arguments at the beginning of the stargazer input. What I'm looking for is a way to input them all with a list that contains them all, being one argument. Something like
stargazer(list,...)
stargazer arguments explanation states that
one or more model objects (for regression analysis tables) or data frames/vectors/matrices (for summary statistics, or direct output of content). They can also be included as lists (or even lists within lists).
I was wondering what is the correct way to gather LM estimates in a list so that this would work. When I just save the results in a list I get the following error
Error in list.of.objects[[i]] : subscript out of bounds
I will mention that I create the elements storing the estimate using assign. E.G:
assign(some_string,lm(...))
So what I have is a string, called some_string, and I want to put the LM result names some_string inside a list. Using get doesn't help with that.
EDIT: I think you want mget
library(stargazer)
Y <- rnorm(100)
X <- rnorm(100)
assign("string_1", lm(Y ~ X))
assign("string_2", lm(Y ~ X))
my_list <- mget(x = c("string_1", "string_2"))
stargazer(my_list)
works for me?
library(stargazer)
Y <- rnorm(100)
X <- rnorm(100)
fit_1 <- lm(Y ~ X)
fit_2 <- lm(Y ~ X)
stargazer(list(fit_1, fit_2))
did you name your list list? maybe it's grabbing the function?

Extracting p-value from lapply list of glm fits

I am using lapply to perform several glm regressions on one dependent variable by one independent variable at a time. Right now I am specifically interested in the Pr(>|z|) of each independent variable. However, I am unsure on how to report just Pr(>|z|) using the list from lapply.
If I was just running one model at a time:
coef(summary(fit))[,"Pr(>|z|)"]
or
summary(fit)$coefficients[,4]
Would work (as described here), but trying something similar with lapply does not seem to work. Can I get just the p-values using lapply and glm with an accessor method or from directly calling from the models?
#mtcars dataset
vars <- names(mtcars)[2:8]
fits <- lapply(vars, function(x) {glm(substitute(mpg ~ i, list(i = as.name(x))), family=binomial, data = mtcars)})
lapply(fits,summary) # this works
lapply(fits, coefficients) # this works
#lapply(fits, summary(fits)$coefficients[,4])# this for example does not work
You want to do:
lapply(fits, function(f) summary(f)$coefficients[,4])
However, if each item is just a p-value, you would probably rather have a vector than a list, so you could use sapply instead of lapply:
sapply(fits, function(f) summary(f)$coefficients[,4])
When you run lapply(fits, summary) it creates a list of summary.glm objects each of which is printed using print.summary.glm
If you save this
summaries <- lapply(fits, summary)
You can then go through and extract the coefficient matrix
coefmat <- lapply(summaries, '[[', 'coefficients')
and then the 4th column
lapply(coefmat, '[', , 4)

Resources