Subset with list in linear model - r

I am having troubles with using subset which involves lists in my linear model. One of two things happen, when I use lapply I get object 'x' not found. When I use mapply I get that object 'y' not found. For some reason neither variable is working in subset.
Here are my two approaches:
library(faraway)
data(savings)
form <- "sr ~ pop75 + dpi + ddpi + pop15"
#approach one
lapply(label_res, function(x){
lm(form, savings, subset = (!row.names(savings) %in% x))
})
#output does not change for all linear models
#approach two
form_list<-rep(form, 4) %>% as.list()
mapply(function(x, y)
{lm(x, savings, subset=(!row.names(savings) %in% y))}, form_list, label_res)
>Error in row.names(savings) %in% y : object 'y' not found
However this does work:
library(faraway)
data(savings)
form <- "sr ~ pop75 + dpi + ddpi + pop15"
#approach one
lapply(label_res, function(x){subset(savings, !row.names(savings) %in% x) %>%
lm(form, .)
})
Here s label_res:
list(pop75 = c("Ireland", "Japan"), dpi = c("Ireland", "Sweden",
"United States"), ddpi = "Libya", pop15 = c("Japan", "Libya"))

This is due to the way that lm internally calls model.frame. It tries to evaluate the arguments in the parent frame, which means that it is trying to find an object called x in the global environment, when one doesn't exist. There are various ways round this. The easiest is to subset the data frame you pass to lm in the first place, which actually requires less typing than using subset anyway:
lapply(label_res, function(x) {
lm(form, savings[!row.names(savings) %in% x,])
})
Another, less satisfactory way if you really want to use the subset argument, is to write x into the calling environment with each iteration
lapply(label_res, function(x) {
x <<- x
lm(form, savings, subset = !row.names(savings) %in% x)
})
I prefer the first option since it doesn't have any side effects in the calling frame, but they both produce the same output:
#> $pop75
#>
#> Call:
#> lm(formula = form, data = savings[!row.names(savings) %in% x,
#> ])
#>
#> Coefficients:
#> (Intercept) pop75 dpi ddpi pop15
#> 26.0381140 -1.4257571 -0.0002792 0.3547081 -0.4078521
#>
#>
#> $dpi
#>
#> Call:
#> lm(formula = form, data = savings[!row.names(savings) %in% x,
#> ])
#>
#> Coefficients:
#> (Intercept) pop75 dpi ddpi pop15
#> 29.009000 -2.216356 0.000589 0.443409 -0.469873
#>
#>
#> $ddpi
#>
#> Call:
#> lm(formula = form, data = savings[!row.names(savings) %in% x,
#> ])
#>
#> Coefficients:
#> (Intercept) pop75 dpi ddpi pop15
#> 24.5240460 -1.2808669 -0.0003189 0.6102790 -0.3914401
#>
#>
#> $pop15
#>
#> Call:
#> lm(formula = form, data = savings[!row.names(savings) %in% x,
#> ])
#>
#> Coefficients:
#> (Intercept) pop75 dpi ddpi pop15
#> 20.580204 -0.644662 -0.000448 0.516628 -0.310588
Created on 2022-06-12 by the reprex package (v2.0.1)

Related

Use lapply with formula to estimate lm with different weights

Related but slightly different to How to use lapply with a formula? and Calling update within a lapply within a function, why isn't it working?:
I am trying to estimate models with replicate weights. For correct standard errors, I need to estimate the same regression model with each version of the replicate weights. Since I need to estimate many different models and do not want to always write a seperate loop, I tried writing a function where I specify both data for the regression, regression formula and the data with the replicate weights. While the function works fine when specifying the formula explicitly inside the lapply() command in the function and not as a function input (function tryout below), as soon as I specify the regression formula as a function input (function tryout2 below), it breaks.
Here is a reproducible example:
library(tidyverse)
set.seed(123)
lm.dat <- data.frame(id=1:500,
x1=sample(1:100, replace=T, size=500),
x2=runif(n=500, min=0, max=20)) %>%
mutate(y=0.2*x1+1.5*x2+rnorm(n=500, mean=0, sd=5))
repweights <- data.frame(id=1:500)
set.seed(123)
for (i in 1:200) {
repweights[,i+1] <- runif(n=500, min=0, max=10)
names(repweights)[i+1] <- paste0("hrwgt", i)
}
The two functions are defined as follows:
trythis <- function(data, weightsdata, weightsN){
rep <- as.list(1:weightsN)
res <- lapply(rep, function(x) lm(data=data, formula=y~x1+x2, weights=weightsdata[,x]))
return(res)
}
results1 <- trythis(data=lm.dat, weightsdata=repweights[-1], weightsN=200)
trythis2 <- function(LMformula, data, weightsdata, weightsN){
rep <- as.list(1:weightsN)
res <- lapply(rep, function(x) lm(data=data, formula=LMformula, weights=weightsdata[,x]))
return(res)
}
While the first function works, applying the second one results in an error:
trythis2(LMformula = y~x1+x2, data=lm.dat, weightsN=200, weightsdata = repweights[-1])
Error in eval(extras, data, env) : object 'weightsdata' not found
Formulas have an associated environment in which the referenced variables can be found. In your case, the formula you are passing has the environment of the calling frame. To access the variables within the function, you need to assign the formula to the local frame so it can find the correct variables:
trythis3 <- function(LMformula, data, weightsdata, weightsN){
rep <- as.list(1:weightsN)
res <- lapply(rep, function(x) {
environment(LMformula) <- sys.frames()[[length(sys.frames())]]
lm(data = data, formula = LMformula, weights = weightsdata[,x])
})
return(res)
}
trythis3(LMformula = y~x1+x2, data = lm.dat, weightsN = 200,
weightsdata = repweights[-1])
Which results in
#> [[1]]
#>
#> Call:
#> lm(formula = LMformula, data = data, weights = weightsdata[,
#> x])
#>
#> Coefficients:
#> (Intercept) x1 x2
#> 1.2932 0.1874 1.4308
#>
#>
#> [[2]]
#>
#> Call:
#> lm(formula = LMformula, data = data, weights = weightsdata[,
#> x])
#>
#> Coefficients:
#> (Intercept) x1 x2
#> 1.2932 0.1874 1.4308
#>
#>
#> [[3]]
#>
#> Call:
#> lm(formula = LMformula, data = data, weights = weightsdata[,
#> x])
#>
#> Coefficients:
#> (Intercept) x1 x2
#> 1.2932 0.1874 1.4308
...etc

Run all posible combination of linear regression with 2 independent variables

I want to run every combination possible for every 2 independent variables (OLS regression). I have a csv where I have my data (just one dependent variable and 23 independent variables), and I've tried renaming the variables inside my database from a to z, and I called 'y' to my dependent variable (a column with name "y" which is my dependent variable) to be recognized by the following code:
#all the combinations
all_comb <- combn(letters, 2)
#create the formulas from the combinations above and paste
text_form <- apply(all_comb, 2, function(x) paste('Y ~', paste0(x, collapse = '+')))
lapply(text_form, function(i) lm(i, data= KOFS05.12))
but this error is shown:
Error in eval(predvars, data, env) : object 'y' not found
I need to know the R squared
Any idea to make it work and run every possible regression?
As mentioned in the comments under the question check whether you need y or Y. Having addressed that we can use any of these. There is no need to rename the columns. We use the built in mtcars data set as an example since no test data was provided in the question. (Please always provide that in the future.)
1) ExhaustiveSearch This runs quite fast so you might be able to try combinations higher than 2 as well.
library(ExhaustiveSearch)
ExhaustiveSearch(mpg ~., mtcars, combsUpTo = 2)
2) combn Use the lmfun function defined below with combn.
dep <- "mpg" # name of dependent variable
nms <- setdiff(names(mtcars), dep) # names of indep variables
lmfun <- function(x, dep) do.call("lm", list(reformulate(x, dep), quote(mtcars)))
lms <- combn(nms, 2, lmfun, dep = dep, simplify = FALSE)
names(lms) <- lapply(lms, formula)
3) listcompr Using lmfun from above and listcompr we can use the following. Note that we need version 0.1.1 or later of listcompr which is not yet on CRAN so we get it from github.
# install.github("patrickroocks/listcompr")
library(listcompr)
packageVersion("listcompr") # need version 0.1.1 or later
dep <- "mpg" # name of dependent variable
nms <- setdiff(names(mtcars), dep) # names of indep variables
lms2 <- gen.named.list("{nm1}.{nm2}", lmfun(c(nm1, nm2), dep),
nm1 = nms, nm2 = nms, nm1 < nm2)
You should specify your text_form as formulas:
KOFS05.12 <- data.frame(y = rnorm(10),
a = rnorm(10),
b = rnorm(10),
c = rnorm(10))
all_comb <- combn(letters[1:3], 2)
fmla_form <- apply(all_comb, 2, function(x) as.formula(sprintf("y ~ %s", paste(x, collapse = "+"))))
lapply(fmla_form, function(i) lm(i, KOFS05.12))
#> [[1]]
#>
#> Call:
#> lm(formula = i, data = KOFS05.12)
#>
#> Coefficients:
#> (Intercept) a b
#> 0.19763 -0.15873 0.02854
#>
#>
#> [[2]]
#>
#> Call:
#> lm(formula = i, data = KOFS05.12)
#>
#> Coefficients:
#> (Intercept) a c
#> 0.21395 -0.15967 0.05737
#>
#>
#> [[3]]
#>
#> Call:
#> lm(formula = i, data = KOFS05.12)
#>
#> Coefficients:
#> (Intercept) b c
#> 0.157140 0.002523 0.028088
Created on 2021-02-17 by the reprex package (v1.0.0)

update regression objects by passing an expression to "subset"

I have many regression objects created by lm(). Each one has been built from a different data
frame, and these different data frames have different dimensions. But each data frame contains the logical variables x, y, and z. In some cases, I want to update each regression object so that the subset argument is x. In other cases, I want to update each regression object so that the subset argument is y. And in still other cases, I want to update each regression object so that the subset argument is z. What is an efficient way to do this?
This is the inefficient way:
# Set only one of these three variables to be TRUE
subsetX <- TRUE
subsetY <- FALSE
subsetZ <- FALSE
# Now update the regressions.
if (subsetX) {
update(lm1, subset = x)
update(lm2, subset = x)
[...]
} else if (subsetY) {
update(lm1, subset = y)
update(lm2, subset = y)
[...]
} else if (subsetZ) {
update(lm1, subset = z)
update(lm2, subset = z)
[...]
}
This approach is inefficient because there is a lot of duplication across the three code blocks that update the regressions. I would rather do something like
subsetVar <- dplyr::case_when(
subsetX ~ expression(x),
subsetY ~ expression(y),
subsetZ ~ expression(z))
update(lm1, subset = substitute(subsetVar))
update(lm2, subset = substitute(subsetVar))
[...]
That is, I would like to write at most one update() command for each
regression object, while still varying the subset argument on the basis of
logical (boolean) variables like subsetX and subsetY. Is this possible?
The code above doesn't work; when I try it, I get an Error in xj[i] : invalid subscript type 'symbol' error message.
I've searched other Stack Overflow questions, but I haven't found anything that speaks directly to this problem.
Here's a way to make your workflow a little easier with a simple function and using purrr::map2 to feed it the list of models and subsets you want
library(purrr)
set.seed(2020)
mtcars$x <- sample(c(TRUE, FALSE), 32, replace = TRUE)
mtcars$y <- sample(c(TRUE, FALSE), 32, replace = TRUE)
mtcars$z <- sample(c(TRUE, FALSE), 32, replace = TRUE)
lm1 <- lm(mpg ~ hp, mtcars)
subset_lm_by <- function(model, subset = NULL) {
if (subset == "x") {
update(model, subset = x)
} else if (subset == "y") {
update(model, subset = y)
} else if (subset == "z") {
update(model, subset = z)
} else {
# cat('I only accept x, y or z!')
}
}
models <- list(lm1, lm1, lm1, lm1)
subsets <- list("x", "y", "z", "nonsense")
purrr::map2(.x = models,
.y = subsets,
~ subset_lm_by(model = .x, subset = .y))
#> [[1]]
#>
#> Call:
#> lm(formula = mpg ~ hp, data = mtcars, subset = x)
#>
#> Coefficients:
#> (Intercept) hp
#> 31.21178 -0.08098
#>
#>
#> [[2]]
#>
#> Call:
#> lm(formula = mpg ~ hp, data = mtcars, subset = y)
#>
#> Coefficients:
#> (Intercept) hp
#> 32.83501 -0.07294
#>
#>
#> [[3]]
#>
#> Call:
#> lm(formula = mpg ~ hp, data = mtcars, subset = z)
#>
#> Coefficients:
#> (Intercept) hp
#> 32.53554 -0.08688
#>
#>
#> [[4]]
#> NULL

How does R lm choose contrasts with interaction between a categorical and continuous variables?

If I run lm with a formula like Y ~ X1 + X2:X1 + X3:X1 where X1 is continuous and X2,X3 are categorical, I get a contrast for both levels of X2, but not X3.
The pattern is that the first categorical interaction gets both levels but not the second.
library(tidyverse)
library(magrittr)
#>
#> Attaching package: 'magrittr'
#> The following object is masked from 'package:purrr':
#>
#> set_names
#> The following object is masked from 'package:tidyr':
#>
#> extract
df = data.frame(Frivolousness = sample(1:100, 50, replace =T))
df %<>% mutate(
Personality=sample(c("Bad", "Good"), 50, replace = T),
Timing=ifelse(Frivolousness %% 2 == 0 & runif(50) > 0.2, "Early", "Late")
)
df %<>% mutate(
Enchantedness = 11 +
ifelse(Personality=="Good", 0.23, -0.052)*Frivolousness -
1.3*ifelse(Personality=="Good", 1, 0) +
10*rnorm(50)
)
df %<>% mutate(
Personality = factor(Personality, levels=c("Bad", "Good")),
Timing = factor(Timing, levels=c("Early", "Late"))
)
lm(Enchantedness ~ Personality + Timing + Timing:Frivolousness + Personality:Frivolousness, df)
#>
#> Call:
#> lm(formula = Enchantedness ~ Personality + Timing + Timing:Frivolousness +
#> Personality:Frivolousness, data = df)
#>
#> Coefficients:
#> (Intercept) PersonalityGood
#> 15.64118 -10.99518
#> TimingLate TimingEarly:Frivolousness
#> -1.41757 -0.05796
#> TimingLate:Frivolousness PersonalityGood:Frivolousness
#> -0.07433 0.33410
lm(Enchantedness ~ Personality + Timing + Personality:Frivolousness+ Timing:Frivolousness , df)
#>
#> Call:
#> lm(formula = Enchantedness ~ Personality + Timing + Personality:Frivolousness +
#> Timing:Frivolousness, data = df)
#>
#> Coefficients:
#> (Intercept) PersonalityGood
#> 15.64118 -10.99518
#> TimingLate PersonalityBad:Frivolousness
#> -1.41757 -0.05796
#> PersonalityGood:Frivolousness TimingLate:Frivolousness
#> 0.27614 -0.01636
Created on 2020-02-15 by the reprex package (v0.3.0)
I think the reason it is dropped is that there would be perfect colinearity if it was included. You really should have Frivolousness as a regressor on its own also. Then, you will see that R provides you with the result for just one level of both interactions.
You get this kind of weird behavior because you are missing the term main term, Frivolousness. If you do:
set.seed(111)
## run your data frame stuff
lm(Enchantedness ~ Personality + Timing + Timing:Frivolousness + Personality:Frivolousness, df)
Coefficients:
(Intercept) PersonalityGood
-1.74223 5.31189
TimingLate TimingEarly:Frivolousness
12.47243 0.19090
TimingLate:Frivolousness PersonalityGood:Frivolousness
-0.09496 0.17383
lm(Enchantedness ~ Personality + Timing + Frivolousness+Timing:Frivolousness + Personality:Frivolousness, df)
Coefficients:
(Intercept) PersonalityGood
-1.7422 5.3119
TimingLate Frivolousness
12.4724 0.1909
TimingLate:Frivolousness PersonalityGood:Frivolousness
-0.2859 0.1738
In your model, the interaction term TimingLate:Frivolousness means the change in slope of Frivolousness when Timing is Late. Since the default is not estimated, it has to do it for TimingEarly (the reference level). Hence you can see the coefficients for TimingEarly:Frivolousness and Frivolousness are the same.
As you can see the TimingLate:Frivolousness are very different and In your case I think doesn't make sense to do only the interaction term without the main effect, because it's hard to interpret or model it.
You can roughly check what is the slope for different groups of timing and the model with all terms gives a good estimate:
df %>% group_by(Timing) %>% do(tidy(lm(Enchantedness ~ Frivolousness,data=.)))
# A tibble: 4 x 6
# Groups: Timing [2]
Timing term estimate std.error statistic p.value
<fct> <chr> <dbl> <dbl> <dbl> <dbl>
1 Early (Intercept) 6.13 6.29 0.975 0.341
2 Early Frivolousness 0.208 0.0932 2.23 0.0366
3 Late (Intercept) 11.5 5.35 2.14 0.0419
4 Late Frivolousness -0.00944 0.107 -0.0882 0.930

map a vector of characters to lm formula in r

I'm trying to make a list of lm object using purrr::map.
use mtcars as an example:
vars <- c('hp', 'wt', 'disp')
map(vars, ~lm(mpg~.x, data=mtcars))
error:
Error in model.frame.default(formula = mpg ~ .x, data = mtcars, drop.unused.levels = TRUE) : variable lengths differ (found for '.x')
I also tried:
map(vars, function(x) {x=sym(x); lm(mpg~!!x, data=mtcars)})
I got error message:
Error in !x : invalid argument type
Can anyone tell what I did wrong? Thanks in advance.
The usual way is to paste together formulas as strings, convert them by mapping as.formula (you can't make a vector of formulas; it has to be a list), and then map lm. You can combine it all to a single call if you like, but I've come to prefer mapping single functions, which makes code easier to read:
library(purrr)
c('hp', 'wt', 'disp') %>%
paste('mpg ~', .) %>%
map(as.formula) %>%
map(lm, data = mtcars)
#> [[1]]
#>
#> Call:
#> .f(formula = .x[[i]], data = ..1)
#>
#> Coefficients:
#> (Intercept) hp
#> 30.09886 -0.06823
#>
#>
#> [[2]]
#>
#> Call:
#> .f(formula = .x[[i]], data = ..1)
#>
#> Coefficients:
#> (Intercept) wt
#> 37.285 -5.344
#>
#>
#> [[3]]
#>
#> Call:
#> .f(formula = .x[[i]], data = ..1)
#>
#> Coefficients:
#> (Intercept) disp
#> 29.59985 -0.04122
It's actually unnecessary to call map(as.formula) as lm will coerce it into a formula, but not all models are so generous (e.g. mgcv::gam).
A downside of this approach are that the call listed in the object looks funky, but the coefficients tell you which is which easily enough anyway. A useful alternative is to keep the formula as a string in one column of a data.frame and the model in a list column, e.g.
library(tidyverse)
data_frame(formula = paste('mpg ~', c('hp', 'wt', 'disp')),
model = map(formula, lm, data = mtcars))
#> # A tibble: 3 x 2
#> formula model
#> <chr> <list>
#> 1 mpg ~ hp <S3: lm>
#> 2 mpg ~ wt <S3: lm>
#> 3 mpg ~ disp <S3: lm>
The elegant tidyverse approach demonstrated by #alistaire worked well for me until I tried to pass the list column to the stargazer package and received "% Error: Unrecognized object type."
In case it is helpful for anyone else trying to use purrr map and stargazer, this slight modification solved the issue:
models_out <- data_frame(
formula = paste('mpg ~', c('hp', 'wt', 'disp')),
model = map(
.x = formula,
.f = function(x) lm(x, data = mtcars))
)
stargazer(models_out$model, type = 'text')
===========================================================
Dependent variable:
-----------------------------
mpg
(1) (2) (3)
-----------------------------------------------------------
hp -0.068***
(0.010)
wt -5.344***
(0.559)
disp -0.041***
(0.005)
Constant 30.099*** 37.285*** 29.600***
(1.634) (1.878) (1.230)
-----------------------------------------------------------
Observations 32 32 32
R2 0.602 0.753 0.718
Adjusted R2 0.589 0.745 0.709
Residual Std. Error (df = 30) 3.863 3.046 3.251
F Statistic (df = 1; 30) 45.460*** 91.375*** 76.513***
===========================================================
Note: *p<0.1; **p<0.05; ***p<0.01

Resources