I have many regression objects created by lm(). Each one has been built from a different data
frame, and these different data frames have different dimensions. But each data frame contains the logical variables x, y, and z. In some cases, I want to update each regression object so that the subset argument is x. In other cases, I want to update each regression object so that the subset argument is y. And in still other cases, I want to update each regression object so that the subset argument is z. What is an efficient way to do this?
This is the inefficient way:
# Set only one of these three variables to be TRUE
subsetX <- TRUE
subsetY <- FALSE
subsetZ <- FALSE
# Now update the regressions.
if (subsetX) {
update(lm1, subset = x)
update(lm2, subset = x)
[...]
} else if (subsetY) {
update(lm1, subset = y)
update(lm2, subset = y)
[...]
} else if (subsetZ) {
update(lm1, subset = z)
update(lm2, subset = z)
[...]
}
This approach is inefficient because there is a lot of duplication across the three code blocks that update the regressions. I would rather do something like
subsetVar <- dplyr::case_when(
subsetX ~ expression(x),
subsetY ~ expression(y),
subsetZ ~ expression(z))
update(lm1, subset = substitute(subsetVar))
update(lm2, subset = substitute(subsetVar))
[...]
That is, I would like to write at most one update() command for each
regression object, while still varying the subset argument on the basis of
logical (boolean) variables like subsetX and subsetY. Is this possible?
The code above doesn't work; when I try it, I get an Error in xj[i] : invalid subscript type 'symbol' error message.
I've searched other Stack Overflow questions, but I haven't found anything that speaks directly to this problem.
Here's a way to make your workflow a little easier with a simple function and using purrr::map2 to feed it the list of models and subsets you want
library(purrr)
set.seed(2020)
mtcars$x <- sample(c(TRUE, FALSE), 32, replace = TRUE)
mtcars$y <- sample(c(TRUE, FALSE), 32, replace = TRUE)
mtcars$z <- sample(c(TRUE, FALSE), 32, replace = TRUE)
lm1 <- lm(mpg ~ hp, mtcars)
subset_lm_by <- function(model, subset = NULL) {
if (subset == "x") {
update(model, subset = x)
} else if (subset == "y") {
update(model, subset = y)
} else if (subset == "z") {
update(model, subset = z)
} else {
# cat('I only accept x, y or z!')
}
}
models <- list(lm1, lm1, lm1, lm1)
subsets <- list("x", "y", "z", "nonsense")
purrr::map2(.x = models,
.y = subsets,
~ subset_lm_by(model = .x, subset = .y))
#> [[1]]
#>
#> Call:
#> lm(formula = mpg ~ hp, data = mtcars, subset = x)
#>
#> Coefficients:
#> (Intercept) hp
#> 31.21178 -0.08098
#>
#>
#> [[2]]
#>
#> Call:
#> lm(formula = mpg ~ hp, data = mtcars, subset = y)
#>
#> Coefficients:
#> (Intercept) hp
#> 32.83501 -0.07294
#>
#>
#> [[3]]
#>
#> Call:
#> lm(formula = mpg ~ hp, data = mtcars, subset = z)
#>
#> Coefficients:
#> (Intercept) hp
#> 32.53554 -0.08688
#>
#>
#> [[4]]
#> NULL
Related
Related but slightly different to How to use lapply with a formula? and Calling update within a lapply within a function, why isn't it working?:
I am trying to estimate models with replicate weights. For correct standard errors, I need to estimate the same regression model with each version of the replicate weights. Since I need to estimate many different models and do not want to always write a seperate loop, I tried writing a function where I specify both data for the regression, regression formula and the data with the replicate weights. While the function works fine when specifying the formula explicitly inside the lapply() command in the function and not as a function input (function tryout below), as soon as I specify the regression formula as a function input (function tryout2 below), it breaks.
Here is a reproducible example:
library(tidyverse)
set.seed(123)
lm.dat <- data.frame(id=1:500,
x1=sample(1:100, replace=T, size=500),
x2=runif(n=500, min=0, max=20)) %>%
mutate(y=0.2*x1+1.5*x2+rnorm(n=500, mean=0, sd=5))
repweights <- data.frame(id=1:500)
set.seed(123)
for (i in 1:200) {
repweights[,i+1] <- runif(n=500, min=0, max=10)
names(repweights)[i+1] <- paste0("hrwgt", i)
}
The two functions are defined as follows:
trythis <- function(data, weightsdata, weightsN){
rep <- as.list(1:weightsN)
res <- lapply(rep, function(x) lm(data=data, formula=y~x1+x2, weights=weightsdata[,x]))
return(res)
}
results1 <- trythis(data=lm.dat, weightsdata=repweights[-1], weightsN=200)
trythis2 <- function(LMformula, data, weightsdata, weightsN){
rep <- as.list(1:weightsN)
res <- lapply(rep, function(x) lm(data=data, formula=LMformula, weights=weightsdata[,x]))
return(res)
}
While the first function works, applying the second one results in an error:
trythis2(LMformula = y~x1+x2, data=lm.dat, weightsN=200, weightsdata = repweights[-1])
Error in eval(extras, data, env) : object 'weightsdata' not found
Formulas have an associated environment in which the referenced variables can be found. In your case, the formula you are passing has the environment of the calling frame. To access the variables within the function, you need to assign the formula to the local frame so it can find the correct variables:
trythis3 <- function(LMformula, data, weightsdata, weightsN){
rep <- as.list(1:weightsN)
res <- lapply(rep, function(x) {
environment(LMformula) <- sys.frames()[[length(sys.frames())]]
lm(data = data, formula = LMformula, weights = weightsdata[,x])
})
return(res)
}
trythis3(LMformula = y~x1+x2, data = lm.dat, weightsN = 200,
weightsdata = repweights[-1])
Which results in
#> [[1]]
#>
#> Call:
#> lm(formula = LMformula, data = data, weights = weightsdata[,
#> x])
#>
#> Coefficients:
#> (Intercept) x1 x2
#> 1.2932 0.1874 1.4308
#>
#>
#> [[2]]
#>
#> Call:
#> lm(formula = LMformula, data = data, weights = weightsdata[,
#> x])
#>
#> Coefficients:
#> (Intercept) x1 x2
#> 1.2932 0.1874 1.4308
#>
#>
#> [[3]]
#>
#> Call:
#> lm(formula = LMformula, data = data, weights = weightsdata[,
#> x])
#>
#> Coefficients:
#> (Intercept) x1 x2
#> 1.2932 0.1874 1.4308
...etc
I want to run every combination possible for every 2 independent variables (OLS regression). I have a csv where I have my data (just one dependent variable and 23 independent variables), and I've tried renaming the variables inside my database from a to z, and I called 'y' to my dependent variable (a column with name "y" which is my dependent variable) to be recognized by the following code:
#all the combinations
all_comb <- combn(letters, 2)
#create the formulas from the combinations above and paste
text_form <- apply(all_comb, 2, function(x) paste('Y ~', paste0(x, collapse = '+')))
lapply(text_form, function(i) lm(i, data= KOFS05.12))
but this error is shown:
Error in eval(predvars, data, env) : object 'y' not found
I need to know the R squared
Any idea to make it work and run every possible regression?
As mentioned in the comments under the question check whether you need y or Y. Having addressed that we can use any of these. There is no need to rename the columns. We use the built in mtcars data set as an example since no test data was provided in the question. (Please always provide that in the future.)
1) ExhaustiveSearch This runs quite fast so you might be able to try combinations higher than 2 as well.
library(ExhaustiveSearch)
ExhaustiveSearch(mpg ~., mtcars, combsUpTo = 2)
2) combn Use the lmfun function defined below with combn.
dep <- "mpg" # name of dependent variable
nms <- setdiff(names(mtcars), dep) # names of indep variables
lmfun <- function(x, dep) do.call("lm", list(reformulate(x, dep), quote(mtcars)))
lms <- combn(nms, 2, lmfun, dep = dep, simplify = FALSE)
names(lms) <- lapply(lms, formula)
3) listcompr Using lmfun from above and listcompr we can use the following. Note that we need version 0.1.1 or later of listcompr which is not yet on CRAN so we get it from github.
# install.github("patrickroocks/listcompr")
library(listcompr)
packageVersion("listcompr") # need version 0.1.1 or later
dep <- "mpg" # name of dependent variable
nms <- setdiff(names(mtcars), dep) # names of indep variables
lms2 <- gen.named.list("{nm1}.{nm2}", lmfun(c(nm1, nm2), dep),
nm1 = nms, nm2 = nms, nm1 < nm2)
You should specify your text_form as formulas:
KOFS05.12 <- data.frame(y = rnorm(10),
a = rnorm(10),
b = rnorm(10),
c = rnorm(10))
all_comb <- combn(letters[1:3], 2)
fmla_form <- apply(all_comb, 2, function(x) as.formula(sprintf("y ~ %s", paste(x, collapse = "+"))))
lapply(fmla_form, function(i) lm(i, KOFS05.12))
#> [[1]]
#>
#> Call:
#> lm(formula = i, data = KOFS05.12)
#>
#> Coefficients:
#> (Intercept) a b
#> 0.19763 -0.15873 0.02854
#>
#>
#> [[2]]
#>
#> Call:
#> lm(formula = i, data = KOFS05.12)
#>
#> Coefficients:
#> (Intercept) a c
#> 0.21395 -0.15967 0.05737
#>
#>
#> [[3]]
#>
#> Call:
#> lm(formula = i, data = KOFS05.12)
#>
#> Coefficients:
#> (Intercept) b c
#> 0.157140 0.002523 0.028088
Created on 2021-02-17 by the reprex package (v1.0.0)
When I run this code:
# Create example data
df <- tibble(age=rnorm(10),
income=rnorm(10))
make_model <- function(response_var, df){
# Create formula
form <- as.formula(response_var ~ .)
# Create model
model <- lm(form , data=df)
# Return coefficients
return(coef(model))
}
make_model(income, df)
I obtain the following error
Error in eval(predvars, data, env) : object 'income' not found
How can I make this function work using quasiquotation? I assume the logic is the same as how we can call library(dplyr) instead of library("dplyr").
Use blast() (to be included in rlang 0.5.0)
blast <- function(expr, env = caller_env()) {
eval_bare(enexpr(expr), env)
}
make_model <- function(data, column) {
f <- blast(!!enexpr(column) ~ .)
model <- lm(f, data = data)
coef(model)
}
df <- data.frame(
age = rnorm(10),
income = rnorm(10)
)
make_model(df, income)
#> (Intercept) age
#> -0.3563103 -0.2200773
Works flexibly:
blast(list(!!!1:3))
#> [[1]]
#> [1] 1
#>
#> [[2]]
#> [1] 2
#>
#> [[3]]
#> [1] 3
The following should work:
library(tidyverse)
# Your original function, modified
make_model <- function(df, column) {
column <- enexpr(column)
form <- as.formula(paste0(quo_text(column), " ~ ."))
model <- lm(form, data = df)
return(coef(model))
}
# Your original data and call
tibble(
age = rnorm(10),
income = rnorm(10)
) %>%
make_model(income)
I'm writing a function for my (working) R script in order to clean up my code. I do not have experience with writing functions, but decided I should invest some time into this. The goal of my function is to perform multiple statistical tests while only passing the required dataframe, quantitative variable and grouping variable once. However, I cannot get this to work. For your reference, I'll use the ToothGrowth data frame to illustrate my problem.
Say I want to run a Kruskal-Wallis test and one-way ANOVA on len, to compare different groups named supp, for whatever reason. I can do this separately with
kruskal.test(len ~ supp, data = ToothGrowth)
aov(len ~ supp, data = ToothGrowth)
Now I want to write a function that performs both tests. This is what I had thought should work:
stat_test <- function(mydata, quantvar, groupvar) {
kruskal.test(quantvar ~ groupvar, data = mydata)
aov(quantvar ~ groupvar, data = mydata)
}
But if I then run stat_test(ToothGrowth, "len", "sup"), I get the error
Error in kruskal.test.default("len", "supp") :
all observations are in the same group
What am I doing wrong? Any help would be much appreciated!
You can use deparse(substitute(quantvar)) to get the quoted name of the column you are passing to the function, and this will allow you to build a formula using paste. This is a more idiomatic way of operating in R.
Here's a reproducible example:
stat_test <- function(mydata, quantvar, groupvar) {
A <- as.formula(paste(deparse(substitute(quantvar)), "~",
deparse(substitute(groupvar))))
print(kruskal.test(A, data = mydata))
cat("\n--------------------------------------\n\n")
aov(A, data = mydata)
}
stat_test(ToothGrowth, len, supp)
#>
#> Kruskal-Wallis rank sum test
#>
#> data: len by supp
#> Kruskal-Wallis chi-squared = 3.4454, df = 1, p-value = 0.06343
#>
#>
#> --------------------------------------
#> Call:
#> aov(formula = A, data = mydata)
#>
#> Terms:
#> supp Residuals
#> Sum of Squares 205.350 3246.859
#> Deg. of Freedom 1 58
#>
#> Residual standard error: 7.482001
#> Estimated effects may be unbalanced
Created on 2020-03-30 by the reprex package (v0.3.0)
It looks like you need to convert your variable arguments, given as text strings, into a formula. You can do this by concatenating the strings with paste(). Also, you will need to wrap print() around both of your statistical tests within the function, otherwise only the last one will display.
Here is the modified function:
stat_test <- function(mydata, quantvar, groupvar) {
model_formula <- formula(paste(quantvar, '~', groupvar))
print(kruskal.test(model_formula, data = mydata))
print(aov(model_formula, data = mydata))
}
For reference, if using rstatix (tidy version of statistical functions), you need to use sym and !!, while using formula() when needed.
make_kruskal_test <- function(data, quantvar, groupvar) {
library(rstatix, quietly = TRUE)
library(rlang, quietly = TRUE)
formula_expression <- formula(paste(quantvar, "~", groupvar))
quantvar_sym <- sym(quantvar)
shapiro <- shapiro_test(data, !!quantvar_sym) %>% print()
}
sample_data <- tibble::tibble(sample = letters[1:5], mean = 1:5)
make_kruskal_test(sample_data, "mean", "sample")
#> # A tibble: 1 x 3
#> variable statistic p
#> <chr> <dbl> <dbl>
#> 1 mean 0.987 0.967
I'm trying to create a series of models based on subsets of different categories in my data. Instead of creating a bunch of individual model objects, I'm using lapply() to create a list of models based on subsets of every level of my category factor, like so:
test.data <- data.frame(y=rnorm(100), x1=rnorm(100), x2=rnorm(100), category=rep(c("A", "B"), 2))
run.individual.models <- function(x) {
lm(y ~ x1 + x2, data=test.data, subset=(category==x))
}
individual.models <- lapply(levels(test.data$category), FUN=run.individual.models)
individual.models
# [[1]]
# Call:
# lm(formula = y ~ x1 + x2, data = test.data, subset = (category ==
# x))
# Coefficients:
# (Intercept) x1 x2
# 0.10852 -0.09329 0.11365
# ....
This works fantastically, except the model call shows subset = (category == x) instead of category == "A", etc. This makes it more difficult to use both for diagnostic purposes (it's hard to remember which model in the list corresponds to which category) and for functions like predict().
Is there a way to substitute the actual character value of x into the lm() call so that the model doesn't use the raw x in the call?
Along the lines of Explicit formula used in linear regression
Use bquote to construct the call
run.individual.models <- function(x) {
lmc <- bquote(lm(y ~ x1 + x2, data=test.data, subset=(category==.(x))))
eval(lmc)
}
individual.models <- lapply(levels(test.data$category), FUN=run.individual.models)
individual.models
[[1]]
Call:
lm(formula = y ~ x1 + x2, data = test.data, subset = (category ==
"A"))
Coefficients:
(Intercept) x1 x2
-0.08434 0.05881 0.07695
[[2]]
Call:
lm(formula = y ~ x1 + x2, data = test.data, subset = (category ==
"B"))
Coefficients:
(Intercept) x1 x2
0.1251 -0.1854 -0.1609