Dynamic variable names in R regressions - r

Being aware of the danger of using dynamic variable names, I am trying to loop over varios regression models where different variables specifications are choosen. Usually !!rlang::sym() solves this kind of problem for me just fine, but it somehow fails in regressions. A minimal example would be the following:
y= runif(1000)
x1 = runif(1000)
x2 = runif(1000)
df2= data.frame(y,x1,x2)
summary(lm(y ~ x1+x2, data=df2)) ## works
var = "x1"
summary(lm(y ~ !!rlang::sym(var)) +x2, data=df2) # gives an error
My understanding was that !!rlang::sym(var)) takes the values of var (namely x1) and puts that in the code in a way that R thinks this is a variable (not a char). BUt I seem to be wrong. Can anyone enlighten me?

Personally, I like to do this with some computing on the language. For me, a combination of bquote with eval is easiest (to remember).
var <- as.symbol(var)
eval(bquote(summary(lm(y ~ .(var) + x2, data = df2))))
#Call:
#lm(formula = y ~ x1 + x2, data = df2)
#
#Residuals:
# Min 1Q Median 3Q Max
#-0.49298 -0.26248 -0.00046 0.24111 0.51988
#
#Coefficients:
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 0.50244 0.02480 20.258 <2e-16 ***
#x1 -0.01468 0.03161 -0.464 0.643
#x2 -0.01635 0.03227 -0.507 0.612
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
#Residual standard error: 0.2878 on 997 degrees of freedom
#Multiple R-squared: 0.0004708, Adjusted R-squared: -0.001534
#F-statistic: 0.2348 on 2 and 997 DF, p-value: 0.7908
I find this superior to any approach that doesn't show the same call as summary(lm(y ~ x1+x2, data=df2)).

The bang-bang operator !! only works with "tidy" functions. It's not a part of the core R language. A base R function like lm() has no idea how to expand such operators. Instead, you need to wrap those in functions that can do the expansion. rlang::expr is one such example
rlang::expr(summary(lm(y ~ !!rlang::sym(var) + x2, data=df2)))
# summary(lm(y ~ x1 + x2, data = df2))
Then you need to use rlang::eval_tidy to actually evaluate it
rlang::eval_tidy(rlang::expr(summary(lm(y ~ !!rlang::sym(var) + x2, data=df2))))
# Call:
# lm(formula = y ~ x1 + x2, data = df2)
#
# Residuals:
# Min 1Q Median 3Q Max
# -0.49178 -0.25482 0.00027 0.24566 0.50730
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.4953683 0.0242949 20.390 <2e-16 ***
# x1 -0.0006298 0.0314389 -0.020 0.984
# x2 -0.0052848 0.0318073 -0.166 0.868
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 0.2882 on 997 degrees of freedom
# Multiple R-squared: 2.796e-05, Adjusted R-squared: -0.001978
# F-statistic: 0.01394 on 2 and 997 DF, p-value: 0.9862
You can see this version preserves the expanded formula in the model object.

1) Just use lm(df2) or if lm has additional columns beyond what is shown in the question but we just want to regress on x1 and x2 then
df3 <- df2[c("y", var, "x2")]
lm(df3)
The following are optional and only apply if it is important that the formula appear in the output as if it had been explicitly given.
Compute the formula fo using the first line below and then run lm as in the second line:
fo <- formula(model.frame(df3))
fm <- do.call("lm", list(fo, quote(df3)))
or just run lm as in the first line below and then write the formula into it as in the second line:
fm <- lm(df3)
fm$call <- formula(model.frame(df3))
Either one gives this:
> fm
Call:
lm(formula = y ~ x1 + x2, data = df3)
Coefficients:
(Intercept) x1 x2
0.44752 0.04278 0.05011
2) character string lm accepts a character string for the formula so this also works. The fn$ causes substitution to occur in the character arguments.
library(gsubfn)
fn$lm("y ~ $var + x2", quote(df2))
or at the expense of more involved code, without gsubfn:
do.call("lm", list(sprintf("y ~ %s + x2", var), quote(df2)))
or if you don't care that the formula displays without var substituted then just:
lm(sprintf("y ~ %s + x2", var), df2)

Related

How I pass string inside R function

I have a df and I would like to do a function with the names of header and return linear models.
I'm trying this:
a <- function(j,k){
reg1 <- lm(data$j ~ data$k)
summary(reg1)
}
a(j="hour",k="score")
It's NULL for 'data$j'
You cannot use $ when passing column name as variable. Here are couple of ways in which you can do this.
Use reformulate to create a formula object
a <- function(data, j,k){
reg1 <- lm(reformulate(k, j), data = data)
summary(reg1)
}
lm also accepts formula as string so you don't necessarily need to convert it into formula object.
a <- function(data, j,k){
reg1 <- lm(sprintf('%s~%s', j, k), data = data)
summary(reg1)
}
You can call this as :
a(mtcars, 'mpg', 'cyl')
#Call:
#lm(formula = sprintf("%s~%s", j, k), data = data)
#Residuals:
# Min 1Q Median 3Q Max
#-4.9814 -2.1185 0.2217 1.0717 7.5186
#Coefficients:
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 37.8846 2.0738 18.27 < 2e-16 ***
#cyl -2.8758 0.3224 -8.92 6.11e-10 ***
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#Residual standard error: 3.206 on 30 degrees of freedom
#Multiple R-squared: 0.7262, Adjusted R-squared: 0.7171
#F-statistic: 79.56 on 1 and 30 DF, p-value: 6.113e-10
Note that I added data as an additional argument in the function. It is generally a better practice to pass data object in the function rather than relying it to be evaluated in global environment.
Try this. If you are going tu use strings which are as variables in a dataframe, it is better to invoke them in a function using [[]]. Here the code of your function with slight changes:
a <- function(j,k){
reg1 <- lm(data[[j]] ~ data[[k]])
summary(reg1)
}
a(j="hour",k="score")
And a small example using iris dataset:
#Example
data=iris
#Code
a(j="Sepal.Length",k="Petal.Length")
You can further tune your function a if needed.

Running a regression

Background: my data set has 52 rows and 12 columns (assume column names are A - L) and the name of my data set is foo
I am told to run a regression where foo$L is the dependent variable, and all other variables are independent except for foo$K.
The way i was doing it is
fit <- lm(foo$L ~ foo$a + ... +foo$J)
then calling
summary(fit)
Is my way a good way to run a regression and finding the intercept and coef?
Use the data argument to lm so you don't have to use the foo$ syntax for each predictor. Use dependent ~ . as the formula to have the dependent variable predicted by all other variables. Then you can use - K to exclude K:
data_mat = matrix(rnorm(52 * 12), nrow = 52)
df = as.data.frame(data_mat)
colnames(df) = LETTERS[1:12]
lm(L ~ . - K, data = df)
You can first remove the column K, and then do fit <- lm(L ~ ., data = foo). This will treat the L column as the dependent variable and all the other columns as the independent variables. You don't have to specify each column names in the formula.
Here is an example using the mtcars, fitting a multiple regression model to mpg with all the other variables except carb.
mtcars2 <- mtcars[, !names(mtcars) %in% "carb"]
fit <- lm(mpg ~ ., data = mtcars2)
summary(fit)
# Call:
# lm(formula = mpg ~ ., data = mtcars2)
#
# Residuals:
# Min 1Q Median 3Q Max
# -3.3038 -1.6964 -0.1796 1.1802 4.7245
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 12.83084 18.18671 0.706 0.48790
# cyl -0.16881 0.99544 -0.170 0.86689
# disp 0.01623 0.01290 1.259 0.22137
# hp -0.02424 0.01811 -1.339 0.19428
# drat 0.70590 1.56553 0.451 0.65647
# wt -4.03214 1.33252 -3.026 0.00621 **
# qsec 0.86829 0.68874 1.261 0.22063
# vs 0.36470 2.05009 0.178 0.86043
# am 2.55093 2.00826 1.270 0.21728
# gear 0.50294 1.32287 0.380 0.70745
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 2.593 on 22 degrees of freedom
# Multiple R-squared: 0.8687, Adjusted R-squared: 0.8149
# F-statistic: 16.17 on 9 and 22 DF, p-value: 9.244e-08

Need help modeling data with a log() function

I have some data that Excel will fit pretty nicely with a logarithmic trend. I want to pass the same data into R and have it tell me the coefficients and intercept. What form should have the data in and what function should I call to have it figure out the coefficients? Ultimately, I want to do this thousands of time so that I can project into the future.
Passing Excel these values produces this trendline function: y = -0.099ln(x) + 0.7521
Data:
y <- c(0.7521, 0.683478429, 0.643337383, 0.614856858, 0.592765647, 0.574715813,
0.559454895, 0.546235287, 0.534574767, 0.524144076, 0.514708368)
For context, the data points represent % of our user base that are retained on a given day.
The question omitted the value of x but working backwards it seems you were using 1, 2, 3, ... so try the following:
x <- 1:11
y <- c(0.7521, 0.683478429, 0.643337383, 0.614856858, 0.592765647,
0.574715813, 0.559454895, 0.546235287, 0.534574767, 0.524144076,
0.514708368)
fm <- lm(y ~ log(x))
giving:
> coef(fm)
(Intercept) log(x)
0.7521 -0.0990
and
plot(y ~ x, log = "x")
lines(fitted(fm) ~ x, col = "red")
You can get the same results by:
y <- c(0.7521, 0.683478429, 0.643337383, 0.614856858, 0.592765647, 0.574715813, 0.559454895, 0.546235287, 0.534574767, 0.524144076, 0.514708368)
t <- seq(along=y)
> summary(lm(y~log(t)))
Call:
lm(formula = y ~ log(t))
Residuals:
Min 1Q Median 3Q Max
-3.894e-10 -2.288e-10 -2.891e-11 1.620e-10 4.609e-10
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.521e-01 2.198e-10 3421942411 <2e-16 ***
log(t) -9.900e-02 1.261e-10 -784892428 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.972e-10 on 9 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 6.161e+17 on 1 and 9 DF, p-value: < 2.2e-16
For large projects I recommend to encapsulate the data into a data frame, like
df <- data.frame(y, t)
lm(formula = y ~ log(t), data=df)

Successively removing predictor variable from formula

I have a model formula in the form of
model.all <- lme(Response ~ A + B + C)
I would like to update this model by successively removing a predictor variable from the model, so I would end up with 3 models, specifically:
mod.1 <- lme(Response ~ B + C) ; mod.2 <- lme(Response ~ A + C) ; mod.3 <- lme(Response ~ A + B)
I am thinking of a loop function, so I am aware of the update function, but I have too many predictor variables to manually change the code.
Any suggestions would be appreciated.
I would use combn in this occasion, see the example below:
Example Data
Response <- runif(100)
A <- runif(100)
B <- runif(100)
C <- runif(100)
Solution
a <- c('A','B','C') #the names of your variables
b <- as.data.frame(combn(a,2)) #two-way combinations of those using combn
#create the formula for each model
my_forms <- sapply(b, function(x) paste('Response ~ ', paste(x,collapse=' + ')))
> my_forms #the formulas that will be used in the model
V1 V2 V3
"Response ~ A + B" "Response ~ A + C" "Response ~ B + C"
#run each model
my_models <- lapply(my_forms, function(x) lm(as.formula(x)))
Output
> summary(my_models[[1]])
Call:
lm(formula = as.formula(x))
Residuals:
Min 1Q Median 3Q Max
-0.48146 -0.20745 -0.00247 0.24263 0.58341
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.32415 0.08232 3.938 0.000155 ***
A 0.25404 0.09890 2.569 0.011733 *
B 0.07955 0.10129 0.785 0.434141
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2828 on 97 degrees of freedom
Multiple R-squared: 0.06507, Adjusted R-squared: 0.04579
F-statistic: 3.375 on 2 and 97 DF, p-value: 0.03827
As you can see each model is saved in as a list element in my_models. I find this quite easy to make and run.

Repeating univariable GLM with several different variables

I have an outcome variable, say Y and a list of 20 variables that could affect Y (say X1...X20). I would like to test which variables are NOT independent of Y. To do this I want to run a univariable glm for each variable and Y (ie Y~X1,...,Y~X20) and then do a likelihood ratio test for each model. Finally I would like to create a table the has the resulting P value from the likelihood test for each model.
From what I have seen the lapply function and split function could be useful for this but I don't really understand how they work in the examples I've seen.
This is what I tried at first:
> VarNames<-c(names(data[30:47]))
> glms<-glm(intBT~VarNames,family=binomial(logit))
Error in model.frame.default(formula = intBT ~ VarNames, drop.unused.levels = TRUE) :
variable lengths differ (found for 'VarNames')
I'm not sure if that was a good approach though.
It is easier to answer your questions if you provide a minimal example.
One way to go - but certainly not the most beautiful - is to use paste to create the formulas as a vector of strings and then use lapply on them. The Code for this could look like this:
example.data <- data.frame(intBT=1:10, bli=1:10, bla=1:10, blub=1:10)
var.names <- c('bli', 'bla', 'blub')
formulas <- paste('intBT ~', var.names)
fitted.models <- lapply(formulas, glm, data=example.data)
This gives a list of fitted model. You can then use the apply functions on fitted.models to execute further tests.
Like Paul said it really helps if you provide a minimal example, but I think this does what you want.
set.seed(123)
N <- 100
num_vars <- 5
df <- data.frame(lapply(1:num_vars, function(i) i = rnorm(N)))
names(df) <- c(paste0(rep("X",5), 1:num_vars ))
e <- rnorm(N)
y <- as.numeric((df$X1 + df$X2 + e) > 0.5)
pvalues <- vector(mode = "list")
singlevar <- function(var, y, df){
model <- as.formula(paste0("y ~ ", var))
pvalues[var] <- coef(summary(glm(model, family = "binomial", data = df)))[var,4]
}
sapply(colnames(df), singlevar, y, df)
X1 X2 X3 X4 X5
1.477199e-04 4.193461e-05 8.885365e-01 9.064953e-01 9.702645e-01
For comparison:
Call:
glm(formula = y ~ X2, family = "binomial", data = df)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.0674 -0.8211 -0.5296 0.9218 2.5463
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.5591 0.2375 -2.354 0.0186 *
X2 1.2871 0.3142 4.097 4.19e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 130.68 on 99 degrees of freedom
Residual deviance: 106.24 on 98 degrees of freedom
AIC: 110.24
Number of Fisher Scoring iterations: 4

Resources