How to use a variable in lm() function in R? - r

Let us say I have a dataframe (df) with two columns called "height" and "weight".
Let's say I define:
x = "height"
How do I use x within my lm() function? Neither df[x] nor just using x works.

Two ways :
Create a formula with paste
x = "height"
lm(paste0(x, '~', 'weight'), df)
Or use reformulate
lm(reformulate("weight", x), df)
Using reproducible example with mtcars dataset :
x = "Cyl"
lm(paste0(x, '~', 'mpg'), data = mtcars)
#Call:
#lm(formula = paste0(x, "~", "mpg"), data = mtcars)
#Coefficients:
#(Intercept) mpg
# 11.2607 -0.2525
and same with
lm(reformulate("mpg", x), mtcars)

We can use glue to create the formula
x <- "height"
lm(glue::glue('{x} ~ weight'), data = df)
Using a reproducible example with mtcars
x <- 'cyl'
lm(glue::glue('{x} ~ mpg'), data = mtcars)
#Call:
#lm(formula = glue::glue("{x} ~ mpg"), data = mtcars)
#Coefficients:
#(Intercept) mpg
# 11.2607 -0.2525

When you run x = "height" your are assigning a string of characters to the variable x.
Consider this data frame:
df <- data.frame(
height = c(176, 188, 165),
weight = c(75, 80, 66)
)
If you want a regression using height and weight you can either do this:
lm(height ~ weight, data = df)
# Call:
# lm(formula = height ~ weight, data = df)
#
# Coefficients:
# (Intercept) weight
# 59.003 1.593
or this:
lm(df$height ~ df$weight)
# Call:
# lm(formula = df$height ~ df$weight)
#
# Coefficients:
# (Intercept) df$weight
# 59.003 1.593
If you really want to use x instead of height, you must have a variable called x (in your df or in your environment). You can do that by creating a new variable:
x <- df$height
y <- df$weight
lm(x ~ y)
# Call:
# lm(formula = x ~ y)
#
# Coefficients:
# (Intercept) y
# 59.003 1.593
Or by changing the names of existing variables:
names(df) <- c("x", "y")
lm(x ~ y, data = df)
# Call:
# lm(formula = x ~ y, data = df)
#
# Coefficients:
# (Intercept) y
# 59.003 1.593

Related

How to use the dplyr package to do group-separated linear regressions in R?

I have a dataset of x and y separated by categories (a and b). I want to do 2 linear regressions, one for category a data and one for category b data. For this purpose, I used the dplyr package following this answer. I'm a little confused because my code is simpler, but I'm not able to do the regressions. Any tips?
library(dplyr)
Factor <- c("a", "b")
x <- seq(0,3,1)
df <- expand.grid(x = x, Factor = Factor)
df$y <- rnorm(8)
df %>%
group_by(Factor) %>%
do(lm(formula = y ~ x,
data = .))
Error: Results 1, 2 must be data frames, not lm
This creates a list column whose components are lm objects
df2 <- df %>%
group_by(Factor) %>%
summarize(lm = list(lm(formula = y ~ x, data = cur_data())), .groups = "drop")
giving:
> df2
# A tibble: 2 x 2
Factor lm
<fct> <list>
1 a <lm>
2 b <lm>
> with(df2, setNames(lm, Factor))
$a
Call:
lm(formula = y ~ x, data = cur_data())
Coefficients:
(Intercept) x
-0.3906 0.2947
$b
Call:
lm(formula = y ~ x, data = cur_data())
Coefficients:
(Intercept) x
0.2684 -0.3403
Here is my approach:
df %>%
split(~ Factor) %>%
purrr::map(\(x) lm(formula = y ~ x, data = x))

Using dataframe name as a column in a model table

I'm confused as to why the following doesn't work. I'm trying to use the name of a data frame/tibble as a column in a multiple models data frame, but keep running up against the following error. Here's an example:
library(tidyverse)
library(rlang)
set.seed(666)
df1 <- tibble(
x = 1:10 + rnorm(10),
y = seq(20, 38, by=2) + rnorm(10),
z = 2*x + 3*y
)
df2 <- tibble(
x = 1:10 + rnorm(10),
y = seq(20, 38, by=2) + rnorm(10),
z = 4*x + 5*y
)
results <- tibble(dataset = c('df1','df2'))
Notice that the following all work:
lm(z ~ x + y, data=df1)
lm(z ~ x + y, data=df2)
lm(z ~ x + y, data=eval(sym('df1')))
But when I try the following:
results <- results %>% mutate(model = lm(z ~ x + y, data = eval(sym(dataset))))
I get the error
Error in mutate_impl(.data, dots) :
Evaluation error: Only strings can be converted to symbols.
Can someone figure out how to make this work?
We can use the map function and specify the lm function as the following.
library(tidyverse)
library(rlang)
results2 <- results %>%
mutate(model = map(dataset, ~lm(z ~ x + y, data = eval(sym(.)))))
results2
# # A tibble: 2 x 2
# dataset model
# <chr> <list>
# 1 df1 <S3: lm>
# 2 df2 <S3: lm>
results2$model[[1]]
# Call:
# lm(formula = z ~ x + y, data = eval(sym(.)))
#
# Coefficients:
# (Intercept) x y
# 6.741e-14 2.000e+00 3.000e+00
results2$model[[2]]
# Call:
# lm(formula = z ~ x + y, data = eval(sym(.)))
#
# Coefficients:
# (Intercept) x y
# 9.662e-14 4.000e+00 5.000e+00
I'd recommend a slightly different route where you bind all the data and skip the eval and sym calls. This follows the "Many Models" chapter of R for Data Science.
purrr::lst creates a list of the data frames with the names of those variables as the list's names, and the .id argument to bind_rows uses those names to create a column marking data as coming from df1 or df2. Nesting creates a column data which is a list-column of data frames. Then you can build the models of each dataset. I used the tilde shortcut notation to build the anonymous function.
The result: you have a column model that is a list of models.
library(tidyverse)
library(rlang)
results <- lst(df1, df2) %>%
bind_rows(.id = "dataset") %>%
group_by(dataset) %>%
nest() %>%
mutate(model = map(data, ~lm(z ~ x + y, data = .)))
results$model[[1]]
#>
#> Call:
#> lm(formula = z ~ x + y, data = .)
#>
#> Coefficients:
#> (Intercept) x y
#> 6.741e-14 2.000e+00 3.000e+00
You also still have a column of that nested data. If you don't want it, you can drop it:
select(results, -data)
#> # A tibble: 2 x 2
#> dataset model
#> <chr> <list>
#> 1 df1 <lm>
#> 2 df2 <lm>

NLME GLS Model Formula [duplicate]

I am not able to resolve the issue that when lm(sformula) is executed, it does not show the string that is assigned to sformula. I have a feeling it is generic way R handles argument of a function and not specific to linear regression.
Below is the illustration of the issue through examples. Example 1, has the undesired output lm(formula = sformula). The example 2 is the output I would like i.e., lm(formula = "y~x").
x <- 1:10
y <- x * runif(10)
sformula <- "y~x"
## Example: 1
lm(sformula)
## Call:
## lm(formula = sformula)
## Example: 2
lm("y~x")
## Call:
## lm(formula = "y~x")
How about eval(call("lm", sformula))?
lm(sformula)
#Call:
#lm(formula = sformula)
eval(call("lm", sformula))
#Call:
#lm(formula = "y~x")
Generally speaking there is a data argument for lm. Let's do:
mydata <- data.frame(y = y, x = x)
eval(call("lm", sformula, quote(mydata)))
#Call:
#lm(formula = "y~x", data = mydata)
The above call() + eval() combination can be replaced by do.call():
do.call("lm", list(formula = sformula))
#Call:
#lm(formula = "y~x")
do.call("lm", list(formula = sformula, data = quote(mydata)))
#Call:
#lm(formula = "y~x", data = mydata)

Showing string in formula and not as variable in lm fit

I am not able to resolve the issue that when lm(sformula) is executed, it does not show the string that is assigned to sformula. I have a feeling it is generic way R handles argument of a function and not specific to linear regression.
Below is the illustration of the issue through examples. Example 1, has the undesired output lm(formula = sformula). The example 2 is the output I would like i.e., lm(formula = "y~x").
x <- 1:10
y <- x * runif(10)
sformula <- "y~x"
## Example: 1
lm(sformula)
## Call:
## lm(formula = sformula)
## Example: 2
lm("y~x")
## Call:
## lm(formula = "y~x")
How about eval(call("lm", sformula))?
lm(sformula)
#Call:
#lm(formula = sformula)
eval(call("lm", sformula))
#Call:
#lm(formula = "y~x")
Generally speaking there is a data argument for lm. Let's do:
mydata <- data.frame(y = y, x = x)
eval(call("lm", sformula, quote(mydata)))
#Call:
#lm(formula = "y~x", data = mydata)
The above call() + eval() combination can be replaced by do.call():
do.call("lm", list(formula = sformula))
#Call:
#lm(formula = "y~x")
do.call("lm", list(formula = sformula, data = quote(mydata)))
#Call:
#lm(formula = "y~x", data = mydata)

Using lapply on a list of models

I have generated a list of models, and would like to create a summary table.
As and example, here are two models:
x <- seq(1:10)
y <- sin(x)^2
model1 <- lm(y ~ x)
model2 <- lm(y ~ x + I(x^2) + I(x^3))
and two formulas, the first generating the equation from components of formula
get.model.equation <- function(x) {
x <- as.character((x$call)$formula)
x <- paste(x[2],x[1],x[3])
}
and the second generating the name of model as a string
get.model.name <- function(x) {
x <- deparse(substitute(x))
}
With these, I create a summary table
model.list <- list(model1, model2)
AIC.data <- lapply(X = model.list, FUN = AIC)
AIC.data <- as.numeric(AIC.data)
model.models <- lapply(X = model.list, FUN = get.model)
model.summary <- cbind(model.models, AIC.data)
model.summary <- as.data.frame(model.summary)
names(model.summary) <- c("Model", "AIC")
model.summary$AIC <- unlist(model.summary$AIC)
rm(AIC.data)
model.summary[order(model.summary$AIC),]
Which all works fine.
I'd like to add the model name to the table using get.model.name
x <- get.model.name(model1)
Which gives me "model1" as I want.
So now I apply the function to the list of models
model.names <- lapply(X = model.list, FUN = get.model.name)
but now instead of model1 I get X[[1L]]
How do I get model1 rather than X[[1L]]?
I'm after a table that looks like this:
Model Formula AIC
model1 y ~ x 11.89136
model2 y ~ x + I(x^2) + I(x^3) 15.03888
Do you want something like this?
model.list <- list(model1 = lm(y ~ x),
model2 = lm(y ~ x + I(x^2) + I(x^3)))
sapply(X = model.list, FUN = AIC)
I'd do something like this:
model.list <- list(model1 = lm(y ~ x),
model2 = lm(y ~ x + I(x^2) + I(x^3)))
# changed Reduce('rbind', ...) to do.call(rbind, ...) (Hadley's comment)
do.call(rbind,
lapply(names(model.list), function(x)
data.frame(model = x,
formula = get.model.equation(model.list[[x]]),
AIC = AIC(model.list[[x]])
)
)
)
# model formula AIC
# 1 model1 y ~ x 11.89136
# 2 model2 y ~ x + I(x^2) + I(x^3) 15.03888
Another option, with ldply, but see hadley's comment below for a more efficient use of ldply:
# prepare data
x <- seq(1:10)
y <- sin(x)^2
dat <- data.frame(x,y)
# create list of named models obviously these are not suited to the data here, just to make the workflow work...
models <- list(model1=lm(y~x, data = dat),
model2=lm(y~I(1/x), data=dat),
model3=lm(y ~ log(x), data = dat),
model4=nls(y ~ I(1/x*a) + b*x, data = dat, start = list(a = 1, b = 1)),
model5=nls(y ~ (a + b*log(x)), data=dat, start = setNames(coef(lm(y ~ log(x), data=dat)), c("a", "b"))),
model6=nls(y ~ I(exp(1)^(a + b * x)), data=dat, start = list(a=0,b=0)),
model7=nls(y ~ I(1/x*a)+b, data=dat, start = list(a=1,b=1))
)
library(plyr)
library(AICcmodavg) # for small sample sizes
# build table with model names, function, AIC and AICc
data.frame(cbind(ldply(models, function(x) cbind(AICc = AICc(x), AIC = AIC(x))),
model = sapply(1:length(models), function(x) deparse(formula(models[[x]])))
))
.id AICc AIC model
1 model1 15.89136 11.89136 y ~ x
2 model2 15.78480 11.78480 y ~ I(1/x)
3 model3 15.80406 11.80406 y ~ log(x)
4 model4 16.62157 12.62157 y ~ I(1/x * a) + b * x
5 model5 15.80406 11.80406 y ~ (a + b * log(x))
6 model6 15.88937 11.88937 y ~ I(exp(1)^(a + b * x))
7 model7 15.78480 11.78480 y ~ I(1/x * a) + b
It's not immediately obvious to me how to replace the .id with a column name in the ldply function, any tips?

Resources