Using dataframe name as a column in a model table - r

I'm confused as to why the following doesn't work. I'm trying to use the name of a data frame/tibble as a column in a multiple models data frame, but keep running up against the following error. Here's an example:
library(tidyverse)
library(rlang)
set.seed(666)
df1 <- tibble(
x = 1:10 + rnorm(10),
y = seq(20, 38, by=2) + rnorm(10),
z = 2*x + 3*y
)
df2 <- tibble(
x = 1:10 + rnorm(10),
y = seq(20, 38, by=2) + rnorm(10),
z = 4*x + 5*y
)
results <- tibble(dataset = c('df1','df2'))
Notice that the following all work:
lm(z ~ x + y, data=df1)
lm(z ~ x + y, data=df2)
lm(z ~ x + y, data=eval(sym('df1')))
But when I try the following:
results <- results %>% mutate(model = lm(z ~ x + y, data = eval(sym(dataset))))
I get the error
Error in mutate_impl(.data, dots) :
Evaluation error: Only strings can be converted to symbols.
Can someone figure out how to make this work?

We can use the map function and specify the lm function as the following.
library(tidyverse)
library(rlang)
results2 <- results %>%
mutate(model = map(dataset, ~lm(z ~ x + y, data = eval(sym(.)))))
results2
# # A tibble: 2 x 2
# dataset model
# <chr> <list>
# 1 df1 <S3: lm>
# 2 df2 <S3: lm>
results2$model[[1]]
# Call:
# lm(formula = z ~ x + y, data = eval(sym(.)))
#
# Coefficients:
# (Intercept) x y
# 6.741e-14 2.000e+00 3.000e+00
results2$model[[2]]
# Call:
# lm(formula = z ~ x + y, data = eval(sym(.)))
#
# Coefficients:
# (Intercept) x y
# 9.662e-14 4.000e+00 5.000e+00

I'd recommend a slightly different route where you bind all the data and skip the eval and sym calls. This follows the "Many Models" chapter of R for Data Science.
purrr::lst creates a list of the data frames with the names of those variables as the list's names, and the .id argument to bind_rows uses those names to create a column marking data as coming from df1 or df2. Nesting creates a column data which is a list-column of data frames. Then you can build the models of each dataset. I used the tilde shortcut notation to build the anonymous function.
The result: you have a column model that is a list of models.
library(tidyverse)
library(rlang)
results <- lst(df1, df2) %>%
bind_rows(.id = "dataset") %>%
group_by(dataset) %>%
nest() %>%
mutate(model = map(data, ~lm(z ~ x + y, data = .)))
results$model[[1]]
#>
#> Call:
#> lm(formula = z ~ x + y, data = .)
#>
#> Coefficients:
#> (Intercept) x y
#> 6.741e-14 2.000e+00 3.000e+00
You also still have a column of that nested data. If you don't want it, you can drop it:
select(results, -data)
#> # A tibble: 2 x 2
#> dataset model
#> <chr> <list>
#> 1 df1 <lm>
#> 2 df2 <lm>

Related

R Different Prediction Result for Formula Containing "%>%"

In R, when I use "predict" to get a confidence interval for a certain x (x=42) under the model: y = (centered x) + (centered x)^2. I found two possible ways:
model1 = lm(y ~ scale(x, center=T, scale=F) + I( (scale(x, center=T, scale=F))^2 ), data=data)
model2 = lm(y ~ (x %>% scale(center=T, scale=F)) + I( (x %>% scale(center=T, scale=F))^2 ), data=data)
The summary results for the two models are the same. But when I ran:
predict(model1, data.frame(x=42), interval="confidence", level=0.95)
predict(model2, data.frame(x=42), interval="confidence", level=0.95)
The results are different. I am wondering why. Does R treat the above two formulas differently because of the usage of "%>%"?
The dataset is a practice dataset from Kutner's textbook SENIC.txt, y is the 11th column, x is the 12th column.
The issue here is with scale and not %>%. scale returns a matrix which seems to affect the outcome.
One solution is to write vector --> vector equivalent of scale and use it:
library("magrittr") # for `%>%`
set.seed(1)
dataset = data.frame(x = rnorm(30))
dataset[["y"]] = 1 + (3 * dataset[["x"]]) + rnorm(30, mean = 0, sd = 0.1)
scale_vector = function(x, ...){
stopifnot(inherits(x, "numeric"))
scale(x, ...)[, 1]
}
lm(y ~ scale_vector(x, center=T, scale=F) + I( (scale_vector(x, center=T, scale=F))^2 ), data=dataset)
#>
#> Call:
#> lm(formula = y ~ scale_vector(x, center = T, scale = F) + I((scale_vector(x,
#> center = T, scale = F))^2), data = dataset)
#>
#> Coefficients:
#> (Intercept)
#> 1.27423
#> scale_vector(x, center = T, scale = F)
#> 2.99296
#> I((scale_vector(x, center = T, scale = F))^2)
#> -0.01645
lm(y ~ (x %>% scale_vector(center=T, scale=F)) + I( (x %>% scale_vector(center=T, scale=F))^2 ), data=dataset)
#>
#> Call:
#> lm(formula = y ~ (x %>% scale_vector(center = T, scale = F)) +
#> I((x %>% scale_vector(center = T, scale = F))^2), data = dataset)
#>
#> Coefficients:
#> (Intercept)
#> 1.27423
#> x %>% scale_vector(center = T, scale = F)
#> 2.99296
#> I((x %>% scale_vector(center = T, scale = F))^2)
#> -0.01645
Besides, if you do not mind using tidyverse, this might be cleaner:
library("magrittr") # for `%>%`
set.seed(1)
dataset = tibble::tibble(x = rnorm(30),
y = 1 + (3 * x) + rnorm(30, mean = 0, sd = 0.1)
)
dataset %>%
dplyr::mutate(x_scaled = scale(x, center = TRUE, scale = FALSE)[, 1]) %>%
lm(y ~ x_scaled + I(x_scaled^2), data = .)
#>
#> Call:
#> lm(formula = y ~ x_scaled + I(x_scaled^2), data = .)
#>
#> Coefficients:
#> (Intercept) x_scaled I(x_scaled^2)
#> 1.27423 2.99296 -0.01645

How to use the dplyr package to do group-separated linear regressions in R?

I have a dataset of x and y separated by categories (a and b). I want to do 2 linear regressions, one for category a data and one for category b data. For this purpose, I used the dplyr package following this answer. I'm a little confused because my code is simpler, but I'm not able to do the regressions. Any tips?
library(dplyr)
Factor <- c("a", "b")
x <- seq(0,3,1)
df <- expand.grid(x = x, Factor = Factor)
df$y <- rnorm(8)
df %>%
group_by(Factor) %>%
do(lm(formula = y ~ x,
data = .))
Error: Results 1, 2 must be data frames, not lm
This creates a list column whose components are lm objects
df2 <- df %>%
group_by(Factor) %>%
summarize(lm = list(lm(formula = y ~ x, data = cur_data())), .groups = "drop")
giving:
> df2
# A tibble: 2 x 2
Factor lm
<fct> <list>
1 a <lm>
2 b <lm>
> with(df2, setNames(lm, Factor))
$a
Call:
lm(formula = y ~ x, data = cur_data())
Coefficients:
(Intercept) x
-0.3906 0.2947
$b
Call:
lm(formula = y ~ x, data = cur_data())
Coefficients:
(Intercept) x
0.2684 -0.3403
Here is my approach:
df %>%
split(~ Factor) %>%
purrr::map(\(x) lm(formula = y ~ x, data = x))

How to use a variable in lm() function in R?

Let us say I have a dataframe (df) with two columns called "height" and "weight".
Let's say I define:
x = "height"
How do I use x within my lm() function? Neither df[x] nor just using x works.
Two ways :
Create a formula with paste
x = "height"
lm(paste0(x, '~', 'weight'), df)
Or use reformulate
lm(reformulate("weight", x), df)
Using reproducible example with mtcars dataset :
x = "Cyl"
lm(paste0(x, '~', 'mpg'), data = mtcars)
#Call:
#lm(formula = paste0(x, "~", "mpg"), data = mtcars)
#Coefficients:
#(Intercept) mpg
# 11.2607 -0.2525
and same with
lm(reformulate("mpg", x), mtcars)
We can use glue to create the formula
x <- "height"
lm(glue::glue('{x} ~ weight'), data = df)
Using a reproducible example with mtcars
x <- 'cyl'
lm(glue::glue('{x} ~ mpg'), data = mtcars)
#Call:
#lm(formula = glue::glue("{x} ~ mpg"), data = mtcars)
#Coefficients:
#(Intercept) mpg
# 11.2607 -0.2525
When you run x = "height" your are assigning a string of characters to the variable x.
Consider this data frame:
df <- data.frame(
height = c(176, 188, 165),
weight = c(75, 80, 66)
)
If you want a regression using height and weight you can either do this:
lm(height ~ weight, data = df)
# Call:
# lm(formula = height ~ weight, data = df)
#
# Coefficients:
# (Intercept) weight
# 59.003 1.593
or this:
lm(df$height ~ df$weight)
# Call:
# lm(formula = df$height ~ df$weight)
#
# Coefficients:
# (Intercept) df$weight
# 59.003 1.593
If you really want to use x instead of height, you must have a variable called x (in your df or in your environment). You can do that by creating a new variable:
x <- df$height
y <- df$weight
lm(x ~ y)
# Call:
# lm(formula = x ~ y)
#
# Coefficients:
# (Intercept) y
# 59.003 1.593
Or by changing the names of existing variables:
names(df) <- c("x", "y")
lm(x ~ y, data = df)
# Call:
# lm(formula = x ~ y, data = df)
#
# Coefficients:
# (Intercept) y
# 59.003 1.593

Writing a function to enclose another function with regression models [duplicate]

This question already has answers here:
Novice needs to loop lm in R
(1 answer)
R Loop for Variable Names to run linear regression model
(2 answers)
Closed 4 years ago.
Goal: Run three regression models with three different outcome variables, as seen below, but ideally in a more efficient way than seen in the model1, model2, model3 version seen in the last three lines.
Specific question: How can I write a function that iterates over the set of dv's and creates model + # indicator as an object (e.g. model1, model2, etc.) AND switches the dv (e.g. dv1, dv2, etc...)? I assume there is a forloop and function solution to this but I am not getting it...
mydf <- data.frame(dv1 = rnorm(100),
dv2 = rnorm(100),
dv3 = rnorm(100),
iv1 = rnorm(100),
iv2 = rnorm(100),
iv3 = rnorm(100))
mymodel <- function(dv, df) {
lm(dv ~ iv1 + iv2 + iv3, data = df)
}
model1 <- mymodel(dv = mydf$dv1, df = mydf)
model2 <- mymodel(dv = mydf$dv2, df = mydf)
model3 <- mymodel(dv = mydf$dv3, df = mydf)
Here's another approach using the tidyverse packages, since dplyr has more or less supplanted plyr.
library(tidyverse)
mydf <- data.frame(dv1 = rnorm(100),
dv2 = rnorm(100),
dv3 = rnorm(100),
iv1 = rnorm(100),
iv2 = rnorm(100),
iv3 = rnorm(100))
mymodel <- function(df) {
lm(value ~ iv1 + iv2 + iv3, data = df)
}
mydf %>%
gather("variable","value", contains("dv")) %>%
split(.$variable) %>%
map(mymodel)
#> $dv1
#>
#> Call:
#> lm(formula = value ~ iv1 + iv2 + iv3, data = df)
#>
#> Coefficients:
#> (Intercept) iv1 iv2 iv3
#> -0.04516 -0.04657 0.08045 0.02518
#>
#>
#> $dv2
#>
#> Call:
#> lm(formula = value ~ iv1 + iv2 + iv3, data = df)
#>
#> Coefficients:
#> (Intercept) iv1 iv2 iv3
#> -0.03906 0.16730 0.10324 0.02500
#>
#>
#> $dv3
#>
#> Call:
#> lm(formula = value ~ iv1 + iv2 + iv3, data = df)
#>
#> Coefficients:
#> (Intercept) iv1 iv2 iv3
#> 0.018492 -0.162563 0.002738 0.179366
Created on 2018-11-26 by the reprex package (v0.2.1)
You could convert your data.frame to long form, with all the dv values in one column and then use plyr's dlply to create the lms. This splits the data.frame on the specified column ("dvN") and applys the function to each and returns a list of lms. I have changed the function slightly to make it just take a data.frame, not the column separately.
Hope this gives what you need.
library(plyr)
library(tidyr)
mydf_l <- gather(mydf, dvN, Value, 1:3)
mymodel2 <- function(df) {
lm(Value ~ iv1 + iv2 + iv3, data = df)
}
allmodels <- dlply(mydf_l, .(dvN), mymodel2)

Non standard evaluation (NSE) for dplyr do()

I would like to implement something like
mtcars %>% group_by(cyl) %>% do(mod = lm(mpg ~ disp, data = .))
inside a function like this
myfun <- function(d, groupvar, x, y) {
d %>% group_by(groupvar) %>% do(mod = lm(y ~ x, data = .))
}
myfun(mtcars, cyl, disp, mpg)
but I cannot understand well enough NSE to do it. I know, for example, that dplyr NSE functions like group_by or summarize have the associated SE functions group_by_ and summarize_ but it seems that do has not an associated do_.
Try
library(dplyr)
library(lazyeval)
f <- function(d, groupvar, x , y) {
groupvar <- lazy(groupvar)
x <- lazy(x)
y <- lazy(y)
d %>% group_by_(groupvar) %>%
do(mod = lm(interp(quote(y ~ x), y = y, x = x), data = .))
}
f(mtcars, cyl, disp, mpg)
# Source: local data frame [3 x 2]
# Groups: <by row>
#
# cyl mod
# 1 4 <S3:lm>
# 2 6 <S3:lm>
# 3 8 <S3:lm>

Resources