Non standard evaluation (NSE) for dplyr do() - r

I would like to implement something like
mtcars %>% group_by(cyl) %>% do(mod = lm(mpg ~ disp, data = .))
inside a function like this
myfun <- function(d, groupvar, x, y) {
d %>% group_by(groupvar) %>% do(mod = lm(y ~ x, data = .))
}
myfun(mtcars, cyl, disp, mpg)
but I cannot understand well enough NSE to do it. I know, for example, that dplyr NSE functions like group_by or summarize have the associated SE functions group_by_ and summarize_ but it seems that do has not an associated do_.

Try
library(dplyr)
library(lazyeval)
f <- function(d, groupvar, x , y) {
groupvar <- lazy(groupvar)
x <- lazy(x)
y <- lazy(y)
d %>% group_by_(groupvar) %>%
do(mod = lm(interp(quote(y ~ x), y = y, x = x), data = .))
}
f(mtcars, cyl, disp, mpg)
# Source: local data frame [3 x 2]
# Groups: <by row>
#
# cyl mod
# 1 4 <S3:lm>
# 2 6 <S3:lm>
# 3 8 <S3:lm>

Related

Extracting the T Statistic from a function in R

I have this function that I got from a textbook that runs a couple of linear regressions and then saves the P-Value for each regression.
I would also like to save the T-Statistic as well but I am having a hard time finding the right syntax to enter for the select function.
Here is the current function.
models <- lapply(paste(factors, ' ~ a + b + c + d + e + f + g + h+ j -',factors),
function(f){ lm(as.formula(f), data = df) %>% # Call lm(.)
summary() %>% # Gather the output
"$"(coef) %>% # Keep only the coefs
data.frame() %>% # Convert to dataframe
filter(rownames(.) == "(Intercept)") %>% # Keep only the Intercept
dplyr::select(Estimate,`Pr...t..`)}) # Keep the coef & p-value
I know that I have to change the very last part of the function: dplyr::select(Estimate,`Pr...t..`) but after all my research and trial and error I am still stuck.
Here is a reproducible example using the mtcars data.
library(dplyr)
df <- mtcars
df <- df %>%
select(1,2,3,4,5,6,7)
factors <- c("mpg", "cyl", "disp", "hp", "drat", "wt")
models <- lapply(paste(factors, ' ~ mpg + cyl + disp + hp + drat + wt -',factors),
function(f){ lm(as.formula(f), data = df) %>% # Call lm(.)
summary() %>% # Gather the output
"$"(coef) %>% # Keep only the coefs
data.frame() %>% # Convert to dataframe
filter(rownames(.) == "(Intercept)") %>% # Keep only the Intercept
dplyr::select(Estimate,`Pr...t..`)} # Keep the coef & p-value
)
final <- matrix(unlist(models), ncol = 2, byrow = T) %>% # Switch from list to dataframe
data.frame(row.names = factors
Your example works for me. You can make this a little bit more "tidy" as follows:
library(broom)
sumfun <- function(f) {
lm(as.formula(f), data = df) %>%
tidy() %>%
filter(term == "(Intercept)") %>%
dplyr::select(estimate, p.value)
}
pp <- paste(factors, ' ~ mpg + cyl + disp + hp + drat + wt -',factors)
names(pp) <- factors
final <- purrr::map_dfr(pp, sumfun, .id = "factor")

How to use the dplyr package to do group-separated linear regressions in R?

I have a dataset of x and y separated by categories (a and b). I want to do 2 linear regressions, one for category a data and one for category b data. For this purpose, I used the dplyr package following this answer. I'm a little confused because my code is simpler, but I'm not able to do the regressions. Any tips?
library(dplyr)
Factor <- c("a", "b")
x <- seq(0,3,1)
df <- expand.grid(x = x, Factor = Factor)
df$y <- rnorm(8)
df %>%
group_by(Factor) %>%
do(lm(formula = y ~ x,
data = .))
Error: Results 1, 2 must be data frames, not lm
This creates a list column whose components are lm objects
df2 <- df %>%
group_by(Factor) %>%
summarize(lm = list(lm(formula = y ~ x, data = cur_data())), .groups = "drop")
giving:
> df2
# A tibble: 2 x 2
Factor lm
<fct> <list>
1 a <lm>
2 b <lm>
> with(df2, setNames(lm, Factor))
$a
Call:
lm(formula = y ~ x, data = cur_data())
Coefficients:
(Intercept) x
-0.3906 0.2947
$b
Call:
lm(formula = y ~ x, data = cur_data())
Coefficients:
(Intercept) x
0.2684 -0.3403
Here is my approach:
df %>%
split(~ Factor) %>%
purrr::map(\(x) lm(formula = y ~ x, data = x))

Extract slope and r squared from grouped linear models using broom

I have a dataframe that I want to run linear models on by group, then use the broom package to extract the slope and r squared for each model. So far I am trying this:
library(tidyverse)
library(broom)
#read in the dataset
data(mtcars)
#add a group variable
mtcars <- mtcars %>% as_tibble() %>% mutate(LC = 1)
#create a second group
mtcars2 <- mtcars
mtcars2 <- mtcars2 %>% mutate(LC = 2)
#bind together
mtcars <- rbind(mtcars, mtcars2)
#groupby and run regressions
all_regress <- mtcars %>% group_by(LC) %>%
do(mod1 = lm(mpg ~ disp, data = .),
mod2 = lm(mpg ~ wt, data = .))
#use broom the extract the slope and rsq per group
glance <-all_regress %>% mutate(tidy = map(mod1, broom::tidy),
glance = map(mod1, broom::glance),
augment = map(mod1, broom::augment),
rsq = glance %>% map_dbl('r.squared'),
slope = tidy %>% map_dbl(function(x) x$estimate[2]))
but this fails with:
Error: Problem with `mutate()` input `tidy`.
x No tidy method for objects of class qr
ℹ Input `tidy` is `map(mod1, broom::tidy)`.
ℹ The error occurred in row 1.
If I do this without groups such as:
#read in the dataset
data(mtcars)
mtcars <- mtcars %>% as_tibble()
#run regressions
all_regress <- mtcars %>%
do(mod1 = lm(mpg ~ disp, data = .),
mod2 = lm(mpg ~ wt, data = .))
#use broom the extract the slope and rsq per group
glance <- all_regress %>% mutate(tidy = map(mod1, broom::tidy),
glance = map(mod1, broom::glance),
augment = map(mod1, broom::augment),
rsq = glance %>% map_dbl('r.squared'),
slope = tidy %>% map_dbl(function(x) x$estimate[2]))
there is no error.
I think simply adding ungroup() achieves what you need:
all_regress <- mtcars %>% group_by(LC) %>%
do(mod1 = lm(mpg ~ disp, data = .),
mod2 = lm(mpg ~ wt, data = .)) %>% ungroup()
#use broom the extract the slope and rsq per group
glance <-all_regress %>% mutate(tidy = map(mod1, broom::tidy),
glance = map(mod1, broom::glance),
augment = map(mod1, broom::augment),
rsq = glance %>% map_dbl('r.squared'),
slope = tidy %>% map_dbl(function(x) x$estimate[2]))
I used this approach, its longer but i think theres more control in the individual steps. Finally i created a tibble with lists columns containing each model.
library(tidyverse)
library(broom)
#read in the dataset
data(mtcars)
#add a group variable
mtcars <- mtcars %>% as_tibble() %>% dplyr::select(-c(vs, am, gear, carb, cyl)) %>% mutate(LC = 1)
#create a second group
mtcars2 <- mtcars
mtcars2 <- mtcars2 %>% mutate(LC = 2)
#bind together
mtcars <- bind_rows(mtcars2, mtcars)
#group_split and run regressions
all_regress <- mtcars %>% group_split(LC) %>%
map(~ list(mod1 = lm(mpg ~ disp, data = .),
mod2 = lm(mpg ~ wt, data = .)))
# example <- all_regress[[2]][[1]] %>% glance()
#the list has 2 levels with 2 models each
data <- all_regress %>%
map(~
map(.x, function(model){
#column lists are needed because each function output different objects
tibble(mod = list(model),
tidy = list(broom::tidy(model)),
glance = list(broom::glance(model)),
augment = list(broom::augment(model))) %>%
mutate(
rsq = list(glance[[1]]$r.squared),
slope = list(tidy[[1]]$estimate[2]))
} ))
data_final <-
data %>% map2(unique(mtcars$LC), ~
map2(.x, .y, function(each_model, lc){
mutate(each_model, LC = lc)
}))
final_format <- #because of the list structure i need to bind the two datasets in each level and then bind them again.
map(data_final, ~reduce(.x, rbind)) %>% reduce(rbind)
#acces the data
final_format[1, 1][[1]]

Using dataframe name as a column in a model table

I'm confused as to why the following doesn't work. I'm trying to use the name of a data frame/tibble as a column in a multiple models data frame, but keep running up against the following error. Here's an example:
library(tidyverse)
library(rlang)
set.seed(666)
df1 <- tibble(
x = 1:10 + rnorm(10),
y = seq(20, 38, by=2) + rnorm(10),
z = 2*x + 3*y
)
df2 <- tibble(
x = 1:10 + rnorm(10),
y = seq(20, 38, by=2) + rnorm(10),
z = 4*x + 5*y
)
results <- tibble(dataset = c('df1','df2'))
Notice that the following all work:
lm(z ~ x + y, data=df1)
lm(z ~ x + y, data=df2)
lm(z ~ x + y, data=eval(sym('df1')))
But when I try the following:
results <- results %>% mutate(model = lm(z ~ x + y, data = eval(sym(dataset))))
I get the error
Error in mutate_impl(.data, dots) :
Evaluation error: Only strings can be converted to symbols.
Can someone figure out how to make this work?
We can use the map function and specify the lm function as the following.
library(tidyverse)
library(rlang)
results2 <- results %>%
mutate(model = map(dataset, ~lm(z ~ x + y, data = eval(sym(.)))))
results2
# # A tibble: 2 x 2
# dataset model
# <chr> <list>
# 1 df1 <S3: lm>
# 2 df2 <S3: lm>
results2$model[[1]]
# Call:
# lm(formula = z ~ x + y, data = eval(sym(.)))
#
# Coefficients:
# (Intercept) x y
# 6.741e-14 2.000e+00 3.000e+00
results2$model[[2]]
# Call:
# lm(formula = z ~ x + y, data = eval(sym(.)))
#
# Coefficients:
# (Intercept) x y
# 9.662e-14 4.000e+00 5.000e+00
I'd recommend a slightly different route where you bind all the data and skip the eval and sym calls. This follows the "Many Models" chapter of R for Data Science.
purrr::lst creates a list of the data frames with the names of those variables as the list's names, and the .id argument to bind_rows uses those names to create a column marking data as coming from df1 or df2. Nesting creates a column data which is a list-column of data frames. Then you can build the models of each dataset. I used the tilde shortcut notation to build the anonymous function.
The result: you have a column model that is a list of models.
library(tidyverse)
library(rlang)
results <- lst(df1, df2) %>%
bind_rows(.id = "dataset") %>%
group_by(dataset) %>%
nest() %>%
mutate(model = map(data, ~lm(z ~ x + y, data = .)))
results$model[[1]]
#>
#> Call:
#> lm(formula = z ~ x + y, data = .)
#>
#> Coefficients:
#> (Intercept) x y
#> 6.741e-14 2.000e+00 3.000e+00
You also still have a column of that nested data. If you don't want it, you can drop it:
select(results, -data)
#> # A tibble: 2 x 2
#> dataset model
#> <chr> <list>
#> 1 df1 <lm>
#> 2 df2 <lm>

dplyr, do(), extracting parameters from model without losing grouping variable

A slightly changed example from the R help for do():
by_cyl <- group_by(mtcars, cyl)
models <- by_cyl %>% do(mod = lm(mpg ~ disp, data = .))
coefficients<-models %>% do(data.frame(coef = coef(.$mod)[[1]]))
In the dataframe coefficients, there is the first coefficient of the linear model for each cyl group. My question is how can I produce a dataframe that contains not only a column with the coefficients, but also a column with the grouping variable.
===== Edit: I extend the example to try to make more clear my problem
Let's suppose that I want to extract the coefficients of the model and some prediction. I can do this:
by_cyl <- group_by(mtcars, cyl)
getpars <- function(df){
fit <- lm(mpg ~ disp, data = df)
data.frame(intercept=coef(fit)[1],slope=coef(fit)[2])
}
getprediction <- function(df){
fit <- lm(mpg ~ disp, data = df)
x <- df$disp
y <- predict(fit, data.frame(disp= x), type = "response")
data.frame(x,y)
}
pars <- by_cyl %>% do(getpars(.))
prediction <- by_cyl %>% do(getprediction(.))
The problem is that the code is redundant because I am fitting the model two times. My idea was to build a function that returns a list with all the information:
getAll <- function(df){
results<-list()
fit <- lm(mpg ~ disp, data = df)
x <- df$disp
y <- predict(fit, data.frame(disp= x), type = "response")
results$pars <- data.frame(intercept=coef(fit)[1],slope=coef(fit)[2])
results$prediction <- data.frame(x,y)
results
}
The problem is that I don't know how to use do() with the function getAll to obtain for example just a dataframe with the parameters (like the dataframe pars).
Like this?
coefficients <-models %>% do(data.frame(coef = coef(.$mod)[[1]], group = .[[1]]))
yielding
coef group
1 40.87196 4
2 19.08199 6
3 22.03280 8
Using the approach of Hadley Wickham in this video:
library(dplyr)
library(purrr)
library(broom)
fitmodel <- function(d) lm(mpg ~ disp, data = d)
by_cyl <- mtcars %>%
group_by(cyl) %>%
nest() %>%
mutate(mod = map(data, fitmodel),
pars = map(mod, tidy),
pred = map(mod, augment))
pars <- by_cyl %>% unnest(pars)
prediction <- by_cyl %>% unnest(pred)

Resources