Purrr map over multiple models to store results in dataframe - r

I have got the following example:
mtcars %>%
group_split(cyl) %>%
map(~lm(mpg ~ wt, data = .x)) %>%
map_dbl(~.x$coefficients[[2]])
[1] -5.647025 -2.780106 -2.192438
I also want to store the intercept, so I thought this might work:
mtcars %>%
group_split(cyl) %>%
map(~lm(mpg ~ wt, data = .x)) %>%
map_df(~.x$coefficients)
Error: Argument 1 must have names
However I get this error. What am I doing wrong and how can I store both coefficients in a dataframe?

The coefficients return a numeric vector, we can change it to dataframe and then use map_df.
library(tidyverse)
mtcars %>%
group_split(cyl) %>%
map(~lm(mpg ~ wt, data = .x)) %>%
map_df(~.x$coefficients %>% t %>% as.data.frame)
# (Intercept) wt
#1 39.571 -5.6470
#2 28.409 -2.7801
#3 23.868 -2.1924

Related

Extracting the coefficients of each model by using a series of map functions

I am creating the following model:
models <- mtcars %>%
split(.$cyl) %>%
map(function(df) lm(mpg ~ wt, data = df))
Based on the results you get from that, I am trying to extract the coefficients by using a series of map functions.
The results should look like this:
4 6 8
-5.647025 -2.780106 -2.192438
I am pulling my hair out trying to figure this out. Any help is appreciated.
You can use map_dbl with the coef function to pick out the "wt" coefficients:
coefs <- mtcars %>%
split(.$cyl) %>%
map(function(df) lm(mpg ~ wt, data = df)) %>%
map_dbl(~coef(.)[["wt"]])
It looks like
coefs <- (mtcars
%>% split(.$cyl)
%>% map(lm, formula = mpg~wt)
%>% map_dbl(~coef(.)[["wt"]])
)
should do what you want? If you want to get more information, ending with map_dfr(broom::tidy) instead of the map_dbl will be helpful (you can use the .id= argument too, although this is less useful when the list doesn't have named arguments).
This is very similar to #henryn's answer, although the map syntax (using the named formula argument means that the data get substituted as the next argument implicitly, so you don't have to use an anonymous function function(df) lm(mpg ~ wt, data = df) or (with R >= 4.1.0) \(df) lm(mpg ~ wt, data = df): I think the usual way of doing this, ~ lm(mpg ~ wt, data = .) might get messed up by the tilde in the formula, but I'm nto sure ...
Does this work:
mtcars %>% split(.$cyl) %>% map(function(x) {
c = lm(mpg ~ wt, data = x)
c$coefficients[2]
}) %>% unlist
4.wt 6.wt 8.wt
-5.647025 -2.780106 -2.192438
1) This could be done in straight dplyr:
mtcars %>%
group_by(cyl) %>%
summarize(wt = coef(lm(mpg ~ wt))[[2]], .groups = "drop")
giving:
# A tibble: 3 x 2
cyl wt
<dbl> <dbl>
1 4 -5.65
2 6 -2.78
3 8 -2.19
2) This variation also works:
mtcars %>%
group_by(cyl) %>%
summarize(wt = cov(mpg, wt) / var(wt), .groups = "drop")
3) Also consider this -- omit the [2] to get both coefficients.
library(nlme)
coef(lmList(mpg ~ wt | cyl, mtcars))[2]
giving:
wt
4 -5.647025
6 -2.780106
8 -2.192438

Using nest and purrr::map outside of mutate

Lets say I want to split out mtcars into 3 csv files based on their cyl grouping. I can use mutate to do this, but it will create a NULL column in the output.
library(tidyverse)
by_cyl = mtcars %>%
group_by(cyl) %>%
nest()
by_cyl %>%
mutate(unused = map2(data, cyl, function(x, y) write.csv(x, paste0(y, '.csv'))))
is there a way to do this on the by_cyl object without calling mutate?
Here is an option using purrr without mutate from dplyr.
library(tidyverse)
mtcars %>%
split(.$cyl) %>%
walk2(names(.), ~write_csv(.x, paste0(.y, '.csv')))
Update
This drops the cyl column before saving the output.
library(tidyverse)
mtcars %>%
split(.$cyl) %>%
map(~ .x %>% select(-cyl)) %>%
walk2(names(.), ~write_csv(.x, paste0(.y, '.csv')))
Update2
library(tidyverse)
by_cyl <- mtcars %>%
group_by(cyl) %>%
nest()
by_cyl %>%
split(.$cyl) %>%
walk2(names(.), ~write_csv(.x[["data"]][[1]], paste0(.y, '.csv')))
Here's a solution with do and group_by, so if your data is already grouped as it should, you save one line:
mtcars %>%
group_by(cyl) %>%
do(data.frame(write.csv(.,paste0(.$cyl[1],".csv"))))
data.frame is only used here because do needs to return a data.frame, so it's a little hack.

Extract model summaries and store them as a new column

I'm new to the purrr paradigm and am struggling with it.
Following a few sources I have managed to get so far as to nest a data frame, run a linear model on the nested data, extract some coefficients from each lm, and generate a summary for each lm. The last thing I want to do is extract the "r.squared" from the summary (which I would have thought would be the simplest part of what I'm trying to achieve), but for whatever reason I can't get the syntax right.
Here's a MWE of what I have that works:
library(purrr)
library(dplyr)
library(tidyr)
mtcars %>%
nest(-cyl) %>%
mutate(fit = map(data, ~lm(mpg ~ wt, data = .)),
sum = map(fit, ~summary))
and here's my attempt to extract the r.squared which fails:
mtcars %>%
nest(-cyl) %>%
mutate(fit = map(data, ~lm(mpg ~ wt, data = .)),
sum = map(fit, ~summary),
rsq = map_dbl(sum, "r.squared"))
Error in eval(substitute(expr), envir, enclos) :
`x` must be a vector (not a closure)
This is superficially similar to the example given on the RStudio site:
mtcars %>%
split(.$cyl) %>%
map(~ lm(mpg ~ wt, data = .x)) %>%
map(summary) %>%
map_dbl("r.squared")
This works however I would like the r.squared values to sit in a new column (hence the mutate statement) and I'd like to understand why my code isn't working instead of working-around the problem.
EDIT:
Here's a working solution that I came to using the solutions below:
mtcars %>%
nest(-cyl) %>%
mutate(fit = map(data, ~lm(mpg ~ wt, data = .)),
summary = map(fit, glance),
r_sq = map_dbl(summary, "r.squared"))
EDIT 2:
So, it actually turns out that the bug is from the inclusion of the tilde key in the summary = map(fit, ~summary) line. My guess is that the makes the object a function which is nest and not the object returned by the summary itself. Would love an authoritative answer on this if someone wants to chime in.
To be clear, this version of the original code works fine:
mtcars %>%
nest(-cyl) %>%
mutate(fit = map(data, ~lm(mpg ~ wt, data = .)),
summary = map(fit, summary),
r_sq = map_dbl(summary, "r.squared"))
To fit in your current pipe, you'd want to use unnest along with map and glance from the broom package.
library(tidyr)
library(dplyr)
library(broom)
mtcars %>%
nest(-cyl) %>%
mutate(fit = map(data, ~lm(mpg ~ wt, data = .))) %>%
unnest(map(fit, glance))
You'll get more than just the r-squared, and from there you can use select to drop what you don't need.
If you want to keep the model summaries nested in list-columns:
mtcars %>%
nest(-cyl) %>%
mutate(fit = map(data, ~lm(mpg ~ wt, data = .)),
summary = map(fit, glance))
If you want to just extract a single value from a nested frame you just need to use map to the actual value (and not [[ or extract2 as I originally suggested, many thanks for finding that out).
mtcars %>%
nest(-cyl) %>%
mutate(fit = map(data, ~lm(mpg ~ wt, data = .)),
summary = map(fit, glance),
r_sq = map_dbl(summary, "r.squared"))
I think for what you'd like to achieve, you are better off using the glance() function from the broom package:
library(broom)
library(dplyr)
mtcars %>%
group_by(cyl) %>%
do(glance(lm(mpg ~ wt, data = .))) %>%
select(cyl, r.squared)
# cyl r.squared
# <dbl> <dbl>
#1 4 0.5086326
#2 6 0.4645102
#3 8 0.4229655
There must be a better way, here is my try with pipes:
mtcars %>%
split(.$cyl) %>%
map(~ lm(mpg ~ wt, data = .x)) %>%
map(summary) %>%
map_dbl("r.squared") %>%
list() %>%
as.data.frame(col.names = "r.squared") %>%
add_rownames(var = "cyl")
# # A tibble: 3 × 2
# cyl r.squared
# <chr> <dbl>
# 1 4 0.5086326
# 2 6 0.4645102
# 3 8 0.4229655
Note: You might get below a warning.
Warning message: Deprecated, use tibble::rownames_to_column() instead.

Residualize an observation after fitting a model in group_by

I'd like to find the residual of observations after fitting a model per group. I would have thought the code looks something like
library(dplyr)
df %>%
group_by(group) %>%
do(residual=resid(lm(y~x, data=.))) %>%
ungroup()
but this collapses df and leaves no trace of the x variable. What I want is a data frame return that is something like
group |y| x| residual
1) dplyr For purposes of example, this uses the iris data frame that comes with R. I noticed that the code below chokes on the formula if we remove the double quotes but it works OK if the formula is passed as a character string as shown:
iris %>%
group_by(Species) %>%
do(mutate(., resid = resid(lm("Sepal.Length ~ Sepal.Width", .)))) %>%
ungroup()
1a) This variation also works even without a character string formula:
iris %>%
group_by(Species) %>%
do(cbind(., resid = resid(lm(Sepal.Length ~ Sepal.Width, .)))) %>%
ungroup()
1b) and this variation also works:
iris %>%
group_by(Species) %>%
do(transform(., resid = resid(lm(Sepal.Length ~ Sepal.Width, .)))) %>%
ungroup()
2) Base R We could also consider not using dplyr and just base R like this:
f <- function(ix) resid(lm(Sepal.Length ~ Sepal.Width, iris, subset = ix))
transform(iris, resid = ave(seq_along(Species), Species, FUN = f))
3) data.table If speed is of concern you might want to try data.table which is often the fastest approach and is also quite compact here:
library(data.table)
dt <- as.data.table(iris)
dt[, resid := resid(lm(Sepal.Length ~ Sepal.Width, .SD)), by = Species]
3a) Interestingly this variation of (1) works with data.table input and an actual formula (not character string). Also, do() is not needed:
data.table(iris) %>%
group_by(Species) %>%
mutate(resid = resid(lm(Sepal.Length ~ Sepal.Width, .))) %>%
ungroup()
Note: I have added dplyr issue 1648.

regression output in dplyr

I would like to define similar functions as in the 'broom' package
library(dplyr)
library(broom)
mtcars %>%
group_by(am) %>%
do(model = lm(mpg ~ wt, .)) %>%
glance(model)
works fine. But how do I defne custom functions like
myglance <- function(x, ...) {
s <- summary(x)
ret <- with(s, data.frame(r2=adj.r.squared, a=coefficients[1], b=coefficients[2]))
ret
}
mtcars %>%
group_by(am) %>%
do(model = lm(mpg ~ wt, .)) %>%
myglance(model)
Error in eval(substitute(expr), data, enclos = parent.frame()) :
invalid 'envir' argument of type 'character'
glance works this way because the broom package defines a method for rowwise data frames here. If you were willing to bring in that whole .R file (along with the col_name utility from here), you could use my code to do the same thing:
myglance_df <- wrap_rowwise_df(wrap_rowwise_df_(myglance))
mtcars %>%
group_by(am) %>%
do(model = lm(mpg ~ wt, .)) %>%
myglance_df(model)
There's also a workaround that doesn't require adding so much code from broom: change the class of each of your models, and define your own glance function on that class.
glance.mylm <- function(x, ...) {
s <- summary(x)
ret <- with(s, data.frame(r2=adj.r.squared, a=coefficients[1], b=coefficients[2]))
ret
}
mtcars %>%
group_by(am) %>%
do(model = lm(mpg ~ wt, .)) %>%
mutate(model = list(structure(model, class = c("mylm", class(model))))) %>%
glance(model)
Finally, you also have the option of performing myglance on the model right away.
mtcars %>%
group_by(am) %>%
do(myglance(lm(mpg ~ wt, .)))
Here is my take on how it would work, basically the approach would be:
Extract the appropriate column from the dataframe (My solution is based on this answer, there must be a better way, and I hope someone will correct me!
run lapply on the result and construct the variables that you wanted in the myglance function you have above.
run do.call with rbind to return a data.frame.
myglance <- function(df, ...) {
# step 1
s <- collect(select(df, ...))[[1]] # based on this answer: https://stackoverflow.com/a/21629102/1992167
# step 2
lapply(s, function(x) {
data.frame(r2 = summary(x)$adj.r.squared,
a = summary(x)$coefficients[1],
b = summary(x)$coefficients[2])
}) %>% do.call(rbind, .) # step 3
}
Output:
> mtcars %>%
+ group_by(am) %>%
+ do(model = lm(mpg ~ wt, .)) %>%
+ myglance(model)
r2 a b
1 0.5651357 31.41606 -3.785908
2 0.8103194 46.29448 -9.084268

Resources