Extract model summaries and store them as a new column - r

I'm new to the purrr paradigm and am struggling with it.
Following a few sources I have managed to get so far as to nest a data frame, run a linear model on the nested data, extract some coefficients from each lm, and generate a summary for each lm. The last thing I want to do is extract the "r.squared" from the summary (which I would have thought would be the simplest part of what I'm trying to achieve), but for whatever reason I can't get the syntax right.
Here's a MWE of what I have that works:
library(purrr)
library(dplyr)
library(tidyr)
mtcars %>%
nest(-cyl) %>%
mutate(fit = map(data, ~lm(mpg ~ wt, data = .)),
sum = map(fit, ~summary))
and here's my attempt to extract the r.squared which fails:
mtcars %>%
nest(-cyl) %>%
mutate(fit = map(data, ~lm(mpg ~ wt, data = .)),
sum = map(fit, ~summary),
rsq = map_dbl(sum, "r.squared"))
Error in eval(substitute(expr), envir, enclos) :
`x` must be a vector (not a closure)
This is superficially similar to the example given on the RStudio site:
mtcars %>%
split(.$cyl) %>%
map(~ lm(mpg ~ wt, data = .x)) %>%
map(summary) %>%
map_dbl("r.squared")
This works however I would like the r.squared values to sit in a new column (hence the mutate statement) and I'd like to understand why my code isn't working instead of working-around the problem.
EDIT:
Here's a working solution that I came to using the solutions below:
mtcars %>%
nest(-cyl) %>%
mutate(fit = map(data, ~lm(mpg ~ wt, data = .)),
summary = map(fit, glance),
r_sq = map_dbl(summary, "r.squared"))
EDIT 2:
So, it actually turns out that the bug is from the inclusion of the tilde key in the summary = map(fit, ~summary) line. My guess is that the makes the object a function which is nest and not the object returned by the summary itself. Would love an authoritative answer on this if someone wants to chime in.
To be clear, this version of the original code works fine:
mtcars %>%
nest(-cyl) %>%
mutate(fit = map(data, ~lm(mpg ~ wt, data = .)),
summary = map(fit, summary),
r_sq = map_dbl(summary, "r.squared"))

To fit in your current pipe, you'd want to use unnest along with map and glance from the broom package.
library(tidyr)
library(dplyr)
library(broom)
mtcars %>%
nest(-cyl) %>%
mutate(fit = map(data, ~lm(mpg ~ wt, data = .))) %>%
unnest(map(fit, glance))
You'll get more than just the r-squared, and from there you can use select to drop what you don't need.
If you want to keep the model summaries nested in list-columns:
mtcars %>%
nest(-cyl) %>%
mutate(fit = map(data, ~lm(mpg ~ wt, data = .)),
summary = map(fit, glance))
If you want to just extract a single value from a nested frame you just need to use map to the actual value (and not [[ or extract2 as I originally suggested, many thanks for finding that out).
mtcars %>%
nest(-cyl) %>%
mutate(fit = map(data, ~lm(mpg ~ wt, data = .)),
summary = map(fit, glance),
r_sq = map_dbl(summary, "r.squared"))

I think for what you'd like to achieve, you are better off using the glance() function from the broom package:
library(broom)
library(dplyr)
mtcars %>%
group_by(cyl) %>%
do(glance(lm(mpg ~ wt, data = .))) %>%
select(cyl, r.squared)
# cyl r.squared
# <dbl> <dbl>
#1 4 0.5086326
#2 6 0.4645102
#3 8 0.4229655

There must be a better way, here is my try with pipes:
mtcars %>%
split(.$cyl) %>%
map(~ lm(mpg ~ wt, data = .x)) %>%
map(summary) %>%
map_dbl("r.squared") %>%
list() %>%
as.data.frame(col.names = "r.squared") %>%
add_rownames(var = "cyl")
# # A tibble: 3 × 2
# cyl r.squared
# <chr> <dbl>
# 1 4 0.5086326
# 2 6 0.4645102
# 3 8 0.4229655
Note: You might get below a warning.
Warning message: Deprecated, use tibble::rownames_to_column() instead.

Related

Extracting the coefficients of each model by using a series of map functions

I am creating the following model:
models <- mtcars %>%
split(.$cyl) %>%
map(function(df) lm(mpg ~ wt, data = df))
Based on the results you get from that, I am trying to extract the coefficients by using a series of map functions.
The results should look like this:
4 6 8
-5.647025 -2.780106 -2.192438
I am pulling my hair out trying to figure this out. Any help is appreciated.
You can use map_dbl with the coef function to pick out the "wt" coefficients:
coefs <- mtcars %>%
split(.$cyl) %>%
map(function(df) lm(mpg ~ wt, data = df)) %>%
map_dbl(~coef(.)[["wt"]])
It looks like
coefs <- (mtcars
%>% split(.$cyl)
%>% map(lm, formula = mpg~wt)
%>% map_dbl(~coef(.)[["wt"]])
)
should do what you want? If you want to get more information, ending with map_dfr(broom::tidy) instead of the map_dbl will be helpful (you can use the .id= argument too, although this is less useful when the list doesn't have named arguments).
This is very similar to #henryn's answer, although the map syntax (using the named formula argument means that the data get substituted as the next argument implicitly, so you don't have to use an anonymous function function(df) lm(mpg ~ wt, data = df) or (with R >= 4.1.0) \(df) lm(mpg ~ wt, data = df): I think the usual way of doing this, ~ lm(mpg ~ wt, data = .) might get messed up by the tilde in the formula, but I'm nto sure ...
Does this work:
mtcars %>% split(.$cyl) %>% map(function(x) {
c = lm(mpg ~ wt, data = x)
c$coefficients[2]
}) %>% unlist
4.wt 6.wt 8.wt
-5.647025 -2.780106 -2.192438
1) This could be done in straight dplyr:
mtcars %>%
group_by(cyl) %>%
summarize(wt = coef(lm(mpg ~ wt))[[2]], .groups = "drop")
giving:
# A tibble: 3 x 2
cyl wt
<dbl> <dbl>
1 4 -5.65
2 6 -2.78
3 8 -2.19
2) This variation also works:
mtcars %>%
group_by(cyl) %>%
summarize(wt = cov(mpg, wt) / var(wt), .groups = "drop")
3) Also consider this -- omit the [2] to get both coefficients.
library(nlme)
coef(lmList(mpg ~ wt | cyl, mtcars))[2]
giving:
wt
4 -5.647025
6 -2.780106
8 -2.192438

Purrr map over multiple models to store results in dataframe

I have got the following example:
mtcars %>%
group_split(cyl) %>%
map(~lm(mpg ~ wt, data = .x)) %>%
map_dbl(~.x$coefficients[[2]])
[1] -5.647025 -2.780106 -2.192438
I also want to store the intercept, so I thought this might work:
mtcars %>%
group_split(cyl) %>%
map(~lm(mpg ~ wt, data = .x)) %>%
map_df(~.x$coefficients)
Error: Argument 1 must have names
However I get this error. What am I doing wrong and how can I store both coefficients in a dataframe?
The coefficients return a numeric vector, we can change it to dataframe and then use map_df.
library(tidyverse)
mtcars %>%
group_split(cyl) %>%
map(~lm(mpg ~ wt, data = .x)) %>%
map_df(~.x$coefficients %>% t %>% as.data.frame)
# (Intercept) wt
#1 39.571 -5.6470
#2 28.409 -2.7801
#3 23.868 -2.1924

grouped statistical test tidyverse

I'm trying to do a Wilcoxon test on long-formatted data. I want to use dplyr::group_by() to specify the subsets I'd like to do the test on.
The final result would be a new column with the p-value of the Wilcoxon test appended to the original data frame. All of the techniques I have seen require summarizing the data frame. I DO NOT want to summarize the data frame.
Please see an example reformatting the iris dataset to mimic my data, and finally my attempts to perform the task.
I am getting close, but I want to preserve all of my original data from before the Wilcoxon test.
# Reformatting Iris to mimic my data.
long_format <- iris %>%
gather(key = "attribute", value = "measurement", -Species) %>%
mutate(descriptor =
case_when(
str_extract(attribute, pattern = "\\.(.*)") == ".Width" ~ "Width",
str_extract(attribute, pattern = "\\.(.*)") == ".Length" ~ "Length")) %>%
mutate(Feature =
case_when(
str_extract(attribute, pattern = "^(.*?)\\.") == "Sepal." ~ "Sepal",
str_extract(attribute, pattern = "^(.*?)\\.") == "Petal." ~ "Petal"))
# Removing no longer necessary column.
cleaned_up <- long_format %>% select(-attribute)
# Attempt using do(), but I lose important info like "measurement"
cleaned_up %>%
group_by(Species, Feature) %>%
do(w = wilcox.test(measurement~descriptor, data=., paired=FALSE)) %>%
mutate(Wilcox = w$p.value)
# This is an attempt with the dplyr experimental group_map function. If only I could just make this a new column appended to the original df in one step.
cleaned_up %>%
group_by(Species, Feature) %>%
group_map(~ wilcox.test(measurement~descriptor, data=., paired=FALSE)$p.value)
Thanks for your help.
The model object can be wrapped in a list
library(tidyverse)
cleaned_up %>%
group_by(Species, Feature) %>%
nest %>%
mutate(model = map(data, ~
.x %>%
transmute(w = list(wilcox.test(measurement~descriptor,
data=., paired=FALSE)))))
Or another option is group_split into a list, then map through the list, elements create the 'pval' column after applying the model
cleaned_up %>%
group_split(Species, Feature) %>%
map_dfr(~ .x %>%
mutate(pval = wilcox.test(measurement~descriptor,
data=., paired=FALSE)$p.value))
Another option is to avoid the data argument entirely. The wilcox.test function only requires a data argument when the variables being tested aren't in the calling scope, but functions called within mutate have all the columns from the data frame in scope.
cleaned_up %>%
group_by(Species, Feature) %>%
mutate(pval = wilcox.test(measurement~descriptor, paired=FALSE)$p.value)
Same as akrun's output (thanks to his correction in the comments above)
akrun <-
cleaned_up %>%
group_split(Species, Feature) %>%
map_dfr(~ .x %>%
mutate(pval = wilcox.test(measurement~descriptor,
data=., paired=FALSE)$p.value))
me <-
cleaned_up %>%
group_by(Species, Feature) %>%
mutate(pval = wilcox.test(measurement~descriptor, paired=FALSE)$p.value)
all.equal(akrun, me)
# [1] TRUE

Residualize an observation after fitting a model in group_by

I'd like to find the residual of observations after fitting a model per group. I would have thought the code looks something like
library(dplyr)
df %>%
group_by(group) %>%
do(residual=resid(lm(y~x, data=.))) %>%
ungroup()
but this collapses df and leaves no trace of the x variable. What I want is a data frame return that is something like
group |y| x| residual
1) dplyr For purposes of example, this uses the iris data frame that comes with R. I noticed that the code below chokes on the formula if we remove the double quotes but it works OK if the formula is passed as a character string as shown:
iris %>%
group_by(Species) %>%
do(mutate(., resid = resid(lm("Sepal.Length ~ Sepal.Width", .)))) %>%
ungroup()
1a) This variation also works even without a character string formula:
iris %>%
group_by(Species) %>%
do(cbind(., resid = resid(lm(Sepal.Length ~ Sepal.Width, .)))) %>%
ungroup()
1b) and this variation also works:
iris %>%
group_by(Species) %>%
do(transform(., resid = resid(lm(Sepal.Length ~ Sepal.Width, .)))) %>%
ungroup()
2) Base R We could also consider not using dplyr and just base R like this:
f <- function(ix) resid(lm(Sepal.Length ~ Sepal.Width, iris, subset = ix))
transform(iris, resid = ave(seq_along(Species), Species, FUN = f))
3) data.table If speed is of concern you might want to try data.table which is often the fastest approach and is also quite compact here:
library(data.table)
dt <- as.data.table(iris)
dt[, resid := resid(lm(Sepal.Length ~ Sepal.Width, .SD)), by = Species]
3a) Interestingly this variation of (1) works with data.table input and an actual formula (not character string). Also, do() is not needed:
data.table(iris) %>%
group_by(Species) %>%
mutate(resid = resid(lm(Sepal.Length ~ Sepal.Width, .))) %>%
ungroup()
Note: I have added dplyr issue 1648.

regression output in dplyr

I would like to define similar functions as in the 'broom' package
library(dplyr)
library(broom)
mtcars %>%
group_by(am) %>%
do(model = lm(mpg ~ wt, .)) %>%
glance(model)
works fine. But how do I defne custom functions like
myglance <- function(x, ...) {
s <- summary(x)
ret <- with(s, data.frame(r2=adj.r.squared, a=coefficients[1], b=coefficients[2]))
ret
}
mtcars %>%
group_by(am) %>%
do(model = lm(mpg ~ wt, .)) %>%
myglance(model)
Error in eval(substitute(expr), data, enclos = parent.frame()) :
invalid 'envir' argument of type 'character'
glance works this way because the broom package defines a method for rowwise data frames here. If you were willing to bring in that whole .R file (along with the col_name utility from here), you could use my code to do the same thing:
myglance_df <- wrap_rowwise_df(wrap_rowwise_df_(myglance))
mtcars %>%
group_by(am) %>%
do(model = lm(mpg ~ wt, .)) %>%
myglance_df(model)
There's also a workaround that doesn't require adding so much code from broom: change the class of each of your models, and define your own glance function on that class.
glance.mylm <- function(x, ...) {
s <- summary(x)
ret <- with(s, data.frame(r2=adj.r.squared, a=coefficients[1], b=coefficients[2]))
ret
}
mtcars %>%
group_by(am) %>%
do(model = lm(mpg ~ wt, .)) %>%
mutate(model = list(structure(model, class = c("mylm", class(model))))) %>%
glance(model)
Finally, you also have the option of performing myglance on the model right away.
mtcars %>%
group_by(am) %>%
do(myglance(lm(mpg ~ wt, .)))
Here is my take on how it would work, basically the approach would be:
Extract the appropriate column from the dataframe (My solution is based on this answer, there must be a better way, and I hope someone will correct me!
run lapply on the result and construct the variables that you wanted in the myglance function you have above.
run do.call with rbind to return a data.frame.
myglance <- function(df, ...) {
# step 1
s <- collect(select(df, ...))[[1]] # based on this answer: https://stackoverflow.com/a/21629102/1992167
# step 2
lapply(s, function(x) {
data.frame(r2 = summary(x)$adj.r.squared,
a = summary(x)$coefficients[1],
b = summary(x)$coefficients[2])
}) %>% do.call(rbind, .) # step 3
}
Output:
> mtcars %>%
+ group_by(am) %>%
+ do(model = lm(mpg ~ wt, .)) %>%
+ myglance(model)
r2 a b
1 0.5651357 31.41606 -3.785908
2 0.8103194 46.29448 -9.084268

Resources