regression output in dplyr - r

I would like to define similar functions as in the 'broom' package
library(dplyr)
library(broom)
mtcars %>%
group_by(am) %>%
do(model = lm(mpg ~ wt, .)) %>%
glance(model)
works fine. But how do I defne custom functions like
myglance <- function(x, ...) {
s <- summary(x)
ret <- with(s, data.frame(r2=adj.r.squared, a=coefficients[1], b=coefficients[2]))
ret
}
mtcars %>%
group_by(am) %>%
do(model = lm(mpg ~ wt, .)) %>%
myglance(model)
Error in eval(substitute(expr), data, enclos = parent.frame()) :
invalid 'envir' argument of type 'character'

glance works this way because the broom package defines a method for rowwise data frames here. If you were willing to bring in that whole .R file (along with the col_name utility from here), you could use my code to do the same thing:
myglance_df <- wrap_rowwise_df(wrap_rowwise_df_(myglance))
mtcars %>%
group_by(am) %>%
do(model = lm(mpg ~ wt, .)) %>%
myglance_df(model)
There's also a workaround that doesn't require adding so much code from broom: change the class of each of your models, and define your own glance function on that class.
glance.mylm <- function(x, ...) {
s <- summary(x)
ret <- with(s, data.frame(r2=adj.r.squared, a=coefficients[1], b=coefficients[2]))
ret
}
mtcars %>%
group_by(am) %>%
do(model = lm(mpg ~ wt, .)) %>%
mutate(model = list(structure(model, class = c("mylm", class(model))))) %>%
glance(model)
Finally, you also have the option of performing myglance on the model right away.
mtcars %>%
group_by(am) %>%
do(myglance(lm(mpg ~ wt, .)))

Here is my take on how it would work, basically the approach would be:
Extract the appropriate column from the dataframe (My solution is based on this answer, there must be a better way, and I hope someone will correct me!
run lapply on the result and construct the variables that you wanted in the myglance function you have above.
run do.call with rbind to return a data.frame.
myglance <- function(df, ...) {
# step 1
s <- collect(select(df, ...))[[1]] # based on this answer: https://stackoverflow.com/a/21629102/1992167
# step 2
lapply(s, function(x) {
data.frame(r2 = summary(x)$adj.r.squared,
a = summary(x)$coefficients[1],
b = summary(x)$coefficients[2])
}) %>% do.call(rbind, .) # step 3
}
Output:
> mtcars %>%
+ group_by(am) %>%
+ do(model = lm(mpg ~ wt, .)) %>%
+ myglance(model)
r2 a b
1 0.5651357 31.41606 -3.785908
2 0.8103194 46.29448 -9.084268

Related

Purrr map over multiple models to store results in dataframe

I have got the following example:
mtcars %>%
group_split(cyl) %>%
map(~lm(mpg ~ wt, data = .x)) %>%
map_dbl(~.x$coefficients[[2]])
[1] -5.647025 -2.780106 -2.192438
I also want to store the intercept, so I thought this might work:
mtcars %>%
group_split(cyl) %>%
map(~lm(mpg ~ wt, data = .x)) %>%
map_df(~.x$coefficients)
Error: Argument 1 must have names
However I get this error. What am I doing wrong and how can I store both coefficients in a dataframe?
The coefficients return a numeric vector, we can change it to dataframe and then use map_df.
library(tidyverse)
mtcars %>%
group_split(cyl) %>%
map(~lm(mpg ~ wt, data = .x)) %>%
map_df(~.x$coefficients %>% t %>% as.data.frame)
# (Intercept) wt
#1 39.571 -5.6470
#2 28.409 -2.7801
#3 23.868 -2.1924

Using a quasiquoted argument to a function within a purrr:map call

I'm having a little trouble figuring out quasiquotation, specifically I have a function which takes an argument which specifies which variable should go into a model which is then run within a purrr::map call.
I've been working from: https://dplyr.tidyverse.org/articles/programming.html
# libs
library(tidyverse)
library(broom)
# dummy data
df <- data.frame(
"a"=rep(c("alpha","beta"),50),
"b"=rnorm(100),
"value1"=rnorm(100),
"value2"=rnorm(100)
)
model <- function(var) {
var <- enquo(var)
df %>%
group_by(a) %>%
nest() %>%
mutate(model=map(data, ~ lm(b ~ (!! var),data=.)))
}
model(value1)
> Error in mutate_impl(.data, dots) : Evaluation error: invalid model formula.
putting the name in directly works as expected:
df %>%
group_by(a) %>%
nest() %>%
mutate(model=map(data, ~ lm(b ~ value1,data=.))) %>%
unnest(model %>% map(glance))
I can use !! var within a function:
modelX <- function(var,df=df) {
var <- enquo(var)
df %>%
select(!! var)
}
modelX(value1,df)
I'm assuming that this has something to do with the fact the !! var is referring to a value in the nested tibble data, I've been poking around with rlang::qq_show() but haven't been able to figure it out so far'
The enquo() will attempt to track the enviroment of the symbol you pass in, but you don't really want that included in the formula you are passing to lm. It would be better to capture that as a symbol rather than a quosure. Try this
model <- function(var) {
var <- ensym(var)
df %>%
group_by(a) %>%
nest() %>%
mutate(model=map(data, ~ lm(b ~ !!var, data=.)))
}
Worked for me with dplyr_0.7.6 and purrr_0.2.5

Extract model summaries and store them as a new column

I'm new to the purrr paradigm and am struggling with it.
Following a few sources I have managed to get so far as to nest a data frame, run a linear model on the nested data, extract some coefficients from each lm, and generate a summary for each lm. The last thing I want to do is extract the "r.squared" from the summary (which I would have thought would be the simplest part of what I'm trying to achieve), but for whatever reason I can't get the syntax right.
Here's a MWE of what I have that works:
library(purrr)
library(dplyr)
library(tidyr)
mtcars %>%
nest(-cyl) %>%
mutate(fit = map(data, ~lm(mpg ~ wt, data = .)),
sum = map(fit, ~summary))
and here's my attempt to extract the r.squared which fails:
mtcars %>%
nest(-cyl) %>%
mutate(fit = map(data, ~lm(mpg ~ wt, data = .)),
sum = map(fit, ~summary),
rsq = map_dbl(sum, "r.squared"))
Error in eval(substitute(expr), envir, enclos) :
`x` must be a vector (not a closure)
This is superficially similar to the example given on the RStudio site:
mtcars %>%
split(.$cyl) %>%
map(~ lm(mpg ~ wt, data = .x)) %>%
map(summary) %>%
map_dbl("r.squared")
This works however I would like the r.squared values to sit in a new column (hence the mutate statement) and I'd like to understand why my code isn't working instead of working-around the problem.
EDIT:
Here's a working solution that I came to using the solutions below:
mtcars %>%
nest(-cyl) %>%
mutate(fit = map(data, ~lm(mpg ~ wt, data = .)),
summary = map(fit, glance),
r_sq = map_dbl(summary, "r.squared"))
EDIT 2:
So, it actually turns out that the bug is from the inclusion of the tilde key in the summary = map(fit, ~summary) line. My guess is that the makes the object a function which is nest and not the object returned by the summary itself. Would love an authoritative answer on this if someone wants to chime in.
To be clear, this version of the original code works fine:
mtcars %>%
nest(-cyl) %>%
mutate(fit = map(data, ~lm(mpg ~ wt, data = .)),
summary = map(fit, summary),
r_sq = map_dbl(summary, "r.squared"))
To fit in your current pipe, you'd want to use unnest along with map and glance from the broom package.
library(tidyr)
library(dplyr)
library(broom)
mtcars %>%
nest(-cyl) %>%
mutate(fit = map(data, ~lm(mpg ~ wt, data = .))) %>%
unnest(map(fit, glance))
You'll get more than just the r-squared, and from there you can use select to drop what you don't need.
If you want to keep the model summaries nested in list-columns:
mtcars %>%
nest(-cyl) %>%
mutate(fit = map(data, ~lm(mpg ~ wt, data = .)),
summary = map(fit, glance))
If you want to just extract a single value from a nested frame you just need to use map to the actual value (and not [[ or extract2 as I originally suggested, many thanks for finding that out).
mtcars %>%
nest(-cyl) %>%
mutate(fit = map(data, ~lm(mpg ~ wt, data = .)),
summary = map(fit, glance),
r_sq = map_dbl(summary, "r.squared"))
I think for what you'd like to achieve, you are better off using the glance() function from the broom package:
library(broom)
library(dplyr)
mtcars %>%
group_by(cyl) %>%
do(glance(lm(mpg ~ wt, data = .))) %>%
select(cyl, r.squared)
# cyl r.squared
# <dbl> <dbl>
#1 4 0.5086326
#2 6 0.4645102
#3 8 0.4229655
There must be a better way, here is my try with pipes:
mtcars %>%
split(.$cyl) %>%
map(~ lm(mpg ~ wt, data = .x)) %>%
map(summary) %>%
map_dbl("r.squared") %>%
list() %>%
as.data.frame(col.names = "r.squared") %>%
add_rownames(var = "cyl")
# # A tibble: 3 × 2
# cyl r.squared
# <chr> <dbl>
# 1 4 0.5086326
# 2 6 0.4645102
# 3 8 0.4229655
Note: You might get below a warning.
Warning message: Deprecated, use tibble::rownames_to_column() instead.

bootstrap by group and calculate statistics

I'm trying to bootstrap some model fits and then calculate statistics without having to rerun the models every time. I can do this fine if I calculate r2 inside the first do() but I'd like to know how to access the data.
library(dplyr)
library(tidyr)
library(modelr)
library(purrr)
allmdls <-
mtcars %>%
group_by(cyl) %>%
do({
datsplit=crossv_mc(.,10)
mdls=list(map(datsplit$train, ~glm(hp~disp,data=.,family=gaussian(link='identity'))))
data_frame(datsplit=list(datsplit),mdls)
})
and now something like:
allmdls %>%
by_slice(dmap,.f=map2_dbl(.$mdls,.$datsplit$test,rsquare))
but I get
Error: .y is not a vector (NULL)
or
allmdls %>%
group_by(cyl) %>%
do({
map2_df(.x=.$mdls, .y=.$datsplit, .f=map2_dbl(.x=.x,.y=.y$test,.f=rsquare))
})
Error in map2_dbl(.x = .x, .y = .y$test, .f = rsquare) : object
'.x' not found
I can't seem to get the syntax right.
help?
Thanks
EDIT:
Thanks to #aosmith's comment, I created a somewhat simpler solution:
mtcars %>%
group_by(cyl) %>%
do({
datplit=crossv_mc(.,10) %>%
mutate(mdls=map(train, ~glm(hp~disp,data=.)),
r2=map2_dbl(mdls,test,rsquare)
pctmae=map2_dbl(mdls,test,function(model,data) {mae(model,data)/mean(model$model$hp,na.rm=T)*100})
)
})
One option is to use map2 within mutate. Because you are using lists of lists I ended up with nested map2s to get access to the innermost lists. I pulled the test data out via map(datsplit, "test"), as neither the dollar sign operator nor the extract brackets were working for me.
mutate(allmdls, rsq = map2(mdls, map(datsplit, "test"), ~map2_dbl(.x, .y, rsquare)))
Here is another option that avoids the nested lists all together:
mtcars %>%
split(.$cyl) %>%
map_df(crossv_mc, 10, .id = "cyl") %>%
mutate(models = map(train, ~glm(hp ~ disp, data = .x)),
rsq = map2_dbl(models, test, rsquare))
#aosmith answered my question but here is a simpler solution overall
mtcars %>%
group_by(cyl) %>%
do({
datplit=crossv_mc(.,10) %>%
mutate(mdls=map(train, ~glm(hp~disp,data=.)),
r2=map2_dbl(mdls,test,rsquare)
pctmae=map2_dbl(mdls,test,function(model,data) {mae(model,data)/mean(model$model$hp,na.rm=T)*100})
)
})

Residualize an observation after fitting a model in group_by

I'd like to find the residual of observations after fitting a model per group. I would have thought the code looks something like
library(dplyr)
df %>%
group_by(group) %>%
do(residual=resid(lm(y~x, data=.))) %>%
ungroup()
but this collapses df and leaves no trace of the x variable. What I want is a data frame return that is something like
group |y| x| residual
1) dplyr For purposes of example, this uses the iris data frame that comes with R. I noticed that the code below chokes on the formula if we remove the double quotes but it works OK if the formula is passed as a character string as shown:
iris %>%
group_by(Species) %>%
do(mutate(., resid = resid(lm("Sepal.Length ~ Sepal.Width", .)))) %>%
ungroup()
1a) This variation also works even without a character string formula:
iris %>%
group_by(Species) %>%
do(cbind(., resid = resid(lm(Sepal.Length ~ Sepal.Width, .)))) %>%
ungroup()
1b) and this variation also works:
iris %>%
group_by(Species) %>%
do(transform(., resid = resid(lm(Sepal.Length ~ Sepal.Width, .)))) %>%
ungroup()
2) Base R We could also consider not using dplyr and just base R like this:
f <- function(ix) resid(lm(Sepal.Length ~ Sepal.Width, iris, subset = ix))
transform(iris, resid = ave(seq_along(Species), Species, FUN = f))
3) data.table If speed is of concern you might want to try data.table which is often the fastest approach and is also quite compact here:
library(data.table)
dt <- as.data.table(iris)
dt[, resid := resid(lm(Sepal.Length ~ Sepal.Width, .SD)), by = Species]
3a) Interestingly this variation of (1) works with data.table input and an actual formula (not character string). Also, do() is not needed:
data.table(iris) %>%
group_by(Species) %>%
mutate(resid = resid(lm(Sepal.Length ~ Sepal.Width, .))) %>%
ungroup()
Note: I have added dplyr issue 1648.

Resources