I grouped my data and fit a model to each group and I would like to have the residuals for each group. I can see the residuals for each model with RStudio's viewer, but I cannot figure out how to extract them. Extracting one set of residuals can be done like diamond_mods[[3]][[1]][["residuals"]], but how do I use purrr to extract the set from every group (along with broom to end up with a nice tibble)?
Below is how far I've gotten:
library(tidyverse)
library(purrr)
library(broom)
fit_mod <- function(df) {
lm(price ~ poly(carat, 2, raw = TRUE), data = df)
}
diamond_mods <- diamonds %>%
group_by(cut) %>%
nest() %>%
mutate(
model = map(data, fit_mod),
tidied = map(model, tidy)
#resid = map_dbl(model, "residuals") #this was my best try, it doesn't work
) %>%
unnest(tidied)
You were close - but you should use map() instead of map_dbl() as you need to return a list not a vector.
diamond_mods <- diamonds %>%
group_by(cut) %>%
nest() %>%
mutate(
model = map(data, fit_mod),
tidied = map(model, tidy),
resid = map(model, residuals)
)
With the devel version of dplyr, we can do this in condense after grouping by 'cut'
library(dplyr)
library(ggplot2)
library(broom)
diamonds %>%
group_by(cut) %>%
condense(model = fit_mod(cur_data()),
tidied = tidy(model),
resid = model[["residuals"]])
# A tibble: 5 x 4
# Rowwise: cut
# cut model tidied resid
# <ord> <list> <list> <list>
#1 Fair <lm> <tibble [3 × 5]> <dbl [1,610]>
#2 Good <lm> <tibble [3 × 5]> <dbl [4,906]>
#3 Very Good <lm> <tibble [3 × 5]> <dbl [12,082]>
#4 Premium <lm> <tibble [3 × 5]> <dbl [13,791]>
#5 Ideal <lm> <tibble [3 × 5]> <dbl [21,551]>
Related
Applying functions in a loop fashion with purrr::map is super handy, but extrating objects by its indices seems "too magical" to me, for example, the r.squared from lm summary method. How does it work internally?
library(tidyverse)
data("mtcars")
mtcars %>%
nest(data = -c(vs)) %>%
mutate(model = map(data, ~lm(mpg ~ wt, data = .x)),
summary = map(model, summary),
r2 = map_dbl(summary, "r.squared"))
# # A tibble: 2 x 5
# vs data model summary r2
# <dbl> <list> <list> <list> <dbl>
# 1 0 <tibble [18 x 10]> <lm> <smmry.lm> 0.672
# 2 1 <tibble [14 x 10]> <lm> <smmry.lm> 0.726
I'm simply trying to calculate the prediction (fitted values) on a different dataset the regression model was built on using dplyr and the augment function. However I keep getting errors. Even without using dplyr, the augment function seems to only accept the dataset the model was built on. Any solution to resolve that? Below is one of my attempt. Thank you.
data1 <- head(mtcars,20)
model <- mtcars %>%
group_by(cyl) %>%
do(fit = lm(wt ~ mpg, .),
data = (.)) %>%
augment(fit, data1)
Use augment in mutate and use map to pass one model at a time in augment.
library(broom)
library(dplyr)
library(purrr)
mtcars %>%
group_by(cyl) %>%
do(fit = lm(wt ~ mpg, .),
data = (.)) %>%
ungroup() %>%
mutate(col = map(fit, augment, newdata = data1))
Also since do has been superseded you can fit the model in summarise.
mtcars %>%
group_by(cyl) %>%
summarise(fit = list(lm(wt ~ mpg)),
data = list(cur_data())) %>%
mutate(col = map(fit, augment, newdata = data1))
# cyl fit data col
# <dbl> <list> <list> <list>
#1 4 <lm> <tibble [11 × 11]> <tibble [20 × 14]>
#2 6 <lm> <tibble [7 × 11]> <tibble [20 × 14]>
#3 8 <lm> <tibble [14 × 11]> <tibble [20 × 14]>
I have mapped lineal models, and i would like to check how sussesful are they, so I would like to split my data into train and test, but I am not sure about where to add that.
At the moment I have something like this:
library(tidyverse)
library(broom)
df<-iris %>% group_by(Species) %>% nest() %>%
mutate(model=map(data,~lm(formula= "Sepal.Length ~ Sepal.Width",data=.x)))
final<-vector()
for(i in 1:length(df$model)){
the_model<-df$model[[i]]
value<-broom::glance(the_model)%>% select(adj.r.squared)
final[i]<-value[[1]]
}
I know I should split my data before the "for", and then I should check its sussess inside it.
But I dont know how to split them inside the nested groups, in order to have 75% of each group and not risk to have something like 80% of one group, 60% of another and 75% of the last one.
What should I do?
You can do :
library(tidyverse)
iris %>%
group_by(Species) %>%
mutate(class = sample(rep(c('train', 'test'), c(n() * 0.8, n() * 0.2)))) %>%
group_by(class, .add = TRUE) %>%
nest() %>%
pivot_wider(names_from = class, values_from = data) %>%
mutate(model= map(train,~lm(formula= "Sepal.Length ~ Sepal.Width",data=.x)))
# Species train test model
# <fct> <list> <list> <list>
#1 setosa <tibble [40 × 4]> <tibble [10 × 4]> <lm>
#2 versicolor <tibble [40 × 4]> <tibble [10 × 4]> <lm>
#3 virginica <tibble [40 × 4]> <tibble [10 × 4]> <lm>
I see from the purrr documentation that it should be possible to map a list of functions onto arguments using the map(list(fn1, fn2, fn3), exec, !!!args) syntax or something similar. How would this work for the broom functions tidy, glance, and augment, which usually must be supplemented with do? These are three functions I almost always like to execute at the same time on the same data and model. Of course I can do this explicitly:
# works but is repetitive
MY_MODEL <- hp ~ cyl
my_glance <- mtcars %>% do(glance(lm(data = ., formula = MY_MODEL)))
my_tidy <- mtcars %>% do(tidy(lm(data = ., formula = MY_MODEL)))
my_augment <- mtcars %>% do(augment(lm(data = ., formula = MY_MODEL)))
I suspect there is a better, more compact way to do this without having to retype ...lm(data = ., formula = MY_MODEL... every time, but I couldn't figure it out. I tried
# doesn't work
omnibroom <- function(df, model){
map(list(glance, tidy, augment),
exec,
~{(do(.x(lm(data = df, formula = model))))}
)
}
omnibroom(mtcars, MY_MODEL)
but I think I don't understand the !!! syntax appropriately.
Is there a compact idiom for calling these three broom functions on the same model and data?
It's possible to do this in two lines with simple re-factoring. No do or !!! necessary.
mdl <- mtcars %>% lm(data=., formula=MY_MODEL)
res1 <- map( list(glance, tidy, augment), exec, mdl )
If you really want to squish it down into a single line, use { to help guide pipe input to the correct place in lm:
res2 <- mtcars %>%
{map( list(glance, tidy, augment), exec, lm(data=., formula=MY_MODEL) )}
Verification:
identical( res1, list(my_glance, my_tidy, my_augment) ) # TRUE
identical( res1, res2 ) # TRUE
EDIT to address grouping
Arbitrary functions like lm don't respect data frame groups. While do is a popular approach to handle grouping in this case, I personally think that tidyr::nest() is more intuitive because it places all intermediates and results alongside the data:
## "Listify" broom functions: f -> map( ..., f )
omnibroom <- map( list(glance, tidy, augment), ~function(l) map(l, .x) ) %>%
set_names( c("glance","tidy","augment") )
result <- mtcars %>% nest( data = -gear ) %>%
mutate( model = map(data, lm, formula=MY_MODEL) ) %>%
mutate_at( "model", omnibroom )
# # A tibble: 3 x 6
# gear data model glance tidy augment
# <dbl> <list> <list> <list> <list> <list>
# 1 4 <tibble [12 × 10… <lm> <tibble [1 × 11… <tibble [2 × … <tibble [12 × …
# 2 3 <tibble [15 × 10… <lm> <tibble [1 × 11… <tibble [2 × … <tibble [15 × …
# 3 5 <tibble [5 × 10]> <lm> <tibble [1 × 11… <tibble [2 × … <tibble [5 × 9…
This format also naturally lends itself to unnesting, since broom functions produce data frames:
result %>% select( gear, tidy ) %>% unnest( tidy )
# # A tibble: 6 x 6
# gear term estimate std.error statistic p.value
# <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 4 (Intercept) -5.00 25.3 -0.198 0.847
# 2 4 cyl 20.2 5.30 3.82 0.00339
# 3 3 (Intercept) -47.5 56.1 -0.847 0.412
# 4 3 cyl 30.0 7.42 4.04 0.00142
# 5 5 (Intercept) -101. 51.9 -1.94 0.148
# 6 5 cyl 49.4 8.28 5.96 0.00944
Let's say I have two datasets for the same group of irises over two years:
# Create data for reproducible results.
iris.2007 <- iris
iris.2008 <- iris
iris.2008[1:4] <- 2*iris.2008[1:4] # let's make the 2008 data different
I would like to fit a separate linear model for each species in the 2007 data, which I can do like this:
# First nest by Species.
iris.2007.nested <- iris.2007 %>%
group_by(Species) %>%
nest()
# Now apply the linear model call by group using the data.
iris.2007.nested <- iris.2007.nested %>%
mutate(models = map(data,
~ lm(Petal.Length ~ Petal.Width, data = .)))
When we look at the results, they make sense as a nicely-organized tibble.
head(iris.2007.nested)
# A tibble: 3 × 3
Species data models
<fctr> <list> <list>
1 setosa <tibble [50 × 4]> <S3: lm>
2 versicolor <tibble [50 × 4]> <S3: lm>
3 virginica <tibble [50 × 4]> <S3: lm>
Now let's do the same thing to the 2008 data.
# First nest by species.
iris.2008.nested <- iris.2008 %>%
group_by(Species) %>%
nest()
# Now apply the linear model call by species using the data.
iris.2008.nested <- iris.2008.nested %>%
mutate(models = map(data,
~ lm(Petal.Length ~ Petal.Width, data = .)))
Again, we end up with a nice tibble.
head(iris.2008.nested)
# A tibble: 3 × 3
Species data models
<fctr> <list> <list>
1 setosa <tibble [50 × 4]> <S3: lm>
2 versicolor <tibble [50 × 4]> <S3: lm>
3 virginica <tibble [50 × 4]> <S3: lm>
Now what I would like to do is use the linear models from the 2008 data to predict results using the 2007 data. Thinking that the best way to do that would be to combine the two datasets (retaining the group structure), here is what happens when I try to merge the two nested tibbles:
iris.both.nested <- merge(iris.2007.nested, iris.2008.nested, by='Species')
As you can see below, the tibble no longer seems to have the same format as the individual tibbles above. Specifically, the organization is hard to discern (note that I am not including the full output in this chunk, but you get the idea).
head(iris.both.nested)
Species
1 setosa
2 versicolor
3 virginica
data.x
1 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, ...
... <truncated>
1 1.327563, 0.5464903, -0.03686145, -0.03686145, -0.1368614, 0.06313855,
...
And although I can still apparently use the models fitted to the 2008 data (as models.y) to the data from 2007 (as data.x):
iris.both.nested.pred <- iris.both.nested %>%
mutate( pred = map2(models.y,
data.x, predict))
The result is again not a nicely-organized tibble: (again not showing full output)
head(iris.both.nested.pred)
Species
1 setosa
2 versicolor
3 virginica
data.x
1 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, ...
... <truncated>
1 1.327563, 0.5464903, -0.03686145, -0.03686145, -0.1368614,
...
So my question is -- is this process working even though the tibbles become strangely organized after the merge? Or am I missing something? Thanks!
install.packages("pacman")
pacman::p_load(tidyverse)
iris_2007 <- iris %>% mutate(year = 2007)
iris_2008 <- iris %>% mutate(year = 2008)
iris_2008[1:4] <- 2 * iris_2008[1:4]
# combine data
iris_all_data <- iris_2007 %>%
bind_rows(iris_2008) %>%
group_by(Species) %>%
nest()
# model and predict
iris_predict <- iris_all_data %>%
mutate(
modelData = data %>% map(., ~ filter(., year == 2007)),
validationData = data %>% map(., ~ filter(., year == 2008)),
model = modelData %>% map(., ~ lm(Petal.Length ~ Petal.Width, data = .)),
prediction = map2(
.x = model, .y = validationData, ~ predict(object = .x, newdata = .y)
)
) %>%
select(Species, prediction) %>%
unnest(cols = c(prediction))
print(iris_predict)
I would double nest it first and apply the models later
# Data
iris.2007 <- iris
iris.2008 <- iris
iris.2008[1:4] <- 2*iris.2008[1:4]
joined<-bind_rows(
cbind(dset=rep("iris.2007",length(iris.2007$Species)),iris.2007)
,cbind(dset=rep("iris.2008",length(iris.2008$Species)),iris.2008)
)
# Double nesting
joined_nested<-
joined %>% group_by(dset) %>% nest(.key=data1) %>%
mutate(data1 = map(data1, ~.x %>% group_by(Species) %>% nest))
# Now apply the linear model call by group using the data.
joined_nested_models<-
joined_nested %>% mutate(data1 = map(data1, ~.x %>%
mutate(models = map(data,
~ lm(Petal.Length ~ Petal.Width, data = .)))
))
joined_nested_models %>% unnest
# # A tibble: 6 × 4
# dset Species data models
# <chr> <fctr> <list> <list>
# 1 iris.2007 setosa <tibble [50 × 4]> <S3: lm>
# 2 iris.2007 versicolor <tibble [50 × 4]> <S3: lm>
# 3 iris.2007 virginica <tibble [50 × 4]> <S3: lm>
# 4 iris.2008 setosa <tibble [50 × 4]> <S3: lm>
# 5 iris.2008 versicolor <tibble [50 × 4]> <S3: lm>
# 6 iris.2008 virginica <tibble [50 × 4]> <S3: lm>
Which is a Tidier version of what you get with inner_join