Linear Regression fit (augment) on a different dataset - r

I'm simply trying to calculate the prediction (fitted values) on a different dataset the regression model was built on using dplyr and the augment function. However I keep getting errors. Even without using dplyr, the augment function seems to only accept the dataset the model was built on. Any solution to resolve that? Below is one of my attempt. Thank you.
data1 <- head(mtcars,20)
model <- mtcars %>%
group_by(cyl) %>%
do(fit = lm(wt ~ mpg, .),
data = (.)) %>%
augment(fit, data1)

Use augment in mutate and use map to pass one model at a time in augment.
library(broom)
library(dplyr)
library(purrr)
mtcars %>%
group_by(cyl) %>%
do(fit = lm(wt ~ mpg, .),
data = (.)) %>%
ungroup() %>%
mutate(col = map(fit, augment, newdata = data1))
Also since do has been superseded you can fit the model in summarise.
mtcars %>%
group_by(cyl) %>%
summarise(fit = list(lm(wt ~ mpg)),
data = list(cur_data())) %>%
mutate(col = map(fit, augment, newdata = data1))
# cyl fit data col
# <dbl> <list> <list> <list>
#1 4 <lm> <tibble [11 × 11]> <tibble [20 × 14]>
#2 6 <lm> <tibble [7 × 11]> <tibble [20 × 14]>
#3 8 <lm> <tibble [14 × 11]> <tibble [20 × 14]>

Related

Is purrr::map too magical?

Applying functions in a loop fashion with purrr::map is super handy, but extrating objects by its indices seems "too magical" to me, for example, the r.squared from lm summary method. How does it work internally?
library(tidyverse)
data("mtcars")
mtcars %>%
nest(data = -c(vs)) %>%
mutate(model = map(data, ~lm(mpg ~ wt, data = .x)),
summary = map(model, summary),
r2 = map_dbl(summary, "r.squared"))
# # A tibble: 2 x 5
# vs data model summary r2
# <dbl> <list> <list> <list> <dbl>
# 1 0 <tibble [18 x 10]> <lm> <smmry.lm> 0.672
# 2 1 <tibble [14 x 10]> <lm> <smmry.lm> 0.726

Split into train/test a sample equally in all three nested lineal models

I have mapped lineal models, and i would like to check how sussesful are they, so I would like to split my data into train and test, but I am not sure about where to add that.
At the moment I have something like this:
library(tidyverse)
library(broom)
df<-iris %>% group_by(Species) %>% nest() %>%
mutate(model=map(data,~lm(formula= "Sepal.Length ~ Sepal.Width",data=.x)))
final<-vector()
for(i in 1:length(df$model)){
the_model<-df$model[[i]]
value<-broom::glance(the_model)%>% select(adj.r.squared)
final[i]<-value[[1]]
}
I know I should split my data before the "for", and then I should check its sussess inside it.
But I dont know how to split them inside the nested groups, in order to have 75% of each group and not risk to have something like 80% of one group, 60% of another and 75% of the last one.
What should I do?
You can do :
library(tidyverse)
iris %>%
group_by(Species) %>%
mutate(class = sample(rep(c('train', 'test'), c(n() * 0.8, n() * 0.2)))) %>%
group_by(class, .add = TRUE) %>%
nest() %>%
pivot_wider(names_from = class, values_from = data) %>%
mutate(model= map(train,~lm(formula= "Sepal.Length ~ Sepal.Width",data=.x)))
# Species train test model
# <fct> <list> <list> <list>
#1 setosa <tibble [40 × 4]> <tibble [10 × 4]> <lm>
#2 versicolor <tibble [40 × 4]> <tibble [10 × 4]> <lm>
#3 virginica <tibble [40 × 4]> <tibble [10 × 4]> <lm>

How to map a nested dataframe, and store multiple columns as output

I have a data structure as follows:
test <- data.frame(
id= rep(1:3, each=20),
count = rnorm(60, mean=5, sd=1),
covar1 = rnorm(60, mean=10, sd=3),
covar2 = rnorm(60, mean=95, sd=5),
covar3 = rnorm(60, mean=30, sd=5)
)
Then I nest it by id:
test <- test %>% nest(-id)
I want to apply a model to each data covar column, for a given id. Then I want to store the result in a separate column. I can do this as follows:
test <- test %>% mutate(covar1_lm = map(data, ~lm(count ~ covar1, data=.x)),
covar2_lm = map(data, ~lm(count ~ covar2, data=.x)),
covar3_lm = map(data, ~lm(count ~ covar3, data=.x)))
Which gives the output I want:
> test
# A tibble: 3 x 5
id data covar1_lm covar2_lm covar3_lm
<int> <list> <list> <list> <list>
1 1 <tibble [20 × 4]> <lm> <lm> <lm>
2 2 <tibble [20 × 4]> <lm> <lm> <lm>
3 3 <tibble [20 × 4]> <lm> <lm> <lm>
The problem is my real data has a large number of covar columns, and so I'd like to reduce the boilerplate code. So I'm guessing I need some concept of dynamic variable names, but I cant figure out how to map over a dynamic set of column names??
You can pivot_longer() the dataset first, so that there is one observation (row) for each covariate for each dataset. Then you perform the model within each covariate.
test %>%
pivot_longer(starts_with("covar"),
names_to = "covariate") %>%
group_by(id, covariate) %>%
summarize(model = list(lm(count ~ value)))
You now have one observation for each combination of ID and covariate.
# A tibble: 9 x 3
# Groups: id [3]
id covariate model
<int> <chr> <list>
1 1 covar1 <lm>
2 1 covar2 <lm>
3 1 covar3 <lm>
4 2 covar1 <lm>
5 2 covar2 <lm>
6 2 covar3 <lm>
7 3 covar1 <lm>
8 3 covar2 <lm>
9 3 covar3 <lm>
If you want to turn that into the same kind of result, you could pipe this to pivot_wider(names_from = covariate, values_from = model). (But note that having one row for each model could make it easier to explore and visualize the models, especially if you tidy each with broom::tidy() and unnested them).
An alternative to the group_by()/summarize() above would be to nest them :
test %>%
pivot_longer(starts_with("covar"),
names_to = "covariate") %>%
group_by(id, covariate) %>%
nest() %>%
mutate(model = map(data, ~ lm(count ~ value, data = .x)))

Extract residuals from models fit in purrr

I grouped my data and fit a model to each group and I would like to have the residuals for each group. I can see the residuals for each model with RStudio's viewer, but I cannot figure out how to extract them. Extracting one set of residuals can be done like diamond_mods[[3]][[1]][["residuals"]], but how do I use purrr to extract the set from every group (along with broom to end up with a nice tibble)?
Below is how far I've gotten:
library(tidyverse)
library(purrr)
library(broom)
fit_mod <- function(df) {
lm(price ~ poly(carat, 2, raw = TRUE), data = df)
}
diamond_mods <- diamonds %>%
group_by(cut) %>%
nest() %>%
mutate(
model = map(data, fit_mod),
tidied = map(model, tidy)
#resid = map_dbl(model, "residuals") #this was my best try, it doesn't work
) %>%
unnest(tidied)
You were close - but you should use map() instead of map_dbl() as you need to return a list not a vector.
diamond_mods <- diamonds %>%
group_by(cut) %>%
nest() %>%
mutate(
model = map(data, fit_mod),
tidied = map(model, tidy),
resid = map(model, residuals)
)
With the devel version of dplyr, we can do this in condense after grouping by 'cut'
library(dplyr)
library(ggplot2)
library(broom)
diamonds %>%
group_by(cut) %>%
condense(model = fit_mod(cur_data()),
tidied = tidy(model),
resid = model[["residuals"]])
# A tibble: 5 x 4
# Rowwise: cut
# cut model tidied resid
# <ord> <list> <list> <list>
#1 Fair <lm> <tibble [3 × 5]> <dbl [1,610]>
#2 Good <lm> <tibble [3 × 5]> <dbl [4,906]>
#3 Very Good <lm> <tibble [3 × 5]> <dbl [12,082]>
#4 Premium <lm> <tibble [3 × 5]> <dbl [13,791]>
#5 Ideal <lm> <tibble [3 × 5]> <dbl [21,551]>

Function to extract elements from a list column into a new column using purrr:::map

I want to extract elements from a list column and store them as a new column. I can do this outside of a function, but I can't get this to work within a function.
In the example code below, I want the line mutate(!!F_name := map(!!sum_name, ~.$statistic[[1]])) to extract the test statistic from the model summary column and store it in a new column. This gives an Evaluation error $ operator is invalid for atomic vectors.
aov_f1 <- function(df) {aov(value~ carb, data = df)}
aov_f2 <- function(df) {aov(value~ carb + gear, data = df)}
aov_sum_plus <- function(df, mod) {
mod <- enquo(mod)
sum_name <- paste0(quo_name(mod), "_sum")
F_name <-paste0(quo_name(mod), "_F")
df <- df %>%
mutate(!!sum_name := map(!! mod, broom::tidy)) %>%
mutate(!!F_name := map(!!sum_name, ~.$statistic[[1]]))
df
}
mtcars_n <- gather(mtcars, obs, value, mpg:qsec) %>%
group_by(obs) %>%
nest() %>%
mutate(aov1 = map(data, aov_f1)) %>%
mutate(aov2 = map(data, aov_f2)) %>%
aov_sum_plus(aov1) %>%
aov_sum_plus(aov2)
The equivalent code below gives the desired result.
aov_f1 <- function(df) {aov(value~ carb, data = df)}
aov_f2 <- function(df) {aov(value~ carb + gear, data = df)}
mtcars_n <- gather(mtcars, obs, value, mpg:qsec) %>%
group_by(obs) %>%
nest() %>%
mutate(aov1 = map(data, aov_f1)) %>%
mutate(aov2 = map(data, aov_f2)) %>%
mutate(aov1_sum = map(aov1, broom::tidy)) %>%
mutate(aov2_sum = map(aov2, broom::tidy)) %>%
mutate(aov1_sum_f = map_dbl(aov1_sum, ~.$statistic[[1]])) %>%
mutate(aov1_sum_p = map_dbl(aov1_sum, ~.$p.value[[1]])) %>%
mutate(aov2_sum_f = map_dbl(aov2_sum, ~.$statistic[[1]])) %>%
mutate(aov2_sum_p = map_dbl(aov2_sum, ~.$p.value[[1]]))
You are unquoting sum_name into a string. This won't work in map. You can check this by running:
debugfun <- function(df, mod) {
mod <- enquo(mod)
sum_name <- paste0(quo_name(mod), "_sum")
F_name <-paste0(quo_name(mod), "_F")
quo(df <- df %>%
mutate(!!sum_name := map(!! mod, broom::tidy),
!!F_name := map(!!sum_name, ~.$statistic[[1]])
)
)
}
gather(mtcars, obs, value, mpg:qsec) %>%
group_by(obs) %>%
nest() %>%
mutate(aov1 = map(data, aov_f1)) %>%
debugfun(aov1)
Giving:
<quosure>
expr: ^df <- df %>% mutate("aov1_sum" := map(^aov1, broom::tidy), "aov1_F" := map("aov1_sum", ~.$statistic[[1]]))
env: 0000015EF2AD5C88
This is a need trick! Using quo on your entire expression will translate it for you. Looking at the second map we see the problem with the strings.
You need to create a symbol (or name) from your strings. You can add them to your paste0 lines:
aov_sum_plus <- function(df, mod) {
mod <- enquo(mod)
sum_name <- sym(paste0(quo_name(mod), "_sum"))
F_name <- sym(paste0(quo_name(mod), "_F"))
mutate(
df,
!!sum_name := map(!! mod, broom::tidy),
!!F_name := map_dbl(!!sum_name, ~.$statistic[[1]])
)
}
gather(mtcars, obs, value, mpg:qsec) %>%
group_by(obs) %>%
nest() %>%
mutate(aov1 = map(data, aov_f1)) %>%
aov_sum_plus(aov1)
# A tibble: 7 x 5
obs data aov1 aov1_sum aov1_F
<chr> <list> <list> <list> <dbl>
1 mpg <tibble [32 x 5]> <S3: aov> <tibble [2 x 6]> 13.1
2 cyl <tibble [32 x 5]> <S3: aov> <tibble [2 x 6]> 11.5
3 disp <tibble [32 x 5]> <S3: aov> <tibble [2 x 6]> 5.55
4 hp <tibble [32 x 5]> <S3: aov> <tibble [2 x 6]> 38.5
5 drat <tibble [32 x 5]> <S3: aov> <tibble [2 x 6]> 0.249
6 wt <tibble [32 x 5]> <S3: aov> <tibble [2 x 6]> 6.71
7 qsec <tibble [32 x 5]> <S3: aov> <tibble [2 x 6]> 22.7

Resources