passing columns into purrr::walk2() - r

I am trying to save a series of models using purrr's walk() functions, and am getting the following error:
"Error in map2(.x, .y, .f, ...) : object 'model' not found"
library(dplyr)
library(tidyr)
library(purrr)
mt_models <-
mtcars %>%
group_by(cyl) %>%
nest() %>%
mutate(
model = map(.x = data, .f = ~lm(mpg ~ wt, data = .x)),
file_name = paste("model", cyl, "cyl.rda", sep = "_")
)
mt_models %>% walk2(.x = model, .y = file_name, .f = ~save(.x, file = .y))
I can successfully save the models using this code below:
walk2(.x = mt_models$model, .y = mt_models$file_name, .f = ~save(.x, file = .y))
But I am trying to understand why model is not passing into walk2() in the first example.

You can use with to provide an environment in which to search for variables
mt_models %>%
with(walk2(.x = model, .y = file_name, .f = ~save(.x, file = .y)))

Outside the mutate/summarise and other tidyverse function, we need to do .$ to extract the column
library(dplyr)
library(purrr)
mt_models %>% {
walk2(.x = .$model, .y = .$file_name, .f = ~save(.x, file = .y))
}

Related

How to use purrr to pluck/keep some elements from a list of linear regression fit objects?

I have a list of linear regression fit objects. Let's create it in this example by:
c('hp', 'wt', 'disp') %>%
paste('mpg ~', .) %>%
map(as.formula) %>%
map(lm, data = mtcars)
What I would like is to keep just the residuals and fitted.values from each of the regression fit objects, within this same pipeline. I was trying to use the keep function, but it doesn't work:
c('hp', 'wt', 'disp') %>%
paste('mpg ~', .) %>%
map(as.formula) %>%
map(lm, data = mtcars) %>%
map(keep, names(.) %in% c("residuals", "fitted.values"))
Error:
Error in probe(.x, .p, ...) : length(.p) == length(.x) is not TRUE
How can I perform this action?
If a data frame is wanted as output then use the code below or if a list is wanted omit the bind_rows line.
library(dplyr)
library(purrr)
nms <- c('hp', 'wt', 'disp')
out <- nms %>%
set_names(x = map(paste('mpg ~', .), as.formula)) %>%
map(lm, data = mtcars) %>%
map(~ data.frame(fit = fitted(.), resid = resid(.))) %>%
bind_rows(.id = "id")
We can simplify this slightly using sapply as it will add names and use reformulate to generate the formula.
out <- nms %>%
sapply(reformulate, response = "mpg") %>%
map(lm, data = mtcars) %>%
map(~ data.frame(fit = fitted(.), resid = resid(.))) %>%
bind_rows(.id = "id")

Using map() in a split-apply-combine to run multiple regressions with regression weights

Say I have some data that looks like this:
N <- 200
X <- sample(letters[1:5],N, replace = T)
Y <- rnorm(N)
W <- abs(rnorm(N))
my_data <- tibble(X, Y, W)
I want to run an intercept-only regression on each subset of my data defined by X. To do so, I use nest(), mutate() and map() like so:
my_data %>%
group_by(X) %>%
nest() %>%
mutate(fit = map(data, ~lm(Y ~ 1, data = .x)))
While this code works, when I try to incorporate regression weights, like this:
my_data %>%
group_by(X) %>%
nest() %>%
mutate(fit = map(data, ~lm(Y ~ 1, data = .x, weights = W)))
I get the following error:
Error: Problem with `mutate()` input `fit`.
x missing or negative weights not allowed
ℹ Input `fit` is `map(data, ~lm(Y ~ 1, data = .x, weights = W))`.
ℹ The error occurred in group 1: X = "a".
Run `rlang::last_error()` to see where the error occurred.
Where am I going wrong?
(Disclaimer: I am new to the tidyverse so am likely doing something dumb)
Write it this way:
my_data %>%
group_by(X) %>%
nest() %>%
mutate(fit = map(data, ~lm(Y ~ 1, data = .x, weights = .x$W)))
The point is that at that point W is not an existing column anymore.
It exists only inside data. With map you are looping over data which at that point is a list of dataframes.
Therefore to call W, you need to call it through $
Also, define W as:
W <- abs(rnorm(N))
Because you can't have negative weights.
Alternatively, you can do it as follow:
my_data %>%
group_by(X) %>%
summarise(fit = list(lm(Y ~ 1, weights = W)))
It will give you the same result. [almost: because X will be sorted]
Check this out:
fit1 <- my_data %>%
arrange(X) %>%
group_by(X) %>%
nest() %>%
mutate(fit = map(data, ~lm(Y ~ 1, data = .x, weights = .x$W))) %>%
pull(fit)
fit2 <- my_data %>%
group_by(X) %>%
summarise(fit = list(lm(Y ~ 1, weights = W))) %>%
pull(fit)
identical(map(fit1, coef), map(fit2, coef))
#> TRUE
If you just need the coefficients, you can do it this way:
my_data %>%
group_by(X) %>%
summarise(fit = coef(lm(Y ~ 1, weights = W)))

Writing dplyr function to pass to group_map

The problem in question would be to apply the function f to each group of a tibble. It is a simpler way to do this, but I would like to solve the problem using the group_map() function.
Data used: starwars of the dplyr package.
What I want is to get an average of the height variable for a grouped tibble considering the variables gender and species. I know the problem could be easily solved by doing:
starwars %>% group_by(gender, species) %>%
summarise(mean = mean(height, na.rm = TRUE))
However, my desire is to implement summarise(mean = mean(height, na.rm = TRUE)) in a function and send to group_map().
I tried to create the f() function that gets the data argument which is a tibble object with the previously defined groups. The second argument of the f() function would be ... so that I could pass the variables of interest from data to f().
f <- function(dados, ...){
dados %>% summarise(mean = mean(..., na.rm = TRUE))
}
starwars %>% group_by(gender, species) %>%
group_map(.tbl = ., .f = ~f(dados = .x), height)
Solutions:
func_1 <- function(dados, var, ...){
var_interesse <- enquo(var)
dots <- enquos(...)
# Could be attributed direct reference ...
dados %>% group_by(!!!dots) %>%
summarise(media = mean(x = !!var_interesse, na.rm = TRUE))
}
starwars %>% func_1(var = height, gender, species)
or
func_2 <- function(dados, var){
var_interesse <- enquo(var)
#dots <- enquos(...)
dados %>% summarise(media = mean(x = !!var_interesse, na.rm = TRUE))
}
agrupamento <- starwars %>% group_by(gender, species)
agrupamento %>%
group_map(.tbl = ., .f = ~func_2(dados = .x, var = height))

using quosures within formula inside an anonymous function

I am trying to use quosures to pass along variable names within a custom function for data processing and use in a formula, but my use of quosures in the formula is not correct. Is there a better way to unquote arguments within a formula?
library(dplyr)
library(broom)
library(purrr)
library(tidyr)
foo <- function(mydata, dv, iv, group_var) {
dv = enquo(dv)
iv = enquo(iv)
group_var = enquo(group_var)
mydata <- mydata %>%
group_by(!!group_var) %>%
nest()
mydata %>%
mutate(model = map(data,
~summary(lm(formula(substitute(dv ~ iv)), data = .))
)) %>%
unnest(model %>% map(tidy))
}
foo(mydata=mtcars, dv=mpg, iv=wt, group_var=cyl)
My code produces "Error in mutate_impl(.data, dots) : Evaluation error: object is not a matrix."
This is a working version of code I am trying to make into a function:
mtcars %>%
group_by(cyl) %>%
nest() %>%
mutate(model = map(data, ~summary(lm(mpg ~ wt, data = .)))) %>%
unnest(model %>% map(tidy))
You need to use base R nonstandard evaluation with functions like lm which are not "in the tidyverse" so to speak.
So you could change things to:
foo <- function(mydata, dv, iv, group_var) {
flma <- as.formula(paste(substitute(dv), "~", substitute(iv)))
group_var = enquo(group_var)
mydata <- mydata %>%
group_by(!!group_var) %>%
nest()
mydata %>%
mutate(model = map(data, ~summary(lm(flma, data = .)))) %>%
unnest(model %>% map(tidy))
}
foo(mtcars, mpg, wt, cyl)
That's fine if you know you are only doing simple regression. For more flexibility just pass the formula directly, as in:
foo2 <- function(mydata, flma, group_var) {
group_var = enquo(group_var)
mydata <- mydata %>%
group_by(!!group_var) %>%
nest()
mydata %>%
mutate(model = map(data, ~summary(lm(flma, data = .)))) %>%
unnest(model %>% map(tidy))
}
foo(mtcars, mpg ~ wt, cyl)

modelr: cross-validated model fitting for all variables in a dataset that match criterion

I have a dataset that contains a number of (factor) variables with the prefix "cat_".
library(tidyverse)
library(modelr)
library(lazyeval)
library(purrr)
# create the dataset
df_foo = wakefield::r_data_frame(
n = 100,
wakefield::r_series(wakefield::r_sample, j = 5, name = "cat"),
Y = wakefield::normal()
)
I want to be able to compute the pairwise, k-fold cross-validated regression R2 of each of these factor variables with the response variable using the tidy framework.
It is easy to compute this across folds for a few variables as below.
df_foo %>%
mutate_at(.funs = funs(as.factor), .cols = vars(starts_with("cat"))) %>%
crossv_kfold(k = 10, id = "id") %>%
mutate_(
.dots = setNames(
list(
interp(
quote(
purrr::map_dbl(train, .f = function(train_data) {
summary(stats::lm(Y ~ cat_1, data = train_data))$r.squared
}))),
interp(
quote(
purrr::map_dbl(train, .f = function(train_data) {
summary(stats::lm(Y ~ cat_2, data = train_data))$r.squared
})))
),
nm = c("cat_1", "cat_2")
)
)
Questions:
How to generalize this to an arbitrary number of variables?
Why do I explicitly have to use the namespace accessors for the functions purrr::map_dbl and stats::lm (the logic above will not work if I remove the namespace accessors)?
Edit:
The following code gets the R2 for each of the variables, but it cannot be flattened out to a number of columns equal to the number of variables in the dataset.
make_r2_variable = function(var_name, train_data) {
summary(stats::lm(Y ~ var_name, data = train_data))$r.squared
}
make_r2 = function(train_data) {
summarise_at(
.tbl = data.frame(train_data),
.cols = vars(starts_with("cat_")),
.funs = funs(make_r2_variable(., train_data = train_data))
)
}
df_foo = df_foo %>%
mutate_at(.funs = funs(as.factor), .cols = vars(starts_with("cat"))) %>%
crossv_kfold(k = 10, id = "id") %>%
mutate(
R2 = map(.x = train, .f = make_r2)
)
The solution that I think is as compact as it can get is this:
make_r2_variable = function(var_name, train_data) {
summary(stats::lm(Y ~ var_name, data = train_data))$r.squared
}
make_r2 = function(train_data) {
summarise_at(
.tbl = data.frame(train_data),
.cols = vars(starts_with("cat_")),
.funs = funs(make_r2_variable(., train_data = train_data))
)
}
df_foo = df_foo %>%
mutate_at(.funs = funs(as.factor), .cols = vars(starts_with("cat"))) %>%
crossv_kfold(k = 10, id = "id") %>%
mutate(
R2 = map(.x = train, .f = make_r2)
) %>%
unnest(R2)
This is basically the solution that I had in the edit plus unnest. This basically mutates the S3: resample column using map and within that it uses mutate_at to cycle over the columns that match the criterion. Since that returns a list/1D data.frame, a call to unnest is required.

Resources