run model for each line of model parameters (meta) data.frame - r

In the spirit of purr, broom, modelr, I am trying to create a "meta" data.frame in which each row denotes the dataset (d) and the model parameters (yvar, xvars, FEvars). For instance:
iris2 <- iris %>% mutate(Sepal.Length=Sepal.Length^2)
meta <- data.frame(n=1:4,
yvar = c('Sepal.Length','Sepal.Length','Sepal.Length','Sepal.Length'),
xvars= I(list(c('Sepal.Width'),
c('Sepal.Width','Petal.Length'),
c('Sepal.Width'),
c('Sepal.Width','Petal.Length'))),
data= I(list(iris,iris,iris2,iris2)) )
Now, I would like to run a model for each column of "meta". And then add a list column "model" with the model output object. To run the model I use an auxiliary function that uses a dataset, a y variable and a vector of x variables:
OLS_help <- function(d,y,xvars){
paste(y, paste(xvars, collapse=" + "), sep=" ~ ") %>% as.formula %>%
lm(d)
}
y <- 'Sepal.Length'
xvars <- c('Sepal.Width','Petal.Length')
OLS_help(iris,y,xvars)
How can I execute OLS_help for all the rows of meta and adding the output of OLS_help as a list column in meta? I tryed the following code, but it did not work:
meta %>% mutate(model = map2(d,yvar,xvars,OLS_help) )
Error: Can't convert a `AsIs` object to function
Call `rlang::last_error()` to see a backtrace
OBS: The solution to when only the "data" (nested) list column (corvered in Hadley's book here) is:
by_country <- gapminder %>% group_by(country, continent) %>% nest()
country_model <- function(df) { lm(lifeExp ~ year, data = df) }
by_country <- by_country %>% mutate(model = map(data, country_model))

We can use pmap in the following way
df <- meta %>%
as_tibble() %>%
mutate_if(is.factor, as.character) %>%
mutate(fit = pmap(
list(yvar, xvars, data),
function(y, x, df) lm(reformulate(x, response = y), data = df)))
## A tibble: 4 x 5
# n yvar xvars data fit
# <int> <chr> <I<list>> <I<list>> <list>
#1 1 Sepal.Length <chr [1]> <df[,5] [150 × 5]> <lm>
#2 2 Sepal.Length <chr [2]> <df[,5] [150 × 5]> <lm>
#3 3 Sepal.Length <chr [1]> <df[,5] [150 × 5]> <lm>
#4 4 Sepal.Length <chr [2]> <df[,5] [150 × 5]> <lm>
Explanation: pmap iterates over multiple arguments simultaneously (similar to base R's Map); here we simultaneously loop throw entries in column yvar, xvar and data, then use reformulate to construct the formula to be used within lm. We store the lm fit object in column fit.

Related

R purrr:map on a grouped/nested tibble

I would like to apply a function across columns of a nested grouped tibble as in the example below.
library(tidyverse)
df <- swiss %>%
group_by(Catholic > 20) %>%
nest()
Which results in a tibble that looks like:
> df
# A tibble: 2 x 2
# Groups: Catholic > 20 [2]
`Catholic > 20` data
<lgl> <list>
1 FALSE <tibble [26 × 6]>
2 TRUE <tibble [21 × 6]>
Now I make some function to build a model
fit <- function(df, modL = NA){
if (modL == 1) {fit <- lm(Fertility ~ Education, data = df)}
if (modL == 2) {fit <- lm(Fertility ~ Education + Examination, data = df)}
fit
}
Now I map that model to columns of the grouped data and make two new variables to store the model fits.
df <- df %>%
mutate(model1 = map(data, fit, modL = 1)) %>%
mutate(model2 = map(data, fit, modL = 2))
Which produces a tibble with two new columns that contain the model fits
> df
# A tibble: 2 x 4
# Groups: Catholic > 20 [2]
`Catholic > 20` data model1 model2
<lgl> <list> <list> <list>
1 FALSE <tibble [26 × 6]> <lm> <lm>
2 TRUE <tibble [21 × 6]> <lm> <lm>
What I want to achieve is a purr-type map function that does the same thing as the following code.
anova(df$model1[[1]], df$model2[[1]])
anova(df$model1[[2]], df$model2[[2]])
I though the following code would work, but it does not.
map(df[,3:4], anova)
Gurus, how do I map a function across columns of a nested and grouped dataset to give one result per row using the columns of that row as input?
Brant
df %>%
mutate(anova = map2(model1, model2, ~ anova(.x,.y)))%>%
mutate(pvalue = map_dbl(anova, ~.$`Pr(>F)`[2]))
I think this is what you want? Can you clarify please! Second mutate will pull out the p-value of the anova for each pairwise comparison.

Running multiple Cox-PH models with tidyr

I have a regular Surv object from the survival package;
s <- Surv(sample(100:150, 5), sample(c(T, F), 5, replace = T))
And a matrix of multiple variables;
df <- data.frame(var1 = rnorm(5),
var2 = rnorm(5),
var3 = rnorm(5))
I need to fit a Cox-PH model for each variable separately. My code currently uses a loop as follows:
for (v in colnames(df)) {
coxph(s ~ df[[v]])
}
Of course, in reality there are thousands of variables and this process takes a bit. I wanted to follow the answer given here to try and do it all with tidyr but I'm kinda stumped because the predictand isn't a factor, it's a survival object, so I don't quite know how to handle it as part of a tibble.
Assuming your response is s for the survival model, you can use a nested dataframe similar to the answer you link to, then map the model to the different variables:
library(tidyverse)
df_nested <- df %>% pivot_longer(cols = var1:var3) %>% group_by(name) %>% nest()
surv_model <- function(df) {
coxph(s ~ df$value)
}
df_nested <- df_nested %>% mutate(model = map(data, surv_model))
df_nested
# A tibble: 3 x 3
# Groups: name [3]
name data model
<chr> <list> <list>
1 var1 <tibble [5 x 1]> <coxph>
2 var2 <tibble [5 x 1]> <coxph>
3 var3 <tibble [5 x 1]> <coxph>

rsample vfold_cv function not accepting .y parameter from purrr::map2

I'm trying to create nested cross-validations using the rsample package, and I use purrr::map2 to create them multiple times, with differing amount of folds as dictated by the v parameter. However, the vfold_cv function does not accept the v parameter, and instead I get this error: Error: v must be a single integer.
In the reprex below, I'm simulating the situation using the mtcars data, by creating a cross validation for each cylinder. Replacing .y with a number works, but I need the parameter to vary with each cylinder by using the n column.
library(purrr)
library(parsnip)
library(rsample)
library(tidyr)
data("mtcars")
nested <- mtcars %>%
select(cyl, disp:gear) %>%
group_by(cyl) %>%
nest(data = disp:gear) %>%
cbind(n = 2:4)
nested %>%
group_by(cyl) %>%
mutate(cv = map2(data, n,
~nested_cv(.x,
inside = vfold_cv(v = 10, repeats = 3),
outside = vfold_cv(v = .y))))
Error: `v` must be a single integer.
It's vfold_cv function inside nested_cv, you can try it:
createNested = function(x,y){
nested_cv(x,inside = vfold_cv(v = 10, repeats = 3),outside = vfold_cv(v = y))
}
createNested(nested$data[[1]],3)
Error in vfold_splits(data = data, v = v, strata = strata, breaks = breaks) :
object 'y' not found
So it cannot see the y variable (like your .y) inside the function. So I wrote a function to explicitly pass the results of vfold_cv() for outside into nested_cv(), a few more lines of code but it's ok:
createNested = function(x,y){
outside_cv = vfold_cv(x,v = y)
nested_cv(x,inside = vfold_cv(v = 10, repeats = 3),outside = outside_cv)
}
nested <- mtcars %>%
select(cyl, disp:gear) %>%
nest(data = disp:gear) %>%
mutate(n=2:4)
nested %>% mutate(cv = map2(data,n,.f=createNested))
# A tibble: 3 x 4
cyl data n cv
<dbl> <list> <int> <list>
1 6 <tibble [7 × 8]> 2 <tibble [2 × 3]>
2 4 <tibble [11 × 8]> 3 <tibble [3 × 3]>
3 8 <tibble [14 × 8]> 4 <tibble [4 × 3]>
Note, once you have nested the data, you don't need group_by()

fit an `lm` model for every level of a factor

I am trying to write a function that iterates (or uses purrr::map()) through every level of a factor, and fits an lm() model for the subset of the data where the factor is equal to that level.
To make a simple reproducable example with mtcars, just say that I'd like a different lm model for each value of mtcars$gear. I'll start by making it a factor, because my real problem involves iteration through a factor:
library(tidyverse)
mtcars <- mtcars %>%
mutate(factor_gear = factor(gear))
I'd like the function to fit every level of factor_gear. The levels are given by:
levels(mtcars$factor_gear)
i.e.
[1] "3" "4" "5"
So the output I would be looking for would be:
fit1 <- lm(mpg ~ cyl, data = mtcars %>% filter(factor_gear=="3"))
fit2 <- lm(mpg ~ cyl, data = mtcars %>% filter(factor_gear=="4"))
fit3 <- lm(mpg ~ cyl, data = mtcars %>% filter(factor_gear=="5"))
fits <- list(fit1, fit2, fit3)
I've made a start on the function, but wasn't able to get it to work.
I thought that a function should:
get every level of of the factor into a vector
run an lm model for each level.
fit_each_level <- function(factor_variable) {
# trying to: 1. get every level of of the factor into a vector
factor_levels <- levels(df_cars$factor_variable)
# trying to: 2. run an lm model for each level.
for i in factor_levels {
fit <- mtcars %>% filter(factor_variable==i [# every value of segment_levels]) %>%
lm(mpg ~ cyl, data = . )
}
}
fit_each_level(factor_gear)
If the function worked well, I'd ultimately be able to do do it on another factor, eg:
mtcars <- mtcars %>%
mutate(factor_carb = factor(carb))
fit_each_level(factor_carb)
You can nest the dataframe and use map to apply lm for each factor_gear.
library(dplyr)
mtcars %>%
group_by(factor_gear) %>%
tidyr::nest() %>%
mutate(model = map(data, ~lm(mpg ~ cyl, data = .x)))
# factor_gear data model
# <fct> <list> <list>
#1 4 <tibble [12 × 11]> <lm>
#2 3 <tibble [15 × 11]> <lm>
#3 5 <tibble [5 × 11]> <lm>
In the new dplyr you can use cur_data to refer to current data in group which avoids the need of nest and map.
mtcars %>%
group_by(factor_gear) %>%
summarise(model = list(lm(mpg ~ cyl, data = cur_data())))
Make sure you have the latest version of dplyr (1.0.0). Then you can use:
model_coefs <- function(formula, data) {
coefs <- lm(formula, data)$coefficients
data.frame(coef = names(coefs), value = coefs)
}
mtcars %>%
dplyr::mutate(factor_gear = factor(gear)) %>%
dplyr::nest_by(factor_gear) %>%
dplyr::summarise(model_coefs(mpg ~ cyl, data)) %>%
tidyr::pivot_wider(names_from = coef, values_from = value)
# A tibble: 3 x 3
# Groups: factor_gear [3]
factor_gear `(Intercept)` cyl
<fct> <dbl> <dbl>
1 3 29.8 -1.83
2 4 41.3 -3.59
3 5 40.6 -3.2

R Use map2 to iterate over columns within a list of data frames to fit statistical models

I'm trying to figure out a purrr approach to iteratively map over columns within a list of data frames to fit univariate GLMs. Using map2, the first element, .x, would be the three pred columns, and the second element, .y, would be the list of data frames (or vice-versa). map2 seems to be able to do this, but I recognize that I need to cross the .x and .y elements first, so I use tidyr::crossing first to do this. From here, I am unsure how to properly reference the columns to select within the data frames. Example code is below:
#Sample data
set.seed(100)
test_df <- tibble(pred1 = sample(40:80, size = 1000, replace = TRUE),
pred2 = sample(40:80, size = 1000, replace = TRUE),
pred3 = sample(40:80, size = 1000, replace = TRUE),
resp = sample(100:200, size = 1000, replace = TRUE),
group = sample(c('a','b','c'), size = 1000, replace = TRUE))
#Split into list
test_ls <- test_df %>%
group_by(group) %>%
{df_groups <<- .} %>%
group_split()
#Obtain keys and name list elements
group_keys <- df_groups %>%
group_keys() %>%
pull()
test_ls <- test_ls %>% setNames(nm = group_keys)
#Cross all combinations of pred columns and list element names
preds <- c('pred1','pred2','pred3')
map_keys <- crossing(preds, group_keys)
#.y = list of data frames; iterate over data frames
#.x = three pred columns; iterate over columns
#Use purrr to fit glm of each .x columns within each of .y dfs
#Example structure - does not work
map2(.x, .y, .f = ~glm(resp ~ .x, data = .y))
#Workaround that does work
lapply(test_ls, function(x) {
x %>%
select(pred1, pred2, pred3) %>%
map(.f = ~glm(resp ~ .x, data = x))
})
There's something I'm missing, and I can't seem to figure it out. I've gotten a variety of errors with a few approaches, but I think it's coming down to not properly referencing the .x columns within the .y data frames. My approaches don't seem to recognize that .x is a column within .y. The workaround does the trick, but I'd prefer to avoid using both lapply and map.
My suggestion would be to NOT split the data before fitting models, since you are considering all possible combinations of variables that are already available directly in your original dataset. Instead, consider converting the original data frame to the "long" format, and then grouping by the necessary variables:
test_df %>% gather( pred, value, pred1:pred3 ) %>%
nest( -c(group, pred) ) %>%
mutate( models = map(data, ~glm(resp ~ value, data=.x)) )
# # A tibble: 9 x 4
# group pred data models
# <chr> <chr> <list> <list>
# 1 b pred1 <tibble [340 x 2]> <glm>
# 2 a pred1 <tibble [317 x 2]> <glm>
# 3 c pred1 <tibble [343 x 2]> <glm>
# 4 b pred2 <tibble [340 x 2]> <glm>
# 5 a pred2 <tibble [317 x 2]> <glm>
# 6 c pred2 <tibble [343 x 2]> <glm>
# 7 b pred3 <tibble [340 x 2]> <glm>
# 8 a pred3 <tibble [317 x 2]> <glm>
# 9 c pred3 <tibble [343 x 2]> <glm>
This substantially simplifies your code, and you can now split the result, if you still need those models in a list.

Resources