I have data of the following form
date data
<chr> <list>
1 2012-01-05 <tibble [796 x 5]>
2 2012-01-12 <tibble [831 x 5]>
3 2012-01-19 <tibble [820 x 5]>
... ...
I would like to use something analogous to map() to calculate the mean and standard deviation.
I can currently use the following separately, but it is possible to calculate both at the same time.
mutate(stats = map(data, ~ sd(.$metric)))
mutate(stats = map(data, ~ mean(.$metric)))
Another alternative is to make a function that is like summary, which returns quartiles and the mean. but calculate the mean and sd instead. then I could use that new function in map as follows:
mutate(stats = map(data, ~ new_function(.$metric)))
Is there a better alternative?
A simple option to add multiple columns is to just make another list column of the desired summary statistics and unnest it:
library(tidyverse)
set.seed(47)
df <- data_frame(date = seq(as.Date('1970-01-01'), by = 1, length = 4),
data = map(date, ~data_frame(metric = rnorm(10))))
df
#> # A tibble: 4 x 2
#> date data
#> <date> <list>
#> 1 1970-01-01 <tibble [10 × 1]>
#> 2 1970-01-02 <tibble [10 × 1]>
#> 3 1970-01-03 <tibble [10 × 1]>
#> 4 1970-01-04 <tibble [10 × 1]>
df %>%
mutate(stats = map(data, ~data.frame(mean = mean(.x$metric),
sd = sd(.x$metric)))) %>%
unnest(stats)
#> # A tibble: 4 x 4
#> date data mean sd
#> <date> <list> <dbl> <dbl>
#> 1 1970-01-01 <tibble [10 × 1]> -0.106 0.992
#> 2 1970-01-02 <tibble [10 × 1]> -0.102 0.875
#> 3 1970-01-03 <tibble [10 × 1]> -0.833 0.979
#> 4 1970-01-04 <tibble [10 × 1]> 0.184 0.671
A more programmatic approach (which may scale better) is to iterate within the anonymous function over a list of functions. lst will automatically name them, so the results will be named, and map_dfc will cbind them into a data frame:
df %>%
mutate(stats = map(data,
~map_dfc(lst(mean, sd),
function(.fun) .fun(.x$metric)))) %>%
unnest(stats)
purrr has a purpose-built function for iterating over functions/parameters like this: invoke_map. If you want the function or parameters to be recycled, they have to be in a length-1 list. Since parameters should already be collected in a list, here it has to be a nested list.
df %>%
mutate(stats = map(data,
~invoke_map_dfc(lst(mean, sd),
list(list(.x$metric))))) %>%
unnest(stats)
All approaches return the same thing.
Related
I have a dataframe of 4 columns: Dataset, X, Y, Group.
The task is to fit a linear model to each of the five groups (The group column contains 5 groups: a, b, c, d, e) in the dataframe and then compare the slope with the dataframe test_2. For the test_2 I have already fitted a model, as there was no group separation like in the test_1. For the test_1 we have been suggested to use the function nest_by to compute a group-wise linear models
I have tried to fit a model with the function nest_by
Input:
model <- test_1 %>%
nest_by(Group) %>%
mutate(model = list(lm(y ~ x, data = test_1)))
model
Output:
A tibble: 5 x 3
# Rowwise: Group
Group data model
<fct> <list<tibble[,3]>> <list>
1 a [58 x 3] <lm>
2 b [35 x 3] <lm>
3 c [47 x 3] <lm>
4 d [44 x 3] <lm>
5 e [38 x 3] <lm>
I do not know now how to proceed. I thought that I could ungroup them and do a summary(), but would be similar to just fit a model separately with the function filter() and create 5 separated models.
Yes, you can proceed further using tidy from broom package which is better option than summary and then doing unnest.
For example, for mtcars, for each cyl group, we can do the following,
library(tidyr)
library(dplyr)
library(purrr)
library(broom)
mtcars_model <- mtcars %>%
nest(data = -cyl) %>%
mutate(
model = map(data, ~ lm(mpg ~ wt, data = .))
)
# now simply for each cyl, tidy the model output and unnest it
mtcars_model %>%
mutate(
tidy_summary = map(model, tidy)
) %>%
unnest(tidy_summary)
#> # A tibble: 6 × 8
#> cyl data model term estimate std.error statistic p.value
#> <dbl> <list> <list> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 6 <tibble [7 × 10]> <lm> (Interce… 28.4 4.18 6.79 1.05e-3
#> 2 6 <tibble [7 × 10]> <lm> wt -2.78 1.33 -2.08 9.18e-2
#> 3 4 <tibble [11 × 10]> <lm> (Interce… 39.6 4.35 9.10 7.77e-6
#> 4 4 <tibble [11 × 10]> <lm> wt -5.65 1.85 -3.05 1.37e-2
#> 5 8 <tibble [14 × 10]> <lm> (Interce… 23.9 3.01 7.94 4.05e-6
#> 6 8 <tibble [14 × 10]> <lm> wt -2.19 0.739 -2.97 1.18e-2
Created on 2022-07-09 by the reprex package (v2.0.1)
For additional Information with examples, check here
I try to transfer the problem from this post to a setting where you use different formulas in the lm()
function in R.
Here a basic setup to reproduce the problem:
library(dplyr)
library(broom)
library(purrr)
library(tidyr)
# Generate data
set.seed(324)
dt <- data.frame(
t = sort(rep(c(1,2), 50)),
w1 = rnorm(100),
w2 = rnorm(100),
x1 = rnorm(100),
x2 = rnorm(100)
)
# Generate formulas
fm <- map(1:2, ~as.formula(paste0("w", .x, "~ x", .x)))
Now I try to run different regressions for each group t with models specified in formulas object fm :
# Approach 1:
dt %>% group_by(t) %>%
do(fit = tidy(map(fm, ~lm(.x, data = .)))) %>%
unnest(fit)
# Approach 2
dt %>% nest(-t) %>%
mutate(
fit = map(fm, ~lm(.x, data = .)),
tfit = tidy(fit)
)
This produces an error indicating that the formula cannot be converted to a data.frame . What am I doing wrong?
This needs map2 instead of map as the data column from nest is also a list of data.frame, and thus we need to loop over the corresponding elements of 'fm' list and data (map2 does that)
library(tidyr)
library(purrr)
library(dplyr)
library(broom)
out <- dt %>%
nest(data = -t) %>%
mutate(
fit = map2(fm, data, ~lm(.x, data = .y)),
tfit = map(fit, tidy))
-output
> out
# A tibble: 2 × 4
t data fit tfit
<dbl> <list> <list> <list>
1 1 <tibble [50 × 4]> <lm> <tibble [2 × 5]>
2 2 <tibble [50 × 4]> <lm> <tibble [2 × 5]>
> bind_rows(out$tfit)
# A tibble: 4 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 0.0860 0.128 0.670 0.506
2 x1 0.262 0.119 2.19 0.0331
3 (Intercept) -0.00285 0.152 -0.0187 0.985
4 x2 -0.115 0.154 -0.746 0.459
Or may also use
> imap_dfr(fm, ~ lm(.x, data = dt %>%
filter(t == .y)) %>%
tidy)
# A tibble: 4 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 0.0860 0.128 0.670 0.506
2 x1 0.262 0.119 2.19 0.0331
3 (Intercept) -0.00285 0.152 -0.0187 0.985
4 x2 -0.115 0.154 -0.746 0.459
If we want to have all the combinations of 'fm' for each level of 't', then use crossing
dt %>%
nest(data = -t) %>%
crossing(fm) %>%
mutate(fit = map2(fm, data, ~ lm(.x, data = .y)),
tfit = map(fit, tidy))
-output
# A tibble: 4 × 5
t data fm fit tfit
<dbl> <list> <list> <list> <list>
1 1 <tibble [50 × 4]> <formula> <lm> <tibble [2 × 5]>
2 1 <tibble [50 × 4]> <formula> <lm> <tibble [2 × 5]>
3 2 <tibble [50 × 4]> <formula> <lm> <tibble [2 × 5]>
4 2 <tibble [50 × 4]> <formula> <lm> <tibble [2 × 5]>
I'm trying to run a data wrangling procedure inside a tibble using tools from {purrr} package. My method is to organize everything I need inside a tibble:
the input data inside a column
the function to apply upon the input data gets its own column too
My problem: how can I use purrr's mapping functions to say "take the function stored in column x and apply it over the data in column y"?
Below is a minimal example, based on mtcars and iris. I want to summarise each data set, in the same workflow: first subset columns, then do some aggregation. For the aggregation part, I preemptively set up 2 functions, one for each data.
summarise_iris()
summarise_mtcars()
Then I organize all I need inside a tibble (see trb object below).
The first part, the subsetting, works well. As can be seen in trb_1 below, dat_selected is a new column I mutated, which stores the output of the subset step.
However, the second part is not working. I want to take the function in column summarise_func and apply it over the data stored in column dat_selected. But it's not working. Why not? I purposely used map() because it maps only 1 input to the function.
library(purrr)
library(tibble)
library(dplyr, warn.conflicts = FALSE)
summarise_iris <- function(.dat) {
.dat %>%
group_by(Species) %>%
summarise(across(starts_with("Sepal"), ~ mean(.x, na.rm = TRUE)))
}
# to test: iris %>% summarise_iris()
summarise_mtcars <- function(.dat) {
.dat %>%
group_by(am) %>%
summarise(mpg_median = median(mpg))
}
# to test: mtcars %>% summarise_mtcars()
trb <-
tribble(~original_data, ~cols_to_select, ~summarise_func,
mtcars, c("am", "disp", "mpg"), ~summarise_mtcars(.),
iris, c("Species", "Sepal.Length", "Sepal.Width"), ~summarise_iris(.)
)
trb_1 <-
trb %>%
mutate(dat_selected = map2(.x = original_data, .y = cols_to_select, .f = ~select(.x, all_of(.y))))
trb_1
#> # A tibble: 2 x 4
#> original_data cols_to_select summarise_func dat_selected
#> <list> <list> <list> <list>
#> 1 <df [32 x 11]> <chr [3]> <formula> <df [32 x 3]>
#> 2 <df [150 x 5]> <chr [3]> <formula> <df [150 x 3]>
trb_1 %>%
mutate(dat_summarised = map(.x = dat_selected, .f = summarise_func))
#> Error: Problem with `mutate()` column `dat_summarised`.
#> i `dat_summarised = map(.x = dat_selected, .f = summarise_func)`.
#> x Index 1 must have length 1, not 2
Created on 2021-12-02 by the reprex package (v2.0.1.9000)
How can I achieve the desired output (see below) using the in-table method I'm trying to incorporate? I.e.:
trb_1 %>%
mutate(dat_summarised = map(.x = dat_selected, .f = summarise_func))
## to give the desired output that's equivalent to what we get if we run:
summar_mtcars <- mtcars %>% summarise_mtcars()
summar_iris <- iris %>% summarise_iris()
trb_1 %>%
tibble::add_column(dat_summarised = list(summar_mtcars, summar_iris))
## # A tibble: 2 x 5
## original_data cols_to_select summarise_func dat_selected dat_summarised
## <list> <list> <list> <list> <list>
## 1 <df [32 x 11]> <chr [3]> <formula> <df [32 x 3]> <tibble [2 x 2]>
## 2 <df [150 x 5]> <chr [3]> <formula> <df [150 x 3]> <tibble [3 x 3]>
UPDATE
I don't know if the following is in the right direction, but based on this answer, I thought to utilize rlang::as_function() such that:
trb_1 %>%
mutate(dat_summarised = map(.x = dat_selected, .f = ~rlang::as_function(summarise_func)))
But it gives a different error now:
x Can't convert a list to function
I think you can take a simpler approach. First, we don't need to select columns, that's inherent to summarize anyway. Let's create columns that define the columns to group by, the columns to summarize, and functions to use.
library(purrr)
library(tibble)
library(dplyr, warn.conflicts = FALSE)
trb <-
tribble(~original_data, ~cols_to_group, ~cols_to_summarize, ~summarise_func,
mtcars, "am", "mpg", \(x) mean(x, na.rm = T),
iris, "Species", ~starts_with("Sepal"), median
)
The \(x) mean(x, na.rm = TRUE) syntax is the new anonymous function syntax in R 4.1. If using an earlier version, just change to function(x) mean(...)
Now we can define a function (to eventually use in pmap that accepts the data, grouping columns, columns to analyse, and the summarize functions.
summarize_fun <- function(
.dat, .group_cols, .summ_cols, .funs
) {
.dat %>%
group_by(across(!!.group_cols)) %>%
summarize(across(!!.summ_cols, .funs))
}
And now we can just use these within mutate(pmap(...)) to get the result we want. I rely on !! for unquoting expressions because that works for passing in things like ~starts_with("Sepal"), which don't work with {{ }} to my knowledge.
trb_final <- trb %>%
mutate(dat_summarized = pmap(
list(
.dat=original_data,
.group_cols=cols_to_group,
.summ_cols=cols_to_summarize,
.funs=summarise_func
),
summarize_fun
))
trb_final
#> # A tibble: 2 × 5
#> original_data cols_to_group cols_to_summarize summarise_func dat_summarized
#> <list> <chr> <list> <list> <list>
#> 1 <df [32 × 11]> am <chr [1]> <fn> <tibble [2 × 2]>
#> 2 <df [150 × 5]> Species <formula> <fn> <tibble [3 × 3]>
trb_final$dat_summarized
#> [[1]]
#> # A tibble: 2 × 2
#> am mpg
#> <dbl> <dbl>
#> 1 0 17.1
#> 2 1 24.4
#>
#> [[2]]
#> # A tibble: 3 × 3
#> Species Sepal.Length Sepal.Width
#> <fct> <dbl> <dbl>
#> 1 setosa 5 3.4
#> 2 versicolor 5.9 2.8
#> 3 virginica 6.5 3
General functions
If instead as in the comments, we want just to apply generic functions to summarize, then just rely on pmap with 2 arguments, the data and the summarizing function.
summarize_mtcars <- function(.dat) {
.dat %>%
group_by(am) %>%
summarise(mpg_median = median(mpg))
}
summarize_iris <- function(.dat) {
.dat %>%
group_by(Species) %>%
summarise(across(starts_with("Sepal"), ~ mean(.x, na.rm = TRUE)))
}
Now we can just define our data frame to analyze using the original data and the two summarize_... functions we defined for the datasets.
trb <-
tribble(~original_data, ~summarize_func,
mtcars, summarize_mtcars,
iris, summarize_iris
)
And then just use pmap as before (can also use map2 of course).
trb_final <- trb %>%
mutate(dat_summarized = pmap(
list(
original_data,
summarize_func
),
\(.d, .f) .f(.d)
))
trb_final
#> # A tibble: 2 × 3
#> original_data summarize_func dat_summarized
#> <list> <list> <list>
#> 1 <df [32 × 11]> <fn> <tibble [2 × 2]>
#> 2 <df [150 × 5]> <fn> <tibble [3 × 3]>
trb_final$dat_summarized
#> [[1]]
#> # A tibble: 2 × 2
#> am mpg_median
#> <dbl> <dbl>
#> 1 0 17.3
#> 2 1 22.8
#>
#> [[2]]
#> # A tibble: 3 × 3
#> Species Sepal.Length Sepal.Width
#> <fct> <dbl> <dbl>
#> 1 setosa 5.01 3.43
#> 2 versicolor 5.94 2.77
#> 3 virginica 6.59 2.97
I would store the functions as strings:
trb <-
tribble(~original_data, ~cols_to_select, ~summarise_func,
mtcars, c("am", "disp", "mpg"), "summarise_mtcars",
iris, c("Species", "Sepal.Length", "Sepal.Width"), "summarise_iris"
)
Then you can simply use do.call in your map call. Or you convert your functions to strings on the fly with mutate:
trb_2 <- trb_1 %>%
mutate(summarise_func = as.character(summarise_func)) %>%
mutate(dat_summarised = map2(summarise_func, original_data, ~ do.call(what = .x, args = list(.dat = .y))))
trb_2
#> # A tibble: 2 × 5
#> original_data cols_to_select summarise_func dat_selected dat_summarised
#> <list> <list> <chr> <list> <list>
#> 1 <df [32 × 11]> <chr [3]> summarise_mtcars <df [32 × 3]> <tibble [2 × 2]>
#> 2 <df [150 × 5]> <chr [3]> summarise_iris <df [150 × 3]> <tibble [3 × 3]>
Created on 2021-12-02 by the reprex package (v2.0.1)
Update: Storing functions or rather function names as strings can be problematic if the underlying function changes (I get that now). The problem is getting the function into the tibble in the first place. What you do in the question is storing it as a formula. A better way is (imo) to store it in a list column:
trb <-
tribble(~original_data, ~cols_to_select, ~summarise_func,
mtcars, c("am", "disp", "mpg"), list(fun = summarise_mtcars),
iris, c("Species", "Sepal.Length", "Sepal.Width"), list(fun = summarise_iris)
)
With a slight adaptation, this original answer then works like this:
trb_3 <- trb_1 %>%
mutate(dat_summarised = map2(summarise_func, original_data, ~ do.call(what = .x$fun, args = list(.dat = .y))))
trb_3
#> # A tibble: 2 × 5
#> original_data cols_to_select summarise_func dat_selected dat_summarised
#> <list> <list> <list> <list> <list>
#> 1 <df [32 × 11]> <chr [3]> <named list [1]> <df [32 × 3]> <tibble [2 × 2]>
#> 2 <df [150 × 5]> <chr [3]> <named list [1]> <df [150 × 3]> <tibble [3 × 3]>
Created on 2021-12-02 by the reprex package (v2.0.1)
I am trying to translate this basic for loop using the purr package. The idea is to apply a function using data frame elements as parameters.
Creating the data frame to loop on using the mpg dataset from ggplot2:
param <- mpg %>% select(manufacturer, year) %>% distinct() %>% rename(man = manufacturer, y = year)
The function to apply:
fcn <- function(man, y) {
df <- mpg %>% filter(manufacturer == man & year == y)
mod <- lm(data = df, cty ~ hwy)
out <- summary(mod)
return(out)
}
And the loop to apply fcn for each man and y combination :
for (i in 1:nrow(param)) {
fcn(man = param$man[i],
y = param$y[i])
}
I am very new to purr and struggle how general specifications of purr::map work.
Thanks a lot.
EDIT :
I used here a very basic example with fcn and param to understand how to include function parameters (from param) inside the map specification. As a results, I was not particularly interested in a nesting beforehand but only the dull translation of the loop using map that could work for any king of function with multiple parameters.
If I have understood correctly you want to model the cty based on hwy for each year and manufacturer combinations.
library(tidyverse)
library(ggplot2)
library(purrr)
I have changed the definition of your function to fit to the map function settings.
fcn <- function(df){
mod <- lm(data = df, cty ~ hwy)
return(summary(mod))
}
The code below should produce the summary of the model for each year and manufacturer
mpg %>% group_by(manufacturer, year) %>%
nest() %>% mutate(model = map(data, fcn))
You can nest the data first within manufacturer and year, then map using a function, except below, I used the .x directly, which would be each element of the data you map through. You can also use tidy() from broom to put the summary() result into a data.frame:
library(purrr)
library(tidyr)
library(dplyr)
library(broom)
mpg = ggplot2::mpg
result = mpg %>%
select(manufacturer, year,cty,hwy) %>%
nest(data=c(cty, hwy)) %>%
mutate(
model=map(data,~lm(cty ~ hwy,data=.x)),
summary=map(model,~tidy(summary(.x)))
)
# A tibble: 30 x 5
manufacturer year data model summary
<chr> <int> <list> <list> <list>
1 audi 1999 <tibble [9 × 2]> <lm> <tibble [2 × 5]>
2 audi 2008 <tibble [9 × 2]> <lm> <tibble [2 × 5]>
3 chevrolet 2008 <tibble [12 × 2]> <lm> <tibble [2 × 5]>
4 chevrolet 1999 <tibble [7 × 2]> <lm> <tibble [2 × 5]>
5 dodge 1999 <tibble [16 × 2]> <lm> <tibble [2 × 5]>
6 dodge 2008 <tibble [21 × 2]> <lm> <tibble [2 × 5]>
If you want to look at the results of summary:
result %>% unnest(summary)
# A tibble: 55 x 9
manufacturer year data model term estimate std.error statistic p.value
<chr> <int> <list> <list> <chr> <dbl> <dbl> <dbl> <dbl>
1 audi 1999 <tibbl… <lm> (Inte… -5.85 6.15 -0.951 3.73e-1
2 audi 1999 <tibbl… <lm> hwy 0.879 0.235 3.74 7.27e-3
3 audi 2008 <tibbl… <lm> (Inte… -0.5 3.68 -0.136 8.96e-1
4 audi 2008 <tibbl… <lm> hwy 0.695 0.137 5.08 1.43e-3
The following post helped me to achieve the desired outcome, general enough to be applied in many situations and ignoring nesting: https://stackoverflow.com/a/52309113/10580543.
Using pmap:
output <- param %>% pmap(~fcn(.x, .y))
I'm trying to extract model info from model in a list column.
Using mtcars to illustrate my problem:
mtcars %>%
nest(-cyl) %>%
mutate(model= map(data, ~lm(mpg~wt, data=.))) %>%
mutate(aic=AIC(model))
what I got is error message:
Error in mutate_impl(.data, dots) :
Evaluation error: no applicable method for 'logLik' applied to an object of class "list".
But when I do it this way, it works.
mtcars %>%
group_by(cyl) %>%
do(model= lm(mpg~wt, data=.)) %>%
mutate(aic=AIC(model))
Can anyone explain why? Why the second way works? I could not figure it out. In both cases, the list column 'model' contains model info . But there might be some differences... Thanks a lot.
Let's compare the differences between these two approaches. We can run your entire code in addition to the last AIC call and save the results to a and b.
a <- mtcars %>%
nest(-cyl) %>%
mutate(model= map(data, ~lm(mpg~wt, data=.)))
b <- mtcars %>%
group_by(cyl) %>%
do(model= lm(mpg~wt, data=.))
Now we can print the results in the console.
a
# A tibble: 3 x 3
cyl data model
<dbl> <list> <list>
1 6 <tibble [7 x 10]> <S3: lm>
2 4 <tibble [11 x 10]> <S3: lm>
3 8 <tibble [14 x 10]> <S3: lm>
b
Source: local data frame [3 x 2]
Groups: <by row>
# A tibble: 3 x 2
cyl model
* <dbl> <list>
1 4 <S3: lm>
2 6 <S3: lm>
3 8 <S3: lm>
Now we can see dataframe b is grouped by row, while dataframe a is not. This is the key.
To extract AIC in dataframe a, we can use the rowwise function to group dataframe by each row.
mtcars %>%
nest(-cyl) %>%
mutate(model= map(data, ~lm(mpg~wt, data=.))) %>%
rowwise() %>%
mutate(aic=AIC(model))
Source: local data frame [3 x 4]
Groups: <by row>
# A tibble: 3 x 4
cyl data model aic
<dbl> <list> <list> <dbl>
1 6 <tibble [7 x 10]> <S3: lm> 25.65036
2 4 <tibble [11 x 10]> <S3: lm> 61.48974
3 8 <tibble [14 x 10]> <S3: lm> 63.31555
Or we can use the map_dbl function because we know each AIC is numeric.
mtcars %>%
nest(-cyl) %>%
mutate(model= map(data, ~lm(mpg~wt, data=.))) %>%
mutate(aic = map_dbl(model, AIC))
# A tibble: 3 x 4
cyl data model aic
<dbl> <list> <list> <dbl>
1 6 <tibble [7 x 10]> <S3: lm> 25.65036
2 4 <tibble [11 x 10]> <S3: lm> 61.48974
3 8 <tibble [14 x 10]> <S3: lm> 63.31555