I am trying to move away from rowwise() for list columns as I have heard that the tidyverse team is in the process of axing it. However, I am not used to using the purrr functions so I feel like there must be a better way of doing the following:
I create a list-column containing a tibble for each species. I then want to go into the tibble and take the mean of certain variables. The first case is using map and second is the rowwise solution that I personally feel is cleaner.
Does anyone know a better way to use map in this situation?
library(tidyverse)
iris %>%
group_by(Species) %>%
nest() %>%
mutate(mean_slength = map_dbl(data, ~mean(.$Sepal.Length, na.rm = TRUE)),
mean_swidth = map_dbl(data, ~mean(.$Sepal.Width, na.rm = TRUE))
)
#> # A tibble: 3 x 4
#> Species data mean_slength mean_swidth
#> <fct> <list> <dbl> <dbl>
#> 1 setosa <tibble [50 x 4]> 5.01 3.43
#> 2 versicolor <tibble [50 x 4]> 5.94 2.77
#> 3 virginica <tibble [50 x 4]> 6.59 2.97
iris %>%
group_by(Species) %>%
nest() %>%
rowwise() %>%
mutate(mean_slength = mean(data$Sepal.Length, na.rm = TRUE),
mean_swidth = mean(data$Sepal.Width, na.rm = TRUE))
#> Source: local data frame [3 x 4]
#> Groups: <by row>
#>
#> # A tibble: 3 x 4
#> Species data mean_slength mean_swidth
#> <fct> <list> <dbl> <dbl>
#> 1 setosa <tibble [50 x 4]> 5.01 3.43
#> 2 versicolor <tibble [50 x 4]> 5.94 2.77
#> 3 virginica <tibble [50 x 4]> 6.59 2.97
Created on 2018-12-26 by the reprex package (v0.2.1)
Instead of having two map, use a single one, with summarise_at
library(tidyverse)
iris %>%
group_by(Species) %>%
nest() %>%
mutate(out = map(data, ~
.x %>%
summarise_at(vars(matches('Sepal')),
funs(mean_s = mean(., na.rm = TRUE))))) %>%
unnest(out)
Related
I have a list of nested tibbles (which includes lists themselves etc, sort of like inception).
library(tidyverse)
# example function I want to apply
fun1 <- function(data) {
minimum <- min(data$Sepal.Length)
return(minimum)
}
# example of nested list
list_nested_tibbles <- list(
list1 = iris %>% group_by(Species) %>% nest(),
list2 = iris %>% group_by(Species) %>% nest()
)
# applying function to one of the nested tibbles within the list
list_nested_tibbles$list1 %>% mutate(minimum = map_dbl(.x = data, ~ fun1(.x)))
#> # A tibble: 3 x 3
#> # Groups: Species [3]
#> Species data minimum
#> <fct> <list> <dbl>
#> 1 setosa <tibble [50 x 4]> 4.3
#> 2 versicolor <tibble [50 x 4]> 4.9
#> 3 virginica <tibble [50 x 4]> 4.9
# function fails if I apply across whole list
list_nested_tibbles %>% mutate(minimum = map_dbl(.x = data, ~ fun1(.x)))
#> Error in UseMethod("mutate"): no applicable method for 'mutate' applied to an object of class "list"
However, I want to apply the function to both list items simultaneously, I'm guessing I need some sort of nested map statement?
Any help appreciated.
Cheers.
Here's how to do it with purrr functions:
map(list_nested_tibbles, ~ .x %>% mutate(minimum = map_dbl(data, ~ fun1(.x))))
$list1
# A tibble: 3 × 3
# Groups: Species [3]
Species data minimum
<fct> <list> <dbl>
1 setosa <tibble [50 × 4]> 4.3
2 versicolor <tibble [50 × 4]> 4.9
3 virginica <tibble [50 × 4]> 4.9
$list2
# A tibble: 3 × 3
# Groups: Species [3]
Species data minimum
<fct> <list> <dbl>
1 setosa <tibble [50 × 4]> 4.3
2 versicolor <tibble [50 × 4]> 4.9
3 virginica <tibble [50 × 4]> 4.9
I'm trying to run a data wrangling procedure inside a tibble using tools from {purrr} package. My method is to organize everything I need inside a tibble:
the input data inside a column
the function to apply upon the input data gets its own column too
My problem: how can I use purrr's mapping functions to say "take the function stored in column x and apply it over the data in column y"?
Below is a minimal example, based on mtcars and iris. I want to summarise each data set, in the same workflow: first subset columns, then do some aggregation. For the aggregation part, I preemptively set up 2 functions, one for each data.
summarise_iris()
summarise_mtcars()
Then I organize all I need inside a tibble (see trb object below).
The first part, the subsetting, works well. As can be seen in trb_1 below, dat_selected is a new column I mutated, which stores the output of the subset step.
However, the second part is not working. I want to take the function in column summarise_func and apply it over the data stored in column dat_selected. But it's not working. Why not? I purposely used map() because it maps only 1 input to the function.
library(purrr)
library(tibble)
library(dplyr, warn.conflicts = FALSE)
summarise_iris <- function(.dat) {
.dat %>%
group_by(Species) %>%
summarise(across(starts_with("Sepal"), ~ mean(.x, na.rm = TRUE)))
}
# to test: iris %>% summarise_iris()
summarise_mtcars <- function(.dat) {
.dat %>%
group_by(am) %>%
summarise(mpg_median = median(mpg))
}
# to test: mtcars %>% summarise_mtcars()
trb <-
tribble(~original_data, ~cols_to_select, ~summarise_func,
mtcars, c("am", "disp", "mpg"), ~summarise_mtcars(.),
iris, c("Species", "Sepal.Length", "Sepal.Width"), ~summarise_iris(.)
)
trb_1 <-
trb %>%
mutate(dat_selected = map2(.x = original_data, .y = cols_to_select, .f = ~select(.x, all_of(.y))))
trb_1
#> # A tibble: 2 x 4
#> original_data cols_to_select summarise_func dat_selected
#> <list> <list> <list> <list>
#> 1 <df [32 x 11]> <chr [3]> <formula> <df [32 x 3]>
#> 2 <df [150 x 5]> <chr [3]> <formula> <df [150 x 3]>
trb_1 %>%
mutate(dat_summarised = map(.x = dat_selected, .f = summarise_func))
#> Error: Problem with `mutate()` column `dat_summarised`.
#> i `dat_summarised = map(.x = dat_selected, .f = summarise_func)`.
#> x Index 1 must have length 1, not 2
Created on 2021-12-02 by the reprex package (v2.0.1.9000)
How can I achieve the desired output (see below) using the in-table method I'm trying to incorporate? I.e.:
trb_1 %>%
mutate(dat_summarised = map(.x = dat_selected, .f = summarise_func))
## to give the desired output that's equivalent to what we get if we run:
summar_mtcars <- mtcars %>% summarise_mtcars()
summar_iris <- iris %>% summarise_iris()
trb_1 %>%
tibble::add_column(dat_summarised = list(summar_mtcars, summar_iris))
## # A tibble: 2 x 5
## original_data cols_to_select summarise_func dat_selected dat_summarised
## <list> <list> <list> <list> <list>
## 1 <df [32 x 11]> <chr [3]> <formula> <df [32 x 3]> <tibble [2 x 2]>
## 2 <df [150 x 5]> <chr [3]> <formula> <df [150 x 3]> <tibble [3 x 3]>
UPDATE
I don't know if the following is in the right direction, but based on this answer, I thought to utilize rlang::as_function() such that:
trb_1 %>%
mutate(dat_summarised = map(.x = dat_selected, .f = ~rlang::as_function(summarise_func)))
But it gives a different error now:
x Can't convert a list to function
I think you can take a simpler approach. First, we don't need to select columns, that's inherent to summarize anyway. Let's create columns that define the columns to group by, the columns to summarize, and functions to use.
library(purrr)
library(tibble)
library(dplyr, warn.conflicts = FALSE)
trb <-
tribble(~original_data, ~cols_to_group, ~cols_to_summarize, ~summarise_func,
mtcars, "am", "mpg", \(x) mean(x, na.rm = T),
iris, "Species", ~starts_with("Sepal"), median
)
The \(x) mean(x, na.rm = TRUE) syntax is the new anonymous function syntax in R 4.1. If using an earlier version, just change to function(x) mean(...)
Now we can define a function (to eventually use in pmap that accepts the data, grouping columns, columns to analyse, and the summarize functions.
summarize_fun <- function(
.dat, .group_cols, .summ_cols, .funs
) {
.dat %>%
group_by(across(!!.group_cols)) %>%
summarize(across(!!.summ_cols, .funs))
}
And now we can just use these within mutate(pmap(...)) to get the result we want. I rely on !! for unquoting expressions because that works for passing in things like ~starts_with("Sepal"), which don't work with {{ }} to my knowledge.
trb_final <- trb %>%
mutate(dat_summarized = pmap(
list(
.dat=original_data,
.group_cols=cols_to_group,
.summ_cols=cols_to_summarize,
.funs=summarise_func
),
summarize_fun
))
trb_final
#> # A tibble: 2 × 5
#> original_data cols_to_group cols_to_summarize summarise_func dat_summarized
#> <list> <chr> <list> <list> <list>
#> 1 <df [32 × 11]> am <chr [1]> <fn> <tibble [2 × 2]>
#> 2 <df [150 × 5]> Species <formula> <fn> <tibble [3 × 3]>
trb_final$dat_summarized
#> [[1]]
#> # A tibble: 2 × 2
#> am mpg
#> <dbl> <dbl>
#> 1 0 17.1
#> 2 1 24.4
#>
#> [[2]]
#> # A tibble: 3 × 3
#> Species Sepal.Length Sepal.Width
#> <fct> <dbl> <dbl>
#> 1 setosa 5 3.4
#> 2 versicolor 5.9 2.8
#> 3 virginica 6.5 3
General functions
If instead as in the comments, we want just to apply generic functions to summarize, then just rely on pmap with 2 arguments, the data and the summarizing function.
summarize_mtcars <- function(.dat) {
.dat %>%
group_by(am) %>%
summarise(mpg_median = median(mpg))
}
summarize_iris <- function(.dat) {
.dat %>%
group_by(Species) %>%
summarise(across(starts_with("Sepal"), ~ mean(.x, na.rm = TRUE)))
}
Now we can just define our data frame to analyze using the original data and the two summarize_... functions we defined for the datasets.
trb <-
tribble(~original_data, ~summarize_func,
mtcars, summarize_mtcars,
iris, summarize_iris
)
And then just use pmap as before (can also use map2 of course).
trb_final <- trb %>%
mutate(dat_summarized = pmap(
list(
original_data,
summarize_func
),
\(.d, .f) .f(.d)
))
trb_final
#> # A tibble: 2 × 3
#> original_data summarize_func dat_summarized
#> <list> <list> <list>
#> 1 <df [32 × 11]> <fn> <tibble [2 × 2]>
#> 2 <df [150 × 5]> <fn> <tibble [3 × 3]>
trb_final$dat_summarized
#> [[1]]
#> # A tibble: 2 × 2
#> am mpg_median
#> <dbl> <dbl>
#> 1 0 17.3
#> 2 1 22.8
#>
#> [[2]]
#> # A tibble: 3 × 3
#> Species Sepal.Length Sepal.Width
#> <fct> <dbl> <dbl>
#> 1 setosa 5.01 3.43
#> 2 versicolor 5.94 2.77
#> 3 virginica 6.59 2.97
I would store the functions as strings:
trb <-
tribble(~original_data, ~cols_to_select, ~summarise_func,
mtcars, c("am", "disp", "mpg"), "summarise_mtcars",
iris, c("Species", "Sepal.Length", "Sepal.Width"), "summarise_iris"
)
Then you can simply use do.call in your map call. Or you convert your functions to strings on the fly with mutate:
trb_2 <- trb_1 %>%
mutate(summarise_func = as.character(summarise_func)) %>%
mutate(dat_summarised = map2(summarise_func, original_data, ~ do.call(what = .x, args = list(.dat = .y))))
trb_2
#> # A tibble: 2 × 5
#> original_data cols_to_select summarise_func dat_selected dat_summarised
#> <list> <list> <chr> <list> <list>
#> 1 <df [32 × 11]> <chr [3]> summarise_mtcars <df [32 × 3]> <tibble [2 × 2]>
#> 2 <df [150 × 5]> <chr [3]> summarise_iris <df [150 × 3]> <tibble [3 × 3]>
Created on 2021-12-02 by the reprex package (v2.0.1)
Update: Storing functions or rather function names as strings can be problematic if the underlying function changes (I get that now). The problem is getting the function into the tibble in the first place. What you do in the question is storing it as a formula. A better way is (imo) to store it in a list column:
trb <-
tribble(~original_data, ~cols_to_select, ~summarise_func,
mtcars, c("am", "disp", "mpg"), list(fun = summarise_mtcars),
iris, c("Species", "Sepal.Length", "Sepal.Width"), list(fun = summarise_iris)
)
With a slight adaptation, this original answer then works like this:
trb_3 <- trb_1 %>%
mutate(dat_summarised = map2(summarise_func, original_data, ~ do.call(what = .x$fun, args = list(.dat = .y))))
trb_3
#> # A tibble: 2 × 5
#> original_data cols_to_select summarise_func dat_selected dat_summarised
#> <list> <list> <list> <list> <list>
#> 1 <df [32 × 11]> <chr [3]> <named list [1]> <df [32 × 3]> <tibble [2 × 2]>
#> 2 <df [150 × 5]> <chr [3]> <named list [1]> <df [150 × 3]> <tibble [3 × 3]>
Created on 2021-12-02 by the reprex package (v2.0.1)
I want to create a data frame where I summarize values like number of observations, mean and median, and I want to nest its ggplot histograms. For this, I will use the iris dataset.
This is my first attempt:
iris %>%
pivot_longer(-Species,
names_to = "Vars",
values_to = "Values") %>%
group_by(Vars) %>%
summarise(obs = n(),
mean = round(mean(Values),2),
median = round(median(Values),2))
So it gives me:
# A tibble: 4 x 4
Vars obs mean median
<chr> <int> <dbl> <dbl>
1 Petal.Length 150 3.76 4.35
2 Petal.Width 150 1.2 1.3
3 Sepal.Length 150 5.84 5.8
4 Sepal.Width 150 3.06 3
This is the expected table:
# A tibble: 4 x 5
Vars obs mean median plot
<chr> <int> <dbl> <dbl> <list>
1 Petal.Length 150 3.76 4.35 <gg>
2 Petal.Width 150 1.2 1.3 <gg>
3 Sepal.Length 150 5.84 5.8 <gg>
4 Sepal.Width 150 3.06 3 <gg>
This is what I have tried:
iris %>%
pivot_longer(-Species,
names_to = "Vars",
values_to = "Values") %>%
group_by(Vars) %>%
nest() %>%
mutate(metrics = lapply(data, function(df) df %>% summarise(obs = n(), mean = mean(Values), median = median(Values))),
plots = lapply(data, function(df) df %>% ggplot(aes(Values)) + geom_histogram()))
Almost there, I see this:
# A tibble: 4 x 4
# Groups: Vars [4]
Vars data metrics plots
<chr> <list> <list> <list>
1 Sepal.Length <tibble [150 × 2]> <tibble [1 × 3]> <gg>
2 Sepal.Width <tibble [150 × 2]> <tibble [1 × 3]> <gg>
3 Petal.Length <tibble [150 × 2]> <tibble [1 × 3]> <gg>
4 Petal.Width <tibble [150 × 2]> <tibble [1 × 3]> <gg>
But I don't know how to see the expected tibble with the obs, mean, median and plots columns without the data and metrics columns. Any help will be greatly appreciated.
We may use cur_data() in summarise and get the output in a list by wrapping
library(dplyr)
library(ggplot2)
library(tidyr)
out <- iris %>%
pivot_longer(-Species,
names_to = "Vars",
values_to = "Values") %>%
group_by(Vars) %>%
summarise(obs = n(),
mean = round(mean(Values),2),
median = round(median(Values),2),
plots = list(ggplot(cur_data(), aes(Values)) + geom_histogram()))
-output
out
# A tibble: 4 × 5
Vars obs mean median plots
<chr> <int> <dbl> <dbl> <list>
1 Petal.Length 150 3.76 4.35 <gg>
2 Petal.Width 150 1.2 1.3 <gg>
3 Sepal.Length 150 5.84 5.8 <gg>
4 Sepal.Width 150 3.06 3 <gg>
I'm trying to mutate a new variable in a nested dataframe with an ifelse-condition. But the problem is that after implementing the ifelse-condition the nested dataframe turns into a list.
I want to show this problem with the iris dataset:
Here you can see the original nested format:
iris %>% nest(data = -Species)
# A tibble: 3 x 2
Species data
<fct> <list>
1 setosa <tibble [50 x 4]>
2 versicolor <tibble [50 x 4]>
3 virginica <tibble [50 x 4]>
And now I want to mutate a new variable in the nested dataframes:
iris %>%
nest(data = -Species) %>%
mutate(data = map(data, function(x)
x %>% mutate(`Sepal.Length^2` = Sepal.Length^2)))
# A tibble: 3 x 2
Species data
<fct> <list>
1 setosa <tibble [50 x 5]>
2 versicolor <tibble [50 x 5]>
3 virginica <tibble [50 x 5]>
This code works. The data-column is as desired in the tibble-format.
But if I insert the ifelse-condition now, the tibble-format is lost:
iris %>%
nest(data = -Species) %>%
mutate(data = map(data, function(x)
ifelse(!is.na(x), x %>% mutate(`Sepal.Length^2` = Sepal.Length^2), NA)))
# A tibble: 3 x 2
Species data
<fct> <list>
1 setosa <list [200]>
2 versicolor <list [200]>
3 virginica <list [200]>
I want to keep the tibble-format even with the ifelse-condition.
Can anyone help me?
In the first step of the map() computation, i.e. data in setosa, the input x of your custom function is actually
x <- iris[1:50, 1:4]
Then you put x into ifelse()
ifelse(!is.na(x), # part 1
x %>% mutate(`Sepal.Length^2` = Sepal.Length^2), # part 2
NA) # part 3
The first part is !is.na(x), which returns 50x4=200 logical values. Hence, the second and third parts will be recycled to length 200. However, the second part, i.e.
x %>% mutate(`Sepal.Length^2` = Sepal.Length^2)
is a tibble with 5 variables, which is also a list with length 5, so each variable in this tibble will be recycled 40 times and subsequently a list with length 200 will be created. That is why you get 3 lists of length 200.
In your case, ifelse() may not be applicable. You can adjust it to
iris %>%
nest(data = -Species) %>%
add_row(Species = "example", data = NA) %>%
mutate(data = map(data, function(x) {
if(is.data.frame(x))
x %>% mutate(`Sepal.Length^2` = Sepal.Length^2)
else
NULL
}))
# # A tibble: 4 x 2
# Species data
# <chr> <list>
# 1 setosa <tibble [50 × 5]>
# 2 versicolor <tibble [50 × 5]>
# 3 virginica <tibble [50 × 5]>
# 4 example <NULL>
Make sure that the condition in if() must be a single logical value.
Grateful to #27ϕ9 for a neater version with map_if():
iris %>%
nest(data = -Species) %>%
add_row(Species = "example", data = NA) %>%
mutate(data = map_if(data, is_tibble,
~ mutate(.x, `Sepal.Length^2` = Sepal.Length^2),
.else = NULL))
Let's say I have two datasets for the same group of irises over two years:
# Create data for reproducible results.
iris.2007 <- iris
iris.2008 <- iris
iris.2008[1:4] <- 2*iris.2008[1:4] # let's make the 2008 data different
I would like to fit a separate linear model for each species in the 2007 data, which I can do like this:
# First nest by Species.
iris.2007.nested <- iris.2007 %>%
group_by(Species) %>%
nest()
# Now apply the linear model call by group using the data.
iris.2007.nested <- iris.2007.nested %>%
mutate(models = map(data,
~ lm(Petal.Length ~ Petal.Width, data = .)))
When we look at the results, they make sense as a nicely-organized tibble.
head(iris.2007.nested)
# A tibble: 3 × 3
Species data models
<fctr> <list> <list>
1 setosa <tibble [50 × 4]> <S3: lm>
2 versicolor <tibble [50 × 4]> <S3: lm>
3 virginica <tibble [50 × 4]> <S3: lm>
Now let's do the same thing to the 2008 data.
# First nest by species.
iris.2008.nested <- iris.2008 %>%
group_by(Species) %>%
nest()
# Now apply the linear model call by species using the data.
iris.2008.nested <- iris.2008.nested %>%
mutate(models = map(data,
~ lm(Petal.Length ~ Petal.Width, data = .)))
Again, we end up with a nice tibble.
head(iris.2008.nested)
# A tibble: 3 × 3
Species data models
<fctr> <list> <list>
1 setosa <tibble [50 × 4]> <S3: lm>
2 versicolor <tibble [50 × 4]> <S3: lm>
3 virginica <tibble [50 × 4]> <S3: lm>
Now what I would like to do is use the linear models from the 2008 data to predict results using the 2007 data. Thinking that the best way to do that would be to combine the two datasets (retaining the group structure), here is what happens when I try to merge the two nested tibbles:
iris.both.nested <- merge(iris.2007.nested, iris.2008.nested, by='Species')
As you can see below, the tibble no longer seems to have the same format as the individual tibbles above. Specifically, the organization is hard to discern (note that I am not including the full output in this chunk, but you get the idea).
head(iris.both.nested)
Species
1 setosa
2 versicolor
3 virginica
data.x
1 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, ...
... <truncated>
1 1.327563, 0.5464903, -0.03686145, -0.03686145, -0.1368614, 0.06313855,
...
And although I can still apparently use the models fitted to the 2008 data (as models.y) to the data from 2007 (as data.x):
iris.both.nested.pred <- iris.both.nested %>%
mutate( pred = map2(models.y,
data.x, predict))
The result is again not a nicely-organized tibble: (again not showing full output)
head(iris.both.nested.pred)
Species
1 setosa
2 versicolor
3 virginica
data.x
1 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, ...
... <truncated>
1 1.327563, 0.5464903, -0.03686145, -0.03686145, -0.1368614,
...
So my question is -- is this process working even though the tibbles become strangely organized after the merge? Or am I missing something? Thanks!
install.packages("pacman")
pacman::p_load(tidyverse)
iris_2007 <- iris %>% mutate(year = 2007)
iris_2008 <- iris %>% mutate(year = 2008)
iris_2008[1:4] <- 2 * iris_2008[1:4]
# combine data
iris_all_data <- iris_2007 %>%
bind_rows(iris_2008) %>%
group_by(Species) %>%
nest()
# model and predict
iris_predict <- iris_all_data %>%
mutate(
modelData = data %>% map(., ~ filter(., year == 2007)),
validationData = data %>% map(., ~ filter(., year == 2008)),
model = modelData %>% map(., ~ lm(Petal.Length ~ Petal.Width, data = .)),
prediction = map2(
.x = model, .y = validationData, ~ predict(object = .x, newdata = .y)
)
) %>%
select(Species, prediction) %>%
unnest(cols = c(prediction))
print(iris_predict)
I would double nest it first and apply the models later
# Data
iris.2007 <- iris
iris.2008 <- iris
iris.2008[1:4] <- 2*iris.2008[1:4]
joined<-bind_rows(
cbind(dset=rep("iris.2007",length(iris.2007$Species)),iris.2007)
,cbind(dset=rep("iris.2008",length(iris.2008$Species)),iris.2008)
)
# Double nesting
joined_nested<-
joined %>% group_by(dset) %>% nest(.key=data1) %>%
mutate(data1 = map(data1, ~.x %>% group_by(Species) %>% nest))
# Now apply the linear model call by group using the data.
joined_nested_models<-
joined_nested %>% mutate(data1 = map(data1, ~.x %>%
mutate(models = map(data,
~ lm(Petal.Length ~ Petal.Width, data = .)))
))
joined_nested_models %>% unnest
# # A tibble: 6 × 4
# dset Species data models
# <chr> <fctr> <list> <list>
# 1 iris.2007 setosa <tibble [50 × 4]> <S3: lm>
# 2 iris.2007 versicolor <tibble [50 × 4]> <S3: lm>
# 3 iris.2007 virginica <tibble [50 × 4]> <S3: lm>
# 4 iris.2008 setosa <tibble [50 × 4]> <S3: lm>
# 5 iris.2008 versicolor <tibble [50 × 4]> <S3: lm>
# 6 iris.2008 virginica <tibble [50 × 4]> <S3: lm>
Which is a Tidier version of what you get with inner_join