tidyr unnest, prefix column names with nested name during unnesting - r

When running unnest on a data.frame is there a way to add the group name of nested item to the individual columns it contains (either as a suffix or prefix). Or does renaming have to be done manually via rename?
This is particularly relevant with 'unnesting' multiple groups that contain columns with the same names.
In the example below the base aggregate command does this well (eg. Petal.Length.mn), but I couldn't find an option to get unnest to do the same thing?
I'm using nest with purrr::map as I want the flexibility to mix functions, eg. calculate means and sd on a couple of variables and also run a t test to look at differences between them.
library(dplyr, warn.conflicts = FALSE)
msd_c <- function(x) c(mn = mean(x), sd = sd(x))
msd_df <- function(x) bind_rows(c(mn = mean(x), sd = sd(x)))
aggregate(cbind(Petal.Length, Petal.Width) ~ Species,
data = iris, FUN = msd_c)
#> Species Petal.Length.mn Petal.Length.sd Petal.Width.mn Petal.Width.sd
#> 1 setosa 1.4620000 0.1736640 0.2460000 0.1053856
#> 2 versicolor 4.2600000 0.4699110 1.3260000 0.1977527
#> 3 virginica 5.5520000 0.5518947 2.0260000 0.2746501
iris %>%
select(Petal.Length:Species) %>%
group_by(Species) %>%
tidyr::nest() %>%
mutate(
Petal.Length = purrr::map(data, ~ msd_df(.$Petal.Length)),
Petal.Width = purrr::map(data, ~ msd_df(.$Petal.Width)),
Correlation = purrr::map(data, ~ broom::tidy(cor.test(.$Petal.Length, .$Petal.Width))),
) %>%
select(-data) %>%
tidyr::unnest(c(Petal.Length, Petal.Width, Correlation), names_repair = tidyr::tidyr_legacy)
#> # A tibble: 3 x 13
#> # Groups: Species [3]
#> Species mn sd mn1 sd1 estimate statistic p.value parameter conf.low
#> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl>
#> 1 setosa 1.46 0.174 0.246 0.105 0.332 2.44 1.86e- 2 48 0.0587
#> 2 versic~ 4.26 0.470 1.33 0.198 0.787 8.83 1.27e-11 48 0.651
#> 3 virgin~ 5.55 0.552 2.03 0.275 0.322 2.36 2.25e- 2 48 0.0481
#> # ... with 3 more variables: conf.high <dbl>, method <chr>, alternative <chr>
Created on 2020-05-20 by the reprex package (v0.3.0)

The answer to this was somewhat obvious, use the names_sep option rather than the names_repair option. As quoted from the nest help menu under names_sep:
If a string, the inner and outer names will be used together. In
nest(), the names of the new outer columns will be formed by pasting
together the outer and the inner column names, separated by names_sep.
In unnest(), the new inner names will have the outer names (+
names_sep) automatically stripped. This makes names_sep roughly
symmetric between nesting and unnesting.
library(dplyr, warn.conflicts = FALSE)
msd_c <- function(x) c(mn = mean(x), sd = sd(x))
msd_df <- function(x) bind_rows(c(mn = mean(x), sd = sd(x)))
iris %>%
select(Petal.Length:Species) %>%
group_by(Species) %>%
tidyr::nest() %>%
mutate(
Petal.Length = purrr::map(data, ~ msd_df(.$Petal.Length)),
Petal.Width = purrr::map(data, ~ msd_df(.$Petal.Width)),
Correlation = purrr::map(data, ~ broom::tidy(cor.test(.$Petal.Length, .$Petal.Width))),
) %>%
select(-data) %>%
tidyr::unnest(c(Petal.Length, Petal.Width, Correlation), names_sep = ".")
#> # A tibble: 3 x 13
#> # Groups: Species [3]
#> Species Petal.Length.mn Petal.Length.sd Petal.Width.mn Petal.Width.sd
#> <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 setosa 1.46 0.174 0.246 0.105
#> 2 versic~ 4.26 0.470 1.33 0.198
#> 3 virgin~ 5.55 0.552 2.03 0.275
#> # ... with 8 more variables: Correlation.estimate <dbl>,
#> # Correlation.statistic <dbl>, Correlation.p.value <dbl>,
#> # Correlation.parameter <int>, Correlation.conf.low <dbl>,
#> # Correlation.conf.high <dbl>, Correlation.method <chr>,
#> # Correlation.alternative <chr>
Created on 2020-06-10 by the reprex package (v0.3.0)

To apply multiple functions to multiple columns I would use summarise_at/mutate_at instead of nesting and unnesting data.
For example, in this case we can do :
library(dplyr)
iris %>%
group_by(Species) %>%
summarise_at(vars(Petal.Length:Petal.Width), list(mn = mean, sd = sd))
# Species Petal.Length_mn Petal.Width_mn Petal.Length_sd Petal.Width_sd
# <fct> <dbl> <dbl> <dbl> <dbl>
#1 setosa 1.46 0.246 0.174 0.105
#2 versicolor 4.26 1.33 0.470 0.198
#3 virginica 5.55 2.03 0.552 0.275
This automatically adds a prefix to column names which we are applying the function to. Also, this is equivalent dplyr version of aggregate function you tried.
Also note that summarise_at will soon be replaced with across in upcoming version of dplyr.

You can use setNames like below. It is a little bit wordy, but it seems like you plan to specify each function for each column, this may be of interest.
iris %>%
select(Petal.Length:Species) %>%
group_by(Species) %>%
tidyr::nest() %>%
mutate(
Petal.Length = purrr::map(data, ~ msd_df(.x$Petal.Length) %>%
setNames(paste0("Petal.Length.", names(.)))),
Petal.Width = purrr::map(data, ~ msd_df(.$Petal.Width) %>%
setNames(paste0("Petal.Width.", names(.)))),
Ratio = purrr::map(data, ~ msd_df(.$Petal.Length/.$Petal.Width) %>%
setNames(paste0("Ratio.", names(.))))
) %>%
select(-data) %>%
tidyr::unnest(c(Petal.Length, Petal.Width, Ratio))
# A tibble: 3 x 7
# Groups: Species [3]
Species Petal.Length.mn Petal.Length.sd Petal.Width.mn Petal.Width.sd Ratio.mn Ratio.sd
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 setosa 1.46 0.174 0.246 0.105 6.91 2.85
2 versicolor 4.26 0.470 1.33 0.198 3.24 0.312
3 virginica 5.55 0.552 2.03 0.275 2.78 0.407
Or modify your function to allow it being able to modify the column name like this.
msd_df_name <- function(x, name){
bind_rows(c(mn = mean(x), sd = sd(x))) %>%
setNames(paste0(name, ".", names(.)))
}
iris %>%
select(Petal.Length:Species) %>%
group_by(Species) %>%
tidyr::nest() %>%
mutate(
Petal.Length = purrr::map(data, ~ msd_df_name(.x$Petal.Length, "Petal.Length")),
Petal.Width = purrr::map(data, ~ msd_df_name(.$Petal.Width, "Petal.Width")),
Ratio = purrr::map(data, ~ msd_df_name(.$Petal.Length/.$Petal.Width, "Ratio"))
) %>%
select(-data) %>%
tidyr::unnest(c(Petal.Length, Petal.Width, Ratio))

Related

Compute descriptive statistics (mean, sd, n) by a group column across multiple columns using lapply and dplyr resulting in NA values

I have a Group column indicating group membership as well as many other columns containing numerical values. For each column containing numerical values, I want to get the Mean, Standard Deviation, and sample size for each subgroup.
Using the inbuilt iris dataset as an example, I have come up with the following code:
lapply(names(iris)[1:4], function(x) {
iris %>%
group_by(Species) %>%
summarise(mean = mean(noquote(x), na.rm = T),
sd = sd(noquote(x), na.rm = T),
n = n())
})
However, the mean and standard deviation for each numerical column by group is a NA. R provides plenty of warning messages such as:
In mean.default(noquote(x), na.rm = T) : argument is not numeric or logical: returning NA
In is.data.frame(x) : NAs introduced by coercion
However, I have ensured that my numerical columns have a numeric data type already.
I have also attempted using the across function, but the results are clearly wrong:
iris %>%
group_by(Species) %>%
summarise(across(1:4, ~ mean(., na.rm = T),
sd(., na.rm = T),
n()))
The number of/position of NAs in each numerical value columns in my actual dataset differs across numerical columns. The solution has to compute the mean/SD/sample size for each group using all the non-NA values for that particular column. Thanks!
Using across is the correct approach you just need to fix the syntax.
library(dplyr)
library(tidyr)
iris %>%
group_by(Species) %>%
summarise(across(1:4, list(mean = ~mean(., na.rm = T),
sd = ~sd(., na.rm = T),
Count = ~n())))
# A tibble: 3 x 13
# Species Sepal.Length_mean Sepal.Length_sd Sepal.Length_Count Sepal.Width_mean
# <fct> <dbl> <dbl> <int> <dbl>
#1 setosa 5.01 0.352 50 3.43
#2 versicolor 5.94 0.516 50 2.77
#3 virginica 6.59 0.636 50 2.97
# … with 8 more variables: Sepal.Width_sd <dbl>, Sepal.Width_Count <int>,
# Petal.Length_mean <dbl>, Petal.Length_sd <dbl>, Petal.Length_Count <int>,
# Petal.Width_mean <dbl>, Petal.Width_sd <dbl>, Petal.Width_Count <int>
Maybe adding pivot_longer would make the output better -
iris %>%
group_by(Species) %>%
summarise(across(1:4, list(mean = ~mean(., na.rm = T),
sd = ~sd(., na.rm = T),
Count = ~n()))) %>%
pivot_longer(cols = -Species,
names_to = c('name', '.value'),
names_sep = '_')
# Species name mean sd Count
# <fct> <chr> <dbl> <dbl> <int>
# 1 setosa Sepal.Length 5.01 0.352 50
# 2 setosa Sepal.Width 3.43 0.379 50
# 3 setosa Petal.Length 1.46 0.174 50
# 4 setosa Petal.Width 0.246 0.105 50
# 5 versicolor Sepal.Length 5.94 0.516 50
# 6 versicolor Sepal.Width 2.77 0.314 50
# 7 versicolor Petal.Length 4.26 0.470 50
# 8 versicolor Petal.Width 1.33 0.198 50
# 9 virginica Sepal.Length 6.59 0.636 50
#10 virginica Sepal.Width 2.97 0.322 50
#11 virginica Petal.Length 5.55 0.552 50
#12 virginica Petal.Width 2.03 0.275 50

Adding group size to summary table no longer works

I am aggregating some data and I want to add group sizes N to the output table. Until recently, the code below worked fine. Now, N is equal to the rowcount of my table.
iris %>%
group_by(Species) %>%
group_by(N = n(), .add = TRUE) %>%
summarise_all(list(~mean(., na.rm = TRUE)))
# A tibble: 3 x 6
# Groups: Species [3]
Species N Sepal.Length Sepal.Width Petal.Length Petal.Width
<fct> <int> <dbl> <dbl> <dbl> <dbl>
1 setosa 150 5.01 3.43 1.46 0.246
2 versicolor 150 5.94 2.77 4.26 1.33
3 virginica 150 6.59 2.97 5.55 2.03
This looks like a recently introduced bug. Can be reproduced on dplyr 1.0.3 but not on 1.0.2.
You could however, avoid the second group_by completely in this case.
library(dplyr)
iris %>%
group_by(Species) %>%
summarise(across(.fns = mean, na.rm = TRUE),
N = n())
# Species Sepal.Length Sepal.Width Petal.Length Petal.Width N
#* <fct> <dbl> <dbl> <dbl> <dbl> <int>
#1 setosa 5.01 3.43 1.46 0.246 50
#2 versicolor 5.94 2.77 4.26 1.33 50
#3 virginica 6.59 2.97 5.55 2.03 50
Try this:
rm(list = ls())
library(dplyr)
iris %>%
group_by(Species) %>%
group_by(N = n(), .add = TRUE) %>%
summarise_all(list(~mean(., na.rm = TRUE)))

Apply a summarise condition to a range of columns when using dplyr group_by?

Suppose we want to group_by() and summarise a massive data.frame with very many columns, but that there are some large groups of consecutive columns that will have the same summarise condition (e.g. max, mean etc)
Is there a way to avoid having to specify the summarise condition for each and every column, and instead do it for ranges of columns?
Example
Suppose we want to do this:
iris %>%
group_by(Species) %>%
summarise(max(Sepal.Length), mean(Sepal.Width), mean(Petal.Length), mean(Petal.Width))
but note that 3 consecutive columns have the same summarise condition, mean(Sepal.Width), mean(Petal.Length), mean(Petal.Width)
Is there a way to use some method like mean(Sepal.Width:Petal.Width) to specify the condition for the range of columns, and hence a avoiding having to type out the summarise condition multiple times for all the columns in between)
Note
The iris example above is a small and manageable example that has a range of 3 consecutive columns, but actual use case has ~hundreds.
The upcoming version 1.0.0 of dplyr will have across() function that does what you wish for
Basic usage
across() has two primary arguments:
The first argument, .cols, selects the columns you want to operate on.
It uses tidy selection (like select()) so you can pick variables by
position, name, and type.
The second argument, .fns, is a function or list of functions to apply to
each column. This can also be a purrr style formula (or list of formulas)
like ~ .x / 2. (This argument is optional, and you can omit it if you just want
to get the underlying data; you'll see that technique used in
vignette("rowwise").)
### Install development version on GitHub first
# install.packages("devtools")
# devtools::install_github("tidyverse/dplyr")
library(dplyr, warn.conflicts = FALSE)
Control how the names are created with the .names argument which takes a glue spec:
iris %>%
group_by(Species) %>%
summarise(
across(c(Sepal.Width:Petal.Width), ~ mean(.x, na.rm = TRUE), .names = "mean_{col}"),
across(c(Sepal.Length), ~ max(.x, na.rm = TRUE), .names = "max_{col}")
)
#> # A tibble: 3 x 5
#> Species mean_Sepal.Width mean_Petal.Leng~ mean_Petal.Width max_Sepal.Length
#> * <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 setosa 3.43 1.46 0.246 5.8
#> 2 versicolor 2.77 4.26 1.33 7
#> 3 virginica 2.97 5.55 2.03 7.9
Using multiple functions
my_func <- list(
mean = ~ mean(., na.rm = TRUE),
max = ~ max(., na.rm = TRUE)
)
iris %>%
group_by(Species) %>%
summarise(across(where(is.numeric), my_func, .names = "{fn}.{col}"))
#> # A tibble: 3 x 9
#> Species mean.Sepal.Length max.Sepal.Length mean.Sepal.Width max.Sepal.Width
#> * <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 setosa 5.01 5.8 3.43 4.4
#> 2 versicolor 5.94 7 2.77 3.4
#> 3 virginica 6.59 7.9 2.97 3.8
#> mean.Petal.Length max.Petal.Length mean.Petal.Width max.Petal.Width
#> * <dbl> <dbl> <dbl> <dbl>
#> 1 1.46 1.9 0.246 0.6
#> 2 4.26 5.1 1.33 1.8
#> 3 5.55 6.9 2.03 2.5
Created on 2020-03-06 by the reprex package (v0.3.0)
Since summarise collapses the rows and hence we cannot further apply any functions to it, we can use mutate_at instead, select range of columns to apply function and then select 1st row from every group.
library(dplyr)
iris %>%
group_by(Species) %>%
mutate_at(vars(Sepal.Width:Petal.Width), mean) %>%
mutate_at(vars(Sepal.Length), max) %>%
slice(1L)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# <dbl> <dbl> <dbl> <dbl> <fct>
#1 5.8 3.43 1.46 0.246 setosa
#2 7 2.77 4.26 1.33 versicolor
#3 7.9 2.97 5.55 2.03 virginica
We can use pmap from purrr to apply various functions to various columns and then join back together at the end. Note the use of lst from purrr so we can refer to previously named objects in the list construction. This allows us to analyze the same column with multiple functions, such as Sepal.Length below.
library(tidyverse)
lst(a = list("Sepal.Length", names(select(iris, Sepal.Length:Petal.Width))),
b = list("max" = max, "mean" = mean),
c = names(b)) %>%
pmap(function(a, b, c) {
iris %>%
group_by(Species) %>%
summarize_at(a, b) %>%
rename_at(a, paste0, "_", c)
}) %>%
reduce(inner_join, by = "Species")
#> # A tibble: 3 x 6
#> Species Sepal.Length_max Sepal.Length_me~ Sepal.Width_mean Petal.Length_me~
#> <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 setosa 5.8 5.01 3.43 1.46
#> 2 versic~ 7 5.94 2.77 4.26
#> 3 virgin~ 7.9 6.59 2.97 5.55
#> # ... with 1 more variable: Petal.Width_mean <dbl>

plyr::ddply equivalent in dplyr

I personally learned plyr prior to dplyr, and I'm trying to normalize my code into the dplyr syntax wherever possible, but I get stuck with the following use-case:
ddply(
.data = somedataframe,
.variables = c('var1', 'var2'),
.function =
function(thisdf){
...
}
)
Where the ... inside the function call is some arbitrarily complex modification of the dataframe. Note that the choice of ddply versus dlply (or anyother dxply) is purely for illustration. Does a function within dplyr exists (call it dplyr::f for the moment), that could also take an arbitrary modification function? For example:
somedataframe %>%
group_by(var1, var2) %>%
dplyr::f(.function = function(thisdf){ ... })
In my investigation of this functionality, all the examples that I could find were extremely simple summarise implementations of ddply.
Probably the simplest way is using the dplyr::do() function but one can also use the group_modify(). Complete example:
library(tidyverse)
#some complex function
func = function(x) {
mod = lm(Sepal.Length ~ Petal.Width, data = x)
mod_coefs = broom::tidy(mod)
tibble(
mean_sepal_length = mean(x$Sepal.Length),
mean_petal_width = mean(x$Petal.Width),
slope = mod_coefs[[2, 2]],
slope_p = mod_coefs[[2, 5]]
)
}
#plyr version
plyr::ddply(iris, "Species", func)
#dplyr with do()
iris %>%
group_by(Species) %>%
do(func(.))
#dplyr with group_map()
#have to rewrite the function to take a second argument, which is the grouping variable
func2 = function(x, y) {
mod = lm(Sepal.Length ~ Petal.Width, data = x)
mod_coefs = broom::tidy(mod)
tibble(
mean_sepal_length = mean(x$Sepal.Length),
mean_petal_width = mean(x$Petal.Width),
slope = mod_coefs[[2, 2]],
slope_p = mod_coefs[[2, 5]]
)
}
iris %>%
group_by(Species) %>%
group_modify(func2)
These produce:
Species mean_sepal_length mean_petal_width slope slope_p
1 setosa 5.006 0.246 0.9301727 5.052644e-02
2 versicolor 5.936 1.326 1.4263647 4.035422e-05
3 virginica 6.588 2.026 0.6508306 4.798149e-02
# A tibble: 3 x 5
# Groups: Species [3]
Species mean_sepal_length mean_petal_width slope slope_p
<fct> <dbl> <dbl> <dbl> <dbl>
1 setosa 5.01 0.246 0.930 0.0505
2 versicolor 5.94 1.33 1.43 0.0000404
3 virginica 6.59 2.03 0.651 0.0480
# A tibble: 3 x 5
# Groups: Species [3]
Species mean_sepal_length mean_petal_width slope slope_p
<fct> <dbl> <dbl> <dbl> <dbl>
1 setosa 5.01 0.246 0.930 0.0505
2 versicolor 5.94 1.33 1.43 0.0000404
3 virginica 6.59 2.03 0.651 0.0480
There are 2 differences. The ddply() output is a standard data frame, even though the function outputted a tibble. The dplyr outputs are grouped tibbles, despite the grouping had been 'used'.

R dplyr: Write list output to dataframe

Suppose I have the following function
SlowFunction = function(vector){
return(list(
mean =mean(vector),
sd = sd(vector)
))
}
And I would like to use dplyr:summarise to write the results to a dataframe:
iris %>%
dplyr::group_by(Species) %>%
dplyr::summarise(
mean = SlowFunction(Sepal.Length)$mean,
sd = SlowFunction(Sepal.Length)$sd
)
Does anyone have a suggestion how I can do this by calling "SlowFunction" once instead of twice? (In my code "SlowFunction" is a slow function that I have to call many times.) Without splitting "SlowFunction" in two parts of course. So actually I would like to somehow fill multiple columns of a dataframe in one statement.
Without changing your current SlowFunction one way is to use do
library(dplyr)
iris %>%
group_by(Species) %>%
do(data.frame(SlowFunction(.$Sepal.Length)))
# Species mean sd
# <fct> <dbl> <dbl>
#1 setosa 5.01 0.352
#2 versicolor 5.94 0.516
#3 virginica 6.59 0.636
Or with group_split + purrr::map_dfr
bind_cols(Species = unique(iris$Species), iris %>%
group_split(Species) %>%
map_dfr(~SlowFunction(.$Sepal.Length)))
An option is to use to store the output of SlowFunction in a list column of data.frames and then to use unnest
iris %>%
group_by(Species) %>%
summarise(res = list(as.data.frame(SlowFunction(Sepal.Length)))) %>%
unnest()
## A tibble: 3 x 3
# Species mean sd
# <fct> <dbl> <dbl>
#1 setosa 5.01 0.352
#2 versicolor 5.94 0.516
#3 virginica 6.59 0.636
We can use group_map if you are using dplyr 0.8.0 or later. The output from SlowFunction needs to be converted to a data frame.
library(dplyr)
iris %>%
group_by(Species) %>%
group_map(~SlowFunction(.x$Sepal.Length) %>% as.data.frame())
# # A tibble: 3 x 3
# # Groups: Species [3]
# Species mean sd
# <fct> <dbl> <dbl>
# 1 setosa 5.01 0.352
# 2 versicolor 5.94 0.516
# 3 virginica 6.59 0.636
We can change the SlowFunction to return a tibble and
SlowFunction = function(vector){
tibble(
mean =mean(vector),
sd = sd(vector)
)
}
and then unnest the summarise output in a list
iris %>%
group_by(Species) %>%
summarise(out = list(SlowFunction(Sepal.Length))) %>%
unnest
# A tibble: 3 x 3
# Species mean sd
# <fct> <dbl> <dbl>
#1 setosa 5.01 0.352
#2 versicolor 5.94 0.516
#3 virginica 6.59 0.636

Resources