plyr::ddply equivalent in dplyr - r

I personally learned plyr prior to dplyr, and I'm trying to normalize my code into the dplyr syntax wherever possible, but I get stuck with the following use-case:
ddply(
.data = somedataframe,
.variables = c('var1', 'var2'),
.function =
function(thisdf){
...
}
)
Where the ... inside the function call is some arbitrarily complex modification of the dataframe. Note that the choice of ddply versus dlply (or anyother dxply) is purely for illustration. Does a function within dplyr exists (call it dplyr::f for the moment), that could also take an arbitrary modification function? For example:
somedataframe %>%
group_by(var1, var2) %>%
dplyr::f(.function = function(thisdf){ ... })
In my investigation of this functionality, all the examples that I could find were extremely simple summarise implementations of ddply.

Probably the simplest way is using the dplyr::do() function but one can also use the group_modify(). Complete example:
library(tidyverse)
#some complex function
func = function(x) {
mod = lm(Sepal.Length ~ Petal.Width, data = x)
mod_coefs = broom::tidy(mod)
tibble(
mean_sepal_length = mean(x$Sepal.Length),
mean_petal_width = mean(x$Petal.Width),
slope = mod_coefs[[2, 2]],
slope_p = mod_coefs[[2, 5]]
)
}
#plyr version
plyr::ddply(iris, "Species", func)
#dplyr with do()
iris %>%
group_by(Species) %>%
do(func(.))
#dplyr with group_map()
#have to rewrite the function to take a second argument, which is the grouping variable
func2 = function(x, y) {
mod = lm(Sepal.Length ~ Petal.Width, data = x)
mod_coefs = broom::tidy(mod)
tibble(
mean_sepal_length = mean(x$Sepal.Length),
mean_petal_width = mean(x$Petal.Width),
slope = mod_coefs[[2, 2]],
slope_p = mod_coefs[[2, 5]]
)
}
iris %>%
group_by(Species) %>%
group_modify(func2)
These produce:
Species mean_sepal_length mean_petal_width slope slope_p
1 setosa 5.006 0.246 0.9301727 5.052644e-02
2 versicolor 5.936 1.326 1.4263647 4.035422e-05
3 virginica 6.588 2.026 0.6508306 4.798149e-02
# A tibble: 3 x 5
# Groups: Species [3]
Species mean_sepal_length mean_petal_width slope slope_p
<fct> <dbl> <dbl> <dbl> <dbl>
1 setosa 5.01 0.246 0.930 0.0505
2 versicolor 5.94 1.33 1.43 0.0000404
3 virginica 6.59 2.03 0.651 0.0480
# A tibble: 3 x 5
# Groups: Species [3]
Species mean_sepal_length mean_petal_width slope slope_p
<fct> <dbl> <dbl> <dbl> <dbl>
1 setosa 5.01 0.246 0.930 0.0505
2 versicolor 5.94 1.33 1.43 0.0000404
3 virginica 6.59 2.03 0.651 0.0480
There are 2 differences. The ddply() output is a standard data frame, even though the function outputted a tibble. The dplyr outputs are grouped tibbles, despite the grouping had been 'used'.

Related

use for loop for unique element in r

I have a question about for loop in r. I have used the following for loop
for (i in 1:length(unique(iris$Species))) {
datu <- data.frame(ID = unique(i),
Sl = mean(iris$Sepal.Length),
Sw = mean(iris$Sepal.Width))
}
to get the mean of each unique species in iris. But my final data only has one observation. However my desired output is separate for setosa versicolor virginica. What should i change in this code? Thanks
We don't need a loop. It can be done with group by approach
setNames(aggregate(.~ Species, iris[c(1, 2, 5)], mean), c("ID", "Sl", "Sw"))
-output
ID Sl Sw
1 setosa 5.006 3.428
2 versicolor 5.936 2.770
3 virginica 6.588 2.974
Or with tidyverse
library(dplyr)
library(stringr)
iris %>%
group_by(ID = Species) %>%
summarise(across(starts_with("Sepal"), ~ mean(.x, na.rm = TRUE),
.names = "{str_to_title(str_remove_all(.col, '[a-z.]+'))}"))
-output
# A tibble: 3 × 3
ID Sl Sw
<fct> <dbl> <dbl>
1 setosa 5.01 3.43
2 versicolor 5.94 2.77
3 virginica 6.59 2.97
In the loop, the unique(i) is just i, instead if we meant unique(iris$Species)[i]. In addition, the datu will get updated in each iteration, returning only the last output from the iteration. Instead, it can be stored in a list and rbind later or use
datu <- data.frame()
for (i in 1:length(unique(iris$Species))) {
unqSp <- unique(iris$Species)[i]
i1 <- iris$Species == unqSp
datu <- rbind(datu, data.frame(ID = unqSp,
Sl = mean(iris$Sepal.Length[i1]),
Sw = mean(iris$Sepal.Width[i1])))
}
-output
> datu
ID Sl Sw
1 setosa 5.006 3.428
2 versicolor 5.936 2.770
3 virginica 6.588 2.974
A tidyverse approach using dplyr.
dplyr::summarize
‘summarise()’ creates a new data frame. It will have one (or more)
rows for each combination of grouping variables; if there are no
grouping variables, the output will have a single row summarising
all observations in the input.
library(dplyr)
iris %>%
group_by(Species) %>%
summarize(Sl = mean(Sepal.Length), Sw = mean(Sepal.Width))
# A tibble: 3 × 3
Species Sl Sw
<fct> <dbl> <dbl>
1 setosa 5.01 3.43
2 versicolor 5.94 2.77
3 virginica 6.59 2.97

Perform a different simple custom function based on group

I have data with three groups and would like to perform a different custom function on each of the three groups. Rather than write three separate functions, and calling them all separately, I'm wondering whether I can easily wrap all three into one function with a 'group' parameter.
For example, say I want the mean for group A:
library(tidyverse)
data(iris)
iris$Group <- c(rep("A", 50), rep("B", 50), rep("C", 50))
f_a <- function(df){
out <- df %>%
group_by(Species) %>%
summarise(mean = mean(Sepal.Length))
return(out)
}
The median for group B
f_b <- function(df){
out <- df %>%
group_by(Species) %>%
summarise(median = median(Sepal.Length))
return(out)
}
And the standard deviation for group C
f_c <- function(df){
out <- df %>%
group_by(Species) %>%
summarise(sd= sd(Sepal.Length))
return(out)
}
Is there any way I can combine the above functions and run them according to a group parameter?? Like:
fx(df, group = "A")
Which would produce the results of the above f_a function??
Keeping in mind that in my actual use context, I can't simply group_by(group) in the original function, since the actual functions are more complex. Thanks!!
We create a switch inside the function to select the appropriate function to be applied based on the matching input from group. This function is passed into summarise to apply after groupihg by 'Species'
fx <- function(df, group) {
fn_selector <- switch(group,
A = "mean",
B = "median",
C = "sd")
df %>%
group_by(Species) %>%
summarise(!! fn_selector :=
match.fun(fn_selector)(Sepal.Length), .groups = 'drop')
}
-testing
fx(iris, "A")
# A tibble: 3 x 2
# Species mean
# <fct> <dbl>
#1 setosa 5.01
#2 versicolor 5.94
#3 virginica 6.59
fx(iris, "B")
# A tibble: 3 x 2
# Species median
# <fct> <dbl>
#1 setosa 5
#2 versicolor 5.9
#3 virginica 6.5
fx(iris, "C")
# A tibble: 3 x 2
# Species sd
# <fct> <dbl>
#1 setosa 0.352
#2 versicolor 0.516
#3 virginica 0.636
I don't understand the point of having group column in the dataset. When we pass group = "A" in the function this has got nothing to do with group column that was created.
Instead of passing group = "A" in the function and then mapping A to some function you can directly pass the function that you want to apply.
library(dplyr)
f_a <- function(df, fn){
out <- df %>%
group_by(Species) %>%
summarise(out = fn(Sepal.Length))
return(out)
}
f_a(iris, mean)
# A tibble: 3 x 2
# Species out
#* <fct> <dbl>
#1 setosa 5.01
#2 versicolor 5.94
#3 virginica 6.59
f_a(iris, median)
# A tibble: 3 x 2
# Species out
#* <fct> <dbl>
#1 setosa 5
#2 versicolor 5.9
#3 virginica 6.5

tidyr unnest, prefix column names with nested name during unnesting

When running unnest on a data.frame is there a way to add the group name of nested item to the individual columns it contains (either as a suffix or prefix). Or does renaming have to be done manually via rename?
This is particularly relevant with 'unnesting' multiple groups that contain columns with the same names.
In the example below the base aggregate command does this well (eg. Petal.Length.mn), but I couldn't find an option to get unnest to do the same thing?
I'm using nest with purrr::map as I want the flexibility to mix functions, eg. calculate means and sd on a couple of variables and also run a t test to look at differences between them.
library(dplyr, warn.conflicts = FALSE)
msd_c <- function(x) c(mn = mean(x), sd = sd(x))
msd_df <- function(x) bind_rows(c(mn = mean(x), sd = sd(x)))
aggregate(cbind(Petal.Length, Petal.Width) ~ Species,
data = iris, FUN = msd_c)
#> Species Petal.Length.mn Petal.Length.sd Petal.Width.mn Petal.Width.sd
#> 1 setosa 1.4620000 0.1736640 0.2460000 0.1053856
#> 2 versicolor 4.2600000 0.4699110 1.3260000 0.1977527
#> 3 virginica 5.5520000 0.5518947 2.0260000 0.2746501
iris %>%
select(Petal.Length:Species) %>%
group_by(Species) %>%
tidyr::nest() %>%
mutate(
Petal.Length = purrr::map(data, ~ msd_df(.$Petal.Length)),
Petal.Width = purrr::map(data, ~ msd_df(.$Petal.Width)),
Correlation = purrr::map(data, ~ broom::tidy(cor.test(.$Petal.Length, .$Petal.Width))),
) %>%
select(-data) %>%
tidyr::unnest(c(Petal.Length, Petal.Width, Correlation), names_repair = tidyr::tidyr_legacy)
#> # A tibble: 3 x 13
#> # Groups: Species [3]
#> Species mn sd mn1 sd1 estimate statistic p.value parameter conf.low
#> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl>
#> 1 setosa 1.46 0.174 0.246 0.105 0.332 2.44 1.86e- 2 48 0.0587
#> 2 versic~ 4.26 0.470 1.33 0.198 0.787 8.83 1.27e-11 48 0.651
#> 3 virgin~ 5.55 0.552 2.03 0.275 0.322 2.36 2.25e- 2 48 0.0481
#> # ... with 3 more variables: conf.high <dbl>, method <chr>, alternative <chr>
Created on 2020-05-20 by the reprex package (v0.3.0)
The answer to this was somewhat obvious, use the names_sep option rather than the names_repair option. As quoted from the nest help menu under names_sep:
If a string, the inner and outer names will be used together. In
nest(), the names of the new outer columns will be formed by pasting
together the outer and the inner column names, separated by names_sep.
In unnest(), the new inner names will have the outer names (+
names_sep) automatically stripped. This makes names_sep roughly
symmetric between nesting and unnesting.
library(dplyr, warn.conflicts = FALSE)
msd_c <- function(x) c(mn = mean(x), sd = sd(x))
msd_df <- function(x) bind_rows(c(mn = mean(x), sd = sd(x)))
iris %>%
select(Petal.Length:Species) %>%
group_by(Species) %>%
tidyr::nest() %>%
mutate(
Petal.Length = purrr::map(data, ~ msd_df(.$Petal.Length)),
Petal.Width = purrr::map(data, ~ msd_df(.$Petal.Width)),
Correlation = purrr::map(data, ~ broom::tidy(cor.test(.$Petal.Length, .$Petal.Width))),
) %>%
select(-data) %>%
tidyr::unnest(c(Petal.Length, Petal.Width, Correlation), names_sep = ".")
#> # A tibble: 3 x 13
#> # Groups: Species [3]
#> Species Petal.Length.mn Petal.Length.sd Petal.Width.mn Petal.Width.sd
#> <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 setosa 1.46 0.174 0.246 0.105
#> 2 versic~ 4.26 0.470 1.33 0.198
#> 3 virgin~ 5.55 0.552 2.03 0.275
#> # ... with 8 more variables: Correlation.estimate <dbl>,
#> # Correlation.statistic <dbl>, Correlation.p.value <dbl>,
#> # Correlation.parameter <int>, Correlation.conf.low <dbl>,
#> # Correlation.conf.high <dbl>, Correlation.method <chr>,
#> # Correlation.alternative <chr>
Created on 2020-06-10 by the reprex package (v0.3.0)
To apply multiple functions to multiple columns I would use summarise_at/mutate_at instead of nesting and unnesting data.
For example, in this case we can do :
library(dplyr)
iris %>%
group_by(Species) %>%
summarise_at(vars(Petal.Length:Petal.Width), list(mn = mean, sd = sd))
# Species Petal.Length_mn Petal.Width_mn Petal.Length_sd Petal.Width_sd
# <fct> <dbl> <dbl> <dbl> <dbl>
#1 setosa 1.46 0.246 0.174 0.105
#2 versicolor 4.26 1.33 0.470 0.198
#3 virginica 5.55 2.03 0.552 0.275
This automatically adds a prefix to column names which we are applying the function to. Also, this is equivalent dplyr version of aggregate function you tried.
Also note that summarise_at will soon be replaced with across in upcoming version of dplyr.
You can use setNames like below. It is a little bit wordy, but it seems like you plan to specify each function for each column, this may be of interest.
iris %>%
select(Petal.Length:Species) %>%
group_by(Species) %>%
tidyr::nest() %>%
mutate(
Petal.Length = purrr::map(data, ~ msd_df(.x$Petal.Length) %>%
setNames(paste0("Petal.Length.", names(.)))),
Petal.Width = purrr::map(data, ~ msd_df(.$Petal.Width) %>%
setNames(paste0("Petal.Width.", names(.)))),
Ratio = purrr::map(data, ~ msd_df(.$Petal.Length/.$Petal.Width) %>%
setNames(paste0("Ratio.", names(.))))
) %>%
select(-data) %>%
tidyr::unnest(c(Petal.Length, Petal.Width, Ratio))
# A tibble: 3 x 7
# Groups: Species [3]
Species Petal.Length.mn Petal.Length.sd Petal.Width.mn Petal.Width.sd Ratio.mn Ratio.sd
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 setosa 1.46 0.174 0.246 0.105 6.91 2.85
2 versicolor 4.26 0.470 1.33 0.198 3.24 0.312
3 virginica 5.55 0.552 2.03 0.275 2.78 0.407
Or modify your function to allow it being able to modify the column name like this.
msd_df_name <- function(x, name){
bind_rows(c(mn = mean(x), sd = sd(x))) %>%
setNames(paste0(name, ".", names(.)))
}
iris %>%
select(Petal.Length:Species) %>%
group_by(Species) %>%
tidyr::nest() %>%
mutate(
Petal.Length = purrr::map(data, ~ msd_df_name(.x$Petal.Length, "Petal.Length")),
Petal.Width = purrr::map(data, ~ msd_df_name(.$Petal.Width, "Petal.Width")),
Ratio = purrr::map(data, ~ msd_df_name(.$Petal.Length/.$Petal.Width, "Ratio"))
) %>%
select(-data) %>%
tidyr::unnest(c(Petal.Length, Petal.Width, Ratio))

Apply a summarise condition to a range of columns when using dplyr group_by?

Suppose we want to group_by() and summarise a massive data.frame with very many columns, but that there are some large groups of consecutive columns that will have the same summarise condition (e.g. max, mean etc)
Is there a way to avoid having to specify the summarise condition for each and every column, and instead do it for ranges of columns?
Example
Suppose we want to do this:
iris %>%
group_by(Species) %>%
summarise(max(Sepal.Length), mean(Sepal.Width), mean(Petal.Length), mean(Petal.Width))
but note that 3 consecutive columns have the same summarise condition, mean(Sepal.Width), mean(Petal.Length), mean(Petal.Width)
Is there a way to use some method like mean(Sepal.Width:Petal.Width) to specify the condition for the range of columns, and hence a avoiding having to type out the summarise condition multiple times for all the columns in between)
Note
The iris example above is a small and manageable example that has a range of 3 consecutive columns, but actual use case has ~hundreds.
The upcoming version 1.0.0 of dplyr will have across() function that does what you wish for
Basic usage
across() has two primary arguments:
The first argument, .cols, selects the columns you want to operate on.
It uses tidy selection (like select()) so you can pick variables by
position, name, and type.
The second argument, .fns, is a function or list of functions to apply to
each column. This can also be a purrr style formula (or list of formulas)
like ~ .x / 2. (This argument is optional, and you can omit it if you just want
to get the underlying data; you'll see that technique used in
vignette("rowwise").)
### Install development version on GitHub first
# install.packages("devtools")
# devtools::install_github("tidyverse/dplyr")
library(dplyr, warn.conflicts = FALSE)
Control how the names are created with the .names argument which takes a glue spec:
iris %>%
group_by(Species) %>%
summarise(
across(c(Sepal.Width:Petal.Width), ~ mean(.x, na.rm = TRUE), .names = "mean_{col}"),
across(c(Sepal.Length), ~ max(.x, na.rm = TRUE), .names = "max_{col}")
)
#> # A tibble: 3 x 5
#> Species mean_Sepal.Width mean_Petal.Leng~ mean_Petal.Width max_Sepal.Length
#> * <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 setosa 3.43 1.46 0.246 5.8
#> 2 versicolor 2.77 4.26 1.33 7
#> 3 virginica 2.97 5.55 2.03 7.9
Using multiple functions
my_func <- list(
mean = ~ mean(., na.rm = TRUE),
max = ~ max(., na.rm = TRUE)
)
iris %>%
group_by(Species) %>%
summarise(across(where(is.numeric), my_func, .names = "{fn}.{col}"))
#> # A tibble: 3 x 9
#> Species mean.Sepal.Length max.Sepal.Length mean.Sepal.Width max.Sepal.Width
#> * <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 setosa 5.01 5.8 3.43 4.4
#> 2 versicolor 5.94 7 2.77 3.4
#> 3 virginica 6.59 7.9 2.97 3.8
#> mean.Petal.Length max.Petal.Length mean.Petal.Width max.Petal.Width
#> * <dbl> <dbl> <dbl> <dbl>
#> 1 1.46 1.9 0.246 0.6
#> 2 4.26 5.1 1.33 1.8
#> 3 5.55 6.9 2.03 2.5
Created on 2020-03-06 by the reprex package (v0.3.0)
Since summarise collapses the rows and hence we cannot further apply any functions to it, we can use mutate_at instead, select range of columns to apply function and then select 1st row from every group.
library(dplyr)
iris %>%
group_by(Species) %>%
mutate_at(vars(Sepal.Width:Petal.Width), mean) %>%
mutate_at(vars(Sepal.Length), max) %>%
slice(1L)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# <dbl> <dbl> <dbl> <dbl> <fct>
#1 5.8 3.43 1.46 0.246 setosa
#2 7 2.77 4.26 1.33 versicolor
#3 7.9 2.97 5.55 2.03 virginica
We can use pmap from purrr to apply various functions to various columns and then join back together at the end. Note the use of lst from purrr so we can refer to previously named objects in the list construction. This allows us to analyze the same column with multiple functions, such as Sepal.Length below.
library(tidyverse)
lst(a = list("Sepal.Length", names(select(iris, Sepal.Length:Petal.Width))),
b = list("max" = max, "mean" = mean),
c = names(b)) %>%
pmap(function(a, b, c) {
iris %>%
group_by(Species) %>%
summarize_at(a, b) %>%
rename_at(a, paste0, "_", c)
}) %>%
reduce(inner_join, by = "Species")
#> # A tibble: 3 x 6
#> Species Sepal.Length_max Sepal.Length_me~ Sepal.Width_mean Petal.Length_me~
#> <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 setosa 5.8 5.01 3.43 1.46
#> 2 versic~ 7 5.94 2.77 4.26
#> 3 virgin~ 7.9 6.59 2.97 5.55
#> # ... with 1 more variable: Petal.Width_mean <dbl>

R dplyr: Write list output to dataframe

Suppose I have the following function
SlowFunction = function(vector){
return(list(
mean =mean(vector),
sd = sd(vector)
))
}
And I would like to use dplyr:summarise to write the results to a dataframe:
iris %>%
dplyr::group_by(Species) %>%
dplyr::summarise(
mean = SlowFunction(Sepal.Length)$mean,
sd = SlowFunction(Sepal.Length)$sd
)
Does anyone have a suggestion how I can do this by calling "SlowFunction" once instead of twice? (In my code "SlowFunction" is a slow function that I have to call many times.) Without splitting "SlowFunction" in two parts of course. So actually I would like to somehow fill multiple columns of a dataframe in one statement.
Without changing your current SlowFunction one way is to use do
library(dplyr)
iris %>%
group_by(Species) %>%
do(data.frame(SlowFunction(.$Sepal.Length)))
# Species mean sd
# <fct> <dbl> <dbl>
#1 setosa 5.01 0.352
#2 versicolor 5.94 0.516
#3 virginica 6.59 0.636
Or with group_split + purrr::map_dfr
bind_cols(Species = unique(iris$Species), iris %>%
group_split(Species) %>%
map_dfr(~SlowFunction(.$Sepal.Length)))
An option is to use to store the output of SlowFunction in a list column of data.frames and then to use unnest
iris %>%
group_by(Species) %>%
summarise(res = list(as.data.frame(SlowFunction(Sepal.Length)))) %>%
unnest()
## A tibble: 3 x 3
# Species mean sd
# <fct> <dbl> <dbl>
#1 setosa 5.01 0.352
#2 versicolor 5.94 0.516
#3 virginica 6.59 0.636
We can use group_map if you are using dplyr 0.8.0 or later. The output from SlowFunction needs to be converted to a data frame.
library(dplyr)
iris %>%
group_by(Species) %>%
group_map(~SlowFunction(.x$Sepal.Length) %>% as.data.frame())
# # A tibble: 3 x 3
# # Groups: Species [3]
# Species mean sd
# <fct> <dbl> <dbl>
# 1 setosa 5.01 0.352
# 2 versicolor 5.94 0.516
# 3 virginica 6.59 0.636
We can change the SlowFunction to return a tibble and
SlowFunction = function(vector){
tibble(
mean =mean(vector),
sd = sd(vector)
)
}
and then unnest the summarise output in a list
iris %>%
group_by(Species) %>%
summarise(out = list(SlowFunction(Sepal.Length))) %>%
unnest
# A tibble: 3 x 3
# Species mean sd
# <fct> <dbl> <dbl>
#1 setosa 5.01 0.352
#2 versicolor 5.94 0.516
#3 virginica 6.59 0.636

Resources