Let's say we want to calculate the means of sepal length based on tercile groups of sepal width.
We can use the split_quantile function from the fabricatr package and do the following:
iris %>%
group_by(split_quantile(Sepal.Width, 3)) %>%
summarise(Sepal.Length = mean(Sepal.Length))
So far so good. Now, let's say we want to group_by(Species, split_quantile(Sepal.Width, 3)) instead of just group_by(split_quantile(Sepal.Width, 3)).
However, what if we want the terciles to be calculated inside of the each species type and not generally?
Basically, what I'm looking for could be achieved by splitting iris into several dataframes based on Species, using split_quantile on those dataframes to calculate terciles and then joining the dataframes back together. However, I'm looking for a way to do this without splitting the dataframe.
You kinda have written the answer in your text, but you can create a new variable for tercile after grouping by species, then regroup with both Species and Tercile.
library(tidyverse)
library(fabricatr)
iris %>%
group_by(Species) %>%
mutate(Tercile = split_quantile(Sepal.Width, 3)) %>%
group_by(Species, Tercile) %>%
summarise(Sepal.Length = mean(Sepal.Length))
#> # A tibble: 9 x 3
#> # Groups: Species [3]
#> Species Tercile Sepal.Length
#> <fct> <fct> <dbl>
#> 1 setosa 1 4.69
#> 2 setosa 2 5.08
#> 3 setosa 3 5.27
#> 4 versicolor 1 5.61
#> 5 versicolor 2 6.12
#> 6 versicolor 3 6.22
#> 7 virginica 1 6.29
#> 8 virginica 2 6.73
#> 9 virginica 3 6.81
Created on 2020-05-27 by the reprex package (v0.3.0)
Related
I need to mutate a column in a dataframe, with the seq of another column.
For example with iris, I would like to add a new column for each Species, with
seq(min(Sepal.Length),max(Sepal.Length),length=100)
I tried (with no success):
iris %>%
group_by(Species) %>%
mutate(seqq = seq(min(Sepal.Length),max(Sepal.Length), 100))
Any ideas?
thank you!
mutate needs to return the same number of rows as the original data or the ones in the group_by. We may use summarise
library(dplyr)
iris %>%
group_by(Species) %>%
summarise(seq = seq(min(Sepal.Length),max(Sepal.Length),
length = 100), .groups = 'drop')
-output
# A tibble: 300 x 2
# Groups: Species [3]
Species seq
<fct> <dbl>
1 setosa 4.3
2 setosa 4.32
3 setosa 4.33
4 setosa 4.35
5 setosa 4.36
6 setosa 4.38
7 setosa 4.39
8 setosa 4.41
9 setosa 4.42
10 setosa 4.44
# … with 290 more rows
I am trying to make a data.frame which displays the average time an individual displays a behaviour.
I have been using group_by and summarise to calculate the averages across groups. But the output is many rows down. See an example using the iris dataset...
data(iris)
x <- iris %>%
group_by(Species, Petal.Length) %>%
summarise(mean(Sepal.Length))
I would like to get an output that has, for this example, one row per 'Species' and a column of averages per 'Petal.Length'.
I have resorted to creating multiple outputs and then using left_join to combine them into the desired data.frame. See example below...
a <- iris %>%
group_by(Species) %>%
filter(Petal.Length == 0.1) %>%
summarise(mean(Sepal.Length))
b <- iris %>%
group_by(Species) %>%
filter(Petal.Length == 0.2) %>%
summarise(mean(Sepal.Length))
left_join(a, b)
However, doing this twelve or more times at a time is tedious and I am sure there must be an easy way to get the mean(Sepal.Length) for the 'Petal.Length' 0.1, and 0.2, and 0.3 (etc) in the one output.
n.b. in my data Petal.Length would actually be characters that represent behaviours and Sepal.Length would be the duration of time
Some ideas:
library(tidyverse)
data(iris)
mutate(iris, Petal.Length_discrete = cut(Petal.Length, 5)) %>%
group_by(Species, Petal.Length_discrete) %>%
summarise(mean(Sepal.Length))
#> `summarise()` has grouped output by 'Species'. You can override using the `.groups` argument.
#> # A tibble: 7 x 3
#> # Groups: Species [3]
#> Species Petal.Length_discrete `mean(Sepal.Length)`
#> <fct> <fct> <dbl>
#> 1 setosa (0.994,2.18] 5.01
#> 2 versicolor (2.18,3.36] 5
#> 3 versicolor (3.36,4.54] 5.81
#> 4 versicolor (4.54,5.72] 6.43
#> 5 virginica (3.36,4.54] 4.9
#> 6 virginica (4.54,5.72] 6.32
#> 7 virginica (5.72,6.91] 7.25
iris %>%
group_split(Species, Petal.Length) %>%
map(~ summarise(.x, mean(Sepal.Length))) %>%
head(3)
#> [[1]]
#> # A tibble: 1 x 1
#> `mean(Sepal.Length)`
#> <dbl>
#> 1 4.6
#>
#> [[2]]
#> # A tibble: 1 x 1
#> `mean(Sepal.Length)`
#> <dbl>
#> 1 4.3
#>
#> [[3]]
#> # A tibble: 1 x 1
#> `mean(Sepal.Length)`
#> <dbl>
#> 1 5.4
Created on 2021-06-28 by the reprex package (v2.0.0)
I have 5 columns in which I'd like to group by a column and then summarize as mean per columns. However, in the process, I'd like to only calculate the mean for values between a certain range for all the columns. Is this possible? Not excluding the rows themselves but the values to be aggregated.
Current code:
a <- b %>% group_by(c) %>% summarise_all(funs(mean(., na.rm=T)))
If you want to use only a subset of data to compute the mean on, you can use a lambda function inside summarise().
However, if the subset is based on only one variable, you should simply use filter().
Also, note that summarise_all() is retired and we should use summarise(across()) instead.
Here is an example where the mean is computed with only values included between 2 and 3.
library(tidyverse)
iris %>%
group_by(Species) %>%
summarise(across(everything(), ~mean(.x, na.rm=TRUE)))
#> # A tibble: 3 x 5
#> Species Sepal.Length Sepal.Width Petal.Length Petal.Width
#> <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 setosa 5.01 3.43 1.46 0.246
#> 2 versicolor 5.94 2.77 4.26 1.33
#> 3 virginica 6.59 2.97 5.55 2.03
my_range = c(inf=2, sup=3)
iris %>%
group_by(Species) %>%
summarise(across(everything(), ~.x[.x>my_range["inf"] & .x<my_range["sup"]] %>% mean(na.rm=TRUE)))
#> # A tibble: 3 x 5
#> Species Sepal.Length Sepal.Width Petal.Length Petal.Width
#> <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 setosa NaN 2.60 NaN NaN
#> 2 versicolor NaN 2.63 NaN NaN
#> 3 virginica NaN 2.69 NaN 2.27
Created on 2021-05-12 by the reprex package (v2.0.0)
Suppose we want to group_by() and summarise a massive data.frame with very many columns, but that there are some large groups of consecutive columns that will have the same summarise condition (e.g. max, mean etc)
Is there a way to avoid having to specify the summarise condition for each and every column, and instead do it for ranges of columns?
Example
Suppose we want to do this:
iris %>%
group_by(Species) %>%
summarise(max(Sepal.Length), mean(Sepal.Width), mean(Petal.Length), mean(Petal.Width))
but note that 3 consecutive columns have the same summarise condition, mean(Sepal.Width), mean(Petal.Length), mean(Petal.Width)
Is there a way to use some method like mean(Sepal.Width:Petal.Width) to specify the condition for the range of columns, and hence a avoiding having to type out the summarise condition multiple times for all the columns in between)
Note
The iris example above is a small and manageable example that has a range of 3 consecutive columns, but actual use case has ~hundreds.
The upcoming version 1.0.0 of dplyr will have across() function that does what you wish for
Basic usage
across() has two primary arguments:
The first argument, .cols, selects the columns you want to operate on.
It uses tidy selection (like select()) so you can pick variables by
position, name, and type.
The second argument, .fns, is a function or list of functions to apply to
each column. This can also be a purrr style formula (or list of formulas)
like ~ .x / 2. (This argument is optional, and you can omit it if you just want
to get the underlying data; you'll see that technique used in
vignette("rowwise").)
### Install development version on GitHub first
# install.packages("devtools")
# devtools::install_github("tidyverse/dplyr")
library(dplyr, warn.conflicts = FALSE)
Control how the names are created with the .names argument which takes a glue spec:
iris %>%
group_by(Species) %>%
summarise(
across(c(Sepal.Width:Petal.Width), ~ mean(.x, na.rm = TRUE), .names = "mean_{col}"),
across(c(Sepal.Length), ~ max(.x, na.rm = TRUE), .names = "max_{col}")
)
#> # A tibble: 3 x 5
#> Species mean_Sepal.Width mean_Petal.Leng~ mean_Petal.Width max_Sepal.Length
#> * <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 setosa 3.43 1.46 0.246 5.8
#> 2 versicolor 2.77 4.26 1.33 7
#> 3 virginica 2.97 5.55 2.03 7.9
Using multiple functions
my_func <- list(
mean = ~ mean(., na.rm = TRUE),
max = ~ max(., na.rm = TRUE)
)
iris %>%
group_by(Species) %>%
summarise(across(where(is.numeric), my_func, .names = "{fn}.{col}"))
#> # A tibble: 3 x 9
#> Species mean.Sepal.Length max.Sepal.Length mean.Sepal.Width max.Sepal.Width
#> * <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 setosa 5.01 5.8 3.43 4.4
#> 2 versicolor 5.94 7 2.77 3.4
#> 3 virginica 6.59 7.9 2.97 3.8
#> mean.Petal.Length max.Petal.Length mean.Petal.Width max.Petal.Width
#> * <dbl> <dbl> <dbl> <dbl>
#> 1 1.46 1.9 0.246 0.6
#> 2 4.26 5.1 1.33 1.8
#> 3 5.55 6.9 2.03 2.5
Created on 2020-03-06 by the reprex package (v0.3.0)
Since summarise collapses the rows and hence we cannot further apply any functions to it, we can use mutate_at instead, select range of columns to apply function and then select 1st row from every group.
library(dplyr)
iris %>%
group_by(Species) %>%
mutate_at(vars(Sepal.Width:Petal.Width), mean) %>%
mutate_at(vars(Sepal.Length), max) %>%
slice(1L)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# <dbl> <dbl> <dbl> <dbl> <fct>
#1 5.8 3.43 1.46 0.246 setosa
#2 7 2.77 4.26 1.33 versicolor
#3 7.9 2.97 5.55 2.03 virginica
We can use pmap from purrr to apply various functions to various columns and then join back together at the end. Note the use of lst from purrr so we can refer to previously named objects in the list construction. This allows us to analyze the same column with multiple functions, such as Sepal.Length below.
library(tidyverse)
lst(a = list("Sepal.Length", names(select(iris, Sepal.Length:Petal.Width))),
b = list("max" = max, "mean" = mean),
c = names(b)) %>%
pmap(function(a, b, c) {
iris %>%
group_by(Species) %>%
summarize_at(a, b) %>%
rename_at(a, paste0, "_", c)
}) %>%
reduce(inner_join, by = "Species")
#> # A tibble: 3 x 6
#> Species Sepal.Length_max Sepal.Length_me~ Sepal.Width_mean Petal.Length_me~
#> <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 setosa 5.8 5.01 3.43 1.46
#> 2 versic~ 7 5.94 2.77 4.26
#> 3 virgin~ 7.9 6.59 2.97 5.55
#> # ... with 1 more variable: Petal.Width_mean <dbl>
I can select and arrange a single column:
iris %>%
select(Petal.Width, Species) %>%
arrange(desc(Petal.Width))
But I want to do this for the whole dataframe. I'm approaching this with a forloop:
features <- colnames(iris)
top <- data.frame()
for (i in 1:length(features)) {
label <- features[[i]]
iris %>%
select(label, Species) %>%
arrange(desc(label)) %>%
top_n(3) %>%
rbind(top)
}
# Error in arrange_impl(.data, dots) :
# incorrect size (1) at position 1, expecting : 150
Which gives me an error.
Apparently the arrange(desc(label)) doesn't work. I searched around and tried things like UQ and substitute to unquote the label, but with no result.
The rbind(top) and the top_n end might also be not exactly what I want, but the main problem I have now is how to use the label so the forloop wil accept it.
And maybe someone knows a better approach alltogether than my forloop...
The desired output is a dataframe, with the top 3 of every column.
If you want to use something on all columns, there are multiple ways. I like to gather (or melt) the data first and then use dplyr again.
For example, in your case, this would result in
library(tidyr)
library(dplyr)
iris %>%
gather("var", "val", -Species) %>%
group_by(var) %>%
arrange(desc(val)) %>%
top_n(3)
#> Selecting by val
#> # A tibble: 14 x 3
#> # Groups: var [4]
#> Species var val
#> <fctr> <chr> <dbl>
#> 1 virginica Sepal.Length 7.9
#> 2 virginica Sepal.Length 7.7
#> 3 virginica Sepal.Length 7.7
#> 4 virginica Sepal.Length 7.7
#> 5 virginica Sepal.Length 7.7
#> 6 virginica Petal.Length 6.9
#> 7 virginica Petal.Length 6.7
#> 8 virginica Petal.Length 6.7
#> 9 setosa Sepal.Width 4.4
#> 10 setosa Sepal.Width 4.2
#> 11 setosa Sepal.Width 4.1
#> 12 virginica Petal.Width 2.5
#> 13 virginica Petal.Width 2.5
#> 14 virginica Petal.Width 2.5
What you see is that top_n selects the top-n values not top-n entries, but you can substitute the function for slice(1:3)
Does that give you what you where looking for?