Group by a factor and then summarise a different variable [duplicate] - r

This question already has answers here:
Faster ways to calculate frequencies and cast from long to wide
(4 answers)
Closed 3 years ago.
I have data in this format, where samples are in groups (in this example A or B), have a numerical quantity and a quality score (which is a factor).
I would like to summarise the qual_score by each group_name.
Example Data:
group_name <- rep(c("A","B"),5)
qual_score <- c(rep("POOR",4),rep("FAIR",1),rep("GOOD",5))
quantity <- 5:14
df <- data.frame(group_name, qual_score, quantity)
> df
group_name qual_score quantity
1 A POOR 5
2 B POOR 6
3 A POOR 7
4 B POOR 8
5 A FAIR 9
6 B FAIR 10
7 A GOOD 11
8 B GOOD 12
9 A GOOD 13
10 B GOOD 14
Desired Output:
desired_output <- data.frame(c("2","2"),c("1","0"),c("2","3"))
colnames(desired_output) <- c("POOR", "FAIR", "GOOD")
rownames(desired_output) <- c("A", "B")
desired_output
POOR FAIR GOOD
A 2 1 2
B 2 0 3
I can do summary() of qual_score for the entire dataframe:
> summary(df$qual_score)
FAIR GOOD POOR
2 4 4
And can group_by() to summarise mean(quantity) according to each group:
> df %>%
+ group_by(group_name) %>%
+ summarise(mean(quantity))
# A tibble: 2 x 2
group_name `mean(quantity)`
<fct> <dbl>
1 A 9
2 B 10
But when I try to use group_by() with summary() I get a warning and the following output:
> df %>%
+ group_by(group_name) %>%
+ summary(qual_score)
group_name qual_score quantity
A:5 FAIR:2 Min. : 5.00
B:5 GOOD:4 1st Qu.: 7.25
POOR:4 Median : 9.50
Mean : 9.50
3rd Qu.:11.75
Max. :14.00
Warning messages:
1: In if (length(ll) > maxsum) { :
the condition has length > 1 and only the first element will be used
2: In if (length(ll) > maxsum) { :
the condition has length > 1 and only the first element will be used

library(dplyr)
df %>%
group_by(group_name) %>%
select(-quantity) %>%
table()
#> qual_score
#> group_name FAIR GOOD POOR
#> A 1 2 2
#> B 0 3 2
If you want a solution completely in tidyverse:
library(dplyr)
library(tidyr)
df %>%
group_by(group_name, qual_score) %>%
tally() %>%
spread(qual_score, n, fill=0)
#> # A tibble: 2 x 4
#> # Groups: group_name [2]
#> group_name FAIR GOOD POOR
#> <fct> <dbl> <dbl> <dbl>
#> 1 A 1 2 2
#> 2 B 0 3 2

Related

Continuing a sequence into NAs using dplyr

I am trying to figure out a dplyr specific way of continuing a sequence of numbers when there are NAs in that column.
For example I have this dataframe:
library(tibble)
dat <- tribble(
~x, ~group,
1, "A",
2, "A",
NA_real_, "A",
NA_real_, "A",
1, "B",
NA_real_, "B",
3, "B"
)
dat
#> # A tibble: 7 × 2
#> x group
#> <dbl> <chr>
#> 1 1 A
#> 2 2 A
#> 3 NA A
#> 4 NA A
#> 5 1 B
#> 6 NA B
#> 7 3 B
I would like this one:
#> # A tibble: 7 × 2
#> x group
#> <dbl> <chr>
#> 1 1 A
#> 2 2 A
#> 3 3 A
#> 4 4 A
#> 5 1 B
#> 6 2 B
#> 7 3 B
When I try this I get a warning which makes me think I am probably approaching this incorrectly:
library(dplyr)
dat %>%
group_by(group) %>%
mutate(n = n()) %>%
mutate(new_seq = seq_len(n))
#> Warning in seq_len(n): first element used of 'length.out' argument
#> Warning in seq_len(n): first element used of 'length.out' argument
#> # A tibble: 7 × 4
#> # Groups: group [2]
#> x group n new_seq
#> <dbl> <chr> <int> <int>
#> 1 1 A 4 1
#> 2 2 A 4 2
#> 3 NA A 4 3
#> 4 NA A 4 4
#> 5 1 B 3 1
#> 6 NA B 3 2
#> 7 3 B 3 3
It's easier if you do it in one go. Your approach is not 'wrong', it is just that seq_len needs one integer, and you are giving a vector (n), so seq_len corrects it by using the first value.
dat %>%
group_by(group) %>%
mutate(x = seq_len(n()))
Note that row_number might be even easier here:
dat %>%
group_by(group) %>%
mutate(x = row_number())
We could use rowid directly if the intention is to create a sequence and group size is just intermediate column
library(data.table)
library(dplyr)
dat %>%
mutate(new_seq = rowid(group))
The issue with using a column after it is created is that it is no longer a single row as showed in #Maëls post. If we need to do that, use first as seq_len is not vectorized and here it is not needed as well
dat %>%
group_by(group) %>%
mutate(n = n()) %>%
mutate(new_seq = seq_len(first(n)))
A base R option using ave (work in a similar way as group_by in dplyr)
> transform(dat, x = ave(x, group, FUN = seq_along))
x group
1 1 A
2 2 A
3 3 A
4 4 A
5 1 B
6 2 B
7 3 B

Average across rows, but leaving out own group [duplicate]

Using dplyr (preferably), I am trying to calculate the group mean for each observation while excluding that observation from the group.
It seems that this should be doable with a combination of rowwise() and group_by(), but both functions cannot be used simultaneously.
Given this data frame:
df <- data_frame(grouping = rep(LETTERS[1:5], 3),
value = 1:15) %>%
arrange(grouping)
df
#> Source: local data frame [15 x 2]
#>
#> grouping value
#> (chr) (int)
#> 1 A 1
#> 2 A 6
#> 3 A 11
#> 4 B 2
#> 5 B 7
#> 6 B 12
#> 7 C 3
#> 8 C 8
#> 9 C 13
#> 10 D 4
#> 11 D 9
#> 12 D 14
#> 13 E 5
#> 14 E 10
#> 15 E 15
I'd like to get the group mean for each observation with that observation excluded from the group, resulting in:
#> grouping value special_mean
#> (chr) (int)
#> 1 A 1 8.5 # i.e. (6 + 11) / 2
#> 2 A 6 6 # i.e. (1 + 11) / 2
#> 3 A 11 3.5 # i.e. (1 + 6) / 2
#> 4 B 2 9.5
#> 5 B 7 7
#> 6 B 12 4.5
#> 7 C 3 ...
I've attempted nesting rowwise() inside a function called by do(), but haven't gotten it to work, along these lines:
special_avg <- function(chunk) {
chunk %>%
rowwise() #%>%
# filter or something...?
}
df %>%
group_by(grouping) %>%
do(special_avg(.))
No need to define a custom function, instead we could simply sum all elements of the group, subtract the current value, and divide by number of elements per group minus 1.
df %>% group_by(grouping) %>%
mutate(special_mean = (sum(value) - value)/(n()-1))
# grouping value special_mean
# (chr) (int) (dbl)
#1 A 1 8.5
#2 A 6 6.0
#3 A 11 3.5
#4 B 2 9.5
#5 B 7 7.0
I came across this old question just by chance and I wondered if there is a general solution which would work for other aggregation functions besides mean() as well, e.g., max() as requested by jlesuffleur or median().
The idea is to omit the actual row from computing the aggregate by looping over the rows within the actual group:
library(dplyr)
df %>%
group_by(grouping) %>%
mutate(special_mean = sapply(1:n(), function(i) mean(value[-i])))
grouping value special_mean
<chr> <int> <dbl>
1 A 1 8.5
2 A 6 6
3 A 11 3.5
4 B 2 9.5
5 B 7 7
...
This will work for max() as well
df %>%
group_by(grouping) %>%
mutate(special_max = sapply(1:n(), \(i) max(value[-i])))
grouping value special_max
<chr> <int> <int>
1 A 1 11
2 A 6 11
3 A 11 6
4 B 2 12
5 B 7 12
6 B 12 7
...
For the sake of completeness, here is also a data.table solution:
library(data.table)
setDT(df)[, special_mean := sapply(1:.N, function(i) mean(value[-i])), by = grouping][]

How to compute a leave one out average using dplyr in R? [duplicate]

Using dplyr (preferably), I am trying to calculate the group mean for each observation while excluding that observation from the group.
It seems that this should be doable with a combination of rowwise() and group_by(), but both functions cannot be used simultaneously.
Given this data frame:
df <- data_frame(grouping = rep(LETTERS[1:5], 3),
value = 1:15) %>%
arrange(grouping)
df
#> Source: local data frame [15 x 2]
#>
#> grouping value
#> (chr) (int)
#> 1 A 1
#> 2 A 6
#> 3 A 11
#> 4 B 2
#> 5 B 7
#> 6 B 12
#> 7 C 3
#> 8 C 8
#> 9 C 13
#> 10 D 4
#> 11 D 9
#> 12 D 14
#> 13 E 5
#> 14 E 10
#> 15 E 15
I'd like to get the group mean for each observation with that observation excluded from the group, resulting in:
#> grouping value special_mean
#> (chr) (int)
#> 1 A 1 8.5 # i.e. (6 + 11) / 2
#> 2 A 6 6 # i.e. (1 + 11) / 2
#> 3 A 11 3.5 # i.e. (1 + 6) / 2
#> 4 B 2 9.5
#> 5 B 7 7
#> 6 B 12 4.5
#> 7 C 3 ...
I've attempted nesting rowwise() inside a function called by do(), but haven't gotten it to work, along these lines:
special_avg <- function(chunk) {
chunk %>%
rowwise() #%>%
# filter or something...?
}
df %>%
group_by(grouping) %>%
do(special_avg(.))
No need to define a custom function, instead we could simply sum all elements of the group, subtract the current value, and divide by number of elements per group minus 1.
df %>% group_by(grouping) %>%
mutate(special_mean = (sum(value) - value)/(n()-1))
# grouping value special_mean
# (chr) (int) (dbl)
#1 A 1 8.5
#2 A 6 6.0
#3 A 11 3.5
#4 B 2 9.5
#5 B 7 7.0
I came across this old question just by chance and I wondered if there is a general solution which would work for other aggregation functions besides mean() as well, e.g., max() as requested by jlesuffleur or median().
The idea is to omit the actual row from computing the aggregate by looping over the rows within the actual group:
library(dplyr)
df %>%
group_by(grouping) %>%
mutate(special_mean = sapply(1:n(), function(i) mean(value[-i])))
grouping value special_mean
<chr> <int> <dbl>
1 A 1 8.5
2 A 6 6
3 A 11 3.5
4 B 2 9.5
5 B 7 7
...
This will work for max() as well
df %>%
group_by(grouping) %>%
mutate(special_max = sapply(1:n(), \(i) max(value[-i])))
grouping value special_max
<chr> <int> <int>
1 A 1 11
2 A 6 11
3 A 11 6
4 B 2 12
5 B 7 12
6 B 12 7
...
For the sake of completeness, here is also a data.table solution:
library(data.table)
setDT(df)[, special_mean := sapply(1:.N, function(i) mean(value[-i])), by = grouping][]

Calculate group mean while excluding current observation using dplyr

Using dplyr (preferably), I am trying to calculate the group mean for each observation while excluding that observation from the group.
It seems that this should be doable with a combination of rowwise() and group_by(), but both functions cannot be used simultaneously.
Given this data frame:
df <- data_frame(grouping = rep(LETTERS[1:5], 3),
value = 1:15) %>%
arrange(grouping)
df
#> Source: local data frame [15 x 2]
#>
#> grouping value
#> (chr) (int)
#> 1 A 1
#> 2 A 6
#> 3 A 11
#> 4 B 2
#> 5 B 7
#> 6 B 12
#> 7 C 3
#> 8 C 8
#> 9 C 13
#> 10 D 4
#> 11 D 9
#> 12 D 14
#> 13 E 5
#> 14 E 10
#> 15 E 15
I'd like to get the group mean for each observation with that observation excluded from the group, resulting in:
#> grouping value special_mean
#> (chr) (int)
#> 1 A 1 8.5 # i.e. (6 + 11) / 2
#> 2 A 6 6 # i.e. (1 + 11) / 2
#> 3 A 11 3.5 # i.e. (1 + 6) / 2
#> 4 B 2 9.5
#> 5 B 7 7
#> 6 B 12 4.5
#> 7 C 3 ...
I've attempted nesting rowwise() inside a function called by do(), but haven't gotten it to work, along these lines:
special_avg <- function(chunk) {
chunk %>%
rowwise() #%>%
# filter or something...?
}
df %>%
group_by(grouping) %>%
do(special_avg(.))
No need to define a custom function, instead we could simply sum all elements of the group, subtract the current value, and divide by number of elements per group minus 1.
df %>% group_by(grouping) %>%
mutate(special_mean = (sum(value) - value)/(n()-1))
# grouping value special_mean
# (chr) (int) (dbl)
#1 A 1 8.5
#2 A 6 6.0
#3 A 11 3.5
#4 B 2 9.5
#5 B 7 7.0
I came across this old question just by chance and I wondered if there is a general solution which would work for other aggregation functions besides mean() as well, e.g., max() as requested by jlesuffleur or median().
The idea is to omit the actual row from computing the aggregate by looping over the rows within the actual group:
library(dplyr)
df %>%
group_by(grouping) %>%
mutate(special_mean = sapply(1:n(), function(i) mean(value[-i])))
grouping value special_mean
<chr> <int> <dbl>
1 A 1 8.5
2 A 6 6
3 A 11 3.5
4 B 2 9.5
5 B 7 7
...
This will work for max() as well
df %>%
group_by(grouping) %>%
mutate(special_max = sapply(1:n(), \(i) max(value[-i])))
grouping value special_max
<chr> <int> <int>
1 A 1 11
2 A 6 11
3 A 11 6
4 B 2 12
5 B 7 12
6 B 12 7
...
For the sake of completeness, here is also a data.table solution:
library(data.table)
setDT(df)[, special_mean := sapply(1:.N, function(i) mean(value[-i])), by = grouping][]

dplyr summarise: Equivalent of ".drop=FALSE" to keep groups with zero length in output

When using summarise with plyr's ddply function, empty categories are dropped by default. You can change this behavior by adding .drop = FALSE. However, this doesn't work when using summarise with dplyr. Is there another way to keep empty categories in the result?
Here's an example with fake data.
library(dplyr)
df = data.frame(a=rep(1:3,4), b=rep(1:2,6))
# Now add an extra level to df$b that has no corresponding value in df$a
df$b = factor(df$b, levels=1:3)
# Summarise with plyr, keeping categories with a count of zero
plyr::ddply(df, "b", summarise, count_a=length(a), .drop=FALSE)
b count_a
1 1 6
2 2 6
3 3 0
# Now try it with dplyr
df %.%
group_by(b) %.%
summarise(count_a=length(a), .drop=FALSE)
b count_a .drop
1 1 6 FALSE
2 2 6 FALSE
Not exactly what I was hoping for. Is there a dplyr method for achieving the same result as .drop=FALSE in plyr?
The issue is still open, but in the meantime, especially since your data are already factored, you can use complete from "tidyr" to get what you might be looking for:
library(tidyr)
df %>%
group_by(b) %>%
summarise(count_a=length(a)) %>%
complete(b)
# Source: local data frame [3 x 2]
#
# b count_a
# (fctr) (int)
# 1 1 6
# 2 2 6
# 3 3 NA
If you wanted the replacement value to be zero, you need to specify that with fill:
df %>%
group_by(b) %>%
summarise(count_a=length(a)) %>%
complete(b, fill = list(count_a = 0))
# Source: local data frame [3 x 2]
#
# b count_a
# (fctr) (dbl)
# 1 1 6
# 2 2 6
# 3 3 0
Since dplyr 0.8 group_by gained the .drop argument that does just what you asked for:
df = data.frame(a=rep(1:3,4), b=rep(1:2,6))
df$b = factor(df$b, levels=1:3)
df %>%
group_by(b, .drop=FALSE) %>%
summarise(count_a=length(a))
#> # A tibble: 3 x 2
#> b count_a
#> <fct> <int>
#> 1 1 6
#> 2 2 6
#> 3 3 0
One additional note to go with #Moody_Mudskipper's answer: Using .drop=FALSE can give potentially unexpected results when one or more grouping variables are not coded as factors. See examples below:
library(dplyr)
data(iris)
# Add an additional level to Species
iris$Species = factor(iris$Species, levels=c(levels(iris$Species), "empty_level"))
# Species is a factor and empty groups are included in the output
iris %>% group_by(Species, .drop=FALSE) %>% tally
#> Species n
#> 1 setosa 50
#> 2 versicolor 50
#> 3 virginica 50
#> 4 empty_level 0
# Add character column
iris$group2 = c(rep(c("A","B"), 50), rep(c("B","C"), each=25))
# Empty groups involving combinations of Species and group2 are not included in output
iris %>% group_by(Species, group2, .drop=FALSE) %>% tally
#> Species group2 n
#> 1 setosa A 25
#> 2 setosa B 25
#> 3 versicolor A 25
#> 4 versicolor B 25
#> 5 virginica B 25
#> 6 virginica C 25
#> 7 empty_level <NA> 0
# Turn group2 into a factor
iris$group2 = factor(iris$group2)
# Now all possible combinations of Species and group2 are included in the output,
# whether present in the data or not
iris %>% group_by(Species, group2, .drop=FALSE) %>% tally
#> Species group2 n
#> 1 setosa A 25
#> 2 setosa B 25
#> 3 setosa C 0
#> 4 versicolor A 25
#> 5 versicolor B 25
#> 6 versicolor C 0
#> 7 virginica A 0
#> 8 virginica B 25
#> 9 virginica C 25
#> 10 empty_level A 0
#> 11 empty_level B 0
#> 12 empty_level C 0
Created on 2019-03-13 by the reprex package (v0.2.1)
dplyr solution:
First make grouped df
by_b <- tbl_df(df) %>% group_by(b)
then we summarise those levels that occur by counting with n()
res <- by_b %>% summarise( count_a = n() )
then we merge our results into a data frame that contains all factor levels:
expanded_res <- left_join(expand.grid(b = levels(df$b)),res)
finally, in this case since we are looking at counts the NA values are changed to 0.
final_counts <- expanded_res[is.na(expanded_res)] <- 0
This can also be implemented functionally, see answers:
Add rows to grouped data with dplyr?
A hack:
I thought I would post a terrible hack that works in this case for interest's sake. I seriously doubt you should ever actually do this but it shows how group_by() generates the atrributes as if df$b was a character vector not a factor with levels. Also, I don't pretend to understand this properly -- but I am hoping this helps me learn -- this is the only reason I'm posting it!
by_b <- tbl_df(df) %>% group_by(b)
define an "out-of-bounds" value that cannot exist in dataset.
oob_val <- nrow(by_b)+1
modify attributes to "trick" summarise():
attr(by_b, "indices")[[3]] <- rep(NA,oob_val)
attr(by_b, "group_sizes")[3] <- 0
attr(by_b, "labels")[3,] <- 3
do the summary:
res <- by_b %>% summarise(count_a = n())
index and replace all occurences of oob_val
res[res == oob_val] <- 0
which gives the intended:
> res
Source: local data frame [3 x 2]
b count_a
1 1 6
2 2 6
3 3 0
this is not exactly what was asked in the question, but at least for this simple example, you could get the same result using xtabs, for example:
using dplyr:
df %>%
xtabs(formula = ~ b) %>%
as.data.frame()
or shorter:
as.data.frame(xtabs( ~ b, df))
result (equal in both cases):
b Freq
1 1 6
2 2 6
3 3 0

Resources