R dplyr: summarise complete cases by group for all variables - r

I want to summarise variables by group for every variable in a dataset using dplyr. The summarised variables should be stored under a new name.
An example:
df <- data.frame(
group = c("A", "B", "A", "B"),
a = c(1,1,NA,2),
b = c(1,NA,1,1),
c = c(1,1,2,NA),
d = c(1,2,1,1)
)
df %>% group_by(group) %>%
mutate(complete_a = sum(complete.cases(a))) %>%
mutate(complete_b = sum(complete.cases(b))) %>%
mutate(complete_c = sum(complete.cases(c))) %>%
mutate(complete_d = sum(complete.cases(d))) %>%
group_by(group, complete_a, complete_b, complete_c, complete_d) %>% summarise()
results in my expected output:
# # A tibble: 2 x 5
# # Groups: group, complete_a, complete_b, complete_c [?]
# group complete_a complete_b complete_c complete_d
# <fct> <int> <int> <int> <int>
# A 1 2 2 2
# B 2 1 1 2
How can I generate the same output without duplicating the mutate statements per variable?
I tried:
df %>% group_by(group) %>% summarise_all(funs(sum(complete.cases(.))))
which works but does not rename the variables.

You are almost there. You have to use rename_all
library(dplyr)
df %>%
group_by(group) %>%
summarise_all(funs(sum(complete.cases(.)))) %>%
rename_all(~paste0("complete_", colnames(df)))
# A tibble: 2 x 5
# complete_group complete_a complete_b complete_c complete_d
# <fct> <int> <int> <int> <int>
#1 A 1 2 2 2
#2 B 2 1 1 2
Edit
Or as pointed all by #symbolrush, more directly without colnames:
df %>%
group_by(group) %>%
summarise_all(funs(sum(complete.cases(.)))) %>%
rename_all(~paste0("complete_", .))
## A tibble: 2 x 5
# complete_group complete_a complete_b complete_c complete_d
# <fct> <int> <int> <int> <int>
#1 A 1 2 2 2
#2 B 2 1 1 2

Related

How to filter out groups empty for 1 column in Tidyverse

tibble(
A = c("A","A","B","B"),
x = c(NA,NA,NA,1),
y = c(1,2,3,4),
) %>% group_by(A) -> df
desired output:
tibble(
A = c("B","B"),
x = c(NA,1)
y = c(3,4),
)
I want to find all groups for which all elements of x and x only are all NA, then remove those groups. "B" is filtered in because it has at least 1 non NA element.
I tried:
df %>%
filter(all(!is.na(x)))
but it seems that filters out if it finds at least 1 NA; I need the correct word, which is not all.
This will remove groups of column A if all elements of x are NA:
library(dplyr)
df %>%
group_by(A) %>%
filter(! all(is.na(x)))
# A tibble: 2 × 3
# Groups: A [1]
# A x y
# <chr> <dbl> <dbl>
#1 B NA 3
#2 B 1 4
Note that group "A" was removed because both cells in the column x are not defined.
We can use any with complete.cases
library(dplyr)
df %>%
group_by(A) %>%
filter(any(complete.cases(x))) %>%
ungroup
-output
# A tibble: 2 × 3
A x y
<chr> <dbl> <dbl>
1 B NA 3
2 B 1 4
In the devel version of dplyr, we could use .by in filter thus we don't need to group_by/ungroup
df %>%
filter(any(complete.cases(x)), .by = 'A')
# A tibble: 2 × 3
A x y
<chr> <dbl> <dbl>
1 B NA 3
2 B 1 4

Find mean of counts within groups

I have a dataframe that looks like this:
library(tidyverse)
x <- tibble(
batch = rep(c(1,2), each=10),
exp_id = c(rep('a',3),rep('b',2),rep('c',5),rep('d',6),rep('e',4))
)
I can run the code below to get the count perexp_id:
x %>% group_by(batch,exp_id) %>%
summarise(count=n())
which generates:
batch exp_id count
<dbl> <chr> <dbl>
1 1 a 3
2 1 b 2
3 1 c 5
4 2 d 6
5 2 e 4
A really ugly way to generate the mean of these counts is:
x %>% group_by(batch,exp_id) %>%
summarise(count=n()) %>%
ungroup() %>%
group_by(batch) %>%
summarise(avg_exp = mean(count))
which generates:
batch avg_exp
<dbl> <dbl>
1 1 3.33
2 2 5
Is there a more succinct and "tidy" way generate this?
library(dplyr)
group_by(x, batch) %>%
summarize(avg_exp = mean(table(exp_id)))
# # A tibble: 2 x 2
# batch avg_exp
# <dbl> <dbl>
# 1 1 3.33
# 2 2 5
Here's another way -
library(dplyr)
x %>%
count(batch, exp_id, name = "count") %>%
group_by(batch) %>%
summarise(count = mean(count))
# batch count
# <dbl> <dbl>
#1 1 3.33
#2 2 5

Spread data with non-unique keys with R

I have the following data frame:
ID
Group
1
A
1
B
2
C
2
D
And I want to reshape the data frame into a wider version in terms of ID. Thus, the new data frame looks like this:
ID
Group1
Group2
1
A
B
2
C
D
You can do this by adding a helper column and then using tidyr::pivot_wider():
library(dplyr)
library(tidyr)
data <- tibble(
id = c(1, 1, 2, 2),
group = letters[1:4]
)
# Add a helper column to use when pivoting. This uses the row number
# over each subgroup, i.e. over each value of `id`
transformed_data <- data %>%
group_by(id) %>%
mutate(helper = paste0("Group", row_number())) %>%
ungroup()
# Here's what the helper column looks like
transformed_data
#> # A tibble: 4 x 3
#> id group helper
#> <dbl> <chr> <chr>
#> 1 1 a Group1
#> 2 1 b Group2
#> 3 2 c Group1
#> 4 2 d Group2
# Pivot the data using the helper column
transformed_data %>%
pivot_wider(names_from = helper, values_from = group)
#> # A tibble: 2 x 3
#> id Group1 Group2
#> <dbl> <chr> <chr>
#> 1 1 a b
#> 2 2 c d

group by and conditional summarize in R

My code is dirty.
if condition smaller than two, names = unpopular.
df <- data.frame(vote=c("A","A","A","B","B","B","B","B","B","C","D"),
val=c(rep(1,11))
)
df %>% group_by(vote) %>% summarise(val=sum(val))
out
vote val
<fct> <dbl>
1 A 3
2 B 6
3 C 1
4 D 1
but I need
vote val
<fct> <dbl>
1 A 3
2 B 6
3 unpopular 2
my idea is
df2 <- df %>% group_by(vote) %>% summarise(val=sum(val))
df2$vote[df2$val < 2] <- "unpop"
df2 %>% group_by....
it's not cool.
do you know any cool & helpful function ?
We can do a double grouping
library(dplyr)
df %>%
group_by(vote) %>%
summarise(val=sum(val)) %>%
group_by(vote = replace(vote, val <2, 'unpop')) %>%
summarise(val = sum(val))
-output
# A tibble: 3 x 2
# vote val
# <chr> <dbl>
#1 A 3
#2 B 6
#3 unpop 2
Or another option with rowsum
df %>%
group_by(vote = replace(vote, vote %in%
names(which((rowsum(val, vote) < 2)[,1])), 'unpopular')) %>%
summarise(val = sum(val))
Or using fct_lump_n from forcats
library(forcats)
df %>%
group_by(vote = fct_lump_n(vote, 2, other_level = "unpop")) %>%
summarise(val = sum(val))
# A tibble: 3 x 2
# vote val
# <fct> <dbl>
#1 A 3
#2 B 6
#3 unpop 2
Or using table
df %>%
group_by(vote = replace(vote,
vote %in% names(which(table(vote) < 2)), 'unpop')) %>%
summarise(val = sum(val))
If you want to vote based on sum of val in base R you can do this as :
aggregate(val~vote, transform(aggregate(val~vote, df, sum),
vote = replace(vote, val < 2, 'unpop')), sum)
# vote val
#1 A 3
#2 B 6
#3 unpop 2

R dplyr group_by summarise keep last non missing

Consider the following dataset where id uniquely identifies a person, and name varies within id only to the extent of minor spelling issues. I want to aggregate to id level using dplyr:
df= data.frame(id=c(1,1,1,2,2,2),name=c('michael c.','mike', 'michael','','John',NA),var=1:6)
Using group_by(id) yields the correct computation, but I lose the name column:
df %>% group_by(id) %>% summarise(newvar=sum(var)) %>%ungroup()
A tibble: 2 x 2
id newvar
<dbl> <int>
1 1 6
2 2 15
Using group_by(id,name) yields both name and id but obviously the "wrong" sums.
I would like to keep the last non-missing observatoin of the name within each group. I basically lack a dplyr version of Statas lastnm() function:
df %>% group_by(id) %>% summarise(sum = sum(var), Name = lastnm(name))
id sum Name
1 1 6 michael
2 2 15 John
Is there a "keep last non missing"-option?
1) Use mutate like this:
df %>%
group_by(id) %>%
mutate(sum = sum(var)) %>%
ungroup
giving:
# A tibble: 6 x 4
id name var sum
<dbl> <fct> <int> <int>
1 1 michael c. 1 6
2 1 mike 2 6
3 1 michael 3 6
4 2 john 4 15
5 2 john 5 15
6 2 john 6 15
2) Another possibility is:
df %>%
group_by(id) %>%
summarize(name = name %>% unique %>% toString, sum = sum(var)) %>%
ungroup
giving:
# A tibble: 2 x 3
id name sum
<dbl> <chr> <int>
1 1 michael c., mike, michael 6
2 2 john 15
3) Another variation is to only report the first name in each group:
df %>%
group_by(id) %>%
summarize(name = first(name), sum = sum(var)) %>%
ungroup
giving:
# A tibble: 2 x 3
id name sum
<dbl> <fct> <int>
1 1 michael c. 6
2 2 john 15
I posted a feature request on dplyrs github thread, and the reponse there is actually the best answer. For sake of completion I repost it here:
df %>%
group_by(id) %>%
summarise(sum=sum(var), Name=last(name[!is.na(name)]))
#> # A tibble: 2 x 3
#> id sum Name
#> <dbl> <int> <chr>
#> 1 1 6 michael
#> 2 2 15 John

Resources