`dplyr`'s `summarize_all()` by some condition - r

I have a data frame with an ID column and multiple columns that I want to summarize. In each of the columns (which are mutually exclusive), I want to count rows that match "a", "b", or either.
> df
# A tibble: 5 x 3
id col1 col2 col3
<dbl> <chr> <chr> <chr>
1 1 NA b NA
2 2 NA b NA
3 3 NA NA a
4 4 b NA NA
5 5 a NA NA
This is as far as I've gotten:
> df %>%
group_by(id) %>%
summarize_all(a = nrow(. %>% filter(. == "a"),
b = nrow(. %>% filter(. == "b"),
x = nrow(!is.na(.))
Error: Can't create call to non-callable object
Call `rlang::last_error()` to see a backtrace
Am I taking the right approach? I'm trying to get something that looks like this:
var a b x
-------------
col1 1 1 2
col2 0 2 2
col3 1 0 1

You can try:
library(tidyverse)
df %>%
gather(key, value, -id) %>%
group_by(key, value) %>%
count %>%
filter(!is.na(value))
# A tibble: 4 x 3
# Groups: key, value [4]
key value n
<chr> <chr> <int>
1 col1 a 1
2 col1 b 1
3 col2 b 2
4 col3 a 1
If you want the tabular result edited into your question you can do:
df %>%
gather(key, value, -id) %>%
group_by(key, value) %>%
count %>%
filter(!is.na(value)) %>%
group_by(key) %>%
mutate(x = sum(n)) %>%
spread(value, n, fill = 0)
# A tibble: 3 x 4
# Groups: key [3]
key x a b
<chr> <int> <dbl> <dbl>
1 col1 2 1 1
2 col2 2 0 2
3 col3 1 1 0

One tidyverse possibility could be:
df %>%
gather(var, letters, -id, na.rm = TRUE) %>%
add_count(var, letters, name = "n_letters") %>%
add_count(var, name = "n_all") %>%
select(-id) %>%
distinct()
var letters n_letters n_all
<chr> <chr> <int> <int>
1 col1 b 1 2
2 col1 a 1 2
3 col2 b 2 2
4 col3 a 1 1
Or:
df %>%
gather(var, letters, -id, na.rm = TRUE) %>%
add_count(var, letters, name = "n_letters") %>%
add_count(var, name = "all") %>%
select(-id) %>%
distinct() %>%
spread(letters, n_letters, fill = 0)
var all a b
<chr> <int> <dbl> <dbl>
1 col1 2 1 1
2 col2 2 0 2
3 col3 1 1 0

Related

Which groups have exactly the same rows

If I have a data frame like the following
group1
group2
col1
col2
A
1
ABC
5
A
1
DEF
2
B
1
AB
1
C
1
ABC
5
C
1
DEF
2
A
2
BC
8
B
2
AB
1
We can see that the the (A, 1) and (C, 1) groups have the same rows (since col1 and col2 are the same within this group). The same is true for (B,1) and (B, 2).
So really we are left with 3 distinct "larger groups" (call them categories) in this data frame, namely:
category
group1
group2
1
A
1
1
C
1
2
B
1
2
B
2
3
A
2
And I am wondering how can I return the above data frame in R given a data frame like the first? The order of the "category" column doesn't matter here, for example (A,2) could be group 1 instead of {(A,1), (C,1)}, as long as these have a distinct category index.
I have tried a few very long/inefficient ways of doing this in Dplyr but I'm sure there must be a more efficient way to do this. Thanks
You can use pivot_wider first to handle identical groups over multiple rows.
library(tidyverse)
df %>%
group_by(group1, group2) %>%
mutate(id = row_number()) %>%
pivot_wider(names_from = id, values_from = c(col1, col2)) %>%
group_by(across(-c(group1, group2))) %>%
mutate(category = cur_group_id()) %>%
ungroup() %>%
select(category, group1, group2) %>%
arrange(category)
category group1 group2
<int> <chr> <int>
1 1 B 1
2 1 B 2
3 2 A 1
4 2 C 1
5 3 A 2
You could first group_by "col1" and "col2" and select the duplicated rows. Next, you can create a unique ID using cur_group_id like this:
library(dplyr)
library(tidyr)
df %>%
group_by(col1, col2) %>%
filter(n() != 1) %>%
mutate(ID = cur_group_id()) %>%
ungroup() %>%
select(-starts_with("col"))
#> # A tibble: 6 × 3
#> group1 group2 ID
#> <chr> <int> <int>
#> 1 A 1 2
#> 2 A 1 3
#> 3 B 1 1
#> 4 C 1 2
#> 5 C 1 3
#> 6 B 2 1
Created on 2022-08-12 by the reprex package (v2.0.1)

count frequency by year with dplyr (conditional count)

I want to count the use of Tool A by year and keep zeros.
ID <- c(1,1,2,2,2,3,4,5,5,5)
Tool <- c("A","B","A","B","A","A","B","A","A","A")
Year <- c(2000,2001,2001,2001,2002,2002,2001,2000,2001,2002)
df <- data.frame(ID,Tool,Year)
library(tidyverse)
df %>% group_by(ID) %>% summarise(toolA = sum(Tool == "A")) %>% count(toolA)
# A tibble: 4 x 2
toolA n
<int> <int>
1 0 1
2 1 2
3 2 1
4 3 1
I want to add year columns, so that I can have a table as below
tool A
Count
2000
2001
2002
0
1
0
0
0
1
2
1
0
1
2
1
0
1
1
3
1
1
1
1
The numbers under years means the number of use in a year.(Not a person)
How would you do?
Here is another tidyverse method. Simply speaking, we would pivot the dataframe from wide to long and then summarize. Frist summarization gets rid of all the other non-"A"s. Second summarization condenses the result table into unique bins identified by each toolA and produces a count.
library(dplyr)
library(tidyr)
df %>%
mutate(value = +(Tool == "A")) %>%
pivot_wider(names_from = Year, values_fill = 0L) %>%
group_by(ID) %>%
summarize(across(-Tool, sum)) %>%
group_by(toolA = rowSums(across(-ID))) %>%
summarize(count = n(), across(-c(ID, count), sum))
Output
# A tibble: 4 x 5
toolA count `2000` `2001` `2002`
<dbl> <int> <int> <int> <int>
1 0 1 0 0 0
2 1 2 1 0 1
3 2 1 0 1 1
4 3 1 1 1 1
Maybe this is too convoluted and a better/easier solution exists.
library(dplyr)
library(tidyr)
dataA <- df %>%
group_by(ID) %>%
summarise(toolA = sum(Tool == "A")) %>%
count(toolA)
df %>%
group_by(ID, Year) %>%
summarise(toolA = sum(Tool == "A"), .groups = 'drop') %>%
pivot_wider(names_from = Year, values_from = toolA, values_fill = 0) %>%
select(-ID) %>%
mutate(toolA = rowSums(.)) %>%
right_join(dataA, by = 'toolA') %>%
select(toolA, n, everything()) %>%
arrange(toolA) %>%
group_by(toolA, n) %>%
summarise(across(.fns = sum), .groups = 'drop')
# toolA n `2000` `2001` `2002`
# <dbl> <int> <int> <int> <int>
#1 0 1 0 0 0
#2 1 2 1 0 1
#3 2 1 0 1 1
#4 3 1 1 1 1
I might try this approach with tidyverse. Create a list column with the Year when grouping by ID. After including the count n as you have done, use unnest_longer to recover the years. I added an extra column for situations where count is zero called "None". A final pivot_wider would put the data into wide form again.
library(tidyverse)
df %>%
group_by(ID) %>%
summarise(toolA = sum(Tool == "A"),
Years = list(Year[Tool == "A"])) %>%
add_count(toolA) %>%
unnest_longer(Years) %>%
replace_na(list(Years = "None")) %>%
mutate(value = 1) %>%
pivot_wider(id_cols = c(toolA, n), names_from = Years, names_prefix = "Year_", values_from = value, values_fill = 0)%>%
arrange(toolA)
Output
toolA n Year_2000 Year_2001 Year_2002 Year_None
<int> <int> <dbl> <dbl> <dbl> <dbl>
1 0 1 0 0 0 1
2 1 2 1 0 1 0
3 2 1 0 1 1 0
4 3 1 1 1 1 0

tidyverse do a pivot_wider with two different reshaping strategies (creating categorical and binary columns)

Using the following data:
df <- data.frame(id = c("A", "B", "C", "A", "B", "A"),
value = c(1, 2, 3, 4, 5, 6))
I want to pivot_wider this data so that the reshaping creates two different sets of columns:
One set where I create a bunch of binary columns that take the column names from the value columns (e.g. bin_1, bin_2 and so on) and that are coded as 0/1.
An additional set where I create as many necessary columns to store the values in a "categorical" way. Here, id "A" has three values, so I want to create three columns cat_1, cat_2, cat_3 and for IDs B and C I want to fill them up with NAs if there's no value.
Now, I know how to create these two things separately from each other and merge them afterwards via a left_join.
However, my question is: can it be done in one pipeline, where I do two subsequent pivot_widers? I tried, but it doesn't work (obviously because my way of copying the value column and then try to use one for the binary reshape and one for the categorial reshape is wrong).
Any ideas?
Code so far that works:
df1 <- df %>%
group_by(id) %>%
mutate(group_id = 1:n()) %>%
ungroup() %>%
pivot_wider(names_from = group_id,
names_prefix = "cat_",
values_from = value)
df2 <- df %>%
mutate(dummy = 1) %>%
arrange(value) %>%
pivot_wider(names_from = value,
names_prefix = "bin_",
values_from = dummy,
values_fill = list(dummy = 0),
values_fn = list(dummy = length))
df <- df1 %>%
left_join(., df2, by = "id)
Expected output:
# A tibble: 3 x 10
id cat_1 cat_2 cat_3 bin_1 bin_2 bin_3 bin_4 bin_5 bin_6
<chr> <dbl> <dbl> <dbl> <int> <int> <int> <int> <int> <int>
1 A 1 4 6 1 0 0 1 0 1
2 B 2 5 NA 0 1 0 0 1 0
3 C 3 NA NA 0 0 1 0 0 0
With the addition of purrr, you could do:
map(.x = reduce(range(df$value), `:`),
~ df %>%
group_by(id) %>%
mutate(!!paste0("bin_", .x) := as.numeric(.x %in% value))) %>%
reduce(full_join) %>%
mutate(cats = paste0("cat_", row_number())) %>%
pivot_wider(names_from = "cats",
values_from = "value")
id bin_1 bin_2 bin_3 bin_4 bin_5 bin_6 cat_1 cat_2 cat_3
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A 1 0 0 1 0 1 1 4 6
2 B 0 1 0 0 1 0 2 5 NA
3 C 0 0 1 0 0 0 3 NA NA
In base you can try:
tt <- unstack(df[2:1])
x <- cbind(t(sapply(tt, "[", seq_len(max(lengths(tt))))),
t(+sapply(names(tt), "%in%", x=df$id)))
colnames(x) <- c(paste0("cat_", seq_len(max(lengths(tt)))),
paste0("bin_", seq_len(nrow(df))))
x
# cat_1 cat_2 cat_3 bin_1 bin_2 bin_3 bin_4 bin_5 bin_6
#A 1 4 6 1 0 0 1 0 1
#B 2 5 NA 0 1 0 0 1 0
#C 3 NA NA 0 0 1 0 0 0
Slightly modifying your approach by reducing df2 code and including it all in one pipe by taking advantage of the list and . trick which allows you to work on two versions of df in the same call.
Its not much of an improvement on what you have done but it is now all in one call. I can't think of way you can do it without a merge/join.
library(tidyverse)
df %>%
list(
pivot_wider(., id_cols = id,
names_from = value,
names_prefix = "bin_") %>%
mutate_if(is.numeric, ~ +(!is.na(.))), #convert to binary
group_by(., id) %>%
mutate(group_id = 1:n()) %>%
ungroup() %>%
pivot_wider(names_from = group_id,
names_prefix = "cat_",
values_from = value)
) %>%
.[c(2:3)] %>%
reduce(left_join)
# id bin_1 bin_2 bin_3 bin_4 bin_5 bin_6 cat_1 cat_2 cat_3
# <chr> <int> <int> <int> <int> <int> <int> <dbl> <dbl> <dbl>
# 1 A 1 0 0 1 0 1 1 4 6
# 2 B 0 1 0 0 1 0 2 5 NA
# 3 C 0 0 1 0 0 0 3 NA NA
Even you can join both your syntax into one without creating any intermediate object
df %>%
group_by(id) %>%
mutate(group_id = row_number()) %>%
pivot_wider(names_from = group_id,
names_prefix = "cat_",
values_from = value) %>% left_join(df %>% mutate(dummy = 1) %>% arrange(value) %>% pivot_wider(names_from = value,
names_prefix = "bin_",
values_from = dummy,
values_fill = list(dummy = 0),
values_fn = list(dummy = length)), by = "id")
# A tibble: 3 x 10
# Groups: id [3]
id cat_1 cat_2 cat_3 bin_1 bin_2 bin_3 bin_4 bin_5 bin_6
<chr> <dbl> <dbl> <dbl> <int> <int> <int> <int> <int> <int>
1 A 1 4 6 1 0 0 1 0 1
2 B 2 5 NA 0 1 0 0 1 0
3 C 3 NA NA 0 0 1 0 0 0

How to get all combinations of 2 from a grouped column in a data frame

I could write a loop to do this, but I was wondering how this might be done in R with dplyr. I have a data frame with two columns. Column 1 is the group, Column 2 is the value. I would like a data frame that has every combination of two values from each group in two separate columns. For example:
input = data.frame(col1 = c(1,1,1,2,2), col2 = c("A","B","C","E","F"))
input
#> col1 col2
#> 1 1 A
#> 2 1 B
#> 3 1 C
#> 4 2 E
#> 5 2 F
and have it return
output = data.frame(col1 = c(1,1,1,2), col2 = c("A","B","C","E"), col3 = c("B","C","A","F"))
output
#> col1 col2 col3
#> 1 1 A B
#> 2 1 B C
#> 3 1 C A
#> 4 2 E F
I'd like to be able to include it within dplyr syntax:
input %>%
group_by(col1) %>%
???
I tried writing my own function that produces a data frame of combinations like what I need from a vector and sent it into the group_map function, but didn't have success:
combos = function(x, ...) {
x = t(combn(x, 2))
return(as.data.frame(x))
}
input %>%
group_by(col1) %>%
group_map(.f = combos)
Produced an error.
Any suggestions?
You can do :
library(dplyr)
data <- input %>%
group_by(col1) %>%
summarise(col2 = t(combn(col2, 2)))
cbind(data[1], data.frame(data$col2))
# col1 X1 X2
# <dbl> <chr> <chr>
#1 1 A B
#2 1 A C
#3 1 B C
#4 2 E F
input %>%
group_by(col1) %>%
nest(data=-col1) %>%
mutate(out= map(data, ~ t(combn(unlist(.x), 2)))) %>%
unnest(out) %>% select(-data)
# A tibble: 4 x 2
# Groups: col1 [2]
col1 out[,1] [,2]
<dbl> <chr> <chr>
1 1 A B
2 1 A C
3 1 B C
4 2 E F
Or :
combos = function(x, ...) {
return(tibble(col1=x[[1,1]],col2=t(combn(unlist(x[[2]], use.names=F), 2))))
}
input %>%
group_by(col1) %>%
group_map(.f = combos, .keep=T) %>% invoke(rbind,.) %>% tibble
# A tibble: 4 x 2
col1 col2[,1] [,2]
<dbl> <chr> <chr>
1 1 A B
2 1 A C
3 1 B C
4 2 E F
Thank you! In terms of parsimony, I like both the answer from Ben
input %>%
group_by(col1) %>%
do(data.frame(t(combn(.$col2, 2))))
and Ronak
data <- input %>%
group_by(col1) %>%
summarise(col2 = t(combn(col2, 2)))
cbind(data[1], data.frame(data$col2))

R dplyr's group_by consider empty groups as well

Let's consider the following data frame:
set.seed(123)
data <- data.frame(col1 = factor(rep(c("A", "B", "C"), 4)),
col2 = factor(c(rep(c("A", "B", "C"), 3), c("A", "A", "A"))),
val1 = 1:12,
val2 = rnorm(12, 10, 15))
The contingency table is as follows:
cont_tab <- table(data$col1, data$col2, dnn = c("col1", "col2"))
cont_tab
col2
col1 A B C
A 4 0 0
B 1 3 0
C 1 0 3
As you can see some pairs didn't occur: (A,B), (A,C), (B,C), (C,B). The end goal of my analysis is to list all of the pairs (in this case 9) and show a statistic for each of them. While using dplyr::group_by() function I hit a limitation. Namely, the dplyr::group_by() considers only existing pairs (pairs that occured at least once):
data %>%
group_by(col1, col2) %>%
summarize(stat = sum(val2) - sum(val1))
# A tibble: 5 x 3
# Groups: col1 [?]
col1 col2 stat
<fct> <fct> <dbl>
1 A A 58.1
2 B A -16.4
3 B B 17.0
4 C A -12.9
5 C C -41.9
The output I have in mind has 9 rows (4 of which has stat equal to 0). Is it doable in dplyr?
EDIT: Sorry for being too vague at the beginning. The real problem is more complex than counting the number of times a particular pair occurs. I added the new data in order to make the real problem more visible.
It is much easier to add spread from tidyr to get the same result as with table
library(dplyr)
library(tidyr)
count(data, col1, col2) %>%
spread(col2, n, fill = 0)
# A tibble: 3 x 4
# Groups: col1 [3]
# col1 A B C
# <fct> <dbl> <dbl> <dbl>
#1 A 4 0 0
#2 B 1 3 0
#3 C 1 0 3
NOTE: The group_by/summarise step is changed to count here
As #divibisan suggested, if the OP wanted long format, then add gather at the end
data %>%
group_by(col1, col2) %>%
summarize(stat = n()) %>%
spread(col2, stat, fill = 0) %>%
gather(col2, stat, A:C)
# A tibble: 9 x 3
# Groups: col1 [3]
# col1 col2 stat
# <fct> <chr> <dbl>
#1 A A 4
#2 B A 1
#3 C A 1
#4 A B 0
#5 B B 3
#6 C B 0
#7 A C 0
#8 B C 0
#9 C C 3
Update
With the updated data in OP's post
data %>%
group_by(col1, col2) %>%
summarize(stat = sum(val2) - sum(val1)) %>%
spread(col2, stat, fill = 0) %>%
gather(col2, stat, -1)
# A tibble: 9 x 3
# Groups: col1 [3]
# col1 col2 stat
# <fct> <chr> <dbl>
#1 A A 7.76
#2 B A -20.8
#3 C A 6.97
#4 A B 0
#5 B B 28.8
#6 C B 0
#7 A C 0
#8 B C 0
#9 C C 9.56
This is doable even without dplyr
as.data.frame(table(data$col1, data$col2, dnn = c("col1", "col2")))
# col1 col2 Freq
#1 A A 4
#2 B A 1
#3 C A 1
#4 A B 0
#5 B B 3
#6 C B 0
#7 A C 0
#8 B C 0
#9 C C 3
You can use tidyr::complete
library(tidyverse)
data %>%
group_by(col1, col2) %>%
summarize(stat = n()) %>%
# additions below
ungroup %>%
complete(col1, col2, fill = list(stat = 0))
# # A tibble: 9 x 3
# col1 col2 stat
# <chr> <chr> <dbl>
# 1 A A 4
# 2 A B 0
# 3 A C 0
# 4 B A 1
# 5 B B 3
# 6 B C 0
# 7 C A 1
# 8 C B 0
# 9 C C 3
You can also use count for the first part. The code below gives the same output as the code above
data %>%
count(col1, col2) %>%
complete(col1, col2, fill = list(n = 0))
Also a tidyverse possibility using tidyr::complete():
data %>%
group_by_all() %>%
add_count() %>%
complete(col1, col2, fill = list(n = 0)) %>%
distinct()
col1 col2 n
<fct> <fct> <dbl>
1 A A 4
2 A B 0
3 A C 0
4 B A 1
5 B B 3
6 B C 0
7 C A 1
8 C B 0
9 C C 3
Or using tidyr::expand():
data %>%
count(col1, col2) %>%
right_join(data %>%
expand(col1, col2), by = c("col1" = "col1",
"col2" = "col2")) %>%
replace_na(list(n = 0))
Or using tidyr::crossing():
data %>%
count(col1, col2) %>%
right_join(crossing(col1 = unique(data$col1),
col2 = unique(data$col2)), by = c("col1" = "col1",
"col2" = "col2")) %>%
replace_na(list(n = 0))
Here is a little workaround, I hope it works for you. Merge your table with table of all combinations and replace NAs with 0.
data %>%
group_by(col1, col2) %>%
summarize(stat = n()) %>%
merge(unique(expand.grid(data)), by=c("col1","col2"), all=T) %>%
replace_na(list(stat=0))

Resources