How to use group_by variable as an exclusion value with dplyr? - r

Let's say I have the following data frame:
(dat = data_frame(v1 = c(rep("a", 3), rep("b", 3), rep("c", 4)), v2 = 1:10))
# A tibble: 10 × 2
# v1 v2
# <chr> <int>
# 1 a 1
# 2 a 2
# 3 a 3
# 4 b 4
# 5 b 5
# 6 b 6
# 7 c 7
# 8 c 8
# 9 c 9
# 10 c 10
What I want to be able to do is compute a sum for each group (i.e. "a", "b", and "c") that is equal to the sum of v2 where v1 is not equal to the grouping value. So it should look like this:
# A tibble: 3 × 2
# v1 sum
# <chr> <int>
# 1 a 49
# 2 b 40
# 3 c 21
Based on what I've been seeing online, this looks like a job for do, but I can't wrap my head around how to achieve this. I thought it would look something like this:
x %>%
group_by(v1) %>%
do(data.frame(sum=sum(.$v2[x$v1 != unique(.$v1)])))
But this just gives me a dataframe with sum equal to NA for all three groups. How would I go about doing this?

Maybe using an intermediate column it is easier:
dat %>% mutate(total = sum(v2)) %>% group_by(v1) %>% summarize(sum = max(total) - sum(v2))

You can nest and then index the list column negatively:
library(tidyverse)
dat %>% nest(v2) %>% mutate(sum = map_int(seq(n()), ~sum(unlist(data[-.x]))))
## # A tibble: 3 × 3
## v1 data sum
## <chr> <list> <int>
## 1 a <tibble [3 × 1]> 49
## 2 b <tibble [3 × 1]> 40
## 3 c <tibble [4 × 1]> 21
The advantage of this approach is that it's really easy to save the original data and align the computed values with them.

A small function without using dplyr:
dat <- data_frame(v1 = c(rep("a", 3), rep("b", 3), rep("c", 4)), v2 = 1:10)
test_func<-function(df){
a<-sum(df[df$v1 != "a",][,2])
b<-sum(df[df$v1 != "b",][,2])
c<-sum(df[df$v1 != "c",][,2])
out<-rbind(a,b,c)
return(out)
}
test_func(dat)
[,1]
a 49
b 40
c 21

#67342343's solution seems like the way to go here. If you have more complex overlapping/excluded groups, then maybe something like the following would be helpful:
library(tidyverse)
dat = data_frame(v1 = rep(letters[1:5], 3), v2 = 1:(5*3))
c(combn(unique(dat$v1),2, simplify=FALSE),
combn(unique(dat$v1),3, simplify=FALSE)) %>%
map_df(~ dat %>%
group_by(v1) %>%
summarise(v2 = sum(v2)) %>%
filter(v1 %in% .x) %>%
ungroup %>%
summarise(groups = paste(.x,collapse=","),
sum = sum(v2)))
groups sum
1 a,b 39
2 a,c 42
3 a,d 45
4 a,e 48
5 b,c 45
...
18 b,c,e 75
19 b,d,e 78
20 c,d,e 81

Keeping it simple:
dat %>% group_by(v1) %>% summarize(foo = sum(dat$v2) - sum(v2))
This is crass if you are in the middle of a long dplyr chain and have modified dat. (But then, why not relax and just store your data?)

Related

How to filter out groups empty for 1 column in Tidyverse

tibble(
A = c("A","A","B","B"),
x = c(NA,NA,NA,1),
y = c(1,2,3,4),
) %>% group_by(A) -> df
desired output:
tibble(
A = c("B","B"),
x = c(NA,1)
y = c(3,4),
)
I want to find all groups for which all elements of x and x only are all NA, then remove those groups. "B" is filtered in because it has at least 1 non NA element.
I tried:
df %>%
filter(all(!is.na(x)))
but it seems that filters out if it finds at least 1 NA; I need the correct word, which is not all.
This will remove groups of column A if all elements of x are NA:
library(dplyr)
df %>%
group_by(A) %>%
filter(! all(is.na(x)))
# A tibble: 2 × 3
# Groups: A [1]
# A x y
# <chr> <dbl> <dbl>
#1 B NA 3
#2 B 1 4
Note that group "A" was removed because both cells in the column x are not defined.
We can use any with complete.cases
library(dplyr)
df %>%
group_by(A) %>%
filter(any(complete.cases(x))) %>%
ungroup
-output
# A tibble: 2 × 3
A x y
<chr> <dbl> <dbl>
1 B NA 3
2 B 1 4
In the devel version of dplyr, we could use .by in filter thus we don't need to group_by/ungroup
df %>%
filter(any(complete.cases(x)), .by = 'A')
# A tibble: 2 × 3
A x y
<chr> <dbl> <dbl>
1 B NA 3
2 B 1 4

tidy way to remove duplicates per row

I've seen different solutions to remove rowwise duplicates with base R solutions, e.g. R - find all duplicates in row and replace.
However, I'm wondering if there's amore tidy way. I tried several ways of using across or a combination of rowwise with c_across, but can't get it work.
df <- data.frame(x = c(1, 2, 3, 4),
y = c(1, 3, 4, 5),
z = c(2, 3, 5, 6))
Expected output:
x y z
1 1 NA 2
2 2 3 NA
3 3 4 5
4 4 5 6
My ideas so far (not working):
df |>
mutate(apply(across(everything()), 1, function(x) replace(x, duplicated(x), NA)))
df |>
mutate(apply(across(everything()), 1, function(x) {x[duplicated(x)] <- NA}))
I got somewhat along the way by creating a list column that contains the column positions of the duplicates (but it also has the ugly warning about the usual "new names" problem. I'm unsure how to proceed from there (if that's a promising way), i.e. I guess it requires some form of purrr magic?
df |>
rowwise() |>
mutate(test = list(duplicated(c_across(everything())))) |>
unnest_wider(test)
# A tibble: 4 × 6
x y z ...1 ...2 ...3
<dbl> <dbl> <dbl> <lgl> <lgl> <lgl>
1 1 1 2 FALSE TRUE FALSE
2 2 3 3 FALSE FALSE TRUE
3 3 4 5 FALSE FALSE FALSE
4 4 5 6 FALSE FALSE FALSE
Maybe you want something like this:
library(dplyr)
df %>%
rowwise() %>%
do(data.frame(replace(., duplicated(unlist(.)), NA)))
Output:
# A tibble: 4 × 3
# Rowwise:
x y z
<dbl> <dbl> <dbl>
1 1 NA 2
2 2 3 NA
3 3 4 5
4 4 5 6
I wouldn't say its tidy but it is a solution using map:
library(tidyverse)
df %>%
group_nest(row_number()) %>%
pull(data) %>%
map(function(x) as.numeric(x) %>% replace(., duplicated(.), NA) %>% setNames(names(df))) %>%
bind_rows()
# # A tibble: 4 x 3
# x y z
# <dbl> <dbl> <dbl>
# 1 1 NA 2
# 2 2 3 NA
# 3 3 4 5
# 4 4 5 6
Just for completeness, after trialing & erroring a bit, I also got the same result as provided by #Quinten, just in a much, much uglier way!
df |>
rowwise() |>
mutate(pos = list(which(duplicated(c_across(everything()))))) |>
mutate(across(-pos, ~ ifelse(which(names(df) == cur_column()) %in% unlist(pos), NA, .))) |>
select(-pos)

R: How to summarize and group by variables as column names

I have a wide dataframe with about 200 columns and want to summarize it over various columns. I can not figure the syntax for this, I think it should work with .data$ and .env$ but I don't get it. Heres an example:
> library(dplyr)
> df = data.frame('A'= c('X','X','X','Y','Y'), 'B'= 1:5, 'C' = 6:10)
> df
A B C
1 X 1 6
2 X 2 7
3 X 3 8
4 Y 4 9
5 Y 5 10
> df %>% group_by(A) %>% summarise(sum(B), sum(C))
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 2 x 3
A `sum(B)` `sum(C)`
<chr> <int> <int>
1 X 6 21
2 Y 9 19
But I want to be able to do something like this:
columns_to_sum = c('B','C')
columns_to_group = c('A')
df %>% group_by(colums_to_group)%>% summarise(sum(columns_to_sum))
We can use across from the new version of dplyr
library(dplyr)
df %>%
group_by(across(colums_to_group)) %>%
summarise(across(all_of(columns_to_sum), sum, na.rm = TRUE), .groups = 'drop')
# A tibble: 2 x 3
# A B C
# <chr> <int> <int>
#1 X 6 21
#2 Y 9 19
In the previous version, we could use group_by_at along with summarise_at
df %>%
group_by_at(colums_to_group) %>%
summarise_at(vars(columns_to_sum), sum, na.rm = TRUE)

R: dplyr and row_number() does not enumerate as expected

I want to enumerate each record of a dataframe/tibble resulted from a grouping. The index is according a defined order. If I use row_number() it does enumerate but within group. But I want that it enumerates without considering the former grouping.
Here is an example. To make it simple I used the most minimal dataframe:
library(dplyr)
df0 <- data.frame( x1 = rep(LETTERS[1:2],each=2)
, x2 = rep(letters[1:2], 2)
, y = floor(abs(rnorm(4)*10))
)
df0
# x1 x2 y
# 1 A a 12
# 2 A b 24
# 3 B a 0
# 4 B b 12
Now, I group this table:
df1 <- df0 %>% group_by(x1,x2) %>% summarize(y=sum(y))
This gives me a object of class tibble:
# A tibble: 4 x 3
# Groups: x1 [?]
# x1 x2 y
# <fct> <fct> <dbl>
# 1 A a 12
# 2 A b 24
# 3 B a 0
# 4 B b 12
I want to add a row number to this table using row_numer():
df2 <- df1 %>% arrange(desc(y)) %>% mutate(index = row_number())
df2
# A tibble: 4 x 4
# Groups: x1 [2]
# x1 x2 y index
# <fct> <fct> <dbl> <int>
# 1 A b 24 1
# 2 A a 12 2
# 3 B b 12 1
# 4 B a 0 2
row_number() does enumerate within the former grouping. This was not my intention. This can be avoid converting tibble to a dataframe first:
df2 <- df2 %>% as.data.frame() %>% arrange(desc(y)) %>% mutate(index = row_number())
df2
# x1 x2 y index
# 1 A b 24 1
# 2 A a 12 2
# 3 B b 12 3
# 4 B a 0 4
My question is: is this behaviour intended?
If yes: is it not very dangerous to incorporate former data processing into tibble? Which type of processing is incorporated?
At the moment I will convert tibble into dataframe to avoid this kind of unexpected results.
To elaborate on my comment: yes, retaining grouping is intended, and in many cases useful. It's only dangerous if you don't understand how group_by works—and that's true of any function. To undo group_by, you call ungroup.
Take a look at the group_by docs, as they're very thorough and explain how this function interacts with others, how grouping is layered, etc. The docs also explain how each call to summarise removes a layer of grouping—it might be there that you got confused about what's going on.
For example, you can group by x1 and x2, summarize y, and create a row number, which will give you the rows according to x1 (summarise removed a layer of grouping, i.e. drops the x2 grouping). Then ungrouping allows you to get row numbers based on the entire data frame.
library(dplyr)
df0 %>%
group_by(x1, x2) %>%
summarise(y = sum(y)) %>%
mutate(group_row = row_number()) %>%
ungroup() %>%
mutate(all_df_row = row_number())
#> # A tibble: 4 x 5
#> x1 x2 y group_row all_df_row
#> <fct> <fct> <dbl> <int> <int>
#> 1 A a 12 1 1
#> 2 A b 2 2 2
#> 3 B a 10 1 3
#> 4 B b 23 2 4
A use case—I do this for work probably every day—is to get sums within multiple groups (again, x1 and x2), then to find the shares of those values within their larger group (after peeling away a layer of grouping, this is x1) with mutate. Again, here I ungroup to show the shares instead of the entire data frame.
df0 %>%
group_by(x1, x2) %>%
summarise(y = sum(y)) %>%
mutate(share_in_group = y / sum(y)) %>%
ungroup() %>%
mutate(share_all_df = y / sum(y))
#> # A tibble: 4 x 5
#> x1 x2 y share_in_group share_all_df
#> <fct> <fct> <dbl> <dbl> <dbl>
#> 1 A a 12 0.857 0.255
#> 2 A b 2 0.143 0.0426
#> 3 B a 10 0.303 0.213
#> 4 B b 23 0.697 0.489
Created on 2018-10-11 by the reprex package (v0.2.1)
As camille nicely showed, there are good reasons for wanting to have the result of summarize() retain additional layers of grouping and it's a documented behaviour so not really dangerous or unexpected per se.
However one additional tip is that if you are just going to call ungroup() after summarize() you might as well use summarize(.groups = "drop") which will return an ungrouped tibble and save you a line of code.
library(tidyverse)
df0 <- data.frame(
x1 = rep(LETTERS[1:2], each = 2),
x2 = rep(letters[1:2], 2),
y = floor(abs(rnorm(4) * 10))
)
df0 %>%
group_by(x1,x2) %>%
summarize(y=sum(y), .groups = "drop") %>%
arrange(desc(y)) %>%
mutate(index = row_number())
#> # A tibble: 4 x 4
#> x1 x2 y index
#> <chr> <chr> <dbl> <int>
#> 1 A b 8 1
#> 2 A a 2 2
#> 3 B a 2 3
#> 4 B b 1 4
Created on 2022-02-06 by the reprex package (v2.0.1)

tidyr::expand() for a single column across groups

tidyr::expand() returns all possible combinations of values from multiple columns. I'm looking for a slightly different behavior, where all the values are in a single column and the combinations are to be taken across groups.
For example, let the data be defined as follows:
library( tidyverse )
X <- bind_rows( data_frame(Group = "Group1", Value = LETTERS[1:3]),
data_frame(Group = "Group2", Value = letters[4:5]) )
We want all combinations of values from Group1 with values from Group2. My current clunky solution is to separate the values across multiple columns
Y <- X %>% group_by(Group) %>% do(vals = .$Value) %>% spread(Group, vals)
# # A tibble: 1 x 2
# Group1 Group2
# <list> <list>
# 1 <chr [3]> <chr [2]>
followed by a double unnest operation
Y %>% unnest( .preserve = Group2 ) %>% unnest
# # A tibble: 6 x 2
# Group1 Group2
# <chr> <chr>
# 1 A d
# 2 A e
# 3 B d
# 4 B e
# 5 C d
# 6 C e
This is the desired output, but as you can imagine, this solution doesn't generalize well: as the number of groups increases, so does the number of unnest operations that we have to perform.
Is there a more elegant solution?
Because OP seems happy to use base, I upgrade my comment to an answer:
expand.grid(split(X$Value, X$Group))
# Group1 Group2
# 1 A d
# 2 B d
# 3 C d
# 4 A e
# 5 B e
# 6 C e
As noted by OP, expand.grid converts character vectors to factors. To prevent that, use stringsAsFactors = FALSE.
The tidyverse equivalent is purrr::cross_df, which doesn't coerce to factor:
cross_df(split(X$Value, X$Group))
# A tibble: 6 x 2
# Group1 Group2
# <chr> <chr>
# 1 A d
# 2 B d
# 3 C d
# 4 A e
# 5 B e
# 6 C e
Here is one option. It will work on the cases with more than two groups although complete_ is deprecated.
library( tidyverse )
X2 <- X %>%
group_by(Group) %>%
mutate(ID = 1:n()) %>%
spread(Group, Value) %>%
select(-ID) %>%
complete_(names(.)) %>%
na.omit()
X2
# # A tibble: 6 x 2
# Group1 Group2
# <chr> <chr>
# 1 A d
# 2 A e
# 3 B d
# 4 B e
# 5 C d
# 6 C e
Update
!!!syms(names(.)) works well with the regular complete function, thus is better than using complete_ as my original solution.
library( tidyverse )
X2 <- X %>%
group_by(Group) %>%
mutate(ID = 1:n()) %>%
spread(Group, Value) %>%
select(-ID) %>%
complete(!!!syms(names(.))) %>%
na.omit()
X2
# # A tibble: 6 x 2
# Group1 Group2
# <chr> <chr>
# 1 A d
# 2 A e
# 3 B d
# 4 B e
# 5 C d
# 6 C e
I often use tidyr::crossing() to join all values from group2 to group.
data_frame(group = c(LETTERS[1:3])) %>%
crossing(group2 = letters[4:5])
I might do something like this:
data %>%
distinct(group) %>%
crossing(group2)
A more specific example:
dates <- lubridate::make_date(2000:2018)
data_frame(group = letters[1:5]) %>%
crossing(dates)
This still works with expand after spread.
X %>%
mutate(id = row_number()) %>%
spread(Group, Value) %>%
expand(Group1, Group2) %>%
na.omit()

Resources