tidyr::expand() for a single column across groups - r

tidyr::expand() returns all possible combinations of values from multiple columns. I'm looking for a slightly different behavior, where all the values are in a single column and the combinations are to be taken across groups.
For example, let the data be defined as follows:
library( tidyverse )
X <- bind_rows( data_frame(Group = "Group1", Value = LETTERS[1:3]),
data_frame(Group = "Group2", Value = letters[4:5]) )
We want all combinations of values from Group1 with values from Group2. My current clunky solution is to separate the values across multiple columns
Y <- X %>% group_by(Group) %>% do(vals = .$Value) %>% spread(Group, vals)
# # A tibble: 1 x 2
# Group1 Group2
# <list> <list>
# 1 <chr [3]> <chr [2]>
followed by a double unnest operation
Y %>% unnest( .preserve = Group2 ) %>% unnest
# # A tibble: 6 x 2
# Group1 Group2
# <chr> <chr>
# 1 A d
# 2 A e
# 3 B d
# 4 B e
# 5 C d
# 6 C e
This is the desired output, but as you can imagine, this solution doesn't generalize well: as the number of groups increases, so does the number of unnest operations that we have to perform.
Is there a more elegant solution?

Because OP seems happy to use base, I upgrade my comment to an answer:
expand.grid(split(X$Value, X$Group))
# Group1 Group2
# 1 A d
# 2 B d
# 3 C d
# 4 A e
# 5 B e
# 6 C e
As noted by OP, expand.grid converts character vectors to factors. To prevent that, use stringsAsFactors = FALSE.
The tidyverse equivalent is purrr::cross_df, which doesn't coerce to factor:
cross_df(split(X$Value, X$Group))
# A tibble: 6 x 2
# Group1 Group2
# <chr> <chr>
# 1 A d
# 2 B d
# 3 C d
# 4 A e
# 5 B e
# 6 C e

Here is one option. It will work on the cases with more than two groups although complete_ is deprecated.
library( tidyverse )
X2 <- X %>%
group_by(Group) %>%
mutate(ID = 1:n()) %>%
spread(Group, Value) %>%
select(-ID) %>%
complete_(names(.)) %>%
na.omit()
X2
# # A tibble: 6 x 2
# Group1 Group2
# <chr> <chr>
# 1 A d
# 2 A e
# 3 B d
# 4 B e
# 5 C d
# 6 C e
Update
!!!syms(names(.)) works well with the regular complete function, thus is better than using complete_ as my original solution.
library( tidyverse )
X2 <- X %>%
group_by(Group) %>%
mutate(ID = 1:n()) %>%
spread(Group, Value) %>%
select(-ID) %>%
complete(!!!syms(names(.))) %>%
na.omit()
X2
# # A tibble: 6 x 2
# Group1 Group2
# <chr> <chr>
# 1 A d
# 2 A e
# 3 B d
# 4 B e
# 5 C d
# 6 C e

I often use tidyr::crossing() to join all values from group2 to group.
data_frame(group = c(LETTERS[1:3])) %>%
crossing(group2 = letters[4:5])
I might do something like this:
data %>%
distinct(group) %>%
crossing(group2)
A more specific example:
dates <- lubridate::make_date(2000:2018)
data_frame(group = letters[1:5]) %>%
crossing(dates)

This still works with expand after spread.
X %>%
mutate(id = row_number()) %>%
spread(Group, Value) %>%
expand(Group1, Group2) %>%
na.omit()

Related

How to filter out groups empty for 1 column in Tidyverse

tibble(
A = c("A","A","B","B"),
x = c(NA,NA,NA,1),
y = c(1,2,3,4),
) %>% group_by(A) -> df
desired output:
tibble(
A = c("B","B"),
x = c(NA,1)
y = c(3,4),
)
I want to find all groups for which all elements of x and x only are all NA, then remove those groups. "B" is filtered in because it has at least 1 non NA element.
I tried:
df %>%
filter(all(!is.na(x)))
but it seems that filters out if it finds at least 1 NA; I need the correct word, which is not all.
This will remove groups of column A if all elements of x are NA:
library(dplyr)
df %>%
group_by(A) %>%
filter(! all(is.na(x)))
# A tibble: 2 × 3
# Groups: A [1]
# A x y
# <chr> <dbl> <dbl>
#1 B NA 3
#2 B 1 4
Note that group "A" was removed because both cells in the column x are not defined.
We can use any with complete.cases
library(dplyr)
df %>%
group_by(A) %>%
filter(any(complete.cases(x))) %>%
ungroup
-output
# A tibble: 2 × 3
A x y
<chr> <dbl> <dbl>
1 B NA 3
2 B 1 4
In the devel version of dplyr, we could use .by in filter thus we don't need to group_by/ungroup
df %>%
filter(any(complete.cases(x)), .by = 'A')
# A tibble: 2 × 3
A x y
<chr> <dbl> <dbl>
1 B NA 3
2 B 1 4

R: how to combine the value of a column in two rows together, if these two rows share same character strings in another two columns

I am constructing an edge list for a network. I would like to combine the value of the third column together if the first two columns are the same. The data I have is like this.
ego alter weight
A B 12
B A 10
C D 5
D C 2
E F 7
F E 6
The dataset I expect is like this:
ego alter weight
A B 22
C D 7
E F 13
Please enlighten me if you have some great ideas to achieve the expected result.
A base R option using pmin/pmax + aggregate
aggregate(
weight ~ .,
transform(
df,
ego = pmin(ego,alter),
alter = pmax(ego,alter)
),
sum
)
gives
ego alter weight
1 A B 22
2 C D 7
3 E F 13
Or, we can use igraph
library(igraph)
df %>%
graph_from_data_frame(directed = FALSE) %>%
simplify() %>%
get.data.frame()
which gives
from to weight
1 A B 22
2 C D 7
3 E F 13
You could do the following:
f <- function(e,a) sapply(seq_along(e), \(i) paste0(sort(c(e[i],a[i])), collapse=""))
group_by(dt, grp = f(ego,alter)) %>%
summarize(weight=sum(weight),.groups="drop") %>%
separate(grp,c("ego","alter"),1)
Output:
ego alter weight
<chr> <chr> <int>
1 A B 22
2 C D 7
3 E F 13
A possible solution:
library(tidyverse)
df %>%
rowwise() %>%
mutate(aux = sort(c(ego, alter)) %>% str_c(collapse = "")) %>%
group_by(aux) %>%
summarise(ego, alter, weight = sum(weight), .groups = "drop") %>%
filter(!duplicated(aux)) %>%
select(-aux)
#> # A tibble: 3 × 3
#> ego alter weight
#> <chr> <chr> <int>
#> 1 A B 22
#> 2 C D 7
#> 3 E F 13
Or avoiding rowwise:
library(tidyverse)
df %>%
mutate(aux = apply(df[1:2], 1, \(x) sort(x) %>% paste0(collapse = ""))) %>%
group_by(aux) %>%
summarise(ego, alter, weight = sum(weight), .groups = "drop") %>%
filter(!duplicated(aux)) %>%
select(-aux)
#> # A tibble: 3 × 3
#> ego alter weight
#> <chr> <chr> <int>
#> 1 A B 22
#> 2 C D 7
#> 3 E F 13
And yet another solution, a bit more succinct:
library(tidyverse)
df %>%
group_by(aux = map2_chr(ego, alter, ~ sort(c(.x, .y)) %>% str_c(collapse = ""))) %>%
summarise(weight = sum(weight)) %>%
extract(aux, c("ego", "alter"), "([[:upper:]])([[:upper:]])")
#> # A tibble: 3 × 3
#> ego alter weight
#> <chr> <chr> <int>
#> 1 A B 22
#> 2 C D 7
#> 3 E F 13

R tibble: Group by column A, keep only distinct values in column B and C and sum values in column C

I want to group by column A and then sum values in column C for distinct values in columns B and C. Is it possible to do it inside summarise clause?
I know that's possible with distinct() function before aggregation. What about something like that:
Data:
df <- tibble(A = c(1,1,1,2,2), B = c('a','b','b','a','a'), C=c(5,10,10,15,15))
My try that doesn't work:
df %>%
group_by(A) %>%
summarise(sumC=sum(distinct(B,C) %>% select(C)))
Desired ouput:
A sumC
1 15
2 15
You could use duplicated
df %>%
group_by(A) %>%
summarise(sumC = sum(C[!duplicated(B)]))
## A tibble: 2 x 2
# A sumC
# <dbl> <dbl>
#1 1 15
#2 2 15
Or with distinct
df %>%
group_by(A) %>%
distinct(B, C) %>%
summarise(sumC = sum(C))
## A tibble: 2 x 2
# A sumC
# <dbl> <dbl>
#1 1 15
#2 2 15
A different possibility could be:
df %>%
group_by(A, B, C) %>%
slice(1) %>%
group_by(A) %>%
summarise(sumC = sum(C))
A sumC
<dbl> <dbl>
1 1 15
2 2 15
Or a twist on #Maurits Evers answer:
df %>%
distinct(A, B, C) %>%
group_by(A) %>%
summarise(sumC = sum(C))

dplyr mutate with null value

I have a data frame and I'd like to use mutate to populate a "e_value" column that is the value for the "e" metric within a group so I use dplyr and group_by the group then mutate using value[metric == "e"] but this is returning an error when there is no metric == e within a group like in group C below. Is there a way to just return the f metric when there is no e metric?
library(dplyr)
# this code does not work because there is no e metric in group C
data =data.frame(group = c("A","A","B","B","C"),metric=c("e","f","e","f","f"),value = c(1,2,3,4,5))
data %>% group_by(group) %>% mutate( e_value = value[metric == "e"] )
## this code below work becuase there is always an e metric
data =data.frame(group = c("A","A","B","B"),metric=c("e","f","e","f"),value = c(1,2,3,4))
data %>% group_by(group) %>% mutate( e_value = value[metric == "e"] )
You can insert an ifelse to make it conditional.
data %>%
group_by(group) %>%
mutate(
e_value = ifelse(is.null(value[metric == "e"]), NA, value[metric == "e"])
)
# # A tibble: 5 x 4
# # Groups: group [3]
# group metric value e_value
# <fct> <fct> <dbl> <dbl>
# 1 A e 1.00 1.00
# 2 A f 2.00 1.00
# 3 B e 3.00 3.00
# 4 B f 4.00 3.00
# 5 C f 5.00 NA
Or like this using %in%:
data %>% group_by(group) %>% mutate(e_value = ifelse("e" %in% metric, value, NA));
## A tibble: 5 x 4
## Groups: group [3]
# group metric value e_value
# <fctr> <fctr> <dbl> <dbl>
#1 A e 1 1
#2 A f 2 1
#3 B e 3 3
#4 B f 4 3
#5 C f 5 NA

dplyr: Difference between unique and distinct

Seems the number of resulting rows is different when using distinct vs unique. The data set I am working with is huge. Hope the code is OK to understand.
dt2a <- select(dt, mutation.genome.position,
mutation.cds, primary.site, sample.name, mutation.id) %>%
group_by(mutation.genome.position, mutation.cds, primary.site) %>%
mutate(occ = nrow(.)) %>%
select(-sample.name) %>% distinct()
dim(dt2a)
[1] 2316382 5
## Using unique instead
dt2b <- select(dt, mutation.genome.position, mutation.cds,
primary.site, sample.name, mutation.id) %>%
group_by(mutation.genome.position, mutation.cds, primary.site) %>%
mutate(occ = nrow(.)) %>%
select(-sample.name) %>% unique()
dim(dt2b)
[1] 2837982 5
This is the file I am working with:
sftp://sftp-cancer.sanger.ac.uk/files/grch38/cosmic/v72/CosmicMutantExport.tsv.gz
dt = fread(fl)
This appears to be a result of the group_by Consider this case
dt<-data.frame(g=rep(c("a","b"), each=3),
v=c(2,2,5,2,7,7))
dt %>% group_by(g) %>% unique()
# Source: local data frame [4 x 2]
# Groups: g
#
# g v
# 1 a 2
# 2 a 5
# 3 b 2
# 4 b 7
dt %>% group_by(g) %>% distinct()
# Source: local data frame [2 x 2]
# Groups: g
#
# g v
# 1 a 2
# 2 b 2
dt %>% group_by(g) %>% distinct(v)
# Source: local data frame [4 x 2]
# Groups: g
#
# g v
# 1 a 2
# 2 a 5
# 3 b 2
# 4 b 7
When you use distinct() without indicating which variables to make distinct, it appears to use the grouping variable.

Resources