dplyr: Difference between unique and distinct

dplyr: Difference between unique and distinct - r

Seems the number of resulting rows is different when using distinct vs unique. The data set I am working with is huge. Hope the code is OK to understand.
dt2a <- select(dt, mutation.genome.position,
mutation.cds, primary.site, sample.name, mutation.id) %>%
group_by(mutation.genome.position, mutation.cds, primary.site) %>%
mutate(occ = nrow(.)) %>%
select(-sample.name) %>% distinct()
dim(dt2a)
[1] 2316382 5
## Using unique instead
dt2b <- select(dt, mutation.genome.position, mutation.cds,
primary.site, sample.name, mutation.id) %>%
group_by(mutation.genome.position, mutation.cds, primary.site) %>%
mutate(occ = nrow(.)) %>%
select(-sample.name) %>% unique()
dim(dt2b)
[1] 2837982 5
This is the file I am working with:
sftp://sftp-cancer.sanger.ac.uk/files/grch38/cosmic/v72/CosmicMutantExport.tsv.gz
dt = fread(fl)

This appears to be a result of the group_by Consider this case
dt<-data.frame(g=rep(c("a","b"), each=3),
v=c(2,2,5,2,7,7))
dt %>% group_by(g) %>% unique()
# Source: local data frame [4 x 2]
# Groups: g
#
# g v
# 1 a 2
# 2 a 5
# 3 b 2
# 4 b 7
dt %>% group_by(g) %>% distinct()
# Source: local data frame [2 x 2]
# Groups: g
#
# g v
# 1 a 2
# 2 b 2
dt %>% group_by(g) %>% distinct(v)
# Source: local data frame [4 x 2]
# Groups: g
#
# g v
# 1 a 2
# 2 a 5
# 3 b 2
# 4 b 7
When you use distinct() without indicating which variables to make distinct, it appears to use the grouping variable.

Related

Using rbind within a pipe

Is it possible to use rbind within a pipe so that I don't have to define and store a variable to use it?
library(tidyverse)
## works fine
df <- iris %>%
group_by(Species) %>%
summarise(Avg.Sepal.Length = mean(Sepal.Length)) %>%
ungroup
df %>%
rbind(df)
## anyway to make this work?
iris %>%
group_by(Species) %>%
summarise(Avg.Sepal.Length = mean(Sepal.Length)) %>%
ungroup %>%
rbind(.)

Just to elaborate #MichaelDewar's answer, note the following section of ?magrittr::`%>%`:
Placing lhs elsewhere in rhs call
Often you will want lhs to the rhs call at another position than the first. For this purpose you can use the dot (.) as placeholder. For example, y %>% f(x, .) is equivalent to f(x, y) and z %>% f(x, y, arg = .) is equivalent to f(x, y, arg = z).
My understanding is that when . appears as an argument in the right hand side call, the left hand side is not inserted in the first position. The call is evaluated "as is", with . evaluating to the left hand side. Hence:
library("dplyr")
x <- data.frame(a = 1:2, b = 3:4)
x %>% rbind() # rbind(x)
## a b
## 1 1 3
## 2 2 4
x %>% rbind(.) # same
## a b
## 1 1 3
## 2 2 4
x %>% rbind(x) # rbind(x, x)
## a b
## 1 1 3
## 2 2 4
## 3 1 3
## 4 2 4
x %>% rbind(x, .) # same
x %>% rbind(., x) # same
x %>% rbind(., .) # same
## a b
## 1 1 3
## 2 2 4
## 3 1 3
## 4 2 4
You can devise clever tricks if you know the rules:
x %>% rbind((.)) # rbind(x, (x))
## a b
## 1 1 3
## 2 2 4
## 3 1 3
## 4 2 4
(.) isn't parsed like ., so the left hand is inserted in the first position of the right hand side call. Compare:
as.list(quote(.))
## [[1]]
## .
as.list(quote((.)))
## [[1]]
## `(`
##
## [[2]]
## .

I don't know why you would want to rbind something with itself, but here you go:
iris %>%
group_by(Species) %>%
summarise(Avg.Sepal.Length = mean(Sepal.Length)) %>%
ungroup %>%
rbind(., .)

tidyr::expand() for a single column across groups

tidyr::expand() returns all possible combinations of values from multiple columns. I'm looking for a slightly different behavior, where all the values are in a single column and the combinations are to be taken across groups.
For example, let the data be defined as follows:
library( tidyverse )
X <- bind_rows( data_frame(Group = "Group1", Value = LETTERS[1:3]),
data_frame(Group = "Group2", Value = letters[4:5]) )
We want all combinations of values from Group1 with values from Group2. My current clunky solution is to separate the values across multiple columns
Y <- X %>% group_by(Group) %>% do(vals = .$Value) %>% spread(Group, vals)
# # A tibble: 1 x 2
# Group1 Group2
# <list> <list>
# 1 <chr [3]> <chr [2]>
followed by a double unnest operation
Y %>% unnest( .preserve = Group2 ) %>% unnest
# # A tibble: 6 x 2
# Group1 Group2
# <chr> <chr>
# 1 A d
# 2 A e
# 3 B d
# 4 B e
# 5 C d
# 6 C e
This is the desired output, but as you can imagine, this solution doesn't generalize well: as the number of groups increases, so does the number of unnest operations that we have to perform.
Is there a more elegant solution?

Because OP seems happy to use base, I upgrade my comment to an answer:
expand.grid(split(X$Value, X$Group))
# Group1 Group2
# 1 A d
# 2 B d
# 3 C d
# 4 A e
# 5 B e
# 6 C e
As noted by OP, expand.grid converts character vectors to factors. To prevent that, use stringsAsFactors = FALSE.
The tidyverse equivalent is purrr::cross_df, which doesn't coerce to factor:
cross_df(split(X$Value, X$Group))
# A tibble: 6 x 2
# Group1 Group2
# <chr> <chr>
# 1 A d
# 2 B d
# 3 C d
# 4 A e
# 5 B e
# 6 C e

Here is one option. It will work on the cases with more than two groups although complete_ is deprecated.
library( tidyverse )
X2 <- X %>%
group_by(Group) %>%
mutate(ID = 1:n()) %>%
spread(Group, Value) %>%
select(-ID) %>%
complete_(names(.)) %>%
na.omit()
X2
# # A tibble: 6 x 2
# Group1 Group2
# <chr> <chr>
# 1 A d
# 2 A e
# 3 B d
# 4 B e
# 5 C d
# 6 C e
Update
!!!syms(names(.)) works well with the regular complete function, thus is better than using complete_ as my original solution.
library( tidyverse )
X2 <- X %>%
group_by(Group) %>%
mutate(ID = 1:n()) %>%
spread(Group, Value) %>%
select(-ID) %>%
complete(!!!syms(names(.))) %>%
na.omit()
X2
# # A tibble: 6 x 2
# Group1 Group2
# <chr> <chr>
# 1 A d
# 2 A e
# 3 B d
# 4 B e
# 5 C d
# 6 C e

I often use tidyr::crossing() to join all values from group2 to group.
data_frame(group = c(LETTERS[1:3])) %>%
crossing(group2 = letters[4:5])
I might do something like this:
data %>%
distinct(group) %>%
crossing(group2)
A more specific example:
dates <- lubridate::make_date(2000:2018)
data_frame(group = letters[1:5]) %>%
crossing(dates)

This still works with expand after spread.
X %>%
mutate(id = row_number()) %>%
spread(Group, Value) %>%
expand(Group1, Group2) %>%
na.omit()

Identify subsets containing only repeats of an expression

I have a dataset like so:
df<-data.frame(x=c("A","A","A","A", "B","B","B","B","B",
"C","C","C","C","C","D","D","D","D","D"),
y= as.factor(c(rep("Eoissp2",4),rep("Eoissp1",5),"Eoissp1","Eoisp4","Automerissp1","Automerissp2","Acharias",rep("Eoissp2",3),rep("Eoissp1",2))))
I want to identify, for each subset of x, the corresponding levels in y that are entirely duplicates containing the expression Eois. Therefore, A , B, and D will be returned in a vector because every level of A , B, and D contains the expression Eois , while level C consists of various unique levels (e.g. Eois, Automeris and Acharias). For this example the output would be:
output<- c("A", "B", "D")

Using new df:
> df %>% filter(str_detect(y,"Eois")) %>% group_by(x) %>% distinct(y) %>%
count() %>% filter(n==1) %>% select(x)
# A tibble: 2 x 1
# Groups: x [2]
x
<fct>
1 A
2 B
(Answer below uses the original df posted by the question author.)
Using the pipe function in magrittr & functions from dplyr:
> df %>% group_by(x) %>% distinct(y)
# A tibble: 7 x 2
# Groups: x [3]
x y
<fct> <fct>
1 A plant1a
2 B plant1b
3 C plant1a
4 C plant2a
5 C plant3a
6 C plant4a
7 C plant5a
Then you can roll up the results like this:
> results <- df %>% group_by(x) %>% distinct(y) %>%
count() %>% filter(n==1) %>% select(x)
> results
# A tibble: 2 x 1
# Groups: x [2]
x
<fct>
1 A
2 B
If you know your original data frame is always going to come with the x's in order, you can drop the group_by part.

A dplyr based solution could be as:
library(dplyr)
df %>% group_by(x) %>%
filter(grepl("Eoiss", y)) %>%
mutate(y = sub("\\d+", "", y)) %>%
filter(n() >1 & length(unique(y)) == 1) %>%
select(x) %>% unique(.)
# A tibble: 3 x 1
# Groups: x [3]
# x
# <fctr>
#1 A
#2 B
#3 D
Data
df<-data.frame(x=c("A","A","A","A", "B","B","B","B","B",
"C","C","C","C","C","D","D","D","D","D"),
y= as.factor(c(rep("Eoissp2",4),
rep("Eoissp1",5),"Eoissp1","Eoisp4","Automerissp1","Automerissp2",
"Acharias",rep("Eoissp2",3),rep("Eoissp1",2))))

How to use group_by variable as an exclusion value with dplyr?

Let's say I have the following data frame:
(dat = data_frame(v1 = c(rep("a", 3), rep("b", 3), rep("c", 4)), v2 = 1:10))
# A tibble: 10 × 2
# v1 v2
# <chr> <int>
# 1 a 1
# 2 a 2
# 3 a 3
# 4 b 4
# 5 b 5
# 6 b 6
# 7 c 7
# 8 c 8
# 9 c 9
# 10 c 10
What I want to be able to do is compute a sum for each group (i.e. "a", "b", and "c") that is equal to the sum of v2 where v1 is not equal to the grouping value. So it should look like this:
# A tibble: 3 × 2
# v1 sum
# <chr> <int>
# 1 a 49
# 2 b 40
# 3 c 21
Based on what I've been seeing online, this looks like a job for do, but I can't wrap my head around how to achieve this. I thought it would look something like this:
x %>%
group_by(v1) %>%
do(data.frame(sum=sum(.$v2[x$v1 != unique(.$v1)])))
But this just gives me a dataframe with sum equal to NA for all three groups. How would I go about doing this?

Maybe using an intermediate column it is easier:
dat %>% mutate(total = sum(v2)) %>% group_by(v1) %>% summarize(sum = max(total) - sum(v2))

You can nest and then index the list column negatively:
library(tidyverse)
dat %>% nest(v2) %>% mutate(sum = map_int(seq(n()), ~sum(unlist(data[-.x]))))
## # A tibble: 3 × 3
## v1 data sum
## <chr> <list> <int>
## 1 a <tibble [3 × 1]> 49
## 2 b <tibble [3 × 1]> 40
## 3 c <tibble [4 × 1]> 21
The advantage of this approach is that it's really easy to save the original data and align the computed values with them.

A small function without using dplyr:
dat <- data_frame(v1 = c(rep("a", 3), rep("b", 3), rep("c", 4)), v2 = 1:10)
test_func<-function(df){
a<-sum(df[df$v1 != "a",][,2])
b<-sum(df[df$v1 != "b",][,2])
c<-sum(df[df$v1 != "c",][,2])
out<-rbind(a,b,c)
return(out)
}
test_func(dat)
[,1]
a 49
b 40
c 21

#67342343's solution seems like the way to go here. If you have more complex overlapping/excluded groups, then maybe something like the following would be helpful:
library(tidyverse)
dat = data_frame(v1 = rep(letters[1:5], 3), v2 = 1:(5*3))
c(combn(unique(dat$v1),2, simplify=FALSE),
combn(unique(dat$v1),3, simplify=FALSE)) %>%
map_df(~ dat %>%
group_by(v1) %>%
summarise(v2 = sum(v2)) %>%
filter(v1 %in% .x) %>%
ungroup %>%
summarise(groups = paste(.x,collapse=","),
sum = sum(v2)))
groups sum
1 a,b 39
2 a,c 42
3 a,d 45
4 a,e 48
5 b,c 45
...
18 b,c,e 75
19 b,d,e 78
20 c,d,e 81

Keeping it simple:
dat %>% group_by(v1) %>% summarize(foo = sum(dat$v2) - sum(v2))
This is crass if you are in the middle of a long dplyr chain and have modified dat. (But then, why not relax and just store your data?)

dplyr mutate calling another dataframe

I would like to mutate a dataframe by applying a function which calls out to another dataframe. I can acheive this in a few different ways, but would like to know how to do this 'properly'.
Here is an example of what I'm trying to do. I have a dataframe with some start times, and a second with some timed observations. I would like to return a dataframe with the start times, and the number of observations that occur within some window after the start time. e.g.
set.seed(1337)
df1 <- data.frame(id=LETTERS[1:3], start_time=1:3*10)
df2 <- data.frame(time=runif(100)*100)
lapply(df1$start_time, function(s) sum(df2$time>s & df2$time<(s+15)))
The best I've got so far with dplyr is the following (but this loses the identity variables):
df1 %>%
rowwise() %>%
do(count = filter(df2, time>.$start_time, time < (.$start_time + 15))) %>%
mutate(n=nrow(count))
output:
Source: local data frame [3 x 2]
Groups: <by row>
# A tibble: 3 × 2
count n
<list> <int>
1 <data.frame [17 × 1]> 17
2 <data.frame [18 × 1]> 18
3 <data.frame [10 × 1]> 10
I was expecting to be able to do this:
df1 <- data.frame(id=LETTERS[1:3], start_time=1:3*10)
df2 <- data.frame(time=runif(100)*100)
df1 %>%
group_by(id) %>%
mutate(count = nrow(filter(df2, time>start_time, time<(start_time+15))))
but this returns the error:
Error: comparison (6) is possible only for atomic and list types
What is the dplyr way of doing this?

Here is one option with data.table where we can use the non-equi joins
library(data.table)#1.9.7+
setDT(df1)[, start_timeNew := start_time + 15]
setDT(df2)[df1, .(id, .N), on = .(time > start_time, time < start_timeNew),
by = .EACHI][, c('id', 'N'), with = FALSE]
# id N
#1: A 17
#2: B 18
#3: C 10
which gives the same count as in the OP's base R method
sapply(df1$start_time, function(s) sum(df2$time>s & df2$time<(s+15)))
#[1] 17 18 10
If we need the 'id' variable also as output in dplyr, we can modify the OP's code
df1 %>%
rowwise() %>%
do(data.frame(., count = filter(df2, time>.$start_time,
time < (.$start_time + 15)))) %>%
group_by(id) %>%
summarise(n = n())
# id n
# <fctr> <int>
#1 A 17
#2 B 18
#3 C 10
Or another option is map from purrr with dplyr
library(purrr)
df1 %>%
split(.$id) %>%
map_df(~mutate(., N = sum(df2$time >start_time & df2$time < start_time + 15))) %>%
select(-start_time)
# id N
#1 A 17
#2 B 18
#3 C 10

Another slightly different approach using dplyr:
result <- df1 %>% group_by(id) %>%
summarise(count = length(which(df2$time > start_time &
df2$time < (start_time+15))))
print(result)
### A tibble: 3 x 2
## id count
## <fctr> <int>
##1 A 17
##2 B 18
##3 C 10
I believe you can use length and which to count the number of occurrences for which your condition is true for each id in df1. Then, group by id and use this to summarise.
If there are possibly more that one start_time per id, then you can use the same function but rowwise and with mutate:
result <- df1 %>% rowwise() %>%
mutate(count = length(which(df2$time > start_time &
df2$time < (start_time+15))))
print(result)
##Source: local data frame [3 x 3]
##Groups: <by row>
##
### A tibble: 3 x 3
## id start_time count
## <fctr> <dbl> <int>
##1 A 10 17
##2 B 20 18
##3 C 30 10

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

dplyr: Difference between unique and distinct - r

Related

Using rbind within a pipe

tidyr::expand() for a single column across groups

Identify subsets containing only repeats of an expression

How to use group_by variable as an exclusion value with dplyr?

dplyr mutate calling another dataframe

Categories

Resources