Counting the result of a left join using dplyr - r

What is the proper way to count the result of a left outer join using dplyr?
Consider the two data frames:
a <- data.frame( id=c( 1, 2, 3, 4 ) )
b <- data.frame( id=c( 1, 1, 3, 3, 3, 4 ), ref_id=c( 'a', 'b', 'c', 'd', 'e', 'f' ) )
a specifies four different IDs. b specifies six records that reference IDs in a. If I want to see how many times each ID is referenced, I might try this:
a %>% left_join( b, by='id' ) %>% group_by( id ) %>% summarise( refs=n() )
Source: local data frame [4 x 2]
id refs
(dbl) (int)
1 1 2
2 2 1
3 3 3
4 4 1
However, the result is misleading because it indicates that ID 2 was referenced once when in reality, it was never referenced (in the intermediate data frame, ref_id was NA for ID 2). I would like to avoid introducing a separate library such as sqldf.

With data.table, you can do
library(data.table)
setDT(a); setDT(b)
b[a, .N, on="id", by=.EACHI]
id N
1: 1 2
2: 2 0
3: 3 3
4: 4 1
Here, the syntax is x[i, j, on, by=.EACHI].
.EACHI refers to each row of i=a.
j=.N uses a special variable for the number of rows.

There are already some good answers but since the question asks not to use packages here is one. We perform a left join on a and b and append a refs column which is TRUE if ref_id is not NA. Then use aggregate to sum over the refs column:
m <- transform(merge(a, b, all.x = TRUE), refs = !is.na(ref_id))
aggregate(refs ~ id, m, sum)
giving:
id refs
1 1 2
2 2 0
3 3 3
4 4 1

It does require another package, but i'd feel remiss for not mentioning tidylog which provides reports for a wide range of tidyverse verbs. In your case, it would produce a report like:
library(tidylog)
a <- data.frame(id = c(1, 2, 3, 4 ))
b <- data.frame(id = c(1, 1, 3, 3, 3, 4), ref_id = c('a', 'b', 'c', 'd', 'e', 'f'))
a %>% left_join(b, by='id')
left_join: added one column (ref_id)
> rows only in x 1
> rows only in y (0)
> matched rows 6 (includes duplicates)
> ===
> rows total 7
id ref_id
1 1 a
2 1 b
3 2 <NA>
4 3 c
5 3 d
6 3 e
7 4 f
See here and here for more examples/info

I'm having a hard time deciding if this is a hack or the proper way to count references, but this returns the expected result:
a %>% left_join( b, by='id' ) %>% group_by( id ) %>% summarise( refs=sum( !is.na( ref_id ) ) )
Source: local data frame [4 x 2]
id refs
(dbl) (int)
1 1 2
2 2 0
3 3 3
4 4 1

Related

Filtering ids when they have the same value across a column in r [duplicate]

I have a data frame like below
sample <- data.frame(ID = 1:9,
Group = c('AA','AA','AA','BB','BB','CC','CC','BB','CC'),
Value = c(1,1,1,2,2,2,3,2,3))
ID Group Value
1 AA 1
2 AA 1
3 AA 1
4 BB 2
5 BB 2
6 CC 2
7 CC 3
8 BB 2
9 CC 3
I want to select groups according to the number of distinct (unique) values within each group. For example, select groups where all values within the group are the same (one distinct value per group). If you look at the group CC, it has more than one distinct value (2 and 3) and should thus be removed. The other groups, with only one distinct value, should be kept. Desired output:
ID Group Value
1 AA 1
2 AA 1
3 AA 1
4 BB 2
5 BB 2
8 BB 2
Would you tell me simple and fast code in R that solves the problem?
Here's a solution using dplyr:
library(dplyr)
sample <- data.frame(
ID = 1:9,
Group= c('AA', 'AA', 'AA', 'BB', 'BB', 'CC', 'CC', 'BB', 'CC'),
Value = c(1, 1, 1, 2, 2, 2, 3, 2, 3)
)
sample %>%
group_by(Group) %>%
filter(n_distinct(Value) == 1)
We group the data by Group, and then only select groups where the number of distinct values of Value is 1.
data.table version:
library(data.table)
sample <- as.data.table(sample)
sample[ , if(uniqueN(Value) == 1) .SD, by = Group]
# Group ID Value
#1: AA 1 1
#2: AA 2 1
#3: AA 3 1
#4: BB 4 2
#5: BB 5 2
#6: BB 8 2
An alternative using ave if the data is numeric, is to check if the variance is 0:
sample[with(sample, ave(Value, Group, FUN=var ))==0,]
An alternative solution that could be faster on large data is:
setkey(sample, Group, Value)
ans <- sample[unique(sample)[, .N, by=Group][N==1, Group]]
The point is that calculating unique values for each group could be time consuming when there are more groups. Instead, we can set the key on the data.table, then take unique values by key (which is extremely fast) and then count the total values for each group. We then require only those where it is 1. We can then perform a join (which is once again very fast). Here's a benchmark on large data:
require(data.table)
set.seed(1L)
sample <- data.table(ID=1:1e7,
Group = sample(rep(paste0("id", 1:1e5), each=100)),
Value = sample(2, 1e7, replace=TRUE, prob=c(0.9, 0.1)))
system.time (
ans1 <- sample[,if(length(unique(Value))==1) .SD ,by=Group]
)
# minimum of three runs
# user system elapsed
# 14.328 0.066 14.382
system.time ({
setkey(sample, Group, Value)
ans2 <- sample[unique(sample)[, .N, by=Group][N==1, Group]]
})
# minimum of three runs
# user system elapsed
# 5.661 0.219 5.877
setkey(ans1, Group, ID)
setkey(ans2, Group, ID)
identical(ans1, ans2) # [1] TRUE
You can make a selector for sample using ave many different ways.
sample[ ave( sample$Value, sample$Group, FUN = function(x) length(unique(x)) ) == 1,]
or
sample[ ave( sample$Value, sample$Group, FUN = function(x) sum(x - x[1]) ) == 0,]
or
sample[ ave( sample$Value, sample$Group, FUN = function(x) diff(range(x)) ) == 0,]
Here's an approach
> ind <- aggregate(Value~Group, FUN=function(x) length(unique(x))==1, data=sample)[,2]
> sample[sample[,"Group"] %in% levels(sample[,"Group"])[ind], ]
ID Group Value
1 1 AA 1
2 2 AA 1
3 3 AA 1
4 4 BB 2
5 5 BB 2
8 8 BB 2

How to group the data by id and get unique values of all columns in R?

I have a table with ID and other columns. I want to group the data by Ids and get the unique values of all columns.
from above table group by ID and get unique(Alt1, Alt2, Alt3)
Resul should be in vector form
A -> 1,2,3,5
B ->1,3,4,5,7
We can get data in long format and for each ID make a list of unique values.
library(dplyr)
library(tidyr)
df1 <- df %>%
pivot_longer(cols = -ID) %>%
group_by(ID) %>%
summarise(value = list(unique(value))) %>%
unnest(value)
df1
# ID value
# <fct> <dbl>
# 1 A 1
# 2 A 3
# 3 A 2
# 4 A 5
# 5 B 1
# 6 B 4
# 7 B 5
# 8 B 3
# 9 B 6
#10 B 7
We can store it as a list if needed using split.
split(df1$value, df1$ID)
#$A
#[1] 1 3 2 5
#$B
#[1] 1 4 5 3 6 7
data.table equivalent of the above would be :
library(Data.table)
setDT(df)
df2 <- melt(df, id.vars = 'ID')[, .(value = list(unique(value))), ID]
unique values are present in df2$value as a vector.
data
df <- data.frame(ID = c('A', 'A', 'B', 'B'),
Alt1 = c(1, 2, 1, 3),
Alt2 = c(3, 5, 4, 6),
Alt3 = c(1, 3, 5, 7))

In R: Extract similar trajectory patterns from a data table

I have a data table that contains several patterns for going from a to c. These patterns are assigned to different expeditions. I want to extract similar patterns for the different expedition_id.
dt<- data.table(departure = c('a', 'a', 'a', 'b', 'a','d','a', 'b'), arrival =
c('a','a','b','c','d','c','b','c'), expedition_id = c(1,2,1,1,3,3,2,2))
>dt
departure arrival expedition_id
a a 1
a a 2
a b 1
b c 1
a d 3
d c 3
a b 2
b c 2
The results that I am trying to get look like different data tables for each unique pattern.
>dt1
departure arrival expedition_list
a a 1,2
a b 1,2
b c 1,2
>dt2
departure arrival expedition_list
a d 3
d c 3
I'd appreciate your help on this one.
You can try:
library(data.table)
dt <- dt[, .(expedition_list = toString(expedition_id)), by = .(departure, arrival)]
dt_list <- split(dt, dt$expedition_list)
list2env(
setNames(
dt_list,
paste0('dt', 1:length(dt_list))
),
.GlobalEnv
)
Output:
dt1
departure arrival expedition_list
1: a a 1, 2
2: a b 1, 2
3: b c 1, 2
dt2
departure arrival expedition_list
1: a d 3
2: d c 3
You asked for data.table but for others this dplyr version might also be helpful:
dt %>%
group_by(departure, arrival) %>%
summarise(expedition_list = paste(expedition_id, collapse = ","))

New column conditional on whether number is even/uneven and on column

Say i have the following df:
id<-rep(1:2,c(7,6))
name<-c('a','t','signal','b','s','e','signal','x','signal','r','s','t','signal')
id name
1 1 a
2 1 t
3 1 signal
4 1 b
5 1 s
6 1 e
7 1 signal
8 2 x
9 2 signal
10 2 r
11 2 s
12 2 t
13 2 signal
I want to add a new column with a character value conditional on whether the id number is even or not, and whether the string 'signal' is reached in the 'name' column.
For uneven id numbers, and up to including 'signal' for the column 'name' I would like the character T. After the signal, the character should become 'C'.
For even id numbers, and up to including 'signal' for the column 'name' I would like the character C. After the signal, the character should become 'T'.
For the example given, this should result in the following data.frame:
id, name condition
1, a, T
1, t, T
1, signal, T
1, b, C
1, s, C
1, e, C
1, signal C
2, x, C
2, signal, C
2, r, T
2, s, T
2, t, T
2, signal T
Any help is very much appreciated!
This is not a vectorized solution, but for me it seems as a wroking code.
Data preparation - I add new column to describe the condition
id<-rep(1:2,c(7,6))
name<-c('a','t','signal','b','s','e','signal','x','signal','r','s','t','signal')
df <- data.frame(id, name)
df$condition <- rep("X", nrow(df))
I need to control two states: (i) if the signal has switched; (ii) if the id changes last (from even to odd and other way). Then I read row by row and update the condition state along with two variables.
signal <- F
last <- 1
for (i in 1:nrow(df)){
# id changed - reset signal
if (last != (df[i, "id"] %% 2)) signal <- F
if(!signal){
df[i,"condition"] <- ifelse(df[i,"id"] %% 2, "T", "C")
} else {
df[i, "condition"] <- ifelse(df[i,"id"] %% 2, "C", "T")
}
# signal is on
if (df[i, "name"] == "signal") signal <- T
# save last id (even or odd)
last <- df[i, "id"] %% 2
}
I hope it helps.
We could make use of %% with == to create the column
library(dplyr)
df1 %>%
group_by(id) %>%
mutate(ind = (cumsum(lag(name, default = name[1]) == 'signal')>0) + 1,
condition = c('T', 'C')[ifelse(id %%2 > 0, ind,
as.integer(factor(ind, levels = rev(unique(ind)))))] ) %>%
select(-ind)
# A tibble: 13 x 3
# Groups: id [2]
# id name condition
# <int> <chr> <chr>
# 1 1 a T
# 2 1 t T
# 3 1 signal T
# 4 1 b C
# 5 1 s C
# 6 1 e C
# 7 1 signal C
# 8 2 x C
# 9 2 signal C
#10 2 r T
#11 2 s T
#12 2 t T
#13 2 signal T
data
df1 <- data.frame(id, name, stringsAsFactors=FALSE)
Another approach could be
id <- rep(1:2,c(7,6))
name <- c('a','t','signal','b','s','e','signal','x','signal','r','s','t','signal')
df <- data.frame(id, name)
library(dplyr)
df %>%
group_by(id) %>%
mutate(FirstSignalIndex=min(which(name=='signal'))) %>%
mutate(condition = ifelse((id %% 2)==0,
ifelse(row_number()>FirstSignalIndex, 'T', 'C'),
ifelse(row_number()>FirstSignalIndex, 'C', 'T')))
Hope this helps!

Sample by groupy with a condition (r)

I need to randomly select a diary for each individual (id) but only for those who filled more than one.
Let us suppose my data look like this
dta = rbind(c(1, 1, 'a'),
c(1, 2, 'a'),
c(1, 3, 'b'),
c(2, 1, 'a'),
c(3, 1, 'b'),
c(3, 2, 'a'),
c(3, 3, 'c'))
colnames(dta) <- c('id', 'DiaryNumber', 'type')
dta = as.data.frame(dta)
dta
id DiaryNumber type
1 1 a
1 2 a
1 3 b
2 1 a
3 1 b
3 2 a
3 3 c
For example, id 1 filled 3 diaries. What I need is to randomly select one of the 3 diaries. Id 2 only filled one diary, so I do not need to do anything with it.
I have no idea how I could do that.
Any ideas ?
You can use sample_n:
library(dplyr)
dta %>% group_by(id) %>% sample_n(1)
## Source: local data frame [3 x 3]
## Groups: id
##
## id DiaryNumber type
## 1 1 2 a
## 2 2 1 a
## 3 3 1 b
Base package:
set.seed(123)
df <- lapply(split(dta, dta$id), function(x) x[sample(nrow(x), 1), ])
do.call("rbind", df)
Output:
id DiaryNumber type
1 1 1 a
2 2 1 a
3 3 2 a

Resources