Get top k records per group, where k differs by group, in R data.table - r

I have two data.tables:
Values to extract the top k from, per group.
A mapping from group to the k values to select for that group.
how to find the top N values by group or within category (groupwise) in an R data.frame addresses this question when k does not vary by group. How can I do this? Here's sample data and the desired result:
Values:
(dt <- data.table(id=1:10,
group=c(rep(1, 5), rep(2, 5))))
# id group
# 1: 1 1
# 2: 2 1
# 3: 3 1
# 4: 4 1
# 5: 5 1
# 6: 6 2
# 7: 7 2
# 8: 8 2
# 9: 9 2
# 10: 10 2
Mapping from group to k:
(group.k <- data.table(group=1:2,
k=2:3))
# group k
# 1: 1 2
# 2: 2 3
Desired result, which should include the first two records from group 1 and the first three records from group 2:
(result <- data.table(id=c(1:2, 6:8),
group=c(rep(1, 2), rep(2, 3))))
# id group
# 1: 1 1
# 2: 2 1
# 3: 6 2
# 4: 7 2
# 5: 8 2
Applying the solution to the above-linked question after merging returns this error:
merged <- merge(dt, group.k, by="group")
(result <- merged[, head(.SD, k), by=group])
# Error: length(n) == 1L is not TRUE

I'd rather do it as:
dt[group.k, head(.SD, k), by=.EACHI, on="group"]
because it's quite clear to see what the intended operation is. j can be .SD[1:k] of course. Both these expressions will very likely be (further) optimised (for speed) in the next release.
See this post for a detailed explanation of by=.EACHI until we wrap those vignettes.

After merging in the k by group, a similar approach to https://stackoverflow.com/a/14800271/1840471's solution can be applied, you just need a unique to avoid the length(n) error:
merged <- merge(dt, group.k, by="group")
(result <- merged[, head(.SD, unique(k)), by=group])
# group id k
# 1: 1 1 2
# 2: 1 2 2
# 3: 2 6 3
# 4: 2 7 3
# 5: 2 8 3

Related

Unique ID for interconnected cases

I have the following data frame, that shows which cases are interconnected:
DebtorId DupDebtorId
1: 1 2
2: 1 3
3: 1 4
4: 5 1
5: 5 2
6: 5 3
7: 6 7
8: 7 6
My goal is to assign a unique group ID to each group of cases. The desired output is:
DebtorId group
1: 1 1
2: 2 1
3: 3 1
4: 4 1
5: 5 1
6: 6 2
7: 7 2
My train of thought:
library(data.table)
example <- data.table(
DebtorId = c(1,1,1,5,5,5,6,7),
DupDebtorId = c(2,3,4,1,2,3,7,6)
)
unique_pairs <- example[!duplicated(t(apply(example, 1, sort))),] #get unique pairs of DebtorID and DupDebtorID
unique_pairs[, group := .GRP, by=.(DebtorId)] #assign a group ID for each DebtorId
unique_pairs[, num := rowid(group)]
groups <- dcast(unique_pairs, group + DebtorId ~ num, value.var = 'DupDebtorId') #format data to wide for each group ID
#create new data table with unique cases to assign group ID
newdt <- data.table(DebtorId = sort(unique(c(example$DebtorId, example$DupDebtorId))), group = NA)
newdt$group <- as.numeric(newdt$group)
#loop through the mapped groups, selecting the first instance of group ID for the case
for (i in 1:nrow(newdt)) {
a <- newdt[i]$DebtorId
b <- min(which(groups[,-1] == a, arr.ind=TRUE)[,1])
newdt[i]$group <- b
}
Output:
DebtorId group
1: 1 1
2: 2 1
3: 3 1
4: 4 1
5: 5 2
6: 6 3
7: 7 3
There are 2 problems in my approach:
From the output, you can see that it fails to recognize that case 5
belongs to group 1;
The final loop is agonizingly slow, which would
render it useless for my use case of 1M rows in my original data, and going the traditional := way does not work with which()
I'm not sure whether my approach could be optimized, or there is a better way of doing this altogether.
This functionality already exists in igraph, so if you don't need to do it yourself, we can build a graph from your data frame and then extract cluster membership. stack() is just an easy way to convert a named vector to data frame.
library(igraph)
g <- graph.data.frame(df)
df_membership <- clusters(g)$membership
stack(df_membership)
#> values ind
#> 1 1 1
#> 2 1 5
#> 3 2 6
#> 4 2 7
#> 5 1 2
#> 6 1 3
#> 7 1 4
Above, values corresponds to group and ind to DebtorId.

Repeat sequence by group

I have the following dataframe:
a <- data.frame(
group1=factor(rep(c("a","b"),each=6,times=1)),
time=rep(1:6,each=1,times=2),
newcolumn = c(1,1,2,2,3,3,1,1,2,2,3,3)
)
I'm looking to replicate the output of newcolumn with a rep by group function (the time variable is there for ordering purposes). In other words, for each group, ordered by time, how can I assign a sequence 1,1,2,2,n,n? I also need a general solution (in the case that groups are of differing number of rows, or I want to repeat values 3,10,n times).
For instance, I can generate that sequence with this:
newcolumn=rep(1:3,each=2,times=2)
But that wouldn't work in a group by statement where group1 has differing rows.
We specify the length.out in the rep after grouping by 'group1'
library(dplyr)
a %>%
group_by(group1) %>%
mutate(new = rep(seq_len(n()/2), each = 2, length.out = n()))
NOTE: each and times are not used in the same call. Either we use each or times
EDIT: Based on comments from #r2evans
A data.table alternative:
library(data.table)
DT <- as.data.table(a[1:2])
DT[order(time),newcolumn := rep(seq_len(.N/2), each=2, length.out=.N),by=c("group1")]
DT
# group1 time newcolumn
# 1: a 1 1
# 2: a 2 1
# 3: a 3 2
# 4: a 4 2
# 5: a 5 3
# 6: a 6 3
# 7: b 1 1
# 8: b 2 1
# 9: b 3 2
# 10: b 4 2
# 11: b 5 3
# 12: b 6 3

Lowest pair sequential combination data table

I have a set with two columns. The rows are pairs of values (a,b).
require(data.table)
dt<-data.table(a=c(1,11,11,2,7,5,6), b = c(2,9,8,6,5,3,3))
I want to assign to each pair of values the lowest number. BUT if one of the values appears again in a new line, it must be compared again with the new pair and selected the lowest of the history. The result must be this one:
res.dt<-data.table(a=c(1,11,11,2,7,5,6), b = c(2,9,8,6,5,3,3), res=c(1,9,8,1,5,3,1))
a b res
1: 1 2 1
2: 11 9 9
3: 11 8 8
4: 2 6 1
5: 7 5 5
6: 5 3 3
7: 6 3 1
To state the problem differently: For each row i, we need to iteratively update res with the smallest value in rows j <= i where (a_i,b_i) and (a_j,b_j) have a non-empty intersection.
We can do this efficiently with non-equi joins in data.table (v>=1.9.8), but since this feature only allows single-element comparisons (>,>=,==,<=, or <), we need to find intersections by comparing (a_i,a_j), (a_i,b_j), (b_i,a_j), (b_i,b_j) separately. (There is an intersection if at least one of these pairs contains identical elements.) Doing this iteratively accounts for the entire history, and we can stop when the result converges:
dt[, `:=`(idx=.I, res=pmin(a,b), prev_res=NA)]
while (dt[, !identical(res, prev_res)]) {
dt[, prev_res:= res]
# Use non-equi joins to update 'res' for intersecting pairs downstream
dt[dt[, .(i.a=a, i.res=res, i=.I)], on=.(a==i.a, idx > i), res:= pmin(res, i.res)]
dt[dt[, .(i.a=a, i.res=res, i=.I)], on=.(b==i.a, idx > i), res:= pmin(res, i.res)]
dt[dt[, .(i.b=b, i.res=res, i=.I)], on=.(a==i.b, idx > i), res:= pmin(res, i.res)]
dt[dt[, .(i.b=b, i.res=res, i=.I)], on=.(b==i.b, idx > i), res:= pmin(res, i.res)]
}
The result:
> dt[, .(a,b,res)]
# a b res
# 1: 1 2 1
# 2: 11 9 9
# 3: 11 8 8
# 4: 2 6 1
# 5: 7 5 5
# 6: 5 3 3
# 7: 6 3 1

R data.table filtering on group size

I am trying to find all the records in my data.table for which there is more than one row with value v in field f.
For instance, we can use this data:
dt <- data.table(f1=c(1,2,3,4,5), f2=c(1,1,2,3,3))
If looking for that property in field f2, we'd get (note the absence of the (3,2) tuple)
f1 f2
1: 1 1
2: 2 1
3: 4 3
4: 5 3
My first guess was dt[.N>2,list(.N),by=f2], but that actually keeps entries with .N==1.
dt[.N>2,list(.N),by=f2]
f2 N
1: 1 2
2: 2 1
3: 3 2
The other easy guess, dt[duplicated(dt$f2)], doesn't do the trick, as it keeps one of the 'duplicates' out of the results.
dt[duplicated(dt$f2)]
f1 f2
1: 2 1
2: 5 3
So how can I get this done?
Edited to add example
The question is not clear. Based on the title, it looks like we want to extract all groups with number of rows (.N) greater than 1.
DT[, if(.N>1) .SD, by=f]
But the value v in field f is making it confusing.
If I understand what you're after correctly, you'll need to do some compound queries:
library(data.table)
DT <- data.table(v1 = 1:10, f = c(rep(1:3, 3), 4))
DT[, N := .N, f][N > 2][, N := NULL][]
# v1 f
# 1: 1 1
# 2: 2 2
# 3: 3 3
# 4: 4 1
# 5: 5 2
# 6: 6 3
# 7: 7 1
# 8: 8 2
# 9: 9 3

How to count occurrences combinations in data.table in R

I have two data.tables. I would like to count the number of rows matching a combination of a table in another table. I have checked the data.table documentation but I have not found my answer. I am using data.table 1.9.2.
DT1 <- data.table(a=c(3,2), b=c(8,3))
DT2 <- data.table(w=c(3,3,3,2,3), x=c(8,8,8,3,7), z=c(2,6,7,2,2))
DT1
# a b
# 1: 3 8
# 2: 2 3
DT2
# w x z
# 1: 3 8 2
# 2: 3 8 6
# 3: 3 8 7
# 4: 2 3 2
# 5: 3 7 2
Now I would like to count the number of (3, 8) pairs and (2, 3) pairs in DT2.
setkey(DT2, w, x)
nrow(DT2[J(3, 8), nomatch=0])
# [1] 3 ## OK !
nrow(DT2[J(2, 3), nomatch=0])
# [1] 1 ## OK !
DT1[,count_combination_in_dt2 := nrow(DT2[J(a, b), nomatch=0])]
DT1
# a b count_combination_in_dt2
# 1: 3 8 4 ## not ok.
# 2: 2 3 4 ## not ok.
Expected result:
# a b count_combination_in_dt2
# 1: 3 8 3
# 2: 2 3 1
setkey(DT2, w, x)
DT2[DT1, .N, by = .EACHI]
# w x N
#1: 3 8 3
#2: 2 3 1
# In versions <= 1.9.2, use DT2[DT1, .N] instead
The above simply does the merge and counts the number of rows for each group defined by the i-expression, thus by = .EACHI.
You just need to add by=list(a,b).
DT1[,count_combination_in_dt2:=nrow(DT2[J(a,b),nomatch=0]), by=list(a,b)]
DT1
##
## a b count_combination_in_dt2
## 1: 3 8 3
## 2: 2 3 1
EDIT: Some more details: In your original version, you used DT2[DT1, nomatch=0] (because you used all a, b combinations. If you want to use J(a,b) for each a, b combination separately, you need to use the by argument. The data.table is then grouped by a, b and the nrow(...) is evaluated within each group.

Resources