How to count occurrences combinations in data.table in R - r

I have two data.tables. I would like to count the number of rows matching a combination of a table in another table. I have checked the data.table documentation but I have not found my answer. I am using data.table 1.9.2.
DT1 <- data.table(a=c(3,2), b=c(8,3))
DT2 <- data.table(w=c(3,3,3,2,3), x=c(8,8,8,3,7), z=c(2,6,7,2,2))
DT1
# a b
# 1: 3 8
# 2: 2 3
DT2
# w x z
# 1: 3 8 2
# 2: 3 8 6
# 3: 3 8 7
# 4: 2 3 2
# 5: 3 7 2
Now I would like to count the number of (3, 8) pairs and (2, 3) pairs in DT2.
setkey(DT2, w, x)
nrow(DT2[J(3, 8), nomatch=0])
# [1] 3 ## OK !
nrow(DT2[J(2, 3), nomatch=0])
# [1] 1 ## OK !
DT1[,count_combination_in_dt2 := nrow(DT2[J(a, b), nomatch=0])]
DT1
# a b count_combination_in_dt2
# 1: 3 8 4 ## not ok.
# 2: 2 3 4 ## not ok.
Expected result:
# a b count_combination_in_dt2
# 1: 3 8 3
# 2: 2 3 1

setkey(DT2, w, x)
DT2[DT1, .N, by = .EACHI]
# w x N
#1: 3 8 3
#2: 2 3 1
# In versions <= 1.9.2, use DT2[DT1, .N] instead
The above simply does the merge and counts the number of rows for each group defined by the i-expression, thus by = .EACHI.

You just need to add by=list(a,b).
DT1[,count_combination_in_dt2:=nrow(DT2[J(a,b),nomatch=0]), by=list(a,b)]
DT1
##
## a b count_combination_in_dt2
## 1: 3 8 3
## 2: 2 3 1
EDIT: Some more details: In your original version, you used DT2[DT1, nomatch=0] (because you used all a, b combinations. If you want to use J(a,b) for each a, b combination separately, you need to use the by argument. The data.table is then grouped by a, b and the nrow(...) is evaluated within each group.

Related

R data.table filtering on group size

I am trying to find all the records in my data.table for which there is more than one row with value v in field f.
For instance, we can use this data:
dt <- data.table(f1=c(1,2,3,4,5), f2=c(1,1,2,3,3))
If looking for that property in field f2, we'd get (note the absence of the (3,2) tuple)
f1 f2
1: 1 1
2: 2 1
3: 4 3
4: 5 3
My first guess was dt[.N>2,list(.N),by=f2], but that actually keeps entries with .N==1.
dt[.N>2,list(.N),by=f2]
f2 N
1: 1 2
2: 2 1
3: 3 2
The other easy guess, dt[duplicated(dt$f2)], doesn't do the trick, as it keeps one of the 'duplicates' out of the results.
dt[duplicated(dt$f2)]
f1 f2
1: 2 1
2: 5 3
So how can I get this done?
Edited to add example
The question is not clear. Based on the title, it looks like we want to extract all groups with number of rows (.N) greater than 1.
DT[, if(.N>1) .SD, by=f]
But the value v in field f is making it confusing.
If I understand what you're after correctly, you'll need to do some compound queries:
library(data.table)
DT <- data.table(v1 = 1:10, f = c(rep(1:3, 3), 4))
DT[, N := .N, f][N > 2][, N := NULL][]
# v1 f
# 1: 1 1
# 2: 2 2
# 3: 3 3
# 4: 4 1
# 5: 5 2
# 6: 6 3
# 7: 7 1
# 8: 8 2
# 9: 9 3

Get top k records per group, where k differs by group, in R data.table

I have two data.tables:
Values to extract the top k from, per group.
A mapping from group to the k values to select for that group.
how to find the top N values by group or within category (groupwise) in an R data.frame addresses this question when k does not vary by group. How can I do this? Here's sample data and the desired result:
Values:
(dt <- data.table(id=1:10,
group=c(rep(1, 5), rep(2, 5))))
# id group
# 1: 1 1
# 2: 2 1
# 3: 3 1
# 4: 4 1
# 5: 5 1
# 6: 6 2
# 7: 7 2
# 8: 8 2
# 9: 9 2
# 10: 10 2
Mapping from group to k:
(group.k <- data.table(group=1:2,
k=2:3))
# group k
# 1: 1 2
# 2: 2 3
Desired result, which should include the first two records from group 1 and the first three records from group 2:
(result <- data.table(id=c(1:2, 6:8),
group=c(rep(1, 2), rep(2, 3))))
# id group
# 1: 1 1
# 2: 2 1
# 3: 6 2
# 4: 7 2
# 5: 8 2
Applying the solution to the above-linked question after merging returns this error:
merged <- merge(dt, group.k, by="group")
(result <- merged[, head(.SD, k), by=group])
# Error: length(n) == 1L is not TRUE
I'd rather do it as:
dt[group.k, head(.SD, k), by=.EACHI, on="group"]
because it's quite clear to see what the intended operation is. j can be .SD[1:k] of course. Both these expressions will very likely be (further) optimised (for speed) in the next release.
See this post for a detailed explanation of by=.EACHI until we wrap those vignettes.
After merging in the k by group, a similar approach to https://stackoverflow.com/a/14800271/1840471's solution can be applied, you just need a unique to avoid the length(n) error:
merged <- merge(dt, group.k, by="group")
(result <- merged[, head(.SD, unique(k)), by=group])
# group id k
# 1: 1 1 2
# 2: 1 2 2
# 3: 2 6 3
# 4: 2 7 3
# 5: 2 8 3

get rows of unique values by group

I have a data.table and want to pick those lines of the data.table where some values of a variable x are unique relative to another variable y
It's possible to get the unique values of x, grouped by y in a separate dataset, like this
dt[,unique(x),by=y]
But I want to pick the rows in the original dataset where this is the case. I don't want a new data.table because I also need the other variables.
So, what do I have to add to my code to get the rows in dt for which the above is true?
dt <- data.table(y=rep(letters[1:2],each=3),x=c(1,2,2,3,2,1),z=1:6)
y x z
1: a 1 1
2: a 2 2
3: a 2 3
4: b 3 4
5: b 2 5
6: b 1 6
What I want:
y x z
1: a 1 1
2: a 2 2
3: b 3 4
4: b 2 5
5: b 1 6
The idiomatic data.table way is:
require(data.table)
unique(dt, by = c("y", "x"))
# y x z
# 1: a 1 1
# 2: a 2 2
# 3: b 3 4
# 4: b 2 5
# 5: b 1 6
data.table is a bit different in how to use duplicated. Here's the approach I've seen around here somewhere before:
dt <- data.table(y=rep(letters[1:2],each=3),x=c(1,2,2,3,2,1),z=1:6)
setkey(dt, "y", "x")
key(dt)
# [1] "y" "x"
!duplicated(dt)
# [1] TRUE TRUE FALSE TRUE TRUE TRUE
dt[!duplicated(dt)]
# y x z
# 1: a 1 1
# 2: a 2 2
# 3: b 1 6
# 4: b 2 5
# 5: b 3 4
The simpler data.table solution is to grab the first element of each group
> dt[, head(.SD, 1), by=.(y, x)]
y x z
1: a 1 1
2: a 2 2
3: b 3 4
4: b 2 5
5: b 1 6
Thanks to dplyR
library(dplyr)
col1 = c(1,1,3,3,5,6,7,8,9)
col2 = c("cust1", 'cust1', 'cust3', 'cust4', 'cust5', 'cust5', 'cust5', 'cust5', 'cust6')
df1 = data.frame(col1, col2)
df1
distinct(select(df1, col1, col2))

Cumulative sequence of occurrences of values [duplicate]

This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 5 years ago.
I have a dataset that looks something like this, with a column that can have four different values:
dataset <- data.frame(out = c("a","b","c","a","d","b","c","a","d","b","c","a"))
In R, I'd like to create a second column that tallies, in sequence, the cumulative number of rows containing a particular value. Thus the output column would look like this:
out
1
1
1
2
1
2
2
3
2
3
3
4
Try this:
dataset <- data.frame(out = c("a","b","c","a","d","b","c","a","d","b","c","a"))
with(dataset, ave(as.character(out), out, FUN = seq_along))
# [1] "1" "1" "1" "2" "1" "2" "2" "3" "2" "3" "3" "4"
Of course, you can assign the output to a column in your data.frame using something like out$asNumbers <- with(dataset, ave(as.character(out), out, FUN = seq_along))
Update
The "dplyr" approach is also quite nice. The logic is very similar to the "data.table" approach. An advantage is that you don't need to wrap the output with as.numeric which would be required with the ave approach mentioned above.
dataset %>% group_by(out) %>% mutate(count = sequence(n()))
# Source: local data frame [12 x 2]
# Groups: out
#
# out count
# 1 a 1
# 2 b 1
# 3 c 1
# 4 a 2
# 5 d 1
# 6 b 2
# 7 c 2
# 8 a 3
# 9 d 2
# 10 b 3
# 11 c 3
# 12 a 4
A third option is to use getanID from my "splitstackshape" package. For this particular example, you just need to specify the data.frame name (since it's a single column), however, generally, you would be more specific and mention the column(s) that presently serve as "ids", and the function would check whether they are unique or whether a cumulative sequence is required to make them unique.
library(splitstackshape)
# getanID(dataset, "out") ## Example of being specific about column to use
getanID(dataset)
# out .id
# 1: a 1
# 2: b 1
# 3: c 1
# 4: a 2
# 5: d 1
# 6: b 2
# 7: c 2
# 8: a 3
# 9: d 2
# 10: b 3
# 11: c 3
# 12: a 4
Update:
As Ananda pointed out, you can use the simpler:
DT[, counts := sequence(.N), by = "V1"]
(where DT is as below)
You can create a "counts" column, initialized to 1, then tally the cumulative sum, by factor.
below is a quick implementation with data.table
# Called the column V1
dataset<-data.frame(V1=c("a","b","c","a","d","b","c","a","d","b","c","a"))
library(data.table)
DT <- data.table(dataset)
DT[, counts := 1L]
DT[, counts := cumsum(counts), by=V1]; DT
# V1 counts
# 1: a 1
# 2: b 1
# 3: c 1
# 4: a 2
# 5: d 1
# 6: b 2
# 7: c 2
# 8: a 3
# 9: d 2
# 10: b 3
# 11: c 3
# 12: a 4

Add a countdown column to data.table containing rows until a special row encountered

I have a data.table with ordered data labled up, and I want to add a column that tells me how many records until I get to a "special" record that resets the countdown.
For example:
DT = data.table(idx = c(1,3,3,4,6,7,7,8,9),
name = c("a", "a", "a", "b", "a", "a", "b", "a", "b"))
setkey(DT, idx)
#manually add the answer
DT[, countdown := c(3,2,1,0,2,1,0,1,0)]
Gives
> DT
idx name countdown
1: 1 a 3
2: 3 a 2
3: 3 a 1
4: 4 b 0
5: 6 a 2
6: 7 a 1
7: 7 b 0
8: 8 a 1
9: 9 b 0
See how the countdown column tells me how many rows until a row called "b".
The question is how to create that column in code.
Note that the key is not evenly spaced and may contain duplicates (so is not very useful in solving the problem). In general the non-b names could be different, but I could add a dummy column that is just True/False if the solution requires this.
Here's another idea:
## Create groups that end at each occurrence of "b"
DT[, cd:=0L]
DT[name=="b", cd:=1L]
DT[, cd:=rev(cumsum(rev(cd)))]
## Count down within them
DT[, cd:=max(.I) - .I, by=cd]
# idx name cd
# 1: 1 a 3
# 2: 3 a 2
# 3: 3 a 1
# 4: 4 b 0
# 5: 6 a 2
# 6: 7 a 1
# 7: 7 b 0
# 8: 8 a 1
# 9: 9 b 0
I'm sure (or at least hopeful) that a purely "data.table" solution would be generated, but in the meantime, you could make use of rle. In this case, you're interested in reversing the countdown, so we'll use rev to reverse the "name" values before proceeding.
output <- sequence(rle(rev(DT$name))$lengths)
makezero <- cumsum(rle(rev(DT$name))$lengths)[c(TRUE, FALSE)]
output[makezero] <- 0
DT[, countdown := rev(output)]
DT
# idx name countdown
# 1: 1 a 3
# 2: 3 a 2
# 3: 3 a 1
# 4: 4 b 0
# 5: 6 a 2
# 6: 7 a 1
# 7: 7 b 0
# 8: 8 a 1
# 9: 9 b 0
Here's a mix of Josh's and Ananda's solution, in that, I use RLE to generate the way Josh has given the answer:
t <- rle(DT$name)
t <- t$lengths[t$values == "a"]
DT[, cd := rep(t, t+1)]
DT[, cd:=max(.I) - .I, by=cd]
Even better: Taking use of the fact that there's only one b always (or assuming here), you could do this one better:
t <- rle(DT$name)
t <- t$lengths[t$values == "a"]
DT[, cd := rev(sequence(rev(t+1)))-1]
Edit: From OP's comment, it seems clear that there is more than 1 b possible and in such cases, all b should be 0. The first step in doing this is to create groups where b ends after each consecutive a's.
DT <- data.table(idx=sample(10), name=c("a","a","a","b","b","a","a","b","a","b"))
t <- rle(DT$name)
val <- cumsum(t$lengths)[t$values == "b"]
DT[, grp := rep(seq(val), c(val[1], diff(val)))]
DT[, val := c(rev(seq_len(sum(name == "a"))),
rep(0, sum(name == "b"))), by = grp]
# idx name grp val
# 1: 1 a 1 3
# 2: 7 a 1 2
# 3: 9 a 1 1
# 4: 4 b 1 0
# 5: 2 b 1 0
# 6: 8 a 2 2
# 7: 6 a 2 1
# 8: 3 b 2 0
# 9: 10 a 3 1
# 10: 5 b 3 0

Resources