R data.table filtering on group size - r

I am trying to find all the records in my data.table for which there is more than one row with value v in field f.
For instance, we can use this data:
dt <- data.table(f1=c(1,2,3,4,5), f2=c(1,1,2,3,3))
If looking for that property in field f2, we'd get (note the absence of the (3,2) tuple)
f1 f2
1: 1 1
2: 2 1
3: 4 3
4: 5 3
My first guess was dt[.N>2,list(.N),by=f2], but that actually keeps entries with .N==1.
dt[.N>2,list(.N),by=f2]
f2 N
1: 1 2
2: 2 1
3: 3 2
The other easy guess, dt[duplicated(dt$f2)], doesn't do the trick, as it keeps one of the 'duplicates' out of the results.
dt[duplicated(dt$f2)]
f1 f2
1: 2 1
2: 5 3
So how can I get this done?
Edited to add example

The question is not clear. Based on the title, it looks like we want to extract all groups with number of rows (.N) greater than 1.
DT[, if(.N>1) .SD, by=f]
But the value v in field f is making it confusing.

If I understand what you're after correctly, you'll need to do some compound queries:
library(data.table)
DT <- data.table(v1 = 1:10, f = c(rep(1:3, 3), 4))
DT[, N := .N, f][N > 2][, N := NULL][]
# v1 f
# 1: 1 1
# 2: 2 2
# 3: 3 3
# 4: 4 1
# 5: 5 2
# 6: 6 3
# 7: 7 1
# 8: 8 2
# 9: 9 3

Related

Renaming multiple columns in R data.table

This is related to this question from Henrik
Assign multiple columns using := in data.table, by group
But what if I want to create a new data.table with given column names instead of assigning new columns to an existing one?
f <- function(x){list(head(x,2),tail(x,2))}
dt <- data.table(group=sample(c('a','b'),10,replace = TRUE),val=1:10)
> dt
group val
1: b 1
2: b 2
3: a 3
4: b 4
5: a 5
6: b 6
7: a 7
8: a 8
9: b 9
10: b 10
I want to get a new data.table with predefined column names by calling the function f:
dt[,c('head','tail')=f(val),by=group]
I wish to get this:
group head tail
1: a 1 8
2: a 3 10
3: b 2 6
4: b 5 9
But it gives me an error. What I can do is create the table then change the column names, but that seems cumbersome:
> dt2 <- dt[,f(val),by=group]
> dt2
group V1 V2
1: a 1 8
2: a 3 10
3: b 2 6
4: b 5 9
> colnames(dt2)[-1] <- c('head','tail')
> dt2
group head tail
1: a 1 8
2: a 3 10
3: b 2 6
4: b 5 9
Is it something I can do with one call?
From running your code as-is, this is the error I get:
dt[,c('head','tail')=f(val),by=group]
# Error: unexpected '=' in "dt2[,c('head','tail')="
The problem is using = instead of := for assignment.
On to your problem of wanting a new data.table:
dt2 <- dt[, setNames(f(val), c('head', 'tail')), by = group]

Summing the number of times a value appears in either of 2 columns

I have a large data set - around 32mil rows. I have information on the telephone number, the origin of the call, and the destination.
For each telephone number, I want to count the number of times it appeared either as Origin or as Destination.
An example data table is as follows:
library(data.table)
dt <- data.table(Tel=seq(1,5,1), Origin=seq(1,5,1), Destination=seq(3,7,1))
Tel Origin Destination
1: 1 1 3
2: 2 2 4
3: 3 3 5
4: 4 4 6
5: 5 5 7
I have working code, but it takes too long for my data since it involves a for loop. How can I optimize it?
Here it is:
for (i in unique(dt$Tel)){
index <- (dt$Origin == i | dt$Destination == i)
dt[dt$Tel ==i, "N"] <- sum(index)
}
Result:
Tel Origin Destination N
1: 1 1 3 1
2: 2 2 4 1
3: 3 3 5 2
4: 4 4 6 2
5: 5 5 7 2
Where N tells that Tel=1 appears 1, Tel=2 appears 1, Tel=3,4 and 5 each appear 2 times.
We can do a melt and match
dt[, N := melt(dt, id.var = "Tel")[, tabulate(match(value, Tel))]]
Or another option is to loop through the columns 2 and 3, use %in% to check whether the values in 'Tel' are present, then with Reduce and + get the sum of logical elements for each 'Tel', assign (:=) the values to 'N'
dt[, N := Reduce(`+`, lapply(.SD, function(x) Tel %in% x)), .SDcols = 2:3]
dt
# Tel Origin Destination N
#1: 1 1 3 1
#2: 2 2 4 1
#3: 3 3 5 2
#4: 4 4 6 2
#5: 5 5 7 2
A second method constructs a temporary data.table which is then joins to the original. This is longer and likely less efficient than #akrun's, but can be useful to see.
# get temporary data.table as the sum of origin and destination frequencies
temp <- setnames(data.table(table(unlist(dt[, .(Origin, Destination)], use.names=FALSE))),
c("Tel", "N"))
# turn the variables into integers (Tel is the name of the table above, and thus character)
temp <- temp[, lapply(temp, as.integer)]
Now, join the original table on
dt <- temp[dt, on="Tel"]
dt
Tel N Origin Destination
1: 1 1 1 3
2: 2 1 2 4
3: 3 2 3 5
4: 4 2 4 6
5: 5 2 5 7
You can get the desired column order using setcolorder
setcolorder(dt, c("Tel", "Origin", "Destination", "N"))

Get top k records per group, where k differs by group, in R data.table

I have two data.tables:
Values to extract the top k from, per group.
A mapping from group to the k values to select for that group.
how to find the top N values by group or within category (groupwise) in an R data.frame addresses this question when k does not vary by group. How can I do this? Here's sample data and the desired result:
Values:
(dt <- data.table(id=1:10,
group=c(rep(1, 5), rep(2, 5))))
# id group
# 1: 1 1
# 2: 2 1
# 3: 3 1
# 4: 4 1
# 5: 5 1
# 6: 6 2
# 7: 7 2
# 8: 8 2
# 9: 9 2
# 10: 10 2
Mapping from group to k:
(group.k <- data.table(group=1:2,
k=2:3))
# group k
# 1: 1 2
# 2: 2 3
Desired result, which should include the first two records from group 1 and the first three records from group 2:
(result <- data.table(id=c(1:2, 6:8),
group=c(rep(1, 2), rep(2, 3))))
# id group
# 1: 1 1
# 2: 2 1
# 3: 6 2
# 4: 7 2
# 5: 8 2
Applying the solution to the above-linked question after merging returns this error:
merged <- merge(dt, group.k, by="group")
(result <- merged[, head(.SD, k), by=group])
# Error: length(n) == 1L is not TRUE
I'd rather do it as:
dt[group.k, head(.SD, k), by=.EACHI, on="group"]
because it's quite clear to see what the intended operation is. j can be .SD[1:k] of course. Both these expressions will very likely be (further) optimised (for speed) in the next release.
See this post for a detailed explanation of by=.EACHI until we wrap those vignettes.
After merging in the k by group, a similar approach to https://stackoverflow.com/a/14800271/1840471's solution can be applied, you just need a unique to avoid the length(n) error:
merged <- merge(dt, group.k, by="group")
(result <- merged[, head(.SD, unique(k)), by=group])
# group id k
# 1: 1 1 2
# 2: 1 2 2
# 3: 2 6 3
# 4: 2 7 3
# 5: 2 8 3

data preparation part II

there's another problem I encountered which is (as I think) quite interesting:
dt <- data.table(K=c("A","A","A","B","B","B"),A=c(2,3,4,1,3,4),B=c(3,3,3,1,1,1))
dt
K A B
1: A 2 3
2: A 3 3
3: A 4 3
4: B 1 1
5: B 3 1
6: B 4 1
Now I want a somewhat "higher" level of the data. For each letter in K, there should only be one line and "A_sum" should include the length of A where B has the same value. So there are three values for B=3 and three values for B=1.
Resulting data.table:
dt_new
K A_sum B
1: A 3 3
2: B 3 1
It's not clear how you want to treat K, but here's one option:
dt_new <- dt[, list(A_sum = length(A)), by = list(K, B)]
# K B A_sum
# 1: A 3 3
# 2: B 1 3

Add a countdown column to data.table containing rows until a special row encountered

I have a data.table with ordered data labled up, and I want to add a column that tells me how many records until I get to a "special" record that resets the countdown.
For example:
DT = data.table(idx = c(1,3,3,4,6,7,7,8,9),
name = c("a", "a", "a", "b", "a", "a", "b", "a", "b"))
setkey(DT, idx)
#manually add the answer
DT[, countdown := c(3,2,1,0,2,1,0,1,0)]
Gives
> DT
idx name countdown
1: 1 a 3
2: 3 a 2
3: 3 a 1
4: 4 b 0
5: 6 a 2
6: 7 a 1
7: 7 b 0
8: 8 a 1
9: 9 b 0
See how the countdown column tells me how many rows until a row called "b".
The question is how to create that column in code.
Note that the key is not evenly spaced and may contain duplicates (so is not very useful in solving the problem). In general the non-b names could be different, but I could add a dummy column that is just True/False if the solution requires this.
Here's another idea:
## Create groups that end at each occurrence of "b"
DT[, cd:=0L]
DT[name=="b", cd:=1L]
DT[, cd:=rev(cumsum(rev(cd)))]
## Count down within them
DT[, cd:=max(.I) - .I, by=cd]
# idx name cd
# 1: 1 a 3
# 2: 3 a 2
# 3: 3 a 1
# 4: 4 b 0
# 5: 6 a 2
# 6: 7 a 1
# 7: 7 b 0
# 8: 8 a 1
# 9: 9 b 0
I'm sure (or at least hopeful) that a purely "data.table" solution would be generated, but in the meantime, you could make use of rle. In this case, you're interested in reversing the countdown, so we'll use rev to reverse the "name" values before proceeding.
output <- sequence(rle(rev(DT$name))$lengths)
makezero <- cumsum(rle(rev(DT$name))$lengths)[c(TRUE, FALSE)]
output[makezero] <- 0
DT[, countdown := rev(output)]
DT
# idx name countdown
# 1: 1 a 3
# 2: 3 a 2
# 3: 3 a 1
# 4: 4 b 0
# 5: 6 a 2
# 6: 7 a 1
# 7: 7 b 0
# 8: 8 a 1
# 9: 9 b 0
Here's a mix of Josh's and Ananda's solution, in that, I use RLE to generate the way Josh has given the answer:
t <- rle(DT$name)
t <- t$lengths[t$values == "a"]
DT[, cd := rep(t, t+1)]
DT[, cd:=max(.I) - .I, by=cd]
Even better: Taking use of the fact that there's only one b always (or assuming here), you could do this one better:
t <- rle(DT$name)
t <- t$lengths[t$values == "a"]
DT[, cd := rev(sequence(rev(t+1)))-1]
Edit: From OP's comment, it seems clear that there is more than 1 b possible and in such cases, all b should be 0. The first step in doing this is to create groups where b ends after each consecutive a's.
DT <- data.table(idx=sample(10), name=c("a","a","a","b","b","a","a","b","a","b"))
t <- rle(DT$name)
val <- cumsum(t$lengths)[t$values == "b"]
DT[, grp := rep(seq(val), c(val[1], diff(val)))]
DT[, val := c(rev(seq_len(sum(name == "a"))),
rep(0, sum(name == "b"))), by = grp]
# idx name grp val
# 1: 1 a 1 3
# 2: 7 a 1 2
# 3: 9 a 1 1
# 4: 4 b 1 0
# 5: 2 b 1 0
# 6: 8 a 2 2
# 7: 6 a 2 1
# 8: 3 b 2 0
# 9: 10 a 3 1
# 10: 5 b 3 0

Resources