How to perform a "serial join" in data.table? - r

I have two data.tables: an experiment data table x and a category lookup table dict.
library(data.table)
set.seed(123)
x = data.table(samp=c(1,1,2,3,3,3,4,5,5,5,6,7,7,7,8,9,9,10,10), y=rnorm(19))
x
samp y
#1: 1 -0.56047565
#2: 1 -0.23017749
#3: 2 1.55870831
#4: 3 0.07050839
#5: 3 0.12928774
#6: 3 1.71506499
#7: 4 0.46091621
#8: 5 -1.26506123
#9: 5 -0.68685285
#10: 5 -0.44566197
#11: 6 1.22408180
#12: 7 0.35981383
#13: 7 0.40077145
#14: 7 0.11068272
#15: 8 -0.55584113
#16: 9 1.78691314
#17: 9 0.49785048
#18: 10 -1.96661716
#19: 10 0.70135590
dict = data.table(samp=c(1:5, 4:8, 7:10), cat=c(rep(1,length(1:5)), rep(2,length(4:8)), rep(3,length(7:10))))
dict
# samp cat
# 1: 1 1
# 2: 2 1
# 3: 3 1
# 4: 4 1
# 5: 5 1
# 6: 4 2
# 7: 5 2
# 8: 6 2
# 9: 7 2
# 10: 8 2
# 11: 7 3
# 12: 8 3
# 13: 9 3
# 14: 10 3
For each samp, I need to first compute the product of all y's associated with it. I then need to compute the sum of these products per each sample category specified in dict$cat. Note that each samp maps to more than one dict$cat.
One way of doing this is merge x and dict right away, allowing row duplication (allow.cartesian=T):
setkey(dict, samp)
setkey(x, samp)
step0 = dict[x, allow.cartesian=T]
setkey(step0, samp, cat)
step1 = step0[, list(prodY=prod(y)[1], cat=cat[1]), by=c("samp", "cat")]
resMet1 = step1[, sum(prodY), by="cat"]
I wonder however whether this joining step can be avoided. There are a few reasons to this - for example, if x is enormous, duplication will use extra memory (am I right?). Also, these summary tables with duplicated rows are quite confusing, making the analysis more error-prone.
So instead I was thinking of using samples in each dict$cat for a binary search in x. I know how to do it for a single category, so an ugly way of doing it for all of them would be with a loop:
setkey(x, samp)
setkey(dict,samp)
pool = vector("list")
for(n in unique(dict$cat)){
thisCat = x[J(dict[cat==n])]
setkey(thisCat, samp)
step1 = thisCat[, list(prodY=prod(y)[1], cat=cat[1]), by="samp"]
pool[[n]] = step1[, sum(prodY), by="cat"]
}
resMet2 = rbindlist(pool)
But of course such loops are to be avoided. So I'm wondering if there's any way to somehow get data.table to iterate over the key values inside of J()?

IIUC, I'd formulate your question as follows: for each dict$cat, I'd like to get prod(y) corresponding to each sample for that cat, and then sum them all up.
Let's construct this step by step now:
For each dict$cat - sounds like you need to group by cat:
dict[, ,by=cat]
All that's left is to fill up j properly.
you need to get prod(y) from x for each sample for this group:
x[samp %in% .SD$samp, prod(y), by=samp]
extracts those rows from x corresponding to this group's samp (using .SD which stands for subset of data) and computes prod(y) on them, grouped by samp. Great!
We still need to sum them.
sum(x[samp %in% .SD$samp, prod(y), by=samp]$V1)
We've the complete j expression. Let's plug it all in:
dict[, sum(x[samp %in% .SD$samp, prod(y), by=samp]$V1), by=cat]
# cat V1
# 1: 1 1.7770272
# 2: 2 0.7578771
# 3: 3 -1.0295633
Hope this helps.
Note 1: that there's some redundant computation of prod(y) here, but the upside is that we don't materialise much intermediate data. So it's memory efficient. If you've too many groups, this might get slower.. and you might want to compute prod(y) in another variable like so:
x_p = x[, .(p = prod(y)), by=samp]
With this, we can simplify j as follows:
dict[, x_p[samp %in% .SD$samp, sum(p)], by=cat]
Note 2: that %in% expression creates an auto index on the first run on x's samp column to use binary search based subset from then on. Therefore there need not be any worries on performance due to vector scans.

You might as well collapse x to the samp level first.
xprod = x[, .(py = prod(y)), by=samp]
Merge
res2 <- xprod[dict, on = "samp"][, sum(py), by=cat]
identical(res2, resMet2) # test passed
Or subset
If samp is the row number in xprod (as here), you can subset instead of merging:
res3 <- xprod[(dict$samp), sum(py), by=.(cat=dict$cat)]
identical(res3, resMet2) # test passed
It's very simple to relabel sample IDs so that this is true.

Related

Speed up data.frame rearrangement

I have a data frame with coordinates ("start","end") and labels ("group"):
a <- data.frame(start=1:4, end=3:6, group=c("A","B","C","D"))
a
start end group
1 1 3 A
2 2 4 B
3 3 5 C
4 4 6 D
I want to create a new data frame in which labels are assigned to every element of the sequence on the range of coordinates:
V1 V2
1 1 A
2 2 A
3 3 A
4 2 B
5 3 B
6 4 B
7 3 C
8 4 C
9 5 C
10 4 D
11 5 D
12 6 D
The following code works but it is extremely slow with wide ranges:
df<-data.frame()
for(i in 1:dim(a)[1]){
s<-seq(a[i,1],a[i,2])
df<-rbind(df,data.frame(s,rep(a[i,3],length(s))))
}
colnames(df)<-c("V1","V2")
How can I speed this up?
You can try data.table
library(data.table)
setDT(a)[, start:end, by = group]
which gives
group V1
1: A 1
2: A 2
3: A 3
4: B 2
5: B 3
6: B 4
7: C 3
8: C 4
9: C 5
10: D 4
11: D 5
12: D 6
Obviously this would only work if you have one row per group, which it seems you have here.
If you want a very fast solution in base R, you can manually create the data.frame in two steps:
Use mapply to create a list of your ranges from "start" to "end".
Use rep + lengths to repeat the "groups" column to the expected number of rows.
The base R approach shared here won't depend on having only one row per group.
Try:
temp <- mapply(":", a[["start"]], a[["end"]], SIMPLIFY = FALSE)
data.frame(group = rep(a[["group"]], lengths(temp)),
values = unlist(temp, use.names = FALSE))
If you're doing this a lot, just put it in a function:
myFun <- function(indf) {
temp <- mapply(":", indf[["start"]], indf[["end"]], SIMPLIFY = FALSE)
data.frame(group = rep(indf[["group"]], lengths(temp)),
values = unlist(temp, use.names = FALSE))
}
Then, if you want some sample data to try it with, you can use the following as sample data:
set.seed(1)
a <- data.frame(start=1:4, end=sample(5:10, 4, TRUE), group=c("A","B","C","D"))
x <- do.call(rbind, replicate(1000, a, FALSE))
y <- do.call(rbind, replicate(100, x, FALSE))
Note that this does seem to slow down as the number of different unique values in "group" increases.
(In other words, the "data.table" approach will make the most sense in general. I'm just sharing a possible base R alternative that should be considerably faster than your existing approach.)

data.table aggregating over a subset of keys and joining with the subset

I have a table, Y, which contains a subset of unique keys from a much larger table, X, which has many duplicate keys. For each key in Y, I want to aggregate the same keys in X and add the aggregated variables to Y. I've been playing around with data.table and I've come up with a way that works without having to make a copy, but I'm hoping there is a faster and less syntactically dizzying solution. As more variables are added the syntax gets harder and harder to read and more helper references are made to table X when I really only care about them in table Y.
My question, just to clarify, is whether there is a more efficient and/or syntactically simpler way to do this operation.
My solution:
Y[X[Y, b:= sum(a)], b := i.b, nomatch=0]
For example:
set.seed(1)
X = data.table(id = sample.int(10,30, replace=TRUE), a = runif(30))
Y = data.table(id = seq(1,5))
setkey(X,id)
setkey(Y,id)
#head(X)
#id a
#1: 1 0.4112744
#2: 1 0.3162717
#3: 2 0.6470602
#4: 2 0.2447973
#5: 3 0.4820801
#6: 3 0.8273733
Y[X[Y, b := sum(a)], b := i.b, nomatch=0]
#head(Y)
# id b
#1: 1 0.7275461
#2: 2 0.8918575
#3: 3 3.0622883
#4: 4 2.9098465
#5: 5 0.7893562
IIUC, we could use data.table's by-without-by feature here...
## <= 1.9.2
X[Y, list(b=sum(a))] ## implicit by-without-by
## 1.9.3
X[Y, list(b=sum(a)), by=.EACHI] ## explicit by
# id b
# 1: 1 0.7275461
# 2: 2 0.8918575
# 3: 3 3.0622883
# 4: 4 2.9098465
# 5: 5 0.7893562
In 1.9.3, by-without-by has now been changed to require explicit by`. You can read more about it here under 1.9.3 new features points (1) and (2), and the links from there.
Is this what you had in mind?
# set up a reproducible example
library(data.table)
set.seed(1) # for reproducible example
X = data.table(id = sample.int(10,30, replace=TRUE), a = runif(30))
Y = data.table(id = seq(1,5))
setkey(X,id)
setkey(Y,id)
# this statement does the work
result <- X[,list(b=sum(a)),keyby=id][Y]
result
# id b
# 1: 1 0.7275461
# 2: 2 0.8918575
# 3: 3 3.0622883
# 4: 4 2.9098465
# 5: 5 0.7893562
This might be faster, as it subsets X first.
result.2 <- X[Y][,list(b=sum(a)),by=id]
identical(result, result.2)
# [1] TRUE

Don't resort data.table rows

I am learning data.table so I'm very new to it's syntax. I am trying to use the package as a hash lookup and it works well except, because of my ignorance of syntax, it reorders the rows. I want it not to reorder the rows without sacrificing speed (i.e., the efficient way to accomplish this). Here is an example and desired output:
library(data.table)
(key <- setNames(aggregate(mpg~as.character(carb), mtcars, mean), c("x", "y")))
set.seed(10)
terms <- data.frame(x = c(9, 12, sample(key[, 1], 6, TRUE)), stringsAsFactors = FALSE)
## > terms$x
## [1] "9" "12" "4" "2" "3" "6" "1" "2"
setDT(key)
setDT(terms)
setkey(key, x)
setkey(terms, x)
terms[key, out := i.y]
terms
This gives:
## x out
## 1: 1 25.34286
## 2: 12 NA
## 3: 2 22.40000
## 4: 2 22.40000
## 5: 3 16.30000
## 6: 4 15.79000
## 7: 6 19.70000
## 8: 9 NA
I want:
## x out
## 1: 9 NA
## 2: 12 NA
## 3: 4 15.79000
## 4: 2 22.40000
## 5: 3 16.30000
## 6: 6 19.70000
## 7: 1 25.34286
## 8: 2 22.40000
In data.table, a join x[i] has to have a key set for x, but it's not essential for the key to be set for i.
NOTE: But if you don't set the key for i,
1) Ensure that the columns of i are in the same order as the key columns of x (reorder if necessary, using setcolorder), as it doesn't join by checking for names (yet).
2) It could be a tad slower (but not by much in my benchmarks).
The issue therefore is that, if you just want to do a x[i] join without any additional preprocessing, then terms has to take the place of i with no key set in order to get the results in the order you require.
With this in mind, we can approach this in two ways (that I could think of).
First method:
This one requires no additional preprocessing. That is we treat key as x as mentioned above - meaning it's key has to be set. We don't set key for terms.
setkey(key, x)
The first column of terms is also named x and that's the column we want to join with. So, no reordering needed here.
ans = key[terms]
> ans
# x y
# 1: 9 NA
# 2: 12 NA
# 3: 4 15.79000
# 4: 2 22.40000
# 5: 3 16.30000
# 6: 6 19.70000
# 7: 1 25.34286
# 8: 2 22.40000
The difference is that this is an entirely new data.table, not just assigning the column by reference.
Second method:
We do a little extra preprocessing - addition of an extra column N to terms, by reference, which runs from 1:nrow(terms). This basically helps us to rearrange the data back in the order required, after the join. Here, we'll consider terms as x.
terms[, N := 1:.N]
setkey(terms, x)
It doesn't matter if key has 'x' column set as key.. But again, ensure that x is the first column in key if it's key isn't set.. In my case, I'll set the key column of key to x.
setkey(key, x)
setkey(terms[key, out := i.y], N)
> terms
# x N out
# 1: 9 1 NA
# 2: 12 2 NA
# 3: 4 3 15.79000
# 4: 2 4 22.40000
# 5: 3 5 16.30000
# 6: 6 6 19.70000
# 7: 1 7 25.34286
# 8: 2 8 22.40000
Personally, since you require terms unsorted, I'd go with the first method here. But feel free to benchmark on your real data dimensions and choose which suits your need best.

Replacing all missing values in R data.table with a value

If you have an R data.table that has missing values, how do you replace all of them with say, the value 0? E.g.
aa = data.table(V1=1:10,V2=c(1,2,2,3,3,3,4,4,4,4))
bb = data.table(V1=3:6,X=letters[1:4])
setkey(aa,V1)
setkey(bb,V1)
tt = bb[aa]
V1 X V2
1: 1 NA 1
2: 2 NA 2
3: 3 a 2
4: 4 b 3
5: 5 c 3
6: 6 d 3
7: 7 NA 4
8: 8 NA 4
9: 9 NA 4
10: 10 NA 4
Any way to do this in one line? If it were just a matrix, you could just do:
tt[is.na(tt)] = 0
is.na (being a primitive) has relatively very less overhead and is usually quite fast. So, you can just loop through the columns and use set to replace NA with0`.
Using <- to assign will result in a copy of all the columns and this is not the idiomatic way using data.table.
First I'll illustrate as to how to do it and then show how slow this can get on huge data (due to the copy):
One way to do this efficiently:
for (i in seq_along(tt)) set(tt, i=which(is.na(tt[[i]])), j=i, value=0)
You'll get a warning here that "0" is being coerced to character to match the type of column. You can ignore it.
Why shouldn't you use <- here:
# by reference - idiomatic way
set.seed(45)
tt <- data.table(matrix(sample(c(NA, rnorm(10)), 1e7*3, TRUE), ncol=3))
tracemem(tt)
# modifies value by reference - no copy
system.time({
for (i in seq_along(tt))
set(tt, i=which(is.na(tt[[i]])), j=i, value=0)
})
# user system elapsed
# 0.284 0.083 0.386
# by copy - NOT the idiomatic way
set.seed(45)
tt <- data.table(matrix(sample(c(NA, rnorm(10)), 1e7*3, TRUE), ncol=3))
tracemem(tt)
# makes copy
system.time({tt[is.na(tt)] <- 0})
# a bunch of "tracemem" output showing the copies being made
# user system elapsed
# 4.110 0.976 5.187
Nothing unusual here:
tt[is.na(tt)] = 0
..will work.
This is somewhat confusing however given that:
tt[is.na(tt)]
...currently returns:
Error in [.data.table(tt, is.na(tt)) : i is invalid type
(matrix). Perhaps in future a 2 column matrix could return a list of
elements of DT (in the spirit of A[B] in FAQ 2.14). Please let
datatable-help know if you'd like this, or add your comments to FR #1611.
I would make use of data.table and lapply, namely:
tt[,lapply(.SD,function(kkk) ifelse(is.na(kkk),-666,kkk)),.SDcols=names(tt)]
yielding in:
V1 X V2
1: 1 -666 1
2: 2 -666 2
3: 3 a 2
4: 4 b 3
5: 5 c 3
6: 6 d 3
7: 7 -666 4
8: 8 -666 4
9: 9 -666 4
10: 10 -666 4
The specific problem OP is posting could also be solved by
tt[is.na(X), X := 0]

Using .BY with a lookup table--unexpected results

I'd like to create a variable in dt according to a lookup table k. I'm getting some unexpected results depending on how I extract the variable of interest in k.
dt <- data.table(x=c(1:10))
setkey(dt, x)
k <- data.table(x=c(1:5,10), b=c(letters[1:5], "d"))
setkey(k, x)
dt[,b:=k[.BY, list(b)],by=x]
dt #unexpected results
# x b
# 1: 1 1
# 2: 2 2
# 3: 3 3
# 4: 4 4
# 5: 5 5
# 6: 6 6
# 7: 7 7
# 8: 8 8
# 9: 9 9
# 10: 10 10
dt <- data.table(x=c(1:10))
setkey(x, x)
dt[,b:=k[.BY]$b,by=x]
dt #expected results
# x b
# 1: 1 a
# 2: 2 b
# 3: 3 c
# 4: 4 d
# 5: 5 e
# 6: 6 NA
# 7: 7 NA
# 8: 8 NA
# 9: 9 NA
# 10: 10 d
Can anyone explain why this is happening?
You don't have to use by=. here at all.
First solution:
Set appropriate keys and use X[Y] syntax from data.table:
require(data.table)
dt <- data.table(x=c(1:10))
setkey(dt, "x")
k <- data.table(x=c(1:5,10), b=c(letters[1:5], "d"))
setkey(k, "x")
k[dt]
# x b
# 1: 1 a
# 2: 2 b
# 3: 3 c
# 4: 4 d
# 5: 5 e
# 6: 6 NA
# 7: 7 NA
# 8: 8 NA
# 9: 9 NA
# 10: 10 d
OP said that this creates a new data.table and it is undesirable for him.
Second solution
Again, without by:
dt <- data.table(x=c(1:10))
setkey(dt, "x")
k <- data.table(x=c(1:5,10), b=c(letters[1:5], "d"))
setkey(k, "x")
# solution
dt[k, b := i.b]
This does not create a new data.table and gives the solution you're expecting.
To explain why the unexpected result happens:
For the first case you do, dt[,b:=k[.BY, list(b)],by=x]. Here, k[.BY, list(b)] itself returns a data.table. For example:
k[list(x=1), list(b)]
# x b
# 1: 1 a
So, basically, if you would do:
k[list(x=dt$x), list(b)]
That would give you the desired solution as well. To answer why you get what you get when you do b := k[.BY, list(b)], since, the RHS returns a data.table and you're assigning a variable to it, it takes the first element and drops the rest. For example, do this:
dt[, c := dt[1], by=x]
# you'll get the whole column to be 1
For the second case, to understand why it works, you'll have to know the subtle difference between, accessing a data.table as k[6] and k[list(6)], for example:
In the first case, k[6], you are accessing the 6th element of k, which is 10 d. But in the second case, you're asking for a J, join. So, it searches for x = 6 (key column) and since there isn't any in k, it returns 6 NA. In your case, since you use k[.BY] which returns a list, it is a J operation, which fetches the right value.
I hope this helps.

Resources