R data.table Multiple Conditions Join - r

I’ve devised a solution to lookup values from multiple columns of two separate data tables and add a new column based calculations of their values (multiple conditional comparisons). Code below. It involves using a data.table and join while calculating values from both tables, however, the tables aren’t joined on the columns I’m comparing, and therefore I suspect I may not be getting the speed advantages inherent to data.tables that I’ve read so much about and am excited about tapping into. Said another way, I’m joining on a ‘dummy’ column, so I don’t think I’m joining “properly.”
The exercise is, given an X by X grid dtGrid and a list of X^2 random Events dtEvents within the grid, to determine how many Events occur within a 1 unit radius of each grid point. The code is below. I picked a grid size of 100 X 100, which takes ~1.5 sec to run the join on my machine. But I can’t go much bigger without introducing an enormous performance hit (200 X 200 takes ~22 sec).
I really like the flexibility of being able to add multiple conditions to my val statement (e.g., if I wanted to add a bunch of AND and OR combinations I could do that), so I'd like to retain that functionality.
Is there a way to use data.table joins ‘properly’ (or any other data.table solution) to achieve a much speedier / efficient outcome?
Thanks so much!
#Initialization stuff
library(data.table)
set.seed(77L)
#Set grid size constant
#Increasing this number to a value much larger than 100 will result in significantly longer run times
cstGridSize = 100L
#Create Grid
vecXYSquare <- seq(0, cstGridSize, 1)
dtGrid <- data.table(expand.grid(vecXYSquare, vecXYSquare))
setnames(dtGrid, 'Var1', 'x')
setnames(dtGrid, 'Var2', 'y')
dtGrid[, DummyJoin:='A']
setkey(dtGrid, DummyJoin)
#Create Events
xrand <- runif(cstGridSize^2, 0, cstGridSize + 1)
yrand <- runif(cstGridSize^2, 0, cstGridSize + 1)
dtEvents <- data.table(x=xrand, y=yrand)
dtEvents[, DummyJoin:='A']
dtEvents[, Counter:=1L]
setkey(dtEvents, DummyJoin)
#Return # of events within 1 unit radius of each grid point
system.time(
dtEventsWithinRadius <- dtEvents[dtGrid, {
val = Counter[(x - i.x)^2 + (y - i.y)^2 < 1^2]; #basic circle fomula: x^2 + y^2 = radius^2
list(col_i.x=i.x, col_i.y=i.y, EventsWithinRadius=sum(val))
}, by=.EACHI]
)

Very interesting question.. and great use of by = .EACHI! Here's another approach using the NEW non-equi joins from the current development version, v1.9.7.
Issue: Your use of by=.EACHI is completely justified because the other alternative is to perform a cross join (each row of dtGrid joined to all rows of dtEvents) but that's too exhaustive and is bound to explode very quickly.
However by = .EACHI is performed along with an equi-join using a dummy column, which results in computing all distances (except that it does one at a time, therefore memory efficient). That is, in your code, for each dtGrid, all possible distances are still computed with dtEvents; hence it doesn't scale as well as expected.
Strategy: Then you'd agree that an acceptable improvement is to restrict the number of rows that would result from joining each row of dtGrid to dtEvents.
Let (x_i, y_i) come from dtGrid and (a_j, b_j) come from from dtEvents, say, where 1 <= i <= nrow(dtGrid) and 1 <= j <= nrow(dtEvents). Then, i = 1 implies, all j that satisfies (x1 - a_j)^2 + (y1 - b_j)^2 < 1 needs to be extracted. That can only happen when:
(x1 - a_j)^2 < 1 AND (y1 - b_j)^2 < 1
This helps reduce the search space drastically because, instead of looking at all rows in dtEvents for each row in dtGrid, we just have to extract those rows where,
a_j - 1 <= x1 <= a_j + 1 AND b_j - 1 <= y1 <= b_j + 1
# where '1' is the radius
This constraint can be directly translated to a non-equi join, and combined with by = .EACHI as before. The only additional step required is to construct the columns a_j-1, a_j+1, b_j-1, b_j+1 as follows:
foo1 <- function(dt1, dt2) {
dt2[, `:=`(xm=x-1, xp=x+1, ym=y-1, yp=y+1)] ## (1)
tmp = dt2[dt1, on=.(xm<=x, xp>=x, ym<=y, yp>=y),
.(sum((i.x-x)^2+(i.y-y)^2<1)), by=.EACHI,
allow=TRUE, nomatch=0L
][, c("xp", "yp") := NULL] ## (2)
tmp[]
}
## (1) constructs all columns necessary for non-equi joins (since expressions are not allowed in the formula for on= yet.
## (2) performs a non-equi join that computes distances and checks for all distances that are < 1 on the restricted set of combinations for each row in dtGrid -- hence should be much faster.
Benchmarks:
# Here's your code (modified to ensure identical column names etc..):
foo2 <- function(dt1, dt2) {
ans = dt2[dt1,
{
val = Counter[(x - i.x)^2 + (y - i.y)^2 < 1^2];
.(xm=i.x, ym=i.y, V1=sum(val))
},
by=.EACHI][, "DummyJoin" := NULL]
ans[]
}
# on grid size of 100:
system.time(ans1 <- foo1(dtGrid, dtEvents)) # 0.166s
system.time(ans2 <- foo2(dtGrid, dtEvents)) # 1.626s
# on grid size of 200:
system.time(ans1 <- foo1(dtGrid, dtEvents)) # 0.983s
system.time(ans2 <- foo2(dtGrid, dtEvents)) # 31.038s
# on grid size of 300:
system.time(ans1 <- foo1(dtGrid, dtEvents)) # 2.847s
system.time(ans2 <- foo2(dtGrid, dtEvents)) # 151.32s
identical(ans1[V1 != 0]L, ans2[V1 != 0L]) # TRUE for all of them
The speedups are ~10x, 32x and 53x respectively.
Note that the rows in dtGrid for which the condition is not satisfied even for a single row in dtEvents will not be present in the result (due to nomatch=0L). If you want those rows, you'll have to also add one of the xm/xp/ym/yp cols.. and check them for NA (= no matches).
This is the reason we had to remove all 0 counts to get identical = TRUE.
HTH
PS: See history for another variation where the entire join is materialised and then the distance is computed and counts generated.

Related

How to select a specific amount of rows before and after predefined values

I am trying to select relevant rows from a large time-series data set. The tricky bit is, that the needed rows are before and after certain values in a column.
# example data
x <- rnorm(100)
y <- rep(0,100)
y[c(13,44,80)] <- 1
y[c(20,34,92)] <- 2
df <- data.frame(x,y)
In this case the critical values are 1 and 2 in the df$y column. If, e.g., I want to select 2 rows before and 4 after df$y==1 I can do:
ones<-which(df$y==1)
selection <- NULL
for (i in ones) {
jj <- (i-2):(i+4)
selection <- c(selection,jj)
}
df$selection <- 0
df$selection[selection] <- 1
This, arguably, scales poorly for more values. For df$y==2 I would have to repeat with:
twos<-which(df$y==2)
selection <- NULL
for (i in twos) {
jj <- (i-2):(i+4)
selection <- c(selection,jj)
}
df$selection[selection] <- 2
Ideal scenario would be a function doing something similar to this imaginary function selector(data=df$y, values=c(1,2), before=2, after=5, afterafter = FALSE, beforebefore=FALSE), where values is fed with the critical values, before with the amount of rows to select before and correspondingly after.
Whereas, afterafter would allow for the possibility to go from certain rows until certain rows after the value, e.g. after=5,afterafter=10 (same but going into the other direction with afterafter).
Any tips and suggestions are very welcome!
Thanks!
This is easy enough with rep and its each argument.
df$y[rep(which(df$y == 2), each=7L) + -2:4] <- 2
Here, rep repeats the row indices that your criterion 7 times each (two before, the value, and four after, the L indicates that the argument should be an integer). Add values -2 through 4 to get these indices. Now, replace.
Note that for some comparisons, == will not be adequate due to numerical precision. See the SO post why are these numbers not equal for a detailed discussion of this topic. In these cases, you could use something like
which(abs(df$y - 2) < 0.001)
or whatever precision measure will work for your problem.

Round to a multiple and filter in data.table

I have very interesting problem, though I'd rather not to have one.
I have to round a number to a closes multiple so I followed the solution here
It used to work OK, until I've discover the bug with data.table
library(data.table)
options(digits = 20) # to see number representation
mround <- function (number, multiple) {
return(multiple * round(number / multiple))
}
DT = data.table(a = mround(112.3, 0.1), b = "B")
DT[a == 112.3,] # works as expected, i.e returns one row
DT[a == 112.3 & b == 'B', ] # doesn't work
To be fair, with data.frame even the first filter doesn't work. Any ideas how to fix that?
Just to add to #Tens great answer.
What seem to be happening are three things
You have a floating point issue (as mentioned already)
You are using and old data.table version
Secondary indices are kicking in while you aren't aware of it
Using your setup
library(data.table)
options(digits = 20) # to see number representation
mround <- function (number, multiple) {
return(multiple * round(number / multiple))
}
DT = data.table(a = mround(112.3, 0.1), b = "B")
So lets address the points above. Since you have a floating point and quoting ?setNumericRounding
Computers cannot represent some floating point numbers (such as 0.6) precisely, using base 2. This leads to unexpected behaviour when joining or grouping columns of type 'numeric'; i.e. 'double
This led data.table devs to implement the setNumericRounding which auto rounded floats so a the radix algorithm would behave as expected.
Prior to v1.9.8, setNumericRounding(2) was the default (hence your first example works), but after some complaints from users for inconsistency on GH (IIRC), since v1.9.8 the default was set back to setNumericRounding(0) in order to be consistent with data.frame behavior (see here). So if you'll update your data.table to the latest version, you will see that both data.table and data.frame will behave the same for your both examples (and both of your examples will fail).
Compare
setNumericRounding(0)
DT[a == 112.3]
## Empty data.table (0 rows) of 2 cols: a,b
To
setNumericRounding(1)
DT[a == 112.3]
# a b
# 1: 112.30000000000001 B
So you will ask, "what on earth radix algorithm has to do with anything here". So here we reach the third point above- secondary indices (please read this). Lets see what actually happens when you are running you code above
options(datatable.verbose = TRUE)
DT[a == 112.3] # works as expected, i.e returns one row
# Creating new index 'a' <~~~~
# forder took 0 sec
# Starting bmerge ...done in 0 secs
# a b
# 1: 112.30000000000001 B
Lets checks the new secondary indices
indices(DT)
#[1] "a"
when you've ran ==, data.table set a as your secondary index in order to perform future operations much more efficiently (this was introduced in v1.9.4, see here). In other words, you performed a binary join on a instead the usual vector scan like it was prior v1.9.4 (As a side note, this can be disabled by doing options(datatable.auto.index = FALSE), in that case, none of your examples will work even with setNumericRounding(1) unless you will explicitly specify a key using setkey or the on argument)
This is probably will also explain why
DT[a == 112.30000 & b == 'B']
doesn't work. You are sub-setting here by two columns and neither secondary indices or binary join don't (automatically) kick-in for an expressions such as == & == (yet), hence you did a normal vector scan and setNumericRounding(1) didn't kick in
Though, you can set the keys manually and make it work, for instance (like I commented under #Tens answer), you can do
setNumericRounding(1) # make sure autoroundings turned on
DT[.(112.3, 'B'), nomatch = 0L, on = .(a, b)]
# Calculated ad hoc index in 0 secs
# Starting bmerge ...done in 0 secs
# a b
# 1: 112.3 B
Or using the old way
setkey(DT, a, b)
DT[.(112.3, 'B'), nomatch = 0L]
# Starting bmerge ...done in 0 secs
# a b
# 1: 112.3 B
It's a problem of floating point precision.
See DT[abs(a - 112.3)<1.e-6 & b == 'B',] using an error margin of 0.000001 will give you proper result.
If you want more precision you can use .Machine$double.eps^0.5 as does all.equal.
General advice is to never compare equality of floats but compare the difference with a value near enough to the machine precision to get around the precision drift between 0 and 1), more details here
One way to fix your problem could be to refactor your function to:
mround <- function (number, multiple, digits=nchar(strsplit(as.character(multiple),".",fixed=TRUE)[[1]][2])) {
round(multiple * round(number / multiple),digits)
}
I used a "convoluted" method to get the digits needed from the multiple passed as default significant digits, adapt to your needs (you may used 2 here for example, or force the precision when calling).
I removed the unnecessary return which just cause the interpreter to look for a function already called at end of the function call.
This way your output should be precise enough, but you'll still have corner cases:
> mround(112.37,0.2)
[1] 112.40000000000001
To use floats in joins, you can use (courtesy of David Arenburg):
setNumericRounding(1)
DT[.(112.3, 'B'), nomatch = 0L, on = .(a, b)]

R data.table transform with row explosion

I would like to split one row into two (or more) rows when the cumsum of one of the column breaks the period.
Is there any elegant way to perform such specific row explosion using data.table?
Do not focus on cumsum (which I used in reversed order to have cumsum from most recent row to the oldest one), strictly speaking I want transform dt into rdt from code below.
# current data
dt <- data.table(
time_id = 101:110,
desc = c('asd','qwe','xyz','qwe','qwe','xyz','asd','asd','qwe','asd'),
value = c(5.5,3.5,14,0.7,6,5.5,9.3,29.8,4,7.2)
)
dt[, cum_value_from_now := rev(cumsum(rev(value)))]
period_width <- 10
dt[, value_period := ceiling(cum_value_from_now/period_width)*period_width]
dt
# expected result
rdt <- data.table(
time_id = c(101,102,103,103,104,105,105,106,107,107,108,108,108,108,109,109,110),
desc = c('asd','qwe','xyz','xyz','qwe','qwe','qwe','xyz','asd','asd','asd','asd','asd','asd','qwe','qwe','asd'),
value = c(5.5,3.5,6.5,7.5,0.7,1.8,4.2,5.5,0.3,9,1,10,10,8.8,1.2,2.8,7.2)
)[, cum_value_from_now := rev(cumsum(rev(value)))][, value_period := ceiling(cum_value_from_now/period_width)*period_width]
rdt
# validation
all.equal(
dt[,list(time_id,desc,value)],
rdt[,list(value = sum(value)), by=c('time_id','desc')]
)
edit: I realized my question is not explained well the transformation I want to perform. To better understand the breaks the period meaning please take a look at my rdt the cum_value_from_now values from the last to first. Each value_period is completely filled by cumsum on value, the rest of value is produced as new row (if value is big enough then it is produced to multiple rows) to fit into next period(s). Thanks
First, you seem to be applying your rules inconsistently. If "breaking the period" means that a row has value_period different from the previous row, then row 2 breaks the period, but you do not treat it that way.
Second, you never explain the partitioning of value. For instance, row 3 has value=14. This is replaced in rdt with two rows with values 6.5 and 7.5. These add to 14 all right, but there is no explanation of why this should be 6.5 and 7.5, rather than, say, 7 and 7. So in the solution below I partition equally.
The code below produces a result which passes your test, but it is not quite the same as your rdt, due to the above-mentioned problems with your question.
dt[,diff:=c(-diff(value_period)/10,0)]
rdt <- dt[,list(value=as.numeric(rep(value/(diff+1),diff+1))),
by=list(time_id,desc,cum_value_from_now, value_period)]
all.equal(
dt[,list(time_id,desc,value)],
rdt[,list(value = sum(value)), by=c('time_id','desc')]
)
# [1] TRUE

subset slow in large matrix

I have a numeric vector of length 5,000,000
>head(coordvec)
[1] 47286545 47286546 47286547 47286548 47286549 472865
and a 3 x 1,400,000 numeric matrix
>head(subscores)
V1 V2 V3
1 47286730 47286725 0.830
2 47286740 47286791 0.065
3 47286750 47286806 -0.165
4 47288371 47288427 0.760
5 47288841 47288890 0.285
6 47288896 47288945 0.225
What I am trying to accomplish is that for each number in coordvec, find the average of V3 for rows in subscores in which V1 and V2 encompass the number in coordvec. To do that, I am taking the following approach:
results<-numeric(length(coordvec))
for(i in 1:length(coordvec)){
select_rows <- subscores[, 1] < coordvec[i] & subscores[, 2] > coordvec[i]
scores_subset <- subscores[select_rows, 3]
results[m]<-mean(scores_subset)
}
This is very slow, and would take a few days to finish. Is there a faster way?
Thanks,
Dan
I think there are two challenging parts to this question. The first is finding the overlaps. I'd use the IRanges package from Bioconductor (?findInterval in the base package might also be useful)
library(IRanges)
creating width 1 ranges representing the coordinate vector, and set of ranges representing the scores; I sort the coordinate vectors for convenience, assuming that duplicate coordinates can be treated the same
coord <- sort(sample(.Machine$integer.max, 5000000))
starts <- sample(.Machine$integer.max, 1200000)
scores <- runif(length(starts))
q <- IRanges(coord, width=1)
s <- IRanges(starts, starts + 100L)
Here we find which query overlaps which subject
system.time({
olaps <- findOverlaps(q, s)
})
This takes about 7s on my laptop. There are different types of overlaps (see ?findOverlaps) so maybe this step requires a bit of refinement.
The result is a pair of vectors indexing the query and overlapping subject.
> olaps
Hits of length 281909
queryLength: 5000000
subjectLength: 1200000
queryHits subjectHits
<integer> <integer>
1 19 685913
2 35 929424
3 46 1130191
4 52 37417
I think this is the end of the first complicated part, finding the 281909 overlaps. (I don't think the data.table answer offered elsewhere addresses this, though I could be mistaken...)
The next challenging part is calculating a large number of means. The built-in way would be something like
olaps0 <- head(olaps, 10000)
system.time({
res0 <- tapply(scores[subjectHits(olaps0)], queryHits(olaps0), mean)
})
which takes about 3.25s on my computer and appears to scale linearly, so maybe 90s for the 280k overlaps. But I think we can accomplish this tabulation efficiently with data.table. The original coordinates are start(v)[queryHits(olaps)], so as
require(data.table)
dt <- data.table(coord=start(q)[queryHits(olaps)],
score=scores[subjectHits(olaps)])
res1 <- dt[,mean(score), by=coord]$V1
which takes about 2.5s for all 280k overlaps.
Some more speed can be had by recognizing that the query hits are ordered. We want to calculate a mean for each run of query hits. We start by creating a variable to indicate the ends of each query hit run
idx <- c(queryHits(olaps)[-1] != queryHits(olaps)[-length(olaps)], TRUE)
and then calculate the cumulative scores at the ends of each run, the length of each run, and the difference between the cumulative score at the end and at the start of the run
scoreHits <- cumsum(scores[subjectHits(olaps)])[idx]
n <- diff(c(0L, seq_along(idx)[idx]))
xt <- diff(c(0L, scoreHits))
And finally, the mean is
res2 <- xt / n
This takes about 0.6s for all the data, and is identical to (though more cryptic than?) the data.table result
> identical(res1, res2)
[1] TRUE
The original coordinates corresponding to the means are
start(q)[ queryHits(olaps)[idx] ]
Something like this might be faster :
require(data.table)
subscores <- as.data.table(subscores)
subscores[, cond := V1 < coordvec & V2 > coordvec]
subscores[list(cond)[[1]], mean(V3)]
list(cond)[[1]] because: "When i is a single variable name, it is not considered an expression of column names and is instead evaluated in calling scope." source: ?data.table
Since your answer isn't easily reproducible and even if it were, none of your subscores meet your boolean condition, I'm not sure if this does exactly what you're looking for but you can use one of the apply family and a function.
myfun <- function(x) {
y <- subscores[, 1] < x & subscores[, 2] > x
mean(subscores[y, 3])
}
sapply(coordvec, myfun)
You can also take a look at mclapply. If you have enough memory this will probably speed things up significantly. However, you could also look at the foreach package with similar results. You've got your for loop "correct" by assigning into results rather than growing it, but really, you're doing a lot of comparisons. It will be hard to speed this up much.

Aligning sequences with missing values

The language I'm using is R, but you don't necessarily need to know about R to answer the question.
Question:
I have a sequence that can be considered the ground truth, and another sequence that is a shifted version of the first, with some missing values. I'd like to know how to align the two.
setup
I have a sequence ground.truth that is basically a set of times:
ground.truth <- rep( seq(1,by=4,length.out=10), 5 ) +
rep( seq(0,length.out=5,by=4*10+30), each=10 )
Think of ground.truth as times where I'm doing the following:
{take a sample every 4 seconds for 10 times, then wait 30 seconds} x 5
I have a second sequence observations, which is ground.truth shifted with 20% of the values missing:
nSamples <- length(ground.truth)
idx_to_keep <- sort(sample( 1:nSamples, .8*nSamples ))
theLag <- runif(1)*100
observations <- ground.truth[idx_to_keep] + theLag
nObs <- length(observations)
If I plot these vectors this is what it looks like (remember, think of these as times):
What I've tried. I want to:
calculate the shift (theLag in my example above)
calculate a vector idx such that ground.truth[idx] == observations - theLag
First, assume we know theLag. Note that ground.truth[1] is not necessarily observations[1]-theLag. In fact, we have ground.truth[1] == observations[1+lagI]-theLag for some lagI.
To calculate this, I thought I'd use cross-correlation (ccf function).
However, whenever I do this I get a lag with a max. cross-correlation of 0, meaning ground.truth[1] == observations[1] - theLag. But I've tried this in examples where I've explicitly made sure that observations[1] - theLag is not ground.truth[1] (i.e. modify idx_to_keep to make sure it doesn't have 1 in it).
The shift theLag shouldn't affect the cross-correlation (isn't ccf(x,y) == ccf(x,y-constant)?) so I was going to work it out later.
Perhaps I'm misunderstanding though, because observations doesn't have as many values in it as ground.truth? Even in the simpler case where I set theLag==0, the cross correlation function still fails to identify the correct lag, which leads me to believe I'm thinking about this wrong.
Does anyone have a general methodology for me to go about this, or know of some R functions/packages that could help?
Thanks a lot.
For the lag, you can compute all the differences (distances) between your two sets of points:
diffs <- outer(observations, ground.truth, '-')
Your lag should be the value that appears length(observations) times:
which(table(diffs) == length(observations))
# 55.715382960625
# 86
Double check:
theLag
# [1] 55.71538
The second part of your question is easy once you have found theLag:
idx <- which(ground.truth %in% (observations - theLag))
The following should work if your time series are not too long.
You have two vectors of time-stamps,
the second one being a shifted and incomplete copy of the first,
and you want to find by how much it was shifted.
# Sample data
n <- 10
x <- cumsum(rexp(n,.1))
theLag <- rnorm(1)
y <- theLag + x[sort(sample(1:n, floor(.8*n)))]
We can try all possible lags and, for each one,
compute how bad the alignment is,
by matching each observed timestamp with the closest
"truth" timestamp.
# Loss function
library(sqldf)
f <- function(u) {
# Put all the values in a data.frame
d1 <- data.frame(g="truth", value=x)
d2 <- data.frame(g="observed", value=y+u)
d <- rbind(d1,d2)
# For each observed value, find the next truth value
# (we could take the nearest, on either side,
# but it would be more complicated)
d <- sqldf("
SELECT A.g, A.value,
( SELECT MIN(B.value)
FROM d AS B
WHERE B.g='truth'
AND B.value >= A.value
) AS next
FROM d AS A
WHERE A.g = 'observed'
")
# If u is greater than the lag, there are missing values.
# If u is smaller, the differences decrease
# as we approach the lag.
if(any(is.na(d))) {
return(Inf)
} else {
return( sum(d$`next` - d$value, na.rm=TRUE) )
}
}
We can now search for the best lag.
# Look at the loss function
sapply( seq(-2,2,by=.1), f )
# Minimize the loss function.
# Change the interval if it does not converge,
# i.e., if it seems in contradiction with the values above
# or if the minimum is Inf
(r <- optimize(f, c(-3,3)))
-r$minimum
theLag # Same value, most of the time

Resources