I have a dataset of 8,000,000 rows with 100 columns in a data.table where each column is a count. I need to find the maximum count in each row and which column this maximum is in.
I can quickly get which column has the maximum value for each row using
dt <- dt[, maxCol := which.max(.SD), by=pmxid]
but trying to get the actual maximum value using
dt <- dt[, nmax := max(.SD), by=pmxid]
is incredibly slow. I ran it for nearly 20 mins and only 200,000 row maximums had been calculated. Finding the max column took approx. 2 mins for all 8,000,000 rows.
How come finding the maximum takes so long? Shouldn't it take the same time as which.max() or less?
Though, you are seeking a data.table solution, here is a base R solution which would be fast enough for your dataset.
indx <- max.col(df, ties.method='first')
df[cbind(1:nrow(df), indx)]
On a slightly bigger dataset, system.time comparisons revealed
system.time({
indx <- max.col(df1, ties.method='first')
res <- df1[cbind(1:nrow(df1), indx)]
})
# user system elapsed
# 2.180 0.163 2.345
df1$pmxid <- 1:nrow(df1)
dt <- as.data.table(df1)
system.time(dt[, nmax:= max(.SD), by= pmxid])
# user system elapsed
#1265.792 2.305 1267.836
base R method to be faster than the data.table method in the post.
data
set.seed(24)
df <- as.data.frame(matrix(sample(c(NA,0:20), 20*10,
replace=TRUE), ncol=10))
#if there are NAs, change it to lowest number
df[is.na(df)] <- -999
set.seed(585)
df1 <- as.data.frame(matrix(sample(c(NA,0:20), 100*1e6,
replace=TRUE), ncol=100))
df1[is.na(df1)] <- -999
For the maximum over columns in a data.table,
dt[, max:= do.call(pmax, .SD)]
is much faster then dt[, nmax:= max(.SD), by= 1:nrow(dt)], and faster than the above base R solution :
library(data.table)
ncols=100
nrows=8000000
dfi <- as.data.frame(matrix(runif(ncols*nrows), ncol = ncols, nrow = nrows))
df=dfi
system.time({
indx <- max.col(df, ties.method='first')
df$max <- df[cbind(1:nrow(df1), indx)]
})
# user system elapsed
# 8.89 1.37 10.45
dt <- as.data.table(dfi)
system.time({
dt[, max:= do.call(pmax, .SD)]
})
# user system elapsed
# 3.31 0.01 3.33
Once you have calculated the Colmax index, use the index to retrieve the maximum in each row
dt[Colmax == <value>]
or,
dt[J(<values>), on = 'Colmax']
Also, wrong syntax in
dt[, nmax := max(.SD), by = pmxid]
this collates a vector of nrow(dt) * length(.SD) length (see the Note in Description of max())
Instead try:
dt[, nmax := apply(.SD, 1, max), by = pmxid]
or, the parallel max:
dt[, nmax := pmax(.SD), by = pmxid]
Related
I have a list of vectors of integers, for example:
set.seed(1)
vec_list <- replicate(100, sample(1:10000000, size=sample(1:10000, 100)), simplify=FALSE)
And a vector of integers, for example:
vec <- sample(1:10000000, size=10000)
How can I count the number of intergers in each vector in vec_list that appear in the vector vec? I can do this using a for loop. For example:
total_match <- rep(NA, length(vec_list))
for (i in 1:length(vec_list)){
total_match[i] <- length(which(vec_list[[i]] %in% vec))
print(i)
}
However, the list and vector I am trying to apply this too are very large, and this is slow. Please help with suggestions on how to improve performance.
Using data.table is much faster, but does not return 0's when there are no matches. For example:
DT <- data.table(repid=rep(1:length(vec_list), sapply(vec_list, length)), val=unlist(vec_list))
total_match2 <- DT[.(vec), on=.(val), nomatch=0L, .N, keyby=.(repid)]$N
What about:
sapply(vec_list, function(x) sum(x %in% vec))
Maybe try:
DT <- setDT(stack(setNames(vec_list, 1:length(vec_list))))
DT[, x := +(values %in% vec)][, sum(x), keyby=.(ind)]$V1
Another, a variant of #chinsoon's:
nvec = 5000
max_size = 10000
nv = 10000000
set.seed(1)
vec_list <- replicate(nvec, sample(nv, size=sample(max_size, 1)), simplify=FALSE)
vec <- sample(nv, size=max_size)
system.time(
res <- rbindlist(lapply(vec_list, list), id=TRUE)[.(vec), on=.(V1), nomatch=0, .N, keyby=.id]
)
# user system elapsed
# 0.86 0.20 0.47
system.time({
DT <- setDT(stack(setNames(vec_list, 1:length(vec_list))))
res2 <- DT[, x := +(values %in% vec)][, sum(x), keyby=.(ind)]$V1
})
# user system elapsed
# 1.03 0.45 1.00
identical(res2[res2 != 0], res$N) # TRUE
I have a list of data frames with different sets of columns. I would like to combine them by rows into one data frame. I use plyr::rbind.fill to do that. I am looking for something that would do this more efficiently, but is similar to the answer given here
require(plyr)
set.seed(45)
sample.fun <- function() {
nam <- sample(LETTERS, sample(5:15))
val <- data.frame(matrix(sample(letters, length(nam)*10,replace=TRUE),nrow=10))
setNames(val, nam)
}
ll <- replicate(1e4, sample.fun())
rbind.fill(ll)
UPDATE: See this updated answer instead.
UPDATE (eddi): This has now been implemented in version 1.8.11 as a fill argument to rbind. For example:
DT1 = data.table(a = 1:2, b = 1:2)
DT2 = data.table(a = 3:4, c = 1:2)
rbind(DT1, DT2, fill = TRUE)
# a b c
#1: 1 1 NA
#2: 2 2 NA
#3: 3 NA 1
#4: 4 NA 2
FR #4790 added now - rbind.fill (from plyr) like functionality to merge list of data.frames/data.tables
Note 1:
This solution uses data.table's rbindlist function to "rbind" list of data.tables and for this, be sure to use version 1.8.9 because of this bug in versions < 1.8.9.
Note 2:
rbindlist when binding lists of data.frames/data.tables, as of now, will retain the data type of the first column. That is, if a column in first data.frame is character and the same column in the 2nd data.frame is "factor", then, rbindlist will result in this column being a character. So, if your data.frame consisted of all character columns, then, your solution with this method will be identical to the plyr method. If not, the values will still be the same, but some columns will be character instead of factor. You'll have to convert to "factor" yourself after. Hopefully this behaviour will change in the future.
And now here's using data.table (and benchmarking comparison with rbind.fill from plyr):
require(data.table)
rbind.fill.DT <- function(ll) {
# changed sapply to lapply to return a list always
all.names <- lapply(ll, names)
unq.names <- unique(unlist(all.names))
ll.m <- rbindlist(lapply(seq_along(ll), function(x) {
tt <- ll[[x]]
setattr(tt, 'class', c('data.table', 'data.frame'))
data.table:::settruelength(tt, 0L)
invisible(alloc.col(tt))
tt[, c(unq.names[!unq.names %chin% all.names[[x]]]) := NA_character_]
setcolorder(tt, unq.names)
}))
}
rbind.fill.PLYR <- function(ll) {
rbind.fill(ll)
}
require(microbenchmark)
microbenchmark(t1 <- rbind.fill.DT(ll), t2 <- rbind.fill.PLYR(ll), times=10)
# Unit: seconds
# expr min lq median uq max neval
# t1 <- rbind.fill.DT(ll) 10.8943 11.02312 11.26374 11.34757 11.51488 10
# t2 <- rbind.fill.PLYR(ll) 121.9868 134.52107 136.41375 184.18071 347.74724 10
# for comparison change t2 to data.table
setattr(t2, 'class', c('data.table', 'data.frame'))
data.table:::settruelength(t2, 0L)
invisible(alloc.col(t2))
setcolorder(t2, unique(unlist(sapply(ll, names))))
identical(t1, t2) # [1] TRUE
It should be noted that plyr's rbind.fill edges past this particular data.table solution until list size of about 500.
Benchmarking plot:
Here's the plot on runs with list length of data.frames with seq(1000, 10000, by=1000). I've used microbenchmark with 10 reps on each of these different list lengths.
Benchmarking gist:
Here's the gist for benchmarking, in case anyone wants to replicate the results.
Now that rbindlist (and rbind) for data.table has improved functionality and speed with the recent changes/commits in v1.9.3 (development version), and dplyr has a faster version of plyr's rbind.fill, named rbind_all, this answer of mine seems a bit too outdated.
Here's the relevant NEWS entry for rbindlist:
o 'rbindlist' gains 'use.names' and 'fill' arguments and is now implemented entirely in C. Closes #5249
-> use.names by default is FALSE for backwards compatibility (doesn't bind by
names by default)
-> rbind(...) now just calls rbindlist() internally, except that 'use.names'
is TRUE by default, for compatibility with base (and backwards compatibility).
-> fill by default is FALSE. If fill is TRUE, use.names has to be TRUE.
-> At least one item of the input list has to have non-null column names.
-> Duplicate columns are bound in the order of occurrence, like base.
-> Attributes that might exist in individual items would be lost in the bound result.
-> Columns are coerced to the highest SEXPTYPE, if they are different, if/when possible.
-> And incredibly fast ;).
-> Documentation updated in much detail. Closes DR #5158.
So, I've benchmarked the newer (and faster versions) on relatively bigger data below.
New Benchmark:
We'll create a total of 10,000 data.tables with columns ranging from 200-300 with the total number of columns after binding to be 500.
Functions to create data:
require(data.table) ## 1.9.3 commit 1267
require(dplyr) ## commit 1504 devel
set.seed(1L)
names = paste0("V", 1:500)
foo <- function() {
cols = sample(200:300, 1)
data = setDT(lapply(1:cols, function(x) sample(10)))
setnames(data, sample(names)[1:cols])
}
n = 10e3L
ll = vector("list", n)
for (i in 1:n) {
.Call("Csetlistelt", ll, i, foo())
}
And here are the timings:
## Updated timings on data.table v1.9.5 - three consecutive runs:
system.time(ans1 <- rbindlist(ll, fill=TRUE))
# user system elapsed
# 1.993 0.106 2.107
system.time(ans1 <- rbindlist(ll, fill=TRUE))
# user system elapsed
# 1.644 0.092 1.744
system.time(ans1 <- rbindlist(ll, fill=TRUE))
# user system elapsed
# 1.297 0.088 1.389
## dplyr's rbind_all - Timings for three consecutive runs
system.time(ans2 <- rbind_all(ll))
# user system elapsed
# 9.525 0.121 9.761
# user system elapsed
# 9.194 0.112 9.370
# user system elapsed
# 8.665 0.081 8.780
identical(ans1, setDT(ans2)) # [1] TRUE
There is still something to be gained if you parallelize both rbind.fill and rbindlist.
The results are done with data.table version 1.8.8 as version 1.8.9 got bricked when I tried it with the parallelized function. So the results aren't identical between data.table and plyr, but they are identical within data.table or plyr solution. Meaning parallel plyr matches to unparallel plyr, and vice versa.
Here's the benchmark/scripts. The parallel.rbind.fill.DT looks horrible, but that's the fastest one I could pull.
require(plyr)
require(data.table)
require(ggplot2)
require(rbenchmark)
require(parallel)
# data.table::rbindlist solutions
rbind.fill.DT <- function(ll) {
all.names <- lapply(ll, names)
unq.names <- unique(unlist(all.names))
rbindlist(lapply(seq_along(ll), function(x) {
tt <- ll[[x]]
setattr(tt, 'class', c('data.table', 'data.frame'))
data.table:::settruelength(tt, 0L)
invisible(alloc.col(tt))
tt[, c(unq.names[!unq.names %chin% all.names[[x]]]) := NA_character_]
setcolorder(tt, unq.names)
}))
}
parallel.rbind.fill.DT <- function(ll, cluster=NULL){
all.names <- lapply(ll, names)
unq.names <- unique(unlist(all.names))
if(is.null(cluster)){
ll.m <- rbindlist(lapply(seq_along(ll), function(x) {
tt <- ll[[x]]
setattr(tt, 'class', c('data.table', 'data.frame'))
data.table:::settruelength(tt, 0L)
invisible(alloc.col(tt))
tt[, c(unq.names[!unq.names %chin% all.names[[x]]]) := NA_character_]
setcolorder(tt, unq.names)
}))
}else{
cores <- length(cluster)
sequ <- as.integer(seq(1, length(ll), length.out = cores+1))
Call <- paste(paste("list", seq(cores), sep=""), " = ll[", c(1, sequ[2:cores]+1), ":", sequ[2:(cores+1)], "]", sep="", collapse=", ")
ll <- eval(parse(text=paste("list(", Call, ")")))
rbindlist(clusterApply(cluster, ll, function(ll, unq.names){
rbindlist(lapply(seq_along(ll), function(x, ll, unq.names) {
tt <- ll[[x]]
setattr(tt, 'class', c('data.table', 'data.frame'))
data.table:::settruelength(tt, 0L)
invisible(alloc.col(tt))
tt[, c(unq.names[!unq.names %chin% colnames(tt)]) := NA_character_]
setcolorder(tt, unq.names)
}, ll=ll, unq.names=unq.names))
}, unq.names=unq.names))
}
}
# plyr::rbind.fill solutions
rbind.fill.PLYR <- function(ll) {
rbind.fill(ll)
}
parallel.rbind.fill.PLYR <- function(ll, cluster=NULL, magicConst=400){
if(is.null(cluster) | ceiling(length(ll)/magicConst) < length(cluster)){
rbind.fill(ll)
}else{
cores <- length(cluster)
sequ <- as.integer(seq(1, length(ll), length.out = ceiling(length(ll)/magicConst)))
Call <- paste(paste("list", seq(cores), sep=""), " = ll[", c(1, sequ[2:(length(sequ)-1)]+1), ":", sequ[2:length(sequ)], "]", sep="", collapse=", ")
ll <- eval(parse(text=paste("list(", Call, ")")))
rbind.fill(parLapply(cluster, ll, rbind.fill))
}
}
# Function to generate sample data of varying list length
set.seed(45)
sample.fun <- function() {
nam <- sample(LETTERS, sample(5:15))
val <- data.frame(matrix(sample(letters, length(nam)*10,replace=TRUE),nrow=10))
setNames(val, nam)
}
ll <- replicate(10000, sample.fun())
cl <- makeCluster(4, type="SOCK")
clusterEvalQ(cl, library(data.table))
clusterEvalQ(cl, library(plyr))
benchmark(t1 <- rbind.fill.PLYR(ll),
t2 <- rbind.fill.DT(ll),
t3 <- parallel.rbind.fill.PLYR(ll, cluster=cl, 400),
t4 <- parallel.rbind.fill.DT(ll, cluster=cl),
replications=5)
stopCluster(cl)
# Results for rbinding 10000 dataframes
# done with 4 cores, i5 3570k and 16gb memory
# test reps elapsed relative
# rbind.fill.PLYR 5 321.80 16.682
# rbind.fill.DT 5 26.10 1.353
# parallel.rbind.fill.PLYR 5 28.00 1.452
# parallel.rbind.fill.DT 5 19.29 1.000
# checking are results equal
t1 <- as.matrix(t1)
t2 <- as.matrix(t2)
t3 <- as.matrix(t3)
t4 <- as.matrix(t4)
t1 <- t1[order(t1[, 1], t1[, 2]), ]
t2 <- t2[order(t2[, 1], t2[, 2]), ]
t3 <- t3[order(t3[, 1], t3[, 2]), ]
t4 <- t4[order(t4[, 1], t4[, 2]), ]
identical(t2, t4) # TRUE
identical(t1, t3) # TRUE
identical(t1, t2) # FALSE, mismatch between plyr and data.table
As you can see parallesizing rbind.fill made it comparable to data.table, and you could get marginal increase of speed by parallesizing data.table even with this low of a dataframe count.
simply dplyr::bind_rows will do the job, as
library(dplyr)
merged_list <- bind_rows(ll)
#check it
> nrow(merged_list)
[1] 100000
> ncol(merged_list)
[1] 26
Time taken
> system.time(merged_list <- bind_rows(ll))
user system elapsed
0.29 0.00 0.28
This question already has an answer here:
Fastest method to replace data values conditionally in data.table (speed comparison)
(1 answer)
Closed 6 years ago.
I have a code that replaces impossible values in a dataset with NA.
I'm trying to convert the code to being based on data.table, as an example, I replace height of 0 with height NA
(Dummy) data
DT <- data.table(id = 1:5e6,
height = sample(c(0, 100:240), 5e6, replace = TRUE))
My current solution is slower and at least as verbose as my data.frame version. I assume I am doing something wrong...
DT[height == 0, height := NA]
While researching this question I found another solution which is much faster (but uglier).
set(DT, which("height"==0), "height", value = NA)
All suggestions appreciated.
Since v1.9.4, data.table by default automatically creates an index on columns during subsets of the form x == val and x %in% val used within [.data.table call. This makes subsequent subsetting very fast with only a slightly higher price to pay on the first subset (since data.table's radix ordering is quite fast). The first subset could be slower because it is the time to:
create the index
and then subset.
To illustrate this (using #akrun's data):
require(data.table)
getOption("datatable.auto.index") # [1] TRUE ===> enabled
set.seed(24)
DT <- data.table(id = 1:1e7, height = sample(c(0, 100:240), 1e7, replace = TRUE))
system.time(DT[height == 0L])
# 0.396 0.059 0.452 ## first run
# 0.003 0.000 0.004 ## second run is very fast
Now if we disable auto indexing:
require(data.table)
options(datatable.auto.index = FALSE)
getOption("datatable.auto.index") # [1] FALSE
set.seed(24)
DT <- data.table(id = 1:1e7, height = sample(c(0, 100:240), 1e7, replace = TRUE))
system.time(DT[height == 0L])
# 0.037 0.007 0.042 ## first run
# 0.039 0.010 0.045 ## second run (~ 10x slower than 2nd run above)
options(datatable.auto.index = TRUE) # restore auto indexing if necessary
But your case is special because, you update the same column you subset. In essence, this is what is happening:
The i expression is seen to be an expression that can be optimised for auto indexing. An index is created and saved for blazing fast subsets later on.
The j expression is seen and the column is updated.
The column on which the index has been set has been updated. So index is removed.
Auto indexing logic should detect this and skip creating the index altogether if any of the rows evaluate to TRUE, since the created index is essentially useless.
Could you please file an issue on the project issues page? Just linking to this SO Q should be sufficient.
To answer your Q, disable auto indexing and run the subset, and it should be more or less equal to the time you get with set().
Base R solution just can not be faster here since it copies to entire column just to update those entries. But it is because base R chose to do that.
A speed test with one evaluation on 100 million rows:
library(data.table)
DT <- data.table(id = 1:1e8,
height = sample(c(0, 100:240), 1e8, replace = TRUE))
DT2 <- copy(DT);DT3 <- copy(DT); DT4 <- copy(DT); DT5 <- copy(DT); DT6 <- copy(DT);DT7 <- copy(DT)
library(microbenchmark)
microbenchmark(
David = set(DT, i = which(DT[["height"]] == 0), j = "height", value = NA),
OP = DT2[height == 0, height := NA],
akrun = setkey(DT3, "height")[.(0), height := NA],
isna = {is.na(DT4$height) <- DT4$height == 0},
assignNA = {DT5$height[DT5$height == 0] <- NA},
indexset = {setindex(DT6, height); DT6[height==0, height := NA_real_]},
exponent = DT7[, height:= NA^(!height)*height],
times=1L
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# David 585.9044 585.9044 585.9044 585.9044 585.9044 585.9044 1
# OP 10421.3323 10421.3323 10421.3323 10421.3323 10421.3323 10421.3323 1
# akrun 11922.5951 11922.5951 11922.5951 11922.5951 11922.5951 11922.5951 1
# isna 4843.3623 4843.3623 4843.3623 4843.3623 4843.3623 4843.3623 1
# assignNA 4797.0191 4797.0191 4797.0191 4797.0191 4797.0191 4797.0191 1
# indexset 6307.4564 6307.4564 6307.4564 6307.4564 6307.4564 6307.4564 1
# exponent 1054.6013 1054.6013 1054.6013 1054.6013 1054.6013 1054.6013 1
We can try
system.time(DT[, height:= NA^(!height)*height])
# user system elapsed
# 0.03 0.05 0.08
OP's code
system.time(DT[height == 0, height := NA])
# user system elapsed
# 0.42 0.04 0.49
base R option that should be faster.
system.time(DT$height[DT$height == 0] <- NA)
# user system elapsed
# 0.19 0.05 0.23
and the is.na route
system.time(is.na(DT$height) <- DT$height == 0)
# user system elapsed
# 0.22 0.06 0.28
#DavidArenburg's suggestion
system.time(set(DT, i = which(DT[["height"]] == 0), j = "height", value = NA))
# user system elapsed
# 0.06 0.00 0.06
NOTE: All these benchmarks are done by freshly creating the dataset before each run so as to provide some unbiased benchmarks. I could use microbenchmark, but there could be some biasedness in each run as the assignment happens in the 1st run.
Using a bigger dataset
set.seed(24)
DT <- data.table(id = 1:1e8,
height = sample(c(0, 100:240), 1e8, replace = TRUE))
system.time(DT[, height:= NA^(!height)*height])
# user system elapsed
# 0.58 0.24 0.81
system.time(set(DT, i = which(DT[["height"]] == 0), j = "height", value = NA))
# user system elapsed
# 0.49 0.12 0.61
data
set.seed(24)
DT <- data.table(id = 1:1e7,
height = sample(c(0, 100:240), 1e7, replace = TRUE))
This is the code snippet from recommenderlab package, that takes matrix with ratings and returns top 5 elements for each user -
reclist <- apply(ratings, MARGIN=2, FUN=function(x)
head(order(x, decreasing=TRUE, na.last=NA), 5))
For large matrix (>10K columns) it takes too long to run, is there any way to re-write it to make more efficient? Maybe by using dpyr, or data.table package)? Writing C++ code is not an option for me
An answer with data.table and base R
# 10000 column dummy matrix
cols <- 10000
mat <- matrix(rnorm(100*cols), ncol=cols)
With data.table:
library(data.table)
dt1 <- data.table(mat)
# sort every column, return first 5 rows
dt1[, lapply(.SD, sort, decreasing=T)][1:5]
system.time(dt1[, lapply(.SD, sort, decreasing=T)][1:5])
result:
user system elapsed
2.904 0.013 2.916
In plain old base, it's actually faster! (thanks for the comment Arun)
system.time(head(apply(mat, 2, sort, decreasing=T), 5))
user system elapsed
0.473 0.002 0.475
However, both are faster than the code sample above, according to system.time()
system.time(
apply(mat, MARGIN=2, FUN=function(x) {
head(order(x, decreasing=TRUE, na.last=NA), 5)
}))
user system elapsed
3.063 0.031 3.094
With a the R package data.table is it possible to find the values that are in a given interval without a full vector scan of the data. For example
>DT<-data.table(x=c(1,1,2,3,5,8,13,21,34,55,89))
>my.data.table.function(DT,min=3,max=10)
x
1: 3
2: 5
3: 8
Where DT can be a very big table.
Bonus question:
is it possible to do the same thing for a set of non-overlapping intervals such as
>I<-data.table(i=c(1,2),min=c(3,20),max=c(10,40))
>I
i min max
1: 1 3 10
2: 2 20 40
> my.data.table.function2(DT,I)
i x
1: 1 3
2: 1 5
3: 1 8
4: 2 21
5: 2 34
Where both I and DT can be very big.
Thanks a lot
Here is a variation of the code proposed by #user1935457 (see comment in #user1935457 post)
system.time({
if(!identical(key(DT), "x")) setkey(DT, x)
setkey(IT, min)
#below is the line that differs from #user1935457
#Using IT to address the lines of DT creates a smaller intermediate table
#We can also directly use .I
target.low<-DT[IT,list(i=i,min=.I),roll=-Inf, nomatch = 0][,list(min=min[1]),keyby=i]
setattr(IT, "sorted", "max")
# same here
target.high<-DT[IT,list(i=i,max=.I),roll=Inf, nomatch = 0][,list(max=last(max)),keyby=i]
target <- target.low[target.high, nomatch = 0]
target[, len := max - min + 1L]
rm(target.low, target.high)
ans.roll2 <- DT[data.table:::vecseq(target$min, target$len, NULL)][, i := unlist(mapply(rep, x = target$i, times = target$len, SIMPLIFY=FALSE))]
setcolorder(ans.roll2, c("i", "x"))
})
# user system elapsed
# 0.07 0.00 0.06
system.time({
# #user1935457 code
})
# user system elapsed
# 0.08 0.00 0.08
identical(ans.roll2, ans.roll)
#[1] TRUE
The performance gain is not huge here, but it shall be more sensitive with larger DT and smaller IT. thanks again to #user1935457 for your answer.
First of all, vecseq isn't exported as a visible function from data.table, so its syntax and/or behavior here could change without warning in future updates to the package. Also, this is untested besides the simple identical check at the end.
That out of the way, we need a bigger example to exhibit difference from vector scan approach:
require(data.table)
n <- 1e5L
f <- 10L
ni <- n / f
set.seed(54321)
DT <- data.table(x = 1:n + sample(-f:f, n, replace = TRUE))
IT <- data.table(i = 1:ni,
min = seq(from = 1L, to = n, by = f) + sample(0:4, ni, replace = TRUE),
max = seq(from = 1L, to = n, by = f) + sample(5:9, ni, replace = TRUE))
DT, the Data Table is a not-too-random subset of 1:n. IT, the Interval Table is ni = n / 10 non-overlapping intervals in 1:n. Doing the repeated vector scan on all ni intervals takes a while:
system.time({
ans.vecscan <- IT[, DT[x >= min & x <= max], by = i]
})
## user system elapsed
## 84.15 4.48 88.78
One can do two rolling joins on the interval endpoints (see the roll argument in ?data.table) to get everything in one swoop:
system.time({
# Save time if DT is already keyed correctly
if(!identical(key(DT), "x")) setkey(DT, x)
DT[, row := .I]
setkey(IT, min)
target.low <- IT[DT, roll = Inf, nomatch = 0][, list(min = row[1]), keyby = i]
# Non-overlapping intervals => (sorted by min => sorted by max)
setattr(IT, "sorted", "max")
target.high <- IT[DT, roll = -Inf, nomatch = 0][, list(max = last(row)), keyby = i]
target <- target.low[target.high, nomatch = 0]
target[, len := max - min + 1L]
rm(target.low, target.high)
ans.roll <- DT[data.table:::vecseq(target$min, target$len, NULL)][, i := unlist(mapply(rep, x = target$i, times = target$len, SIMPLIFY=FALSE))]
ans.roll[, row := NULL]
setcolorder(ans.roll, c("i", "x"))
})
## user system elapsed
## 0.12 0.00 0.12
Ensuring the same row order verifies the result:
setkey(ans.vecscan, i, x)
setkey(ans.roll, i, x)
identical(ans.vecscan, ans.roll)
## [1] TRUE
If you don't want to do a full vector scan, you should first declare your variable as a key for your data.table :
DT <- data.table(x=c(1,1,2,3,5,8,13,21,34,55,89),key="x")
Then you can use %between% :
R> DT[x %between% c(3,10),]
x
1: 3
2: 5
3: 8
R> DT[x %between% c(3,10) | x %between% c(20,40),]
x
1: 3
2: 5
3: 8
4: 21
5: 34
EDIT : As #mnel pointed out, %between% still does vector scans. The Note section of the help page says :
Current implementation does not make use of ordered keys.
So this doesn't answer your question.