Replacing impossible values with NA using R's data.table [duplicate] - r

This question already has an answer here:
Fastest method to replace data values conditionally in data.table (speed comparison)
(1 answer)
Closed 6 years ago.
I have a code that replaces impossible values in a dataset with NA.
I'm trying to convert the code to being based on data.table, as an example, I replace height of 0 with height NA
(Dummy) data
DT <- data.table(id = 1:5e6,
height = sample(c(0, 100:240), 5e6, replace = TRUE))
My current solution is slower and at least as verbose as my data.frame version. I assume I am doing something wrong...
DT[height == 0, height := NA]
While researching this question I found another solution which is much faster (but uglier).
set(DT, which("height"==0), "height", value = NA)
All suggestions appreciated.

Since v1.9.4, data.table by default automatically creates an index on columns during subsets of the form x == val and x %in% val used within [.data.table call. This makes subsequent subsetting very fast with only a slightly higher price to pay on the first subset (since data.table's radix ordering is quite fast). The first subset could be slower because it is the time to:
create the index
and then subset.
To illustrate this (using #akrun's data):
require(data.table)
getOption("datatable.auto.index") # [1] TRUE ===> enabled
set.seed(24)
DT <- data.table(id = 1:1e7, height = sample(c(0, 100:240), 1e7, replace = TRUE))
system.time(DT[height == 0L])
# 0.396 0.059 0.452 ## first run
# 0.003 0.000 0.004 ## second run is very fast
Now if we disable auto indexing:
require(data.table)
options(datatable.auto.index = FALSE)
getOption("datatable.auto.index") # [1] FALSE
set.seed(24)
DT <- data.table(id = 1:1e7, height = sample(c(0, 100:240), 1e7, replace = TRUE))
system.time(DT[height == 0L])
# 0.037 0.007 0.042 ## first run
# 0.039 0.010 0.045 ## second run (~ 10x slower than 2nd run above)
options(datatable.auto.index = TRUE) # restore auto indexing if necessary
But your case is special because, you update the same column you subset. In essence, this is what is happening:
The i expression is seen to be an expression that can be optimised for auto indexing. An index is created and saved for blazing fast subsets later on.
The j expression is seen and the column is updated.
The column on which the index has been set has been updated. So index is removed.
Auto indexing logic should detect this and skip creating the index altogether if any of the rows evaluate to TRUE, since the created index is essentially useless.
Could you please file an issue on the project issues page? Just linking to this SO Q should be sufficient.
To answer your Q, disable auto indexing and run the subset, and it should be more or less equal to the time you get with set().
Base R solution just can not be faster here since it copies to entire column just to update those entries. But it is because base R chose to do that.

A speed test with one evaluation on 100 million rows:
library(data.table)
DT <- data.table(id = 1:1e8,
height = sample(c(0, 100:240), 1e8, replace = TRUE))
DT2 <- copy(DT);DT3 <- copy(DT); DT4 <- copy(DT); DT5 <- copy(DT); DT6 <- copy(DT);DT7 <- copy(DT)
library(microbenchmark)
microbenchmark(
David = set(DT, i = which(DT[["height"]] == 0), j = "height", value = NA),
OP = DT2[height == 0, height := NA],
akrun = setkey(DT3, "height")[.(0), height := NA],
isna = {is.na(DT4$height) <- DT4$height == 0},
assignNA = {DT5$height[DT5$height == 0] <- NA},
indexset = {setindex(DT6, height); DT6[height==0, height := NA_real_]},
exponent = DT7[, height:= NA^(!height)*height],
times=1L
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# David 585.9044 585.9044 585.9044 585.9044 585.9044 585.9044 1
# OP 10421.3323 10421.3323 10421.3323 10421.3323 10421.3323 10421.3323 1
# akrun 11922.5951 11922.5951 11922.5951 11922.5951 11922.5951 11922.5951 1
# isna 4843.3623 4843.3623 4843.3623 4843.3623 4843.3623 4843.3623 1
# assignNA 4797.0191 4797.0191 4797.0191 4797.0191 4797.0191 4797.0191 1
# indexset 6307.4564 6307.4564 6307.4564 6307.4564 6307.4564 6307.4564 1
# exponent 1054.6013 1054.6013 1054.6013 1054.6013 1054.6013 1054.6013 1

We can try
system.time(DT[, height:= NA^(!height)*height])
# user system elapsed
# 0.03 0.05 0.08
OP's code
system.time(DT[height == 0, height := NA])
# user system elapsed
# 0.42 0.04 0.49
base R option that should be faster.
system.time(DT$height[DT$height == 0] <- NA)
# user system elapsed
# 0.19 0.05 0.23
and the is.na route
system.time(is.na(DT$height) <- DT$height == 0)
# user system elapsed
# 0.22 0.06 0.28
#DavidArenburg's suggestion
system.time(set(DT, i = which(DT[["height"]] == 0), j = "height", value = NA))
# user system elapsed
# 0.06 0.00 0.06
NOTE: All these benchmarks are done by freshly creating the dataset before each run so as to provide some unbiased benchmarks. I could use microbenchmark, but there could be some biasedness in each run as the assignment happens in the 1st run.
Using a bigger dataset
set.seed(24)
DT <- data.table(id = 1:1e8,
height = sample(c(0, 100:240), 1e8, replace = TRUE))
system.time(DT[, height:= NA^(!height)*height])
# user system elapsed
# 0.58 0.24 0.81
system.time(set(DT, i = which(DT[["height"]] == 0), j = "height", value = NA))
# user system elapsed
# 0.49 0.12 0.61
data
set.seed(24)
DT <- data.table(id = 1:1e7,
height = sample(c(0, 100:240), 1e7, replace = TRUE))

Related

Row maximum in data table

I have a dataset of 8,000,000 rows with 100 columns in a data.table where each column is a count. I need to find the maximum count in each row and which column this maximum is in.
I can quickly get which column has the maximum value for each row using
dt <- dt[, maxCol := which.max(.SD), by=pmxid]
but trying to get the actual maximum value using
dt <- dt[, nmax := max(.SD), by=pmxid]
is incredibly slow. I ran it for nearly 20 mins and only 200,000 row maximums had been calculated. Finding the max column took approx. 2 mins for all 8,000,000 rows.
How come finding the maximum takes so long? Shouldn't it take the same time as which.max() or less?
Though, you are seeking a data.table solution, here is a base R solution which would be fast enough for your dataset.
indx <- max.col(df, ties.method='first')
df[cbind(1:nrow(df), indx)]
On a slightly bigger dataset, system.time comparisons revealed
system.time({
indx <- max.col(df1, ties.method='first')
res <- df1[cbind(1:nrow(df1), indx)]
})
# user system elapsed
# 2.180 0.163 2.345
df1$pmxid <- 1:nrow(df1)
dt <- as.data.table(df1)
system.time(dt[, nmax:= max(.SD), by= pmxid])
# user system elapsed
#1265.792 2.305 1267.836
base R method to be faster than the data.table method in the post.
data
set.seed(24)
df <- as.data.frame(matrix(sample(c(NA,0:20), 20*10,
replace=TRUE), ncol=10))
#if there are NAs, change it to lowest number
df[is.na(df)] <- -999
set.seed(585)
df1 <- as.data.frame(matrix(sample(c(NA,0:20), 100*1e6,
replace=TRUE), ncol=100))
df1[is.na(df1)] <- -999
For the maximum over columns in a data.table,
dt[, max:= do.call(pmax, .SD)]
is much faster then dt[, nmax:= max(.SD), by= 1:nrow(dt)], and faster than the above base R solution :
library(data.table)
ncols=100
nrows=8000000
dfi <- as.data.frame(matrix(runif(ncols*nrows), ncol = ncols, nrow = nrows))
df=dfi
system.time({
indx <- max.col(df, ties.method='first')
df$max <- df[cbind(1:nrow(df1), indx)]
})
# user system elapsed
# 8.89 1.37 10.45
dt <- as.data.table(dfi)
system.time({
dt[, max:= do.call(pmax, .SD)]
})
# user system elapsed
# 3.31 0.01 3.33
Once you have calculated the Colmax index, use the index to retrieve the maximum in each row
dt[Colmax == <value>]
or,
dt[J(<values>), on = 'Colmax']
Also, wrong syntax in
dt[, nmax := max(.SD), by = pmxid]
this collates a vector of nrow(dt) * length(.SD) length (see the Note in Description of max())
Instead try:
dt[, nmax := apply(.SD, 1, max), by = pmxid]
or, the parallel max:
dt[, nmax := pmax(.SD), by = pmxid]

Group by variable in data.table and carry on other variables

I am working on some summaries for financial datasets and I would like to sort the summary in regard to a certain criterion, but without loosing the remaining summary values in a row. Here is a simple example:
set.seed(1)
tseq <- seq(Sys.time(), length.out = 36, by = "mins")
dt <- data.table(TM_STMP = tseq, COMP = rep(c(rep("A", 4), rep("B", 4), rep("C", 4)), 3), SEC = rep(letters[1:12],3), VOL = rpois(36, 3e+6))
dt2 <- dt[, list(SUM = sum(VOL), MEAN = mean(VOL)), by = list(COMP, SEC)]
dt2
COMP SEC SUM MEAN
1: A a 9000329 3000110
2: A b 9001274 3000425
3: A c 9003505 3001168
4: A d 9002138 3000713
Now I would like to get the SEC per COMP with highest VOL:
dt3 <- dt2[, list(SUM = max(SUM)), by = list(COMP)]
dt3
COMP SUM
1: A 9003505
2: B 9002888
3: C 9005042
This gives me what I want, but I would like to keep the other values in the specific rows (SEC and MEAN) such that it looks like this (made by hand):
COMP SUM SEC MEAN
1: A 9003505 c 3001168
2: B 9002888 f 3000963
3: C 9005042 k 3001681
How can I achieve this?
If you are looking for the SEC and the MEAN corresponding to max of SUM:
dt3 <- dt2[, list(SUM = max(SUM),SEC=SEC[which.max(SUM)],MEAN=MEAN[which.max(SUM)]), by = list(COMP)]
> dt3
COMP SUM SEC MEAN
1: A 9003110 a 3001037
2: B 9000814 e 2999612
3: C 9002707 i 2999741
Edit: This'll be faster:
dt2[dt2[, .I[which.max(SUM)], by = list(COMP)]$V1]
Another way to do this would be to setkey of the data.table to: COMP, SUM and then use mult="last" as follows:
setkey(dt2, COMP, SUM)
dt2[J(unique(COMP)), mult="last"]
# COMP SEC SUM MEAN
# 1: A c 9002500 3000833
# 2: B g 9003312 3001104
# 3: C i 9000058 3000019
Edit: To answer to Simon's benchmarking about speed differences between this and #metrics':
set.seed(45)
N <- 1e6
tseq <- seq(Sys.time(), length.out = N, by = "mins")
ff <- function(x) paste(sample(letters, x, TRUE), collapse="")
val1 <- unique(unlist(replicate(1e5, ff(8), simplify=FALSE)))
val2 <- unique(unlist(replicate(1e5, ff(12), simplify=FALSE)))
dt <- data.table(TM_STMP = tseq, COMP = rep(val1, each=100), SEC = rep(val2, each=100), VOL = rpois(1e6, 3e+6))
dt2 <- dt[, list(SUM = sum(VOL), MEAN = mean(VOL)), by = list(COMP, SEC)]
require(microbenchmark)
metrics <- function(x=copy(dt2)) {
x[, list(SUM = max(SUM),SEC=SEC[which.max(SUM)],MEAN=MEAN[which.max(SUM)]), by = list(COMP)]
}
arun <- function(x=copy(dt2)) {
setkey(x, COMP, SUM)
x[J(unique(COMP)), mult="last"]
}
microbenchmark(ans1 <- metrics(dt2), ans2 <- arun(dt2), times=20)
# Unit: milliseconds
# expr min lq median uq max neval
# ans1 <- metrics(dt2) 749.0001 804.0651 838.0750 882.3869 1053.3389 20
# ans2 <- arun(dt2) 301.7696 321.6619 342.4779 359.9343 392.5902 20
setkey(ans1, COMP, SEC)
setkey(ans2, COMP, SEC)
setcolorder(ans1, names(ans2))
identical(ans1, ans2) # [1] TRUE
from your sample output, it's not exactly clear what you would like to keep / drop, but you can simply list your additional columns in the j argument of DT[i, j, ]
> dt2[, list(SUM = max(SUM), SEC, MEAN), by = list(COMP)]
COMP SUM SEC MEAN
1: A 9007273 a 3000131
2: A 9007273 b 3000938
3: A 9007273 c 2999502
4: A 9007273 d 3002424
5: B 9004829 e 3001610
6: B 9004829 f 2999991
7: B 9004829 g 2998471
8: B 9004829 h 2999571
9: C 9002479 i 3000826
10: C 9002479 j 2999826
11: C 9002479 k 3000728
12: C 9002479 l 2999634
I was very interested in the performance of the two different approaches from #Metrics that I denote in the following as which.func and from #Arun that I denote as innate.func. So, I made some benchmarking with my example given in the question above. Here are the results:
which.func <- function() {dt3 <- dt2[, list(SUM = max(SUM), SEC=SEC[which.max(SUM)], MENA=MEAN[which.max(SUM)]), by = list(COMP)]}
innate.func <- function() {dt3 <- dt2[J(unique(COMP)), mult = "last"]}
library(rbenchmark)
benchmark(which.func, innate.func, replications = 10e+6)
test replications elapsed relative user.self sys.self
2 innate 10000000 24.689 1.000 24.259 0.425
1 which.func 10000000 32.664 1.323 32.216 0.446
Of course this is maybe a little unfair towards the which.func becuase the innate.funcinvolves a call to setkey, which is especially for large samples a time consumer. If I include the setkeycall into the function I get the following:
innate.func <- function() {setkey(dt2, COMP, SUM); dt3 <- dt2[J(unique(COMP)), mult = "last"]; setkey(dt2, NULL)}
test replications elapsed relative user.self sys.self
2 innate.func 10000000 25.271 1.000 24.834 0.430
1 which.func 10000000 26.476 1.048 26.062 0.397
It seems, that the two approaches have a very similar performance. The approach of #Arun has perhaps a more elegant style in regard to the data.table and needs less code. Its disadvantage may come with different aggregation functions than the maxor min, where the approach of #Metrics plays out its character of being able to be applied in a more general setting.
I learned from both approaches and put them into my toolbox.
During my further work with the solutions given here I encountered another problem with the summary shown above in my question and I found a solution to it, that I would like to share.
If I want to provide a choice to the user for
an aggregation function, denoted by aggregate and
a criterion (variable of the summary) the aggregate method should be applied to, denoted by crit,
then I encounter the problem, that I have to check, which of the columns are remaining (see e.g. #Metrics answer that uses the which). A simple example:
We take data.table dt2 from my question above. A user now, wants to apply the aggregate = "max" method on the crit = "SUM" variable in the data.table summary of dt2. Here is a solution I found out that works fine (any discussion of course appreciated):
aggregate = "max"
crit = "SUM"
user call <- expression(do.call(aggregate, list(get(crit))))
dt2[, .SD[which(get(crit) == eval(mycall))], by = COMP]
dt2
COMP SEC SUM MEAN
1: A c 9002500 3000833
2: B g 9003312 3001104
3: C i 9000058 3000019

Use data.table to select non-unique rows

I have a large table consisting of several genes (newID) with associated values. Some genes (newID) are unique, some have several instances (appear in multiple rows). How to exclude from the table those with only one occurrence (row)? IN the example below, only the last row would be removed as it is unique.
head(exons.s, 10)
Row.names exonID pvalue log2fold.5_t.GFP_t. newID
1 ENSMUSG00000000001_Gnai3:E001 E001 0.3597070 0.029731989 ENSMUSG00000000001
2 ENSMUSG00000000001_Gnai3:E002 E002 0.6515167 0.028984837 ENSMUSG00000000001
3 ENSMUSG00000000001_Gnai3:E003 E003 0.8957798 0.009665072 ENSMUSG00000000001
4 ENSMUSG00000000001_Gnai3:E004 E004 0.5308266 -0.059273822 ENSMUSG00000000001
5 ENSMUSG00000000001_Gnai3:E005 E005 0.4507640 -0.061276835 ENSMUSG00000000001
6 ENSMUSG00000000001_Gnai3:E006 E006 0.5147357 -0.068357886 ENSMUSG00000000001
7 ENSMUSG00000000001_Gnai3:E007 E007 0.5190718 -0.063959853 ENSMUSG00000000001
8 ENSMUSG00000000001_Gnai3:E008 E008 0.8999434 0.032186993 ENSMUSG00000000001
9 ENSMUSG00000000001_Gnai3:E009 E009 0.5039369 0.133313175 ENSMUSG00000000001
10 ENSMUSG00000000003_Pbsn:E001 E001 NA NA ENSMUSG00000000003
> dim(exons.s)
[1] 234385 5
With plyr I would go about it like this:
## remove single exon genes:
multEx <- function(df){
if (nrow(df) > 1){return(df)}
}
genes.mult.ex <- ddply(exons.s , .(newID), multEx, .parallel=TRUE)
But this is very slow. I thought this would be easy with data.table but I can't figure it out:
exons.s <- data.table(exons.s, key="newID")
x.dt.out <- exons.s[, lapply(.SD, multEx), by=newID]
I am new to data.table so any pointers in the right direction would be welcome.
Create a column giving the number of rows in each group, then subset:
exons.s[,n:=.N,by=newID]
exons.s[n>1]
There is a simpler and more effiecent way of doing this using the duplicated() function instead of counting the group sizes.
First we need to generate a test dastaset:
# Generate test datasets
smallNumberSampled <- 1e3
largeNumberSampled <- 1e6
smallDataset <- data.table(id=paste('id', 1:smallNumberSampled, sep='_'), value1=sample(x = 1:26, size = smallNumberSampled, replace = T), value2=letters[sample(x = 1:26, size = smallNumberSampled, replace = T)])
largeDataset <- data.table(id=paste('id', 1:largeNumberSampled, sep='_'), value1=sample(x = 1:26, size = largeNumberSampled, replace = T), value2=letters[sample(x = 1:26, size = largeNumberSampled, replace = T)])
# add 2 % duplicated rows:
smallDataset <- rbind(smallDataset, smallDataset[sample(x = 1:nrow(smallDataset), size = nrow(smallDataset)* 0.02)])
largeDataset <- rbind(largeDataset, largeDataset[sample(x = 1:nrow(largeDataset), size = nrow(largeDataset)* 0.02)])
Then we implement the three solutions as functions:
# Original suggestion
getDuplicatedRows_Count <- function(dt, columnName) {
dt[,n:=.N,by=columnName]
return( dt[n>1] )
}
# Duplicated using subsetting
getDuplicatedRows_duplicated_subset <- function(dt, columnName) {
# .. means "look up one level"
return( dt[which( duplicated(dt[, ..columnName]) | duplicated(dt[, ..columnName], fromLast = T) ),] )
}
# Duplicated using the "by" argument to avoid copying
getDuplicatedRows_duplicated_by <- function(dt, columnName) {
return( dt[which( duplicated(dt[,by=columnName]) | duplicated(dt[,by=columnName], fromLast = T) ),] )
}
Then we test that they give the same results
results1 <- getDuplicatedRows_Count (smallDataset, 'id')
results2 <- getDuplicatedRows_duplicated_subset(smallDataset, 'id')
results3 <- getDuplicatedRows_duplicated_by(smallDataset, 'id')
> identical(results1, results2)
[1] TRUE
> identical(results2, results3)
[1] TRUE
And the we time the average performance of the 3 solutions:
# Small dataset
> system.time( temp <- replicate(n = 100, expr = getDuplicatedRows_Count (smallDataset, 'id')) ) / 100
user system elapsed
0.00176 0.00007 0.00186
> system.time( temp <- replicate(n = 100, expr = getDuplicatedRows_duplicated_subset(smallDataset, 'id')) ) / 100
user system elapsed
0.00206 0.00005 0.00221
> system.time( temp <- replicate(n = 100, expr = getDuplicatedRows_duplicated_by (smallDataset, 'id')) ) / 100
user system elapsed
0.00141 0.00003 0.00147
#Large dataset
> system.time( temp <- replicate(n = 100, expr = getDuplicatedRows_Count (largeDataset, 'id')) ) / 100
user system elapsed
0.28571 0.01980 0.31022
> system.time( temp <- replicate(n = 100, expr = getDuplicatedRows_duplicated_subset(largeDataset, 'id')) ) / 100
user system elapsed
0.24386 0.03596 0.28243
> system.time( temp <- replicate(n = 100, expr = getDuplicatedRows_duplicated_by (largeDataset, 'id')) ) / 100
user system elapsed
0.22080 0.03918 0.26203
Which shows that the duplicated() approach scales better, especially if the "by=" option is used.
UPDATE: 21 nov 2014. Test of identical output (As suggested by Arun - thanks) identified a problem with me using data.table v 1.9.2 where duplicated's fromLast does not work. I updated to v 1.9.4 and redid the analysis and now the differences is much smaller.
UPDATE: 26 nov 2014. Included and tested the "by=" approach to extract column from the data.table (as suggested by Arun so credit goes there). Furthermore the test of runtime was averaged over 100 test to ensure correctness of result.

Find values in a given interval without a vector scan

With a the R package data.table is it possible to find the values that are in a given interval without a full vector scan of the data. For example
>DT<-data.table(x=c(1,1,2,3,5,8,13,21,34,55,89))
>my.data.table.function(DT,min=3,max=10)
x
1: 3
2: 5
3: 8
Where DT can be a very big table.
Bonus question:
is it possible to do the same thing for a set of non-overlapping intervals such as
>I<-data.table(i=c(1,2),min=c(3,20),max=c(10,40))
>I
i min max
1: 1 3 10
2: 2 20 40
> my.data.table.function2(DT,I)
i x
1: 1 3
2: 1 5
3: 1 8
4: 2 21
5: 2 34
Where both I and DT can be very big.
Thanks a lot
Here is a variation of the code proposed by #user1935457 (see comment in #user1935457 post)
system.time({
if(!identical(key(DT), "x")) setkey(DT, x)
setkey(IT, min)
#below is the line that differs from #user1935457
#Using IT to address the lines of DT creates a smaller intermediate table
#We can also directly use .I
target.low<-DT[IT,list(i=i,min=.I),roll=-Inf, nomatch = 0][,list(min=min[1]),keyby=i]
setattr(IT, "sorted", "max")
# same here
target.high<-DT[IT,list(i=i,max=.I),roll=Inf, nomatch = 0][,list(max=last(max)),keyby=i]
target <- target.low[target.high, nomatch = 0]
target[, len := max - min + 1L]
rm(target.low, target.high)
ans.roll2 <- DT[data.table:::vecseq(target$min, target$len, NULL)][, i := unlist(mapply(rep, x = target$i, times = target$len, SIMPLIFY=FALSE))]
setcolorder(ans.roll2, c("i", "x"))
})
# user system elapsed
# 0.07 0.00 0.06
system.time({
# #user1935457 code
})
# user system elapsed
# 0.08 0.00 0.08
identical(ans.roll2, ans.roll)
#[1] TRUE
The performance gain is not huge here, but it shall be more sensitive with larger DT and smaller IT. thanks again to #user1935457 for your answer.
First of all, vecseq isn't exported as a visible function from data.table, so its syntax and/or behavior here could change without warning in future updates to the package. Also, this is untested besides the simple identical check at the end.
That out of the way, we need a bigger example to exhibit difference from vector scan approach:
require(data.table)
n <- 1e5L
f <- 10L
ni <- n / f
set.seed(54321)
DT <- data.table(x = 1:n + sample(-f:f, n, replace = TRUE))
IT <- data.table(i = 1:ni,
min = seq(from = 1L, to = n, by = f) + sample(0:4, ni, replace = TRUE),
max = seq(from = 1L, to = n, by = f) + sample(5:9, ni, replace = TRUE))
DT, the Data Table is a not-too-random subset of 1:n. IT, the Interval Table is ni = n / 10 non-overlapping intervals in 1:n. Doing the repeated vector scan on all ni intervals takes a while:
system.time({
ans.vecscan <- IT[, DT[x >= min & x <= max], by = i]
})
## user system elapsed
## 84.15 4.48 88.78
One can do two rolling joins on the interval endpoints (see the roll argument in ?data.table) to get everything in one swoop:
system.time({
# Save time if DT is already keyed correctly
if(!identical(key(DT), "x")) setkey(DT, x)
DT[, row := .I]
setkey(IT, min)
target.low <- IT[DT, roll = Inf, nomatch = 0][, list(min = row[1]), keyby = i]
# Non-overlapping intervals => (sorted by min => sorted by max)
setattr(IT, "sorted", "max")
target.high <- IT[DT, roll = -Inf, nomatch = 0][, list(max = last(row)), keyby = i]
target <- target.low[target.high, nomatch = 0]
target[, len := max - min + 1L]
rm(target.low, target.high)
ans.roll <- DT[data.table:::vecseq(target$min, target$len, NULL)][, i := unlist(mapply(rep, x = target$i, times = target$len, SIMPLIFY=FALSE))]
ans.roll[, row := NULL]
setcolorder(ans.roll, c("i", "x"))
})
## user system elapsed
## 0.12 0.00 0.12
Ensuring the same row order verifies the result:
setkey(ans.vecscan, i, x)
setkey(ans.roll, i, x)
identical(ans.vecscan, ans.roll)
## [1] TRUE
If you don't want to do a full vector scan, you should first declare your variable as a key for your data.table :
DT <- data.table(x=c(1,1,2,3,5,8,13,21,34,55,89),key="x")
Then you can use %between% :
R> DT[x %between% c(3,10),]
x
1: 3
2: 5
3: 8
R> DT[x %between% c(3,10) | x %between% c(20,40),]
x
1: 3
2: 5
3: 8
4: 21
5: 34
EDIT : As #mnel pointed out, %between% still does vector scans. The Note section of the help page says :
Current implementation does not make use of ordered keys.
So this doesn't answer your question.

return type for j parameter in data.table

I have been using data.table for some computation and am wondering what are the possible return types for the j parameter so that it stacks up my output correctly? I know data.frame is acceptable so list must be as well? My function returns multiple rows and multiple columns for each id. So imagine:
dtb <- data.table(id=rep(1:5,20), a=1:100, b=sample(1:100, 100), c=sample(1:100, 100))
f <- function(dt) { return(c(dt$a+1, dt$b+1, dt$c+1))}
dtb[,f(.SD), by=id]
This clearly does not work properly. This does:
dtb <- data.table(id=rep(1:5,20), a=1:100, b=sample(1:100, 100), c=sample(1:100, 100))
f <- function(dt) { return(data.frame(a=dt$a+1, b=dt$b+1, c=dt$c+1))}
dtb[,f(.SD), by=id]
Constructing these data.frames seems like a really inefficient way to do things. What are some suggestions? The by must be used.
Your approach to th e j component is not native data.table-speak
It is worth reading the data.table wiki on do's and don't regarding data.table syntax (using data.frame is terrible!, in terms of performance).
You may also refer to this question, and perhaps you will start to understand how the using j and list works.
You are passing a list of expression that will be evaluated within the data.table (or grouped subset thereof)
these are unevaulated expressions, and (currentl) the function [ relies on observing list to properly evaulated these within the correct environment (the data.table or .SD, the group subset)
This call will work
dtb[,list(a = a+1, b = b + 1, c = c+1), by = id]
As will this (passing an unevaulated expression which happens to be a call to list(...)
library(plyr) # for as.quoted
my_list <- as.quoted(paste('list(',paste(letters[1:3], '=', letters[1:3], '+1',collapse= ','),')'))[[1]]
my_list
## list(a = a + 1, b = b + 1, c = c + 1)
dtb[,eval(my_list), by = id]
There is also the possibility of combining a call of lapply(.SD, a_function) in conjunction with .SDcols. The .SDcols argument lets you pass a string of column names on which want the function to be evaluated, so this will work
dtb[, lapply(.SD,base::'+',1),by= id, .SDcols = c('a','b','c')]
or
dtb[,lapply(.SD, .Primitive('+'),1), by= id, .SDcols = c('a','b','c')]
note that I called base::'+' or .Primitive('+') instead of '+', as data.table cannot cannot find '+' as a function
Benchmarking
Benchmarking these solutions
benchmark(
lstdt=dtb[ , flst(.SD), by=id],
dfdt=dtb[ , fdf(.SD), by=id],
lapplySD = dtb[, lapply(.SD,base::'+',1),by= id, .SDcols = c('a','b','c')],
lapplySD2 = dtb[, lapply(.SD,.Primitive('+'),1),by= id, .SDcols = c('a','b','c')]
just_list = dtb[,list(a = a+1,b=b+1,c=c+1),b=id],
eval_mylist = dtb[,eval(my_list),b=id],
replications=10^2
## test replications elapsed relative user.self
## 2 dfdt 100 0.36 4.000000 0.34
## 6 eval_mylist 100 0.09 1.000000 0.10
## 5 just_list 100 0.11 1.222222 0.10
## 3 lapplySD 100 0.14 1.555556 0.14
## 4 lapplySD2 100 0.11 1.1 0.11
## 1 lstdt 100 0.18 2.000000 0.17
the unevaluated expression (passing the list of expressions) is the fasted, which is consistent with Matthew Dowle's points in this previous question
When you wrote this c(dt$a+1, dt$b+1, dt$c+1) you should have expected a single vector (plus the group id column. Try this instead:
dtb <- data.table(id=rep(1:5,20), a=1:100, b=sample(1:100, 100), c=sample(1:100, 100))
f <- function(dt) { return(list(dt$a+1, dt$b+1, dt$c+1))}
dtb[,f(.SD), by=id]
EDIT2 (there was an error in my earlier edit that I only noticed when posting the full code). To the question about "cheaper": Here's a benchmark run that shows list construction to be 'cheaper':
flst <- function(dt) { return(list(dt$a+1, dt$b+1, dt$c+1))}
fdf <- function(dt) { return(data.frame(dt$a+1, dt$b+1, dt$c+1))}
require(rbenchmark)
benchmark(
lstdt=dtb[ , flst(.SD), by=id],
dfdt=dtb[ , fdf(.SD), by=id],
replications=10^2
)
test replications elapsed relative user.self sys.self user.child sys.child
2 dfdt 100 0.466 2.89441 0.457 0.010 0 0
1 lstdt 100 0.161 1.00000 0.159 0.003 0 0

Resources