Find values in a given interval without a vector scan - r

With a the R package data.table is it possible to find the values that are in a given interval without a full vector scan of the data. For example
>DT<-data.table(x=c(1,1,2,3,5,8,13,21,34,55,89))
>my.data.table.function(DT,min=3,max=10)
x
1: 3
2: 5
3: 8
Where DT can be a very big table.
Bonus question:
is it possible to do the same thing for a set of non-overlapping intervals such as
>I<-data.table(i=c(1,2),min=c(3,20),max=c(10,40))
>I
i min max
1: 1 3 10
2: 2 20 40
> my.data.table.function2(DT,I)
i x
1: 1 3
2: 1 5
3: 1 8
4: 2 21
5: 2 34
Where both I and DT can be very big.
Thanks a lot

Here is a variation of the code proposed by #user1935457 (see comment in #user1935457 post)
system.time({
if(!identical(key(DT), "x")) setkey(DT, x)
setkey(IT, min)
#below is the line that differs from #user1935457
#Using IT to address the lines of DT creates a smaller intermediate table
#We can also directly use .I
target.low<-DT[IT,list(i=i,min=.I),roll=-Inf, nomatch = 0][,list(min=min[1]),keyby=i]
setattr(IT, "sorted", "max")
# same here
target.high<-DT[IT,list(i=i,max=.I),roll=Inf, nomatch = 0][,list(max=last(max)),keyby=i]
target <- target.low[target.high, nomatch = 0]
target[, len := max - min + 1L]
rm(target.low, target.high)
ans.roll2 <- DT[data.table:::vecseq(target$min, target$len, NULL)][, i := unlist(mapply(rep, x = target$i, times = target$len, SIMPLIFY=FALSE))]
setcolorder(ans.roll2, c("i", "x"))
})
# user system elapsed
# 0.07 0.00 0.06
system.time({
# #user1935457 code
})
# user system elapsed
# 0.08 0.00 0.08
identical(ans.roll2, ans.roll)
#[1] TRUE
The performance gain is not huge here, but it shall be more sensitive with larger DT and smaller IT. thanks again to #user1935457 for your answer.

First of all, vecseq isn't exported as a visible function from data.table, so its syntax and/or behavior here could change without warning in future updates to the package. Also, this is untested besides the simple identical check at the end.
That out of the way, we need a bigger example to exhibit difference from vector scan approach:
require(data.table)
n <- 1e5L
f <- 10L
ni <- n / f
set.seed(54321)
DT <- data.table(x = 1:n + sample(-f:f, n, replace = TRUE))
IT <- data.table(i = 1:ni,
min = seq(from = 1L, to = n, by = f) + sample(0:4, ni, replace = TRUE),
max = seq(from = 1L, to = n, by = f) + sample(5:9, ni, replace = TRUE))
DT, the Data Table is a not-too-random subset of 1:n. IT, the Interval Table is ni = n / 10 non-overlapping intervals in 1:n. Doing the repeated vector scan on all ni intervals takes a while:
system.time({
ans.vecscan <- IT[, DT[x >= min & x <= max], by = i]
})
## user system elapsed
## 84.15 4.48 88.78
One can do two rolling joins on the interval endpoints (see the roll argument in ?data.table) to get everything in one swoop:
system.time({
# Save time if DT is already keyed correctly
if(!identical(key(DT), "x")) setkey(DT, x)
DT[, row := .I]
setkey(IT, min)
target.low <- IT[DT, roll = Inf, nomatch = 0][, list(min = row[1]), keyby = i]
# Non-overlapping intervals => (sorted by min => sorted by max)
setattr(IT, "sorted", "max")
target.high <- IT[DT, roll = -Inf, nomatch = 0][, list(max = last(row)), keyby = i]
target <- target.low[target.high, nomatch = 0]
target[, len := max - min + 1L]
rm(target.low, target.high)
ans.roll <- DT[data.table:::vecseq(target$min, target$len, NULL)][, i := unlist(mapply(rep, x = target$i, times = target$len, SIMPLIFY=FALSE))]
ans.roll[, row := NULL]
setcolorder(ans.roll, c("i", "x"))
})
## user system elapsed
## 0.12 0.00 0.12
Ensuring the same row order verifies the result:
setkey(ans.vecscan, i, x)
setkey(ans.roll, i, x)
identical(ans.vecscan, ans.roll)
## [1] TRUE

If you don't want to do a full vector scan, you should first declare your variable as a key for your data.table :
DT <- data.table(x=c(1,1,2,3,5,8,13,21,34,55,89),key="x")
Then you can use %between% :
R> DT[x %between% c(3,10),]
x
1: 3
2: 5
3: 8
R> DT[x %between% c(3,10) | x %between% c(20,40),]
x
1: 3
2: 5
3: 8
4: 21
5: 34
EDIT : As #mnel pointed out, %between% still does vector scans. The Note section of the help page says :
Current implementation does not make use of ordered keys.
So this doesn't answer your question.

Related

Sum over 7 day rolling window in R [duplicate]

I want to get the rolling 7-day sum by ID. Suppose my data looks like this:
data<-as.data.frame(matrix(NA,42,3))
data$V1<-seq(as.Date("2014-05-01"),as.Date("2014-09-01"),by=3)
data$V2<-rep(1:6,7)
data$V3<-rep(c(1,2),21)
colnames(data)<-c("Date","USD","ID")
Date USD ID
1 2014-05-01 1 1
2 2014-05-04 2 2
3 2014-05-07 3 1
4 2014-05-10 4 2
5 2014-05-13 5 1
6 2014-05-16 6 2
7 2014-05-19 1 1
8 2014-05-22 2 2
9 2014-05-25 3 1
10 2014-05-28 4 2
How can I add a new column that will contain the rolling 7-day sum by ID?
If your data is big, you might want to check out this solution which uses data.table. It is pretty fast. If you need more speed, you can always change mapply to mcmapply and use multiple cores.
#Load data.table and convert to data.table object
require(data.table)
setDT(data)[,ID2:=.GRP,by=c("ID")]
#Build reference table
Ref <- data[,list(Compare_Value=list(I(USD)),Compare_Date=list(I(Date))), by=c("ID2")]
#Use mapply to get last seven days of value by id
data[,Roll.Val := mapply(RD = Date,NUM=ID2, function(RD, NUM) {
d <- as.numeric(Ref$Compare_Date[[NUM]] - RD)
sum((d <= 0 & d >= -7)*Ref$Compare_Value[[NUM]])})]
Dataset provided by OP does not expose the complexity of the task. In terms of addressing OP question so far only Mike's answer was the correct one.
In fact for a 8 rolling days, instead of 7 rolling days, due to d <= 0 & d >= -7.
zoo solution by #G. Grothendieck is almost valid, only if merge would be made to each group of ID.
Below second data.table solution, this time valid results, using dev RcppRoll which allows na.rm=TRUE.
And slightly formatted Mike's solution output.
data<-as.data.frame(matrix(NA,42,3))
data$V1<-seq(as.Date("2014-05-01"),as.Date("2014-09-01"),by=3)
data$V2<-rep(1:6,7)
data$V3<-rep(c(1,2),21)
colnames(data)<-c("Date","USD","ID")
library(microbenchmark)
library(RcppRoll) # install_github("kevinushey/RcppRoll")
library(data.table) # install_github("Rdatatable/data.table")
correct_jan_dt = function(n, partial=TRUE){
DT = as.data.table(data) # this can be speedup by setDT()
date.range = DT[,range(Date)]
all.dates = seq.Date(date.range[1],date.range[2],by=1)
setkey(DT,ID,Date)
r = DT[CJ(unique(ID),all.dates)][, c("roll") := as.integer(roll_sumr(USD, n, normalize = FALSE, na.rm = TRUE)), by="ID"][!is.na(USD)]
# This could be simplified when `partial` arg will be implemented in [kevinushey/RcppRoll](https://github.com/kevinushey/RcppRoll)
if(isTRUE(partial)){
r[is.na(roll), roll := cumsum(USD), by="ID"][]
}
return(r[order(Date,ID)])
}
correct_mike_dt = function(){
data = as.data.table(data)[,ID2:=.GRP,by=c("ID")]
#Build reference table
Ref <- data[,list(Compare_Value=list(I(USD)),Compare_Date=list(I(Date))), by=c("ID2")]
#Use mapply to get last seven days of value by id
data[, c("roll") := mapply(RD = Date,NUM=ID2, function(RD, NUM){
d <- as.numeric(Ref$Compare_Date[[NUM]] - RD)
sum((d <= 0 & d >= -7)*Ref$Compare_Value[[NUM]])})][,ID2:=NULL][]
}
identical(correct_mike_dt(), correct_jan_dt(n=8,partial=TRUE))
# [1] TRUE
microbenchmark(unit="relative", times=5L, correct_mike_dt(), correct_jan_dt(8))
# Unit: relative
# expr min lq mean median uq max neval
# correct_mike_dt() 274.0699 273.9892 267.2886 266.6009 266.2254 256.7296 5
# correct_jan_dt(8) 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 5
Looking forward for update from #Khashaa.
Edit (20150122.2): Below benchmarks do not answer OP question.
Timing on a bigger (still very tiny) dataset, 5439 rows:
library(zoo)
library(data.table)
library(dplyr)
library(RcppRoll)
library(microbenchmark)
data<-as.data.frame(matrix(NA,5439,3))
data$V1<-seq(as.Date("1970-01-01"),as.Date("2014-09-01"),by=3)
data$V2<-sample(1:6,5439,TRUE)
data$V3<-sample(c(1,2),5439,TRUE)
colnames(data)<-c("Date","USD","ID")
zoo_f = function(){
z <- read.zoo(data)
z0 <- merge(z, zoo(, seq(start(z), end(z), "day")), fill = 0) # expand to daily
roll <- function(x) rollsumr(x, 7, fill = NA)
transform(data, roll = ave(z0$USD, z0$ID, FUN = roll)[time(z)])
}
dt_f = function(){
DT = as.data.table(data) # this can be speedup by setDT()
date.range = DT[,range(Date)]
all.dates = seq.Date(date.range[1],date.range[2],by=1)
setkey(DT,Date)
DT[.(all.dates)
][order(Date), c("roll") := rowSums(setDT(shift(USD, 0:6, NA, "lag")),na.rm=FALSE), by="ID"
][!is.na(ID)]
}
dp_f = function(){
data %>% group_by(ID) %>%
mutate(roll=roll_sum(c(rep(NA,6), USD), 7))
}
dt2_f = function(){
# this can be speedup by setDT()
as.data.table(data)[, c("roll") := roll_sum(c(rep(NA,6), USD), 7), by="ID"][]
}
identical(as.data.table(zoo_f()),dt_f())
# [1] TRUE
identical(setDT(as.data.frame(dp_f())),dt_f())
# [1] TRUE
identical(dt2_f(),dt_f())
# [1] TRUE
microbenchmark(unit="relative", times=20L, zoo_f(), dt_f(), dp_f(), dt2_f())
# Unit: relative
# expr min lq mean median uq max neval
# zoo_f() 140.331889 141.891917 138.064126 139.381336 136.029019 137.730171 20
# dt_f() 14.917166 14.464199 15.210757 16.898931 16.543811 14.221987 20
# dp_f() 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 20
# dt2_f() 1.536896 1.521983 1.500392 1.518641 1.629916 1.337903 20
Yet I'm not sure if my data.table code is already optimal.
Above functions did not answer OP question. Read the top of post for update. Mike's solution was the correct one.
1) Assuming you mean every successive overlapping 7 rows for that ID:
library(zoo)
transform(data, roll = ave(USD, ID, FUN = function(x) rollsumr(x, 7, fill = NA)))
2) If you really did mean 7 days and not 7 rows then try this:
library(zoo)
z <- read.zoo(data)
z0 <- merge(z, zoo(, seq(start(z), end(z), "day")), fill = 0) # expand to daily
roll <- function(x) rollsumr(x, 7, fill = NA)
transform(data, roll = ave(z0$USD, z0$ID, FUN = roll)[time(z)])
Updated Added (2) and made some improvements.
library(data.table)
data <- data.table(Date = seq(as.Date("2014-05-01"),
as.Date("2014-09-01"),
by = 3),
USD = rep(1:6, 7),
ID = rep(c(1, 2), 21))
data[, Rolling7DaySum := {
d <- data$Date - Date
sum(data$USD[ID == data$ID & d <= 0 & d >= -7])
},
by = list(Date, ID)]
I found that there is some problem with Mike.Gahan's suggested code and correct it as below after testing it out.
require(data.table)
setDT(data)[,ID2:=.GRP,by=c("ID")]
Ref <-data[,list(Compare_Value=list(I(USD)),Compare_Date=list(I(Date))),by=c("ID2")]
data[,Roll.Val := mapply(RD = Date,NUM=ID2, function(RD, NUM) {
d <- as.numeric(Ref[ID2 == NUM,]$Compare_Date[[1]] - RD)
sum((d <= 0 & d >= -7)*Ref[ID2 == NUM,]$Compare_Value[[1]])})]

Is there a RAM efficient way to calculate the median over a complement set?

I am looking for an RAM efficient way to calculate the median over a complement set with the help of data.table.
For a set of observations from different groups, I am interested in an implementation of a median of "other groups". I.e., if a have a data.table with one value column and one grouping column, I want for each group calculate the median of values in all other group except the current group. E.g. for group 1 we calculate the median from all values except the values that belong to group 1, and so on.
A concrete example data.table
dt <- data.table(value = c(1,2,3,4,5), groupId = c(1,1,2,2,2))
dt
# value groupId
# 1: 1 1
# 2: 2 1
# 3: 3 2
# 4: 4 2
# 5: 5 2
I would like the medianOfAllTheOtherGroups to be defined as 1.5 for group 2
and defined as 4 for group 1, repeated for each entry in the same data.table:
dt <- data.table(value = c(1,2,3,4,5), groupId = c(1,1,2,2,2), medianOfAllTheOtherGroups = c(4, 4, 1.5, 1.5, 1.5))
dt
# value groupId medianOfAllTheOtherGroups
# 1: 1 1 4.0 # median of all groups _except_ 1
# 2: 2 1 4.0
# 3: 3 2 1.5 # median of all groups _except_ 2
# 4: 4 2 1.5
# 5: 5 2 1.5
To calculate the median for each group only once and not for each observation, we went for an implementation with a loop.
The current complete implementation works nice for small data.tables as input, but
suffers from large RAM consumption for larger data sets a lot with the medians called in a loop as bottleneck (Note: for the real use case we have a dt with 3.000.000 rows and 100.000 groups).
I have worked very little with improving RAM consumption. Can an expert help here to improve RAM for the minimal example that I provide below?
MINIMAL EXAMPLE:
library(data.table)
set.seed(1)
numberOfGroups <- 10
numberOfValuesPerGroup <- 100
# Data table with column
# groupIds - Ids for the groups available
# value - value we want to calculate the median over
# includeOnly - boolean that indicates which example should get a "group specific" median
dt <-
data.table(
groupId = as.character(rep(1:numberOfGroups, each = numberOfValuesPerGroup)),
value = round(runif(n = numberOfGroups * numberOfValuesPerGroup), 4)
)
# calculate the median from all observations for those groups that do not
# require a separate treatment
medianOfAllGroups <- median(dt$value)
dt$medianOfAllTheOtherGroups <- medianOfAllGroups
# generate extra data.table to collect results for selected groups
includedGroups <- dt[, unique(groupId)]
dt_otherGroups <-
data.table(groupId = includedGroups,
medianOfAllTheOtherGroups = as.numeric(NA)
)
# loop over all selected groups and calculate the median from all observations
# except of those that belong to this group
for (id in includedGroups){
dt_otherGroups[groupId == id,
medianOfAllTheOtherGroups := median(dt[groupId != id, value])]
}
# merge subset data to overall data.table
dt[dt_otherGroups, medianOfAllTheOtherGroups := i.medianOfAllTheOtherGroups,
on = c("groupId")]
PS: here the example output for 10 groups with 100 observations each:
dt
# groupId value medianOfAllTheOtherGroups
# 1: 1 0.2655 0.48325
# 2: 1 0.3721 0.48325
# 3: 1 0.5729 0.48325
# 4: 1 0.9082 0.48325
# 5: 1 0.2017 0.48325
# ---
# 996: 10 0.7768 0.48590
# 997: 10 0.6359 0.48590
# 998: 10 0.2821 0.48590
# 999: 10 0.1913 0.48590
# 1000: 10 0.2655 0.48590
Some numbers for different settings of the minimal example (tested on a Mac Book Pro with 16Gb RAM):
NumberOfGroups
numberOfValuesPerGroup
Memory (GB)
Runtime (s)
500
50
0.48
1.47
5000
50
39.00
58.00
50
5000
0.42
0.65
All memory values were extracted from the output of profvis, see example screenshot for the smallest example here:
How about this approach:
setkey(dt, groupId)
dt[, median_val := median(dt$value[dt$groupId != groupId]), by = .(groupId)]
For the 5000 groups with 50 values each case this took ~34 seconds on my MBP. Haven't checked RAM usage though.
Edit: here's another version with two changes, (1) using collapse::fmedian as suggested by Henrik and (2) pre-aggregating the values into a list column by group.
d2 = dt[, .(value = list(value)), keyby = .(groupId)]
setkey(dt, groupId)
dt[, median_val :=
fmedian(d2[-.GRP, unlist(value, use.names = FALSE, recursive = FALSE)]),
by = .(groupId)]
This took around 18 seconds for the 5000/50 example on my machine.
RAM usage: approach 1 ~28GB approach 2 ~15GB according to profvis
Disclaimer: For some reason the profiling keeps crashing my session, so unfortunately I have no such results. However, because my alternatives were a bit faster than OP, I thought it could still be worth posting them so that OP may assess their memory use.
Data
# numberOfGroups <- 5000
# numberOfValuesPerGroup <- 50
# dt <- ...as in OP...
d1 = copy(dt)
d1[ , ri := .I] # just to able to restore original order when comparing result with OP
d2 = copy(dt)
d3 = copy(dt)
Explanation
I shamelessly borrow lines 28, 30-32 from median.default to make a stripped-down version of median.
Calculate total number of rows in the original data (nrow(d1)). Order data by 'value' (setorder). By ordering, two instances of sort in the median code can be removed.
For each 'groupID' (by = groupId):
Calculate length of "other" (number of rows in the original data minus number of rows of current group (.N)).
Calculate median, where the input values are d1$value[-.I], i.e. the original values except the indices of the current group; ?.I: "While grouping, it holds for each item in the group, its row location in x".
Code & Timing
system.time({
# number of rows in original data
nr = nrow(d1)
# order by value
setorder(d1, value)
d1[ , med := {
# length of "other"
n = nr - .N
# ripped from median
half = (n + 1L) %/% 2L
if (n %% 2L == 1L) d1$value[-.I][half]
else mean(d1$value[-.I][half + 0L:1L])
}, by = groupId]
})
# user system elapsed
# 4.08 0.01 4.07
# OP's code on my (old) PC
# user system elapsed
# 84.02 7.26 86.75
# restore original order & check equality
setorder(d1, ri)
all.equal(dt$medianOfAllTheOtherGroups, d1$med)
# [1] TRUE
Comparison with base::median & collapse::fmedian
I also tried the "-.I" with base::median and collapse::fmedian, where the latter was about twice as fast as base::median.
system.time(
d2[ , med := median(d2$value[-.I]), by = groupId]
)
# user system elapsed
# 26.86 0.02 26.85
library(collapse)
system.time(
d3[ , med := fmedian(d3$value[-.I]), by = groupId]
)
# user system elapsed
# 16.95 0.00 16.96
all.equal(dt$medianOfAllTheOtherGroups, d2$med)
# TRUE
all.equal(dt$medianOfAllTheOtherGroups, d3$med)
# TRUE
Thanks a lot to #Cole for helpful comments which improved the performance.
The median is the midpoint of a dataset that's been ordered. For an odd number of values in a dataset, the median is simply the middle number. For an even number of values in a dataset, the median is the mean of the two numbers closest to the middle.
To demonstrate, consider the simple vector of 1:8
1 | 2 | 3 |** 4 | 5 **| 6 | 7 | 8
In this case, our midpoint is 4.5. And because this is a very simple example, the median itself is 4.5
Now consider groupings where one grouping is the first value of the vector. That is, our group is only 1. We know that this will shift our median towards the right (i.e. larger) because we removed a low value of the distribution. Our new distribution is 2:8 and the median is now 5.
2 | 3 | 4 | *5* | 6 | 7 | 8
This is only interesting if we can determine a relationship between these shifts. Specifically, our original midpoint was 4.5. Our new midpoint based on the original vector is 5.
Let's demonstrate a larger mixture with a group of 1, 3, and 7. In this case, we have 2 values below the original midpoint and one value above the original midpoint. Our new median is 5:
2 | 4 | ** 5 ** | 6 | 8
So empirically, we have determined that shifting removing smaller numbers from the distribution shifts our mid_point index by 0.5 and removing larger numbers from the distribution shifts our mid_point index by -0.5. There are a few other stipulations:
We need to make sure that our grouping index is not in the new mid_point calculation. Consider a group of 1, 2, and 5. Based on my math, we would shift up by 0.5 based on (2 below - 1 above) / 2 for a new mid_point of 5. That's wrong because 5 was already used up! We need to account for this.
3 | 4 | ** 6 ** | 7 | 8
Likewise, with our shifted mid_point, we also need to look back to verify that our ranking values are still aligned. In a sequence of 1:20, consider a group of c(1:9, 11). While 11 is originally above the original mid_point of 10.5, it is not above the shifted mid_point of (9 below - 1 above ) / 2 14.5. But our actual median would be 15.5 because 11 is now below the new mid_way point.
10 | 12 | 13 | 14 | ** 15 | 16 **| 17 | 18 | 19 | 20
TL:DR what's the code??
All of the examples above, the grouping's rankings vector are given in data.table by the special symbol I assuming we did setorder(). If we do the same math as above, we don't have to waste time subsetting the dataset. We can instead determine what the new index(es) should be based on what's been removed from the distribution.
setorder(dt, value)
nr = nrow(dt)
is_even = nr %% 2L == 0L
mid_point = (nr + 1L) / 2L
dt[, medianOfAllTheOtherGroups :=
{
below = sum(.I < mid_point)
is_midpoint = is_even && below && (.I[below] + 1L == mid_point)
above = .N - below - is_midpoint
new_midpoint = (below - above) / 2L + mid_point
## TODO turn this into a loop incase there are multiple values that this is true
if (new_midpoint > mid_point && above &&.I[below + 1] < new_midpoint) { ## check to make sure that none of the indices were above
below = below - 1L
new_midpoint = new_midpoint + 1L
} else if (new_midpoint < mid_point && below && .I[below] > new_midpoint) {
below = below + 1L
new_midpoint = new_midpoint - 1L
}
if (((nr - .N + 1L) %% 2L) == 0L) {
dt$value[new_midpoint]
} else {
##TODO turn this into a loop in case there are multiple values that this is true for.
default_inds = as.integer(new_midpoint + c(-0.5, 0.5))
if (below) {
if (.I[below] == default_inds[1L])
default_inds[1L] = .I[below] - 1L
}
if (above) {
if (.I[below + 1L + is_midpoint] == default_inds[2L])
default_inds[2L] = .I[below + 1L] + 1L
}
mean(dt$value[default_inds])
}
}
, by = groupId]
Performance
This is using bench::mark which checks that all results are equal. FOr Henrik and my solutions, I do reorder the results back to the original grouping so that they are all equal.
Note that while this (complicated) algorithm is most efficient, I do want to emphasize that most of these likely do not extreme peak RAM usage. The other answers have to subset 5,000 times to allocate a vector of length 249,950 to calculate a new median. That's about 2 MB per loop just on allocation (e.g., 10 GB overall).
# A tibble: 6 x 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
1 cole 225.7ms 271.8ms 3.68 6.34MB
2 henrik_smart_med 17.7s 17.7s 0.0564 23.29GB
3 henrik_base_med 1.6m 1.6m 0.0104 41.91GB
4 henrik_fmed 55.9s 55.9s 0.0179 32.61GB
5 christian_lookup 54.7s 54.7s 0.0183 51.39GB
6 talat_unlist 35.9s 35.9s 0.0279 19.02GB
Full profile code
library(data.table)
library(collapse)
set.seed(76)
numberOfGroups <- 5000
numberOfValuesPerGroup <- 50
dt <-
data.table(
groupId = (rep(1:numberOfGroups, each = numberOfValuesPerGroup)),
value = round(runif(n = numberOfGroups * numberOfValuesPerGroup, 0, 10), 4)
)
## this is largely instantaneous.
dt[ , ri := .I]
bench::mark( cole = {
setorder(dt, value)
nr = nrow(dt)
is_even = nr %% 2L == 0L
mid_point = (nr + 1L) / 2L
dt[, medianOfAllTheOtherGroups :=
{
below = sum(.I < mid_point)
is_midpoint = is_even && below && (.I[below] + 1L == mid_point)
above = .N - below - is_midpoint
new_midpoint = (below - above) / 2L + mid_point
## TODO turn this into a loop incase there are multiple values that this is true
if (new_midpoint > mid_point && above &&.I[below + 1] < new_midpoint) { ## check to make sure that none of the indices were above
below = below - 1L
new_midpoint = new_midpoint + 1L
} else if (new_midpoint < mid_point && below && .I[below] > new_midpoint) {
below = below + 1L
new_midpoint = new_midpoint - 1L
}
if (((nr - .N + 1L) %% 2L) == 0L) {
as.numeric(dt$value[new_midpoint])
} else {
##TODO turn this into a loop in case there are multiple values that this is true for.
default_inds = as.integer(new_midpoint + c(-0.5, 0.5))
if (below) {
if (.I[below] == default_inds[1L])
default_inds[1L] = .I[below] - 1L
}
if (above) {
if (.I[below + 1L + is_midpoint] == default_inds[2L])
default_inds[2L] = .I[below + 1L] + 1L
}
mean(dt$value[default_inds])
}
}
, by = groupId]
setorder(dt, ri)
},
henrik_smart_med = {
# number of rows in original data
nr = nrow(dt)
# order by value
setorder(dt, value)
dt[ , medianOfAllTheOtherGroups := {
# length of "other"
n = nr - .N
# ripped from median
half = (n + 1L) %/% 2L
if (n %% 2L == 1L) dt$value[-.I][half]
else mean(dt$value[-.I][half + 0L:1L])
}, by = groupId]
setorder(dt, ri)
},
henrik_base_med = {
dt[ , med := median(dt$value[-.I]), by = groupId]
},
henrik_fmed = {
dt[ , med := fmedian(dt$value[-.I]), by = groupId]
},
christian_lookup = {
nrows <- dt[, .N]
dt_match <- dt[, .(nrows_other = nrows- .N), by = .(groupId_match = groupId)]
dt_match[, odd := nrows_other %% 2]
dt_match[, idx1 := ceiling(nrows_other/2)]
dt_match[, idx2 := ifelse(odd, idx1, idx1+1)]
setkey(dt, value)
dt_match[, medianOfAllTheOtherGroups := dt[groupId != groupId_match][c(idx1, idx2), sum(value)/2], by = groupId_match]
dt[dt_match, medianOfAllTheOtherGroups := i.medianOfAllTheOtherGroups,
on = c(groupId = "groupId_match")]
},
talat_unlist = {
d2 = dt[, .(value = list(value)), keyby = .(groupId)]
setkey(dt, groupId)
dt[, medianOfAllTheOtherGroups :=
fmedian(d2[-.GRP, unlist(value, use.names = FALSE, recursive = FALSE)]),
by = .(groupId)]
})
Approach for exact results:
Median is "the middle" value of a sorted vector. (or mean of two middle values for even length vector)
If we know the length of the sorted vector of others, we can directly look up the corresponding vector element(s) index for the median thus avoiding actually computing the median n*groupId times:
library(data.table)
set.seed(1)
numberOfGroups <- 5000
numberOfValuesPerGroup <- 50
dt <-
data.table(
groupId = as.character(rep(1:numberOfGroups, each = numberOfValuesPerGroup)),
value = round(runif(n = numberOfGroups * numberOfValuesPerGroup), 4)
)
# group count match table + idx position for median of others
nrows <- dt[, .N]
dt_match <- dt[, .(nrows_other = nrows- .N), by = .(groupId_match = groupId)]
dt_match[, odd := nrows_other %% 2]
dt_match[, idx1 := ceiling(nrows_other/2)]
dt_match[, idx2 := ifelse(odd, idx1, idx1+1)]
setkey(dt, value)
dt_match[, medianOfAllTheOtherGroups := dt[groupId != groupId_match][c(idx1, idx2), sum(value)/2], by = groupId_match]
dt[dt_match, medianOfAllTheOtherGroups := i.medianOfAllTheOtherGroups,
on = c(groupId = "groupId_match")]
There might be more data.table-ish ways improving performance further, I guess.
Memory/runtime for numberOfGroups = 5000 and numberOfValuesPerGroup = 50: 20GB, 27000ms

Faster way to calculate distance between all individuals during each time step

I have data on positions of several individuals, each registered at several time steps. I want to calculate distance between each animal to all other animals registered at the same time step.
Here's a simplified example, with data on three individuals ('animal_id') registered on three dates ('date') each, on different positions ('x', 'y'):
library(data.table)
dt1 <- data.table(animal_id = 1, date = as.POSIXct(c("2014-01-01", "2014-01-02", "2014-01-03")),
x = runif(3, -10, 10), y = runif(3, -10, 10))
dt2 <- data.table(animal_id = 2, date = as.POSIXct(c("2014-01-01", "2014-01-02", "2014-01-03")),
x = runif(3, -10, 10), y = runif(3, -10, 10))
dt3 <- data.table(animal_id = 3, date = as.POSIXct(c("2014-01-01", "2014-01-02", "2014-01-03")),
x = runif(3, -10, 10), y = runif(3, -10, 10))
dt <- rbindlist(list(dt1, dt2, dt3))
# Create dist function between two animals at same time
dist.between.animals <- function(collar_id1, x1, y1, collar_id2, x2, y2) {
if (collar_id1 == collar_id2) return(NA)
sqrt((x1 - x2)^2 + (y1 - y2)^2)
}
# Get unique collar id of each animal, create column name for all animals per animal
animal_ids <- dt[ , unique(animal_id)]
animal_ids_str <- dt[,paste0("dist_to_", unique(animal_id))]
datetimes <- dt[ , unique(date)]
# Calculate distance of each animal to all animals, at same time
for (i in 1:length(animal_ids)) {
for (j in 1:length(datetimes)) {
x1 <- dt[.(animal_ids[i], datetimes[j]), x, on = .(animal_id, date)]
y1 <- dt[.(animal_ids[i], datetimes[j]), y, on = .(animal_id, date)]
dt[date == datetimes[j], animal_ids_str[i] := mapply(function(c, x2, y2) dist.between.animals(animal_ids[i], x1, y1, c, x2, y2), animal_id, x, y)]
}
}
Here's an example of what the output should look like:
animal_id date x y dist_to_1 dist_to_2 dist_to_3
1: 1 2014-01-01 -7.0276047 4.7660664 NA 7.1354265 13.7962800
2: 1 2014-01-02 -6.6383802 7.0087919 NA 3.7003879 16.4294999
3: 1 2014-01-03 -0.9722872 -4.8638019 NA 11.6447645 11.8313410
4: 2 2014-01-01 0.1076661 4.8131960 7.135426 NA 7.7052205
5: 2 2014-01-02 -8.9042124 4.0832364 3.700388 NA 13.3225921
6: 2 2014-01-03 8.2858839 2.1992575 11.644764 NA 0.4569632
7: 3 2014-01-01 5.7519522 -0.4320359 13.796280 7.7052205 NA
8: 3 2014-01-02 -9.0805265 -9.2381889 16.429500 13.3225921 NA
9: 3 2014-01-03 8.6832729 1.9736531 11.831341 0.4569632 NA
However, my real data have about 30 animals with 20,000 observations per animal, so my current code takes a long time to run. Is there a more efficient way to do this?
OK, so here's kind of an unorthodox method, especially given that for once I think datatables make the situation worse. I'm using the dist function, which calculates the Euclidean distance (or any other, your pick). If you use diag=T, upper=Tit generates a matrix that you can then assign to the specified rows-columns. Creating the columns might get tedious with multiple animals, but nothing that the paste function can't fix.
dt[, c("dist_to_1", "dist_to_2", "dist_to_3") := NA]
dt<- arrange(dt, date, animal_id) # order by date. here it turns into a data.frame
for (i in 1:length(unique(dt$date))){
sub<- subset(dt, dt$date == unique(dt$date)[i])
dt[which(dt$date == unique(sub$date)), c("dist_to_1", "dist_to_2", "dist_to_3")]<- as.matrix(dist(sub[, c("x","y")], diag=T, upper=T))
}
dt[dt==0]<- NA #assign NAs for 0s. Not necessary if the it's ok for diag==0.
setDT(dt) # back to datatable. Again this part is not really necessary.
dt<- dt[order(animal_id, date)] # order as initially ordered
Using this code:
> proc.time()-ptm
user system elapsed
0.051 0.007 0.068
Using earlier code:
> proc.time()-ptm
user system elapsed
0.083 0.004 0.092
If you find a way to use both dist and data.table you're golden, but I couldn't figure it out. It's pretty fast, since it calls C, and it will get faster the more observations you add.
You can make a self-join on date (dt[dt, on = "date",), and for each match (by = .EACHI) calculate the distance:
dt[dt, on = "date",
.(from_id = id, to_id = i.id, dist = sqrt((x - i.x)^2 + (y - i.y)^2)), by = .EACHI]
I you wish to turn the data to a wide format (dcast), chain this to the code above:
[ , dcast(.SD, from_id + date ~ to_id, value.var = "dist")]
It seems to perform OK in a benchmark using the data of #digEmAll
library(microbenchmark)
microbenchmark(
digemall = dt[,(animal_ids_str):=distancesInSameDate(.SD,animal_ids_str),by=date],
henrik = dt[dt, on = "date",
.(from_id = animal_id, to_id = i.animal_id, dist = sqrt((x - i.x)^2 + (y - i.y)^2)), by = .EACHI][
, dcast(.SD, from_id + date ~ to_id, value.var = "dist")],
times = 5, unit = "relative")
# Unit: relative
# expr min lq mean median uq max neval
# digemall 3.206063 2.058547 2.189487 2.035975 2.032324 2.019082 5
# henrik 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 5
Note that I haven't renamed the "to_id" in my code. That basically reflects my prefence to keep the data in long format, and in that format I would like to have both the 'from_id' and 'to_id' without any prefix. If you want prefix in the columns in the wide format, just add to_id = paste0("dist_to_", i.animal_id) in the first step.
Here's an alternative approach which should be much faster :
library(data.table)
### CREATE A BIG DATASET
set.seed(123)
nSamples <- 20000
nAnimals <- 30
allDates <- as.POSIXct(c("2014-01-01")) + (1:nSamples)*24*3600
dts <- lapply(1:nAnimals, function(id){
data.table(animal_id=id,date=allDates,
x=runif(nSamples,-10,10), y=runif(nSamples,-10,10))
})
dt <- rbindlist(dts)
### ALTERNATIVE APPROACH (NO LOOP)
animal_ids_str <- dt[,paste0("dist_to_",sort(unique(animal_id)))]
# set keys
setkey(dt,animal_id,date)
# add the distance columns
dt[,(animal_ids_str):=as.double(NA)]
# custom function that computes animal distances for a subset of dt at the same date
distancesInSameDate <- function(subsetDT,animal_ids_str){
m <- as.matrix(dist(subsetDT[,.(x,y)]))
diag(m) <- NA
cols <- paste0("dist_to_",subsetDT$animal_id)
missingCols <- animal_ids_str[is.na(match(animal_ids_str,cols))]
m <- cbind(m,matrix(NA,nrow=nrow(m),ncol=length(missingCols)))
colnames(m) <- c(cols,missingCols)
DF <- as.data.frame(m,stringsAsFactors=F)
DF <- DF[,match(animal_ids_str,colnames(DF))]
return(DF)
}
# let's compute the distances
system.time( dt[,(animal_ids_str):=distancesInSameDate(.SD,animal_ids_str),by=date] )
On my machine it takes :
user system elapsed
13.76 0.00 13.82

Rolling Sum by Another Variable in R

I want to get the rolling 7-day sum by ID. Suppose my data looks like this:
data<-as.data.frame(matrix(NA,42,3))
data$V1<-seq(as.Date("2014-05-01"),as.Date("2014-09-01"),by=3)
data$V2<-rep(1:6,7)
data$V3<-rep(c(1,2),21)
colnames(data)<-c("Date","USD","ID")
Date USD ID
1 2014-05-01 1 1
2 2014-05-04 2 2
3 2014-05-07 3 1
4 2014-05-10 4 2
5 2014-05-13 5 1
6 2014-05-16 6 2
7 2014-05-19 1 1
8 2014-05-22 2 2
9 2014-05-25 3 1
10 2014-05-28 4 2
How can I add a new column that will contain the rolling 7-day sum by ID?
If your data is big, you might want to check out this solution which uses data.table. It is pretty fast. If you need more speed, you can always change mapply to mcmapply and use multiple cores.
#Load data.table and convert to data.table object
require(data.table)
setDT(data)[,ID2:=.GRP,by=c("ID")]
#Build reference table
Ref <- data[,list(Compare_Value=list(I(USD)),Compare_Date=list(I(Date))), by=c("ID2")]
#Use mapply to get last seven days of value by id
data[,Roll.Val := mapply(RD = Date,NUM=ID2, function(RD, NUM) {
d <- as.numeric(Ref$Compare_Date[[NUM]] - RD)
sum((d <= 0 & d >= -7)*Ref$Compare_Value[[NUM]])})]
Dataset provided by OP does not expose the complexity of the task. In terms of addressing OP question so far only Mike's answer was the correct one.
In fact for a 8 rolling days, instead of 7 rolling days, due to d <= 0 & d >= -7.
zoo solution by #G. Grothendieck is almost valid, only if merge would be made to each group of ID.
Below second data.table solution, this time valid results, using dev RcppRoll which allows na.rm=TRUE.
And slightly formatted Mike's solution output.
data<-as.data.frame(matrix(NA,42,3))
data$V1<-seq(as.Date("2014-05-01"),as.Date("2014-09-01"),by=3)
data$V2<-rep(1:6,7)
data$V3<-rep(c(1,2),21)
colnames(data)<-c("Date","USD","ID")
library(microbenchmark)
library(RcppRoll) # install_github("kevinushey/RcppRoll")
library(data.table) # install_github("Rdatatable/data.table")
correct_jan_dt = function(n, partial=TRUE){
DT = as.data.table(data) # this can be speedup by setDT()
date.range = DT[,range(Date)]
all.dates = seq.Date(date.range[1],date.range[2],by=1)
setkey(DT,ID,Date)
r = DT[CJ(unique(ID),all.dates)][, c("roll") := as.integer(roll_sumr(USD, n, normalize = FALSE, na.rm = TRUE)), by="ID"][!is.na(USD)]
# This could be simplified when `partial` arg will be implemented in [kevinushey/RcppRoll](https://github.com/kevinushey/RcppRoll)
if(isTRUE(partial)){
r[is.na(roll), roll := cumsum(USD), by="ID"][]
}
return(r[order(Date,ID)])
}
correct_mike_dt = function(){
data = as.data.table(data)[,ID2:=.GRP,by=c("ID")]
#Build reference table
Ref <- data[,list(Compare_Value=list(I(USD)),Compare_Date=list(I(Date))), by=c("ID2")]
#Use mapply to get last seven days of value by id
data[, c("roll") := mapply(RD = Date,NUM=ID2, function(RD, NUM){
d <- as.numeric(Ref$Compare_Date[[NUM]] - RD)
sum((d <= 0 & d >= -7)*Ref$Compare_Value[[NUM]])})][,ID2:=NULL][]
}
identical(correct_mike_dt(), correct_jan_dt(n=8,partial=TRUE))
# [1] TRUE
microbenchmark(unit="relative", times=5L, correct_mike_dt(), correct_jan_dt(8))
# Unit: relative
# expr min lq mean median uq max neval
# correct_mike_dt() 274.0699 273.9892 267.2886 266.6009 266.2254 256.7296 5
# correct_jan_dt(8) 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 5
Looking forward for update from #Khashaa.
Edit (20150122.2): Below benchmarks do not answer OP question.
Timing on a bigger (still very tiny) dataset, 5439 rows:
library(zoo)
library(data.table)
library(dplyr)
library(RcppRoll)
library(microbenchmark)
data<-as.data.frame(matrix(NA,5439,3))
data$V1<-seq(as.Date("1970-01-01"),as.Date("2014-09-01"),by=3)
data$V2<-sample(1:6,5439,TRUE)
data$V3<-sample(c(1,2),5439,TRUE)
colnames(data)<-c("Date","USD","ID")
zoo_f = function(){
z <- read.zoo(data)
z0 <- merge(z, zoo(, seq(start(z), end(z), "day")), fill = 0) # expand to daily
roll <- function(x) rollsumr(x, 7, fill = NA)
transform(data, roll = ave(z0$USD, z0$ID, FUN = roll)[time(z)])
}
dt_f = function(){
DT = as.data.table(data) # this can be speedup by setDT()
date.range = DT[,range(Date)]
all.dates = seq.Date(date.range[1],date.range[2],by=1)
setkey(DT,Date)
DT[.(all.dates)
][order(Date), c("roll") := rowSums(setDT(shift(USD, 0:6, NA, "lag")),na.rm=FALSE), by="ID"
][!is.na(ID)]
}
dp_f = function(){
data %>% group_by(ID) %>%
mutate(roll=roll_sum(c(rep(NA,6), USD), 7))
}
dt2_f = function(){
# this can be speedup by setDT()
as.data.table(data)[, c("roll") := roll_sum(c(rep(NA,6), USD), 7), by="ID"][]
}
identical(as.data.table(zoo_f()),dt_f())
# [1] TRUE
identical(setDT(as.data.frame(dp_f())),dt_f())
# [1] TRUE
identical(dt2_f(),dt_f())
# [1] TRUE
microbenchmark(unit="relative", times=20L, zoo_f(), dt_f(), dp_f(), dt2_f())
# Unit: relative
# expr min lq mean median uq max neval
# zoo_f() 140.331889 141.891917 138.064126 139.381336 136.029019 137.730171 20
# dt_f() 14.917166 14.464199 15.210757 16.898931 16.543811 14.221987 20
# dp_f() 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 20
# dt2_f() 1.536896 1.521983 1.500392 1.518641 1.629916 1.337903 20
Yet I'm not sure if my data.table code is already optimal.
Above functions did not answer OP question. Read the top of post for update. Mike's solution was the correct one.
1) Assuming you mean every successive overlapping 7 rows for that ID:
library(zoo)
transform(data, roll = ave(USD, ID, FUN = function(x) rollsumr(x, 7, fill = NA)))
2) If you really did mean 7 days and not 7 rows then try this:
library(zoo)
z <- read.zoo(data)
z0 <- merge(z, zoo(, seq(start(z), end(z), "day")), fill = 0) # expand to daily
roll <- function(x) rollsumr(x, 7, fill = NA)
transform(data, roll = ave(z0$USD, z0$ID, FUN = roll)[time(z)])
Updated Added (2) and made some improvements.
library(data.table)
data <- data.table(Date = seq(as.Date("2014-05-01"),
as.Date("2014-09-01"),
by = 3),
USD = rep(1:6, 7),
ID = rep(c(1, 2), 21))
data[, Rolling7DaySum := {
d <- data$Date - Date
sum(data$USD[ID == data$ID & d <= 0 & d >= -7])
},
by = list(Date, ID)]
I found that there is some problem with Mike.Gahan's suggested code and correct it as below after testing it out.
require(data.table)
setDT(data)[,ID2:=.GRP,by=c("ID")]
Ref <-data[,list(Compare_Value=list(I(USD)),Compare_Date=list(I(Date))),by=c("ID2")]
data[,Roll.Val := mapply(RD = Date,NUM=ID2, function(RD, NUM) {
d <- as.numeric(Ref[ID2 == NUM,]$Compare_Date[[1]] - RD)
sum((d <= 0 & d >= -7)*Ref[ID2 == NUM,]$Compare_Value[[1]])})]

Group by variable in data.table and carry on other variables

I am working on some summaries for financial datasets and I would like to sort the summary in regard to a certain criterion, but without loosing the remaining summary values in a row. Here is a simple example:
set.seed(1)
tseq <- seq(Sys.time(), length.out = 36, by = "mins")
dt <- data.table(TM_STMP = tseq, COMP = rep(c(rep("A", 4), rep("B", 4), rep("C", 4)), 3), SEC = rep(letters[1:12],3), VOL = rpois(36, 3e+6))
dt2 <- dt[, list(SUM = sum(VOL), MEAN = mean(VOL)), by = list(COMP, SEC)]
dt2
COMP SEC SUM MEAN
1: A a 9000329 3000110
2: A b 9001274 3000425
3: A c 9003505 3001168
4: A d 9002138 3000713
Now I would like to get the SEC per COMP with highest VOL:
dt3 <- dt2[, list(SUM = max(SUM)), by = list(COMP)]
dt3
COMP SUM
1: A 9003505
2: B 9002888
3: C 9005042
This gives me what I want, but I would like to keep the other values in the specific rows (SEC and MEAN) such that it looks like this (made by hand):
COMP SUM SEC MEAN
1: A 9003505 c 3001168
2: B 9002888 f 3000963
3: C 9005042 k 3001681
How can I achieve this?
If you are looking for the SEC and the MEAN corresponding to max of SUM:
dt3 <- dt2[, list(SUM = max(SUM),SEC=SEC[which.max(SUM)],MEAN=MEAN[which.max(SUM)]), by = list(COMP)]
> dt3
COMP SUM SEC MEAN
1: A 9003110 a 3001037
2: B 9000814 e 2999612
3: C 9002707 i 2999741
Edit: This'll be faster:
dt2[dt2[, .I[which.max(SUM)], by = list(COMP)]$V1]
Another way to do this would be to setkey of the data.table to: COMP, SUM and then use mult="last" as follows:
setkey(dt2, COMP, SUM)
dt2[J(unique(COMP)), mult="last"]
# COMP SEC SUM MEAN
# 1: A c 9002500 3000833
# 2: B g 9003312 3001104
# 3: C i 9000058 3000019
Edit: To answer to Simon's benchmarking about speed differences between this and #metrics':
set.seed(45)
N <- 1e6
tseq <- seq(Sys.time(), length.out = N, by = "mins")
ff <- function(x) paste(sample(letters, x, TRUE), collapse="")
val1 <- unique(unlist(replicate(1e5, ff(8), simplify=FALSE)))
val2 <- unique(unlist(replicate(1e5, ff(12), simplify=FALSE)))
dt <- data.table(TM_STMP = tseq, COMP = rep(val1, each=100), SEC = rep(val2, each=100), VOL = rpois(1e6, 3e+6))
dt2 <- dt[, list(SUM = sum(VOL), MEAN = mean(VOL)), by = list(COMP, SEC)]
require(microbenchmark)
metrics <- function(x=copy(dt2)) {
x[, list(SUM = max(SUM),SEC=SEC[which.max(SUM)],MEAN=MEAN[which.max(SUM)]), by = list(COMP)]
}
arun <- function(x=copy(dt2)) {
setkey(x, COMP, SUM)
x[J(unique(COMP)), mult="last"]
}
microbenchmark(ans1 <- metrics(dt2), ans2 <- arun(dt2), times=20)
# Unit: milliseconds
# expr min lq median uq max neval
# ans1 <- metrics(dt2) 749.0001 804.0651 838.0750 882.3869 1053.3389 20
# ans2 <- arun(dt2) 301.7696 321.6619 342.4779 359.9343 392.5902 20
setkey(ans1, COMP, SEC)
setkey(ans2, COMP, SEC)
setcolorder(ans1, names(ans2))
identical(ans1, ans2) # [1] TRUE
from your sample output, it's not exactly clear what you would like to keep / drop, but you can simply list your additional columns in the j argument of DT[i, j, ]
> dt2[, list(SUM = max(SUM), SEC, MEAN), by = list(COMP)]
COMP SUM SEC MEAN
1: A 9007273 a 3000131
2: A 9007273 b 3000938
3: A 9007273 c 2999502
4: A 9007273 d 3002424
5: B 9004829 e 3001610
6: B 9004829 f 2999991
7: B 9004829 g 2998471
8: B 9004829 h 2999571
9: C 9002479 i 3000826
10: C 9002479 j 2999826
11: C 9002479 k 3000728
12: C 9002479 l 2999634
I was very interested in the performance of the two different approaches from #Metrics that I denote in the following as which.func and from #Arun that I denote as innate.func. So, I made some benchmarking with my example given in the question above. Here are the results:
which.func <- function() {dt3 <- dt2[, list(SUM = max(SUM), SEC=SEC[which.max(SUM)], MENA=MEAN[which.max(SUM)]), by = list(COMP)]}
innate.func <- function() {dt3 <- dt2[J(unique(COMP)), mult = "last"]}
library(rbenchmark)
benchmark(which.func, innate.func, replications = 10e+6)
test replications elapsed relative user.self sys.self
2 innate 10000000 24.689 1.000 24.259 0.425
1 which.func 10000000 32.664 1.323 32.216 0.446
Of course this is maybe a little unfair towards the which.func becuase the innate.funcinvolves a call to setkey, which is especially for large samples a time consumer. If I include the setkeycall into the function I get the following:
innate.func <- function() {setkey(dt2, COMP, SUM); dt3 <- dt2[J(unique(COMP)), mult = "last"]; setkey(dt2, NULL)}
test replications elapsed relative user.self sys.self
2 innate.func 10000000 25.271 1.000 24.834 0.430
1 which.func 10000000 26.476 1.048 26.062 0.397
It seems, that the two approaches have a very similar performance. The approach of #Arun has perhaps a more elegant style in regard to the data.table and needs less code. Its disadvantage may come with different aggregation functions than the maxor min, where the approach of #Metrics plays out its character of being able to be applied in a more general setting.
I learned from both approaches and put them into my toolbox.
During my further work with the solutions given here I encountered another problem with the summary shown above in my question and I found a solution to it, that I would like to share.
If I want to provide a choice to the user for
an aggregation function, denoted by aggregate and
a criterion (variable of the summary) the aggregate method should be applied to, denoted by crit,
then I encounter the problem, that I have to check, which of the columns are remaining (see e.g. #Metrics answer that uses the which). A simple example:
We take data.table dt2 from my question above. A user now, wants to apply the aggregate = "max" method on the crit = "SUM" variable in the data.table summary of dt2. Here is a solution I found out that works fine (any discussion of course appreciated):
aggregate = "max"
crit = "SUM"
user call <- expression(do.call(aggregate, list(get(crit))))
dt2[, .SD[which(get(crit) == eval(mycall))], by = COMP]
dt2
COMP SEC SUM MEAN
1: A c 9002500 3000833
2: B g 9003312 3001104
3: C i 9000058 3000019

Resources