Related
I am looking for an RAM efficient way to calculate the median over a complement set with the help of data.table.
For a set of observations from different groups, I am interested in an implementation of a median of "other groups". I.e., if a have a data.table with one value column and one grouping column, I want for each group calculate the median of values in all other group except the current group. E.g. for group 1 we calculate the median from all values except the values that belong to group 1, and so on.
A concrete example data.table
dt <- data.table(value = c(1,2,3,4,5), groupId = c(1,1,2,2,2))
dt
# value groupId
# 1: 1 1
# 2: 2 1
# 3: 3 2
# 4: 4 2
# 5: 5 2
I would like the medianOfAllTheOtherGroups to be defined as 1.5 for group 2
and defined as 4 for group 1, repeated for each entry in the same data.table:
dt <- data.table(value = c(1,2,3,4,5), groupId = c(1,1,2,2,2), medianOfAllTheOtherGroups = c(4, 4, 1.5, 1.5, 1.5))
dt
# value groupId medianOfAllTheOtherGroups
# 1: 1 1 4.0 # median of all groups _except_ 1
# 2: 2 1 4.0
# 3: 3 2 1.5 # median of all groups _except_ 2
# 4: 4 2 1.5
# 5: 5 2 1.5
To calculate the median for each group only once and not for each observation, we went for an implementation with a loop.
The current complete implementation works nice for small data.tables as input, but
suffers from large RAM consumption for larger data sets a lot with the medians called in a loop as bottleneck (Note: for the real use case we have a dt with 3.000.000 rows and 100.000 groups).
I have worked very little with improving RAM consumption. Can an expert help here to improve RAM for the minimal example that I provide below?
MINIMAL EXAMPLE:
library(data.table)
set.seed(1)
numberOfGroups <- 10
numberOfValuesPerGroup <- 100
# Data table with column
# groupIds - Ids for the groups available
# value - value we want to calculate the median over
# includeOnly - boolean that indicates which example should get a "group specific" median
dt <-
data.table(
groupId = as.character(rep(1:numberOfGroups, each = numberOfValuesPerGroup)),
value = round(runif(n = numberOfGroups * numberOfValuesPerGroup), 4)
)
# calculate the median from all observations for those groups that do not
# require a separate treatment
medianOfAllGroups <- median(dt$value)
dt$medianOfAllTheOtherGroups <- medianOfAllGroups
# generate extra data.table to collect results for selected groups
includedGroups <- dt[, unique(groupId)]
dt_otherGroups <-
data.table(groupId = includedGroups,
medianOfAllTheOtherGroups = as.numeric(NA)
)
# loop over all selected groups and calculate the median from all observations
# except of those that belong to this group
for (id in includedGroups){
dt_otherGroups[groupId == id,
medianOfAllTheOtherGroups := median(dt[groupId != id, value])]
}
# merge subset data to overall data.table
dt[dt_otherGroups, medianOfAllTheOtherGroups := i.medianOfAllTheOtherGroups,
on = c("groupId")]
PS: here the example output for 10 groups with 100 observations each:
dt
# groupId value medianOfAllTheOtherGroups
# 1: 1 0.2655 0.48325
# 2: 1 0.3721 0.48325
# 3: 1 0.5729 0.48325
# 4: 1 0.9082 0.48325
# 5: 1 0.2017 0.48325
# ---
# 996: 10 0.7768 0.48590
# 997: 10 0.6359 0.48590
# 998: 10 0.2821 0.48590
# 999: 10 0.1913 0.48590
# 1000: 10 0.2655 0.48590
Some numbers for different settings of the minimal example (tested on a Mac Book Pro with 16Gb RAM):
NumberOfGroups
numberOfValuesPerGroup
Memory (GB)
Runtime (s)
500
50
0.48
1.47
5000
50
39.00
58.00
50
5000
0.42
0.65
All memory values were extracted from the output of profvis, see example screenshot for the smallest example here:
How about this approach:
setkey(dt, groupId)
dt[, median_val := median(dt$value[dt$groupId != groupId]), by = .(groupId)]
For the 5000 groups with 50 values each case this took ~34 seconds on my MBP. Haven't checked RAM usage though.
Edit: here's another version with two changes, (1) using collapse::fmedian as suggested by Henrik and (2) pre-aggregating the values into a list column by group.
d2 = dt[, .(value = list(value)), keyby = .(groupId)]
setkey(dt, groupId)
dt[, median_val :=
fmedian(d2[-.GRP, unlist(value, use.names = FALSE, recursive = FALSE)]),
by = .(groupId)]
This took around 18 seconds for the 5000/50 example on my machine.
RAM usage: approach 1 ~28GB approach 2 ~15GB according to profvis
Disclaimer: For some reason the profiling keeps crashing my session, so unfortunately I have no such results. However, because my alternatives were a bit faster than OP, I thought it could still be worth posting them so that OP may assess their memory use.
Data
# numberOfGroups <- 5000
# numberOfValuesPerGroup <- 50
# dt <- ...as in OP...
d1 = copy(dt)
d1[ , ri := .I] # just to able to restore original order when comparing result with OP
d2 = copy(dt)
d3 = copy(dt)
Explanation
I shamelessly borrow lines 28, 30-32 from median.default to make a stripped-down version of median.
Calculate total number of rows in the original data (nrow(d1)). Order data by 'value' (setorder). By ordering, two instances of sort in the median code can be removed.
For each 'groupID' (by = groupId):
Calculate length of "other" (number of rows in the original data minus number of rows of current group (.N)).
Calculate median, where the input values are d1$value[-.I], i.e. the original values except the indices of the current group; ?.I: "While grouping, it holds for each item in the group, its row location in x".
Code & Timing
system.time({
# number of rows in original data
nr = nrow(d1)
# order by value
setorder(d1, value)
d1[ , med := {
# length of "other"
n = nr - .N
# ripped from median
half = (n + 1L) %/% 2L
if (n %% 2L == 1L) d1$value[-.I][half]
else mean(d1$value[-.I][half + 0L:1L])
}, by = groupId]
})
# user system elapsed
# 4.08 0.01 4.07
# OP's code on my (old) PC
# user system elapsed
# 84.02 7.26 86.75
# restore original order & check equality
setorder(d1, ri)
all.equal(dt$medianOfAllTheOtherGroups, d1$med)
# [1] TRUE
Comparison with base::median & collapse::fmedian
I also tried the "-.I" with base::median and collapse::fmedian, where the latter was about twice as fast as base::median.
system.time(
d2[ , med := median(d2$value[-.I]), by = groupId]
)
# user system elapsed
# 26.86 0.02 26.85
library(collapse)
system.time(
d3[ , med := fmedian(d3$value[-.I]), by = groupId]
)
# user system elapsed
# 16.95 0.00 16.96
all.equal(dt$medianOfAllTheOtherGroups, d2$med)
# TRUE
all.equal(dt$medianOfAllTheOtherGroups, d3$med)
# TRUE
Thanks a lot to #Cole for helpful comments which improved the performance.
The median is the midpoint of a dataset that's been ordered. For an odd number of values in a dataset, the median is simply the middle number. For an even number of values in a dataset, the median is the mean of the two numbers closest to the middle.
To demonstrate, consider the simple vector of 1:8
1 | 2 | 3 |** 4 | 5 **| 6 | 7 | 8
In this case, our midpoint is 4.5. And because this is a very simple example, the median itself is 4.5
Now consider groupings where one grouping is the first value of the vector. That is, our group is only 1. We know that this will shift our median towards the right (i.e. larger) because we removed a low value of the distribution. Our new distribution is 2:8 and the median is now 5.
2 | 3 | 4 | *5* | 6 | 7 | 8
This is only interesting if we can determine a relationship between these shifts. Specifically, our original midpoint was 4.5. Our new midpoint based on the original vector is 5.
Let's demonstrate a larger mixture with a group of 1, 3, and 7. In this case, we have 2 values below the original midpoint and one value above the original midpoint. Our new median is 5:
2 | 4 | ** 5 ** | 6 | 8
So empirically, we have determined that shifting removing smaller numbers from the distribution shifts our mid_point index by 0.5 and removing larger numbers from the distribution shifts our mid_point index by -0.5. There are a few other stipulations:
We need to make sure that our grouping index is not in the new mid_point calculation. Consider a group of 1, 2, and 5. Based on my math, we would shift up by 0.5 based on (2 below - 1 above) / 2 for a new mid_point of 5. That's wrong because 5 was already used up! We need to account for this.
3 | 4 | ** 6 ** | 7 | 8
Likewise, with our shifted mid_point, we also need to look back to verify that our ranking values are still aligned. In a sequence of 1:20, consider a group of c(1:9, 11). While 11 is originally above the original mid_point of 10.5, it is not above the shifted mid_point of (9 below - 1 above ) / 2 14.5. But our actual median would be 15.5 because 11 is now below the new mid_way point.
10 | 12 | 13 | 14 | ** 15 | 16 **| 17 | 18 | 19 | 20
TL:DR what's the code??
All of the examples above, the grouping's rankings vector are given in data.table by the special symbol I assuming we did setorder(). If we do the same math as above, we don't have to waste time subsetting the dataset. We can instead determine what the new index(es) should be based on what's been removed from the distribution.
setorder(dt, value)
nr = nrow(dt)
is_even = nr %% 2L == 0L
mid_point = (nr + 1L) / 2L
dt[, medianOfAllTheOtherGroups :=
{
below = sum(.I < mid_point)
is_midpoint = is_even && below && (.I[below] + 1L == mid_point)
above = .N - below - is_midpoint
new_midpoint = (below - above) / 2L + mid_point
## TODO turn this into a loop incase there are multiple values that this is true
if (new_midpoint > mid_point && above &&.I[below + 1] < new_midpoint) { ## check to make sure that none of the indices were above
below = below - 1L
new_midpoint = new_midpoint + 1L
} else if (new_midpoint < mid_point && below && .I[below] > new_midpoint) {
below = below + 1L
new_midpoint = new_midpoint - 1L
}
if (((nr - .N + 1L) %% 2L) == 0L) {
dt$value[new_midpoint]
} else {
##TODO turn this into a loop in case there are multiple values that this is true for.
default_inds = as.integer(new_midpoint + c(-0.5, 0.5))
if (below) {
if (.I[below] == default_inds[1L])
default_inds[1L] = .I[below] - 1L
}
if (above) {
if (.I[below + 1L + is_midpoint] == default_inds[2L])
default_inds[2L] = .I[below + 1L] + 1L
}
mean(dt$value[default_inds])
}
}
, by = groupId]
Performance
This is using bench::mark which checks that all results are equal. FOr Henrik and my solutions, I do reorder the results back to the original grouping so that they are all equal.
Note that while this (complicated) algorithm is most efficient, I do want to emphasize that most of these likely do not extreme peak RAM usage. The other answers have to subset 5,000 times to allocate a vector of length 249,950 to calculate a new median. That's about 2 MB per loop just on allocation (e.g., 10 GB overall).
# A tibble: 6 x 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
1 cole 225.7ms 271.8ms 3.68 6.34MB
2 henrik_smart_med 17.7s 17.7s 0.0564 23.29GB
3 henrik_base_med 1.6m 1.6m 0.0104 41.91GB
4 henrik_fmed 55.9s 55.9s 0.0179 32.61GB
5 christian_lookup 54.7s 54.7s 0.0183 51.39GB
6 talat_unlist 35.9s 35.9s 0.0279 19.02GB
Full profile code
library(data.table)
library(collapse)
set.seed(76)
numberOfGroups <- 5000
numberOfValuesPerGroup <- 50
dt <-
data.table(
groupId = (rep(1:numberOfGroups, each = numberOfValuesPerGroup)),
value = round(runif(n = numberOfGroups * numberOfValuesPerGroup, 0, 10), 4)
)
## this is largely instantaneous.
dt[ , ri := .I]
bench::mark( cole = {
setorder(dt, value)
nr = nrow(dt)
is_even = nr %% 2L == 0L
mid_point = (nr + 1L) / 2L
dt[, medianOfAllTheOtherGroups :=
{
below = sum(.I < mid_point)
is_midpoint = is_even && below && (.I[below] + 1L == mid_point)
above = .N - below - is_midpoint
new_midpoint = (below - above) / 2L + mid_point
## TODO turn this into a loop incase there are multiple values that this is true
if (new_midpoint > mid_point && above &&.I[below + 1] < new_midpoint) { ## check to make sure that none of the indices were above
below = below - 1L
new_midpoint = new_midpoint + 1L
} else if (new_midpoint < mid_point && below && .I[below] > new_midpoint) {
below = below + 1L
new_midpoint = new_midpoint - 1L
}
if (((nr - .N + 1L) %% 2L) == 0L) {
as.numeric(dt$value[new_midpoint])
} else {
##TODO turn this into a loop in case there are multiple values that this is true for.
default_inds = as.integer(new_midpoint + c(-0.5, 0.5))
if (below) {
if (.I[below] == default_inds[1L])
default_inds[1L] = .I[below] - 1L
}
if (above) {
if (.I[below + 1L + is_midpoint] == default_inds[2L])
default_inds[2L] = .I[below + 1L] + 1L
}
mean(dt$value[default_inds])
}
}
, by = groupId]
setorder(dt, ri)
},
henrik_smart_med = {
# number of rows in original data
nr = nrow(dt)
# order by value
setorder(dt, value)
dt[ , medianOfAllTheOtherGroups := {
# length of "other"
n = nr - .N
# ripped from median
half = (n + 1L) %/% 2L
if (n %% 2L == 1L) dt$value[-.I][half]
else mean(dt$value[-.I][half + 0L:1L])
}, by = groupId]
setorder(dt, ri)
},
henrik_base_med = {
dt[ , med := median(dt$value[-.I]), by = groupId]
},
henrik_fmed = {
dt[ , med := fmedian(dt$value[-.I]), by = groupId]
},
christian_lookup = {
nrows <- dt[, .N]
dt_match <- dt[, .(nrows_other = nrows- .N), by = .(groupId_match = groupId)]
dt_match[, odd := nrows_other %% 2]
dt_match[, idx1 := ceiling(nrows_other/2)]
dt_match[, idx2 := ifelse(odd, idx1, idx1+1)]
setkey(dt, value)
dt_match[, medianOfAllTheOtherGroups := dt[groupId != groupId_match][c(idx1, idx2), sum(value)/2], by = groupId_match]
dt[dt_match, medianOfAllTheOtherGroups := i.medianOfAllTheOtherGroups,
on = c(groupId = "groupId_match")]
},
talat_unlist = {
d2 = dt[, .(value = list(value)), keyby = .(groupId)]
setkey(dt, groupId)
dt[, medianOfAllTheOtherGroups :=
fmedian(d2[-.GRP, unlist(value, use.names = FALSE, recursive = FALSE)]),
by = .(groupId)]
})
Approach for exact results:
Median is "the middle" value of a sorted vector. (or mean of two middle values for even length vector)
If we know the length of the sorted vector of others, we can directly look up the corresponding vector element(s) index for the median thus avoiding actually computing the median n*groupId times:
library(data.table)
set.seed(1)
numberOfGroups <- 5000
numberOfValuesPerGroup <- 50
dt <-
data.table(
groupId = as.character(rep(1:numberOfGroups, each = numberOfValuesPerGroup)),
value = round(runif(n = numberOfGroups * numberOfValuesPerGroup), 4)
)
# group count match table + idx position for median of others
nrows <- dt[, .N]
dt_match <- dt[, .(nrows_other = nrows- .N), by = .(groupId_match = groupId)]
dt_match[, odd := nrows_other %% 2]
dt_match[, idx1 := ceiling(nrows_other/2)]
dt_match[, idx2 := ifelse(odd, idx1, idx1+1)]
setkey(dt, value)
dt_match[, medianOfAllTheOtherGroups := dt[groupId != groupId_match][c(idx1, idx2), sum(value)/2], by = groupId_match]
dt[dt_match, medianOfAllTheOtherGroups := i.medianOfAllTheOtherGroups,
on = c(groupId = "groupId_match")]
There might be more data.table-ish ways improving performance further, I guess.
Memory/runtime for numberOfGroups = 5000 and numberOfValuesPerGroup = 50: 20GB, 27000ms
Is it possible to use data.table to apply a two-parameter function quickly by group across a data set? On a 1 million row data set, I am finding that calling the simple function defined below is taking over 11 seconds, which is much longer than I would expect for something of this complexity.
The self-contained code below outlines the essentials of what I am trying to do:
# generate data frame - 1 million rows
library(data.table)
set.seed(42)
nn = 1e6
daf = data.frame(aa=sample(1:1000, nn, repl=TRUE),
bb=sample(1:1000, nn, repl=TRUE),
xx=rnorm(nn),
yy=rnorm(nn),
stringsAsFactors=FALSE)
# myfunc is the function to apply to each group
myfunc = function(xx, yy) {
if (max(yy)>1) {
return(mean(xx))
} else {
return(weighted.mean(yy, ifelse(xx>0, 2, 1)))
}
}
# running the function takes around 11.5 seconds
system.time({
dt = data.table(daf, key=c("aa","bb"))
dt = dt[,myfunc(xx, yy), by=c("aa","bb")]
})
head(dt)
# OUTPUT:
# aa bb V1
# 1: 1 2 -1.02605645
# 2: 1 3 -0.49318243
# 3: 1 4 0.02165797
# 4: 1 5 0.40811793
# 5: 1 6 -1.00312393
# 6: 1 7 0.14754417
Is there a way to significantly reduce the time for a function call like this?
I am interested in whether there is a more efficient way to perform the above calculation without completely re-writing the function call, or whether it can only be sped up by breaking apart the function and somehow rewriting it in data.table syntax.
Many thanks in advance for your replies.
Your results:
system.time({
dt = data.table(daf, key = c("aa","bb"))
dt = dt[,myfunc(xx, yy), by = c("aa","bb")]
}) # 21.25
dtInitial <- copy(dt)
V1: if NA values does not concern you, you can modify your function like this:
myfunc2 = function(xx, yy) {
if (max(yy) > 1) {
return(mean(xx))
} else {
w <- ifelse(xx > 0, 2, 1)
return(sum((yy * w)[w != 0])/sum(w))
}
}
system.time({
dt = data.table(daf, key = c("aa","bb"))
dtM = dt[, myfunc2(xx, yy), by = c("aa","bb")]
}) # 6.69
all.equal(dtM, dtInitial)
# [1] TRUE
V2: Also, you can do it faster like this:
system.time({
dt3 <- data.table(daf, key = c("aa","bb"))
dt3[, maxy := max(yy), by = c("aa","bb")]
dt3[, meanx := mean(xx), by = c("aa","bb")]
dt3[, w := ifelse(xx > 0, 2, 1)]
dt3[, wm2 := sum((yy * w)[w != 0])/sum(w), by = c("aa","bb")]
r2 <- dt3[, .(aa, bb, V1 = ifelse(maxy > 1, meanx, wm2))]
r2 <- unique(r2)
}) #2.09
all.equal(r2, dtInitial)
# [1] TRUE
20 sek vs 2 sek for me
Update:
Or a little bit faster:
system.time({
dt3 <- data.table(daf, key = c("aa","bb"))
dt3[, w := ifelse(xx > 0, 2, 1)]
dt3[, yyw := yy * w]
r2 <- dt3[, .(maxy = max(yy),
meanx = mean(xx),
wm2 = sum(yyw)/sum(w)),
, by = c("aa","bb")]
r2[, V1 := ifelse(maxy > 1, meanx, wm2)]
r2[, c("maxy", "meanx", "wm2") := NULL]
}) # 1.51
all.equal(r2, dtInitial)
# [1] TRUE
Another solution
system.time({
dat <- data.table(daf, key = c("aa","bb"))
dat[, xweight := (xx > 0) * 1 + 1]
result <- dat[, list(MaxY = max(yy), Mean1 = mean(xx), Mean2 = sum(yy*xweight)/sum(xweight)), keyby=c("aa", "bb")]
result[, FinalMean := ifelse(MaxY > 1, Mean1, Mean2)]
})
user system elapsed
1.964 0.059 1.348
I've found a way to gain a further speedup of 8x, which reduces the time down to around 0.2 seconds on my machine. See below. Rather than calculating sum(yyw)/sum(w) directly for each group, which is time-consuming, we instead calculate the quantities sum(yyw) and sum(w) for each group, and only afterwards perform the division. Magic!
system.time({
dt <- data.table(daf, key = c("aa","bb"))
dt[, w := 1][xx > 0, w := 2]
dt[, yyw := yy * w]
res <- dt[, .(maxy = max(yy),
meanx = mean(xx),
wm2num = sum(yyw),
wm2den = sum(w)),
by = c("aa","bb")]
res[, wm2 := wm2num/wm2den]
res[, V1 := wm2][maxy > 1, V1 := meanx]
res[, c("maxy", "meanx", "wm2num", "wm2den", "wm2") := NULL]
}) # 0.19
all.equal(res, dtInitial)
# [1] TRUE
I've implemented a simple dynamic programming example described here, using data.table, in the hope that it would be as fast as vectorized code.
library(data.table)
B=100; M=50; alpha=0.5; beta=0.9;
n = B + M + 1
m = M + 1
u <- function(c)c^alpha
dt <- data.table(s = 0:(B+M))[, .(a = 0:min(s, M)), s] # State Space and corresponging Action Space
dt[, u := (s-a)^alpha,] # rewards r(s, a)
dt <- dt[, .(s_next = a:(a+B), u = u), .(s, a)] # all possible (s') for each (s, a)
dt[, p := 1/(B+1), s] # transition probs
# s a s_next u p
# 1: 0 0 0 0 0.009901
# 2: 0 0 1 0 0.009901
# 3: 0 0 2 0 0.009901
# 4: 0 0 3 0 0.009901
# 5: 0 0 4 0 0.009901
# ---
#649022: 150 50 146 10 0.009901
#649023: 150 50 147 10 0.009901
#649024: 150 50 148 10 0.009901
#649025: 150 50 149 10 0.009901
#649026: 150 50 150 10 0.009901
To give a little content to my question: conditional on s and a, future values of s (s_next) is realized as one of a:(a+10), each with probability p=1/(B + 1). u column gives the u(s, a) for each combination (s, a).
Given the initial values V(always n by 1 vector) for each unique state s, V updates according to V[s]=max(u(s, a)) + beta* sum(p*v(s_next)) (Bellman Equation).
Maximization is wrt a, hence, [, `:=`(v = max(v), i = s_next[which.max(v)]), by = .(s)] in the iteration below.
Actually there is very efficient vectorized solution. I thought data.table solution would be comparable in performance as vectorized approach.
I know that the main culprit is dt[, v := V[s_next + 1]]. Alas, I have no idea how to fix it.
# Iteration starts here
system.time({
V <- rep(0, n) # initial guess for Value function
i <- 1
tol <- 1
while(tol > 0.0001){
dt[, v := V[s_next + 1]]
dt[, v := u + beta * sum(p*v), by = .(s, a)
][, `:=`(v = max(v), i = s_next[which.max(v)]), by = .(s)] # Iteration
dt1 <- dt[, .(v[1L], i[1L]), by = s]
Vnew <- dt1$V1
sig <- dt1$V2
tol <- max(abs(V - Vnew))
V <- Vnew
i <- i + 1
}
})
# user system elapsed
# 5.81 0.40 6.25
To my dismay, the data.table solution is even slower than the following highly non-vectorized solution. As a sloppy data.table-user, I must be missing some data.table functionality. Is there a way to improve things, or, data.table is not suitable for these kinds of computations?
S <- 0:(n-1) # StateSpace
VFI <- function(V){
out <- rep(0, length(V))
for(s in S){
x <- -Inf
for(a in 0:min(s, M)){
s_next <- a:(a+B) # (s')
x <- max(x, u(s-a) + beta * sum(V[s_next + 1]/(B+1)))
}
out[s+1] <- x
}
out
}
system.time({
V <- rep(0, n) # initial guess for Value function
i <- 1
tol <- 1
while(tol > 0.0001){
Vnew <- VFI(V)
tol <- max(abs(V - Vnew))
V <- Vnew
i <- i + 1
}
})
# user system elapsed
# 3.81 0.00 3.81
Here's how I would do this...
DT = CJ(s = seq_len(n)-1L, a = seq_len(m)-1L, s_next = seq_len(n)-1L)
DT[ , p := 0]
#p is 0 unless this is true
DT[between(s_next, a, a + B), p := 1/(B+1)]
#may as well subset to eliminate irrelevant states
DT = DT[p>0 & s>=a]
DT[ , util := u(s - a)]
#don't technically need by, but just to be careful
DT[ , V0 := rep(0, n), by = .(a, s_next)]
while(TRUE) {
#for each s, maximize given past value;
# within each s, have to sum over s_nexts,
# to do so, sum by a
DT[ , V1 := max(.SD[ , util[1L] + beta*sum(V0*p), by = a],
na.rm = TRUE), by = s]
if (DT[ , max(abs(V0 - V1))] < 1e-4) break
DT[ , V0 := V1]
}
On my machine this takes about 15 seconds (so not good)... but maybe this will give you some ideas. For example, this data.table is far too large since there really only are n unique values of V ultimately.
I have data that looks like this :
char_column date_column1 date_column2 integer_column
415 18JT9R6EKV 2014-08-28 2014-09-06 1
26 18JT9R6EKV 2014-12-08 2014-12-11 2
374 18JT9R6EKV 2015-03-03 2015-03-09 1
139 1PEGXAVCN5 2014-05-06 2014-05-10 3
969 1PEGXAVCN5 2014-06-11 2014-06-15 2
649 1PEGXAVCN5 2014-08-12 2014-08-16 3
I want to perform a loop that would check every row against the preceding row, and given certain conditions assign them the same number (so I can group them later) , the point is that if the date segments are close enough I would collapse them into one segment.
my attempt is the following :
i <- 1
z <- 1
v <- 1
for (i in 2:nrow(df)){
z[i] <- ifelse(df[i,'char_column'] == df[i-1,'char_column'],
ifelse((df[i,'date_column1'] - df[i-1,'date_column2']) <= 5,
ifelse(df[i,'integer_column'] == df[i-1,'integer_column'],
v, v<- v+1),
v <- v+1),
v <- v+1)}
df$grouping <- z
then I would just group using min(date_column1) and max(date_column2).
this method works perfectly for say 100,000 rows (22.86 seconds)
but for a million rows : 33.18 minutes!! I have over 60m rows to process,
is there a way I can make the process more efficient ?
PS: to generate a similar table you can use the following code :
x <- NULL
for (i in 1:200) { x[i] <- paste(sample(c(LETTERS, 1:9), 10), collapse = '')}
y <- sample((as.Date('2014-01-01')):as.Date('2015-05-01'), 1000, replace = T)
y2 <- y + sample(1:10)
df <- data.frame(char_column = sample(x, 1000, rep = T),
date_column1 = as.Date(y, origin = '1970-01-01'),
date_column2 = as.Date(y2,origin = '1970-01-01'),
integer_column = sample(1:3,1000, replace = T),
row.names = NULL)
df <- df[order(df$char_column, df$date_column1),]
Since data.table::rleid does not work, I post another (hopefully) fast solution
1. Get rid of nested ifelse
ifelse is often slow, especially for scalar evaluation, use if.
Nested ifelse should be avoided whenever possible: observe that ifelse(A, ifelse(B, x, y), y) can be suitably replaced by if (A&B) x else y
f1 <- function(df){
z <- rep(NA, nrow(df))
z[1] <- 1
char_col <- df[, 'char_column']
date_col1 <- df[, 'date_column1']
date_col2 <- df[, 'date_column2']
int_col <- df[, 'integer_column']
for (i in 2:nrow(df)){
if((char_col[i] == char_col[i-1])&((date_col1[i] - date_col2[i-1]) <= 5)&(int_col[i] == int_col[i-1]))
{
z[i] <- z[i-1]
}
else
{
z[i] <- z[i-1]+1
}
}
z
}
f1 is about 40% faster than the original solution for 10.000 rows.
system.time(f1(df))
user system elapsed
2.72 0.00 2.79
2. Vectorize
Upon closer inspection the conditions inside if can be vectorized
library(data.table)
f2 <- function(df){
z <- rep(NA, nrow(df))
z[1] <- 1
char_col <- df[, 'char_column']
date_col1 <- df[, 'date_column1']
date_col2 <- df[, 'date_column2']
int_col <- df[, 'integer_column']
cond <- (char_col==shift(char_col))&(date_col1 - shift(date_col2) <= 5)&(int_col==shift(int_col))
for (i in 2:nrow(df)){
if(cond[i])
{
z[i] <- z[i-1]
}
else
{
z[i] <- z[i-1]+1
}
}
z
}
# for 10000 rows
system.time(f2(df))
# user system elapsed
# 0.01 0.00 0.02
3. Vectorize, Vectorize
While f2 is already quite fast, a further vectorization is possible. Observe how z is calculated: cond is a logical vector, and z[i] = z[i-1] + 1 when cond is FALSE. This is none other than cumsum(!cond).
f3 <- function(df){
setDT(df)
df[, cond := (char_column==shift(char_column))&(date_column1 - shift(date_column2) <= 5)&(integer_column==shift(integer_column)),]
df[, group := cumsum(!c(FALSE, cond[-1L])),]
}
For 1M rows
system.time(f3(df))
# user system elapsed
# 0.05 0.05 0.09
system.time(f2(df))
# user system elapsed
# 1.83 0.05 1.87
With a the R package data.table is it possible to find the values that are in a given interval without a full vector scan of the data. For example
>DT<-data.table(x=c(1,1,2,3,5,8,13,21,34,55,89))
>my.data.table.function(DT,min=3,max=10)
x
1: 3
2: 5
3: 8
Where DT can be a very big table.
Bonus question:
is it possible to do the same thing for a set of non-overlapping intervals such as
>I<-data.table(i=c(1,2),min=c(3,20),max=c(10,40))
>I
i min max
1: 1 3 10
2: 2 20 40
> my.data.table.function2(DT,I)
i x
1: 1 3
2: 1 5
3: 1 8
4: 2 21
5: 2 34
Where both I and DT can be very big.
Thanks a lot
Here is a variation of the code proposed by #user1935457 (see comment in #user1935457 post)
system.time({
if(!identical(key(DT), "x")) setkey(DT, x)
setkey(IT, min)
#below is the line that differs from #user1935457
#Using IT to address the lines of DT creates a smaller intermediate table
#We can also directly use .I
target.low<-DT[IT,list(i=i,min=.I),roll=-Inf, nomatch = 0][,list(min=min[1]),keyby=i]
setattr(IT, "sorted", "max")
# same here
target.high<-DT[IT,list(i=i,max=.I),roll=Inf, nomatch = 0][,list(max=last(max)),keyby=i]
target <- target.low[target.high, nomatch = 0]
target[, len := max - min + 1L]
rm(target.low, target.high)
ans.roll2 <- DT[data.table:::vecseq(target$min, target$len, NULL)][, i := unlist(mapply(rep, x = target$i, times = target$len, SIMPLIFY=FALSE))]
setcolorder(ans.roll2, c("i", "x"))
})
# user system elapsed
# 0.07 0.00 0.06
system.time({
# #user1935457 code
})
# user system elapsed
# 0.08 0.00 0.08
identical(ans.roll2, ans.roll)
#[1] TRUE
The performance gain is not huge here, but it shall be more sensitive with larger DT and smaller IT. thanks again to #user1935457 for your answer.
First of all, vecseq isn't exported as a visible function from data.table, so its syntax and/or behavior here could change without warning in future updates to the package. Also, this is untested besides the simple identical check at the end.
That out of the way, we need a bigger example to exhibit difference from vector scan approach:
require(data.table)
n <- 1e5L
f <- 10L
ni <- n / f
set.seed(54321)
DT <- data.table(x = 1:n + sample(-f:f, n, replace = TRUE))
IT <- data.table(i = 1:ni,
min = seq(from = 1L, to = n, by = f) + sample(0:4, ni, replace = TRUE),
max = seq(from = 1L, to = n, by = f) + sample(5:9, ni, replace = TRUE))
DT, the Data Table is a not-too-random subset of 1:n. IT, the Interval Table is ni = n / 10 non-overlapping intervals in 1:n. Doing the repeated vector scan on all ni intervals takes a while:
system.time({
ans.vecscan <- IT[, DT[x >= min & x <= max], by = i]
})
## user system elapsed
## 84.15 4.48 88.78
One can do two rolling joins on the interval endpoints (see the roll argument in ?data.table) to get everything in one swoop:
system.time({
# Save time if DT is already keyed correctly
if(!identical(key(DT), "x")) setkey(DT, x)
DT[, row := .I]
setkey(IT, min)
target.low <- IT[DT, roll = Inf, nomatch = 0][, list(min = row[1]), keyby = i]
# Non-overlapping intervals => (sorted by min => sorted by max)
setattr(IT, "sorted", "max")
target.high <- IT[DT, roll = -Inf, nomatch = 0][, list(max = last(row)), keyby = i]
target <- target.low[target.high, nomatch = 0]
target[, len := max - min + 1L]
rm(target.low, target.high)
ans.roll <- DT[data.table:::vecseq(target$min, target$len, NULL)][, i := unlist(mapply(rep, x = target$i, times = target$len, SIMPLIFY=FALSE))]
ans.roll[, row := NULL]
setcolorder(ans.roll, c("i", "x"))
})
## user system elapsed
## 0.12 0.00 0.12
Ensuring the same row order verifies the result:
setkey(ans.vecscan, i, x)
setkey(ans.roll, i, x)
identical(ans.vecscan, ans.roll)
## [1] TRUE
If you don't want to do a full vector scan, you should first declare your variable as a key for your data.table :
DT <- data.table(x=c(1,1,2,3,5,8,13,21,34,55,89),key="x")
Then you can use %between% :
R> DT[x %between% c(3,10),]
x
1: 3
2: 5
3: 8
R> DT[x %between% c(3,10) | x %between% c(20,40),]
x
1: 3
2: 5
3: 8
4: 21
5: 34
EDIT : As #mnel pointed out, %between% still does vector scans. The Note section of the help page says :
Current implementation does not make use of ordered keys.
So this doesn't answer your question.