UPDATE: I reduced my code to the pivotal elements to shorten it
The function_impact_calc is very slow (26 secs for 100000 records dataframe). I think the main reason is the for loop (maybe apply or map will help?). Below I simulate the data, write impact_calc function and record run time.
library(dplyr)
library(data.table)
library(reshape2)
###########################################################
# Start Simulate Data
###########################################################
BuySell <- function(m = 40, s = 4) {
S <- pmax(round(rnorm(10, m, s), 2), 0)
S.sorted <- sort(S)
data.frame(buy = rev(head(S.sorted, 5)), sell = tail(S.sorted, 5))
}
number_sates <- 10000
lst <- replicate(number_sates, BuySell(), simplify = FALSE)
# assemble prices data frame
prices <- as.data.frame(data.table::rbindlist(lst))
prices$state_id <- rep(1:number_sates, each = 5)
prices$level <- rep(1:5, times = number_sates)
prices$quantities <- round(runif(number_sates * 5, 100000, 1000000), 0)
# reshape to long format
prices_long <- reshape2::melt(prices,
id.vars = c("state_id", "quantities", "level"),
value.name = "price"
) %>%
rename("side" = "variable") %>%
setDT()
###########################################################
# End Simulate Data
###########################################################
Here is the function impact_calc which is very slow.
##########################################################
# function to optimize
impact_calc <- function(data, required_quantity) {
# get best buy and sell
best_buy <- data[, ,.SDcols = c("price", "side", "level")][side == "buy" & level == 1][1, "price"][[1]]
best_sell <- data[, ,.SDcols = c("price", "side", "level")][side == "sell" & level == 1][1, "price"][[1]]
# calculate mid
mid <- 0.5 * (best_buy + best_sell)
# buys
remaining_qty <- required_quantity
impact <- 0
data_buy <- data[, ,][side == "buy"]
levels <- data_buy[, ,][side == "buy"][, level]
# i think this for loop is slow!
for (level in levels) {
price_difference <- mid - data_buy$price[level]
if (data_buy$quantities[level] >= remaining_qty) {
impact <- impact + remaining_qty * price_difference
remaining_qty <- 0
break
} else {
impact <- impact + data_buy$quantities[level] * price_difference
remaining_qty <- remaining_qty - data_buy$quantities[level]
}
}
rel_impact <- impact / required_quantity / mid
return_list <- list("relative_impact" = rel_impact)
}
The results with run time:
start_time <- Sys.time()
impact_buys <- prices_long[, impact_calc(.SD, 600000), by = .(state_id)]
end_time <- Sys.time()
end_time - start_time
# for 100000 data frame it takes
#Time difference of 26.54057 secs
Thanks for your help!
OP's suspicion is correct: By replacing the for loop by vector operations we can speed up the calculation by a factor of over 100:
required_quantity <- 600000
setDT(prices)
library(bench)
mark(
orig = prices_long[, impact_calc(.SD, required_quantity), by = .(state_id)],
mod1 = prices_long[, impact_calc2(.SD, required_quantity), by = .(state_id)],
vec_w = prices[, {
mid <- 0.5 * (buy[1L] + sell[1L])
tmp <- cumsum(quantities) - required_quantity
list(relative_impact =
sum(pmin(quantities, pmax(0, quantities - tmp)) * (mid - buy)) /
required_quantity / mid)
}, by = .(state_id)],
min_time = 1.0
)
# A tibble: 3 x 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <lis> <list>
1 orig 28.1s 28.1s 0.0356 2.21GB 1.39 1 39 28.1s <data.ta~ <Rprofme~ <bch~ <tibb~
2 mod1 13.1s 13.1s 0.0762 658.42MB 1.45 1 19 13.12s <data.ta~ <Rprofme~ <bch~ <tibb~
3 vec_w 175.1ms 196.9ms 5.19 440.19KB 2.59 6 3 1.16s <data.ta~ <Rprofme~ <bch~ <tibb~
In addition to the speed up, the vectorized version vec_w allocates remarkably less memory (about 5000 times).
Note that the vectorized version vec_w is using the original prices dataset in wide format. So, there is no need to reshape the data from wide to long format.
The second benchmark case mod1 is a version of impact_calc() where the code outside of the for loop code has been modified to make better use of the data.table syntax. These minor modifications alone account for a speed-up by a factor of 2.
The results are identical which is checked by mark().
Explanation of vec_w
If I understand correctly, the OP considers quantities in the given level order until required_quantity is reached. The last level is considered only partiallly to the extent which is required to meet required_quantity exactly.
In a vectorised version this can be achieved by nested ifelse() as shown in this example:
library(data.table)
r <- 5
dt <- data.table(q = 1:4)
dt[, csq := cumsum(q)]
dt[, tmp := csq - r]
dt[, aq := ifelse(tmp < 0, q, ifelse(q - tmp > 0, q - tmp, 0))][]
q csq tmp aq
1: 1 1 -4 1
2: 2 3 -2 2
3: 3 6 1 2
4: 4 10 5 0
The temporary variable tmp holds the difference between the cumulated sum of quantities q and the required quantity r.
The first ifelse() tests if the cumulated sum of quantities q is below the required quantity r. If so then use q without deduction. If not then use the part of q which is required to fill up the cumulated sum of actual quantities aq1 to meet the required quantity r.
The second ifelse() ensures that the quantity q minus deduction is positive (which is the case for the incomplete level) or zero (for the remaining levels below).
The actual quantities aq = c(1, 2, 2, 0) derived by the previous steps do sum up to the requested quantity r = 5.
Now, the ifelse() constructs can be replaced by pmin() and pmax():
dt[, aq := pmin(q, pmax(q - tmp, 0))]
I have verified in a separate benchmark (not posted here) that the pmin()/pmax() approach is slightly faster than the nested ifelse().
Explanation of mod1
In function impact_calc() some code lines can be modified to make use of the data.table syntax.
Thus,
best_buy <- data[, .SD,.SDcols = c("price", "side", "level")][side == "buy" & level == 1][1, "price"][[1]]
best_sell <- data[, .SD,.SDcols = c("price", "side", "level")][side == "sell" & level == 1][1, "price"][[1]]
become
best_buy <- data[side == "buy" & level == 1, first(price)]
best_sell <- data[side == "sell" & level == 1, first(price)]
and
data_buy <- data[, ,][side == "buy"]
levels <- data_buy[, ,][side == "buy"][, level]
become
data_buy <- data[side == "buy"]
levels <- data[side == "buy", level]
I was quite surprised to learn that these modifications outside of the for loop already gained a substantial speed increase.
Related
I have a dataset that looks like the below:
> head(mydata)
id value1 value2
1: 1 200001 300001
2: 2 200002 300002
3: 3 200003 300003
4: 4 200004 300004
5: 5 200005 300005
6: 6 200006 300006
value1 and value2 represent amounts at the beginning and the end of a given year. I would like to linearly interpolate the value for a given month, for each id (i.e. rowwise).
After trying different options that were slower, I am currently using map2 from the purrr package in combination with approx from base R. I create the new variable using assignment by reference from the data.table package. This is still surprisingly slow, as it takes approximately 2.2 min for my code to run on my data (1.7 million rows).
Note that I also use get() to access the variables for the interpolation, as their names need to be dynamic. This is slowing down my code, but it doesn't seem to be the bottleneck. Also, I have tried to use the furrr package to speed up map2 by making the code parallel, but the speed gains were not material.
Below is reproducible example with 1000 rows of data. Any help to speed up the code is greatly appreciated!
mydata <- data.table(id = 1:1000, value1= 2001:3000, value2= 3001:4000)
floor_value <- "value1"
ceiling_value <- "value2"
m <- 7
monthly_sum_assured <- function(a, b, m) {
monthly_value <- approx(x = c(0, 12), c(a, b), xout = m)$y
}
mydata[, interpolated_value := map2(get(floor_value), get(ceiling_value),
~ monthly_sum_assured(.x, .y, m))]
Just use the formula for linear interpolation to vectorize over the whole data.table.
mydata <- data.table(id = 0:1e6, value1= 2e6:3e6, value2= 3e6:4e6)
floor_value <- "value1"
ceiling_value <- "value2"
m <- 7
monthly_sum_assured <- function(a, b, m) {
monthly_value <- approx(x = c(0, 12), c(a, b), xout = m)$y
}
system.time({
mydata[, interpolated_value := map2(get(floor_value), get(ceiling_value),
~ monthly_sum_assured(.x, .y, m))]
})
#> user system elapsed
#> 41.50 0.53 42.05
system.time({
mydata[, interpolated_value2 := get(floor_value) + m*(get(ceiling_value) - get(floor_value))/12]
})
#> user system elapsed
#> 0 0 0
identical(unlist(mydata$interpolated_value), mydata$interpolated_value2)
#> [1] TRUE
It also works just as fast when m is a vector.
m <- sample(12, 1e6 + 1, 1)
system.time({
mydata[, interpolated_value2 := get(floor_value) + m*(get(ceiling_value) - get(floor_value))/12]
})
#> user system elapsed
#> 0.01 0.00 0.02
I am looking for an RAM efficient way to calculate the median over a complement set with the help of data.table.
For a set of observations from different groups, I am interested in an implementation of a median of "other groups". I.e., if a have a data.table with one value column and one grouping column, I want for each group calculate the median of values in all other group except the current group. E.g. for group 1 we calculate the median from all values except the values that belong to group 1, and so on.
A concrete example data.table
dt <- data.table(value = c(1,2,3,4,5), groupId = c(1,1,2,2,2))
dt
# value groupId
# 1: 1 1
# 2: 2 1
# 3: 3 2
# 4: 4 2
# 5: 5 2
I would like the medianOfAllTheOtherGroups to be defined as 1.5 for group 2
and defined as 4 for group 1, repeated for each entry in the same data.table:
dt <- data.table(value = c(1,2,3,4,5), groupId = c(1,1,2,2,2), medianOfAllTheOtherGroups = c(4, 4, 1.5, 1.5, 1.5))
dt
# value groupId medianOfAllTheOtherGroups
# 1: 1 1 4.0 # median of all groups _except_ 1
# 2: 2 1 4.0
# 3: 3 2 1.5 # median of all groups _except_ 2
# 4: 4 2 1.5
# 5: 5 2 1.5
To calculate the median for each group only once and not for each observation, we went for an implementation with a loop.
The current complete implementation works nice for small data.tables as input, but
suffers from large RAM consumption for larger data sets a lot with the medians called in a loop as bottleneck (Note: for the real use case we have a dt with 3.000.000 rows and 100.000 groups).
I have worked very little with improving RAM consumption. Can an expert help here to improve RAM for the minimal example that I provide below?
MINIMAL EXAMPLE:
library(data.table)
set.seed(1)
numberOfGroups <- 10
numberOfValuesPerGroup <- 100
# Data table with column
# groupIds - Ids for the groups available
# value - value we want to calculate the median over
# includeOnly - boolean that indicates which example should get a "group specific" median
dt <-
data.table(
groupId = as.character(rep(1:numberOfGroups, each = numberOfValuesPerGroup)),
value = round(runif(n = numberOfGroups * numberOfValuesPerGroup), 4)
)
# calculate the median from all observations for those groups that do not
# require a separate treatment
medianOfAllGroups <- median(dt$value)
dt$medianOfAllTheOtherGroups <- medianOfAllGroups
# generate extra data.table to collect results for selected groups
includedGroups <- dt[, unique(groupId)]
dt_otherGroups <-
data.table(groupId = includedGroups,
medianOfAllTheOtherGroups = as.numeric(NA)
)
# loop over all selected groups and calculate the median from all observations
# except of those that belong to this group
for (id in includedGroups){
dt_otherGroups[groupId == id,
medianOfAllTheOtherGroups := median(dt[groupId != id, value])]
}
# merge subset data to overall data.table
dt[dt_otherGroups, medianOfAllTheOtherGroups := i.medianOfAllTheOtherGroups,
on = c("groupId")]
PS: here the example output for 10 groups with 100 observations each:
dt
# groupId value medianOfAllTheOtherGroups
# 1: 1 0.2655 0.48325
# 2: 1 0.3721 0.48325
# 3: 1 0.5729 0.48325
# 4: 1 0.9082 0.48325
# 5: 1 0.2017 0.48325
# ---
# 996: 10 0.7768 0.48590
# 997: 10 0.6359 0.48590
# 998: 10 0.2821 0.48590
# 999: 10 0.1913 0.48590
# 1000: 10 0.2655 0.48590
Some numbers for different settings of the minimal example (tested on a Mac Book Pro with 16Gb RAM):
NumberOfGroups
numberOfValuesPerGroup
Memory (GB)
Runtime (s)
500
50
0.48
1.47
5000
50
39.00
58.00
50
5000
0.42
0.65
All memory values were extracted from the output of profvis, see example screenshot for the smallest example here:
How about this approach:
setkey(dt, groupId)
dt[, median_val := median(dt$value[dt$groupId != groupId]), by = .(groupId)]
For the 5000 groups with 50 values each case this took ~34 seconds on my MBP. Haven't checked RAM usage though.
Edit: here's another version with two changes, (1) using collapse::fmedian as suggested by Henrik and (2) pre-aggregating the values into a list column by group.
d2 = dt[, .(value = list(value)), keyby = .(groupId)]
setkey(dt, groupId)
dt[, median_val :=
fmedian(d2[-.GRP, unlist(value, use.names = FALSE, recursive = FALSE)]),
by = .(groupId)]
This took around 18 seconds for the 5000/50 example on my machine.
RAM usage: approach 1 ~28GB approach 2 ~15GB according to profvis
Disclaimer: For some reason the profiling keeps crashing my session, so unfortunately I have no such results. However, because my alternatives were a bit faster than OP, I thought it could still be worth posting them so that OP may assess their memory use.
Data
# numberOfGroups <- 5000
# numberOfValuesPerGroup <- 50
# dt <- ...as in OP...
d1 = copy(dt)
d1[ , ri := .I] # just to able to restore original order when comparing result with OP
d2 = copy(dt)
d3 = copy(dt)
Explanation
I shamelessly borrow lines 28, 30-32 from median.default to make a stripped-down version of median.
Calculate total number of rows in the original data (nrow(d1)). Order data by 'value' (setorder). By ordering, two instances of sort in the median code can be removed.
For each 'groupID' (by = groupId):
Calculate length of "other" (number of rows in the original data minus number of rows of current group (.N)).
Calculate median, where the input values are d1$value[-.I], i.e. the original values except the indices of the current group; ?.I: "While grouping, it holds for each item in the group, its row location in x".
Code & Timing
system.time({
# number of rows in original data
nr = nrow(d1)
# order by value
setorder(d1, value)
d1[ , med := {
# length of "other"
n = nr - .N
# ripped from median
half = (n + 1L) %/% 2L
if (n %% 2L == 1L) d1$value[-.I][half]
else mean(d1$value[-.I][half + 0L:1L])
}, by = groupId]
})
# user system elapsed
# 4.08 0.01 4.07
# OP's code on my (old) PC
# user system elapsed
# 84.02 7.26 86.75
# restore original order & check equality
setorder(d1, ri)
all.equal(dt$medianOfAllTheOtherGroups, d1$med)
# [1] TRUE
Comparison with base::median & collapse::fmedian
I also tried the "-.I" with base::median and collapse::fmedian, where the latter was about twice as fast as base::median.
system.time(
d2[ , med := median(d2$value[-.I]), by = groupId]
)
# user system elapsed
# 26.86 0.02 26.85
library(collapse)
system.time(
d3[ , med := fmedian(d3$value[-.I]), by = groupId]
)
# user system elapsed
# 16.95 0.00 16.96
all.equal(dt$medianOfAllTheOtherGroups, d2$med)
# TRUE
all.equal(dt$medianOfAllTheOtherGroups, d3$med)
# TRUE
Thanks a lot to #Cole for helpful comments which improved the performance.
The median is the midpoint of a dataset that's been ordered. For an odd number of values in a dataset, the median is simply the middle number. For an even number of values in a dataset, the median is the mean of the two numbers closest to the middle.
To demonstrate, consider the simple vector of 1:8
1 | 2 | 3 |** 4 | 5 **| 6 | 7 | 8
In this case, our midpoint is 4.5. And because this is a very simple example, the median itself is 4.5
Now consider groupings where one grouping is the first value of the vector. That is, our group is only 1. We know that this will shift our median towards the right (i.e. larger) because we removed a low value of the distribution. Our new distribution is 2:8 and the median is now 5.
2 | 3 | 4 | *5* | 6 | 7 | 8
This is only interesting if we can determine a relationship between these shifts. Specifically, our original midpoint was 4.5. Our new midpoint based on the original vector is 5.
Let's demonstrate a larger mixture with a group of 1, 3, and 7. In this case, we have 2 values below the original midpoint and one value above the original midpoint. Our new median is 5:
2 | 4 | ** 5 ** | 6 | 8
So empirically, we have determined that shifting removing smaller numbers from the distribution shifts our mid_point index by 0.5 and removing larger numbers from the distribution shifts our mid_point index by -0.5. There are a few other stipulations:
We need to make sure that our grouping index is not in the new mid_point calculation. Consider a group of 1, 2, and 5. Based on my math, we would shift up by 0.5 based on (2 below - 1 above) / 2 for a new mid_point of 5. That's wrong because 5 was already used up! We need to account for this.
3 | 4 | ** 6 ** | 7 | 8
Likewise, with our shifted mid_point, we also need to look back to verify that our ranking values are still aligned. In a sequence of 1:20, consider a group of c(1:9, 11). While 11 is originally above the original mid_point of 10.5, it is not above the shifted mid_point of (9 below - 1 above ) / 2 14.5. But our actual median would be 15.5 because 11 is now below the new mid_way point.
10 | 12 | 13 | 14 | ** 15 | 16 **| 17 | 18 | 19 | 20
TL:DR what's the code??
All of the examples above, the grouping's rankings vector are given in data.table by the special symbol I assuming we did setorder(). If we do the same math as above, we don't have to waste time subsetting the dataset. We can instead determine what the new index(es) should be based on what's been removed from the distribution.
setorder(dt, value)
nr = nrow(dt)
is_even = nr %% 2L == 0L
mid_point = (nr + 1L) / 2L
dt[, medianOfAllTheOtherGroups :=
{
below = sum(.I < mid_point)
is_midpoint = is_even && below && (.I[below] + 1L == mid_point)
above = .N - below - is_midpoint
new_midpoint = (below - above) / 2L + mid_point
## TODO turn this into a loop incase there are multiple values that this is true
if (new_midpoint > mid_point && above &&.I[below + 1] < new_midpoint) { ## check to make sure that none of the indices were above
below = below - 1L
new_midpoint = new_midpoint + 1L
} else if (new_midpoint < mid_point && below && .I[below] > new_midpoint) {
below = below + 1L
new_midpoint = new_midpoint - 1L
}
if (((nr - .N + 1L) %% 2L) == 0L) {
dt$value[new_midpoint]
} else {
##TODO turn this into a loop in case there are multiple values that this is true for.
default_inds = as.integer(new_midpoint + c(-0.5, 0.5))
if (below) {
if (.I[below] == default_inds[1L])
default_inds[1L] = .I[below] - 1L
}
if (above) {
if (.I[below + 1L + is_midpoint] == default_inds[2L])
default_inds[2L] = .I[below + 1L] + 1L
}
mean(dt$value[default_inds])
}
}
, by = groupId]
Performance
This is using bench::mark which checks that all results are equal. FOr Henrik and my solutions, I do reorder the results back to the original grouping so that they are all equal.
Note that while this (complicated) algorithm is most efficient, I do want to emphasize that most of these likely do not extreme peak RAM usage. The other answers have to subset 5,000 times to allocate a vector of length 249,950 to calculate a new median. That's about 2 MB per loop just on allocation (e.g., 10 GB overall).
# A tibble: 6 x 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
1 cole 225.7ms 271.8ms 3.68 6.34MB
2 henrik_smart_med 17.7s 17.7s 0.0564 23.29GB
3 henrik_base_med 1.6m 1.6m 0.0104 41.91GB
4 henrik_fmed 55.9s 55.9s 0.0179 32.61GB
5 christian_lookup 54.7s 54.7s 0.0183 51.39GB
6 talat_unlist 35.9s 35.9s 0.0279 19.02GB
Full profile code
library(data.table)
library(collapse)
set.seed(76)
numberOfGroups <- 5000
numberOfValuesPerGroup <- 50
dt <-
data.table(
groupId = (rep(1:numberOfGroups, each = numberOfValuesPerGroup)),
value = round(runif(n = numberOfGroups * numberOfValuesPerGroup, 0, 10), 4)
)
## this is largely instantaneous.
dt[ , ri := .I]
bench::mark( cole = {
setorder(dt, value)
nr = nrow(dt)
is_even = nr %% 2L == 0L
mid_point = (nr + 1L) / 2L
dt[, medianOfAllTheOtherGroups :=
{
below = sum(.I < mid_point)
is_midpoint = is_even && below && (.I[below] + 1L == mid_point)
above = .N - below - is_midpoint
new_midpoint = (below - above) / 2L + mid_point
## TODO turn this into a loop incase there are multiple values that this is true
if (new_midpoint > mid_point && above &&.I[below + 1] < new_midpoint) { ## check to make sure that none of the indices were above
below = below - 1L
new_midpoint = new_midpoint + 1L
} else if (new_midpoint < mid_point && below && .I[below] > new_midpoint) {
below = below + 1L
new_midpoint = new_midpoint - 1L
}
if (((nr - .N + 1L) %% 2L) == 0L) {
as.numeric(dt$value[new_midpoint])
} else {
##TODO turn this into a loop in case there are multiple values that this is true for.
default_inds = as.integer(new_midpoint + c(-0.5, 0.5))
if (below) {
if (.I[below] == default_inds[1L])
default_inds[1L] = .I[below] - 1L
}
if (above) {
if (.I[below + 1L + is_midpoint] == default_inds[2L])
default_inds[2L] = .I[below + 1L] + 1L
}
mean(dt$value[default_inds])
}
}
, by = groupId]
setorder(dt, ri)
},
henrik_smart_med = {
# number of rows in original data
nr = nrow(dt)
# order by value
setorder(dt, value)
dt[ , medianOfAllTheOtherGroups := {
# length of "other"
n = nr - .N
# ripped from median
half = (n + 1L) %/% 2L
if (n %% 2L == 1L) dt$value[-.I][half]
else mean(dt$value[-.I][half + 0L:1L])
}, by = groupId]
setorder(dt, ri)
},
henrik_base_med = {
dt[ , med := median(dt$value[-.I]), by = groupId]
},
henrik_fmed = {
dt[ , med := fmedian(dt$value[-.I]), by = groupId]
},
christian_lookup = {
nrows <- dt[, .N]
dt_match <- dt[, .(nrows_other = nrows- .N), by = .(groupId_match = groupId)]
dt_match[, odd := nrows_other %% 2]
dt_match[, idx1 := ceiling(nrows_other/2)]
dt_match[, idx2 := ifelse(odd, idx1, idx1+1)]
setkey(dt, value)
dt_match[, medianOfAllTheOtherGroups := dt[groupId != groupId_match][c(idx1, idx2), sum(value)/2], by = groupId_match]
dt[dt_match, medianOfAllTheOtherGroups := i.medianOfAllTheOtherGroups,
on = c(groupId = "groupId_match")]
},
talat_unlist = {
d2 = dt[, .(value = list(value)), keyby = .(groupId)]
setkey(dt, groupId)
dt[, medianOfAllTheOtherGroups :=
fmedian(d2[-.GRP, unlist(value, use.names = FALSE, recursive = FALSE)]),
by = .(groupId)]
})
Approach for exact results:
Median is "the middle" value of a sorted vector. (or mean of two middle values for even length vector)
If we know the length of the sorted vector of others, we can directly look up the corresponding vector element(s) index for the median thus avoiding actually computing the median n*groupId times:
library(data.table)
set.seed(1)
numberOfGroups <- 5000
numberOfValuesPerGroup <- 50
dt <-
data.table(
groupId = as.character(rep(1:numberOfGroups, each = numberOfValuesPerGroup)),
value = round(runif(n = numberOfGroups * numberOfValuesPerGroup), 4)
)
# group count match table + idx position for median of others
nrows <- dt[, .N]
dt_match <- dt[, .(nrows_other = nrows- .N), by = .(groupId_match = groupId)]
dt_match[, odd := nrows_other %% 2]
dt_match[, idx1 := ceiling(nrows_other/2)]
dt_match[, idx2 := ifelse(odd, idx1, idx1+1)]
setkey(dt, value)
dt_match[, medianOfAllTheOtherGroups := dt[groupId != groupId_match][c(idx1, idx2), sum(value)/2], by = groupId_match]
dt[dt_match, medianOfAllTheOtherGroups := i.medianOfAllTheOtherGroups,
on = c(groupId = "groupId_match")]
There might be more data.table-ish ways improving performance further, I guess.
Memory/runtime for numberOfGroups = 5000 and numberOfValuesPerGroup = 50: 20GB, 27000ms
Currently I have the following data.table :
item city dummyvar
A Austin 1
A Austin 1
A Austin 100
B Austin 2
B Austin 2
B Austin 200
A NY 1
A NY 1
A NY 100
B NY 2
B NY 2
B NY 200
and I have a user-defined function called ImbalancePoints, which is applied to dummyvar and it returns the rows where it detects an abrupt change in dummyvar. The way I am doing this is as follows:
my.data.table[,
.(item, city , imb.points = list(unique(try(ImbalancePoints(dummyvar), silent = T))) ),
by = .(city, item)
]
And for the NY case lets say that I get a data.table object like the following:
item city imb.points
A NY 3,449
where the column imb.points is a column with nested lists as its elements, and for this example the numbers 3 and 449 denote the rows where there is an abrupt change for the case of city = NY and item = A. However the problem that I am facing is that I have approx. 3000 different items for 12 different cities, and it is taking a long time to calculate this. I was wondering if you could give me an idea of how to vectorize/speed up this calculation since the last time that I attempted this it took almost 2 hours and it didn't finish.
I don't know if its of any help but I am also attaching the ImbalancePoints function:
library(pracma)
ImbalancePr <- function(eval.column) {
n <- length(eval.column)
imbalance <- rep(0, n)
b_t = rep(0,n)
elem_diff <- diff(eval.column)
for(i in 2:n)
{
imbalance[i] <- sign(elem_diff[i-1]) * (elem_diff[i-1] != 0)
+ imbalance[i-1]*(elem_diff[i-1] == 0)
}
return(imbalance)
}
ImbalancePoints <- function(eval.column, w0 = 100, bkw_T = 10, bkw_b = 10){
bv_t <- ImbalancePr(eval.column)
w0 <- min(min(which(cumsum(bv_t) != 0)), w0)
Tstar <- w0
E0t <- Tstar
repeat{
Tlast <- sum(Tstar)
nbt <- min(bkw_b, Tlast-1)
P <- pracma::movavg(bv_t[1:Tlast], n = nbt, type = "e")
P <- tail(P,1)
bv_t_expected <- E0t * abs(P)
bv_t_cumsum <- abs(cumsum(bv_t[-(1:Tlast)]))
if(max(bv_t_cumsum) < bv_t_expected){break}else{
Tnew <- min(which(bv_t_cumsum >= bv_t_expected))
}
Tlast <- Tlast + Tnew
if(Tlast > length(eval.column)[1]){break}else{
Tstar <- c(Tstar,Tnew)
if(length(Tstar) <= 2){
E0t <- mean(Tstar)
}else{
nt <- min(bkw_T,length(Tstar)-1)
E0t <- pracma::movavg(Tstar[1:length(Tstar)], n = nt, type = "e")
E0t <- tail(E0t,1)
}
}
}
return(sort(unique(Tstar)))
}
EDIT: Thanks to Paul insight then my problem is just to vectorize the repeat loop inside the ImbalancePoints function. However I am not very proficient coding and I don't see a straightforward solution to it. If someone could give me a suggestion or if you know about an auxiliary function/library I will appreciate it.
This posting consist of several sections addressing different issues:
Vectorizing ImbalancePr()
Profiling ImbalancePoints()
Speeding-up movavg() with Rcpp by a factor of 4
Vectorizing ImbalancePr()
I believe ImbalancePr() can be replaced by
fImbalancePr <- function(x) c(0, sign(diff(x)))
At least, it returns the same result, wenn benchmarked (with check of results):
library(bench)
library(ggplot2)
bm <- press(
n = c(10, 100, 1000, 10000),
{
x <- rep(0, n)
set.seed(123)
x[sample(n, n/5)] <- 100
print(table(x))
mark(
ImbalancePr(x),
fImbalancePr(x)
)
}
)
Running with:
n
1 10
x
0 100
8 2
2 100
x
0 100
80 20
3 1000
x
0 100
800 200
4 10000
x
0 100
8000 2000
autoplot(bm)
fImbalancePr() is always faster than OP's original version. The speed advantage increases with vector length.
Profiling ImbalancePoints()
However, this improvement does not have much impact on the overall performance of ImbalancePoints():
library(bench)
library(ggplot2)
bm <- press(
n = c(10L, 100L, 1000L),
{
x <- replace(rep(0, n), n, 100)
y <- c(rep(2, n), rep(-3, n), rep(5, n))
mark(
original = {
list(
ImbalancePoints(x),
ImbalancePoints(y)
)
},
modified = {
list(
fImbalancePoints(x),
fImbalancePoints(y)
)
}
)
}
)
fImbalancePoint() is a variant of ImbalancePoint() where ImbalancePr() has been replaced by fImbalancePr().
autoplot(bm)
There is a minor improvement but this does not help to cut down the reported execution time of 2 hours significantly.
We can use profvis to identify where the time is spent within ImbalancePoints():
library(profvis)
x <- c(rep(0, 480L), rep(c(0:9, 9:0), 2L), rep(0, 480L))
profvis({
for (i in 1:100) ffImbalancePoints(x)
})
Timings are collected by sampling, therefore a sufficient number of repetitions is required to get a good coverage.
The results from one run are shown in this screenshot from RStudio:
movavg() consumes 25% of the time spent in ImbalancePoints().
According to the profiling, another 20% are required for the double colon operator in pracma::movavg(). It might be worthwhile to test if there is a speedup from loading the pracma paackage beforehand using library(pracma).
10% are spent in calls to tail(). tail(x, 1) can be replaced by x[length(x)] which is more than a magnitude faster.
If we look at code of movavg() by typing pracma::movavg (without parentheses) we see that there is a iterative loop which cannot be vectorized:
...
else if (type == "e") {
a <- 2/(n + 1)
y[1] <- x[1]
for (k in 2:nx) y[k] <- a * x[k] + (1 - a) * y[k - 1]
}
...
In addition, only the last value of the time series created by the call to movavg() is used. So, there might be two options for performance improvements here:
Choose a different weighted means function which uses only data points within a limited window.
Re-implement movavg() in C++ using Rcpp.
Speeding-up movavg() with Rcpp
Replacing the call to pracma::movavg() and the subsequent call to tail() by on Rcpp function we can gain a speed-up up to a factor of 4 for ImbalancePoints() overall.
EMA_last_cpp(x, n) replaces tail(pracma::movavg(x, n, type = "e"), 1)
library(Rcpp)
cppFunction("
double EMA_last_cpp(const NumericVector& x, const int n) {
int nx = x.size();
double a = 2.0 / (n + 1.0);
double b = 1.0 - a;
double y;
y = x[0];
for(int k = 1; k < nx; k++){
y = a * x[k] + b * y;
}
return y;
}
")
Now, we can modify ImbalancePoints() accordingly. In addition, the call to ImbalancePr() is replaced and the code is modified in two other places (see comments):
fImbalancePoints <-
function(eval.column,
w0 = 100,
bkw_T = 10,
bkw_b = 10) {
# bv_t <- ImbalancePr(eval.column)
bv_t <- c(0, sign(diff(eval.column)))
# w0 <- min(min(which(cumsum(bv_t) != 0)), w0)
w0 <- min(which(bv_t != 0)[1L], w0) # pick first change point
Tstar <- w0
E0t <- Tstar
repeat {
Tlast <- sum(Tstar)
# remove warning:
# In max(bv_t_cumsum) : no non-missing arguments to max; returning -Inf
if (Tlast >= length(bv_t)) break
nbt <- min(bkw_b, Tlast - 1)
# P <- movavg(bv_t[1:Tlast], n = nbt, type = "e")
# P <- tail(P, 1)
P <- EMA_last_cpp(bv_t[1:Tlast], n = nbt)
bv_t_expected <- E0t * abs(P)
bv_t_cumsum <- abs(cumsum(bv_t[-(1:Tlast)]))
if (max(bv_t_cumsum) < bv_t_expected) {
break
} else{
Tnew <- min(which(bv_t_cumsum >= bv_t_expected))
}
Tlast <- Tlast + Tnew
if (Tlast > length(eval.column)[1]) {
break
} else{
Tstar <- c(Tstar, Tnew)
if (length(Tstar) <= 2) {
E0t <- mean(Tstar)
} else{
nt <- min(bkw_T, length(Tstar) - 1)
# E0t <- movavg(Tstar[1:length(Tstar)], n = nt, type = "e")
# E0t <- tail(E0t, 1)
E0t <- EMA_last_cpp(Tstar[1:length(Tstar)], n = nt)
}
}
}
return(sort(unique(Tstar)))
}
The benchmark
library(bench)
library(ggplot2)
bm <- press(
n = c(10L, 100L, 1000L),
{
x <- replace(rep(0, n), n, 100)
y <- c(rep(2, n), rep(-3, n), rep(5, n))
mark(
original = {
list(
ImbalancePoints(x),
ImbalancePoints(y)
)
},
modified = {
list(
fImbalancePoints(x),
fImbalancePoints(y)
)
},
min_time = 1
)
}
)
bm
expression n min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time
<bch:expr> <int> <bch:t> <bch:> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm>
1 original 10 315.1us 369us 2318. 2.66KB 4.16 2231 4 962.49ms
2 modified 10 120us 136us 6092. 195.11KB 5.31 5733 5 940.99ms
3 original 100 583.4us 671us 1283. 55.09KB 4.16 1234 4 961.78ms
4 modified 100 145.4us 167us 5146. 47.68KB 4.19 4916 4 955.29ms
5 original 1000 438.1ms 469ms 2.17 157.37MB 4.33 3 6 1.38s
6 modified 1000 97.1ms 103ms 9.53 152.09MB 17.1 10 18 1.05s
shows that the modified version is about 3 to 5 times faster than the original version. This may help the OP to reduce the compute time for his production dataset from 2+ hours by a significant factor.
I have not been able to find a solution to a problem similar to this on StackOverflow. I hope someone can help!
I am using the R environment.
I have data from turtle nests. There are two types of hourly data in each nest. The first is hourly Temperature, and it has an associated hourly Development (amount of "anatomical" embryonic development").
I am calculating a weighted median. In this case, the median is temperature and it is weighted by development.
I have a script here that I am using to calculated weighted median:
weighted.median <- function(x, w, probs=0.5, na.rm=TRUE) {
x <- as.numeric(as.vector(x))
w <- as.numeric(as.vector(w))
if(anyNA(x) || anyNA(w)) {
ok <- !(is.na(x) | is.na(w))
x <- x[ok]
w <- w[ok]
}
stopifnot(all(w >= 0))
if(all(w == 0)) stop("All weights are zero", call.=FALSE)
#'
oo <- order(x)
x <- x[oo]
w <- w[oo]
Fx <- cumsum(w)/sum(w)
#'
result <- numeric(length(probs))
for(i in seq_along(result)) {
p <- probs[i]
lefties <- which(Fx <= p)
if(length(lefties) == 0) {
result[i] <- x[1]
} else {
left <- max(lefties)
result[i] <- x[left]
if(Fx[left] < p && left < length(x)) {
right <- left+1
y <- x[left] + (x[right]-x[left]) * (p-Fx[left])/(Fx[right]- Fx[left])
if(is.finite(y)) result[i] <- y
}
}
}
names(result) <- paste0(format(100 * probs, trim = TRUE), "%")
return(result)
}
So from the function you can see that I need two input vectors, x and w (which will be temperature and development, respectively).
The problem I'm having is that I have hourly temperature traces that last anywhere from 5 days to 53 days (i.e., 120 hours to 1272 hours).
I would like to calculate the daily weighted median for all days within a nest (i.e., take the 24 rows of x and w, and calculate the weighted median, then move onto rows 25-48, and so forth.) The output vector would therefore be a list of daily weighted medians with length n/24 (where n is the total number of rows in x).
In other words, I would like to analyse my data automatically, in a fashion equivalent to manually doing this (nest1 is the datasheet for Nest 1 which contains two vectors, temp and devo (devo is the weight))):
`weighted.median(nest1$temp[c(1,1:24)],nest1$devo[c(1,1:24)],na.rm=TRUE)`
followed by
weighted.median(nest1$temp[c(1,25:48)],nest1$devo[c(1,25:48)],na.rm=TRUE)
followed by
weighted.median(nest1$temp[c(1,49:72)],nest1$devo[c(1,49:72)],na.rm=TRUE)
all the way to
`weighted.median(nest1$temp[c(1,n-23:n)],nest1$devo[c(1,n-23:n)],na.rm=TRUE)`
I'm afraid I don't even know where to start. Any help or clues would be very much appreciated.
The main idea is to create a new column for day 1, day 2, ..., day n/24, split the dataframe into subsets by day, and apply your function to each subset.
First I create some sample data:
set.seed(123)
n <- 120 # number of rows
nest1 <- data.frame(temp = rnorm(n), devo = rpois(n, 5))
Create the splitting variable:
nest1$day <- rep(1:(nrow(nest1)/24), each = 24)
Then, use the by() function to split nest1 by nest1$day and apply the function to each subset:
out <- by(nest1, nest1$day, function(d) {
weighted.median(d$temp, d$devo, na.rm = TRUE)
})
data.frame(day = dimnames(out)[[1]], x = as.vector(out))
# day x
# 1 1 -0.45244433
# 2 2 0.15337312
# 3 3 0.07071673
# 4 4 0.23873174
# 5 5 -0.27694709
Instead of using by, you can also use the group_by + summarise functions from the dplyr package:
library(dplyr)
nest1 %>%
group_by(day) %>%
summarise(x = weighted.median(temp, devo, na.rm = TRUE))
# # A tibble: 5 x 2
# day x
# <int> <dbl>
# 1 1 -0.452
# 2 2 0.153
# 3 3 0.0707
# 4 4 0.239
# 5 5 -0.277
I have data on positions of several individuals, each registered at several time steps. I want to calculate distance between each animal to all other animals registered at the same time step.
Here's a simplified example, with data on three individuals ('animal_id') registered on three dates ('date') each, on different positions ('x', 'y'):
library(data.table)
dt1 <- data.table(animal_id = 1, date = as.POSIXct(c("2014-01-01", "2014-01-02", "2014-01-03")),
x = runif(3, -10, 10), y = runif(3, -10, 10))
dt2 <- data.table(animal_id = 2, date = as.POSIXct(c("2014-01-01", "2014-01-02", "2014-01-03")),
x = runif(3, -10, 10), y = runif(3, -10, 10))
dt3 <- data.table(animal_id = 3, date = as.POSIXct(c("2014-01-01", "2014-01-02", "2014-01-03")),
x = runif(3, -10, 10), y = runif(3, -10, 10))
dt <- rbindlist(list(dt1, dt2, dt3))
# Create dist function between two animals at same time
dist.between.animals <- function(collar_id1, x1, y1, collar_id2, x2, y2) {
if (collar_id1 == collar_id2) return(NA)
sqrt((x1 - x2)^2 + (y1 - y2)^2)
}
# Get unique collar id of each animal, create column name for all animals per animal
animal_ids <- dt[ , unique(animal_id)]
animal_ids_str <- dt[,paste0("dist_to_", unique(animal_id))]
datetimes <- dt[ , unique(date)]
# Calculate distance of each animal to all animals, at same time
for (i in 1:length(animal_ids)) {
for (j in 1:length(datetimes)) {
x1 <- dt[.(animal_ids[i], datetimes[j]), x, on = .(animal_id, date)]
y1 <- dt[.(animal_ids[i], datetimes[j]), y, on = .(animal_id, date)]
dt[date == datetimes[j], animal_ids_str[i] := mapply(function(c, x2, y2) dist.between.animals(animal_ids[i], x1, y1, c, x2, y2), animal_id, x, y)]
}
}
Here's an example of what the output should look like:
animal_id date x y dist_to_1 dist_to_2 dist_to_3
1: 1 2014-01-01 -7.0276047 4.7660664 NA 7.1354265 13.7962800
2: 1 2014-01-02 -6.6383802 7.0087919 NA 3.7003879 16.4294999
3: 1 2014-01-03 -0.9722872 -4.8638019 NA 11.6447645 11.8313410
4: 2 2014-01-01 0.1076661 4.8131960 7.135426 NA 7.7052205
5: 2 2014-01-02 -8.9042124 4.0832364 3.700388 NA 13.3225921
6: 2 2014-01-03 8.2858839 2.1992575 11.644764 NA 0.4569632
7: 3 2014-01-01 5.7519522 -0.4320359 13.796280 7.7052205 NA
8: 3 2014-01-02 -9.0805265 -9.2381889 16.429500 13.3225921 NA
9: 3 2014-01-03 8.6832729 1.9736531 11.831341 0.4569632 NA
However, my real data have about 30 animals with 20,000 observations per animal, so my current code takes a long time to run. Is there a more efficient way to do this?
OK, so here's kind of an unorthodox method, especially given that for once I think datatables make the situation worse. I'm using the dist function, which calculates the Euclidean distance (or any other, your pick). If you use diag=T, upper=Tit generates a matrix that you can then assign to the specified rows-columns. Creating the columns might get tedious with multiple animals, but nothing that the paste function can't fix.
dt[, c("dist_to_1", "dist_to_2", "dist_to_3") := NA]
dt<- arrange(dt, date, animal_id) # order by date. here it turns into a data.frame
for (i in 1:length(unique(dt$date))){
sub<- subset(dt, dt$date == unique(dt$date)[i])
dt[which(dt$date == unique(sub$date)), c("dist_to_1", "dist_to_2", "dist_to_3")]<- as.matrix(dist(sub[, c("x","y")], diag=T, upper=T))
}
dt[dt==0]<- NA #assign NAs for 0s. Not necessary if the it's ok for diag==0.
setDT(dt) # back to datatable. Again this part is not really necessary.
dt<- dt[order(animal_id, date)] # order as initially ordered
Using this code:
> proc.time()-ptm
user system elapsed
0.051 0.007 0.068
Using earlier code:
> proc.time()-ptm
user system elapsed
0.083 0.004 0.092
If you find a way to use both dist and data.table you're golden, but I couldn't figure it out. It's pretty fast, since it calls C, and it will get faster the more observations you add.
You can make a self-join on date (dt[dt, on = "date",), and for each match (by = .EACHI) calculate the distance:
dt[dt, on = "date",
.(from_id = id, to_id = i.id, dist = sqrt((x - i.x)^2 + (y - i.y)^2)), by = .EACHI]
I you wish to turn the data to a wide format (dcast), chain this to the code above:
[ , dcast(.SD, from_id + date ~ to_id, value.var = "dist")]
It seems to perform OK in a benchmark using the data of #digEmAll
library(microbenchmark)
microbenchmark(
digemall = dt[,(animal_ids_str):=distancesInSameDate(.SD,animal_ids_str),by=date],
henrik = dt[dt, on = "date",
.(from_id = animal_id, to_id = i.animal_id, dist = sqrt((x - i.x)^2 + (y - i.y)^2)), by = .EACHI][
, dcast(.SD, from_id + date ~ to_id, value.var = "dist")],
times = 5, unit = "relative")
# Unit: relative
# expr min lq mean median uq max neval
# digemall 3.206063 2.058547 2.189487 2.035975 2.032324 2.019082 5
# henrik 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 5
Note that I haven't renamed the "to_id" in my code. That basically reflects my prefence to keep the data in long format, and in that format I would like to have both the 'from_id' and 'to_id' without any prefix. If you want prefix in the columns in the wide format, just add to_id = paste0("dist_to_", i.animal_id) in the first step.
Here's an alternative approach which should be much faster :
library(data.table)
### CREATE A BIG DATASET
set.seed(123)
nSamples <- 20000
nAnimals <- 30
allDates <- as.POSIXct(c("2014-01-01")) + (1:nSamples)*24*3600
dts <- lapply(1:nAnimals, function(id){
data.table(animal_id=id,date=allDates,
x=runif(nSamples,-10,10), y=runif(nSamples,-10,10))
})
dt <- rbindlist(dts)
### ALTERNATIVE APPROACH (NO LOOP)
animal_ids_str <- dt[,paste0("dist_to_",sort(unique(animal_id)))]
# set keys
setkey(dt,animal_id,date)
# add the distance columns
dt[,(animal_ids_str):=as.double(NA)]
# custom function that computes animal distances for a subset of dt at the same date
distancesInSameDate <- function(subsetDT,animal_ids_str){
m <- as.matrix(dist(subsetDT[,.(x,y)]))
diag(m) <- NA
cols <- paste0("dist_to_",subsetDT$animal_id)
missingCols <- animal_ids_str[is.na(match(animal_ids_str,cols))]
m <- cbind(m,matrix(NA,nrow=nrow(m),ncol=length(missingCols)))
colnames(m) <- c(cols,missingCols)
DF <- as.data.frame(m,stringsAsFactors=F)
DF <- DF[,match(animal_ids_str,colnames(DF))]
return(DF)
}
# let's compute the distances
system.time( dt[,(animal_ids_str):=distancesInSameDate(.SD,animal_ids_str),by=date] )
On my machine it takes :
user system elapsed
13.76 0.00 13.82