Scale columns of a data.table to unit interval - r

I have a data.table with a mixture of numeric and factor data, such as:
R> dat
x z y w
1: 3.307590 -1.66951137 b a
2: 1.809447 4.10331322 b b
3: 3.314621 3.69436879 a a
4: 1.896529 -0.08143017 c b
5: 3.317341 1.01839533 c a
6: 1.806456 -2.09547272 a b
...
I need to scale each of the numeric variables (x and z) to the unit interval. I store their minima and maxima in a separate matrix (the maximum is not simply max(x)). The first row is the min of each numeric variable, the second row is the max.
> cts.mat
x z
[1,] 1 -3
[2,] 4 5
How can I scale the x and z columns using the bounds in the matrix?
I tried something like
dat[, lapply(.SD, range01, cts.mat), .SDcol = c("x", "z")]
where range01 is the function
range01 <- function(x, cts.mat) {
x.as.string <- deparse(substitute(x))
# This is (x-lower)/(upper-lower)
(x - cts.mat[, x.as.string][1]) / (cts.mat[, x.as.string][2] - cts.mat[, x.as.string][1])
}
But this does not work. I think my core problem is that I don't know how to run an lapply with arguments that change for each column of dat. The changing arguments in this case are the min and max of each numeric column.
thanks for any help.

I found out a way to do it with a loop, which I think is good enough. If anyone knows how to do it with a data.table lapply, I'm still interested in a solution.
cts.names <- c("x", "z")
for (var in cts.names) {
dat[, (var) := scales::rescale(get(var),
from = c(0, 1),
to = cts.mat[, var])]
}

I'm very surprised by the lapply performance boost:
microbenchmark::microbenchmark(
set_loop={
for (var in colnames(cts.mat)) {
set(dat,j=var,value=scales::rescale(dat[[var]],
from = c(0, 1),
to = cts.mat[, var]))}
},
dt_loop={
for (var in colnames(cts.mat)) {
dat[, c(var) := scales::rescale(dat[[var]],
from = c(0, 1),
to = cts.mat[, var])]
}},
lapply={
lapply(colnames(cts.mat),
function(var) set(dat,j=var,value=scales::rescale(dat[[var]],
from = c(0, 1),
to = cts.mat[, var])))})
# Unit: microseconds
# expr min lq mean median uq max neval
# set_loop 2342.9 2380.90 2523.414 2437.15 2531.30 4856.3 100
# dt_loop 3109.0 3176.40 4000.359 3247.70 3383.35 64652.8 100
# lapply 65.8 74.85 103.510 83.85 90.00 2100.3 100
I verified that the results are the same with the three methods, interested if there is an explanation for this. Perhaps try ou with bigger data sets?
library(data.table)
dat <- read.table(text='x z y w
1: 3.307590 -1.66951137 b a
2: 1.809447 4.10331322 b b
3: 3.314621 3.69436879 a a
4: 1.896529 -0.08143017 c b
5: 3.317341 1.01839533 c a
6: 1.806456 -2.09547272 a b ',header=T)
setDT(dat)
cts.mat <- read.table(text='
x z
1: 1 -3
2: 4 5', header=T)
cts.mat <- as.matrix(cts.mat)
dat.ref <- copy(dat)
dat <- copy(dat.ref)
# set + loop
for (var in colnames(cts.mat)) {
set(dat,j=var,value=scales::rescale(dat[[var]],
from = c(0, 1),
to = cts.mat[, var]))}
result.set.loop <- copy(dat)
# dt + loop
dat <- copy(dat.ref)
for (var in colnames(cts.mat)) {
dat[, c(var) := scales::rescale(dat[[var]],
from = c(0, 1),
to = cts.mat[, var])]
}
result.dt.loop <- copy(dat)
# set + lapply
dat <- copy(dat.ref)
lapply(colnames(cts.mat),function(var) set(dat,j=var,value=scales::rescale(dat[[var]],
from = c(0, 1),
to = cts.mat[, var])))
#> [[1]]
#> x z y w
#> 1: 10.922770 -16.356091 b a
#> 2: 6.428341 29.826506 b b
#> 3: 10.943863 26.554950 a a
#> 4: 6.689587 -3.651441 c b
#> 5: 10.952023 5.147163 c a
#> 6: 6.419368 -19.763782 a b
#>
#> [[2]]
#> x z y w
#> 1: 10.922770 -16.356091 b a
#> 2: 6.428341 29.826506 b b
#> 3: 10.943863 26.554950 a a
#> 4: 6.689587 -3.651441 c b
#> 5: 10.952023 5.147163 c a
#> 6: 6.419368 -19.763782 a b
result.set.lapply <- copy(dat)
all.equal(result.dt.loop,result.set.loop)
#> [1] TRUE
all.equal(result.set.loop,result.set.lapply)
#> [1] TRUE

Related

Closest other Value in the same Vector

I have a vector
set.seed(2)
x <- sample.int(20, 5)
[1] 4 14 11 3 16
Now, for every element I want to find
the element with the minimum distance (min(abs(x[i]-x[-i])) for element i), which here would be
[1] 3 16 14 4 14
the (first) index of the element with the minimum distance, which here would be
[1] 4 5 2 1 2
The point is that the element itself is not considered, but only all the other elements, which is why this R - Fastest way to find nearest value in vector
is not the answer.
If the actual answer is out there, sorry - I didn't find it.
1) Rfast Using dista in Rfast we get the indexes of the closest two. Take the second closest as the closest will be the same value.
library(Rfast)
x <- c(4, 14, 11, 3, 16) # input
x[ dista(x, x, k = 2, index = TRUE)[, 2] ]
## [1] 3 16 14 4 14
2) sqldf Using SQL we can left join DF to itself excluding the same value value and take the row with the minimum distance.
DF <- data.frame(x) # x is from (1)
sqldf("select a.x, b.x nearest, min(abs(a.x - b.x))
from DF a
left join DF b on a.x != b.x
group by a.rowid")[1:2]
giving:
x nearest
1 4 3
2 14 16
3 11 14
4 3 4
5 16 14
3) zoo Sort the input, take the value corresponding to the least difference on either of side of each element and order it back.
library(zoo)
ix <- order(x)
least <- function(x) if (x[2] - x[1] < x[3] - x[2]) x[1] else x[3]
rollapply(c(-Inf, x[ix], Inf), 3, least)[order(ix)]
## [1] 3 16 14 4 14
4) Base R Using ix and least from (3) we can mimic (3) using only base functions as follows.
apply(embed(c(-Inf, x[ix], Inf), 3)[, 3:1], 1, least)[order(ix)]
## [1] 3 16 14 4 14
4a) This slightly shorter variation would also work:
-apply(embed(-c(-Inf, x[ix], Inf), 3), 1, least)[order(ix)]
## [1] 3 16 14 4 14
4b) Simplifying further we have the following base solution where, again, ix is from (3):
xx <- x[ix]
x1 <- c(-Inf, xx[-length(xx)])
x2 <- c(xx[-1], Inf)
ifelse(xx - x1 < x2 - xx, x1, x2)[order(ix)]
## [1] 3 16 14 4 14
Duplicates
The example in the question had no duplicates but if there were duplicates there is some question regarding the problem definition. For example if we had c(1, 3, 4, 1) then if we look at the first value, 1, there is another value exactly equal to it so the closest value is 1. Another interpretation is that the closest value not equal to 1 should be returned which in this case is 3. In the codes above the sqldf solution gives the closest value not equal to the current value whereas the others give the closest value among the remaining values.
If we wanted the interpretation of the closest not equal for those other than sqldf then we could use rle after ordering to compress it down to unique values and then use inverse.rle afterwards as shown on the modified 4b:
x <- c(1, 3, 4, 1)
ix <- order(x)
r <- rle(x[ix])
xx <- r$values
x1 <- c(-Inf, xx[-length(xx)])
x2 <- c(xx[-1], Inf)
r$values <- ifelse(xx - x1 < x2 - xx, x1, x2)
inverse.rle(r)[order(ix)]
## [1] 3 4 3 3
I was very interested in this question and in the approaches suggested in the other responses, so I compared them with regard to their running time (and I added another approach using the package RANN). The code is appended below. TL;DR: The base R version 4b by user G. Grothendieck was most efficient, and by a significant margin.
library(RANN)
library(zoo)
library(data.table)
library(Rfast)
library(sqldf)
# All functions take a vector as argument,
# and return the values of nearest neighbours (not their index)
# Using base R, by ThomasIsCoding
base_nn <- function(x) {
d <- data.frame(`diag<-`(as.matrix(dist(x)),Inf))
id <- unlist(Map(which.min,d))
x[id]
}
# Using Rfast, by G. Grothendieck
rfast_nn <- function(x) {
x[ dista(x, x, k = 2, index = TRUE)[, 2] ]
}
# Using sqldf, by G. Grothendieck
sqldf_nn <- function(x) {
DF <- data.frame(x) # x is from (1)
unname(
unlist(sqldf("select a.x, b.x nearest, min(abs(a.x - b.x))
from DF a
left join DF b on a.x != b.x
group by a.rowid")[2])
)
}
# Using `zoo`, by G. Grothendieck
zoo_nn <- function(x) {
ix <- order(x)
least <- function(x) if (x[2] - x[1] < x[3] - x[2]) x[1] else x[3]
rollapply(c(-Inf, x[ix], Inf), 3, least)[order(ix)]
}
# Using base R (v 4), by G. Grothendieck
base2_nn <- function(x) {
ix <- order(x)
least <- function(x) if (x[2] - x[1] < x[3] - x[2]) x[1] else x[3]
apply(embed(c(-Inf, x[ix], Inf), 3)[, 3:1], 1, least)[order(ix)]
}
# Using base R (v 4a), by G. Grothendieck
base3_nn <- function(x) {
ix <- order(x)
least <- function(x) if (x[2] - x[1] < x[3] - x[2]) x[1] else x[3]
-apply(embed(-c(-Inf, x[ix], Inf), 3), 1, least)[order(ix)]
}
# Using base R (v 4b), by G. Grothendieck
base4_nn <- function(x) {
ix <- order(x)
xx <- x[ix]
x1 <- c(-Inf, xx[-length(xx)])
x2 <- c(xx[-1], Inf)
ifelse(xx - x1 < x2 - xx, x1, x2)[order(ix)]
}
# Using data.table, by IceCreamToucan
dt_nn <- function(x) {
dt <- setkey(data.table(x), x)
dt[dt, on = .(x > x), mult = 'first', lowx := i.x][, lowx := fcoalesce(lowx + .0, -Inf)]
dt[dt, on = .(x < x), mult = 'last', highx := i.x][, highx := fcoalesce(highx + .0, Inf)]
dt[, closex := fifelse(x - lowx < highx - x, lowx, highx)]
unname(unlist(dt[, .(closex)]))
}
# Using, RANN, by me
rann_nn <- function(x) {
id <- RANN::nn2(as.matrix(x), k = 2)$nn.idx[, 2]
x[id]
}
### Apply all methods
# Test that all have the same output:
x <- c(4, 14,11,3,16)
rann_nn(x)
# [1] 3 16 14 4 14
base_nn(x)
# [1] 3 16 14 4 14
rfast_nn(x)
# [1] 3 16 14 4 14
sqldf_nn(x)
# [1] 3 16 14 4 14
zoo_nn(x)
# [1] 3 16 14 4 14
base2_nn(x)
# [1] 3 16 14 4 14
base3_nn(x)
# [1] 3 16 14 4 14
base4_nn(x)
# [1] 3 16 14 4 14
dt_nn(x) # differently ordered for some reason
# [1] 4 3 14 16 14
# Compare running times
library(microbenchmark)
# Compare for N = 1000 elements
benchmark_data <- rnorm(1000)
microbenchmark(
rann_nn(benchmark_data),
base_nn(benchmark_data),
rfast_nn(benchmark_data),
sqldf_nn(benchmark_data),
zoo_nn(benchmark_data),
base2_nn(benchmark_data),
base3_nn(benchmark_data),
base4_nn(benchmark_data),
dt_nn(benchmark_data)
)
# Unit: microseconds
# expr min lq mean median uq max neval
# rann_nn(benchmark_data) 641.180 684.1975 776.5467 711.6680 775.3635 3822.023 100
# base_nn(benchmark_data) 166523.177 179240.8130 209471.1333 187633.0515 249740.8425 330864.712 100
# rfast_nn(benchmark_data) 45160.603 47032.5225 47681.0557 47594.0075 48308.8440 50579.839 100
# sqldf_nn(benchmark_data) 133916.594 138769.8175 143505.9315 140543.3250 143830.2765 211873.960 100
# zoo_nn(benchmark_data) 4359.359 4604.0275 5008.4291 4785.1515 5037.9705 14999.802 100
# base2_nn(benchmark_data) 1292.322 1407.4875 1747.8404 1462.7295 1588.1580 11297.321 100
# base3_nn(benchmark_data) 1263.644 1396.9210 1615.7495 1472.9940 1571.8575 11828.015 100
# base4_nn(benchmark_data) 119.543 146.1080 254.5075 178.1065 197.4265 7726.156 100
# dt_nn(benchmark_data) 5290.337 6580.6965 7111.1816 6892.3800 7351.3795 29469.815 100
# For N = 100000, leaving out the slowest versions (e.g., `base_nn()`
# no longer works because a distance matrix cannot be computed for
# N = 100000)
benchmark_data <- rnorm(100000)
microbenchmark(
rann_nn(benchmark_data),
zoo_nn(benchmark_data),
base2_nn(benchmark_data),
base3_nn(benchmark_data),
base4_nn(benchmark_data),
dt_nn(benchmark_data)
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# rann_nn(benchmark_data) 130.957025 141.02904 149.94052 148.60184 156.14506 271.1882 100
# zoo_nn(benchmark_data) 606.690004 673.88980 720.12545 717.51658 766.98190 886.4397 100
# base2_nn(benchmark_data) 142.554407 176.30358 198.58375 193.34812 212.33885 329.5470 100
# base3_nn(benchmark_data) 142.074126 168.78195 189.65122 184.45025 205.89414 287.0740 100
# base4_nn(benchmark_data) 9.354764 10.46687 17.22086 12.36354 14.22882 166.4758 100
# dt_nn(benchmark_data) 96.503882 104.06914 117.95408 108.20284 121.11428 247.2092 100
Here is a base R solution
d <- data.frame(`diag<-`(as.matrix(dist(x)),Inf))
ids <- unlist(Map(which.min,d))
val <- x[ids]
such that
> ids
X1 X2 X3 X4 X5
4 5 2 1 2
> val
[1] 3 16 14 4 14
DATA
x <- c(4, 14,11,3,16)
Option with a data.table non-equi-join
dt <- setkey(data.table(x), x)
dt[dt, on = .(x > x), mult = 'first', lowx := i.x][, lowx := fcoalesce(lowx + .0, -Inf)]
dt[dt, on = .(x < x), mult = 'last', highx := i.x][, highx := fcoalesce(highx + .0, Inf)]
dt[, closex := fifelse(x - lowx < highx - x, lowx, highx)]
dt[, .(x, closex)]
# x closex
# 1: 3 4
# 2: 4 3
# 3: 11 14
# 4: 14 16
# 5: 16 14

How to calculate mean per group from list of data.tables?

Let's say we have a list of data.table's like this:
dt <- data.table(x=rnorm(10^6,100,10), letters=sample(LETTERS,10^6,T))
myList <- list(dt1=dt,dt2=dt,dt3=dt,dt4=dt,dt5=dt)
If I wanted a solution that would calculate the mean per group across all data.tables, I could do the following:
bigDT <- rbindlist(myList)
bigDT[,list('average'=mean(x)),by=letters]
With my data, however, each dt is quite large (millions of rows) and each list is substantial as well (500-1000 dt in each list). There are also considerably more than two options for the by choice.
Part of a function I plan on optimizing using genetic algorithms requires computing the above mean by group. I was wondering if there was a more efficient solution than rbind-ing the list before using data.table's ability to calculate per group? Otherwise, the maximization algorithm will makes many function calls to this potentially bottlenecking calculation.
Any help would be appreciated!
microbenchmark(doThis())
Unit: milliseconds
expr min lq mean median uq max neval
doThis() 151.512 154.3395 174.8071 167.7151 170.2952 440.9359 100
One method is to computed the grouped means for each table within the list, then bind, then compute a weighted mean of them. Since you have different counts of each letter, you'll need to preserve the .N as well.
I'm going to change each element of the list so that we can verify the weighted-mean calculations. For reproducibility:
set.seed(1)
myList <- replicate(5, data.table(x=rnorm(10^6,100,10), letters=sample(LETTERS,10^6,T)),
simplify=FALSE)
myList[1:2]
# [[1]]
# x letters
# 1: 93.73546 P
# 2: 101.83643 I
# 3: 91.64371 F
# 4: 115.95281 V
# 5: 103.29508 D
# ---
# 999996: 109.24487 Q
# 999997: 99.86486 K
# 999998: 93.95941 J
# 999999: 116.28763 O
# 1000000: 106.93750 E
# [[2]]
# x letters
# 1: 97.53576 R
# 2: 105.27503 T
# 3: 107.53592 L
# 4: 102.21228 M
# 5: 98.71087 G
# ---
# 999996: 109.46843 C
# 999997: 99.14458 M
# 999998: 96.76845 Y
# 999999: 94.22413 E
# 1000000: 98.25855 K
To do this for just one table:
head(myList[[1]][,.(mu = mean(x), n = .N), keyby=letters])
# letters mu n
# 1: A 100.04987 39005
# 2: B 100.01288 38576
# 3: C 99.97402 38547
# 4: D 99.99909 38460
# 5: E 100.03689 38030
# 6: F 100.02697 38293
First, compute the averages per-list-element:
myAgg <- rbindlist(lapply(myList, function(d) d[,.(mu = mean(x), n = .N), keyby="letters"]))
Now do the weighted-mean either manually or with Hmisc::wtd.mean:
cbind(
# just to verify the below answer is the same as the brute-force method of rbind-then-average
rbindlist(myList)[,.(mu = mean(x)), keyby=letters],
# either of these is your answer
myAgg[,.(mu = sum(n*mu)/sum(n)),keyby=letters],
myAgg[,.(mu = Hmisc::wtd.mean(mu, weights=n)),keyby=letters]
)
# letters mu letters mu letters mu
# 1: A 100.02325 A 100.02325 A 100.02325
# 2: B 100.03473 B 100.03473 B 100.03473
# 3: C 100.00688 C 100.00688 C 100.00688
# 4: D 100.04041 D 100.04041 D 100.04041
# 5: E 100.00780 E 100.00780 E 100.00780
# 6: F 100.01202 F 100.01202 F 100.01202
# 7: G 100.01200 G 100.01200 G 100.01200
# 8: H 99.97232 H 99.97232 H 99.97232
# 9: I 100.00495 I 100.00495 I 100.00495
# 10: J 100.03019 J 100.03019 J 100.03019
# 11: K 99.96851 K 99.96851 K 99.96851
# 12: L 100.01850 L 100.01850 L 100.01850
# 13: M 100.00976 M 100.00976 M 100.00976
# 14: N 100.01299 N 100.01299 N 100.01299
# 15: O 100.02108 O 100.02108 O 100.02108
# 16: P 100.02052 P 100.02052 P 100.02052
# 17: Q 100.03814 Q 100.03814 Q 100.03814
# 18: R 99.99013 R 99.99013 R 99.99013
# 19: S 99.95219 S 99.95219 S 99.95219
# 20: T 99.97721 T 99.97721 T 99.97721
# 21: U 99.96310 U 99.96310 U 99.96310
# 22: V 99.94430 V 99.94430 V 99.94430
# 23: W 99.98877 W 99.98877 W 99.98877
# 24: X 100.07352 X 100.07352 X 100.07352
# 25: Y 99.96677 Y 99.96677 Y 99.96677
# 26: Z 99.99397 Z 99.99397 Z 99.99397
# letters mu letters mu letters mu
Quick benchmarking, for comparison:
library(microbenchmark)
microbenchmark(
bruteforce = rbindlist(myList)[,.(mu = mean(x)), keyby=letters],
# either of these is your answer
baseR = {
myAgg <- rbindlist(lapply(myList, function(d) d[,.(mu = mean(x), n = .N), keyby="letters"]))
myAgg[,.(mu = sum(n*mu)/sum(n)),keyby=letters]
},
Hmisc = {
myAgg <- rbindlist(lapply(myList, function(d) d[,.(mu = mean(x), n = .N), keyby="letters"]))
myAgg[,.(mu = Hmisc::wtd.mean(mu, weights=n)),keyby=letters]
},
times=50
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# bruteforce 131.8770 139.4562 153.93202 151.95375 159.6329 315.6117 50
# baseR 89.7047 93.3623 109.20174 98.11670 115.0171 268.2517 50
# Hmisc 89.2784 91.5927 97.87455 93.73475 98.1655 119.2671 50

How can I delete certain rows according to two columns which have symmetricl values in data.table in R?

For example, I have a table as follows:
DT <- data.table(
A = c(1,1,1,2,2,2,3,3,3),
B = c(1,2,3,1,2,3,1,2,3),
key = "A"
)
I wand to delete the rows under the conditon such as "A" == 2 and "B" == 1, since there is already the row that "A" == 1 and "B" == 2.
In short, I want to delete the rows that already appears symmetrically in the previous rows, how can I realize it?
Maybe not the most efficient, but leverage the duplicated.matrix method:
DT[!duplicated(apply(cbind(A, B), 1L, sort), MARGIN = 2L)]
# A B
# 1: 1 1
# 2: 1 2
# 3: 1 3
# 4: 2 2
# 5: 2 3
# 6: 3 3
Another option:
DT[, g := paste(B, A, sep="_")][A < B, g := paste(A, B, sep="_")][!duplicated(g), !"g"]
A B
1: 1 1
2: 1 2
3: 1 3
4: 2 2
5: 2 3
6: 3 3
So ...
make a grouping variable as A + B,
flip the order to B + A on subset A < B or A > B
dedupe on the grouping variable
The last step could alternately be unique(DT, by="g").
if you only have two columns, then you could do:
unique(do.call(function(A,B)data.table(A=pmin(A,B),B=pmax(A,B)),DT))
A B
1: 1 1
2: 1 2
3: 1 3
4: 2 2
5: 2 3
6: 3 3
Another method for the case where there are only 2 columns, using anti-join.
dupes <- unique(DT[B > A])[unique(DT[A < B]), on=c("A"="B", "B"="A")]
ans <- unique(DT)[!dupes, on=.(A, B)]
timing code:
library(data.table)
set.seed(0L)
nr <- 1e5
nElem <- 1e3
mat <- matrix(sample(nElem, nr*2, replace=TRUE), ncol=2)
DT <- as.data.table(mat)
setnames(DT, c("A", "B"))
DT2 <- copy(DT)
library(microbenchmark)
mtd1 <- function() unique(data.frame(A=pmin(mat[, 1], mat[, 2]), B=pmax(mat[, 1], mat[, 2])))
mtd2 <- function() DT[!duplicated(apply(cbind(A, B), 1L, sort), MARGIN = 2L)]
mtd3 <- function() DT2[, g := paste(B, A, sep="_")][A < B, g := paste(A, B, sep="_")][!duplicated(g), !"g"]
mtd4 <- function() {
dupes <- unique(DT[B > A])[unique(DT[A < B]), on=c("A"="B", "B"="A")]
ans <- unique(DT)[!dupes, on=.(A, B)]
}
microbenchmark(mtd1(),mtd2(),mtd3(),mtd4(),times=3L)
some timings:
Unit: milliseconds
expr min lq mean median uq max neval
mtd1() 118.62051 129.50581 153.77216 140.39111 171.34799 202.30487 3
mtd2() 3500.47877 3552.80879 3732.67006 3605.13882 3848.76571 4092.39260 3
mtd3() 89.22901 92.94830 97.22658 96.66759 101.22536 105.78313 3
mtd4() 28.61628 32.37641 50.90126 36.13654 62.04375 87.95096 3
But the fastest is eddi's method: data.table with two string columns of set elements, extract unique rows with each row unsorted
mtd5 <- function() DT[DT[, .I[1L], by=.(pmin(A, B), pmax(A, B))]$V1]
microbenchmark(mtd1(),mtd2(),mtd3(),mtd4(),mtd5(),times=3L)
timings:
Unit: milliseconds
expr min lq mean median uq max neval
mtd1() 149.62224 150.70685 175.66394 151.79146 188.68479 225.57813 3
mtd2() 4126.51014 4140.72876 4277.37907 4154.94738 4352.81353 4550.67968 3
mtd3() 126.01679 131.26463 134.63642 136.51247 138.94624 141.38000 3
mtd4() 39.24141 42.42815 45.65804 45.61489 48.86635 52.11781 3
mtd5() 12.58396 16.68156 18.21613 20.77915 21.03221 21.28527 3

Expanding window (cumulative calculation) in data.table: how to improve performance

I have grouped data collected at different time steps. Within each time step, there are several registrations of values. Each value may occur one or more times within and among time steps.
Some toy data:
df <- data.frame(grp = rep(1:2, each = 8),
time = c(rep(1, 3), rep(2, 2), rep(3, 3)),
val = c(1, 2, 1, 2, 3, 2, 3, 4, 1, 2, 3, 1, 1, 1, 2, 3))
df
# grp time val
# 1 1 1 1
# 2 1 1 2
# 3 1 1 1
# 4 1 2 2
# 5 1 2 3
# 6 1 3 2
# 7 1 3 3
# 8 1 3 4
# 9 2 1 1
# 10 2 1 2
# 11 2 1 3
# 12 2 2 1
# 13 2 2 1
# 14 2 3 1
# 15 2 3 2
# 16 2 3 3
Objectives
I wish to do some calculations within an expanding time window, i.e. within time step 1, within time 1 and 2 together, within 1, 2, and 3 together, and so on. Within each window, I wish to calculate the number of unique values, the number of values which have occurred more than once, and the proportion of values which have occurred more than once.
For example, in my toy data, in group (grp) 1, in the second time window (time = 1 & 2 together) three unique values (val 1, 2, 3) have been registered (n_val = 3). Two of them (1, 2) occur more than once (n_re = 2), resulting in a "re_rate" of 0.67 (see below).
My data.table code produces the desired result. On a small data set it is slower than my base attempt, which I believe is fair enough, given some possible overhead in the data.table code. With a larger data set, the data.table code catches up, but is still slower. I expected (hoped) that the benefits would show up earlier.
Thus, what made me post this question is that I believe that the relative performance of my code is a strong indicator of me abusing data.table (I am sure the reason is not data.table performance itself). Thus, the main objective of my question is to get some advice on how to code this in a more data.table-esque way. For example, is it possible to avoid the loop over time windows altogether by vectorizing the calculations, as shown e.g. in the nice answer by #Khashaa here. If not, are there ways to make the loop and assignment more efficient?
My data.table code:
library(data.table)
f_dt <- function(df){
setDT(df, key = c("grp", "time", "val"))[ , {
# key or not only affects speed marginally
# unique time steps
times <- .SD[ , unique(time)]
# index vector to loop over
idx <- seq_along(times)
# pre-allocate data table
d2 <- data.table(time = times,
n_val = integer(1),
n_re = integer(1),
re_rate = numeric(1))
# loop to generate expanding window
for(i in idx){
# number of registrations per val
n <- .SD[time %in% times[seq_len(i)], .(n = .N), by = val][ , n]
# number of unique val
set(x = d2, i = i, j = 2L, length(n))
# number of val registered more than once
set(x = d2, i = i, j = 3L, sum(n > 1))
}
# proportion values registered more than once
d2[ , re_rate := round(n_re / n_val, 2)]
d2
}
, by = grp]
}
...which gives the desired result:
f_dt(df)
# grp time n_val n_re re_rate
# 1: 1 1 2 1 0.50
# 2: 1 2 3 2 0.67
# 3: 1 3 4 3 0.75
# 4: 2 1 3 0 0.00
# 5: 2 2 3 1 0.33
# 6: 2 3 3 3 1.00
Corresponding base code:
f_by <- function(df){
do.call(rbind,
by(data = df, df$grp, function(d){
times <- unique(d$time)
idx <- seq_along(times)
d2 <- data.frame(grp = d$grp[1],
time = times,
n_val = integer(1),
n_re = integer(1),
re_rate = numeric(1))
for(i in idx){
dat <- d[d$time %in% times[seq_len(i)], ]
tt <- table(dat$val)
n_re <- sum(tt > 1)
n_val <- length(tt)
re_rate <- round(n_re / n_val, 2)
d2[i, ] <- data.frame(d2$grp[1], time = times[i], n_val, n_re, re_rate)
}
d2
})
)
}
Timings:
Tiny toy data from above:
library(microbenchmark)
microbenchmark(f_by(df),
f_dt(df),
times = 10,
unit = "relative")
# Unit: relative
# expr min lq mean median uq max neval
# f_by(df) 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 10
# f_dt(df) 1.481724 1.450203 1.474037 1.452887 1.521378 1.502686 10
Some larger data:
set.seed(123)
df <- data.frame(grp = sample(1:100, 100000, replace = TRUE),
time = sample(1:100, 100000, replace = TRUE),
val = sample(1:100, 100000, replace = TRUE))
microbenchmark(f_by(df),
f_dt(df),
times = 10,
unit = "relative")
# Unit: relative
# expr min lq mean median uq max neval
# f_by(df) 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 10
# f_dt(df) 1.094424 1.099642 1.107821 1.096997 1.097693 1.194983 10
No, the data is still not large, but I would expect data.table to catch up by now. If coded properly... I believe this suggests that there is a large potential for improvement of my code. Any advice is highly appreciated.
f <- function(df){
setDT(df)[, n_val := cumsum(!duplicated(val)), grp
][, occ := 1:.N, .(grp, val)
][, occ1 := cumsum(occ == 1) - cumsum(occ == 2), grp
][, n_re := n_val - occ1,
][, re_rate := round(n_re/n_val, 2),
][, .(n_val = n_val[.N], n_re = n_re[.N], re_rate =re_rate[.N]), .(grp, time)]
}
where
cumsum(!duplicated(val)) counts the (cumulative) occurrences of the unique values, n_val,
occ counts the cumulative occurrences each value (note that it is grouped by val).
occ1 then counts the number of elements in val occurred only once so far.
The number of values occurred only once increases by 1 when occ==1, decreases by 1 when occ==2; hence cumsum(occ == 1) - cumsum(occ == 2).
The number of values which have occurred more than once is n_val-occ1
Speed Comparison
set.seed(123)
df <- data.frame(grp = sample(1:100, 100000, replace = TRUE),
time = sample(1:100, 100000, replace = TRUE),
val = sample(1:100, 100000, replace = TRUE))
system.time(f(df))
# user system elapsed
# 0.038 0.000 0.038
system.time(f_dt(df))
# user system elapsed
# 16.617 0.013 16.727
system.time(f_by(df))
# user system elapsed
# 16.077 0.040 16.122
Hope this helps.
Was looking for a better way to code expanding window of non-duplicated groups and came across this question.
This question seems to be more about expanding window where the group (i.e. time in the question) is duplicated. Below is a solution making use of between.
#expanding group by where groups are duplicated
library(data.table)
setDT(df)
df[ , {
#get list of unique time groups to be used in the expanding group
uniqt <- unique(time)
c(list(time=uniqt), #output time as well
#expanding window of each unique time group
do.call(rbind, lapply(uniqt, function(n) {
#tabulate the occurrences
x <- table(val[between(time, uniqt[1L], n)])
#calculate desired values
n_val <- length(x)
n_re <- sum(x > 1)
data.frame(n_val=n_val, n_re=n_re, re_rate=n_re/n_val)
})))
}, by=grp]
result:
# grp time n_val n_re re_rate
# 1: 1 1 2 1 0.5000000
# 2: 1 2 3 2 0.6666667
# 3: 1 3 4 3 0.7500000
# 4: 2 1 3 0 0.0000000
# 5: 2 2 3 1 0.3333333
# 6: 2 3 3 3 1.0000000
I was unable to find in which version of data.table was between first released and hence, between might be released after this question was posted.

Label portions of a dataframe based on boolean value, including previous rows?

For a given dataframe, I'd like to split it based on some boolean value, and then apply a label to that row and the previous rows up until that point.
Assuming the following dataframe:
test <- data.frame(x = 1:10, y = c(F, F, F, T, F, F, T, F, F, F))
I'd ultimately like to create a new column that would contain a label for that specific portion of the dataframe. Ideally, something like the following:
x y z
1 F 1
2 F 1
3 F 1
4 T 1
5 F 2
6 F 2
7 T 2
8 F 3
9 F 3
10 F 3
My current thought is that I need to loop through the dataframe with a function similar to the following (but not exactly):
label.portion <- function(test) {
for (i in 1:nrow(test)) {
z <- 1
if(test$y[i]) { z <- z + 1 }
return(z)
}
}
What is the best/easiest way of doing this? Any help is much appreciated.
Your z column can be built as
z <- with(test, sum(y)-rev(cumsum(rev(y)))+1)
in order to make every new z value start at a FALSE y after a TRUE y, as per your example.
Then you can do cbind(test, z) to get what you want.
One liner solution using transform
transform(test,z= cumsum(c(0,diff(y)) == -1)+1)
x y z
1 1 FALSE 1
2 2 FALSE 1
3 3 FALSE 1
4 4 TRUE 1
5 5 FALSE 2
6 6 FALSE 2
7 7 TRUE 2
8 8 FALSE 3
9 9 FALSE 3
10 10 FALSE 3
Another one liner solution which will be slightly faster than other solutions (except data.table)
test <- data.frame(x = 1:10, y = c(F, F, F, T, F, F, T, F, F, F))
test$z <- c(1, head(cumsum(test$y), -1) + 1)
test
## x y z
## 1 1 FALSE 1
## 2 2 FALSE 1
## 3 3 FALSE 1
## 4 4 TRUE 1
## 5 5 FALSE 2
## 6 6 FALSE 2
## 7 7 TRUE 2
## 8 8 FALSE 3
## 9 9 FALSE 3
## 10 10 FALSE 3
Benchmarks with other solutions provided (excluding data.table)
test <- data.frame(x = 1:1e+05, y = sample(c(T, F), size = 1e+05, replace = TRUE))
microbenchmark(c(1, head(cumsum(test$y), -1) + 1), cumsum(c(0, diff(test$y)) == -1) + 1, with(test, sum(y) - rev(cumsum(rev(y))) +
1), times = 100)
## Unit: milliseconds
## expr min lq median uq max neval
## c(1, head(cumsum(test$y), -1) + 1) 1.685473 1.758474 1.865409 4.647218 5.091512 100
## cumsum(c(0, diff(test$y)) == -1) + 1 4.064867 4.379714 6.936561 7.338810 7.657961 100
## with(test, sum(y) - rev(cumsum(rev(y))) + 1) 2.568766 2.720395 5.396096 5.701176 30.642436 100
Here is an approach using na.locf from xts and data.table for coding elegance (and efficiency)
library(data.table)
library(xts) # for na.locf
test <- data.table(test)
test[(y), grp := seq_along(y)][, grp := na.locf(grp, fromLast = TRUE)]
test[is.na(grp), grp := max(test[, grp], na.rm =TRUE) + 1L]
And a far clearer and faster approach
test[, grp := {xx <- diff(c(0,.I[y], length(.I))); rep.int(seq_along(xx),xx)}]
Note that diff uses a for loop implemented in R, so an Rcpp sugar implementation) would be faster (I'm sure that a cpp function would blow most of these out of the water)

Resources