I'm trying to add columns to my data.table that essentially append a cumulative frequency table for each group that is aggregated. Unfortunately, my current solution is about ten times slower than I had hoped.
Here is what I'm using (apologies for the ugly one-liner):
DT[, c("bin1","bin2","bin3","bin4") := as.list(cumsum(hist(colx,c(lbound,bound1,bound2, bound3,ubound),plot=FALSE)$counts)), by=category]
If the bin boundaries are set at 0,25,50,75,100, I would like my table to look like:
id category colx bin1 bin2 bin3 bin4
1 a 5 1 2 2 3
2 a 30 1 2 2 3
3 b 21 1 2 3 4
4 c 62 0 1 3 3
5 b 36 1 2 3 4
6 a 92 1 2 2 3
7 c 60 0 1 3 3
8 b 79 1 2 3 4
9 b 54 1 2 3 4
10 c 27 0 1 3 3
In the actual dataset I'm grouping using 4 different columns and there are millions of rows and unique groups. When I try a simpler function, such as sum, it takes an acceptable amount of time to do the calculation. Is there any way to significantly speed up the counting process?
Okay, here's one way (here I use data.table v1.9.3). Remove the by=.EACHI if you're using versions <= 1.9.2.
dt[, ival := findInterval(colx, seq(0, 100, by=25), rightmost.closed=TRUE)]
setkey(dt, category, ival)
ans <- dt[CJ(unique(category), unique(ival)), .N, allow.cartesian=TRUE, by=.EACHI]
ans[, N := cumsum(N), by="category"][, bin := "bin"]
ans <- dcast.data.table(ans, category ~ bin+ival, value.var="N")
ans <- dt[ans][, ival := NULL]
id category colx bin_1 bin_2 bin_3 bin_4
1: 1 a 5 1 2 2 3
2: 2 a 30 1 2 2 3
3: 6 a 92 1 2 2 3
4: 3 b 21 1 2 3 4
5: 5 b 36 1 2 3 4
6: 9 b 54 1 2 3 4
7: 8 b 79 1 2 3 4
8: 10 c 27 0 1 3 3
9: 4 c 62 0 1 3 3
10: 7 c 60 0 1 3 3
Benchmark on simulated large data:
I generate here a data.table with 20 million rows and a total of 1-million groups with 2 grouping columns (instead of 4 as you state in your question).
K = 1e3L
N = 20e6L
sim_data <- function(K, N) {
set.seed(1L)
ff <- function(K, N) sample(paste0("V", 1:K), N, TRUE)
data.table(x=ff(K,N), y=ff(K,N), val=sample(1:100, N, TRUE))
}
dt <- sim_data(K, N)
method1 <- function(x) {
dt[, ival := findInterval(val, seq(0, 100, by=25), rightmost.closed=TRUE)]
setkey(dt, x, y, ival)
ans <- dt[CJ(unique(x), unique(y), unique(ival)), .N, allow.cartesian=TRUE, by=.EACHI]
ans[, N := cumsum(N), by="x,y"][, bin := "bin"]
ans <- dcast.data.table(ans, x+y ~ bin+ival, value.var="N")
ans <- dt[ans][, ival := NULL]
}
system.time(ans1 <- method1(dt))
# user system elapsed
# 13.148 2.778 16.209
I hope this is faster than your original solution and scales well for your real data dimensions.
Update: Here's another version using data.table's rolling joins instead of findInterval from base. We've to modify the intervals slightly so that the rolling join finds the right match.
dt <- sim_data(K, N)
method2 <- function(x) {
ivals = seq(24L, 100L, by=25L)
ivals[length(ivals)] = 100L
setkey(dt, x,y,val)
dt[, ival := seq_len(.N), by="x,y"]
ans <- dt[CJ(unique(x), unique(y), ivals), roll=TRUE, mult="last"][is.na(ival), ival := 0L][, bin := "bin"]
ans <- dcast.data.table(ans, x+y~bin+val, value.var="ival")
dt[, ival := NULL]
ans2 <- dt[ans]
}
system.time(ans2 <- method2(dt))
# user system elapsed
# 12.538 2.649 16.079
## check if both methods give identical results:
setkey(ans1, x,y,val)
setnames(ans2, copy(names(ans1)))
setkey(ans2, x,y,val)
identical(ans1, ans2) # [1] TRUE
Edit: Some explanation on why OP's is very time consuming:
A huge reason, I suspect, for the difference in runtime between these solutions and hist is that both the answers here are vectorised (written entirely in C and will work on the whole data set directly), where as hist is a S3 method (which'll take time for dispatch to the .default method and added to that, it's written in R. So, basically you're executing about a million times hist, a function in R, where as the other two vectorised solutions are calling it once in C (no need to call for every group here).
And since that's the most complex part of your question, it obviously slows things down.
Related
This is related to this question. I have data like this:
x t
1: 1 1
2: 1 2
3: 1 3
4: 2 1
5: 2 2
6: 2 3
I'd like to flag the last observation in every group (and keep the other observations), defined by x, where the "last" observation is defined by t. I tried this:
dt[order(x, t), flag_last := 1, by = "x", mult = "last"]
but that returns
x t flag_last
1: 1 1 1
2: 1 2 1
3: 1 3 1
4: 2 1 1
5: 2 2 1
6: 2 3 1
The desired output is
x t flag_last
1: 1 1 0
2: 1 2 0
3: 1 3 1
4: 2 1 0
5: 2 2 0
6: 2 3 1
Am I going about this the wrong way?
A couple of caveats:
The actual dataset is roughly 61 GB and there are only a couple of observations per x group, so if possible I'd like to avoid creating another copy with the unique values or creating another copy with dplyr. If that's unavoidable, I'll make do.
Obviously this is simple data. The number of observations within each group is not necessarily the same, and the values for t differ too, so simply picking out t == 3 will not work.
Use the built-in .I like this:
DT[, is.end := .I == last(.I), by = "x"]
dt[, flag_last := replace(vector(mode = "integer", length = .N), which.max(t), 1L), x]
# x t flag_last
# 1: 1 1 0
# 2: 1 2 0
# 3: 1 3 1
# 4: 2 1 0
# 5: 2 2 0
# 6: 2 3 1
One option is to use .N and which.max to check for equality between the row index and the row index at which t is maximized
df[, flag := as.integer(1:.N == which.max(t)), x]
But benchmarking shows replace is faster on my machine for this dataset, and if you don't mind NAs instead of 0s, David Arenburg's suggested method using .I is fastest.
df <- data.table(x = rep(1:1e4, each = 1e4), t = sample(1e4*1e4))
library(microbenchmark)
microbenchmark(
replace = df[, flag_last := replace(vector(mode = "integer", length = .N), which.max(t), 1L), x],
use.N = df[, flag := as.integer(1:.N == which.max(t)), x],
use.max = df[, flag := as.integer(t==max(t)), x],
use.I = {
df[, flag := 0L]
df[df[, .I[which.max(t)], by = x]$V1, flag := 1L]
},
use.I.no0 = df[df[, .I[which.max(t)], by = x]$V1, flag := 1L],
times = 20)
# Unit: seconds
# expr min lq mean median uq max neval cld
# replace 1.228490 1.292348 1.442919 1.443021 1.578300 1.659990 20 b
# use.N 1.439939 1.522782 1.617104 1.574932 1.696046 1.923207 20 c
# use.max 1.405741 1.436817 1.596363 1.502337 1.663895 2.743942 20 c
# use.I 1.497599 1.547276 1.574657 1.564789 1.589066 1.686353 20 bc
# use.I.no0 1.080715 1.115329 1.162752 1.145145 1.182280 1.383989 20 a
This would do the trick, if you create an id variable that you can then use to merge the two datasets together:
library(dplyr)
x <- c(1,1,1,2,2,2)
t <- c(1,2,3,1,2,3)
id <- as.character(c(1,2,3,4,5,6))
data <- data.frame(x,t, id)
You create a sliced dataset with the max value of each group, and then you merge it back to the initial dataframe.
sliced <- data %>%
group_by(x) %>%
slice(which.max(t)) %>%
ungroup() %>%
select(id) %>%
mutate(max = "max_group")
tot <- left_join(data, sliced, c("id"))
The sliced df has only two variables, so might be not too bad to work with. This is the easier way that came to my mind.
This is my first question here in a long time :).
I've got a data frame with data about patient visits to a clinic.
visit_id <- c(1,2,3,4,5,6,7,8,9,10)
patient_id <- c(1,2,1,1,3,2,1,4,5,6)
visit_date <- as.Date(c('2016-12-02','2016-12-02','2016-12-30',
'2016-12-15','2016-12-30','2017-02-01',
'2017-02-15','2017-02-10','2017-01-15','2017-03-01'))
df <- data.frame(visit_id,patient_id,visit_date,visits_previous_20_weeks)
It looks like this:
visit_id patient_id visit_date
1 1 1 2016-12-02
2 2 2 2016-12-02
3 3 1 2016-12-30
4 4 1 2016-12-15
5 5 3 2016-12-30
6 6 2 2017-02-01
7 7 1 2017-02-15
8 8 4 2017-02-10
9 9 5 2017-01-15
10 10 6 2017-03-01
I want to add one more column that would indicate the number of times the patient has been to the clinic in the last 20 weeks:
visit_id patient_id visit_date visits_previous_20_weeks
1 1 1 2016-12-02 0
2 2 2 2016-12-02 0
3 3 1 2016-12-30 2
4 4 1 2016-12-15 1
5 5 3 2016-12-30 0
6 6 2 2017-02-01 1
7 7 1 2017-02-15 3
8 8 4 2017-02-10 0
9 9 5 2017-01-15 0
10 10 6 2017-03-01 0
The only data source is this table. So in the beginning of the table, since this is the first record, patient 1 has been to the clinic 0 times. But on the December 15th, 2016, the patient comes back to the clinic. So the number of visits in the previous 20 weeks (as of that date) is 1.
One inefficient way to do this would be to create a loop that for each row in the data frame, would go through the whole data frame and tally the number of visits for same patient in the previous 20 weeks. Any better way to do this in R?
Thanks :)
Here's a way using the data.table package. What this basically doing is to first create a 20 week boundary column and then perform an a non-equi self join while counting the matches.
library(data.table)
setDT(df)[, visit_date := as.IDate(visit_date)] # Convert visit_date to a proper Date class
df[, visit_date20 := visit_date - 20*7] # Create a 20 weeks boundry
## Count previous visits within the range
df[df, .(Visits = .N),
on = .(patient_id, visit_date < visit_date, visit_date > visit_date20),
by = .EACHI]
# patient_id visit_date visit_date Visits
# 1: 1 2016-12-02 2016-07-15 0
# 2: 2 2016-12-02 2016-07-15 0
# 3: 1 2016-12-30 2016-08-12 2
# 4: 1 2016-12-15 2016-07-28 1
# 5: 3 2016-12-30 2016-08-12 0
# 6: 2 2017-02-01 2016-09-14 1
# 7: 1 2017-02-15 2016-09-28 3
# 8: 4 2017-02-10 2016-09-23 0
# 9: 5 2017-01-15 2016-08-28 0
# 10: 6 2017-03-01 2016-10-12 0
If I understood you well, here is a solution using the data.table package. I have found two options (but the first one has better performance)
Convert the original data frame into data.table object:
dt <- data.table(df) # Create a data table from the data frame
setorder(dt, patient_id, visit_date) # Sort by patient_id, then by visit_date
Define the week threshold parameter:
weekNum = 20L # Considering a threshold of: 20-weeks.
OPTION 1: Computing directly the number of weeks from visit_datecolumn
We define the following function that makes the calculation for each group:
visitFreq <- function(x) {
n <- length(x)
result <- numeric(n)
if (n > 1) {
for (i in 1:n) {
# For each row of the column by patient_id
ref <- x[i] # reference date
x.prev <- x[x < ref] # select previous dates
if (length(x.prev) > 0) {
x.prev <- sapply(x.prev, function(y) {
ifelse(difftime(ref, y, units = "weeks") <= weekNum, 1, 0)
})
result[i] <- sum(x.prev)
}
}
}
return(result)
}
For each x[i] it finds the number of previous visits and then computes whether the previous dates are within the defined thershold or not. Then just left to count the number of previous visits before within the threshold.
Once we know how to make the calculation, we just need to apply this function for the visit_datecolumn for each patient_id:
dt[, visits := visitFreq(visit_date), by = patient_id]
Note: The function visitFreqhas to be defined considering a vectorial function, that receives an array of visit_dateand should return an array of the same dimension.
OPTION 2: Creating an artificial variable that collects all visit_date for a given patient.
Now we need to create a function that makes the calculation for computing the number of weeks:
calc <- function(vec, x) {
vec.prev <- vec[vec < x] # Select all dates before x
n <- 0
if (length(vec.prev) > 0) {
vec.prev <- sapply(vec.prev, function(y) {
ifelse(difftime(x, y, units = "weeks") <= weekNum, 1, 0)
})
n <- sum(vec.prev)
}
return(n)
}
where:
vec: Is an array of dates
x : Is the reference date
We filter only by the dates previous to date x. Now we apply the sapply function for each element of vec, for computing the difference in time between y (each element of vec) and the reference date x using as units the number of weeks. The result will be 1 for any diff date less that weekNum or zero. Then the number of previous visits less than certain number of weeks from reference date will be just counting all 1 we get.
Now we use this function in a data.table object like this:
dt[, visits := .(list(visit_date)), by = patient_id]
[, visits := mapply(calc, visits, visit_date)][order(patient_id)][]
Let's explain it a little bit:
We create a visits variable that is a list of all dates for a given patient_id (because the by clause).
If we execute the first expression it will produce something like this:
> dt[, visits := .(list(visit_date)), by = patient_id][]
visit_id patient_id visit_date visits
1: 1 1 2016-12-02 2016-12-02,2016-12-15,2016-12-30,2017-02-15
2: 4 1 2016-12-15 2016-12-02,2016-12-15,2016-12-30,2017-02-15
3: 3 1 2016-12-30 2016-12-02,2016-12-15,2016-12-30,2017-02-15
4: 7 1 2017-02-15 2016-12-02,2016-12-15,2016-12-30,2017-02-15
5: 2 2 2016-12-02 2016-12-02,2017-02-01
6: 6 2 2017-02-01 2016-12-02,2017-02-01
7: 5 3 2016-12-30 2016-12-30
8: 8 4 2017-02-10 2017-02-10
9: 9 5 2017-01-15 2017-01-15
10: 10 6 2017-03-01 2017-03-01
>
The second statement (second []-block) just do the calculation re-assigning the previously created variable visits, but now counting the number or previous visits with respect the reference date. We need the mapply function to make the vectorial computation, on each invocation of cal function we have as input arguments: dt[i]$visits(a list) and the corresponding dt[i]$visit_date[i]. mapply just iterates over all i-elements invoking the function calc.
RESULT
Finally, the result will be:
> dt
visit_id patient_id visit_date visits
1: 1 1 2016-12-02 0
2: 4 1 2016-12-15 1
3: 3 1 2016-12-30 2
4: 7 1 2017-02-15 3
5: 2 2 2016-12-02 0
6: 6 2 2017-02-01 1
7: 5 3 2016-12-30 0
8: 8 4 2017-02-10 0
9: 9 5 2017-01-15 0
10: 10 6 2017-03-01 0
>
and I guess this is what you wanted.
Note: Probably it would be a way to get the calculation on the fly but I was not able to see how. Perhaps other folks can suggest a slightly more syntactically succinct way.
PERFORMANCE
I was wondering about which option has better performance (I expected the OPC1), let's check it:
library(microbenchmark)
op <- microbenchmark(
OP1 = copy(dt)[, visits := visitFreq(visit_date), by = patient_id],
OP2 = copy(dt)[, visits := .(list(visit_date)), by = patient_id][, visits := mapply(calc, visits, visit_date)],
times=100L)
print(op)
It produce the following output:
Unit: milliseconds
expr min lq mean median uq max neval cld
OP1 3.467451 3.552916 4.165517 3.642150 4.200413 7.96348 100 a
OP2 4.732729 4.832695 5.799648 5.063985 6.073467 13.17264 100 b
>
Therefore the first option has the best performance.
EDIT (added the solution proposed by: #DavidArenburg)
Let's include as the third option the join solution, but increasing the size of the input argument repeating the input vector, for example:
nSample <- 100
patient_id <- rep(c(1, 2, 1, 1, 3, 2, 1, 4, 5, 6), nSample)
visit_id <- 1:nSample
visit_date <- rep(as.Date(c('2016-12-02', '2016-12-02', '2016-12-30',
'2016-12-15', '2016-12-30', '2017-02-01',
'2017-02-15', '2017-02-10', '2017-01-15', '2017-03-01')), nSample)
df <- data.frame(visit_id, patient_id, visit_date)
opc3 <- function(df) {
df[, visit_date20 := visit_date - 20 * 7] # Create a 20 weeks boundry
## Count previous visits within the range
df[df, .(visits = .N),
on = .(patient_id, visit_date < visit_date, visit_date > visit_date20),
by = .EACHI]
}
dt <- data.table(df)
dt3 <- copy(dt)[, visit_date := as.IDate(visit_date)] # Convert visit_date to a proper Date class
library(microbenchmark)
op <- microbenchmark(
OP1 = copy(dt)[, visits := visitFreq(visit_date), by = patient_id],
OP2 = copy(dt)[, visits := .(list(visit_date)), by = patient_id][, visits := mapply(calc, visits, visit_date)],
OP3 = opc3(copy(dt3)),
times = 10L)
print(op)
I get the following results:
Unit: milliseconds
expr min lq mean median uq max neval cld
OP1 6315.73724 6485.111937 10744.808669 11789.230998 15062.957734 15691.445961 10 b
OP2 6266.80130 6431.330087 11074.441187 11773.459887 13928.861934 15335.733525 10 b
OP3 2.38427 2.845334 5.157246 5.383949 6.711482 8.596792 10 a
>
The #DavidArenburg solution scale much better when the number of rows increse.
How about this solution, using dplyr and lubridate?
library(lubridate)
no_of_weeks <- 4 #You want 20 here, but the result will be NULL for the example dataset you've given
df %>%
mutate(week_filter=visit_date<Sys.Date()-weeks(no_of_weeks)) %>%
group_by(patient_id) %>%
mutate(visits_previous_n_weeks=cumsum(week_filter)) %>%
ungroup()
I am not very clear about use of .SD and by.
For instance, does the below snippet mean: 'change all the columns in DT to factor except A and B?' It also says in data.table manual: ".SD refers to the Subset of the data.table for each group (excluding the grouping columns)" - so columns A and B are excluded?
DT = DT[ ,lapply(.SD, as.factor), by=.(A,B)]
However, I also read that by means like 'group by' in SQL when you do aggregation. For instance, if I would like to sum (like colsum in SQL) over all the columns except A and B do I still use something similar? Or in this case, does the below code mean to take the sum and group by values in columns A and B? (take sum and group by A,B as in SQL)
DT[,lapply(.SD,sum),by=.(A,B)]
Then how do I do a simple colsum over all the columns except A and B?
Just to illustrate the comments above with an example, let's take
set.seed(10238)
# A and B are the "id" variables within which the
# "data" variables C and D vary meaningfully
DT = data.table(
A = rep(1:3, each = 5L),
B = rep(1:5, 3L),
C = sample(15L),
D = sample(15L)
)
DT
# A B C D
# 1: 1 1 14 11
# 2: 1 2 3 8
# 3: 1 3 15 1
# 4: 1 4 1 14
# 5: 1 5 5 9
# 6: 2 1 7 13
# 7: 2 2 2 12
# 8: 2 3 8 6
# 9: 2 4 9 15
# 10: 2 5 4 3
# 11: 3 1 6 5
# 12: 3 2 12 10
# 13: 3 3 10 4
# 14: 3 4 13 7
# 15: 3 5 11 2
Compare the following:
#Sum all columns
DT[ , lapply(.SD, sum)]
# A B C D
# 1: 30 45 120 120
#Sum all columns EXCEPT A, grouping BY A
DT[ , lapply(.SD, sum), by = A]
# A B C D
# 1: 1 15 38 43
# 2: 2 15 30 49
# 3: 3 15 52 28
#Sum all columns EXCEPT A
DT[ , lapply(.SD, sum), .SDcols = !"A"]
# B C D
# 1: 45 120 120
#Sum all columns EXCEPT A, grouping BY B
DT[ , lapply(.SD, sum), by = B, .SDcols = !"A"]
# B C D
# 1: 1 27 29
# 2: 2 17 30
# 3: 3 33 11
# 4: 4 23 36
# 5: 5 20 14
A few notes:
You said "does the below snippet... change all the columns in DT..."
The answer is no, and this is very important for data.table. The object returned is a new data.table, and all of the columns in DT are exactly as they were before running the code.
You mentioned wanting to change the column types
Referring to the point above again, note that your code (DT[ , lapply(.SD, as.factor)]) returns a new data.table and does not change DT at all. One (incorrect) way to do this, which is done with data.frames in base, is to overwrite the old data.table with the new data.table you've returned, i.e., DT = DT[ , lapply(.SD, as.factor)].
This is wasteful because it involves creating copies of DT which can be an efficiency killer when DT is large. The correct data.table approach to this problem is to update the columns by reference using`:=`, e.g., DT[ , names(DT) := lapply(.SD, as.factor)], which creates no copies of your data. See data.table's reference semantics vignette for more on this.
You mentioned comparing efficiency of lapply(.SD, sum) to that of colSums. sum is internally optimized in data.table (you can note this is true from the output of adding the verbose = TRUE argument within []); to see this in action, let's beef up your DT a bit and run a benchmark:
Results:
library(data.table)
set.seed(12039)
nn = 1e7; kk = seq(100L)
DT = setDT(replicate(26L, sample(kk, nn, TRUE), simplify=FALSE))
DT[ , LETTERS[1:2] := .(sample(100L, nn, TRUE), sample(100L, nn, TRUE))]
library(microbenchmark)
microbenchmark(
times = 100L,
colsums = colSums(DT[ , !c("A", "B")]),
lapplys = DT[ , lapply(.SD, sum), .SDcols = !c("A", "B")]
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# colsums 1624.2622 2020.9064 2028.9546 2034.3191 2049.9902 2140.8962 100
# lapplys 246.5824 250.3753 252.9603 252.1586 254.8297 266.1771 100
For a data.table DT grouped by site, sorted by time t, I need to change the last value of a variable in each group. I assume it should be possible to do this by reference using :=, but I haven't found a way that works yet.
Sample data:
require(data.table) # using 1.8.11
DT <- data.table(site=c(rep("A",5), rep("B",4)),t=c(1:5,1:4),a=as.double(c(11:15,21:24)))
setkey(DT, site, t)
DT
# site t a
# 1: A 1 11
# 2: A 2 12
# 3: A 3 13
# 4: A 4 14
# 5: A 5 15
# 6: B 1 21
# 7: B 2 22
# 8: B 3 23
# 9: B 4 24
The desired result is to change the last value of a in each group, for example to 999, so the result looks like:
# site t a
# 1: A 1 11
# 2: A 2 12
# 3: A 3 13
# 4: A 4 14
# 5: A 5 999
# 6: B 1 21
# 7: B 2 22
# 8: B 3 23
# 9: B 4 999
It seems like .I and/or .N should be used, but I haven't found a form that works. The use of := in the same statement as .I[.N] gives an error. The following gives me the row numbers where the assignment is to be made:
DT[, .I[.N], by=site]
# site V1
# 1: A 5
# 2: B 9
but I don't seem to be able to use this with a := assignment. The following give errors:
DT[.N, a:=999, by=site]
# Null data.table (0 rows and 0 cols)
DT[, .I[.N, a:=999], by=site]
# Error in `:=`(a, 999) :
# := and `:=`(...) are defined for use in j, once only and in particular ways.
# See help(":="). Check is.data.table(DT) is TRUE.
DT[.I[.N], a:=999, by=site]
# Null data.table (0 rows and 0 cols)
Is there a way to do this by reference in data.table? Or is this better done another way in R?
Currently you can use:
DT[DT[, .I[.N], by = site][['V1']], a := 999]
# or, avoiding the overhead of a second call to `[.data.table`
set(DT, i = DT[,.I[.N],by='site'][['V1']], j = 'a', value = 999L)
alternative approaches:
use replace...
DT[, a := replace(a, .N, 999), by = site]
or shift the replacement to the RHS, wrapped by {} and return the full vector
DT[, a := {a[.N] <- 999L; a}, by = site]
or use mult='last' and take advantage of by-without-by. This requires the data.table to be keyed by the groups of interest.
DT[unique(site), a := 999, mult = 'last']
There is a feature request #2793 that would allow
DT[, a[.N] := 999]
but this is yet to be implemented
Is there a quick way to evaluate the i of a data.table[i,j] over multiple conditions? (My actual datatable has 2M rows. I want to do this with data.table operations, not loops or lapply).
For example, let's say I have:
require(data.table)
data = data.table(seq(0.25,10, by = 0.25), rep(c("a","b","c","d"),10))
filter = seq(0,10,by=1)
I now want to filter, say:
data[V1 > filter[4], .N, by=V2]
How can I evaluate this expression for all elements of filter?
Note: I know I can loop through the data! This is not the answer I want!
Edit: I want to do this internally in the data.table and not via lapply, as kindly pointed out.
Edit 2: The reason for this is that in the real problem, I have a dataset of 2 million rows that must be evaluated using a filter based on the same dataset at about 200 intervals. This must be repeated multiple times. So speed is of the essence.
Try the following:
data = data.table(val = seq(0.25,10, by = 0.25), grp = rep(c("a","b","c","d"),10))
filter = seq(0,10,by=1)
fl = data.table(filter, key = 'filter')
# to get strict inequality I subtracted a "small" number
# adjust it appropriately for your data
data[, max.filter := fl[J(val - 1e-7), .I, roll = Inf]$.I][,
lapply(seq_along(filter), function(i) sum(max.filter >= i)), by = grp]
# grp V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
#1: a 10 9 8 7 6 5 4 3 2 1 0
#2: b 10 9 8 7 6 5 4 3 2 1 0
#3: c 10 9 8 7 6 5 4 3 2 1 0
#4: d 10 9 8 7 6 5 4 3 2 1 0
Testing on data with 2M rows and 200 filter values it takes a bit over 5 seconds (and, to compare, naive lapply takes over a minute):
N = 2e6
data = data.table(val = runif(N, 1, N), grp = sample(letters, N, T))
filter = seq(0, N, by = N/200)
fl = data.table(filter, key = 'filter')
system.time(data[, max.filter := fl[J(val - 1e-7), .I, roll = Inf]$.I][, lapply(seq_along(filter), function(i) sum(max.filter >= i)), by = grp])
# user system elapsed
# 5.24 0.00 5.41
system.time(lapply(filter, function(x) data[val > x, .N, by = grp]))
# user system elapsed
# 71.07 0.03 73.75
Not sure if this'll work as good as it does here, but this is more or less the idea:
1) First, set the key on V2 followed by V1:
setkey(data, V2, V1)
2) Next we add a couple of columns we'll use later:
data[, `:=`(id = 1:.N, N = .N), by=V2][, id := id[1L], by=list(V2,V1)]
The first is a running sequence from 1 to .N and second is the total count.
3) Here's where all the magic happens:
data[, V1.b := V1] ### this is necessary in 1.8.10 as V1 is not available in `j`
### fixed in v1.8.11
ans <- data[CJ(x=unique(V2), y=filter),
list(start=id[1L], end=N[1L],
actual_num=V1.b[1L], close_match=y[1L]),
roll="nearest"]
So let's split this and understand. The first part, CJ creates the combinations for getting all results at once with a join (hence the setkey first). For each value in i, we use roll="nearest" to make sure we definitely get a match (the closest of the values available), and we note down start, end V1 and y values for that match. So why do we need all these values? In particular why V1 and y?
4) Now, from this result, start gives the first position where the match was and end always gives the total number of elements for that V2. However, there's one catch. If the number you're looking for counts where V1 > 5 and the closest value in V1 is 5.5 (> 5), then the start position is correct. However, if the closest value of V1 is 4.5, then we've to increment start by 1, because the match we have is the previous row.
A lot to take in... but doing it step by step should help. So, basically we do now:
ans[actual_num <= close_match, start := start+1L]
Exactly what I explained before (here, V1 is actual_num).
5) Now, we can do end-start+1 to get the total count:
ans[, tot_cnt := end-start+1L]
6) Cleaning up:
ans[, `:=`(start=NULL, end=NULL, close_match=NULL, actual_num=NULL)]
setnames(ans, 'V1', 'filter')
setkey(ans, filter)
Putting it all together:
setkey(data, V2, V1)
data[, `:=`(id = 1:.N, N = .N), by=V2][, id := id[1L], by=list(V2,V1)]
data[, V1.b := V1] ### required for 1.8.10, V1 is not available in `j`
### fixed in 1.8.11
ans <- data[CJ(x=unique(V2), y=filter),
list(start=id[1L], end=N[1L],
actual_num=V1.b[1L], close_match=y[1L]),
roll="nearest"]
ans[actual_num <= close_match, start := start+1L]
ans[, tot_cnt := end-start+1L]
ans[, `:=`(start=NULL, end=NULL, close_match=NULL, actual_num=NULL)]
setnames(ans, 'V1', 'filter')
setkey(ans, filter)
Running this on #eddi's data takes about 2.4 seconds.
If I understand correctly, the OP wants to count the number of values which exceed a threshold for each grp and for each threshold given in filter.
This can be solved using non-equi joins which became available with data.table version v1.9.8 (on CRAN 25 Nov 2016).
Here are two variants. The first one simply counts all values above each threshold which means that some values are compared and counted multiple times.
The second variant tries to avoid the multiple compares and counts the values within each interval and computes a cumulative sum in a second step.
As a benchmark shows the speedup is rather small.
Variant 1
aggregated <- data[CJ(grp = unique(grp), lb = filter),
on = .(grp, val > lb), .N, by = .EACHI]
dcast(aggregated, grp ~ val)
grp 0 1 2 3 4 5 6 7 8 9 10
1: a 10 9 8 7 6 5 4 3 2 1 0
2: b 10 9 8 7 6 5 4 3 2 1 0
3: c 10 9 8 7 6 5 4 3 2 1 0
4: d 10 9 8 7 6 5 4 3 2 1 0
Note that the column headers denote threshold values.
Variant 2
fl <- CJ(grp = unique(data$grp), lower = filter)[
, upper := shift(lower, type = "lead", fill = Inf)][]
tmp <- data[fl, on = .(grp, val > lower, val <= upper), .N, by = .EACHI ][
order(-val), .(val, N = cumsum(N)), by = grp][]
dcast(tmp, grp ~ val)
grp 0 1 2 3 4 5 6 7 8 9 10
1: a 10 9 8 7 6 5 4 3 2 1 0
2: b 10 9 8 7 6 5 4 3 2 1 0
3: c 10 9 8 7 6 5 4 3 2 1 0
4: d 10 9 8 7 6 5 4 3 2 1 0
Benchmark
Unfortunately, I could not get the solutions provided by Arun and Eddi to work with the actual data.table version 1.10.5. So, this benchmark just compares the two non-equi join variants using Eddi's benchmark data.
# create benchmark data
N = 2e6
set.seed(123L)
data = data.table(val = runif(N, 1, N), grp = sample(letters, N, T))
filter = seq(0, N, by = N/200)
# define check function
my_check <- function(values) {
all(sapply(values[-1], function(x) identical(values[[1]], x)))
}
#run benchmark
microbenchmark::microbenchmark(
nej1 = {
aggregated <- data[CJ(grp = unique(grp), lb = filter), on = .(grp, val > lb), .N, by = .EACHI]
dcast(aggregated, grp ~ val, value.var = "N")
},
nej2 =
{
fl <- CJ(grp = unique(data$grp), lower = filter)[, upper := shift(lower, type = "lead", fill = Inf)][]
tmp <- data[fl, on = .(grp, val > lower, val <= upper), .N, by = .EACHI ][
order(-val), .(val, N = cumsum(N)), by = grp][]
dcast(tmp, grp ~ val, value.var = "N")
},
check = my_check,
times = 20L
)
Unit: milliseconds
expr min lq mean median uq max neval cld
nej1 358.6231 368.8033 389.6501 378.1158 391.8690 556.3323 20 b
nej2 347.6321 360.4038 365.2218 366.9392 370.1697 382.8185 20 a
The speed advantage of variant 2 is less than 5 per cent. This may vary if a computationally more intensive aggregation function is used.