How do you change the data table rolling join condition from weak inequality to strict inequality? - r

Consider the two following two datasets, with a 'time' column which represents a general timestamp, integers are used for simplicity of illustrating the example:
library(data.table)
df_test_1 <-
data.table(time = c(1:10, seq(20, 30, by = 5))) %>%
df_test_1$values <- -df_test_1$time
df_test_1 <- setkey(df_test_1, time)
df_test_2 <-
data.table(time = c(15, 20, 26, 28, 31))
df_test_2 <- setkey(df_test_2, time)
so that:
> df_test_1
time values
...
5: 5 -5
6: 6 -6
7: 7 -7
8: 8 -8
9: 9 -9
10: 10 -10
11: 20 -20
12: 25 -25
13: 30 -30
and:
> df_test_2
time
1: 15
2: 20
3: 26
4: 28
5: 31
The rolling join df_test_1[df_test_2, roll = -Inf] produces:
> df_test_1[df_test_2, roll = -Inf]
time values
1: 15 -20
2: 20 -20
3: 26 -30
4: 28 -30
5: 31 NA
That is, for each time value in df_test_1, find all time values in df_test_2 smaller than or equal to it, and associate the corresponding value to this row of df_test_2. For example, df_test_1$time == 20 matches the time values 15 and 20 in df_test_2$time, thus the corresponding value of -20 is associated to these rows of df_test_2.
I would like to change the join condition (in bold above) to smaller than it, that is, the answer produced should be:
time values
1: 15 -20
2: 20 -25
3: 26 -30
4: 28 -30
5: 31 NA
The difference here is that the value at df_test_1$time == 25 should be matched to the row of df_test_2 where df_test_2$time == 20.
An alternate way of producing the desired result would be to take away a minuscule portion from time:
df_test_3 <-
df_test_1 %>%
mutate(time = time - 0.1) %>%
setkey(time)
so that:
> df_test_3[df_test_2, roll = -Inf]
time values
1: 15 -20
2: 20 -25
3: 26 -30
4: 28 -30
5: 31 NA

Using the new non-equi joins feature in the current development version of data.table, this is straightforward:
# v1.9.7+
df_test_1[df_test_2, on=.(time > time), mult="first"]
Keyed joins are capable of only equi joins. on argument is essential for conditional joins.
Note that there's no need for data.tables to be keyed if the on argument is used. Even if you wish to key the data.tables, specifying on is better as it helps understand the code immediately at a later point.
See the installation instructions for devel version here.

Related

Perform row-wise operation in datatable with multiple elements

I have the following data table:
library(data.table)
set.seed(1)
DT <- data.table(ind=1:100,x=sample(100),y=sample(100),group=c(rep("A",50),rep("B",50)))
Now the problem I have is that I need to take every value in column "x" (that is, each given ID), and add all the existing values in column "y" to it. I also need to do it separately per column "group". Let's assume we start with ID = 1. This element has the value: x_1 = 68, and y_1 = 76. We also see y_2 = 39, y_3 = 24, etc. So what I want to compute is the sums x_1 + y_1, x_1 + y2, x_1 + y_3, etc. But not only for x_1, but also for x_2, x_3, etc. So for x_2 it would look like: x_2 + y_1, x_2 + y_2, x_2 + y_3, etc. This should also be done separately per column "group" (in this regard the dataset should simple be split by group).
Edit: Exemplary code to do this only for X_1 and group A:
current_X <- DT[1,x] # not needed, just to illustrate
vector_current_X <- rep(DT[1,x],nrow(DT[group == "A"]))
DT[group == "A",copy_current_X := vector_current_X]
DT[,sum_current_X_Y := copy_current_X + y]
DT
One apparent issue with this approach is that if it were applied to all x, then a lot of columns would be added to the final DT. So I am not sure if it is the best approach. In the end, I am just looking for the lowest sum (per element x) with each element y, and per group.
I know how to do operations per group, and I also know the lapply functions. The issue is that from my understanding, I need to include a row-wise loop. And next, the structure of the result will be different from the original data table, because we have many additional observations. I have seen before that you can save lists inside a data.table, but I am unsure if that is the best approach. My dataset is much larger, so efficiency is important.
Thanks for any hints how to approach this.
You can do this:
DT[, .(.BY$x+DT[group==.BY$group,y]), by=.(x,group)]
This returns N rows per x, where N is the size of x's group. We leverage the special (.BY), which is available in j when utilizing by. Basically, .BY is a named list, containing the values of the grouping variables. Here, I'm adding the value of x (.BY$x) to the vector of y values from the subset of DT where the group is equal to the current group value (.BY$group)
Output:
x group V1
<int> <char> <int>
1: 68 A 144
2: 68 A 107
3: 68 A 92
4: 68 A 121
5: 68 A 160
---
4996: 4 B 25
4997: 4 B 66
4998: 4 B 83
4999: 4 B 27
5000: 4 B 68
You can also accomplish this via a join:
DT[,!c("y")][DT[, .(y,group)], on=.(group), allow.cartesian=T][, total:=x+y][order(ind)]
Output:
ind x group y total
<int> <int> <char> <int> <int>
1: 1 68 A 76 144
2: 1 68 A 39 107
3: 1 68 A 24 92
4: 1 68 A 53 121
5: 1 68 A 92 160
---
4996: 100 4 B 21 25
4997: 100 4 B 62 66
4998: 100 4 B 79 83
4999: 100 4 B 23 27
5000: 100 4 B 64 68
If I understand correctly, the requested result requires a cross join where each element of x is combined with each element of y (within each group).
This can be accomplished easily using the CJ() function:
DT[, CJ(x, y, sorted = FALSE), by = group][, sum_x_y := x + y][]
group x y sum_x_y
1: A 68 76 144
2: A 68 39 107
3: A 68 24 92
4: A 68 53 121
5: A 68 92 160
---
4996: B 4 21 25
4997: B 4 62 66
4998: B 4 79 83
4999: B 4 23 27
5000: B 4 64 68

reducing repetitive tasks in data.table in R

I notice that i am doing the same thing multiple time, just with slightly different values:
HCCtreshold <- 40000
claimsMonthly[, HCC12mnth := +(HCCtreshold < claim12month)][ HCC12mnth == 1, `:=` (aboveHCCth12mnth = (claim12month - HCCtreshold))][is.na(aboveHCCth12mnth),aboveHCCth12mnth := 0]
claimsMonthly[, HCC11mnth := +(HCCtreshold < claim11month)][ HCC11mnth == 1, `:=` (aboveHCCth11mnth = (claim11month - HCCtreshold))][is.na(aboveHCCth11mnth),aboveHCCth11mnth := 0]
claimsMonthly[, HCC10mnth := +(HCCtreshold < claim10month)][ HCC10mnth == 1, `:=` (aboveHCCth10mnth = (claim10month - HCCtreshold))][is.na(aboveHCCth10mnth),aboveHCCth10mnth := 0]
So started with something like this:
k <- seq.default(from = 8, to = 12, by = 1)
claimsMonthly[paste0("HCC", k, "mnth") := lapply(k, function(x) (+(HCCtreshold < paste0("HCC", k, "mnth"))))]
i get an error:
Error: Check that is.data.table(DT) == TRUE. Otherwise, := and `:=`(...) are defined for use in j, once only and in particular ways. See help(":=").
I also tried:
for(k in 8:12){
claimsMonthly[, paste0("HCC", k, "mnth") := +(HCCtreshold < paste0("HCC", k, "mnth"))]
}
the columns are created correctly, but i get incorrect values inside them. I get an 1 everywhere
I am not sure what i am doing wrong?
I can offer some suggestions and, with some fake data, try them out.
You can programmatically define names on the left-hand side of := if you wrap a vector in c(...), so for instance DT[ c(vec_of_names) := list(some, values)].
You can programmatically retrieve values of variables with a vector of variable names and mget. While I generally think mget can indicate problematic code, I believe that in here it works with low risk. (While mget and get normally retrieve variables from the operating environment, often .GlobalEnv, from within a data.table operation then retrieve columns just as easily.)
Instead of a double-tap of assignment with == 1 and then is.na(...), we can use some logical trickery and the data.table::fcoalesce function. (If you aren't familiar, fcoalesce operates like SQL's coalesce function which is a vector-friendly way of finding the first non-NA value in arguments of vectors.
fcoalesce(c(1, 2, NA, NA), c(11, 12, 13, NA), c(21, 22, 23, 24))
# [1] 1 2 13 24
We can use fcoalesce(some + math * calc, 0) to do the math and, if NA, replace it with 0. (We use it on the above* variables below, and not necessarily on the HCC* logical variables. It can apply there too, if desired. If those HCC* variables are throw-away, though, it just doesn't matter.)
Fake data:
library(data.table)
set.seed(42)
hccthreshold <- 50
dat <- data.table( claim10month = sample(99, 10), claim11month = sample(99, 10), claim12month = sample(99, 10) )
dat$claim11month[5] <- NA
dat
# claim10month claim11month claim12month
# 1: 91 46 90
# 2: 92 71 14
# 3: 28 91 96
# 4: 80 25 91
# 5: 61 NA 8
# 6: 49 89 49
# 7: 69 97 37
# 8: 13 11 84
# 9: 60 95 41
# 10: 64 51 76
First, let's programmatically determine the column names we want to act on, and from then create the same vectors for the new variables. (I'm a big fan of determining and adapting these variable names programmatically, so that if you get a partial data set your code still works. You might consider setting checks and alarms to catch something wrong. For instance, stopifnot(length(claimnames) == 12L), in case you are expecting to always have precisely 12 months.)
claimnames <- grep("^claim[0-9]+month", colnames(dat), value = TRUE)
hccnames <- gsub("^claim", "HCC", claimnames)
abovenames <- gsub("^claim", "aboveHCC", claimnames)
claimnames
# [1] "claim10month" "claim11month" "claim12month"
hccnames
# [1] "HCC10month" "HCC11month" "HCC12month"
abovenames
# [1] "aboveHCC10month" "aboveHCC11month" "aboveHCC12month"
And now, we can process the data.
dat[, c(hccnames) := lapply(mget(claimnames), `>`, hccthreshold) ]
dat[, c(abovenames) := Map(function(hcc, clm) fcoalesce(clm - hcc * hccthreshold, 0),
mget(hccnames), mget(claimnames)) ]
dat
# claim10month claim11month claim12month HCC10month HCC11month HCC12month aboveHCC10month aboveHCC11month aboveHCC12month
# 1: 91 46 90 TRUE FALSE TRUE 41 46 40
# 2: 92 71 14 TRUE TRUE FALSE 42 21 14
# 3: 28 91 96 FALSE TRUE TRUE 28 41 46
# 4: 80 25 91 TRUE FALSE TRUE 30 25 41
# 5: 61 NA 8 TRUE NA FALSE 11 0 8
# 6: 49 89 49 FALSE TRUE FALSE 49 39 49
# 7: 69 97 37 TRUE TRUE FALSE 19 47 37
# 8: 13 11 84 FALSE FALSE TRUE 13 11 34
# 9: 60 95 41 TRUE TRUE FALSE 10 45 41
# 10: 64 51 76 TRUE TRUE TRUE 14 1 26
I chose to keep the HCC* variables as logical instead of your +(...) integers, but it's directly translatable and up to you.

Using Rolling Average to Calculate over Window of Values

I am trying to calculate rolling averages of Heart Rate over 15 second intervals. I have millisecond data for many participants and as such the millisecond values can potentially be repeated multiple times, and due to inconsistent time readings, creating intervals by row is not viable.
Below is a small sample of the data for one participant. Data for another participant would obviously feature different millisecond data taken at different intervals.
Ideal output would involve a new column with the rolling average for each value of millisecond data.
MS <- c(36148, 36753,37364,38062,38737,39580,40029,40387,41208,42006,42796, 43533,44274,44988,45696,46398,47079,47742,48429,49135,49861,50591,51324,52059)
HR <- c(84,84,84,84,84,96,84,84,96,84,84,96,84,84,96,84,84,84,84,84,84,84,84,84)
df <- data.frame(MS, HR)
I have tried a few packages (namely Zoo's suite of rolling functions) but have had trouble applying them to this problem.
Thank you!
rollapplyr in zoo accepts a vector of widths and findInterval can be used to calculate the index in MS 15 seconds ago so if we subtract that from 1:n we get w, the number of positions to average. Exactly which intervals to produce is not discussed in the question so we will assumes that the right hand edge of each interval is at an input point.
library(zoo)
w <- with(df, seq_along(MS) - findInterval(MS - 15000, MS))
transform(df, roll = rollapplyr(HR, w, mean, fill = NA))
An option using non-equi join in data.table which also handles an ID:
library(data.table)
setDT(df)[, avgHR :=
df[.(ID=ID, start=MS-15000, end=MS), on=.(ID, MS>=start, MS<=end),
by=.EACHI, mean(HR)]$V1
]
output:
ID MS HR avgHR
1: 1 36148 84 84.00000
2: 1 36753 84 84.00000
3: 1 37364 84 84.00000
4: 1 38062 84 84.00000
5: 1 38737 84 84.00000
6: 1 39580 96 86.00000
7: 1 40029 84 85.71429
8: 1 40387 84 85.50000
9: 1 41208 96 86.66667
10: 1 42006 84 86.40000
11: 1 42796 84 86.18182
12: 1 43533 96 87.00000
13: 1 44274 84 86.76923
14: 1 44988 84 86.57143
15: 1 45696 96 87.20000
16: 1 46398 84 87.00000
17: 1 47079 84 86.82353
18: 1 47742 84 86.66667
19: 1 48429 84 86.52632
20: 1 49135 84 86.40000
21: 1 49861 84 86.28571
22: 1 50591 84 86.18182
23: 1 51324 84 86.18182
24: 1 52059 84 86.18182
ID MS HR avgHR
data:
MS <- c(36148, 36753,37364,38062,38737,39580,40029,40387,41208,42006,42796, 43533,44274,44988,45696,46398,47079,47742,48429,49135,49861,50591,51324,52059)
HR <- c(84,84,84,84,84,96,84,84,96,84,84,96,84,84,96,84,84,84,84,84,84,84,84,84)
df <- data.frame(ID=1, MS, HR)
I'm not totally sure how you want to apply the 15s rolling average, but here is one way to go about what I think youre looking for. First we subset the data that is between 7.5s before and 7.5s after, then we take the average. This, however, will have an edge effect since there is no 7.5s before the first value.
library(tidyverse)
roll_vec <- c()
for(i in 1:nrow(df)){
ref <- df$MS[[i]]
val <- df %>%
filter(MS <= ref + 7500 & MS >= ref- 7500) %>%
pull(HR) %>%
mean
roll_vec[[i]] <- val
}
df %>%
mutate(roll_15s = roll_vec)
#> MS HR roll_15s
#> 1 36148 84 87.00000
#> 2 36753 84 87.00000
#> 3 37364 84 86.76923
#> 4 38062 84 86.57143
#> 5 38737 84 86.57143
#> 6 39580 96 86.57143
#> 7 40029 84 86.57143
#> 8 40387 84 86.57143
#> 9 41208 96 86.57143
#> 10 42006 84 86.57143
#> 11 42796 84 86.57143
#> 12 43533 96 86.57143
#> 13 44274 84 87.00000
#> 14 44988 84 87.27273
#> 15 4569 96 96.00000
df %>%
mutate(roll_15s = roll_vec) %>%
ggplot(aes(MS, HR))+
geom_line()+
geom_line(aes(y = roll_15s), color = "blue")
Notice that in the plot, the black line is the raw data and the blue line is the 15s rolling average.
One possible solution:
library(magrittr)
start_range <- df$MS[df$MS < max(df$MS)-15000]
lapply(start_range,function(t){
data.frame(MS = mean(df$MS[df$MS %between% c(t,t+15000)]),
HR = mean(df$HR[df$MS %between% c(t,t+15000)]))
}) %>% Reduce(rbind,.)
MS HR
1 43218.00 86.18182
2 43907.82 86.18182
3 44603.55 86.18182
4 44948.29 86.28571
5 45673.38 86.33333
I added some points to your data (I had only two points with the data you give):
MS <- c(36148, 36753,37364,38062,38737,39580,40029,40387,41208,42006,42796, 43533,44274,44988,45696,46398,47079,47742,48429,49135,49861,50591,51324,52059,53289,54424)
HR <- c(84,84,84,84,84,96,84,84,96,84,84,96,84,84,96,84,84,84,84,84,84,84,84,84,85,88)
df <- data.frame(MS, HR)
The idea here is to calculate, for each MS value, the mean of HR and the time MSof all points having a time between this value (t in lapply) and 15 s after.
I restrict that on the range where I have values encompassing the 15s : the start_range vector.

How to effectively determine the maximum difference between the variable value in each row and same variable subsequent row values in data.table in R

What is the most efficient way to determine the maximum positive difference between the value (X) for each row and the subsequent values of the same variable (X) within group (Y) in data.table in R.
Example:
set.seed(1)
dt <- data.table(X = sample(100:200, 500455, replace = TRUE),
Y = unlist(sapply(10:1000, function(x) rep(x, x))))
Here's my solution which I consider ineffective and slow:
dt[, max_diff := vapply(1:.N, function(x) max(X[x:.N] - X[x]), numeric(1)), by = Y]
head(dt, 21)
X Y max_diff
1: 126 10 69
2: 137 10 58
3: 157 10 38
4: 191 10 4
5: 120 10 75
6: 190 10 5
7: 195 10 0
8: 166 10 0
9: 163 10 0
10: 106 10 0
11: 120 11 80
12: 117 11 83
13: 169 11 31
14: 138 11 62
15: 177 11 23
16: 150 11 50
17: 172 11 28
18: 200 11 0
19: 138 11 56
20: 178 11 16
21: 194 11 0
If you can advise the efficient (faster) solution?
Here's a dplyr solution that is about 20x faster and gets the same results. I presume the data.table equivalent would be yet faster. (EDIT: see bottom - it is!)
The speedup comes from reducing how many comparisons need to be performed. The largest difference will always be found against the largest remaining number in the group, so it's faster to identify that number first and do only the one subtraction per row.
First, the original solution takes about 4 sec on my machine:
tictoc::tic("OP data.table")
dt[, max_diff := vapply(1:.N, function(x) max(X[x:.N] - X[x]), numeric(1)), by = Y]
tictoc::toc()
# OP data.table: 4.594 sec elapsed
But in only 0.2 sec we can take that data.table, convert to a data frame, add the orig_row row number, group by Y, reverse sort by orig_row, take the difference between X and the cumulative max of X, ungroup, and rearrange in original order:
library(dplyr)
tictoc::tic("dplyr")
dt2 <- dt %>%
as_data_frame() %>%
mutate(orig_row = row_number()) %>%
group_by(Y) %>%
arrange(-orig_row) %>%
mutate(max_diff2 = cummax(X) - X) %>%
ungroup() %>%
arrange(orig_row)
tictoc::toc()
# dplyr: 0.166 sec elapsed
all.equal(dt2$max_diff, dt2$max_diff2)
#[1] TRUE
EDIT: as #david-arenburg suggests in the comments, this can be done lightning-fast in data.table with an elegant line:
dt[.N:1, max_diff2 := cummax(X) - X, by = Y]
On my computer, that's about 2-4x faster than the dplyr solution above.

Subset by multiple ranges [duplicate]

This question already has answers here:
Efficient way to filter one data frame by ranges in another
(3 answers)
Closed 5 years ago.
I want to get a list of values that fall in between multiple ranges.
library(data.table)
values <- data.table(value = c(1:100))
range <- data.table(start = c(6, 29, 87), end = c(10, 35, 92))
I need the results to include only the values that fall in between those ranges:
results <- c(6, 7, 8, 9, 10, 29, 30, 31, 32, 33, 34, 35, 87, 88, 89, 90, 91, 92)
I am currently doing this with a for loop,
results <- data.table(NULL)
for (i in 1:NROW(range){
results <- rbind(results,
data.table(result = values[value >= range[i, start] &
value <= range[i, end], value]))}
however the actual dataset is quite large and I am looking for a more efficient way.
Any suggestions are appreciated! Thank you!
Using the non-equi join possibility of data.table:
values[range, on = .(value >= start, value <= end), .(results = x.value)]
which gives:
results
1: 6
2: 7
3: 8
4: 9
5: 10
6: 29
7: 30
8: 31
9: 32
10: 33
11: 34
12: 35
13: 87
14: 88
15: 89
16: 90
17: 91
18: 92
Or as per the suggestion of #Henrik: values[value %inrange% range]. This works also very well on data.table's with multiple columns:
# create new data
set.seed(26042017)
values2 <- data.table(value = c(1:100), let = sample(letters, 100, TRUE), num = sample(100))
> values2[value %inrange% range]
value let num
1: 6 v 70
2: 7 f 77
3: 8 u 21
4: 9 x 66
5: 10 g 58
6: 29 f 7
7: 30 w 48
8: 31 c 50
9: 32 e 5
10: 33 c 8
11: 34 y 19
12: 35 s 97
13: 87 j 80
14: 88 o 4
15: 89 h 65
16: 90 c 94
17: 91 k 22
18: 92 g 46
If you have the latest CRAN version of data.table you can use non-equi joins. For example, you can create an index which you can then use to subset your original data:
idx <- values[range, on = .(value >= start, value <= end), which = TRUE]
# [1] 6 7 8 9 10 29 30 31 32 33 34 35 87 88 89 90 91 92
values[idx]
Here is one method using lapply and %between%
rbindlist(lapply(seq_len(nrow(range)), function(i) values[value %between% range[i]]))
This method loops through the ranges data.table and subsets values in each iteration according to the variable in ranges. lapply returns a list, which rbindlist constructs into a data.table. If you want a vector, replace rbindlist with unlist.
benchmarks
Just to check the speeds of each suggestion on the given data, I ran a quick comparison
microbenchmark(
lmo=rbindlist(lapply(seq_len(nrow(range)), function(i) values[value %between% range[i]])),
dd={idx <- values[range, on = .(value >= start, value <= end), which = TRUE]; values[idx]},
jaap=values[range, on = .(value >= start, value <= end), .(results = x.value)],
inrange=values[value %inrange% range])
This returned
Unit: microseconds
expr min lq mean median uq max neval cld
lmo 1238.472 1460.5645 1593.6632 1520.8630 1613.520 3101.311 100 c
dd 688.230 766.7750 885.1826 792.8615 825.220 3609.644 100 b
jaap 798.279 897.6355 935.9474 921.7265 970.906 1347.380 100 b
inrange 463.002 518.3110 563.9724 545.5375 575.758 1944.948 100 a
As might be expected, my looping solution is quite a bit slower than the others. However, the clear winner is %inrange%, which is essentially a vectorized extension of %between%.

Resources