How to effectively determine the maximum difference between the variable value in each row and same variable subsequent row values in data.table in R - r

What is the most efficient way to determine the maximum positive difference between the value (X) for each row and the subsequent values of the same variable (X) within group (Y) in data.table in R.
Example:
set.seed(1)
dt <- data.table(X = sample(100:200, 500455, replace = TRUE),
Y = unlist(sapply(10:1000, function(x) rep(x, x))))
Here's my solution which I consider ineffective and slow:
dt[, max_diff := vapply(1:.N, function(x) max(X[x:.N] - X[x]), numeric(1)), by = Y]
head(dt, 21)
X Y max_diff
1: 126 10 69
2: 137 10 58
3: 157 10 38
4: 191 10 4
5: 120 10 75
6: 190 10 5
7: 195 10 0
8: 166 10 0
9: 163 10 0
10: 106 10 0
11: 120 11 80
12: 117 11 83
13: 169 11 31
14: 138 11 62
15: 177 11 23
16: 150 11 50
17: 172 11 28
18: 200 11 0
19: 138 11 56
20: 178 11 16
21: 194 11 0
If you can advise the efficient (faster) solution?

Here's a dplyr solution that is about 20x faster and gets the same results. I presume the data.table equivalent would be yet faster. (EDIT: see bottom - it is!)
The speedup comes from reducing how many comparisons need to be performed. The largest difference will always be found against the largest remaining number in the group, so it's faster to identify that number first and do only the one subtraction per row.
First, the original solution takes about 4 sec on my machine:
tictoc::tic("OP data.table")
dt[, max_diff := vapply(1:.N, function(x) max(X[x:.N] - X[x]), numeric(1)), by = Y]
tictoc::toc()
# OP data.table: 4.594 sec elapsed
But in only 0.2 sec we can take that data.table, convert to a data frame, add the orig_row row number, group by Y, reverse sort by orig_row, take the difference between X and the cumulative max of X, ungroup, and rearrange in original order:
library(dplyr)
tictoc::tic("dplyr")
dt2 <- dt %>%
as_data_frame() %>%
mutate(orig_row = row_number()) %>%
group_by(Y) %>%
arrange(-orig_row) %>%
mutate(max_diff2 = cummax(X) - X) %>%
ungroup() %>%
arrange(orig_row)
tictoc::toc()
# dplyr: 0.166 sec elapsed
all.equal(dt2$max_diff, dt2$max_diff2)
#[1] TRUE
EDIT: as #david-arenburg suggests in the comments, this can be done lightning-fast in data.table with an elegant line:
dt[.N:1, max_diff2 := cummax(X) - X, by = Y]
On my computer, that's about 2-4x faster than the dplyr solution above.

Related

Using Rolling Average to Calculate over Window of Values

I am trying to calculate rolling averages of Heart Rate over 15 second intervals. I have millisecond data for many participants and as such the millisecond values can potentially be repeated multiple times, and due to inconsistent time readings, creating intervals by row is not viable.
Below is a small sample of the data for one participant. Data for another participant would obviously feature different millisecond data taken at different intervals.
Ideal output would involve a new column with the rolling average for each value of millisecond data.
MS <- c(36148, 36753,37364,38062,38737,39580,40029,40387,41208,42006,42796, 43533,44274,44988,45696,46398,47079,47742,48429,49135,49861,50591,51324,52059)
HR <- c(84,84,84,84,84,96,84,84,96,84,84,96,84,84,96,84,84,84,84,84,84,84,84,84)
df <- data.frame(MS, HR)
I have tried a few packages (namely Zoo's suite of rolling functions) but have had trouble applying them to this problem.
Thank you!
rollapplyr in zoo accepts a vector of widths and findInterval can be used to calculate the index in MS 15 seconds ago so if we subtract that from 1:n we get w, the number of positions to average. Exactly which intervals to produce is not discussed in the question so we will assumes that the right hand edge of each interval is at an input point.
library(zoo)
w <- with(df, seq_along(MS) - findInterval(MS - 15000, MS))
transform(df, roll = rollapplyr(HR, w, mean, fill = NA))
An option using non-equi join in data.table which also handles an ID:
library(data.table)
setDT(df)[, avgHR :=
df[.(ID=ID, start=MS-15000, end=MS), on=.(ID, MS>=start, MS<=end),
by=.EACHI, mean(HR)]$V1
]
output:
ID MS HR avgHR
1: 1 36148 84 84.00000
2: 1 36753 84 84.00000
3: 1 37364 84 84.00000
4: 1 38062 84 84.00000
5: 1 38737 84 84.00000
6: 1 39580 96 86.00000
7: 1 40029 84 85.71429
8: 1 40387 84 85.50000
9: 1 41208 96 86.66667
10: 1 42006 84 86.40000
11: 1 42796 84 86.18182
12: 1 43533 96 87.00000
13: 1 44274 84 86.76923
14: 1 44988 84 86.57143
15: 1 45696 96 87.20000
16: 1 46398 84 87.00000
17: 1 47079 84 86.82353
18: 1 47742 84 86.66667
19: 1 48429 84 86.52632
20: 1 49135 84 86.40000
21: 1 49861 84 86.28571
22: 1 50591 84 86.18182
23: 1 51324 84 86.18182
24: 1 52059 84 86.18182
ID MS HR avgHR
data:
MS <- c(36148, 36753,37364,38062,38737,39580,40029,40387,41208,42006,42796, 43533,44274,44988,45696,46398,47079,47742,48429,49135,49861,50591,51324,52059)
HR <- c(84,84,84,84,84,96,84,84,96,84,84,96,84,84,96,84,84,84,84,84,84,84,84,84)
df <- data.frame(ID=1, MS, HR)
I'm not totally sure how you want to apply the 15s rolling average, but here is one way to go about what I think youre looking for. First we subset the data that is between 7.5s before and 7.5s after, then we take the average. This, however, will have an edge effect since there is no 7.5s before the first value.
library(tidyverse)
roll_vec <- c()
for(i in 1:nrow(df)){
ref <- df$MS[[i]]
val <- df %>%
filter(MS <= ref + 7500 & MS >= ref- 7500) %>%
pull(HR) %>%
mean
roll_vec[[i]] <- val
}
df %>%
mutate(roll_15s = roll_vec)
#> MS HR roll_15s
#> 1 36148 84 87.00000
#> 2 36753 84 87.00000
#> 3 37364 84 86.76923
#> 4 38062 84 86.57143
#> 5 38737 84 86.57143
#> 6 39580 96 86.57143
#> 7 40029 84 86.57143
#> 8 40387 84 86.57143
#> 9 41208 96 86.57143
#> 10 42006 84 86.57143
#> 11 42796 84 86.57143
#> 12 43533 96 86.57143
#> 13 44274 84 87.00000
#> 14 44988 84 87.27273
#> 15 4569 96 96.00000
df %>%
mutate(roll_15s = roll_vec) %>%
ggplot(aes(MS, HR))+
geom_line()+
geom_line(aes(y = roll_15s), color = "blue")
Notice that in the plot, the black line is the raw data and the blue line is the 15s rolling average.
One possible solution:
library(magrittr)
start_range <- df$MS[df$MS < max(df$MS)-15000]
lapply(start_range,function(t){
data.frame(MS = mean(df$MS[df$MS %between% c(t,t+15000)]),
HR = mean(df$HR[df$MS %between% c(t,t+15000)]))
}) %>% Reduce(rbind,.)
MS HR
1 43218.00 86.18182
2 43907.82 86.18182
3 44603.55 86.18182
4 44948.29 86.28571
5 45673.38 86.33333
I added some points to your data (I had only two points with the data you give):
MS <- c(36148, 36753,37364,38062,38737,39580,40029,40387,41208,42006,42796, 43533,44274,44988,45696,46398,47079,47742,48429,49135,49861,50591,51324,52059,53289,54424)
HR <- c(84,84,84,84,84,96,84,84,96,84,84,96,84,84,96,84,84,84,84,84,84,84,84,84,85,88)
df <- data.frame(MS, HR)
The idea here is to calculate, for each MS value, the mean of HR and the time MSof all points having a time between this value (t in lapply) and 15 s after.
I restrict that on the range where I have values encompassing the 15s : the start_range vector.

Create a sequence of values by group between a min and max interval using dplyr

this is surely a basic question but couldn't find a way to solve.
I need to create a sequence of values for a minimum (dds_min) to maximum (dds_max) per group (fs).
This is my data:
fs <- c("early", "late")
dds_min <-as.numeric(c("47.2", "40"))
dds_max <-as.numeric(c("122", "105"))
dds_min.max <-as.data.frame(cbind(fs,dds_min, dds_max))
And this is what I did....
dss_levels <-dds_min.max %>%
group_by(fs) %>%
mutate(dds=seq(dds_min,dds_max,length.out=100))
I intended to create a new variable (dds), that has to be 100 length and start and end at different values depending on "fs". My expectation was to end with another dataframe (dss_levels) with two columns (fs and dds), 200 values on it.
But I am getting this error.
Error: Column `dds` must be length 1 (the group size), not 100
In addition: Warning messages:
1: In Ops.factor(to, from) : ‘-’ not meaningful for factors
2: In Ops.factor(from, seq_len(length.out - 2L) * by) :
‘+’ not meaningful for factors
Any help would be really appreciated.
Thanks!
I make the sequence length 5 for illustrative purposes, you can change it to 100.
library(purrr)
library(tidyr)
dds_min.max %>%
mutate(dds= map2(dds_min, dds_max, seq, length.out = 5)) %>%
unnest(cols = dds)
# # A tibble: 10 x 4
# fs dds_min dds_max dds
# <fct> <dbl> <dbl> <dbl>
# 1 early 47.2 122 47.2
# 2 early 47.2 122 65.9
# 3 early 47.2 122 84.6
# 4 early 47.2 122 103.
# 5 early 47.2 122 122
# 6 late 40 105 40
# 7 late 40 105 56.2
# 8 late 40 105 72.5
# 9 late 40 105 88.8
# 10 late 40 105 105
Using this data (make sure your numeric columns are numeric! Don't use cbind!)
fs <- c("early", "late")
dds_min <-c(47.2, 40)
dds_max <-c(122, 105)
dds_min.max <-data.frame(fs,dds_min, dds_max)

R - Sum range over lookback period, divided sum of look back - excel to R

I am looking to workout a percentage total over a look back range in R.
I know how to do this in excel with the following formula:
=SUM(B2:B4)/SUM(B2:B4,C2:C4)
This is summing column B over a range of today looking back 3 lines. It then divides this sum buy the total sum of column B + C again looking back 3 lines.
I am looking to achieve the same calculation in R to run across my matrix.
The output would look something like this:
adv dec perct
1 69 376
2 113 293
3 270 150 0.355625492
4 74 371 0.359559402
5 308 96 0.513790386
6 236 173 0.491255962
7 252 134 0.663886572
8 287 129 0.639966969
9 219 187 0.627483444
This is a line of code I could perhaps add the look back range too:
perct <- apply(data.matrix[,c('adv','dec')], 1, function(x) { (x[1] / x[1] + x[2]) } )
If i could get [1] to sum the previous 3 line range and
If i could get [2] to also sum the previous 3 line range.
Still learning how to apply forward and look back periods within R. So any additional learning on the answer would be appreciated!
Here are some approaches. The first 3 use rollsumr and/or rollapplyr in zoo and the last one uses only the base of R.
1) rollsumr Create a matrix with rollsumr whose columns contain the rollling sums, convert that to row proportions and take the "adv" column. Finally assign that to a new column frac in DF. This approach has the shortest code.
library(zoo)
DF$frac <- prop.table(rollsumr(DF, 3, fill = NA), 1)[, "adv"]
giving:
> DF
adv dec frac
1 69 376 NA
2 113 293 NA
3 270 150 0.3556255
4 74 371 0.3595594
5 308 96 0.5137904
6 236 173 0.4912560
7 252 134 0.6638866
8 287 129 0.6399670
9 219 187 0.6274834
1a) This variation is similar except instead of using prop.table we write out the ratio. The code is longer but you may find it clearer.
m <- rollsumr(DF, 3, fill = NA)
DF$frac <- with(as.data.frame(m), adv / (adv + dec))
1b) This is a variation of (1) that is the same except it uses a magrittr pipeline:
library(magrittr)
DF %>% rollsumr(3, fill = NA) %>% prop.table(1) %>% `[`(TRUE, "adv") -> DF$frac
2) rollapplyr We could use rollapplyr with by.column = FALSE like this. The result is the same.
ratio <- function(x) sum(x[, "adv"]) / sum(x)
DF$frac <- rollapplyr(DF, 3, ratio, by.column = FALSE, fill = NA)
3) Yet another variation is to compute the numerator and denominator separately:
DF$frac <- rollsumr(DF$adv, 3, fill = NA) /
rollapplyr(DF, 3, sum, by.column = FALSE, fill = NA)
4) base This uses embed followed by rowSums on each column to get the rolling sums and then uses prop.table as in (1).
DF$frac <- prop.table(sapply(lapply(rbind(NA, NA, DF), embed, 3), rowSums), 1)[, "adv"]
Note: The input used in reproducible form is:
Lines <- "adv dec
1 69 376
2 113 293
3 270 150
4 74 371
5 308 96
6 236 173
7 252 134
8 287 129
9 219 187"
DF <- read.table(text = Lines, header = TRUE)
Consider an sapply that loops through the number of rows in order to index two rows back:
DF$pred <- sapply(seq(nrow(DF)), function(i)
ifelse(i>=3, sum(DF$adv[(i-2):i])/(sum(DF$adv[(i-2):i]) + sum(DF$dec[(i-2):i])), NA))
DF
# adv dec pred
# 1 69 376 NA
# 2 113 293 NA
# 3 270 150 0.3556255
# 4 74 371 0.3595594
# 5 308 96 0.5137904
# 6 236 173 0.4912560
# 7 252 134 0.6638866
# 8 287 129 0.6399670
# 9 219 187 0.6274834

Subset by multiple ranges [duplicate]

This question already has answers here:
Efficient way to filter one data frame by ranges in another
(3 answers)
Closed 5 years ago.
I want to get a list of values that fall in between multiple ranges.
library(data.table)
values <- data.table(value = c(1:100))
range <- data.table(start = c(6, 29, 87), end = c(10, 35, 92))
I need the results to include only the values that fall in between those ranges:
results <- c(6, 7, 8, 9, 10, 29, 30, 31, 32, 33, 34, 35, 87, 88, 89, 90, 91, 92)
I am currently doing this with a for loop,
results <- data.table(NULL)
for (i in 1:NROW(range){
results <- rbind(results,
data.table(result = values[value >= range[i, start] &
value <= range[i, end], value]))}
however the actual dataset is quite large and I am looking for a more efficient way.
Any suggestions are appreciated! Thank you!
Using the non-equi join possibility of data.table:
values[range, on = .(value >= start, value <= end), .(results = x.value)]
which gives:
results
1: 6
2: 7
3: 8
4: 9
5: 10
6: 29
7: 30
8: 31
9: 32
10: 33
11: 34
12: 35
13: 87
14: 88
15: 89
16: 90
17: 91
18: 92
Or as per the suggestion of #Henrik: values[value %inrange% range]. This works also very well on data.table's with multiple columns:
# create new data
set.seed(26042017)
values2 <- data.table(value = c(1:100), let = sample(letters, 100, TRUE), num = sample(100))
> values2[value %inrange% range]
value let num
1: 6 v 70
2: 7 f 77
3: 8 u 21
4: 9 x 66
5: 10 g 58
6: 29 f 7
7: 30 w 48
8: 31 c 50
9: 32 e 5
10: 33 c 8
11: 34 y 19
12: 35 s 97
13: 87 j 80
14: 88 o 4
15: 89 h 65
16: 90 c 94
17: 91 k 22
18: 92 g 46
If you have the latest CRAN version of data.table you can use non-equi joins. For example, you can create an index which you can then use to subset your original data:
idx <- values[range, on = .(value >= start, value <= end), which = TRUE]
# [1] 6 7 8 9 10 29 30 31 32 33 34 35 87 88 89 90 91 92
values[idx]
Here is one method using lapply and %between%
rbindlist(lapply(seq_len(nrow(range)), function(i) values[value %between% range[i]]))
This method loops through the ranges data.table and subsets values in each iteration according to the variable in ranges. lapply returns a list, which rbindlist constructs into a data.table. If you want a vector, replace rbindlist with unlist.
benchmarks
Just to check the speeds of each suggestion on the given data, I ran a quick comparison
microbenchmark(
lmo=rbindlist(lapply(seq_len(nrow(range)), function(i) values[value %between% range[i]])),
dd={idx <- values[range, on = .(value >= start, value <= end), which = TRUE]; values[idx]},
jaap=values[range, on = .(value >= start, value <= end), .(results = x.value)],
inrange=values[value %inrange% range])
This returned
Unit: microseconds
expr min lq mean median uq max neval cld
lmo 1238.472 1460.5645 1593.6632 1520.8630 1613.520 3101.311 100 c
dd 688.230 766.7750 885.1826 792.8615 825.220 3609.644 100 b
jaap 798.279 897.6355 935.9474 921.7265 970.906 1347.380 100 b
inrange 463.002 518.3110 563.9724 545.5375 575.758 1944.948 100 a
As might be expected, my looping solution is quite a bit slower than the others. However, the clear winner is %inrange%, which is essentially a vectorized extension of %between%.

How do you change the data table rolling join condition from weak inequality to strict inequality?

Consider the two following two datasets, with a 'time' column which represents a general timestamp, integers are used for simplicity of illustrating the example:
library(data.table)
df_test_1 <-
data.table(time = c(1:10, seq(20, 30, by = 5))) %>%
df_test_1$values <- -df_test_1$time
df_test_1 <- setkey(df_test_1, time)
df_test_2 <-
data.table(time = c(15, 20, 26, 28, 31))
df_test_2 <- setkey(df_test_2, time)
so that:
> df_test_1
time values
...
5: 5 -5
6: 6 -6
7: 7 -7
8: 8 -8
9: 9 -9
10: 10 -10
11: 20 -20
12: 25 -25
13: 30 -30
and:
> df_test_2
time
1: 15
2: 20
3: 26
4: 28
5: 31
The rolling join df_test_1[df_test_2, roll = -Inf] produces:
> df_test_1[df_test_2, roll = -Inf]
time values
1: 15 -20
2: 20 -20
3: 26 -30
4: 28 -30
5: 31 NA
That is, for each time value in df_test_1, find all time values in df_test_2 smaller than or equal to it, and associate the corresponding value to this row of df_test_2. For example, df_test_1$time == 20 matches the time values 15 and 20 in df_test_2$time, thus the corresponding value of -20 is associated to these rows of df_test_2.
I would like to change the join condition (in bold above) to smaller than it, that is, the answer produced should be:
time values
1: 15 -20
2: 20 -25
3: 26 -30
4: 28 -30
5: 31 NA
The difference here is that the value at df_test_1$time == 25 should be matched to the row of df_test_2 where df_test_2$time == 20.
An alternate way of producing the desired result would be to take away a minuscule portion from time:
df_test_3 <-
df_test_1 %>%
mutate(time = time - 0.1) %>%
setkey(time)
so that:
> df_test_3[df_test_2, roll = -Inf]
time values
1: 15 -20
2: 20 -25
3: 26 -30
4: 28 -30
5: 31 NA
Using the new non-equi joins feature in the current development version of data.table, this is straightforward:
# v1.9.7+
df_test_1[df_test_2, on=.(time > time), mult="first"]
Keyed joins are capable of only equi joins. on argument is essential for conditional joins.
Note that there's no need for data.tables to be keyed if the on argument is used. Even if you wish to key the data.tables, specifying on is better as it helps understand the code immediately at a later point.
See the installation instructions for devel version here.

Resources