Calculate second highest cumulative value by group - r

I have data with a grouping variable 'grps' and a value 'x'. I have calculated the cummax within each group 'cmx'. Now I need to find the second highest cumulative value of 'x' within each group, scmx.
Some data, including the desired column scmx:
library(data.table)
d = structure(list(date = structure(rep(c(18690, 18691, 18692, 18693, 18694, 18695, 18696, 18697), 2), class = "Date"),
x = c(18, 70, 57, 94, 94, 13, 98, 23, 20, 72, 59, 96, 96, 15, 100, 25),
grps = c(rep("g1", 8), rep("g2", 8))),
row.names = c(NA, -16L), class = c("data.table", "data.frame"))
d[, cmx := cummax(x), by = .(grps)]
d[, scmx := c(18, 18, 57, 70, 70, 70, 94, 94, 20, 20, 59, 72, 72, 72, 96, 96)]
Context
If x corresponds to a performance rating, what I am trying to do is locate the date when they achieved their best performance and their second best. A similar question of mine where I needed to locate the row which corresponded to the highest cumulative value in a column:
Fill down first row within each cumulative max, with a twist

A data.table alternative:
d[ , scmx2 := {
c(x[1], sapply(seq(.N)[-1], function(i){
v = x[1:i]
v[frank(-v, ties.method = "dense") == 2][1]
}))
}, by = grps]
# date x grps cmx scmx scmx2
# 1: 2021-03-04 18 g1 18 18 18
# 2: 2021-03-05 70 g1 70 18 18
# 3: 2021-03-06 57 g1 70 57 57
# 4: 2021-03-07 94 g1 94 70 70
# 5: 2021-03-08 94 g1 94 70 70
# 6: 2021-03-09 13 g1 94 70 70
# 7: 2021-03-10 98 g1 98 94 94
# 8: 2021-03-11 23 g1 98 94 94
# 9: 2021-03-04 20 g2 20 20 20
# 10: 2021-03-05 72 g2 72 20 20
# 11: 2021-03-06 59 g2 72 59 59
# 12: 2021-03-07 96 g2 96 72 72
# 13: 2021-03-08 96 g2 96 72 72
# 14: 2021-03-09 15 g2 96 72 72
# 15: 2021-03-10 100 g2 100 96 96
# 16: 2021-03-11 25 g2 100 96 96
Within each group (by = grps), loop (sapply) over a sequence from 2 to number of rows in the current group (seq(.N)[-1]). In each step, subset 'x' from start of the vector to the index 'i' (v = x[1:i]).
Calculate dense rank and check if the rank is 2 (frank(-v, ties.method = "dense") == 2), i.e. the rank of the second largest number. Use the logical indices to subset 'v' (v[...). Select the first match ([1]; in case of several values with rank 2). Concatenate the result from this 'expanding window' with the first element of 'x' (c(x[1], ...).
In the first window, with only one value, there is clearly no second highest value. Here OP have chosen to return the first value. The same choice needs to be made also for longer windows where all values are equal, which will occur when there are leading runs of equal values. If we rather want to return NA than the first value, then replace the x[1] in the line
c(x[1], sapply(seq(.N)[-1], function(i){
...with NA_real_.
Small demo:
d = data.table(grps = c(1, 1, 2, 2, 2), x = c(3, 3, 4, 4, 5))
d[ , scmx2 := {
c(NA_real_, sapply(seq(.N)[-1], function(i){
v = x[1:i]
v[frank(-v, ties.method = "dense") == 2][1]
}))
}, by = grps]
# grps x scmx
# 1: 1 3 NA # grp 1: all values equal in all windows -> all NA
# 2: 1 3 NA
# 3: 2 4 NA
# 4: 2 4 NA
# 5: 2 5 4 # grp 2: only the last window has a second highest value
This question is indeed similar to the post I linked to above (Finding cumulative second max per group in R). However, here OP asked for a data.table solution.

Here is another option using non-equi join:
d[, s2 := .SD[.SD, on=.(grps, date<=date, x<cmx), by=.EACHI, max(x.x)]$V1]
d[is.na(s2), s2 := x][]
output:
date x grps cmx scmx s2
1: 2021-03-04 18 g1 18 18 18
2: 2021-03-05 70 g1 70 18 18
3: 2021-03-06 57 g1 70 57 57
4: 2021-03-07 94 g1 94 70 70
5: 2021-03-08 94 g1 94 70 70
6: 2021-03-09 13 g1 94 70 70
7: 2021-03-10 98 g1 98 94 94
8: 2021-03-11 23 g1 98 94 94
9: 2021-03-04 20 g2 20 20 20
10: 2021-03-05 72 g2 72 20 20
11: 2021-03-06 59 g2 72 59 59
12: 2021-03-07 96 g2 96 72 72
13: 2021-03-08 96 g2 96 72 72
14: 2021-03-09 15 g2 96 72 72
15: 2021-03-10 100 g2 100 96 96
16: 2021-03-11 25 g2 100 96 96

Create a sequence that is the length of the column x. Apply the function to each sequence in x that is from index 1 to the current number in the sequence, only caring about the unique values. Rfast::nth can be used to take the 2nd highest number in a vector.
library(Rfast)
sapply(seq(length(d$x)), function(x) {
return(nth(unique(d$x[1:x]), 2, descending=TRUE))
})
[1] 2.652495e-315 1.800000e+01 5.700000e+01 7.000000e+01
[5] 7.000000e+01 7.000000e+01 9.400000e+01 9.400000e+01
To do it for the new data frame. We can still use the function created above. Arrange the data frame so that the group names and values are in their own column, then use lapply with rollapplyr to capture the 2nd largest unique value.
d1=d %>% select(-cmx) %>%
pivot_wider(names_from=grps, values_from=x)
lapply(d1[-1], function(x) {
my_list=rollapplyr(x, seq(length(x)), function(x) {return(nth(sort(unique(x), decreasing=TRUE), 2))})
return(my_list)
})

Related

R data.table divide set of columns and flag using any

I am working on a data set which is large and having many columns. I am using data.table to speed up the calculations. However at certain points I am not sure how to go about and convert my data.table back to data.frame and do the calculation. This slows up the process. It would help a lot to have suggestions on how I can write the below in data.table. Below is a snap of my code on a dummy data -
library(data.table)
#### set the seed value
set.seed(9901)
#### create the sample variables for creating the data
p01 <- sample(1:100,1000,replace = T)
p02 <- sample(1:100,1000,replace = T)
p03 <- sample(1:100,1000,replace = T)
p04 <- sample(1:100,1000,replace = T)
p05 <- sample(1:100,1000,replace = T)
p06 <- sample(1:100,1000,replace = T)
p07 <- sample(1:100,1000,replace = T)
#### create the data.table
data <- data.table(cbind(p01,p02,p03,p04,p05,p06,p07))
###user input for last column
lcol <- 6
###calculate start column as last - 3
scol <- lcol-3
###calculate average for scol:lcol
data <- data[,avg:= apply(.SD,1,mean,na.rm=T),.SDcols=scol:lcol]
###converting to data.frame since do not know the solution in data.table
data <- as.data.frame(data)
###calculate the trend in percentage
data$t01 <- data[,lcol-00]/data[,"avg"]-1
data$t02 <- data[,lcol-01]/data[,"avg"]-1
data$t03 <- data[,lcol-02]/data[,"avg"]-1
data$t04 <- data[,lcol-03]/data[,"avg"]-1
data$t05 <- data[,lcol-04]/data[,"avg"]-1
###converting back to data.table
data <- as.data.table(data)
###calculate the min and max for the trend
data1 <- data[,`:=` (trend_min = apply(.SD,1,min,na.rm=T),
trend_max = apply(.SD,1,max,na.rm=T)),.SDcols=c(scol:lcol)]
###calculate flag if any of t04 OR t05 is an outlier for min and max values. This would be many columns in actual data
data1$flag1 <- ifelse(data1$t04 < data1$trend_min | data1$t04 > data1$trend_max,1,0)
data1$flag2 <- ifelse(data1$t05 < data1$trend_min | data1$t05 > data1$trend_max,1,0)
data1$flag <- ifelse(data1$flag1 == 1 | data1$flag2 == 1,1,0)
So basically, how can I -
calculate the percentages based on user input of column index. Note it is not simple divide but percentage
How can I create the flag variable....I think I need to use any function but not sure how....
Some steps can be made more efficient, i.e. instead of using the apply with MARGIN = 1, the mean, min, max can be replaced with rowMeans, pmin, pmax
library(data.table)
data[ , avg:= rowMeans(.SD, na.rm = TRUE) ,.SDcols=scol:lcol]
data[, sprintf('t%02d', 1:5) := lapply(.SD, function(x) x/avg -1),
.SDcol = patterns("^p0[1-5]")]
data[,`:=` (trend_min = do.call(pmin, c(.SD,na.rm=TRUE)),
trend_max = do.call(pmax, c(.SD,na.rm=TRUE)) ),.SDcols=c(scol:lcol)]
data
# p01 p02 p03 p04 p05 p06 p07 avg t01 t02 t03 t04 t05 trend_min trend_max
# 1: 35 53 22 82 100 59 69 65.75 -0.46768061 -0.19391635 -0.6653992 0.24714829 0.5209125 22 100
# 2: 78 75 15 65 70 69 66 54.75 0.42465753 0.36986301 -0.7260274 0.18721461 0.2785388 15 70
# 3: 15 45 27 61 63 75 99 56.50 -0.73451327 -0.20353982 -0.5221239 0.07964602 0.1150442 27 75
# 4: 41 80 13 22 63 84 17 45.50 -0.09890110 0.75824176 -0.7142857 -0.51648352 0.3846154 13 84
# 5: 53 9 75 47 25 75 66 55.50 -0.04504505 -0.83783784 0.3513514 -0.15315315 -0.5495495 25 75
# ---
# 996: 33 75 9 61 74 55 57 49.75 -0.33668342 0.50753769 -0.8190955 0.22613065 0.4874372 9 74
# 997: 24 68 74 11 43 75 37 50.75 -0.52709360 0.33990148 0.4581281 -0.78325123 -0.1527094 11 75
# 998: 62 78 82 97 56 50 74 71.25 -0.12982456 0.09473684 0.1508772 0.36140351 -0.2140351 50 97
# 999: 70 88 93 4 39 75 93 52.75 0.32701422 0.66824645 0.7630332 -0.92417062 -0.2606635 4 93
#1000: 20 50 99 94 62 66 98 80.25 -0.75077882 -0.37694704 0.2336449 0.17133956 -0.2274143 62 99
and then create the 'flag'
data[, flag := +(Reduce(`|`, lapply(.SD, function(x)
x < trend_min| x > trend_max))), .SDcols = t04:t05]

Using Rolling Average to Calculate over Window of Values

I am trying to calculate rolling averages of Heart Rate over 15 second intervals. I have millisecond data for many participants and as such the millisecond values can potentially be repeated multiple times, and due to inconsistent time readings, creating intervals by row is not viable.
Below is a small sample of the data for one participant. Data for another participant would obviously feature different millisecond data taken at different intervals.
Ideal output would involve a new column with the rolling average for each value of millisecond data.
MS <- c(36148, 36753,37364,38062,38737,39580,40029,40387,41208,42006,42796, 43533,44274,44988,45696,46398,47079,47742,48429,49135,49861,50591,51324,52059)
HR <- c(84,84,84,84,84,96,84,84,96,84,84,96,84,84,96,84,84,84,84,84,84,84,84,84)
df <- data.frame(MS, HR)
I have tried a few packages (namely Zoo's suite of rolling functions) but have had trouble applying them to this problem.
Thank you!
rollapplyr in zoo accepts a vector of widths and findInterval can be used to calculate the index in MS 15 seconds ago so if we subtract that from 1:n we get w, the number of positions to average. Exactly which intervals to produce is not discussed in the question so we will assumes that the right hand edge of each interval is at an input point.
library(zoo)
w <- with(df, seq_along(MS) - findInterval(MS - 15000, MS))
transform(df, roll = rollapplyr(HR, w, mean, fill = NA))
An option using non-equi join in data.table which also handles an ID:
library(data.table)
setDT(df)[, avgHR :=
df[.(ID=ID, start=MS-15000, end=MS), on=.(ID, MS>=start, MS<=end),
by=.EACHI, mean(HR)]$V1
]
output:
ID MS HR avgHR
1: 1 36148 84 84.00000
2: 1 36753 84 84.00000
3: 1 37364 84 84.00000
4: 1 38062 84 84.00000
5: 1 38737 84 84.00000
6: 1 39580 96 86.00000
7: 1 40029 84 85.71429
8: 1 40387 84 85.50000
9: 1 41208 96 86.66667
10: 1 42006 84 86.40000
11: 1 42796 84 86.18182
12: 1 43533 96 87.00000
13: 1 44274 84 86.76923
14: 1 44988 84 86.57143
15: 1 45696 96 87.20000
16: 1 46398 84 87.00000
17: 1 47079 84 86.82353
18: 1 47742 84 86.66667
19: 1 48429 84 86.52632
20: 1 49135 84 86.40000
21: 1 49861 84 86.28571
22: 1 50591 84 86.18182
23: 1 51324 84 86.18182
24: 1 52059 84 86.18182
ID MS HR avgHR
data:
MS <- c(36148, 36753,37364,38062,38737,39580,40029,40387,41208,42006,42796, 43533,44274,44988,45696,46398,47079,47742,48429,49135,49861,50591,51324,52059)
HR <- c(84,84,84,84,84,96,84,84,96,84,84,96,84,84,96,84,84,84,84,84,84,84,84,84)
df <- data.frame(ID=1, MS, HR)
I'm not totally sure how you want to apply the 15s rolling average, but here is one way to go about what I think youre looking for. First we subset the data that is between 7.5s before and 7.5s after, then we take the average. This, however, will have an edge effect since there is no 7.5s before the first value.
library(tidyverse)
roll_vec <- c()
for(i in 1:nrow(df)){
ref <- df$MS[[i]]
val <- df %>%
filter(MS <= ref + 7500 & MS >= ref- 7500) %>%
pull(HR) %>%
mean
roll_vec[[i]] <- val
}
df %>%
mutate(roll_15s = roll_vec)
#> MS HR roll_15s
#> 1 36148 84 87.00000
#> 2 36753 84 87.00000
#> 3 37364 84 86.76923
#> 4 38062 84 86.57143
#> 5 38737 84 86.57143
#> 6 39580 96 86.57143
#> 7 40029 84 86.57143
#> 8 40387 84 86.57143
#> 9 41208 96 86.57143
#> 10 42006 84 86.57143
#> 11 42796 84 86.57143
#> 12 43533 96 86.57143
#> 13 44274 84 87.00000
#> 14 44988 84 87.27273
#> 15 4569 96 96.00000
df %>%
mutate(roll_15s = roll_vec) %>%
ggplot(aes(MS, HR))+
geom_line()+
geom_line(aes(y = roll_15s), color = "blue")
Notice that in the plot, the black line is the raw data and the blue line is the 15s rolling average.
One possible solution:
library(magrittr)
start_range <- df$MS[df$MS < max(df$MS)-15000]
lapply(start_range,function(t){
data.frame(MS = mean(df$MS[df$MS %between% c(t,t+15000)]),
HR = mean(df$HR[df$MS %between% c(t,t+15000)]))
}) %>% Reduce(rbind,.)
MS HR
1 43218.00 86.18182
2 43907.82 86.18182
3 44603.55 86.18182
4 44948.29 86.28571
5 45673.38 86.33333
I added some points to your data (I had only two points with the data you give):
MS <- c(36148, 36753,37364,38062,38737,39580,40029,40387,41208,42006,42796, 43533,44274,44988,45696,46398,47079,47742,48429,49135,49861,50591,51324,52059,53289,54424)
HR <- c(84,84,84,84,84,96,84,84,96,84,84,96,84,84,96,84,84,84,84,84,84,84,84,84,85,88)
df <- data.frame(MS, HR)
The idea here is to calculate, for each MS value, the mean of HR and the time MSof all points having a time between this value (t in lapply) and 15 s after.
I restrict that on the range where I have values encompassing the 15s : the start_range vector.

Multiple different conditions and if statments within a loop

I want to assign different letters from A:U to a new column vector according to some conditions that depend on a different column that takes the numbers 1:99.
I came up with the following solution, but I want to write it more efficiently.
for (i in 1:99){
if (i %in% 1:3 == T ){
id<-which(H07_NACE$NACE2.Code==i)
H07_NACE$NACE2.Sectors[id]<-"A"
}
.............
if (i %in% 45:60 == T ){
id<-which(H07_NACE$NACE2.Code==i)
H07_NACE$NACE2.Sectors[id]<-"D"
}
.....................
if (i == 99 ){
id<-which(H07_NACE$NACE2.Code==i)
H07_NACE$NACE2.Sectors[id]<-"U"
}
}
In the previous code I skipped multiple other line which essentially do the same thing. Notice that conditions changing all the time within this loop that I created and are of two types. One is for example of the type i %in% 45:60 == T and the other of the type 'i == 99 '
My original code has multiple such ifs within this loop so any help on how I can write it more efficiently or compactly will be appreciated.
The user has requested to map the numbers given in H07_NACE$NACE2.Code to the letters "A" to "U" according to given rules he has hardcoded in a number of if clauses.
A more flexible approach (and less tedious to code) is to use a lookup table (or constraint vector as Joseph Wood called it in his answer).
With data.table, we can use either a rolling join or a non-equi update join to do the mapping.
Sample data to be mapped
set.seed(1)
H07_NACE <- data.frame(NACE2.Code = sample(99, 10, replace = TRUE))
Rolling join
For the rolling join, we specify the mapping rules by tiling the number range 1:99 contiguously and giving the start number of each tile.
library(data.table)
# set up lookup table
lookup <- data.table(Code = c(1, 4, 21, 45, 61:75, 98, 99),
Sector = LETTERS[1:21])
lookup
Code Sector
1: 1 A
2: 4 B
3: 21 C
4: 45 D
5: 61 E
6: 62 F
7: 63 G
8: 64 H
9: 65 I
10: 66 J
11: 67 K
12: 68 L
13: 69 M
14: 70 N
15: 71 O
16: 72 P
17: 73 Q
18: 74 R
19: 75 S
20: 98 T
21: 99 U
Code Sector
# map Code to Sector
lookup[setDT(H07_NACE), on = .(Code = NACE2.Code), roll = TRUE]
Code Sector
1: 27 C
2: 37 C
3: 57 D
4: 90 S
5: 20 B
6: 89 S
7: 94 S
8: 66 J
9: 63 G
10: 7 B
If the H07_NACE is to be updated we can append a new column by
setDT(H07_NACE)[, NACE2.Sector := lookup[H07_NACE, on = .(Code = NACE2.Code),
roll = TRUE, Sector]][]
NACE2.Code NACE2.Sector
1: 27 C
2: 37 C
3: 57 D
4: 90 S
5: 20 B
6: 89 S
7: 94 S
8: 66 J
9: 63 G
10: 7 B
Non-equi update join
For the non-equi update join, we specify the mapping rules by giving the lower and upper bounds. This can be derived from lookup by
lookup2 <- lookup[, .(Sector, lower = Code,
upper = shift(Code - 1L, type = "lead", fill = max(Code)))]
lookup2
Sector lower upper
1: A 1 3
2: B 4 20
3: C 21 44
4: D 45 60
5: E 61 61
6: F 62 62
7: G 63 63
8: H 64 64
9: I 65 65
10: J 66 66
11: K 67 67
12: L 68 68
13: M 69 69
14: N 70 70
15: O 71 71
16: P 72 72
17: Q 73 73
18: R 74 74
19: S 75 97
20: T 98 98
21: U 99 99
Sector lower upper
The new column is created by
setDT(H07_NACE)[lookup2, on = .(NACE2.Code >= lower, NACE2.Code <= upper),
NACE2.Sector := Sector][]
NACE2.Code NACE2.Sector
1: 27 C
2: 37 C
3: 57 D
4: 90 S
5: 20 B
6: 89 S
7: 94 S
8: 66 J
9: 63 G
10: 7 B
Here is a quick and dirty solution that should do the job (I'm sure there is more efficient/elegant way to do this). We can setup a constraint vector and use indexing from there to produce the desired results.
## Here is some random data that resembles the OP's
set.seed(3)
H07_NACE <- data.frame(NACE2.Code = sample(99, replace = TRUE))
## "T" is the 20th element... we need to gurantee
## that the number corresponding to "U"
## corresponds to max(NACE2.Code)
maxCode <- max(H07_NACE$NACE2.Code)
constraintVec <- sort(sample(maxCode - 1, 20))
constraintVec <- c(constraintVec, maxCode)
H07_NACE$NACE2.Sector <- LETTERS[vapply(H07_NACE$NACE2.Code, function(x) {
which(constraintVec >= x)[1]
}, 1L)]
## Add optional check column to ensure we are mapping the
## Code to the correct Sector
H07_NACE$NACE2.Check <- constraintVec[vapply(H07_NACE$NACE2.Code, function(x) {
which(constraintVec >= x)[1]
}, 1L)]
head(H07_NACE)
NACE2.Code NACE2.Sector NACE2.Check
1 17 E 18
2 80 R 85
3 39 K 54
4 33 J 37
5 60 N 66
6 60 N 66
Update courtesy of #Frank
As suspected, there is a much simpler solution assuming the above logic is correct. We use findInterval and set the arguments rightmost.closed and left.open to TRUE (we also have to add 1L to the resulting vector):
H07_NACE$NACE2.Sector2 <- LETTERS[findInterval(H07_NACE$NACE2.Code, constraintVec,
rightmost.closed = TRUE, , left.open = TRUE) + 1L]
head(H07_NACE)
NACE2.Code NACE2.Sector NACE2.Check NACE2.Sector2
1 17 E 18 E
2 80 R 85 R
3 39 K 54 K
4 33 J 37 J
5 60 N 66 N
6 60 N 66 N
identical(H07_NACE$NACE2.Sector, H07_NACE$NACE2.Sector2)
[1] TRUE
Here's two tidyverse examples, though I'm not completely certain what the original poster is really asking for.
library(tidyverse)
data.frame(NACE2.Code = sample(99, replace = TRUE)) %>%
mutate(Sectors = ifelse(NACE2.Code %in% 1:3, "A",
ifelse(NACE2.Code %in% 45:60, "D",
ifelse(NACE2.Code ==99, "U", NA))))
data.frame(NACE2.Code = sample(99, replace = TRUE)) %>%
mutate(Sectors = case_when(NACE2.Code %in% 1:3 ~ "A",
NACE2.Code %in% 45:60 ~ "D",
NACE2.Code ==99 ~ "U")) %>%
drop_na

Subset by multiple ranges [duplicate]

This question already has answers here:
Efficient way to filter one data frame by ranges in another
(3 answers)
Closed 5 years ago.
I want to get a list of values that fall in between multiple ranges.
library(data.table)
values <- data.table(value = c(1:100))
range <- data.table(start = c(6, 29, 87), end = c(10, 35, 92))
I need the results to include only the values that fall in between those ranges:
results <- c(6, 7, 8, 9, 10, 29, 30, 31, 32, 33, 34, 35, 87, 88, 89, 90, 91, 92)
I am currently doing this with a for loop,
results <- data.table(NULL)
for (i in 1:NROW(range){
results <- rbind(results,
data.table(result = values[value >= range[i, start] &
value <= range[i, end], value]))}
however the actual dataset is quite large and I am looking for a more efficient way.
Any suggestions are appreciated! Thank you!
Using the non-equi join possibility of data.table:
values[range, on = .(value >= start, value <= end), .(results = x.value)]
which gives:
results
1: 6
2: 7
3: 8
4: 9
5: 10
6: 29
7: 30
8: 31
9: 32
10: 33
11: 34
12: 35
13: 87
14: 88
15: 89
16: 90
17: 91
18: 92
Or as per the suggestion of #Henrik: values[value %inrange% range]. This works also very well on data.table's with multiple columns:
# create new data
set.seed(26042017)
values2 <- data.table(value = c(1:100), let = sample(letters, 100, TRUE), num = sample(100))
> values2[value %inrange% range]
value let num
1: 6 v 70
2: 7 f 77
3: 8 u 21
4: 9 x 66
5: 10 g 58
6: 29 f 7
7: 30 w 48
8: 31 c 50
9: 32 e 5
10: 33 c 8
11: 34 y 19
12: 35 s 97
13: 87 j 80
14: 88 o 4
15: 89 h 65
16: 90 c 94
17: 91 k 22
18: 92 g 46
If you have the latest CRAN version of data.table you can use non-equi joins. For example, you can create an index which you can then use to subset your original data:
idx <- values[range, on = .(value >= start, value <= end), which = TRUE]
# [1] 6 7 8 9 10 29 30 31 32 33 34 35 87 88 89 90 91 92
values[idx]
Here is one method using lapply and %between%
rbindlist(lapply(seq_len(nrow(range)), function(i) values[value %between% range[i]]))
This method loops through the ranges data.table and subsets values in each iteration according to the variable in ranges. lapply returns a list, which rbindlist constructs into a data.table. If you want a vector, replace rbindlist with unlist.
benchmarks
Just to check the speeds of each suggestion on the given data, I ran a quick comparison
microbenchmark(
lmo=rbindlist(lapply(seq_len(nrow(range)), function(i) values[value %between% range[i]])),
dd={idx <- values[range, on = .(value >= start, value <= end), which = TRUE]; values[idx]},
jaap=values[range, on = .(value >= start, value <= end), .(results = x.value)],
inrange=values[value %inrange% range])
This returned
Unit: microseconds
expr min lq mean median uq max neval cld
lmo 1238.472 1460.5645 1593.6632 1520.8630 1613.520 3101.311 100 c
dd 688.230 766.7750 885.1826 792.8615 825.220 3609.644 100 b
jaap 798.279 897.6355 935.9474 921.7265 970.906 1347.380 100 b
inrange 463.002 518.3110 563.9724 545.5375 575.758 1944.948 100 a
As might be expected, my looping solution is quite a bit slower than the others. However, the clear winner is %inrange%, which is essentially a vectorized extension of %between%.

Add new columns to a data.table containing many variables

I want to add many new columns simultaneously to a data.table based on by-group computations. A working example of my data would look something like this:
Time Stock x1 x2 x3
1: 2014-08-22 A 15 27 34
2: 2014-08-23 A 39 44 29
3: 2014-08-24 A 20 50 5
4: 2014-08-22 B 42 22 43
5: 2014-08-23 B 44 45 12
6: 2014-08-24 B 3 21 2
Now I want to scale and sum many of the variables to get an output like:
Time Stock x1 x2 x3 x2_scale x3_scale x2_sum x3_sum
1: 2014-08-22 A 15 27 34 -1.1175975 0.7310560 121 68
2: 2014-08-23 A 39 44 29 0.3073393 0.4085313 121 68
3: 2014-08-24 A 20 50 5 0.8102582 -1.1395873 121 68
4: 2014-08-22 B 42 22 43 -0.5401315 1.1226726 88 57
5: 2014-08-23 B 44 45 12 1.1539172 -0.3274462 88 57
6: 2014-08-24 B 3 21 2 -0.6137858 -0.7952265 88 57
A brute force implementation of my problem would be:
library(data.table)
set.seed(123)
d <- data.table(Time = rep(seq.Date( Sys.Date(), length=3, by="day" )),
Stock = rep(LETTERS[1:2], each=3 ),
x1 = sample(1:50, 6),
x2 = sample(1:50, 6),
x3 = sample(1:50, 6))
d[,x2_scale:=scale(x2),by=Stock]
d[,x3_scale:=scale(x3),by=Stock]
d[,x2_sum:=sum(x2),by=Stock]
d[,x3_sum:=sum(x3),by=Stock]
Other posts describing a similar issue (Add multiple columns to R data.table in one function call? and Assign multiple columns using := in data.table, by group) suggest the following solution:
d[, c("x2_scale","x3_scale"):=list(scale(x2),scale(x3)), by=Stock]
d[, c("x2_sum","x3_sum"):=list(sum(x2),sum(x3)), by=Stock]
But again, this would get very messy with a lot of variables and also this brings up an error message with scale (but not with sum since this isn't returning a vector).
Is there a more efficient way to achieve the required result (keeping in mind that my actual data set is quite large)?
I think with a small modification to your last code you can easily do both for as many variables you want
vars <- c("x2", "x3") # <- Choose the variable you want to operate on
d[, paste0(vars, "_", "scale") := lapply(.SD, function(x) scale(x)[, 1]), .SDcols = vars, by = Stock]
d[, paste0(vars, "_", "sum") := lapply(.SD, sum), .SDcols = vars, by = Stock]
## Time Stock x1 x2 x3 x2_scale x3_scale x2_sum x3_sum
## 1: 2014-08-22 A 13 14 32 -1.1338934 1.1323092 87 44
## 2: 2014-08-23 A 25 39 9 0.7559289 -0.3701780 87 44
## 3: 2014-08-24 A 18 34 3 0.3779645 -0.7621312 87 44
## 4: 2014-08-22 B 44 8 6 -0.4730162 -0.7258662 59 32
## 5: 2014-08-23 B 49 3 18 -0.6757374 1.1406469 59 32
## 6: 2014-08-24 B 15 48 8 1.1487535 -0.4147807 59 32
For simple functions (that don't need special treatment like scale) you could easily do something like
vars <- c("x2", "x3") # <- Define the variable you want to operate on
funs <- c("min", "max", "mean", "sum") # <- define your function
for(i in funs){
d[, paste0(vars, "_", i) := lapply(.SD, eval(i)), .SDcols = vars, by = Stock]
}
Another variation using data.table
vars <- c("x2", "x3")
d[, paste0(rep(vars, each=2), "_", c("scale", "sum")) := do.call(`cbind`,
lapply(.SD, function(x) list(scale(x)[,1], sum(x)))), .SDcols=vars, by=Stock]
d
# Time Stock x1 x2 x3 x2_scale x2_sum x3_scale x3_sum
#1: 2014-08-22 A 15 27 34 -1.1175975 121 0.7310560 68
#2: 2014-08-23 A 39 44 29 0.3073393 121 0.4085313 68
#3: 2014-08-24 A 20 50 5 0.8102582 121 -1.1395873 68
#4: 2014-08-22 B 42 22 43 -0.5401315 88 1.1226726 57
#5: 2014-08-23 B 44 45 12 1.1539172 88 -0.3274462 57
#6: 2014-08-24 B 3 21 2 -0.6137858 88 -0.7952265 57
Based on comments from #Arun, you could also do:
cols <- paste0(rep(vars, each=2), "_", c("scale", "sum"))
d[,(cols):= unlist(lapply(.SD, function(x) list(scale(x)[,1L], sum(x))),
rec=F), by=Stock, .SDcols=vars]
You're probably looking for a pure data.table solution, but you could also consider using dplyr here since it works with data.tables as well (no need for conversion). Then, from dplyr you could use the function mutate_all as I do in this example here (with the first data set you showed in your question):
library(dplyr)
dt %>%
group_by(Stock) %>%
mutate_all(funs(sum, scale), x2, x3)
#Source: local data table [6 x 9]
#Groups: Stock
#
# Time Stock x1 x2 x3 x2_sum x3_sum x2_scale x3_scale
#1 2014-08-22 A 15 27 34 121 68 -1.1175975 0.7310560
#2 2014-08-23 A 39 44 29 121 68 0.3073393 0.4085313
#3 2014-08-24 A 20 50 5 121 68 0.8102582 -1.1395873
#4 2014-08-22 B 42 22 43 88 57 -0.5401315 1.1226726
#5 2014-08-23 B 44 45 12 88 57 1.1539172 -0.3274462
#6 2014-08-24 B 3 21 2 88 57 -0.6137858 -0.7952265
You can easily add more functions to be calculated which will create more columns for you. Note that mutate_all applies the function to each column except the grouping variable (Stock) by default. But you can either specify the columns you only want to apply the functions to (which I did in this example) or you can specify which columns you don't want to apply the functions to (that would be, e.g. -c(x2,x3) instead of where I wrote x2, x3).
EDIT: replaced mutate_each above with mutate_all as mutate_each will be deprecated in the near future.
EDIT: cleaner version using functional. I think this is the closest to the dplyr answer.
library(functional)
funs <- list(scale=Compose(scale, c), sum=sum) # See data.table issue #783 on github for the need for this
cols <- paste0("x", 2:3)
cols.all <- outer(cols, names(funs), paste, sep="_")
d[,
c(cols.all) := unlist(lapply(funs, Curry(lapply, X=.SD)), rec=F),
.SDcols=cols,
by=Stock
]
Produces:
Time Stock x1 x2 x3 x2_scale x3_scale x2_sum x3_sum
1: 2014-08-22 A 15 27 34 -1.1175975 0.7310560 121 68
2: 2014-08-23 A 39 44 29 0.3073393 0.4085313 121 68
3: 2014-08-24 A 20 50 5 0.8102582 -1.1395873 121 68
4: 2014-08-22 B 42 22 43 -0.5401315 1.1226726 88 57
5: 2014-08-23 B 44 45 12 1.1539172 -0.3274462 88 57
6: 2014-08-24 B 3 21 2 -0.6137858 -0.7952265 88 57

Resources