Using Rolling Average to Calculate over Window of Values - r

I am trying to calculate rolling averages of Heart Rate over 15 second intervals. I have millisecond data for many participants and as such the millisecond values can potentially be repeated multiple times, and due to inconsistent time readings, creating intervals by row is not viable.
Below is a small sample of the data for one participant. Data for another participant would obviously feature different millisecond data taken at different intervals.
Ideal output would involve a new column with the rolling average for each value of millisecond data.
MS <- c(36148, 36753,37364,38062,38737,39580,40029,40387,41208,42006,42796, 43533,44274,44988,45696,46398,47079,47742,48429,49135,49861,50591,51324,52059)
HR <- c(84,84,84,84,84,96,84,84,96,84,84,96,84,84,96,84,84,84,84,84,84,84,84,84)
df <- data.frame(MS, HR)
I have tried a few packages (namely Zoo's suite of rolling functions) but have had trouble applying them to this problem.
Thank you!

rollapplyr in zoo accepts a vector of widths and findInterval can be used to calculate the index in MS 15 seconds ago so if we subtract that from 1:n we get w, the number of positions to average. Exactly which intervals to produce is not discussed in the question so we will assumes that the right hand edge of each interval is at an input point.
library(zoo)
w <- with(df, seq_along(MS) - findInterval(MS - 15000, MS))
transform(df, roll = rollapplyr(HR, w, mean, fill = NA))

An option using non-equi join in data.table which also handles an ID:
library(data.table)
setDT(df)[, avgHR :=
df[.(ID=ID, start=MS-15000, end=MS), on=.(ID, MS>=start, MS<=end),
by=.EACHI, mean(HR)]$V1
]
output:
ID MS HR avgHR
1: 1 36148 84 84.00000
2: 1 36753 84 84.00000
3: 1 37364 84 84.00000
4: 1 38062 84 84.00000
5: 1 38737 84 84.00000
6: 1 39580 96 86.00000
7: 1 40029 84 85.71429
8: 1 40387 84 85.50000
9: 1 41208 96 86.66667
10: 1 42006 84 86.40000
11: 1 42796 84 86.18182
12: 1 43533 96 87.00000
13: 1 44274 84 86.76923
14: 1 44988 84 86.57143
15: 1 45696 96 87.20000
16: 1 46398 84 87.00000
17: 1 47079 84 86.82353
18: 1 47742 84 86.66667
19: 1 48429 84 86.52632
20: 1 49135 84 86.40000
21: 1 49861 84 86.28571
22: 1 50591 84 86.18182
23: 1 51324 84 86.18182
24: 1 52059 84 86.18182
ID MS HR avgHR
data:
MS <- c(36148, 36753,37364,38062,38737,39580,40029,40387,41208,42006,42796, 43533,44274,44988,45696,46398,47079,47742,48429,49135,49861,50591,51324,52059)
HR <- c(84,84,84,84,84,96,84,84,96,84,84,96,84,84,96,84,84,84,84,84,84,84,84,84)
df <- data.frame(ID=1, MS, HR)

I'm not totally sure how you want to apply the 15s rolling average, but here is one way to go about what I think youre looking for. First we subset the data that is between 7.5s before and 7.5s after, then we take the average. This, however, will have an edge effect since there is no 7.5s before the first value.
library(tidyverse)
roll_vec <- c()
for(i in 1:nrow(df)){
ref <- df$MS[[i]]
val <- df %>%
filter(MS <= ref + 7500 & MS >= ref- 7500) %>%
pull(HR) %>%
mean
roll_vec[[i]] <- val
}
df %>%
mutate(roll_15s = roll_vec)
#> MS HR roll_15s
#> 1 36148 84 87.00000
#> 2 36753 84 87.00000
#> 3 37364 84 86.76923
#> 4 38062 84 86.57143
#> 5 38737 84 86.57143
#> 6 39580 96 86.57143
#> 7 40029 84 86.57143
#> 8 40387 84 86.57143
#> 9 41208 96 86.57143
#> 10 42006 84 86.57143
#> 11 42796 84 86.57143
#> 12 43533 96 86.57143
#> 13 44274 84 87.00000
#> 14 44988 84 87.27273
#> 15 4569 96 96.00000
df %>%
mutate(roll_15s = roll_vec) %>%
ggplot(aes(MS, HR))+
geom_line()+
geom_line(aes(y = roll_15s), color = "blue")
Notice that in the plot, the black line is the raw data and the blue line is the 15s rolling average.

One possible solution:
library(magrittr)
start_range <- df$MS[df$MS < max(df$MS)-15000]
lapply(start_range,function(t){
data.frame(MS = mean(df$MS[df$MS %between% c(t,t+15000)]),
HR = mean(df$HR[df$MS %between% c(t,t+15000)]))
}) %>% Reduce(rbind,.)
MS HR
1 43218.00 86.18182
2 43907.82 86.18182
3 44603.55 86.18182
4 44948.29 86.28571
5 45673.38 86.33333
I added some points to your data (I had only two points with the data you give):
MS <- c(36148, 36753,37364,38062,38737,39580,40029,40387,41208,42006,42796, 43533,44274,44988,45696,46398,47079,47742,48429,49135,49861,50591,51324,52059,53289,54424)
HR <- c(84,84,84,84,84,96,84,84,96,84,84,96,84,84,96,84,84,84,84,84,84,84,84,84,85,88)
df <- data.frame(MS, HR)
The idea here is to calculate, for each MS value, the mean of HR and the time MSof all points having a time between this value (t in lapply) and 15 s after.
I restrict that on the range where I have values encompassing the 15s : the start_range vector.

Related

Perform row-wise operation in datatable with multiple elements

I have the following data table:
library(data.table)
set.seed(1)
DT <- data.table(ind=1:100,x=sample(100),y=sample(100),group=c(rep("A",50),rep("B",50)))
Now the problem I have is that I need to take every value in column "x" (that is, each given ID), and add all the existing values in column "y" to it. I also need to do it separately per column "group". Let's assume we start with ID = 1. This element has the value: x_1 = 68, and y_1 = 76. We also see y_2 = 39, y_3 = 24, etc. So what I want to compute is the sums x_1 + y_1, x_1 + y2, x_1 + y_3, etc. But not only for x_1, but also for x_2, x_3, etc. So for x_2 it would look like: x_2 + y_1, x_2 + y_2, x_2 + y_3, etc. This should also be done separately per column "group" (in this regard the dataset should simple be split by group).
Edit: Exemplary code to do this only for X_1 and group A:
current_X <- DT[1,x] # not needed, just to illustrate
vector_current_X <- rep(DT[1,x],nrow(DT[group == "A"]))
DT[group == "A",copy_current_X := vector_current_X]
DT[,sum_current_X_Y := copy_current_X + y]
DT
One apparent issue with this approach is that if it were applied to all x, then a lot of columns would be added to the final DT. So I am not sure if it is the best approach. In the end, I am just looking for the lowest sum (per element x) with each element y, and per group.
I know how to do operations per group, and I also know the lapply functions. The issue is that from my understanding, I need to include a row-wise loop. And next, the structure of the result will be different from the original data table, because we have many additional observations. I have seen before that you can save lists inside a data.table, but I am unsure if that is the best approach. My dataset is much larger, so efficiency is important.
Thanks for any hints how to approach this.
You can do this:
DT[, .(.BY$x+DT[group==.BY$group,y]), by=.(x,group)]
This returns N rows per x, where N is the size of x's group. We leverage the special (.BY), which is available in j when utilizing by. Basically, .BY is a named list, containing the values of the grouping variables. Here, I'm adding the value of x (.BY$x) to the vector of y values from the subset of DT where the group is equal to the current group value (.BY$group)
Output:
x group V1
<int> <char> <int>
1: 68 A 144
2: 68 A 107
3: 68 A 92
4: 68 A 121
5: 68 A 160
---
4996: 4 B 25
4997: 4 B 66
4998: 4 B 83
4999: 4 B 27
5000: 4 B 68
You can also accomplish this via a join:
DT[,!c("y")][DT[, .(y,group)], on=.(group), allow.cartesian=T][, total:=x+y][order(ind)]
Output:
ind x group y total
<int> <int> <char> <int> <int>
1: 1 68 A 76 144
2: 1 68 A 39 107
3: 1 68 A 24 92
4: 1 68 A 53 121
5: 1 68 A 92 160
---
4996: 100 4 B 21 25
4997: 100 4 B 62 66
4998: 100 4 B 79 83
4999: 100 4 B 23 27
5000: 100 4 B 64 68
If I understand correctly, the requested result requires a cross join where each element of x is combined with each element of y (within each group).
This can be accomplished easily using the CJ() function:
DT[, CJ(x, y, sorted = FALSE), by = group][, sum_x_y := x + y][]
group x y sum_x_y
1: A 68 76 144
2: A 68 39 107
3: A 68 24 92
4: A 68 53 121
5: A 68 92 160
---
4996: B 4 21 25
4997: B 4 62 66
4998: B 4 79 83
4999: B 4 23 27
5000: B 4 64 68

Create a sequence of values by group between a min and max interval using dplyr

this is surely a basic question but couldn't find a way to solve.
I need to create a sequence of values for a minimum (dds_min) to maximum (dds_max) per group (fs).
This is my data:
fs <- c("early", "late")
dds_min <-as.numeric(c("47.2", "40"))
dds_max <-as.numeric(c("122", "105"))
dds_min.max <-as.data.frame(cbind(fs,dds_min, dds_max))
And this is what I did....
dss_levels <-dds_min.max %>%
group_by(fs) %>%
mutate(dds=seq(dds_min,dds_max,length.out=100))
I intended to create a new variable (dds), that has to be 100 length and start and end at different values depending on "fs". My expectation was to end with another dataframe (dss_levels) with two columns (fs and dds), 200 values on it.
But I am getting this error.
Error: Column `dds` must be length 1 (the group size), not 100
In addition: Warning messages:
1: In Ops.factor(to, from) : ‘-’ not meaningful for factors
2: In Ops.factor(from, seq_len(length.out - 2L) * by) :
‘+’ not meaningful for factors
Any help would be really appreciated.
Thanks!
I make the sequence length 5 for illustrative purposes, you can change it to 100.
library(purrr)
library(tidyr)
dds_min.max %>%
mutate(dds= map2(dds_min, dds_max, seq, length.out = 5)) %>%
unnest(cols = dds)
# # A tibble: 10 x 4
# fs dds_min dds_max dds
# <fct> <dbl> <dbl> <dbl>
# 1 early 47.2 122 47.2
# 2 early 47.2 122 65.9
# 3 early 47.2 122 84.6
# 4 early 47.2 122 103.
# 5 early 47.2 122 122
# 6 late 40 105 40
# 7 late 40 105 56.2
# 8 late 40 105 72.5
# 9 late 40 105 88.8
# 10 late 40 105 105
Using this data (make sure your numeric columns are numeric! Don't use cbind!)
fs <- c("early", "late")
dds_min <-c(47.2, 40)
dds_max <-c(122, 105)
dds_min.max <-data.frame(fs,dds_min, dds_max)

How to effectively determine the maximum difference between the variable value in each row and same variable subsequent row values in data.table in R

What is the most efficient way to determine the maximum positive difference between the value (X) for each row and the subsequent values of the same variable (X) within group (Y) in data.table in R.
Example:
set.seed(1)
dt <- data.table(X = sample(100:200, 500455, replace = TRUE),
Y = unlist(sapply(10:1000, function(x) rep(x, x))))
Here's my solution which I consider ineffective and slow:
dt[, max_diff := vapply(1:.N, function(x) max(X[x:.N] - X[x]), numeric(1)), by = Y]
head(dt, 21)
X Y max_diff
1: 126 10 69
2: 137 10 58
3: 157 10 38
4: 191 10 4
5: 120 10 75
6: 190 10 5
7: 195 10 0
8: 166 10 0
9: 163 10 0
10: 106 10 0
11: 120 11 80
12: 117 11 83
13: 169 11 31
14: 138 11 62
15: 177 11 23
16: 150 11 50
17: 172 11 28
18: 200 11 0
19: 138 11 56
20: 178 11 16
21: 194 11 0
If you can advise the efficient (faster) solution?
Here's a dplyr solution that is about 20x faster and gets the same results. I presume the data.table equivalent would be yet faster. (EDIT: see bottom - it is!)
The speedup comes from reducing how many comparisons need to be performed. The largest difference will always be found against the largest remaining number in the group, so it's faster to identify that number first and do only the one subtraction per row.
First, the original solution takes about 4 sec on my machine:
tictoc::tic("OP data.table")
dt[, max_diff := vapply(1:.N, function(x) max(X[x:.N] - X[x]), numeric(1)), by = Y]
tictoc::toc()
# OP data.table: 4.594 sec elapsed
But in only 0.2 sec we can take that data.table, convert to a data frame, add the orig_row row number, group by Y, reverse sort by orig_row, take the difference between X and the cumulative max of X, ungroup, and rearrange in original order:
library(dplyr)
tictoc::tic("dplyr")
dt2 <- dt %>%
as_data_frame() %>%
mutate(orig_row = row_number()) %>%
group_by(Y) %>%
arrange(-orig_row) %>%
mutate(max_diff2 = cummax(X) - X) %>%
ungroup() %>%
arrange(orig_row)
tictoc::toc()
# dplyr: 0.166 sec elapsed
all.equal(dt2$max_diff, dt2$max_diff2)
#[1] TRUE
EDIT: as #david-arenburg suggests in the comments, this can be done lightning-fast in data.table with an elegant line:
dt[.N:1, max_diff2 := cummax(X) - X, by = Y]
On my computer, that's about 2-4x faster than the dplyr solution above.

Multiple different conditions and if statments within a loop

I want to assign different letters from A:U to a new column vector according to some conditions that depend on a different column that takes the numbers 1:99.
I came up with the following solution, but I want to write it more efficiently.
for (i in 1:99){
if (i %in% 1:3 == T ){
id<-which(H07_NACE$NACE2.Code==i)
H07_NACE$NACE2.Sectors[id]<-"A"
}
.............
if (i %in% 45:60 == T ){
id<-which(H07_NACE$NACE2.Code==i)
H07_NACE$NACE2.Sectors[id]<-"D"
}
.....................
if (i == 99 ){
id<-which(H07_NACE$NACE2.Code==i)
H07_NACE$NACE2.Sectors[id]<-"U"
}
}
In the previous code I skipped multiple other line which essentially do the same thing. Notice that conditions changing all the time within this loop that I created and are of two types. One is for example of the type i %in% 45:60 == T and the other of the type 'i == 99 '
My original code has multiple such ifs within this loop so any help on how I can write it more efficiently or compactly will be appreciated.
The user has requested to map the numbers given in H07_NACE$NACE2.Code to the letters "A" to "U" according to given rules he has hardcoded in a number of if clauses.
A more flexible approach (and less tedious to code) is to use a lookup table (or constraint vector as Joseph Wood called it in his answer).
With data.table, we can use either a rolling join or a non-equi update join to do the mapping.
Sample data to be mapped
set.seed(1)
H07_NACE <- data.frame(NACE2.Code = sample(99, 10, replace = TRUE))
Rolling join
For the rolling join, we specify the mapping rules by tiling the number range 1:99 contiguously and giving the start number of each tile.
library(data.table)
# set up lookup table
lookup <- data.table(Code = c(1, 4, 21, 45, 61:75, 98, 99),
Sector = LETTERS[1:21])
lookup
Code Sector
1: 1 A
2: 4 B
3: 21 C
4: 45 D
5: 61 E
6: 62 F
7: 63 G
8: 64 H
9: 65 I
10: 66 J
11: 67 K
12: 68 L
13: 69 M
14: 70 N
15: 71 O
16: 72 P
17: 73 Q
18: 74 R
19: 75 S
20: 98 T
21: 99 U
Code Sector
# map Code to Sector
lookup[setDT(H07_NACE), on = .(Code = NACE2.Code), roll = TRUE]
Code Sector
1: 27 C
2: 37 C
3: 57 D
4: 90 S
5: 20 B
6: 89 S
7: 94 S
8: 66 J
9: 63 G
10: 7 B
If the H07_NACE is to be updated we can append a new column by
setDT(H07_NACE)[, NACE2.Sector := lookup[H07_NACE, on = .(Code = NACE2.Code),
roll = TRUE, Sector]][]
NACE2.Code NACE2.Sector
1: 27 C
2: 37 C
3: 57 D
4: 90 S
5: 20 B
6: 89 S
7: 94 S
8: 66 J
9: 63 G
10: 7 B
Non-equi update join
For the non-equi update join, we specify the mapping rules by giving the lower and upper bounds. This can be derived from lookup by
lookup2 <- lookup[, .(Sector, lower = Code,
upper = shift(Code - 1L, type = "lead", fill = max(Code)))]
lookup2
Sector lower upper
1: A 1 3
2: B 4 20
3: C 21 44
4: D 45 60
5: E 61 61
6: F 62 62
7: G 63 63
8: H 64 64
9: I 65 65
10: J 66 66
11: K 67 67
12: L 68 68
13: M 69 69
14: N 70 70
15: O 71 71
16: P 72 72
17: Q 73 73
18: R 74 74
19: S 75 97
20: T 98 98
21: U 99 99
Sector lower upper
The new column is created by
setDT(H07_NACE)[lookup2, on = .(NACE2.Code >= lower, NACE2.Code <= upper),
NACE2.Sector := Sector][]
NACE2.Code NACE2.Sector
1: 27 C
2: 37 C
3: 57 D
4: 90 S
5: 20 B
6: 89 S
7: 94 S
8: 66 J
9: 63 G
10: 7 B
Here is a quick and dirty solution that should do the job (I'm sure there is more efficient/elegant way to do this). We can setup a constraint vector and use indexing from there to produce the desired results.
## Here is some random data that resembles the OP's
set.seed(3)
H07_NACE <- data.frame(NACE2.Code = sample(99, replace = TRUE))
## "T" is the 20th element... we need to gurantee
## that the number corresponding to "U"
## corresponds to max(NACE2.Code)
maxCode <- max(H07_NACE$NACE2.Code)
constraintVec <- sort(sample(maxCode - 1, 20))
constraintVec <- c(constraintVec, maxCode)
H07_NACE$NACE2.Sector <- LETTERS[vapply(H07_NACE$NACE2.Code, function(x) {
which(constraintVec >= x)[1]
}, 1L)]
## Add optional check column to ensure we are mapping the
## Code to the correct Sector
H07_NACE$NACE2.Check <- constraintVec[vapply(H07_NACE$NACE2.Code, function(x) {
which(constraintVec >= x)[1]
}, 1L)]
head(H07_NACE)
NACE2.Code NACE2.Sector NACE2.Check
1 17 E 18
2 80 R 85
3 39 K 54
4 33 J 37
5 60 N 66
6 60 N 66
Update courtesy of #Frank
As suspected, there is a much simpler solution assuming the above logic is correct. We use findInterval and set the arguments rightmost.closed and left.open to TRUE (we also have to add 1L to the resulting vector):
H07_NACE$NACE2.Sector2 <- LETTERS[findInterval(H07_NACE$NACE2.Code, constraintVec,
rightmost.closed = TRUE, , left.open = TRUE) + 1L]
head(H07_NACE)
NACE2.Code NACE2.Sector NACE2.Check NACE2.Sector2
1 17 E 18 E
2 80 R 85 R
3 39 K 54 K
4 33 J 37 J
5 60 N 66 N
6 60 N 66 N
identical(H07_NACE$NACE2.Sector, H07_NACE$NACE2.Sector2)
[1] TRUE
Here's two tidyverse examples, though I'm not completely certain what the original poster is really asking for.
library(tidyverse)
data.frame(NACE2.Code = sample(99, replace = TRUE)) %>%
mutate(Sectors = ifelse(NACE2.Code %in% 1:3, "A",
ifelse(NACE2.Code %in% 45:60, "D",
ifelse(NACE2.Code ==99, "U", NA))))
data.frame(NACE2.Code = sample(99, replace = TRUE)) %>%
mutate(Sectors = case_when(NACE2.Code %in% 1:3 ~ "A",
NACE2.Code %in% 45:60 ~ "D",
NACE2.Code ==99 ~ "U")) %>%
drop_na

Using ddply across numerous variables when calculating descriptive statistics

Here's my data. It shows the amount of fish I found at three different sites.
Selidor.Bay Enlades.Bay Cumphrey.Bay
1 39 29 187
2 70 370 50
3 13 44 52
4 0 65 20
5 43 110 220
6 0 30 266
What I would like to do is create a script to calculate basic statistics for each site.
If I re-arrange the data by stacking it. I.e :
values site
1 29 Selidor.Bay
2 370 Selidor.Bay
3 44 Selidor.Bay
4 65 Enlades.Bay
I'm able to use the following:
data <- ddply(df, c("site"), summarise,
N = length(values),
mean = mean(values),
sd = sd(values),
se = sd / sqrt(N),
sum = sum(values)
)
data.
My question is how can I use the script without having to stack my dataframe?
Thanks.
A slight variation on #docendodiscimus' comment:
library(reshape2)
library(dplyr)
DF %>%
melt(variable.name="site") %>%
group_by(site) %>%
summarise_each(funs( n(), mean, sd, se=sd(.)/sqrt(n()), sum ), value)
# site n mean sd se sum
# 1 Selidor.Bay 6 27.5 27.93385 11.40395 165
# 2 Enlades.Bay 6 108.0 131.84688 53.82626 648
# 3 Cumphrey.Bay 6 132.5 104.29909 42.57992 795
melt does what the OP referred to as "stacking" the data.frame. There is likely some analogous function in the tidyr package.

Resources