time between max and min of cycles - r

I have a series of data of 60,000 data which part of the data is as the figure 1 (the whole curve is not so nice and uniform like this image (some other part of data is as second image)) but there are many cycles with different period in my data.
I need to calculate the time of three red, green and purple rectangles for each of the cycles (** the time between each maximum and minimum and total time of cycles **)
Can you give me some ideas on how to do it in R ... is there any special command or package that I can use?

Premise is that the mean value of the data range is used to split the data into categories of peaks and not peaks. Then a running id is generated to group each set of data so an appropriate min or max value can be determined. The half_cycle provides the red and green boxes, while full_cycle provides the purple box for max-to-max and min-to-min. There is likely room for improvement, but it gives a method that can be adjusted as needed.
This sample uses random data since no sample data was provided.
wave <- c(seq(20, 50, 10), seq(50, 60, 0.5), seq(50, 20, -10))
df1 <- data.frame(time = seq_len(length(wave) * 5),
data = as.vector(replicate(5, wave + rnorm(length(wave), sd = 5))))
df1 %>%
mutate(peak = data > mean(range(df1$data))) %>%
mutate(run = cumsum(peak != lag(peak, default = TRUE))) %>%
group_by(run) %>%
mutate(max = max(data), min = min(data)) %>%
filter((peak == TRUE & data == max) | (peak == FALSE & data == min)) %>%
mutate(max = if_else(data == max, max, NULL), min = if_else(data == min, min , NULL)) %>%
ungroup() %>%
mutate(half_cycle = time - lag(time), full_cycle = time - lag(time, n = 2L))
# A tibble: 11 x 8
time data peak run max min half_cycle full_cycle
<int> <dbl> <lgl> <int> <dbl> <dbl> <int> <int>
1 2 24.0 FALSE 1 NA 24.0 NA NA
2 12 67.1 TRUE 2 67.1 NA 10 NA
3 29 15.1 FALSE 3 NA 15.1 17 27
4 54 68.5 TRUE 4 68.5 NA 25 42
5 59 20.8 FALSE 5 NA 20.8 5 30
6 80 70.6 TRUE 6 70.6 NA 21 26
7 87 18.3 FALSE 7 NA 18.3 7 28
8 108 63.1 TRUE 8 63.1 NA 21 28
9 117 13.8 FALSE 9 NA 13.8 9 30
10 140 64.5 TRUE 10 64.5 NA 23 32
11 145 22.4 FALSE 11 NA 22.4 5 28


Using Rolling Average to Calculate over Window of Values

I am trying to calculate rolling averages of Heart Rate over 15 second intervals. I have millisecond data for many participants and as such the millisecond values can potentially be repeated multiple times, and due to inconsistent time readings, creating intervals by row is not viable.
Below is a small sample of the data for one participant. Data for another participant would obviously feature different millisecond data taken at different intervals.
Ideal output would involve a new column with the rolling average for each value of millisecond data.
MS <- c(36148, 36753,37364,38062,38737,39580,40029,40387,41208,42006,42796, 43533,44274,44988,45696,46398,47079,47742,48429,49135,49861,50591,51324,52059)
HR <- c(84,84,84,84,84,96,84,84,96,84,84,96,84,84,96,84,84,84,84,84,84,84,84,84)
df <- data.frame(MS, HR)
I have tried a few packages (namely Zoo's suite of rolling functions) but have had trouble applying them to this problem.
Thank you!
rollapplyr in zoo accepts a vector of widths and findInterval can be used to calculate the index in MS 15 seconds ago so if we subtract that from 1:n we get w, the number of positions to average. Exactly which intervals to produce is not discussed in the question so we will assumes that the right hand edge of each interval is at an input point.
w <- with(df, seq_along(MS) - findInterval(MS - 15000, MS))
transform(df, roll = rollapplyr(HR, w, mean, fill = NA))
An option using non-equi join in data.table which also handles an ID:
setDT(df)[, avgHR :=
df[.(ID=ID, start=MS-15000, end=MS), on=.(ID, MS>=start, MS<=end),
by=.EACHI, mean(HR)]$V1
1: 1 36148 84 84.00000
2: 1 36753 84 84.00000
3: 1 37364 84 84.00000
4: 1 38062 84 84.00000
5: 1 38737 84 84.00000
6: 1 39580 96 86.00000
7: 1 40029 84 85.71429
8: 1 40387 84 85.50000
9: 1 41208 96 86.66667
10: 1 42006 84 86.40000
11: 1 42796 84 86.18182
12: 1 43533 96 87.00000
13: 1 44274 84 86.76923
14: 1 44988 84 86.57143
15: 1 45696 96 87.20000
16: 1 46398 84 87.00000
17: 1 47079 84 86.82353
18: 1 47742 84 86.66667
19: 1 48429 84 86.52632
20: 1 49135 84 86.40000
21: 1 49861 84 86.28571
22: 1 50591 84 86.18182
23: 1 51324 84 86.18182
24: 1 52059 84 86.18182
MS <- c(36148, 36753,37364,38062,38737,39580,40029,40387,41208,42006,42796, 43533,44274,44988,45696,46398,47079,47742,48429,49135,49861,50591,51324,52059)
HR <- c(84,84,84,84,84,96,84,84,96,84,84,96,84,84,96,84,84,84,84,84,84,84,84,84)
df <- data.frame(ID=1, MS, HR)
I'm not totally sure how you want to apply the 15s rolling average, but here is one way to go about what I think youre looking for. First we subset the data that is between 7.5s before and 7.5s after, then we take the average. This, however, will have an edge effect since there is no 7.5s before the first value.
roll_vec <- c()
for(i in 1:nrow(df)){
ref <- df$MS[[i]]
val <- df %>%
filter(MS <= ref + 7500 & MS >= ref- 7500) %>%
pull(HR) %>%
roll_vec[[i]] <- val
df %>%
mutate(roll_15s = roll_vec)
#> MS HR roll_15s
#> 1 36148 84 87.00000
#> 2 36753 84 87.00000
#> 3 37364 84 86.76923
#> 4 38062 84 86.57143
#> 5 38737 84 86.57143
#> 6 39580 96 86.57143
#> 7 40029 84 86.57143
#> 8 40387 84 86.57143
#> 9 41208 96 86.57143
#> 10 42006 84 86.57143
#> 11 42796 84 86.57143
#> 12 43533 96 86.57143
#> 13 44274 84 87.00000
#> 14 44988 84 87.27273
#> 15 4569 96 96.00000
df %>%
mutate(roll_15s = roll_vec) %>%
ggplot(aes(MS, HR))+
geom_line(aes(y = roll_15s), color = "blue")
Notice that in the plot, the black line is the raw data and the blue line is the 15s rolling average.
One possible solution:
start_range <- df$MS[df$MS < max(df$MS)-15000]
data.frame(MS = mean(df$MS[df$MS %between% c(t,t+15000)]),
HR = mean(df$HR[df$MS %between% c(t,t+15000)]))
}) %>% Reduce(rbind,.)
1 43218.00 86.18182
2 43907.82 86.18182
3 44603.55 86.18182
4 44948.29 86.28571
5 45673.38 86.33333
I added some points to your data (I had only two points with the data you give):
MS <- c(36148, 36753,37364,38062,38737,39580,40029,40387,41208,42006,42796, 43533,44274,44988,45696,46398,47079,47742,48429,49135,49861,50591,51324,52059,53289,54424)
HR <- c(84,84,84,84,84,96,84,84,96,84,84,96,84,84,96,84,84,84,84,84,84,84,84,84,85,88)
df <- data.frame(MS, HR)
The idea here is to calculate, for each MS value, the mean of HR and the time MSof all points having a time between this value (t in lapply) and 15 s after.
I restrict that on the range where I have values encompassing the 15s : the start_range vector.

Create a sequence of values by group between a min and max interval using dplyr

this is surely a basic question but couldn't find a way to solve.
I need to create a sequence of values for a minimum (dds_min) to maximum (dds_max) per group (fs).
This is my data:
fs <- c("early", "late")
dds_min <-as.numeric(c("47.2", "40"))
dds_max <-as.numeric(c("122", "105"))
dds_min.max <-as.data.frame(cbind(fs,dds_min, dds_max))
And this is what I did....
dss_levels <-dds_min.max %>%
group_by(fs) %>%
I intended to create a new variable (dds), that has to be 100 length and start and end at different values depending on "fs". My expectation was to end with another dataframe (dss_levels) with two columns (fs and dds), 200 values on it.
But I am getting this error.
Error: Column `dds` must be length 1 (the group size), not 100
In addition: Warning messages:
1: In Ops.factor(to, from) : ‘-’ not meaningful for factors
2: In Ops.factor(from, seq_len(length.out - 2L) * by) :
‘+’ not meaningful for factors
Any help would be really appreciated.
I make the sequence length 5 for illustrative purposes, you can change it to 100.
dds_min.max %>%
mutate(dds= map2(dds_min, dds_max, seq, length.out = 5)) %>%
unnest(cols = dds)
# # A tibble: 10 x 4
# fs dds_min dds_max dds
# <fct> <dbl> <dbl> <dbl>
# 1 early 47.2 122 47.2
# 2 early 47.2 122 65.9
# 3 early 47.2 122 84.6
# 4 early 47.2 122 103.
# 5 early 47.2 122 122
# 6 late 40 105 40
# 7 late 40 105 56.2
# 8 late 40 105 72.5
# 9 late 40 105 88.8
# 10 late 40 105 105
Using this data (make sure your numeric columns are numeric! Don't use cbind!)
fs <- c("early", "late")
dds_min <-c(47.2, 40)
dds_max <-c(122, 105)
dds_min.max <-data.frame(fs,dds_min, dds_max)

Rank subset into quantiles using Ntile

I have a dataset containing 42840 observations with a total of 119 unique months (Dataset$date). The idea is that i want to assign a quantile to every dataset$Value within each month, and 'rank' them from 1(lowest value) to 5(highest value).
Date Name(ID) Value Quantile (I want to add this column where i assign the values a quantile from 1 to 5)
2009-03 1 35 (1-5)
2009-04 1 20 ...
2009-05 1 65 ...
2009-03 2 24 ...
2009-04 2 77 ...
2009-03 3 110 ...
2018-12 3 125 ...
2009-03 56 24 ...
2009-04 56 65 ...
2009-03 57 26 ...
2009-04 57 67 ...
2009-03 58 99 ...
I've tried to use the Ntile function, which works great for the whole dataset but there doesn't seem to be a function where I can specify for a subset of date.
Any suggestions?
You could use the base rank function with dplyr's group_by:
# Create some data
N <- 3
dat <- tibble(
date = rep(1:12,N),
value = runif(12*N, 0, 100)
# The rescale function we will use later to fit on your 1-5 scale
## Adapted From https://stackoverflow.com/questions/25962508/rescaling-a-variable-in-r
RESCALE <- function (x, nx1, nx2, minx, maxx) {
nx = nx1 + (nx2 - nx1) * (x - minx)/(maxx - minx)
# What you want
dat %>%
group_by(date) %>% # Group the data by Date so that mutate fill compute the rank's for each Month
mutate(rank_detail = rank(value), # ranks the values within each group
rank_group = RESCALE(rank_detail, 1, 5, min(rank_detail), max(rank_detail)) ) %>% # rescales the ranking to be on you 1 to 5 scale
# A tibble: 36 x 4
# # Groups: date [12]
# date value rank_detail rank_group
# <int> <dbl> <dbl> <dbl>
# 1 1 92.7 3 5
# 2 1 53.6 2 3
# 3 1 47.8 1 1
# 4 2 24.6 2 3
# 5 2 72.2 3 5
# 6 2 11.5 1 1

Automate regression by rows

I have a data.frame
exp <- data.frame(exp = c(rep(LETTERS[1:2], each = 10)), re = c(rep(seq(1, 10, 1), 2)), age1 = seq(10, 29, 1), age2 = seq(30, 49, 1),
h = c(runif(20, 10, 40)), h2 = c(40 + runif(20, 4, 9)))
I'd like to make a lm for each row in a data set (h and h2 ~ age1 and age2)
I do it by loop
exp$modelh <- 0
for (i in 1:length(exp$exp)){
age = c(exp$age1[i], exp$age2[i])
h = c(exp$h[i], exp$h2[i])
model = lm(age ~ h)
exp$modelh[i] = coef(model)[1] + 100 * coef(model)[2]
and it works well but takes some time with very large files. Will be grateful for the faster solution f.ex. dplyr
Using dplyr, we can try with rowwise() and do. Inside the do, we concatenate (c) the 'age1', 'age2' to create 'age', likewise, we can create 'h', apply lm, extract the coef to create the column 'modelh'.
exp %>%
rowwise() %>%
age <- c(.$age1, .$age2)
h <- c(.$h, .$h2)
model <- lm(age ~ h)
data.frame(., modelh = coef(model)[1] + 100*coef(model)[2])
} )
gives the output
# exp re age1 age2 h h2 modelh
#1 A 1 10 30 19.23298 46.67906 68.85506
#2 A 2 11 31 17.73018 47.55402 66.17050
#3 A 3 12 32 26.56967 46.69174 84.98486
#4 A 4 13 33 11.69149 47.74486 61.98766
#5 A 5 14 34 24.05648 46.10051 82.90167
#6 A 6 15 35 24.51312 44.85710 89.21053
#7 A 7 16 36 34.37208 47.85151 113.37492
#8 A 8 17 37 21.10962 48.40977 74.79483
#9 A 9 18 38 26.39676 46.74548 90.34187
#10 A 10 19 39 15.10786 45.38862 75.07002
#11 B 1 20 40 28.74989 46.44153 100.54666
#12 B 2 21 41 36.46497 48.64253 125.34773
#13 B 3 22 42 18.41062 45.74346 81.70062
#14 B 4 23 43 21.95464 48.77079 81.20773
#15 B 5 24 44 32.87653 47.47637 115.95097
#16 B 6 25 45 30.07065 48.44727 101.10688
#17 B 7 26 46 16.13836 44.90204 84.31080
#18 B 8 27 47 20.72575 47.14695 87.00805
#19 B 9 28 48 20.78425 48.94782 84.25406
#20 B 10 29 49 30.70872 44.65144 128.39415
We could do this with the devel version of data.table i.e. v1.9.5. Instructions to install the devel version are here.
We convert the 'data.frame' to 'data.table' (setDT), create a column 'rn' with the option keep.rownames=TRUE. We melt the dataset by specifying the patterns in the measure to convert from 'wide' to 'long' format. Grouped by 'rn', we do the lm and get the coef. This can be assigned as a new column in the original dataset ('exp') while removing the unwanted 'rn' column by assigning (:=) it to NULL.
modelh <- melt(setDT(exp, keep.rownames=TRUE), measure=patterns('^age', '^h'),
value.name=c('age', 'h'))[, {model <- lm(age ~h)
coef(model)[1] + 100 * coef(model)[2]},rn]$V1
exp[, modelh:= modelh][, rn := NULL]
# exp re age1 age2 h h2 modelh
# 1: A 1 10 30 19.23298 46.67906 68.85506
# 2: A 2 11 31 17.73018 47.55402 66.17050
# 3: A 3 12 32 26.56967 46.69174 84.98486
# 4: A 4 13 33 11.69149 47.74486 61.98766
# 5: A 5 14 34 24.05648 46.10051 82.90167
# 6: A 6 15 35 24.51312 44.85710 89.21053
# 7: A 7 16 36 34.37208 47.85151 113.37492
# 8: A 8 17 37 21.10962 48.40977 74.79483
# 9: A 9 18 38 26.39676 46.74548 90.34187
#10: A 10 19 39 15.10786 45.38862 75.07002
#11: B 1 20 40 28.74989 46.44153 100.54666
#12: B 2 21 41 36.46497 48.64253 125.34773
#13: B 3 22 42 18.41062 45.74346 81.70062
#14: B 4 23 43 21.95464 48.77079 81.20773
#15: B 5 24 44 32.87653 47.47637 115.95097
#16: B 6 25 45 30.07065 48.44727 101.10688
#17: B 7 26 46 16.13836 44.90204 84.31080
#18: B 8 27 47 20.72575 47.14695 87.00805
#19: B 9 28 48 20.78425 48.94782 84.25406
#20: B 10 29 49 30.70872 44.65144 128.39415
Great (double) answer from #akrun.
Just a suggestion for your future analysis as you mentioned "it's an example of a bigger problem". Obviously, if you are really interested in building models rowwise then you'll create more and more columns as your age and h observations increase. If you get N observations you'll have to use 2xN columns for those 2 variables only.
I'd suggest to use a long data format in order to increase your rows instead of your columns.
Something like:
exp[1,] # how your first row (model building info) looks like
# exp re age1 age2 h h2
# 1 A 1 10 30 19.23298 46.67906
reshape(exp[1,], # how your model building info is transformed
varying = list(c("age1","age2"),
v.names = c("age_value","h_value"),
direction = "long")
# exp re time age_value h_value id
# 1.1 A 1 1 10 19.23298 1
# 1.2 A 1 2 30 46.67906 1
Apologies if the "bigger problem" refers to something else and this answer is irrelevant.
With base R, the function sprintf can help us create formulas. And lapply carries out the calculation.
strings <- sprintf("c(%f,%f) ~ c(%f,%f)", exp$age1, exp$age2, exp$h, exp$h2)
lst <- lapply(strings, function(x) {model <- lm(as.formula(x));coef(model)[1] + 100 * coef(model)[2]})
exp$modelh <- unlist(lst)
# exp re age1 age2 h h2 modelh
# 1 A 1 10 30 19.23298 46.67906 68.85506
# 2 A 2 11 31 17.73018 47.55402 66.17050
# 3 A 3 12 32 26.56967 46.69174 84.98486
# 4 A 4 13 33 11.69149 47.74486 61.98766
# 5 A 5 14 34 24.05648 46.10051 82.90167
# 6 A 6 15 35 24.51312 44.85710 89.21053
# 7 A 7 16 36 34.37208 47.85151 113.37493
# 8 A 8 17 37 21.10962 48.40977 74.79483
# 9 A 9 18 38 26.39676 46.74548 90.34187
# 10 A 10 19 39 15.10786 45.38862 75.07002
# 11 B 1 20 40 28.74989 46.44153 100.54666
# 12 B 2 21 41 36.46497 48.64253 125.34773
# 13 B 3 22 42 18.41062 45.74346 81.70062
# 14 B 4 23 43 21.95464 48.77079 81.20773
# 15 B 5 24 44 32.87653 47.47637 115.95097
# 16 B 6 25 45 30.07065 48.44727 101.10688
# 17 B 7 26 46 16.13836 44.90204 84.31080
# 18 B 8 27 47 20.72575 47.14695 87.00805
# 19 B 9 28 48 20.78425 48.94782 84.25406
# 20 B 10 29 49 30.70872 44.65144 128.39416
In the lapply function the expression as.formula(x) is what converts the formulas created in the first line into a format usable by the lm function.
big.exp <- data.frame(age1=sample(30, 1e4, T),
age2=sample(30:50, 1e4, T),
h=runif(1e4, 10, 40),
h2= 40 + runif(1e4,4,9))
plafort = {strings <- sprintf("c(%f,%f) ~ c(%f,%f)", big.exp$age1, big.exp$age2, big.exp$h, big.exp$h2)
lst <- lapply(strings, function(x) {model <- lm(as.formula(x));coef(model)[1] + 100 * coef(model)[2]})
big.exp$modelh <- unlist(lst)},
akdplyr = {big.exp %>%
rowwise() %>%
age <- c(.$age1, .$age2)
h <- c(.$h, .$h2)
model <- lm(age ~ h)
data.frame(., modelh = coef(model)[1] + 100*coef(model)[2])
} )}
t: seconds
expr min lq mean median uq max neval cld
plafort 13.00605 13.41113 13.92165 13.56927 14.53814 15.08366 5 a
akdplyr 26.95064 27.64240 29.40892 27.86258 31.02955 33.55940 5 b
(Note: I downloaded the newest 1.9.5 devel version of data.table today, but continued to receive errors when trying to test it.
The results also differ fractionally (1.93 x 10^-8). Rounding likely accounts for the difference.)
all.equal(pl, ak)
[1] "Attributes: < Component “class”: Lengths (1, 3) differ (string compare on first 1) >"
[2] "Attributes: < Component “class”: 1 string mismatch >"
[3] "Component “modelh”: Mean relative difference: 1.933893e-08"
The lapply approach seems to perform well compared to dplyr with respect to speed, but it's 5 digit rounding may be an issue. Improvements may be possible. Perhaps using apply after converting to matrix to increase speed and efficiency.

How to find the highest value of a column in a data frame in R?

I have the following data frame which I called ozone:
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
7 23 299 8.6 65 5 7
8 19 99 13.8 59 5 8
9 8 19 20.1 61 5 9
I would like to extract the highest value from ozone, Solar.R, Wind...
Also, if possible how would I sort Solar.R or any column of this data frame in descending order
I tried
max(ozone, na.rm=T)
which gives me the highest value in the dataset.
I have also tried
but got "subset" must be logical."
I can set an object to hold the subset of each column, by the following commands
ozone <- subset(ozone, Ozone >0)
but it gives the same value of 334, which is the max value of the data frame, not the column.
Any help would be great, thanks.
Similar to colMeans, colSums, etc, you could write a column maximum function, colMax, and a column sort function, colSort.
colMax <- function(data) sapply(data, max, na.rm = TRUE)
colSort <- function(data, ...) sapply(data, sort, ...)
I use ... in the second function in hopes of sparking your intrigue.
Get your data:
dat <- read.table(h=T, text = "Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
7 23 299 8.6 65 5 7
8 19 99 13.8 59 5 8
9 8 19 20.1 61 5 9")
Use colMax function on sample data:
# Ozone Solar.R Wind Temp Month Day
# 41.0 313.0 20.1 74.0 5.0 9.0
To do the sorting on a single column,
sort(dat$Solar.R, decreasing = TRUE)
# [1] 313 299 190 149 118 99 19
and over all columns use our colSort function,
colSort(dat, decreasing = TRUE) ## compare with '...' above
To get the max of any column you want something like:
max(ozone$Ozone, na.rm = TRUE)
To get the max of all columns, you want:
apply(ozone, 2, function(x) max(x, na.rm = TRUE))
And to sort:
Or to sort the other direction:
Here's a dplyr solution:
# find max for each column
summarise_each(ozone, funs(max(., na.rm=TRUE)))
# sort by Solar.R, descending
arrange(ozone, desc(Solar.R))
UPDATE: summarise_each() has been deprecated in favour of a more featureful family of functions: mutate_all(), mutate_at(), mutate_if(), summarise_all(), summarise_at(), summarise_if()
Here is how you could do:
# find max for each column
ozone %>%
summarise_if(is.numeric, funs(max(., na.rm=TRUE)))%>%
ozone %>%
summarise_at(vars(1:6), funs(max(., na.rm=TRUE)))%>%
In response to finding the max value for each column, you could try using the apply() function:
> apply(ozone, MARGIN = 2, function(x) max(x, na.rm=TRUE))
Ozone Solar.R Wind Temp Month Day
41.0 313.0 20.1 74.0 5.0 9.0
Another way would be to use ?pmax
do.call('pmax', c(as.data.frame(t(ozone)),na.rm=TRUE))
#[1] 41.0 313.0 20.1 74.0 5.0 9.0
There is a package matrixStats that provides some functions to do column and row summaries, see in the package vignette, but you have to convert your data.frame into a matrix.
Then you run: colMaxs(as.matrix(ozone))
max(may$Ozone, na.rm = TRUE)
Without $Ozone it will filter in the whole data frame, this can be learned in the swirl library.
I'm studying this course on Coursera too ~
Assuming that your data in data.frame called maxinozone, you can do this
max(maxinozone[1, ], na.rm = TRUE)
max(ozone$Ozone, na.rm = TRUE) should do the trick. Remember to include the na.rm = TRUE or else R will return NA.
Try this solution:
Oz<-subset(data, data$Month==5,select=Ozone) # select ozone value in the month of
#May (i.e. Month = 5)
summary(T) #gives caracteristics of table( contains 1 column of Ozone) including max, min ...
