Find historical maximums in time series - r

I am trying to identify historical maximum records from time-series data. I need to only identify maximum records as they pertain to data up to that point, not for the whole vector.
An example:
set.seed(431)
df <- data.frame(time = c(1:10), value = runif(10, 1, 10))
df
time value
1 1 7.758703
2 2 6.262849
3 3 8.281712
4 4 8.243617
5 5 6.781752
6 6 2.078103
7 7 4.455353
8 8 1.339119
9 9 3.635554
10 10 9.084619
What I want to do is produce the vector that identifies the following record high numbers moving forward in time:
time value record
1 1 7.758703 yes
2 2 6.262849 no
3 3 8.281712 yes
4 4 8.243617 no
5 5 6.781752 no
6 6 2.078103 no
7 7 4.455353 no
8 8 1.339119 no
9 9 3.635554 no
10 10 9.084619 yes
The value at time 1 is a record because no values exist prior to that, therefor it is maximum. The item at time 3 is a record because its higher than that at time 1. The value at time 10 is a record because its higher than that at time 3.
All I have been able to do is test the max value for the whole vector (i.e identify the value at time 10), rather than the vector up to the time value being considered. I was trying to mutate through dplyr but it wouldn't work. Then I looked at writing a for loop, which would append values to the vector and look for the maximum within that new vector. That lead me to posts suggesting that was a more pythonic than R way of doing things.
Can anyone help? I imagine this is easy.

An option is to get the cummax of 'value', check whether it is equal to 'value'
library(dplyr)
df %>%
mutate(record = c('no', 'yes')[(value == cummax(value)) + 1])
# A tibble: 10 x 3
# time value record
# <int> <dbl> <chr>
# 1 1 7.76 yes
# 2 2 6.26 no
# 3 3 8.28 yes
# 4 4 8.24 no
# 5 5 6.78 no
# 6 6 2.08 no
# 7 7 4.46 no
# 8 8 1.34 no
# 9 9 3.64 no
#10 10 9.08 yes

Related

multiple condition then creating new column

I have a dataset with two columns, I need to create a third column carries conditions on first one and second one.
set.seed(1)
x1=(sample(1:10, 100,replace=T))
y1=sample(seq(1,10,0.1),100,replace=T)
z=cbind(x1,y1)
unique(as.data.frame(z)$x1)
z%>%as.data.frame()%>%dplyr::filter(x1==3)
table(x1)
1 2 3 4 5 6 7 8 9 10
7 6 11 14 14 5 11 15 11 6
> z%>%as.data.frame()%>%dplyr::filter(x1==3)
x1 y1
1 3 6.9
2 3 9.5
3 3 10.0
4 3 5.6
5 3 4.1
6 3 2.5
7 3 5.3
8 3 9.5
9 3 5.5
10 3 8.9
11 3 1.2
for example when I filter x==3 then y1 values can be seen, I need to write 1 on 11th row, rest will be 0. I need to find a minimum in that column. My original dataset has 43545 rows but only 638 unique numbers like x1. table x1 shows that 1 repeated 7 times but in my dataset some have a frequency of 1 some have frequency of 100. I should use case_when but how can I check every y1 to find the smallest to put 1?
If I understand correctly, you are looking for the row with minimal y1 value for each value of x1
library(tidyverse)
z %>% as.data.frame() %>%
group_by(x1) %>%
arrange(y1) %>% # sort values by increasing order within each group
mutate(flag = ifelse(row_number()==1,1,0)) %>% # create flag for first row in group
ungroup()

Reuse value of previous row during dplyr::mutate

I am trying to group events based on their time of occurrence. To achieve this, I simply calculate a diff over the timestamps and want to essentially start a new group if the diff is larger than a certain value. I would have tried like the code below. However, this is not working since the dialog variable is not available during the mutate it is created by.
library(tidyverse)
df <- data.frame(time = c(1,2,3,4,5,510,511,512,513), id = c(1,2,3,4,5,6,7,8,9))
> df
time id
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 510 6
7 511 7
8 512 8
9 513 9
df <- df %>%
mutate(t_diff = c(NA, diff(time))) %>%
# This generates an error as dialog is not available as a variable at this point
mutate(dialog = ifelse(is.na(t_diff), id, ifelse(t_diff >= 500, id, lag(dialog, 1))))
# This is the desired result
> df
time id t_diff dialog
1 1 1 NA 1
2 2 2 1 1
3 3 3 1 1
4 4 4 1 1
5 5 5 1 1
6 510 6 505 6
7 511 7 1 6
8 512 8 1 6
9 513 9 1 6
In words, I want to add a column that points to the first element of each group. Thereby, the groups are distinguished at points at which the diff to the previous element is larger than 500.
Unfortunately, I have not found a clever workaround to achieve this in an efficient way using dplyr. Obviously, iterating over the data.frame with a loop would work, but would be very inefficient.
Is there a way to achieve this in dplyr?

How to calculate variance in a data table

I am a nebie to R.I have a data table DT as
id time day type
1 1 9 10
2 2 3 10
1 3 6 12
3 8 9 10
6 9 9 10
8 2 6 18
9 3 5 10
9 1 4 12
From this I initially wanted the count group by day time type.SO i did
DT[,.N,by=list(day,time,type)]
which gives the count for each group.
Now I need to calculate the variance for each group. So I tried
DT[,var(.N),by=list(day,time,type)]
But this gave NA for all fields.Any help is appreciated.
In the example given, there is only a single unique value for many of the combinations, so there is no variance for those rows.
DT <- data.frame (id = c(1,2,1,3,6,8,9,9),
time = c(1,2,3,8,9,2,3,1),
day = c(9,3,6,9,9,6,5,4),
type = c(10,10, 12, 10,10,18,10,12))
aggregate(DT, list(DT$id), FUN = var)

Rolling sum in specified range

For df I want to take the rolling sum of the Value column over the last 10 seconds, with Time given in seconds. The dataframe is very large so using dply::complete is not an option (millions of data point, millisecond level). I prefer dplyr solution but think it may be possible with datatable left_join, just cant make it work.
df = data.frame(Row=c(1,2,3,4,5,6,7),Value=c(4,7,2,6,3,8,3),Time=c(10021,10023,10027,10035,10055,10058,10092))
Solution would add a column (Sum.10S) that takes the rolling sum of past 10 seconds:
df$Sum.10S=c(4,11,13,8,3,11,3)
Define a function sum10 which sums the last 10 seconds and use it with rollapplyr. It avoids explicit looping and runs about 10x faster than explicit looping using the data in the question.
library(zoo)
sum10 <- function(x) {
if (is.null(dim(x))) x <- t(x)
tt <- x[, "Time"]
sum(x[tt >= tail(tt, 1) - 10, "Value"])
}
transform(df, S10 = rollapplyr(df, 10, sum10, by.column = FALSE, partial = TRUE))
giving:
Row Value Time S10
1 1 4 10021 4
2 2 7 10023 11
3 3 2 10027 13
4 4 6 10035 8
5 5 3 10055 3
6 6 8 10058 11
7 7 3 10092 3
Well I wasn't fast enough to get the first answer in. But this solution is simpler, and doesn't require an external library.
df = data.frame(Row=c(1,2,3,4,5,6,7),Value=c(4,7,2,6,3,8,3),Time=c(10021,10023,10027,10035,10055,10058,10092))
df$SumR<-NA
for(i in 1:nrow(df)){
df$SumR[i]<-sum(df$Value[which(df$Time<=df$Time[i] & df$Time>=df$Time[i]-10)])
}
Row Value Time SumR
1 1 4 10021 4
2 2 7 10023 11
3 3 2 10027 13
4 4 6 10035 8
5 5 3 10055 3
6 6 8 10058 11
7 7 3 10092 3

Create sequence of repeated values, in sequence?

I need a sequence of repeated numbers, i.e. 1 1 ... 1 2 2 ... 2 3 3 ... 3 etc. The way I implemented this was:
nyear <- 20
names <- c(rep(1,nyear),rep(2,nyear),rep(3,nyear),rep(4,nyear),
rep(5,nyear),rep(6,nyear),rep(7,nyear),rep(8,nyear))
which works, but is clumsy, and obviously doesn't scale well.
How do I repeat the N integers M times each in sequence?
I tried nesting seq() and rep() but that didn't quite do what I wanted.
I can obviously write a for-loop to do this, but there should be an intrinsic way to do this!
You missed the each= argument to rep():
R> n <- 3
R> rep(1:5, each=n)
[1] 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5
R>
so your example can be done with a simple
R> rep(1:8, each=20)
Another base R option could be gl():
gl(5, 3)
Where the output is a factor:
[1] 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5
Levels: 1 2 3 4 5
If integers are needed, you can convert it:
as.numeric(gl(5, 3))
[1] 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5
For your example, Dirk's answer is perfect. If you instead had a data frame and wanted to add that sort of sequence as a column, you could also use group from groupdata2 (disclaimer: my package) to greedily divide the datapoints into groups.
# Attach groupdata2
library(groupdata2)
# Create a random data frame
df <- data.frame("x" = rnorm(27))
# Create groups with 5 members each (except last group)
group(df, n = 5, method = "greedy")
x .groups
<dbl> <fct>
1 0.891 1
2 -1.13 1
3 -0.500 1
4 -1.12 1
5 -0.0187 1
6 0.420 2
7 -0.449 2
8 0.365 2
9 0.526 2
10 0.466 2
# … with 17 more rows
There's a whole range of methods for creating this kind of grouping factor. E.g. by number of groups, a list of group sizes, or by having groups start when the value in some column differs from the value in the previous row (e.g. if a column is c("x","x","y","z","z") the grouping factor would be c(1,1,2,3,3).

Resources