Rolling sum in specified range - r

For df I want to take the rolling sum of the Value column over the last 10 seconds, with Time given in seconds. The dataframe is very large so using dply::complete is not an option (millions of data point, millisecond level). I prefer dplyr solution but think it may be possible with datatable left_join, just cant make it work.
df = data.frame(Row=c(1,2,3,4,5,6,7),Value=c(4,7,2,6,3,8,3),Time=c(10021,10023,10027,10035,10055,10058,10092))
Solution would add a column (Sum.10S) that takes the rolling sum of past 10 seconds:
df$Sum.10S=c(4,11,13,8,3,11,3)

Define a function sum10 which sums the last 10 seconds and use it with rollapplyr. It avoids explicit looping and runs about 10x faster than explicit looping using the data in the question.
library(zoo)
sum10 <- function(x) {
if (is.null(dim(x))) x <- t(x)
tt <- x[, "Time"]
sum(x[tt >= tail(tt, 1) - 10, "Value"])
}
transform(df, S10 = rollapplyr(df, 10, sum10, by.column = FALSE, partial = TRUE))
giving:
Row Value Time S10
1 1 4 10021 4
2 2 7 10023 11
3 3 2 10027 13
4 4 6 10035 8
5 5 3 10055 3
6 6 8 10058 11
7 7 3 10092 3

Well I wasn't fast enough to get the first answer in. But this solution is simpler, and doesn't require an external library.
df = data.frame(Row=c(1,2,3,4,5,6,7),Value=c(4,7,2,6,3,8,3),Time=c(10021,10023,10027,10035,10055,10058,10092))
df$SumR<-NA
for(i in 1:nrow(df)){
df$SumR[i]<-sum(df$Value[which(df$Time<=df$Time[i] & df$Time>=df$Time[i]-10)])
}
Row Value Time SumR
1 1 4 10021 4
2 2 7 10023 11
3 3 2 10027 13
4 4 6 10035 8
5 5 3 10055 3
6 6 8 10058 11
7 7 3 10092 3

Related

R Looking up closest value in data.frame less than equal to another value

I have two data.frames, lookup_df and values_df. For each row in lookup_df I want to lookup the closest value in the values_df that is less than or equal to an index value.
Here's my code so far:
lookup_df <- data.frame(ids = 1:10)
values_df <- data.frame(idx = c(1,3,7), values = c(6,2,8))
What I'm wanting for the result_df is the following:
> result_df
ids values
1 1 6
2 2 6
3 3 2
4 4 2
5 5 2
6 6 2
7 7 8
8 8 8
9 9 8
10 10 8
I know how to do this with SQL fairly easily but I'm curious if there is an R way that is straightforward. I could iterate the the rows of the lookup_df and then loop through the rows of the values_df but that is not computationally efficient. I'm open to using dplyr library if someone knows how to use that to solve the problem.
If values_df is sorted by idx ascending, then findInterval will work:
lookup_df <- data.frame(ids = 1:10)
values_df <- data.frame(idx = c(1,3,7), values = c(6,2,8))
lookup_df$values <- values_df$values[findInterval(lookup_df$ids,values_df$idx)]
lookup_df
> ids values
1 1 6
2 2 6
3 3 2
4 4 2
5 5 2
6 6 2
7 7 8
8 8 8
9 9 8
10 10 8

Finding the minimum positive value

I guess I don't know which.min as well as I thought.
I'm trying to find the occurrence in a vector of a minimum value that is positive.
TIME <- c(0.00000, 4.47104, 6.10598, 6.73993, 8.17467, 8.80862, 10.00980, 11.01080, 14.78110, 15.51520, 16.51620, 17.11680)
I want to know for the values z of 1 to 19, the index of the above vector TIME containing the value that is closest to but above z. I tried the following code:
vec <- sapply(seq(1,19,1), function(z) which.min((z-TIME > 0)))
vec
#[1] 2 2 2 2 3 3 5 5 7 7 8 9 9 9 10 11 12 1 1
To my mind, the last two values of vec should be '12, 12'. The reason it's doing this is because it thinks that '0.0000' is closest to 0.
So, I thought that maybe it was because I exported the data from external software and that 0.0000 wasn't really 0. But,
TIME[1]==0 #TRUE
Then I got further confused. Why do these give the answer of index 1, when really they should be an ERROR?
which.min(0 > 0 ) #1
which.min(-1 > 0 ) #1
I'll be glad to be put right.
EDIT:
I guess in a nutshell, what is the better way to get this result:
#[1] 2 2 2 2 3 3 5 5 7 7 8 9 9 9 10 11 12 12 12
which shows the index of TIME that gives the smallest possible positive value, when subtracting each element of TIME from the values of 1 to 19.
The natural function to use here (both to limit typing and for efficiency) is actually not which.min + sapply but the cut function, which will determine which range of times each of the values 1:19 falls into:
cut(1:19, breaks=TIME, right=FALSE)
# [1] [0,4.47) [0,4.47) [0,4.47) [0,4.47) [4.47,6.11) [4.47,6.11) [6.74,8.17)
# [8] [6.74,8.17) [8.81,10) [8.81,10) [10,11) [11,14.8) [11,14.8) [11,14.8)
# [15] [14.8,15.5) [15.5,16.5) [16.5,17.1) <NA> <NA>
# 11 Levels: [0,4.47) [4.47,6.11) [6.11,6.74) [6.74,8.17) [8.17,8.81) ... [16.5,17.1)
From this, you can easily determine what you're looking for, which is the index of the smallest element in TIME greater than the cutoff:
(x <- as.numeric(cut(1:19, breaks=TIME, right=FALSE))+1)
# [1] 2 2 2 2 3 3 5 5 7 7 8 9 9 9 10 11 12 NA NA
The last two entries appear as NA because there is no element in TIME that exceeds 18 or 19. If you wanted to replace these with the largest element in TIME, you could do so with replace:
replace(x, is.na(x), length(TIME))
# [1] 2 2 2 2 3 3 5 5 7 7 8 9 9 9 10 11 12 12 12
Here's one way:
x <- t(outer(TIME,1:19,`-`))
max.col(ifelse(x<0,x,Inf),ties="first")
# [1] 2 2 2 2 3 3 5 5 7 7 8 9 9 9 10 11 12 12 12
It's computationally wasteful to take all the differences in this way, since both vectors are ordered.

get z standardized score within each group

Here is the data.
set.seed(23) data<-data.frame(ID=rep(1:12), group=rep(1:3,times=4), value=(rnorm(12,mean=0.5, sd=0.3)))
ID group value
1 1 1 0.4133934
2 2 2 0.6444651
3 3 3 0.1350871
4 4 1 0.5924411
5 5 2 0.3439465
6 6 3 0.3673059
7 7 1 0.3202062
8 8 2 0.8883733
9 9 3 0.7506174
10 10 1 0.3301955
11 11 2 0.7365258
12 12 3 0.1502212
I want to get z-standardized scores within each group. so I try
library(weights)
data_split<-split(data, data$group) #split the dataframe
stan<-lapply(data_split, function(x) stdz(x$value)) #compute z-scores within group
However, It looks wrong because I want to add a new variable following 'value'
How can I do that? Kindly provide some suggestions(sample code). Any help is greatly appreciated .
Use this instead:
within(data, stan <- ave(value, group, FUN=stdz))
No need to call split nor lapply.
One way using data.table package:
library(data.table)
library(weights)
set.seed(23)
data <- data.table(ID=rep(1:12), group=rep(1:3,times=4), value=(rnorm(12,mean=0.5, sd=0.3)))
setkey(data, ID)
dataNew <- data[, list(ID, stan = stdz(value)), by = 'group']
the result is:
group ID stan
1: 1 1 -0.6159312
2: 1 4 0.9538398
3: 1 7 -1.0782747
4: 1 10 0.7403661
5: 2 2 -1.2683237
6: 2 5 0.7839781
7: 2 8 0.8163844
8: 2 11 -0.3320388
9: 3 3 0.6698418
10: 3 6 0.8674548
11: 3 9 -0.2131335
12: 3 12 -1.3241632
I tried Ferdinand.Kraft's solution but it didn't work for me. I think the stdz function isn't included in the basic R install. Moreover, the within part troubled me in a large dataset with many variables. I think the easiest way is:
data$value.s <- ave(data$value, data$group, FUN=scale)
Add the new column while in your function, and have the function return the whole data frame.
stanL<-lapply(data_split, function(x) {
x$stan <- stdz(x$value)
x
})
stan <- do.call(rbind, stanL)

Excel OFFSET function in r

I am trying to simulate the OFFSET function from Excel. I understand that this can be done for a single value but I would like to return a range. I'd like to return a group of values with an offset of 1 and a group size of 2. For example, on row 4, I would like to have a group with values of column a, rows 3 & 2. Sorry but I am stumped.
Is it possible to add this result to the data frame as another column using cbind or similar? Alternatively, could I use this in a vectorized function so I could sum or mean the result?
Mockup Example:
> df <- data.frame(a=1:10)
> df
a
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
> #PROCESS
> df
a b
1 1 NA
2 2 (1)
3 3 (1,2)
4 4 (2,3)
5 5 (3,4)
6 6 (4,5)
7 7 (5,6)
8 8 (6,7)
9 9 (7,8)
10 10 (8,9)
This should do the trick:
df$b1 <- c(rep(NA, 1), head(df$a, -1))
df$b2 <- c(rep(NA, 2), head(df$a, -2))
Note that the result will have to live in two columns, as columns in data frames only support simple data types. (Unless you want to resort to complex numbers.) head with a negative argument cuts the negated value of the argument from the tail, try head(1:10, -2). rep is repetition, c is concatenation. The <- assignment adds a new column if it's not there yet.
What Excel calls OFFSET is sometimes also referred to as lag.
EDIT: Following Greg Snow's comment, here's a version that's more elegant, but also more difficult to understand:
df <- cbind(df, as.data.frame((embed(c(NA, NA, df$a), 3))[,c(3,2)]))
Try it component by component to see how it works.
Do you want something like this?
> df <- data.frame(a=1:10)
> b=t(sapply(1:10, function(i) c(df$a[(i+2)%%10+1], df$a[(i+4)%%10+1])))
> s = sapply(1:10, function(i) sum(b[i,]))
> df = data.frame(df, b, s)
> df
a X1 X2 s
1 1 4 6 10
2 2 5 7 12
3 3 6 8 14
4 4 7 9 16
5 5 8 10 18
6 6 9 1 10
7 7 10 2 12
8 8 1 3 4
9 9 2 4 6
10 10 3 5 8

Is there any way to bind data to data.frame by some index?

#For say, I got a situation like this
user_id = c(1:5,1:5)
time = c(1:10)
visit_log = data.frame(user_id, time)
#And I've wrote a method to calculate interval
interval <- function(data) {
interval = c(Inf)
for (i in seq(1, length(data$time))) {
intv = data$time[i]-data$time[i-1]
interval = append(interval, intv)
}
data$interval = interval
return (data)
}
#But when I want to get intervals by user_id and bind them to the data.frame,
#I can't find a proper way
#Is there any method to get something like
new_data = merge(by(visit_log, INDICE=visit_log$user_id, FUN=interval))
#And the result should be
user_id time interval
1 1 1 Inf
2 2 2 Inf
3 3 3 Inf
4 4 4 Inf
5 5 5 Inf
6 1 6 5
7 2 7 5
8 3 8 5
9 4 9 5
10 5 10 5
We can replace your loop with the diff() function which computes the differences between adjacent indices in a vector, for example:
> diff(c(1,3,6,10))
[1] 2 3 4
To that we can prepend Inf to the differences via c(Inf, diff(x)).
The next thing we need is to apply the above to each user_id individually. For that there are many options, but here I use aggregate(). Confusingly, this function returns a data frame with a time component that is itself a matrix. We need to convert that matrix to a vector, relying upon the fact that in R, columns of matrices are filled first. Finally, we add and interval column to the input data as per your original version of the function.
interval <- function(x) {
diffs <- aggregate(time ~ user_id, data = x, function(y) c(Inf, diff(y)))
diffs <- as.numeric(diffs$time)
x <- within(x, interval <- diffs)
x
}
Here is a slightly expanded example, with 3 time points per user, to illustrate the above function:
> visit_log = data.frame(user_id = rep(1:5, 3), time = 1:15)
> interval(visit_log)
user_id time interval
1 1 1 Inf
2 2 2 Inf
3 3 3 Inf
4 4 4 Inf
5 5 5 Inf
6 1 6 5
7 2 7 5
8 3 8 5
9 4 9 5
10 5 10 5
11 1 11 5
12 2 12 5
13 3 13 5
14 4 14 5
15 5 15 5

Resources