Cumulative sum with a threshold window in R data.table - r

I want to calculate the rolling sum of n rows in my dataset where the window size 'n' depends on the sum itself. For example, I want to slide the window as soon as the rolling sum of time exceeds 5 mins. Basically, I want to calculate how much distance the person traveled in the last 5 mins but the time steps are not equally spaced. Here's a dummy data.table for clarity (the last two columns are required):
I am looking for a data.table solution in R
Input data table:
ID
Distance
Time
1
2
2
1
4
1
1
2
1
1
2
2
1
3
3
1
6
3
1
1
1
Desired Output:
ID
Distance
Time
5.min.rolling.distance
5.min.rolling.time
1
2
2
NA
NA
1
4
1
NA
NA
1
2
1
NA
NA
1
2
2
10
6
1
3
3
5
5
1
6
3
9
6
1
1
1
10
7

Here is a solution that works with double time units as well as a simpler solution that will work with integer time units. I tested the double solution on 10,000 records and on my 2015 laptop it executed instantly. I can't make any guarantees about performance on 40 GB of data.
If you wanted to generalize this code I'd look at the RcppRoll package and learn how to implement c++ code in R.
Solution with double time units
I broke this down into two problems. First, figure out the window size by looking back until we get to at least 5 minutes (or run out of data). Second, take the sum of distances and time from the current observation to the look back unit.
Bad loop code in R usually tries to 'grow' a vector, its a huge efficiency gain to pre-allocate the vector length and then change elements in it.
input <- data.frame(
dist = c(2, 4, 2, 2, 3, 6, 1),
time = c(2, 1, 1, 2, 3, 3, 1)
)
var_window_cumsum <- function(input, MIN_TIME) {
if(is.null(input$time) | is.null(input$dist)) {
stop("input must have variables time and dist that record the row's duration and distance traveled.")
}
n <- nrow(input)
# First, figure out how far we need to look back to, this vector will store
# the position of the first record that gets our target record up to 5 min or
# more. If we cant look back to 5 min, we leave it as NA.
time_indx = rep(NA_integer_, length = n) # always preallocate your vector!
for(time in (1:n)) {
prior = time # start at self in case observation is already >= MIN_TIME
while(sum(input$time[time:prior]) < MIN_TIME & prior > 1) {
prior = prior - 1
}
# if we cant look back to our minimum time, leave the indx as NA
if (sum(input$time[time:prior]) >= MIN_TIME) {
time_indx[time] = prior
}
}
# Now that we know how far to look back, its easy to find out the total distance
# and total time.
dist5 = rep(NA_integer_, n)
time5 = rep(NA_integer_, n)
for (i in 1:n) {
dist5[i] <- ifelse(!is.na(time_indx[i]),
sum(input$dist[i:time_indx[i]]),
NA)
time5[i] <- ifelse(!is.na(time_indx[i]),
sum(input$time[i:time_indx[i]]),
NA)
}
cbind(input,
window_dist = dist5,
window_time = time5,
window_start = time_indx)
}
# output looks good
# Warning: example data does not include exhaustive cases
# I have not setup thorough testing
var_window_cumsum(input, 5)
# Test on a larger dataset, 10k records
set.seed(1234)
n <- 10000
med_input <- data.frame(
dist = sample(1:5, n, replace = TRUE),
time = sample(1:60, n, replace = TRUE) / 10
)
# you should inspect this to make sure there are no errors
med_output <- var_window_cumsum(med_input, 5)
Solution with integer time units
If your time unit is in integers and your data isn't too big, it may work to complete your dataset. This is a little bit of a hack, but here I create a continuos timeid variable that goes from the starting time to the maximum time, and create one row for each integer unit of time. From there its easy to calculate a rolling cumulative sum for the last five time units. Finally, we get rid of all the fake rows we added in (you want to make sure to do that because they will have invalid cumulative sum data. Also, important to note that I use roll_sumr and not roll_sum; roll_sumr includes 4 padding NA's on the left side of the output vector for the first 4 units.
library(tidyverse)
library(RcppRoll)
input <- data.frame(
dist = c(2, 4, 2, 2, 3, 6, 1),
time = c(2, 1, 1, 2, 3, 3, 1)
)
desired_dist5 <- c(NA, NA, NA, 10, 5, 9, 10)
desired_time5 <- c(NA, NA, NA, 6, 5, 6, 7)
output <- input %>%
mutate(timeid = cumsum(time),
realrow = TRUE) %>%
complete(timeid = 1:max(timeid)) %>%
mutate(dist5 = roll_sumr(dist, 5, na.rm = T),
time5 = roll_sumr(time, 5, na.rm = T)) %>%
filter(realrow) %>%
select(-c(realrow, timeid))
# Check against example table
output$dist5 == desired_dist5
output$time5 == desired_time5

Related

Trying to create a rolling period cummax

I am trying to create a function that buys an N period high. So if I have a vector:
x = c(1, 2, 3, 4, 5, 1, 2, 3, 4, 5)
I want to take the rolling 3 period high. This is how I would like the function to look
x = c(1, 2, 3, 4, 5, 5, 5, 3, 4, 5)
I am trying to do this on an xts object.
Here is what I tried:
rollapplyr(SPY$SPY.Adjusted, width = 40, FUN = cummax)
rollapply(SPY$SPY.Adjusted, width = 40, FUN = "cummax")
rapply(SPY$SPY.Adjusted, width = 40, FUN = cummax)
The error I am receiving is:
Error in `dimnames<-.xts`(`*tmp*`, value = dn) :
length of 'dimnames' [2] not equal to array extent
Thanks in advance
You're close. Realize that rollapply (et al) is this case expecting a single number back, but cummax is returning a vector. Let's trace through this:
When using rollapply(..., partial=TRUE), the first pass is just the first number: 1
Second call, the first two numbers. You are expecting 2 (so that it will append to the previous step's 1), but look at cummax(1:2): it has length 2. Conclusion from this step: the cum functions are naïve in that they are relatively monotonic: they always consider everything up to and including the current number when they perform their logic/transformation.
Third call, our first visit to a full window (in this case): considering 1 2 3, we want 3. max works.
So I think you want this:
zoo::rollapplyr(x, width = 3, FUN = max, partial = TRUE)
# [1] 1 2 3 4 5 5 5 3 4 5
partial allows us to look at 1 and 1-2 before moving on to the first full window of 1-3. From the help page:
partial: logical or numeric. If 'FALSE' (default) then 'FUN' is only
applied when all indexes of the rolling window are within the
observed time range. If 'TRUE', then the subset of indexes
that are in range are passed to 'FUN'. A numeric argument to
'partial' can be used to determin the minimal window size for
partial computations. See below for more details.
Perhaps it is helpful -- if not perfectly accurate -- to think of cummax as equivalent to
rollapplyr(x, width = length(x), FUN = max, partial = TRUE)
# [1] 1 2 3 4 5 5 5 5 5 5
cummax(x)
# [1] 1 2 3 4 5 5 5 5 5 5

xts - keep calculation of columns intact, even though some columns might be missing

I am setting up a "decision-map" in xts. The result of below code generates following:
dec.1 dec.2 dec.3 dec.4 Master.dec
2017-01-01 2 2 2 2 2
2017-02-02 3 3 3 3 3
2017-03-03 0 0 0 0 0
There will always exist minimum one of these columns (dec.1 -> dec.4), but it will be unknown whether it is 1,2,3, or 4 columns (dec.1 -> dec.4).
In the original solution the decision columns will be spread out through the xts-sheet so I will not be able to use column number as identifiers.
Question:
In column "Master.dec" I calculate based on the left-side columns (dec.1 - dec.4), whereof sometimes it will be 1,2,3, or 4 decicion columns. Is there a way to keep the calculation done in "Master.dec" intact and working, despite if 1 to 3 of the decision columns would not be present ?
To reproduce the problem I encounter:
1) run the complete script
2) delete one column: xts1$dec.1 <- NULL
3) run only the section 2 of the script (2.add a rules system)
...you will get the error:
Error in NextMethod(.Generic) :
dims [product 3] do not match the length of object [0]
Note! The solution should be able to have removed 1-3 decision columns, there will always be one decision column but unknown which ones.
# dependent libraries
library(matrixStats)
library(xts)
#############################################
# 1. Create the xts from a data.frame base
#############################################
# creates a dataframe
df1 <- data.frame(date = c("2017-01-01", "2017-02-02", "2017-03-03"),
other.1 = c(1998, 1999, 2000),
dec.1 = c(2, 3, 0),
other.2 = c(58, 54, 32),
other.3 = c(12, 3, 27),
dec.2 = c(2, 3, 0),
dec.3 = c(2, 3, 0),
other.4 = c(2, 5, 27),
dec.4 = c(2, 3, 0)
)
# transforms the column date to date-format
df1 = transform(df1,date=as.Date(as.character(date),format='%Y-%m-%d'))
# creates the xts, based on the dataframe df1
xts1 <- xts(df1[,-1],order.by = df1$date)
#############################################
# 2.Add a rule system:
# if all "dec"-columns are 2, add value 2 in master.dec
# if all "dec"-columns are 3, add value 3 in master.dec
# if all "dec"-columns are 0, (or any other combination then above) add value 0 in master.dec
#############################################
xts1$m.dec <- ifelse(rowSds(xts1)==0,rowMins(xts1),0)
Since an xts object is essentially just an indexed matrix, you could try calculating the row wise standard deviation. If the result is 0 (ie. all values are the same), then you assign the rowMin (or max, whichever you prefer) to your new column, otherwise 0.
An efficient and concise solution can be found using the matrixStats package:
library(matrixStats)
xts1$m.dec <- ifelse(rowSds(xts1)==0,rowMins(xts1),0)
# dec.1 dec.2 dec.3 dec.4 m.dec
#2017-01-01 2 2 2 2 2
#2017-02-02 3 3 3 3 3
#2017-03-03 0 0 0 0 0

average previous time series in R

I'm wondering if there is an easy way to average over the previous 30 seconds of data in R when there may be more than one data point per second.
For instance, for the sample weight taken at 32 seconds, I want the mean of the concentrations recorded in the past 30 seconds, so the mean of 9, 10, 7, ..14,20, 18, 2). For the sample weight taken at 31 seconds,I want the mean of the concentrations recorded in the past 30 seconds, so the mean of 5, 9, 10, 7, .. 14,20, 18). It's technically not a rolling average over the 30 previous measurements because there can be more than one measurement per second.
I'd like to do this in R.
1) sqldf Using DF below and 3 seconds join the last three seconds of data to each row of DF and then take the mean over them:
DF <- data.frame(time = c(1, 2, 2, 3, 4, 5, 6, 7, 8, 10), data = 1:10)
library(sqldf)
sqldf("select a.*, avg(b.data) mean
from DF a join DF b on b.time between a.time - 3 and a.time
group by a.rowid")
giving:
time data mean
1 1 1 1.0
2 2 2 2.0
3 2 3 2.0
4 3 4 2.5
5 4 5 3.0
6 5 6 4.0
7 6 7 5.5
8 7 8 6.5
9 8 9 7.5
10 10 10 9.0
The first mean value is the mean(1) which is 1, the second and third mean values are mean(1:3) which is 2, the fourth mean value is mean(1:4) which is 2.5, the fifth mean value is mean(1:5) which is 3, the sixth mean value is mean(2:6) which is 4, the seventh mean value is mean(3:7) which is 5 and so on.
2) This 2nd solution uses no packages. For each row of DF it finds the rows within 3 seconds back and takes the mean of their data:
Mean3 <- function(i) with(DF, mean(data[time <= time[i] & time >= time[i] - 3]))
cbind(DF, mean = sapply(1:nrow(DF), Mean3))
The rollapply function should do the trick.
library(zoo)
rollapply(weight.vector, 30, mean)
You can do (assuming your data is stored in a dataframe called df):
now <- 32
step <- 30
subsetData <- subset(df, time >= (now-step) & time < now)
average <- mean(subsetData$concentration)
And if you want to calculate the mean for at more time points, you can put this in a loop where you must adjust now
My first idea would be to summarise the data so the value column would contain a list of all values.
test.data <- data.frame(t = 1:50 + rbinom(50, 30, 0.3), y=rnorm(50)) %>% arrange(t)
prep <- test.data %>% group_by(t) %>% summarise(vals = list(y))
wrk <- left_join(data.frame(t=1:max(test.data$t)), prep, by='t')
Unfortunately zoos rollapply would not work on such a data.frame.
For testing I was thinking to only use a window of 5 lines.
I tried commands along: rollapply(wrk, 5, function(z) mean(unlist(z)))
But maybe someone else can fill in the missing bit of information.
This is sufficiently different that it warrants another answer.
This should do what you're asking with no extra libraries needed.
It just loops through each row, filters based on that row's time, and computes the mean.
Don't fear a simple loop :)
count = 200 # dataset rows
windowTimespan = 30 # timespan of window
# first lets make some data
df = data.frame(
# 200 random numbers from 0-99
time = sort(floor(runif(count)*100)),
concentration = runif(count),
weight = runif(count)
)
# add placeholder column(s)
df$rollingMeanWeight = NA
df$rollingMeanConcentration = NA
# for each row
for (r in 1:nrow(df)) {
# get the time in this row
thisTime = df$time[r]
# find all the rows within the acceptable timespan
# note: figure out if you want < vs <=
thisSubset = df[
df$time < thisTime &
df$time >= thisTime-windowTimespan
,]
# get the mean of the subset
df$rollingMeanWeight[r] = mean(thisSubset$weight)
df$rollingMeanConcentration[r] = mean(thisSubset$concentration)
}

How to add a column that gives the result of an operation on every row relative to current row?

I have a data frame with a group of x and y points. I need to calculate the euclidean distance of every point relative to every other point. Then I have to figure, for each row, how many are within a given range.
For example, if I had this data frame:
x y
- -
1 2
2 2
9 9
I should add a column that signals how many points (if we consider these points to be in a cartesian plane) are within a distance of 3 units from every other point.
x y n
- - -
1 2 1
2 2 1
9 9 0
Thus, the first point (1,2) has one other point (2,2) that is within that range, whereas the point (9,9) has 0 points at a distance of 3 units.
I could do this with a couple of nested for loops, but I am interested in solving this in R in an idiomatic way, preferably using dplyr or other library.
This is what I have:
ddply(.data=mydataframe, .variables('x', 'y'), .fun=count.in.range)
count.in.range <- function (df) {
xp <- df$x
yp <- df$y
return(nrow(filter(df, dist( rbind(c(x,y), c(xp,yp)) ) < 3 )))
}
But, for some reason, this doesn't work. I think it has to do with filter.
Given
df_ <- data.frame(x = c(1, 2, 9),
y = c(2, 2, 9))
You can use the function "dist":
matrix_dist <- as.matrix(dist(df_))
df_$n <- rowSums(matrix_dist <= 3)
This is base approach with straightforward application of a "distance function" but only on a row-by-row basis:
apply( df_ , 1, function(x) sum( (x[1] - df_[['x']])^2+(x[2]-df_[['y']])^2 <=9 )-1 )
#[1] 1 1 0
It's also really a "sweep" operation, although I wouldn't really expect a performance improvement.
I would suggest you work with pairs of points in the long format and then use a data.table solution, which is probably one of the fastest alternatives to work with large datasets
library(data.table)
library(reshape)
df <- data.frame(x = c(1, 2, 9),
y = c(2, 2, 9))
The first thing you need to do is to reshape your data to long format with all possible combinations of pairs of points:
df_long <- expand.grid.df(df,df)
# rename columns
setDT(df_long )
setnames(df_long, c("x","y","x1","y1"))
Now you only need to do this:
# calculate distance between pairs
df_long[ , mydist := dist ( matrix(c(x,x1,y,y1), ncol = 2, nrow = 2) ) , by=.(x,y,x1,y1)]
# count how many points are within a distance of 3 units
df_long[mydist <3 , .(count = .N), by=.(x,y)]
#> x y count
#> 1: 1 2 2
#> 2: 2 2 2
#> 3: 9 9 1

Permutations from columns of a data frame in R with specific conditions

This may be a rather complex question so if someone can at least point me in the right direction I can probably figure out the rest on my own.
Sample data:
dat <- data.frame(A = c(1, 4, 5, 3, NA, 5), B = c(6, 5, NA, 5, 3, 5), C = c(5, 3, 1, 5, 3, 7), D = c(5, NA, 3, 10, 4, 5))
A B C D
1 1 6 5 5
2 4 5 3 NA
3 5 NA 1 3
4 3 5 5 10
5 NA 3 3 4
6 5 5 7 5
I would like to find all possible permutations of letter sequences of different lengths from the table shown above. For example, one valid letter sequence might be: A C A D D B. Another valid sequence could be B C C.
However, there are a few exceptions to this I'd like to follow:
1. Must be able to specify the minimum length of the returned sequence.
Note that in my example above, the min sequence length was 3 and the max sequence length was equal to the number of rows. I would like to be able to specify the min value (the max value will always be equal to the number of rows, 6 in the case of the sample data).
Note that if the sequence length is shorter than 6, it cannot be generated from skipping rows. In other words, any short sequences must come from consecutive rows. Clarification based on comments: Short sequences do not have to start on row 1. A short sequence could start on row 3 and continue onward through consecutive rows to row 6.
2. Letters with an NA value are not available for sampling.
Note that in row 2 there is an NA in the D column. This means that D would not be available for sampling in row 2. So A B D would be a valid combination but A D D would not be valid.
3. The sequences must be ranked based on the values in each cell.
Notice how each cell has a specific value in it. Each sequence chosen can be ranked by summing up the value shown in the table for the chosen letter. Using the example from above A C A D D B would have a rank of 1+3+5+10+4+5. So when generating all possible sequence they should be ordered from highest rank to lowest rank.
I would like to apply all three of these rules to the data table listed above to find all combinations of sequences possible of minimum length 3 and maximum length 6.
Please let me know if I need to clarify anything!
In principle, you want to do this using expand.grid I believe. Using your example data, I worked out the basics here:
dat <- data.frame(A = c(1, 4, 5, 3, NA, 5),
B = c(6, 5, NA, 5, 3, 5),
C = c(5, 3, 1, 5, 3, 7),
D = c(5, NA, 3, 10, 4, 5))
dat[,1][!is.na(dat[,1])] <- paste("A",na.omit(dat[,1]),sep="-")
dat[,2][!is.na(dat[,2])] <- paste("B",na.omit(dat[,2]),sep="-")
dat[,3][!is.na(dat[,3])] <- paste("C",na.omit(dat[,3]),sep="-")
dat[,4][!is.na(dat[,4])] <- paste("D",na.omit(dat[,4]),sep="-")
transp_data <- as.data.frame(t(dat))
data_list <- list(V1 = as.vector(na.omit(transp_data$V1)),
V2 = as.vector(na.omit(transp_data$V2)),
V3 = as.vector(na.omit(transp_data$V3)),
V4 = as.vector(na.omit(transp_data$V4)),
V5 = as.vector(na.omit(transp_data$V5)),
V6 = as.vector(na.omit(transp_data$V6)))
This code lets you essentially transform your data frame into a list of vectors of different lengths (one element for each variable in your original data, but omitting NAs and such). The reason you would want to do this is because it makes finding the acceptable combinations trivially easy by using the expand.grid function.
To solve for the six, you would simply use:
grid_6 <- do.call(what = expand.grid,
args = data_list)
This would give you a list of all possible permutations that met your criteria for the six (i.e. there were no NA elements). You can extract the numeric data back using some regular expressions (not a very vectorized way of doing it, but this is a complex thing that I don't have time to fully put into a function).
grid_6_letters <- grid_6
for(x in 1:ncol(grid_6_letters)) {
for(y in 1:nrow(grid_6_letters)) {
grid_6_letters[y,x] <- gsub(pattern = "-[0-9]*",replacement = "",x = grid_6_letters[y,x])
}
}
grid_6_numbers <- grid_6
for(x in 1:ncol(grid_6_numbers)) {
for(y in 1:nrow(grid_6_numbers)) {
grid_6_numbers[y,x] <- gsub(pattern = "^[ABCD]-",replacement = "",x = grid_6_numbers[y,x])
}
grid_6_numbers[[x]] <- as.numeric(grid_6_numbers[[x]])
}
grid_6_letters$Total <- rowSums(grid_6_numbers)
grid_6_letters <- grid_6_letters[order(grid_6_letters$Total,decreasing = TRUE),]
Anyway, if you wanted to get the various lower-level combinations, you could do it by simply using expand.grid on subsets of the list and combining them using rbind (with some judicious use of setNames as needed. Example:
grid_3 <- rbind(setNames(do.call(what = expand.grid,args = list(data_list[1:3],stringsAsFactors = FALSE)),nm = c("V1","V2","V3")),
setNames(do.call(what = expand.grid,args = list(data_list[2:4],stringsAsFactors = FALSE)),nm = c("V1","V2","V3")),
setNames(do.call(what = expand.grid,args = list(data_list[3:5],stringsAsFactors = FALSE)),nm = c("V1","V2","V3")),
setNames(do.call(what = expand.grid,args = list(data_list[4:6],stringsAsFactors = FALSE)),nm = c("V1","V2","V3")))
Anyway, with some time and programming, you can likely wrap this into a function that is much better than my example, but hopefully it will get you started.
Sorry I don't do any R anymore, so I'll try to help with a dirty code...
addPointsToSequence <- function(seq0, currRow){
i<-0;
for(i in 1:4){# 4 is the number of columns
seq2 = seq0
if (!is.na(dat[currRow,i])){
# add the point at the end of seq2
seq2 = cbind(seq2,dat[currRow,i])
# here I add the value, but you may prefer
# adding the colnames(dat)[i] and using the value to estimate the value of this sequence, in another variable
if(length(seq2) >= 3){
# save seq2 as an existing sequence where you need to
print (seq2)
}
if(currRow < 6){# 6 is the number of rows in dat (use nrow?)
addPointsToSequence(seq2, currRow+1)
}
}
}
}
dat <- data.frame(A = c(1, 4, 5, 3, NA, 5), B = c(6, 5, NA, 5, 3, 5), C = c(5, 3, 1, 5, 3, 7), D = c(5, NA, 3, 10, 4, 5))
for (startingRow in 1:4){
#4 is the last row you can start from to make a length3 sequence
emptySequence <- {};
addPointsToSequence(emptySequence , i);
}

Resources