R: find consecutive occurrence of a number - r

first define some function to bind list rowwise and column wise
# a function to append vectors row wise
rbindlist <- function(list) {
n <- length(list)
res <- NULL
for (i in seq(n)) res <- rbind(res, list[[i]])
return(res)
}
cbindlist <- function(list) {
n <- length(list)
res <- NULL
for (i in seq(n)) res <- cbind(res, list[[i]])
return(res)
}
# generate sample data
sample.dat <- list()
set.seed(123)
for(i in 1:365){
vec1 <- sample(c(0,1), replace=TRUE, size=5)
sample.dat[[i]] <- vec1
}
dat <- rbindlist(sample.dat)
dat has five columns. Each column is a location and has 365 days of the year (365 rows) with values 1 or 0.
I have another dataframe (see below) which has certain days of the year for each column (location) in dat.
# generate second sample data
set.seed(123)
sample.dat1 <- list()
for(i in 1:5){
vec1 <- sort(sample(c(258:365), replace=TRUE, size=4), decreasing = F)
sample.dat1[[i]] <- vec1
}
dat1 <- cbindlist(sample.dat1)
I need to use dat1 to subset days in dat to do a calculation. An example below:
1) For location 1 (first column in both dat1 and dat):
In column 1 of dat, select the days from 289 till 302 (using dat1), find the longest consecutive occurrence of 1.
Repeat it and this time select the days from 303 (302 + 1) till 343 from dat, find the longest consecutive occurrence of 1.
Repeat it for 343 till 353: select the days from 344 (343 + 1) till 353, find the longest consecutive occurrence of 1.
2) Do this for all the columns
If I want to do sum of 1s, I can do this:
dat <- as.tibble(dat)
dat1 <- as.tibble(dat1)
pmap(list(dat,dat1), ~ {
range1 <- ..2[1]
range2 <- ..2[2]
range3 <- ..2[3]
range4 <- ..2[4]
sum.range1 <- sum(..1[range1:range2]) # this will generate sum between range 1 and range 2
sum.range2 <- sum(..1[range2:range3]) # this will generate sum between range 2 and range 3
sum.range3 <- sum(..1[range3:range4]) # this will generate sum between range 3 and range 4
c(sum.range1=sum.range1,sum.range2=sum.range2,sum.range3=sum.range3)
})
For longest consequtive occurrence of 1 between each range, I thought of using the rle function. Example below:
pmap(list(dat,dat1), ~ {
range1 <- ..2[1]
range2 <- ..2[2]
range3 <- ..2[3]
range4 <- ..2[4]
spell.range1 <- rle(..1[range1:range2]) # sort the data, this shows the longest run of ANY type (0 OR 1)
spell.1.range1 <- tapply(spell.range1$lengths, spell.range1$values, max)[2] # this should select the maximum consequtive run of 1
spell.range2 <- rle(..1[range2:range3]) # sort the data, this shows the longest run of ANY type (0 OR 1)
spell.1.range2 <- tapply(spell.range2$lengths, spell.range2$values, max)[2] # this should select the maximum consequtive run of 1
spell.range3 <- rle(..1[range3:range4]) # sort the data, this shows the longest run of ANY type (0 OR 1)
spell.1.range3 <- tapply(spell.range3$lengths, spell.range3$values, max)[2] # this should select the maximum consequtive run of 1
c(spell.1.range1 = spell.1.range1, spell.1.range2 = spell.1.range2, spell.1.range3 = spell.1.range3)
})
I get an error which I think is because I am not using the rle function properly here. I would really like to keep the code as above since
my others code are in the same pattern and format of the outputs is suited for my need, so I would appreciate if someone can suggest how to fix it.

OP's code does work for me. So, without a specific error message it is impossible to understand why the code is not working for the OP.
However, the sample datasets created by the OP are matrices (before they were coerced to tibble) and I felt challenged to find a way to solve the task in base R without using purrr:
To find the number of consecutive occurences of a particular value val in a vector x we can use the following function:
max_rle <- function(x, val) {
y <- rle(x)
len <- y$lengths[y$value == val]
if (length(len) > 0) max(len) else NA
}
Examples:
max_rle(c(0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1), 1)
[1] 4
max_rle(c(0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1), 0)
[1] 2
# find consecutive occurrences in column batches
lapply(seq_len(ncol(dat1)), function(col_num) {
start <- head(dat1[, col_num], -1L)
end <- tail(dat1[, col_num], -1L) - 1
sapply(seq_along(start), function(range_num) {
max_rle(dat[start[range_num]:end[range_num], col_num], 1)
})
})
[[1]]
[1] 8 4 5
[[2]]
[1] 4 5 2
[[3]]
[1] NA 3 4
[[4]]
[1] 5 5 4
[[5]]
[1] 3 2 3
The first lapply() loops over the columns of dat and dat1, resp. The second sapply() loops over the row ranges stored in dat1 and subsets dat accordingly.

Related

Calculate euclidean distance with R

I have data where rows are points and columns are coordinates x,y,z.
I'd like to calculate euclidean distance between points in couple, as 3-4, 11-12, 18-19 and so on... for example, I dont' need distance between 3 and 11, 12, 18
The problem is that I have to analize 1074 tables with 1000 rows or more, so I'm searching a way to do it automatically, maybe considering tha fact that I want to calculate distance between an odd number and the even following one. I don't care too much about the output format, but pls consider that after I have to select only distances <3.2, so a dataframe format will be great.
THANK YOU! :*
How about something like this:
First, I'll make some fake data
set.seed(4304)
df <- data.frame(
x = runif(1000, -1, 1),
y = runif(1000, -1, 1),
z = runif(1000, -1,1)
)
Make a sequence of values from 1 to the number of rows of your dataset by 2s.
s <- seq(1, nrow(df), by=2)
Use sapply() to make the distance between each pair of points.
out <- sapply(s, function(i){
sqrt(sum((df[i,] - df[(i+1), ])^2))
})
Organize the distances into a data frame
res <- data.frame(
pair = paste(rownames(df)[s], rownames(df)[(s+1)], sep="-"),
dist=out)
head(res)
# pair dist
# 1 1-2 1.379992
# 2 3-4 1.303511
# 3 5-6 1.242302
# 4 7-8 1.257228
# 5 9-10 1.107484
# 6 11-12 1.392247
Here is a function that can be applied to a data.frame or matrix holding the data.
DistEucl <- function(X){
i <- cumsum(seq_len(nrow(X)) %% 2 == 1)
sapply(split(X, i), function(Y){
sqrt(sum((Y[1, ] - Y[2, ])^2))
})
}
DistEucl(df1)
# 1 2 3 4
#1.229293 1.234273 1.245567 1.195319
With the data in DaveArmstrong's answer, the results are the same except for a names attribute in the above function's return value.
out2 <- DistEucl(df)
all.equal(out, out2)
#[1] "names for current but not for target"
identical(out, unname(out2))
#[1] TRUE
Data in the question
x <- c(13.457, 13.723, 15.319, 15.713, 18.446, 19.488, 19.762, 19.743)
y <- c(28.513, 29.656, 28.510, 27.342, 28.827, 28.24, 29.841, 30.942)
z <- c(40.513, 40.147, 43.281, 43.218, 43.095, 43.443, 40.094, 40.559)
df1 <- data.frame(x, y, z)

Find combination of n vectors across k dataframes with highest correlation

Let's assume four data frames, each with 3 vectors, e.g.
setA <- data.frame(
a1 = c(6,5,2,4,5,3,4,4,5,3),
a2 = c(4,3,1,4,5,1,1,6,3,2),
a3 = c(5,4,5,6,4,6,5,5,3,3)
)
setB <- data.frame(
b1 = c(5,3,4,3,3,6,4,4,3,5),
b2 = c(4,3,1,3,5,2,5,2,5,6),
b3 = c(6,5,4,3,2,6,4,3,4,6)
)
setC <- data.frame(
c1 = c(4,4,5,5,6,4,2,2,4,6),
c2 = c(3,3,4,4,2,1,2,3,5,4),
c3 = c(4,5,4,3,5,5,3,5,5,6)
)
setD <- data.frame(
d1 = c(5,5,4,4,3,5,3,5,5,4),
d2 = c(4,4,3,3,4,3,4,3,4,5),
d3 = c(6,5,5,3,3,4,2,5,5,4)
)
I'm trying to find n number of vectors in each data frame, that have the highest correlation among each other. For this simple example, let's say want to find the n = 1 vectors in each of the k = 4 data frames, that show the overall strongest, positive correlation cor().
I'm not interested in the correlation of vectors within a data frame, but the correlation between data frames, since i wish to pick 1 variable from each set.
Intuitively, I would sum all the correlation coefficients for each combination, i.e.:
sum(cor(cbind(setA$a1, setB$b1, setC$c1, setC$d1)))
sum(cor(cbind(setA$a1, setB$b2, setC$c1, setC$d1)))
sum(cor(cbind(setA$a1, setB$b1, setC$c2, setC$d1)))
... # and so on...
...but this seems like brute-forcing a solution that might be solvable more elegantly, with some kind of clustering-technique?
Anyhow, I was hoping to find a dynamic solution like function(n = 1, ...) where (... for data frames) which would return a list of the highest correlating vector names.
Base on your example I would not go with a really complicated algorithm unless your actual data is huge. This is a simple approach I think gets what you want.
So base on your 4 data frames a creates the list_df and then in the function I just generate all the possible combinations of variables an calculate their correlation. At the end I select the n combinations with highest correlation.
list_df = list(setA,setB,setC,setD)
CombMaxCor = function(n = 1,list_df){
column_names = lapply(list_df,colnames)
mat_comb = expand.grid(column_names)
mat_total = do.call(cbind,list_df)
vec_cor = rep(NA,nrow(mat_comb))
for(i in 1:nrow(mat_comb)){
vec_cor[i] = sum(cor(mat_total[,as.character(unlist(mat_comb[i,]))]))
}
pos_max_temp = rev(sort(vec_cor))[1:n]
pos_max = vec_cor%in%pos_max_temp
comb_max_cor = mat_comb[pos_max,]
return(comb_max_cor)
}
You could use comb function:
fun = function(x){
nm = paste0(names(x),collapse="")
if(!grepl("(.)\\d.*\\1",nm,perl = T))
setNames(sum(cor(x)),nm)
}
unlist(combn(a,4,fun,simplify = FALSE))[1:3]#Only printed the first 3
a1b1c1d1 a1b1c1d2 a1b1c1d3
3.246442 4.097532 3.566949
sum(cor(cbind(setA$a1, setB$b1, setC$c1, setD$d1)))
[1] 3.246442
sum(cor(cbind(setA$a1, setB$b1, setC$c1, setD$d2)))
[1] 4.097532
sum(cor(cbind(setA$a1, setB$b1, setC$c1, setD$d3)))
[1] 3.566949
Here is a function we can use to get n non-repeating columns from each data frame to get the max total correlation:
func <- function(n, ...){
list.df <- list(...)
n.df <- length(list.df)
# 1) First get the correlations
get.two.df.cors <- function(df1, df2) apply(df1, 2,
function(x) apply(df2, 2, function(y) cor(x,y))
)
cor.combns <- lapply(list.df, function(x)
lapply(list.df, function(y) get.two.df.cors(x,y))
)
# 2) Define function to help with aggregating the correlations.
# We will call them for different combinations of selected columns from each df later
# cmbns: given a df corresponding columns to be selected each data frame
# (i-th row corresponds to i-th df),
# return the "total correlation"
get.cmbn.sum <- function(cmbns, cor.combns){
# a helper matrix to help aggregation
# each row represents which two data frames we want to get the correlation sums
df.df <- t(combn(seq(n.df), 2, c))
# convert to list of selections for each df
cmbns <- split(cmbns, seq(nrow(cmbns)))
sums <- apply(df.df, 1,
function(dfs) sum(
cor.combns[[dfs[1]]][[dfs[2]]][cmbns[[dfs[2]]], cmbns[[dfs[1]]]]
)
)
# sum of the sums give the "total correlation"
sum(sums)
}
# 3) Now perform the aggragation
# get the methods of choosing n columns from each of the k data frames
if (n==1) {
cmbns.each.df <- lapply(list.df, function(df) matrix(seq(ncol(df)), ncol=1))
} else {
cmbns.each.df <- lapply(list.df, function(df) t(combn(seq(ncol(df)), n, c)))
}
# get all unique selection methods
unique.selections <- Reduce(function(all.dfs, new.df){
all.dfs.lst <- rep(list(all.dfs), nrow(new.df))
all.new.rows <- lapply(seq(nrow(new.df)), function(x) new.df[x,,drop=F])
for(i in seq(nrow(new.df))){
for(j in seq(length(all.dfs.lst[[i]]))){
all.dfs.lst[[i]][[j]] <- rbind(all.dfs.lst[[i]][[j]], all.new.rows[[i]])
}
}
do.call(c, all.dfs.lst)
}, c(list(list(matrix(numeric(0), nrow=0, ncol=n))), cmbns.each.df))
# for each unique selection method, calculate the total correlation
result <- sapply(unique.selections, get.cmbn.sum, cor.combns=cor.combns)
return( unique.selections[[which.max(result)]] )
}
And now we have:
# n = 1
func(1, setA, setB, setC, setD)
# [,1]
# [1,] 1
# [2,] 2
# [3,] 3
# [4,] 2
# n = 2
func(2, setA, setB, setC, setD)
# [,1] [,2]
# [1,] 1 2
# [2,] 2 3
# [3,] 2 3
# [4,] 2 3

Trying to make a script calculate a value (using a function) for every 24 rows

I have not been able to find a solution to a problem similar to this on StackOverflow. I hope someone can help!
I am using the R environment.
I have data from turtle nests. There are two types of hourly data in each nest. The first is hourly Temperature, and it has an associated hourly Development (amount of "anatomical" embryonic development").
I am calculating a weighted median. In this case, the median is temperature and it is weighted by development.
I have a script here that I am using to calculated weighted median:
weighted.median <- function(x, w, probs=0.5, na.rm=TRUE) {
x <- as.numeric(as.vector(x))
w <- as.numeric(as.vector(w))
if(anyNA(x) || anyNA(w)) {
ok <- !(is.na(x) | is.na(w))
x <- x[ok]
w <- w[ok]
}
stopifnot(all(w >= 0))
if(all(w == 0)) stop("All weights are zero", call.=FALSE)
#'
oo <- order(x)
x <- x[oo]
w <- w[oo]
Fx <- cumsum(w)/sum(w)
#'
result <- numeric(length(probs))
for(i in seq_along(result)) {
p <- probs[i]
lefties <- which(Fx <= p)
if(length(lefties) == 0) {
result[i] <- x[1]
} else {
left <- max(lefties)
result[i] <- x[left]
if(Fx[left] < p && left < length(x)) {
right <- left+1
y <- x[left] + (x[right]-x[left]) * (p-Fx[left])/(Fx[right]- Fx[left])
if(is.finite(y)) result[i] <- y
}
}
}
names(result) <- paste0(format(100 * probs, trim = TRUE), "%")
return(result)
}
So from the function you can see that I need two input vectors, x and w (which will be temperature and development, respectively).
The problem I'm having is that I have hourly temperature traces that last anywhere from 5 days to 53 days (i.e., 120 hours to 1272 hours).
I would like to calculate the daily weighted median for all days within a nest (i.e., take the 24 rows of x and w, and calculate the weighted median, then move onto rows 25-48, and so forth.) The output vector would therefore be a list of daily weighted medians with length n/24 (where n is the total number of rows in x).
In other words, I would like to analyse my data automatically, in a fashion equivalent to manually doing this (nest1 is the datasheet for Nest 1 which contains two vectors, temp and devo (devo is the weight))):
`weighted.median(nest1$temp[c(1,1:24)],nest1$devo[c(1,1:24)],na.rm=TRUE)`
followed by
weighted.median(nest1$temp[c(1,25:48)],nest1$devo[c(1,25:48)],na.rm=TRUE)
followed by
weighted.median(nest1$temp[c(1,49:72)],nest1$devo[c(1,49:72)],na.rm=TRUE)
all the way to
`weighted.median(nest1$temp[c(1,n-23:n)],nest1$devo[c(1,n-23:n)],na.rm=TRUE)`
I'm afraid I don't even know where to start. Any help or clues would be very much appreciated.
The main idea is to create a new column for day 1, day 2, ..., day n/24, split the dataframe into subsets by day, and apply your function to each subset.
First I create some sample data:
set.seed(123)
n <- 120 # number of rows
nest1 <- data.frame(temp = rnorm(n), devo = rpois(n, 5))
Create the splitting variable:
nest1$day <- rep(1:(nrow(nest1)/24), each = 24)
Then, use the by() function to split nest1 by nest1$day and apply the function to each subset:
out <- by(nest1, nest1$day, function(d) {
weighted.median(d$temp, d$devo, na.rm = TRUE)
})
data.frame(day = dimnames(out)[[1]], x = as.vector(out))
# day x
# 1 1 -0.45244433
# 2 2 0.15337312
# 3 3 0.07071673
# 4 4 0.23873174
# 5 5 -0.27694709
Instead of using by, you can also use the group_by + summarise functions from the dplyr package:
library(dplyr)
nest1 %>%
group_by(day) %>%
summarise(x = weighted.median(temp, devo, na.rm = TRUE))
# # A tibble: 5 x 2
# day x
# <int> <dbl>
# 1 1 -0.452
# 2 2 0.153
# 3 3 0.0707
# 4 4 0.239
# 5 5 -0.277

How to create a conditional dummy in R?

I have a dataframe of time series data with daily observations of temperatures. I need to create a dummy variable that counts each day that has temperature above a threshold of 5C. This would be easy in itself, but an additional condition exists: counting starts only after ten consecutive days above the threshold occurs. Here's an example dataframe:
df <- data.frame(date = seq(365),
temp = -30 + 0.65*seq(365) - 0.0018*seq(365)^2 + rnorm(365))
I think I got it done, but with too many loops for my liking. This is what I did:
df$dummyUnconditional <- 0
df$dummyHead <- 0
df$dummyTail <- 0
for(i in 1:nrow(df)){
if(df$temp[i] > 5){
df$dummyUnconditional[i] <- 1
}
}
for(i in 1:(nrow(df)-9)){
if(sum(df$dummyUnconditional[i:(i+9)]) == 10){
df$dummyHead[i] <- 1
}
}
for(i in 9:nrow(df)){
if(sum(df$dummyUnconditional[(i-9):i]) == 10){
df$dummyTail[i] <- 1
}
}
df$dummyConditional <- ifelse(df$dummyHead == 1 | df$dummyTail == 1, 1, 0)
Could anyone suggest simpler ways for doing this?
Here's a base R option using rle:
df$dummy <- with(rle(df$temp > 5), rep(as.integer(values & lengths >= 10), lengths))
Some explanation: The task is a classic use case for the run length encoding (rle) function, imo. We first check if the value of temp is greater than 5 (creating a logical vector) and apply rle on that vector resulting in:
> rle(df$temp > 5)
#Run Length Encoding
# lengths: int [1:7] 66 1 1 225 2 1 69
# values : logi [1:7] FALSE TRUE FALSE TRUE FALSE TRUE ...
Now we want to find those cases where the values is TRUE (i.e. temp is greater than 5) and where at the same time the lengths is greater than 10 (i.e. at least ten consecutive tempvalues are greater than 5). We do this by running:
values & lengths >= 10
And finally, since we want to return a vector of the same lengths as nrow(df), we use rep(..., lengths) and as.integer in order to return 1/0 instead of TRUE/FALSE.
I think you could use a combination of a simple ifelse and the roll apply function in the zoo package to achieve what you are looking for. The final step just involves padding the result to account for the first N-1 days where there isnt enough information to fill the window.
library(zoo)
df <- data.frame(date = seq(365),
temp = -30 + 0.65*seq(365) - 0.0018*seq(365)^2 + rnorm(365))
df$above5 <- ifelse(df$temp > 5, 1, 0)
temp <- rollapply(df$above5, 10, sum)
df$conseq <- c(rep(0, 9),temp)
I would do this:
set.seed(42)
df <- data.frame(date = seq(365),
temp = -30 + 0.65*seq(365) - 0.0018*seq(365)^2 + rnorm(365))
thr <- 5
df$dum <- 0
#find first 10 consecutive values above threshold
test1 <- filter(df$temp > thr, rep(1,10), sides = 1) == 10L
test1[1:9] <- FALSE
n <- which(cumsum(test1) == 1L)
#count days above threshold after that
df$dum[(n+1):nrow(df)] <- cumsum(df$temp[(n+1):nrow(df)] > thr)

R: Rolling window function with adjustable window and step-size for irregularly spaced observations

Say there is a 2-column data frame with a time or distance column which sequentially increases and an observation column which may have NAs here and there. How can I efficiently use a sliding window function to get some statistic, say a mean, for the observations in a window of duration X (e.g. 5 seconds), slide the window over Y seconds (e.g. 2.5 seconds), repeat... The number of observations in the window is based on the time column, thus both the number of observations per window and the number of observations to slide the window may vary The function should accept any window size up to the number of observations and a step size.
Here is sample data (see "Edit:" for a larger sample set)
set.seed(42)
dat <- data.frame(time = seq(1:20)+runif(20,0,1))
dat <- data.frame(dat, measure=c(diff(dat$time),NA_real_))
dat$measure[sample(1:19,2)] <- NA_real_
head(dat)
time measure
1 1.914806 1.0222694
2 2.937075 0.3490641
3 3.286140 NA
4 4.830448 0.8112979
5 5.641746 0.8773504
6 6.519096 1.2174924
Desired Output for the specific case of a 5 second window, 2.5 second step, first window from -2.5 to 2.5, na.rm=FALSE:
[1] 1.0222694
[2] NA
[3] NA
[4] 1.0126639
[5] 0.9965048
[6] 0.9514456
[7] 1.0518228
[8] NA
[9] NA
[10] NA
Explanation: In the desired output the very first window looks for times between -2.5 and 2.5. One observation of measure is in this window, and it is not an NA, thus we get that observation: 1.0222694. The next window is from 0 to 5, and there is an NA in the window, so we get NA. Same for the window from 2.5 to 7.5. The next window is from 5 to 10. There are 5 observations in the window, none are NA. So, we get the average of those 5 observations (i.e. mean(dat[dat$time >5 & dat$time <10,'measure']) )
What I tried: Here is what I tried for the specific case of a window where the step size is 1/2 the window duration:
windo <- 5 # duration in seconds of window
# partition into groups depending on which window(s) an observation falls in
# When step size >= window/2 and < window, need two grouping vectors
leaf1 <- round(ceiling(dat$time/(windo/2))+0.5)
leaf2 <- round(ceiling(dat$time/(windo/2))-0.5)
l1 <- tapply(dat$measure, leaf1, mean)
l2 <- tapply(dat$measure, leaf2, mean)
as.vector(rbind(l2,l1))
Not flexible, not elegant, not efficient. If step size isn't 1/2 window size, the approach will not work, as is.
Any thoughts on a general solution to this kind of problem? Any solution is acceptable. The faster the better, though I prefer solutions using base R, data.table, Rcpp, and/or parallel computation. In my real data set, there are several millions of observations contained in a list of data frames (max data frame is ~400,000 observations).
Below is a extra info: A larger sample set
Edit: As per request, here is a larger, more realistic example dataset with many more NAs and the minimum time span (~0.03). To be clear, though, the list of data frames contains small ones like the one above, as well as ones like the following and larger:
set.seed(42)
dat <- data.frame(time = seq(1:50000)+runif(50000, 0.025, 1))
dat <- data.frame(dat, measure=c(diff(dat$time),NA_real_))
dat$measure[sample(1:50000,1000)] <- NA_real_
dat$measure[c(350:450,3000:3300, 20000:28100)] <- NA_real_
dat <- dat[-c(1000:2000, 30000:35000),]
# a list with a realistic number of observations:
dat <- lapply(1:300,function(x) dat)
Here is an attempt with Rcpp. The function assumes that data is sorted according to time. More testing would be advisable and adjustments could be made.
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector rollAverage(const NumericVector & times,
NumericVector & vals,
double start,
const double winlen,
const double winshift) {
int n = ceil((max(times) - start) / winshift);
NumericVector winvals;
NumericVector means(n);
int ind1(0), ind2(0);
for(int i=0; i < n; i++) {
if (times[0] < (start+winlen)) {
while((times[ind1] <= start) &
(times[ind1+1] <= (start+winlen)) &
(ind1 < (times.size() - 1))) {
ind1++;
}
while((times[ind2+1] <= (start+winlen)) & (ind2 < (times.size() - 1))) {
ind2++;
}
if (times[ind1] >= start) {
winvals = vals[seq(ind1, ind2)];
means[i] = mean(winvals);
} else {
means[i] = NA_REAL;
}
} else {
means[i] = NA_REAL;
}
start += winshift;
}
return means;
}
Testing it:
set.seed(42)
dat <- data.frame(time = seq(1:20)+runif(20,0,1))
dat <- data.frame(dat, measure=c(diff(dat$time),NA_real_))
dat$measure[sample(1:19,2)] <- NA_real_
rollAverage(dat$time, dat$measure, -2.5, 5.0, 2.5)
#[1] 1.0222694 NA NA 1.0126639 0.9965048 0.9514456 1.0518228 NA NA NA
With your list of data.frames (using data.table):
set.seed(42)
dat <- data.frame(time = seq(1:50000)+runif(50000, 0.025, 1))
dat <- data.frame(dat, measure=c(diff(dat$time),NA_real_))
dat$measure[sample(1:50000,1000)] <- NA_real_
dat$measure[c(350:450,3000:3300, 20000:28100)] <- NA_real_
dat <- dat[-c(1000:2000, 30000:35000),]
# a list with a realistic number of observations:
dat <- lapply(1:300,function(x) dat)
library(data.table)
dat <- lapply(dat, setDT)
for (ind in seq_along(dat)) dat[[ind]][, i := ind]
#possibly there is a way to avoid these copies?
dat <- rbindlist(dat)
system.time(res <- dat[, rollAverage(time, measure, -2.5, 5.0, 2.5), by=i])
#user system elapsed
#1.51 0.02 1.54
print(res)
# i V1
# 1: 1 1.0217126
# 2: 1 0.9334415
# 3: 1 0.9609050
# 4: 1 1.0123473
# 5: 1 0.9965922
# ---
#6000596: 300 1.1121296
#6000597: 300 0.9984581
#6000598: 300 1.0093060
#6000599: 300 NA
#6000600: 300 NA
Here is a function that gives the same result for your small data frame. It's not particularly quick: it takes several seconds to run on one of the larger datasets in your second dat example.
rolling_summary <- function(DF, time_col, fun, window_size, step_size, min_window=min(DF[, time_col])) {
# time_col is name of time column
# fun is function to apply to the subsetted data frames
# min_window is the start time of the earliest window
times <- DF[, time_col]
# window_starts is a vector of the windows' minimum times
window_starts <- seq(from=min_window, to=max(times), by=step_size)
# The i-th element of window_rows is a vector that tells us the row numbers of
# the data-frame rows that are present in window i
window_rows <- lapply(window_starts, function(x) { which(times>=x & times<x+window_size) })
window_summaries <- sapply(window_rows, function(w_r) fun(DF[w_r, ]))
data.frame(start_time=window_starts, end_time=window_starts+window_size, summary=window_summaries)
}
rolling_summary(DF=dat,
time_col="time",
fun=function(DF) mean(DF$measure),
window_size=5,
step_size=2.5,
min_window=-2.5)
Here are some functions that will give the same output on your first example:
partition <- function(x, window, step = 0){
a = x[x < step]
b = x[x >= step]
ia = rep(0, length(a))
ib = cut(b, seq(step, max(b) + window, by = window))
c(ia, ib)
}
roll <- function(df, window, step = 0, fun, ...){
tapply(df$measure, partition(df$time, window, step), fun, ...)
}
roll_steps <- function(df, window, steps, fun, ...){
X = lapply(steps, roll, df = df, window = window, fun = fun, ...)
names(X) = steps
X
}
Output for your first example:
> roll_steps(dat, 5, c(0, 2.5), mean)
$`0`
1 2 3 4 5
NA 1.0126639 0.9514456 NA NA
$`2.5`
0 1 2 3 4
1.0222694 NA 0.9965048 1.0518228 NA
You can also ignore missing values this way easily:
> roll_steps(dat, 5, c(0, 2.5), mean, na.rm = TRUE)
$`0`
1 2 3 4 5
0.7275438 1.0126639 0.9514456 0.9351326 NaN
$`2.5`
0 1 2 3 4
1.0222694 0.8138012 0.9965048 1.0518228 0.6122983
This can also be used for a list of data.frames:
> x = lapply(dat2, roll_steps, 5, c(0, 2.5), mean)
Ok, how about this.
library(data.table)
dat <- data.table(dat)
setkey(dat, time)
# function to compute a given stat over a time window on a given data.table
window_summary <- function(start_tm, window_len, stat_fn, my_dt) {
pos_vec <- my_dt[, which(time>=start_tm & time<=start_tm+window_len)]
return(stat_fn(my_dt$measure[pos_vec]))
}
# a vector of window start times
start_vec <- seq(from=-2.5, to=dat$time[nrow(dat)], by=2.5)
# sapply'ing the function above over vector of start times
# (in this case, getting mean over 5 second windows)
result <- sapply(start_vec, window_summary,
window_len=5, stat_fn=mean, my_dt=dat)
On my machine, it processes the first 20,000 rows of your large dataset in 13.06781 secs; all rows in 51.58614 secs
Here's another attempt to use pure data.table approach and its between function.
Have compared Rprof against the above answers (except #Rolands answer) and it seems the most optimized one.
Haven't tested for bugs though, but if you"ll like it, I'll expand the answer.
Using your dat from above
library(data.table)
Rollfunc <- function(dat, time, measure, wind = 5, slide = 2.5, FUN = mean, ...){
temp <- seq.int(-slide, max(dat$time), by = slide)
temp <- cbind(temp, temp + wind)
setDT(dat)[, apply(temp, 1, function(x) FUN(measure[between(time, x[1], x[2])], ...))]
}
Rollfunc(dat, time, measure, 5, 2.5)
## [1] 1.0222694 NA NA 1.0126639 0.9965048 0.9514456 1.0518228 NA NA
## [10] NA
You can also specify the functions and its arguments, i.e., for example:
Rollfunc(dat, time, measure, 5, 2.5, max, na.rm = TRUE)
will also work
Edit: I did some benchnarks against #Roland and his method clearly wins (by far), so I would go with the Rcpp aproach

Resources