I'm wondering if there is an easy way to average over the previous 30 seconds of data in R when there may be more than one data point per second.
For instance, for the sample weight taken at 32 seconds, I want the mean of the concentrations recorded in the past 30 seconds, so the mean of 9, 10, 7, ..14,20, 18, 2). For the sample weight taken at 31 seconds,I want the mean of the concentrations recorded in the past 30 seconds, so the mean of 5, 9, 10, 7, .. 14,20, 18). It's technically not a rolling average over the 30 previous measurements because there can be more than one measurement per second.
I'd like to do this in R.
1) sqldf Using DF below and 3 seconds join the last three seconds of data to each row of DF and then take the mean over them:
DF <- data.frame(time = c(1, 2, 2, 3, 4, 5, 6, 7, 8, 10), data = 1:10)
library(sqldf)
sqldf("select a.*, avg(b.data) mean
from DF a join DF b on b.time between a.time - 3 and a.time
group by a.rowid")
giving:
time data mean
1 1 1 1.0
2 2 2 2.0
3 2 3 2.0
4 3 4 2.5
5 4 5 3.0
6 5 6 4.0
7 6 7 5.5
8 7 8 6.5
9 8 9 7.5
10 10 10 9.0
The first mean value is the mean(1) which is 1, the second and third mean values are mean(1:3) which is 2, the fourth mean value is mean(1:4) which is 2.5, the fifth mean value is mean(1:5) which is 3, the sixth mean value is mean(2:6) which is 4, the seventh mean value is mean(3:7) which is 5 and so on.
2) This 2nd solution uses no packages. For each row of DF it finds the rows within 3 seconds back and takes the mean of their data:
Mean3 <- function(i) with(DF, mean(data[time <= time[i] & time >= time[i] - 3]))
cbind(DF, mean = sapply(1:nrow(DF), Mean3))
The rollapply function should do the trick.
library(zoo)
rollapply(weight.vector, 30, mean)
You can do (assuming your data is stored in a dataframe called df):
now <- 32
step <- 30
subsetData <- subset(df, time >= (now-step) & time < now)
average <- mean(subsetData$concentration)
And if you want to calculate the mean for at more time points, you can put this in a loop where you must adjust now
My first idea would be to summarise the data so the value column would contain a list of all values.
test.data <- data.frame(t = 1:50 + rbinom(50, 30, 0.3), y=rnorm(50)) %>% arrange(t)
prep <- test.data %>% group_by(t) %>% summarise(vals = list(y))
wrk <- left_join(data.frame(t=1:max(test.data$t)), prep, by='t')
Unfortunately zoos rollapply would not work on such a data.frame.
For testing I was thinking to only use a window of 5 lines.
I tried commands along: rollapply(wrk, 5, function(z) mean(unlist(z)))
But maybe someone else can fill in the missing bit of information.
This is sufficiently different that it warrants another answer.
This should do what you're asking with no extra libraries needed.
It just loops through each row, filters based on that row's time, and computes the mean.
Don't fear a simple loop :)
count = 200 # dataset rows
windowTimespan = 30 # timespan of window
# first lets make some data
df = data.frame(
# 200 random numbers from 0-99
time = sort(floor(runif(count)*100)),
concentration = runif(count),
weight = runif(count)
)
# add placeholder column(s)
df$rollingMeanWeight = NA
df$rollingMeanConcentration = NA
# for each row
for (r in 1:nrow(df)) {
# get the time in this row
thisTime = df$time[r]
# find all the rows within the acceptable timespan
# note: figure out if you want < vs <=
thisSubset = df[
df$time < thisTime &
df$time >= thisTime-windowTimespan
,]
# get the mean of the subset
df$rollingMeanWeight[r] = mean(thisSubset$weight)
df$rollingMeanConcentration[r] = mean(thisSubset$concentration)
}
Related
I want to calculate the rolling sum of n rows in my dataset where the window size 'n' depends on the sum itself. For example, I want to slide the window as soon as the rolling sum of time exceeds 5 mins. Basically, I want to calculate how much distance the person traveled in the last 5 mins but the time steps are not equally spaced. Here's a dummy data.table for clarity (the last two columns are required):
I am looking for a data.table solution in R
Input data table:
ID
Distance
Time
1
2
2
1
4
1
1
2
1
1
2
2
1
3
3
1
6
3
1
1
1
Desired Output:
ID
Distance
Time
5.min.rolling.distance
5.min.rolling.time
1
2
2
NA
NA
1
4
1
NA
NA
1
2
1
NA
NA
1
2
2
10
6
1
3
3
5
5
1
6
3
9
6
1
1
1
10
7
Here is a solution that works with double time units as well as a simpler solution that will work with integer time units. I tested the double solution on 10,000 records and on my 2015 laptop it executed instantly. I can't make any guarantees about performance on 40 GB of data.
If you wanted to generalize this code I'd look at the RcppRoll package and learn how to implement c++ code in R.
Solution with double time units
I broke this down into two problems. First, figure out the window size by looking back until we get to at least 5 minutes (or run out of data). Second, take the sum of distances and time from the current observation to the look back unit.
Bad loop code in R usually tries to 'grow' a vector, its a huge efficiency gain to pre-allocate the vector length and then change elements in it.
input <- data.frame(
dist = c(2, 4, 2, 2, 3, 6, 1),
time = c(2, 1, 1, 2, 3, 3, 1)
)
var_window_cumsum <- function(input, MIN_TIME) {
if(is.null(input$time) | is.null(input$dist)) {
stop("input must have variables time and dist that record the row's duration and distance traveled.")
}
n <- nrow(input)
# First, figure out how far we need to look back to, this vector will store
# the position of the first record that gets our target record up to 5 min or
# more. If we cant look back to 5 min, we leave it as NA.
time_indx = rep(NA_integer_, length = n) # always preallocate your vector!
for(time in (1:n)) {
prior = time # start at self in case observation is already >= MIN_TIME
while(sum(input$time[time:prior]) < MIN_TIME & prior > 1) {
prior = prior - 1
}
# if we cant look back to our minimum time, leave the indx as NA
if (sum(input$time[time:prior]) >= MIN_TIME) {
time_indx[time] = prior
}
}
# Now that we know how far to look back, its easy to find out the total distance
# and total time.
dist5 = rep(NA_integer_, n)
time5 = rep(NA_integer_, n)
for (i in 1:n) {
dist5[i] <- ifelse(!is.na(time_indx[i]),
sum(input$dist[i:time_indx[i]]),
NA)
time5[i] <- ifelse(!is.na(time_indx[i]),
sum(input$time[i:time_indx[i]]),
NA)
}
cbind(input,
window_dist = dist5,
window_time = time5,
window_start = time_indx)
}
# output looks good
# Warning: example data does not include exhaustive cases
# I have not setup thorough testing
var_window_cumsum(input, 5)
# Test on a larger dataset, 10k records
set.seed(1234)
n <- 10000
med_input <- data.frame(
dist = sample(1:5, n, replace = TRUE),
time = sample(1:60, n, replace = TRUE) / 10
)
# you should inspect this to make sure there are no errors
med_output <- var_window_cumsum(med_input, 5)
Solution with integer time units
If your time unit is in integers and your data isn't too big, it may work to complete your dataset. This is a little bit of a hack, but here I create a continuos timeid variable that goes from the starting time to the maximum time, and create one row for each integer unit of time. From there its easy to calculate a rolling cumulative sum for the last five time units. Finally, we get rid of all the fake rows we added in (you want to make sure to do that because they will have invalid cumulative sum data. Also, important to note that I use roll_sumr and not roll_sum; roll_sumr includes 4 padding NA's on the left side of the output vector for the first 4 units.
library(tidyverse)
library(RcppRoll)
input <- data.frame(
dist = c(2, 4, 2, 2, 3, 6, 1),
time = c(2, 1, 1, 2, 3, 3, 1)
)
desired_dist5 <- c(NA, NA, NA, 10, 5, 9, 10)
desired_time5 <- c(NA, NA, NA, 6, 5, 6, 7)
output <- input %>%
mutate(timeid = cumsum(time),
realrow = TRUE) %>%
complete(timeid = 1:max(timeid)) %>%
mutate(dist5 = roll_sumr(dist, 5, na.rm = T),
time5 = roll_sumr(time, 5, na.rm = T)) %>%
filter(realrow) %>%
select(-c(realrow, timeid))
# Check against example table
output$dist5 == desired_dist5
output$time5 == desired_time5
I am trying to dynamically populate a variable, which requires me to reference rows.
Given are 3 columns: time, group, and val.
I want to populate rows 3, 4, 7, and 8's val which are initially NA.
Here is my toy data:
df <- expand.grid(time = rep(c(1,2,3,4)), group = rep(c("A", "B")))
df$val <- c(50,40,NA,NA)
df
> df
time group val
1 1 A 50
2 2 A 40
3 3 A NA
4 4 A NA
5 1 B 50
6 2 B 40
7 3 B NA
8 4 B NA
I have two grouping variables (time and group) and, as example, I need to populate row 3 above by this set of rules:
1. Order by group and time (in ascending order)
2. For time = 3, the value of **val** is the arithmetic average of two previous rows;
(2a). i.e. the average of time 2 and time 1 values, so it will be 1/2 * (40+50) = 45.
3. For time = 4, the value of **val** is the arithmetic average of two previous rows;
(3a). i.e. the average of time 3 and time 2 values, so it will be 1/2 * (45+40) = 42.5.
And so on, going down to the last row of each group as defined by time and group variables.
I want to avoid using loops and referencing row index to achieve this, and prefer to stay within dplyr, as the rest of my scripts are in the dplyr ecosystem. Is there an efficient way to achieve this?
This isn't the cleanest solution, but it gets the job done:
df2 = df %>%
arrange(group, time) %>%
mutate(val = if_else(is.na(val), (lag(val, n=1) + lag(val, n=2))/2.0, val)) %>%
mutate(val = if_else(is.na(val), (lag(val, n=1) + lag(val, n=2))/2.0, val))
Again, it's not pretty, but it seems to work. Hope that helps give you something to start from.
I have a data frame with a group of x and y points. I need to calculate the euclidean distance of every point relative to every other point. Then I have to figure, for each row, how many are within a given range.
For example, if I had this data frame:
x y
- -
1 2
2 2
9 9
I should add a column that signals how many points (if we consider these points to be in a cartesian plane) are within a distance of 3 units from every other point.
x y n
- - -
1 2 1
2 2 1
9 9 0
Thus, the first point (1,2) has one other point (2,2) that is within that range, whereas the point (9,9) has 0 points at a distance of 3 units.
I could do this with a couple of nested for loops, but I am interested in solving this in R in an idiomatic way, preferably using dplyr or other library.
This is what I have:
ddply(.data=mydataframe, .variables('x', 'y'), .fun=count.in.range)
count.in.range <- function (df) {
xp <- df$x
yp <- df$y
return(nrow(filter(df, dist( rbind(c(x,y), c(xp,yp)) ) < 3 )))
}
But, for some reason, this doesn't work. I think it has to do with filter.
Given
df_ <- data.frame(x = c(1, 2, 9),
y = c(2, 2, 9))
You can use the function "dist":
matrix_dist <- as.matrix(dist(df_))
df_$n <- rowSums(matrix_dist <= 3)
This is base approach with straightforward application of a "distance function" but only on a row-by-row basis:
apply( df_ , 1, function(x) sum( (x[1] - df_[['x']])^2+(x[2]-df_[['y']])^2 <=9 )-1 )
#[1] 1 1 0
It's also really a "sweep" operation, although I wouldn't really expect a performance improvement.
I would suggest you work with pairs of points in the long format and then use a data.table solution, which is probably one of the fastest alternatives to work with large datasets
library(data.table)
library(reshape)
df <- data.frame(x = c(1, 2, 9),
y = c(2, 2, 9))
The first thing you need to do is to reshape your data to long format with all possible combinations of pairs of points:
df_long <- expand.grid.df(df,df)
# rename columns
setDT(df_long )
setnames(df_long, c("x","y","x1","y1"))
Now you only need to do this:
# calculate distance between pairs
df_long[ , mydist := dist ( matrix(c(x,x1,y,y1), ncol = 2, nrow = 2) ) , by=.(x,y,x1,y1)]
# count how many points are within a distance of 3 units
df_long[mydist <3 , .(count = .N), by=.(x,y)]
#> x y count
#> 1: 1 2 2
#> 2: 2 2 2
#> 3: 9 9 1
I want to split my data into 3 parts with the ratio of 6:2:2. Is there a R command that can do that? Thanks.
I used createDataPartition in the caret package, that can split data into two parts. But how to do it with 3 splits? Is that possible? Or I need two steps to do that?
You randomly split with (roughly) this ratio using sample:
set.seed(144)
spl <- split(iris, sample(c(1, 1, 1, 2, 3), nrow(iris), replace=T))
This split your initial data frame into a list. Now you can check that you've gotten the split ratio you were looking for using lapply with nrow called on each element of your list:
unlist(lapply(spl, nrow))
# 1 2 3
# 98 26 26
If you wanted to randomly shuffle but to get exactly your ratio for each group, you could shuffle the indices and then select the correct number of each type of index from the shuffled list. For iris, we would want 90 for group 1, 30 for group 2, and 30 for group 3:
set.seed(144)
nums <- c(90, 30, 30)
assignments <- rep(NA, nrow(iris))
assignments[sample(nrow(iris))] <- rep(c(1, 2, 3), nums)
spl2 <- split(iris, assignments)
unlist(lapply(spl2, nrow))
# 1 2 3
# 90 30 30
I have a data frame with three columns: DATE, HOUR, HRC
(So there are 24 rows for each DATE)
The HRC column is sometimes a number and sometimes NA.
I am trying to figure out a way of taking a subset of DATEs and then figuring out the HOURs that have non-NA values across all days.
Example: so if DATES are Aug16, Aug18, Aug19, and HRC column has non-NA values on Aug16 at HOURS 8, 9, 10, 11, 12... Aug18 at HOURS 7, 8, 9, 10, 11...Aug19 at HOURS 9, 10, 11, 12, 13. I would like the outcome to be the list of HOURS 9, 10, 11 since those are the non-NA HOURS for all DATES.
Adjusting sum(is.na(x$HRC)) to sum(!is.na(x$HRC)) in Gary's solution did the trick. Thanks everyone!
You didn't produce an example, so we are really confused about your question. It is generally constructive to provide a reproducible example. Even if I admit that it is little bit challenging to create example with date types.
set.seed(1234)
#generate sequence of 25 days hour by hour
x <- Sys.time() + seq(1,by=60*60,length.out=24*25)
hh <- as.POSIXlt(x)$hour
## generate the data.frame
dat <- data.frame(DATE = as.POSIXct(format(x,"%Y-%m-%d")),
HOUR=as.POSIXlt(x)$hour,
HRC = 1:length(x))
## introduce random NA
id <- sample(nrow(dat),10,rep=F)
dat$HRC[id] <- NA
Here begins my solution; it is similar to Gary solution, I am using plyr package but with different function.
## I choose 2 dates to subset
min.d <- as.POSIXct('2013-03-01')
max.d <- as.POSIXct('2013-03-15')
dat.s <- subset(dat, DATE >=min.d & DATE <= max.d )
res <- ddply(dat.s, .(HOUR), ## grouping by hour
function(x){
any(is.na(x$HRC)) ## I retuen one HRC at least is NA
})
The result:
res[res$V1,]
HOUR V1
6 5 TRUE
12 11 TRUE
14 13 TRUE
17 16 TRUE
19 18 TRUE
22 21 TRUE
You might try something like this:
library(plyr)
# assuming your dates are in some date format
d_0 <- as.Date('02-01-2010',format='%m-%d-%Y')
d_1 <- as.Date('02-10-2010',format='%m-%d-%Y')
# assuming your data are in data frame 'dat', get some subset of dates
some_dates <- subset(dat, DATE > d_0 & DATE < d_1)
# count the NAs for each hour
hr_count <- ddply(some_dates, .(HOUR), function(x) sum(!is.na(x$HRC)))