Summarise dataframe to obtain diff (lagged difference) - r

I have a dataframe that I want to group and obtain the median of the diff (lagged difference) in consistent units. Is very similar to the example below. As you can see by running the code below I have problems because diff have an units attribute that is not taken into account by my summarise function
library(tidyverse)
# Initialise random data
t = Sys.time()
rnd <- sample(1:10000,10,replace=F)
add <- rnd[order(rnd)]
# Create 2 dtaaframes
time1 <- data.frame(datetime = t + add)
time2 <- data.frame(datetime = t + add * 1000)
# Bind dataframe together
mydata <- bind_rows(time1, time2, .id = "group")
# Trying to get a summary table
mydata %>% group_by(group) %>% summarise(elapsed = median(diff(datetime[order(datetime)])))
# These are the values that I should get in my summary table
median(diff(time1$datetime))
median(diff(time2$datetime))

What about using difftime and setting the units?
mydata %>%
group_by(group) %>%
summarise(elapsed = median(difftime(datetime, lag(datetime), units = "mins"), na.rm = TRUE))

Here's one option, which will show all results in seconds. Use dminutes(1) or dhours(1) or ddays(1) if more appropriate.
mydata %>%
group_by(group) %>%
summarise(elapsed = median(diff(datetime[order(datetime)])) / lubridate::dseconds(1))

Related

R roll mean on a non continuous time serie

I want to make a rolling mean on the last X number of days. rollmean() does that using rows. Since I am using loggers that sometimes fail, and also the data were cleaned, the time series is not continuous (rows do not necessarily represent a constant time difference).
A colleague suggested the solution below, which works great. Except my data need to be grouped (in the example by treatment). For each day, I want the rolling mean of the last X days for each treatment.
Thanks
# making some example data
# vector with days since the beginning of experiment
days <- 0:30
# random values df1 <- tibble::tibble(
days_since_beginning = days,
value_to_used = rnorm(length(days)),
treatment = sample(letters[1],31,replace = TRUE) )
df2 <- tibble::tibble(
days_since_beginning = days,
value_to_used = rnorm(length(days)),
treatment = sample(letters[2],31,replace = TRUE) )
df <- full_join(df1, df2)
# how long should be the period for mean
time_period <- 10 # calculate for last 10 days
df_mean <- df %>% dplyr::mutate(
# calculate rolling mean
roll_mean = purrr::map_dbl(
.x = days_since_beginning,
.f = ~ df %>%
# select only data for the last `time_period`
dplyr::filter(days_since_beginning >= .x - time_period &
days_since_beginning <= .x) %>%
purrr::pluck("value_to_used") %>%
mean() %>%
return()
) )
This takes the mean over the last 10 days by treatment. The width argument includes a computation of how many rows back to use so that it corresponds to 10 days rather than 10 rows. This uses the fact that width can be a vector.
library(dplyr)
library(zoo)
df %>%
group_by(treatment) %>%
mutate(roll = rollapplyr(value_to_used,
seq_along(days_since_beginning) - findInterval(days_since_beginning - 10, days_since_beginning),
mean)) %>%
ungroup
Same colleague came up with his own solution:
df_mean <-
df %>%
dplyr::group_by(treatment) %>%
tidyr::nest() %>%
dplyr::mutate(
data_with_mean = purrr::map(
.x = data,
.f = ~ {
dataset <- .x
dataset %>%
dplyr::mutate(
# calculate rolling mean
roll_mean = purrr::map_dbl(
.x = days_since_beginning,
.f = ~ dataset %>%
# select only data for the last `time_period`
dplyr::filter(days_since_beginning >= .x - time_period &
days_since_beginning <= .x) %>%
purrr::pluck("value_to_used") %>%
mean() %>%
return()
)) %>%
return()
}
)) %>%
dplyr::select(-data) %>%
tidyr::unnest(data_with_mean) %>%
dplyr::ungroup()
I compared the results with G. Grothendieck's idea, and it only matches if I use time_period in my colleague's code and time_period + 1 in G. Grothendieck's code. So there is a difference in how the time_period is used, and I am confused about why it happens.

Is there a faster way than applying 'ddply' to aggregate columns by groups with a large dataset?

Purpose
I am trying to check whether a pair of values in two columns appear in the previous event, and aggregate the dummy variables by groups.
Specifically, I have each event id (i.e., oid) and dyad-level observations associated with each event: agent (i.e., aid), partner (i.e., pid). The events are sorted by time when the event occurs (i.e., o4.in).
(1)I made a dummy variable indicating if a pair of agent and partner appear together in the previous event.
(2) Also, I used ddply to aggregate the dummy variable by groups, as specified in the below example.
I find that ddply and lag functions take so much time with a large dataset, and I am wondering if there is a faster way to achieved these tasks.
Dataset
library(tidyverse)
library(tibble)
rename <- dplyr::rename
select <- dplyr::select
set.seed(10001)
cases <- sample(1:5, 1000, replace=T)
set.seed(10002)
agent <- sample(1:20, 1000, replace=T)
set.seed(10003)
partner <- sample(1:20, 1000, replace=T)
set.seed(123)
n <- 1000 # no of random datetimes needed
minDate <- as.POSIXct("1999/01/01")
maxDate <- as.POSIXct("2000-01-01")
epoch <- "1970-01-01"
timestamps <-
as.POSIXct(pmax(runif(n, minDate, maxDate), runif(n, minDate, maxDate)), origin = epoch)
df <-
data.frame(cases, agent, partner, timestamps) %>%
rename(
aid = agent,
pid = partner,
oid = cases,
o4.in = timestamps
) %>%
filter(aid != pid)
Current Methods
# creating dummy variable
d <-
df %>%
arrange(o4.in) %>%
group_by(aid) %>%
mutate(
oid.lag.a = lag(oid)
) %>%
ungroup %>%
group_by(pid) %>%
mutate(
oid.lag.p = lag(oid)
) %>%
ungroup %>%
mutate(
j2.consecutive = ifelse(oid.lag.a == oid.lag.p, 1, 0),
j2.consecutive = ifelse(is.na(j2.consecutive), 0, j2.consecutive)
) %>%
select(-oid.lag.a, -oid.lag.p)
# aggregating the dummy variable by groups
t <-
d %>%
ungroup %>%
ddply(c('oid', 'aid'), function(i){
i %>%
mutate(aj1.consecutive = (sum(j2.consecutive) - j2.consecutive)/(n()-1))
} , .progress = 'text') %>%
arrange(oid, pid) %>%
ddply(c('oid', 'pid'), function(i){
i %>%
mutate(apj1.consecutive = (sum(j2.consecutive) - j2.consecutive)/(n()-1))
} , .progress = 'text')
Update for Future Readers
Task (1) is achieved by the answer by #akrun below.
Task (2) solution is answered by #akrun in a separate post: A faster way than applying 'ddply' to aggregate a variable by a function by groups
Special thanks to #akrun!!
We can use data.table methods to make it faster
library(data.table)
df2 <- copy(df)
df3 <- setDT(df2)[order(o4.in)]
df3[, oid.lag.a := shift(oid), by = aid
][, oid.lag.p := shift(oid), by = pid]
df3[, j2.consecutive := fcoalesce(+(oid.lag.a == oid.lag.p), 0L)]
Also, note that some things in the OP's code are unnecessary i.e. using ifelse to convert a logical to binary. It can just be as.integer or coercion with +. The second line again with ifelse can be removed as well with coalesce
library(dplyr)
out <- df %>%
arrange(o4.in) %>%
group_by(aid) %>%
mutate(
oid.lag.a = lag(oid)
) %>%
group_by(pid) %>%
mutate(
oid.lag.p = lag(oid)
) %>%
ungroup %>%
mutate(j2.consecutive = coalesce(+(oid.lag.a == oid.lag.p), 0))
-checking the output from dplyr/data.table
all(out$j2.consecutive == df3$j2.consecutive )
[1] TRUE

Reduce a data frame by combining like rows according to two qualitative factors

I have a dataframe like the following:
observations<- data.frame(X=c("00KS089001","00KS089001","00KS089002","00KS089002","00KS089003","00KS089003","00KS105001","00KS105001", "00KS177011","00KS177011","00P0006","00P006","00P006","00P006"), hzdept = c(0,20,0,15,0,13,0,20,0,16,0,6,13,29), hzdepb = c(20,30,15,30,13,30,20,30,16,30,6,13,29,30),Y=c("Red","White","Red","White","Green","Red","Red","Blue", "Black","Black","Red","White","White","White"), Z = c(0.67,0.33,0.5,0.5,0.43,0.57,0.67,0.33,0.53,0.47,0.2,0.23,0.53,0.04))
I want to be able to reduce this so that anytime X and Y are the same for two rows, the observations are combined i.e.
data.frame(X=c("00KS089001","00KS089001","00KS089002","00KS089002","00KS089003","00KS089003","00KS105001","00KS105001", "00KS177011","00P0006","00P006"), hzdept = c(0,20,0,15,0,13,0,20,0,0,6), hzdepb = c(20,30,15,30,13,30,20,30,30,6,30),Y=c("Red","White","Red","White","Green","Red","Red","Blue", "Black","Red","White"), Z = c(0.67,0.33,0.5,0.5,0.43,0.57,0.67,0.33,1.00,0.20,0.80))
Any suggestions on how to best go about this?
Edit: ok, now that I see how hzdept and hzdepb are supposed to be combined from your commment above:
library(tidyverse)
df <- observations %>% count(X,Y,wt = Z,name = "Z")
df_hzdept <- observations %>%
arrange(hzdept) %>%
distinct(X,Y,.keep_all = T) %>%
select(X,Y,hzdept)
df_hzdepb <- observations %>%
arrange(desc(hzdepb)) %>%
distinct(X,Y,.keep_all = T) %>%
select(X,Y,hzdepb)
df <- df %>% left_join(df_hzdept) %>% left_join(df_hzdepb)
Using dplyr
Here is how you would group by two columns and summarize using the minimum, max, and sum other columns in a dataframe:
library(magrittr) # For the pipe: %>%
observations %>%
dplyr::group_by(X, Y) %>%
dplyr::summarise(hzdept = min(hzdept),
hzdepb = max(hzdepb),
Z = sum(Z), .groups = 'drop')

sample multiple different sample sizes using crossing and sample_n to create single df

I am attempting to sample a dataframe using sample_n. I know that sample_n usually takes a single size= argument at a time, however, I would like to sample sizes from 2 to the max # of rows in the df. Unfortunately, the code I have compiled below does not do the job. The needed output would be a dataframe with an id= column or a list divided by the id column from crossing().
df <- data.frame(Date = 1:15,
grp = rep(1:3,each = 5),
frq = rep(c(3,2,4), each = 5))
data_sampled_by_stratum <- df %>%
group_by(Date) %>%
crossing(id = seq(500)) %>% # repeat dataframes
group_by(id) %>%
sample_n(size=c(2:15)) %>%
group_by(CLUSTER_ID,Date) %>% filter(n() > 2)
If you had a column with different sites you could do this.
data_sampled_by_stratum <- data_grouped_by_stratum %>%
group_by(siteid, Date) %>%
crossing(id = seq(500)) %>% # repeat dataframes
sample_n(rbinom(1,sum(siteid==i),(1-s)^2))

How to Calculate A Daily Max for Different Locations in R?

So currently I am able to calculate a daily max for one site using the following code:
library('dplyr')
library('data.table')
library('tidyverse')
library('tidyr')
library('lubridate')
funcVolume <- function(max_data$enter_yard, max_data$exit_yard)
{
vecOnes <- array(1,c(length(max_data$enter_yard),1))
vecTime <- c(max_data$enter_yard,max_data$exit_yard)
vecCount <- c(vecOnes,-vecOnes)
df_test <- data.frame(T = vecTime, Count = vecCount)
df_test <- df_test %>%
arrange(T) %>%
mutate(Volume = cumsum(Count))
df_test
}
df_test2 <- df_test
df_test2$date <- as.Date(format(df_test$T, "%Y-%m-%d"))
df_test3 <- df_test2
df_test3 <- tibble(x = df_test2$Volume, y = df_test2$date) %>%
arrange(y)
dataset <- df_test3 %>%
group_by(y) %>%
dplyr::filter(x == max(x)) %>%
distinct(x,.keep_all = T) %>%
ungroup()
However, I would like to do this for multiple locations. In my original dataframe, I have a column that lists the name of the site, and two columns for when an object enter or leaves a site. The name is just a general text column, and the other two columns are datetime columns. Ideally, I would want an output that looks like the following:
Date | Max Count | Site
x y z
x a b
I also have a couple million rows of data, so I need something that can run in a reasonable time frame.

Resources