I want to make a rolling mean on the last X number of days. rollmean() does that using rows. Since I am using loggers that sometimes fail, and also the data were cleaned, the time series is not continuous (rows do not necessarily represent a constant time difference).
A colleague suggested the solution below, which works great. Except my data need to be grouped (in the example by treatment). For each day, I want the rolling mean of the last X days for each treatment.
Thanks
# making some example data
# vector with days since the beginning of experiment
days <- 0:30
# random values df1 <- tibble::tibble(
days_since_beginning = days,
value_to_used = rnorm(length(days)),
treatment = sample(letters[1],31,replace = TRUE) )
df2 <- tibble::tibble(
days_since_beginning = days,
value_to_used = rnorm(length(days)),
treatment = sample(letters[2],31,replace = TRUE) )
df <- full_join(df1, df2)
# how long should be the period for mean
time_period <- 10 # calculate for last 10 days
df_mean <- df %>% dplyr::mutate(
# calculate rolling mean
roll_mean = purrr::map_dbl(
.x = days_since_beginning,
.f = ~ df %>%
# select only data for the last `time_period`
dplyr::filter(days_since_beginning >= .x - time_period &
days_since_beginning <= .x) %>%
purrr::pluck("value_to_used") %>%
mean() %>%
return()
) )
This takes the mean over the last 10 days by treatment. The width argument includes a computation of how many rows back to use so that it corresponds to 10 days rather than 10 rows. This uses the fact that width can be a vector.
library(dplyr)
library(zoo)
df %>%
group_by(treatment) %>%
mutate(roll = rollapplyr(value_to_used,
seq_along(days_since_beginning) - findInterval(days_since_beginning - 10, days_since_beginning),
mean)) %>%
ungroup
Same colleague came up with his own solution:
df_mean <-
df %>%
dplyr::group_by(treatment) %>%
tidyr::nest() %>%
dplyr::mutate(
data_with_mean = purrr::map(
.x = data,
.f = ~ {
dataset <- .x
dataset %>%
dplyr::mutate(
# calculate rolling mean
roll_mean = purrr::map_dbl(
.x = days_since_beginning,
.f = ~ dataset %>%
# select only data for the last `time_period`
dplyr::filter(days_since_beginning >= .x - time_period &
days_since_beginning <= .x) %>%
purrr::pluck("value_to_used") %>%
mean() %>%
return()
)) %>%
return()
}
)) %>%
dplyr::select(-data) %>%
tidyr::unnest(data_with_mean) %>%
dplyr::ungroup()
I compared the results with G. Grothendieck's idea, and it only matches if I use time_period in my colleague's code and time_period + 1 in G. Grothendieck's code. So there is a difference in how the time_period is used, and I am confused about why it happens.
Related
I am trying to create an index of a set of variables by taken the mean of the selected variables using the following code:
data <- data %>%
group_by(country) %>%
# Standardize each component/measure
mutate(
std_var1 = standardize(var1, Z),
std_var2 = standardize(var2, Z),
std_var3 = standardize(var3, Z),
std_var4 = standardize(var4, Z)
) %>%
ungroup() %>%
dplyr::select(std_var1,
std_var2,
std_var3,
std_var4) %>%
# Average all z scores for an individual
mutate(index = pmap_dbl(., ~ mean(c(...), na.rm = T))) %>%
cbind(data, .) %>% unnest() %>%
I also use the idx_mean package that takes the following syntax:
mutate(data, idx_var = idx_mean(std_var1, std_var2, std_var3, std_var4))
and obtain similar but not exactly the same index values (not just a matter of rounding).
Is there one approach that seems more accurate here?
The 4th and 5th columns display index values created by the idx function (4th column) and the other approach (5th column.)
A data wrangling question:
I have a dataframe of hourly animal tracking points with columns for id, time, and whether the animal is on land or in water (0 = water; 1 = land). It looks something like this:
set.seed(13)
n <- 100
dat <- data.frame(id = rep(1:5, each = 10),
datetime=seq(as.POSIXct("2020-12-26 00:00:00"), as.POSIXct("2020-12-30 3:00:00"), by = "hour"),
land = sample(0:1, n, replace = TRUE))
What I need to do is flag the first row after which the animal uses land at least once for 3 straight days. I tried doing something like this:
dat$ymd <- ymd(dat$datetime[1]) # make column for year-month-day
# add land points within each id group
land.pts <- dat %>%
group_by(id, ymd) %>%
arrange(id, datetime) %>%
drop_na(land) %>%
mutate(all.land = cumsum(land))
#flag days that have any land points
flag <- land.pts %>%
group_by(id, ymd) %>%
arrange(id, datetime) %>%
slice(n()) %>%
mutate(flag = if_else(all.land == 0,0,1))
# Combine flagged dataframe with full dataframe
comb <- left_join(land.pts, flag)
comb[is.na(comb)] <- 1
and then I tried this:
x = comb %>%
group_by(id) %>%
arrange(id, datetime) %>%
mutate(time.land=ifelse(land==0 | is.na(lag(land)) | lag(land)==0 | flag==0,
0,
difftime(datetime, lag(datetime), units="days")))
But I still can't quite wrap my head around what to do to make it so that I can figure out when the animal has been on land at least once for three days straight, and then flag that first point on land. Thanks so much for any help you can provide!
Create a date column from the timestamp. Summarise the data and keep only 1 row for each id and date which shows whether the animal was on land even once in the entire day.
Use zoo's rollapply function to mark the first day as TRUE if the next 3 days the animal was on land.
library(dplyr)
library(zoo)
dat <- dat %>% mutate(date = as.Date(datetime))
dat %>%
group_by(id, date) %>%
summarise(on_land = any(land == 1)) %>%
mutate(consec_three = rollapply(on_land, 3,all, align = 'left', fill = NA)) %>%
ungroup %>%
#If you want all the rows of the data
left_join(dat, by = c('id', 'date'))
I have 2 codes that manipulate and filter (by date) my data.frame and that work perfectly. Now I want to run the code for not only one day, but for every day in vector:
seq(from=as.Date('2020-03-02'), to=Sys.Date(),by='days')` #.... 538 days
The code I want to run for all the days between 2020-03-02 and today is:
KOKOKO <- data.frame %>%
filter(DATE < '2020-03-02')%>%
summarize(DATE = '2020-03-02', CZK = sum(Objem.v.CZK,na.rm = T)
STAVPTF <- data.frame %>%
filter (DATE < '2020-03-02')%>%
group_by(CP) %>%
summarize(mnozstvi = last(AKTUALNI_MNOZSTVI_AKCIE), DATE = '2020-03-02') %>%
select(DATE,CP,mnozstvi) %>%
rbind(KOKOKO)%>%
drop_na() %>%
So instead of '2020-03-02' I want to fill in all days since '2020-03-02' one after another. And each of the KOKOKO and STAVPTF created for the unique day like this I want to save as a separate data.frame and all of them store in a list.
We could use map to loop over the sequence and apply the code
library(dplyr)
library(purrr)
out <- map(s1, ~ data.frame %>%
filter(DATE < .x)%>%
summarize(DATE = .x, CZK = sum(Objem.v.CZK,na.rm = TRUE))
As this is repeated cycle, a function would make it cleaner
f1 <- function(dat, date_col, group_col, Objem_col, aktualni_col, date_val) {
filtered <- dat %>%
filter({{date_col}} < date_val)
KOKOKO <- filtered %>%
summarize({{date_col}} := date_val,
CZK = sum({{Objem_col}}, na.rm = TRUE)
STAVPTF <- filtered %>%
group_by({{group_col}}) %>%
summarize(mnozstvi = last({{aktualni_col}}),
{{date_col}} := date_val) %>%
select({{date_col}}, {{group_col}}, mnozstvi) %>%
bind_rows(KOKOKO)%>%
drop_na()
return(STAVPTF)
}
and call as
map(s1, ~ f1(data.frame, DATE, CP, Objem.v.CZK, AKTUALNI_MNOZSTVI_AKCIE, !!.x))
where
s1 <- seq(from=as.Date('2020-03-02'), to=Sys.Date(), by='days')
It would be easier to answer your question, if you would provide a minimal reproducible example. It's easy done with tidyverses reprex packages
However, your KOKOKO code can be rewritten as simple cumulative sum:
KOKOKO =
data.frame %>%
arrange(DATE) %>% # if necessary
group_by(DATE) %>%
summarise(CZK = sum(Objem.v.CZK), .groups = 'drop') %>% # summarise per DATE (if necessary)
mutate(CZK = cumsum(CZK) - CZK) # cumulative sum excluding current row (current DATE)
Even STAVPTF code can probably be rewritten without iterations. First find the last value of AKTUALNI_MNOZSTVI_AKCIE per CP and DATE. Then this value is assigned to the next DATE:
STAVPTF <-
data.frame %>%
group_by(CP, DATE) %>%
summarise(mnozstvi = last(AKTUALNI_MNOZSTVI_AKCIE), .groups='drop_last') %>%
arrange(DATE) %>% # if necessary
mutate(DATE = lead(DATE))
Purpose
I am trying to check whether a pair of values in two columns appear in the previous event, and aggregate the dummy variables by groups.
Specifically, I have each event id (i.e., oid) and dyad-level observations associated with each event: agent (i.e., aid), partner (i.e., pid). The events are sorted by time when the event occurs (i.e., o4.in).
(1)I made a dummy variable indicating if a pair of agent and partner appear together in the previous event.
(2) Also, I used ddply to aggregate the dummy variable by groups, as specified in the below example.
I find that ddply and lag functions take so much time with a large dataset, and I am wondering if there is a faster way to achieved these tasks.
Dataset
library(tidyverse)
library(tibble)
rename <- dplyr::rename
select <- dplyr::select
set.seed(10001)
cases <- sample(1:5, 1000, replace=T)
set.seed(10002)
agent <- sample(1:20, 1000, replace=T)
set.seed(10003)
partner <- sample(1:20, 1000, replace=T)
set.seed(123)
n <- 1000 # no of random datetimes needed
minDate <- as.POSIXct("1999/01/01")
maxDate <- as.POSIXct("2000-01-01")
epoch <- "1970-01-01"
timestamps <-
as.POSIXct(pmax(runif(n, minDate, maxDate), runif(n, minDate, maxDate)), origin = epoch)
df <-
data.frame(cases, agent, partner, timestamps) %>%
rename(
aid = agent,
pid = partner,
oid = cases,
o4.in = timestamps
) %>%
filter(aid != pid)
Current Methods
# creating dummy variable
d <-
df %>%
arrange(o4.in) %>%
group_by(aid) %>%
mutate(
oid.lag.a = lag(oid)
) %>%
ungroup %>%
group_by(pid) %>%
mutate(
oid.lag.p = lag(oid)
) %>%
ungroup %>%
mutate(
j2.consecutive = ifelse(oid.lag.a == oid.lag.p, 1, 0),
j2.consecutive = ifelse(is.na(j2.consecutive), 0, j2.consecutive)
) %>%
select(-oid.lag.a, -oid.lag.p)
# aggregating the dummy variable by groups
t <-
d %>%
ungroup %>%
ddply(c('oid', 'aid'), function(i){
i %>%
mutate(aj1.consecutive = (sum(j2.consecutive) - j2.consecutive)/(n()-1))
} , .progress = 'text') %>%
arrange(oid, pid) %>%
ddply(c('oid', 'pid'), function(i){
i %>%
mutate(apj1.consecutive = (sum(j2.consecutive) - j2.consecutive)/(n()-1))
} , .progress = 'text')
Update for Future Readers
Task (1) is achieved by the answer by #akrun below.
Task (2) solution is answered by #akrun in a separate post: A faster way than applying 'ddply' to aggregate a variable by a function by groups
Special thanks to #akrun!!
We can use data.table methods to make it faster
library(data.table)
df2 <- copy(df)
df3 <- setDT(df2)[order(o4.in)]
df3[, oid.lag.a := shift(oid), by = aid
][, oid.lag.p := shift(oid), by = pid]
df3[, j2.consecutive := fcoalesce(+(oid.lag.a == oid.lag.p), 0L)]
Also, note that some things in the OP's code are unnecessary i.e. using ifelse to convert a logical to binary. It can just be as.integer or coercion with +. The second line again with ifelse can be removed as well with coalesce
library(dplyr)
out <- df %>%
arrange(o4.in) %>%
group_by(aid) %>%
mutate(
oid.lag.a = lag(oid)
) %>%
group_by(pid) %>%
mutate(
oid.lag.p = lag(oid)
) %>%
ungroup %>%
mutate(j2.consecutive = coalesce(+(oid.lag.a == oid.lag.p), 0))
-checking the output from dplyr/data.table
all(out$j2.consecutive == df3$j2.consecutive )
[1] TRUE
I have a dataframe that I want to group and obtain the median of the diff (lagged difference) in consistent units. Is very similar to the example below. As you can see by running the code below I have problems because diff have an units attribute that is not taken into account by my summarise function
library(tidyverse)
# Initialise random data
t = Sys.time()
rnd <- sample(1:10000,10,replace=F)
add <- rnd[order(rnd)]
# Create 2 dtaaframes
time1 <- data.frame(datetime = t + add)
time2 <- data.frame(datetime = t + add * 1000)
# Bind dataframe together
mydata <- bind_rows(time1, time2, .id = "group")
# Trying to get a summary table
mydata %>% group_by(group) %>% summarise(elapsed = median(diff(datetime[order(datetime)])))
# These are the values that I should get in my summary table
median(diff(time1$datetime))
median(diff(time2$datetime))
What about using difftime and setting the units?
mydata %>%
group_by(group) %>%
summarise(elapsed = median(difftime(datetime, lag(datetime), units = "mins"), na.rm = TRUE))
Here's one option, which will show all results in seconds. Use dminutes(1) or dhours(1) or ddays(1) if more appropriate.
mydata %>%
group_by(group) %>%
summarise(elapsed = median(diff(datetime[order(datetime)])) / lubridate::dseconds(1))