I have data frame like below, and I want to get the summation(value) for each 4 rolling month.
Edit: In the output I have "2018-12". But it's not shown in the input. It's a typo, my actual data contain "2018-12".
I prefer to use dplyr:
group <- c("red","green","red","red","red","green","green","green","red","green","green","green")
Month <- c("2019-01","2019-02","2019-03","2019-03","2019-05","2019-07","2019-07","2019-08","2019-09","2019-10","2019-10","2019-10")
VALUE <- c(10,20,30,40,50,60,70,80,90,100,110,120)
d_f <- data.frame(group,Month,VALUE)
d_f %>%
group_by(group) %>%
summarise(value = sum(value))
Can anyone please help me with how to handle the 4 rolling month? Thanks a lot for your valuable time.
Using lubridate you can use floor_date and group your dates by 4 month intervals.
library(tidyverse)
library(lubridate)
d_f %>%
mutate(date = as.Date(paste0(Month, '-01'), format = "%Y-%m-%d")) %>%
arrange(date) %>%
group_by(group, startdategroup = floor_date(date, "4 months")) %>%
summarise(value = sum(VALUE)) %>%
mutate(enddategroup = startdategroup %m+% months(4) - 1)
Output
# A tibble: 6 x 4
# Groups: group [2]
group startdategroup value enddategroup
<fct> <date> <dbl> <date>
1 green 2019-01-01 20 2019-04-30
2 green 2019-05-01 210 2019-08-31
3 green 2019-09-01 330 2019-12-31
4 red 2019-01-01 80 2019-04-30
5 red 2019-05-01 50 2019-08-31
6 red 2019-09-01 90 2019-12-31
Edit: To allow for an "overlap month" (months on the edge of two sequential date intervals), I might take a different approach.
First, I might create a sequence of start and end dates for the intervals (based on minimum and maximum dates in your data frame). The sequence would have date intervals every 4 months.
Then, I would do a fuzzy_left_join (using >= and <= logic) and merge this new data frame with yours. Then a row of data for a single month could be counted twice (once for each of two different intervals).
library(fuzzyjoin)
d_f$date = as.Date(paste0(Month, '-01'), format = "%Y-%m-%d")
d_f2 <- data.frame(date_start = seq.Date(min(d_f$date), max(d_f$date), "4 months"))
d_f2$date_end = date_start %m+% months(4)
d_f %>%
fuzzy_left_join(d_f2,
by = c("date" = "date_start", "date" = "date_end"),
match_fun = list(`>=`, `<=`)) %>%
group_by(group, date_start, date_end) %>%
summarise(value = sum(VALUE))
Output
# A tibble: 6 x 4
# Groups: group, date_start [6]
group date_start date_end value
<fct> <date> <date> <dbl>
1 green 2019-01-01 2019-05-01 20
2 green 2019-05-01 2019-09-01 210
3 green 2019-09-01 2020-01-01 330
4 red 2019-01-01 2019-05-01 130
5 red 2019-05-01 2019-09-01 140
6 red 2019-09-01 2020-01-01 90
One approach is to use the lag/lead functions in dplyr. Something like:
df2 = df %>%
group_by(group) %>%
mutate(prev_value = lag(value, 1, order_by = month),
prev_value2 = lag(value, 2, order_by = month),
prev_value3 = lag(value, 3, order_by = month)) %>%
mutate(avg = (value + prev_value + prev_value2 + prev_value3) / 4)
And then filter away the intervals you are not interested in.
Related
I have a data frame that contains events and dates. I want to find the patients who had the event at least 4 times in any 14 days period and return the row with the ID and the earliest event date of 4 or more events that occurred in R.
the data frame:
df <-data.frame(ID=c("P01","P01","P01","P01","P01","P01","P01","P02","P02","P02","P02","P02","P03","P03","P03","P03","P03","P03"),
date=c("2019-07-08","2019-07-26","2019-07-27","2019-07-30","2019-08-01","2019-08-03","2019-08-05", "2019-09-08","2019-09-14","2020-06-20","2020-06-23","2020-06-30","2019-11-25","2019-11-26","2019-12-11","2019-12-12","2019-12-20","2019-12-23"))
output:
P01 2019-07-26
P03 2019-12-11
Here I convert the dates to Date format, add a counting column, and then group by ID and count how many appearances occur in a 14 day window looking forward (current day + 13 days), and finally filter to only keep the first day for each ID where the window count is 4 or more.
library(dplyr)
df %>%
mutate(date = as.Date(date), count = 1) %>%
arrange(ID, date) %>%
group_by(ID) %>%
mutate(count14d = slider::slide_index_dbl(count, date, sum,
.after = lubridate::days(13))) %>%
filter(date == min(date[count14d >= 4])) %>%
ungroup()
Result is the first rows that begin qualifying "4 in 14" streaks:
# A tibble: 2 × 4
ID date count count14d
<chr> <date> <dbl> <dbl>
1 P01 2019-07-26 1 6
2 P03 2019-12-11 1 4
Or, if you want the first event of any kind for these ID's you could add:
...%>%
select(ID) %>%
left_join(df %>% mutate(date = as.Date(date)) %>%
group_by(ID) %>% summarize(first_event = min(date)))
to get:
Joining, by = "ID"
# A tibble: 2 × 2
ID first_event
<chr> <date>
1 P01 2019-07-08
2 P03 2019-11-25
I have a data set that I would like to split into 10-day intervals. The code that I included below does that, but for the last week or so there are days that (e.g., the 31st or 30th of a month) that remain end up by itself.
I would like to either remove the intervals that create this or include them in the previous intervals.
For example:
If I separate the month of January by 10-day intervals, it would put the first 10 days in a element of a list, the second 10 days into another element and the third 10 days into another one. It would then put January 31st into a element of list by itself.
My desired output would be to either remove these elements from the list or more preferably include them in the third 10-day interval. Can that be done? If so, what would be the best way to do so?
library(lubridate)
library(tidyverse)
date <- rep_len(seq(dmy("26-12-2010"), dmy("20-12-2013"), by = "days"), 500)
ID <- rep(seq(1, 5), 100)
df <- data.frame(date = date,
x = runif(length(date), min = 60000, max = 80000),
y = runif(length(date), min = 800000, max = 900000),
ID)
int <- df %>%
arrange(ID) %>%
mutate(new = ceiling_date(date, '10 day')) %>%
# mutate(cut = data.table::rleid(cut(new, breaks = "10 day"))) %>%
group_by(new) %>%
group_split()
Here is a solution which splits the months by 10-day intervals but corrects new to assign day 31 of a month to the last period. So,
days 1 to 10 belong to the first third of a month,
days 11 to 20 to the second third, and
days 21 to 31 to the third third.
int <- df %>%
# arrange(ID) %>% # skipped for readability of result
mutate(new = floor_date(date, '10 day')) %>%
mutate(new = if_else(day(new) == 31, new - days(10), new)) %>%
group_by(new) %>%
group_split()
int[[1]]
# A tibble: 6 x 5
date x y ID new
<date> <dbl> <dbl> <int> <date>
1 2010-12-26 71469. 819084. 1 2010-12-21
2 2010-12-27 69417. 893227. 2 2010-12-21
3 2010-12-28 70865. 831341. 3 2010-12-21
4 2010-12-29 68322. 812423. 4 2010-12-21
5 2010-12-30 65643. 837395. 5 2010-12-21
6 2010-12-31 63638. 892200. 1 2010-12-21
Now, 2010-12-31 was assigned to the third third of December.
Note that new indicates the start of the interval by calling floor_date() instead of ceiling_date(). This is due to avoid potential problems with day arithmetic across month boundaries and to clarify to which month the interval belongs to. For instance, for the last day of February, ceiling_date(ymd('2011-02-28'), '10 day') returns "2011-03-03" which is a date in March.
If there is a single row in a group give it the previous new value. Try this -
library(dplyr)
library(lubridate)
df %>%
arrange(ID, date) %>%
mutate(new = ceiling_date(date, '10 day')) %>%
add_count(new) %>%
mutate(new = if_else(n == 1, lag(new), new)) %>%
select(-n) %>%
group_split(new)
Above would only work to combine groups that has 1 observation in a group. If we want to combine more than 1 day use the below code which counts numbers of days in a group. It combines the group if number of day is less than n number of days.
n <- 2
df %>%
arrange(ID, date) %>%
mutate(new = ceiling_date(date, '10 day'),
ID = match(new, unique(new))) -> tmp
tmp %>%
group_by(new, ID) %>%
summarise(count_unique = n_distinct(date)) %>%
ungroup %>%
mutate(new = if_else(count_unique < n, lag(new), new)) %>%
inner_join(tmp, by = 'ID') %>%
select(new = new.x, date, x, y) %>%
group_split(new)
Alternative solution
library(lubridate)
library(tidyverse)
dt <- rep_len(seq(dmy("26-12-2010"), dmy("20-12-2013"), by = "days"), 500)
ID <- rep(seq(1, 5), 100)
df <- data.frame(dt = dt,
x = runif(length(dt), min = 60000, max = 80000),
y = runif(length(dt), min = 800000, max = 900000),
ID)
Include extra days (31st) into the last third
int_df <- df %>%
# arrange(ID) %>%
mutate(day_date = day(dt),
day_new = case_when(
day_date <= 10 ~ 1,
day_date <= 20 ~ 11,
TRUE ~ 21
),
new = ymd(paste(year(dt), month(dt), day_new, sep = "-"))) %>%
select(-c(day_date, day_new)) %>%
group_by(new) %>%
group_split()
int_df[[1]]
#> # A tibble: 6 x 5
#> dt x y ID new
#> <date> <dbl> <dbl> <int> <date>
#> 1 2010-12-26 62395. 837491. 1 2010-12-21
#> 2 2010-12-27 66236. 836481. 2 2010-12-21
#> 3 2010-12-28 79918. 818399. 3 2010-12-21
#> 4 2010-12-29 67613. 807213. 4 2010-12-21
#> 5 2010-12-30 72980. 899380. 5 2010-12-21
#> 6 2010-12-31 61004. 876191. 1 2010-12-21
Exclude extra days (31st)
int_df <- df %>%
# arrange(ID) %>%
mutate(day_date = day(dt),
day_new = case_when(
day_date <= 10 ~ 1,
day_date <= 20 ~ 11,
day_date <= 30 ~ 21,
TRUE ~ 31
),
new = ymd(paste(year(dt), month(dt), day_new, sep = "-"))) %>%
filter(day_date != 31) %>%
select(-c(day_date, day_new)) %>%
group_by(new) %>%
group_split()
int_df[[1]]
#> # A tibble: 5 x 5
#> dt x y ID new
#> <date> <dbl> <dbl> <int> <date>
#> 1 2010-12-26 62395. 837491. 1 2010-12-21
#> 2 2010-12-27 66236. 836481. 2 2010-12-21
#> 3 2010-12-28 79918. 818399. 3 2010-12-21
#> 4 2010-12-29 67613. 807213. 4 2010-12-21
#> 5 2010-12-30 72980. 899380. 5 2010-12-21
Created on 2021-07-03 by the reprex package (v2.0.0)
I am looking to calculate a 3 month rolling sum of values in one column of a data frame based upon the dates in another column and product.
newResults data frame columns : Product, Date, Value
In this example, I wish to calculate the rolling sum of value for Product for 3 months. I have sorted the data frame on Product and Date.
Dataset Example:
Sample Dataset
My Code:
newResults = newResults %>%
group_by(Product) %>%
mutate(Roll_12Mth =
rollapplyr(Value, width = 1:n() - findInterval( Date %m-% months(3), date), sum)) %>%
ungroup
Error: Problem with mutate() input Roll_12Mth.
x could not find function "%m-%"
i Input Roll_12Mth is rollapplyr(...).
Output:
Output
If the dates are always spaced 1 month apart, it is easy.
dat=data.frame(Date=seq(as.Date("2/1/2017", "%m/%d/%Y"), as.Date("1/1/2018", "%m/%d/%Y"), by="month"),
Product=rep(c("A", "B"), each=6),
Value=c(4182, 4822, 4805, 6235, 3665, 3326, 3486, 3379, 3596, 3954, 3745, 3956))
library(zoo)
library(dplyr)
dat %>%
group_by(Product) %>%
arrange(Date, .by_group=TRUE) %>%
mutate(Value=rollapplyr(Value, 3, sum, partial=TRUE))
Date Product Value
<date> <fct> <dbl>
1 2017-02-01 A 4182
2 2017-03-01 A 9004
3 2017-04-01 A 13809
4 2017-05-01 A 15862
5 2017-06-01 A 14705
6 2017-07-01 A 13226
7 2017-08-01 B 3486
8 2017-09-01 B 6865
9 2017-10-01 B 10461
10 2017-11-01 B 10929
11 2017-12-01 B 11295
12 2018-01-01 B 11655
I have a dataset that I'd now like to split at 12:00pm (midday) into two, i.e. if variable goes from 08:00-13:00 it becomes 08:00-12:00 and 12:00-13:00 across two rows. The variable duration and cumulative sum would need to be changed accordingly, but the other variables should be as in the original (unchanged).
This should be applicable across different id variables.
id = unchanged from row 1, just repeated
start = changed in both rows
end = changed in both rows
day = unchanged from row 1, just repeated
duration = changed in both rows
cumulative time = changed in both row
ORIGINAL DATAFILE
#Current dataframe
id<-c("m1","m1")
x<-c("2020-01-03 10:00:00","2020-01-03 19:20:00")
start<-strptime(x,"%Y-%m-%d %H:%M:%S")
y<-c("2020-01-03 16:00:00","2020-01-03 20:50:00")
end<-strptime(y,"%Y-%m-%d %H:%M:%S")
day<-c(1,1)
mydf<-data.frame(id,start,end,day)
# calculate duration and time
mydf$duration<-as.numeric(difftime(mydf$end,mydf$start,units = "hours"))
mydf$time<-c(cumsum(mydf$duration))
REQUIRED DATAFILE
#Required dataframe
id2<-c("m1","m1","m1")
x2<-c("2020-01-03 10:00:00","2020-01-03 12:00:00","2020-01-03 19:20:00")
start2<-strptime(x2,"%Y-%m-%d %H:%M:%S")
y2<-c("2020-01-03 12:00:00","2020-01-03 16:00:00","2020-01-03 20:50:00")
end2<-strptime(y2,"%Y-%m-%d %H:%M:%S")
day2<-c(1,1,1)
mydf2<-data.frame(id2,start2,end2,day2)
# calculate duration and time
mydf2$duration<-c(2,4,1.5)
mydf2$time<-c(2,6,7.5)
Good question. So, each line implicitly contains either one or two intervals, so you should be able to just define those interval(s) on each line and then pivot to long, but you can't pivot with interval values (yet?). So, here's my approach, which computes up to two shift start times for each line, and then infers the shift end from the start of the next shift after pivoting. Comments inline.
library(lubridate, warn.conflicts = FALSE)
library(tidyverse)
library(magrittr, warn.conflicts = FALSE)
library(hablar, warn.conflicts = FALSE)
(mydf <- tibble(
id = "m1",
start = as_datetime(c("2020-01-03 10:00:00", "2020-01-03 19:20:00")),
end = as_datetime(c("2020-01-03 16:00:00", "2020-01-03 20:50:00")),
day = 1
))
#> # A tibble: 2 x 4
#> id start end day
#> <chr> <dttm> <dttm> <dbl>
#> 1 m1 2020-01-03 10:00:00 2020-01-03 16:00:00 1
#> 2 m1 2020-01-03 19:20:00 2020-01-03 20:50:00 1
(mydf2 <-
mydf %>%
# Assume the relevant noontime cutoff is on the same day as the start
mutate(midday =
start %>% as_date() %>%
add(12 %>% hours()) %>%
fit_to_timeline() %>%
# No relevant midday if the shift doesn't include noon
na_if(not(. %within% interval(start, end)))) %>%
# Make an original row ID since there doesn't seem to be one, and we will need
# to build intervals within the data stemming from each original row
rownames_to_column("orig_shift") %>%
pivot_longer(cols = c(start, midday, end),
# The timestamps we have here will be treated as start times
values_to = "start",
# Drop rows that would exist due to irrelevant middays
values_drop_na = TRUE) %>%
select(-name) %>%
# Infer shift end times as the start of the next shift, within lines defined
# by the original shifts
group_by(orig_shift) %>%
arrange(start) %>%
mutate(end = lead(start)) %>%
ungroup() %>%
# Drop lines that represent the end of the last shift and not a full one
drop_na() %>%
# Compute those durations and times (should times really be globally
# cumulative? Also, your specified mydf2 seems to have an incorrect first time
# value)
mutate(duration = start %--% end %>% as.numeric("hours"),
time = cumsum(duration)) %>%
select(id, start, end, day, duration, time))
#> # A tibble: 3 x 6
#> id start end day duration time
#> <chr> <dttm> <dttm> <dbl> <dbl> <dbl>
#> 1 m1 2020-01-03 10:00:00 2020-01-03 12:00:00 1 2 2
#> 2 m1 2020-01-03 12:00:00 2020-01-03 16:00:00 1 4 6
#> 3 m1 2020-01-03 19:20:00 2020-01-03 20:50:00 1 1.5 7.5
Created on 2019-10-23 by the reprex package (v0.3.0)
Here is mine solution for a more general case when you have many observations with different dates. The logic is the following.
Firstly, I create a data frame with 12:00pm (midday) splitters.
Next, I identify the rows which should be split by joining the data frame to the initial one and saving them in separate data frame.
Next, I duplicate the rows and create the split_rows
From the original dataset I delete the rows which I split and join the correct doubled rows.
library(dplyr)
split_time_data =
tibble(split_time = as.POSIXct(seq(0, 365*60*60*24, 60*60*24),
origin="2020-01-01 17:00:00")) %>%
mutate(key = TRUE)# I use 17:00 to make it 12:00 EST, adjust for your purposes
data_to_split =
mydf %>%
mutate(key = TRUE) %>%
left_join(split_time_data) %>%
filter(between(split_time, start, end)) %>%
select(-key)
library(lubridate)
split_rows =
data_to_split %>%
rbind(data_to_split) %>%
arrange(start) %>%
group_by(start) %>%
mutate(row_number = row_number() ) %>%
ungroup() %>%
mutate(start = if_else(row_number == 1, start, split_time ),
end = if_else(row_number == 1, split_time, end )) %>%
select(-row_number, -split_time) %>%
mutate(duration = hour(end) - hour(start) )
mydf %>%
anti_join(data_to_split) %>%
full_join(split_rows) %>%
arrange(start) %>%
mutate(time = cumsum(duration) )
The output
id start end day duration time
1 m1 2020-01-03 10:00:00 2020-01-03 12:00:00 1 2.0 2.0
2 m1 2020-01-03 12:00:00 2020-01-03 16:00:00 1 4.0 6.0
3 m1 2020-01-03 19:20:00 2020-01-03 20:50:00 1 1.5 7.5
I am working on a data set which is similar to
data <-tribble(
~id, ~ dates, ~days_prior,
1,20190101, NA,
1,NA, 15,
1,NA, 20,
2, 20190103, NA,
2,NA, 3,
2,NA, 4)
I have the first date for each ID and I am trying to calculate the next date by adding days_prior to the previous date. I am using the lag function to refer to the previous date.
df<- df%>% mutate(dates = as.Date(ymd(dates)), days_prior =as.integer(days_prior))
df<-df %>% mutate(dates =
as.Date(ifelse(is.na(days_prior),dates,days_prior+lag(dates)),
origin="1970-01-01"))
This works but only for the next row as you can see attached data.
What am I doing wrong? I would like all the dates to be calculated by mutate(). What different approach should I take to calculate this.
I don't really see how lag would help here; unless I misunderstood here is an option using tidyr::fill
data %>%
group_by(id) %>%
mutate(dates = as.Date(ymd(dates))) %>%
fill(dates) %>%
mutate(dates = dates + if_else(is.na(days_prior), 0L, as.integer(days_prior))) %>%
ungroup()
## A tibble: 6 x 3
# id dates days_prior
# <dbl> <date> <dbl>
#1 1 2019-01-01 NA
#2 1 2019-01-16 15
#3 1 2019-01-21 20
#4 2 2019-01-03 NA
#5 2 2019-01-06 3
#6 2 2019-01-07 4
Or a slight variation, replacing the NA entries in days_prior with 0
data %>%
group_by(id) %>%
mutate(
dates = as.Date(ymd(dates)),
days_prior = replace(days_prior, is.na(days_prior), 0)) %>%
fill(dates) %>%
mutate(dates = dates + as.integer(days_prior)) %>%
ungroup()
Update
In response to your clarifications in the comments, here is what you can do
data %>%
group_by(id) %>%
mutate(
dates = as.Date(ymd(dates)),
days_prior = replace(days_prior, is.na(days_prior), 0)) %>%
fill(dates) %>%
mutate(dates = dates + cumsum(days_prior)) %>%
ungroup()
## A tibble: 6 x 3
# id dates days_prior
# <dbl> <date> <dbl>
#1 1 2019-01-01 0
#2 1 2019-01-16 15
#3 1 2019-02-05 20
#4 2 2019-01-03 0
#5 2 2019-01-06 3
#6 2 2019-01-10 4
You can use the na.locf from the zoo package to fill in the last observed date before adding the prior days.
library("tidyverse")
library("zoo")
data %>%
# Fill in NA dates with the previous non-NA date
# The `locf` stands for "last observation carried forward"
# Fill in NA days_prior with 0
mutate(dates = zoo::na.locf(dates),
days_prior = replace_na(days_prior, 0)) %>%
mutate(dates = lubridate::ymd(dates) + days_prior)
This solution makes two assumptions:
The rows are sorted by id. You can get around this assumption with a group_by(id) followed by an ungroup() statement as shows in the solution by Maurits Evers.
For each id, the row with the observed date is first in the group. This needs to be true in any case with either na.locf and fill because both functions fill in NAs using the previous non-NA entry.
If you don't want to make any assumptions about the ordering, you can sort the rows at the start with data %>% arrange(id, dates).