I have a data set that I would like to split into 10-day intervals. The code that I included below does that, but for the last week or so there are days that (e.g., the 31st or 30th of a month) that remain end up by itself.
I would like to either remove the intervals that create this or include them in the previous intervals.
For example:
If I separate the month of January by 10-day intervals, it would put the first 10 days in a element of a list, the second 10 days into another element and the third 10 days into another one. It would then put January 31st into a element of list by itself.
My desired output would be to either remove these elements from the list or more preferably include them in the third 10-day interval. Can that be done? If so, what would be the best way to do so?
library(lubridate)
library(tidyverse)
date <- rep_len(seq(dmy("26-12-2010"), dmy("20-12-2013"), by = "days"), 500)
ID <- rep(seq(1, 5), 100)
df <- data.frame(date = date,
x = runif(length(date), min = 60000, max = 80000),
y = runif(length(date), min = 800000, max = 900000),
ID)
int <- df %>%
arrange(ID) %>%
mutate(new = ceiling_date(date, '10 day')) %>%
# mutate(cut = data.table::rleid(cut(new, breaks = "10 day"))) %>%
group_by(new) %>%
group_split()
Here is a solution which splits the months by 10-day intervals but corrects new to assign day 31 of a month to the last period. So,
days 1 to 10 belong to the first third of a month,
days 11 to 20 to the second third, and
days 21 to 31 to the third third.
int <- df %>%
# arrange(ID) %>% # skipped for readability of result
mutate(new = floor_date(date, '10 day')) %>%
mutate(new = if_else(day(new) == 31, new - days(10), new)) %>%
group_by(new) %>%
group_split()
int[[1]]
# A tibble: 6 x 5
date x y ID new
<date> <dbl> <dbl> <int> <date>
1 2010-12-26 71469. 819084. 1 2010-12-21
2 2010-12-27 69417. 893227. 2 2010-12-21
3 2010-12-28 70865. 831341. 3 2010-12-21
4 2010-12-29 68322. 812423. 4 2010-12-21
5 2010-12-30 65643. 837395. 5 2010-12-21
6 2010-12-31 63638. 892200. 1 2010-12-21
Now, 2010-12-31 was assigned to the third third of December.
Note that new indicates the start of the interval by calling floor_date() instead of ceiling_date(). This is due to avoid potential problems with day arithmetic across month boundaries and to clarify to which month the interval belongs to. For instance, for the last day of February, ceiling_date(ymd('2011-02-28'), '10 day') returns "2011-03-03" which is a date in March.
If there is a single row in a group give it the previous new value. Try this -
library(dplyr)
library(lubridate)
df %>%
arrange(ID, date) %>%
mutate(new = ceiling_date(date, '10 day')) %>%
add_count(new) %>%
mutate(new = if_else(n == 1, lag(new), new)) %>%
select(-n) %>%
group_split(new)
Above would only work to combine groups that has 1 observation in a group. If we want to combine more than 1 day use the below code which counts numbers of days in a group. It combines the group if number of day is less than n number of days.
n <- 2
df %>%
arrange(ID, date) %>%
mutate(new = ceiling_date(date, '10 day'),
ID = match(new, unique(new))) -> tmp
tmp %>%
group_by(new, ID) %>%
summarise(count_unique = n_distinct(date)) %>%
ungroup %>%
mutate(new = if_else(count_unique < n, lag(new), new)) %>%
inner_join(tmp, by = 'ID') %>%
select(new = new.x, date, x, y) %>%
group_split(new)
Alternative solution
library(lubridate)
library(tidyverse)
dt <- rep_len(seq(dmy("26-12-2010"), dmy("20-12-2013"), by = "days"), 500)
ID <- rep(seq(1, 5), 100)
df <- data.frame(dt = dt,
x = runif(length(dt), min = 60000, max = 80000),
y = runif(length(dt), min = 800000, max = 900000),
ID)
Include extra days (31st) into the last third
int_df <- df %>%
# arrange(ID) %>%
mutate(day_date = day(dt),
day_new = case_when(
day_date <= 10 ~ 1,
day_date <= 20 ~ 11,
TRUE ~ 21
),
new = ymd(paste(year(dt), month(dt), day_new, sep = "-"))) %>%
select(-c(day_date, day_new)) %>%
group_by(new) %>%
group_split()
int_df[[1]]
#> # A tibble: 6 x 5
#> dt x y ID new
#> <date> <dbl> <dbl> <int> <date>
#> 1 2010-12-26 62395. 837491. 1 2010-12-21
#> 2 2010-12-27 66236. 836481. 2 2010-12-21
#> 3 2010-12-28 79918. 818399. 3 2010-12-21
#> 4 2010-12-29 67613. 807213. 4 2010-12-21
#> 5 2010-12-30 72980. 899380. 5 2010-12-21
#> 6 2010-12-31 61004. 876191. 1 2010-12-21
Exclude extra days (31st)
int_df <- df %>%
# arrange(ID) %>%
mutate(day_date = day(dt),
day_new = case_when(
day_date <= 10 ~ 1,
day_date <= 20 ~ 11,
day_date <= 30 ~ 21,
TRUE ~ 31
),
new = ymd(paste(year(dt), month(dt), day_new, sep = "-"))) %>%
filter(day_date != 31) %>%
select(-c(day_date, day_new)) %>%
group_by(new) %>%
group_split()
int_df[[1]]
#> # A tibble: 5 x 5
#> dt x y ID new
#> <date> <dbl> <dbl> <int> <date>
#> 1 2010-12-26 62395. 837491. 1 2010-12-21
#> 2 2010-12-27 66236. 836481. 2 2010-12-21
#> 3 2010-12-28 79918. 818399. 3 2010-12-21
#> 4 2010-12-29 67613. 807213. 4 2010-12-21
#> 5 2010-12-30 72980. 899380. 5 2010-12-21
Created on 2021-07-03 by the reprex package (v2.0.0)
Related
I have long time series dataframe grouped by id. The series have different start dates and also missing observations. I want to complete missing observations, by completing the the date and id and filling it with 0.
What I want to avoid in the process, is to complete the missing observations in the beginning, because this is just an indicator, that the time series has a later starting point (different launch date of product for example).
In my reprex I used complete from tidyr. It does the opposite of what I want. Instead of completing the id "A1" with "2015-01-04", it completes the id "B1" with "2015-01-01", which is not needed in this case. Does complete always create groups of the same size? Maybe then it is the wrong function.
How can I achieve the opposite in the following example?
library(tidyr)
data <- data.frame (id = as.character(c(rep("A1",6),rep("B1",5))),
value = c(seq( 1, 9, length.out = 11)),
date = as.Date(c(c("2015-01-01","2015-01-02","2015-01-03",
"2015-01-05","2015-01-06","2015-01-07"),
c("2015-01-02","2015-01-03","2015-01-05",
"2015-01-06","2015-01-07")
)
)
)
data %>% complete(date, id, fill = list(value = 0))
You need to provide the dates to fill explicitly:
data %>%
group_by(id) %>%
complete(date = seq(min(date), max(date), by = 1), fill = list(value = 0))
Doing it rectangularly is easiest to express.
You can reintroduce the missingness as follows:
data %>%
tidyr::complete(date, id, fill = list(value = 0)) %>%
dplyr::group_by(id) %>%
dplyr::arrange(date) %>% # Ensure it's sorted by date
dplyr::filter(!cumall(value == 0)) %>% # Don't keep zeros that didn't have non-0 rows before
dplyr::ungroup()
library(tidyverse)
data <- data.frame(
id = as.character(c(rep("A1", 6), rep("B1", 5))),
value = c(seq(1, 9, length.out = 11)),
date = as.Date(c(
c(
"2015-01-01", "2015-01-02", "2015-01-03",
"2015-01-05", "2015-01-06", "2015-01-07"
),
c(
"2015-01-02", "2015-01-03", "2015-01-05",
"2015-01-06", "2015-01-07"
)
))
)
all_dates <- seq(min(data$date), max(data$date), by = "day") %>% as.character()
# complete all dates for each id
data %>%
as_tibble() %>%
group_by(id) %>%
mutate(date = date %>% as.character() %>% factor(levels = all_dates)) %>%
complete(date, fill = list(value = 0)) %>%
mutate(date = date %>% as.Date())
#> # A tibble: 14 × 3
#> # Groups: id [2]
#> id date value
#> <chr> <date> <dbl>
#> 1 A1 2015-01-01 1
#> 2 A1 2015-01-02 1.8
#> 3 A1 2015-01-03 2.6
#> 4 A1 2015-01-04 0
#> 5 A1 2015-01-05 3.4
#> 6 A1 2015-01-06 4.2
#> 7 A1 2015-01-07 5
#> 8 B1 2015-01-01 0
#> 9 B1 2015-01-02 5.8
#> 10 B1 2015-01-03 6.6
#> 11 B1 2015-01-04 0
#> 12 B1 2015-01-05 7.4
#> 13 B1 2015-01-06 8.2
#> 14 B1 2015-01-07 9
Created on 2022-04-01 by the reprex package (v2.0.0)
This is not very elegant, but it works.
data.frame(date = rep(dates, length(id)),
id = rep(ids, each = length(dates))) |>
full_join(data) |>
arrange(id, date) |>
group_by(id) |>
filter(!is.na(value) | row_number() > 1) |>
mutate(value = replace_na(value, 0)) |>
ungroup()
I have a data set that I would like to separate by 10-day intervals. For example, I would like to get all of the dates 26-12-2010 to 04-01-2011 for ID 1 together than the next 10-days for ID 1 together. I would like to do this for each ID, and compile the 10-day intervals into a list.
library(lubridate)
date <- rep_len(seq(dmy("26-12-2010"), dmy("20-12-2013"), by = "days"), 500)
ID <- rep(seq(1, 5), 100)
df <- data.frame(date = date,
x = runif(length(date), min = 60000, max = 80000),
y = runif(length(date), min = 800000, max = 900000),
ID)
df %>%
mutate(interval = map(1:50, ~rep(.x, 10)) %>% reduce(c)) %>%
group_split(interval) %>%
map(~arrange(.x, ID)) %>%
map(~ group_split(.x, ID)) %>%
head(2)
)
When using the last lines of code, it breaks the days and IDs but the observations that are suppose to be within 10-days are not being grouped together.
I've had difficulty understanding your desired output yesterday, but I have no idea why you don't start by arranging all IDs first. I hope this is what you are looking for:
library(dplyr)
library(magrittr)
# slicing first 2 elements only
df %>%
arrange(ID) %>%
mutate(cut = data.table::rleid(cut(date, breaks = "10 day"))) %>%
group_split(ID, cut) %>%
extract(1:2)
[[1]]
# A tibble: 2 x 5
date x y ID cut
<date> <dbl> <dbl> <int> <int>
1 2010-12-26 73719. 803002. 1 1
2 2010-12-31 66825. 870527. 1 1
[[2]]
# A tibble: 2 x 5
date x y ID cut
<date> <dbl> <dbl> <int> <int>
1 2011-01-05 63023. 807545. 1 2
2 2011-01-10 76356. 875837. 1 2
I have data frame like below, and I want to get the summation(value) for each 4 rolling month.
Edit: In the output I have "2018-12". But it's not shown in the input. It's a typo, my actual data contain "2018-12".
I prefer to use dplyr:
group <- c("red","green","red","red","red","green","green","green","red","green","green","green")
Month <- c("2019-01","2019-02","2019-03","2019-03","2019-05","2019-07","2019-07","2019-08","2019-09","2019-10","2019-10","2019-10")
VALUE <- c(10,20,30,40,50,60,70,80,90,100,110,120)
d_f <- data.frame(group,Month,VALUE)
d_f %>%
group_by(group) %>%
summarise(value = sum(value))
Can anyone please help me with how to handle the 4 rolling month? Thanks a lot for your valuable time.
Using lubridate you can use floor_date and group your dates by 4 month intervals.
library(tidyverse)
library(lubridate)
d_f %>%
mutate(date = as.Date(paste0(Month, '-01'), format = "%Y-%m-%d")) %>%
arrange(date) %>%
group_by(group, startdategroup = floor_date(date, "4 months")) %>%
summarise(value = sum(VALUE)) %>%
mutate(enddategroup = startdategroup %m+% months(4) - 1)
Output
# A tibble: 6 x 4
# Groups: group [2]
group startdategroup value enddategroup
<fct> <date> <dbl> <date>
1 green 2019-01-01 20 2019-04-30
2 green 2019-05-01 210 2019-08-31
3 green 2019-09-01 330 2019-12-31
4 red 2019-01-01 80 2019-04-30
5 red 2019-05-01 50 2019-08-31
6 red 2019-09-01 90 2019-12-31
Edit: To allow for an "overlap month" (months on the edge of two sequential date intervals), I might take a different approach.
First, I might create a sequence of start and end dates for the intervals (based on minimum and maximum dates in your data frame). The sequence would have date intervals every 4 months.
Then, I would do a fuzzy_left_join (using >= and <= logic) and merge this new data frame with yours. Then a row of data for a single month could be counted twice (once for each of two different intervals).
library(fuzzyjoin)
d_f$date = as.Date(paste0(Month, '-01'), format = "%Y-%m-%d")
d_f2 <- data.frame(date_start = seq.Date(min(d_f$date), max(d_f$date), "4 months"))
d_f2$date_end = date_start %m+% months(4)
d_f %>%
fuzzy_left_join(d_f2,
by = c("date" = "date_start", "date" = "date_end"),
match_fun = list(`>=`, `<=`)) %>%
group_by(group, date_start, date_end) %>%
summarise(value = sum(VALUE))
Output
# A tibble: 6 x 4
# Groups: group, date_start [6]
group date_start date_end value
<fct> <date> <date> <dbl>
1 green 2019-01-01 2019-05-01 20
2 green 2019-05-01 2019-09-01 210
3 green 2019-09-01 2020-01-01 330
4 red 2019-01-01 2019-05-01 130
5 red 2019-05-01 2019-09-01 140
6 red 2019-09-01 2020-01-01 90
One approach is to use the lag/lead functions in dplyr. Something like:
df2 = df %>%
group_by(group) %>%
mutate(prev_value = lag(value, 1, order_by = month),
prev_value2 = lag(value, 2, order_by = month),
prev_value3 = lag(value, 3, order_by = month)) %>%
mutate(avg = (value + prev_value + prev_value2 + prev_value3) / 4)
And then filter away the intervals you are not interested in.
I have a dataset with ID, date, days of life, and medication variables. Each ID has multiple observations indicating different administrations of a certain drug. I want to find UNIQUE meds that were administered within 365 days of each other. A sample of the data frame is as follows:
ID date dayoflife meds
1 2003-11-24 16361 lasiks
1 2003-11-24 16361 vigab
1 2004-01-09 16407 lacos
1 2013-11-25 20015 pheno
1 2013-11-26 20016 vigab
1 2013-11-26 20016 lasiks
2 2008-06-05 24133 pheno
2 2008-04-07 24074 vigab
3 2014-11-25 8458 pheno
3 2014-12-22 8485 pheno
I expect the outcome to be:
ID N
1 3
2 2
3 1
indicating that individual 1 had a max of 3 different types of medications administered within 365 days of each other. I am not sure if it is best to use days of life or the date to get to this expected outcome.Any help is appreciated
An option would be to convert the 'date' to Date class, grouped by 'ID', get the absolute difference of 'date' and the lag of the column, check whether it is greater than 365, create a grouping index with cumsum, get the number of distinct elements of 'meds' in summarise
library(dplyr)
df1 %>%
mutate(date = as.Date(date)) %>%
group_by(ID) %>%
mutate(diffd = abs(as.numeric(difftime(date, lag(date, default = first(date)),
units = 'days')))) %>%
group_by(grp = cumsum(diffd > 365), add = TRUE) %>%
summarise(N = n_distinct(meds)) %>%
group_by(ID) %>%
summarise(N = max(N))
# A tibble: 3 x 2
# ID N
# <int> <int>
#1 1 2
#2 2 2
#3 3 1
You can try:
library(dplyr)
df %>%
group_by(ID) %>%
mutate(date = as.Date(date),
lag_date = abs(date - lag(date)) <= 365,
lead_date = abs(date - lead(date)) <= 365) %>%
mutate_at(vars(lag_date, lead_date), ~ ifelse(., ., NA)) %>%
filter(coalesce(lag_date, lead_date)) %>%
summarise(N = n_distinct(meds))
Output:
# A tibble: 3 x 2
ID N
<int> <int>
1 1 2
2 2 2
3 3 1
Want to calculate conditional sum based on specified dates in r. My sample df is
start_date = c("7/24/2017", "7/1/2017", "7/25/2017")
end_date = c("7/27/2017", "7/4/2017", "7/28/2017")
`7/23/2017` = c(1,5,1)
`7/24/2017` = c(2,0,2)
`7/25/2017` = c(0,0,10)
`7/26/2017` = c(2,2,2)
`7/27/2017` = c(0,0,0)
df = data.frame(start_date,end_date,`7/23/2017`,`7/24/2017`,`7/25/2017`,`7/26/2017`,`7/27/2017`)
In Excel it looks like:
I want to perform calculations as specified in Column H which is a conditional sum of columns C through G based on the dates specified in columns A and B.
Apparently, Excel allows columns to be dates but not R.
#wide to long format
dat <- reshape(df, direction="long", varying=list(names(df)[3:7]), v.names="Value",
idvar=c("start_date","end_date"), timevar="Date",
times=seq(as.Date("2017/07/23"),as.Date("2017/07/27"), "day"))
#convert from factor to date class
dat$end_date <- as.Date(dat$end_date, format = "%m/%d/%Y")
dat$start_date <- as.Date(dat$start_date, format = "%m/%d/%Y")
library(dplyr)
dat %>% group_by(start_date, end_date) %>%
mutate(mval = ifelse(between(Date, start_date, end_date), Value, 0)) %>%
summarise(conditional_sum=sum(mval))
# # A tibble: 3 x 3
# # Groups: start_date [?]
# start_date end_date conditional_sum
# <date> <date> <dbl>
# 1 2017-07-01 2017-07-04 0
# 2 2017-07-24 2017-07-27 4
# 3 2017-07-25 2017-07-28 12
You could achieve that as follows:
# number of trailing columns without numeric values
c = 2
# create a separate vector with the dates
dates = as.Date(gsub("X","",tail(colnames(df),-c)),format="%m.%d.%Y")
# convert date columns in dataframe
df$start_date = as.Date(df$start_date,format="%m/%d/%Y")
df$end_date = as.Date(df$end_date,format="%m/%d/%Y")
# calculate sum
sapply(1:nrow(df),function(x) {y = df[x,(c+1):ncol(df)][dates %in%
seq(df$start_date[x],df$end_date[x],by="day") ]; ifelse(length(y)>0,sum(y),0) })
returns:
[1] 4 0 12
Hope this helps!
Here's a solution all in one dplyr pipe:
library(dplyr)
library(lubridate)
library(tidyr)
df %>%
gather(date, value, -c(1, 2)) %>%
mutate(date = gsub('X', '', date)) %>%
mutate(date = gsub('\\.', '/', date)) %>%
mutate(date = mdy(date)) %>%
filter(date >= mdy(start_date) & date <=mdy(end_date)) %>%
group_by(start_date, end_date) %>%
summarize(Conditional_Sum = sum(value)) %>%
right_join(df) %>%
mutate(Conditional_Sum = ifelse(is.na(Conditional_Sum), 0, Conditional_Sum)) %>%
select(-one_of('Conditional_Sum'), one_of('Conditional_Sum'))
## start_date end_date X7.23.2017 X7.24.2017 X7.25.2017 X7.26.2017 X7.27.2017 Conditional_Sum
## <fctr> <fctr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7/24/2017 7/27/2017 1 2 0 2 0 4
## 2 7/1/2017 7/4/2017 5 0 0 2 0 0
## 3 7/25/2017 7/28/2017 1 2 10 2 0 12