Convert from start/end date to cumulate events per time period - r

The end goal is to visualize the amount of a medication taken per day across a large sample of individuals. I'm trying to reshape my data to make a stacked area chart (or something similar).
In a more general term; I have my data structured as below:
id med start_date end_date
1 drug_a 2010-08-24 2011-03-03
2 drug_a 2011-06-07 2011-08-12
3 drug_b 2010-03-26 2010-10-31
4 drug_b 2012-08-14 2013-01-31
5 drug_c 2012-03-01 2012-06-20
5 drug_a 2012-04-01 2012-06-14
I think I'm trying to create a data frame with one row per date, and a column summing the total of patients (id) that are taking that drug on that day. For example, if someone is taking drug_a from 2010-01-01 to 2010-01-20, each of those drug-days should count.
Something like:
date drug_a drug_b drug_c
2010-01-01 5 0 10
2010-01-02 10 2 8
I'm functional with dplyr and tidyr, but unsure how to use spread with dates and durations.

I'd expand out the data to use all dates using a do loop:
library(dplyr)
library(tidyr)
library(zoo)
df %>%
group_by(id, med) %>%
do(with(.,
data_frame(
date = (start_date:end_date) %>% as.Date) ) ) %>%
group_by(date, med) %>%
summarize(frequency = n() ) %>%
spread(med, frequency)

Related

Create a group after a condition in R

I am trying to split my data in two based on a gap in the dates. The problem is that in the real data, the duration of the observations is not constant. I am assigning all values of lactation to be 1, and trying to make everything after the long gap to become two.
What I am trying to do:
Identify the gap in days, if the gap is longer than 20 days, we will start counting from 1 again using group_by and row_number.
The problem here is that the lag() function is not carrying the new value after the condition.
###Code
library(dplyr)
library(lubridate)
#simulating the data
name<-"cow1"
milk<-rnorm(500,15,6)
date1<-seq(ymd('2012-01-01'),ymd('2012-09-06'),by='days') %>% as_tibble()
date2<-seq(ymd('2013-01-01'),ymd('2013-09-07'),by='days') %>% as_tibble()
date<-bind_rows(date1,date2) %>% rename("day"=value)
cow1<- milk %>% as_tibble() %>% rename("Yield"=value) %>% mutate(cowid=name,day=date$value)
cow1.1 <- cow1 %>% mutate(lactation=1) %>%
mutate(gap = day - lag(day, default = day[1])) %>%
mutate(lactation=ifelse(gap>20,lag(lactation)+1,lag(lactation))) %>%
group_by(lactation) %>% mutate(dim=row_numer())
Sample result:
Row Yield cowid day lactation gap
250 3.1429436 cow1 2012-09-06 1 1 days
251 10.1427923 cow1 2013-01-01 2 117 days
252 19.8654469 cow1 2013-01-02 1 1 days
Desired result:
Row Yield cowid day lactation gap
250 3.1429436 cow1 2012-09-06 1 1 days
251 10.1427923 cow1 2013-01-01 2 117 days
252 19.8654469 cow1 2013-01-02 2 1 days

Calculate number of pending tasks at given time points (ideally with dplyr)

I have a database containing a list of events. Each event has an associated start date, and a date when the event ended or was completed, eg:
dataset <- tibble(
eventid = sample(1:100, 25, replace=TRUE),
start_date = sample(seq(as.Date('2011/01/01'), as.Date('2012/01/01'), by="day"), 25),
completed_date = sample(seq(as.Date('2012/01/01'), as.Date('2014/01/01'), by="day"), 25)
)
> dataset
# A tibble: 25 x 3
eventid start_date completed_date
<int> <date> <date>
1 57 2011-01-14 2013-01-07
2 97 2011-01-21 2011-03-03
3 58 2011-01-26 2011-02-05
4 25 2011-03-22 2013-07-20
5 8 2011-04-20 2012-07-16
6 81 2011-04-26 2013-03-04
7 42 2011-05-02 2012-01-16
8 77 2011-05-03 2012-08-14
9 78 2011-05-21 2013-09-26
10 49 2011-05-22 2013-01-04
# ... with 15 more rows
>
I am trying to produce a rolling "snapshot" of how many tasks were pending a different points in time, e.g. month by month. Expected result:
# A tibble: 25 x 2
month count
<date> <int>
1 2011-01-01 0
2 2011-02-01 3
3 2011-03-01 2
4 2011-04-01 2
5 2011-05-01 4
6 2011-06-01 8
I have attempted to group my variables using group_by(period=floor_date(start_date,"month")), but I'm a bit stuck and would appreciate a pointer in the right direction!
I would prefer a solution using dplyr if possible.
Thanks!
You can expand rows for each month included in the range of dates with map2 from purrr. map2 will iterate over multiple inputs simultaneously. In this case, it will iterate through the start and end dates at the same time.
In each iteration, if will create a monthly sequence using seq (or seq.Date) from start to end month (determined from floor_date). The result is nested for each row of data (since one row can have multiple months in the sequence). So, unnest is needed afterwards.
The transmute will add a new variable called month_year (and drop the old ones) and use substr to extract the year and month only (no day). This is the first through seventh character of the date.
Then, you can group_by the month-year and count up the number of pending projects for each month_year.
I included set.seed to reproduce from data below.
library(dplyr)
library(tidyr)
library(purrr)
library(lubridate)
dataset %>%
mutate(month = map2(floor_date(start_date, "month"),
floor_date(completed_date, "month"),
seq.Date,
by = "month")) %>%
unnest(month) %>%
transmute(month_year = substr(month, 1, 7)) %>%
group_by(month_year) %>%
summarise(count = n())
Output
month_year count
<chr> <int>
1 2011-01 1
2 2011-02 3
3 2011-03 9
4 2011-04 10
5 2011-05 13
6 2011-06 15
7 2011-07 16
8 2011-08 18
9 2011-09 19
10 2011-10 20
# … with 22 more rows
If you want to exclude the completed month (except when start month and completed month are the same, if that can exist), you can subtract 1 month from the sequence of months created. In this case, you can use pmax so that if both start and end months are the same, it will still count the month).
Here is the modified mutate with map2:
mutate(month = map2(floor_date(start_date, "month"),
pmax(floor_date(completed_date, "month") - 1, floor_date(start_date, "month")),
seq.Date,
by = "month"))
Data
set.seed(123)
dataset <- tibble(
eventid = sample(1:100, 25, replace=TRUE),
start_date = sample(seq(as.Date('2011/01/01'), as.Date('2012/01/01'), by="day"), 25),
completed_date = sample(seq(as.Date('2012/01/01'), as.Date('2014/01/01'), by="day"), 25)
)

How to find if a date is within a given time interval

I have the following data frame with dates.
ID start_date end_date Intrvl a_date b_date c_date
1 2013-12-01 2014-05-01 2013-12-01--2014-05-01 2014-01-01 2014-03-10 2015-03-10
2 2016-01-01 2016-07-01 2016-01-01--2016-07-01 2014-02-01 NA 2016-02-01
3 2014-01-01 2014-07-01 2014-01-01--2014-07-01 2014-02-01 2016-02-01 2014-07-01
I want to know,
if the dates from columns a_date, b_date and c_date are within the interval period that I have calculated using
lubridate:: interval (start_date, end_date). In real I have a data frame with 400 columns.
The names of date columns if the dates are within the corresponding interval. Like the output below
ID Within_Intrvl
1 a_b
2 a
3 a_c
I have read the answers of this question [link], but did not help me.
Thank you!
Assuming your data is already converted with lubridate,
input<- df %>%
mutate(start_date=ymd(start_date)) %>%
mutate(end_date=ymd(end_date)) %>%
mutate(a_date=ymd(a_date)) %>%
mutate(b_date=ymd(b_date)) %>%
mutate(c_date=ymd(c_date)) %>%
mutate(Intrvl=interval(start_date, end_date))
you could use the %within% operator in lubridate
result <- input %>%
mutate(AinIntrvl=if_else(a_date %within% Intrvl,"a","")) %>%
mutate(BinIntrvl=if_else(b_date %within% Intrvl,"b","")) %>%
mutate(CinIntrvl=if_else(c_date %within% Intrvl,"c","")) %>%
mutate(Within_Intrvl=paste(AinIntrvl,BinIntrvl,CinIntrvl,sep="_")) %>%
select(-start_date,-end_date,-Intrvl,-a_date,-b_date,-c_date )
You can format the Within_Intrvl column as you like, and well as decide how you want to deal with NAs

sub-setting rows based on dates in R

I am trying to solve a problem in R. I have 2 data frames which look like this:
df1 <-
Date Rainfall_Duration
6/14/2016 10
6/15/2016 20
6/17/2016 10
8/16/2016 30
8/19/2016 40
df2 <-
Date Removal.Rate
6/17/2016 64.7
6/30/2016 22.63
7/14/2016 18.18
8/19/2016 27.87
I want to look up the dates from df2 in df1 and their corresponding Rainfall_Duration data. For example, I want to look for the 1st date of df2 in df1 and subset rows in df1 for that specific date and 7 days prior to that. additionally, for example: for 6/30/2016 (in df2) there is no dates available in df1 within it's 7 days range. So, in this case I just want to extract the results same as it's previous date (6/17/2016) in df2. Same logic goes for 7/14/2016(df2).
The output should look like this:
df3<-
Rate.Removal.Date Date Rainfall_Duration
6/17/2016 6/14/2016 10
6/17/2016 6/15/2016 20
6/17/2016 6/17/2016 10
6/30/2016 6/14/2016 10
6/30/2016 6/15/2016 20
6/30/2016 6/17/2016 10
7/14/2016 6/14/2016 10
7/14/2016 6/15/2016 20
7/14/2016 6/17/2016 10
8/19/2016 8/16/2016 30
8/19/2016 8/19/2016 40
I could subset data for the 7 days range. But could not do it when no dates are available in that range. I have the following code:
library(plyr)
library (dplyr)
df1$Date <- as.Date(df1$Date,format = "%m/%d/%Y")
df2$Date <- as.Date(df2$Date,format = "%m/%d/%Y")
df3 <- lapply(df2$Date, function(x){
filter(df1, between(Date, x-7, x))
})
names(df3) <- as.character(df2$Date)
bind_rows(df3, .id = "Rate.Removal.Date")
df3 <- ldply (df3, data.frame, .id = "Rate.Removal.Date")
I hope I could explain my problem properly. I would highly appreciate if someone can help me out with this code or a new one. Thanks in advance.
I would approach this by explicitly generating all of the dates on which you want to collect the rainfall duration then binding.
To that end, I split each row and generated a rainfallDate column for each that included the seven previous days. Then, I bound the rows back together and used left_join to get the rainfall data. Finally, I filter out the rows with missing rainfall information
df2 %>%
split(1:nrow(.)) %>%
lapply(function(x){
data.frame(
x
, rainfallDate = seq(x$Date - 7, x$Date, 1)
)
}) %>%
bind_rows() %>%
left_join(df1, by = c(rainfallDate = "Date")) %>%
filter(!is.na(Rainfall_Duration))
gives
Date Removal.Rate rainfallDate Rainfall_Duration
1 2016-06-17 64.70 2016-06-14 10
2 2016-06-17 64.70 2016-06-15 20
3 2016-06-17 64.70 2016-06-17 10
4 2016-08-19 27.87 2016-08-16 30
5 2016-08-19 27.87 2016-08-19 40
Note that if the dates without rainfall information are actually 0's, you could skip the filter line and use replace_na from tidyr to set them to explicit zeros (e.g., if you want average rainfall or to include the removal dates without any days of rainfall ahead of them).
An alternative is possible if you are actually interested in a summary value for each date (e.g., total rain in the past seven days). First, I would generate a rainfall dataset that actually had entries for every day of observation. For that, I used complete from tidyr to add an explicit zero entry for each day then used rollapply from zoo to calculate the sum in the rolling previous seven days.
completeRainfall <-
df1 %>%
complete(Date = full_seq(c(as.Date("2016-06-10"), max(Date)), 1)
, fill = list(Rainfall_Duration = 0)) %>%
mutate(prevSeven = rollapply(Rainfall_Duration, 7, sum
, fill = NA, align = "right"))
Then, a simple join will give you both that day's rainfall and the summarized total from the past seven days. This is particularly useful if your Dates for df1 are ever within seven days of each other to avoid copying the same rows multiple times.
left_join(
df2
, completeRainfall
)
Gives
Date Removal.Rate Rainfall_Duration prevSeven
1 2016-06-17 64.70 10 40
2 2016-06-30 22.63 0 0
3 2016-07-14 18.18 0 0
4 2016-08-19 27.87 40 70

Calculate mean date across years

I am trying to calculate the mean date independent of year for each level of a factor.
DF <- data.frame(Date = seq(as.Date("2013-2-15"), by = "day", length.out = 730))
DF$ID = rep(c("AAA", "BBB", "CCC"), length.out = 730)
head(DF)
Date ID
1 2013-02-15 AAA
2 2013-02-16 BBB
3 2013-02-17 CCC
4 2013-02-18 AAA
5 2013-02-19 BBB
6 2013-02-20 CCC
With the data above and the code below, I can calculate the mean date for each factor, but this includes the year.
I want a mean month and day across years. The preferred result would be a POSIXct time class formatted as month-day (eg. 12-31 for Dec 31st) representing the mean month and day across multiple years.
library(dplyr)
DF2 <- DF %>% group_by(ID) %>% mutate(
Col = mean(Date, na.rm = T))
DF2
Addition
I am looking for the mean day of the year with a month and day component, for each factor level. If the date represents, for example, the date an animal reproduced, I am not interested in the yearly differences between years, but instead want a single mean day.
I The end result would look like DF2 but with the new value calculated as previously described (mean day of the year with a month day component.
Sorry this was not more clear.
If I understand your question correctly, here's how to get a mean date column. I first extract the day of the year with yday from POSIXlt. I then calculate the mean. To get a date back, I have to add those days to an actual year, hence the creation of the Year object. As requested, I put the results in the same format as DF2 in your example.
library(dplyr)
DF2 <- DF %>%
mutate(Year=format(Date,"%Y"),
Date_day=as.POSIXlt(Date, origin = "1960-01-01")$yday)%>%
group_by(ID) %>%
mutate(Col = mean(Date_day, na.rm = T),Mean_date=format(as.Date(paste0(Year,"-01-01"))+Col,"%m-%d"))%>%
select(Date,ID,Mean_date)
DF2
> DF2
Source: local data frame [730 x 3]
Groups: ID [3]
Date ID Mean_date
(date) (chr) (chr)
1 2013-02-15 AAA 07-02
2 2013-02-16 BBB 07-02
3 2013-02-17 CCC 07-01
4 2013-02-18 AAA 07-02
5 2013-02-19 BBB 07-02
6 2013-02-20 CCC 07-01
7 2013-02-21 AAA 07-02
8 2013-02-22 BBB 07-02
9 2013-02-23 CCC 07-01
10 2013-02-24 AAA 07-02
.. ... ... ...
You can take the mean of dates by using the mean function. However, note that the mean implementation (and result) will be different depending on the data type. For POSIXct, the mean will be calculated and return the date and time - think of taking the mean of a bunch of integers and you will likely get a float or numeric. For Date, it will essentially 'round' the date to the nearest date.
For example, I recently took a mean of dates. Look at the output when different data types are used.
> mean(as.Date(stationPointDf$knockInDate))
[1] "2018-06-04"
> mean(as.POSIXct(stationPointDf$knockInDate))
[1] "2018-06-03 21:19:21 CDT"
If I am looking for a mean Month and Day across years, I convert all the dates to have the current year using lubridate package.
library(lubridate)
year(myVectorOfDates) <- 2018
Then, I compute the mean and drop the year.

Resources