I have the following data frame with dates.
ID start_date end_date Intrvl a_date b_date c_date
1 2013-12-01 2014-05-01 2013-12-01--2014-05-01 2014-01-01 2014-03-10 2015-03-10
2 2016-01-01 2016-07-01 2016-01-01--2016-07-01 2014-02-01 NA 2016-02-01
3 2014-01-01 2014-07-01 2014-01-01--2014-07-01 2014-02-01 2016-02-01 2014-07-01
I want to know,
if the dates from columns a_date, b_date and c_date are within the interval period that I have calculated using
lubridate:: interval (start_date, end_date). In real I have a data frame with 400 columns.
The names of date columns if the dates are within the corresponding interval. Like the output below
ID Within_Intrvl
1 a_b
2 a
3 a_c
I have read the answers of this question [link], but did not help me.
Thank you!
Assuming your data is already converted with lubridate,
input<- df %>%
mutate(start_date=ymd(start_date)) %>%
mutate(end_date=ymd(end_date)) %>%
mutate(a_date=ymd(a_date)) %>%
mutate(b_date=ymd(b_date)) %>%
mutate(c_date=ymd(c_date)) %>%
mutate(Intrvl=interval(start_date, end_date))
you could use the %within% operator in lubridate
result <- input %>%
mutate(AinIntrvl=if_else(a_date %within% Intrvl,"a","")) %>%
mutate(BinIntrvl=if_else(b_date %within% Intrvl,"b","")) %>%
mutate(CinIntrvl=if_else(c_date %within% Intrvl,"c","")) %>%
mutate(Within_Intrvl=paste(AinIntrvl,BinIntrvl,CinIntrvl,sep="_")) %>%
select(-start_date,-end_date,-Intrvl,-a_date,-b_date,-c_date )
You can format the Within_Intrvl column as you like, and well as decide how you want to deal with NAs
Related
Hi I'm struggling with how to count number of cases given a cut off date. On Table A I have IDs with cut off dates attached to each ID. On Table B I have IDs with dates where a claim case happened. I hope to count the number of cases that an ID has been through by cut off date on Table A.
Table A
ID Date
A 2019-01-03
A 2019-05-03
A 2019-09-23
B 2019-02-04
B 2019-03-16
Table B
ID Claim_Date
A 2018-12-03
A 2019-04-23
B 2019-03-16
I want to achieve below data structure:
ID Date Claims
A 2019-01-03 1
A 2019-05-03 2
A 2019-09-23 2
B 2019-02-04 0
B 2019-03-16 1
I've been trying multiple ways but nothing worked. Could someone help me on this? Many thanks in advance for your help!
You can try the following with dply, tidyr and lubridate
library(dplyr)
library(tidyr)
library(lubridate)
# Transform to date columns
TableA <- TableA %>%
mutate(Date = lubridate::date(Date))
TableB <- TableB %>%
mutate(Claim_Date = lubridate::date(Claim_Date))
# Join Tables and count Dates greater or equal than Claim Dates.
TableA %>%
left_join(TableB) %>%
mutate(Claims = ifelse(Date >= Claim_Date,1,0)) %>%
group_by(ID, Date) %>%
summarise(Claims = sum(Claims))
I have a database of hourly data organized in rows and would like to reshape it in such as way as to obtain the start and end times when the data are within a certain criteria
Consider the following case example, one column is the sequential hourly times, and in the second column is the dummy variable data.
Yrs= data.frame(Date=seq(as.POSIXct("2019-02-04 01:00:00",tz="UTC"), as.POSIXct("2019-02-04 23:00:00",tz="UTC"), by="hour"))
Yrs$Var=c(1:12,1:11)
I would like to obtain the start and end dates of the period in which the Variable was between say 3 and 7.
Expected result:
StartDate EndDate
2019-02-04 03:00:00 2019-02-04 07:00:00
2019-02-04 15:00:00 2019-02-04 19:00:00
I figure I can create a new column indicating the rows where the criteria is met, but do not know how to get the start and end of those consecutive periods
Yrs$Period= ifelse(Yrs$Var >= 3 & Yrs$Var <=7, 1, 0)
I found a reverse example to this problem here Given start date and end date, reshape/expand data for each day between (each day on a row)
but I am struggling to figure this out. Any help will be greatly appreciated.
Maybe something like:
library(data.table)
setDT(Yrs)[, .(StartDate=Date[Var==3L], EndDate=Date[Var==7L]),
by=.(c(0L, cumsum(diff(Var) < 1L)))][, -1L]
output:
StartDate EndDate
1: 2019-02-04 03:00:00 2019-02-04 07:00:00
2: 2019-02-04 15:00:00 2019-02-04 19:00:00
Why not filter and spread ?
library(dplyr)
Yrs %>%
filter(Var == 3 | Var == 7) %>%
group_by(Var) %>%
mutate(ind = row_number()) %>%
spread(Var, Date) %>%
select(-ind) %>%
rename_all(funs(c("Start_Date", "End_Date")))
# Start_Date End_Date
# <dttm> <dttm>
#1 2019-02-04 03:00:00 2019-02-04 07:00:00
#2 2019-02-04 15:00:00 2019-02-04 19:00:00
I have a big data frame with dates and i need to check for the first date in a continuous way, as follows:
ID ID_2 END BEG
1 55 2017-06-30 2016-01-01
1 55 2015-12-31 2015-11-12 --> Gap (required date)
1 88 2008-07-26 2003-02-24
2 19 2014-09-30 2013-05-01
2 33 2013-04-30 2011-01-01 --> Not Gap (overlapping)
2 19 2012-12-31 2011-01-01
2 33 2010-12-31 2008-01-01
2 19 2007-12-31 2006-01-01
2 19 2005-12-31 1980-10-20 --> No actual Gap(required date)
As shown, not all the dates have overlapping and i need to return by ID (not ID_2) the date when the first gap (going backwards in time) appears. I've tried using for but it's extremely slow (dataframe has 150k rows). I've been messing around with dplyr and mutate as follows:
df <- df%>%
group_by(ID)%>%
mutate(END_lead = lead(END))
df$FLAG <- df$BEG - days(1) == df$END_lead
df <- df%>%
group_by(ID)%>%
filter(cumsum(cumsum(FLAG == FALSE))<=1)
But this set of instructions stops at the first overlapping, filtering the wrong date. I've tried anything i could think of, ordering in decreasing or ascending order, and using min and max but could not figure out a solution.
The actual result wanted would be:
ID ID_2 END BEG
1 55 2015-12-31 2015-11-12
2 19 2008-07-26 1980-10-20
Is there a way of doing this using dplyr,tidyr and lubridate?
A possible solution using dplyr:
library(dplyr)
df %>%
mutate_at(vars(END, BEG), funs(as.Date)) %>%
group_by(ID) %>%
slice(which.max(BEG > ( lead(END) + 1 ) | is.na(BEG > ( lead(END) + 1 ))))
With your last data, it gives:
# A tibble: 2 x 4
# Groups: ID [2]
ID ID_2 END BEG
<int> <int> <date> <date>
1 1 55 2015-12-31 2015-11-12
2 2 19 2005-12-31 1980-10-20
What the solution does is basically:
Changes the dates to Date format (no need for lubridate);
Groups by ID;
Selects the highest row that satisfies your criteria, i.e. the highest row which is either a gap (TRUE), or if there is no gap it is the first row (meaning it has a missing value when checking for a gap, this is why is.na(BEG > ( lead(END) + 1 ))).
I would use xts package, first creating xts objects for each ID you have, than use first() and last() function on each objects.
https://www.datacamp.com/community/blog/r-xts-cheat-sheet
Hi I am new to R and would like to know if there is a simple way to filter data over multiple dates.
I have a data which has dates from 07.03.2003 to 31.12.2016.
I need to split/ filter the data by multiple time series, as per below.
Dates require in new data frame:
07.03.2003 to 06/03/2005
and
01/01/2013 to 31/12/2016
i.e the new data frame should not include dates from 07/03/2005 to 31/12/2012
Let's take the following data.frame with dates:
df <- data.frame( date = c(ymd("2017-02-02"),ymd("2016-02-02"),ymd("2014-02-01"),ymd("2012-01-01")))
date
1 2017-02-02
2 2016-02-02
3 2014-02-01
4 2012-01-01
I can filter this for a range of dates using lubridate::ymd and dplyr::between and dplyr::between:
df1 <- filter(df, between(date, ymd("2017-01-01"), ymd("2017-03-01")))
date
1 2017-02-02
Or:
df2 <- filter(df, between(date, ymd("2013-01-01"), ymd("2014-04-01")))
date
1 2014-02-01
I would go with lubridate. In particular
library(data.table)
library(lubridate)
set.seed(555)#in order to be reproducible
N <- 1000#number of pseudonumbers to be generated
date1<-dmy("07-03-2003")
date2<-dmy("06-03-2005")
date3<-dmy("01-01-2013")
date4<-dmy("31-12-2016")
Creating data table with two columns (dates and numbers):
my_dt<-data.table(date_sample=c(sample(seq(date1, date4, by="day"), N),numeric_sample=sample(N,replace = F)))
> head(my_dt)
date_sample numeric_sample
1: 2007-04-11 2
2: 2006-04-20 71
3: 2007-12-20 46
4: 2016-05-23 78
5: 2011-10-07 5
6: 2003-09-10 47
Let's impose some cuts:
forbidden_dates<-interval(date2+1,date3-1)#create interval that dates should not fall in.
> forbidden_dates
[1] 2005-03-07 UTC--2012-12-31 UTC
test_date1<-dmy("08-03-2003")#should not fall in above range
test_date2<-dmy("08-03-2005")#should fall in above range
Therefore:
test_date1 %within% forbidden_dates
[1] FALSE
test_date2 %within% forbidden_dates
[1] TRUE
A good way of visualizing the cut:
before
>plot(my_dt)
my_dt<-my_dt[!(date_sample %within% forbidden_dates)]#applying the temporal cut
after
>plot(my_dt)
The end goal is to visualize the amount of a medication taken per day across a large sample of individuals. I'm trying to reshape my data to make a stacked area chart (or something similar).
In a more general term; I have my data structured as below:
id med start_date end_date
1 drug_a 2010-08-24 2011-03-03
2 drug_a 2011-06-07 2011-08-12
3 drug_b 2010-03-26 2010-10-31
4 drug_b 2012-08-14 2013-01-31
5 drug_c 2012-03-01 2012-06-20
5 drug_a 2012-04-01 2012-06-14
I think I'm trying to create a data frame with one row per date, and a column summing the total of patients (id) that are taking that drug on that day. For example, if someone is taking drug_a from 2010-01-01 to 2010-01-20, each of those drug-days should count.
Something like:
date drug_a drug_b drug_c
2010-01-01 5 0 10
2010-01-02 10 2 8
I'm functional with dplyr and tidyr, but unsure how to use spread with dates and durations.
I'd expand out the data to use all dates using a do loop:
library(dplyr)
library(tidyr)
library(zoo)
df %>%
group_by(id, med) %>%
do(with(.,
data_frame(
date = (start_date:end_date) %>% as.Date) ) ) %>%
group_by(date, med) %>%
summarize(frequency = n() ) %>%
spread(med, frequency)