count number of cases by cut off date - r

Hi I'm struggling with how to count number of cases given a cut off date. On Table A I have IDs with cut off dates attached to each ID. On Table B I have IDs with dates where a claim case happened. I hope to count the number of cases that an ID has been through by cut off date on Table A.
Table A
ID Date
A 2019-01-03
A 2019-05-03
A 2019-09-23
B 2019-02-04
B 2019-03-16
Table B
ID Claim_Date
A 2018-12-03
A 2019-04-23
B 2019-03-16
I want to achieve below data structure:
ID Date Claims
A 2019-01-03 1
A 2019-05-03 2
A 2019-09-23 2
B 2019-02-04 0
B 2019-03-16 1
I've been trying multiple ways but nothing worked. Could someone help me on this? Many thanks in advance for your help!

You can try the following with dply, tidyr and lubridate
library(dplyr)
library(tidyr)
library(lubridate)
# Transform to date columns
TableA <- TableA %>%
mutate(Date = lubridate::date(Date))
TableB <- TableB %>%
mutate(Claim_Date = lubridate::date(Claim_Date))
# Join Tables and count Dates greater or equal than Claim Dates.
TableA %>%
left_join(TableB) %>%
mutate(Claims = ifelse(Date >= Claim_Date,1,0)) %>%
group_by(ID, Date) %>%
summarise(Claims = sum(Claims))

Related

Is there an R function for finding a list of all dates between two values. then inserting them as rows?

I have a dataframe in the following format:
Contract_Begin Contract_End FP
2020-01-01 2020-01-31 5
2020-01-01 2020-03-31 6
If the Contract_End - Contract_Begin is less than 1 month, I want to insert the additional months as rows below. Here is the desired output.
Contract_Begin Contract_End FP
2020-01-01 2020-01-31 5
2020-01-01 6
2020-02-01 6
2020-03-01 6
Trying to accomplish in R as a part of pre data processing. Any help is greatly appreciated.
We can use map2 to get the sequence of dates from 'Contract_Begin','Contract_End' and then unnest the listcolumn created by map2 and expand the rows
library(dplyr)
library(tidyr)
library(purrr)
df1 %>%
mutate_at(1:2, as.Date) %>%
mutate(Contract_Begin = map2(Contract_Begin, Contract_End, seq,
by = "1 month")) %>%
unnest(c(Contract_Begin))

Is there an r function that would help me find the number of rows which carry data from a particular year?

I am completely new to R. Here is how my data looks like:
incident_id date
1 461105 2013-01-01
2 460726 2013-01-01
3 478855 2013-01-01
4 478925 2013-01-05
5 478959 2013-01-07
6 478948 2013-01-07
7 479363 2013-01-19
8 479374 2013-01-21
9 479389 2013-01-21
10 492151 2013-01-23
I would like to find out the number of times an incident has been reported in a given year.
The tail looks like this:
incident_id date
239668 1082234 2018-03-31
239669 1081742 2018-03-31
239670 1082990 2018-03-31
239671 1081752 2018-03-31
239672 1082061 2018-03-31
239673 1083142 2018-03-31
239674 1083139 2018-03-31
239675 1083151 2018-03-31
239676 1082514 2018-03-31
239677 1081940 2018-03-31
I have tried SQL, but I would want to use R for this.
Code to find out incidents reported each year
First creating a subset with only data and incident ID
dfgvdates = dfgv[,1:2]
head(dfgvdates, 10)
I would like to use the count() function but I guess it can only be used if I use a library.
You could extract the year from the date and then count the length. Using aggregate we could do
aggregate(incident_id~date, transform(df, date = format(as.Date(date),"%Y")),length)
Or with table
stack(table(format(as.Date(df$date), "%Y")))
Using dplyr, we could do
library(dplyr)
df %>%
group_by(date = format(as.Date(date), "%Y")) %>%
summarise(n = n())
Or using count
df %>% count(date = format(as.Date(date), "%Y"))
We can use data.table
library(data.table)
setDT([, .(N = .N), by = .(Year = year(as.IDate(date)))]

How to find if a date is within a given time interval

I have the following data frame with dates.
ID start_date end_date Intrvl a_date b_date c_date
1 2013-12-01 2014-05-01 2013-12-01--2014-05-01 2014-01-01 2014-03-10 2015-03-10
2 2016-01-01 2016-07-01 2016-01-01--2016-07-01 2014-02-01 NA 2016-02-01
3 2014-01-01 2014-07-01 2014-01-01--2014-07-01 2014-02-01 2016-02-01 2014-07-01
I want to know,
if the dates from columns a_date, b_date and c_date are within the interval period that I have calculated using
lubridate:: interval (start_date, end_date). In real I have a data frame with 400 columns.
The names of date columns if the dates are within the corresponding interval. Like the output below
ID Within_Intrvl
1 a_b
2 a
3 a_c
I have read the answers of this question [link], but did not help me.
Thank you!
Assuming your data is already converted with lubridate,
input<- df %>%
mutate(start_date=ymd(start_date)) %>%
mutate(end_date=ymd(end_date)) %>%
mutate(a_date=ymd(a_date)) %>%
mutate(b_date=ymd(b_date)) %>%
mutate(c_date=ymd(c_date)) %>%
mutate(Intrvl=interval(start_date, end_date))
you could use the %within% operator in lubridate
result <- input %>%
mutate(AinIntrvl=if_else(a_date %within% Intrvl,"a","")) %>%
mutate(BinIntrvl=if_else(b_date %within% Intrvl,"b","")) %>%
mutate(CinIntrvl=if_else(c_date %within% Intrvl,"c","")) %>%
mutate(Within_Intrvl=paste(AinIntrvl,BinIntrvl,CinIntrvl,sep="_")) %>%
select(-start_date,-end_date,-Intrvl,-a_date,-b_date,-c_date )
You can format the Within_Intrvl column as you like, and well as decide how you want to deal with NAs

How to check for continuity minding possible gaps in dates

I have a big data frame with dates and i need to check for the first date in a continuous way, as follows:
ID ID_2 END BEG
1 55 2017-06-30 2016-01-01
1 55 2015-12-31 2015-11-12 --> Gap (required date)
1 88 2008-07-26 2003-02-24
2 19 2014-09-30 2013-05-01
2 33 2013-04-30 2011-01-01 --> Not Gap (overlapping)
2 19 2012-12-31 2011-01-01
2 33 2010-12-31 2008-01-01
2 19 2007-12-31 2006-01-01
2 19 2005-12-31 1980-10-20 --> No actual Gap(required date)
As shown, not all the dates have overlapping and i need to return by ID (not ID_2) the date when the first gap (going backwards in time) appears. I've tried using for but it's extremely slow (dataframe has 150k rows). I've been messing around with dplyr and mutate as follows:
df <- df%>%
group_by(ID)%>%
mutate(END_lead = lead(END))
df$FLAG <- df$BEG - days(1) == df$END_lead
df <- df%>%
group_by(ID)%>%
filter(cumsum(cumsum(FLAG == FALSE))<=1)
But this set of instructions stops at the first overlapping, filtering the wrong date. I've tried anything i could think of, ordering in decreasing or ascending order, and using min and max but could not figure out a solution.
The actual result wanted would be:
ID ID_2 END BEG
1 55 2015-12-31 2015-11-12
2 19 2008-07-26 1980-10-20
Is there a way of doing this using dplyr,tidyr and lubridate?
A possible solution using dplyr:
library(dplyr)
df %>%
mutate_at(vars(END, BEG), funs(as.Date)) %>%
group_by(ID) %>%
slice(which.max(BEG > ( lead(END) + 1 ) | is.na(BEG > ( lead(END) + 1 ))))
With your last data, it gives:
# A tibble: 2 x 4
# Groups: ID [2]
ID ID_2 END BEG
<int> <int> <date> <date>
1 1 55 2015-12-31 2015-11-12
2 2 19 2005-12-31 1980-10-20
What the solution does is basically:
Changes the dates to Date format (no need for lubridate);
Groups by ID;
Selects the highest row that satisfies your criteria, i.e. the highest row which is either a gap (TRUE), or if there is no gap it is the first row (meaning it has a missing value when checking for a gap, this is why is.na(BEG > ( lead(END) + 1 ))).
I would use xts package, first creating xts objects for each ID you have, than use first() and last() function on each objects.
https://www.datacamp.com/community/blog/r-xts-cheat-sheet

Filter a data frame by two time series

Hi I am new to R and would like to know if there is a simple way to filter data over multiple dates.
I have a data which has dates from 07.03.2003 to 31.12.2016.
I need to split/ filter the data by multiple time series, as per below.
Dates require in new data frame:
07.03.2003 to 06/03/2005
and
01/01/2013 to 31/12/2016
i.e the new data frame should not include dates from 07/03/2005 to 31/12/2012
Let's take the following data.frame with dates:
df <- data.frame( date = c(ymd("2017-02-02"),ymd("2016-02-02"),ymd("2014-02-01"),ymd("2012-01-01")))
date
1 2017-02-02
2 2016-02-02
3 2014-02-01
4 2012-01-01
I can filter this for a range of dates using lubridate::ymd and dplyr::between and dplyr::between:
df1 <- filter(df, between(date, ymd("2017-01-01"), ymd("2017-03-01")))
date
1 2017-02-02
Or:
df2 <- filter(df, between(date, ymd("2013-01-01"), ymd("2014-04-01")))
date
1 2014-02-01
I would go with lubridate. In particular
library(data.table)
library(lubridate)
set.seed(555)#in order to be reproducible
N <- 1000#number of pseudonumbers to be generated
date1<-dmy("07-03-2003")
date2<-dmy("06-03-2005")
date3<-dmy("01-01-2013")
date4<-dmy("31-12-2016")
Creating data table with two columns (dates and numbers):
my_dt<-data.table(date_sample=c(sample(seq(date1, date4, by="day"), N),numeric_sample=sample(N,replace = F)))
> head(my_dt)
date_sample numeric_sample
1: 2007-04-11 2
2: 2006-04-20 71
3: 2007-12-20 46
4: 2016-05-23 78
5: 2011-10-07 5
6: 2003-09-10 47
Let's impose some cuts:
forbidden_dates<-interval(date2+1,date3-1)#create interval that dates should not fall in.
> forbidden_dates
[1] 2005-03-07 UTC--2012-12-31 UTC
test_date1<-dmy("08-03-2003")#should not fall in above range
test_date2<-dmy("08-03-2005")#should fall in above range
Therefore:
test_date1 %within% forbidden_dates
[1] FALSE
test_date2 %within% forbidden_dates
[1] TRUE
A good way of visualizing the cut:
before
>plot(my_dt)
my_dt<-my_dt[!(date_sample %within% forbidden_dates)]#applying the temporal cut
after
>plot(my_dt)

Resources