I would like to know which ID is repeated at least a certain amount of times (eg: ≥3) in a given period (2 years).
These are the conditions for the results: I would like to get a TRUE result if the ID is repeated ≥3 times in a 2-year period if ≥1 of those times is before index.
If the ID is repeated <3 times or ≥3 times but in a period larger than 2 years, or ≥3 times in 2 years but solely after index or ≥3 times before and after index but not in a 2-year period, I want the result to be FALSE.
I have the following table as an example:
ID Date index
1 1998-08-04 2002-04-05
1 1999-12-01 2002-04-05
1 1999-12-12 2002-04-05
2 2000-04-04 2001-06-13
2 2001-08-12 2001-06-13
2 2001-10-18 2001-06-13
3 2002-04-04 2002-09-12
3 2002-05-08 2002-09-12
4 2006-04-08 2001-01-03
4 2006-12-18 2001-01-03
4 2007-01-01 2001-01-03
5 2007-07-07 2007-08-12
5 2012-04-03 2007-08-12
5 2012-05-03 2007-08-12
5 2012-06-06 2007-08-12
6 2012-05-04 2021-04-09
6 2012-07-04 2021-04-09
6 2016-04-08 2021-04-09
6 2016-05-22 2021-04-09
6 2020-01-01 2021-04-09
I would like to obtain the following result:
ID Rep
1 TRUE
2 TRUE
3 FALSE
4 FALSE
5 FALSE
6 FALSE
I've been doing something like this, to count for repeated ID in a 2-year period:
df %>%
arrange(Date) %>%
group_by(ID) %>%
summarize(Rep = sum(diff(Date)<(2*365.25))>=2)
I get the following result but it's not what I'm looking for. I'm also not being able to use index as a reference for each ID group.
ID Rep
1 TRUE
2 TRUE
3 FALSE
4 TRUE
5 TRUE
6 TRUE
Maybe I need to add an IF statement or case_when but I really don't know how to continue.
Any thoughts?
Many thanks!
I am trying to merge two dataframes based on a conditional relationship between several dates associated with unique identifiers but distributed across different observations (rows).
I have two large datasets with unique identifiers. One dataset has 'enter' and 'exit' dates (alongside some other variables).
> df1 <- data.frame(ID=c(1,1,1,2,2,3,4),
enter.date=c('5/07/2015','7/10/2015','8/25/2017','9/1/2016','1/05/2018','5/01/2016','4/08/2017'),
+ exit.date = c('7/1/2015', '10/15/2015', '9/03/2017', '9/30/2016', '6/01/2019',
'5/01/2017', '6/08/2017'));
> dcis <- grep('date$',names(df1));
> df1[dcis] <- lapply(df1[dcis],as.Date,'%m/%d/%Y');
> df1;
ID enter.date exit.date
1 1 2015-05-07 2015-07-01
2 1 2015-07-10 2015-10-15
3 1 2017-08-25 2017-09-03
4 2 2016-09-01 2016-09-30
5 2 2018-01-05 2019-06-01
6 3 2016-05-01 2017-05-01
7 4 2017-04-08 2017-06-08
and the other has "eval" dates.
> df2 <- data.frame(ID=c(1,2,2,3,4), eval.date=c('10/30/2015',
'10/10/2016','9/10/2019','5/15/2018','1/19/2015'));
> df2$eval.date<-as.Date(df2$eval.date, '%m/%d/%Y')
> df2;
ID eval.date
1 1 2015-10-30
2 2 2016-10-10
3 2 2019-09-10
4 3 2018-05-15
5 4 2015-01-19
I am trying to calculate the average interval of time from 'exit' to 'eval' for each individual in the dataset. However, I only want those 'evals' that come after a given individual's 'exit' and before the next 'enter' for that individual (there are no 'eval' observations between enter and exit for a given individual), if such an 'eval' exists.
In other words, I'm trying to get an output that looks like this from the two dataframes above.
> df3 <- data.frame(ID=c(1,2,2,3), enter.date=c('7/10/2015','9/1/2016','1/05/2018','5/01/2016'),
+ exit.date = c('10/15/2015', '9/30/2016', '6/01/2019', '5/01/2017'),
+ assess.date=c('10/30/2015', '10/10/2016', '9/10/2019', '5/15/2018'));
> dcis <- grep('date$',names(df3));
> df3[dcis] <- lapply(df3[dcis],as.Date,'%m/%d/%Y');
> df3$time.diff<-difftime(df3$exit.date, df3$assess.date)
> df3;
ID enter.date exit.date assess.date time.diff
1 1 2015-07-10 2015-10-15 2015-10-30 -15 days
2 2 2016-09-01 2016-09-30 2016-10-10 -10 days
3 2 2018-01-05 2019-06-01 2019-09-10 -101 days
4 3 2016-05-01 2017-05-01 2018-05-15 -379 days
Once I perform the merge finding the averages is easy enough with
> aggregate(df3[,5], list(df3$ID), mean)
Group.1 x
1 1 -15.0
2 2 -55.5
3 3 -379.0
but I'm really at a loss as to how to perform the merge. I've tried to use leftjoin and fuzzyjoin to perform the merge per the advice given here and here, but I'm inexperienced at R and couldn't figure it out. I would really appreciate if someone could walk me through it - thanks!
A few other descriptive notes about the data: each ID may have some number of rows associated with it in each dataframe. df1 has enter dates which mark the beginning of a service delivery and exit dates that mark the end of a service delivery. All enters have one corresponding exit. df2 has eval dates. Eval dates can occur at any time when an individual is not receiving the service. There may be many evals between one period of service delivery and the next, or there may be no evals.
Just discovered the sqldf package. Assuming that for each ID the date ranges are in ascending order, you might use it like this:
df1 <- data.frame(ID=c(1,1,1,2,2,3,4), enter.date=c('5/07/2015','7/10/2015','8/25/2017','9/1/2016','1/05/2018','5/01/2016','4/08/2017'), exit.date = c('7/1/2015', '10/15/2015', '9/03/2017', '9/30/2016', '6/01/2019',
'5/01/2017', '6/08/2017'));
dcis <- grep('date$',names(df1));
df1[dcis] <- lapply(df1[dcis],as.Date,'%m/%d/%Y');
df1;
df2 <- data.frame(ID=c(1,2,2,3,4), eval.date=c('10/30/2015',
'10/10/2016','9/10/2019','5/15/2018','1/19/2015'));
df2$eval.date<-as.Date(df2$eval.date, '%m/%d/%Y')
df2;
library(sqldf)
df1 = unsplit(lapply(split(df1, df1$ID, drop=FALSE), function(df) {
df$next.date = as.Date('2100-12-31')
if (nrow(df) > 1)
df$next.date[1:(nrow(df) - 1)] = df$enter.date[2:nrow(df)]
df
}), df1$ID)
sqldf('
select df1.*, df2.*, df1."exit.date" - df2."eval.date" as "time.diff"
from df1, df2
where df1.ID == df2.ID
and df2."eval.date" between df1."exit.date"
and df1."next.date"')
ID enter.date exit.date next.date ID..5 eval.date time.diff
1 1 2015-07-10 2015-10-15 2017-08-25 1 2015-10-30 -15
2 2 2016-09-01 2016-09-30 2018-01-05 2 2016-10-10 -10
3 2 2018-01-05 2019-06-01 2100-12-31 2 2019-09-10 -101
4 3 2016-05-01 2017-05-01 2100-12-31 3 2018-05-15 -379
I have a data-set of course enrollments over time, as indicated by a start_date and end_date for ~50,000 ids. I am interested in capturing both periods of enrollment and non-enrollment for this group over a specific time frame, specifically from '2001-06-01' to '2015-01-01' even though the data goes back earlier. I am interested in creating rows for each id that captures the date intervals for these periods of enrollment/non-enrollment, and creating an additional variable that categorizes these periods; 'prior to enrollment', 'enrolled', 'between enrollment', 'never enrolled'.
To clarify;
Some example data
id = c(1,3,3,3,5,5)
entry_date = c(NA, '1996-05-09', '2005-12-12', '2013-12-19', '2011-11-05','2012-12-10')
exit_date = c(NA, '2005-12-10', '2008-12-10', '2016-01-01', '2016-09-01', '2013-02-20')
test
id entry_date exit_date
1 NA NA
3 1996-05-09 2005-12-10
3 2005-12-12 2008-12-10
3 2013-12-19 2016-01-01
5 2011-11-05 2012-09-01
5 2012-12-10 2013-02-20
My intended output would be;
id entry_date exit_date category
1 2001-06-01 2016-01-01 Never enrolled
3 2001-06-01 2005-12-10 Enrolled
3 2005-12-10 2005-12-12 Between enrollment
3 2005-12-12 2008-12-10 Enrolled
3 2008-12-10 2013-12-19 Between enrollment
3 2013-12-19 2016-01-01 Enrolled
5 2001-06-01 2011-11-05 Prior to enrollment
5 2011-11-05 2012-09-01 Enrolled
5 2012-09-01 2012-12-10 Between enrollment
5 2012-12-10 2013-02-20 Enrolled
5 2013-02-20 2016-01-01 Between enrollment
Any help in how to achieve something like this would be really appreciated!
This is a follow up to my only other question, but hopefully more direct. I need data that looks like this:
custID custChannel custDate
1 151 Direct 2015-10-10 00:15:32
2 151 GooglePaid 2015-10-10 00:16:45
3 151 Converted 2015-10-10 00:17:01
4 5655 BingPaid 2015-10-11 00:20:12
5 7855 GoogleOrganic 2015-10-12 00:05:32
6 7862 YahooOrganic 2015-10-13 00:18:20
7 9655 GooglePaid 2015-10-13 00:08:35
8 9655 GooglePaid 2015-10-13 00:11:11
9 9655 Converted 2015-10-13 00:11:35
10 9888 GooglePaid 2015-10-14 00:08:35
11 9888 GooglePaid 2015-10-14 00:11:11
12 9888 Converted 2015-10-14 00:11:35
To be sorted so that the output looks like this:
Path Path Count
BingPaid 1
Direct>GooglePaid>Converted 1
GoogleOrganic 1
GooglePaid>GooglePaid>Converted 2
YahooOrganic 1
The idea is to capture customer paths (as identified by custID) and count for the entire data set how many people took that exact path (Path Count). I need to perform this over a data set of 5 million rows.
Using data.table you can do this as follows:
require(data.table)
setDT(dat)[,paste(custChannel, collapse = ">"), custID][,.("path length"=.N), .(path=V1)]
Result:
path path length
1: Direct>GooglePaid>Converted 1
2: BingPaid 1
3: GoogleOrganic 1
4: YahooOrganic 1
5: GooglePaid>GooglePaid>Converted 2
Step by step:
setDT(dat) # make dat a data.table
# get path by custID
dat_path <- dat[,paste(custChannel, collapse = ">"), custID]
#get length by path created in the previous step
res <- dat_path[,.("path length"=.N), by=.(path=V1)]
Have a look at dat_path and resto understand what happened.
I have a data with the following columns:
CaseID, Time, Value.
The 'time' column values are not at regular intervals of 1. I am trying to add the missing values of time with 'NA' for the rest of the columns except CaseID.
Case Value Time
1 100 07:52:00
1 110 07:53:00
1 120 07:55:00
2 10 08:35:00
2 11 08:36:00
2 12 08:38:00
Desired output:
Case Value Time
1 100 07:52:00
1 110 07:53:00
1 NA 07:54:00
1 120 07:55:00
2 10 08:35:00
2 11 08:36:00
2 NA 08:37:00
2 12 08:38:00
I tried dt[CJ(unique(CaseID),seq(min(Time),max(Time),"min"))] but it gives the following error:
Error in vecseq(f__, len__, if (allow.cartesian || notjoin) NULL else as.integer(max(nrow(x), :
Join results in 9827315 rows; more than 9620640 = max(nrow(x),nrow(i)). Check for duplicate key values in i, each of which join to the same group in x over and over again. If that's ok, try including `j` and dropping `by` (by-without-by) so that j runs for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-help for advice.
I cannot able to make it work..any help would be appreciated.
Like this??
dt[,Time:=as.POSIXct(Time,format="%H:%M:%S")]
result <- dt[,list(Time=seq(min(Time),max(Time),by="1 min")),by=Case]
setkey(result,Case,Time)
setkey(dt,Case,Time)
result <- dt[result][,Time:=format(Time,"%H:%M:%S")]
result
# Case Value Time
# 1: 1 100 07:52:00
# 2: 1 110 07:53:00
# 3: 1 NA 07:54:00
# 4: 1 120 07:55:00
# 5: 2 10 08:35:00
# 6: 2 11 08:36:00
# 7: 2 NA 08:37:00
# 8: 2 12 08:38:00
Another way:
dt[, Time := as.POSIXct(Time, format = "%H:%M:%S")]
setkey(dt, Time)
dt[, .SD[J(seq(min(Time), max(Time), by='1 min'))], by=Case]
We group by Case and join on Time on each group using .SD (hence setting key on Time). From here you can use format() as shown above.