Compare different columns in separate rows in R - r

I would like to check that an individual does not have any gaps in their eligibility status. I define a gap as a date_of_claim that occurs 30 days after the last elig_end_date. therefore, what I would like to do is check that each date_of_claim is no longer than the elig_end_date +30days in the row immediately preceeding. Ideally I would like an indicator that says 0 for no gap and 1 if there is a gap per person and where the gap occurs. Here is a sample df with the solution built in as 'gaps'.
names date_of_claim elig_end_date obs gaps
1 tom 2010-01-01 2010-07-01 1 NA
2 tom 2010-05-04 2010-07-01 1 0
3 tom 2010-06-01 2014-01-01 2 0
4 tom 2010-10-10 2014-01-01 2 0
5 mary 2010-03-01 2014-06-14 1 NA
6 mary 2010-05-01 2014-06-14 1 0
7 mary 2010-08-01 2014-06-14 1 0
8 mary 2010-11-01 2014-06-14 1 0
9 mary 2011-01-01 2014-06-14 1 0
10 john 2010-03-27 2011-03-01 1 NA
11 john 2010-07-01 2011-03-01 1 0
12 john 2010-11-01 2011-03-01 1 0
13 john 2011-02-01 2011-03-01 1 0
14 sue 2010-02-01 2010-04-30 1 NA
15 sue 2010-02-27 2010-04-30 1 0
16 sue 2010-03-13 2010-05-31 2 0
17 sue 2010-04-27 2010-06-30 3 0
18 sue 2010-04-27 2010-06-30 3 0
19 sue 2010-05-06 2010-08-31 4 0
20 sue 2010-06-08 2010-09-30 5 0
21 mike 2010-05-01 2010-07-30 1 NA
22 mike 2010-06-01 2010-07-30 1 0
23 mike 2010-11-12 2011-07-30 2 1
I have found this post quite useful How can I compare a value in a column to the previous one using R?, but feel that I cant use a loop as my df has 4 million rows, and I have had a lot of difficulty trying to run a loop on it already.
to this end, i think the code i need is something like this:
df$gaps<-ifelse(df$date_of_claim>=df$elig_end_date+30,1,0) ##this doesn't use the preceeding row.
I've made a clumsy attempt using this:
df$gaps<-df$date_of_claim>=df$elig_end_date[-1,]
but I get an error to say i have an incorrect number of dimensions.
all help greatly appreciated! thank you.

With four million observations I would use data.table:
DF <- read.table(text="names date_of_claim elig_end_date obs gaps
1 tom 2010-01-01 2010-07-01 1 NA
2 tom 2010-05-04 2010-07-01 1 0
3 tom 2010-06-01 2014-01-01 2 0
4 tom 2010-10-10 2014-01-01 2 0
5 mary 2010-03-01 2014-06-14 1 NA
6 mary 2010-05-01 2014-06-14 1 0
7 mary 2010-08-01 2014-06-14 1 0
8 mary 2010-11-01 2014-06-14 1 0
9 mary 2011-01-01 2014-06-14 1 0
10 john 2010-03-27 2011-03-01 1 NA
11 john 2010-07-01 2011-03-01 1 0
12 john 2010-11-01 2011-03-01 1 0
13 john 2011-02-01 2011-03-01 1 0
14 sue 2010-02-01 2010-04-30 1 NA
15 sue 2010-02-27 2010-04-30 1 0
16 sue 2010-03-13 2010-05-31 2 0
17 sue 2010-04-27 2010-06-30 3 0
18 sue 2010-04-27 2010-06-30 3 0
19 sue 2010-05-06 2010-08-31 4 0
20 sue 2010-06-08 2010-09-30 5 0
21 mike 2010-05-01 2010-07-30 1 NA
22 mike 2010-06-01 2010-07-30 1 0
23 mike 2010-11-12 2011-07-30 2 1", header=TRUE)
library(data.table)
DT <- data.table(DF)
DT[, c("date_of_claim", "elig_end_date") := list(as.Date(date_of_claim), as.Date(elig_end_date))]
DT[, gaps2:= c(NA, date_of_claim[-1] > head(elig_end_date, -1)+30), by=names]
# names date_of_claim elig_end_date obs gaps gaps2
# 1: tom 2010-01-01 2010-07-01 1 NA NA
# 2: tom 2010-05-04 2010-07-01 1 0 FALSE
# 3: tom 2010-06-01 2014-01-01 2 0 FALSE
# 4: tom 2010-10-10 2014-01-01 2 0 FALSE
# 5: mary 2010-03-01 2014-06-14 1 NA NA
# 6: mary 2010-05-01 2014-06-14 1 0 FALSE
# 7: mary 2010-08-01 2014-06-14 1 0 FALSE
# 8: mary 2010-11-01 2014-06-14 1 0 FALSE
# 9: mary 2011-01-01 2014-06-14 1 0 FALSE
# 10: john 2010-03-27 2011-03-01 1 NA NA
# 11: john 2010-07-01 2011-03-01 1 0 FALSE
# 12: john 2010-11-01 2011-03-01 1 0 FALSE
# 13: john 2011-02-01 2011-03-01 1 0 FALSE
# 14: sue 2010-02-01 2010-04-30 1 NA NA
# 15: sue 2010-02-27 2010-04-30 1 0 FALSE
# 16: sue 2010-03-13 2010-05-31 2 0 FALSE
# 17: sue 2010-04-27 2010-06-30 3 0 FALSE
# 18: sue 2010-04-27 2010-06-30 3 0 FALSE
# 19: sue 2010-05-06 2010-08-31 4 0 FALSE
# 20: sue 2010-06-08 2010-09-30 5 0 FALSE
# 21: mike 2010-05-01 2010-07-30 1 NA NA
# 22: mike 2010-06-01 2010-07-30 1 0 FALSE
# 23: mike 2010-11-12 2011-07-30 2 1 TRUE
# names date_of_claim elig_end_date obs gaps gaps2

Related

Define periods/episodes of exposition with overlaping and concatenated intervals of time

I'm trying to identify periods/episodes of exposition to a drug with prescriptions. If those prescriptions are separated for 30 days it's considered a new period/episode of exposition. Prescriptions can overlap during certain time or be consecutive. If the sum of separated days of two consecutive prescripction is greater than 30 days it's not considered a new episode.
I have data like this:
id = c(rep(1,3), rep(2,6), rep(3,5))
start = as.Date(c("2017-05-10", "2017-07-28", "2017-11-23", "2017-01-27", "2017-10-02", "2018-05-14", "2018-05-25", "2018-11-26", "2018-12-28", "2016-01-01", "2016-03-02", "2016-03-20", "2016-04-25", "2016-06-29"))
end = as.Date(c("2017-07-27", "2018-01-28", "2018-03-03", "2017-04-27", "2018-05-13", "2018-11-14", "2018-11-25", "2018-12-27", "2019-06-28", "2016-02-15", "2016-03-05", "2016-03-24", "2016-04-29", "2016-11-01"))
DT = data.table(id, start, end)
DT
id start end
1: 1 2017-05-10 2017-07-27
2: 1 2017-07-28 2018-01-28
3: 1 2017-11-23 2018-03-03
4: 2 2017-01-27 2017-04-27
5: 2 2017-10-02 2018-05-13
6: 2 2018-05-14 2018-11-14
7: 2 2018-05-25 2018-11-25
8: 2 2018-11-26 2018-12-27
9: 2 2018-12-28 2019-06-28
10: 3 2016-01-01 2016-02-15
11: 3 2016-03-02 2016-03-05
12: 3 2016-03-20 2016-03-24
13: 3 2016-04-25 2016-04-29
14: 3 2016-06-29 2016-11-01
I calculated the difference of start and last end observation (last_diffdays)
DT[, last_diffdays := start-shift(end, n=1L), by = .(id)][is.na(last_diffdays), last_diffdays := 0][]
id start end last_diffdays
1: 1 2017-05-10 2017-07-27 0 days
2: 1 2017-07-28 2018-01-28 1 days
3: 1 2017-11-23 2018-03-03 -66 days
4: 2 2017-01-27 2017-04-27 0 days
5: 2 2017-10-02 2018-05-13 158 days
6: 2 2018-05-14 2018-11-14 1 days
7: 2 2018-05-25 2018-11-25 -173 days
8: 2 2018-11-26 2018-12-27 1 days
9: 2 2018-12-28 2019-06-28 1 days
10: 3 2016-01-01 2016-02-15 0 days
11: 3 2016-03-02 2016-03-05 16 days
12: 3 2016-03-20 2016-03-24 15 days
13: 3 2016-04-25 2016-04-29 32 days
14: 3 2016-06-29 2016-11-01 61 days
This shows when an overlap happens (negative values) or not (positive values). I think an ifelse/fcase statement here would be a bad idea and I'm not comfortable doing it.
I think a good output for this job would be something like:
id start end last_diffdays noexp_days period
1: 1 2017-05-10 2017-07-27 0 days 0 1
2: 1 2017-07-28 2018-01-28 1 days 1 1
3: 1 2017-11-23 2018-03-03 -66 days 0 1
4: 2 2017-01-27 2017-04-27 0 days 0 1
5: 2 2017-10-02 2018-05-13 158 days 158 2
6: 2 2018-05-14 2018-11-14 1 days 1 2
7: 2 2018-05-25 2018-11-25 -173 days 0 2
8: 2 2018-11-26 2018-12-27 1 days 1 2
9: 2 2018-12-28 2019-06-28 1 days 1 2
10: 3 2016-01-01 2016-02-15 0 days 0 1
11: 3 2016-03-02 2016-03-05 16 days 16 1
12: 3 2016-03-20 2016-03-24 15 days 15 1
13: 3 2016-04-25 2016-04-29 32 days 32 2
14: 3 2016-06-29 2016-11-01 61 days 61 3
I manually calculated the days without exposition (noexp_days) of the before prescription.
I dunno If I'm the right path but I think I need to calculate noexp_days variable and then make a cumsum((noexp_days)>30)+1.
If there is a much better solution I don't see or any other possibility I haven't considered I will appreciate to read about them.
Thanks in advance for any help! :)
Try :
library(data.table)
DT[, noexp_days := pmax(as.integer(last_diffdays), 0)]
DT[, period := cumsum(noexp_days > 30) + 1, id]
DT
# id start end last_diffdays noexp_days period
# 1: 1 2017-05-10 2017-07-27 0 days 0 1
# 2: 1 2017-07-28 2018-01-28 1 days 1 1
# 3: 1 2017-11-23 2018-03-03 -66 days 0 1
# 4: 2 2017-01-27 2017-04-27 0 days 0 1
# 5: 2 2017-10-02 2018-05-13 158 days 158 2
# 6: 2 2018-05-14 2018-11-14 1 days 1 2
# 7: 2 2018-05-25 2018-11-25 -173 days 0 2
# 8: 2 2018-11-26 2018-12-27 1 days 1 2
# 9: 2 2018-12-28 2019-06-28 1 days 1 2
#10: 3 2016-01-01 2016-02-15 0 days 0 1
#11: 3 2016-03-02 2016-03-05 16 days 16 1
#12: 3 2016-03-20 2016-03-24 15 days 15 1
#13: 3 2016-04-25 2016-04-29 32 days 32 2
#14: 3 2016-06-29 2016-11-01 61 days 61 3

How can I receive from a column data points with the highest Information (Gain)?

Suppose I have this dataframe:
> df1
date count
1 2012-07-01 2.867133
2 2012-08-01 2.018745
3 2012-09-01 5.237515
4 2012-10-01 8.320493
5 2012-11-01 4.119850
6 2012-12-01 3.648649
7 2013-01-01 3.172867
8 2013-02-01 4.065041
9 2013-03-01 2.914798
10 2013-04-01 4.735683
11 2013-05-01 3.775411
12 2013-06-01 3.825717
13 2013-07-01 3.273427
14 2013-08-01 2.716469
15 2013-09-01 2.687296
16 2013-10-01 3.674121
17 2013-11-01 3.325942
18 2013-12-01 2.524038
I now want to split df1$count in such a way, that I get groups/ranges of where the Information is the highest. My thoughts go towards Information Gain, but I know IG is for attributes, not a column.
If you plot the data, you can distinguish a high rise and decrease...so my goal is to always find these significant increases/decreases which contain a high Information Gain.
Any ideas on how I could do this?
Something like this?
df1%>%
mutate(dif=ifelse((lag(count)-count)>0,0,1))%>%
mutate(group=rle(dif) %>% magrittr::extract2("lengths") %>% rep(seq_along(.), .))
date count dif group
1 2012-07-01 2.867133 NA 1
2 2012-08-01 2.018745 0 2
3 2012-09-01 5.237515 1 3
4 2012-10-01 8.320493 1 3
5 2012-11-01 4.119850 0 4
6 2012-12-01 3.648649 0 4
7 2013-01-01 3.172867 0 4
8 2013-02-01 4.065041 1 5
9 2013-03-01 2.914798 0 6
10 2013-04-01 4.735683 1 7
11 2013-05-01 3.775411 0 8
12 2013-06-01 3.825717 1 9
13 2013-07-01 3.273427 0 10
14 2013-08-01 2.716469 0 10
15 2013-09-01 2.687296 0 10
16 2013-10-01 3.674121 1 11
17 2013-11-01 3.325942 0 12
18 2013-12-01 2.524038 0 12
UPDATE
df1%>%
mutate(nxt=lag(count),
dif=ifelse( abs(count-lag(count))>2 | count/lag(count)>3 | lag(count)/count>3,1,0))%>%
+ mutate(group=rle(dif) %>% magrittr::extract2("lengths") %>% rep(seq_along(.), .))
date count nxt dif group
1 2012-07-01 2.867133 NA NA 1
2 2012-08-01 2.018745 2.867133 0 2
3 2012-09-01 5.237515 2.018745 1 3
4 2012-10-01 8.320493 5.237515 1 3
5 2012-11-01 4.119850 8.320493 1 3
6 2012-12-01 3.648649 4.119850 0 4
7 2013-01-01 3.172867 3.648649 0 4
8 2013-02-01 4.065041 3.172867 0 4
9 2013-03-01 2.914798 4.065041 0 4
10 2013-04-01 4.735683 2.914798 0 4
11 2013-05-01 3.775411 4.735683 0 4
12 2013-06-01 3.825717 3.775411 0 4
13 2013-07-01 3.273427 3.825717 0 4
14 2013-08-01 2.716469 3.273427 0 4
15 2013-09-01 2.687296 2.716469 0 4
16 2013-10-01 3.674121 2.687296 0 4
17 2013-11-01 3.325942 3.674121 0 4
18 2013-12-01 2.524038 3.325942 0 4

How can I connect a data set to another when the date is between 2 other dates R

I need to merge two datasets, but the rows have to merge if the date of the one dataset is between two dates of the other one. The first dataset data looks like this:
Date Weight diff Loc.nr
2013-01-24 1040 7 2
2013-01-31 1000 7 2
2013-02-07 1185 7 2
2013-02-14 915 7 2
2013-02-21 1090 7 2
2013-03-01 1065 9 2
2013-01-19 500 4 9
2013-01-23 1040 3 9
2013-01-28 415 5 9
2013-01-31 650 3 9
2013-02-04 725 4 9
2013-02-07 450 3 9
2013-02-11 550 4 9
The other data set matches looks like this:
Date winning
2013-01-20 1
2013-01-27 0
2013-02-03 1
2013-02-10 0
2013-02-17 1
2013-02-24 0
I wrote a code to connect the winning column from matches to the data set "data":
data$winning <- NA
for(i in 1:nrow(data)) {
for(j in 1:nrow(matches)) {
if((data$Date[i]-data$diff[i]) < matches$Date[j] & data$Date[i] > matches$Date[j]) {
data$winning[i] <- matches$winning[j]
}
}
}
This code takes 3 days to run, is there a faster way to do this?
My expected output is:
Date Weight diff Loc.nr winning
2013-01-24 1040 7 2 1
2013-01-31 1000 7 2 0
2013-02-07 1185 7 2 1
2013-02-14 915 7 2 0
2013-02-21 1090 7 2 1
2013-03-01 1065 9 2 0
2013-01-19 500 4 9 NA
2013-01-23 1040 3 9 NA
2013-01-28 415 5 9 0
2013-01-31 650 3 9 NA
2013-02-04 725 4 9 1
2013-02-07 450 3 9 NA
2013-02-11 550 4 9 0
With non-equi join as suggested by Gregor you can try something along
library(data.table)
setDT(data)[, winning := setDT(matches)[data[, .(upper = Date, lower = Date - diff)],
on = .(Date < upper, Date > lower)]$winning][]
Date Weight diff Loc.nr winning
1: 2013-01-24 1040 7 2 1
2: 2013-01-31 1000 7 2 0
3: 2013-02-07 1185 7 2 1
4: 2013-02-14 915 7 2 0
5: 2013-02-21 1090 7 2 1
6: 2013-03-01 1065 9 2 0
7: 2013-01-19 500 4 9 NA
8: 2013-01-23 1040 3 9 NA
9: 2013-01-28 415 5 9 0
10: 2013-01-31 650 3 9 NA
11: 2013-02-04 725 4 9 1
12: 2013-02-07 450 3 9 NA
13: 2013-02-11 550 4 9 0

Remove rows after a certain date based on a condition in R

There are similar questions I've seen, but none of them apply it to specific rows of a data.table or data.frame, rather they apply it to the whole matrix.
Subset a dataframe between 2 dates
How to select some rows with specific date from a data frame in R
I have a dataset with patients who were diagnosed with OA and those who were not:
dt <- data.table(ID = seq(1,10,1), OA = c(1,0,0,1,0,0,0,1,1,0),
oa.date = as.Date(c("01/01/2006", "01/01/2001", "01/01/2001", "02/03/2005","01/01/2001","01/01/2001","01/01/2001","05/06/2010", "01/01/2011", "01/01/2001"), "%d/%m/%Y"),
stop.date = as.Date(c("01/01/2006", "31/12/2007", "31/12/2008", "02/03/2005", "31/12/2011", "31/12/2011", "31/12/2011", "05/06/2010", "01/01/2011", "31/12/2011"), "%d/%m/%Y"))
dt$oa.date[dt$OA==0] <- NA
> dt
ID OA oa.date stop.date
1: 1 1 2006-01-01 2006-01-01
2: 2 0 <NA> 2007-12-31
3: 3 0 <NA> 2008-12-31
4: 4 1 2005-03-02 2005-03-02
5: 5 0 <NA> 2011-12-31
6: 6 0 <NA> 2011-12-31
7: 7 0 <NA> 2011-12-31
8: 8 1 2010-06-05 2010-06-05
9: 9 1 2011-01-01 2011-01-01
10: 10 0 <NA> 2011-12-31
What I want to do is delete those who were diagnosed with OA (OA==1) before start:
start <- as.Date("01/01/2009", "%d/%m/%Y")
So I want my final data to be:
> dt
ID OA oa.date stop.date
1: 2 0 <NA> 2009-12-31
2: 3 0 <NA> 2008-12-31
3: 5 0 <NA> 2011-12-31
4: 6 0 <NA> 2011-12-31
5: 7 0 <NA> 2011-12-31
6: 8 1 2010-06-05 2010-06-05
7: 9 1 2011-01-01 2011-01-01
8: 10 0 <NA> 2011-12-31
My tries are:
dt[dt$OA==1] <- dt[!(oa.date < start)]
I've also tried a loop but to no effect.
Any help is much appreciated.
This should be straightforward:
> dt[!(OA & oa.date < start)]
# ID OA oa.date stop.date
#1: 2 0 <NA> 2007-12-31
#2: 3 0 <NA> 2008-12-31
#3: 5 0 <NA> 2011-12-31
#4: 6 0 <NA> 2011-12-31
#5: 7 0 <NA> 2011-12-31
#6: 8 1 2010-06-05 2010-06-05
#7: 9 1 2011-01-01 2011-01-01
#8: 10 0 <NA> 2011-12-31
The OA column is binary (1/0) which is coerced to logical (TRUE/FALSE) in the i-expression.
You can try
dt=dt[dt$OA==0|(dt$OA==1&!(dt$oa.date < start)),]

Subsetting data based on variable start date in R

I am trying to write a bit of code that will grab a portion of a data frame based on start date, where each start date is different for each user.
Suppose I have the following data.frames (in reality my dataset is several orders of magnitude larger, but this will suffice as a sample set)
df1:
> df
name start.date
1 Allison 2013-03-16
2 Andrew 2013-03-16
3 Carl 2013-03-16
4 Dora 2013-03-17
5 Hilary 2013-03-17
6 Louis 2013-03-18
7 Mary 2013-03-19
8 Mickey 2013-03-20
And df2:
> df2
names X03.16.2013 X03.17.2013 X03.18.2013 X03.19.2013
2001 Allison 5 5 0 0
2002 Andrew 2 0 0 0
2003 Carl 10 8 11 10
2004 Dora 0 4 0 0
2005 Hilary 0 3 5 0
2006 Louis 0 0 8 3
2007 Mary 0 0 0 7
2008 Mickey 0 0 0 0
I merged these two data frames into one called tmp:
>tmp
name start.date X03.16.2013 X03.17.2013 X03.18.2013 X03.19.2013
1 Allison 2013-03-16 5 5 0 0
2 Andrew 2013-03-16 2 0 0 0
3 Carl 2013-03-16 10 8 11 10
4 Dora 2013-03-17 0 4 0 0
5 Hilary 2013-03-17 0 3 5 0
6 Louis 2013-03-18 0 0 8 3
7 Mary 2013-03-19 0 0 0 7
8 Mickey 2013-03-20 0 0 0 0
I also have a list of the column names of df2 converted to dates:
>dts
[1] "2014-03-16" "2014-03-17" "2014-03-18" "2014-03-19"
I thought one way to approach this problem is to change all the zero entries in df2 that occur before each user's start date to NA using the following nested loops:
for (i in 1:dim(tmp)[1]){
for (j in 1:length(dts)){
for (z in 4:dim(tmp)[2]){
if (dts[j]< tmp$Date.of.Sign.Up[i]){
tmp[i,z]<-NA
} else {tmp[i,z]<-tmp[i,z]}
}
}
}
The trouble with this loop is that 1. It will run infinitely and 2. Doesn't work. It is changing all the values in tmp from tmp[,3:end] to zero, regardless of start date. Ideally I would end up with something like this:
name start.date X03.16.2013 X03.17.2013 X03.18.2013 X03.19.2013
Allison 2013-03-16 5 5 0 0
Andrew 2013-03-16 2 0 0 0
Carl 2013-03-16 10 8 11 10
Dora 2013-03-17 NA 4 0 0
Hilary 2013-03-17 NA 3 5 0
Louis 2013-03-18 NA NA 8 3
Mary 2013-03-19 NA NA NA 7
Mickey 2013-03-20 NA NA NA NA
Any suggestions? Thank you in advance!
You may reshape 'tmp' to long format, convert the former headers to dates, compare them with the start dates, and insert NA when start dates occur after the 'header dates':
library(reshape2)
# melt data from wide to long format
df3 <- melt(tmp, id.vars = c("name", "start.date"))
# convert 'variable' to class Date
df3$variable <- as.Date(df3$variable, format = "X%m.%d.%Y")
# compare start dates with 'variable dates' and insert NA
df3$value[df3$start.date > df3$variable] <- NA
# reshape back to wide
dcast(df3, name + start.date ~ variable)
# name start.date 2013-03-16 2013-03-17 2013-03-18 2013-03-19
# 1 Allison 2013-03-16 5 5 0 0
# 2 Andrew 2013-03-16 2 0 0 0
# 3 Carl 2013-03-16 10 8 11 10
# 4 Dora 2013-03-17 NA 4 0 0
# 5 Hilary 2013-03-17 NA 3 5 0
# 6 Louis 2013-03-18 NA NA 8 3
# 7 Mary 2013-03-19 NA NA NA 7
# 8 Mickey 2013-03-20 NA NA NA NA
Another possibility where we loop over the 'date column names' in 'tmp':
dates <- names(tmp)[-c(1, 2)]
tmp[ , -c(1, 2)] <- sapply(dates, function(x){
date <- as.Date(x, format = "X%m.%d.%Y")
tmp[ , x][df2$start.date > date] <- NA
tmp[ , x]
})
tmp
# name start.date X03.16.2013 X03.17.2013 X03.18.2013 X03.19.2013
# 1 Allison 2013-03-16 5 5 0 0
# 2 Andrew 2013-03-16 2 0 0 0
# 3 Carl 2013-03-16 10 8 11 10
# 4 Dora 2013-03-17 NA 4 0 0
# 5 Hilary 2013-03-17 NA 3 5 0
# 6 Louis 2013-03-18 NA NA 8 3
# 7 Mary 2013-03-19 NA NA NA 7
# 8 Mickey 2013-03-20 NA NA NA NA

Resources