Combining observations with overlapping dates - r

Each observations in my dataframe contains a different "before date" and "after date instance". The problem is some dates overlap for each ID. For instance, in the table below, ID's 1 and 4 contain overlapping date values.
ID before date after date
1 10/1/1996 12/1/1996
1 1/1/1998 9/30/2003
1 1/1/2000 12/31/2004
2 1/1/2001 3/31/2006
3 1/1/2001 9/30/2006
4 1/1/2001 9/30/2005
4 10/1/2004 12/30/2004
4 10/3/2004 11/28/2004
I am trying to get something like this:
ID before date after date
1 10/1/1996 12/1/1996
1 1/1/1998 12/31/2004
2 1/1/2001 3/31/2006
3 1/1/2001 9/30/2006
4 1/1/2001 9/30/2005
Basically, I would like to replace any overlapping date values with the date range of the values with the overlap, leave the non-overlapping values alone, and delete any unnecessary rows. Not sure how to go about doing this

Firstly, you should convert your string dates into Date-classed values, which will make comparison possible. Here's how I've defined and coerced your data:
df <- data.frame(ID=c(1,1,1,2,3,4,4,4), before.date=c('10/1/1996','1/1/1998','1/1/2000','1/1/2001','1/1/2001','1/1/2001','10/1/2004','10/3/2004'), after.date=c('12/1/1996','9/30/2003','12/31/2004','3/31/2006','9/30/2006','9/30/2005','12/30/2004','11/28/2004') );
dcis <- grep('date$',names(df));
df[dcis] <- lapply(df[dcis],as.Date,'%m/%d/%Y');
df;
## ID before.date after.date
## 1 1 1996-10-01 1996-12-01
## 2 1 1998-01-01 2003-09-30
## 3 1 2000-01-01 2004-12-31
## 4 2 2001-01-01 2006-03-31
## 5 3 2001-01-01 2006-09-30
## 6 4 2001-01-01 2005-09-30
## 7 4 2004-10-01 2004-12-30
## 8 4 2004-10-03 2004-11-28
Now, my solution involves computing an "overlapping grouping" vector which I've called og. It makes the assumption that the input df is ordered by ID and then before.date, which it is in your example data. If not, this could be achieved by df[order(df$ID,df$before.date),]. Here's how I compute og:
cummax.Date <- function(x) as.Date(cummax(as.integer(x)),'1970-01-01');
og <- with(df,c(0,cumsum(!(ID[-length(ID)]==ID[-1] & ave(after.date,ID,FUN=cummax)[-length(after.date)]>before.date[-1]))));
og;
## [1] 0 1 1 2 3 4 4 4
Unfortunately, the base R cummax() function doesn't work on Date-classed objects, so I had to write a cummax.Date() shim. I'll explain the need for the ave() and cummax() business at the end of the post.
As you can see, the above computation lags the RHS of each of the two vectorized comparisons by excluding the first element via [-1]. This allows us to compare a record's ID for equality with the following record's ID, and also compare if its after.date is after the before.date of the following record. The resulting logical vectors are ANDed (&) together. The negation of that logical vector then represents adjacent pairs of records that do not overlap, and thus we can cumsum() the result (and prepend zero, as the first record must start with zero) to get our grouping vector.
Finally, for the final piece of the solution, I've used by() to work with each overlapping group independently:
do.call(rbind,by(df,og,function(g) transform(g[1,],after.date=max(g$after.date))));
## ID before.date after.date
## 0 1 1996-10-01 1996-12-01
## 1 1 1998-01-01 2004-12-31
## 2 2 2001-01-01 2006-03-31
## 3 3 2001-01-01 2006-09-30
## 4 4 2001-01-01 2005-09-30
Since all records in a group must have the same ID, and we've made the assumption that records are ordered by before.date (after being ordered by ID, which is no longer relevant), we can get the correct ID and before.date values from the first record in the group. That's why I started with g[1,]. Then we just need to get the greatest after.date from the group via max(g$after.date), and overwrite the first record's after.date with that, which I've done with transform().
A word about performance: The assumption about ordering aids performance, because it allows us to simply compare each record against the immediately following record via lagged vectorized comparisons, rather than comparing every record in a group with every other record.
Now, for the ave() and cummax() business. I realized after writing the initial version of my answer that there was a flaw in my solution, which happens to not be exposed by your example data. Say there are three records in a group. If the first record has a range that overlaps with both of the following two records, and then the middle record does not overlap with the third record, then my (original) code would fail to identify that the third record is part of the same overlapping group of the previous two records.
The solution is to not simply use the after.date of the current record when comparing against the following record, but instead use the cumulative maximum after.date within the group. If any earlier record sprawled completely beyond its immediately following record, then it obviously overlapped with that record, and its after.date is what's important in considering overlapping groups for subsequent records.
Here's a demonstration of input data that requires this fix, using your df as a base:
df2 <- df;
df2[7,'after.date'] <- '2004-10-02';
df2;
## ID before.date after.date
## 1 1 1996-10-01 1996-12-01
## 2 1 1998-01-01 2003-09-30
## 3 1 2000-01-01 2004-12-31
## 4 2 2001-01-01 2006-03-31
## 5 3 2001-01-01 2006-09-30
## 6 4 2001-01-01 2005-09-30
## 7 4 2004-10-01 2004-10-02
## 8 4 2004-10-03 2004-11-28
Now record 6 overlaps with both records 7 and 8, but record 7 does not overlap with record 8. The solution still works:
cummax.Date <- function(x) as.Date(cummax(as.integer(x)),'1970-01-01');
og <- with(df2,c(0,cumsum(!(ID[-length(ID)]==ID[-1] & ave(after.date,ID,FUN=cummax)[-length(after.date)]>before.date[-1]))));
og;
## [1] 0 1 1 2 3 4 4 4
do.call(rbind,by(df2,og,function(g) transform(g[1,],after.date=max(g$after.date))));
## ID before.date after.date
## 0 1 1996-10-01 1996-12-01
## 1 1 1998-01-01 2004-12-31
## 2 2 2001-01-01 2006-03-31
## 3 3 2001-01-01 2006-09-30
## 4 4 2001-01-01 2005-09-30
Here's a proof that the og calculation would be wrong without the ave()/cummax() fix:
og <- with(df2,c(0,cumsum(!(ID[-length(ID)]==ID[-1] & after.date[-length(after.date)]>before.date[-1]))));
og;
## [1] 0 1 1 2 3 4 4 5
Minor adjustment to the solution, to overwrite after.date in advance of the og computation, and avoid the max() call (makes more sense if you're planning on overwriting the original df with the new aggregation):
cummax.Date <- function(x) as.Date(cummax(as.integer(x)),'1970-01-01');
df$after.date <- ave(df$after.date,df$ID,FUN=cummax);
df;
## ID before.date after.date
## 1 1 1996-10-01 1996-12-01
## 2 1 1998-01-01 2003-09-30
## 3 1 2000-01-01 2004-12-31
## 4 2 2001-01-01 2006-03-31
## 5 3 2001-01-01 2006-09-30
## 6 4 2001-01-01 2005-09-30
## 7 4 2004-10-01 2005-09-30
## 8 4 2004-10-03 2005-09-30
og <- with(df,c(0,cumsum(!(ID[-length(ID)]==ID[-1] & after.date[-length(after.date)]>before.date[-1]))));
og;
## [1] 0 1 1 2 3 4 4 4
df <- do.call(rbind,by(df,og,function(g) transform(g[1,],after.date=g$after.date[nrow(g)])));
df;
## ID before.date after.date
## 0 1 1996-10-01 1996-12-01
## 1 1 1998-01-01 2004-12-31
## 2 2 2001-01-01 2006-03-31
## 3 3 2001-01-01 2006-09-30
## 4 4 2001-01-01 2005-09-30

Related

Finding newest data older than a specific date in R

I have a two data.frames (call them dataset.new and dataset.old) that both contain information about some individuals. These individuals all have a identification number (a variable we can call ”individual”) that occurs in both of the data.frames and each frame has information on when the data was collected, stored in a column that we can call ”some.date”.
The second of these two data.frames (dataset.old) contains historical data for the individuals, i.e. values of some other variables measured at other times and thus each individual appears many times in dataset.old.
What I wish to do is the following. For each individual in dataset.new, find the rows from dataset.old that are the newest but still older than the observations in dataset.new. For the individuals that have no such date present in dataset.old, I want it to return NA.
This is perhaps easiest illustrated through some example data, presented below.
dataset.new
individual some.date
1 1 2016-05-01
2 2 2016-01-28
3 7 2016-03-03
dataset.old
individual some.date
1 1 2016-01-12
2 1 2015-12-30
3 1 2016-04-27
4 1 2016-05-02
5 2 2015-11-15
6 2 2012-01-27
7 2 2016-02-06
8 3 2016-04-30
9 3 2016-01-27
10 4 2016-03-01
11 4 2011-01-16
In this example, I am looking for a way get the following output:
individual row.nr
1 1 3
2 2 5
3 7 NA
since those rows correspond to the newest data in dataset.old that still is older than the data in dataset.new.
I have a code that solves the problem, but it is too slow for the data that I have in mind (which has well over 20 000 rows in dataset.new and many, many more in dataset.old). My solution is basically a loop over all individuals, subsetting the data at each stage.
find.previous <- function(dataset.old, individual, some.new.date){
subsetted.dataset <- dataset.old[dataset.old[, "individual"] == individual, ] # We only look at the individual in question.
subsetted.dataset <- subsetted.dataset[subsetted.dataset[, "some.date"] < some.new.date, ]# Here we get all the rows that have data that are measured BEFORE timepoint.
row.index <- which.min(some.new.date - subsetted.dataset[, "some.date"]) # This can be done, since we have already made sure that fromdatum < timepoint.
ifelse(length(row.index)!= 0, as.integer(rownames(subsetted.dataset[row.index,])), NA) # Then we output the row that had that information.
}
output <- matrix(ncol=2, nrow=0)
for(i in 1:nrow(dataset.new)){
output <- rbind(output, cbind(dataset.new[, "individual"][i], find.previous(dataset.old, dataset.new[, "individual"][i], dataset.new[, "some.date"][i])))
}
colnames(output) <- c("individual", "row.nr")
output
Any help on how to solve this problem would be greatly appreciated. I have tried using my Google skills as well as reading other posts on here stackoverflow, but without success.
The example data can be replicated by copying the following lines of code:
dataset.new <- data.frame(individual=c(1, 2, 7), some.date=as.Date(c("2016-05-01", "2016-01-28", "2016-03-03")))
dataset.old <- data.frame(individual=c(1,1,1,1,2,2,2,3,3,4,4), some.date=as.Date(c("2016-01-12", "2015-12-30", "2016-04-27", "2016-05-02", "2015-11-15", "2012-01-27", "2016-02-06", "2016-04-30", "2016-01-27", "2016-03-01", "2011-01-16")))
You can solve this efficiently with a merge.
First make the rownumber variable you want in dataset.old. Then merge dataset.new with dataset.old on individual (left join, or merge(lhs, rhs, all.x = TRUE)). This can get you:
dataset.old
individual new.date old.date old.rownumber
1 1 2016-05-01 2016-01-12 1
2 1 2016-05-01 2015-12-30 2
3 1 2016-05-01 2016-04-27 3
4 1 2016-05-01 2016-05-02 4
5 2 2016-01-28 2015-11-15 5
6 2 2016-01-28 2012-01-27 6
7 2 2016-01-28 2016-02-06 7
8 7 2016-03-03 NA NA
Subset to new.date > old.date or is.na(old.date):
dataset.old
individual new.date old.date old.rownumber
1 1 2016-05-01 2016-01-12 1
2 1 2016-05-01 2015-12-30 2
3 1 2016-05-01 2016-04-27 3
5 2 2016-01-28 2015-11-15 5
6 2 2016-01-28 2012-01-27 6
8 7 2016-03-03 NA NA
Subset to old.date == max(old.date) or is.na(old.date) grouped by individual.
dataset.old
individual new.date old.date old.rownumber
3 1 2016-05-01 2016-04-27 3
6 2 2016-01-28 2012-01-27 5
8 7 2016-03-03 NA NA
Edit:
I'm partial to data.table. The code would look something like:
dataset.old[, old.rownumber := 1:.N]
setnames(dataset.old, "some.date", "old.date")
setnames(dataset.new, "some.date", "new.date")
dataset.merge <- merge(dataset.old, dataset.new, by = "individual", all.x = TRUE)
dataset.merge <- dataset.merge[, new.date > old.date]
dataset.merge[old.date == max(old.date) | is.na(old.date), by = individual]
We can skip the NA search by finding the minimum square root. The negative values will be coerced to missing for us:
dataset.old$rn <- 1:nrow(dataset.old)
minp <- function(x) if(!length(m <- which.min(as.numeric(x)^.5))) NA else m
mrg <- merge(dataset.new, dataset.old, by="individual", all.x=TRUE)
mrg %>% group_by(individual) %>%
summarise(row.nr=rn[minp(some.date.x - some.date.y)])
# A tibble: 3 x 2
# individual row.nr
# <int> <int>
# 1 1 3
# 2 2 5
# 3 7 NA

Erasing duplicates with NA values

I have a data frame like this:
names <- c('Mike','Mike','Mike','John','John','John','David','David','David','David')
dates <- c('04-26','04-26','04-27','04-28','04-27','04-26','04-01','04-02','04-02','04-03')
values <- c(NA,1,2,4,5,6,1,2,NA,NA)
test <- data.frame(names,dates,values)
Which is:
names dates values
1 Mike 04-26 NA
2 Mike 04-26 1
3 Mike 04-27 2
4 John 04-28 4
5 John 04-27 5
6 John 04-26 6
7 David 04-01 1
8 David 04-02 2
9 David 04-02 NA
10 David 04-03 NA
I'd like to get rid of duplicates with NA values. So, in this case, I have a valid observation from Mike on 04-26 and also have a valid observation from David on 04-02, so rows 1 and 9 should be erased and I will end up with:
names dates values
1 Mike 04-26 1
2 Mike 04-27 2
3 John 04-28 4
4 John 04-27 5
5 John 04-26 6
6 David 04-01 1
7 David 04-02 2
8 David 04-03 NA
I tried to use duplicated function, something like this:
test[!duplicated(test[,c('names','dates')]),]
But that does not work since some NA values come before the valid value. Do you have any suggestions without trying things like merge or making another data frame?
Update: I'd like to keep rows with NA that are not duplicates.
What about this way?
library(dplyr)
test %>% group_by(names, dates) %>% filter((n()>=2 & !is.na(values)) | n()==1)
Source: local data frame [8 x 3]
Groups: names, dates [8]
names dates values
(fctr) (fctr) (dbl)
1 Mike 04-26 1
2 Mike 04-27 2
3 John 04-28 4
4 John 04-27 5
5 John 04-26 6
6 David 04-01 1
7 David 04-02 2
8 David 04-03 NA
Here is an attempt in data.table:
# set up
libary(data.table)
setDT(test)
# construct condition
test[, dupes := max(duplicated(.SD)), .SDcols=c("names", "dates"), by=c("names", "dates")]
# print out result
test[dupes == 0 | !is.na(values),]
Here is a similar method using base R, except that the dupes variable is kept separately from the data.frame:
dupes <- duplicated(test[c("names", "dates")])
# this generates warnings, but works nonetheless
dupes <- ave(dupes, test$names, test$dates, FUN=max)
# print out result
test[dupes == 0 | !is.na(test$values),]
If there are duplicated rows where the values variable is NA, and these duplicates add nothing to the data, then you can drop them prior to running the code above:
testNoNADupes <- test[!(duplicated(test) & is.na(test$values)),]
This should work based on your sample.
test <- test[order(test$values),]
test <- test[!(duplicated(test$names) & duplicated(test$dates) & is.na(test$values)),]

Identify and remove duplicates by a criteria in R

Hi I am puzzled with a problem concerning duplicates in R. I have looked around a lot and don't seem to find any help. I have a dataset like that
x = data.frame( id = c("A","A","A","A","A","A","A","B","B","B","B"),
StartDate = c("09/07/2006", "09/07/2006", "09/07/2006", "08/10/2006",
"08/10/2006", "09/04/2007", "02/03/2011","05/05/2005", "08/06/2009", "07/09/2009", "07/09/2009"),
EndDate = c("06/08/2006", "06/08/2006", "06/08/2006", "19/11/2006", "19/11/2006", "07/05/2007", "30/03/2011",
"02/06/2005", "06/07/2009", "05/10/2009", "05/10/2009"),
Group = c(1,1,1,2,2,3,4,2,3,4,4),
TestDate = c("09/06/2006", "08/09/2006", "08/10/2006", "08/09/2006", "08/10/2006", "NA", "02/03/2011",
"NA", "07/09/2009", "07/09/2009", "08/10/2009"),
Code = c(4,4,4858,4,4858,NA,4,NA, 795, 795, 4)
)
> x
id StartDate EndDate Group TestDate Code
1 A 09/07/2006 06/08/2006 1 09/06/2006 4
2 A 09/07/2006 06/08/2006 1 08/09/2006 4
3 A 09/07/2006 06/08/2006 1 08/10/2006 4858
4 A 08/10/2006 19/11/2006 2 08/09/2006 4
5 A 08/10/2006 19/11/2006 2 08/10/2006 4858
6 A 09/04/2007 07/05/2007 3 NA NA
7 A 02/03/2011 30/03/2011 4 02/03/2011 4
8 B 05/05/2005 02/06/2005 2 NA NA
9 B 08/06/2009 06/07/2009 3 07/09/2009 795
10 B 07/09/2009 05/10/2009 4 07/09/2009 795
11 B 07/09/2009 05/10/2009 4 08/10/2009 4
So basically what I am trying to do is to identify duplicates in the TestDate variable by ID. For example dates 08/09/2006 and 08/10/2006 seem to be repeated in the same person but for different Group and I don't want the same Testdate to be in different Group by ID. The criteria to choose which TestDate to choose is to take the difference in days of TestDate with StartDate and EndDate for the different groups and then keep the one with the smallest difference in days. For example, about the date 08/10/2006 I would like to keep row 5 as the TestDate there is closer to the StartDate, than compared with the same differences in row 3. Eventually, I would like to get with a dataset like that
> xfinal
id StartDate EndDate Group TestDate Code
1 A 09/07/2006 06/08/2006 1 09/06/2006 4
4 A 08/10/2006 19/11/2006 2 08/09/2006 4
5 A 08/10/2006 19/11/2006 2 08/10/2006 4858
6 A 09/04/2007 07/05/2007 3 NA NA
7 A 02/03/2011 30/03/2011 4 02/03/2011 4
8 B 05/05/2005 02/06/2005 2 NA NA
10 B 07/09/2009 05/10/2009 4 07/09/2009 795
11 B 07/09/2009 05/10/2009 4 08/10/2009 4
Any help on that will be much appreciated. Thanks
x$StartDate <- as.Date(x$StartDate,format="%d/%m/%Y")
x$EndDate <- as.Date(x$EndDate,format="%d/%m/%Y")
x$TestDate <- as.Date(x$TestDate,format="%d/%m/%Y")
x$Diff <- difftime(x$EndDate,x$StartDate,"days")
x <- x[order(x$id,x$Diff),]
x <- x[!duplicated(x[,c("id","TestDate")]),]
x$Diff <- NULL
x

Comparing elements between groups in R

I have a data frame arranged as below:
DEPUTIES CHAMBER (...)
1 2496 1
2 2577 1
3 2577 2
4 2577 3
5 2577 4
6 2578 2
(...)
I have 2322 different deputies and 4 chambers, but some deputies appear in more than one chamber. What I want to do is to create a variable that indicates whether a deputy was in the previous chamber ("reelection") or not (the first chamber will be discarded later). I think it is probably simple, but could someone help me out?
Like this?
df <- df[order(df$DEPUTIES,df$CHAMBER),]
df$r <- unlist(aggregate(CHAMBER~DEPUTIES,df,function(x)c(NA,diff(x)))$CHAMBER)
df
# DEPUTIES CHAMBER r
# 1 2496 1 NA
# 2 2577 1 NA
# 3 2577 2 1
# 4 2577 3 1
# 5 2577 4 1
# 6 2578 2 NA
This orders the df by deputies and chamber (already ordered that way it seems, but to be sure...). Then, using aggregate(...), for each deputy calculate the difference between the current chamber number and the previous. If this is >0, they went from, e.g ,chamber 1 to chamber 2. Not sure what to do if someone starts in a chamber > 1 but never advances??

R finding date intervals by ID

Having the following table which comprises some key columns which are: customer ID | order ID | product ID | Quantity | Amount | Order Date.
All this data is in LONG Format, in that you will get multi line items for the 1 Customer ID.
I can get the first date last date using R DateDiff but converting the file to WIDE format using Plyr, still end up with the same problem of getting multiple orders by customer, just less rows and more columns.
Is there an R function that extends R DateDiff to work out how to get the time interval between purchases by Customer ID? That is, time between order 1 and 2, order 2 and 3, and so on assuming these orders exists.
CID Order.Date Order.DateMY Order.No_ Amount Quantity Category.Name Locality
1 26/02/13 Feb-13 zzzzz 1 r MOSMAN
1 26/05/13 May-13 qqqqq 1 x CHULLORA
1 28/05/13 May-13 wwwww 1 r MOSMAN
1 28/05/13 May-13 wwwww 1 x MOSMAN
2 19/08/13 Aug-13 wwwwww 1 o OAKLEIGH SOUTH
3 3/01/13 Jan-13 wwwwww 1 x CURRENCY CREEK
4 28/08/13 Aug-13 eeeeeee 1 t BRISBANE
4 10/09/13 Sep-13 rrrrrrrrr 1 y BRISBANE
4 25/09/13 Sep-13 tttttttt 2 e BRISBANE
It is not clear what do you want to do since you don't give the expected result. But I guess you want to the the intervals between 2 orders.
library(data.table)
DT <- as.data.table(DF)
DT[, list(Order.Date,
diff = c(0,diff(sort(as.Date(Order.Date,'%d/%m/%y')))) ),CID]
CID Order.Date diff
1: 1 26/02/13 0
2: 1 26/05/13 89
3: 1 28/05/13 2
4: 1 28/05/13 0
5: 2 19/08/13 0
6: 3 3/01/13 0
7: 4 28/08/13 0
8: 4 10/09/13 13
9: 4 25/09/13 15
Split the data frame and find the intervals for each Customer ID.
df <- data.frame(customerID=as.factor(c(rep("A",3),rep("B",4))),
OrderDate=as.Date(c("2013-07-01","2013-07-02","2013-07-03","2013-06-01","2013-06-02",
"2013-06-03","2013-07-01")))
dfs <- split(df,df$customerID)
lapply(dfs,function(x){
tmp <-diff(x$OrderDate)
tmp
})
Or use plyr
library(plyr)
dfs <- dlply(df,.(customerID),function(x)return(diff(x$OrderDate)))
I know this question is very old, but I just figured out another way to do it and wanted to record it:
> library(dplyr)
> library(lubridate)
> df %>% group_by(customerID) %>%
mutate(SinceLast=(interval(ymd(lag(OrderDate)),ymd(OrderDate)))/86400)
# A tibble: 7 x 3
# Groups: customerID [2]
customerID OrderDate SinceLast
<fct> <date> <dbl>
1 A 2013-07-01 NA
2 A 2013-07-02 1.
3 A 2013-07-03 1.
4 B 2013-06-01 NA
5 B 2013-06-02 1.
6 B 2013-06-03 1.
7 B 2013-07-01 28.

Resources