Identify and remove duplicates by a criteria in R - r

Hi I am puzzled with a problem concerning duplicates in R. I have looked around a lot and don't seem to find any help. I have a dataset like that
x = data.frame( id = c("A","A","A","A","A","A","A","B","B","B","B"),
StartDate = c("09/07/2006", "09/07/2006", "09/07/2006", "08/10/2006",
"08/10/2006", "09/04/2007", "02/03/2011","05/05/2005", "08/06/2009", "07/09/2009", "07/09/2009"),
EndDate = c("06/08/2006", "06/08/2006", "06/08/2006", "19/11/2006", "19/11/2006", "07/05/2007", "30/03/2011",
"02/06/2005", "06/07/2009", "05/10/2009", "05/10/2009"),
Group = c(1,1,1,2,2,3,4,2,3,4,4),
TestDate = c("09/06/2006", "08/09/2006", "08/10/2006", "08/09/2006", "08/10/2006", "NA", "02/03/2011",
"NA", "07/09/2009", "07/09/2009", "08/10/2009"),
Code = c(4,4,4858,4,4858,NA,4,NA, 795, 795, 4)
)
> x
id StartDate EndDate Group TestDate Code
1 A 09/07/2006 06/08/2006 1 09/06/2006 4
2 A 09/07/2006 06/08/2006 1 08/09/2006 4
3 A 09/07/2006 06/08/2006 1 08/10/2006 4858
4 A 08/10/2006 19/11/2006 2 08/09/2006 4
5 A 08/10/2006 19/11/2006 2 08/10/2006 4858
6 A 09/04/2007 07/05/2007 3 NA NA
7 A 02/03/2011 30/03/2011 4 02/03/2011 4
8 B 05/05/2005 02/06/2005 2 NA NA
9 B 08/06/2009 06/07/2009 3 07/09/2009 795
10 B 07/09/2009 05/10/2009 4 07/09/2009 795
11 B 07/09/2009 05/10/2009 4 08/10/2009 4
So basically what I am trying to do is to identify duplicates in the TestDate variable by ID. For example dates 08/09/2006 and 08/10/2006 seem to be repeated in the same person but for different Group and I don't want the same Testdate to be in different Group by ID. The criteria to choose which TestDate to choose is to take the difference in days of TestDate with StartDate and EndDate for the different groups and then keep the one with the smallest difference in days. For example, about the date 08/10/2006 I would like to keep row 5 as the TestDate there is closer to the StartDate, than compared with the same differences in row 3. Eventually, I would like to get with a dataset like that
> xfinal
id StartDate EndDate Group TestDate Code
1 A 09/07/2006 06/08/2006 1 09/06/2006 4
4 A 08/10/2006 19/11/2006 2 08/09/2006 4
5 A 08/10/2006 19/11/2006 2 08/10/2006 4858
6 A 09/04/2007 07/05/2007 3 NA NA
7 A 02/03/2011 30/03/2011 4 02/03/2011 4
8 B 05/05/2005 02/06/2005 2 NA NA
10 B 07/09/2009 05/10/2009 4 07/09/2009 795
11 B 07/09/2009 05/10/2009 4 08/10/2009 4
Any help on that will be much appreciated. Thanks

x$StartDate <- as.Date(x$StartDate,format="%d/%m/%Y")
x$EndDate <- as.Date(x$EndDate,format="%d/%m/%Y")
x$TestDate <- as.Date(x$TestDate,format="%d/%m/%Y")
x$Diff <- difftime(x$EndDate,x$StartDate,"days")
x <- x[order(x$id,x$Diff),]
x <- x[!duplicated(x[,c("id","TestDate")]),]
x$Diff <- NULL
x

Related

Issue with merge statement

My merge does not seem to be working anymore and I do not why. Below is the code and the sample of the two data sets I am merging
head(lookup.table)
code label
1: I-2 1
2: I-3 2
3: I-4 3
4: I-5 4
5: I-6 5
6: I-7 6
df
Rate
1 S-4
2 S-4
3 S-4
4 S-1
5 S-2
6 S-4
Code to reproduce the example
library(data.table)
letter=c('I','S','P','D')
start=c(2,1,1,1)
end=c(7,4,3,2)
label=1:15
code.table = data.table(letter,start,end)
code.vector = unlist(apply(code.table,1,function(x) paste(x[1],x[2]:x[3],sep='-')))
lookup.table = data.table(code=code.vector,label=label)
df = data.table(Rate = paste0("S-", c(4,4,4,1,2,4)))
Attempt:
df$rank = merge(df,lookup.table,by.x="Rate",by.y="code",all.x=TRUE,sort=F)$label
Below a sample of the output, and the merge is not producing the expected results. I am expecting the merge to join the lookup.table and df when code=Rate.
rank Rate
1 10 S-4
2 10 S-4
3 10 S-3
4 10 I-5
5 10 I-5
6 10 I-6

How to calculate moving average for different starting date?

I would like to calculate moving averages for each participant in the dataset.
Participant may have more than one visit date, and I would like to calculate the average value in the past 3 days and in the past 2 days before each visit (not including the day of visit).
For example, let id=1, date=6/6/2017.
Average value in the past 2 days should be an average of value on 6/5/2017 and 6/4/2017.
Sample datasets are generated as below.
I am working on a much larger dataset, with more participants, more visits, and more days of value. I want to find an efficient way to calculate these averages.
timeseries <- data.frame(id=c(1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3), date=c("6/1/2017","6/2/2017","6/3/2017","6/4/2017","6/5/2017","6/6/2017",
"6/1/2017","6/2/2017","6/3/2017","6/4/2017","6/5/2017","6/6/2017",
"6/1/2017","6/2/2017","6/3/2017","6/4/2017","6/5/2017","6/6/2017"),
value=c(2,3,4,NA,6,7,
NA,9,5,NA,3,2,
5,7,3,8,3,5))
> timeseries
id date value
1 1 6/1/2017 2
2 1 6/2/2017 3
3 1 6/3/2017 4
4 1 6/4/2017 NA
5 1 6/5/2017 6
6 1 6/6/2017 7
7 2 6/1/2017 NA
8 2 6/2/2017 9
9 2 6/3/2017 5
10 2 6/4/2017 NA
...
visit <- data.frame(id=c(1,1,2,3,3,3),
date=c("6/6/2017","6/5/2017",
"6/6/2017",
"6/6/2017","6/5/2017","6/4/2017"))
> visit
id date
1 1 6/6/2017
2 1 6/5/2017
3 2 6/6/2017
4 3 6/6/2017
5 3 6/5/2017
6 3 6/4/2017
The result table should be something like this, where mean3 is the average value in the past 3 days, and mean2 is the average value in the past 2 days
> result
id date mean3 mean2
1 1 6/6/2017
2 1 6/5/2017
3 2 6/6/2017
4 3 6/6/2017
5 3 6/5/2017
6 3 6/4/2017
For each id of visit, I subset corresponding data from timeseries and then calculated mean of the value within n_days.
library(lubridate)
n_days = 2
sapply(1:NROW(visit), function(i)
with(subset(x = timeseries,
subset = timeseries$id == visit$id[i]),
mean(x = value[difftime(time1 = mdy(visit$date[i]),
time2 = mdy(date),
units = "days") <= n_days &
difftime(time1 = mdy(visit$date[i]),
time2 = mdy(date),
units = "days") > 0],
na.rm = TRUE)))
#[1] 6.0 4.0 3.0 5.5 5.5 5.0

Finding newest data older than a specific date in R

I have a two data.frames (call them dataset.new and dataset.old) that both contain information about some individuals. These individuals all have a identification number (a variable we can call ”individual”) that occurs in both of the data.frames and each frame has information on when the data was collected, stored in a column that we can call ”some.date”.
The second of these two data.frames (dataset.old) contains historical data for the individuals, i.e. values of some other variables measured at other times and thus each individual appears many times in dataset.old.
What I wish to do is the following. For each individual in dataset.new, find the rows from dataset.old that are the newest but still older than the observations in dataset.new. For the individuals that have no such date present in dataset.old, I want it to return NA.
This is perhaps easiest illustrated through some example data, presented below.
dataset.new
individual some.date
1 1 2016-05-01
2 2 2016-01-28
3 7 2016-03-03
dataset.old
individual some.date
1 1 2016-01-12
2 1 2015-12-30
3 1 2016-04-27
4 1 2016-05-02
5 2 2015-11-15
6 2 2012-01-27
7 2 2016-02-06
8 3 2016-04-30
9 3 2016-01-27
10 4 2016-03-01
11 4 2011-01-16
In this example, I am looking for a way get the following output:
individual row.nr
1 1 3
2 2 5
3 7 NA
since those rows correspond to the newest data in dataset.old that still is older than the data in dataset.new.
I have a code that solves the problem, but it is too slow for the data that I have in mind (which has well over 20 000 rows in dataset.new and many, many more in dataset.old). My solution is basically a loop over all individuals, subsetting the data at each stage.
find.previous <- function(dataset.old, individual, some.new.date){
subsetted.dataset <- dataset.old[dataset.old[, "individual"] == individual, ] # We only look at the individual in question.
subsetted.dataset <- subsetted.dataset[subsetted.dataset[, "some.date"] < some.new.date, ]# Here we get all the rows that have data that are measured BEFORE timepoint.
row.index <- which.min(some.new.date - subsetted.dataset[, "some.date"]) # This can be done, since we have already made sure that fromdatum < timepoint.
ifelse(length(row.index)!= 0, as.integer(rownames(subsetted.dataset[row.index,])), NA) # Then we output the row that had that information.
}
output <- matrix(ncol=2, nrow=0)
for(i in 1:nrow(dataset.new)){
output <- rbind(output, cbind(dataset.new[, "individual"][i], find.previous(dataset.old, dataset.new[, "individual"][i], dataset.new[, "some.date"][i])))
}
colnames(output) <- c("individual", "row.nr")
output
Any help on how to solve this problem would be greatly appreciated. I have tried using my Google skills as well as reading other posts on here stackoverflow, but without success.
The example data can be replicated by copying the following lines of code:
dataset.new <- data.frame(individual=c(1, 2, 7), some.date=as.Date(c("2016-05-01", "2016-01-28", "2016-03-03")))
dataset.old <- data.frame(individual=c(1,1,1,1,2,2,2,3,3,4,4), some.date=as.Date(c("2016-01-12", "2015-12-30", "2016-04-27", "2016-05-02", "2015-11-15", "2012-01-27", "2016-02-06", "2016-04-30", "2016-01-27", "2016-03-01", "2011-01-16")))
You can solve this efficiently with a merge.
First make the rownumber variable you want in dataset.old. Then merge dataset.new with dataset.old on individual (left join, or merge(lhs, rhs, all.x = TRUE)). This can get you:
dataset.old
individual new.date old.date old.rownumber
1 1 2016-05-01 2016-01-12 1
2 1 2016-05-01 2015-12-30 2
3 1 2016-05-01 2016-04-27 3
4 1 2016-05-01 2016-05-02 4
5 2 2016-01-28 2015-11-15 5
6 2 2016-01-28 2012-01-27 6
7 2 2016-01-28 2016-02-06 7
8 7 2016-03-03 NA NA
Subset to new.date > old.date or is.na(old.date):
dataset.old
individual new.date old.date old.rownumber
1 1 2016-05-01 2016-01-12 1
2 1 2016-05-01 2015-12-30 2
3 1 2016-05-01 2016-04-27 3
5 2 2016-01-28 2015-11-15 5
6 2 2016-01-28 2012-01-27 6
8 7 2016-03-03 NA NA
Subset to old.date == max(old.date) or is.na(old.date) grouped by individual.
dataset.old
individual new.date old.date old.rownumber
3 1 2016-05-01 2016-04-27 3
6 2 2016-01-28 2012-01-27 5
8 7 2016-03-03 NA NA
Edit:
I'm partial to data.table. The code would look something like:
dataset.old[, old.rownumber := 1:.N]
setnames(dataset.old, "some.date", "old.date")
setnames(dataset.new, "some.date", "new.date")
dataset.merge <- merge(dataset.old, dataset.new, by = "individual", all.x = TRUE)
dataset.merge <- dataset.merge[, new.date > old.date]
dataset.merge[old.date == max(old.date) | is.na(old.date), by = individual]
We can skip the NA search by finding the minimum square root. The negative values will be coerced to missing for us:
dataset.old$rn <- 1:nrow(dataset.old)
minp <- function(x) if(!length(m <- which.min(as.numeric(x)^.5))) NA else m
mrg <- merge(dataset.new, dataset.old, by="individual", all.x=TRUE)
mrg %>% group_by(individual) %>%
summarise(row.nr=rn[minp(some.date.x - some.date.y)])
# A tibble: 3 x 2
# individual row.nr
# <int> <int>
# 1 1 3
# 2 2 5
# 3 7 NA

R finding date intervals by ID

Having the following table which comprises some key columns which are: customer ID | order ID | product ID | Quantity | Amount | Order Date.
All this data is in LONG Format, in that you will get multi line items for the 1 Customer ID.
I can get the first date last date using R DateDiff but converting the file to WIDE format using Plyr, still end up with the same problem of getting multiple orders by customer, just less rows and more columns.
Is there an R function that extends R DateDiff to work out how to get the time interval between purchases by Customer ID? That is, time between order 1 and 2, order 2 and 3, and so on assuming these orders exists.
CID Order.Date Order.DateMY Order.No_ Amount Quantity Category.Name Locality
1 26/02/13 Feb-13 zzzzz 1 r MOSMAN
1 26/05/13 May-13 qqqqq 1 x CHULLORA
1 28/05/13 May-13 wwwww 1 r MOSMAN
1 28/05/13 May-13 wwwww 1 x MOSMAN
2 19/08/13 Aug-13 wwwwww 1 o OAKLEIGH SOUTH
3 3/01/13 Jan-13 wwwwww 1 x CURRENCY CREEK
4 28/08/13 Aug-13 eeeeeee 1 t BRISBANE
4 10/09/13 Sep-13 rrrrrrrrr 1 y BRISBANE
4 25/09/13 Sep-13 tttttttt 2 e BRISBANE
It is not clear what do you want to do since you don't give the expected result. But I guess you want to the the intervals between 2 orders.
library(data.table)
DT <- as.data.table(DF)
DT[, list(Order.Date,
diff = c(0,diff(sort(as.Date(Order.Date,'%d/%m/%y')))) ),CID]
CID Order.Date diff
1: 1 26/02/13 0
2: 1 26/05/13 89
3: 1 28/05/13 2
4: 1 28/05/13 0
5: 2 19/08/13 0
6: 3 3/01/13 0
7: 4 28/08/13 0
8: 4 10/09/13 13
9: 4 25/09/13 15
Split the data frame and find the intervals for each Customer ID.
df <- data.frame(customerID=as.factor(c(rep("A",3),rep("B",4))),
OrderDate=as.Date(c("2013-07-01","2013-07-02","2013-07-03","2013-06-01","2013-06-02",
"2013-06-03","2013-07-01")))
dfs <- split(df,df$customerID)
lapply(dfs,function(x){
tmp <-diff(x$OrderDate)
tmp
})
Or use plyr
library(plyr)
dfs <- dlply(df,.(customerID),function(x)return(diff(x$OrderDate)))
I know this question is very old, but I just figured out another way to do it and wanted to record it:
> library(dplyr)
> library(lubridate)
> df %>% group_by(customerID) %>%
mutate(SinceLast=(interval(ymd(lag(OrderDate)),ymd(OrderDate)))/86400)
# A tibble: 7 x 3
# Groups: customerID [2]
customerID OrderDate SinceLast
<fct> <date> <dbl>
1 A 2013-07-01 NA
2 A 2013-07-02 1.
3 A 2013-07-03 1.
4 B 2013-06-01 NA
5 B 2013-06-02 1.
6 B 2013-06-03 1.
7 B 2013-07-01 28.

cross sectional sub-sets in data.table

I have a data.table which contains multiple columns, which is well represented by the following:
DT <- data.table(date = as.IDate(rep(c("2012-10-17", "2012-10-18", "2012-10-19"), each=10)),
session = c(1,2,3), price = c(10, 11, 12,13,14),
volume = runif(30, min=10, max=1000))
I would like to extract a multiple column table which shows the volume traded at each price in a particular type of session -- with each column representing a date.
At present, i extract this data one date at a time using the following:
DT[session==1,][date=="2012-10-17", sum(volume), by=price]
and then bind the columns.
Is there a way of obtaining the end product (a table with each column referring to a particular date) without sticking all the single queries together -- as i'm currently doing?
thanks
Does the following do what you want.
A combination of reshape2 and data.table
library(reshape2)
.DT <- DT[,sum(volume),by = list(price,date,session)][, DATE := as.character(date)]
# reshape2 for casting to wide -- it doesn't seem to like IDate columns, hence
# the character DATE co
dcast(.DT, session + price ~ DATE, value.var = 'V1')
session price 2012-10-17 2012-10-18 2012-10-19
1 1 10 308.9528 592.7259 NA
2 1 11 649.7541 NA 816.3317
3 1 12 NA 502.2700 766.3128
4 1 13 424.8113 163.7651 NA
5 1 14 682.5043 NA 147.1439
6 2 10 NA 755.2650 998.7646
7 2 11 251.3691 695.0153 NA
8 2 12 791.6882 NA 275.4777
9 2 13 NA 111.7700 240.3329
10 2 14 230.6461 817.9438 NA
11 3 10 902.9220 NA 870.3641
12 3 11 NA 719.8441 963.1768
13 3 12 361.8612 563.9518 NA
14 3 13 393.6963 NA 718.7878
15 3 14 NA 871.4986 582.6158
If you just wanted session 1
dcast(.DT[session == 1L], session + price ~ DATE)
session price 2012-10-17 2012-10-18 2012-10-19
1 1 10 308.9528 592.7259 NA
2 1 11 649.7541 NA 816.3317
3 1 12 NA 502.2700 766.3128
4 1 13 424.8113 163.7651 NA
5 1 14 682.5043 NA 147.1439

Resources