data.table join + update with mult='first' gives unexpected result - r

In the below example, I have a table of users and a table of transactions where one user can have 0, 1, or more transactions. I execute a join+update with mult='first' on the users table to attempt to insert a column indicating the date of the first occurring transaction for each user.
library(data.table) # v1.10.4
# Download data
users <- fread("https://raw.githubusercontent.com/ben519/DataWrangling/master/Data/users.csv")
transactions <- transactions <- fread("https://raw.githubusercontent.com/ben519/DataWrangling/master/Data/transactions.csv")
# Convert date columns to Date type
users[, `:=`(Registered = as.Date(Registered), Cancelled = as.Date(Cancelled))]
transactions[, TransactionDate := as.Date(TransactionDate)]
users
UserID User Gender Registered Cancelled FirstTransactionDate
1: 1 Charles male 2012-12-21 <NA> 2012-08-26
2: 2 Pedro male 2010-08-01 2010-08-08 2013-12-23
3: 3 Caroline female 2012-10-23 2016-06-07 2016-05-08
4: 4 Brielle female 2013-07-17 <NA> <NA>
5: 5 Benjamin male 2010-11-25 <NA> <NA>
transactions
TransactionID TransactionDate UserID ProductID Quantity
1: 1 2010-08-21 7 2 1
2: 2 2011-05-26 3 4 1
3: 3 2011-06-16 3 3 1
4: 4 2012-08-26 1 2 3
5: 5 2013-06-06 2 4 1
6: 6 2013-12-23 2 5 6
7: 7 2013-12-30 3 4 1
8: 8 2014-04-24 NA 2 3
9: 9 2015-04-24 7 4 3
10: 10 2016-05-08 3 4 4
##### For each user, insert the TransactionDate of the first matching row
users[transactions, FirstTransactionDate := i.TransactionDate, on="UserID", mult="first"]
# Unexpected result
users[UserID == 2]
UserID User Gender Registered Cancelled FirstTransactionDate
1: 2 Pedro male 2010-08-01 2010-08-08 2013-12-23 # <- shouldn't this be 2013-06-06?
Why does FirstTransactionDate 2013-12-23 get set for user 2 when an earlier transaction in the transactions table is tied to that user? Is this a bug?

Reading the documentation for data.table's mult more closely, it says that:
When i is a list (or data.frame or data.table) and multiple rows in x
match to the row in i, mult controls which are returned: "all"
(default), "first" or "last".
So if there are multiple rows in x ("users") that match to i ("transactions"), then mult will return the first row in x. However, in your case, there aren't multiple rows in x that match to i, rather there are multiple rows in i that match to x.
As #Arun suggested, the best option would be change around your so that mult = "first" is relevant:
users[, FirstTransactionDate := transactions[users, TransactionDate, on="UserID", mult = "first"]]
users
# UserID User Gender Registered Cancelled FirstTransactionDate
#1: 1 Charles male 2012-12-21 <NA> 2012-08-26
#2: 2 Pedro male 2010-08-01 2010-08-08 2013-06-06
#3: 3 Caroline female 2012-10-23 2016-06-07 2011-05-26
#4: 4 Brielle female 2013-07-17 <NA> <NA>
#5: 5 Benjamin male 2010-11-25 <NA> <NA>
Another option would be to change up your merge slightly:
users[transactions[,FirstTransactionDate := min(TransactionDate), by = UserID],
FirstTransactionDate := FirstTransactionDate, on="UserID"]
I just create the first transaction date within the transactions dataset. This gets merged on multiple times, but it should be fine because it's always the same value for a UserID.

Related

Finding newest data older than a specific date in R

I have a two data.frames (call them dataset.new and dataset.old) that both contain information about some individuals. These individuals all have a identification number (a variable we can call ”individual”) that occurs in both of the data.frames and each frame has information on when the data was collected, stored in a column that we can call ”some.date”.
The second of these two data.frames (dataset.old) contains historical data for the individuals, i.e. values of some other variables measured at other times and thus each individual appears many times in dataset.old.
What I wish to do is the following. For each individual in dataset.new, find the rows from dataset.old that are the newest but still older than the observations in dataset.new. For the individuals that have no such date present in dataset.old, I want it to return NA.
This is perhaps easiest illustrated through some example data, presented below.
dataset.new
individual some.date
1 1 2016-05-01
2 2 2016-01-28
3 7 2016-03-03
dataset.old
individual some.date
1 1 2016-01-12
2 1 2015-12-30
3 1 2016-04-27
4 1 2016-05-02
5 2 2015-11-15
6 2 2012-01-27
7 2 2016-02-06
8 3 2016-04-30
9 3 2016-01-27
10 4 2016-03-01
11 4 2011-01-16
In this example, I am looking for a way get the following output:
individual row.nr
1 1 3
2 2 5
3 7 NA
since those rows correspond to the newest data in dataset.old that still is older than the data in dataset.new.
I have a code that solves the problem, but it is too slow for the data that I have in mind (which has well over 20 000 rows in dataset.new and many, many more in dataset.old). My solution is basically a loop over all individuals, subsetting the data at each stage.
find.previous <- function(dataset.old, individual, some.new.date){
subsetted.dataset <- dataset.old[dataset.old[, "individual"] == individual, ] # We only look at the individual in question.
subsetted.dataset <- subsetted.dataset[subsetted.dataset[, "some.date"] < some.new.date, ]# Here we get all the rows that have data that are measured BEFORE timepoint.
row.index <- which.min(some.new.date - subsetted.dataset[, "some.date"]) # This can be done, since we have already made sure that fromdatum < timepoint.
ifelse(length(row.index)!= 0, as.integer(rownames(subsetted.dataset[row.index,])), NA) # Then we output the row that had that information.
}
output <- matrix(ncol=2, nrow=0)
for(i in 1:nrow(dataset.new)){
output <- rbind(output, cbind(dataset.new[, "individual"][i], find.previous(dataset.old, dataset.new[, "individual"][i], dataset.new[, "some.date"][i])))
}
colnames(output) <- c("individual", "row.nr")
output
Any help on how to solve this problem would be greatly appreciated. I have tried using my Google skills as well as reading other posts on here stackoverflow, but without success.
The example data can be replicated by copying the following lines of code:
dataset.new <- data.frame(individual=c(1, 2, 7), some.date=as.Date(c("2016-05-01", "2016-01-28", "2016-03-03")))
dataset.old <- data.frame(individual=c(1,1,1,1,2,2,2,3,3,4,4), some.date=as.Date(c("2016-01-12", "2015-12-30", "2016-04-27", "2016-05-02", "2015-11-15", "2012-01-27", "2016-02-06", "2016-04-30", "2016-01-27", "2016-03-01", "2011-01-16")))
You can solve this efficiently with a merge.
First make the rownumber variable you want in dataset.old. Then merge dataset.new with dataset.old on individual (left join, or merge(lhs, rhs, all.x = TRUE)). This can get you:
dataset.old
individual new.date old.date old.rownumber
1 1 2016-05-01 2016-01-12 1
2 1 2016-05-01 2015-12-30 2
3 1 2016-05-01 2016-04-27 3
4 1 2016-05-01 2016-05-02 4
5 2 2016-01-28 2015-11-15 5
6 2 2016-01-28 2012-01-27 6
7 2 2016-01-28 2016-02-06 7
8 7 2016-03-03 NA NA
Subset to new.date > old.date or is.na(old.date):
dataset.old
individual new.date old.date old.rownumber
1 1 2016-05-01 2016-01-12 1
2 1 2016-05-01 2015-12-30 2
3 1 2016-05-01 2016-04-27 3
5 2 2016-01-28 2015-11-15 5
6 2 2016-01-28 2012-01-27 6
8 7 2016-03-03 NA NA
Subset to old.date == max(old.date) or is.na(old.date) grouped by individual.
dataset.old
individual new.date old.date old.rownumber
3 1 2016-05-01 2016-04-27 3
6 2 2016-01-28 2012-01-27 5
8 7 2016-03-03 NA NA
Edit:
I'm partial to data.table. The code would look something like:
dataset.old[, old.rownumber := 1:.N]
setnames(dataset.old, "some.date", "old.date")
setnames(dataset.new, "some.date", "new.date")
dataset.merge <- merge(dataset.old, dataset.new, by = "individual", all.x = TRUE)
dataset.merge <- dataset.merge[, new.date > old.date]
dataset.merge[old.date == max(old.date) | is.na(old.date), by = individual]
We can skip the NA search by finding the minimum square root. The negative values will be coerced to missing for us:
dataset.old$rn <- 1:nrow(dataset.old)
minp <- function(x) if(!length(m <- which.min(as.numeric(x)^.5))) NA else m
mrg <- merge(dataset.new, dataset.old, by="individual", all.x=TRUE)
mrg %>% group_by(individual) %>%
summarise(row.nr=rn[minp(some.date.x - some.date.y)])
# A tibble: 3 x 2
# individual row.nr
# <int> <int>
# 1 1 3
# 2 2 5
# 3 7 NA

Combining observations with overlapping dates

Each observations in my dataframe contains a different "before date" and "after date instance". The problem is some dates overlap for each ID. For instance, in the table below, ID's 1 and 4 contain overlapping date values.
ID before date after date
1 10/1/1996 12/1/1996
1 1/1/1998 9/30/2003
1 1/1/2000 12/31/2004
2 1/1/2001 3/31/2006
3 1/1/2001 9/30/2006
4 1/1/2001 9/30/2005
4 10/1/2004 12/30/2004
4 10/3/2004 11/28/2004
I am trying to get something like this:
ID before date after date
1 10/1/1996 12/1/1996
1 1/1/1998 12/31/2004
2 1/1/2001 3/31/2006
3 1/1/2001 9/30/2006
4 1/1/2001 9/30/2005
Basically, I would like to replace any overlapping date values with the date range of the values with the overlap, leave the non-overlapping values alone, and delete any unnecessary rows. Not sure how to go about doing this
Firstly, you should convert your string dates into Date-classed values, which will make comparison possible. Here's how I've defined and coerced your data:
df <- data.frame(ID=c(1,1,1,2,3,4,4,4), before.date=c('10/1/1996','1/1/1998','1/1/2000','1/1/2001','1/1/2001','1/1/2001','10/1/2004','10/3/2004'), after.date=c('12/1/1996','9/30/2003','12/31/2004','3/31/2006','9/30/2006','9/30/2005','12/30/2004','11/28/2004') );
dcis <- grep('date$',names(df));
df[dcis] <- lapply(df[dcis],as.Date,'%m/%d/%Y');
df;
## ID before.date after.date
## 1 1 1996-10-01 1996-12-01
## 2 1 1998-01-01 2003-09-30
## 3 1 2000-01-01 2004-12-31
## 4 2 2001-01-01 2006-03-31
## 5 3 2001-01-01 2006-09-30
## 6 4 2001-01-01 2005-09-30
## 7 4 2004-10-01 2004-12-30
## 8 4 2004-10-03 2004-11-28
Now, my solution involves computing an "overlapping grouping" vector which I've called og. It makes the assumption that the input df is ordered by ID and then before.date, which it is in your example data. If not, this could be achieved by df[order(df$ID,df$before.date),]. Here's how I compute og:
cummax.Date <- function(x) as.Date(cummax(as.integer(x)),'1970-01-01');
og <- with(df,c(0,cumsum(!(ID[-length(ID)]==ID[-1] & ave(after.date,ID,FUN=cummax)[-length(after.date)]>before.date[-1]))));
og;
## [1] 0 1 1 2 3 4 4 4
Unfortunately, the base R cummax() function doesn't work on Date-classed objects, so I had to write a cummax.Date() shim. I'll explain the need for the ave() and cummax() business at the end of the post.
As you can see, the above computation lags the RHS of each of the two vectorized comparisons by excluding the first element via [-1]. This allows us to compare a record's ID for equality with the following record's ID, and also compare if its after.date is after the before.date of the following record. The resulting logical vectors are ANDed (&) together. The negation of that logical vector then represents adjacent pairs of records that do not overlap, and thus we can cumsum() the result (and prepend zero, as the first record must start with zero) to get our grouping vector.
Finally, for the final piece of the solution, I've used by() to work with each overlapping group independently:
do.call(rbind,by(df,og,function(g) transform(g[1,],after.date=max(g$after.date))));
## ID before.date after.date
## 0 1 1996-10-01 1996-12-01
## 1 1 1998-01-01 2004-12-31
## 2 2 2001-01-01 2006-03-31
## 3 3 2001-01-01 2006-09-30
## 4 4 2001-01-01 2005-09-30
Since all records in a group must have the same ID, and we've made the assumption that records are ordered by before.date (after being ordered by ID, which is no longer relevant), we can get the correct ID and before.date values from the first record in the group. That's why I started with g[1,]. Then we just need to get the greatest after.date from the group via max(g$after.date), and overwrite the first record's after.date with that, which I've done with transform().
A word about performance: The assumption about ordering aids performance, because it allows us to simply compare each record against the immediately following record via lagged vectorized comparisons, rather than comparing every record in a group with every other record.
Now, for the ave() and cummax() business. I realized after writing the initial version of my answer that there was a flaw in my solution, which happens to not be exposed by your example data. Say there are three records in a group. If the first record has a range that overlaps with both of the following two records, and then the middle record does not overlap with the third record, then my (original) code would fail to identify that the third record is part of the same overlapping group of the previous two records.
The solution is to not simply use the after.date of the current record when comparing against the following record, but instead use the cumulative maximum after.date within the group. If any earlier record sprawled completely beyond its immediately following record, then it obviously overlapped with that record, and its after.date is what's important in considering overlapping groups for subsequent records.
Here's a demonstration of input data that requires this fix, using your df as a base:
df2 <- df;
df2[7,'after.date'] <- '2004-10-02';
df2;
## ID before.date after.date
## 1 1 1996-10-01 1996-12-01
## 2 1 1998-01-01 2003-09-30
## 3 1 2000-01-01 2004-12-31
## 4 2 2001-01-01 2006-03-31
## 5 3 2001-01-01 2006-09-30
## 6 4 2001-01-01 2005-09-30
## 7 4 2004-10-01 2004-10-02
## 8 4 2004-10-03 2004-11-28
Now record 6 overlaps with both records 7 and 8, but record 7 does not overlap with record 8. The solution still works:
cummax.Date <- function(x) as.Date(cummax(as.integer(x)),'1970-01-01');
og <- with(df2,c(0,cumsum(!(ID[-length(ID)]==ID[-1] & ave(after.date,ID,FUN=cummax)[-length(after.date)]>before.date[-1]))));
og;
## [1] 0 1 1 2 3 4 4 4
do.call(rbind,by(df2,og,function(g) transform(g[1,],after.date=max(g$after.date))));
## ID before.date after.date
## 0 1 1996-10-01 1996-12-01
## 1 1 1998-01-01 2004-12-31
## 2 2 2001-01-01 2006-03-31
## 3 3 2001-01-01 2006-09-30
## 4 4 2001-01-01 2005-09-30
Here's a proof that the og calculation would be wrong without the ave()/cummax() fix:
og <- with(df2,c(0,cumsum(!(ID[-length(ID)]==ID[-1] & after.date[-length(after.date)]>before.date[-1]))));
og;
## [1] 0 1 1 2 3 4 4 5
Minor adjustment to the solution, to overwrite after.date in advance of the og computation, and avoid the max() call (makes more sense if you're planning on overwriting the original df with the new aggregation):
cummax.Date <- function(x) as.Date(cummax(as.integer(x)),'1970-01-01');
df$after.date <- ave(df$after.date,df$ID,FUN=cummax);
df;
## ID before.date after.date
## 1 1 1996-10-01 1996-12-01
## 2 1 1998-01-01 2003-09-30
## 3 1 2000-01-01 2004-12-31
## 4 2 2001-01-01 2006-03-31
## 5 3 2001-01-01 2006-09-30
## 6 4 2001-01-01 2005-09-30
## 7 4 2004-10-01 2005-09-30
## 8 4 2004-10-03 2005-09-30
og <- with(df,c(0,cumsum(!(ID[-length(ID)]==ID[-1] & after.date[-length(after.date)]>before.date[-1]))));
og;
## [1] 0 1 1 2 3 4 4 4
df <- do.call(rbind,by(df,og,function(g) transform(g[1,],after.date=g$after.date[nrow(g)])));
df;
## ID before.date after.date
## 0 1 1996-10-01 1996-12-01
## 1 1 1998-01-01 2004-12-31
## 2 2 2001-01-01 2006-03-31
## 3 3 2001-01-01 2006-09-30
## 4 4 2001-01-01 2005-09-30

In R: add rows based on a date and another condition

I have a data frame df:
df <- data.frame(names=c("john","mary","tom"),dates=c(as.Date("2010-06-01"),as.Date("2010-07-09"),as.Date("2010-06-01")),tours_missed=c(2,12,6))
names dates tours_missed
john 2010-06-01 2
mary 2010-07-09 12
tom 2010-06-01 6
I want to be able to add a row with the dates the person missed. There are 2 tours every day the person works. Each person works every 4 days.
The result should be (though the order doesn't matter):
names dates tours_missed
john 2010-06-01 2
mary 2010-07-09 12
mary 2010-07-13 12
mary 2010-07-17 12
mary 2010-07-21 12
mary 2010-07-25 12
mary 2010-07-29 12
tom 2010-06-01 6
tom 2010-06-05 6
tom 2010-06-09 6
I have already tried looking at these topics but was unable to produce the above result: Add rows to a data frame based on date in previous row, In R: Add rows with data of previous row to data frame, add new row to dataframe, enter link description here. Thanks for your help!
library(data.table)
dt = as.data.table(df) # or convert in-place using setDT
# all of the relevant dates
dates.all = dt[, seq(dates, length = tours_missed/2, by = "4 days"), by = names]
# set the key and merge filling in the blanks with previous observation
setkey(dt, names, dates)
dt[dates.all, roll = T]
# names dates tours_missed
# 1: john 2010-06-01 2
# 2: mary 2010-07-09 12
# 3: mary 2010-07-13 12
# 4: mary 2010-07-17 12
# 5: mary 2010-07-21 12
# 6: mary 2010-07-25 12
# 7: mary 2010-07-29 12
# 8: tom 2010-06-01 6
# 9: tom 2010-06-05 6
#10: tom 2010-06-09 6
Or if merging is unnecessary (not quite clear from OP), just construct the answer:
dt[, list(dates = seq(dates, length = tours_missed/2, by = "4 days"), tours_missed)
, by = names]

R finding date intervals by ID

Having the following table which comprises some key columns which are: customer ID | order ID | product ID | Quantity | Amount | Order Date.
All this data is in LONG Format, in that you will get multi line items for the 1 Customer ID.
I can get the first date last date using R DateDiff but converting the file to WIDE format using Plyr, still end up with the same problem of getting multiple orders by customer, just less rows and more columns.
Is there an R function that extends R DateDiff to work out how to get the time interval between purchases by Customer ID? That is, time between order 1 and 2, order 2 and 3, and so on assuming these orders exists.
CID Order.Date Order.DateMY Order.No_ Amount Quantity Category.Name Locality
1 26/02/13 Feb-13 zzzzz 1 r MOSMAN
1 26/05/13 May-13 qqqqq 1 x CHULLORA
1 28/05/13 May-13 wwwww 1 r MOSMAN
1 28/05/13 May-13 wwwww 1 x MOSMAN
2 19/08/13 Aug-13 wwwwww 1 o OAKLEIGH SOUTH
3 3/01/13 Jan-13 wwwwww 1 x CURRENCY CREEK
4 28/08/13 Aug-13 eeeeeee 1 t BRISBANE
4 10/09/13 Sep-13 rrrrrrrrr 1 y BRISBANE
4 25/09/13 Sep-13 tttttttt 2 e BRISBANE
It is not clear what do you want to do since you don't give the expected result. But I guess you want to the the intervals between 2 orders.
library(data.table)
DT <- as.data.table(DF)
DT[, list(Order.Date,
diff = c(0,diff(sort(as.Date(Order.Date,'%d/%m/%y')))) ),CID]
CID Order.Date diff
1: 1 26/02/13 0
2: 1 26/05/13 89
3: 1 28/05/13 2
4: 1 28/05/13 0
5: 2 19/08/13 0
6: 3 3/01/13 0
7: 4 28/08/13 0
8: 4 10/09/13 13
9: 4 25/09/13 15
Split the data frame and find the intervals for each Customer ID.
df <- data.frame(customerID=as.factor(c(rep("A",3),rep("B",4))),
OrderDate=as.Date(c("2013-07-01","2013-07-02","2013-07-03","2013-06-01","2013-06-02",
"2013-06-03","2013-07-01")))
dfs <- split(df,df$customerID)
lapply(dfs,function(x){
tmp <-diff(x$OrderDate)
tmp
})
Or use plyr
library(plyr)
dfs <- dlply(df,.(customerID),function(x)return(diff(x$OrderDate)))
I know this question is very old, but I just figured out another way to do it and wanted to record it:
> library(dplyr)
> library(lubridate)
> df %>% group_by(customerID) %>%
mutate(SinceLast=(interval(ymd(lag(OrderDate)),ymd(OrderDate)))/86400)
# A tibble: 7 x 3
# Groups: customerID [2]
customerID OrderDate SinceLast
<fct> <date> <dbl>
1 A 2013-07-01 NA
2 A 2013-07-02 1.
3 A 2013-07-03 1.
4 B 2013-06-01 NA
5 B 2013-06-02 1.
6 B 2013-06-03 1.
7 B 2013-07-01 28.

R: Find and add missing (/non existing) rows in time related data frame

I'm struggling with the following.
If have a (big) data frame with the following:
several columns for which the combination of columns is a 'unique' combination, say ID
a time related column
a measure related column
I want to make sure that for each unique ID for each time interval a measure is available in the data frame. And if it is not, I want to add a 0 (or NA) measure for that time/ID.
To illustrate the problem, create the following test data frame:
test <- data.frame(
YearWeek =rep(c("2012-01","2012-02"),each=4),
ProductID =rep(c(1,2), times=4),
CustomerID =rep(c("a","b"), each=2, times=2),
Quantity =5:12
)[1:7,]
YearWeek ProductID CustomerID Quantity
1 2012-01 1 a 5
2 2012-01 2 a 6
3 2012-01 1 b 7
4 2012-01 2 b 8
5 2012-02 1 a 9
6 2012-02 2 a 10
7 2012-02 1 b 11
The 8th row is left out, on purpose. This way I simulate a 'missing value' (missing Quantity) for ID '2-b' (ProductID-CustomerID) for the time value "2012-02".
What I want to do is adjust the data.frame in such a way that for all time values (these are known, in this example just "2012-01" and "2012-02"), for all ID-combinations (these are not known upfront, but this is 'all unique ID combinations in the data frame', thus the unique set on the ID columns), a Quantity is available in the data frame.
This should result for this example (if we choose NA for the missing value, typically I want to have control on that):
YearWeek ProductID CustomerID Quantity
1 2012-01 1 a 5
2 2012-01 2 a 6
3 2012-01 1 b 7
4 2012-01 2 b 8
5 2012-02 1 a 9
6 2012-02 2 a 10
7 2012-02 1 b 11
8 2012-02 2 b NA
The ultimate goal is to create time series for these ID combinations and I therefore want to have Quantities for all time values. I need to do different aggregations (on time) and using different levels of ID's from a big dataset
I tried several things, for instance with melt and cast from the reshape package. But so far I didn't manage to do it. The next step is creating a function, with for-loops etc. but that is not really useful from a performance perspective.
Maybe there is an easier way to create time series instantly, giving a data.frame like test. Does anybody have an idea on this one??
Thanks in advance!
Note that in the actual problem there are more than two 'ID columns'.
EDIT:
I should describe the problem further. There is a difference between the 'time' column and the 'ID' columns. The first (and great!) answer on the question by joran, maybe didn't get a clear understanding from what I want (and the example I gave didn't made the difference clear). I said above:
for all ID-combinations (these are not known upfront, but this is 'all
unique ID combinations in the data frame', thus the unique set on the
ID columns)
So I do not want 'all possible ID combinations' but 'all ID combinations within the data'.
For each of those combinations I want a value for every unique time-value.
Let me make it clear by expanding test to test2, as follows
> test2 <- rbind(test, c("2012-02", 3, "a", 13))
> test2
YearWeek ProductID CustomerID Quantity
1 2012-01 1 a 5
2 2012-01 2 a 6
3 2012-01 1 b 7
4 2012-01 2 b 8
5 2012-02 1 a 9
6 2012-02 2 a 10
7 2012-02 1 b 11
8 2012-02 3 a 13
Which means I want in the resulting data frame no '3-b' ID combination, because this combination is not within test2. If I use the method of the first answer I will get the following:
> vals2 <- expand.grid(YearWeek = unique(test2$YearWeek),
ProductID = unique(test2$ProductID),
CustomerID = unique(test2$CustomerID))
> merge(vals2,test2,all = TRUE)
YearWeek ProductID CustomerID Quantity
1 2012-01 1 a 5
2 2012-01 1 b 7
3 2012-01 2 a 6
4 2012-01 2 b 8
5 2012-01 3 a <NA>
6 2012-01 3 b <NA>
7 2012-02 1 a 9
8 2012-02 1 b 11
9 2012-02 2 a 10
10 2012-02 2 b <NA>
11 2012-02 3 a 13
12 2012-02 3 b <NA>
So I don't want the rows 6 and 12 to be here.
To overcome this problem I found a solution in the one below. In here I split the 'unique time column' and the 'unique ID combination'. The difference with above is thus the word 'combination' and not unique for every ID column.
> temp_merge <- merge(unique(test2["YearWeek"]),
unique(test2[c("ProductID", "CustomerID")]))
> merge(temp_merge,test2,all = TRUE)
YearWeek ProductID CustomerID Quantity
1 2012-01 1 a 5
2 2012-01 1 b 7
3 2012-01 2 a 6
4 2012-01 2 b 8
5 2012-01 3 a <NA>
6 2012-02 1 a 9
7 2012-02 1 b 11
8 2012-02 2 a 10
9 2012-02 2 b <NA>
10 2012-02 3 a 13
What are the comments on this one?
Is this an elegant way, or are there better ways?
Use expand.grid and merge:
vals <- expand.grid(YearWeek = unique(test$YearWeek),
ProductID = unique(test$ProductID),
CustomerID = unique(test$CustomerID))
> merge(vals,test,all = TRUE)
YearWeek ProductID CustomerID Quantity
1 2012-01 1 a 5
2 2012-01 1 b 7
3 2012-01 2 a 6
4 2012-01 2 b 8
5 2012-02 1 a 9
6 2012-02 1 b 11
7 2012-02 2 a 10
8 2012-02 2 b NA
The NAs can be replaced after the fact with whatever values you choose using subsetting and is.na.

Resources