Match Dataframes Excluding Last Non-NA Value and disregarding order - r

I have two dataframes:
Partner<-c("Alpha","Beta","Zeta")
COL1<-c("A","C","M")
COL2<-c("B","D","K")
COL3<-c("C","F",NA)
COL4<-c("D",NA,NA)
df1<-data.frame(Partner,COL1,COL2,COL3,COL4)
lift<-c(9,10,11,12,12,23,12,24)
RULE1<-c("B","B","D","A","C","K","M","K")
RULE2<-c("A","A","C","B","A","M","T","M")
RULE3<-c("G","D","M","C" ,"M", "E",NA,NA)
RULE4<-c(NA,NA,"K","D" ,NA, NA,NA,NA)
df2<-data.frame(lift,RULE1,RULE2,RULE3,RULE4)
df1
Partner COL1 COL2 COL3 COL4
Alpha A B C D
Beta C D F NA
Zeta M K NA NA
df2
lift RULE1 RULE2 RULE3 RULE4
9 B A G NA
10 B A D NA
11 D C M K
12 A B C D
12 C A M NA
23 K M E NA
12 M T NA NA
24 K M NA NA
This is a market basket analysis. df1 is the customer/partner that bought each of the items listed: A, B, C...etc.
df2 are the recommendations associated with the items bought in the past.
The last value in each of the df2 rows represent the recommendation. So the preceding values in each row from the last non-NA value are the "baskets".
So for example in the first row of df2, it is stating: If B and A are bought together, recommend G.
I want to be able to figure out if each partner from df1 bought ALL the values in each row excluding the final value since that is the recommendation. Then add that recommendation to the end of each row of the new dataframe.
For example:
For partner: Alpha, would it be good to recommend value G from the first row? Answer would be yes because they bought all the values from that row in df2 (A and B).
For partner: Beta, it would not be good to recommend value G because not all of the values from the first row of df2 are found in the Beta row.
Final Output:
Partner COL1 COL2 COL3 COL4 lift RULE1 RULE2 RULE3 RULE4 Does Last Non-NA Value Exist in Row?
Alpha A B C D 9 B A G NA No
Alpha A B C D 10 B A D NA Yes
Alpha A B C D 12 A B C D Yes
Alpha A B C D 12 C A M NA No
Zeta M K NA NA 23 K M E NA No
Zeta M K NA NA 12 M T NA NA No
Zeta M K NA NA 24 K M NA NA Yes
Written out results for clarity:
df3
row1 outputs "No" because G is not found in Alpha Partner and all values before G show up in Alpha Partner (B,A)
row2 outputs "Yes" because D is found in Alpha Partner and all values before D show up in Alpha Partner (B,A)
row3 outputs "Yes" because D is found in Alpha Partner and all values before D show up in Alpha Partner (A,B,C)
row4 outputs "No" because M is not found in Alpha Partner and all values before M show up in Alpha Partner (C,A)
row5 outputs "No" because E is not found in Zeta Partner and all values before E show up in Zeta Partner (K,M)
row6 outputs "No" because T is not found in Zeta Partner and all values before T show up in Zeta Partner (M)
row7 outputs "Yes" because M is found in Zeta Partner and all values before M show up in Zeta Partner (K)
I think that has to be a join or a match of some kind but can't figure out how to do it.
This would be extremely helpful if someone can help me out with this.
Thanks.
This was the attempt:
df1<-cbind(df1_id=1:nrow(df1),df1)
df2 <- cbind(df2_id=1:nrow(df2),df2)
d11 <- df1 %>% gather(Col, Value,starts_with("C")) #Long
d11 <- d11 %>% na.omit() %>%group_by(df1_id) %>% slice(-n()) #remove last non NA
d22 <- df2 %>% gather(Rule, Value,starts_with("R"))
res <- inner_join(d11,d22)
rm(d22)
rm(d11)
final<-cbind(df1[res$df1_id,],df2[res$df2_id,])
final$Exist <- apply(final, 1, FUN = function(x)
c("No", "Yes")[(anyDuplicated(x[!is.na(x) & x != "" ])!=0) +1])
But this didn't work because it didn't take all of the values into account, only if one of them matched...not all.

This is quite tricky because the purchases of n customers have to be compared to a set of m rules. Besides this, there are two points which add to the complexity:
The last non-NA RULE column in df2 is semantically different from the others. Unfortunately, the given data structure doesn't reflect this. So, df2 is missing an explicite recommended column.
Finally, it has to be determined whether a partner already has purchased the recommended item.
The approach below relies on melt(), dcast() and join operations of the data.table package for performance reasons. However, in order to avoid creation of cartesian crossproduct of n * m rows, a loop is used.
EDIT The dcast() has been moved out of the lapply() function.
Prepare data for n:m join
library(data.table)
# convert to data.table and add row numbers
# here, a copy is used insteasd of setDT() in order to rename the data.tables
purchases <- as.data.table(df1)[, rnp := seq_len(.N)]
rules <- as.data.table(df2)[, rnr := seq_len(.N)]
# prepare purchases for joins
lp <- melt(purchases, id.vars = c("rnp", "Partner"), na.rm = TRUE)
wp <- dcast(lp, rnp ~ value, drop = FALSE)
wp
# rnp A B C D F K M
#1: 1 A B C D NA NA NA
#2: 2 NA NA C D F NA NA
#3: 3 NA NA NA NA NA K M
# prepare rules
lr <- melt(rules, id.vars = c("rnr", "lift"), na.rm = TRUE)
# identify last column of each rule which becomes the recommendation
rn_of_last_col <- lr[, last(.I), by = rnr][, V1]
# reshape from long to wide without recommendation
wr <- dcast(lr[-rn_of_last_col], rnr ~ value)
# add column with recommendations (kind of cbind, no join)
wr[, recommended := lr[rn_of_last_col, value]]
wr
# rnr A B C D K M recommended
#1: 1 A B NA NA NA NA G
#2: 2 A B NA NA NA NA D
#3: 3 NA NA C D NA M K
#4: 4 A B C NA NA NA D
#5: 5 A NA C NA NA NA M
#6: 6 NA NA NA NA K M E
#7: 7 NA NA NA NA NA M T
#8: 8 NA NA NA NA K NA M
Combine rules and purchases
combi <- rbindlist(
# implied loop over rules to find matching purchases for each rule
lapply(seq_len(nrow(rules)), function(i) {
# get col names except last col which is the recommendation
cols <- lr[rnr == i, value[-.N]]
# join single rule with all partners on relevant cols for this rule
wp[wr[i, .SD, .SDcols = c(cols, "rnr", "recommended")], on = cols, nomatch = 0]
})
)
# check if recommendation was purchased already
combi[, already_purchased := Reduce(`|`, lapply(.SD, function(x) x == recommended)),
.SDcols = -c("rnp", "rnr", "recommended")]
# clean up already purchased
combi[is.na(already_purchased), already_purchased := FALSE
][, already_purchased := ifelse(already_purchased, "Yes", "No")]
combi
# rnp A B C D F K M rnr recommended already_purchased
#1: 1 A B C D NA NA NA 1 G No
#2: 1 A B C D NA NA NA 2 D Yes
#3: 1 A B C D NA NA NA 4 D Yes
#4: 1 A B C D NA NA NA 5 M No
#5: 3 NA NA NA NA NA K M 6 E No
#6: 3 NA NA NA NA NA K M 7 T No
#7: 3 NA NA NA NA NA K M 8 M Yes
In creating combi, the trick is to join only on those columns which are included in each rule. This is why the join needs to be done for each rule separately.
Essentially, we are done now. However, it doesn't look like the desired output.
Final joins
tmp_rules <- rules[combi[, .(rnp, rnr, recommended, already_purchased)], on = "rnr"]
tmp_purch <- purchases[combi[, .(rnp, rnr)], on = "rnp"]
result <- tmp_purch[tmp_rules, on = c("rnp", "rnr")]
result[, (c("rnp", "rnr")) := NULL]
result
# Partner COL1 COL2 COL3 COL4 lift RULE1 RULE2 RULE3 RULE4 recommend already_purchased
#1: Alpha A B C D 9 B A G NA G No
#2: Alpha A B C D 10 B A D NA D Yes
#3: Alpha A B C D 12 A B C D D Yes
#4: Alpha A B C D 12 C A M NA M No
#5: Zeta M K NA NA 23 K M E NA E No
#6: Zeta M K NA NA 12 M T NA NA T No
#7: Zeta M K NA NA 24 K M NA NA M Yes

Related

function to rbind list of dataframes different columns and rows

I want to create a function that merges a list of dataframes with different column numbers and the rows have different names that I'd like to keep. Essentially I want to stack dataframes where the column names just become another row to be appended.
df <- list()
df[[1]] <- data.frame(d = c(4,5), e = c("c", "d"))
rownames(df[[1]]) <- c("df2_row_1", "df2_row_2")
df[[2]] <- data.frame(a = c(1,2,3), b = c("a", "b", "c"), c = c("one", "two", "three"))
rownames(df[[2]]) <- c("df1_row_1", "df1_row_2", "df1_row_3")
df[[3]] <- data.frame(f = c(6,7,8), g = c("e", "f", "g"), h = c("one", "two", "three"), w = c(100,101,102))
rownames(df[[3]]) <- c("df3_row_1", "df3_row_2", "df3_row_3")
Current Output:
do.call(bind_rows, df)
d e a b c f g h w
1 4 c NA <NA> <NA> NA <NA> <NA> NA
2 5 d NA <NA> <NA> NA <NA> <NA> NA
3 NA <NA> 1 a one NA <NA> <NA> NA
4 NA <NA> 2 b two NA <NA> <NA> NA
5 NA <NA> 3 c three NA <NA> <NA> NA
6 NA <NA> NA <NA> <NA> 6 e one 100
7 NA <NA> NA <NA> <NA> 7 f two 101
8 NA <NA> NA <NA> <NA> 8 g three 102
Desired Output
d e
df2_row_1 4 c
df2_row_2 5 d
a b c
df1_row_1 1 a one
df1_row_2 2 b two
df1_row_3 3 c three
f g h w
df3_row_1 6 e one 100
df3_row_2 7 f two 101
df3_row_3 8 g three 102
I've tried (unsuccessfully) creating a function that finds the longest data frame, then appends empty columns to the data frames that are shorter than the longest, then gives all the data frames the same name for each of those columns.
I also realize this couldn't be more NOT tidy - is this possible?
Thank you!!!
This can be achieved with a for loop (I think it could be achieved with mapply to, check ?mapply). The overall strategy is filling each df in the list with NAs (cbinding them) and then rbindlisting the resulting list:
library(data.table)
cols <- max(sapply(df, ncol))
# This is the length of the NA vectors that make the cbinding dfs:
lengths <- (cols - sapply(df, ncol))*sapply(df, nrow)
newdf <- list()
for (i in 1:length(df)){
if (ncol(df[[i]]) != cols){
newdf[[i]] <- cbind(df[[i]],
as.data.frame(matrix(rep(NA, lengths[i]),
ncol = lengths[i] / nrow(df[[i]]))))
} else {
newdf[[i]] <- df[[i]]
}
}
rbindlist(newdf, use.names = FALSE)
Which results in:
d e V1 V2
1: 4 c <NA> NA
2: 5 d <NA> NA
3: 1 a one NA
4: 2 b two NA
5: 3 c three NA
6: 6 e one 100
7: 7 f two 101
8: 8 g three 102

Populate matrix by colname identity

I have many samples, each one of which has a corresponding abundance matrix. From these abundance matrices, I would like to create a large matrix that contains abundance information for each sample in rows.
For example, a single abundance matrix would look like:
A B C D
sample1 1 3 4 2
where A, B, C, and D represent colnames, and the abundances are the row values.
I would like to populate my larger matrix, which has as colnames all possible letters (A:Z) and all possible samples (sample1:sampleN) as rows, by matching the colname values.
For ex. :
A B C D E F G .... Z
sample1 1 3 4 2 NA NA NA ....
sample2 NA NA 2 5 7 NA NA ....
sample3 4 NA 6 9 2 NA 2 .....
....
sampleN
Different samples have a varying mix of abundances, in no guaranteed order.
When iteratively adding to this larger matrix, how could I ensure that the correct columns are populated by the right abundance values (ex. column "A" is only filled by values corresponding to abundances of "A" in different samples)? Thanks!
Starting data, changing just a little to highlight differences:
m1 <- as.matrix(read.table(header=TRUE, text="
A B C Z
sample1 1 3 4 2"))
m2 <- as.matrix(read.table(header=TRUE, text="
A B C D E F G
sample2 NA NA 2 5 7 NA NA
sample3 4 NA 6 9 2 NA 2"))
First, we need to make sure both matrices have the same column names:
newcols <- setdiff(colnames(m2), colnames(m1))
m1 <- cbind(m1, matrix(NA, nr=nrow(m1), nc=length(newcols), dimnames=list(NULL, newcols)))
newcols <- setdiff(colnames(m1), colnames(m2))
m2 <- cbind(m2, matrix(NA, nr=nrow(m2), nc=length(newcols), dimnames=list(NULL, newcols)))
m1
# A B C Z D E F G
# sample1 1 3 4 2 NA NA NA NA
m2
# A B C D E F G Z
# sample2 NA NA 2 5 7 NA NA NA
# sample3 4 NA 6 9 2 NA 2 NA
And now we combine them; regular cbind needs the column names to be aligned as well:
rbind(m2, m1[,colnames(m2),drop=FALSE])
# A B C D E F G Z
# sample2 NA NA 2 5 7 NA NA NA
# sample3 4 NA 6 9 2 NA 2 NA
# sample1 1 3 4 NA NA NA NA 2
You should be able to take advantage of matrix indexing, like so:
big[cbind(rownames(abun),colnames(abun))] <- abun
Using this example abundance matrix, and a big matrix to fill:
abun <- matrix(c(1,3,4,2),nrow=1,dimnames=list("sample1",LETTERS[1:4]))
big <- matrix(NA,nrow=5,ncol=26,dimnames=list(paste0("sample",1:5),LETTERS))
Another solution using reduce from purrr package and union_all from dplyr package:
library(purrr)
library(dplyr)
sample_names <- c("sample1","sample2","sample3")
Generating 3 random abundance dataframes:
num1 <- round(runif(runif(1,min = 1, max = 10),min = 1, max = 10))
df1 <- data.frame(t(num1))
colnames(df1) <- sample(LETTERS,length(num1))
num2 <- round(runif(runif(1,min = 1, max = 10),min = 1, max = 10))
df2 <- data.frame(t(num2))
colnames(df2) <- sample(LETTERS,length(num2))
num3 <- round(runif(runif(1,min = 1, max = 10),min = 1, max = 10))
df3 <- data.frame(t(num3))
colnames(df3) <- sample(LETTERS,length(num3))
This is actually the code that does all the magic:
A <- reduce(list(df1,df2,df3),union_all)
col_order <- sort(colnames(A),decreasing = FALSE)
A <- A[,col_order]
rownames(A) <- sample_names
Output:
> A
A C E F O P Q U W Y
sample1 9 NA NA NA 9 NA 5 6 NA NA
sample2 NA NA NA NA 5 4 NA NA 5 NA
sample3 NA 6 5 9 NA NA 3 NA 5 7

How to compare if any of the elements in a row is same

Is there a way to compare whether "any value of" a row is identical to "any value" of the row above -- regardless of the order? Below is a very random input data table.
DT <- data.table(A=c("a","a","b","d","e","f","h","i","j"),
B=c("a","b","c","c","f","g",NA,"j",NA),
C=c("a","b","c","b","g","h",NA,NA,NA))
> DT
A B C
1: a a a
2: a b b
3: b c c
4: d c b
5: e f g
6: f g h
7: h NA NA
8: i j NA
9: j NA NA
I would like to add a column D that compares a row with the row above, and compare whether any values of the two rows are identical (regardless of the order). So the desired output would be:
> DT
A B C D
1: a a a 0 #No row above to compare; could be either NA or 0
2: a b b 1 #row 2 has "a", which is in row 1; returns 1
3: b c c 1 #row 3 has "b", which is in row 2; returns 1
4: d c b 1 #row 4 has "b" and "c", which are in row 3; returns 1
5: e f g 0 #row 5 has nothing that is in row 4; returns 0
6: f g h 1 #row 6 has "f" and "g", which are in row 5; returns 1
7: h NA NA 1 #row 7 has "h", which is in row 6; returns 1
8: i j NA 0 #row 8 has nothing that is in row 7 (NA doesn't count)
9: j NA NA 1 #row 9 has "j", which is in row 8; returns 1 (NA doesn't count)
The main idea is that I would like to compare a row (or a vector) with another row (vector), and define two rows to be identical if any of the elements in each row (vector) are. (without reiterating to compare each element)
We can do this by getting the lead rows of the dataset, paste each row, check for any pattern in with the pasteed rows of original dataset using grepl and Map, then unlist and convert to integer
DT[, D := {
v1 <- do.call(paste, .SD)
v2 <- do.call(paste, c(shift(.SD, type = "lead"), sep="|"))
v2N <- gsub("NA\\|*|\\|*NA", "", v2)
v3 <- unlist(Map(grepl, v2N, v1), use.names = FALSE)
as.integer(head(c(FALSE, v3), -1))
}]
DT
# A B C D
#1: a a a 0
#2: a b b 1
#3: b c c 1
#4: d c b 1
#5: e f g 0
#6: f g h 1
#7: h NA NA 1
#8: i j NA 0
#9: j NA NA 1
Or we can do a split and do comparison using Map
as.integer(c(FALSE, unlist(Map(function(x,y) {
x1 <- na.omit(unlist(x))
y1 <- na.omit(unlist(y))
any(x1 %in% y1 | y1 %in% x1) },
split(DT[-nrow(DT)], 1:(nrow(DT)-1)), split(DT[-1], 2:nrow(DT))), use.names = FALSE)))
Here is another method. It's probably not advisable on large data.tables as it uses by=1:nrow(DT) which tends to be quite slow.
DT[, D:= sign(DT[, c(.SD, shift(.SD))][,
sum(!is.na(intersect(unlist(.SD[, .(A, B, C)]), unlist(.SD[, .(V4, V5, V6)])))),
by=1:nrow(DT)]$V1)]
Here, [, c(.SD, shift(.SD))] creates a copy of the data.frame, with the lagged variables included (cbinded). Then the second chain intersects the unlisted variables in the original data.table and the shifted data.table. NAs are assigned 0 and non-NAs are assigned 1 and these results are summed. This operation occurs for each row of the copied data.table. The sum is extracted with $v1 and is turned into binary (0 and 1) using sign.
It returns
DT
A B C D
1: a a a 0
2: a b b 1
3: b c c 1
4: d c b 1
5: e f g 0
6: f g h 1
7: h NA NA 1
8: i j NA 0
9: j NA NA 1
Here's a loop-free approach using data.table's joins:
DT[, id := 1:.N]
dt <- melt(DT, id.vars = "id")
dt[, id2 := id-1]
dt <- dt[!is.na(value)]
idx <- dt[dt, on = .(id2 = id, value), nomatch=0][, unique(id)]
DT[, `:=`(D = as.integer(id %in% idx), id = NULL)]
It looks somewhat complicated but id does perform pretty well with just over a second for a 1-million-row data set with three columns.
I would do a sapply along the indices (minus the last) of the table:
compare <- function(i) {
row1 <- as.character(DT[i,])
row2 <- as.character(DT[i+1,])
return(length(intersect(row1[!is.na(row1)], row2[!is.na(row2)])) > 0)
}
result <- sapply(1:(nrow(DT) - 1), compare)
This returns a vector of logicals, so if you prefer to get integers, wrap the output of compare in a as.numeric()
Here is a base R solution using intersect:
res <- c(0, sapply(2:nrow(DT), function(i)
length(intersect( na.omit(as.character(DT[i,])), na.omit(as.character(DT[i-1,])) ) )>0))
cbind(DT, D=res)
# A B C D
# 1: a a a 0
# 2: a b b 1
# 3: b c c 1
# 4: d c b 1
# 5: e f g 0
# 6: f g h 1
# 7: h NA NA 1
# 8: i j NA 0
# 9: j NA NA 1
This solution compares the two rows with %in% (after unlist()):
DT[, result:=as.integer(c(NA, sapply(2:DT[,.N], function(i) any(na.omit(unlist(DT[i])) %in% unlist(DT[i-1])))))]
#> DT
# A B C result
#1: a a a NA
#2: a b b 1
#3: b c c 1
#4: d c b 1
#5: e f g 0
#6: f g h 1
#7: h NA NA 1
#8: i j NA 0
#9: j NA NA 1
Using a combination of intersect and mapply you could do:
#list of unique elements in each row
tableList = apply(DT,1,function(x) unique(na.omit(x)))
#a lagged list to be compared with above list
tableListLag = c(NA,tableList[2:length(tableList)-1])
#find common elements using intersect function
#if length > 0 implies common elements hence set value as 1 else 0
DT$D = mapply(function(x,y) ifelse(length(intersect(x,y))>0,1,0) ,tableList,tableListLag,
SIMPLIFY = TRUE)
DT
# A B C D
#1: a a a 0
#2: a b b 1
#3: b c c 1
#4: d c b 1
#5: e f g 0
#6: f g h 1
#7: h NA NA 1
#8: i j NA 0
#9: j NA NA 1

Merge 2 columns in R

I have a data set with columns I'd like to merge similar to this:
library(data.table)
DF <- as.data.table(list(ID = c(1,2,3,4,5), Product = c('Y', NA, NA, 'Z', NA), Type = c(NA, 'D', 'G', NA, NA)))
DF
ID Product Type
1 Y NA
2 NA D
3 NA G
4 Z NA
5 NA NA
which I would like to look like this:
DF
ID Product Type Category
1 Y NA Y
2 NA D D
3 NA G G
4 Z NA Z
5 NA NA NA
My Code is:
DF[,Category := na.omit(c(Product,Type)), by = ID][,c("Product","Type"):=NULL]
The problem that I have is that I would like to have for the Category to be NA when both Product and Type are NAs. Also, I don't know if my code works because my data set has over 200,000 rows.
DF[ , Category := ifelse(is.na(Product), Type, Product)]
# ID Product Type Category
#1: 1 Y NA Y
#2: 2 NA D D
#3: 3 NA G G
#4: 4 Z NA Z
#5: 5 NA NA NA
This is assuming if there are values for both Product and Type, you want Product in Category
We can do this in two assignments and avoid ifelse as assignment in place (:=) is faster and efficient.
DF[, Category := Product][is.na(Product), Category := Type][]
# ID Product Type Category
#1: 1 Y NA Y
#2: 2 NA D D
#3: 3 NA G G
#4: 4 Z NA Z
#5: 5 NA NA NA
Or if we assume that there will be only a maximum 1 non-NA value per row for Product/Type, then pmax can be used.
DF[, Category := pmax(Product, Type, na.rm = TRUE)][]
# ID Product Type Category
#1: 1 Y NA Y
#2: 2 NA D D
#3: 3 NA G G
#4: 4 Z NA Z
#5: 5 NA NA NA
Benchmarks
DF1 <- DF[rep(1:nrow(DF), 1e6)]
DF2 <- copy(DF1)
DF3 <- copy(DF1)
system.time(DF1[, Category := Product][is.na(Product), Category := Type])
# user system elapsed
# 0.16 0.06 0.17
system.time(DF2[ , Category := ifelse(is.na(Product), Type, Product)])
# user system elapsed
# 1.35 0.19 1.53
system.time(DF3[ ,Category := pmax(Product, Type, na.rm = TRUE)])
# user system elapsed
# 0.04 0.02 0.06
EDIT: Updated with the benchmarks and it clearly shows both the methods mentioned in my post are efficient.

assign unique ID name to unique rows with multiple columns

I apologize, not sure how to insert a data.table into the question box.
I have a data set with a ton of rows like this:
phylum class family order genus species
A B C D E NA
A B C D E NA
A B C D NA NA
A B C D E F
A B C D NA NA
A B C D E F
I would like each matching row to be assigned a unique ID for example:
ID phylum class family order genus species
1 A B C D E NA
1 A B C D E NA
2 A B C D NA NA
3 A B C D E F
2 A B C D NA NA
3 A B C D E F
I have tried using GRP in a variety of ways but its not working.
For example:
DT2 = DT[,i:=.GRP,by=key(DT)]
I have looked at other samples but everything is assigning IDs based on a single or only 2 columns value and I want to use 6 different ones. Any help is greatly appreciated.
A solution with base R:
df2 <- unique(df)
df2$ID <- 1:nrow(df2)
merge(df, df2)
or using data.table:
dt[, ID := .GRP, by = names(dt)]

Resources