Generate matrix of unique user-item cross-product combinations - r

I am trying to create a cross-product matrix of unique users in R. I searched for it on SO but could not find what I was looking for. Any help is appreciated.
I have a large dataframe (over a million) and a sample is shown:
df <- data.frame(Products=c('Product a', 'Product b', 'Product a',
'Product c', 'Product b', 'Product c'),
Users=c('user1', 'user1', 'user2', 'user1',
'user2','user3'))
Output of df is:
Products Users
1 Product a user1
2 Product b user1
3 Product a user2
4 Product c user1
5 Product b user2
6 Product c user3
I would like to see two matrices:
The first one will show the number of unique users that had either products(OR) - so the output will be something like:
Product a Product b Product c
Product a 2 3
Product b 2 3
Product c 3 3
The second matrix will be the number of unique users that had both products(AND):
Product a Product b Product c
Product a 2 1
Product b 2 1
Product c 1 1
Any help is appreciated.
Thanks
UPDATE:
Here is more clarity: Product a is used by User1 and User2. Product b is used by User1 and User2 and Product c is used by User1 and User3. So in the first matrix, Product a and Product b will be 2 since there are 2 unique users. Similarly, Product a and Product c will be 3. Where as in the second matrix, they would be 2 and 1 since I want the intersection.
Thanks

Try
lst <- split(df$Users, df$Products)
ln <- length(lst)
m1 <- matrix(0, ln,ln, dimnames=list(names(lst), names(lst)))
m1[lower.tri(m1, diag=FALSE)] <- combn(seq_along(lst), 2,
FUN= function(x) length(unique(unlist(lst[x]))))
m1[upper.tri(m1)] <- m1[lower.tri(m1)]
m1
# Product a Product b Product c
#Product a 0 2 3
#Product b 2 0 3
#Product c 3 3 0
Or using outer
f1 <- function(u, v) length(unique(unlist(c(lst[[u]], lst[[v]]))))
res <- outer(seq_along(lst), seq_along(lst), FUN= Vectorize(f1)) *!diag(3)
dimnames(res) <- rep(list(names(lst)),2)
res
# Product a Product b Product c
#Product a 0 2 3
#Product b 2 0 3
#Product c 3 3 0
For the second case
tcrossprod(table(df))*!diag(3)
# Products
#Products Product a Product b Product c
# Product a 0 2 1
# Product b 2 0 1
# Product c 1 1 0

Related

SQL select column and count from the next column

My table structure for referrals is , the field ref is unique:
ID pid ref ref_by
1 1 k NAN
2 2 l k
3 3 m k
4 4 n l
And the user table is:
id name
1 john
2 Bob
3 Tim
4 Rob
I need to get the id,pid, ref and count of ref in next column .Based on the number of referrals they each will be assigned some points that is a constant 100, the result should look like this .
pid name ref number_of_referals points_earned
1 john k 2 200
2 Bob l 1 100
3 Tim m 0 0
4 Rob n 0 0
You need 2 joins:
the 1st from users table to referrals and the 2nd to a query that groups and counts the referrals:
select
r.id, u.name, r.ref,
case when c.counter is null then 0 else c.counter end number_of_referals,
case when c.counter is null then 0 else c.counter end * 100 points_earned
from users u inner join referrals r
on r.pid = u.id
left join (select ref_by, count(*) counter from referrals group by ref_by) c
on c.ref_by = r.ref
order by r.id
See the demo

Combination Counts in R [duplicate]

I have a data frame that contains one column which indicates an event ID. There is another column that indicates the products used in that event. Each product would only be used one time for an event and each event contains at least one product. I would like to know how many times each product is used with every other product. Some sample data is below:
set.seed(1)
events <- paste('Event ', sample(1:4, size = 15, replace = TRUE), sep = '')
events <- events[order(events)]
prods <- paste('Product ', c(1, 2, 3, 4, 1, 5, 6, 2, 4, 6, 7, 1, 2, 3, 5))
test_data <- data.frame(events, prods)
test_data
events prods
1 Event 1 Product 1
2 Event 1 Product 2
3 Event 1 Product 3
4 Event 1 Product 4
5 Event 2 Product 1
6 Event 2 Product 5
7 Event 2 Product 6
8 Event 3 Product 2
9 Event 3 Product 4
10 Event 3 Product 6
11 Event 3 Product 7
12 Event 4 Product 1
13 Event 4 Product 2
14 Event 4 Product 3
15 Event 4 Product 5
Product 1 and Product 2 occur in the same event twice (Event 1 and Event 4). So I would want to return a '2' for that match. Product 1 and Product 7 never occur in the same event, so I'd want to return a 0 for that pair. For 'matches' between the same item, I am comfortable returning the total number of times that product is used.
There are two formats that are possible and I don't have a preference for which I'd like to see returned.
A short and fat data frame that has the products running across the tops as column headers and the side as row headers. The body of this data frame would be populated by the number of matches.
A long, narrow data frame where there are two columns that will serve to represent all possible combinations of Product pairings and then a third column representing the number of times they match.
I have been experimenting with expand.grid with nothing to show for it.
Thank you!
Split prods by events and then calculate all the combn-inations, then aggregate to get the count of each combination.
out <- t(do.call(cbind,
lapply(split(as.character(test_data$prods), test_data$events), combn, 2))
)
aggregate(count ~ . , data=transform(out,count=1), FUN=sum)
# X1 X2 count
#1 Product 1 Product 2 2
#2 Product 1 Product 3 2
#3 Product 2 Product 3 2
#4 Product 1 Product 4 1
#5 Product 2 Product 4 2
#6 Product 3 Product 4 1
#7 Product 1 Product 5 2
#8 Product 2 Product 5 1
#9 Product 3 Product 5 1
#10 Product 1 Product 6 1
#11 Product 2 Product 6 1
#12 Product 4 Product 6 1
#13 Product 5 Product 6 1
#14 Product 2 Product 7 1
#15 Product 4 Product 7 1
#16 Product 6 Product 7 1
Maybe this is using a sledgehammer to crack a nut, but you could mine (frequent) item sets, which comes with other fancy stuff. It could work like this:
library(arules)
library(reshape2)
mat <- as(sapply(dcast(test_data, events~prods, fun.aggregate = length, value.var="prods")[, -1], as.logical), "transactions")
sets <- apriori(trans, parameter = list(supp = 0, conf = 0, minlen = 2, maxlen = 2, target = "frequent itemsets"))
df <- as(sets, "data.frame")
subset(transform(df, n=support*nrow(trans)), n>0, -support)
# items n
# 2 {Product 6,Product 7} 1
# 4 {Product 4,Product 7} 1
# 6 {Product 2,Product 7} 1
# 7 {Product 5,Product 6} 1
# 8 {Product 3,Product 5} 1
# 10 {Product 1,Product 5} 2
# 11 {Product 2,Product 5} 1
# 13 {Product 4,Product 6} 1
# 14 {Product 1,Product 6} 1
# 15 {Product 2,Product 6} 1
# 16 {Product 3,Product 4} 1
# 17 {Product 1,Product 3} 2
# 18 {Product 2,Product 3} 2
# 19 {Product 1,Product 4} 1
# 20 {Product 2,Product 4} 2
# 21 {Product 1,Product 2} 2
The support value shows you the percentage of events in which both products were included. I multiplied it with the number of transactions to get your frequency count.

Creating a Data Frame of the Number of Pairings based on an Event Column

I have a data frame that contains one column which indicates an event ID. There is another column that indicates the products used in that event. Each product would only be used one time for an event and each event contains at least one product. I would like to know how many times each product is used with every other product. Some sample data is below:
set.seed(1)
events <- paste('Event ', sample(1:4, size = 15, replace = TRUE), sep = '')
events <- events[order(events)]
prods <- paste('Product ', c(1, 2, 3, 4, 1, 5, 6, 2, 4, 6, 7, 1, 2, 3, 5))
test_data <- data.frame(events, prods)
test_data
events prods
1 Event 1 Product 1
2 Event 1 Product 2
3 Event 1 Product 3
4 Event 1 Product 4
5 Event 2 Product 1
6 Event 2 Product 5
7 Event 2 Product 6
8 Event 3 Product 2
9 Event 3 Product 4
10 Event 3 Product 6
11 Event 3 Product 7
12 Event 4 Product 1
13 Event 4 Product 2
14 Event 4 Product 3
15 Event 4 Product 5
Product 1 and Product 2 occur in the same event twice (Event 1 and Event 4). So I would want to return a '2' for that match. Product 1 and Product 7 never occur in the same event, so I'd want to return a 0 for that pair. For 'matches' between the same item, I am comfortable returning the total number of times that product is used.
There are two formats that are possible and I don't have a preference for which I'd like to see returned.
A short and fat data frame that has the products running across the tops as column headers and the side as row headers. The body of this data frame would be populated by the number of matches.
A long, narrow data frame where there are two columns that will serve to represent all possible combinations of Product pairings and then a third column representing the number of times they match.
I have been experimenting with expand.grid with nothing to show for it.
Thank you!
Split prods by events and then calculate all the combn-inations, then aggregate to get the count of each combination.
out <- t(do.call(cbind,
lapply(split(as.character(test_data$prods), test_data$events), combn, 2))
)
aggregate(count ~ . , data=transform(out,count=1), FUN=sum)
# X1 X2 count
#1 Product 1 Product 2 2
#2 Product 1 Product 3 2
#3 Product 2 Product 3 2
#4 Product 1 Product 4 1
#5 Product 2 Product 4 2
#6 Product 3 Product 4 1
#7 Product 1 Product 5 2
#8 Product 2 Product 5 1
#9 Product 3 Product 5 1
#10 Product 1 Product 6 1
#11 Product 2 Product 6 1
#12 Product 4 Product 6 1
#13 Product 5 Product 6 1
#14 Product 2 Product 7 1
#15 Product 4 Product 7 1
#16 Product 6 Product 7 1
Maybe this is using a sledgehammer to crack a nut, but you could mine (frequent) item sets, which comes with other fancy stuff. It could work like this:
library(arules)
library(reshape2)
mat <- as(sapply(dcast(test_data, events~prods, fun.aggregate = length, value.var="prods")[, -1], as.logical), "transactions")
sets <- apriori(trans, parameter = list(supp = 0, conf = 0, minlen = 2, maxlen = 2, target = "frequent itemsets"))
df <- as(sets, "data.frame")
subset(transform(df, n=support*nrow(trans)), n>0, -support)
# items n
# 2 {Product 6,Product 7} 1
# 4 {Product 4,Product 7} 1
# 6 {Product 2,Product 7} 1
# 7 {Product 5,Product 6} 1
# 8 {Product 3,Product 5} 1
# 10 {Product 1,Product 5} 2
# 11 {Product 2,Product 5} 1
# 13 {Product 4,Product 6} 1
# 14 {Product 1,Product 6} 1
# 15 {Product 2,Product 6} 1
# 16 {Product 3,Product 4} 1
# 17 {Product 1,Product 3} 2
# 18 {Product 2,Product 3} 2
# 19 {Product 1,Product 4} 1
# 20 {Product 2,Product 4} 2
# 21 {Product 1,Product 2} 2
The support value shows you the percentage of events in which both products were included. I multiplied it with the number of transactions to get your frequency count.

Cumulative sum conditional over multiple columns in r dataframe containing the same values

Say my data.frame is as outlined below:
df<-as.data.frame(cbind("Home"=c("a","c","e","b","e","b"),
"Away"=c("b","d","f","c","a","f"))
df$Index<-rep(1,nrow(df))
Home Away Index
1 a b 1
2 c d 1
3 e f 1
4 b c 1
5 e a 1
6 b f 1
What I want to do is calculate a cumulative sum using the Index column for each character a - f regardless of whether they in the Home or Away columns. Thus a column called Cumulative_Sum_Home, say, takes the character in the Home row, "b" in the case of row 6, and counts how many times "b" has appeared in either the Home or Away columns in all previous rows including row 6. Thus in this case b has appeared 3 times cumulatively in the first 6 rows, and thus the Cumulative_Sum_Home gives the value 3. Likewise the same logic applies to the Cumulative_Sum_Away column. Taking row 5, character "a" appears in the Away column, and has cumulatively appeared 2 times in either Home or Away columns up to that row, so the column Cumulative_Sum_Away takes the value 2.
Home Away Index Cumulative_Sum_Home Cumulative_Sum_Away
1 a b 1 1 1
2 c d 1 1 1
3 e f 1 1 1
4 b c 1 2 2
5 e a 1 2 2
6 b f 1 3 2
I have to confess to being totally stumped as to how to solve this problem. I've tried looking at the data.table approaches, but I've never used that package before so I can't immediately see how to solve it. Any tips would be greatly received.
There is scope to make this leaner but if that doesn't matter much for you then this should be okay.
NewColumns = list()
for ( i in sort(unique(c(levels(df[,"Home"]),levels(df[,"Away"]))))) {
NewColumnAddition = i == df$Home | i ==df$Away
NewColumnAddition[NewColumnAddition] = cumsum(NewColumnAddition[NewColumnAddition])
NewColumns[[i]] = NewColumnAddition
}
df$Cumulative_Sum_Home = sapply(
seq(nrow(df)),
function(i) {
NewColumns[[as.character(df[i,"Home"])]][i]
}
)
df$Cumulative_Sum_Away = sapply(
seq(nrow(df)),
function(i) {
NewColumns[[as.character(df[i,"Away"])]][i]
}
)
> df
Home Away Index HomeSum AwaySum
1 a b 1 1 1
2 c d 1 1 1
3 e f 1 1 1
4 b c 1 2 2
5 e a 1 2 2
6 b f 1 3 2
Here's a data.table alternative -
setDT(df)
for ( i in sort(unique(c(levels(df[,Home]),levels(df[,Away]))))) {
df[, TotalSum := cumsum(i == Home | i == Away)]
df[Home == i, Cumulative_Sum_Home := TotalSum]
df[Away == i, Cumulative_Sum_Away := TotalSum]
}
df[,TotalSum := NULL]

Product usage combination in R

I trying to figure out a way to get a list of product combinations with unique users in R. This is a follow up question to [Generate matrix of unique user-item cross-product combinations
df <- data.frame(Products=c('Product a', 'Product b', 'Product a',
'Product c', 'Product b', 'Product c', 'Product d'),
Users=c('user1', 'user1', 'user2', 'user1',
'user2','user3', 'user1'))
Output of df is:
Products Users
1 Product a user1
2 Product b user1
3 Product a user2
4 Product c user1
5 Product b user2
6 Product c user3
7 Product d user1
The output I am looking for would be all three product combinations:
Product a/Product b/Product c - 3
Product a/Product b/Product d - 2
Product b/Product c/Product d - 3
...
Thanks again for your help.
It looks like you want logical-OR treatment as the relation between users and each product set. In other words, you want to count how many unique users have any product in the set. Here's one way of doing it:
df <- data.frame(Products=c('Product a','Product b','Product a','Product c','Product b','Product c','Product d'),Users=c('user1','user1','user2','user1','user2','user3','user1'));
comb <- combn(unique(df$Products),3);
data.frame(comb=apply(comb,2,function(x) paste(levels(comb)[x],collapse='/')),num=apply(comb,2,function(x) length(unique(df$Users[as.integer(df$Products)%in%x]))));
## comb num
## 1 Product a/Product b/Product c 3
## 2 Product a/Product b/Product d 2
## 3 Product a/Product c/Product d 3
## 4 Product b/Product c/Product d 3
Edit: Logical-AND is trickier, since we need to test for the presence of every product for every user. I think I found a good solution using aggregate() and match():
data.frame(comb=apply(comb,2,function(x) paste(levels(comb)[x],collapse='/')),num=apply(comb,2,function(x) sum(aggregate(Products~Users,df,function(y) !any(is.na(match(x,as.integer(y)))))$Products)));
## comb num
## 1 Product a/Product b/Product c 1
## 2 Product a/Product b/Product d 1
## 3 Product a/Product c/Product d 1
## 4 Product b/Product c/Product d 1

Resources