I trying to figure out a way to get a list of product combinations with unique users in R. This is a follow up question to [Generate matrix of unique user-item cross-product combinations
df <- data.frame(Products=c('Product a', 'Product b', 'Product a',
'Product c', 'Product b', 'Product c', 'Product d'),
Users=c('user1', 'user1', 'user2', 'user1',
'user2','user3', 'user1'))
Output of df is:
Products Users
1 Product a user1
2 Product b user1
3 Product a user2
4 Product c user1
5 Product b user2
6 Product c user3
7 Product d user1
The output I am looking for would be all three product combinations:
Product a/Product b/Product c - 3
Product a/Product b/Product d - 2
Product b/Product c/Product d - 3
...
Thanks again for your help.
It looks like you want logical-OR treatment as the relation between users and each product set. In other words, you want to count how many unique users have any product in the set. Here's one way of doing it:
df <- data.frame(Products=c('Product a','Product b','Product a','Product c','Product b','Product c','Product d'),Users=c('user1','user1','user2','user1','user2','user3','user1'));
comb <- combn(unique(df$Products),3);
data.frame(comb=apply(comb,2,function(x) paste(levels(comb)[x],collapse='/')),num=apply(comb,2,function(x) length(unique(df$Users[as.integer(df$Products)%in%x]))));
## comb num
## 1 Product a/Product b/Product c 3
## 2 Product a/Product b/Product d 2
## 3 Product a/Product c/Product d 3
## 4 Product b/Product c/Product d 3
Edit: Logical-AND is trickier, since we need to test for the presence of every product for every user. I think I found a good solution using aggregate() and match():
data.frame(comb=apply(comb,2,function(x) paste(levels(comb)[x],collapse='/')),num=apply(comb,2,function(x) sum(aggregate(Products~Users,df,function(y) !any(is.na(match(x,as.integer(y)))))$Products)));
## comb num
## 1 Product a/Product b/Product c 1
## 2 Product a/Product b/Product d 1
## 3 Product a/Product c/Product d 1
## 4 Product b/Product c/Product d 1
Related
I have a database consisting of products and raw materials. Each product consumes types of raw materials and these products can also be consumed in other products.
Each row in the database has the relationship of a product to a raw material, the identification of an industrial complex, and a technical consumption index. This index determines how many units of raw material are needed to produce 1 unit of the product.
I need to determine what the global technical index of consumption of a raw material would be based on a certain product, but I am having difficulties creating a way to relate the products in the database.
Product Raw material Industrial plant INDEX
B A IND1 3
C B IND2 4
C A IND2 12
D C IND3 8
D B IND3 4
D E IND3 7
B A IND4 4
C B IND5 4
C A IND5 10
D C IND6 6
D B IND6 3
D E IND6 8
Example, to produce 1 C, you need 4 B and 12 A, but 4 B are made from 12 A, so 1 C needs 24 A.
My table structure for referrals is , the field ref is unique:
ID pid ref ref_by
1 1 k NAN
2 2 l k
3 3 m k
4 4 n l
And the user table is:
id name
1 john
2 Bob
3 Tim
4 Rob
I need to get the id,pid, ref and count of ref in next column .Based on the number of referrals they each will be assigned some points that is a constant 100, the result should look like this .
pid name ref number_of_referals points_earned
1 john k 2 200
2 Bob l 1 100
3 Tim m 0 0
4 Rob n 0 0
You need 2 joins:
the 1st from users table to referrals and the 2nd to a query that groups and counts the referrals:
select
r.id, u.name, r.ref,
case when c.counter is null then 0 else c.counter end number_of_referals,
case when c.counter is null then 0 else c.counter end * 100 points_earned
from users u inner join referrals r
on r.pid = u.id
left join (select ref_by, count(*) counter from referrals group by ref_by) c
on c.ref_by = r.ref
order by r.id
See the demo
I have a data frame that contains one column which indicates an event ID. There is another column that indicates the products used in that event. Each product would only be used one time for an event and each event contains at least one product. I would like to know how many times each product is used with every other product. Some sample data is below:
set.seed(1)
events <- paste('Event ', sample(1:4, size = 15, replace = TRUE), sep = '')
events <- events[order(events)]
prods <- paste('Product ', c(1, 2, 3, 4, 1, 5, 6, 2, 4, 6, 7, 1, 2, 3, 5))
test_data <- data.frame(events, prods)
test_data
events prods
1 Event 1 Product 1
2 Event 1 Product 2
3 Event 1 Product 3
4 Event 1 Product 4
5 Event 2 Product 1
6 Event 2 Product 5
7 Event 2 Product 6
8 Event 3 Product 2
9 Event 3 Product 4
10 Event 3 Product 6
11 Event 3 Product 7
12 Event 4 Product 1
13 Event 4 Product 2
14 Event 4 Product 3
15 Event 4 Product 5
Product 1 and Product 2 occur in the same event twice (Event 1 and Event 4). So I would want to return a '2' for that match. Product 1 and Product 7 never occur in the same event, so I'd want to return a 0 for that pair. For 'matches' between the same item, I am comfortable returning the total number of times that product is used.
There are two formats that are possible and I don't have a preference for which I'd like to see returned.
A short and fat data frame that has the products running across the tops as column headers and the side as row headers. The body of this data frame would be populated by the number of matches.
A long, narrow data frame where there are two columns that will serve to represent all possible combinations of Product pairings and then a third column representing the number of times they match.
I have been experimenting with expand.grid with nothing to show for it.
Thank you!
Split prods by events and then calculate all the combn-inations, then aggregate to get the count of each combination.
out <- t(do.call(cbind,
lapply(split(as.character(test_data$prods), test_data$events), combn, 2))
)
aggregate(count ~ . , data=transform(out,count=1), FUN=sum)
# X1 X2 count
#1 Product 1 Product 2 2
#2 Product 1 Product 3 2
#3 Product 2 Product 3 2
#4 Product 1 Product 4 1
#5 Product 2 Product 4 2
#6 Product 3 Product 4 1
#7 Product 1 Product 5 2
#8 Product 2 Product 5 1
#9 Product 3 Product 5 1
#10 Product 1 Product 6 1
#11 Product 2 Product 6 1
#12 Product 4 Product 6 1
#13 Product 5 Product 6 1
#14 Product 2 Product 7 1
#15 Product 4 Product 7 1
#16 Product 6 Product 7 1
Maybe this is using a sledgehammer to crack a nut, but you could mine (frequent) item sets, which comes with other fancy stuff. It could work like this:
library(arules)
library(reshape2)
mat <- as(sapply(dcast(test_data, events~prods, fun.aggregate = length, value.var="prods")[, -1], as.logical), "transactions")
sets <- apriori(trans, parameter = list(supp = 0, conf = 0, minlen = 2, maxlen = 2, target = "frequent itemsets"))
df <- as(sets, "data.frame")
subset(transform(df, n=support*nrow(trans)), n>0, -support)
# items n
# 2 {Product 6,Product 7} 1
# 4 {Product 4,Product 7} 1
# 6 {Product 2,Product 7} 1
# 7 {Product 5,Product 6} 1
# 8 {Product 3,Product 5} 1
# 10 {Product 1,Product 5} 2
# 11 {Product 2,Product 5} 1
# 13 {Product 4,Product 6} 1
# 14 {Product 1,Product 6} 1
# 15 {Product 2,Product 6} 1
# 16 {Product 3,Product 4} 1
# 17 {Product 1,Product 3} 2
# 18 {Product 2,Product 3} 2
# 19 {Product 1,Product 4} 1
# 20 {Product 2,Product 4} 2
# 21 {Product 1,Product 2} 2
The support value shows you the percentage of events in which both products were included. I multiplied it with the number of transactions to get your frequency count.
I am trying to create a cross-product matrix of unique users in R. I searched for it on SO but could not find what I was looking for. Any help is appreciated.
I have a large dataframe (over a million) and a sample is shown:
df <- data.frame(Products=c('Product a', 'Product b', 'Product a',
'Product c', 'Product b', 'Product c'),
Users=c('user1', 'user1', 'user2', 'user1',
'user2','user3'))
Output of df is:
Products Users
1 Product a user1
2 Product b user1
3 Product a user2
4 Product c user1
5 Product b user2
6 Product c user3
I would like to see two matrices:
The first one will show the number of unique users that had either products(OR) - so the output will be something like:
Product a Product b Product c
Product a 2 3
Product b 2 3
Product c 3 3
The second matrix will be the number of unique users that had both products(AND):
Product a Product b Product c
Product a 2 1
Product b 2 1
Product c 1 1
Any help is appreciated.
Thanks
UPDATE:
Here is more clarity: Product a is used by User1 and User2. Product b is used by User1 and User2 and Product c is used by User1 and User3. So in the first matrix, Product a and Product b will be 2 since there are 2 unique users. Similarly, Product a and Product c will be 3. Where as in the second matrix, they would be 2 and 1 since I want the intersection.
Thanks
Try
lst <- split(df$Users, df$Products)
ln <- length(lst)
m1 <- matrix(0, ln,ln, dimnames=list(names(lst), names(lst)))
m1[lower.tri(m1, diag=FALSE)] <- combn(seq_along(lst), 2,
FUN= function(x) length(unique(unlist(lst[x]))))
m1[upper.tri(m1)] <- m1[lower.tri(m1)]
m1
# Product a Product b Product c
#Product a 0 2 3
#Product b 2 0 3
#Product c 3 3 0
Or using outer
f1 <- function(u, v) length(unique(unlist(c(lst[[u]], lst[[v]]))))
res <- outer(seq_along(lst), seq_along(lst), FUN= Vectorize(f1)) *!diag(3)
dimnames(res) <- rep(list(names(lst)),2)
res
# Product a Product b Product c
#Product a 0 2 3
#Product b 2 0 3
#Product c 3 3 0
For the second case
tcrossprod(table(df))*!diag(3)
# Products
#Products Product a Product b Product c
# Product a 0 2 1
# Product b 2 0 1
# Product c 1 1 0
How to count the total number of transaction by id and by date ?
Sample data :
f<- data.frame(
id=c("A","A","A","A","C","C","D","D","E"),
start_date=c("6/3/2012","7/3/2012","7/3/2012","8/3/2012","5/3/2012","6/3/2012","6/3/2012","6/3/2012","5 /3/2012")
)
Excepted Output:
id | count
A | 3
C | 2
D | 1
E | 1
Logic :
As A is 6 MARCH , 7 MARCH AND 8 MARCH SO COUNT 3
C is 5 MARCH , 6 MARCH SO COUNT 2
so on...
I Tried with the following code , and I think it only count the number of the ID occurred in the data.
library(lubridate)
f$date <- mdy(f$Date)
f1 <- s[order(f$id, f$Date), ]
How can I implement this code to get my desire outcome?
[Note: The actual data is in huge volume, so optimization need to be consider.]
Thanks in advance.
I'm getting a different answer:
with(f, tapply(start_date, id, length))
A C D E
4 2 2 1
You can try. f[!duplicated(f), ] removes duplicates from f and then aggregate does the aggregation using length function i.e. gives count of start_date for each id
aggregate(start_date ~ id, f[!duplicated(f), ], length)
## id start_date
## 1 A 3
## 2 C 2
## 3 D 1
## 4 E 1
Not sure what format you want the results in, but
rowSums(with(f, table(id, start_date)>0))
will return a named vector with the count of distinct days for each ID.