This question already has answers here:
Creating co-occurrence matrix
(5 answers)
Closed 1 year ago.
I have a simple matrix, e.g.
test <- matrix(c("u1","p1","u1","p2","u2","p2","u2",
"p3","u3","p1","u4","p2","u5","p1",
"u5","p3","u6","p3","u7","p4","u7",
"p3","u8","p1","u9","p4"),
ncol=2,byrow=TRUE)
colnames(test) <- c("user","product")
test1<-as.data.frame(test)
test:
user product
1 u1 p1
2 u1 p2
3 u2 p2
4 u2 p3
5 u3 p1
6 u4 p2
7 u5 p1
8 u5 p3
9 u6 p3
10 u7 p4
11 u7 p3
12 u8 p1
13 u9 p4
I want to count how many users bought product pair together, such as p1&p2, p2& p3...
table(test1$product,test1$product) give me this :
p1 p2 p3 p4
p1 4 0 0 0
p2 0 3 0 0
p3 0 0 4 0
p4 0 0 0 2
How can I get the right result as :
p1 p2 p3 p4
p1 4 1 1 0
p2 1 3 1 0
p3 1 1 4 1
p4 0 0 1 2
Looking at your desired output, you are looking for the crossprod function:
crossprod(table(test1))
# product
# product p1 p2 p3 p4
# p1 4 1 1 0
# p2 1 3 1 0
# p3 1 1 4 1
# p4 0 0 1 2
This is the same as crossprod(table(test1$user, test1$product)) (reflecting Dennis's comment).
A similar question tagged to this post requested an efficient solution, but got deleted undeleted now. We decided to post the solution here.
Here is one with RcppEigen to do the crossproduct
library(RcppEigen)
library(inline)
prodFun <- '
typedef Eigen::Map<Eigen::MatrixXi> MapMti;
const MapMti B(as<MapMti>(BB));
const MapMti C(as<MapMti>(CC));
return List::create(B.adjoint() * C);
'
funCPr <- cxxfunction(signature(BB= "matrix", CC = "matrix"),
prodFun, plugin = "RcppEigen")
tbl <- table(test1)
funCPr(tbl, tbl)[[1]]
# [,1] [,2] [,3] [,4]
#[1,] 4 1 1 0
#[2,] 1 3 1 0
#[3,] 1 1 4 1
#[4,] 0 0 1 2
Benchmarks
set.seed(24)
test2 <- data.frame(user = sample(1:5000, 1e6, replace=TRUE),
product = sample(paste0("p", 1:50), 1e6, replace = TRUE),
stringsAsFactors=FALSE)
tbl1 <- table(test2)
library(microbenchmark)
microbenchmark(cPP = funCPr(tbl1, tbl1)[[1]],
CrossP = crossprod(tbl1),
adjMat = adjmat(tbl1)$adjacency,
unit = "relative", times = 10L)
#Unit: relative
# expr min lq mean median uq max neval cld
# cPP 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 10 a
# CrossP 2.079867 2.070509 2.234376 2.074388 2.290516 2.676798 10 a
# adjMat 6.223034 6.500791 9.619088 7.197824 7.771270 31.394812 10 b
NOTE: This could be made more efficient by doing the table in Rcpp
Ananda's solution is superior (it is lighter weight and requires no external package) but I am putting down another. I believe this is called an adjacency matrix (smarter people feel free to edit this if I'm wrong):
library(qdap)
adjmat(table(test1))$adjacency
## product
## product p1 p2 p3 p4
## p1 4 1 1 0
## p2 1 3 1 0
## p3 1 1 4 1
## p4 0 0 1 2
Related
I am working with the R programming language.
I am trying to simulate random discrete data that contains "correlations" between the variables. For example, this is what I have tried so far (I generated random continuous data with correlations, and converted all values below a certain threshold to 0 else 1):
library(mvtnorm)
n <- 11
A <- matrix(runif(n^2)*2-1, ncol=n)
s <- t(A) %*% A
my_data = MASS::mvrnorm(100, mu = c(rnorm(11,10,1)), Sigma = s)
my_data = data.frame(my_data)
colnames(my_data)[1] <- 'p1'
colnames(my_data)[2] <- 'p2'
colnames(my_data)[3] <- 'p3'
colnames(my_data)[4] <- 'p4'
colnames(my_data)[5] <- 'p5'
colnames(my_data)[6] <- 'p6'
colnames(my_data)[7] <- 'p7'
colnames(my_data)[8] <- 'p8'
colnames(my_data)[9] <- 'p9'
colnames(my_data)[10] <- 'p10'
colnames(my_data)[11] <- 'result'
my_data[my_data < 9] <- 0
my_data[my_data > 9] <- 1
p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 result
1 1 1 1 0 1 1 1 0 0 0 0
2 0 1 1 1 1 0 1 1 1 1 1
3 1 1 1 0 1 0 1 1 1 1 1
4 1 1 1 0 1 1 1 1 1 1 1
5 0 1 1 1 1 0 1 1 0 0 0
6 1 0 1 0 1 1 1 0 1 1 1
I am not sure if I have done this correctly - sure, I have simulated random discrete data, but I am not sure if I have preserved the correlation structure within the data. For instance, I would have liked there to be correlation patterns such as:
When p1 = p5 = p9 = 1 -> "results" are more likely to be 1 (i.e. take all rows where p1 = p5 = p9 = 1 and measure the percentage of 1's in the results column)
When p3 = p4 = 0 and p9 = 1 -> "results" are more likely to be 0
etc.
Is there some other way to do this?
Thanks!
If you are happy with p1 through p10 and just want to use your stated rules to generate the result column, then you do a kind of reverse logistic regression. First of all, set up your rules to give you numerical results. Here, we get a 1 if p1 = p5 = p9 = 1, and we get -1 if p3 = 0, p4 = 0, p9 = 1:
log_odds <- with(my_data, p1 * p5 * p9)
log_odds <- with(my_data, result - (1 - p3) * (1 - p4) * p9)
Now we convert these to probabilities of getting a 1 in our results column:
odds <- exp(log_odds)
probs <- odds / (1 + odds)
Finally, we use probs to generate a binomial sample:
my_data$result <- rbinom(nrow(my_data), size = 1, prob = probs)
We can see that overall our sample has about a 50% chance of having a 1 or 0:
table(my_data$result)
#> 0 1
#> 47 53
But the odds of a 1 are much increased when p1 = p5 = p9 = 1
table(my_data$result[with(my_data, p1 == 1 & p5 == 1 & p9 == 1)])
#> 0 1
#> 3 18
It is possible to control the background probability and strength of correlations by adjusting the weightings for log_odds
This question already has answers here:
Creating co-occurrence matrix
(5 answers)
Closed 1 year ago.
I have a simple matrix, e.g.
test <- matrix(c("u1","p1","u1","p2","u2","p2","u2",
"p3","u3","p1","u4","p2","u5","p1",
"u5","p3","u6","p3","u7","p4","u7",
"p3","u8","p1","u9","p4"),
ncol=2,byrow=TRUE)
colnames(test) <- c("user","product")
test1<-as.data.frame(test)
test:
user product
1 u1 p1
2 u1 p2
3 u2 p2
4 u2 p3
5 u3 p1
6 u4 p2
7 u5 p1
8 u5 p3
9 u6 p3
10 u7 p4
11 u7 p3
12 u8 p1
13 u9 p4
I want to count how many users bought product pair together, such as p1&p2, p2& p3...
table(test1$product,test1$product) give me this :
p1 p2 p3 p4
p1 4 0 0 0
p2 0 3 0 0
p3 0 0 4 0
p4 0 0 0 2
How can I get the right result as :
p1 p2 p3 p4
p1 4 1 1 0
p2 1 3 1 0
p3 1 1 4 1
p4 0 0 1 2
Looking at your desired output, you are looking for the crossprod function:
crossprod(table(test1))
# product
# product p1 p2 p3 p4
# p1 4 1 1 0
# p2 1 3 1 0
# p3 1 1 4 1
# p4 0 0 1 2
This is the same as crossprod(table(test1$user, test1$product)) (reflecting Dennis's comment).
A similar question tagged to this post requested an efficient solution, but got deleted undeleted now. We decided to post the solution here.
Here is one with RcppEigen to do the crossproduct
library(RcppEigen)
library(inline)
prodFun <- '
typedef Eigen::Map<Eigen::MatrixXi> MapMti;
const MapMti B(as<MapMti>(BB));
const MapMti C(as<MapMti>(CC));
return List::create(B.adjoint() * C);
'
funCPr <- cxxfunction(signature(BB= "matrix", CC = "matrix"),
prodFun, plugin = "RcppEigen")
tbl <- table(test1)
funCPr(tbl, tbl)[[1]]
# [,1] [,2] [,3] [,4]
#[1,] 4 1 1 0
#[2,] 1 3 1 0
#[3,] 1 1 4 1
#[4,] 0 0 1 2
Benchmarks
set.seed(24)
test2 <- data.frame(user = sample(1:5000, 1e6, replace=TRUE),
product = sample(paste0("p", 1:50), 1e6, replace = TRUE),
stringsAsFactors=FALSE)
tbl1 <- table(test2)
library(microbenchmark)
microbenchmark(cPP = funCPr(tbl1, tbl1)[[1]],
CrossP = crossprod(tbl1),
adjMat = adjmat(tbl1)$adjacency,
unit = "relative", times = 10L)
#Unit: relative
# expr min lq mean median uq max neval cld
# cPP 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 10 a
# CrossP 2.079867 2.070509 2.234376 2.074388 2.290516 2.676798 10 a
# adjMat 6.223034 6.500791 9.619088 7.197824 7.771270 31.394812 10 b
NOTE: This could be made more efficient by doing the table in Rcpp
Ananda's solution is superior (it is lighter weight and requires no external package) but I am putting down another. I believe this is called an adjacency matrix (smarter people feel free to edit this if I'm wrong):
library(qdap)
adjmat(table(test1))$adjacency
## product
## product p1 p2 p3 p4
## p1 4 1 1 0
## p2 1 3 1 0
## p3 1 1 4 1
## p4 0 0 1 2
I have the following dataframe (DF_A):
PARTY_ID O1 O2 O3 O4
P1 0 0 1 0
P2 2 1 0 1
P3 0 0 0 0
P4 2 1 1 1
P5 1 0 0 1
I also have another dataframe (DF_B) with the position of the columns that I need in DF_A. This is DF_B:
PARTY_ID POS_1 POS_2
P1 1 2
P2 2 1
P3 3 1
P4 2 1
P5 1 4
I need to give the position of the columns (DF_B) for getting the values of DF_A. The desired result is something like this:
PARTY_ID V1 V2
P1 0 0
P2 1 2
P3 0 0
P4 1 2
P5 1 1
I'm trying to use which function, but it seems not to work.
Can anyone please help me?
SIDE NOTE: I need to do this the fastest way possible because my real data have more than 100K rows.
A quick and dirty way to do this using apply from base R:
DF_C <- apply(DF_A, 1, function(x) {
cols_to_use <- as.numeric(unlist(DF_B[DF_B$"PARTY_ID"==x["PARTY_ID"],2:3]))
x[-1][cols_to_use]
})
DF_C <- cbind(DF_A$PARTY_ID,t(DF_C))
colnames(DF_C) <- c("PARTY_ID", "V1","V2")
> ind <- as.matrix(DF_B[,-1])
> t(sapply(1:nrow(ind), function(i) DF_A[, -1][, ind[i,]][i,] ))
O1 O2
[1,] 0 0
[2,] 1 2
[3,] 0 0
[4,] 1 2
[5,] 1 1
If you want to get a data.frame:
> DF <- t(sapply(1:nrow(ind), function(i) DF_A[, -1][, ind[i,]][i,] ))
> data.frame(PARTY_ID=DF_A[,1], DF)
PARTY_ID O1 O2
1 P1 0 0
2 P2 1 2
3 P3 0 0
4 P4 1 2
5 P5 1 1
The one with a simple for loop:
DF_C <- DF_B # creating dataframe with same dimension and column/row identifiers
for(i in 1:nrow(DF_C)) { DF_C[i,] <- DF_A[i,as.numeric(DF_B[i,])] } #over rows
I have got a matrix describing relationship between diffetent people. If there is any connection between people I have "1" in a particular cell, otherwise - "0". How to turn this into a data.frame with two columns looks like:
person1 -- person4
person1 -- person6
person2 -- person1
?
Use melt from reshape2:
library(reshape2)
set.seed(1)
mx <- matrix(sample(0:1, 9, r=T), nrow=3, dimnames=replicate(2, paste0("p", 1:3), s=F))
# p1 p2 p3
# p1 0 1 1
# p2 0 0 1
# p3 1 1 1
melt(mx)
# Var1 Var2 value
# 1 p1 p1 0
# 2 p2 p1 0
# 3 p3 p1 1
# 4 p1 p2 1
# 5 p2 p2 0
# 6 p3 p2 1
# 7 p1 p3 1
# 8 p2 p3 1
# 9 p3 p3 1
Here is my data:
sub <- paste ("s", 1:6, sep = "")
mark1a <- c("A", "A", "B", "d1", "A", 2)
mark1b <- c("A", "B", "d1", 2, "d1", "A")
myd <- data.frame (sub, mark1a, mark1b)
myd
sub mark1a mark1b
1 s1 A A
2 s2 A B
3 s3 B d1
4 s4 d1 2
5 s5 A d1
6 s6 2 A
I want create a design matrix of the pair of variables (columns) - mark1a and mark1b. A design matrix will consists of length (unique (c(mark1a, mark1b))) for each unique (c(mark1a, mark1b). then 1 or 2 based on if the particular number is present once or twice in the columns and else 0. The following is expected output (not a figure):
I could understand how this can be done:
You could try something like this:
cbind(myd, t(apply(myd, 1, function(x) sapply(unique(unlist(myd[, 2:3])), function(y) sum(x==y)))))
1 s1 A A 2 0 0 0
2 s2 A B 1 1 0 0
3 s3 B d1 0 1 1 0
4 s4 d1 2 0 0 1 1
5 s5 A d1 1 0 1 0
6 s6 2 A 1 0 0 1
First, make sure that the mark1a and mark1b columns share the same levels:
all.levels <- levels(myd["mark1a", "mark1b"])
levels(myd$mark1a) <- all.levels
levels(myd$mark1b) <- all.levels
Then you can compute the sum of two frequency tables and bind it to myd:
library(plyr)
cbind(myd, ddply(myd, "sub", function(x)table(x$mark1a) + table(x$mark1b))[,-1])
# sub mark1a mark1b 2 A B d1
# 1 s1 A A 0 2 0 0
# 2 s2 A B 0 1 1 0
# 3 s3 B d1 0 0 1 1
# 4 s4 d1 2 1 0 0 1
# 5 s5 A d1 0 1 0 1
# 6 s6 2 A 1 1 0 0
I would say the solution from #jmsigner is the way go to for a one-liner, but I usually get confused by those nested apply (and its relatives) solutions.
Here's a similar solution:
# Identify all the levels in `mark1a` and `mark1b`
mydLevels = unique(c(levels(myd$mark1a), levels(myd$mark1b)))
# Use these levels and an anonymous function with `lapply`
temp = data.frame(lapply(mydLevels,
function(x) rowSums(myd[-1] == x)+0))
colnames(temp) = mydLevels
# This gives you the correct output, but not in the order
# that you have in your original question.
cbind(myd, temp)
# sub mark1a mark1b 2 A B d1
# 1 s1 A A 0 2 0 0
# 2 s2 A B 0 1 1 0
# 3 s3 B d1 0 0 1 1
# 4 s4 d1 2 1 0 0 1
# 5 s5 A d1 0 1 0 1
# 6 s6 2 A 1 1 0 0