Efficient way to generate a coincidence matrix - r

I want to generate a simple coincidence matrix, I've looked for R packages but could not find one that does this calculation so far, I don't know if the English term for this matrix is different than the Portuguese one... so, that's what I need to do.
I have a matrix:
[,1] [,2] [,3] [,4]
[1,] 1 1 2 1
[2,] 1 2 3 1
[3,] 2 3 1 2
[4,] 1 2 3 3
A coincidence matrix will be calculated comparing each element row by row to generate a dissimilarity distance with the formula:
Diss = 1 - (Coincidences / (Coincidences + Discordance))
So my resulting matrix is an symmetrical one with dim 4x4 and diagonal elements equal 0, so in the example my A(1,2) would it be:
A(1,2) = 1 - (2 / 4) = 0.5
A(1,3) = 1 - (0/4) = 1.0
And so on...
I have created a function to generate this matrix:
cs_matrix <- function (x) {
cs.mat <- matrix(rep(0,dim(x)[1]^2), ncol = dim(x)[1])
for (i in 1:dim(x)[1]){
for (j in 1:dim(x)[1]){
cs.mat[i,j] <- 1 - (sum(x[i,] == x[j,]) / dim(x)[2])
}
}
return(cs.mat)
}
The function works fine, but my actual Data Set has 2560 observations of 4 variables, thus generating a 2560 x 2560 coincidence matrix, and it takes quite some time to do the calculation. I wonder if there is a more efficient way of calculating this or even if there is already a package that can calculate this dissimilarity distance. This matrix will be later used in Cluster Analysis.

I think you can use outer
add <- function(x, y) sum(mat[x, ] == mat[y,])
nr <- seq_len(nrow(mat))
mat1 <- 1 - outer(nr, nr, Vectorize(add))/ncol(mat)
mat1
# [,1] [,2] [,3] [,4]
#[1,] 0.00 0.50 1 0.75
#[2,] 0.50 0.00 1 0.25
#[3,] 1.00 1.00 0 1.00
#[4,] 0.75 0.25 1 0.00
If diagonal elements need to be 1 do diag(mat1) <- 1.
data
mat <- structure(c(1, 1, 2, 1, 1, 2, 3, 2, 2, 3, 1, 3, 1, 1, 2, 3), .Dim = c(4L,4L))

Related

Matrix manipulation to compute the minimum value of the upper and lower triangular matrix

I want to create a minimum value which compares the lower and upper triangular matrix of a matrix. For example
A = matrix( c(2, 4, 3, 1, 5, 7,4,2,4), nrow=3, ncol=3,byrow = TRUE)
B= matrix(c(0,1,3,1,0,2,3,2,0), nrow=3, ncol=3,byrow= TRUE)
I would like to create a matrix like this with diagonal elements set to 0 and the rest to be minimum of upper and lower elements. For example (A(1,2), A(2,1)) which is min(4,1) =1. This results in matrix B. Can anyone suggest how to achieve this manipulation?
I think you want to use pmin:
A <- matrix( c(2, 4, 3, 1, 5, 7,4,2,4), nrow=3, ncol=3,byrow = TRUE)
diag(A) <- 0
output <- pmin(A, t(A))
output
[,1] [,2] [,3]
[1,] 0 1 3
[2,] 1 0 2
[3,] 3 2 0
Do it like this:
B = ifelse(A<t(A),A,t(A))
diag(B) = 0
> B
[,1] [,2] [,3]
[1,] 0 1 3
[2,] 1 0 2
[3,] 3 2 0
First get the minimum between A and transpose A, then set the diagonal elements to 0.

How to apply median function to multiple columns or vectors in R [duplicate]

Suppose I have a n by 2 matrix and a function that takes a 2-vector as one of its arguments. I would like to apply the function to each row of the matrix and get a n-vector. How to do this in R?
For example, I would like to compute the density of a 2D standard Normal distribution on three points:
bivariate.density(x = c(0, 0), mu = c(0, 0), sigma = c(1, 1), rho = 0){
exp(-1/(2*(1-rho^2))*(x[1]^2/sigma[1]^2+x[2]^2/sigma[2]^2-2*rho*x[1]*x[2]/(sigma[1]*sigma[2]))) * 1/(2*pi*sigma[1]*sigma[2]*sqrt(1-rho^2))
}
out <- rbind(c(1, 2), c(3, 4), c(5, 6))
How to apply the function to each row of out?
How to pass values for the other arguments besides the points to the function in the way you specify?
You simply use the apply() function:
R> M <- matrix(1:6, nrow=3, byrow=TRUE)
R> M
[,1] [,2]
[1,] 1 2
[2,] 3 4
[3,] 5 6
R> apply(M, 1, function(x) 2*x[1]+x[2])
[1] 4 10 16
R>
This takes a matrix and applies a (silly) function to each row. You pass extra arguments to the function as fourth, fifth, ... arguments to apply().
In case you want to apply common functions such as sum or mean, you should use rowSums or rowMeans since they're faster than apply(data, 1, sum) approach. Otherwise, stick with apply(data, 1, fun). You can pass additional arguments after FUN argument (as Dirk already suggested):
set.seed(1)
m <- matrix(round(runif(20, 1, 5)), ncol=4)
diag(m) <- NA
m
[,1] [,2] [,3] [,4]
[1,] NA 5 2 3
[2,] 2 NA 2 4
[3,] 3 4 NA 5
[4,] 5 4 3 NA
[5,] 2 1 4 4
Then you can do something like this:
apply(m, 1, quantile, probs=c(.25,.5, .75), na.rm=TRUE)
[,1] [,2] [,3] [,4] [,5]
25% 2.5 2 3.5 3.5 1.75
50% 3.0 2 4.0 4.0 3.00
75% 4.0 3 4.5 4.5 4.00
Here is a short example of applying a function to each row of a matrix.
(Here, the function applied normalizes every row to 1.)
Note: The result from the apply() had to be transposed using t() to get the same layout as the input matrix A.
A <- matrix(c(
0, 1, 1, 2,
0, 0, 1, 3,
0, 0, 1, 3
), nrow = 3, byrow = TRUE)
t(apply(A, 1, function(x) x / sum(x) ))
Result:
[,1] [,2] [,3] [,4]
[1,] 0 0.25 0.25 0.50
[2,] 0 0.00 0.25 0.75
[3,] 0 0.00 0.25 0.75
Apply does the job well, but is quite slow.
Using sapply and vapply could be useful. dplyr's rowwise could also be useful
Let's see an example of how to do row wise product of any data frame.
a = data.frame(t(iris[1:10,1:3]))
vapply(a, prod, 0)
sapply(a, prod)
Note that assigning to variable before using vapply/sapply/ apply is good practice as it reduces time a lot. Let's see microbenchmark results
a = data.frame(t(iris[1:10,1:3]))
b = iris[1:10,1:3]
microbenchmark::microbenchmark(
apply(b, 1 , prod),
vapply(a, prod, 0),
sapply(a, prod) ,
apply(iris[1:10,1:3], 1 , prod),
vapply(data.frame(t(iris[1:10,1:3])), prod, 0),
sapply(data.frame(t(iris[1:10,1:3])), prod) ,
b %>% rowwise() %>%
summarise(p = prod(Sepal.Length,Sepal.Width,Petal.Length))
)
Have a careful look at how t() is being used
First step would be making the function object, then applying it. If you want a matrix object that has the same number of rows, you can predefine it and use the object[] form as illustrated (otherwise the returned value will be simplified to a vector):
bvnormdens <- function(x=c(0,0),mu=c(0,0), sigma=c(1,1), rho=0){
exp(-1/(2*(1-rho^2))*(x[1]^2/sigma[1]^2+
x[2]^2/sigma[2]^2-
2*rho*x[1]*x[2]/(sigma[1]*sigma[2]))) *
1/(2*pi*sigma[1]*sigma[2]*sqrt(1-rho^2))
}
out=rbind(c(1,2),c(3,4),c(5,6));
bvout<-matrix(NA, ncol=1, nrow=3)
bvout[] <-apply(out, 1, bvnormdens)
bvout
[,1]
[1,] 1.306423e-02
[2,] 5.931153e-07
[3,] 9.033134e-15
If you wanted to use other than your default parameters then the call should include named arguments after the function:
bvout[] <-apply(out, 1, FUN=bvnormdens, mu=c(-1,1), rho=0.6)
apply() can also be used on higher dimensional arrays and the MARGIN argument can be a vector as well as a single integer.
Another approach if you want to use a varying portion of the dataset instead of a single value is to use rollapply(data, width, FUN, ...). Using a vector of widths allows you to apply a function on a varying window of the dataset. I've used this to build an adaptive filtering routine, though it isn't very efficient.
A dplyr Approach using across, rowSums and rowMeans.
M <- matrix(1:9, nrow=3, byrow=TRUE)
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9
M %>% as_tibble() %>%
rowwise() %>%
mutate(sum = rowSums(across(where(is.numeric)))) %>%
mutate(mean = rowMeans(across(V1:V3))) %>%
mutate(Max = max(V1:V3)) %>%
mutate(Min = min(V1:V3)) %>%
as.matrix()
V1 V2 V3 sum mean Max Min
[1,] 1 2 3 6 2 3 1
[2,] 4 5 6 15 5 6 4
[3,] 7 8 9 24 8 9 7

Can I vectorise/vectorize this simple cohort retention model in R?

I am creating a simple cohort-based user retention model, based on the number of new users that appear each day, and the likelihood of a user reappearing on day 0 (100%), day 1, day 2, etc. I want to know the number of users active on each day. I am trying to vectorise this and getting in a right muddle. Here is a toy mockup.
rvec <- c(1, .8, .4); #retention for day 0, 1,2 (day 0 = 100%, and so forth)
newvec <- c(10, 10, 10); #new joiners for day 0, 1, 2 (might be different)
playernumbers <- matrix(0, nrow = 3, ncol = 3);
# I want to fill matrix playernumbers such that sum of each row gives
# the total playernumbers on day rownumber-1
# here is a brute force method (could be simplified via a loop or two)
# but what I am puzzled about is whether there is a way to fully vectorise it
playernumbers[1,1] <- rvec[1] * newvec[1];
playernumbers[2,1] <- rvec[2] * newvec[1];
playernumbers[3,1] <- rvec[3] * newvec[1];
playernumbers[2,2] <- rvec[1] * newvec[2];
playernumbers[3,2] <- rvec[2] * newvec[2];
playernumbers[3,3] <- rvec[1] * newvec[3];
playernumbers
I can't figure out how to vectorise this fully. I can see how I might do it columnwise, successsively using each column number to indicate (a) which rows to update (column number: nrows), and (b) which newvec index value to multiply by. But I'm not sure this is worth doing, as to me the loop is clearer. But is there a fully vectorised form am I missing? thanks!
If you don't insist on your weird indexing logic, you could simply calculate the outer product:
outer(rvec, newvec)
# [,1] [,2] [,3]
#[1,] 10 10 10
#[2,] 8 8 8
#[3,] 4 4 4
In the outer product the product of the second element of vector 1 and the second element of vector 2 is placed at [2,2]. You place it at [3,2]. Why?
Your result:
playernumbers
# [,1] [,2] [,3]
#[1,] 10 0 0
#[2,] 8 10 0
#[3,] 4 8 10
Edit:
This should do the same as your loop:
rvec <- c(1, .8, .4)
newvec <- c(10, 20, 30)
tmp <- outer(rvec, newvec)
tmp <- tmp[, ncol(tmp):1]
tmp[lower.tri(tmp)] <- 0
tmp <- tmp[, ncol(tmp):1]
res <- tmp*0
res[lower.tri(res, diag=TRUE)] <- tmp[tmp!=0]
# [,1] [,2] [,3]
#[1,] 10 0 0
#[2,] 8 20 0
#[3,] 4 16 30
rowSums(res)
#[1] 10 28 50

Hints to improve performance in nested for loop?

In a 100x100 matrix populated only with integers, I am performing pairwise comparisons WITHIN each row beginning with (and including) element 1,1. For any comparison that is TRUE, I tally a +1 at the corresponding element in another preallocated matrix [x] (this is just a similarity matrix).
Using nested for loops, this operation requires N*(N-1)/2 + N comparisons for each row. On my machine, the code below doesn't take too long but is there a better (ok, faster and more elegant) way to this? I have considered a vectorized calculation using "apply" but as of yet, with no joy.
result <- matrix( round(rnorm(10000,sample(5))), ncol=100)
x <-matrix(data=0, nrow=100,ncol=100)
system.time(
for (i in 1:100) {
for (j in 1:100) {
for (k in j:100) {
if (result[i,][j] == result[i,][k]) {
x[j,][k] = x[j,][k] + 1
}
}
}
}
)
user system elapsed
6.586 0.599 7.192
Here's a small example:
"result" matrix
[,1] [,2] [,3] [,4]
[1,] 1 6 1 1
[2,] 6 1 5 3
[3,] 1 5 4 4
[4,] 2 3 4 2
structure(c(1, 6, 1, 2, 6, 1, 5, 3, 1, 5, 4, 4, 1, 3, 4, 2), .Dim = c(4L,4L))
After the code application, I expect in the x matrix:
[,1] [,2] [,3] [,4]
[1,] 4 0 1 2
[2,] 0 4 0 0
[3,] 0 0 4 2
[4,] 0 0 0 4
This is about 100 times faster (50ms) on my machine using your 100-by-100 result matrix:
for (i in 1:ncol(result))
for (j in i:ncol(result))
x[i, j] <- sum(result[, i] == result[, j])
And this is about 200 times faster, but maybe a bit harder to understand:
x <- apply(result, 2, function(y)colSums(result == y))
x[lower.tri(x)] <- 0
If it is still not fast enough for your taste, I would look if this exact function is not already implemented in one of the many distance packages, or try a Rcpp implementation. Although I'm not sure you'll get a lot more out, as my suggestions already use a fair amount of vectorization.

Apply a function to every row of a matrix or a data frame

Suppose I have a n by 2 matrix and a function that takes a 2-vector as one of its arguments. I would like to apply the function to each row of the matrix and get a n-vector. How to do this in R?
For example, I would like to compute the density of a 2D standard Normal distribution on three points:
bivariate.density(x = c(0, 0), mu = c(0, 0), sigma = c(1, 1), rho = 0){
exp(-1/(2*(1-rho^2))*(x[1]^2/sigma[1]^2+x[2]^2/sigma[2]^2-2*rho*x[1]*x[2]/(sigma[1]*sigma[2]))) * 1/(2*pi*sigma[1]*sigma[2]*sqrt(1-rho^2))
}
out <- rbind(c(1, 2), c(3, 4), c(5, 6))
How to apply the function to each row of out?
How to pass values for the other arguments besides the points to the function in the way you specify?
You simply use the apply() function:
R> M <- matrix(1:6, nrow=3, byrow=TRUE)
R> M
[,1] [,2]
[1,] 1 2
[2,] 3 4
[3,] 5 6
R> apply(M, 1, function(x) 2*x[1]+x[2])
[1] 4 10 16
R>
This takes a matrix and applies a (silly) function to each row. You pass extra arguments to the function as fourth, fifth, ... arguments to apply().
In case you want to apply common functions such as sum or mean, you should use rowSums or rowMeans since they're faster than apply(data, 1, sum) approach. Otherwise, stick with apply(data, 1, fun). You can pass additional arguments after FUN argument (as Dirk already suggested):
set.seed(1)
m <- matrix(round(runif(20, 1, 5)), ncol=4)
diag(m) <- NA
m
[,1] [,2] [,3] [,4]
[1,] NA 5 2 3
[2,] 2 NA 2 4
[3,] 3 4 NA 5
[4,] 5 4 3 NA
[5,] 2 1 4 4
Then you can do something like this:
apply(m, 1, quantile, probs=c(.25,.5, .75), na.rm=TRUE)
[,1] [,2] [,3] [,4] [,5]
25% 2.5 2 3.5 3.5 1.75
50% 3.0 2 4.0 4.0 3.00
75% 4.0 3 4.5 4.5 4.00
Here is a short example of applying a function to each row of a matrix.
(Here, the function applied normalizes every row to 1.)
Note: The result from the apply() had to be transposed using t() to get the same layout as the input matrix A.
A <- matrix(c(
0, 1, 1, 2,
0, 0, 1, 3,
0, 0, 1, 3
), nrow = 3, byrow = TRUE)
t(apply(A, 1, function(x) x / sum(x) ))
Result:
[,1] [,2] [,3] [,4]
[1,] 0 0.25 0.25 0.50
[2,] 0 0.00 0.25 0.75
[3,] 0 0.00 0.25 0.75
Apply does the job well, but is quite slow.
Using sapply and vapply could be useful. dplyr's rowwise could also be useful
Let's see an example of how to do row wise product of any data frame.
a = data.frame(t(iris[1:10,1:3]))
vapply(a, prod, 0)
sapply(a, prod)
Note that assigning to variable before using vapply/sapply/ apply is good practice as it reduces time a lot. Let's see microbenchmark results
a = data.frame(t(iris[1:10,1:3]))
b = iris[1:10,1:3]
microbenchmark::microbenchmark(
apply(b, 1 , prod),
vapply(a, prod, 0),
sapply(a, prod) ,
apply(iris[1:10,1:3], 1 , prod),
vapply(data.frame(t(iris[1:10,1:3])), prod, 0),
sapply(data.frame(t(iris[1:10,1:3])), prod) ,
b %>% rowwise() %>%
summarise(p = prod(Sepal.Length,Sepal.Width,Petal.Length))
)
Have a careful look at how t() is being used
First step would be making the function object, then applying it. If you want a matrix object that has the same number of rows, you can predefine it and use the object[] form as illustrated (otherwise the returned value will be simplified to a vector):
bvnormdens <- function(x=c(0,0),mu=c(0,0), sigma=c(1,1), rho=0){
exp(-1/(2*(1-rho^2))*(x[1]^2/sigma[1]^2+
x[2]^2/sigma[2]^2-
2*rho*x[1]*x[2]/(sigma[1]*sigma[2]))) *
1/(2*pi*sigma[1]*sigma[2]*sqrt(1-rho^2))
}
out=rbind(c(1,2),c(3,4),c(5,6));
bvout<-matrix(NA, ncol=1, nrow=3)
bvout[] <-apply(out, 1, bvnormdens)
bvout
[,1]
[1,] 1.306423e-02
[2,] 5.931153e-07
[3,] 9.033134e-15
If you wanted to use other than your default parameters then the call should include named arguments after the function:
bvout[] <-apply(out, 1, FUN=bvnormdens, mu=c(-1,1), rho=0.6)
apply() can also be used on higher dimensional arrays and the MARGIN argument can be a vector as well as a single integer.
Another approach if you want to use a varying portion of the dataset instead of a single value is to use rollapply(data, width, FUN, ...). Using a vector of widths allows you to apply a function on a varying window of the dataset. I've used this to build an adaptive filtering routine, though it isn't very efficient.
A dplyr Approach using across, rowSums and rowMeans.
M <- matrix(1:9, nrow=3, byrow=TRUE)
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9
M %>% as_tibble() %>%
rowwise() %>%
mutate(sum = rowSums(across(where(is.numeric)))) %>%
mutate(mean = rowMeans(across(V1:V3))) %>%
mutate(Max = max(V1:V3)) %>%
mutate(Min = min(V1:V3)) %>%
as.matrix()
V1 V2 V3 sum mean Max Min
[1,] 1 2 3 6 2 3 1
[2,] 4 5 6 15 5 6 4
[3,] 7 8 9 24 8 9 7

Resources