Hints to improve performance in nested for loop? - r

In a 100x100 matrix populated only with integers, I am performing pairwise comparisons WITHIN each row beginning with (and including) element 1,1. For any comparison that is TRUE, I tally a +1 at the corresponding element in another preallocated matrix [x] (this is just a similarity matrix).
Using nested for loops, this operation requires N*(N-1)/2 + N comparisons for each row. On my machine, the code below doesn't take too long but is there a better (ok, faster and more elegant) way to this? I have considered a vectorized calculation using "apply" but as of yet, with no joy.
result <- matrix( round(rnorm(10000,sample(5))), ncol=100)
x <-matrix(data=0, nrow=100,ncol=100)
system.time(
for (i in 1:100) {
for (j in 1:100) {
for (k in j:100) {
if (result[i,][j] == result[i,][k]) {
x[j,][k] = x[j,][k] + 1
}
}
}
}
)
user system elapsed
6.586 0.599 7.192
Here's a small example:
"result" matrix
[,1] [,2] [,3] [,4]
[1,] 1 6 1 1
[2,] 6 1 5 3
[3,] 1 5 4 4
[4,] 2 3 4 2
structure(c(1, 6, 1, 2, 6, 1, 5, 3, 1, 5, 4, 4, 1, 3, 4, 2), .Dim = c(4L,4L))
After the code application, I expect in the x matrix:
[,1] [,2] [,3] [,4]
[1,] 4 0 1 2
[2,] 0 4 0 0
[3,] 0 0 4 2
[4,] 0 0 0 4

This is about 100 times faster (50ms) on my machine using your 100-by-100 result matrix:
for (i in 1:ncol(result))
for (j in i:ncol(result))
x[i, j] <- sum(result[, i] == result[, j])
And this is about 200 times faster, but maybe a bit harder to understand:
x <- apply(result, 2, function(y)colSums(result == y))
x[lower.tri(x)] <- 0
If it is still not fast enough for your taste, I would look if this exact function is not already implemented in one of the many distance packages, or try a Rcpp implementation. Although I'm not sure you'll get a lot more out, as my suggestions already use a fair amount of vectorization.

Related

Look up R matrix cells in a vectorized way

I'd like to lookup matrix cells by using rows and columns from a data frame. Preferably, I'd like to do this in a vectorized way for best performance. However, the most obvious syntax leads to a lookup of all the row-column combinations possible, not only the combinations that stem from one data frame row:
Here is a small example:
> m1 <- matrix(c(1, 2, 3, 4, 5, 6, 7, 8, 9), 3, 3)
>
> m1
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
>
> p1 <- data.frame(row = c(2, 3, 1), column = c(3, 1, 2))
>
> p1
row column
1 2 3
2 3 1
3 1 2
>
> # vectorized indexing that does not work as intended
> m1[p1$row, p1$column]
[,1] [,2] [,3]
[1,] 8 2 5
[2,] 9 3 6
[3,] 7 1 4
>
> # this works as intended, but is possible slow due to R-language looping
> sapply(1 : nrow(p1), function (i) { m1[p1[i, "row"], p1[i, "column"]] })
[1] 8 3 4
The sapply call computes the output I expect (only m1[2, 3], m1[3, 1] and m1[1, 2]), but it's expected to be slow for larger data frames because it loops in R language.
Any thoughts on a better (ideally vectorized) way?
For your intended purpose you need to use a matrix to subset the matrix using certain row,column combinations. So you can try:
m1[as.matrix(p1)]
# [1] 8 3 4
Or if you have two vectors:
m1[cbind(row_idx, col_idx)]

How do I avoid "dimnames must be a list"-error when creating a matrix?

I'm brushing up on my linear algebra skills. Since my notebook is getting messy I'm trying to create my matrices in R. It is my hope that then I'll manage to avoid random errors.
I try to create a matrix with three rows and six columns:
matrix(
c(2,2,1,1,0,0),
c(1,3,2,0,1,0),
c(1,3,6,0,0,1),
nrow=3,
ncol=6)
I then get an error message stating that:
Error in matrix(c(2, 2, 1, 1, 0, 0), c(1, 3, 2, 0, 1, 0), c(1, 3, 6, 0, :
'dimnames' must be a list
I'm not sure I understand the hickup. I have specified my three rows and there's nothing in the help section claiming a need for naming anything at all?
It doesn't say "dimnames must be named”, it says "'dimnames' must be a list". It happens because you provide three c() objects in arguments, and the function expects only the first as an input data. Just wrap in one c():
matrix(
c( c(2,2,1,1,0,0),
c(1,3,2,0,1,0),
c(1,3,6,0,0,1)
),
nrow=3,
ncol=6)
Or put all numbers in a single c() from the very beginning. And you actually don't need to indicate both nrow and ncol, one is enough:
matrix(
c(2,2,1,1,0,0,
1,3,2,0,1,0,
1,3,6,0,0,1),
nrow=3)
An easy way to create matrices is to just bind your vectors:
rbind(
c(2,2,1,1,0,0),
c(1,3,2,0,1,0),
c(1,3,6,0,0,1))
Use rbind to row-bind a series of vectors, or feed a single vector to the matrix function. Note that if you choose option 2, you need to set byrow=TRUE because matrix is column major by default.
rbind(c(2,2,1,1,0,0),
c(1,3,2,0,1,0),
c(1,3,6,0,0,1))
#> [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,] 2 2 1 1 0 0
#> [2,] 1 3 2 0 1 0
#> [3,] 1 3 6 0 0 1
matrix(
c(2,2,1,1,0,0,
1,3,2,0,1,0,
1,3,6,0,0,1),
nrow=3,
ncol=6,
byrow=TRUE)
#> [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,] 2 2 1 1 0 0
#> [2,] 1 3 2 0 1 0
#> [3,] 1 3 6 0 0 1

Efficient way to generate a coincidence matrix

I want to generate a simple coincidence matrix, I've looked for R packages but could not find one that does this calculation so far, I don't know if the English term for this matrix is different than the Portuguese one... so, that's what I need to do.
I have a matrix:
[,1] [,2] [,3] [,4]
[1,] 1 1 2 1
[2,] 1 2 3 1
[3,] 2 3 1 2
[4,] 1 2 3 3
A coincidence matrix will be calculated comparing each element row by row to generate a dissimilarity distance with the formula:
Diss = 1 - (Coincidences / (Coincidences + Discordance))
So my resulting matrix is an symmetrical one with dim 4x4 and diagonal elements equal 0, so in the example my A(1,2) would it be:
A(1,2) = 1 - (2 / 4) = 0.5
A(1,3) = 1 - (0/4) = 1.0
And so on...
I have created a function to generate this matrix:
cs_matrix <- function (x) {
cs.mat <- matrix(rep(0,dim(x)[1]^2), ncol = dim(x)[1])
for (i in 1:dim(x)[1]){
for (j in 1:dim(x)[1]){
cs.mat[i,j] <- 1 - (sum(x[i,] == x[j,]) / dim(x)[2])
}
}
return(cs.mat)
}
The function works fine, but my actual Data Set has 2560 observations of 4 variables, thus generating a 2560 x 2560 coincidence matrix, and it takes quite some time to do the calculation. I wonder if there is a more efficient way of calculating this or even if there is already a package that can calculate this dissimilarity distance. This matrix will be later used in Cluster Analysis.
I think you can use outer
add <- function(x, y) sum(mat[x, ] == mat[y,])
nr <- seq_len(nrow(mat))
mat1 <- 1 - outer(nr, nr, Vectorize(add))/ncol(mat)
mat1
# [,1] [,2] [,3] [,4]
#[1,] 0.00 0.50 1 0.75
#[2,] 0.50 0.00 1 0.25
#[3,] 1.00 1.00 0 1.00
#[4,] 0.75 0.25 1 0.00
If diagonal elements need to be 1 do diag(mat1) <- 1.
data
mat <- structure(c(1, 1, 2, 1, 1, 2, 3, 2, 2, 3, 1, 3, 1, 1, 2, 3), .Dim = c(4L,4L))

R: fast determine top k maximum value in a matrix

I would like to fast determine top k maximum values in a matrix, and then put those not the top k maximum value as zero, currently I work out the following solution. Can somebody improve these one, since when the matrix have many many rows, this one is not so fast?
thanks.
mat <- matrix(c(5, 1, 6, 4, 9, 1, 8, 9, 10), nrow = 3, byrow = TRUE)
sortedMat <- t(apply(mat, 1, function(x) sort(x, decreasing = TRUE, method = "quick")))
topK <- 2
sortedMat <- sortedMat[, 1:topK, drop = FALSE]
lmat <- mat
for (i in 1:nrow(mat)) {
lmat[i, ] <- mat[i, ] %in% sortedMat[i, ]
}
kMat <- mat * lmat
> mat
[,1] [,2] [,3]
[1,] 5 1 6
[2,] 4 9 1
[3,] 8 9 10
> kMat
[,1] [,2] [,3]
[1,] 5 0 6
[2,] 4 9 0
[3,] 0 9 10
In Rfast the command sort_mat sorts the columns of a matrix, colOrder does order for each column, colRanks gives ranks for each column and the colnth gives the nth value for each column. I believe at least one of them suit you.
You could use rank to speed this up. In case there are ties, you would have to decide on a method to break these (e.g. ties.method = "random").
kmat <- function(mat, k){
mat[t(apply(mat, 1, rank)) <= (ncol(mat)-k)] <- 0
mat
}
kmat(mat, 2)
## [,1] [,2] [,3]
## [1,] 5 0 6
## [2,] 4 9 0
## [3,] 0 9 10

Can I vectorise/vectorize this simple cohort retention model in R?

I am creating a simple cohort-based user retention model, based on the number of new users that appear each day, and the likelihood of a user reappearing on day 0 (100%), day 1, day 2, etc. I want to know the number of users active on each day. I am trying to vectorise this and getting in a right muddle. Here is a toy mockup.
rvec <- c(1, .8, .4); #retention for day 0, 1,2 (day 0 = 100%, and so forth)
newvec <- c(10, 10, 10); #new joiners for day 0, 1, 2 (might be different)
playernumbers <- matrix(0, nrow = 3, ncol = 3);
# I want to fill matrix playernumbers such that sum of each row gives
# the total playernumbers on day rownumber-1
# here is a brute force method (could be simplified via a loop or two)
# but what I am puzzled about is whether there is a way to fully vectorise it
playernumbers[1,1] <- rvec[1] * newvec[1];
playernumbers[2,1] <- rvec[2] * newvec[1];
playernumbers[3,1] <- rvec[3] * newvec[1];
playernumbers[2,2] <- rvec[1] * newvec[2];
playernumbers[3,2] <- rvec[2] * newvec[2];
playernumbers[3,3] <- rvec[1] * newvec[3];
playernumbers
I can't figure out how to vectorise this fully. I can see how I might do it columnwise, successsively using each column number to indicate (a) which rows to update (column number: nrows), and (b) which newvec index value to multiply by. But I'm not sure this is worth doing, as to me the loop is clearer. But is there a fully vectorised form am I missing? thanks!
If you don't insist on your weird indexing logic, you could simply calculate the outer product:
outer(rvec, newvec)
# [,1] [,2] [,3]
#[1,] 10 10 10
#[2,] 8 8 8
#[3,] 4 4 4
In the outer product the product of the second element of vector 1 and the second element of vector 2 is placed at [2,2]. You place it at [3,2]. Why?
Your result:
playernumbers
# [,1] [,2] [,3]
#[1,] 10 0 0
#[2,] 8 10 0
#[3,] 4 8 10
Edit:
This should do the same as your loop:
rvec <- c(1, .8, .4)
newvec <- c(10, 20, 30)
tmp <- outer(rvec, newvec)
tmp <- tmp[, ncol(tmp):1]
tmp[lower.tri(tmp)] <- 0
tmp <- tmp[, ncol(tmp):1]
res <- tmp*0
res[lower.tri(res, diag=TRUE)] <- tmp[tmp!=0]
# [,1] [,2] [,3]
#[1,] 10 0 0
#[2,] 8 20 0
#[3,] 4 16 30
rowSums(res)
#[1] 10 28 50

Resources