I have a large, sparse binary matrix (roughly 39,000 x 14,000; most rows have only a single "1" entry). I'd like to cluster similar rows together, but my initial plan takes too long to complete:
d <- dist(inputMatrix, method="binary")
hc <- hclust(d, method="complete")
The first step doesn't finish, so I'm not sure how the second step would fare. What are some approaches to efficiently grouping similar rows of a large, sparse, binary matrix in R?
I've written some Rcpp code and R code which works out the binary/Jaccard distance of a binary matrix approx. 80x faster than dist(x, method = "binary"). It converts the input matrix into a raw matrix which is the transpose of the input (so that the bit patterns are in the correct order internally). This is then used in some C++ code which handles the data as 64 bit unsigned integers for speed. The Jaccard distance of two vectors x and y is equal to x ^ y / (x | y) where ^ is the xor operator. The Hamming Weight calculation is used to count the number of bits set if the result of the xor or or is non-zero.
I've put together the code on github at https://github.com/NikNakk/binaryDist/ and reproduced the two files below. I've confirmed that the results are the same as dist(x, method = "binary") for a few random datasets.
On a dataset of 39000 rows by 14000 columns with 1-5 ones per row, it took about 11 minutes. The output distance matrix was 5.7 GB.
bDist.cpp
#include <Rcpp.h>
using namespace Rcpp;
//countBits function taken from https://en.wikipedia.org/wiki/Hamming_weight#Efficient_implementation
const uint64_t m1 = 0x5555555555555555; //binary: 0101...
const uint64_t m2 = 0x3333333333333333; //binary: 00110011..
const uint64_t m4 = 0x0f0f0f0f0f0f0f0f; //binary: 4 zeros, 4 ones ...
const uint64_t h01 = 0x0101010101010101; //the sum of 256 to the power of 0,1,2,3...
int countBits(uint64_t x) {
x -= (x >> 1) & m1; //put count of each 2 bits into those 2 bits
x = (x & m2) + ((x >> 2) & m2); //put count of each 4 bits into those 4 bits
x = (x + (x >> 4)) & m4; //put count of each 8 bits into those 8 bits
return (x * h01)>>56; //returns left 8 bits of x + (x<<8) + (x<<16) + (x<<24) + ...
}
// [[Rcpp::export]]
int countBitsFromRaw(RawVector rv) {
uint64_t* x = (uint64_t*)RAW(rv);
return(countBits(*x));
}
// [[Rcpp::export]]
NumericVector bDist(RawMatrix mat) {
int nr(mat.nrow()), nc(mat.ncol());
int nw = nr / 8;
NumericVector res(nc * (nc - 1) / 2);
// Access the raw data as unsigned 64 bit integers
uint64_t* data = (uint64_t*)RAW(mat);
uint64_t a(0);
// Work through each possible combination of columns (rows in the original integer matrix)
for (int i = 0; i < nc - 1; i++) {
for (int j = i + 1; j < nc; j++) {
uint64_t sx = 0;
uint64_t so = 0;
// Work through each 64 bit integer and calculate the sum of (x ^ y) and (x | y)
for (int k = 0; k < nw; k++) {
uint64_t o = data[nw * i + k] | data[nw * j + k];
// If (x | y == 0) then (x ^ y) will also be 0
if (o) {
// Use Hamming weight method to calculate number of set bits
so = so + countBits(o);
uint64_t x = data[nw * i + k] ^ data[nw * j + k];
if (x) {
sx = sx + countBits(x);
}
}
}
res(a++) = (double)sx / so;
}
}
return (res);
}
R source
library("Rcpp")
library("plyr")
sourceCpp("bDist.cpp")
# Converts a binary integer vector into a packed raw vector,
# padding out at the end to make the input length a multiple of packWidth
packRow <- function(row, packWidth = 64L) {
packBits(as.raw(c(row, rep(0, (packWidth - length(row)) %% packWidth))))
}
as.PackedMatrix <- function(x, packWidth = 64L) {
UseMethod("as.PackedMatrix")
}
# Converts a binary integer matrix into a packed raw matrix
# padding out at the end to make the input length a multiple of packWidth
as.PackedMatrix.matrix <- function(x, packWidth = 64L) {
stopifnot(packWidth %% 8 == 0, class(x) %in% c("matrix", "Matrix"))
storage.mode(x) <- "raw"
if (ncol(x) %% packWidth != 0) {
x <- cbind(x, matrix(0L, nrow = nrow(x), ncol = packWidth - (ncol(x) %% packWidth)))
}
out <- packBits(t(x))
dim(out) <- c(ncol(x) %/% 8, nrow(x))
class(out) <- "PackedMatrix"
out
}
# Converts back to an integer matrix
as.matrix.PackedMatrix <- function(x) {
out <- rawToBits(x)
dim(out) <- c(nrow(x) * 8L, ncol(x))
storage.mode(out) <- "integer"
t(out)
}
# Generates random sparse data for testing the main function
makeRandomData <- function(nObs, nVariables, maxBits, packed = FALSE) {
x <- replicate(nObs, {
y <- integer(nVariables)
y[sample(nVariables, sample(maxBits, 1))] <- 1L
if (packed) {
packRow(y, 64L)
} else {
y
}
})
if (packed) {
class(x) <- "PackedMatrix"
x
} else {
t(x)
}
}
# Reads a binary matrix from file or character vector
# Borrows the first bit of code from read.table
readPackedMatrix <- function(file = NULL, text = NULL, packWidth = 64L) {
if (missing(file) && !missing(text)) {
file <- textConnection(text)
on.exit(close(file))
}
if (is.character(file)) {
file <- file(file, "rt")
on.exit(close(file))
}
if (!inherits(file, "connection"))
stop("'file' must be a character string or connection")
if (!isOpen(file, "rt")) {
open(file, "rt")
on.exit(close(file))
}
lst <- list()
i <- 1
while(length(line <- readLines(file, n = 1)) > 0) {
lst[[i]] <- packRow(as.integer(strsplit(line, "", fixed = TRUE)[[1]]), packWidth = packWidth)
i <- i + 1
}
out <- do.call("cbind", lst)
class(out) <- "PackedMatrix"
out
}
# Wrapper for the C++ code which
binaryDist <- function(x) {
if (class(x) != "PackedMatrix") {
x <- as.PackedMatrix(x)
}
dst <- bDist(x)
attr(dst, "Size") <- ncol(x)
attr(dst, "Diag") <- attr(dst, "Upper") <- FALSE
attr(dst, "method") <- "binary"
attr(dst, "call") <- match.call()
class(dst) <- "dist"
dst
}
x <- makeRandomData(2000, 400, maxBits = 5, packed = TRUE)
system.time(bd <- binaryDist(x))
From original answer:
Other things to consider would be doing some prefiltering of comparisons between two rows with single ones since the distance will either be 0 for duplicates or 1 for any other possibility.
A couple of relatively straightforward options that might be faster without needing much code are the vegdist function from the vegan package and the Dist function from the amap package. The latter will probably only be quicker if you have multiple cores and take advantage of the fact it supports parallelisation.
The reason this takes so long to compute is that the call to dist is computing and storing more than 760 million pairwise distances. If your data is stored sparsely, this will take a long time and huge amount of storage. If your data is not stored sparsely, then each distance computation requires at least 14,000 operations, for a total operation count exceeding 1 quadrillion!
An approach that will be much quicker is k-means clustering, since it doesn't require pre-computing a distance matrix; at each iteration you will need only 39000*k distance calculations, where k is the number of clusters. To get pairwise distances that are similar to the Jaccard index (0 if identical, 1 if no indices coincide, in between if some but not all indices coincide), you could divide each row x by sqrt(2*sum(x^2)). For instance, if you had the following input matrix:
(mat <- rbind(c(1, 0, 0, 0, 0), c(0, 0, 0, 1, 1)))
# [,1] [,2] [,3] [,4] [,5]
# [1,] 1 0 0 0 0
# [2,] 0 0 0 1 1
the normalized version would be (assuming binary values only in the matrix; if this were not the case you would use rowSums(mat^2)):
(mat.norm <- mat / sqrt(2*rowSums(mat)))
# [,1] [,2] [,3] [,4] [,5]
# [1,] 0.7071068 0 0 0.0 0.0
# [2,] 0.0000000 0 0 0.5 0.5
These two observations (which have no indices in common), have Euclidean distance 1, coinciding with the Jaccard distance for this case.
dist(mat.norm, "euclidean")
# 1
# 2 1
Additionally, identical observations will clearly have Euclidean distance 0, again corresponding to the Jaccard distance.
do you have duplicate rows? There is no need to compute their distances twice.
all rows with a single 1 will be 100% different from all rows with a single one in a different place.
Thus, it does not make sense to run clustering on such data. The output is rather predictable, and boils down to finding the 1.
Try restricting your data set to those objects that have more than one 1 only. Unless you can get interesting results on these only, no need to continue further. Binary data has too little information.
Related
I am writing a function to perform bit inversion for each row of a binary matrix which depends on a predefined n value. The n value will determine the number of 1 bits for each row of the matrix.
set.seed(123)
## generate a random 5 by 10 binary matrix
init <- t(replicate(5, {i <- sample(3:6, 1); sample(c(rep(1, i), rep(0, 10 - i)))}))
n <- 3
## init_1 is a used to explain my problem (single row matrix)
init_1 <- t(replicate(1, {i <- sample(3:6, 1); sample(c(rep(1, i), rep(0, 10 - i)))}))
The bit_inversion function does this few things:
If the selected row has number of 1's lesser than n, then it randomly select a few indices (difference) and invert them. (0 to 1)
Else if the selected row has number of 1's greater than n, then it randomly select a few indices (difference) and invert them. (1 to 0)
Else do nothing (when the row has number of 1's equals to n.)
Below is the function I implemented:
bit_inversion<- function(pop){
for(i in 1:nrow(pop)){
difference <- abs(sum(pop[i,]) - n)
## checking condition where there are more bits being turned on than n
if(sum(pop[i,]) > n){
## determine position of 1's
bit_position_1 <- sample(which(pop[i,]==1), difference)
## bit inversion
for(j in 1:length(bit_position_1)){
pop[bit_position_1[j]] <- abs(pop[i,][bit_position_1[j]] - 1)
}
}
else if (sum(pop[i,]) < n){
## determine position of 0's
bit_position_0 <- sample(which(pop[i,]==0), difference)
## bit inversion
for(j in 1:length(bit_position_0)){
pop[bit_position_0[j]] <- abs(pop[bit_position_0[j]] - 1)
}
}
}
return(pop)
}
Outcome:
call <- bit_inversion(init)
> rowSums(call) ## suppose to be all 3
[1] 3 4 5 4 3
But when using init_1 (a single row matrix), the function seems to work fine.
Outcome:
call_1 <- bit_inversion(init_1)
> rowSums(call)
[1] 3
Is there a mistake in my for and if...else loop?
Change the line in 'j' for loop
pop[bit_position_1[j]] <- abs(pop[i,][bit_position_1[j]] - 1)
into
pop[i,bit_position_1[j]] <- abs(pop[i,][bit_position_1[j]] - 1)
You forgot the row index.
And, here is a more compact version of your for loop:
for(i in 1:nrow(pop)){
difference <- abs(sum(pop[i,]) - n)
logi <- sum(pop[i,]) > n
pop[i,sample(which(pop[i,]==logi), difference)] <- !logi
}
My data looks like this:
S
0101001010000000000000000100111100000000000011101100010101010
1001010000000001100000000100000000000100000010101110101010010
1101010101010010000000000100000000100101010010110101010101011
0000000000000000001000000111000110000000000000000000000000000
the S indicates the column from which I am talking. It is col 26. All four rows share a 1 at that position.
I would need to be able to count for each row from 2 to 4:
How many columns to the left and right are the same as row 1?
For row 2 it would be 3 to the right (as it reaches 1/0) and 8 to the left (as it reaches 0/1).
The result for every row should be entered into a matrix like this:
row2 8 3
row3 11 9
Is there a fast and efficient way to do that? The matrix I am dealing with is very large.
If you need something fast, you could use Rcpp:
mat <- as.matrix(read.fwf(textConnection("0101001010000000000000000100111100000000000011101100010101010
1001010000000001100000000100000000000100000010101110101010010
1101010101010010000000000100000000100101010010110101010101011
0000000000000000001000000111000110000000000000000000000000000"), widths = rep(1, 61)))
library(Rcpp)
cppFunction('
IntegerMatrix countLR(const LogicalMatrix& mat, const int S) {
const int nr(mat.nrow()), nc(mat.ncol());
IntegerMatrix res(nr - 1, 2);
for(int i=1; i<nr;i++){
for(int j=S-2; j>=0;j--) {
if (mat(0,j) != mat(i,j)) break;
else res(i-1,0)++;
}
for(int j=S; j<nc;j++) {
if (mat(0,j) != mat(i,j)) break;
else res(i-1,1)++;
}
}
return(res);
}' )
countLR(mat, 26)
# [,1] [,2]
#[1,] 8 2
#[2,] 10 2
#[3,] 6 0
I assumed that column 26 itself doesn't count for the result. I also assumed that the matrix can only contain 0/1 (i.e., boolean) values. Adjust as needed.
It's pretty easy with strsplit and rle to pull apart and assemble this data:
> S <- scan(what="") #input of character mode
1: 0101001010000000000000000100111100000000000011101100010101010
2: 1001010000000001100000000100000000000100000010101110101010010
3: 1101010101010010000000000100000000100101010010110101010101011
4: 0000000000000000001000000111000110000000000000000000000000000
5:
s2 <- strsplit(S, split="")
sapply(s2, "[[", 26) # verify the 26th position is all ones
#[1] "1" "1" "1" "1"
#length of strings from 26th postion to right
rtlen <- length(s2[[1]])-(26-1)
# Pick from the `rle` $values where values TRUE
rle( tail( s2[[1]] == s2[[2]], rtlen) )
Run Length Encoding
lengths: int [1:11] 3 4 5 1 7 1 4 1 1 6 ...
values : logi [1:11] TRUE FALSE TRUE FALSE TRUE FALSE ...
Now that you have an algorithm for a single instance, you can iterate of the rest of the items in s2. To do the backwards look I just did the same operation on a rev-ersed section of the strings.
m<-matrix(NA, 3,2);
for (i in 2:4) { m[i-1,2] <- rle(tail( s2[[1]] == s2[[i]], rtlen) )$lengths[1]
m[i-1, 1] <- rle( rev( head( s2[[1]] == s2[[i]], 26)) )$lengths[1] }
m
[,1] [,2]
[1,] 9 3 # I think you counted wrong
[2,] 11 3
[3,] 7 1
Notice that I was comparing each one to the first row and your results suggest you were doing something else...perhaps comparing to the row above. That could easily be done instead with only a very small mod to the code indices for choice of the comparison vector:
m<-matrix(NA, 3,2);
for (i in 2:4) { m[i-1,2] <- rle(tail( s2[[i-1]] == s2[[i]], rtlen) )$lengths[1]
m[i-1, 1] <- rle( rev( head( s2[[i-1]] == s2[[i]], 26)) )$lengths[1] }
m
[,1] [,2]
[1,] 9 3
[2,] 9 9 #Again I think you may have miscounted. Easy to do, eh?
[3,] 7 1
This problem intrigued me. Since the matrix is binary, it's far more efficient to pack the matrix into a raw matrix than it is to use sparse matrices. It means that the storage for a 1,000 x 21,000,000 pattern matrix is approx. 2.4 GiB (print(object.size(raw(1000 * 21000000 / 8)), units = "GB")).
The following should be a relatively efficient way to tackle the problem. The Rcpp code takes a raw matrix which indicates the differences between the first row of the original matrix and the other rows. For efficiency in the R code, it's actually arranged with the patterns in columns rather than rows. The other functions help to convert existing sparse or regular matrices into packed ones and to read a matrix directly from a file.
library("Rcpp")
library("Matrix")
writeLines("0101001010000000000000000100111100000000000011101100010101010
1001010000000001100000000100000000000100000010101110101010010
1101010101010010000000000100000000100101010010110101010101011
0000000000000000001000000111000110000000000000000000000000000", "example.txt")
cppFunction('
IntegerMatrix countLRPacked(IntegerMatrix mat, long S) {
long l = S - 2;
long r = S;
long i, cl, cr;
int nr(mat.nrow()), nc(mat.ncol());
IntegerMatrix res(nc, 2);
for(int i=0; i<nc;i++){
// First the left side
// Work out which byte is the first to have a 1 in it
long j = l >> 3;
int x = mat(j, i) & ((1 << ((l & 7) + 1)) - 1);
long cl = l & 7;
while(j > 0 && !x) {
j --;
x = mat(j, i);
cl += 8;
}
// Then work out where the 1 is in the byte
while (x >>= 1) --cl;
// Now the right side
j = r >> 3;
x = mat(j, i) & ~((1 << ((r & 7))) - 1);
cr = 8 - (r & 7);
while(j < (nr-1) && !x) {
j ++;
x = mat(j, i);
cr += 8;
}
cr--;
while (x = (x << 1) & 0xff) --cr;
res(i, 0) = cl;
res(i, 1) = cr;
}
return(res);
}')
# Reads a binary matrix from file or character vector
# Borrows the first bit of code from read.table
readBinaryMatrix <- function(file = NULL, text = NULL) {
if (missing(file) && !missing(text)) {
file <- textConnection(text)
on.exit(close(file))
}
if (is.character(file)) {
file <- file(file, "rt")
on.exit(close(file))
}
if (!inherits(file, "connection"))
stop("'file' must be a character string or connection")
if (!isOpen(file, "rt")) {
open(file, "rt")
on.exit(close(file))
}
lst <- list()
i <- 1
while(length(line <- readLines(file, n = 1)) > 0) {
lst[[i]] <- packRow(as.integer(strsplit(line, "", fixed = TRUE)[[1]]))
i <- i + 1
}
do.call("cbind", lst)
}
# Converts a binary integer vector into a packed raw vector,
# padding out at the end to make the input length a multiple of 8
packRow <- function(row) {
packBits(as.raw(c(row, rep(0, (8 - length(row)) %% 8 ))))
}
# Converts a binary integer matrix to a packed raw matrix
# Note the matrix is transposed (makes the subsequent xor more efficient)
packMatrix <- function(mat) {
stopifnot(class(mat) %in% c("matrix", "dgCMatrix"))
apply(mat, 1, packRow)
}
# Takes either a packed raw matrix or a binary integer matrix, uses xor to compare all the first row
# with the others and then hands it over to the Rcpp code for processing
countLR <- function(mat, S) {
stopifnot(class(mat) %in% c("matrix", "dgCMatrix"))
if (storage.mode(mat) != "raw") {
mat <- packMatrix(mat)
}
stopifnot(8 * nrow(mat) > S)
y <- xor(mat[, -1, drop = FALSE], mat[, 1, drop = TRUE])
countLRPacked(y, S)
}
sMat <- Matrix(as.matrix(read.fwf("example.txt", widths = rep(1, 61))))
pMat <- readBinaryMatrix("example.txt")
countLR(sMat, 26)
countLR(pMat, 26)
You should note that the width of the pattern matrix is right-padded to a multiple of 8, so if the patterns match all the way to the right hand side this will result in the right hand count being possibly a bit high. This could be corrected if need be.
Slow R version to do this (moved from duplicate):
countLR <- function(mat, S) {
mat2 <- mat[1, ] != t(mat[-1, , drop = FALSE])
l <- apply(mat2[(S - 1):1, ], 2, function(x) which(x)[1] - 1)
l[is.na(l)] <- S - 1
r <- apply(mat2[(S + 1):nrow(mat2), ], 2, function(x) which(x)[1] - 1)
r[is.na(l)] <- ncol(mat) - S
cbind(l, r)
}
a) Create a vector X of length 20, with the kth element in X = 2k, for k=1…20. Print out the values of X.
b) Create a vector Y of length 20, with all elements in Y equal to 0. Print out the values of Y.
c) Using a for loop, reassigns the value of the k-th element in Y, for k = 1…20. When k < 12, the kth element of Y is reassigned as the cosine of k. When the k ≥ 12, the kth element of Y is reassigned as the value of integral sqrt(t)dt from 0 to K.
for the first two questions, it is simple.
> x1 <- seq(1,20,by=2)
> x <- 2 * x1
> x
[1] 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40
> y <- rep(0,20)
> y
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
i got stuck on the last one,
t <- function(i) sqrt(i)
for (i in 1:20) {
if (i < 12) {
y[i] <- cos(i)
}
else if (i >= 12) {
y[i] <- integral(t, lower= 0, Upper = 20)
}
}
y // print new y
Any suggestions? thanks.
What may help is that the command to calculate a one-dimensional integral is integrate not integral.
You have successfully completed the first two, so I'll demonstrate a different way of getting those vectors:
x <- 2 * seq_len(20)
y <- double(length = 20)
As for your function, you have the right idea, but you need to clean up your syntax a bit. For example, you may need to double-check your braces (using a set style like Hadley Wickham's will help you prevent syntax errors and make the code more readable), you don't need the "if" in the else, you need to read up on integrate and see what its inputs, and importantly its outputs are (and which of them you need and how to extract it), and lastly, you need to return a value from your function. Hopefully, that's enough to help you work it out on your own. Good Luck!
Update
Slightly different function to demonstrate coding style and some best practices with loops
Given a working answer has been posted, this is what I did when looking at your question. I think it is worth posting, as as I think that it is a good habit to 1) pre-allocate answers 2) prevent confusion about scope by not re-using the input variable name as an output and 3) use the seq_len and seq_along constructions for for loops, per R Inferno(pdf) which is required reading, in my opinion:
tf <- function(y){
z <- double(length = length(y))
for (k in seq_along(y)) {
if (k < 12) {
z[k] <- cos(k)
} else {
z[k] <- integrate(f = sqrt, lower = 0, upper = k)$value
}
}
return(z)
}
Which returns:
> tf(y)
[1] 0.540302306 -0.416146837 -0.989992497 -0.653643621 0.283662185 0.960170287 0.753902254
[8] -0.145500034 -0.911130262 -0.839071529 0.004425698 27.712816032 31.248114562 34.922139530
[15] 38.729837810 42.666671456 46.728535669 50.911693960 55.212726149 59.628486093
To be honest you almost have it ready and it is good that you have showed some code here:
y <- rep(0,20) #y vector from question 2
for ( k in 1:20) { #start the loop
if (k < 12) { #if k less than 12
y[k] <- cos(k) #calculate cosine
} else if( k >= 12) { #else if k greater or equal to 12
y[k] <- integrate( sqrt, lower=0, upper=k)$value #see below for explanation
}
}
print(y) #prints y
> print(y)
[1] 0.540302306 -0.416146837 -0.989992497 -0.653643621 0.283662185 0.960170287 0.753902254 -0.145500034 -0.911130262 -0.839071529 0.004425698
[12] 27.712816032 31.248114562 34.922139530 38.729837810 42.666671456 46.728535669 50.911693960 55.212726149 59.628486093
First of all stats::integrate is the function you need to calculate the integral
integrate( sqrt, lower=0, upper=2)$value
The first argument is a function which in your case is sqrt. sqrt is defined already in R so there is no need to define it yourself explicitly as t <- function(i) sqrt(i)
The other two arguments as you correctly set in your code are lower and upper.
The function integrate( sqrt, lower=0, upper=2) will return:
1.885618 with absolute error < 0.00022
and that is why you need integrate( sqrt, lower=0, upper=2)$value to only extract the value.
Type ?integrate in your console to see the documentation which will help you a lot I think.
I have a nonzero symmetric matrix 'matr' that is 12000X12000. I need to find the indices of the top 10000 elements in 'matr' in R. The code I have written takes a long time - I was wondering if there was any pointers to make it faster.
listk <- numeric(0)
for( i in 1:10000) {
idx <- which(matr == max(matr), arr.ind=T)
if( length(idx) != 0) {
listk <- rbind( listk, idx[1,])
matr[idx[1,1], idx[1,2]] <- 0
matr[idx[2,1], idx[2,2]] <- 0
}
}
Here's how you might find the indices (ij) of the 4 largest elements in a 10-by-10 matrix m.
## Sample data
m <- matrix(runif(100), ncol=10)
## Extract the indices of the 4 largest elements
(ij <- which(m >= sort(m, decreasing=T)[4], arr.ind=TRUE))
# row col
# [1,] 2 1
# [2,] 5 1
# [3,] 6 2
# [4,] 3 10
## Use the indices to extract the values
m[ij]
# [1] 0.9985190 0.9703268 0.9836373 0.9914510
Edit:
For large matrices, performing a partial sort will be a faster way to find the 10,000th largest element:
v <- runif(1e7)
system.time(a <- sort(v, decreasing=TRUE)[10000])
# user system elapsed
# 4.35 0.03 4.38
system.time(b <- -sort(-v, partial=10000)[10000])
# user system elapsed
# 0.60 0.09 0.69
a==b
# [1] TRUE
I like #JoshO'Brien 's answer; the use of partial sorting is great! Here's an Rcpp solution (I'm not a strong C++ programmer so probably bone-headed errors; corrections welcome... how would I template this in Rcpp, to handle different types of input vector?)
I start by including the appropriate headers and using namespaces for convenience
#include <Rcpp.h>
#include <queue>
using namespace Rcpp;
using namespace std;
Then arrange to expose my C++ function to R
// [[Rcpp::export]]
IntegerVector top_i_pq(NumericVector v, int n)
and define some variables, most importantly a priority_queue to hold as a pair the numeric value and index. The queue is ordered so the smallest values are at the 'top', with small relying on the standard pair<> comparator.
typedef pair<double, int> Elt;
priority_queue< Elt, vector<Elt>, greater<Elt> > pq;
vector<int> result;
Now I'll walk through the input data, adding it to the queue if either (a) I don't yet have enough values or (b) the current value is larger than the smallest value in the queue. In the latter case, I pop off the smallest value, and insert it's replacement. In this way the priority queue always contains the n_max largest elements.
for (int i = 0; i != v.size(); ++i) {
if (pq.size() < n)
pq.push(Elt(v[i], i));
else {
Elt elt = Elt(v[i], i);
if (pq.top() < elt) {
pq.pop();
pq.push(elt);
}
}
}
And finally I pop the indexes from the priority queue into the return vector, remembering to translate to 1-based R coordinates.
result.reserve(pq.size());
while (!pq.empty()) {
result.push_back(pq.top().second + 1);
pq.pop();
}
and return the result to R
return wrap(result);
This has nice memory use (the priority queue and return vector are both small relative to the original data) and is fast
> library(Rcpp); sourceCpp("top_i_pq.cpp"); z <- runif(12000 * 12000)
> system.time(top_i_pq(z, 10000))
user system elapsed
0.992 0.000 0.998
Problems with this code include:
The default comparator greater<Elt> works so that, in the case of a tie spanning the value of the _n_th element, the last, rather than first, duplicate is retained.
NA values (and non-finite values?) may not be handled correctly; I'm not sure whether this is true or not.
The function only works for NumericVector input, but the logic is appropriate for any R data type for which an appropriate ordering relationship is defined.
Problems 1 and 2 can likely be dealt with by writing an appropriate comparator; maybe for 2 this is already implemented in Rcpp? I don't know how to leverage C++ language features and the Rcpp design to avoid re-implementing the function for each data type I want to support.
Here's the full code:
#include <Rcpp.h>
#include <queue>
using namespace Rcpp;
using namespace std;
// [[Rcpp::export]]
IntegerVector top_i_pq(NumericVector v, int n)
{
typedef pair<double, int> Elt;
priority_queue< Elt, vector<Elt>, greater<Elt> > pq;
vector<int> result;
for (int i = 0; i != v.size(); ++i) {
if (pq.size() < n)
pq.push(Elt(v[i], i));
else {
Elt elt = Elt(v[i], i);
if (pq.top() < elt) {
pq.pop();
pq.push(elt);
}
}
}
result.reserve(pq.size());
while (!pq.empty()) {
result.push_back(pq.top().second + 1);
pq.pop();
}
return wrap(result);
}
A bit late into the party, but I came up with this, which avoids the sort.
Say you want the top 10k elements from you 12k x 12k matrix. The idea is to "clip" the data to the elements corresponding to a quantile of that size.
find_n_top_elements <- function( x, n ){
#set the quantile to correspond to n top elements
quant <- n / (dim(x)[1]*dim(x)[2])
#select the cutpoint to get the quantile above quant
lvl <- quantile(x, probs=1.0-quant)
#select the elements above the cutpoint
res <- x[x>lvl[[1]]]
}
#create a 12k x 12k matrix (1,1Gb!)
n <- 12000
x <- matrix( runif(n*n), ncol=n)
system.time( res <- find_n_top_elements( x, 10e3 ) )
Resulting in
system.time( res <- find_n_top_elements( x, 10e3 ) )
user system elapsed
3.47 0.42 3.89
For comparison, just sorting x on my system takes
system.time(sort(x))
user system elapsed
30.69 0.21 31.33
Matrix in R is like a vector.
mat <- matrix(sample(1:5000, 10000, rep=T), 100, 100)
mat.od <- order(mat, decreasing = T)
mat.od.arr <- cbind(mat.od%%nrow(mat), mat.od%/%nrow(mat)+1)
mat.od.arr[,2][mat.od.arr[,1]==0] <- mat.od.arr[,2][mat.od.arr[,1]==0] - 1
mat.od.arr[,1][mat.od.arr[,1]==0] <- nrow(mat)
head(mat.od.arr)
# [,1] [,2]
# [1,] 58 5
# [2,] 59 72
# [3,] 38 22
# [4,] 23 10
# [5,] 38 14
# [6,] 90 15
mat[58, 5]
# [1] 5000
mat[59, 72]
# [1] 5000
mat[38, 22]
# [1] 4999
mat[23, 10]
# [1] 4998
I have a nonzero symmetric matrix 'matr' that is 12000X12000. I need to find the indices of the top 10000 elements in 'matr' in R. The code I have written takes a long time - I was wondering if there was any pointers to make it faster.
listk <- numeric(0)
for( i in 1:10000) {
idx <- which(matr == max(matr), arr.ind=T)
if( length(idx) != 0) {
listk <- rbind( listk, idx[1,])
matr[idx[1,1], idx[1,2]] <- 0
matr[idx[2,1], idx[2,2]] <- 0
}
}
Here's how you might find the indices (ij) of the 4 largest elements in a 10-by-10 matrix m.
## Sample data
m <- matrix(runif(100), ncol=10)
## Extract the indices of the 4 largest elements
(ij <- which(m >= sort(m, decreasing=T)[4], arr.ind=TRUE))
# row col
# [1,] 2 1
# [2,] 5 1
# [3,] 6 2
# [4,] 3 10
## Use the indices to extract the values
m[ij]
# [1] 0.9985190 0.9703268 0.9836373 0.9914510
Edit:
For large matrices, performing a partial sort will be a faster way to find the 10,000th largest element:
v <- runif(1e7)
system.time(a <- sort(v, decreasing=TRUE)[10000])
# user system elapsed
# 4.35 0.03 4.38
system.time(b <- -sort(-v, partial=10000)[10000])
# user system elapsed
# 0.60 0.09 0.69
a==b
# [1] TRUE
I like #JoshO'Brien 's answer; the use of partial sorting is great! Here's an Rcpp solution (I'm not a strong C++ programmer so probably bone-headed errors; corrections welcome... how would I template this in Rcpp, to handle different types of input vector?)
I start by including the appropriate headers and using namespaces for convenience
#include <Rcpp.h>
#include <queue>
using namespace Rcpp;
using namespace std;
Then arrange to expose my C++ function to R
// [[Rcpp::export]]
IntegerVector top_i_pq(NumericVector v, int n)
and define some variables, most importantly a priority_queue to hold as a pair the numeric value and index. The queue is ordered so the smallest values are at the 'top', with small relying on the standard pair<> comparator.
typedef pair<double, int> Elt;
priority_queue< Elt, vector<Elt>, greater<Elt> > pq;
vector<int> result;
Now I'll walk through the input data, adding it to the queue if either (a) I don't yet have enough values or (b) the current value is larger than the smallest value in the queue. In the latter case, I pop off the smallest value, and insert it's replacement. In this way the priority queue always contains the n_max largest elements.
for (int i = 0; i != v.size(); ++i) {
if (pq.size() < n)
pq.push(Elt(v[i], i));
else {
Elt elt = Elt(v[i], i);
if (pq.top() < elt) {
pq.pop();
pq.push(elt);
}
}
}
And finally I pop the indexes from the priority queue into the return vector, remembering to translate to 1-based R coordinates.
result.reserve(pq.size());
while (!pq.empty()) {
result.push_back(pq.top().second + 1);
pq.pop();
}
and return the result to R
return wrap(result);
This has nice memory use (the priority queue and return vector are both small relative to the original data) and is fast
> library(Rcpp); sourceCpp("top_i_pq.cpp"); z <- runif(12000 * 12000)
> system.time(top_i_pq(z, 10000))
user system elapsed
0.992 0.000 0.998
Problems with this code include:
The default comparator greater<Elt> works so that, in the case of a tie spanning the value of the _n_th element, the last, rather than first, duplicate is retained.
NA values (and non-finite values?) may not be handled correctly; I'm not sure whether this is true or not.
The function only works for NumericVector input, but the logic is appropriate for any R data type for which an appropriate ordering relationship is defined.
Problems 1 and 2 can likely be dealt with by writing an appropriate comparator; maybe for 2 this is already implemented in Rcpp? I don't know how to leverage C++ language features and the Rcpp design to avoid re-implementing the function for each data type I want to support.
Here's the full code:
#include <Rcpp.h>
#include <queue>
using namespace Rcpp;
using namespace std;
// [[Rcpp::export]]
IntegerVector top_i_pq(NumericVector v, int n)
{
typedef pair<double, int> Elt;
priority_queue< Elt, vector<Elt>, greater<Elt> > pq;
vector<int> result;
for (int i = 0; i != v.size(); ++i) {
if (pq.size() < n)
pq.push(Elt(v[i], i));
else {
Elt elt = Elt(v[i], i);
if (pq.top() < elt) {
pq.pop();
pq.push(elt);
}
}
}
result.reserve(pq.size());
while (!pq.empty()) {
result.push_back(pq.top().second + 1);
pq.pop();
}
return wrap(result);
}
A bit late into the party, but I came up with this, which avoids the sort.
Say you want the top 10k elements from you 12k x 12k matrix. The idea is to "clip" the data to the elements corresponding to a quantile of that size.
find_n_top_elements <- function( x, n ){
#set the quantile to correspond to n top elements
quant <- n / (dim(x)[1]*dim(x)[2])
#select the cutpoint to get the quantile above quant
lvl <- quantile(x, probs=1.0-quant)
#select the elements above the cutpoint
res <- x[x>lvl[[1]]]
}
#create a 12k x 12k matrix (1,1Gb!)
n <- 12000
x <- matrix( runif(n*n), ncol=n)
system.time( res <- find_n_top_elements( x, 10e3 ) )
Resulting in
system.time( res <- find_n_top_elements( x, 10e3 ) )
user system elapsed
3.47 0.42 3.89
For comparison, just sorting x on my system takes
system.time(sort(x))
user system elapsed
30.69 0.21 31.33
Matrix in R is like a vector.
mat <- matrix(sample(1:5000, 10000, rep=T), 100, 100)
mat.od <- order(mat, decreasing = T)
mat.od.arr <- cbind(mat.od%%nrow(mat), mat.od%/%nrow(mat)+1)
mat.od.arr[,2][mat.od.arr[,1]==0] <- mat.od.arr[,2][mat.od.arr[,1]==0] - 1
mat.od.arr[,1][mat.od.arr[,1]==0] <- nrow(mat)
head(mat.od.arr)
# [,1] [,2]
# [1,] 58 5
# [2,] 59 72
# [3,] 38 22
# [4,] 23 10
# [5,] 38 14
# [6,] 90 15
mat[58, 5]
# [1] 5000
mat[59, 72]
# [1] 5000
mat[38, 22]
# [1] 4999
mat[23, 10]
# [1] 4998