I have a nonzero symmetric matrix 'matr' that is 12000X12000. I need to find the indices of the top 10000 elements in 'matr' in R. The code I have written takes a long time - I was wondering if there was any pointers to make it faster.
listk <- numeric(0)
for( i in 1:10000) {
idx <- which(matr == max(matr), arr.ind=T)
if( length(idx) != 0) {
listk <- rbind( listk, idx[1,])
matr[idx[1,1], idx[1,2]] <- 0
matr[idx[2,1], idx[2,2]] <- 0
}
}
Here's how you might find the indices (ij) of the 4 largest elements in a 10-by-10 matrix m.
## Sample data
m <- matrix(runif(100), ncol=10)
## Extract the indices of the 4 largest elements
(ij <- which(m >= sort(m, decreasing=T)[4], arr.ind=TRUE))
# row col
# [1,] 2 1
# [2,] 5 1
# [3,] 6 2
# [4,] 3 10
## Use the indices to extract the values
m[ij]
# [1] 0.9985190 0.9703268 0.9836373 0.9914510
Edit:
For large matrices, performing a partial sort will be a faster way to find the 10,000th largest element:
v <- runif(1e7)
system.time(a <- sort(v, decreasing=TRUE)[10000])
# user system elapsed
# 4.35 0.03 4.38
system.time(b <- -sort(-v, partial=10000)[10000])
# user system elapsed
# 0.60 0.09 0.69
a==b
# [1] TRUE
I like #JoshO'Brien 's answer; the use of partial sorting is great! Here's an Rcpp solution (I'm not a strong C++ programmer so probably bone-headed errors; corrections welcome... how would I template this in Rcpp, to handle different types of input vector?)
I start by including the appropriate headers and using namespaces for convenience
#include <Rcpp.h>
#include <queue>
using namespace Rcpp;
using namespace std;
Then arrange to expose my C++ function to R
// [[Rcpp::export]]
IntegerVector top_i_pq(NumericVector v, int n)
and define some variables, most importantly a priority_queue to hold as a pair the numeric value and index. The queue is ordered so the smallest values are at the 'top', with small relying on the standard pair<> comparator.
typedef pair<double, int> Elt;
priority_queue< Elt, vector<Elt>, greater<Elt> > pq;
vector<int> result;
Now I'll walk through the input data, adding it to the queue if either (a) I don't yet have enough values or (b) the current value is larger than the smallest value in the queue. In the latter case, I pop off the smallest value, and insert it's replacement. In this way the priority queue always contains the n_max largest elements.
for (int i = 0; i != v.size(); ++i) {
if (pq.size() < n)
pq.push(Elt(v[i], i));
else {
Elt elt = Elt(v[i], i);
if (pq.top() < elt) {
pq.pop();
pq.push(elt);
}
}
}
And finally I pop the indexes from the priority queue into the return vector, remembering to translate to 1-based R coordinates.
result.reserve(pq.size());
while (!pq.empty()) {
result.push_back(pq.top().second + 1);
pq.pop();
}
and return the result to R
return wrap(result);
This has nice memory use (the priority queue and return vector are both small relative to the original data) and is fast
> library(Rcpp); sourceCpp("top_i_pq.cpp"); z <- runif(12000 * 12000)
> system.time(top_i_pq(z, 10000))
user system elapsed
0.992 0.000 0.998
Problems with this code include:
The default comparator greater<Elt> works so that, in the case of a tie spanning the value of the _n_th element, the last, rather than first, duplicate is retained.
NA values (and non-finite values?) may not be handled correctly; I'm not sure whether this is true or not.
The function only works for NumericVector input, but the logic is appropriate for any R data type for which an appropriate ordering relationship is defined.
Problems 1 and 2 can likely be dealt with by writing an appropriate comparator; maybe for 2 this is already implemented in Rcpp? I don't know how to leverage C++ language features and the Rcpp design to avoid re-implementing the function for each data type I want to support.
Here's the full code:
#include <Rcpp.h>
#include <queue>
using namespace Rcpp;
using namespace std;
// [[Rcpp::export]]
IntegerVector top_i_pq(NumericVector v, int n)
{
typedef pair<double, int> Elt;
priority_queue< Elt, vector<Elt>, greater<Elt> > pq;
vector<int> result;
for (int i = 0; i != v.size(); ++i) {
if (pq.size() < n)
pq.push(Elt(v[i], i));
else {
Elt elt = Elt(v[i], i);
if (pq.top() < elt) {
pq.pop();
pq.push(elt);
}
}
}
result.reserve(pq.size());
while (!pq.empty()) {
result.push_back(pq.top().second + 1);
pq.pop();
}
return wrap(result);
}
A bit late into the party, but I came up with this, which avoids the sort.
Say you want the top 10k elements from you 12k x 12k matrix. The idea is to "clip" the data to the elements corresponding to a quantile of that size.
find_n_top_elements <- function( x, n ){
#set the quantile to correspond to n top elements
quant <- n / (dim(x)[1]*dim(x)[2])
#select the cutpoint to get the quantile above quant
lvl <- quantile(x, probs=1.0-quant)
#select the elements above the cutpoint
res <- x[x>lvl[[1]]]
}
#create a 12k x 12k matrix (1,1Gb!)
n <- 12000
x <- matrix( runif(n*n), ncol=n)
system.time( res <- find_n_top_elements( x, 10e3 ) )
Resulting in
system.time( res <- find_n_top_elements( x, 10e3 ) )
user system elapsed
3.47 0.42 3.89
For comparison, just sorting x on my system takes
system.time(sort(x))
user system elapsed
30.69 0.21 31.33
Matrix in R is like a vector.
mat <- matrix(sample(1:5000, 10000, rep=T), 100, 100)
mat.od <- order(mat, decreasing = T)
mat.od.arr <- cbind(mat.od%%nrow(mat), mat.od%/%nrow(mat)+1)
mat.od.arr[,2][mat.od.arr[,1]==0] <- mat.od.arr[,2][mat.od.arr[,1]==0] - 1
mat.od.arr[,1][mat.od.arr[,1]==0] <- nrow(mat)
head(mat.od.arr)
# [,1] [,2]
# [1,] 58 5
# [2,] 59 72
# [3,] 38 22
# [4,] 23 10
# [5,] 38 14
# [6,] 90 15
mat[58, 5]
# [1] 5000
mat[59, 72]
# [1] 5000
mat[38, 22]
# [1] 4999
mat[23, 10]
# [1] 4998
Related
Is there a way in Rcpp to return an R function with some pre-computed values that are only computed on the first function call? Consider the following R code:
1: func_generator<-function(X) {
2: X_tot<-sum(X)
3: function(b_vec) { (X_tot*b_vec) }
4: }
5: myfunc<-func_generator(c(3,4,5))
6: myfunc(1:2)
7: myfunc(5:6)
8: myfunc2<-func_generator(c(10,11,12,13))
...
Can this be programmed in Rcpp? In practice, assume that something more computationally intensive is done in place of line 2.
To add context, given vector X and scalar b, there is some likelihood function f(b|X), which can be reexpressed as f(b,s(X)) for some sufficient statistic s(X) that is a function only of X, and which involves some computation. This is in a computationally intensive computer experiment, with many vectors X (many likelihoods), and many separate calls to f(bvec|X) for each likelihood, so I'd rather compute s(X) once (for each likelihood) and save it in some fashion rather than re-computing it many times. I've started by simply programming f(bvec,X) to evaluate f(b|X) at the points bvec=(b_1,...,b_n), but this has extra overhead since I call this function several times and it computes s(X) on each run. I'd like to just compute s(X) once.
Any suggestions to accomplish this task efficiently in Rcpp would be appreciated (whether via returning a function; or via storing intermediate calculations in some other fashion).
One simple way to store intermediate results would be a static variable at function level:
// [[Rcpp::plugins(cpp11)]]
#include <thread>
#include <chrono>
#include <Rcpp.h>
// [[Rcpp::export]]
Rcpp::NumericVector foo(Rcpp::NumericVector X, Rcpp::NumericVector b, bool useCache = true) {
static double cache;
static bool initialized{false};
if (!(useCache && initialized)) {
// sleep to simulate actual work
std::this_thread::sleep_for (std::chrono::seconds(1));
cache = Rcpp::sum(X);
initialized = true;
}
return cache * b;
}
/*** R
X <- 1:10
b <- 10:20
system.time(r1 <- foo(X, b))
system.time(r2 <- foo(X, b))
all.equal(r1, r2)
system.time(r3 <- foo(X, b, FALSE))
all.equal(r1, r3)
*/
Output:
> system.time(r1 <- foo(X, b))
user system elapsed
0 0 1
> system.time(r2 <- foo(X, b))
user system elapsed
0.002 0.000 0.002
> all.equal(r1, r2)
[1] TRUE
> system.time(r3 <- foo(X, b, FALSE))
user system elapsed
0 0 1
> all.equal(r1, r3)
[1] TRUE
When the cache is used in the second function call, the result is computed almost instantaneously.
This approach is efficient if you can loop over the different b within a loop over the different X. If this restriction does not work for you, then you could use the memoise package at the R level to efficiently store the output of your expensive function for arbitrary input:
// [[Rcpp::plugins(cpp11)]]
#include <thread>
#include <chrono>
#include <Rcpp.h>
// [[Rcpp::export]]
Rcpp::NumericVector foo(double total, Rcpp::NumericVector b) {
return total * b;
}
// [[Rcpp::export]]
double bar(Rcpp::NumericVector X) {
// sleep to simulate actual work
std::this_thread::sleep_for (std::chrono::seconds(1));
return Rcpp::sum(X);
}
/*** R
X1 <- 1:10
b1 <- 10:20
X2 <- 10:1
b2 <- 20:10
library(memoise)
bar2 <- memoise(bar)
system.time(r11 <- foo(bar2(X1), b1))
system.time(r21 <- foo(bar2(X2), b2))
system.time(r12 <- foo(bar2(X1), b1))
system.time(r22 <- foo(bar2(X2), b2))
all.equal(r11, r12)
all.equal(r21, r22)
*/
Output:
> system.time(r11 <- foo(bar2(X1), b1))
user system elapsed
0.001 0.000 1.001
> system.time(r21 <- foo(bar2(X2), b2))
user system elapsed
0.033 0.000 1.033
> system.time(r12 <- foo(bar2(X1), b1))
user system elapsed
0 0 0
> system.time(r22 <- foo(bar2(X2), b2))
user system elapsed
0 0 0
> all.equal(r11, r12)
[1] TRUE
> all.equal(r21, r22)
[1] TRUE
As an alternative you could also use these two functions as building blocks for your function generator:
func_generator <- function(X) {
X_tot <- bar(X)
function(b_vec) { foo(X_tot, b_vec) }
}
myfunc <- func_generator(c(3,4,5))
myfunc2 <- func_generator(c(10,11,12,13))
myfunc(1:2)
myfunc(5:6)
myfunc2(1:2)
myfunc2(5:6)
So keep the numerical expensive work in C++, but keep it simple. The functional aspects can then be added using R.
I have a large, sparse binary matrix (roughly 39,000 x 14,000; most rows have only a single "1" entry). I'd like to cluster similar rows together, but my initial plan takes too long to complete:
d <- dist(inputMatrix, method="binary")
hc <- hclust(d, method="complete")
The first step doesn't finish, so I'm not sure how the second step would fare. What are some approaches to efficiently grouping similar rows of a large, sparse, binary matrix in R?
I've written some Rcpp code and R code which works out the binary/Jaccard distance of a binary matrix approx. 80x faster than dist(x, method = "binary"). It converts the input matrix into a raw matrix which is the transpose of the input (so that the bit patterns are in the correct order internally). This is then used in some C++ code which handles the data as 64 bit unsigned integers for speed. The Jaccard distance of two vectors x and y is equal to x ^ y / (x | y) where ^ is the xor operator. The Hamming Weight calculation is used to count the number of bits set if the result of the xor or or is non-zero.
I've put together the code on github at https://github.com/NikNakk/binaryDist/ and reproduced the two files below. I've confirmed that the results are the same as dist(x, method = "binary") for a few random datasets.
On a dataset of 39000 rows by 14000 columns with 1-5 ones per row, it took about 11 minutes. The output distance matrix was 5.7 GB.
bDist.cpp
#include <Rcpp.h>
using namespace Rcpp;
//countBits function taken from https://en.wikipedia.org/wiki/Hamming_weight#Efficient_implementation
const uint64_t m1 = 0x5555555555555555; //binary: 0101...
const uint64_t m2 = 0x3333333333333333; //binary: 00110011..
const uint64_t m4 = 0x0f0f0f0f0f0f0f0f; //binary: 4 zeros, 4 ones ...
const uint64_t h01 = 0x0101010101010101; //the sum of 256 to the power of 0,1,2,3...
int countBits(uint64_t x) {
x -= (x >> 1) & m1; //put count of each 2 bits into those 2 bits
x = (x & m2) + ((x >> 2) & m2); //put count of each 4 bits into those 4 bits
x = (x + (x >> 4)) & m4; //put count of each 8 bits into those 8 bits
return (x * h01)>>56; //returns left 8 bits of x + (x<<8) + (x<<16) + (x<<24) + ...
}
// [[Rcpp::export]]
int countBitsFromRaw(RawVector rv) {
uint64_t* x = (uint64_t*)RAW(rv);
return(countBits(*x));
}
// [[Rcpp::export]]
NumericVector bDist(RawMatrix mat) {
int nr(mat.nrow()), nc(mat.ncol());
int nw = nr / 8;
NumericVector res(nc * (nc - 1) / 2);
// Access the raw data as unsigned 64 bit integers
uint64_t* data = (uint64_t*)RAW(mat);
uint64_t a(0);
// Work through each possible combination of columns (rows in the original integer matrix)
for (int i = 0; i < nc - 1; i++) {
for (int j = i + 1; j < nc; j++) {
uint64_t sx = 0;
uint64_t so = 0;
// Work through each 64 bit integer and calculate the sum of (x ^ y) and (x | y)
for (int k = 0; k < nw; k++) {
uint64_t o = data[nw * i + k] | data[nw * j + k];
// If (x | y == 0) then (x ^ y) will also be 0
if (o) {
// Use Hamming weight method to calculate number of set bits
so = so + countBits(o);
uint64_t x = data[nw * i + k] ^ data[nw * j + k];
if (x) {
sx = sx + countBits(x);
}
}
}
res(a++) = (double)sx / so;
}
}
return (res);
}
R source
library("Rcpp")
library("plyr")
sourceCpp("bDist.cpp")
# Converts a binary integer vector into a packed raw vector,
# padding out at the end to make the input length a multiple of packWidth
packRow <- function(row, packWidth = 64L) {
packBits(as.raw(c(row, rep(0, (packWidth - length(row)) %% packWidth))))
}
as.PackedMatrix <- function(x, packWidth = 64L) {
UseMethod("as.PackedMatrix")
}
# Converts a binary integer matrix into a packed raw matrix
# padding out at the end to make the input length a multiple of packWidth
as.PackedMatrix.matrix <- function(x, packWidth = 64L) {
stopifnot(packWidth %% 8 == 0, class(x) %in% c("matrix", "Matrix"))
storage.mode(x) <- "raw"
if (ncol(x) %% packWidth != 0) {
x <- cbind(x, matrix(0L, nrow = nrow(x), ncol = packWidth - (ncol(x) %% packWidth)))
}
out <- packBits(t(x))
dim(out) <- c(ncol(x) %/% 8, nrow(x))
class(out) <- "PackedMatrix"
out
}
# Converts back to an integer matrix
as.matrix.PackedMatrix <- function(x) {
out <- rawToBits(x)
dim(out) <- c(nrow(x) * 8L, ncol(x))
storage.mode(out) <- "integer"
t(out)
}
# Generates random sparse data for testing the main function
makeRandomData <- function(nObs, nVariables, maxBits, packed = FALSE) {
x <- replicate(nObs, {
y <- integer(nVariables)
y[sample(nVariables, sample(maxBits, 1))] <- 1L
if (packed) {
packRow(y, 64L)
} else {
y
}
})
if (packed) {
class(x) <- "PackedMatrix"
x
} else {
t(x)
}
}
# Reads a binary matrix from file or character vector
# Borrows the first bit of code from read.table
readPackedMatrix <- function(file = NULL, text = NULL, packWidth = 64L) {
if (missing(file) && !missing(text)) {
file <- textConnection(text)
on.exit(close(file))
}
if (is.character(file)) {
file <- file(file, "rt")
on.exit(close(file))
}
if (!inherits(file, "connection"))
stop("'file' must be a character string or connection")
if (!isOpen(file, "rt")) {
open(file, "rt")
on.exit(close(file))
}
lst <- list()
i <- 1
while(length(line <- readLines(file, n = 1)) > 0) {
lst[[i]] <- packRow(as.integer(strsplit(line, "", fixed = TRUE)[[1]]), packWidth = packWidth)
i <- i + 1
}
out <- do.call("cbind", lst)
class(out) <- "PackedMatrix"
out
}
# Wrapper for the C++ code which
binaryDist <- function(x) {
if (class(x) != "PackedMatrix") {
x <- as.PackedMatrix(x)
}
dst <- bDist(x)
attr(dst, "Size") <- ncol(x)
attr(dst, "Diag") <- attr(dst, "Upper") <- FALSE
attr(dst, "method") <- "binary"
attr(dst, "call") <- match.call()
class(dst) <- "dist"
dst
}
x <- makeRandomData(2000, 400, maxBits = 5, packed = TRUE)
system.time(bd <- binaryDist(x))
From original answer:
Other things to consider would be doing some prefiltering of comparisons between two rows with single ones since the distance will either be 0 for duplicates or 1 for any other possibility.
A couple of relatively straightforward options that might be faster without needing much code are the vegdist function from the vegan package and the Dist function from the amap package. The latter will probably only be quicker if you have multiple cores and take advantage of the fact it supports parallelisation.
The reason this takes so long to compute is that the call to dist is computing and storing more than 760 million pairwise distances. If your data is stored sparsely, this will take a long time and huge amount of storage. If your data is not stored sparsely, then each distance computation requires at least 14,000 operations, for a total operation count exceeding 1 quadrillion!
An approach that will be much quicker is k-means clustering, since it doesn't require pre-computing a distance matrix; at each iteration you will need only 39000*k distance calculations, where k is the number of clusters. To get pairwise distances that are similar to the Jaccard index (0 if identical, 1 if no indices coincide, in between if some but not all indices coincide), you could divide each row x by sqrt(2*sum(x^2)). For instance, if you had the following input matrix:
(mat <- rbind(c(1, 0, 0, 0, 0), c(0, 0, 0, 1, 1)))
# [,1] [,2] [,3] [,4] [,5]
# [1,] 1 0 0 0 0
# [2,] 0 0 0 1 1
the normalized version would be (assuming binary values only in the matrix; if this were not the case you would use rowSums(mat^2)):
(mat.norm <- mat / sqrt(2*rowSums(mat)))
# [,1] [,2] [,3] [,4] [,5]
# [1,] 0.7071068 0 0 0.0 0.0
# [2,] 0.0000000 0 0 0.5 0.5
These two observations (which have no indices in common), have Euclidean distance 1, coinciding with the Jaccard distance for this case.
dist(mat.norm, "euclidean")
# 1
# 2 1
Additionally, identical observations will clearly have Euclidean distance 0, again corresponding to the Jaccard distance.
do you have duplicate rows? There is no need to compute their distances twice.
all rows with a single 1 will be 100% different from all rows with a single one in a different place.
Thus, it does not make sense to run clustering on such data. The output is rather predictable, and boils down to finding the 1.
Try restricting your data set to those objects that have more than one 1 only. Unless you can get interesting results on these only, no need to continue further. Binary data has too little information.
Inside a function of type SEXP myfun(SEXP n, SEXP d) I allocate an (n, d)-matrix which contains the result of a function computed in C:
SEXP res = PROTECT(allocMatrix(REALSXP, n, d));
I would like to 'fill' this matrix (it can then be returned by myfun) and thus would like to convert it to an (n, d)-matrix in C (or pointer or so). How can this be done?
If res was a vector, I could do:
double *res_ = REAL(res);
and then walk through res_ with a for loop. However, when I use that for the above matrix, it loses its dimension, i.e., indexing via res_[i][j] for computing the result fails. Of course one could work with a vector and keep track of the row/col indices oneself, but ideally I would like to simply write res_[i][j]. Is this doable without significant amount of extra code?
The matrix is going to be stored as a vector internally, which should imply that [i][j] is meaningless in C. Here is an example from Writing R Extensions on how you would do this:
#include <R.h>
#include <Rinternals.h>
SEXP out(SEXP x, SEXP y)
{
R_len_t i, j, nx = length(x), ny = length(y);
double tmp, *rx = REAL(x), *ry = REAL(y), *rans;
SEXP ans;
PROTECT(ans = allocMatrix(REALSXP, nx, ny));
rans = REAL(ans);
for(i = 0; i < nx; i++) {
tmp = rx[i];
for(j = 0; j < ny; j++)
rans[i + nx*j] = tmp * ry[j];
}
UNPROTECT(1);
return(ans);
}
Speed if course very important, but I also like code clarity. The solution by BrodieG could be written as a much shorter RcppArmadillo function -- it really is just a single outer product:
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::export]]
arma::mat outCpp(arma::colvec x, arma::rowvec y) {
return x * y;
}
If we test this against Brodie's (renamed to outC()) we get his:
R> sourceCpp("/tmp/marius.cpp")
R> library(rbenchmark)
R> a <- as.numeric(1:3)
R> b <- as.numeric(1:4)
R> outC(a, b)
[,1] [,2] [,3] [,4]
[1,] 1 2 3 4
[2,] 2 4 6 8
[3,] 3 6 9 12
R> outCpp(a, b)
[,1] [,2] [,3] [,4]
[1,] 1 2 3 4
[2,] 2 4 6 8
[3,] 3 6 9 12
R> benchmark(outC(a,b), outCpp(a,b), replications=1e5)[,1:4]
test replications elapsed relative
1 outC(a, b) 100000 0.382 1.000
2 outCpp(a, b) 100000 0.484 1.267
R>
So running 100,000 replications takes 380 vs 480 msec.
That means for each call, the difference is one millionth of a second. I think I take for having a single line of code that is easier to read and maintain.
I have a data frame in R that consists of two (numeric) columns, say A and B. I would like to constructor the column consisting of the elements
A1+B1
(A1+B1)*A2+B2
((A1+B1)*A2+B2)*A3+B3
...
This probably is an easy one-liner, but I don't see it.
Edit: deleted the word vectorized from the title, since I'm basically just interested in any elegant solution (the dumb ones I can do myself). In F# - which I'm more familiar with - this would be something like (assuming the elements would be in a list as tuples, which would be more idiomatic):
ABlist |> List.fold (fun acc (a,b) -> acc*a+b) 1
Which is still something very short and clear. I'm dragging this up, because I'm an R noob and unfamiliar with it, but I have read somewhere that it's a functional language, so I would guess a solution in terms of a fold over a data-frame would exist?
This is a different answer driven by the failure of Bram's effort at using Reduce. It builds a list that has the A;B pairs and then sets the intial value for the accumulator to 1 so that the first multiplication doesn't get zeroed out:
abList <- mapply(c, A,B, SIMPLIFY=FALSE) # keeps as a list
Reduce( function(acc,x) {acc*x[1]+x[2]},
abList,init=c(1,0),
accumulate=TRUE)[-1] # removes the initialization value
#--------
[[1]]
[1] 4 3
[[2]]
[1] 10 8
[[3]]
[1] 31 25
[[4]]
[1] 128 104
[[5]]
[1] 645 525
Might take some further work with s/lapply( ..., "[", 1) to pull out the accumulator
> sapply( Reduce( function(acc,x) {acc*x[1]+x[2]},
+ abList,init=c(1,0),
+ accumulate=TRUE)[-1], "[", 1)
[1] 4 10 31 128 645
Ok, turns out I was too lazy, I figured it out myself (note: before that there was a good answer using Rcpp already, but I can't use that at work). Basically just a translation to R of what I wrote in my edit on how I would do this in F#:
a <- c(1,2,3)
b <- c(4,5,6)
abList <- list(a,b)
Reduce( function(acc,x) {acc*x[[1]]+x[[2]]},
abList,
accumulate=TRUE)
Does the trick. Edit: as per the comment below, it actually doesn't do the trick. If one build abList by
abList <- apply(rbind(a,b),2,as.pairlist)
and then folds by:
Reduce(function(acc,x) {(acc*x[[1]])+x[[2]]},abList,1,accumulate=TRUE)
One gets the right answer (with a 1 prepended because that's the intial value for the accumulator)
This is relatively straightforward in Rcpp, which won't have the performance problems you would see if you tried to implement this with loops in R.
library(Rcpp)
sum.prod <- cppFunction(
"NumericVector sum_prod(NumericVector x, NumericVector y) {
NumericVector result(x.size());
result[0] = x[0] + y[0];
for (int i=1; i < x.size(); ++i) result[i] = result[i-1]*x[i] + y[i];
return result;
}")
sum.prod(c(1, 2, 3, 4, 5), c(3, 2, 1, 4, 5))
# [1] 4 10 31 128 645
I've found Rcpp to be the simplest way to speed up hard-to-vectorize computations.
X = 1; for ( i in seq(length(A) ) ) { X= B[i]+A[i]*X; print(X) }
[1] 4
[1] 10
[1] 31
[1] 128
[1] 645
If you want to accumulate rather than reprot:
X = 1; for ( i in seq(length(A) ) ) { X[1+i]= B[i]+A[i]*X[i] }; X[-1]
#[1] 4 10 31 128 645
Will be ploddingly slow compared to the Rcpp solution, but if you need to do the compilation step on the fly it's only when the lengths are more than 1000 that you might even notice the difference:
> A <- sample(1:10000, 1000); B <- sample(1:10000, 1000)
> system.time( {X = 1; for ( i in seq(length(A) ) ) { X[1+i]= B[i]+A[i]*X[i] }; X[-1]})
user system elapsed
0.014 0.002 0.017
> library(Rcpp)
> system.time( {sum.prod <- cppFunction(
+ "NumericVector sum_prod(NumericVector x, NumericVector y) {
+ NumericVector result(x.size());
+ result[0] = x[0] + y[0];
+ for (int i=1; i < x.size(); ++i) result[i] = result[i-1]*x[i] + y[i];
+ return result;
+ }")
+ sum.prod(A,B) } )
user system elapsed
0.012 0.002 0.014
I have a nonzero symmetric matrix 'matr' that is 12000X12000. I need to find the indices of the top 10000 elements in 'matr' in R. The code I have written takes a long time - I was wondering if there was any pointers to make it faster.
listk <- numeric(0)
for( i in 1:10000) {
idx <- which(matr == max(matr), arr.ind=T)
if( length(idx) != 0) {
listk <- rbind( listk, idx[1,])
matr[idx[1,1], idx[1,2]] <- 0
matr[idx[2,1], idx[2,2]] <- 0
}
}
Here's how you might find the indices (ij) of the 4 largest elements in a 10-by-10 matrix m.
## Sample data
m <- matrix(runif(100), ncol=10)
## Extract the indices of the 4 largest elements
(ij <- which(m >= sort(m, decreasing=T)[4], arr.ind=TRUE))
# row col
# [1,] 2 1
# [2,] 5 1
# [3,] 6 2
# [4,] 3 10
## Use the indices to extract the values
m[ij]
# [1] 0.9985190 0.9703268 0.9836373 0.9914510
Edit:
For large matrices, performing a partial sort will be a faster way to find the 10,000th largest element:
v <- runif(1e7)
system.time(a <- sort(v, decreasing=TRUE)[10000])
# user system elapsed
# 4.35 0.03 4.38
system.time(b <- -sort(-v, partial=10000)[10000])
# user system elapsed
# 0.60 0.09 0.69
a==b
# [1] TRUE
I like #JoshO'Brien 's answer; the use of partial sorting is great! Here's an Rcpp solution (I'm not a strong C++ programmer so probably bone-headed errors; corrections welcome... how would I template this in Rcpp, to handle different types of input vector?)
I start by including the appropriate headers and using namespaces for convenience
#include <Rcpp.h>
#include <queue>
using namespace Rcpp;
using namespace std;
Then arrange to expose my C++ function to R
// [[Rcpp::export]]
IntegerVector top_i_pq(NumericVector v, int n)
and define some variables, most importantly a priority_queue to hold as a pair the numeric value and index. The queue is ordered so the smallest values are at the 'top', with small relying on the standard pair<> comparator.
typedef pair<double, int> Elt;
priority_queue< Elt, vector<Elt>, greater<Elt> > pq;
vector<int> result;
Now I'll walk through the input data, adding it to the queue if either (a) I don't yet have enough values or (b) the current value is larger than the smallest value in the queue. In the latter case, I pop off the smallest value, and insert it's replacement. In this way the priority queue always contains the n_max largest elements.
for (int i = 0; i != v.size(); ++i) {
if (pq.size() < n)
pq.push(Elt(v[i], i));
else {
Elt elt = Elt(v[i], i);
if (pq.top() < elt) {
pq.pop();
pq.push(elt);
}
}
}
And finally I pop the indexes from the priority queue into the return vector, remembering to translate to 1-based R coordinates.
result.reserve(pq.size());
while (!pq.empty()) {
result.push_back(pq.top().second + 1);
pq.pop();
}
and return the result to R
return wrap(result);
This has nice memory use (the priority queue and return vector are both small relative to the original data) and is fast
> library(Rcpp); sourceCpp("top_i_pq.cpp"); z <- runif(12000 * 12000)
> system.time(top_i_pq(z, 10000))
user system elapsed
0.992 0.000 0.998
Problems with this code include:
The default comparator greater<Elt> works so that, in the case of a tie spanning the value of the _n_th element, the last, rather than first, duplicate is retained.
NA values (and non-finite values?) may not be handled correctly; I'm not sure whether this is true or not.
The function only works for NumericVector input, but the logic is appropriate for any R data type for which an appropriate ordering relationship is defined.
Problems 1 and 2 can likely be dealt with by writing an appropriate comparator; maybe for 2 this is already implemented in Rcpp? I don't know how to leverage C++ language features and the Rcpp design to avoid re-implementing the function for each data type I want to support.
Here's the full code:
#include <Rcpp.h>
#include <queue>
using namespace Rcpp;
using namespace std;
// [[Rcpp::export]]
IntegerVector top_i_pq(NumericVector v, int n)
{
typedef pair<double, int> Elt;
priority_queue< Elt, vector<Elt>, greater<Elt> > pq;
vector<int> result;
for (int i = 0; i != v.size(); ++i) {
if (pq.size() < n)
pq.push(Elt(v[i], i));
else {
Elt elt = Elt(v[i], i);
if (pq.top() < elt) {
pq.pop();
pq.push(elt);
}
}
}
result.reserve(pq.size());
while (!pq.empty()) {
result.push_back(pq.top().second + 1);
pq.pop();
}
return wrap(result);
}
A bit late into the party, but I came up with this, which avoids the sort.
Say you want the top 10k elements from you 12k x 12k matrix. The idea is to "clip" the data to the elements corresponding to a quantile of that size.
find_n_top_elements <- function( x, n ){
#set the quantile to correspond to n top elements
quant <- n / (dim(x)[1]*dim(x)[2])
#select the cutpoint to get the quantile above quant
lvl <- quantile(x, probs=1.0-quant)
#select the elements above the cutpoint
res <- x[x>lvl[[1]]]
}
#create a 12k x 12k matrix (1,1Gb!)
n <- 12000
x <- matrix( runif(n*n), ncol=n)
system.time( res <- find_n_top_elements( x, 10e3 ) )
Resulting in
system.time( res <- find_n_top_elements( x, 10e3 ) )
user system elapsed
3.47 0.42 3.89
For comparison, just sorting x on my system takes
system.time(sort(x))
user system elapsed
30.69 0.21 31.33
Matrix in R is like a vector.
mat <- matrix(sample(1:5000, 10000, rep=T), 100, 100)
mat.od <- order(mat, decreasing = T)
mat.od.arr <- cbind(mat.od%%nrow(mat), mat.od%/%nrow(mat)+1)
mat.od.arr[,2][mat.od.arr[,1]==0] <- mat.od.arr[,2][mat.od.arr[,1]==0] - 1
mat.od.arr[,1][mat.od.arr[,1]==0] <- nrow(mat)
head(mat.od.arr)
# [,1] [,2]
# [1,] 58 5
# [2,] 59 72
# [3,] 38 22
# [4,] 23 10
# [5,] 38 14
# [6,] 90 15
mat[58, 5]
# [1] 5000
mat[59, 72]
# [1] 5000
mat[38, 22]
# [1] 4999
mat[23, 10]
# [1] 4998