Related
I'm looking for a simple way to build a 3D array using Rcpp and make that array accessible in R. I'm still very new to Rcpp and c++ coding in general, so any help would be greatly appreciated.
Here is my source code:
#include <Rcpp.h>
using namespace Rcpp;
int d[5][5][5] = {0};
// [[Rcpp::export]]
int fit(){
for (int X = 0; X < 5; X++){
for (int Y = 0; Y < 5; Y++){
for (int Z = 0; Z < 5; Z++){
d[X][Y][Z] = X + Y + Z;
}
}
}
return d;
}
The idea being I could assign fit() to an element in R (say D), and be able to call from locations in that array.
I realize I could just make "fit" a function of x, y, and z; but for my purposes, having the array in R works much faster.
Again, any help would be super appreciated.
Edit
Thank you for the help Dirk,
One thing I'm still struggling with is running the loops. When I run this loop:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
IntegerVector rcpp_matrix(){
IntegerVector v = IntegerVector(Dimension(2,2));
for (int i = 0; i < 2; i++){
for (int j = 0; j < 2; j++){
v(i,j) = (i + 1) * (j + 1);
}
}
// Return the vector to R
return v;
}
It works as intended. However, when I try to span out to three dimensions, like so:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
IntegerVector rcpp_matrix(){
IntegerVector v = IntegerVector(Dimension(2,2,2));
for (int i = 0; i < 2; i++){
for (int j = 0; j < 2; j++){
for (int k = 0; k < 2; k++){
v(i,j,k) = (i + 1) * (j + 1) * (k + 1);
}
}
}
// Return the vector to R
return v;
}
I get errors:
file10c2d06d0b.cpp: In function ‘Rcpp::IntegerVector rcpp_matrix()’:
file10c2d06d0b.cpp:14:16: error: no match for call to ‘(Rcpp::IntegerVector {aka Rcpp::Vector<13>}) (int&, int&, int&)’
14 | v(i,j,k) = (i + 1) * (j + 1) * (k + 1);
| ^
Is there some other aspect I'm missing?
Thank you again.
Vectors are key in R, and an array is just a vector with 2-d dimension attribute:
> v <- 1:12
> dim(v) <- c(3,4)
> v
[,1] [,2] [,3] [,4]
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12
>
So we can also do 3-d:
> dim(v) <- c(2,3,2)
> v
, , 1
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
, , 2
[,1] [,2] [,3]
[1,] 7 9 11
[2,] 8 10 12
>
and so on. And that is the exact same thing in Rcpp as seen e.g. here from a unit test snippet:
// [[Rcpp::export]]
IntegerVector integer_dimension_ctor_3(){
return IntegerVector( Dimension( 2, 3, 4) ) ;
}
which we can test quickly:
> Rcpp::cppFunction("IntegerVector ivec3() { return IntegerVector(Dimension(2, 3, 4));}")
> ivec3()
, , 1
[,1] [,2] [,3]
[1,] 0 0 0
[2,] 0 0 0
, , 2
[,1] [,2] [,3]
[1,] 0 0 0
[2,] 0 0 0
, , 3
[,1] [,2] [,3]
[1,] 0 0 0
[2,] 0 0 0
, , 4
[,1] [,2] [,3]
[1,] 0 0 0
[2,] 0 0 0
>
All that said, I recommend you look into Armadillo 'Cubes' and what you can do from RcppArmadillo -- likely more fully featured.
Edit: Here is much simpler Armadillo variant of your approach, using a cube<int> aka icube:
Code
#include <RcppArmadillo/Lightest> // new 'lighter' header
// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::export]]
arma::icube make3dvec(){
arma::icube v(2,2,2);
for (int x = 0; x < 2; x++){
for (int y = 0; y < 2; y++){
for (int z = 0; z < 2; z++){
v(x,y,z) = (x + 1) * (y + 1) * (z + 1);
}
}
}
return v;
}
/*** R
make3dvec()
*/
Output
> Rcpp::sourceCpp("~/git/stackoverflow/75036466/answer.cpp")
> make3dvec()
, , 1
[,1] [,2]
[1,] 1 2
[2,] 2 4
, , 2
[,1] [,2]
[1,] 2 4
[2,] 4 8
>
Okay. I found a solution, albeit a little convoluted, to the specific issue I was having.
sourceCpp(code = "
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
IntegerVector fit(){
int d[2][2][2] = {0};
IntegerVector V = IntegerVector(Dimension(8));
int Count = 0;
for (int X = 0; X < 2; X++){
for (int Y = 0; Y < 2; Y++){
for (int Z = 0; Z < 2; Z++){
d[X][Y][Z] = (X + 1) * (Y + 1) * (Z + 1);
V(Count) = d[X][Y][Z];
Count = Count + 1;
}
}
}
return V;
}
")
Which creates a vector for fit(), which I can use in R to get the array I want, vis-a-vis:
array(fit(),dim=c(2,2,2)
Thank you again for the help. This is speeding up my original model quite a bit.
In R, we can use Matrix::nearPD() to calculate nearest positive definite matrix.
I have written a Rcpp-version, nearPD_c, myself as follows (c++ file),
// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>
using namespace arma;
using namespace Rcpp;
// [[Rcpp::plugins(cpp11)]]
// [[Rcpp::export]]
vec rep_each(const vec& x, const int each) {
std::size_t n=x.n_elem;
std::size_t n_out=n*each;
vec res(n_out);
auto begin = res.begin();
for (std::size_t i = 0, ind = 0; i < n; ind += each, ++i) {
auto start = begin + ind;
auto end = start + each;
std::fill(start, end, x[i]);
}
return res;
}
mat mat_vec_same_len(mat mt1, vec v1){
//do not check the input...
int t=0;
for(int i=0;i<mt1.n_cols;i++){
for(int j=0;j<mt1.n_rows;j++){
mt1(j,i)=mt1(j,i)*v1(t);
t++;
}
}
return(mt1);
}
// [[Rcpp::export]]
vec pmax_c(double a, vec b){
vec c(b.n_elem);
for(int i=0;i<b.n_elem;i++){
c(i)=std::max(a,b(i));
}
return c;
}
// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::export]]
mat nearPD_c(mat x,
bool corr = false, bool keepDiag = false
,bool do2eigen = true // if TRUE do a sfsmisc::posdefify() eigen step
,bool doSym = false // symmetrize after tcrossprod()
, bool doDykstra = true // do use Dykstra's correction
,bool only_values = false // if TRUE simply return lambda[j].
, double eig_tol = 1e-6 // defines relative positiveness of eigenvalues compared to largest
, double conv_tol = 1e-7 // convergence tolerance for algorithm
,double posd_tol = 1e-8 // tolerance for enforcing positive definiteness
, int maxit = 100 // maximum number of iterations allowed
, bool trace = false // set to TRUE (or 1 ..) to trace iterations
){
int n = x.n_cols;
vec diagX0;
if(keepDiag) {
diagX0 = x.diag();
}
mat D_S;
if(doDykstra) {
//D_S should be like x, but filled with '0' -- following also works for 'Matrix':
D_S = x;
D_S.zeros(); //set all element
}
mat X = x;
int iter = 0 ;
bool converged = false;
double conv = R_PosInf;
mat Y;
mat R;
mat B;
while (iter < maxit && !converged) {
Y = X;
if(doDykstra){
R = Y - D_S;
}
vec d;
mat Q;
if(doDykstra){
B=R;
}else{
B=Y;
}
eig_sym(d, Q, B);
// create mask from relative positive eigenvalues
uvec p= (d>eig_tol*d[1]);
if(sum(p)==0){
//stop("Matrix seems negative semi-definite")
break;
}
// use p mask to only compute 'positive' part
uvec p_indexes(sum(p));
int p_i_i=0;
for(int i=0;i<p.n_elem;i++){
if(p(i)){
p_indexes(p_i_i)=i;
p_i_i++;
}
}
Q=Q.cols(p_indexes);
X=mat_vec_same_len(Q,rep_each(d.elem(p_indexes),Q.n_rows))*Q.t();
// update Dykstra's correction D_S = \Delta S_k
if(doDykstra){
D_S = X - R;
}
// project onto symmetric and possibly 'given diag' matrices:
if(doSym){
X = (X + X.t())/2;
}
if(corr){
X.diag().ones(); //set diagnols as ones
}
else if(keepDiag){
X.diag() = diagX0;
}
conv = norm(Y-X,"inf")/norm(Y,"inf");
iter = iter + 1;
if (trace){
// cat(sprintf("iter %3d : #{p}=%d, ||Y-X|| / ||Y||= %11g\n",
// iter, sum(p), conv))
Rcpp::Rcout << "iter " << iter <<" : #{p}= "<< sum(p) << std::endl;
}
converged = (conv <= conv_tol);
// force symmetry is *NEVER* needed, we have symmetric X here!
//X <- (X + t(X))/2
if(do2eigen || only_values) {
// begin from posdefify(sfsmisc)
eig_sym(d, Q, X);
double Eps = posd_tol * std::abs(d[1]);
// if (d[n] < Eps) { //should be n-1?
if (d(n-1) < Eps) {
uvec d_comp = d < Eps;
for(int i=0;i<sum(d_comp);i++){
if(d_comp(i)){
d(i)=Eps;
}
}
// d[d < Eps] = Eps; //how to assign values likes this?
if(!only_values) {
vec o_diag = X.diag();
X = Q * (d *Q.t());
vec D = sqrt(pmax_c(Eps, o_diag)/X.diag());
x=D * X * rep_each(D, n);
}
}
if(only_values) return(d);
// unneeded(?!): X <- (X + t(X))/2
if(corr) {
X.diag().ones(); //set diag as ones
}
else if(keepDiag){
X.diag()= diagX0;
}
} //end from posdefify(sfsmisc)
}
if(!converged){ //not converged
Rcpp::Rcout << "did not converge! " <<std::endl;
}
return X;
// return List::create(_["mat"] = X,_["eigenvalues"]=d,
//
// _["corr"] = corr, _["normF"] = norm(x-X, "fro"), _["iterations"] = iter,
// _["rel.tol"] = conv, _["converged"] = converged);
}
However, although nearPD and nearPD_c give similar results, they are not identical. For example (in R):
> mt0=matrix(c(0.5416, -0.0668 , -0.1538, -0.2435,
+ -0.0668 , 0.9836 , -0.0135 , -0.0195,
+ -0.1538 , -0.0135 , 0.0226 , 0.0334,
+ -0.2435, -0.0195 , 0.0334 , 0.0487),4,byrow = T)
> nearPD(mt0)$mat
4 x 4 Matrix of class "dpoMatrix"
[,1] [,2] [,3] [,4]
[1,] 0.55417390 -0.06540967 -0.14059121 -0.22075966
[2,] -0.06540967 0.98375373 -0.01203943 -0.01698557
[3,] -0.14059121 -0.01203943 0.03650733 0.05726836
[4,] -0.22075966 -0.01698557 0.05726836 0.08983952
> nearPD_c(mt0)
[,1] [,2] [,3] [,4]
[1,] 0.55417390 -0.06540967 -0.14059123 -0.22075967
[2,] -0.06540967 0.98375373 -0.01203944 -0.01698557
[3,] -0.14059123 -0.01203944 0.03650733 0.05726837
[4,] -0.22075967 -0.01698557 0.05726837 0.08983952
There are some differences in 7th or 8th decimal, which make nearPD(mt0) positive define while nearPD_c(mt0) not.
> chol(nearPD(mt0)$mat)
4 x 4 Matrix of class "Cholesky"
[,1] [,2] [,3] [,4]
[1,] 7.444286e-01 -8.786561e-02 -1.888579e-01 -2.965491e-01
[2,] . 9.879440e-01 -2.898297e-02 -4.356729e-02
[3,] . . 1.029821e-04 1.014128e-05
[4,] . . . 1.071201e-04
> chol(nearPD_c(mt0))
Error in chol.default(nearPD_c(mt0)) :
the leading minor of order 3 is not positive definite
I sense that there might be some rounding issue in Rcpp. But I couldn't identify it. Any insights of what goes wrong?
There is at least one logic error in your post-processing. In R we have:
e <- eigen(X, symmetric = TRUE)
d <- e$values
Eps <- posd.tol * abs(d[1])
if (d[n] < Eps) {
d[d < Eps] <- Eps
[...]
While you have:
eig_sym(d, Q, X);
double Eps = posd_tol * std::abs(d[1]);
// if (d[n] < Eps) { //should be n-1?
if (d(n-1) < Eps) {
uvec d_comp = d < Eps;
for(int i=0;i<sum(d_comp);i++){
if(d_comp(i)){
d(i)=Eps;
}
}
According to the Armadillo docs, eigen values are in ascending order, while they are in decreasing order in R. So R builds Eps based on the largest eigen value, while you use the second(!) smallest. Then R compares with the smallest eigen value, while you compare with the largest. Something like this should give the same results as R (untested):
eig_sym(d, Q, X);
double Eps = posd_tol * std::abs(d[n-1]);
if (d(0) < Eps) {
uvec d_comp = d < Eps;
for(int i=0;i<sum(d_comp);i++){
if(d_comp(i)){
d(i)=Eps;
}
}
BTW, you only need // [[Rcpp::export]] for functions that you want to call from R.
I have wrote a Rcpp code to compute element wise matrix multiplication in R. But when try to run this code R stops working and its exiting. How to correct this function?
Thanks in advance.
library(Rcpp)
func <- 'NumericMatrix mmult( NumericMatrix m , NumericMatrix v, bool byrow=true )
{
if( ! m.nrow() == v.nrow() ) stop("Non-conformable arrays") ;
if( ! m.ncol() == v.ncol() ) stop("Non-conformable arrays") ;
NumericMatrix out(m) ;
for (int i = 1; i <= m.nrow(); i++)
{
for (int j = 1; j <= m.ncol(); j++)
{
out(i,j)=m(i,j) * v(i,j) ;
}
}
return out ;
}'
cppFunction( func )
m1<-matrix(1:4,2,2)
m2<-m1
r1<-mmult(m1,m2)
r2<-m1*m2
The (at least to me) obvious choice is to use RcppArmadillo:
R> cppFunction("arma::mat matmult(arma::mat A, arma::mat B) { return A % B; }",
+ depends="RcppArmadillo")
R> m1 <- m2 <- matrix(1:4,2,2)
R> matmult(m1,m2)
[,1] [,2]
[1,] 1 9
[2,] 4 16
R>
as Armadillo is strongly typed, and has an element-by-element multiplication operator (%) which we use in the one-liner it takes.
You have to keep in mind that c++ uses 0 indexed arrays. (See Why does the indexing start with zero in 'C'? and Why are zero-based arrays the norm? .)
So you need to define your loop to run from 0 to m.nrow() - 1
Try this:
func <- '
NumericMatrix mmult( NumericMatrix m , NumericMatrix v, bool byrow=true )
{
if( ! m.nrow() == v.nrow() ) stop("Non-conformable arrays") ;
if( ! m.ncol() == v.ncol() ) stop("Non-conformable arrays") ;
NumericMatrix out(m) ;
for (int i = 0; i < m.nrow(); i++)
{
for (int j = 0; j < m.ncol(); j++)
{
out(i,j)=m(i,j) * v(i,j) ;
}
}
return out ;
}
'
Then I get:
> mmult(m1,m2)
[,1] [,2]
[1,] 1 9
[2,] 4 16
> m1*m2
[,1] [,2]
[1,] 1 9
[2,] 4 16
I'm working on a program in R to calculate the Gabriel Graph for up to 1000 data points. I used a program I found online first (GabrielGraph based on Bhattacharya et al. 1981 lines 781-830).
Unfortunately it takes quite a bit of time to get the result so I tried reprogramming it using Rcpp. For this I wrote a couple of small programs and a big one called edges which is used to calculate the edges of the Gabriel Graph. I'm also new to programming with Rcpp so I probably did everything more complicated than necessary but I didn't know how to do it any better.
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
double vecnorm(NumericVector x){
//to calculate the vectornorm sqrt(sum of (vector entries)^2)
double out;
out = sqrt(sum(pow(x,2.0)));
return out;
}
// [[Rcpp::export]]
NumericVector vektorzugriff(NumericMatrix xy,int i){
//to return a row of the Matrix xy
int col = xy.ncol();
NumericVector out(col);
for(int j=0; j<=col; j++){
out[j] = xy(i-1,j);
}
return out;
}
// [[Rcpp::export]]
IntegerVector vergl(NumericVector eins, NumericVector zwei){
//to see if two Vectors have any identical entries
IntegerVector out = match(eins, zwei);
return out;
}
// [[Rcpp::export]]
IntegerVector verglInt(int eins, NumericVector zwei){
NumericVector dummy = NumericVector::create( eins ) ;
IntegerVector out = match(dummy, zwei);
return out;
}
// [[Rcpp::export]]
NumericVector toVec(NumericVector excluded, int k){
//to append int k to a Vector excluded
NumericVector dummy = NumericVector::create( k ) ;
int len = excluded.size();
int len2 = dummy.size();
int i=0;
NumericVector out(len+len2);
while(i<len+len2){
if(i<len){
out[i]=excluded[i];
i++;
}
else{
out[i]=dummy[i-len];
i++;
}
}
return out;
}
// [[Rcpp::export]]
LogicalVector isNA(IntegerVector x) {
//to see which Vector Entries are NAs
int n = x.size();
LogicalVector out(n);
for (int i = 0; i < n; ++i) {
out[i] = IntegerVector::is_na(x[i]);
}
return out;
}
// [[Rcpp::export]]
NumericMatrix Gab(NumericMatrix Gabriel, NumericVector edges1, NumericVector edges2, int anz){
//to fill a Matrix with the Gabrieledges
for(int i=0; i<anz; i++) {
Gabriel(edges1[i]-1, edges2[i]-1) = 1 ;
Gabriel(edges2[i]-1, edges1[i]-1) = 1 ;
}
return Gabriel;
}
// [[Rcpp::export]]
NumericVector edges(NumericMatrix xy,NumericVector vertices,NumericVector excluded, int i){
//actual function to calculate the edges of the GabrielGraph
int npts = xy.nrow()+1;
double d1;
double d2;
double d3;
for(int r=i+1; r<npts; r++) {
// Skip vertices in excluded
if(!is_true(any(isNA(verglInt(r,excluded))))){
continue;}
d1 = vecnorm(vektorzugriff(xy,i) - vektorzugriff(xy,r));
for(int k=1; k<npts; k++) {
if((k!=r) && (k!=i)){
d2 = vecnorm(vektorzugriff(xy,i) - vektorzugriff(xy,k));
d3 = vecnorm(vektorzugriff(xy,r) - vektorzugriff(xy,k));
//Betrachte vertices, die noch nicht excluded sind
if(!is_true(any(isNA(verglInt(k,vertices[isNA(vergl(vertices,excluded))]))))){
//Wenn d(x,z)^2 > d(x,y)^2+d(y,z)^2 -> Kante gehoert nicht zum GG
if( pow(d2,2.0) > pow(d1,2.0) + pow(d3,2.0) ) {
excluded = toVec(excluded,k);
}
}
if( pow(d1,2.0) > pow(d2,2.0) + pow(d3,2.0) ){
excluded = toVec(excluded,r);
break;
}
}
}
}
return excluded;
}
I used these Rcpp programs in this R program:
GabrielGraphMatrix <- function(X,Y,PlotIt=FALSE){
# Heuristic rejection Algorithm for Gabriel Graph Construction (Bhattacharya et al. 1981)
# Algorithm is ~ O(d n^2)
#loading Rcpp functions
library(Rcpp)
sourceCpp("... .cpp")
XY <- cbind(X,Y)
ndim <- ncol(XY)
npts <- nrow(XY)
edges1<- c()
edges2<- c()
for( i in 1:(npts-1) ) {
# Candidate set of Gabriel neighbors
vertices <- (i+1):npts
# Initialize list of vertices to be excluded from Ni
excluded <- edges(XY,vertices,vector(),i);
adj <- vertices[which(!match(vertices,excluded,nomatch=F)>0)]
if(length(adj) > 0) {
edges1=c(edges1,rep(i,length(adj)))
edges2=c(edges2,adj)
}
}
anz <- length(edges1)
Gabriel <- Gab(matrix(0, npts, npts),edges1,edges2,anz)
return(list(Gabriel=Gabriel,edges=cbind(edges1,edges2)))
}
For a sample data of ten data points it worked fine, for example:
z <- 10
X <- runif(z)*100
Y <- runif(z)*100
GabrielGraphMatrix(X,Y)
returns
> GabrielGraphMatrix(X,Y)
$Gabriel
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 0 1 0 0 0 0 0 0 0 0
[2,] 1 0 0 1 0 0 1 0 0 0
[3,] 0 0 0 1 1 0 0 0 0 1
[4,] 0 1 1 0 0 0 0 0 0 0
[5,] 0 0 1 0 0 0 0 0 0 0
[6,] 0 0 0 0 0 0 0 1 0 0
[7,] 0 1 0 0 0 0 0 0 0 0
[8,] 0 0 0 0 0 1 0 0 1 1
[9,] 0 0 0 0 0 0 0 1 0 1
[10,] 0 0 1 0 0 0 0 1 1 0
$edges
edges1 edges2
[1,] 1 2
[2,] 2 4
[3,] 2 7
[4,] 3 4
[5,] 3 5
[6,] 3 10
[7,] 6 8
[8,] 8 9
[9,] 8 10
[10,] 9 10
But if I try to put in bigger data sets I get this error message:
Error: Value of SET_STRING_ELT() must be a 'CHARSXP' not a 'builtin'
I would be amazingly grateful if anybody had at least an idea of what I did wrong.
Just in case anybody has the same problem. Mine was solved quite easily in the end. The mistake was in the function
// [[Rcpp::export]]
NumericVector vektorzugriff(NumericMatrix xy,int i){
//to return a row of the Matrix xy
int col = xy.ncol();
NumericVector out(col);
for(int j=0; j<=col; j++){
out[j] = xy(i-1,j);
}
return out;
}
The for-loop was too long. It should have been for(int j=0; j<col; j++)instead of for(int j=0; j<=col; j++).
I couldn't reproduce your error, but it threw a variety of similar ones, and often made R crash. Here are a couple of obvious problems.
In your C++ function Gab you have at least two problems:
You don't define the variable anz before you use it.
You are using round rather than square bracket to index Gabriel.
This
Gabriel(edges1[i]-1, edges2[i]-1)
should be
Gabriel[edges1[i]-1, edges2[i]-1]
In your R function GabrielGraphMatrix you are growing edges1 and edges2 in a loop. This means that they have to be reallocated in every iteration of the for loop. This will cause problems once you get above trivial loop lengths.
Instead, preallocate them as lists, then call unlist afterwards to get the vector you want.
# before the loop
edges1 <- vector("list", npts - 1)
edges2 <- vector("list", npts - 1)
# in the loop
if(length(adj) > 0) {
edges1[[i]] <- rep(i,length(adj))
edges2[[i]] <- adj
}
# after the loop
edges1 <- unlist(edges1)
edges2 <- unlist(edges2)
I have a big matrix and am interested in computing the correlation between the rows of the matrix. Since the cor method computes correlation between the columns of a matrix, I am transposing the matrix before calling cor. But since the matrix is big, transposing it is expensive and is slowing down my program. Is there a way to compute the correlations among the rows without having to take transpose?
EDIT: thanks for the responses. thought i'd share some findings. my input matrix is 16 rows by 239766 cols and comes from a .mat file. I wrote C# code to do the same thing using the csmatio library. it looks like this:
foreach (var file in Directory.GetFiles(path, interictal_pattern))
{
var reader = new MatFileReader(file);
var mla = reader.Data[0] as MLStructure;
convert(mla.AllFields[0] as MLNumericArray<double>, data);
double sum = 0;
for (var i = 0; i < 16; i++)
{
for (var j = i + 1; j < 16; j++)
{
sum += cor(data, i, j);
}
}
var avg = sum / 120;
if (++count == 10)
{
var t2 = DateTime.Now;
var t = t2 - t1;
Console.WriteLine(t.TotalSeconds);
break;
}
}
static double[][] createArray(int rows, int cols)
{
var ans = new double[rows][];
for (var row = 0; row < rows; row++)
{
ans[row] = new double[cols];
}
return ans;
}
static void convert(MLNumericArray<double> mla, double[][] M)
{
var rows = M.Length;
var cols = M[0].Length;
for (int i = 0; i < rows; i++)
for (int j = 0; j < cols; j++)
M[i][j] = mla.Get(i, j);
}
static double cor(double[][] M, int i, int j)
{
var count = M[0].Length;
double sum1 = 0, sum2 = 0;
for (int ctr = 0; ctr < count; ctr++)
{
sum1 += M[i][ctr];
sum2 += M[j][ctr];
}
var mu1 = sum1 / count;
var mu2 = sum2 / count;
double numerator = 0, sumOfSquares1 = 0, sumOfSquares2 = 0;
for (int ctr = 0; ctr < count; ctr++)
{
var x = M[i][ctr] - mu1;
var y = M[j][ctr] - mu2;
numerator += x * y;
sumOfSquares1 += x * x;
sumOfSquares2 += y * y;
}
return numerator / Math.Sqrt(sumOfSquares1 * sumOfSquares2);
}
this gave a throughput of 22.22s for 10 files or 2.22s/file
Then I profiled my R code:
ptm=proc.time()
for(file in files)
{
i = i + 1;
mat = readMat(paste(path,file,sep=""))
a = t(mat[[1]][[1]])
C = cor(a)
correlations[i] = mean(C[lower.tri(C)])
}
print(proc.time()-ptm)
to my surprise its running faster than C# and is giving throughput of 5.7s per 10 files or 0.6s/file (an improvement of almost 4x!). The bottleneck in C# is the methods inside csmatio library to parse double values from input stream.
and if i do not convert the csmatio classes into a double[][] then the C# code runs extremely slow (order of magnitude slower ~20-30s/file).
Seeing that this problem arises from a data input issue whose details are not stated (and only hinted at in a comment), I will assume this is a comma-delimited file of unquoted numbers with the number of columns= Ncol. This does the transposition on input.
in.mat <- matrix( scan("path/to/the_file/fil.txt", what =numeric(0), sep=","),
ncol=Ncol, byrow=TRUE)
cor(in.nmat)
One dirty work-around would be to apply cor-functions row-wise and produce the correlation matrix from the results. You could try if this is any more efficient (which I doubt, though you could fine-tune it by not double computing everything or the redundant diagonal cases):
# Apply 2-fold nested row-wise functions
set.seed(1)
dat <- matrix(rnorm(1000), nrow=10)
cormat <- apply(dat, MARGIN=1, FUN=function(z) apply(dat, MARGIN=1, FUN=function(y) cor(z, y)))
cormat[1:3,1:3] # Show few first
# [,1] [,2] [,3]
#[1,] 1.000000000 0.002175792 0.1559263
#[2,] 0.002175792 1.000000000 -0.1870054
#[3,] 0.155926259 -0.187005418 1.0000000
Though, generally I would expect the transpose to have a really, really efficient implementation, so it's hard to imagine when that would be the bottle-neck. But, you could also dig through the implementation of 'cor' function and call the correlation C-function itself by first making sure your rows are suitable. Type 'cor' in the terminal to see the implementation, which is mostly a wrapper that makes input suitable for the C-function:
# Row with C-call from the implementation of 'cor':
# if (method == "pearson")
# .Call(C_cor, x, y, na.method, FALSE)
You can use outer:
outer(seq(nrow(mat)), seq(nrow(mat)),
Vectorize(function(x, y) cor(mat[x , ], mat[y , ])))
where mat is the name of your matrix.