Allocating space for a sparse matrix in R - r

I construct a large, sparse matrix, of which I know the number non-zero elements in advance. Is it possible in R to allocate space for this matrix, instead of having its space automatically increased every time I add an element? Something like spalloc does in Matlab.
As a simplified code-example of what I want, consider the construction of the following block-wise diagonal matrix.
library("Matrix")
n = 1000;
p = 14000;
q = 7;
x_i = Matrix(rnorm(n*p), n, p);
x = Matrix(0, n*q, p*q, sparse=TRUE);
for(i in 1:q) {
x[((i-1)*n+1):(i*n),((i-1)*p+1):(i*p)] = x_i;
}
I think this process would be much faster if I could tell R in advance that the matrix will contain n*p*q non-zero elements.
Thanks in advance!
Edit: I now see that for the blockwise matrix I should use bdiag()
library("Matrix")
n = 1000;
p = 14000;
q = 7;
x_i = Matrix(rnorm(n*p), n, p);
lst = list();
for(i in 1:q) {
lst[i] = x_i;
}
x = bdiag(lst);
This is much faster.

Related

Best way to fill a sparse matrix

What is the most efficient way to fill a sparse matrix? I know that sparse matrixes are CSC, so I expected it to be fast to fill them column by column like
using SparseArrays
M = 100
N = 1000
sparr = spzeros(Float64, M, N)
for n = 1:N
# do some math here
idx = <<boolean index array of nonzero elements in the nth column>>
vals = <<values of nonzero elements in the nth column>>
sparr[idx, n] = vals
end
However, I find that this scales very poorly with N. Is there a better way to fill the array? Or perhaps, I should not bother with filling the array and instead initialize the matrix differently?
You can do sparse(I, J, V, M, N) directly:
julia> using SparseArrays
julia> M = 100;
julia> N = 1000;
julia> nz = 2000; # number of nonzeros
julia> I = rand(1:M, nz); # dummy I indices
julia> J = rand(1:N, nz); # dummy J indices
julia> V = randn(nz); # dummy matrix values
julia> sparse(I, J, V, M, N)
100×1000 SparseMatrixCSC{Float64, Int64} with 1982 stored entries:
⣻⣿⣿⣿⣿⡿⣾⣿⣿⣿⣿⣿⣿⣷⣾⣽⣿⢿⢿⣿⣿⣿⢿⣿⣾⣿⣽⣿⣿⣾⣿⣿⣿⣿⣿⣿⣿⣿⣻⣿
⣼⣿⣿⡿⣿⣿⡽⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣻⣿⡿⣿⣿⣿⡿⣿⡿⣯⢿⣿⠾⣿⣿⡿⢿⣿⣻⡿⣾
which should scale decently with size. For more expert use, you could directly construct the SparseMatrixCSC object.
EDIT:
Note that if you have to stick with the pseudo code you gave with a for-loop for column indices and values, you could simply concatenate them and create I, J, and V:
I = Int[]
J = Int[]
V = Float64[]
for n = 1:N
# do some math here
idx = <<boolean index array of nonzero elements in the nth column>>
vals = <<values of nonzero elements in the nth column>>
I = [I; idx]
J = [J; fill(n, length(I))]
V = [V; vals]
end
but that'd be slower I think.

how to modify my R code to accelerate computational speed

Here are my R code. Could you please give me some advice so that can accelerate the computational speed :)
First, the function myfun()generates a complex number.
Second, I compute the elements of matrix M using myfun().
myfun<-function(a,b,nq,ul,uk)
{
m<-seq(1,(nq/2)+1,length=(nq/2)+1);
k<-m;
D<-matrix(NA,nrow = length(k),ncol = length(k));
for(i in 1:length(k)) # row
for(j in 1:length(m)) # column
{
D[i,j]<-(2/nq)*cos(((j-1)*(i-1)*pi)/(nq*0.5))
}
D[,1]<-D[,1]*0.5;
D[,ncol(D)]<-D[,ncol(D)]*0.5;
# compute the vector v
vseq<-seq(2,nq-2,by=2);
vr<-2/(1-vseq^2);
vr<-c(1,vr,1/(1-nq*nq));
v<-matrix(vr,ncol=1); # v is a N by 1 matrix
# compute the vector w, length(w)=nq/2+1
h<-function(x,ul,uk)
{
((b-a)/2)*(exp((b-a)/2*x+(a+b)/2)+1)^(1i*uk)*cos(((b-a)/2*x+(a+b)/2-a)*ul)
}
w<-matrix(rep(NA,length(v)),ncol=1);
for(i in 1:length(w))
{
w[i]<-h((cos((i-1)*pi/nq)),ul,uk)+h((-cos((i-1)*pi/nq)),ul,uk)
}
res<-t(t(D)%*%v)%*%w; # each element of matrix M
return(res)
}
Next, compute each element of matrix M. The N-th column and N-th row are zeros.
matrix.M<-matrix(0,ncol = N,nrow = N);
for(i in 1:N-1)
for(j in 1:N-1)
{
matrix.M[i,j]<-myfun(a,b,nq,i-1,j-1)
}
We can set parameters as
a<--173.2;
b<-78;
alpha<-0.24;
Dt<-0.1;
M<-1000;
N<-150;
u<-seq(1,150,by=1)*pi/(b-a);
nq<-3000;
I appreciate your help!
Here are some suggestions for speeding the function up. I use three "tricks":
Vectorize as many functions as possible
Use the outer function instead of a double loop
Use the hidden gem crossprod for the final matrix products
myfun<-function(a,b,nq,ul,uk) {
m<-seq(1,(nq/2)+1,length=(nq/2)+1);
k<-m;
## Use outer to compute the elements of the matrix
D <- outer(1:length(k), 1:length(m), function(i, j) {(2/nq)*cos(((j-1)*(i-1)*pi)/(nq*0.5))} )
D[,1]<-D[,1]*0.5;
D[,ncol(D)]<-D[,ncol(D)]*0.5;
# compute the vector v
vseq<-seq(2,nq-2,by=2);
vr<-2/(1-vseq^2);
vr<-c(1,vr,1/(1-nq*nq));
v<-matrix(vr,ncol=1); # v is a N by 1 matrix
h<-function(x,ul,uk) {
((b-a)/2)*(exp((b-a)/2*x+(a+b)/2)+1)^(1i*uk)*cos(((b-a)/2*x+(a+b)/2-a)*ul)
}
## Compute the full w vector in one go
vect <- seq_along(v)-1
w <- h((cos(vect*pi/nq)),ul,uk) + h((-cos(vect*pi/nq)),ul,uk)
## Compute the cross products.
res <- crossprod(crossprod(D, v), w)
return(res)
}
I think this should save around 80% of the time compared to the original function. The time hog was the initial computation of D. Hope this helps.

Find lowest distances between rows of a large matrix: Allocation limit error

I want to calculate the distances between all rows of a large matrix. For each row, I need to find another row, which has the lowest distance. The final output should be a list, containing of IDs of the rows with the lowest distances (see low_dis_ids in the example below).
I was able to find a solution for small samplesizes (example below). However, I am not able to perform these steps with larger samplesizes, because the matrix with the distances gets loo big. Is there a way to omit such a big matrix? I do only need the list with the IDs (like low_dis_ids).
Reproducible Example:
set.seed(123)
# Calculation of distances with small samplesize is working well
N <- 100
data_100 <- data.frame(x1 = rnorm(N, 5, 10),
x2 = rnorm(N, 5, 10),
x3 = rnorm(N, 5, 10),
x4 = rnorm(N, 5, 10),
x5 = rnorm(N, 5, 10))
# Matrix with all distances (no problem for the smaller samplesize)
dist_100 <- as.matrix(dist(data_100))
# Find the row with the smallest distance
for(i in 1:nrow(dist_100)) {
dist_100[i, i] <- Inf
}
low_dis <- numeric()
for(i in 1:nrow(dist_100)) {
low_dis[i] <- as.numeric(sort(dist_100[ , i]))[1]
}
low_dis_ids <- list()
for(i in 1:length(low_dis)) {
low_dis_ids[[i]] <- as.numeric(names(dist_100[ , i][dist_100[ , i] == low_dis[i]]))
}
# low_dis_ids is the desired output and stores the rows with the smallest distances
# The same procedure is not working for larger samplesizes
N <- 100000
data_100000 <- data.frame(x1 = rnorm(N, 5, 10),
x2 = rnorm(N, 5, 10),
x3 = rnorm(N, 5, 10),
x4 = rnorm(N, 5, 10),
x5 = rnorm(N, 5, 10))
dist_100000 <- dist(data_100000)
# Error: cannot allocate vector of size 37.3 Gb
You can definitely avoid the creation of the large matrix that comes as a result of using dist. One such solution is to calculate the distances one row at a time, we could write a function that given the whole data frame and a row id finds which row corresponds to the smallest distance. For example:
f = function(rowid, whole){
d = colSums((whole[rowid,] - t(whole))^2) # calculate distance
d[rowid] = Inf # replace the zero
which.min(d)
}
The colSums function is fairly well optimized so this is relatively quick. I suspect which.min is also a slightly faster and possibly neater approach to looping through the vectors of distances.
To make a solution which then applies to any such data frame I wrote another short function which applies this to every row and gives you a vector of row ids
mindists = function(dat) do.call(c,lapply(1:nrow(dat),f,whole = as.matrix(dat)))
If you want the list instead of a vector, just omit the do.call function. I had this to make it easier to check that the output gave what you expected.
all(do.call(c,low_dis_ids) == mindists(data_100))
[1] TRUE
This also runs for the larger example on my laptop. It isn't fast because you are making nrow(data) calls to f but it does avoid the creation of one large object. There may be better solutions out there but this was the first one that sprung to mind that works. Hope that helps.
EDIT:
Edited since there is an additional C++ answer by Roland
I did a quick benchmark on the smaller data set. The C++ answer is definitely quicker in this case.
Some extra sales pitch for this answer is it is I think easier to understand if you are purely an R programmer (no need to learn C++ and RCpp). The R version is trivial to parallelise using a parallel replacement of lapply. I will note this is not to take away from Rolands answer, personally I like Rcpp, just to give extra bits of info for any future readers of this question.
Use Rcpp since a base R solution will be too slow:
library(Rcpp)
library(inline)
cppFunction(
" IntegerVector findLowestDist(const NumericMatrix X) {
const int n = X.nrow();
const int m = X.ncol();
IntegerVector minind(n);
NumericVector minsqdist(n);
double d;
for (int i = 0; i < n; ++i) {
if (i == 0) {
d = 0;
for (int k = 0; k < m; ++k) {
d += pow(X(i, k) - X(1, k), 2.0);
}
minsqdist(i) = d;
minind(i) = 1;
} else {
d = 0;
for (int k = 0; k < m; ++k) {
d += pow(X(i, k) - X(0, k), 2.0);
}
minsqdist(i) = d;
minind(i) = 0;
}
for (int j = 1; j < n; ++j) {
if (i == j) continue;
d = 0;
for (int k = 0; k < m; ++k) {
d += pow(X(i, k) - X(j, k), 2.0);
}
if (d < minsqdist(i)) {
minsqdist(i) = d;
minind(i) = j;
}
}
}
return minind + 1;
}"
)
all.equal(findLowestDist(as.matrix(data_100)),
unlist(low_dis_ids))
#[1] TRUE
findLowestDist(as.matrix(data_100000))
#works
The algorithm can probably be improved.

R: Convert upper triangular part of a matrix to symmetric matrix

I have the upper triangular part of matrix in R (without diagonal) and want to generate a symmetric matrix from the upper triangular part (with 1 on the diagonal but that can be adjusted later). I usually do that like this:
res.upper <- rnorm(4950)
res <- matrix(0, 100, 100)
res[upper.tri(res)] <- res.upper
rm(res.upper)
diag(res) <- 1
res[lower.tri(res)] <- t(res)[lower.tri(res)]
This works fine but now I want to work with very large matrices. Thus, I would want to avoid having to store res.upper and res (filled with 0) at the same time. Is there any way I can directly convert res.upper to a symmetric matrix without having to initialize the matrix res first?
I think there are two issues here.
now I want to work with very large matrices
Then do not use R code to do this job. R will use much more memory than you expect. Try the following code:
res.upper <- rnorm(4950)
res <- matrix(0, 100, 100)
tracemem(res) ## trace memory copies of `res`
res[upper.tri(res)] <- res.upper
rm(res.upper)
diag(res) <- 1
res[lower.tri(res)] <- t(res)[lower.tri(res)]
This is what you will get:
> res.upper <- rnorm(4950) ## allocation of length 4950 vector
> res <- matrix(0, 100, 100) ## allocation of 100 * 100 matrix
> tracemem(res)
[1] "<0xc9e6c10>"
> res[upper.tri(res)] <- res.upper
tracemem[0xc9e6c10 -> 0xdb7bcf8]: ## allocation of 100 * 100 matrix
> rm(res.upper)
> diag(res) <- 1
tracemem[0xdb7bcf8 -> 0xdace438]: diag<- ## allocation of 100 * 100 matrix
> res[lower.tri(res)] <- t(res)[lower.tri(res)]
tracemem[0xdace438 -> 0xdb261d0]: ## allocation of 100 * 100 matrix
tracemem[0xdb261d0 -> 0xccc34d0]: ## allocation of 100 * 100 matrix
In R, you have to use 5 * (100 * 100) + 4950 double words to finish these operations. While in C, you only need at most 4950 + 100 * 100 double words (In fact, 100 * 100 is all that is needed! Will talk about it later). It is difficult to overwrite object directly in R without extra memory assignment.
Is there any way I can directly convert res.upper to a symmetric matrix without having to initialize the matrix res first?
You do have to allocate memory for res because that is what you end up with; but there is no need to allocate memory for res.upper. You can initialize the upper triangular, while filling in the lower triangular at the same time. Consider the following template:
#include <Rmath.h> // use: double rnorm(double a, double b)
#include <R.h> // use: getRNGstate() and putRNGstate() for randomness
#include <Rinternals.h> // SEXP data type
## N is matrix dimension, a length-1 integer vector in R
## this function returns the matrix you want
SEXP foo(SEXP N) {
int i, j, n = asInteger(N);
SEXP R_res = PROTECT(allocVector(REALSXP, n * n)); // allocate memory for `R_res`
double *res = REAL(R_res);
double tmp; // a local variable for register reuse
getRNGstate();
for (i = 0; i < n; i++) {
res[i * n + i] = 1.0; // diagonal is 1, as you want
for (j = i + 1; j < n; j++) {
tmp = rnorm(0, 1);
res[j * n + i] = tmp; // initialize upper triangular
res[i * n + j] = tmp; // fill lower triangular
}
}
putRNGstate();
UNPROTECT(1);
return R_res;
}
The code has not been optimized, as using integer multiplication j * n + i for addressing in the innermost loop will result in performance penalty. But I believe you can move multiplication outside the inner loop, and only leave addition inside.
To get a symmetric matrix from an upper or lower triangular matrix you can add the matrix to its transpose and subtract the diagonal elements. The equation is linked below.
diag(U) is a diagonal matrix with the diagonal elements of U.
ultosymmetric=function(m){
m = m + t(m) - diag(diag(m))
return (m)}
If you want the diagonal elements to be 1 you can do this.
ultosymmetric_diagonalone=function(m){
m = m + t(m) - 2*diag(diag(m)) + diag(1,nrow=dim(m)[1])
return (m)}

Compute a double sum in R

I have to compute a test statistic with a double sum.
I solved it like this:
T<-numeric(1)
for(j in 1:n){
for(k in 1:n){
T = T + ((1/n)*(exp(-(1/2)*((Y[j]-Y[k])^2))))}
T = T - ((sqrt(2))*(exp(-(1/4)*((Y[j])^2))))}
T = T + (n*(3^(-(1/2))))
Is there an easier way to compute the test statistic?
Use
n=100;
Y=runif(100);
T=0;
Ydiff=outer(Y,Y,"-")^2;
Y_1=exp(-0.5*Ydiff);
Y_2=sqrt(2)*exp(-0.25*Y^2);
T=sum(rowMeans(Y_1)-Y_2) + (n*(3^(-(1/2))))
Comparison of methods given so far give:
T=0;
n=100;
set.seed(100)
Y=runif(100);
for(j in 1:n){
for(k in 1:n){
T = T + ((1/n)*(exp(-(1/2)*((Y[j]-Y[k])^2))));
}
T = T - ((sqrt(2))*(exp(-(1/4)*((Y[j])^2))));
}
T = T + (n*(3^(-(1/2))));
print(T)
#21.18983
T=0;
Ydiff=outer(Y,Y,"-")^2;
Y_1=exp(-0.5*Ydiff);
Y_2=sqrt(2)*exp(-0.25*Y^2);
T=sum(rowMeans(Y_1)-Y_2) + (n*(3^(-(1/2))));
print(T)
# 21.18983
T=0;
indexes = expand.grid(1:n,1:n);
T = 1/n*sum(exp(-1/2)*((Y[indexes[,1]]-Y[indexes[,2]])));
T = T-(sqrt(2))*sum(exp(-1/4*(Y[1:n])));
T = T+n/sqrt(3);
print(T)
# -66.71403
It's more useful to create the indexes in advance and then just sum over an array rather than computing new indices over two nested loops
indexes = expand.grid(1:n,1:n)
T = 1/n*sum(exp(-1/2*(Y[indexes[,1]]-Y[indexes[,2]])))
T = T-(sqrt(2))*sum(exp(-1/4*(Y[1:n])))
T = T+n/sqrt(3)
Edit: For large n, this is impractical, as an n of 1,000,000 would make a 3.7 TB data frame with expand.grid. You can always use the for loops, even if they are slow, but I would recommend using C++ if you need to have absurdly large N, because that is 1 trillion loops, which will take a very long time to compute.

Resources