Populating Large Matrix and Computations - r

I am trying to populate a 25000 x 25000 matrix in a for loop, but R locks up on me. The data has many zero entries, so would a sparse matrix be suitable?
Here is some sample data and code.
x<-c(1,3,0,4,1,0,4,1,1,4)
y<-x
z<-matrix(NA,nrow=10,ncol=10)
for(i in 1:10){
if(x[i]==0){
z[i,]=0
} else{
for(j in 1:10){
if(x[i]==y[j]){
z[i,j]=1
} else{z[i,j]=0
}
}
}
}
One other question. Is it possible to do computations on matrices this large. When I perform some calculations on some sample matrices of this size I get an output of NA with a warning of integer overflow or R completely locks up.

You could vectorize this and that should help you. Also, if your data is indeed sparse and you can conduct your analysis on a sparse matrix it definitely is something to consider.
library(Matrix)
# set up all pairs
pairs <- expand.grid(x,x)
# get matrix indices
idx <- which(pairs[,1] == pairs[,2] & pairs[,1] != 0)
# create empty matrix with zero's instead
z<-matrix(0,nrow=10,ncol=10)
z[idx] = 1
# create empty sparse matrix
z2 <-Matrix(0,nrow=10,ncol=10, sparse=TRUE)
z2[idx] = 1
all(z == z2)
[1] TRUE
The comment by #alexis_lax would make this even simpler and faster. I had completely forgotten about the outer function.
# normal matrix
z = outer(x, x, "==") * (x!=0)
# sparse matrix
z2 = Matrix(outer(x, x, "==") * (x!=0), sparse=TRUE)
To answer your second question if computations can be done on such a big matrix the answer is yes. You just need to approach it more cautiously and use the appropriate tools. Sparse matrices are nice and many typical matrix functions are available and some other package are compatible. Here is a link to a page with some examples.
Another thought, if you are working with really large matrices you may want to look in to other packages like bigmemory which are designed to deal with R's large overhead.

Related

Combine list of matrices into a big.matrix

I have a list of large (35000 x 3) matrices in R and I want to combine them into a single matrix but it would be about 1 billion rows long and would exceed the maximum object size in R.
The bigmemory package allows for larger matrices but doesn't appear to support rbind to put multiple matrices together.
Is there some other package or technique that supports the creation of a very large matrix from smaller matrices?
Also before you ask this is not a RAM issue, simply an R limitation even on 64-bit R.
You could implement it with a loop:
library(bigmemory)
## Reproducible example
mat <- matrix(1, 50e3, 3)
l <- list(mat)
for (i in 2:100) {
l[[i]] <- mat
}
## Solution
m <- ncol(l[[1]]) ## assuming that all have the same number of columns
n <- sum(sapply(l, nrow))
bm <- big.matrix(n, m)
offset <- 0
for (i in seq_along(l)) {
mat_i <- l[[i]]
n_i <- nrow(mat_i)
ind_i <- seq_len(n_i) + offset
bm[ind_i, ] <- mat_i
offset <- offset + n_i
}
## Verif
stopifnot(offset == n, all(bm[, 1] == 1))
Not quite an answer, but a little more than a comment: are you sure that you can't do it by brute force? R now has long vectors (since version 3.0.0; the question you link to refers to R version 2.14.1): from this page,
Arrays (including matrices) can be based on long vectors provided each of their dimensions is at most 2^31 - 1: thus there are no 1-dimensional long arrays.
while the underlying atomic vector can go up to 2^52 -1 elements ("in theory .. address space limits of current CPUs and OSes will be much smaller"). That means you should in principle be able to create a matrix that is as much as ((2^31)-1)/1e9 = 2.1 billion rows long; since the maximum "long" object size is about 10^15 (i.e. literally millions of billions), a matrix of 1 billion rows and 3 columns should (theoretically) not be a problem.

An efficient way to diagonalize a sparse vector in R

I'm working with a vector (~14000x1) of various values that I would like to put on the diagonal of a sparse matrix where I'm using the library Matrix. I want to do this while avoiding the need of creating a full matrix and then converting back to a sparse matrix after.
So far I can do this with a for loop but it takes a long time. Can you think of a more efficient and least memory-intense way of doing it?
Here's a simple reproducible example:
library(Matrix)
x = Matrix(matrix(1,14000,1),sparse=TRUE)
X = Diagonal(14000)
for(i in 1:13383){
X[i,i]=aa[i]
print(i)
}

Matrix operation efficiency in R

I have 3 matrices X, K and M as follows.
x <- matrix(c(1,2,3,1,2,3,1,2,3),ncol=3)
K <- matrix(c(4,5,4,5,4,5),ncol=3)
M <- matrix(c(0.1,0.2,0.3),ncol=1)
Here is what I need to accomplish.
For example,
Y(1,1)=(1-4)^2*0.1^2+(1-4)^2*0.2^2+(1-4)^2*0.3^2
Y(1,2)=(1-5)^2*0.1^2+(1-5)^2*0.2^2+(1-5)^2*0.3^2
...
Y(3,2)=(3-5)^2*0.1^2+(3-5)^2*0.2^2+(3-5)^2*0.3^2
Currently I used 3 for loops to calculate the final matrix in R. But for large matrices, this is taking extremely long to calculate. And I also need to change the elements in matrix M to find the best value that produces minimal squared errors. Is there a better way to code it up, i.e. Euclidean norm?
for (lin in 1:N) {
for (col in 1:K) {
Y[lin,col] <- 0
for (m in 1:M){
Y[lin,col] <- Y[lin,col] + (X[lin,m]-K[col,m])^2 * M[m,1]^2
}
}
}
Edit:
I ended up using Rcpp to write the code in C++ and call it from R. It is significantly faster! It takes 2-3 seconds to fill up a 2000 * 2000 matrix.
Thank you. I was able to figure this out. The change made my calculation twice as fast as before. For anyone who may be interested, I replaced the last for loop for(m in 1:M) with the following:
Y[lin,col] <- norm(as.matrix((X[lin,]-K[col,]) * M[1,]),"F")^2
Note that I transposed the matrix M so that it has 3 columns instead of 1.

R: Big integer matrices

I have some big integer matrices (1000 x 1000000) that I have to multiply and do rowmax on.
They contain 0 and 1 (approx 99% 1 and 1% 0 and no other values).
My problem is memory consumption: Currently R eats 8 bytes per integer.
I have looked at SparseMatrix, but it seems I cannot set the default value to 1 instead of 0.
How can I represent these matrices in a memory efficient way, but so I can still multiply them as matrices and use rowmax?
Preferably it should work with R-2.15 and not require additional libraries.
Second idea: If you have a couple of these matrices, call them X_1 and X_2, let Y_1 = 1*1' - X_1 and Y_2 = 1*1' - X_2; the Y's can be sparse because they are 99% zero. So their product is
X_1 * X_2 = ( 1*1' - Y_1) * (1*1' - Y_2) = 1*1'*1*1' - Y_1*1*1' - 1*1'*Y_2 + Y_1 * Y_2
which you can simplify even further.
There are several sparse matrix packages slam, SparseM, Matrix, ...) but I doubt any will do bitwise presentation, or even single char, as you'd need here. You may have to code that up yourself.
Alternatively, packages like ff allow more compact storage but AFAIK will not do matrix ops for you. Maybe you could that on top of them?
Off the top of my head, I can't think of a packaged solution...
It seems like you could represent this type of data extremely efficiently with run length encoding by row. From there, you could implement a matrix-vector multiply method for rle objects, (which might be hard) and row-max (which should be trivial).
Since there are only 1% 0's it would not be difficult to compress. One trivial example:
pseudo.matrix <- function(x){
nrow <- nrow(x)
ncol <- ncol(x)
zeroes.cells <- which(x==0)
p <- list(nrow=nrow, ncol=ncol, zeroes.cells=zeroes.cells)
}
This alone would reduce their memory size significantly. And it would be easy to recover the original matrix:
recover.matrix <- function(x) {
m <- matrix(1, x$nrow, x$ncol)
for (i in x$zeroes.cells) m[i] <- 0
m
}
I guess it is possible to figure out a way to efficiently multiply these pseudo matrices, since the result for each cell would be something like the number of columns of the first matrix minus an adjustment regarding the number of zeros in the operation, but I am not sure how easy it would be to do this.

How to calculate Euclidean distance (and save only summaries) for large data frames

I've written a short 'for' loop to find the minimum euclidean distance between each row in a dataframe and all the other rows (and to record which row is closest). In theory this avoids the errors associated with trying to calculate distance measures for very large matrices. However, while not that much is being saved in memory, it is very very slow for large matrices (my use case of ~150K rows is still running).
I'm wondering whether anyone can advise or point me in the right direction in terms of vectorising my function, using apply or similar. Apologies for what may seem a simple question, but I'm still struggling to think in a vectorised way.
Thanks in advance (and for your patience).
require(proxy)
df<-data.frame(matrix(runif(10*10),nrow=10,ncol=10), row.names=paste("site",seq(1:10)))
min.dist<-function(df) {
#df for results
all.min.dist<-data.frame()
#set up for loop
for(k in 1:nrow(df)) {
#calcuate dissimilarity between each row and all other rows
df.dist<-dist(df[k,],df[-k,])
# find minimum distance
min.dist<-min(df.dist)
# get rowname for minimum distance (id of nearest point)
closest.row<-row.names(df)[-k][which.min(df.dist)]
#combine outputs
all.min.dist<-rbind(all.min.dist,data.frame(orig_row=row.names(df)[k],
dist=min.dist, closest_row=closest.row))
}
#return results
return(all.min.dist)
}
#example
min.dist(df)
This should be a good start. It uses fast matrix operations and avoids the growing object construct, both suggested in the comments.
min.dist <- function(df) {
which.closest <- function(k, df) {
d <- colSums((df[, -k] - df[, k]) ^ 2)
m <- which.min(d)
data.frame(orig_row = row.names(df)[k],
dist = sqrt(d[m]),
closest_row = row.names(df)[-k][m])
}
do.call(rbind, lapply(1:nrow(df), which.closest, t(as.matrix(df))))
}
If this is still too slow, as a suggested improvement, you could compute the distances for k points at a time instead of a single one. The size of k will need to be a compromise between speed and memory usage.
Edit: Also read https://stackoverflow.com/a/16670220/1201032
Usually, built in functions are faster that coding it yourself (because coded in Fortran or C/C++ and optimized).
It seems that the function dist {stats} answers your question spot on:
Description
This function computes and returns the distance matrix computed by using the specified distance measure to compute the distances between the rows of a data matrix.

Resources