Matrix Math in R on Large Datasets - r

I've got a big square matrix, which I've taken the first row for testing purposes....
so the initial matrix is 1x63000, which is pretty big. Every time i try to multiply it by itself, using
a %*% b
Every time I do this, I get...
Error in fooB %*% fooB : non-conformable arguments
However, this works with smaller matrices. Are there any packages for handling mathematical functions of large matrices? or is there a trick I'm missing somewhere?
cheers

You are looking for the crossproduct, i.e. a %*% t(a) and there is a base R function for this. Try:
crossprod(a)

Related

R: difference between apply(object, 1, function(x) sum(x-a)/b) and rowsums((object-a)/b)

I'm new to R and am struggling with the apply function. It is really slow to execute and I was trying to optimize some code I received.
I am trying to do some matrix operations (element-wise multiplication and division on ~10^6 element matrices) then sum the rows of the resulting matrix. I found the fantastic library Rfast and it executes what I thought was the same code in about 1/30 the time, but I am getting systematic differences between my 'optimized' answer and the previous answer.
The original code was something along the lines of
ans <- apply(object, 1, function(x) sum((x - a) / b))
and my code is
ans = Rfast:::rowsums((object-a)/b)
I'm not sure if it's because one of the methods is throwing away precision or making rounding errors - any thoughts?
Edit
Trying to reproduce the error is pretty hard...
I have been able to isolate the discrepancy to when I divide by my vector b with entries each ~ 3000 (i.e. [3016.460436, 3021.210321, 3033.3303219]. If I take this term out the two methods give the same answer.
I then tried two methods to improve my answer, one was dividing b by 1000 then dividing the sum by 1000 at the end. This didn't work, presumably because the float precision is the same either way.
I also tried forcing my b vector to be integers, which also didn't work.
Sample data doesn't reproduce my error either, which is frustrating...
objmat = rbind(rep(c(1,0,0),1000),rep(c(0,0,1),1000))
amat = rbind(rep(c(0.064384654, 0.025465132, 0.36543214),1000))
bmat = rbind(rep(c(1016.460431,1021.210431,1033.330431),1000))
ans = apply(objmat,1,function(x) sum((x-amat)/bmat))
gives
ans[1] = 0.5418828413
rowsums((objmat[1,]-amat)/bmat) = 0.5418828413
I think it has to be a floating point precision error, but I'm not sure why my dummy data doesn't reproduce it, or which method (apply or rowsums) would be more accurate!

Huge diaginal matrix in R

The following code causes a memory error:
diag(1:100000)
Is there any alternative for diag which allows producing a huge diagonal matrix?
Longer answer: I suggest not creating a diagonal matrix, because in most situations you can do without it. To make that clear, consider the most typical matrix operations:
Multiply the diagonal matrix D by a vector v to produce Dv. Instead of maintaining a matrix, keep your "matrix" as a vector d of the diagonal elements, and then multiply d elementwise by v. Same result.
Invert the matrix. Again, easy: invert each element (of course, only for diagonal matrices is this generally the correct inverse).
Various decompositions/eigenvalues/determinants/trace. Again, these can all be done on the vector d.
In short, though it requires a bit of attention in your code, you can always represent a diagonal matrix as a vector, and that should solve your memory issues.
Shorter answer: Now, having said all that, of course people have already implemented the above steps implicitly using sparse matrices, which does the above steps under the hood. In R, the Matrix package is nice for sparse matrices: https://cran.r-project.org/web/packages/Matrix/Matrix.pdf

Output of parApply different from my input

I am still quite new to r (used to program in Matlab) and I am trying use the parallel package to speed up some calculations. Below is an example which I am trying to calculate the rolling standard deviation of a matrix (by column) with the use of zoo package, with and without parallelising the codes. However, the shape of the outputs came out to be different.
# load library
library('zoo')
library('parallel')
library('snow')
# Data
z <- matrix(runif(1000000,0,1),100,1000)
#This is what I want to calculate with timing
system.time(zz <- rollapply(z,10,sd,by.column=T, fill=NA))
# Trying to achieve the same output with parallel computing
cl<-makeSOCKcluster(4)
clusterEvalQ(cl, library(zoo))
system.time(yy <-parCapply(cl,z,function(x) rollapplyr(x,10,sd,fill=NA)))
stopCluster(cl)
My first output zz has the same dimensions as input z, whereas output yy is a vector rather than a matrix. I understand that I can do something like matrix(yy,nrow(z),ncol(z)) however I would like to know if I have done something wrong or if there is a better way of coding to improve this. Thank you.
From the documentation:
parRapply and parCapply always return a vector. If FUN always returns
a scalar result this will be of length the number of rows or columns:
otherwise it will be the concatenation of the returned values.
And:
parRapply and parCapply are parallel row and column apply functions
for a matrix x; they may be slightly more efficient than parApply but
do less post-processing of the result.
So, I'd suggest you use parApply.

Translating MatLab to R - array multiplication & complex conjugate transposition

I'm trying to translate a code from MatLab into R, but I'm stuck on the following line:
SqO=U.* sqrt(D)*V'
I feel like I'm close:
SqO<-Conj(t(U%*%sqrt(D)*V))
...but the output still isn't matching up. All the variables (Sq0, U, D, and V) are 20x20 matrices if that helps.
Hmmm, I'm no expert in R, but I do know a bit of Matlab. In Matlab the sub-expression
U.* sqrt(D)
does an element-by-element multiplication of U and the square root of D. That is, element (i,j) in U is multiplied by element (i,j) in sqrt(D); so this is not the usual matrix multiplication. Is that what your U%*%sqrt(D) does ? sqrt(D) also operates on the individual elements, that is sqrt(D)~=D^(1/2)*D^(1/2).
Then the Matlab code multiplies the result of the previous operation by the transpose of V (if V is a real array); again my R is too weak to know whether you've done this or an equivalent operation.
From what HighPerformanceMark wrote the translation should be:
SqO=U.* sqrt(D)*V' # Matlab
SqO <- U* sqrt(D) %*% t(V) # R

Heatmap function in R dendrogram failure

For the life of me I cannot understand why this method is failing, I would really appreciate an additional set of eyes here:
heatmap.2(TEST,trace="none",density="none",scale="row",
ColSideColors=c("red","blue")[data.test.factors],
col=redgreen,labRow="",
hclustfun=function(x) hclust(x,method="complete"),
distfun=function(x) as.dist((1 - cor(x))/2))
The error that I get is:
row dendrogram ordering gave index of wrong length
If I don't include the distfun, everything works really well and is responsive to the hclust function. Any advice would be greatly appreicated.
The standard call to dist computes the distance between the rows of the matrix provided, cor computes the correlation between columns of the provided matrix, so the above example to work, you need to transpose the matrix:
heatmap.2(TEST,trace="none",density="none",scale="row",
ColSideColors=c("red","blue")[data.test.factors],
col=redgreen,labRow="",
hclustfun=function(x) hclust(x,method="complete"),
distfun=function(x) as.dist((1 - cor( t(x) ))/2))
should work. If you use a square matrix, you'll get code that works, but it won't be calculating what you think it is.
This is not reproducible yet ...
TEST <- matrix(runif(100),nrow=10)
heatmap.2(TEST, trace="none", density="none",
scale="row",
labRow="",
hclust=function(x) hclust(x,method="complete"),
distfun=function(x) as.dist((1-cor(x))/2))
works for me. I don't know what redgreen or data.test.factors are.
Have you tried debug(heatmap.2) or options(error=recover) (or traceback(), although it's unlikely to be useful on its own) to try to track down the precise location of the error?
> sessionInfo()
R version 2.13.0 alpha (2011-03-18 r54865)
Platform: i686-pc-linux-gnu (32-bit)
...
other attached packages:
[1] gplots_2.8.0 caTools_1.12 bitops_1.0-4.1 gdata_2.8.2 gtools_2.6.2
Building on Ben Bolker's reply, your code seems to work if TEST is an n×n matrix and data.test.factors is a vector of n integers. So for example starting with
n1 <- 5
n2 <- 5
n3 <- 5
TEST <- matrix(runif(n1*n2), nrow=n1)
data.test.factors <- sample(n3)
then your code will work. However if n1 and n2 are different then you will get the error row dendrogram ordering gave index of wrong length, while if they are the same but n3 is different or data.test.factors has non-integers then you will get the error 'ColSideColors' must be a character vector of length ncol(x).

Resources