I can not find any function or package to calculate the null space or (QR decomposition) of a bigmatrix (from library(bigmemory)) in R. For example:
library(bigmemory)
a <- big.matrix(1000000, 1000, type='double', init=0)
I tried the following but got the errors shown. How can I find the null space of a bigmemory object?
a.qr <- Matrix::qr(a)
# Error in as.vector(data) :
# no method for coercing this S4 class to a vector
q.null <- MASS::Null(a)
# Error in as.vector(data) :
# no method for coercing this S4 class to a vector
If you want to compute the full SVD of the matrix, you can use package bigstatsr to perform computations by block. A FBM stands for a Filebacked Big Matrix and is an object similar to a filebacked big.matrix object of package bigmemory.
library(bigstatsr)
options(bigstatsr.block.sizeGB = 0.5)
# Initialize FBM with random numbers
a <- FBM(1e6, 1e3)
big_apply(a, a.FUN = function(X, ind) {
X[, ind] <- rnorm(nrow(X) * length(ind))
NULL
}, a.combine = 'c')
# Compute t(a) * a
K <- big_crossprodSelf(a, big_scale(center = FALSE, scale = FALSE))
# Get v and d where a = u * d * t(v) the SVD of a
eig <- eigen(K[])
v <- eig$vectors
d <- sqrt(eig$values)
# Get u if you need it. It will be of the same size of u
# so that I store it as a FBM.
u <- FBM(nrow(a), ncol(a))
big_apply(u, a.FUN = function(X, ind, a, v, d) {
X[ind, ] <- sweep(a[ind, ] %*% v, 2, d, "/")
NULL
}, a.combine = 'c', block.size = 50e3, ind = rows_along(u),
a = a, v = v, d = d)
# Verification
ind <- sample(nrow(a), 1000)
all.equal(a[ind, ], tcrossprod(sweep(u[ind, ], 2, d, "*"), v))
This takes approximately 10 minutes on my computer.
#Mahon #user20650 #F.Privė For clarity I pinged the bigmemory team and asked
Essentially, is there an implementation of the QR function (QR Decomposition) that works with big memory matrixes?
I felt it useful to get clarity on the original question asked. #F.Privė - nice answer. Hopefully your answer, and their response will help guide people in the future. Their response below:
Thanks for the note. There is not currently an implementation of the qr decomposition. Ideally, you would implement this using Householder reflections (if the matrix is dense) or Givens rotations (if it is sparse).
The irlba package is compatible with bigmemory. It provides a truncated singular value decomposition. So, if your matrix is relatively sparse, you could truncate at the rank of the matrix. This is probably your best option. If you don't know the rank then you can use the package to update the truncation iteratively.
Please note that if your matrix is (tall and skinny or short and fat) then the SO solution is OK. However, anytime you resort to calculating the cross-product you lose some numerical stability. This can be an issue if you are planning on inverting the matrix.
Related
I am trying to write my first function in R to calculate emittance using Plank's function for different temperatures. I can do it manually as below for temperatures from 200 to 310 K.
pi <- 3.141593
h <- 6.626068963e-34
c <- 2.99792458e+8
lambda <- 4 * 1e-6
k <- 1.38e-23
t <- c (200:310)
a <- (2*pi*(c^2)*h)/(lambda^5)
b <- exp((h*c)/(lambda*k*t))
B <- a * (1/(b-1))
Where B is the vector of values I want.
Now here is an effort to write a function in R:
P_function <- function(t, pi = 3.141593, h = 6.626068963e-34, c = 2.99792458e+8,
lambda = 4 * 1e-6, k = 1.38e-2) {
((2*pi*(c^2)*h)/(lambda^5)) *((1/(exp((h*c)/(lambda*k*t))-1)))
}
Now for different values of t (200-300K), how do I implement this function?
Couple of problems. First, pi is already a defined constant at better precision than you are using.
> rm(pi) # remove your copy
> pi
[1] 3.141593 # default for console printing is only 8 digits
> print(pi, digits=18)
[1] 3.14159265358979312 # but there is more "depth" to be had
Second, it makes no sense to put scientific constants in the parameter list. Since they're constant they can be defined in the body. Parameter lists are for items that might vary from situation to situation.
newPfun <- function(t) { h <- 6.626068963e-34
c <- 2.99792458e+8
lambda <- 4 * 1e-6
k <- 1.38e-23
a <- (2*pi*(c^2)*h)/(lambda^5) #pi is already defined
b <- exp((h*c)/(lambda*k*t))
B <- a * (1/(b-1))
return(B) }
This is just your original code "packaged" to accept a vector of temperatures. (And I'm pretty sure that's not the right spelling the scientist's name.)
Not sure where your second function is flawed. Perhaps a mismatched parenthesis. After trying to duplicate the results with a single expression and failing multiple times, I'm now wondering if it's really a problem with numerical overflow (or underflow).
Hey everyone, I have a large Matrix X with the dimensions (654x7095). I wanted to subset this matrix and replace the values of this subsetted matrix of X with another matrix which I have created. The R-code is as follows -
install.packages("Matrix")
install.packages("base")
library(Matrix)
library(base)
T = 215
n = 3
k = 33
X = matrix(0,T*n,T*k)
IN = diag(n)
K1 = Matrix(0, n*n, n*(n-1)/2, sparse = TRUE)
for(i in 1:(n-1)){
K1[(2+(i-1)*(n+1)):(i*n), (1+(i-1)*(n-i/2)):(i*(n-i)*(i+1)/2)] <- diag(n-i)
}
yin = matrix(rnorm(645), ncol = 3)
Xu = matrix(rnorm(2150), ncol = 10)
#Till yet I have defined the variables and matrices which will be used in subsetting.
Above codes are perfectly fine, however, the code below is showing error -
#Loop for X subsetting
for(i in 1:T){
X[(((i-1)*n)+1):(i*n), (((i-1)*k)+1):(i*k)] <- cbind( (t(kronecker(yin[i,],IN))%*%K1) , (t(kronecker(Xu[i,],IN))))
}
# in this Kronecker() finds the Kronecker tensor product of two Matrix A and B. This function can be used with the help of "base" library.
When I am running this above code, the error which is showing is -
Error in X[(((i - 1) * n) + 1):(i * n), ] <- cbind((t(kronecker(yin[i, :
number of items to replace is not a multiple of replacement length
However, when I am running this same command in MATLAB it is working perfectly fine. MATLAB CODE -
X = zeros(T*n,T*k);
for i = 1:T
X((i-1)*n+1:i*n,(i-1)*k+1:i*k) = [kron(yin(i,:),IN)*K1, kron(Xu(i,:),IN)];
end
The output which MATLAB is giving is that it fills up the values in number of rows and columns which is defined in the Loop for subsetting the X. I have attached the snapshot of the desired output which MATLAB is giving. However, error is showing in R for the same.
Can someone enlighten me as where I am going wrong with the R code?
I appreciate the help, Many thanks.
I think the problem is how the class 'dgeMatrix' is handled. Try
for (i in 1:T) {
X[(((i-1)*n)+1):(i*n), (((i-1)*k)+1):(i*k)] <- as.matrix(cbind((t(kronecker(yin[i,],IN))%*%K1) , (t(kronecker(Xu[i,],IN)))))
}
I'm attempting to apply the sweep function to a sparse matrix (dgCMatrix). Unfortunately, when I do that I get a memory error. It seems that sweep is expanding my sparse matrix to a full dense matrix.
If there an easy way to perform this function without if blowing up my memory?
This is what I'm trying to do.
sparse_matrix <- sweep(sparse_matrix, 1, vector_to_multiply, '*')
I'm working with a big and very sparse dgTMatrix matrix (200k rows and 10k columns) in a NLP problem. After hours thinking in a good solution, I created an alternative sweep function for sparse matrices. It is very fast and memory efficient. It took just 1 second and less than 1G of memory to multiply all matrix rows by a array of weights. For margin = 1 it works for both dgCMatrix and dgTMatrix.
Here it follows:
sweep_sparse <- function(x, margin, stats, fun = "*") {
f <- match.fun(fun)
if (margin == 1) {
idx <- x#i + 1
} else {
idx <- x#j + 1
}
x#x <- f(x#x, stats[idx])
return(x)
}
I second #user20650's recommendation to use direct multiplication of the form mat * vec which multiplies every column of your matrix mat with your vector vec by implicitly recycling vec.
Processing time profiling
I understand that you're main requirement here is memory, but it's interesting to perform a microbenchmark comparison of the sweep and direct multiplication methods for both a dense and sparse matrix:
# Sample data
library(Matrix)
set.seed(2018)
mat <- matrix(sample(c(0, 1), 10^6, replace = T), nrow = 10^3)
mat_sparse <- Matrix(mat, sparse = T)
vec <- 1:dim(mat)[1]
library(microbenchmark)
res <- microbenchmark(
sweep_dense = sweep(mat, 1, vec, '*'),
sweep_sparse = sweep(mat_sparse, 1, vec, '*'),
mult_dense = mat * vec,
mult_sparse = mat_sparse * vec
)
res
Unit: milliseconds
expr min lq mean median uq max
sweep_dense 8.639459 10.038711 14.857274 13.064084 18.07434 32.2172
sweep_sparse 116.649865 128.111162 162.736864 135.932811 155.63415 369.3997
mult_dense 2.030882 3.193082 7.744076 4.033918 7.10471 184.9396
mult_sparse 12.998628 15.020373 20.760181 16.894000 22.95510 201.5509
library(ggplot2)
autoplot(res)
On average the operations involving a sparse matrix are actually slightly slower than the ones with a dense matrix. Note however, how direct multiplication is faster than sweep.
Memory profiling
We can use memprof to profile the memory usage of the different approaches.
library(profmem)
mem <- list(
sweep_dense = profmem(sweep(mat, 1, vec, '*')),
sweep_sparse = profmem(sweep(mat_sparse, 1, vec, '*')),
mult_dense = profmem(sweep(mat * vec)),
mult_sparse = profmem(sweep(mat_sparse * vec)))
lapply(mem, function(x) utils:::format.object_size(sum(x$bytes), units = "Mb"))
#$sweep_dense
#[1] "15.3 Mb"
#
#$sweep_sparse
#[1] "103.1 Mb"
#
#$mult_dense
#[1] "7.6 Mb"
#
#$mult_sparse
#[1] "13.4 Mb"
To be honest, I'm surprised that the memory imprint of the direct multiplication with a sparse matrix is not smaller than that involving a dense matrix. Perhaps the sample data are too simplistic. It might be worth exploring this with your actual data (or a representative subset thereof).
I'm trying to optimize spdep function of R for my use case since it is very slow for large databases. I was doing mostly fine but I stuck at one point, where I am trying to find trace of my weights matrix for LM error test. I think the formula is tr[(W' + W) W] (page 82 of Anselin, L., Bera, A. K., Florax, R. and Yoon, M. J. 1996 Simple diagnostic tests for spatial dependence. Regional Science and Urban Economics, 26, 77–104.) W is a square weights matrix, holding the spatial relation of each observation with another. tr() operation is the sum of the diagonals.
In my case, the weights matrix is symmetric and the diagonals are zero. So, I thought the formula tr[(W' + W) W] equals to 2*sumsq(W), which is super fast. But apparently I am mistaken somewhere because the results do not match the results of the spdep library, which is likely to be right.
The relevant part of the spdep library is here. Can anybody help me how the result of the following function differs from 2*sumsq(W) or how to make it much faster? This function is where the lm.LMtests function gets clogged for large data sets.
tracew <- function (listw) {
dlmtr <- 0
n <- length(listw$neighbours)
if (n < 1) stop("non-positive n")
ndij <- card(listw$neighbours)
dlmtr <- 0
for (i in 1:n) {
dij <- listw$neighbours[[i]]
wdij <- listw$weights[[i]]
for (j in seq(length=ndij[i])) {
k <- dij[j]
# Luc Anselin 2006-11-11 problem with asymmetric listw
dk <- which(listw$neighbours[[k]] == i)
if (length(dk) > 0L && dk > 0L &&
dk <= length(listw$neighbours[[k]]))
wdk <- listw$weights[[k]][dk]
else wdk <- 0
dlmtr <- dlmtr + (wdij[j]*wdij[j]) + (wdij[j]*wdk)
}
}
dlmtr
}
Additional explanation for those who are not familiar with spdep library of R:
The input of the function, listw, holds a "graph" implementation of the weight matrix with two list of lists. listw$neighbors is a list, where each list item is a list of the indices of observations for which the observation has a relation to. listw$weights a list of the same structure with neighbors, except that it holds the weights of the relation.
Thanks in advance for any comments and directions.
# example code
# initiliaze
library(spdep)
library(multiway)
# load the tracew function above
data(columbus)
columbus = columbus[rep(row.names(columbus), 20), ] # the difference becomes dramatic when n is high. try not replicating at first to see the results.
# manual calculation, using sumsq
w = distm(cbind(columbus$X, columbus$Y))
w[w > 1000000] = Inf # remove some relations acc. to pre-defined rule
w = 1/(1+w)
diag(w) = 0
w = w / (sum(w) / length(columbus$X)) #"C style" standardization
2*sumsq(w)
# spdep calculation
neighs.band = dnearneigh(cbind(columbus$X, columbus$Y), 0, 1000, longlat = TRUE)
w.spdep = lapply(nbdists(neighs.band, cbind(columbus$X, columbus$Y), longlat = TRUE), function(x) 1/(0.001+x))
my.listw = nb2listw(neighs.band, glist = w.spdep, style="C")
tracew(my.listw)
I am trying to use cor.test over the rows in 2 matrices, namely cer and par.
cerParCorTest <-mapply(function(x,y)cor.test(x,y),cer,par)
mapply,however, works on columns.
This issue has been discussed in Efficient apply or mapply for multiple matrix arguments by row . I tried that split solution (as below)
cer <- split(cer, row(cer))
par <- split(par, row(par))
and it results in the error (plus it is slow)
In split.default(x = seq_len(nrow(x)), f = f, drop = drop, ...) :
data length is not a multiple of split variable
I also tried t(par) and t(cer) to get it running over the rows, but it results in the error
Error in cor.test.default(x, y) : not enough finite observations
The martices are shown below (for cer and same in par):
V1698 V1699 V1700 V1701
YAL002W(cer) 0.01860500 0.01947700 0.02043300 0.0214740
YAL003W(cer) 0.07001600 0.06943900 0.06891200 0.0684330
YAL005C(cer) 0.02298100 0.02391900 0.02485800 0.0257970
YAL007C(cer) -0.00026047 -0.00026009 -0.00026023 -0.0002607
YAL008W(cer) 0.00196200 0.00177360 0.00159490 0.0014258
My question is why transposing the matrix does not work and what is a short solution that will allow running over rows with mapply for cor.test().
I apologise for the long post and thanks in advance for any help.
I don't know what are the dimensions of your matrix , but this works fine for me
N <- 3751 * 1900
cer.m <- matrix(1:N,ncol=1900)
par.m <- matrix(1:N+rnorm(N),ncol=1900)
ll <- mapply(cor.test,
split(par.m,row(par.m)),
split(cer.m,row(cer.m)),
SIMPLIFY=FALSE)
this will give you a list of 3751 elements(the correlation for each row)
EDIT without split, you give the index of the row , this should be fast
ll <- mapply(function(x,y)cor.test(cer.m[x,],par.m[y,]),
1:nrow(cer.m),
1:nrow(cer.m),
SIMPLIFY=FALSE)
EDIT2 how to get the estimate value:
To get the estimate value for example :
sapply(ll,'[[','estimate')
You could always just program things in a for loop, seems reasonably fast on these dimensions:
x1 <- matrix(rnorm(10000000), nrow = 2000)
x2 <- matrix(rnorm(10000000), nrow = 2000)
out <- vector("list", nrow(x1))
system.time(
for (j in seq_along(out)) {
out[[j]] <- cor.test(x1[j, ], x2[j, ])
}
)
user system elapsed
1.35 0.00 1.36
EDIT: If you only want the estimate, I wouldn't store the results in a list, but a simple vector:
out2 <- vector("numeric", nrow(x1))
for (j in seq_along(out)) {
out2[j] <- cor.test(x1[j, ], x2[j, ])$estimate
}
head(out2)
If you want to store all the results and simply extract the estimate from each, then this should do the trick:
> out3 <- as.numeric(sapply(out, "[", "estimate"))
#Confirm they are the same
> all.equal(out2, out3)
[1] TRUE
The tradeoff is that the first method stores all the data in a list which may be useful for further processing vs a mroe simple method that only grabs what you initially want.