I need to solve Ax = b, where A is a symmetric positiv semi-definite matrix. This can be efficiently implemented using the cholesky decomposition. Because the matrix A will be at least have the dimensions 25000 x 25000, I cannot waste memory. Therefore I want to use the in-place version of Julia's cholfact:
cholfact!(A, :U, pivot = true)
Compared to
F = cholfact(A, :U, pivot = true)
this would save Gigabytes of memory.
However after the computation, A is of the type Matrix Float64, while F has the type CholeskyPivoted{Float64}. I far as I understand the in-place version loses essential information, such as the pivot vector F.piv. How can I compute the cholesky decomposition correctly whithout wasting memory?
You want to combine these two:
F = cholfact!(A, :U, pivot = true)
This returns a CholeskyPivoted, which is indeed what you want. But by using cholfact!, you're saying that you don't care whether A gets destroyed in the process. Consequently, it will use the memory allocated for A for storing the factorization (thus destroying A).
Afterwards, you should only use F, not A, because A has been destroyed. Internally, F will contain a reference to A, since it's storing the factorization in A. This may be clearer if you examine how a CholeskyPivoted is represented; A will be used for that UL field.
Related
I need to make quick calculations (+,*,>) with large 3D arrays (tensors) in R (like 1500 x 150 x 30000). Since these arrays are very sparse (only 0.03% entries are non-zeros) I first use as_sptensor function from the 'tensorr' library to convert my tensors to sparse ones like:
x <- array(data = c(1,0,0,0,0,0,0,1,1,1,1,1) , dim = c(3,2,2))
s <- as_dtensor(x)
s1 <- as_sptensor(s)
And then I do some arithmetic operations, e.g. multiplication:
s1*s1
I also have memory limitations of 8GB totally so that also help s me to store the result.
The problem is that when I deal with large tensors like:
A <-some_index_matrix[1:3,1:1000000]
A2 <- sptensor(A, rep(1,ncol(A)), dims=c(max(A[1,]),max(A[2,]),max(A[3,])))
A2*A2
I fail to get this product result within reasonable time. How can I optimize my code for such calculations to be carried out within seconds?
As a trivial case, let's say I'm interested in calculating:
r = v*M^t
Where r and v are vectors and M is an extremely large sparse matrix.
I can solve it one of two ways:
r = v*(M*M*M*M...)
r = ((((v*M)M)M)M)...
Where the first approach results in intermediate dense matrices that is impractical to store in RAM (I would need at least 10's of terabytes for my target minimum use case). The second solution , instead, always results in intermediate vectors, and it does in fact work in practice.
The problem is that at larger values of t, memory allocations are causing a substantial performance bottleneck.
library(pryr)
n = 20000
v = 1:n
M = matrix(1:(n*n), n) # Not a sparse matrix like my use case, but easier to start with
for (i in 1:10) {
v = v %*% M
print(address(v))
}
As the address() function shows, v is being reallocated every iteration. It is not being modified in place. Not only are the memory allocations slowing things down, but according to profvis, the garbage collector is constantly being called as well and taking up a large portion of the time.
So my question is, is there a way to perform this calculation (and potentially others similar to it) in R without the excess memory allocations and gc() calls happening under the hood?
I am trying to decrease the memory footprint of some of my datasets where I have a small set of factors per columns (repeated a large number of times). Are there better ways to minimize it? For comparison, this is what I get from just using factors:
library(pryr)
N <- 10 * 8
M <- 10
Initial data:
test <- data.frame(A = c(rep(strrep("A", M), N), rep(strrep("B", N), N)))
object_size(test)
# 1.95 kB
Using Factors:
test2 <- as.factor(test$A)
object_size(test2)
# 1.33 kB
Aside: I naively assumed that they replaced the strings with a number and was pleasantly surprised to see test2 smaller than test3. Can anyone point me to some material on how to optimize factor representation?
test3 <- data.frame(A = c(rep("1", N), rep("2", N)))
object_size(test3)
# 1.82 kB
I'm afraid the difference is minimal.
The principle would be easy enough: instead of (in your example) 160 strings, you would just be storing 2, along with 160 integers (which are only 4 bytes).
Except that R kind of stores character internally the same way.
Every modern language supports string of (virtually) unlimited length. Which gives the problem that you can't store a vector (or array) of strings as one contiguous block, as any element can be reset to arbitrary length. So if another value is assigned to one element, which happened to be somewhat longer, that would mean the rest of the array would have to be shifted. Or the OS/language should reserve large amounts of space for each string.
Therefore, strings are stored at whatever place in memory is convenient, and arrays (or vectors in R) are stored as blocks of pointers to the place where the value actually is.
In the early days of R, each pointer pointed to another place in memory, even if the actual value was the same. So in your example, 160 pointers to 160 memory locations. But that's changed, nowadays it's implemented as 160 pointers to 2 memory locations.
There may be some small differences, mainly because a factor can normally support only 2^31-1 levels, meaning 32-bits integers are enough to store it, while a character mostly uses 64-bits pointers. Then again, there's more overhead in factors.
Generally, there may be some advantage in using factor if you really have a large percentage duplicates, but if that's not the case it may even harm your memory usage.
And the example you provided doesn't work, as you're comparing a data.frame with a factor, instead of the bare character.
Even stronger: when I reproduce your example, I only get your results if I set stringsAsFactors to FALSE, so you're comparing a factor to a factor in a data.frame.
Comparing the results otherwise gives a lot smaller difference: 1568 for character, 1328 for a factor.
And that only works if you have a lot of the same values, if you look at this you see that the factor can be larger:
> object.size(factor(sample(letters)))
2224 bytes
> object.size(sample(letters))
1712 bytes
So generally, there is no real way to compress your data while still keeping it easy to work with, except for using common sense in what you actually want to store.
I don't have a direct answer for your question but here is a few information from the book "Advanced R" by Hadley Wickham:
Factors
One important use of attributes is to define factors. A factor
is a vector that can contain only predefined values, and is used to
store categorical data. Factors are built on top of integer vectors
using two attributes: the class, “factor”, which makes them behave
differently from regular integer vectors, and the levels, which
defines the set of allowed values.
Also:
"While factors look (and often behave) like character vectors, they
are actually integers. Be careful when treating them like strings.
Some string methods (like gsub() and grepl()) will coerce factors to
strings, while others (like nchar()) will throw an error, and still
others (like c()) will use the underlying integer values. For this
reason, it’s usually best to explicitly convert factors to character
vectors if you need string-like behaviour. In early versions of R,
there was a memory advantage to using factors instead of character
vectors, but this is no longer the case."
There is a package in R called fst (Lightning Fast Serialization of Data Frames for R)
, in which you can create compressed fst objects for your data frame. A detailed explanation can be found in the fst-package manual, but I'll briefly explain about how to use it and how much space an fst object takes. First, Let's make your test dataframe a bit larger, as follows:
library(pryr)
N <- 1000 * 8
M <- 100
test <- data.frame(A = c(rep(strrep("A", M), N), rep(strrep("B", N), N)))
object_size(test)
# 73.3 kB
Now, let's convert this dataframe into an fst object, as follows:
install.packages("fst") #install the package
library(fst) #load the package
path <- paste0(tempfile(), ".fst") #create a temporary '.fst' file
write_fst(test, path) #write the dataframe into the '.fst' file
test2 <- fst(path) #load the data as an fst object
object_size(test2)
# 2.14 kB
The disk space for the created .fst file is 434 bytes. You can deal with test2 as a normal dataframe (as far as I tried).
Hope this helps.
The following code causes a memory error:
diag(1:100000)
Is there any alternative for diag which allows producing a huge diagonal matrix?
Longer answer: I suggest not creating a diagonal matrix, because in most situations you can do without it. To make that clear, consider the most typical matrix operations:
Multiply the diagonal matrix D by a vector v to produce Dv. Instead of maintaining a matrix, keep your "matrix" as a vector d of the diagonal elements, and then multiply d elementwise by v. Same result.
Invert the matrix. Again, easy: invert each element (of course, only for diagonal matrices is this generally the correct inverse).
Various decompositions/eigenvalues/determinants/trace. Again, these can all be done on the vector d.
In short, though it requires a bit of attention in your code, you can always represent a diagonal matrix as a vector, and that should solve your memory issues.
Shorter answer: Now, having said all that, of course people have already implemented the above steps implicitly using sparse matrices, which does the above steps under the hood. In R, the Matrix package is nice for sparse matrices: https://cran.r-project.org/web/packages/Matrix/Matrix.pdf
I'm looking to preallocate a sparse matrix in R (using simple_triplet_matrix) by providing the dimensions of the matrix, m x n, and also the number of non-zero elements I expect to have. Matlab has the function "spalloc" (see below), but I have not been able to find an equivalent in R. Any suggestions?
S = spalloc(m,n,nzmax) creates an all zero sparse matrix S of size m-by-n with room to hold nzmax nonzeros.
Whereas it may make sense to preallocate a traditional dense matrix in R (in the same way it is much more efficient to preallocate a regular (atomic) vector rather than increasing its size one by one,
I'm pretty sure it will not pay to preallocate sparse matrices in R, in most situations.
Why?
For dense matrices, you allocate and then assign "piece by piece", e.g.,
m[i,j] <- value
For sparse matrices, however that is very different: If you do something like
S[i,j] <- value
the internal code has to check if [i,j] is an existing entry (typically non-zero) or not. If it is, it can change the value, but otherwise, one way or the other, the triplet (i,j, value) needs to be stored and that means extending the current structure etc. If you do this piece by piece, it is inefficient... mostly irrespectively if you had done some preallocation or not.
If, on the other hand, you already know in advance all the [i,j] combinations which will contain non-zeroes, you could "pre-allocate", but in this case,
just store the vector i and j of length nnzero, say. And then use your underlying "algorithm" to also construct a vector x of the same length which contains all the corresponding values, i.e., entries.
Now, indeed, as #Pafnucy suggested, use spMatrix() or sparseMatrix(), two slightly different versions of the same functionality: Constructing a sparse matrix, given its contents.
I am happy to help further, as I am the maintainer of the Matrix package.