I have a computation loop with adaptive time stepping and I need to store the results at each iteration. In other words, I do not know the vector size before the computation, so I can not preallocate a vector size to store the data. Right now, I build a vector using the push! function
function comp_loop()
clock = [0.0]
data = [0.0]
while run_time < model_time
# Calculate new timestep
timestep = rand(Float64) # Be sure to add Random
run_time += timestep
# Build vector
push!(clock,run_time)
push!(data,timestep)
end
end
Is there a more efficient way to go about this? Again, I know the optimal choice is to preallocate the vector, but I do not have that luxury available. Buffers are theoretically not an option either, as I don't now how large to make them. I'm looking for something more "optimal" on how to implement this in Julia (i.e. maybe some advanced application available in the language).
Theoretically, you can use a linked list such as the one from DataStructures.jl to get O(1) appending. And then optionally write that out to a Vector afterwards (probably in reverse order, though).
In practise, push!ing to a Vector is often efficient enough -- Vectors use a doubling strategy to manage their dynamic size, which leads to amortized constant time and the advantage of contiguous memory access.
So you could try the linked list, but be sure to benchmark whether it's worth the effort.
Now, the above is about time complexity. When you care about allocation, the argument is quite similar; with a vector, you are going to end up with memory proportional to the next power of two after your actual requirements. Most often, that can be considered amortized, too.
Related
I have been trying to implement some code in Julia JuMP. The idea of my code is that I have a for loop inside my while loop that runs S times. In each of these loops I solve a subproblem and get some variables as well as opt=1 if the subproblem was optimal or opt=0 if it was not optimal. Depending on the value of opt, I have two types of constraints, either optimality cuts (if opt=1) or feasibility cuts (if opt=0). So the intention with my code is that I only add all of the optimality cuts if there are no feasibility cuts for s=1:S (i.e. we get opt=1 in every iteration from 1:S).
What I am looking for is a better way to save the values of ubar, vbar and wbar. Currently I am saving them one at a time with the for-loop, which is quite expensive.
So the problem is that my values of ubar,vbar and wbar are sparse axis arrays. I have tried to save them in other ways like making a 3d sparse axis array, which I could not get to work, since I couldn't figure out how to initialize it.
The below code works (with the correct code inserted inside my <>'s of course), but does not perform as well as I wish. So if there is some way to save the values of 2d sparse axis arrays more efficiently, I would love to know it! Thank you in advance!
ubar2=zeros(nV,nV,S)
vbar2=zeros(nV,nV,S)
wbar2=zeros(nV,nV,S)
while <some condition>
opts=0
for s=1:S
<solve a subproblem, get new ubar,vbar,wbar and opt=1 if optimal or 0 if not>
opts+=opt
if opt==1
# Add opt cut Constraints
for i=1:nV
for k=1:nV
if i!=k
ubar2[i,k,s]=ubar[i,k]
end
end
for j=i:nV
if links[i,j]==1
vbar2[i,j,s]=vbar[i,j]
wbar2[i,j,s]=wbar[i,j]
end
end
end
else
# Add feas cut Constraints
#constraint(mas, <constraint from ubar,vbar,wbar> <= 0)
break
end
if opts==S
for s=1:S
#constraint(mas, <constraint from ubar2,vbar2,wbar2> <= <some variable>)
end
end
end
A SparseAxisArray is simply a thin wrapper in top of a Dict.
It was defined such that when the user creates a container in a JuMP macro, whether he gets an Array, a DenseAxisArray or a SparseAxisArray, it behaves as close as possible to one another hence the user does not need to care about what he obtained for most operations.
For this reason we could not just create a Dict as it behaves differently as an array. For instance you cannot do getindex with multiple indices as x[2, 2].
Here you can use either a Dict or a SparseAxisArray, as you prefer.
Both of them have O(1) complexity for setting and getting new elements and a sparse storage which seems to be adequate for what you need.
If you choose SparseAxisArray, you can initialize it with
ubar2 = JuMP.Containers.SparseAxisArray(Dict{Tuple{Int,Int,Int},Float64}())
and set it with
ubar2[i,k,s]=ubar[i,k]
If you choose Dict, you can initialize it with
ubar2 = Dict{Tuple{Int,Int,Int},Float64}()
and set it with
ubar2[(i,k,s)]=ubar[i,k]
I have a 3 million x 9 million sparse matrix with several billion non-zero entries. R and Python do not allow sparse matrices with more than MAXINT non-zero entries, thus why I found myself using Julia.
While scaling this data with the standard deviation is trivial, demeaning is of course a no-go in a naive manner as that would create a dense, 200+ terabyte matrix.
The relevant code for doing svd is julia can be found at https://github.com/JuliaLang/julia/blob/343b7f56fcc84b20cd1a9566fd548130bb883505/base/linalg/arnoldi.jl#L398
From my reading, a key element of this code is the AtA_or_AAt struct and several of the functions around those, specifically A_mul_B!. Copied below for your convenience
struct AtA_or_AAt{T,S} <: AbstractArray{T, 2}
A::S
buffer::Vector{T}
end
function AtA_or_AAt(A::AbstractMatrix{T}) where T
Tnew = typeof(zero(T)/sqrt(one(T)))
Anew = convert(AbstractMatrix{Tnew}, A)
AtA_or_AAt{Tnew,typeof(Anew)}(Anew, Vector{Tnew}(max(size(A)...)))
end
function A_mul_B!(y::StridedVector{T}, A::AtA_or_AAt{T}, x::StridedVector{T}) where T
if size(A.A, 1) >= size(A.A, 2)
A_mul_B!(A.buffer, A.A, x)
return Ac_mul_B!(y, A.A, A.buffer)
else
Ac_mul_B!(A.buffer, A.A, x)
return A_mul_B!(y, A.A, A.buffer)
end
end
size(A::AtA_or_AAt) = ntuple(i -> min(size(A.A)...), Val(2))
ishermitian(s::AtA_or_AAt) = true
This is passed into the eigs function, where some magic happens, and the output is then processed in to the relevant components for SVD.
I think the best way to make this work for a 'centering on the fly' type setup is to do something like subclass AtA_or_AAT with a AtA_or_AAT_centered version that more or less mimics the behavior but also stores the column means, and redefines the A_mul_B! function appropriately.
However, I do not use Julia very much and have run in to some difficulty modifying things already. Before I try to dive into this again, I was wondering if I could get feedback if this would be considered an appropriate plan of attack, or if there is simply a much easier way of doing SVD on such a large matrix (I haven't seen it, but I may have missed something).
edit: Instead of modifying base Julia, I've tried writing a "Centered Sparse Matrix" package that keeps the sparsity structure of the input sparse matrix, but enters the column means where appropriate in various computations. It's limited in what it has implemented, and it works. Unfortunately, it is still too slow, despite some pretty extensive efforts to try to optimize things.
After much fuddling with the sparse matrix algorithm, I realized that distributing the multiplication over the subtraction was dramatically more efficient:
If our centered matrix Ac is formed from the original nxm matrix A and its vector of column means M, with a nx1 vector of ones that I will just call 1. We are multiplying by a mxk matrix X
Ac := (A - 1M')
AcX = X
= AX - 1M'X
And we are basically done. Stupidly simple, actually.
AX is can be carried out with the usual sparse matrix multiplication function, M'X is a dense vector-matrix inner product, and the vector of 1's "broadcasts" (to use Julia's terminology) to each row of the AX intermediate result. Most languages have a way of doing that broadcasting without realizing the extra memory allocation.
This is what I've implemented in my package for AcX and Ac'X. The resulting object can then be passed to algorithms, such as the svds function, which only depend on matrix multiplication and transpose multiplication.
I am trying R package apcluster on a set of objects that I want to cluster, but I'm running into performance/memory problems, and I suspect I'm not doing it right. I'd like to hear your opinion, please.
In short: I have a set of about 13000 objects. Each object is associated with a set of 2 to 5 'features'. The similarity (by which I want to cluster, eventually) between any two objects i and j is equal to the number of features they have in common divided by the total number of distinct features they 'span'. E.g. if i = {a,b,c} and j = {c,d}, then sim[i,j] = 1/4 = 0.25, because they have only 1 feature in common ({c}) and in total they describe 4 distinct features ({a,b,c,d}).
Calculating my NxN similarity matrix is not a problem in theory: it can be done using set operations if each object's features are stored as a list; or features can be pivoted to a matrix of 1's and 0's, where each column is a feature, and then R's function dist with method="binary" does the trick.
In practice however, the first problem is that such similarity calculations are extremely slow. For 13 K objects, there are about 84.5 M similarities to calculate, but this doesn't sound so bad for a modern computer. I don't understand why it should take a few hours to do that. And the set operation version, that should be quicker as far as I can tell, is actually much slower than dist. [Another package called fingerprint is supposed to deal with such cases more efficiently, but so far I haven't been able to make it work, it gives a lot of errors when trying to make what they call 'featvec' objects].
The other thing to consider is that the 2-5 features per object are not very repetitive. There may be a group of 100 or so objects with at least one feature in common between them, but then none of the other 12.9 K objects has any feature in common with these 100 objects. The consequence is that the pivoted feature matrix is very sparse (if we consider 0's as empty). There are about 4000 columns in the pivoted matrix, and each row has at most 5 1's. I wonder if this is negatively impacting the performance of dist, in that it has to multiply through a lot of 0's that could instead be ignored.
Does it seem normal to you that it should take a few hours to apply dist to a matrix like the one I described? Can you suggest a different way to calculate the similarity that takes advantage of the sparseness of the matrix?
Anyway, I managed to get the output from dist, which however had class 'dist', and was a distance matrix, not a similarity one, so I had to use 1 - as.matrix(distance_matrix) to be able to make the similarity matrix apcluster needs as input.
That's when I got the first 'memory' problem. R said the vector could not be allocated due to its size. I tried the usual tricks, but in the end I could not get more than 4 GB, and my matrices are (apparently) bigger.
I overcame this by assigning each time new matrices to their old 'self'.
And then when I submitted this painstakingly put together similarity matrix to apcluster, again the vector size error popped up, as if the first thing apcluster did was create some other large object from what I had fed it.
I had a look at as.Sparse... in apcluster, but it does not seem to help a lot, considering that you have to calculate the full matrix first anyway.
In the end the only thing that worked a little bit was 'leveraged affinity propagation' by apclusterL, which however is an approximation.
Does anybody know if and how I could do this better? E.g. is it wise to pivot the data first, or should I stick to list and set operations? Or, can the fact that the initial matrix is sparse be used to compute directly a sparse similarity matrix, rather than compute it fully and reduce it to sparse later?
Any advice would be greatly appreciated. Thanks!
BTW, yes, I saw this thread: Cluster Analysis in R on large sparse matrix ; which does not seem to have been answered conclusively.
The R interpreter is really slow.
So you should use R mostly to "drive" your program, but implement all the computations heavy stuff in C or FORTRAN.
You didn't show the code you are using, but I guess it involves nested for loops? Try to rewrite it without any for loops in R, or rewrite it in C.
But no matter what, AP clustering will always remain very slow. It involves many passes over O(n²) matrixes, i.e. it scales very badly.
I'm new in OpenCL and I'm trying to implement power iteration method (described over here)
matrix sizes over 100000x100000!
Actually I have no idea how to implement this.
It's because workgroup have restriction CL_DEVICE_MAX_WORK_GROUP_SIZE (so I can't make one workgoup with 1000000 work-items)
But on each step of iterating I need to synchronize and normalize vector.
1) So is it possible to make all calculations inside one kernel? (I think that answer is no if matrix sizes is more than CL_DEVICE_MAX_WORK_GROUP_SIZE)
2) Can I make "while" loop in the host code? and is it still profitable to use GPU in this case?
something like:
while (condition)
{
kernel calling
synchronization
}
2: Yes, you can make a while loop in host code. Whether this is still profitable in terms of performance depends on whether the kernel that is called achieves a good speedup. My personal preference is not to pack too much logic into a single kernel, because smaller kernels are easier to maintain and sometimes easier to optimize. But of course, invoking a kernel has a (small) overhead that has to be taken into account. And whether combining to kernels into one can bring a speedup (or new potential for optimizations) depends on what the kernels are actually doing. But in this case (Matrix Multiplation and Vector Normalization) I'd personally start with two different kernels that are invoked from the host in a while-loop.
1: Since a 100000x100000 matrix with float values will take at least 40GB of memory, you'll have to think about the approach in general anyhow. There is a vast amount of literature on Matrix operations, their parallelization, and the corresponding implementations on the GPU. One important aspect from the "high level" point of view is whether the matrices are dense or sparse ( http://en.wikipedia.org/wiki/Sparse_matrix ). Depending on the sparsity, it might even be possible to handle 100000x100000 matrices in main memory. Apart from that, you might consider having a look at a library for matrix operations (e.g. http://viennacl.sourceforge.net/ ) because implementing an efficient matrix multiplication is challenging, particularly for sparse matrices. But if you want to go the whole way on your own: Good luck ;-) and ... the CL_DEVICE_MAX_WORK_GROUP_SIZE imposes no limitation on the problem size. In fact, the problem size (that is, the total number of work-items) in OpenCL is virtually infinitely large. If your CL_DEVICE_MAX_WORK_GROUP_SIZE is 256, and you want to handle 10000000000 elements, then you create 10000000000/256 work groups and let OpenCL care about how they are actually dispatched and executed. For matrix operations, the CL_DEVICE_MAX_WORK_GROUP_SIZE is primarily relevant when you want to use local memory (and you will have to, in order to achieve good performance): The size of the work groups thus implicitly defines how large your chunks of local memory may be.
One thing I want to do all the time in my R code is to test whether certain conditions hold for a vector, such as whether it contains any or all values equal to some specified value. The Rish way to do this is to create a boolean vector and use any or all, for example:
any(is.na(my_big_vector))
all(my_big_vector == my_big_vector[[1]])
...
It seems really inefficient to me to allocate a big vector and fill it with values, just to throw it away (especially if any() or all() call can be short-circuited after testing only a couple of the values. Is there a better way to do this, or should I just hand in my desire to write code that is both efficient and succinct when working in R?
"Cheap, fast, reliable: pick any two" is a dry way of saying that you sometimes need to order your priorities when building or designing systems.
It is rather similar here: the cost of the concise expression is the fact that memory gets allocated behind the scenes. If that really is a problem, then you can always write a (compiled ?) routines to runs (quickly) along the vectors and uses only pair of values at a time.
You can trade off memory usage versus performance versus expressiveness, but is difficult to hit all three at the same time.
which(is.na(my_big_vector))
which(my_big_vector == 5)
which(my_big_vector < 3)
And if you want to count them...
length(which(is.na(my_big_vector)))
I think it is not a good idea -- R is a very high-level language, so what you should do is to follow standards. This way R developers know what to optimize. You should also remember that while R is functional and lazy language, it is even possible that statement like
any(is.na(a))
can be recognized and executed as something like
.Internal(is_any_na,a)