Are there any software tools for performing arithmetic on very large numbers in parallel? What I mean by parallel is that I want to use all available cores on my computer for this.
The constraints are wide open for me. I don't mind trying any language or tech.
Please and thanks.
It seems like you are either dividing really huge numbers, or are using a suboptimal algorithm. Parallelizing things to a fixed number of cores will only tweak the constants, but have no effect on the asymptotic behavior of your operation. And if you're talking about hours for a single division, asymptotic behavior is what matters most. So I suggest you first make sure sure your asymptotic complexity is as good as can be, and then start looking for ways to improve the constants, perhaps by parallelizing.
Wikipedia suggests Barrett division, and GMP has a variant of that. I'm not sure whether what you've tried so far is on a similar level, but unless you are sure that it is, I'd give GMP a try.
See also Parallel Modular Multiplication on Multi-core Processors for recent research. Haven't read into that myself, though.
The only effort I am aware of is a CUDA library called CUMP. However, the library only provides support for addition, subtraction and multiplication. Anyway, you can use multiplication to perform the division on the GPU and check if the quality of the result is enough for your particular problem.
Related
I am working with a very large dataset, typically dealing with a few millions of combinations.
I want to solve the assignment problem.(maximise the sum)
I had tried solving it on a small test set using adagio::assignment, clue::solve_LSAP
I wasnt able to successfully install the "lpSolve" package on my system, threw some segmentation fault
Wanted to know which of these is faster or any other method which does it faster.
Thanks....
An LP formulation is not a good way to solve the assignment problem, whichever library you use. You have to use the Hungarian algorithm, and it looks like solve_LSAP does exactly that.
No need to try anything else IMHO.
EDIT: An efficient implementation of the Hungarian method should be O(n^3), which is extremely fast for any optimization algorithm. If solve_LSAP is not fast enough for your problem (assumed it is implemented correctly), it is very unlikely that any exact method will work.
You will have to use some sort of heuristic to approximate the solution.
I need to solve thousands of time SMALL linear system of the type Ax=b. Here A is a matrix that is not smaller than 3x3 and maximum 8x8. I am aware of this http://www.johndcook.com/blog/2010/01/19/dont-invert-that-matrix/ so I dont think it is smart to invert the matrix even if the matrices are small right? So what is the most efficient way to do that? I am programming in Fortran so probably I should use lapack library right? My matrices are full and in general non-simmetric.
Thanks
A.
Caveat: I didn't look into this extensively, but I have some experience I am happy to share.
In my experience, the fastest way to solve a 3x3 system is to basically use Cramer's rule. If you need to solve multiple systems with the same matrix A, it pays to pre-compute the inverse of A. This is only true for 2x2 and 3x3.
If you have to solve multiple 4x4 systems with the same matrix, then again using the inverse is noticeably faster than the forward and back-substitution of LU. I seem to remember that it uses less operations, and in practice the difference is even more (again, in my experience). As the matrix size grows, the difference shrinks, and asymptotically the difference disappears. If you are solving systems with difference matrices, then I don't think there is an advantage in computing the inverse.
In all cases, solving the system with the inverse can be much less accurate than using the LU-decomposition is A is fairly ill-conditioned. So if accuracy is an issue, then LU-factorization is definitely the way to go.
The LU factorization sounds like just the ticket for you, and the lapack routine dgetrf will compute this for you, after which you can use dgetrs to solve that linear system. Lapack has been optimized to the gills over the years, so in all likelihood you are better using that than writing any of this code yourself.
The computational cost of computing the matrix inverse and then multiplying that by the right-hand side vector is the same if not more than computing the LU-factorization of the matrix and then forward- and back-solving to find your answer. Moreover, computing the inverse exhibits even more bizarre pathological behavior than computing the LU-factorization, the stability of which is still a fairly subtle issue. It can be useful to know the inverse for small matrices, but it sounds like you don't need that for your purpose, so why do it?
Moreover, provided there are no loop-carried dependencies, you can parallelize this using OpenMP without too much trouble.
I have a linear algebra code that I am trying get to run faster. Its a iterative algorithm with a loop and matrix vector multiplications within in.
So far, I have used MATMUL (Fortran Lib.), DGEMV, Tried writing my own MV code in OpenMP but the algorithm is doing no better in terms of scalability. Speed ups are barely 3.5 - 4 irrespective of how many processors I am allotting to it (I have tried up 64 processors).
The profiling shows significant time being spent in Matrix-Vector and the rest is fairly nominal.
My question is:
I have a shared memory system with tons of RAM and processors. I have tried tweaking OpenMP implementation of the code (including Matrix Vector) but has not helped. Will it help to code in MPI? I am not a pro at MPI but the ability to fine tune the message communication might help a bit but I can't be sure. Any comments?
More generally, from the literature I have read, MPI = Distributed, OpenMP = Shared but can they perform well in the others' territory? Like MPI in Shared? Will it work? Will it be better than the OpenMP implementation if done well?
You're best off just using a linear algebra package that is already well optimized for a multitcore environment and using that for your matrix-vector multiplication. The Atlas package, gotoblas (if you have a nehalem or older; sadly it's no longer being updated), or vendor BLAS implementations (like MKL for intel CPUs, ACML for AMD, or VecLib for apple, which all cost money) all have good, well-tuned, multithreaded BLAS implementations. Unless you have excellent reason to believe that you can do better than those full time development teams can, you're best off using them.
Note that you'll never get the parallel speedup with DGEMV that you do with DGEMM, just because the vector is smaller than another matrix and so there's less work; but you can still do quite well, and you'll find you get much better perforamance with these libraries than you do with anything hand-rolled unless you were already doing multi-level cache blocking.
You can use MPI in a shared environment (though not OpenMP in a distributed one). However, achieving a good speedup depends a lot more on your algorithms and data dependencies than the technology used. Since you have a lot of shared memory, I'd recommend you stick with OpenMP, and carefully examine whether you're making the best use of your resources.
This is a fairly general question about the future of R: Any hope to see a merger of compilerand Rllvm (from Omegahat) or another JIT compilation scheme for R (I know there is Ra, but not updated recently)?
In my tests the speed gain from compiler are marginal for "complicated" functions...
What matters isn't how complicated a function is but what kinds of computations it performs. The compiler will make most difference for functions dominated by interpreter overhead, such as ones that perform mostly simple operations on scalar or other small data. In cases like that I have seen a factor of 3 for artificial examples and a a bit
better than a factor of 2 for some production code. Functions that spend most of their time in operations implemented in native code, like linear algebra operations, will see little benefit.
This is just the first release of the compiler and it will evolve over time. LLVM is one of several possible direction we will look at but probably not for a while. In any case, I would expect using something like LLVM to provide further improvements in cases where the current compiler already makes a difference, but not to add much in cases where it does not.
(Moving from a comment to an answer ...)
This sounds more like a question for the r development mailing list. Based on my general impressions I would say "probably not". Are your complicated functions already based on heavily vectorized (and hence efficient) functions? I think a more promising direction for not-so-easily-automatically-optimized situations is the increased simplicity of embedding C++ etc. (i.e. Rcpp), inline if necessary
Many numerical algorithms tend to run on 32/64bit floating points.
However, what if you had access to lower precision (and less power hungry) co-processors? How can then be utilized in numerical algorithms?
Does anyone know of good books/articles that address these issues?
Thanks!
Numerical analysis theory uses methods to predict the precision error of operations, independent of the machine they are running on. There are always cases where even on the most advanced processor operations may lose accuracy.
Some books to read about it:
Accuracy and Stability of Numerical Algorithms by N.J. Higham
An Introduction to Numerical Analysis by E. Süli and D. Mayers
If you cant find them or are too lazy to read them tell me and i will try to explain some things to you. (Well im no expert in this because im a Computer Scientist, but i think i can explain you the basics)
I hope you understand what i wrote (my english is not the best).
Most of what you are likely to find will be about doing floating-point arithmetic on computers irrespective of the size of the representation of the numbers themselves. The basic issues surround f-p arithmetic apply whatever the number of bits. Off the top of my head these basic issues will be:
range and accuracy of numbers that are represented;
careful selection of algorithms which are robust and reliable on f-p numbers rather than on real numbers;
the perils and pitfalls of iterative and lengthy calculations in which you run the risk of losing precision and accuracy.
In general, the fewer bits you have the sooner you run into problems, but just as there are algorithms which are useful in 32 bits, there are algorithms which are useful in 8 bits. Sometimes the same algorithm is useful however many bits you use.
As #George suggested, you should probably start with a basic text on numerical analysis, though I think the Higham book is not a basic text.
Regards
Mark