mpi4py not giving speed up on 2D block decomposition - mpi

I have been fiddling with distributed multiprocessing in python and have whipped up a 2D heat equation solver that uses a block-decomposition of the domain. This code performs identically (in terms of the right answer) to other MPI implementations in C and Fortran as well.
However, on my machine at 256x256 grid points and 4 processes, I do not see any speed up at all. In comparison, the C implementation (compiled with mpicc) gives me a nice speed up as expected. The curious thing is, when I deploy this code on a different machine (on the same number of processes) - I get the right speed up. The mpi4py and serial versions of this code may be found here. I thought it may be due to cache size limitations but I get similar trends at lower numbers of grid points as well. Is there some sort of requirement for mpi4py that I'm missing that's causing the slowdown?

Related

Possible issue in presentation of N² diagram with indexed inputs/outputs?

When visualizing the structure of the circuit tutorial via an N² diagram, I noticed that implicit components with indexed inputs/outputs labelled using the pattern x:y (e.g. I_out:0 of n1) do not display connections into the output of the block (in this case V of n1).
I understand that it is computing the residuals with the inputs and some initial "guess" to provide the output, so is this by design for ImplicitComponent because the connections are implicit? I tend to use the diagrams for debugging, and seeing no connections to the output makes it look unclear if it's connected, even though the inputs are fed into it and the code processes it via the residual equation correctly.
This is a known bug in OpenMDAO 2.9.1, but has been fixed already on OpenMDAO master. So the next release, due out before the end of Feb 2020 (2.10) should have the issue fixed.

Numerical method produces platform dependent results

I have a rather complicated issue with my small package. Basically, I'm building a GARCH(1,1) model with rugarch package that is designed exactly for this purpose. It uses a chain of solvers (provided by Rsolnp and nloptr, general-purpose nonlinear optimization) and works fine. I'm testing my method with testthat by providing a benchmark solution, which was obtained previously by manually running the code under Windows (which is the main platform for the package to be used in).
Now, I initially had some issues when the solution was not consistent across several consecutive runs. The difference was within the tolerance I specified for the solver (default solver = 'hybrid', as recommended by the documentation), so my guess was it uses some sort of randomization. So I took away both random seed and parallelization ("legitimate" reasons) and the issue was solved, I'm getting identical results every time under Windows, so I run R CMD CHECK and testthat succeeds.
After that I decided to automate a little bit and now the build process is controlled by travis. To my surprise, the result under Linux is different from my benchmark, the log states that
read_sequence(file_out) not equal to read_sequence(file_benchmark)
Mean relative difference: 0.00000014688
Rebuilding several times yields the same result, and the difference is always the same, which means that under Linux the solution is also consistent. As a temporary fix, I'm setting a tolerance limit depending on the platform, and the test passes (see latest builds).
So, to sum up:
A numeric procedure produces identical output on both Windows and Linux platforms separately;
However, these outputs are different and are not caused by random seeds and/or parallelization;
I generally only care about supporting under Windows and do not plan to make a public release, so this is not a big deal for my package per se. But I'm bringing this to attention as there may be an issue with one of the solvers that are being used quite widely.
And no, I'm not asking to fix my code: platform dependent tolerance is quite ugly, but it does the job so far. The questions are:
Is there anything else that can "legitimately" (or "naturally") lead to the described difference?
Are low-level numeric routines required to produce identical results on all platforms? Can it happen I'm expecting too much?
Should I care a lot about this? Is this a common situation?

Parallel differential evolution

I've been playing around with the differential evolution library in R, and I was wondering: is this an algorithm that it makes sense to parallelize? It seems to me that you could split the optimization interval into several segments, run the algorithm on each segment, and then compare the results of each segment and return the minimum.
Yes, it should parallelize. It's not too hard to find numerous Google hits for the topic, and the GAUL project on Soureforge has even some code (that is not ported to R in any way).
Back to R and its DE variants, the best approach would be at the compiled level. I had a go at it using OpenMP in a 'RcppParDE' variant of my RcppDE 'port' of DEoption but didn't get it finished.
I understand that the next (current?) DEoptim version has a variant that uses a foreach loop at the R level which is not ideal but better than a serial-only approach.

Parallelize Solve() for Ax=b?

Crossposted with STATS.se since this problem could straddle both STATs.se/SO
https://stats.stackexchange.com/questions/17712/parallelize-solve-for-ax-b
I have some extremely large sparse matrices created using spMatrix function from the matrix package.
Using the solve() function works for my Ax=b issue, but it takes a very long time. Several days.
I noticed that http://cran.r-project.org/web/packages/RScaLAPACK/RScaLAPACK.pdf
appears to have a function that can parallelize the solve function, however, it can take several weeks to get new packages installed on this particular server.
The server already has the snow package installed it.
So
Is there a way of using snow to parallelize this operation?
If not, are there other ways to speed up this type of operation?
Are there other packages like RScaLAPACK? My search on RScaLAPACK seemed to suggest people had a lot of issues with it.
Thanks.
[EDIT] -- Additional details
The matrices are about 370,000 x 370,000.
I'm using it to solve for alpha centrality, http://en.wikipedia.org/wiki/Alpha_centrality. I was originally using the alpha centrality function in the igraph package, but it would crash R.
More details
This is on a single machine with 12 cores and 96 gigs of memory (I believe)
It's a directed graph along the lines of paper citation relationships.
Calculating condition number and density will take awhile. Will post as it comes available.
Will crosspost on stat.SE and will add a link back to here
[Update 1: For those just tuning in: The original question involved parallelizing computations to solving a regression problem; given that the underlying problem is related to alpha centrality, some of the issues, such as bagging and regularized regression may not be as immediately applicable, though that leads down the path of further statistical discussions.]
There are a bundle of issues to address here, from the infrastructural to the statistical.
Infrastructure
[Updated - also see Update #2 below.]
Regarding parallelized linear solvers, you can replace R's BLAS / LAPACK library with one that supports multithreaded computations, such as ATLAS, Goto BLAS, Intel's MKL, or AMD's ACML. Personally, I use AMD's version. ATLAS is irritating, because one fixes the number of cores at compilation, not at run-time. MKL is commercial. Goto is not well supported anymore, but is often the fastest, but only by a slight margin. It's under the BSD license. You can also look at Revolution Analytics's R, which includes, I think, the Intel libraries.
So, you can start using all of the cores right away, with a simple back-end change. This could give you a 12X speedup (b/c of the number of cores) or potentially much more (b/c of better implementation). If that brings down the time to an acceptable range, then you're done. :) But, changing the statistical methods could be even better.
You've not mentioned the amount of RAM available (or the distribution of it per core or machine), but A sparse solver should be pretty smart about managing RAM accesses and not try to chew on too much data at once. Nonetheless, if it is on one machine and if things are being done naively, then you may encounter a lot of swapping. In that case, take a look at packages like biglm, bigmemory, ff, and others. The former addresses solving linear equations (or GLMs, rather) in limited memory, the latter two address shared memory (i.e. memory mapping and file-based storage), which is handy for very large objects. More packages (e.g. speedglm and others) can be found at the CRAN Task View for HPC.
A semi-statistical, semi-computational issue is to address visualization of your matrix. Try sorting by the support per row & column (identical if graph is undirected, else do one then the other, or try a reordering method like reverse Cuthill-McKee), and then use image() to plot the matrix. It would be interesting to see how this is shaped, and that affects which computational and statistical methods one could try.
Another suggestion: Can you migrate to Amazon's EC2? It is inexpensive, and you can manage your own installation. If nothing else, you can prototype what you need and migrate it in-house once you have tested the speedups. JD Long has a package called segue that apparently makes life easier for distributing jobs on Amazon's Elastic MapReduce infrastructure. No need to migrate to EC2 if you have 96GB of RAM and 12 cores - distributing it could speed things up, but that's not the issue here. Just getting 100% utilization on this machine would be a good improvement.
Statistical
Next up are multiple simple statistical issues:
BAGGING You could consider sampling subsets of your data in order to fit the models and then bag your models. This can give you a speedup. This can allow you to distribute your computations on as many machines & cores as you have available. You can use SNOW, along with foreach.
REGULARIZATION The glmnet supports sparse matrices and is very fast. You would be wise to test it out. Be careful about ill-conditioned matrices and very small values of lambda.
RANK Your matrices are sparse: are they full rank? If they are not, that could be part of the issue you're facing. When matrices are either singular or very nearly so (check your estimated condition number, or at least look at how your 1st and Nth eigenvalues compare - if there's a steep drop off, you're in trouble - you might check eval1 versus ev2,...,ev10,...). Again, if you have nearly singular matrices, then you need to go back to something like glmnet to shrink out the variables are either collinear or have very low support.
BOUNDING Can you reduce the bandwidth of your matrix? If you can block diagonalize it, that's great, but you'll likely have cliques and members of multiple cliques. If you can trim the most poorly connected members, then you may be able to estimate their alpha centrality as being upper bounded by the lowest value in the same clique. There are some packages in R that are good for this sort of thing (check out Reverse Cuthill-McKee; or simply look to see how you'd convert it into rectangles, often relating to cliques or much smaller groups). If you have multiple disconnected components, then, by all means, separate the data into separate matrices.
ALTERNATIVES Are you wedded to the Alpha Centrality? There may be other measures that are monotonically correlated (i.e. have high rank correlation) with the same value that could be calculated more cheaply or at least implemented quite efficiently. If those will work, then your analyses could proceed with a lot less effort. I have a few ideas, but SO isn't really the place to go about that discussion.
For more statistical perspectives, appropriate Q&A should occur on the stats.stackexchange.com, Cross-Validated.
Update 2: I was a bit too quick in answering and didn't address this from the long-term perspective. If you are planning to do research on such systems for the long-term, you should look at other solvers that may be more applicable to your type of data and computing infrastructure. Here is a very nice directory of the options for both solvers and pre-conditioners. It seems this doesn't include IBM's "Watson" solver suite. Although it may take weeks to get software installed, it's quite possible that one of the packages is already installed if you have a good HPC administrator.
Also, keep in mind that R packages can be installed to the user directory - you need not have a package installed in the general directory. If you need to execute something as a user other than yourself, you could also download a package to the scratch or temporary space (if you're running within just 1 R instance, but using multiple cores, check out tempdir).

How do programs like mathematica draw graphs and how can I make such a program?

I've been wondering how programs like mathematica and mathlab, and so on, plot graphs of functions so gracefully and fast. Can anyone explain to me how they do this and, furthermore, how I can do this? Is it related to an aspect or course in Computer Programming or Math? Which then?
Well, with some encouragement from belisarius, here's a my comment as an answer: try looking at matplotlib. From the home page:
matplotlib is a python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. matplotlib can be used in python scripts, the python and ipython shell (ala MATLAB®* or Mathematica®†), web application servers, and six graphical user interface toolkits.
It was originally inspired by MATLAB's plotting capabilities, though it's grown a lot since then. It's solid software - and it's open source, under a BSD license, so not only can you read the source, you can hack on it and use it in whatever you like.
Another place you could look is gnuplot. It's not one of the common open source licenses, but it's certainly open source, with some permissions to modify and such.
Gnuplot is a portable command-line driven graphing utility for linux, OS/2, MS Windows, OSX, VMS, and many other platforms. The source code is copyrighted but freely distributed (i.e., you don't have to pay for it). It was originally created to allow scientists and students to visualize mathematical functions and data interactively, but has grown to support many non-interactive uses such as web scripting. It is also used as a plotting engine by third-party applications like Octave. Gnuplot has been supported and under active development since 1986.
It does 3D plotting as well, which matplotlib doesn't do, and it's been around a lot longer. The reason I thought of matplotlib first is that it's intended as a library for a higher-level language, not a stand-alone application, so I'm guessing it might a bit easier for you to read.
One other suggestion, just to get an idea of the sorts of things Mathematica is doing under the hood, is to look at the documentation for Plot. In particular, if you look at the available options, you can deduce things.
MaxRecursion Automatic the maximum number of recursive subdivisions allowed
Method Automatic the method to use for refining curves
PerformanceGoal $PerformanceGoal aspects of performance to try to optimize
PlotPoints Automatic initial number of sample points
From the MaxRecursion and PlotPoints, you can see that it's doing an initial sampling then somehow deciding which regions need to be subdivided (resampled) to get an accurate view of the plot. And from there on, it's magic: there is some Method for this, and a PerformanceGoal to guide it...
For MATLAB, because of its cross-platform requirement there is no alternatives as using OpenGL. MATLAB runtime is written in C++ and non-axis GUI uses Java Swing. Therefore MATLAB Plot is probably a C++/OpenGL/Swing mixture.
In reality MATLAB graphics is much less complex then a video game graphics. I think it is easier to find tutorials on video game graphics and then "downsize" it to MATLAB functionality, like drawing a single line with the same color.
The most important concept is probably Transformation Matrix.
Basically most programs that plot any type of graph (particularly any graphs of reasonable complexity) will use some type of third party libraries.
The specific library used would depend on the programming language that is being used.
For example:
For a .Net application you might use Crystal reports. http://en.wikipedia.org/wiki/Crystal_Reports
For Java you might use JFreeChart. http://www.jfree.org/jfreechart/
And so on...
You will likely find numerious libraries for whatever language you decide to code in.
If you want to accomplish this functionality in your specific project I suggest using a library especially if you are a beginner. The internal complexities of how these graph libraries are implemented would be significant because of many issues such as cross platform compatibility, graphic rendering optimizations (ie: making sure the graphics render quickly and ‘prettily’), the maths associated with the positioning of elements on the graph and so forth.
Lastly I doubt you will find specific courses in this subject (or require them) as again excluding VERY specific cases programmers will always use libraries that already exist.
Why code it yourself when someone has
already solved the problem for you?
A good place to start is to understand that there is a grammar to graphics and what you want to construct upon receiving a plot command is a symbolic representation of the graph. For Mathematica, you can do something like
FullForm[Plot[Sin[x], {x, 0, 2 Pi}]]
to see the internal representation Mathematica uses. Basically you need to describe the line segments (2D) or meshes (3D) you want to draw in terms of their color and coordinates. Also, there needs to information about the scale of the graph and how to draw tick marks, label axes, etc.
This leads us to the heart of the question, how do you determine the line segment you want to draw from a function and a range? If you dig around in the help file for plot, you see a few things. First there is a plot points option and a MaxRecursion option. This leads me to believe (and this is just an educated guess, but it is how I would do it) that Mathematica plots the initial number of points on an even interval over the range to get a starting value. The next part is to identify regions where change exceeds some threshold and then to sample more points until the "change" between any two points in your line segment is below a threshold. Mathematica does this recursively, hence the MaxRecursion option.
So far I have been pretty vague about defining rate of change. A more useful way to describe change is to take 3 pts on your line segment. Assume a linear relationship between the 1st and 3rd point and, assuming this linear relationship, make a prediction about what the 2nd point would be. If the error of this prediction is sufficiently low, then consider the next group of three points. If the error is above a threshold, then you should sample some more points in this region until the threshold is met. In this way you will require relatively few points where the curve is relatively straight and more at the "interesting" parts where it bends in new directions. The smoothness of the curve you draw will be proportional to the error you are willing to tolerate in the linear prediction of points.

Resources