I was trying to understand how Ignite scaled with the number of nodes on a grid. Has anyone ran tests. How many engines can participate to a grid before downgrading the performance significantly?
Ignite is all about scalability, however it's hard to run any kind of generic test, because it always depends on the use case. Ignite is proven to provide linear or almost linear performance improvement when application code is implemented correctly. As for cluster sizes, there is no theoretical limit, you can as many nodes as you need. There are examples of the cluster of up to 1500 nodes.
Related
Is there an example of an OpenMDAO optimization where each iteration of the optimization is designed to run in parallel? The examples I saw seemed focused on the design-of-experiment drivers.
Once you get into parallelism within the model, there are potentially many many different ways to split up a problem. However, the simplest case, and the one that is likely the most relevant to something like running multiple directional cases at the same time is just to use a multi-point like setup. This is done with a parallel group, as shown in the multi-point doc.
There have been mentions of using Custom Partitioning algorithms for Giraph applications. However it is not clearly given at any place. As Castagna pointed out here in how to partition graph for pregel to maximize processing speed?, there may not be a need for such partitioning as HashPartitioner will in itself be very good in most cases.
The problem of partitioning a graph 'intelligently' in order to minimize execution time is an interesting one, however it's not simple and it depends on your data and your algorithm. You might find also that, in practice, it's not necessary and a random partitioning is sufficiently good.
For example, if you are interested in exploring Pregel-like approaches, you can have a look at Apache Giraph and experiment with different partitioning techniques.
However for the purpose of learning, it would be good to see live examples and there are none found as far as I've seen. For example, the normal k-way partitioning algorithm (Kerninghan-Lin) being executed in Giraph or atleast the direction I should implement it towards.
All the google results were from the Apache giraph page where there are only definitions of the functions and various options to use them.
I don't know almost anything about parallel computing so this question might be very stupid and it is maybe impossible to do what I would like to.
I am using linux cluster with 40 nodes, however since I don't know how to write parallel code in R I am limited to using only one. On this node I am trying to analyse data which floods the memory (arround 64GB). So my problem isn't lack of computational power but rather memory limitation.
My question is, whether it is even possible to use some R package (like doSnow) for implicit parallelisation to use 2-3 nodes to increase the RAM limit or would I have to rewrite the script from ground to make it explicit parallelised ?
Sorry if my question is naive, any suggestions are welcomed.
Thanks,
Simon
I don't think there is such a package. The reason is that it would not make much sense to have one. Memory access is very fast, and accessing data from another computer over the network is very slow compared to that. So if such a package existed it would be almost useless, since the processor would need to wait for data over the network all the time, and this would make the computation very very slow.
This is true for common computing clusters, built from off-the-shelf hardware. If you happen to have a special cluster where remote memory access is fast, and is provided as a service of the operating system, then of course it might be not that bad.
Otherwise, what you need to do is to try to divide up the problem into multiple pieces, manually, and then parallelize, either using R, or another tool.
An alternative to this would be to keep some of the data on the disk, instead of loading all of it into the memory. You still need to (kind of) divide up the problem, to make sure that the part of the data in the memory is used for a reasonable amount of time for computation, before loading another part of the data.
Whether it is worth (or possible at all) doing either of these options, depends completely on your application.
Btw. a good list of high performance computing tools in R is here:
http://cran.r-project.org/web/views/HighPerformanceComputing.html
For future inquiry:
You may want to have a look at two packages "snow" and "parallel".
Library "snow" extends the functionality of apply/lapply/sapply... to work on more than one core and/or one node.
Of course, you can perform simple parallel computing using more than one core:
#SBATCH --cpus-per-task= (enter some number here)
You can also perform parallel computing using more than one node (preferably with the previously mentioned libraries) using:
#SBATCH --ntasks-per-node= (enter some number here)
However, for several implications, you may wanna think of using Python instead of R where parallelism can be much more efficient using "Dask" workers.
You might want to take a look at TidalScale, which can allow you to aggregate nodes on your cluster to run a single instance of Linux with the collective resources of the underlying nodes. www.tidalscale.com. Though the R application may be inherently single threaded, you'll be able to provide your R application with a single, simple coherent memory space across the nodes that will be transparent to your application.
Good luck with your project!
I have done a MPI and GPU version of diffusion equation.
In MPI version, I compute next values by doing a decomposition of the grid and each process represents a sub-grid.
In GPU/OpenCL version, I compute next values by converting 2D grid to 1D and looping of the global index of this 1D grid to achieve the update of all grid.
Now, I would like to know if it is possible to mix these both versions, i.e to assign a sub-grid for each MPI process and into the sub-grid, compute the values with GPU/OpenCL.
I think that it's only feasible if GPU is able to share its ressources between different MPI processes (I have only a GPU device)
Anyone could tell me if actually this is possible ?
thanks
Sure, the GPU can be shared between multiple processes. It's still just one resource so if you had it reasonably well utilized before with one process then don't expect much scaling since now your processes are competing for a single resource. Worst case is performance actually gets worse, if you oversubscribe the GPU. Another issue to watch out for is GPU memory usage.
FlexUnit is quite an impressive framework for testing and with the new integration in Flash Builder 4 it's a no brainer to use it. However, I'm not sure why it's necessarily exclusive to just unit testing. In my opinion, I think the tools are great candidates for performance testing as well.
It should also be mentioned that by performance testing I'm not talking about testing whole systems. Rather, I'm more interested in testing actual units in a library. For instance, stress testing data structures in order to determine scalability issues.
Is this being done or is there any reference material out there that touches on this subject?
In order to further clarify the question, let me describe a possible scenario.
Let's say we're creating a library of data structures, for instance collections. These structures are meant to focus on efficiency rather than features. While they certainly need testing in live or as-close-to-live scenarios, I can imagine that some bottlenecks may be easily caught before going to acceptance testing.
So the question is, what's considered best practice for stress testing individual units? Is unit testing useful for stress testing individual units or is the data collected from such test insignificant, making it just a waste of time and energy?
"Is unit testing useful for stress testing individual units"
Why not have one of your unit tests create a few thousand instances, exercise them a bit, and then destroy them. Time the whole sequence and fail the test if it takes too long.
At least then you are putting bounds on it.