Distributed physics simulation help/advice - mpi

I'm working in a distributed memory environment. My task is to simulate using particles tied by springs big 3D objects by dividing them into smaller pieces and each piece get simulated by another computer. I'm using a 3rd party physics engine to a achieve simulation. The problem I am facing is how to transmit the particle information in the extremities where the object is divided. This information is needed to compute interacting particle forces. The line in the image shows where the cut has been made. Because the number o particles is big the communication overhead will be big as well. Is there a good way to transmit such information or is there a way to transmit another value which helps me determine the information I need? Any help is much appreciated. Thank-you
PS: by particle information i mean the new positions from which to compute a resulting force to be applied on the particles simulated in the local machine

"Big" means lots of things. Here the number of points with data being communicated may be "big" in that it's much more than one, but if you have say a million particles in a lattice, and are dividing it between 4 processors (say) by cutting it into squares, you're only communicating 500 particles across each boundary; big compared to one but very small compared to 1,000,000.
A library very commonly used for these sorts of distributed-memory computations (which is somwehat different than distributed computing, which suggests nodes scattered all over the internet; this sort of computation, involving tightly-coupled elements, is usually best done with a series of nearby computers in a lab or in a cluster) is MPI. This pattern of communication is very common, and is called "halo exchange" or "guardcell exchange" or "ghostzone exchange" or some combination; you should be able to find lots of examples of such things by searching for those terms. (There are a few questions on this site on the topic, but they're typically focussed on very specific implementation questions).

Related

How does OpenMPI's gather work?

I'm new to MPI and I'm trying to understand how MPI (and specifically OpenMPI) work in order to reason about the performance of my system.
I've tried to find resources online to help me understand things a little better, but haven't had much luck. I thought I'd come here.
Right now my question is simple: if I have 3 nodes (1 master, 2 clients) and I issue an MPI_Gather, does the root process handle incoming data sequentially or concurrently? In other words, if processes 1 is the first to make a connection with processes 0, will process 2 have to wait until processes 1 is done sending its data before it can start to send its data?
Thanks!
There are multiple components in Open MPI that implement collective operations and some of them provide multiple algorithms for the implementation of each operation.
What you are most likely interested in is the tuned component of the coll framework as that is what Open MPI uses by default. tuned implements all collectives using point-to-point operations and provides several algorithms for gather:
linear with synchronisation - used when messages are large to mid-size
binomial - used when the number of processes is large or the message size is small
basic linear - used in all other cases
The performance of each algorithm depends strongly on the particular combination of message size and number of ranks, therefore the library comes with a set of heuristics that tries to determine the best algorithm based on the data size and the size of the communicator (as indicated above). There are several mechanisms to override the heuristics and either force a certain algorithm or provide a list of custom algorithm selection rules.
The basic linear algorithm simply has the root loop over all other ranks receiving their messages in sequence. In that case, rank 2 won't be able to send its chunk before rank 1 since the root will first receive the message from rank 1 and only then move on to rank 2.
The linear with synchronisation algorithm splits the chunks into two pieces each. The first pieces are collected in sequence just like in the basic linear algorithm. The second pieces are collected asynchronously using non-blocking receives.
The binomial algorithm arranges the ranks as a binomial tree. The processes at the nodes of the tree receive the chunks from the lower levels and aggregate them into larger chunks that then get passed to the upper levels until they reach the root rank.
You can find the source code of the tuned module in the ompi/mca/coll/tuned folder of the Open MPI source tree. In the development branch, part of the tuned component got promoted to the base implementation of the collective framework and the code for the gather is to be found in ompi/mca/coll/base instead.
Hristo's answer is of course excellent, but I would like to offer a different point of view.
Contrary to your expectation, the question is not simple. It isn't even possible to specifically answer it without knowing more system specifics, as Hristo pointed out. That doesn't mean the question is invalid, but you should start to reason about performance on a different level.
First, consider the complexity of a the gather operation: The total network transfer to the root as well as the memory requirements are linearly growing with the number of processes in the communicator. This naturally limits scalability.
Second, you may assume that your MPI implementation does implement MPI_Gather in the most efficient way possible - better than you could do it by hand. This assumption may very well be wrong, but it is the best starting point to write your program.
Now when you have your program, you should measure and see where time is spent - or wasted. For that you should an MPI performance analysis tools. Now if you have identified that your Gather has a significant impact on performance, you can go ahead and try to optimize that: But to do so, first consider if you can structure your communication conceptually better, e.g. by somehow removing the computation all together or using a clever reduction instead. If you still need to stick to the gather: go ahead and tune your MPI implementation. Afterwards verify that your optimization did indeed improve performance on your specific system.

estimating distance to ibeacon AVR

I want to ask about I Beacon advertising, especially Tx Power.
I used two BLE module HM10 and HM11. I make one as a ibeacon (HM10). and other one used to connect and listen to HM10 broadcasting.
I used MCU ATmega32 AVR tied with HM11 and I used scanf function to read the broadcast. I want to extract the last byte (Tx Power). I want to measure the distance with AVR programming.
Could you tell me the algorithm?
The formula Apple uses to calculate a distance estimate to an iBeacon is not published. There are a number of alternative formulas including this one, based on a best fit power curve, that we wrote for the Android Beacon Library.
Further research we have done shows that the formula above basically works, but it has two main imperfections:
It does not work well for weaker beacon transmitters. With weaker broadcasts, the distance is underestimated.
It does not account for varying signal gains in receivers. Different receivers have different antennas and receivers which measure the same signals differently.
There is an ongoing discussion of the best formula here.
A bit late but hopefully useful to others. I have given up on Apple's "Accuracy" number; as #davidyoung points out, different devices will have different signal gains. Now I am not an engineer but more of a math and statistics person, so I have gone down the route of "fingerprinting" an indoor space instead. Essentially I read all RSSI from all beacons installed in a certain "venue". Some might not be within reach and therefore I just assume, in such cases, an RSSI of -95 dBm (which seems to be the floor past which a signal is not read any more). Such constituted array has the same beacons in the same positions at all times (even across app launches). I compute a 5 seconds moving average for each beacon (so a I se 5 arrays to do that). The resulting avg array is then shifted up by 95 units and normalised so that the sum of all of its values is one. If you want to tag an an indoor "point" you collect many of these normalised average arrays on that specific spot. I go ahead and construct a database of "spots". To forecast your proximity to any spot in a database you simply compute a quadratic distance of your current reading and the all of the fingerprints in the database.
Which beacons to use? At least class 2 in power. How many? At least a couple per room (put them in two adjacent corners, on the ceiling or high up).
The last step that you need to do is match the fingerprints with an x,y coordinate on your map. I never did this step, because I am mainly interested in proximity applications and not fully fingerprint and indoor space.
Perhaps the discussion above will serve you as a guidance on a technique that is used by many indoor location companies.
Disclosure: I have recently open sourced my code doing the above calculations.

Estimating the heat generated by a process or job

Is it possible to estimate the heat generated by an individual process in runtime.
Temperature readings of the processor is easily accessible but what I need is process specific information.
Is it possible to map information such as cpu utilization, io, running time, memory usage etc to get some kind of an estimate?
I'm gonna say no. Because the overall temperature of your system components isn't a simple mathematical equation with everything that's moving and switching either.
Heat generated by and inside a computer is dependent on many external factors like hardware setup, ambient temperature of the room, possibly the age of the components, is there dust on them or in the fans, was the cooling paste correctly applied on the CPU or elsewhere, where heat sinks are present, how is heat being dissipated etc.etc.. In short, again, no.
Additionally, your computer runs a LOT of processes at any given time apart from the ones that you control (and "control" is a relative term). Even if it is possible to access certain sensory data for individual components (like you can see to some extent in the BIOS) then interpolating one single process' generated temperature in regard to the total is, well, impossible.
At the lowest levels (gate networks, control signalling etc.), an external individual no longer has any means to observe or measure what's going on but there as well, things are in a changing state, a variable amount of electricity is being used and thus a variable amount of heat generated.
Pertaining to your second question: that's basically what your task manager does. There are countless examples and articles on the internet on how to get that done in a plethora of programming languages.
That is, unless some of the actually smart people in this merry little community of keytappers and screengazers say that it IS actually possible, at which point I will be thoroughly amazed...
EDIT: Monitoring the processes is a first step in what you're looking for. take a look at How to detect a process start & end using c# in windows? and be sure to follow up on duplicates like the one mentioned by Hans.
You could take a look at PowerTOP or some other tool that monitors power usage. I am not sure how accurate it is across different systems but a power estimation should provide at least some relative information as the heat generated assuming the processes you are comparing are running in similar manners on hardware. In reality there are just too many factors to predict power, much less heat, effectively but you may be able to get an idea of the usage.

Modeling communication costs in MPI

Does anyone know of any papers that discuss communication costs in MPI programs? I am trying to predict the time taken by (say) the communication step in two phase I/O. That would depend on the no. of processes, the size and number of messages sent/received, network interconnect and architecture, etc. It would be helpful for us to come up with a formula to assess the time taken by communication alone. I have read some papers , but none of them handle the case where multiple processes are communicating at the same time.
The most critical elements in any time estimate will be the total data to be sent, and the speed of the interconnect. That should give you an effective "minimum" time for the message transfers.
After that, you can measure the actual time taken and use that to determine a rough efficiency rating for the MPI implementation. As the amount of data scales up, the time required will also scale up using the scale factor. This is a very rough way to get an estimate. Keep in mind that as the data size crosses certain interesting thresholds (e.g. page size, cache size, and so on) the scale factor will likely need to be revised.

What is an FFP machine?

In R. Kent Dybvig's paper "Three Implementation Models for Scheme" he speaks of "FFP languages" and "FFP machines". Apparently there is some correlation between FFP machines, and string-reduction on multiple processors.
Googling doesn't really uncover much in terms of explanations or examples.
Can anyone shed some light on this topic?
Thanks.
Kent Dybvig's advisor, Gyula A. Mago, published a detailed description in "The FFP Machine: Technical Report 87-014" in 1987 by Mago and Stanat.
As of this writing, the PDF is freely available at:
http://www.cs.unc.edu/techreports/87-014.pdf
The FFP Machine is a very fine-grained parallel computer architecture:
each processor holds a single symbol / atom / value.
It uses a string reduction model of computation in which
innermost function applications are found and replaced by their
equivalent result (eager evaluation).
Where a result is used in several places, it tends to be re-evaluated
instead of incurring the costs of accessing some global store
(but see Mago's paper on "Copying Operands vs Copying Results", or better yet Mago's "Data Sharing in an FFP Machine" in the 1982 Functional Programming Languages and Computer Architecture conference).
The L cells holding the FFP expression being reduced
communicate through a tree structured arrangement of T cells.
Note that IC's are basically two dimensional and with wiring,
circuits can move towards being three dimensional in physical space.
Interconnection networks that occupy higher dimensions
(such as the Hypercube, Omega, Banyan, Star, etc. networks)
will eventually be unable to perform near their theoretical limit.
This communication network is circuit-switched rather than being packet-switched.
Data packets contain no addresses and do not need routing.
Packets from distinct reductions cannot meet, cannot conflict
and cannot experience congestion with each other.
The configuring activity (called "Partitioning") is performed
in a single sweep upwards in the tree, using a handful
of logic operations on 3-bit messages, leaving "area machines" in its wake,
each created to advance at most a single reducible application.
While it is technically logarithmic in time,
the resulting area machines can begin communicating
in a pipelined fashion behind the partitioning wave,
practically costing a constant time penalty.
(The dismantling of area machines remains a logarithmic cost in time).
Packets within a single reduction should, and must, meet
and thus provide a often-useful synchronization.
Sequences of packets are sorted and combined as they rise
within an area, to be broadcast from the root of the area machine.
Parallel Prefix and Parallel Suffix operations are provided
to reduce area traffic, since there remains a potential bottleneck
within an individual reducible application.
This is accomplished without the need exhibited in
the Ultracomputer (Jack (Jacob?) Schwartz at NYU)
for a separate logarithmic-sized cache memory in each
communication node.
Each T cell (internal tree node) only needs a FIFO buffer
(for efficiency) of size greater than the pipeline path to
the top of the tree and back down.
(This latter is a conjecture of mine, but it seems reasonable).
Since the tree maintains the left-to-right order of data
(unlike some other combining networks), the system enables cells
to rotate their data in logarithmic rather than linear time,
avoiding the plausible congestion at the root of the area machine.
It's worth noting again, that the parallelism within an area
machine is independent of the simultaneous parallelism in other
area machines, and has available to it a number of processors
proportional to the quantity of data in the operand.
Have you come across this yet?: Compiling APL for parallel execution on an FFP machine
Formal FP. Similar to FP, but with regular sugarless syntax, for machine execution is all I can offer you.
See Wikis Fp page.

Resources