How to safely switch to another communicator in MPI

How to safely switch to another communicator in MPI - mpi

I'm writing an application using MPI (mpi4py actually). The application may spawn some new processes using MPI_Comm_spawn() (collectively on all current processes) and some nodes from the parent group/communicator may send data to some nodes in the child group/communicator and vice versa. (Notice MPI_Comm_spawn() and data sending/receving are happening in different threads both for functionality [there are other functionalities not directly relevant to this question so I didn't describe] and performance.)
Because the MPI_Comm_spawn() function may be called for several times and I expect all nodes can communicate with each other, I currently plan to use MPI_Intercomm_merge() to merge the two groups (parent and child) into one intracommunicator, and then send data through the new intracommunicator (and the next MPI_Comm_spawn() will happen on the new intracommunicator).
However, because the spawn and merge process happens during the program running, there will be some data sent through the old communicator already (but may not have yet been received by the dest). How could I safely switch from the old communicator to the new communicator (e.g. be able to delete the old communicator[s] at some point) while losing the least performance? The MPI_Comm_merge() is the only way I know to guarentee all processes can send data to each other (because if we don't merge, the next time we call MPI_Comm_merge(), some processes can't directly send data to each other), and I don't mind to change it to another method as long as it works well.
For example, in the following chart, process A, B, C are initial processes (mpiexec -np 3), D is a spawned process:
A and B will send continous data to C; during the sending time, D is spawned; then C sends data to D. Suppose the old communicator A, B and C uses is comm1 and the merged intracommunicator is comm2.
What I want to achieve is to send data through comm1 initially, and (all processes) switch to comm2 after D is spawned. What lacks is a mechanism to know when can C safely switch from comm1 to comm2 to receive data from A and/or B, and then I can safely call MPI_Comm_free(comm1).
Simply sending a special tag through comm1 at the time of switch would be the last option because C don't know how many processes will send data to it. It does know how many groups of processes will send data to it, so this can be achieved by introducing local leaders (but I'd like to know about other options).
Because A, B and C are processing in parellel and send/recv and spawn are happening in different threads, we can't guarentee no pending data when we call MPI_Comm_spawn(). E.g. if we imagine A and B process send and C processes recv at a same rate, when they call comm_spawn, C has only received half of the data from A and B, so we can't drop comm1 at C yet, but have to wait until C has received all pending data from comm1 (which is an unknown number of messages).
Are there any mechanisms provided by MPI or mpi4py (e.g. error codes or exceptions) to achieve this?
By the way, if my approach is apparently bad or if I misunderstand what MPI_Comm_free() does, please point out.
(What I understand is that MPI_Comm_free() is not a collective call; after calling MPI_Comm_free(comm1), no more send/recv calls to comm1 is allowed on the same node which calls MPI_Comm_free(comm1))

so basically, C invokes MPI_Comm_spawn(..., MPI_COMM_SELF, ...)
why don't you have {A,B,C} invoke MPI_Comm_spawn(..., comm1, ...) instead ?
MPI_Intercomm_merge() is a collective operation, so you need to "synchronize" your tasks somehow, so why not "synchronize" them before MPI_Comm_spawn() instead ?
then switching to the new communicator is trivial

Related

Using MPI_Gatherv with only a subset of all processes

I want to use MPI_Gatherv to collect data onto a single process. The thing is, I don't need data from all other processes, just a subset of them. The only communicator in the code is MPI_COMM_WORLD. So, does every process in MPI_COMM_WORLD have to call MPI_Gatherv? I have looked at the MPI standard and can't really make out what it is saying. If all processes must make the call, can some of the values in MPI_Gatherv's "recvcounts" array be zero? Or would there be some other way to signal to the root process which processes should be ignored? I guess I could introduce another communicator, but for this particular problem that would be coding overkill.

How many asynchronous cores should I use?

In an ideal async program, every event loop is always occupied with zero downtime between receiving data and polling->action-execution.
My program listens on an array of ports, and the polling and movement of data into a queue occurs on a single async-core (A). I then have another async-core (B) which takes the data from that queue and processes it. I then have another async core which runs background subroutines (C). All A, B, and C take place on different threads.
Let's suppose there is a massive load of data streaming, and core B becomes overloaded with pending work (this would effectively mean "lag" for the end-user). What are the common ways to detect this overload, and should an overload be detected, should I use another async-core(D) joined with B?

Is it possible to use the TWS/IBpy interface to collect and analyze tick data?

While searching for a template to test a paper trading strategy, I stumbled on IBPy. I have gone through the initial set-up and can connect and receive updates from the server. What I would like to do is:
a) Gather ticks from 1..n symbols when new prices (bid/asks) are published
b) Store these temporarily in a vector (I guess with vector.append((bid,ask))
c) Once the vector reaches it's computational max (I need 30 seconds or a certain number of ticks) I will compute some valued on vector[] and decide on whether an entry is appropriate
d) If not pop(0) and keep collecting
e) exit on a stoploss or trailing profit
My questions are:
i) I have read that updates are 250 ms, that is fine for my analytics but can the program/system keep up because different symbols update at different times so just because symbolA updates every 250 ms, with 10 symbols the updates maybe very frequent
ii) When I stop to make a calculation, haven't I lost updates?
If there is skeleton code for this, it would be great to mess around with it
Thanks for listening!

If you need to handle 100s of stock symbols you shall have multiple (at least 2) threads. One thread pulls the incoming data from the socket, sorts the messages by message type and pushes the data to queues. Other threads are waiting for their respective queues to get some data and process the incoming data.
The idea is that the dispatcher thread ensures that all incoming data gets pulled from the socket as fast as possible.
Generally your PC will be able to handle anything IB will be willing to send you. If your processing does not take too much time - no locks, calls to sleep(), file operations - you can do everything in a single thread.

RPC semantics what exactly is the purpose

I was going through the rpc semantics, at-least-once and at-most-once semantics, how does they work?
Couldn't understand the concept of their implementation.

In both cases, the goal is to invoke the function once. However, the difference is in their failure modes. In "at-least-once", the system will retry on failure until it knows that the function was successfully invoked, while "at-most-once" will not attempt a retry (or will ensure that there is a negative acknowledgement of the invocation before retrying).
As to how these are implemented, this can vary, but the pseudo-code might look like this:
At least once:
request_received = false
while not request_received:
send RPC
wait for acknowledgement with timeout
if acknowledgment received and acknowledgement.is_successful:
request_received = true
At most once:
request_sent = false
while not request_sent:
send RPC
request_sent = true
wait for acknowledgement with timeout
if acknowledgment received and not acknowledgement.is_successful:
request_sent = false
An example case where you want to do "at-most-once" would be something like payments (you wouldn't want to accidentally bill someone's credit card twice), where an example case of "at-least-once" would be something like updating a database with a particular value (if you happen to write the same value to the database twice in a row, that really isn't going to have any effect on anything). You almost always want to use "at-least-once" for non-mutating (a.k.a. idempotent) operations; by contrast, most mutating operations (or at least ones that incrementally mutate the state and are thus dependent on the current/prior state when applying the mutation) would need "at-most-once".
I should add that it is fairly common to implement "at most once" semantics on top of an "at least once" system by including an identifier in the body of the RPC that uniquely identifies it and by ensuring on the server that each ID seen by the system is processed only once. You can think of the sequence numbers in TCP packets (ensuring the packets are delivered once and in order) as a special case of this pattern. This approach, however, can be somewhat challenging to implement correctly on distributed systems where retries of the same RPC could arrive at two separate computers running the same server software. (One technique for dealing with this is to record the transaction where the RPC is received, but then to aggregate and deduplicate these records using a centralized system before redistributing the requests inside the system for further processing; another technique is to opportunistically process the RPC, but to reconcile/restore/rollback state when synchronization between the servers eventually detects this duplication... this approach would probably not fly for payments, but it can be useful in other situations like forum posts).

Variable memory allocation in MPI Code

In a cluster running MPI code, is a copy of all the declared-variables, sent to all nodes , so that all nodes can access it locally, and not perform a remote memory access ?

No, MPI itself can't do this for you in single call.
There is an own state of memory in every MPI process, and every value may be different in any of MPI process.
The only way of sending/receiving data is to use explicit calls of MPI, like Send or Recv. You can pack most of your data into some memory space and send this area of memory to each MPI Process, but this area will not contain 'every declared variable', only variables placed manually into this area.
Update:
Each node runs a copy of the program. Each copy will initialize variables as it want (it can be the same initialization, or individual, based on MPI Process number, called Rank; got from MPI_Comm_Rank function). So every variable exist in N copyes; one set per MPI Process. Every process sees variables, but only the set it owns. Values of variables are unsyncronized automatically.
So, task of programmer is to syncronize values of variables between Nodes (mpi processes).
E.g. here is small MPI program to compute Pi:
http://www.mcs.anl.gov/research/projects/mpi/usingmpi/examples/simplempi/cpi_c.htm
It will send value of the 'n' variable from first process to all other (MPI_Bcast); and every process will send its own 'mypi' after calculation into 'pi' variable of first process (with addition of individual values via MPI_Reduce function).
Only first process will be able to read N from user (via scanf) and this code is conditionally executed based on rank of process; other processes must get the N from the first because they didn't read it from user directly.
Update2 (sorry for late answer):
This is syntax of MPI_Bcast. Programmer should give an address of variable into this function. Each of MPI processes will give the address of its own 'n' variable (it can be different). And the MPI_Bcast will
check the rank of current process and compare with other argument, the rank of "Broadcaster".
If the current process is broadcaster, MPI_Bcast will read value, placed in memory at given address (it will read value of the 'n' variable on "Broadcaster"); then the value will be send via network.
Else, if the current process is not a broadcaster, it is an "receiver". MPI_Bcast at receiver will get the value from "Broadcaster" (Using MPI Library internals, via network) and store the value in memory of current process at given address.
So, the address is given to this function because on some nodes the function will write to the variable. Only value is send via network.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex