I am writing an application in C++ with MPI and OpenMPI. My problem is that when I send data with
MPI_Isend(&interbest,1,MPI_INT,i,2,MPI_COMM_WORLD,&request);
to other slave nodes and try to receive it with
MPI_Irecv(&localbest,1,MPI_INT,MPI_ANY_SOURCE,2,MPI_COMM_WORLD, &request);
the data are simply not transferred (both of the buffer variables are OK).
I am new to MPI, so what could have gone wrong ? I have to use nonblocking versions, since this is used for sharing a best solution that had just been found with other slaves. The other slaves are periodically checking for a new best solution, but even though the solution is clearly sent, no slave will receive it.
The application is running on a MPI cluster combined from 3 servers, communicating with Infiniband.
In case of any further questions please dont hesitate to ask me.
Thank you very much
Related
I've been having to come back to a modem I have and sending AT commands to it, and I need to do this programmatically. Sending AT commands works fine if using Minicom, but when using any kind of programatic method it's just super unreliable. I've tried echos and redirection with bash, the atinout program, and the pyserial module in Python, but no matter what sending and receiving commands is iffy at best. It is very rare that I attempt to run the same AT command twice and get consistent output back. I'll get the complete response one time, but then a partial response the next, or maybe no response.
Admittedly I don't know much about serial, so maybe it's my hardware, or maybe the protocol for reading and writing to serial is just unreliable. Can someone please explain how, in general, reading and writing output over a serial port may be unreliable, and any good techniques or libraries to help guarantee a stable flow of reading and writing?
There was another service on my system called ModemManager that was consuming the serial device at the same time I was running through my commands. Once that was disabled all of my programmatic efforts started producing reliable IO to the device.
I am trying to implement an MPI program which is to have a server node assign task pieces to client node, but I am a freshman and don't know how to mange the client list, can anyone help me?
Let me describe it more specify:
Server node:
MPI_COMM clients[4]; // store client communicators, but I am not sure this is correct or not!
Several clients connect to server using mpirun -np 1 ./mpiclient more than one time, not processor greater than 1.
The reason I want to do this is I want to send each client different job to calculate.
Question 2: How can I get the attributes of the client?
For example: MPI_Comm_accept(portname,MPI_INFO_NULL,0,MPI_COMM_SELF,&client);
How can I get client name or ip?
I don't know that there's a best practice here, but there are a few options.
Know your list of IP addresses ahead of time.
Most of the time, people have a cluster set up with a static pool of IP addresses. That means its easy to predict who will be connecting so you can call MPI_COMM_ACCEPT for each IP address and the clients will already know the address of the "server".
Don't use Connect/Accept directly
It may not be necessary to go through the pain of managing all of your connections directly. You might be able to do something else like MPI_COMM_SPAWN(_MULTIPLE) and spawn your children directly from your master. This simplifies managing connections, though you still have to deal with some of the weirdness of dynamic processing in MPI, specifically inter-communciators.
Don't use dynamic processes at all
Many times, people coming into distributed programming, and specifically MPI, for the first time still have a sockets frame of reference. Meaning, they expect to have to set up all of their own connections and communication management. In reality, MPI and other communication libraries are designed to be slightly higher level than that and let you ignore some of the mundane communication management and get straight to passing data around. Usually in an MPI job, you will use a single binary for your program and have each process decide what they'll be doing based on their rank. For example:
mpirun -np 5 ./my_prog
...
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (rank == 0) {
/* Distribute work */
} else {
/* Get work from rank 0 */
}
...
It's also possible to run multiple binaries this way if you want to separate out your codes for different parts of the program. How you run this can vary from implementation to implementation, but with MPICH, it works like this:
mpiexec -n 1 ./my_prog1 : -n 4 ./my_prog2
Then my_prog1 would distribute the work to all of the other processes running my_prog2. In this model, all processes still end up in the same MPI_COMM_WORLD so they can just check their rank at the beginning of the program and get working.
I have a small communication problem that has consumed hours of search. I am using MPICH2 to communicate between different workers. At some points in my program a process needs to multi-cast a message to a fraction of the workers (2 or 3 out of a total of 20). Therefore, I temporarily need to create a group that includes the ranks of all those workers and then use MPI_BCast. However, this seems to be impossible!
I have tried MPI_Comm_Create but the program simply hangs because it required "every" worker call MPI_Comm_Create. I can not also use MPI_Comm_Split because I do not know the ranks of the recipient workers in advance and hence can not color code them.
Could you please help me.
Why do you need to create a new communicator at all?
Your description, of what you actually want to achieve and what the constraints are is a little lacking, but here are some hints, that might be applicable for your problem.
Sticking to classical two-sided communication, you need at some point a communication that involves all processes to identify the recipients, I guess. You could for example broadcast to everybody who is to be a recipient, and subsequently send the actual message to those with peer-to-peer communication (If this relation is going to change over time, I would not bother with creating a new communicator each time).
You could use MPI's one-sided communication concepts, and simply write messages from the broadcasting rank into dedicated memory areas of the receiving ranks. However, one-sided is often considered somewhat bad and not so good on the performance side.
With MPI-3 you could make use of an non-blocking barrier: All processes open the barrier, and those, which are not the broadcasting rank start immediately testing for the completion of this barrier, open a non-blocking receive for any source and regularly test for that as well, otherwise they proceed as usual. The broadcasting rank however, starts sending out its message to the actual recipients and when it completed that, it waits for the non-blocking barrier to complete. Now, all processes will find the barrier to complete, and now they can stop listening for the receives, those who didn't get a message can simply send a message to themselves to properly close the communication and proceed in their computation.
How (GA) Global array library (an implementation of ARMCI) is used for communication between two process located on different remote machines.
Is that something similar to TCP socket programming where one process wait for data and the other transfers it ?
I try to see the documentation that ga_put() and ga_get() are two operation that used for inter-process communication. till now I only able to come up with a program running on the same machine that use shared-Memory architecture (I have used ga_put() and ga_get() to put data in Global array and to get it respectively ).
Now, I want use this program for communicating data (basically performaning one-sided communication) between two remote processes. Obiviously putting the program that I am running on single machine on the remote side will work out. It needs some way to tell which machine should we access and get the right data. And here is where I need your help. how can I do this? (what is its equivalent of TCP/IP listen, accept and connect ... on GA ? )
Or is that the case that GA also uses TCP/IP socket underneath ?
can some one please explain to me? and sample code of two remote processes communicating is also appreciable.
thanks,
I am answering my question after all. May be it will help some one looking for the same issue.
GA Library is implemented to work with MPI. So we have something like:
MPI_Init(..)
GA_Initialize()
MA_Init(..)
// .... do sothing here
GA_Terminate()
MPI_Finalize()
The answer to my question is:
MPI has the following primitives to be able to support client-server commuication:
//in the server side
MPI_Open_port()
MPI_Comm_accept()
//do MPI_Send() or MPI_Recv()
MPI_Close_port()
//client Side
MPI_Comm_connect()
//do MPI_Recv() or MPI_Send()
depending on the hardware support and the MPI implementation used, MPI might use sockets, or other mechanisms (e.g SAN (System area network)).
In general, most MPI implementations use sockets for TCP based communication.
So, yes GA also uses sockets underneath (of course depending on the MPI implementation used)
cheers,
A bit of history: We have an application, which was originally written many years ago (1998 is the first date in PVCS but the app is about 5 years older than that as it originally was a DOS program). This application communicates with a piece of hardware via serial. When we got to Windows XP we started receiving reports of the app dying after a short time of running. It seems that the serial comms just 'died' and the app was left in a stuck state. The only way to recover from this situation was to restart the application.
The only information I can find regarding this problem was apparently the Windows Message system would miss that information was received, the buffer would fill and the system would get stuck. This snippet of information was left in a old word document, but there's no evidence to back this up. It also mentions that this is only prevalent at high baud rates (115200+).
The solution was to provide customers with USB->Serial converters along with the hardware.
Today: We are working on a new version of the hardware that will run across a network as well as serial ports. So to allow me to work on the network code, minus the actual hardware we are using a VSCOM NetCom113 device. It also installs a virtual comm port on the users (ie: mine) machine.
Now I have got the network code integrated with the app, it appears that the NetCom device exhibits the same behaviour as a physical commport. This is undesirable as I need the app to run longer than ~30 seconds.
Google turns up zero problems that we experience.
I was wondering:
Has anyone experienced this before? If so what did you do to fix/workaround the problem?
Does anyone have any suggestions as to whether the original author of the document is correct and what I can do to test the theory?
Unfortunately I can't post code as the serial code is tightly couple with the rest of the system, though if you have questions regarding it I can answer questions about it.
Updates:
The code is written using Win32 Comm routines - so I am using CreateFile, ReadFile. There's also judicious calls to GetOverlappedResult.
It's not hanging per se, it's just that the comms stops. You can access the menus, click the buttons, but nothing can interact with the connected hardware. Using realterm you can see that no data is coming in or going out.
I think the reference to the windows message is that the problem is internal to windows. Data has arrived but the kernal has missed it and thus not told the rest of the system about it.
Flow control is not used.
Writing a 'simple' test is difficult due the the fact that the code is tightly coupled and the underlying protocol is quite complex and would require a lot of work.
Are you using DOS-style serial code, or the Win32 CreateFile approach?
If the former, be very suspicious: if at all possible I'd convert to the latter.
If the latter, do you know on what kind of system call it's hanging? Are you in a blocking read call? or an overlapped I/O call? or waiting on an event? (I'm not sure I have enough experience to help, but those are the kinds of questions that come to mind)
You might also check into the queue size, which you can set with the SetupComm function.
I don't buy the "Windows Message system" stuff -- it sounds fishy; you can write good Win32 serial i/o code that never uses Windows messages.
edit: does your Overlapped I/O use events? I seem to remember something about auto-reset events occasionally missing their trigger... check your overlapped I/O calls very carefully to see whether you're handling the possible outcomes properly. Perhaps there's a way to make your code more robust by automatically cancelling the overlapped i/o and restarting another read. (I assume the problem is in the read half, not the write half?)
edit 2: A suggestion: assuming the win32 side has missed a byte or packet, and your devices are in deadlock because they're both expecting each other to respond to something, can you tweak the other side of the serial I/O to regularly send some type of "ping" packet with an incrementing counter? (and log the ping packets on the PC side; that way you can see whether you've missed any)
Are you sure you have your flow control set up correctly? DTR, RTS, etc...
-Adam
i have written apps that use usb / bluetooth serial ports and have never had an issue. with bluetooth i have seen bit rates (sustained) of 800,000 bps for long periods of time. most people don't properly implement the port.
My serial port
Not sure if this is a possibility for you, but if you could re-write the code using C#.NET you'd have access to the SerialPort class there. It might remedy your problem. I know a lot of legacy code based around the Win32 API for hardware I/O ports tended to fail in XP due to timing (had a small bit of experience with MIDI).
In addition, I don't know if you can use the Win32 method of Serial Port access in Vista, so that might shut out future MS OSes from being able to use your code.