I am trying to send data(forces) across 2 processes, using MPI_SendRecv. Usually the data will be over written in the received buffer, I do not want to overwrite the data in the received buffer instead I want to add the data it received.
I can do the following. Store the data in the previous time step to a different array and then add it after receiving. But I have huge number of nodes and I do not want to have memory allocated for its storage every time step. (or overwrite the same)
My question is there a way to add the received data directly to the buffer and store it in the received memory using MPI?
Any help in this direction would be really thankful.
I am sure collective communication calls (MPI Reduce)cannot be worked out here. Are there any other commands that can do this?
In short: no, but you should be able to do this.
In long: Your suggestion makes a great deal of sense and the MPI Forum is currently considering new features that would enable essentially what you want.
It is incorrect to suggest that the data must be received before it can be accumulated. MPI_Accumulate does a remote accumulation in a one-sided fashion. You want MPI_Sendrecv_accumulate rather than MPI_Sendrecv_replace. This makes perfect sense and an implementation can internally do much better than you can because it can buffer on a per-packet basis, for example.
For suszterpatt, MPI internally buffers in the eager protocol and in the rendezvous protocol can setup a pipeline to minimize buffering.
The implementation of MPI_Recv_accumulate (for simplicity, as the MPI_Send part need not be considered) looks like this:
int MPI_Recv(void *buf, int count, MPI_Datatype datatype, MPI_Op op, int source, int tag, MPI_Comm comm, MPI_Status *status)
{
if (eager)
MPI_Reduce_local(_eager_buffer, buf, count, datatype, op);
else /* rendezvous */
{
malloc _buffer
while (mycount<count)
{
receive part of the incoming data into _buffer
reduce_local from _buffer into buf
}
}
In short: no.
In long: your suggestion doesn't really make sense. The machine can't perform any operations on your received value without first putting it into local memory somewhere. You'll need a buffer to receive the newest value, and a separate sum that you will increment by the buffer's content after every receive.
Related
I am coding an offline, battery-powered esp32 to take periodic sensor readings and store them until a hotspot is found, in which it connects and pushes the data elsewhere. I am relatively new to esp32 and ask for suggestions on the best way to do this.
I was thinking of storing the reading and DateTime in SPIFFS memory and running a webserver that starts when a network is found, checking every minute or so. Since it is battery-powered, I would also like to deep sleep the board to save power. Does the setup() function run again when the board comes out of deep sleep or would I need to have my connectToWiFi function inside the loop?
Is this viable? And are there any better routes to take? I've seen things on asynchronous servers and using the esp32 as an access point that could maybe work. Is it best to download the file through a web server or send the file line by line through a free online database?
Deep sleep on the ESP32 is almost the equivalent of being power cycled - the CPU restarts, and any dynamic memory will have lost its contents. An Arduino program will enter setup() after deep sleep and will have to completely reinitialize everything the program needs to run.
There is a very small area (8Kbytes) of static memory associated with the real time clock (RTC) which is retained during deep sleep. You can directly reference variables stored there using a special decorator (RTC_DATA_ATTR) when you declare the variable.
For instance, you could use a variable stored in this area to count the number of times the CPU has slept and woken up.
RTC_DATA_ATTR uint64_t sleep_counter = 0;
void setup() {
sleep_counter++;
Serial.begin(115200);
Serial.print("ESP32 has woken up ");
Serial.print(sleep_counter);
Serial.println(" times");
}
Beware that it's generally not safe to store objects in this area - you don't necessarily know whether they've allocated memory that won't persist during deep sleep. So storing a String in this memory won't work. Also storing a struct with pointers generally won't work as the pointers won't point to storage in this area.
Also beware that if the ESP32 loses power, RTC_DATA_ATTR will be wiped out.
The RTC static RAM also has the advantage of not costing as much power to write to as SPIFFS.
If you need more storage than this, SPIFFS is certainly an option. Beware that ESP32's generally use cheap NAND flash memory which is rated for a maximum of maybe 100,000 writes.
SPIFFS performs wear-leveling, which will help avoid writing to the same location in flash over and over again, but eventually it will still wear out. This isn't a problem for most projects but suppose you're writing to SPIFFS once a minute for two years - that's over a million writes. So if you're looking for persistent storage that's frequently written to over a very long time you might want to use a better quality of flash storage like an external SD card.
If an SD card is not an option (beware you don’t pull out the SD while writing!) I would write to SPIFFS or a direct write with esp_partition_write().
For the latter: if you use predefined structs with you sensor data (plus time etc) and you start the partition with a small table with the startvalue which has to be updated next time until the other (a mini fat) it’s easy to retrieve data (no fuzz with reading lines). Keep in mind that every time you wipe the flash the wear counts for that whole block! So if you accept old data to be ignored but present, this could dramatically reduce wear.
For example say you write:
Struct:
Uint8_t day;
Uint8_t month;
Uint8_t year; //year minus 2000, max 256
Uint8_t hour;
Uint8_t minutes;
Uint8_t seconds;
Uint8_t sensorMSB;
Uint8_t sensorLSB;
That’s 8 bytes.
The first struct (call it the mini fat):
Uint8_t firstToProcessMSB;
Uint8_t firstToProcessLSB;
Uint8_t amountToProcessMSB;
Uint8_t amountToProcessLSB;
Uint8_t ID0;
Uint8_t ID1;
Uint8_t ID2;
Uint8_t ID3;
Also eight bytes. For the ID you can use some values to recognize a well written partition. But that’s up to you. It’s only a suggestion!
In an 65.536 byte sized partition you can add 8192 (-1) bytes before you have to erase. Max 100.000 times…
When your device makes contact you read out the first bytes. Check if it’s ok, read the start byte. With fseek you step 8 bytes every hop to the end position and read all values in one time until you reached the end. If succesfull, you change the startposition to the end + 1 and only erase when things go tricky. (Not enough space)
It’s wise to do before the longest suspected amount of time to run out. Otherwise you will also lose data. Or you just make the partition bigger.
But if in this case you could write every minute for 5 days.
Usaully, one would have to define a new type and register it with MPI to use it. I am wondering if using protobuf to serialize a object and send it over using MPI as byte stream. I have two questions:
(1) do you foresee any problem with this approach?
(2) do I need to send length information through a separate MPI_Send(), or can I probe and use MPI_Get_count(&status, MPI_BYTE, &count)?
An example would be:
// sender
MyObj myobj;
...
size_t size = myobj.ByteSizeLong();
void *buf = malloc(size);
myobj.SerializePartialToArray(buf, size);
MPI_Isend(buf, size, MPI_BYTE, ... )
...
// receiver
MPI_Iprobe(MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &flag, &status);
if (flag) {
MPI_Get_count(&status, MPI_BYTE, &size);
MPI_Recv(buf, size, MPI_BYTE, ... , &status);
MyObject obj;
obj.ParseFromArray(buf, size);
...
}
Generally you can do that. Your code sketch looks also fine (except for the omitted buf allocation on the receiver side). As Gilles points out, makes sure to use status.MPI_SOURCE and status.MPI_TAG for the actual MPI_Recv, not MPI_*_ANY.
However, there are some performance limitations.
Protobuf isn't very fast, particularly due to en-/decoding. It very much depends what your performance expectations are. If you run on a high performance network, assume a significant impact. Here are some basic benchmarks.
Not knowing the message size ahead and thus always posting the receive after the send also has performance implications. This means the actual transmission will likely start later, which may or may not have an impact on the senders side since you are using non-blocking sends. There could be cases, where you run into some practical limitations regarding number of unexpected messages. That is not a general correctness issues, but might require some configuration tuning.
If you go ahead with your approach, remember to do some performance analysis on the implementation. Use an MPI-aware performance analysis tool to make sure your approach doesn't introduce critical bottlenecks.
I searched for bytesToWrite in doc and that what I found "For buffered devices, this function returns the number of bytes waiting to be written. For devices with no buffer, this function returns 0."
First what does mean buffered devices. And can anyone please explain to me what exactly this function does and where or how can I use it.
Many IO devices are buffered, which means that data isn't sent straight away, but it is accumulated to be sent in bulk when there is a sufficient amount.
This is done essentially to have better performance, as sending data normally has some fixed overhead (at the very least the syscall overhead), which is well amortized when sending data in bulk, but would have to be paid for each write if no buffering would be used.
(notice that here we are only talking about QIODevice buffers, normally there are also all kinds of kernel-mode buffers and buffers internal to hardware devices themselves)
bytesToWrite tells you how much stuff is in the QIODevice write buffer, i.e. how many bytes you wrote that are waiting to be actually written (as in, given to the OS to write).
I never actually had to use that member, but I suppose it could be useful e.g. to in a producer-consumer scenario (=if the write buffer is lower than something, then you have to actually calculate the next chunk of data to send), to manually handle buffering in some places or even just for debugging/logging purposes.
it's actually very usefull when you're using an asynchronous API.
you can for example, use it inside a bytesWritten() slot to tell wether the buffer is empty and the data has been fully written or not.
I want to send and receive more than 2 GB of data using MPI and I came across a lot of articles like the ones cited below:
http://blogs.cisco.com/performance/can-we-count-on-mpi-to-handle-large-datasets,
http://blogs.cisco.com/performance/new-things-in-mpi-3-mpi_count
talking about changes that are made starting with MPI 3.0 allowing to send and receive bigger chunks of data.
Most of the functions now are receiving as parameter an MPI_Count object instead of int, but not all of them.
How can I replace
int MPI_Pack_size(int incount, MPI_Datatype datatype, MPI_Comm comm,
int *size)
in order to get the size of a larger buffer? (because here the size can only be at most 2GB)
The MPI_Pack routines (MPI_Pack, MPI_Unpack, MPI_Pack_size, MPI_Pack_external) are, as you see, unable to support more than 32 bits worth of data, due to the integer pointer used as a return value. I don't know why the standard did not provide MPI_Pack_x, MPI_Unpack_x, MPI_Pack_size_x, and MPI_Pack_external_x -- presumably an oversight? As Jeff suggests, it might have been done so because packing multiple gigs of data is unlikely to provide much benefit. Still, it breaks orthogonality not to have those...
A quality implementation (I do not know if MPICH is one of those) should return an error about the type being too big, allowing you to pack a smaller amount of data.
// Open the ethernet adapter
handle = pcap_open_live("eth0", 65356, 1, 0, errbuf);
// Make sure it opens correctly
if(handle == NULL)
{
printf("Couldn't open device : %s\n", errbuf);
exit(1);
}
// Compile filter
if(pcap_compile(handle, &bpf, "udp", 0, PCAP_NETMASK_UNKNOWN))
{
printf("pcap_compile(): %s\n", pcap_geterr(handle));
exit(1);
}
// Set Filter
if(pcap_setfilter(handle, &bpf) < 0)
{
printf("pcap_setfilter(): %s\n", pcap_geterr(handle));
exit(1);
}
// Set signals
signal(SIGINT, bailout);
signal(SIGTERM, bailout);
signal(SIGQUIT, bailout);
// Setup callback to process the packet
pcap_loop(handle, -1, process_packet, NULL);
The process_packet function gets rid of header and does a bit of processing on the data. However when it takes too long, i think it is dropping packets.
How can i use pcap to listen for udp packets and be able to do some processing on the data without losing packets?
Well, you don't have infinite storage so, if you continuously run slower than the packets arrive, you will lose data at some point.
If course, if you have a decent amount of storage and, on average, you don't run behind (for example, you may run slow during bursts buth there are quiet times where you can catch up), that would alleviate the problem.
Some network sniffers do this, simply writing the raw data to a file for later analysis.
It's a trick you too can use though not necessarily with a file. It's possible to use a massive in-memory structure like a circular buffer where one thread (the capture thread) writes raw data and another thread (analysis) reads and interprets. And, because each thread only handles one end of the buffer, you can even architect it without locks (or with very short locks).
That also makes it easy to detect if you've run out of buffer and raise an error of some sort rather than just losing data at your application level.
Of course, this all hinges on your "simple and quick as possible" capture thread being able to keep up with the traffic.
Clarifying what I mean, modify your process_packet function so that it does nothing but write the raw packet to a massive circular buffer (detecting overflow and acting accordingly). That should make it as fast as possible, avoiding pcap itself dropping packets.
Then, have an analysis thread that takes stuff off the queue and does the work formerly done in process_packet (the "gets rid of header and does a bit of processing on the data" bit).
Another possible solution is to bump up the pcap internal buffer size. As per the man page:
Packets that arrive for a capture are stored in a buffer, so that they do not have to be read by the application as soon as they arrive.
On some platforms, the buffer's size can be set; a size that's too small could mean that, if too many packets are being captured and the snapshot length doesn't limit the amount of data that's buffered, packets could be dropped if the buffer fills up before the application can read packets from it, while a size that's too large could use more non-pageable operating system memory than is necessary to prevent packets from being dropped.
The buffer size is set with pcap_set_buffer_size().
The only other possibility that springs to mind is to ensure that the processing you do on each packet is as optimised as it can be.
The splitting of processing into collection and analysis should alleviate a problem of not keeping up but it still relies on quiet time to catch up. If your network traffic is consistently more than your analysis can handle, all you're doing is delaying the problem. Optimising the analysis may be the only way to guarantee you'll never lose data.