I don't understand what exactly does the function bytesToWrite() Qt - qt

I searched for bytesToWrite in doc and that what I found "For buffered devices, this function returns the number of bytes waiting to be written. For devices with no buffer, this function returns 0."
First what does mean buffered devices. And can anyone please explain to me what exactly this function does and where or how can I use it.

Many IO devices are buffered, which means that data isn't sent straight away, but it is accumulated to be sent in bulk when there is a sufficient amount.
This is done essentially to have better performance, as sending data normally has some fixed overhead (at the very least the syscall overhead), which is well amortized when sending data in bulk, but would have to be paid for each write if no buffering would be used.
(notice that here we are only talking about QIODevice buffers, normally there are also all kinds of kernel-mode buffers and buffers internal to hardware devices themselves)
bytesToWrite tells you how much stuff is in the QIODevice write buffer, i.e. how many bytes you wrote that are waiting to be actually written (as in, given to the OS to write).
I never actually had to use that member, but I suppose it could be useful e.g. to in a producer-consumer scenario (=if the write buffer is lower than something, then you have to actually calculate the next chunk of data to send), to manually handle buffering in some places or even just for debugging/logging purposes.

it's actually very usefull when you're using an asynchronous API.
you can for example, use it inside a bytesWritten() slot to tell wether the buffer is empty and the data has been fully written or not.

Related

MPI_Probe() for determining message size

A common usage of MPI_Probe is in determining the size of an incoming message so that enough memory is allocated for the receive buffer. But this can also be done with a separate pair of MPI_Send-MPI_Recv calls, i.e. the sender process sends the message size to the receiver in a different message. Can it be assumed that MPI_Probe is in general the faster option? Why? We can perform some tests and compare the walltimes, but the results may be implementation-dependent.
For short messages the latency is more important than the size of the message, so probing for a small message is probably faster.
Probing makes it easier to deal with MPI_ANY_SOURCE as a sender: otherwise you'd have to first determine where the size msg comes from, and then do a specific receive from that source.
Instead of MPI_Probe, people often do MPI_Iprobe which tells you if there is a message at all. Yes, you can emulate that with multiple Irecvs, but why would you make your code so complicated?

Is this an advantage of MPI_PACK over derived datatype?

Suppose a process is going to send a number of arrays of different sizes but of the same type to another process in a single communication, so that the receiver builds the same arrays in its memory. Prior to the communication the receiver doesn't know the number of arrays and their sizes. So it seems to me that though the task can be done quite easily with MPI_Pack and MPI_Unpack, it cannot be done by creating a new datatype because the receiver doesn't know enough. Can this be regarded as an advantage of MPI_PACK over derived datatypes?
There is some passage in the official document of MPI which may be referring to this:
The pack/unpack routines are provided for compatibility with previous libraries. Also, they provide some functionality that is not otherwise available in MPI. For instance, a message can be received in several parts, where the receive operation done on a later part may depend on the content of a former part.
You are absolutely right. The way I phrase it is that with MPI_Pack I can make "self-documenting messages". First you store an integer that says how many elements are coming up, then you pack those elements. The receiver inspects that first int, then unpacks the elements. The only catch is that the receiver needs to know an upper bound on the number of bytes in the pack buffer, but you can do that with a separate message, or a MPI_Probe.
There is of course the matter that unpacking a packed message is way slower than straight copying out of a buffer.
Another advantage to packing is that it makes heterogeneous data much easier to handle. The MPI_Type_struct is rather a bother.

MPI_Lock_win / passive synchronization usage confusion

I'm trying to convert an application from using standard point-to-point MPI calls (e.g., MPI_Isend, MPI_Irecv) to using MPI-3's one-sided calls. My goal is to improve performance on my hardware, which is a system that has Infiniband hardware support and an MPI implementation that's optimized for RDMA calls. I've been told that the hardware performs particularly well with passive synchronization mode, as opposed to active synchronization (i.e., Post-Start-Complete-Wait).
However, even after reading through the MPI standard documentation and examples, I'm confused on how to actually use the calls. For context, my program has a setup phase where I will know the communication pattern and even the buffers of the send data and ultimate buffer of the receiver. So, it's straightforward to set up a window and use it.
Specifically, with passive synchronization, I'm confused about when the "receiver" knows the data in the window has been written by the sender. What I want to do is have the sender produce the message data, then call MPI_Win_lock on the window and then do an MPI_Put and then wait for completion with a MPI_Win_Unlock. But, what is an efficient / recommended way for the "receiver" (window target) of the data to know when the message data has been written? Similarly, given that the communication pattern is iterated and the same receive buffer (the target's buffer) is used multiple times, how do I know that the receiver is done consuming the buffer and it can be reused?
I can envision a couple of approaches:
I can use an MPI_Barrier after the MPI_Win_unlock and before the receiver accesses the data. (This seems that it would work but I'm skeptical that this would yield better performance than active synchronization.)
I can possibly use MPI_Lock and MPI_Unlock on the receiver (target), locking the window when the target is actually using the data so the access epoch can't start on the origin (but, is that the way it works? I've read that lock and unlock don't create critical sections in the traditional sense).
Some sort of home-grown approach where the receiver polls for some sort of a nonce to be written, knowing the data is available when that happens.
Docs for MPI_Win_lock: https://www.open-mpi.org/doc/v3.0/man3/MPI_Win_lock.3.php
In general, how does a programmer synchronize with MPI_Lock and 'MPI_Unlock` in a way that's any more efficient than the active synchronization approach? It does feel like I need to just use post-start-complete-wait, but I'm hoping you can help me find a way to try passive synchronization as well.

Handling messages over TCP

I'm trying to send and receive messages over TCP using a size of each message appended before the it starts.
Say, First three bytes will be the length and later will the message:
As a small example:
005Hello003Hey002Hi
I'll be using this method to do large messages, but because the buffer size will be a constant integer say, 200 Bytes. So, there is a chance that a complete message may not be received e.g. instead of 005Hello I get 005He nor a complete length may be received e.g. I get 2 bytes of length in message.
So, to get over this problem, I'll need to wait for next message and append it to the incomplete message etc.
My question is: Am I the only one having these difficulties to appending messages to each other, appending lengths etc.. to make them complete Or is this really usually how we need to handle the individual messages on TCP? Or, if there is a better way?
What you're seeing is 100% normal TCP behavior. It is completely expected that you'll loop receiving bytes until you get a "message" (whatever that means in your context). It's part of the work of going from a low-level TCP byte stream to a higher-level concept like "message".
And "usr" is right above. There are higher level abstractions that you may have available. If they're appropriate, use them to avoid reinventing the wheel.
So, there is a chance that a complete message may not be received e.g.
instead of 005Hello I get 005He nor a complete length may be received
e.g. I get 2 bytes of length in message.
Yes. TCP gives you at least one byte per read, that's all.
Or is this really usually how we need to handle the individual messages on TCP? Or, if there is a better way?
Try using higher-level primitives. For example, BinaryReader allows you to read exactly N bytes (it will internally loop). StreamReader lets you forget this peculiarity of TCP as well.
Even better is using even more higher-level abstractions such as HTTP (request/response pattern - very common), protobuf as a serialization format or web services which automate pretty much all transport layer concerns.
Don't do TCP if you can avoid it.
So, to get over this problem, I'll need to wait for next message and append it to the incomplete message etc.
Yep, this is how things are done at the socket level code. For each socket you would like to allocate a buffer of at least the same size as kernel socket receive buffer, so that you can read the entire kernel buffer in one read/recv/resvmsg call. Reading from the socket in a loop may starve other sockets in your application (this is why they changed epoll to be level-triggered by default, because the default edge-triggered forced application writers to read in a loop).
The first incomplete message is always kept in the beginning of the buffer, reading the socket continues at the next free byte in the buffer, so that it automatically appends to the incomplete message.
Once reading is done, normally a higher level callback is called with the pointers to all read data in the buffer. That callback should consume all complete messages in the buffer and return how many bytes it has consumed (may be 0 if there is only an incomplete message). The buffer management code should memmove the remaining unconsumed bytes (if any) to the beginning of the buffer. Alternatively, a ring-buffer can be used to avoid moving those unconsumed bytes, but in this case the higher level code should be able to cope with ring-buffer iterators, which it may be not ready to. Hence keeping the buffer linear may be the most convenient option.

Is it possible to get device load in OpenCL

I know how to use clGetDeviceInfo to query information about the device but I don't know how to get information about the device at runtime. For example, how much global memory is in use right now? How busy have the processing elements been, on average, in the last n nanoseconds?
AFAIK, no. OpenCL itself does not have any API to query current status of a device. Those are exposed by the vendor of your particular implementation (like the GPUPerfAPI from AMD or the Graphics Performance analyzer from Intel).
Hope this helps.
What I did to be able to determine the free memory at runtime is write a wrapper around clDevice (or cl::Device in my case) and pipe all buffer allocations through said wrapper.
At the begin of the program, I query the total device memory (CL_DEVICE__GLOBAL_MEM_SIZE) and when buffers are allocated I store their addresses and sizes in a vector so I can subtract the accumulated size of the currently allocated buffers from the total memory.
With OpenCL, you can assign callback calls to the buffers, which are called when the buffer is destroyed (clSetMemObjectDestructorCallback). So I use those to clean up when the buffer is released. Hint: the cl_mem parameter with which the callback is called is NOT a valid mem object. It may have already been destroyed so you cannot query it for its size (that took me a couple of hours, even though it's clearly stated in the standard ...).
This way, I can always know, how much memory is left on the device.

Resources