Calling cudaDeviceSynchronize() only for a particular kernel - asynchronous

I call kernels KerA and KerB asynchronously. Kernel KerC is dependent on KerB finishing but independent of KerA. So how to call cudaDeviceSynchronize() in such a way that means KerC waits for KerB to finish but not KerA?
Time -------------------------->
| KerA ------------------------>
| KerB ------> | KerC --------->

You could achieve this with CUDA streams.
If you do not use any stream, the default stream (aka stream '0') is used, and you get no concurrency (as if cudaDeviceSynchronize() is inserted before and after every CUDA operation, cf. these slides).
However, if KerA runs in stream 0, KerB and KerC run in stream 1, you will get what you want, i.e. synchronous calls between KerB and KerC, which are asynchronous w.r.t. KerA. You can use cudaStreamSynchronize(streamid) to synchronize w.r.t. a specific stream.
Time ------------------------------------>
| Stream 0: KerA ------------------------>
| Stream 1: KerB ------> | KerC --------->
Examples are available in the slides I linked. You can also check the simpleStreams or concurrentKernels samples of the SDK.

Related

Asynchronous data exchange between tasks

The documentation in the "Tasks in Toit" section indicates that the language has facilities for asynchronous data exchange between tasks. If I understand correctly, then two classes from the monitor package: Channel and Mailbox provide this opportunity. Unfortunately, I didn't find examples of using these classes, so I ask you to give at least the simplest example of the implementation of two tasks:
One of the tasks is a message generator, for example, it sends
integers or strings to the second task. The second task gets these
numbers or strings. Perhaps in this case the Channel class should
be used.
Each of the two tasks is both a generator and a receiver of messages.
Those. the first task sends a message to the second task and in turn
asynchronously receives the messages generated by the second task.
Judging by the description, the Mailbox class should be used in
this case.
Thanks in advance,
MK
Here's an example of the first part, using Channel. This class is useful if you have a stream of messages for another task.
import monitor
main:
// A channel with a backlog of 5 items. Once the reader is 5 items behind, the
// writer will block when trying to send. This helps avoid unbounded memory
// use by the in-flight messages if messages are being generated faster than
// they are being consumed. Decreasing this will tend to reduce throughput,
// increasing it will increase memory use.
channel := monitor.Channel 5
task:: message_generating_task channel
task:: message_receiving_task channel
/// Normally this could be looking at IO ports, GPIO pins, or perhaps a socket
/// connection. It could block for some unknown time while doing this. In this
/// case we just sleep a little to illustrate that the data arrives at random
/// times.
generate_message:
milliseconds := random 1000
sleep --ms=milliseconds
// The message is just a string, but could be any object.
return "Message creation took $(milliseconds)ms"
message_generating_task channel/monitor.Channel:
10.repeat:
message := generate_message
channel.send message
channel.send null // We are done.
/// Normally this could be looking at IO ports, GPIO pins, or perhaps a socket
/// connection. It could block for some unknown time while doing this. In this
/// case we just sleep a little to illustrate that the data takes a random
/// amount of time to process.
process_message message:
milliseconds := random 1000
sleep --ms=milliseconds
print message
message_receiving_task channel/monitor.Channel:
while message := channel.receive:
process_message message
Here is an example of using Mailbox. This class is useful if you have a task processing requests and giving responses to other tasks.
import monitor
main:
mailbox := monitor.Mailbox
task:: client_task 1 mailbox
task:: client_task 2 mailbox
task --background:: factorization_task mailbox
/// Normally this could be looking at IO ports, GPIO pins, or perhaps a socket
/// connection. It could block for some unknown time while doing this. For
/// this example we just sleep a little to illustrate that the data arrives at
/// random times.
generate_huge_number:
milliseconds := random 1000
sleep --ms=milliseconds
return (random 100) + 1 // Not actually so huge.
client_task task_number mailbox/monitor.Mailbox:
10.repeat:
huge := generate_huge_number
factor := mailbox.send huge // Send number, wait for result.
other_factor := huge / factor
print "task $task_number: $factor * $other_factor == $huge"
// Factorize a number using the quantum computing port.
factorize_number number:
// TODO: Use actual quantum computing instead of brute-force search.
for i := number.sqrt.round; i > 1; i--:
factor := number / i
if factor * i == number:
return factor
// This will yield so the other tasks can run. In a real application it
// would be waiting on an IO pin connected to the quantum computing unit.
sleep --ms=1
return 1 // 1 is sort-of a factor of all integers.
factorization_task mailbox/monitor.Mailbox:
// Because this task was started as a background task (see 'main' function),
// the program does not wait for it to exit so this loop does not need a real
// exit condition.
while number := mailbox.receive:
result := factorize_number number
mailbox.reply result
I'm pretty sure the Mailbox example worked great at the end of March. I decided to check it now and got the error:
In case of Console Toit:
./web.toit:8:3: error: Argument mismatch: 'task'
task --background:: factorization_task mailbox
^~~~
Compilation failed.
In case of terminal:
micrcx#micrcx-desktop:~/toit_apps/Hsm/communication$ toit execute mailbox_sample.toit
mailbox_sample.toit:8:3: error: Argument mismatch: 'task'
task --background:: factorization_task mailbox
^~~~
Compilation failed.
Perhaps this is due to the latest SDK update. Just in case:
Toit CLI:
| v1.0.0 | 2021-03-29 |

When to use MPI_BUFFER_ATTACH?

As far as I know, MPI_BUFFER_ATTACH must be called by a process if it is going to do buffered communication. But does this include the standard MPI_SEND as well? We know that MPI_SEND may behave either as a synchronous send or as a buffered send.
You need to call MPI_Buffer_attach() only if you plan to perform (explicitly) buffered sends via MPI_Bsend().
If you only plan to MPI_Send() or MPI_Isend(), then you do not need to invoke MPI_Buffer_attach().
FWIW, buffered sends are error prone and I strongly encourage you not to use them.
MPI_Buffer_attach
Attaches a user-provided buffer for sending
Synopsis
int MPI_Buffer_attach(void *buffer, int size)
Input Parameters
buffer
initial buffer address (choice)
size
buffer size, in bytes (integer)
Notes
The size given should be the sum of the sizes of all outstanding
Bsends that you intend to have, plus MPI_BSEND_OVERHEAD for each Bsend
that you do. For the purposes of calculating size, you should use
MPI_Pack_size. In other words, in the code
MPI_Buffer_attach( buffer, size );
MPI_Bsend( ..., count=20, datatype=type1, ... );
...
MPI_Bsend( ..., count=40, datatype=type2, ... );
the value of size in the MPI_Buffer_attach call should be greater than the value computed by
MPI_Pack_size( 20, type1, comm, &s1 );
MPI_Pack_size( 40, type2, comm, &s2 );
size = s1 + s2 + 2 * MPI_BSEND_OVERHEAD;
The MPI_BSEND_OVERHEAD gives the maximum amount of space that may be used in the buffer for use by the BSEND routines in using the buffer. This value is in mpi.h (for C) and mpif.h (for Fortran).
Thread and Interrupt Safety
The user is responsible for ensuring that multiple threads do not try to update the same MPI object from different threads. This routine should not be used from within a signal handler.
The MPI standard defined a thread-safe interface but this does not mean that all routines may be called without any thread locks. For example, two threads must not attempt to change the contents of the same MPI_Info object concurrently. The user is responsible in this case for using some mechanism, such as thread locks, to ensure that only one thread at a time makes use of this routine. Because the buffer for buffered sends (e.g., MPI_Bsend) is shared by all threads in a process, the user is responsible for ensuring that only one thread at a time calls this routine or MPI_Buffer_detach.
Notes for Fortran
All MPI routines in Fortran (except for MPI_WTIME and MPI_WTICK) have an additional argument ierr at the end of the argument list. ierr is an integer and has the same meaning as the return value of the routine in C. In Fortran, MPI routines are subroutines, and are invoked with the call statement.
All MPI objects (e.g., MPI_Datatype, MPI_Comm) are of type INTEGER in Fortran.
Errors
All MPI routines (except MPI_Wtime and MPI_Wtick) return an error value; C routines as the value of the function and Fortran routines in the last argument. Before the value is returned, the current MPI error handler is called. By default, this error handler aborts the MPI job. The error handler may be changed with MPI_Comm_set_errhandler (for communicators), MPI_File_set_errhandler (for files), and MPI_Win_set_errhandler (for RMA windows). The MPI-1 routine MPI_Errhandler_set may be used but its use is deprecated. The predefined error handler MPI_ERRORS_RETURN may be used to cause error values to be returned. Note that MPI does not guarentee that an MPI program can continue past an error; however, MPI implementations will attempt to continue whenever possible.
MPI_SUCCESS
No error; MPI routine completed successfully.
MPI_ERR_BUFFER
Invalid buffer pointer. Usually a null buffer where one is not valid.
MPI_ERR_INTERN
An internal error has been detected. This is fatal. Please send a bug report to mpi-bugs#mcs.anl.gov.
See Also MPI_Buffer_detach, MPI_Bsend
Refer Here For More
Buffer allocation and usage
Programming with MPI
MPI - Bsend usage

Controlling the number of spawned futures to create backpressure

I am using a futures-rs powered version of the Rusoto AWS Kinesis library. I need to spawn a deep pipeline of AWS Kinesis requests to achieve high-throughput because Kinesis has a limit of 500 records per HTTP request. Combined with the 50ms latency of sending a request, I need to start generating many concurrent requests. I am looking to create somewhere on the order of 100 in-flight requests.
The Rusoto put_records function signature looks like this:
fn put_records(
&self,
input: &PutRecordsInput,
) -> RusotoFuture<PutRecordsOutput, PutRecordsError>
The RusotoFuture is a wrapper defined like this:
/// Future that is returned from all rusoto service APIs.
pub struct RusotoFuture<T, E> {
inner: Box<Future<Item = T, Error = E> + 'static>,
}
The inner Future is wrapped but the RusutoFuture still implements Future::poll(), so I believe it is compatible with the futures-rs ecosystem. The RusotoFuture provides a synchronization call:
impl<T, E> RusotoFuture<T, E> {
/// Blocks the current thread until the future has resolved.
///
/// This is meant to provide a simple way for non-async consumers
/// to work with rusoto.
pub fn sync(self) -> Result<T, E> {
self.wait()
}
}
I can issue a request and sync() it, getting the result from AWS. I would like to create many requests, put them in some kind of queue/list, and gather finished requests. If the request errored I need to reissue the request (this is somewhat normal in Kinesis, especially when hitting limits on your shard throughput). If the request is completed successfully I should issue a request with new data. I could spawn a thread for each request and sync it but that seems inefficient when I have the async IO thread running.
I have tried using futures::sync::mpsc::channel from my application thread (not running from inside the Tokio reactor) but whenever I clone the tx it generates its own buffer, eliminating any kind of backpressure on send:
fn kinesis_pipeline(client: DefaultKinesisClient, stream_name: String, num_puts: usize, puts_size: usize) {
use futures::sync::mpsc::{ channel, spawn };
use futures::{ Sink, Future, Stream };
use futures::stream::Sender;
use rusoto_core::reactor::DEFAULT_REACTOR;
let client = Arc::new(KinesisClient::simple(Region::UsWest2));
let data = FauxData::new(); // a data generator for testing
let (mut tx, mut rx) = channel(1);
for rec in data {
tx.clone().send(rec);
}
}
Without the clone, I have the error:
error[E0382]: use of moved value: `tx`
--> src/main.rs:150:9
|
150 | tx.send(rec);
| ^^ value moved here in previous iteration of loop
|
= note: move occurs because `tx` has type `futures::sync::mpsc::Sender<rusoto_kinesis::PutRecordsRequestEntry>`, which does not implement the `Copy` trait
I have also look at futures::mpsc::sync::spawn based on recommendations but it takes owner ship of the rx (as a Stream) and does not solve my problem with the Copy of tx causing unbounded behavior.
I'm hoping if I can get the channel/spawn usage working, I will have a system which takes RusotoFutures, waits for them to complete, and then provides me an easy way to grab completion results from my application thread.
As far as I can tell your problem with channel is not that a single clone of the Sender increase the capacity by one, it is that you clone the Sender for every item you're trying to send.
The error you're seeing without clone comes from your incorrect usage of the Sink::send interface. With clone you actually should see the warning:
warning: unused `futures::sink::Send` which must be used: futures do nothing unless polled
That is: your current code doesn't actually ever send anything!
In order to apply backpressure you need to chain those send calls; each one should wait until the previous one finished (and you need to wait for the last one too!); on success you'll get the Sender back. The best way to do this is to generate a Stream from your iterator by using iter_ok and to pass it to send_all.
Now you got one future SendAll that you need to "drive". If you ignore the result and panic on error (.then(|r| { r.unwrap(); Ok::<(), ()>(()) })) you could spawn it as a separate task, but maybe you want to integrate it into your main application (i.e. return it in a Box).
// this returns a `Box<Future<Item = (), Error = ()>>`. you may
// want to use a different error type
Box::new(tx.send_all(iter_ok(data)).map(|_| ()).map_err(|_| ()))
RusotoFuture::sync and Future::wait
Don't use Future::wait: it is already deprecated in a branch, and it usually won't do what you actually are looking for. I doubt RusotoFuture is aware of the problems, so I recommend avoiding RusotoFuture::sync.
Cloning Sender increases channel capacity
As you correctly stated cloning Sender increases the capacity by one.
This seems to be done to improve performance: A Sender starts in the unblocked ("unparked") state; if a Sender isn't blocked it can send an item without blocking. But if the number of items in the queue hits the configured limit when a Sender sends an item, the Sender becomes blocked ("parked"). (Removing items from the queue will unblock the Sender at a certain time.)
This means that after the inner queue hits the limit each Sender still can send one item, which leads to the documented effect of increased capacity, but only if actually all the Senders are sending items - unused Senders don't increase the observed capacity.
The performance boost comes from the fact that as long as you don't hit the limit it doesn't need to park and notify tasks (which is quite heavy).
The private documentation at the top of the mpsc module describes more of the details.

Why MPI_Win_unlock is so low?

My application uses one-sided communications (MPI_Rget, MPI_Raccumulate) with synchronization primitives like MPI_Win_Lock and MPI_Win_Unlock for its passive target synchronization.
I profiled my application and found that most of time is being spent in MPI_Win_Unlock function (not MPI_Win_Lock), which I cannot understand why.
(1) Does anyone know why MPI_Win_Unlock function takes so much time? (Maybe it's implementation issue)
(2) Can this situation get better if I moves for S/C/P/W synchronization model?
I just need to be sure that all the one-sided operations are not concurrently overlapped.
I am using Intel's MPI Library ver 5.1 which implements MPI V3.
I appended some snippets of my codes (actually it's all :D)
Each MPI process runs 'Run()'
Run ()
// Join
For each Target_Proc i in MPI_COMM_WORLD
RequestDataFrom ( (i + k) % nprocs ); // requests k-step away neighbor's data asynchronously
ConsumeDataFrom (i);
JoinWithMyData (my_rank, i);
WriteBackDataTo (i);
goto the above 'For loop' again if the termination condition does not hold.
MPI_Barrier(MPI_COMM_WORLD);
// Update Data in Window
UpdateMyWindow (my_rank);
RequstDataFrom (target_rank_id)
MPI_Win_Lock (MPI_LOCK_SHARED, target_rank_id, win)
MPI_Rget (from target_rank_id, win, &requests[target_rank_id])
MPI_Win_Unlock (target_rank_id, win)
ConsumeDataFrom (target_rank_id)
MPI_Wait (&requests[target_rank_id])
GetPointerToBuffer (target_rank_id)
WriteBackDataTo (target_rank_id)
MPI_Win_Lock (MPI_LOCK_EXCLUSIVE, target_rank_id, win)
MPI_Rput (from target_rank_id, win, &requests[target_rank_id])
MPI_Win_Unlock (target_rank_id, win)
UpdateMyWindow ()
MPI_Win_Lock (MPI_LOCK_EXCLUSIVE, target_rank_id, win)
Update()
MPI_Win_Unlock (target_rank_id, win)
The function MPI_Win_unlock will block until all RMA operations of the access epoch have been completed.
As such it is no surprise that your profiler will show that this function takes the majority of time. It will block till the MPI implementation has completed all one-sided communication operations that were posted since the corresponding MPI_Win_lock.
Note that one-sided operations (Put, Get, etc) will merely dispatch the operation and not block till the operation is completed. As such these operations are effectively very similar to non-blocking communication functions (MPI_Isend/MPI_Irecv) without the MPI_Request object. To continue the analogy, MPI_Win_unlock waits on all operations to complete, similar to a MPI_Wait_all.

Confusion over the time taken by write() and read() sys calls

The below code simply calculates the time taken to write a file.
#include<time.h>
void main()
{
int fp;
long a,b;
char *str = "Life is like that only";
fp = open("tmp.txt",O_WRONLY,0666);
time(&a);
write(fp,str);
time(&b);
/*(b-a) should be the time taken to write
* the file tmp.txt.
*/
close(fp);
return;
}
My question is that if we have a single CPU then whether the time taken (b-a) would be exact or it can be affected by the execution of other process running parallel.
Some posts here mention that write() and read() can be treated almost like atomic syscalls as if they are not successful EINTR is set that simply means to try again.But still does that mean if it is successful then in the course of its execution all other processes are on hold.
Other processes (that are not using I/O or that are using I/O on different devices) can run while your process is waiting for the write to complete, and your process may not immediately get the CPU back after it completes.
In practice, for a small write to a regular file, your write() will probably return immediately after copying your data into a kernel-space buffer, rather than waiting for it to go all the way to the disk.

Resources