Device sync during async memcpy in CUDA

Device sync during async memcpy in CUDA - asynchronous

Suppose I want to perform an async memcpy host to device in CUDA, then immediately run the kernel. How can I test in the kernel if the async transfer has completed ?

Sequencing your asynchronous copy and kernel launch using a CUDA "stream" ensures that the kernel executes after the asynchronous transfer has completed. The following code example demonstrates:
#include <stdio.h>
__global__ void kernel(const int *ptr)
{
printf("Hello, %d\n", *ptr);
}
int main()
{
int *h_ptr = 0;
// allocate pinned host memory with cudaMallocHost
// pinned memory is required for asynchronous copy
cudaMallocHost(&h_ptr, sizeof(int));
// look for thirteen in the output
*h_ptr = 13;
// allocate device memory
int *d_ptr = 0;
cudaMalloc(&d_ptr, sizeof(int));
// create a stream
cudaStream_t stream;
cudaStreamCreate(&stream);
// sequence the asynchronous copy on our stream
cudaMemcpyAsync(d_ptr, h_ptr, sizeof(int), cudaMemcpyHostToDevice, stream);
// sequence the kernel on our stream after the copy
// the kernel will execute after the copy has completed
kernel<<<1,1,0,stream>>>(d_ptr);
// clean up after ourselves
cudaStreamDestroy(stream);
cudaFree(d_ptr);
cudaFreeHost(h_ptr);
}
And the output:
$ nvcc -arch=sm_20 async.cu -run
Hello, 13
I don't believe there's any supported way to test from within a kernel whether some asynchronous condition (such as the completion of an asynchronous transfer) has been met. CUDA thread blocks are assumed to execute completely independently from other threads of execution.

Related

How can I do other oprations on CPU while kenrels in GPU are running?

The kernel needs a few minutes to finish.
I want to display a moving bar in the console while the kernel in GPU is processing.
Normally, this funciton clEnqueueNDRangeKernel executes the kernels, and when they are finished, CPU continues to execute the following operations like clWaitForEvents and clReleaseMemObject etc.
However, I want the CPU to print a processing bar continuesly after clEnqueueNDRangeKernel but before the kernels finish.
Is there any way to do that?

Create one thread that handles the GPU queue and one thread that handles the console output.
You can share information between the two threads by allocating global variables.
#include <iostream>;
#include <thread>
using namespace std;
volatile bool not_finished = true;
float progress = 0.0f;
void do_console_output() {
while(not_finished) {
// do console output ...
cout << progress << endl;
}
}
void do_opencl_stuff() {
while(not_finished) {
// do OpenCL stuff ...
progress += 0.01f;
}
}
int main() {
thread console_thread(do_console_output); // launch a separate thread
do_opencl_stuff(); // execute this in the main thread
console_thread.join();
return 0;
}

Swap memory pointers atomically on CUDA

I have two pointers in memory and I want to swap it atomically but atomic operation in CUDA support only int types. There is a way to do the following swap?
classA* a1 = malloc(...);
classA* a2 = malloc(...);
atomicSwap(a1,a2);

When writing device-side code...
While CUDA provides atomics, they can't cover multiple (possibly remote) memory locations at once.
To perform this swap, you will need to "protect" access to both these values with something like mutex, and have whoever wants to write values to them take a hold of the mutex for the duration of the critical section (like in C++'s host-side std::lock_guard). This can be done using CUDA's actual atomic facilities, e.g. compare-and-swap, and is the subject of this question:
Implementing a critical section in CUDA
A caveat to the above is mentioned by #RobertCrovella: If you can make do with, say, a pair of 32-bit offsets rather than a 64-bit pointer, then if you were to store them in a 64-bit aligned struct, you could use compare-and-exchange on the whole struct to implement an atomic swap of the whole struct.
... but is it really device side code?
Your code actually doesn't look like something one would run on the device: Memory allocation is usually (though not always) done from the host side before you launch your kernel and do actual work. If you could make sure these alterations only happen on the host side (think CUDA events and callbacks), and that device-side code will not be interfered with by them - you can just use your plain vanilla C++ facilities for concurrent programming (like lock_guard I mentioned above).

I managed to have the needed behaviour, it is not atomic swap but still safe. The context was a monotonic Linked List working both on CPU and GPU:
template<typename T>
union readablePointer
{
T* ptr;
unsigned long long int address;
};
template<typename T>
struct LinkedList
{
struct Node
{
T value;
readablePointer<Node> previous;
};
Node start;
Node end;
int size;
__host__ __device__ void initialize()
{
size = 0;
start.previous.ptr = nullptr;
end.previous.ptr = &start;
}
__host__ __device__ void push_back(T value)
{
Node* node = nullptr;
malloc(&node, sizeof(Node));
readablePointer<Node> nodePtr;
nodePtr.ptr = node;
nodePtr.ptr->value = value;
#ifdef __CUDA_ARCH__
nodePtr.ptr->previous.address = atomicExch(&end.previous.address, nodePtr.address);
atomicAdd(&size,1);
#else
nodePtr.ptr->previous.address = end.previous.address;
end.previous.address = nodePtr.address;
size += 1;
#endif
}
__host__ __device__ T pop_back()
{
assert(end.previous.ptr != &start);
readablePointer<Node> lastNodePtr;
lastNodePtr.ptr = nullptr;
#ifdef __CUDA_ARCH__
lastNodePtr.address = atomicExch(&end.previous.address,end.previous.ptr->previous.address);
atomicSub(&size,1);
#else
lastNodePtr.address = end.previous.address;
end.previous.address = end.previous.ptr->previous.address;
size -= 1;
#endif
T toReturn = lastNodePtr.ptr->value;
free(lastNodePtr.ptr);
return toReturn;
}
__host__ __device__ void clear()
{
while(size > 0)
{
pop_back();
}
}
};

OpenCl Segmentation Error at clEnqueueNDRangeKernel

I have been working on Convolution using OpenCL on Eclipse. It is giving a segmentation Fault after enqueueNDRangeKernel.
Here is my host code: -
I have taken input image using OpenCV and then : -
const int width = image.size().width;
const int height = image.size().height;
std::cout<<"width: \t"<<width<<"\t height: "<<height<<std::endl;
std::size_t in_imagesize = (width*height)*sizeof(float);
std::vector<float> ptr(width*height,0);
const float filter[3] = {1,2,3};
float filter_size = 3*sizeof(float);
const int FilterRadius = 1;
cv::Mat result_image = cv::Mat(cvSize(width,height), CV_32FC1);
std::size_t out_imagesize = sizeof(float)*(width*height);
std::vector<float> read_buffer(width*height,0);
Then context, command queue, kernel program and after that: -
cl::Buffer input_dev, filter_kernel, output_dev;
input_dev = cl::Buffer(ctx,CL_MEM_READ_ONLY|CL_MEM_USE_HOST_PT R,in_imagesize,image.data,&err);
if(error!= CL_SUCCESS){
std::cout<<"Input Buffer Failed "<<std::endl;
}
output_dev =cl::Buffer(ctx,CL_MEM_READ_WRITE,out_imagesize,NU LL,&err);
if(error!= CL_SUCCESS){
std::cout<<"Output Buffer Failed "<<std::endl;
}
filter_kernel = cl::Buffer(ctx,CL_MEM_READ_ONLY,filter_size,NULL,& err);
if(error!= CL_SUCCESS){
std::cout<<"Output Buffer Failed "<<std::endl;
std::cout<<"filter_kernel write buffer "<<std::endl;
queue.enqueueWriteBuffer(filter_kernel,CL_TRUE,0,3 *sizeof(float),filter,NULL,NULL);
// Create Kernel
std::cout<<"Now try create kernel objects .."<<std::endl;
cl::Kernel kernel(prg,"ConvH_naive",&err);
if(error!= CL_SUCCESS)
{
std::cout<<"create Kernel_naive failed \n"<<std::endl;
}
Then Kernel Arguments and after that: -
cl::NDRange globalsize(width,height);
cl::NDRange localsize(1,1);
cl::NDRange offset(0,0);
std::cout<<"Enqueuing the Kernel"<<std::endl;
if(queue.enqueueNDRangeKernel(kernel,offset,global size,localsize,NULL,NULL)!=CL_SUCCESS)
{
std::cout<<"Failed enqueuing the Kernel"<<std::endl;
}
queue.finish();
After this Readbuffer and imshow. But the code stops after this statement giving a segmentation Fault.
Any one can help?? Is it possible that there is problem is Kernel Code? Shall I add that too??

local size of (1,1) is typically a very bad choice
what platform are you running on? What device (e.g. CPU, GPU)?
It could be that you are segfaulting since you are not handling boundary conditions and accessing a buffer out of bounds.

Using POSIX threads in Qt Widget App

I'm relatively new to both Qt and pthreads, but I'm trying to use a pthread to work in the background of basic test app I'm making. I'm aware of the Qt Frameworks own threading framework - but there's a lot of complaint surrounding it so I'd like to use pthread if possible. The code is as below
#include "drawwindow.h"
#include "ui_drawwindow.h"
#include <pthread.h>
#include <stdlib.h>
#include <stdio.h>
#include "QThread"
pthread_t th1;
DrawWindow::DrawWindow(QWidget *parent) :
QMainWindow(parent),
ui(new Ui::DrawWindow)
{
ui->setupUi(this);
}
DrawWindow::~DrawWindow()
{
delete ui;
}
void DrawWindow::on_pushButton_clicked()
{
pthread_create(&th1, NULL, &DrawWindow::alter_text, NULL);
}
void DrawWindow::alter_text()
{
while(1)
{
ui->pushButton->setText("1");
QThread::sleep(1);
ui->pushButton->setText("one");
QThread::sleep(1);
}
}
With the header
#ifndef DRAWWINDOW_H
#define DRAWWINDOW_H
#include <QMainWindow>
namespace Ui {
class DrawWindow;
}
class DrawWindow : public QMainWindow
{
Q_OBJECT
public:
explicit DrawWindow(QWidget *parent = 0);
~DrawWindow();
void alter_text();
private slots:
void on_pushButton_clicked();
private:
Ui::DrawWindow *ui;
};
#endif // DRAWWINDOW_H
And I'm getting the error
error: cannot convert 'void (DrawWindow::*)()' to 'void* (*)(void*)' for argument '3' to 'int pthread_create(pthread_t*, const pthread_attr_t*, void* (*)(void*), void*)'
pthread_create(&th1, NULL, &DrawWindow::alter_text, NULL);
^
Does anyone know what is wrong?

TL;DR: The way you're using pthreads is precisely the discouraged way of using QThread. Just because you use a different api doesn't mean that what you're doing is OK.
There's absolutely no problem with either QThread or std::thread. Forget about pthreads: they are not portable, their API is C and thus abhorrent from a C++ programmer's perspective, and you'll be making your life miserable for no reason by sticking to pthreads.
Your real issue is that you've not understood the concerns with QThread. There are two:
Neither QThread nor std::thread are destructible at all times. Good C++ design mandates that classes are destructible at any time.
You cannot destruct a running QThread nor std::thread. You must first ensure that it's stopped, by calling, respectively QThread::wait() or std::thread::join(). It wouldn't have been a big stretch to have their destructors do that, and also stop the event loop in case of QThread.
Way too often, people use QThread by reimplementing the run method, or they use std::thread by running a functor on it. This is, of course, precisely how you use pthreads: you run some function in a dedicated thread. The way you're using pthreads is just as bad as the discouraged way of using QThread!
There are many ways of doing multithreading in Qt, and you should understand the pros and cons of each of them.
Thus, how do you do threading in C++/Qt?
First, keep in mind that threads are expensive resources, and you should ideally have no more threads in your application than the number of available CPU cores. There are some situations when you're forced to have more threads, but we'll discuss when it's the case.
Use a QThread without subclassing it. The default implementation of run() simply spins an event loop that allows the objects to run their timers and receive events and queued slot calls. Start the thread, then move some QObject instances to it. The instances will run in that thread, and can do whatever work they need done, away from the main thread. Of course, everything that the objects do should be short, run-to-completion code that doesn't block the thread.
The downside of this method is that you're unlikely to exploit all the cores in the system, as the number of threads is fixed. For any given system, you might have exactly as many as needed, but more likely you'll have too few or too many. You also have no control over how busy the threads are. Ideally, they should all be "equally" busy.
Use QtConcurrent::run. This is similar to Apple's GCD. There is a global QThreadPool. When you run a functor, one thread from the pool will be woken up and will execute the functor. The number of threads in the pool is limited to the number of cores available on the system. Using more threads than that will decrease performance.
The functors you pass to run will do self-contained tasks that would otherwise block the GUI leading to usability problems. For example, use it to load or save an image, perform a chunk of computations, etc.
Suppose you wish to have a responsible GUI that loads a multitude of images. A Loader class could do the job without blocking the GUI.
class Loader : public QObject {
Q_OBJECT
public:
Q_SIGNAL void hasImage(const QImage &, const QString & path);
explicit Loader(const QStringList & imagePaths, QObject * parent = 0) :
QObject(parent) {
QtConcurrent::map(imagePaths, [this](const QString & path){
QImage image;
image.load(path);
emit hasImage(image, path);
});
}
};
If you wish to run a short-lived QObject in a thread from the thread pool, the functor can spin the event loop as follows:
auto foo = QSharedPointer<Object>(new Object); // Object inherits QObject
foo->moveToThread(0); // prepares the object to be moved to any thread
QtConcurrent::run([foo]{
foo->moveToThread(QThread::currentThread());
QEventLoop loop;
QObject::connect(foo, &Object::finished, &loop, &QEventLoop::quit);
loop.exec();
});
This should only be done when the object is not expected to take long to finish what it's doing. It should not use timers, for example, since as long as the object is not done, it occupies an entire thread from the pool.
Use a dedicated thread to run a functor or a method. The difference between QThread and std::thread is mostly in that std::thread lets you use functors, whereas QThread requires subclassing. The pthread API is similar to std::thread, except of course that it is C and is awfully unsafe compared to the C++ APIs.
// QThread
int main() {
class MyThread : public QThread {
void run() { qDebug() << "Hello from other thread"; }
} thread;
thread.start();
thread.wait();
return 0;
}
// std::thread
int main() {
// C++98
class Functor {
void operator()() { qDebug() << "Hello from another thread"; }
} functor;
std::thread thread98(functor);
thread98.join();
// C++11
std::thread thread11([]{ qDebug() << "Hello from another thread"; });
thread11.join();
return 0;
}
// pthread
extern "C" void* functor(void*) { qDebug() << "Hello from another thread"; }
int main()
{
pthread_t thread;
pthread_create(&thread, NULL, &functor, NULL);
void * result;
pthread_join(thread, &result);
return 0;
}
So, what is this good for? Sometimes, you have no choice but to use a blocking API. Most database drivers, for example, have blocking-only APIs. They expose no way for your code to get notified when a query has been finished. The only way to use them is to run a blocking query function/method that doesn't return until the query is done. Suppose now that you're using a database in a GUI application that you wish to remain responsive. If you're running the queries from the main thread, the GUI will block each time the database query run. Given long-running queries, a congested network, a dev server with a flaky cable that makes the TCP perform on par with sneakernet... you're facing huge usability issues.
Thus, you can't but have to run the database connection on, and execute the database queries on a dedicated thread that can get blocked as much as necessary.
Even then, it might still be helpful to use some QObject on the thread, and spin an event loop, since this will allow you to easily queue the database requests without having to write your own thread-safe queue. Qt's event loop already implements a nice, thread-safe event queue so you might as well use it. For example, with a note that Qt's SQL module can be used from one thread only - thus you can't prepare QSQLQuery in the main thread :(
Note that this example is very simplistic, you'd likely want to provide thread-safe way of iterating the query results, instead of pushing the entire query's worth of data at once.
class DBWorker : public QObject {
Q_OBJECT
QScopedPointer<QSqlDatabase> m_db;
QScopedPointer<QSqlQuery> m_qBooks, m_query2;
Q_SLOT void init() {
m_db.reset(new QSqlDatabase(QSqlDatabase::addDatabase("QSQLITE")));
m_db->setDatabaseName(":memory:");
if (!m_db->open()) { emit openFailed(); return; }
m_qBooks.reset(new QSqlQuery(*m_db));
m_qBooks->prepare("SELECT * FROM Books");
m_qCars.reset(new QSqlQuery(*m_db));
m_qCars->prepare("SELECT * FROM Cars");
}
QList<QVariantList> read(QSqlQuery * query) {
QList<QVariantList> result;
result.reserve(query->size());
while (query->next()) {
QVariantList row;
auto record = query->record();
row.reserve(record.count());
for (int i = 0; i < recourd.count(); ++i)
row << query->value(i);
result << row;
}
return result;
}
public:
typedef QList<QVariantList> Books, Cars;
DBWorker(QObject * parent = 0) : QObject(parent) {
QObject src;
connect(&src, &QObject::destroyed, this, &DBWorker::init, Qt::QueuedConnection);
m_db.moveToThread(0
}
Q_SIGNAL void openFailed();
Q_SIGNAL void gotBooks(const DBWorker::Books &);
Q_SIGNAL void gotCars(const DBWorker::Cars &);
Q_SLOT void getBooks() {
Q_ASSERT(QThread::currentThread() == thread());
m_qBooks->exec();
emit gotBooks(read(m_qBooks));
}
Q_SLOT void getCars() {
Q_ASSERT(QThread::currentThread() == thread());
m_qCars->exec();
emit gotCars(read(m_qCars));
}
};
Q_REGISTER_METATYPE(DBWorker::Books);
Q_REGISTER_METATYPE(DBWorker::Cars);
// True C++ RAII thread.
Thread : public QThread { using QThread::run; public: ~Thread() { quit(); wait(); } };
int main(int argc, char ** argv) {
QCoreApplication app(argc, argv);
Thread thread;
DBWorker worker;
worker.moveToThread(&thread);
QObject::connect(&worker, &DBWorker::gotCars, [](const DBWorker::Cars & cars){
qDebug() << "got cars:" << cars;
qApp->quit();
});
thread.start();
...
QMetaObject::invokeMethod(&worker, "getBooks"); // safely invoke `getBooks`
return app.exec();
}

Change void DrawWindow::alter_text() to void* DrawWindow::alter_text(void*) and return pthread_exit(NULL);.

Increase in memory uses on master node while using MPISend and MPIReceive

I am writing a program using MPI where master allocates task to slave nodes. Every slave node executes the task locally and sends the result (int array of size 100000) to the master node.
Though I am getting correct results memory uses are not linear. I found that master node takes up N*m memory where N is the number of nodes and m is the memory typically used by slave node.
Anyone has an idea why this is happening and is there any solution to reduce memory uses on master node.
Here is the sample code in which slave nodes sends some data/results to master node and I want to know why memory uses at master node is N*m. I checked memory uses using Linux command top.
#include<iostream>
#include <mpi.h>
using namespace std;
int main(int argv, char *argp[256])
{
int rank,size,master_rank=0,i=0;
int jc=0,jpt,jobsperthread=0,exjpt=0;;
int ii=0,index=0,remaining=0,tobesent=0,tobereceived=0;
int totsendreceivesize=100000,k=0;
int innodes=11;
MPI_Status status;
int *arr_anti_net=(int*)malloc(sizeof(int)*(totsendreceivesize+100));
MPI_Init (&argv, &argp);
MPI_Comm_rank (MPI_COMM_WORLD, &rank);
MPI_Comm_size (MPI_COMM_WORLD, &size);
for(i=0;i<totsendreceivesize;i++)
arr_anti_net[i]=i
if(rank!=master_rank)
{
remaining=totsendreceivesize;
tobesent=256;
k=0;
while(remaining!=0)
{
if(remaining<256)
tobesent=remaining;
MPI_Send(&arr_anti_net[k],tobesent,MPI_INT,0,11,MPI_COMM_WORLD);
k+=tobesent;
remaining-=tobesent;
}
}
else
{
ii=0;
index=0;
for(ii=1;ii<size;ii++)
{
jc=0;
jpt=0;
jobsperthread=innodes/size;
jpt=innodes/size;
exjpt=0;
if(innodes%size!=0)
{
if(ii<innodes%size)
{
jobsperthread+=1;
exjpt=ii;
}
else
exjpt=innodes%size;
}
remaining=256;//totsendreceivesize;
tobereceived=256;
k=0;
while(remaining!=0)
{
if(remaining<256)
tobereceived=remaining;
MPI_Recv(& arr_anti_net[k],tobereceived,MPI_INT,ii,11,MPI_COMM_WORLD,&status);
k+=tobereceived;
remaining-=tobereceived;
}
}
}
MPI_Barrier(MPI_COMM_WORLD);
MPI_Finalize();
}
Thank you very much

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Device sync during async memcpy in CUDA - asynchronous

Suppose I want to perform an async memcpy host to device in CUDA, then immediately run the kernel. How can I test in the kernel if the async transfer has completed ?

Related

How can I do other oprations on CPU while kenrels in GPU are running?

Swap memory pointers atomically on CUDA

OpenCl Segmentation Error at clEnqueueNDRangeKernel

Using POSIX threads in Qt Widget App

Increase in memory uses on master node while using MPISend and MPIReceive

Categories

Resources