Related
I'm fairly new to TCP messaging (and programming in general) and I am trying to develop a simple ROUTER/DEALER message pair with ZeroMQ but am struggling in getting the router to receive a message from the dealer and send one back.
I can do a simple REQ/REP pattern with no problem, where I can send one message from my machine to my VM.
However, when trying to develop a ROUTER/DEALER pair, I can't seem to get the ROUTER-instance to receive the message (ROUTER on VM, DEALER on main box). I have had some success where I could spam 50 messages in a while(){...} loop, but can't send a single message and have the ROUTER send one back.
So from what I have read, a TCP message in a ROUTER/DEALER pair are sent with a delimiter of 0 at the beginning, and this 0 must be sent first to the ROUTER to register an incoming message.
So I just want to send the message "ROUTER_TEST" to my server, and for my server to respond with "RECEIVED".
DEALER
#include <cstdlib>
#include <iostream>
#include <string.h>
#include <unistd.h>
#include <stdlib.h>
#include <assert.h>
#include <stdio.h>
#include "zmq.h"
const char connection[] = "tcp://10.0.10.76:5555";
int main(void)
{
int major, minor, patch;
zmq_version(&major, &minor, &patch);
printf("\nInstalled ZeroMQ version: %d.%d.%d\n", major, minor, patch);
printf("Connecting to: %s\n", connection);
void *context = zmq_ctx_new();
void *requester = zmq_socket(context, ZMQ_DEALER);
int zc = zmq_connect(requester, connection);
std::cout << "zmq_connect = " << zc << std::endl;
int sm = zmq_socket_monitor(requester, connection, ZMQ_EVENT_ALL);
std::cout << "zmq_socket_monitor = " << sm << std::endl;
char messageSend[] = "ROUTER_TEST";
int request_nbr;
int n = zmq_send(requester, NULL, 0, ZMQ_DONTWAIT|ZMQ_SNDMORE );
int ii = 0;
if(n==0) {
std::cout << "n = " << n << std::endl;
while (ii < 50)
{
n = zmq_send(requester, messageSend, 31, ZMQ_DONTWAIT);
ii++;
}
}
return 0;
}
ROUTER
// SERVER
#include <cstdlib>
#include <iostream>
#include <string.h>
#include <assert.h>
#include <stdio.h>
#include <unistd.h>
#include <errno.h>
#include "zmq.h"
int main(void)
{
void *context = zmq_ctx_new();
void *responder = zmq_socket(context, ZMQ_ROUTER);
printf("THIS IS WORKING - ROUTER\n");
int rc = zmq_bind(responder, "tcp://*:5555");
assert(rc == 0);
zmq_pollitem_t pollItems[] = {
{responder, 0, ZMQ_POLLIN, -1}};
int sm = zmq_socket_monitor(responder, "tcp://*:5555", ZMQ_EVENT_LISTENING);
std::cout << "zmq_socket_monitor = " << sm << std::endl;
uint8_t buffer[15];
while (1)
{
int rc = zmq_recv(responder, buffer, 5, ZMQ_DONTWAIT);
if (rc == 0)
{
std::cout << "zmq_recv = " << rc << std::endl;
zmq_send(responder, "RECIEVED", 9,0);
}
zmq_poll(pollItems, sizeof(pollItems), -1);
}
return 0;
}
Your code calls, on the DEALER-side a series of:
void *requester = zmq_socket( context,
ZMQ_DEALER // <-- .STO <ZMQ_DEALER>, *requester
);
...
int n = zmq_send( requester, // <~~ <~~ <~~ <~~ <~~ <~~ .STO 0, n
NULL, // NULL,sizeof(NULL)== 0
0, // explicitly declared 0
ZMQ_DONTWAIT // _DONTWAIT flag
| ZMQ_SNDMORE //---- 1x ZMQ_SNDMORE flag==
); // 1.Frame in 1st MSG
int ii = 0; // MSG-under-CONSTRUCTION
if ( n == 0 ) // ...until complete, not yet sent
{
std::cout << "PREVIOUS[" << ii << ".] CALL of zmq_send() has returned n = " << n << std::endl;
while ( ii < 50 )
{ ii++;
n = zmq_send( requester, //---------//---- 1x ZMQ_SNDMORE following
messageSend, // // 2.Frame in 1st MSG
31, // // MSG-under-CONSTRUCTION, but
ZMQ_DONTWAIT // // NOW complete & will get sent
); //---------//----49x monoFrame MSGs follow
}
}
...
What happens on the opposite side, the ROUTER-side code ?
...
while (1)
{
int rc = zmq_recv( responder, //----------------- 1st .recv()
buffer,
5,
ZMQ_DONTWAIT
);
if ( rc == 0 )
{
std::cout << "zmq_recv = " << rc << std::endl;
zmq_send( responder, // _____________________ void *socket
"RECEIVED", // _____________________ void *buffer
9, // _____________________________ size_t len
0 // _____________________________ int flags
);
}
zmq_poll( pollItems,
sizeof( pollItems ),
-1 // __________________________________ block ( forever )
);// till ( if ever ) ...?
}
Here, most probably, the rc == 0 but once, if not missed, but never more
Kindly notice, that your code does not detect in any way if a .recv()-call is also being flagged by a ZMQ_RECVMORE - signaling a need to first also .recv()-all-the-rest multi-Frame parts of the first message, before becoming able to .send()-any-answer...
An application that processes multi-part messages must use the ZMQ_RCVMORE zmq_getsockopt(3) option after calling zmq_recv() to determine if there are further parts to receive.
Next, the buffer and messageSend message-"payloads" are a kind of fragile entities and ought be re-composed ( for details best read again all details about how to carefully initialise, work with and safely-touch any zmq_msg_t object(s) ), as after a successful .send()/.recv() the low level API ( since 2.11.x+ ) considers 'em disposed-off, not re-useable. Also note, that messageSend is not (as imperatively put into the code) a 31-char[]-long, was it? Was there any particular intention to do this?
The zmq_send() function shall return number of bytes in the message if successful. Otherwise it shall return -1 and set errno to one of the values defined below. { EAGAIN, ENOTSUP, EINVAL, EFSM, ETERM, ENOTSOCK, EINTR, EHOSTUNREACH }
Not testing error-state means knowing nothing about the actual state ( see EFSM and other potential trouble explainers ) of REQ/REP and DEALER/ROUTER (extended) .send()/.recv()/.send()/.recv()/... mandatory dFSA's order of these steps
"So from what I have read, a TCP message in a ROUTER/DEALER pair are sent with a delimiter of 0 at the beginning, and this 0 must be sent first to the ROUTER to register an incoming message."
This seems to be a misunderstood part. The app-side is free to compose any number of monoframe or multi-frame messages, yet the "trick" of a ROUTER prepended identity-frame is performed without users assistance ( message-labelling is performed automatically, before any ( now, principally all ) multi-frame(d) messages get delivered to the app-side ( using the receiver's side .recv()-method ). Due handling of multi-frame messages was noted above.
I have a qTextEdit that I grab the text from (QString) and convert to a char* with this code:
QString msgQText = ui->textMsg->toPlainText();
size_t textSize = (size_t)msgQText.size();
if (textSize > 139) {
textSize = 139;
}
unsigned char * msgText = (unsigned char *)malloc(textSize);
memcpy(msgText, msgQText.toLocal8Bit().data(), textSize);
msgText[textSize] = '\0';
if (textSize > 0) {
Msg * newTextMsg = new Msg;
newTextMsg->type = 1; // text message type
newTextMsg->bitrate = 0;
newTextMsg->samplerate = 0;
newTextMsg->bufSize = (int)textSize;
newTextMsg->len = 0;
newTextMsg->buf = (char *)malloc(textSize);
memcpy((char *)newTextMsg->buf, (char *)msgText, textSize);
lPushToEnd(sendMsgList, newTextMsg, sizeof(Msg));
ui->sendRecList->addItem((char *)newTextMsg->buf);
ui->textMsg->clear();
}
I put the text into a qListBox, but it shows up like
However, the character array, if I print it out, does not have the extra characters.
I have tried checking the "compile using UTF-8" option, but it doesn't make a difference.
Also, I send the text using RS232, and the receiver side also displays the extra characters.
The receiver code is here:
m_serial->waitForReadyRead(200);
const QByteArray data = m_serial->readAll();
if (data.size() > 0) {
qDebug() << "New serial data: " << data;
QString str = QString(data);
if (str.contains("0x6F8C32E90A")) {
qDebug() << "TEST SUCCESSFUL!";
}
return data.data();
} else {
return NULL;
}
There is a difference between the size of a QString and the size of the QByteArray returned by toLocal8Bit(). A QString contains unicode text stored as UTF-16, while a QByteArray is "just" a char[].
A QByteArray is null-terminated, so you do not need to add it manually.
As #GM pointed out: msgText[textSize] = '\0'; is undefined behavior. You are writing to the textSize + 1 position of the msgText array.
This position may be owned by something else and may be overwritten, so you end up with a non null terminated string.
This should work:
QByteArray bytes = msgQText.toLocal8Bit();
size_t textSize = (size_t)bytes.size() + 1; // Add 1 for the final '\0'
unsigned char * msgText = (unsigned char *) malloc(textSize);
memcpy(msgText, bytes.constData(), textSize);
Additional tips:
Prefer using const functions on Qt types that are copy-on-write, e.g. use QBytearray::constData() instead of QByteArray::data(). The non-const functions can cause a deep-copy of the object.
Do not use malloc() and other C-style functions if possible. Here you could do:
unsigned char * msgText = new unsigned char[textSize]; and later delete[] msgText;.
Prefer using C++ casts (static_cast, reinterpret_cast, etc.) instead of C-style casts.
You are making 2 copies of the text (2 calls to memcpy), given your code only 1 seem to be enough.
Consider the following program for enqueueing some work on a non-blocking GPU stream:
#include <iostream>
using clock_value_t = long long;
__device__ void gpu_sleep(clock_value_t sleep_cycles) {
clock_value_t start = clock64();
clock_value_t cycles_elapsed;
do { cycles_elapsed = clock64() - start; }
while (cycles_elapsed < sleep_cycles);
}
void callback(cudaStream_t, cudaError_t, void *ptr) {
*(reinterpret_cast<bool *>(ptr)) = true;
}
__global__ void dummy(clock_value_t sleep_cycles) { gpu_sleep(sleep_cycles); }
int main() {
const clock_value_t duration_in_clocks = 1e6;
const size_t buffer_size = 1e7;
bool callback_executed = false;
cudaStream_t stream;
auto host_ptr = std::unique_ptr<char[]>(new char[buffer_size]);
char* device_ptr;
cudaMalloc(&device_ptr, buffer_size);
cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking);
cudaMemcpyAsync(device_ptr, host_ptr.get(), buffer_size, cudaMemcpyDefault, stream);
dummy<<<128, 128, 0, stream>>>(duration_in_clocks);
cudaMemcpyAsync(host_ptr.get(), device_ptr, buffer_size, cudaMemcpyDefault, stream);
cudaStreamAddCallback(
stream, callback, &callback_executed, 0 /* fixed and meaningless */);
snapshot = callback_executed;
std::cout << "Right after we finished enqueuing work, the stream has "
<< (snapshot ? "" : "not ") << "concluded execution." << std::endl;
cudaStreamSynchronize(stream);
snapshot = callback_executed;
std::cout << "After cudaStreamSynchronize, the stream has "
<< (snapshot ? "" : "not ") << "concluded execution." << std::endl;
}
The size of the buffers and the length of the kernel sleep in cycles are high enough, that as they execute in parallel with the CPU thread, it should finish the enqueueing well before they've concluded (8ms+8ms for copying and 20 ms for the kernel).
And yet, looking at the trace below, it seems the two cudaMemcpyAsync() are actually synchronous, i.e. they block until the (non-blocking) stream has actually concluded the copying. Is this intended behavior? It seems to contradict the relevant section of the CUDA Runtime API documentation. How does that make sense?
Trace: (numbered lines, time in useconds):
1 "Start" "Duration" "Grid X" "Grid Y" "Grid Z" "Block X" "Block Y" "Block Z"
104 14102.830000 59264.347000 "cudaMalloc"
105 73368.351000 19.886000 "cudaStreamCreateWithFlags"
106 73388.and 20 ms for the kernel).
And yet, looking at the trace below, it seems the two cudaMemcpyAsync()'s are actually synchronous, i.e. they block until the (non-blocking) stream has actually concluded the copying. Is this intended behavior? It seems to contradict the relevant section of the CUDA Runtime API documentation. How does it make sense?
850000 8330.257000 "cudaMemcpyAsync"
107 73565.702000 8334.265000 47.683716 5.587311 "Pageable" "Device" "GeForce GTX 650 Ti BOOST (0)" "1"
108 81721.124000 2.394000 "cudaConfigureCall"
109 81723.865000 3.585000 "cudaSetupArgument"
110 81729.332000 30.742000 "cudaLaunch (dummy(__int64) [107])"
111 81760.604000 39589.422000 "cudaMemcpyAsync"
112 81906.303000 20157.648000 128 1 1 128 1 1
113 102073.103000 18736.208000 47.683716 2.485355 "Device" "Pageable" "GeForce GTX 650 Ti BOOST (0)" "1"
114 121351.936000 5.560000 "cudaStreamSynchronize"
This seemed weird, so I contacted someone from the CUDA driver team, who confirmed the documentation is correct. I was also able to confirm it:
#include <iostream>
#include <memory>
using clock_value_t = long long;
__device__ void gpu_sleep(clock_value_t sleep_cycles) {
clock_value_t start = clock64();
clock_value_t cycles_elapsed;
do { cycles_elapsed = clock64() - start; }
while (cycles_elapsed < sleep_cycles);
}
void callback(cudaStream_t, cudaError_t, void *ptr) {
*(reinterpret_cast<bool *>(ptr)) = true;
}
__global__ void dummy(clock_value_t sleep_cycles) { gpu_sleep(sleep_cycles); }
int main(int argc, char* argv[]) {
cudaFree(0);
struct timespec start, stop;
const clock_value_t duration_in_clocks = 1e6;
const size_t buffer_size = 2 * 1024 * 1024 * (size_t)1024;
bool callback_executed = false;
cudaStream_t stream;
void* host_ptr;
if (argc == 1){
host_ptr = malloc(buffer_size);
}
else {
cudaMallocHost(&host_ptr, buffer_size, 0);
}
char* device_ptr;
cudaMalloc(&device_ptr, buffer_size);
cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking);
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start);
cudaMemcpyAsync(device_ptr, host_ptr, buffer_size, cudaMemcpyDefault, stream);
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &stop);
double result = (stop.tv_sec - start.tv_sec) * 1e6 + (stop.tv_nsec - start.tv_nsec) / 1e3;
std::cout << "Elapsed: " << result / 1000 / 1000<< std::endl;
dummy<<<128, 128, 0, stream>>>(duration_in_clocks);
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start);
cudaMemcpyAsync(host_ptr, device_ptr, buffer_size, cudaMemcpyDefault, stream);
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &stop);
result = (stop.tv_sec - start.tv_sec) * 1e6 + (stop.tv_nsec - start.tv_nsec) / 1e3;
std::cout << "Elapsed: " << result / 1000 / 1000 << std::endl;
cudaStreamAddCallback(
stream, callback, &callback_executed, 0 /* fixed and meaningless */);
auto snapshot = callback_executed;
std::cout << "Right after we finished enqueuing work, the stream has "
<< (snapshot ? "" : "not ") << "concluded execution." << std::endl;
cudaStreamSynchronize(stream);
snapshot = callback_executed;
std::cout << "After cudaStreamSynchronize, the stream has "
<< (snapshot ? "" : "not ") << "concluded execution." << std::endl;
}
This is basically your code, with a few modifications:
Time measurement
A switch to allocate from pageable or pinned memory
A buffer size of 2 GiB to ensure a measurable copy time
cudaFree(0) to force CUDA lazy initialisation.
Here are the results:
$ nvcc -std=c++11 main.cu -lrt
$ ./a.out # using pageable memory
Elapsed: 0.360828 # (memcpyDtoH pageable -> device, fully async)
Elapsed: 5.20288 # (memcpyHtoD device -> pageable, sync)
$ ./a.out 1 # using pinned memory
Elapsed: 4.412e-06 # (memcpyDtoH pinned -> device, fully async)
Elapsed: 7.127e-06 # (memcpyDtoH device -> pinned, fully async)
It is slower when copying from pageable to device, but it is really async.
I'm sorry for my mistake. I deleted my previous comments to avoid confusing people.
It so happens that CUDA memory copies are only asynchronous under strict conditions, as #RobinThoni has kindly indicated. For the code in question, the issue is mostly the use of unpinned (that is, paged) host memory.
To quote from a separate section of the Runtime API documentation (emphasis mine):
2. API synchronization behavior
The API provides memcpy/memset functions in both synchronous and
asynchronous forms, the latter having an "Async" suffix. This is a
misnomer as each function may exhibit synchronous or asynchronous
behavior depending on the arguments passed to the function.
...
Asynchronous
For transfers from device memory to pageable host memory, the function will return only once the copy has completed.
and that's just the half of it! It's actually true that
For transfers from pageable host memory to device memory, the data will first be staged in pinned host memory, then copied to the device; and the function will return only after the staging has occurred.
According to the manual page of the truncate function in R, on some platforms including Windows:
... it will not work for large (> 2Gb) files
After some experimentation, I managed to make up a toy example showing that it is possible to do this for large files (quite easily) with visual c++:
// ConsoleApplication1.cpp : Defines the entry point for the console application.
//
#include "stdafx.h"
#include <windows.h>
#include <tlhelp32.h>
#include <tchar.h>
#include <iostream>
#include <string>
// Forward declarations:
void append(LPCTSTR, LPCVOID, DWORD);
void readTail(LPCTSTR, LPVOID, DWORD);
void truncateTail(LPCTSTR, long);
int main()
{
LPCTSTR fn = L"C:/kaiyin/kybig.out";
char buf[] = "helloWorld";
append(fn, buf, 10);
BYTE buf1[10] = {0};
readTail(fn, buf1, 5);
std::cout << (char*) buf1 << std::endl;
//truncateTail(fn, 5);
//for (int i = 0; i < 10; i++) {
// buf1[i] = 0;
//}
//readTail(fn, buf1, 5);
//std::cout << (char*) buf1 << std::endl;
printf("End of program\n");
std::string s = "";
std::getline(std::cin, s);
return 0;
}
void append(LPCTSTR filename, LPCVOID buf, DWORD writeSize) {
LARGE_INTEGER size;
size.QuadPart = 0;
HANDLE fh = CreateFile(filename, GENERIC_READ | GENERIC_WRITE, FILE_SHARE_READ, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL);
GetFileSizeEx(fh, &size);
SetFilePointerEx(fh, size, NULL, FILE_BEGIN);
WriteFile(fh, buf, writeSize, NULL, NULL);
CloseHandle(fh);
}
void readTail(LPCTSTR filename, LPVOID buf, DWORD readSize) {
LARGE_INTEGER size;
size.QuadPart = 0;
HANDLE fh = CreateFile(filename, GENERIC_READ, FILE_SHARE_READ, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL);
GetFileSizeEx(fh, &size);
size.QuadPart -= readSize;
SetFilePointerEx(fh, size, NULL, FILE_BEGIN);
ReadFile(fh, buf, readSize, NULL, NULL);
CloseHandle(fh);
}
void truncateTail(LPCTSTR filename, long truncateSize) {
LARGE_INTEGER size;
size.QuadPart = 0;
HANDLE fh = CreateFile(filename, GENERIC_READ | GENERIC_WRITE, FILE_SHARE_READ, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL);
if (fh == INVALID_HANDLE_VALUE) {
std::cerr << GetLastError();
return;
}
GetFileSizeEx(fh, &size);
size.QuadPart -= truncateSize;
SetFilePointerEx(fh, size, NULL, FILE_BEGIN);
if (SetEndOfFile(fh) == 0) {
std::cerr << GetLastError();
return;
}
CloseHandle(fh);
}
This will append "helloWorld" to the file "C:/kaiyin/kybig.out", and then truncate "World". In the console it should print "World" (tail before truncating), then "hello" (tail after truncating).
There seems to be no problem at all in truncating the tail of a file larger than 2GB -- in fact, I have tested with a 4e9 byte file and the program still behaves correctly.
Am I missing something, or is it true that the truncate function can indeed be reliably (and easily) implemented on Windows?
Update
Following #hrbrmstr's reference to this R bugzilla link, I tried some R code to verify whether the truncate function works properly on Windows 8.1:
filename = "C:/kaiyin/kybig.out"
f = file(filename, "w")
seek(f, 5L, "end")
truncate(f)
file.info(filename)$size
Results:
> filename = "C:/kaiyin/kybig.out"
> f = file(filename, "w")
> seek(f, 5L, "end")
[1] 0
> truncate(f)
NULL
> file.info(filename)$size
[1] 0
Apparently truncate just trashes everything despite the seeking to near the end.
Am I missing something, or is it true that the truncate function can indeed be reliably (and easily) implemented on Windows?
The likely explanation is that the problem is nothing to do with Windows but all to do with the implementation of the R function. On Windows it likely uses a signed 32 bit integer to specify the truncated file size, hence the limitation.
It's also plausible that the documentation could be out of date, and that the R developers have now managed to work out how to implement this function correctly on Windows.
R does not seem to use the Windows API to access files, instead it looks like it uses a POSIX (or POSIX-like) layer, at least according to the source code I mentioned in the linked question. So, while truncating huge files probably works on Windows using the Windows API (as shown in your code), it may well be that this POSIX(-like) layer that R uses does not (yet) fully support this (again, see source).
I am continuously getting an Access Violation Error with a all my kernels which I am trying to build. Other kernels which I take from books seem to work fine.
https://github.com/ssarangi/VideoCL - This is where the code is.
Something seems to be missing in this. Could someone help me with this.
Thanks so much.
[James] - Thanks for the suggestion and you are right. I am doing it on Win 7 with a AMD Redwood card. I have the Catalyst 11.7 drivers with AMD APP SDK 2.5. I am posting the code below.
#include <iostream>
#include "bmpfuncs.h"
#include "CLManager.h"
void main()
{
float theta = 3.14159f/6.0f;
int W ;
int H ;
const char* inputFile = "input.bmp";
const char* outputFile = "output.bmp";
float* ip = readImage(inputFile, &W, &H);
float *op = new float[W*H];
//We assume that the input image is the array “ip”
//and the angle of rotation is theta
float cos_theta = cos(theta);
float sin_theta = sin(theta);
try
{
CLManager* clMgr = new CLManager();
// Build the Source
unsigned int pgmID = clMgr->buildSource("rotation.cl");
// Create the kernel
cl::Kernel* kernel = clMgr->makeKernel(pgmID, "img_rotate");
// Create the memory Buffers
cl::Buffer* clIp = clMgr->createBuffer(CL_MEM_READ_ONLY, W*H*sizeof(float));
cl::Buffer* clOp = clMgr->createBuffer(CL_MEM_READ_WRITE, W*H*sizeof(float));
// Get the command Queue
cl::CommandQueue* queue = clMgr->getCmdQueue();
queue->enqueueWriteBuffer(*clIp, CL_TRUE, 0, W*H*sizeof(float), ip);
// Set the arguments to the kernel
kernel->setArg(0, clOp);
kernel->setArg(1, clIp);
kernel->setArg(2, W);
kernel->setArg(3, H);
kernel->setArg(4, sin_theta);
kernel->setArg(5, cos_theta);
// Run the kernel on specific NDRange
cl::NDRange globalws(W, H);
queue->enqueueNDRangeKernel(*kernel, cl::NullRange, globalws, cl::NullRange);
queue->enqueueReadBuffer(*clOp, CL_TRUE, 0, W*H*sizeof(float), op);
storeImage(op, outputFile, H, W, inputFile);
}
catch(cl::Error error)
{
std::cout << error.what() << "(" << error.err() << ")" << std::endl;
}
}
I am getting the error at the queue->enqueueNDRangeKernel line.
I have the queue and the kernel stored in a class.
CLManager::CLManager()
: m_programIDs(-1)
{
// Initialize the Platform
cl::Platform::get(&m_platforms);
// Create a Context
cl_context_properties cps[3] = {
CL_CONTEXT_PLATFORM,
(cl_context_properties)(m_platforms[0])(),
0
};
m_context = cl::Context(CL_DEVICE_TYPE_GPU, cps);
// Get a list of devices on this platform
m_devices = m_context.getInfo<CL_CONTEXT_DEVICES>();
cl_int err;
m_queue = new cl::CommandQueue(m_context, m_devices[0], 0, &err);
}
cl::Kernel* CLManager::makeKernel(unsigned int programID, std::string kernelName)
{
cl::CommandQueue queue = cl::CommandQueue(m_context, m_devices[0]);
cl::Kernel* kernel = new cl::Kernel(*(m_programs[programID]), kernelName.c_str());
m_kernels.push_back(kernel);
return kernel;
}
I checked your code. I'm on Linux though. At runtime I'm getting Error -38, which means CL_INVALID_MEM_OBJECT. So I went and checked your buffers.
cl::Buffer* clIp = clMgr->createBuffer(CL_MEM_READ_ONLY, W*H*sizeof(float));
cl::Buffer* clOp = clMgr->createBuffer(CL_MEM_READ_WRITE, W*H*sizeof(float));
Then you pass the buffers as a Pointer:
kernel->setArg(0, clOp);
kernel->setArg(1, clIp);
But setArg is expecting a value, so the buffer pointers should be dereferenced:
kernel->setArg(0, *clOp);
kernel->setArg(1, *clIp);
After those changes the cat rotates ;)