Why would JOCL CL.clEnqueueReadBuffer never return? - opencl

Currently I have a kernel that is supposed to take a flattened array of bytes and transform render some image, I have all of this implemented, however, I have a line of code that never returns
ret = CL.clEnqueueReadBuffer(this.commandQueue, this.finalImageBuffer, CL.CL_TRUE, 0, this.outputSize, Pointer.to(this.outArray), 0, null, null);
why would it never return, that's my question.
Thanks in advance

Related

How to traverse character elements of *const char pointer in Rust?

I'm new to Rust programing and I have a bit of difficulty when this language is different from C Example, I have a C function as follows:
bool check(char* data, int size){
int i;
for(i = 0; i < size; i++){
if( data[i] != 0x00){
return false;
}
}
return true;
}
How can I convert this function to Rust? I tried it like C, but it has Errors :((
First off, I assume that you want to use as little unsafe code as possible. Otherwise there really isn't any reason to use Rust in the first place, as you forfeit all the advantages it brings you.
Depending on what data represents, there are multiple ways to transfer this to Rust.
First off: Using pointer and length as two separate arguments is not possible in Rust without unsafe. It has the same concept, though; it's called slices. A slice is exactly the same as a pointer-size combination, just that the compiler understands it and checks it for correctness at compile time.
That said, a char* in C could actually be one of four things. Each of those things map to different types in Rust:
Binary data whose deallocation is taken care of somewhere else (in Rust terms: borrowed data)
maps to &[u8], a slice. The actual content of the slice is:
the address of the data as *u8 (hidden from the user)
the length of the data as usize
Binary data that has to be deallocated within this function after using it (in Rust terms: owned data)
maps to Vec<u8>; as soon as it goes out of scope the data is deleted
actual content is:
the address of the data as *u8 (hidden from the user)
the length of the data as usize
the size of the allocation as usize. This allows for efficient push()/pop() operations. It is guaranteed that the length of the data does not exceed the size of the allocation.
A string whose deallocation is taken care of somewhere else (in Rust terms: a borrowed string)
maps to &str, a so called string slice.
This is identical to &[u8] with the additional compile time guarantee that it contains valid UTF-8 data.
A string that has to be deallocated within this function after using it (in Rust terms: an owned string)
maps to String
same as Vec<u8> with the additional compile time guarantee that it contains valid UTF-8 data.
You can create &[u8] references from Vec<u8>'s and &str references from Strings.
Now this is the point where I have to make an assumption. Because the function that you posted checks if all of the elements of data are zero, and returns false if if finds a non-zero element, I assume the content of data is binary data. And because your function does not contain a free call, I assume it is borrowed data.
With that knowledge, this is how the given function would translate to Rust:
fn check(data: &[u8]) -> bool {
for d in data {
if *d != 0x00 {
return false;
}
}
true
}
fn main() {
let x = vec![0, 0, 0];
println!("Check {:?}: {}", x, check(&x));
let y = vec![0, 1, 0];
println!("Check {:?}: {}", y, check(&y));
}
Check [0, 0, 0]: true
Check [0, 1, 0]: false
This is quite a direct translation; it's not really idiomatic to use for loops a lot in Rust. Good Rust code is mostly iterator based; iterators are most of the time zero-cost abstraction that can get compiled very efficiently.
This is how your code would look like if rewritten based on iterators:
fn check(data: &[u8]) -> bool {
data.iter().all(|el| *el == 0x00)
}
fn main() {
let x = vec![0, 0, 0];
println!("Check {:?}: {}", x, check(&x));
let y = vec![0, 1, 0];
println!("Check {:?}: {}", y, check(&y));
}
Check [0, 0, 0]: true
Check [0, 1, 0]: false
The reason this is more idiomatic is that it's a lot easier to read for someone who hasn't written it. It clearly says "return true if all elements are equal to zero". The for based code needs a second to think about to understand if its "all elements are zero", "any element is zero", "all elements are non-zero" or "any element is non-zero".
Note that both versions compile to the exact same bytecode.
Also note that, unlike the C version, the Rust borrow checker guarantees at compile time that data is valid. It's impossible in Rust (without unsafe) to produce a double free, a use-after-free, an out-of-bounds array access or any other kind of undefined behaviour that would cause memory corruption.
This is also the reason why Rust doesn't do pointers without unsafe - it needs the length of the data to check out-of-bounds errors at runtime. That means, accessing data via [] operator is a little more costly in Rust (as it does perform an out-of-bounds check every time), which is the reason why iterator based programming is a thing. Iterators can iterate over data a lot more efficient than directly accessing it via [] operators.

broadcasting structs without knowing attributes

I'm trying to use MPI_Bcast to share an instance of cudaIpcMemHandler_t, but I cannot figure out how to create the corresponding MPI_Datatype needed for Bcast. I do not know the underlying structure of the cuda type, hence methods like this one don't seem to work. Am I missing something ?
Following up from the useful comment , I have a solution that seems to work, though its only been tested in a toy program:
// Broadcast the memory handle
cudaIpcMemHandler_t memHandler[1];
if (rank==0){
// set the handler content, e.g. call cudpaIpcGetMemHandle
}
MPI_Barrier(MPI_COMM_WORLD);
// share the size of the handler with other objects
int hand_size[1];
if (rank==0)
hand_size[0]= sizeof(memHand[0]);
MPI_Bcast(&hand_size[0], 1, MPI_INT, 0, MPI_COMM_WORLD);
// broadcase the handler as byte array
char memHand_C[hand_size[0]];
if (rank==0)
memcpy(&memHand_C, &memHand[0], hand_size[0]);
MPI_Bcast(&memHand_C, hand_size[0], MPI_BYTE, 0, MPI_COMM_WORLD);
if (rank >0)
memcpy(&memHand[0], &memHand_C, hand_size[0]);

Is it possibile in OpenCL to wait for an event that has not been returned by clEnqueueNDRangeKernel yet?

Let's say there are two Command Queues and I want to synchronize them using events. It can be done like this:
cl_event event1;
clEnqueueNDRangeKernel(queue1, <..params..>, 0, NULL, &event1);
cl_event event2;
clEnqueueNDRangeKernel(queue2, <..params..>, 0, NULL, &event2);
clEnqueueNDRangeKernel(queue1, <..params..>, 1, &event2, NULL);
clEnqueueNDRangeKernel(queue2, <..params..>, 1, &event1, NULL);
Is there a possibility to achive similar results but with a different order of clEnqueueNDRangeKernel calls? For example:
cl_event event1;
cl_event event2;
clEnqueueNDRangeKernel(queue1, <..params..>, 0, NULL, &event1);
clEnqueueNDRangeKernel(queue1, <..params..>, 1, &event2, NULL); //it fails here because event2 does not exist
clEnqueueNDRangeKernel(queue2, <..params..>, 0, NULL, &event2);
clEnqueueNDRangeKernel(queue2, <..params..>, 1, &event1, NULL);
Is it possibile in OpenCL to wait for an event that has not been returned by clEnqueueNDRangeKernel yet?
Not really, but there's a different approach possible.
You could create a user event (clCreateUserEvent) and use the returned userEvent instead of event2 argument in the enqueue call. Then, after enqueueing a kernel that creates event2, you add a callback (clSetEventCallback) on event2, and from that callback you call clSetUserEventStatus(userEvent, CL_COMPLETE).
There are only two problems with this, 1) even if the most common OpenCL implementations weren't horrible WRT user events, you're introducing unnecessary userspace trips (= slowdown), 2) they are horrible WRT user events. By which i mean, the callback will be called... at some point. It's not unusual to see it called with 10-200ms delay after the event has actually finished.
You could get more useful answers if you said what is the problem you're trying to solve.

Is there any reason why clEnqueueNDRangeKernel may block?

I am developing an application making use of OpenCL, targeted to 1.2 versión. I use DX11 interoperability to display the kernel results. I try my code in Intel (iGPU) and Nvidia platforms, in both I recon the same behaviour.
My call to clEnqueueNDRangeKernel is blocking the CPU thread. I have checked the documentation and I can not find an statement declaring the situations in which a kernel call may block. I have read in some forums that those things happens sometimes with some OpenCL implementations. The code is working properly and outputting valid results. The API does not return any error at any given point, all seems smooth.
I can not paste the full source but I will paste the in-loop part:
size_t local = 64;
size_t global = ctx->dec_in_host->horizontal_blocks * ctx->dec_in_host->vertical_blocks * local;
print_if_error(clEnqueueWriteBuffer(ctx->queue, ctx->blocks_gpu, CL_TRUE, 0, sizeof(block_input) * TOTAL_BLOCKS, ctx->blocks_host, 0, NULL, &ctx->blocks_copy_status), "copying data");
print_if_error(clEnqueueWriteBuffer(ctx->queue, ctx->dec_in_gpu, CL_TRUE, 0, sizeof(decoder_input), ctx->dec_in_host, 0, NULL, &ctx->frame_copy_status), "copying data");
if (ctx->mode == nv_d3d11_sharing)
print_if_error(ctx->fp_clEnqueueAcquireD3D11ObjectsNV(ctx->queue, 1, &(ctx->image_gpu), 0, NULL, NULL), "Adquring texture");
else if (ctx->mode == khr_d3d11_sharing)
print_if_error(ctx->fp_clEnqueueAcquireD3D11ObjectsKHR(ctx->queue, 1, &(ctx->image_gpu), 0, NULL, NULL), "Adquring texture");
t1 = clock();
print_if_error(clEnqueueNDRangeKernel(ctx->queue, ctx->kernel, 1, NULL, &global, &local, 0, NULL, &ctx->kernel_status), "kernel launch");
t2 = clock();
if (ctx->mode == nv_d3d11_sharing)
print_if_error(ctx->fp_clEnqueueReleaseD3D11ObjectsNV(ctx->queue, 1, &(ctx->image_gpu), 0, NULL, NULL), "Releasing texture");
else if (ctx->mode == khr_d3d11_sharing)
print_if_error(ctx->fp_clEnqueueReleaseD3D11ObjectsKHR(ctx->queue, 1, &(ctx->image_gpu), 0, NULL, NULL), "Releasing texture");
printf("Elapsed time %lf ms\n", (double)(t2 - t1)*1000 / CLOCKS_PER_SEC);
So my question is:
¿Do you know any reason why the clEnqueueNDRangeKernel would block?
¿Do you know if the Dx11 interop might cause this?
¿Do you know if some OpenCL configuration can create a syncronous kernel launch?
Thank you :)
EDIT 1:
Thanks to doqtor comment I realize that commenting out parts of the kernel the kernel launch becomes asyncronous. The result is not Ok but I have some hint to work out the answer.

Does TcpClient NetworkStream.Write have concept of end of stream?

In a TcpClient/TcpListener set up, is there any difference from the receiving end point of view between:
// Will sending a prefixed length before the data...
client.GetStream().Write(data, 0, 4); // Int32 payload size = 80000
client.GetStream().Write(data, 0, 80000); // payload
// Appear as 80004 bytes in the stream?
// i.e. there is no end of stream to demarcate the first Write() from the second?
client.GetStream().Write(data, 0, 80004);
// Which means I can potentially read more than 4 bytes on the first read
var read = client.GetStream().Read(buffer, 0, 4082); // read could be any value from 0 to 4082?
I noticed that DataAvailable and return value of GetStream().Read() does not reliably tell whether there are incoming data on the way. Do I always need to write a Read() loop to exactly read the first 4 bytes?
// Read() loop
var ms = new MemoryStream()
while(ms.length < 4)
{
read = client.GetStream().Read(buffer, 0, 4 - ms.length);
if(read > 0)
ms.Write(buffer, 0, read);
}
The answer seems to be yes, we have to always be responsible for reading the same number of bytes that was sent. In other words, there has to be an application level protocol to read exactly what was written on to the underlying stream because it does not know when a new message start or ends.

Resources