OpenCL: Page flipping / ping-pong buffer with image3D?

OpenCL: Page flipping / ping-pong buffer with image3D? - opencl

I want to implement an algorithm in openCL which needs to apply a certain transformation on a 3D grayscale image several times. I have an input and an output image for my kernel. Now I would like to simply swap the input and output image and apply the kernel again. However, one image was created with read_only and the other one with write_only. Does this mean I have to use conventional buffers, or is there some trick, how to flip the two images, without first copying them from the device back to the host and back to the device again?

You say: "However, one image was created with read_only and the other one with write_only". The obvious answer is: don't do that, and you'll be fine.
The less obvious subtext is: There's a difference between creating an image with writeonly/readonly flags (which is done on the host-side via clCreateImage(...,CL_MEM_WRITE_ONLY/CL_MEM_READ_ONLY)) and the access-type inside a particular kernel (which is specified with the __read_only/__write_only qualifiers in the kernel's arguments definition).
Unless I'm totally mistaken, you can safely create your image with no restrictions (i.e. CL_MEM_READ_WRITE), then use it as a kernel's input parameter, and for the next kernel run, use it as the output parameter. You just can't mix read/write accesses during a single kernel run.

Related

How are you supposed to update a texture per frame in Vulkan?

I'm trying to work with 2D in vulkan along with 3D. So right now testing out updating a texture for every frame as whatever 2D is going on. I've gotten something of a texture updater working, the problem is that it's very slow and probably not the way it's supposed to be done. Is there any better way of getting this done? The code is based on the https://vulkan-tutorial.com/ code.
https://vulkan-tutorial.com/code/26_depth_buffering.cpp
void UpdateTexture()
{
vkDeviceWaitIdle(device);
vkFreeMemory(device, textureImageMemory, nullptr);
VkBuffer stagingBuffer;
VkDeviceMemory stagingBufferMemory;
createBuffer(imageSize, VK_BUFFER_USAGE_TRANSFER_SRC_BIT, VK_MEMORY_PROPERTY_HOST_COHERENT_BIT, stagingBuffer, stagingBufferMemory);
void* data;
vkMapMemory(device, stagingBufferMemory, 0, imageSize, 0, &data);
memcpy(data, pixel2.data(), static_cast<size_t>(imageSize));
vkUnmapMemory(device, stagingBufferMemory);
createImage(texWidth, texHeight, VK_FORMAT_R8G8B8A8_SRGB, VK_IMAGE_TILING_OPTIMAL, VK_IMAGE_USAGE_TRANSFER_DST_BIT | VK_IMAGE_USAGE_SAMPLED_BIT, VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT, textureImage, textureImageMemory);
transitionImageLayout(textureImage, VK_FORMAT_R8G8B8A8_SRGB, VK_IMAGE_LAYOUT_UNDEFINED, VK_IMAGE_LAYOUT_TRANSFER_DST_OPTIMAL);
copyBufferToImage(stagingBuffer, textureImage, static_cast<uint32_t>(texWidth), static_cast<uint32_t>(texHeight));
transitionImageLayout(textureImage, VK_FORMAT_R8G8B8A8_SRGB, VK_IMAGE_LAYOUT_TRANSFER_DST_OPTIMAL, VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL);
vkDestroyBuffer(device, stagingBuffer, nullptr);
vkFreeMemory(device, stagingBufferMemory, nullptr);
createTextureImageView();
createDescriptorPool();
createDescriptorSets();
createCommandBuffers();
}

This code looks like a direct translation of some OpenGL code, and not particularly good/modern OpenGL code at that.
There's a lot wrong in this code, but most of it boils down to over-synchronization.
First, you should always view any call to vkDeviceWaitIdle as the wrong thing to do. The only exception would be when you are preparing to destroy the VkDevice itself. There is no other reason to do a full CPU/GPU sync like that.
Presumably, this synchronization exists so that you can be sure the GPU is finished using the image before modifying it. This is the wrong thing to do. You should instead employ multiple-buffering. That is, you should have two images that you use. One is currently being used in a rendering process, while the other is being transferred into.
Instead of doing a full device sync, you instead synchronize with the batch you sent two frames ago. That is, if you're wanting to transfer data for use by frame 10, then you must first do a fence-sync operation with the batch you sent in frame 8. Frame 9 is still being processed, but frame 8 is probably done by now. So the synchronization shouldn't hurt too much.
Second, never allocate memory in the middle of an operation like this. Memory gets allocated early in your application, and you leave it allocated until it's time to destroy your application. If you need a staging buffer, then keep it around and reuse it in subsequent frames. Make sure to allocate sufficient storage up-front.
Whatever your createBuffer call is doing, it seems very much like a bad idea. Vulkan is not OpenGL; Vulkan separated memory from buffers/textures that use it for a reason. Creating APIs that hide this separation basically throws all of that away.
Similarly, never unmap memory, unless you're about to destroy that memory object. There's no problem in Vulkan (or OpenGL) with leaving a piece of memory mapped indefinitely. Just map the entire memory's range and leave it mapped. Indeed, you could just pass the mapped pointer directly to your image loader, depending on how the memory get written by the image loading code (if it tries to read data from this pointer, they could be trouble).
Lastly, the commands doing the transfer need to be synchronized with the commands that consume the image. How this happens depends on which queues are being used to do the transfer.
And of course, if you want optimal performance, you may want to check to see if your implementation can read from linear images in your shader. If it can, then you may not need staging at all; you can just write the data directly to the memory in Vulkan's image format, and use it directly.
Employing all of the above is going to add a lot of complexity to your application. But that's how it's supposed to work.

A naive way consists in using the CPU to define the update depending on the time or data and then update the data for the shader, such as a MVP transformation matrix. But this is inefficient with lots of syncing and too low refresh rates, and also overloading the cpu in a loop.
So people recommend using many buffers sometimes mentioning old drivers. If someone can clarify it, that would be nice. I have a naive and probably wrong guess. If they know exactly the frame rate, then they can calculate the time for each frame and dispatch several frames in advance. But it confuses me because the frame rate is dynamic, especially for new screens with the FreeSync functionality that have dynamic refresh rates.
I have thought of a third possibility. One can use the clock directly in the shader. GL_EXT_shader_realtime_clock provides clockRealtimeEXT. It has no defined unit, and will wrap when exceeding the maximum value. But it is said "globally coherent by all invocations on the GPU". During initialization, you can measure its rate using a uniform buffer, and then assume the rate will be constant. And also manage the wrapping.
Then if you can write your shaders as a function of time, for example in a translation, that would be efficient. You just need the initial data. Remember that one must avoid if conditions in shaders.

Can the __LINKEDIT segment of a Mach-O executable be moved

In a Mach-O executable, I am trying to increase the size of the __LLVM segment that precedes the __LINKEDIT segment (with a home-grown tool). I am considering two strategies: (a) move the __LLVM segment to after the __LINKEDIT segment, producing a file that is not what ld would create (now with a gap and section addresses out of order), and (b) move the __LINKEDIT segment to allow resizing of the __LLVM segment that precedes it. I need the result to be accepted for downstream processing, e.g. generating an .ipa file or sending to the App Store.
This question is about my assumptions and the viability of these approaches. Specifically, what are the potential pitfalls of each that might lead them to fail?
I implemented the first approach (a) is understood by segedit's -extract option, but its -replace option complains that the segments are out of order. I append a new segment to the file and update the address and length values in the corresponding load command to refer to this new segment data (both in the file and the destination memory). This might be fine, as long as the other downstream processing will accept the result (still to check; e.g. any local signature is likely invalidated).
The second approach (b) would seem cleaner, as long as there are no references into the __LINKEDIT segment, which I guess contains linking information (symbol tables etc., rather than code). I have not tried this yet, though it seems to be a foregone conclusion that segedit will be happy with the result, which may suggest other processing might also be happier. Are there likely to be any references that are invalidated due to simply moving this segment? I am guessing that I will have to update further load commands (they seem to reference into the __LINKEDIT segment), which I have not examined, but this should be fairly straightforward.
EDIT: Replaced my confused use of "section" with "segment" (mentioned in answer).
ADDED: Context is where I have no control of generating the original executable. I need to post-process it, essentially performing a 'segedit -replace' process, wherein the a section in the segment is to be replaced with a section that is larger than space previously allocated for the segment.
RUN-ON clarifying question: It seems from the answer that moving the __LINKEDIT segment will break it. Can this be fixed by adjusting load commands only (e.g. LC_DYLD_INFO_ONLY, LC_LOAD_DYLINKER, LC_LOAD_DYLIB), not data in any segments? I am not yet familiar with these load commands, and would like to know whether to pursue this.

So basically the segments and sections describe how the physical file maps onto virtual memory.
As I mentioned in my previous iteration of the answer there are limitations on the segments order:
__TEXT section must start at executable physical file offset 0
__LINKEDIT section must not start at physical file offset 0
__LINKEDIT's File Offset + File Size should be equal to physical executable size (this implies __LINKEDIT being the last segment). Otherwise code signing won't work.
__DYLD_INFO_ONLY contains file offsets to dyld loading bind opcodes for:
rebase
bind at load
weak bind
lazy bind
export
For each kind there is file offset and size entry in __DYLD_INFO_ONLY describing the data in file that matches __LINKEDIT (in a "regular" ld linked executable). __DYLD_INFO_ONLY does not use any segment & section information from __LINKEDIT directly, the file offsets and sizes are enough.
EDIT also as mentioned in #kirelagin answer here
"Apparently, the new version of dyld from 10.12 Sierra performs a check that previous versions did not perform: it makes sure that the LC_SYMTAB symbols table is entirely within the __LINKEDIT segment."
I assume since you want to inflate the size of the preceding __LLVM segment you would also want some extra data in the file itself. Typically data described by __LINKEDIT (i.e. not the segment & sections themselves, but the actual data) won't use 100% of it's space so it could be modified to start "later" and occupy less space.
A tool called jtool by Jonathan Levin could probably do it for you.

I know this is an old question, but I solved this problem while solving another problem.
define the slide amount, this must be page-aligned, so I choose 0x4000.
add the slide amount to the relevant load commands, this includes but is not limited to:
__LINKEDIT segment (duh)
dyld_info_command
symtab_command
dysymtab_command
linkedit_data_commands
physically move the __LINKEDIT in the file.

How can the processor discern a far return from a near return?

Reading Intel's big manual, I see that if you want to return from a far call, that is, a call to a procedure in another code segment, you simply issue a return instruction (possibly with an immediate argument that moves the stack pointer up n bytes after the pointer popping).
This, apparently, if I'm interpreting things correctly, is enough for the hardware to pop both the segment selector and offset into the correct registers.
But, how does the system know that the return should be a far return and that both an offset AND a selector need to be popped?
If the hardware just pops the offset pointer and not the selector after it, then you'll be pointing to the right offset but wrong segment.
There is nothing special about the far return command compared to the near return version.
They both look identical as far as I can tell.
I assume then that the processor, perhaps at the micro-architecture level, keeps track of which calls are far and which are close so that when they're returned from, the system knows how many bytes to pop and where to pop them (pointer registers and segment selector registers).
Is my assumption correct?
What do you guys know about this mechanism?

The processor doesn't track whether or not a call should be far or near; the compiler decides how to encode the function call and return using either far or near opcodes.
As it is, FAR calls have no use on modern processors because you don't need to change any segment register values; that's the point of a flat memory model. Segment registers still exist, but the OS sets them up with base=0 and limit=0xffffffff so just a plain 32-bit pointer can access all memory. Everything is NEAR, if you need to put a name on it.
Normally you just don't even think about segmentation so you don't actually call it either. But the manual still describes the call/ret opcodes we use for normal code as the NEAR versions.
FAR and NEAR were used on old 86 processors, which used a segmented memory model. Programs at that time needed to choose what kind of architecture they wished to support, ranging from "tiny" to "large". If your program was small enough to fit in a single segment, then it could be compiled using NEAR calls and returns exclusively. If it was "large", the opposite was true. For anything in between, you had power to choose whether local functions needed to be able to be either callable/returnable from code in another segment.
Most modern programs (besides bootloaders and the like) run on a different construct: they expect a flat memory model. Behind the scenes the OS will swap out memory as needed (with paging not segmentation), but as far as the program is concerned, it has its virtual address space all to itself.
But, to answer your question, the difference in the call/return is the opcode used; the processor obeys the command given to it. If you mistake (say, give it a FAR return opcode when in flat mode), it'll fail.

OpenCL 1.2 read/write image data

I'm creating an Image2d object on the host using the flag CL_MEM_READ_WRITE. This image is the output of one kernel and I want it to be used as an input to a different kernel. I'm also using cl_image_format = {CL_INTENSITY, CL_FLOAT};
Is this possible in OpenCL 1.2? I've read nowhere that says you can't do this, yet when I try my second kernel returns all zeros, but no error.
I've also tried using clEnqueueCopyImage to copy the output of the first kernel to a different Image2d (also created using CL_MEM_READ_WRITE) and using that as input to the second kernel, but that also does not work.
I've verified the output of my first kernel is correct.
Thanks for any insight.

Yes, the output image from one kernel can be used as input to a subsequent kernel.
As long as the image is CL_MEM_READ_WRITE it can either read __read_only or __write_only in a kernel in OpenCL 1.x.
OpenCL 2.0 further allows images to be __read_write but special rules must be followed (such as barriers) to get correct results.
For more information on read/write image, please see https://software.intel.com/en-us/articles/using-opencl-20-read-write-images
Don't try to cheat (OpenCL - Pass image2d_t twice to get both read and write from kernel?)

OpenCL Copy-Once Share a lot

I am implementing a solution using OpenCL and I want to do the following thing, say for example you have a large array of data that you want to copy in the GPU once and have many kernels process batches of it and store the results in their specific output buffers.
The actual question is here which way is faster? En-queue each kernel with the portion of the array it needs to have or pass out the whole array before hand an let each kernel (in the same context) process the required batch, since they would have the same address space and could each map the array concurrently. Of course the said array is read-only but is not constant as it changes every time I execute the kernel(s)... (so I could cache it using a global memory buffer).
Also if the second way is actually faster could you point me with direction on how this could be implemented, as I haven't found anything concrete yet (although I am still searching :)).
Cheers.

I use the second memory normally. Sharing the memory is easy. Just pass the same buffer to each kernel. I do this in my real-time ray-tracer. I render with one kernel and post-process (image process) with another.
Using the C++ bindings it looks something like this
cl_input_mem = cl::Buffer(context, CL_MEM_WRITE_ONLY, sizeof(cl_uchar4)*npixels, NULL, &err);
kernel_render.setArg(0, cl_input_mem);
kernel_postprocess.setArg(0, cl_input_mem);
If you want one kernel to operate on a different segment of the array/memory you can pass an offset value to the kernel arguments and add that to e.g. the global memory pointer for each kernel.
I would use the first method if the array (actually the sum of each buffer - including output) does not fit in memory. Another reason to use the first method is if you're running on multiple devices. In my ray tracer I use the first method when I render on multiple devices. For example I have one GTX 580 render the upper half of the screen and the other GTX 580 rendering the lower half (actually I do this dynamically so one device may render 30% while the other 70% but that's besides the point). I have each device only render it's fraction of the output and then I assemble the output on the CPU. With PCI 3.0 the transfer back and forth between CPU and GPU (multiple times) has a negligible effect on the frame rate even for 1920x1080 images.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex