OpenGL ES 3.0: zero-copy CPU rendering to texture? - 2d

I am working on a pure 2D project, where the screen is rendered by CPU, and I want to display it as a texture, but I do not want to upload the whole image for each frame. I cannot tell what parts are changed, so I assume the whole image is invalid.
I need to do some post-processing (adding another layer with some blits), and my early tests showed it may be a performance problem doing purely with CPU, that's why I need GPU accel (2D would be much better, but is not common/portable in these days...).
My target platform is an embedded system (ARM) where GPU and CPU shares the memory, so theoretically I could do this without any copy. The chosen platform supports OpenGL 2.1 and OpenGL ES 3.0.
I understand that Buffer Objects can be mapped by glMapBufferRange​(). I have looked after the possibilities below:
UNIFORM_BUFFER: too small to store a full-screen image
SHADER_STORAGE_BUFFER: supported only from ES 3.1
Shader Image Load Store: supported only from ES 3.1
Buffer Texture: ES 3.0 does not support (supported from ??)
Pixel Buffer Objects: I cannot reach them from shaders, and it seems to me that it does copy when I update a texture from it. I don't know if it is faster than copying from client memory (taking into account that they all reside on the same RAM chip :))
plain textures: are not buffer objects, cannot be mapped into client process memory
Do I miss something? There is no way in ES 3.0 to share a buffer with the GPU containing mass amount of data what I can write from CPU and read from fragment shaders?

You can change texture partially with GLES20.glTexSubImage2D.

Related

OpenCL: know local work group size in advance?

I'm working on optimizing a separable image downscaler. My next step is reduction of multiple samplings (nearest) of the same texel by reading all necessary texels into local memory. Here begins the fun...
The downscaler is versatile, so it can downscale anything larger into anything smaller and even take sections of an image and downscale it into a destination image. Thus the final resolution divider never is a whole number. Most of the time it will be something around 3.97 or such. This means: I do not know the required size for that local array at compile time.
To me that means: before enqueuing a task, I'll have to create a local mem object of the required size.
How do I know what workgroup sizes OpenCL will select?
If there is no way, is there a "best practice" to overcome this problem?
P.S.: I'm writing for OpenCL 1.1 compatibility.
Since you are using images, the texture cache can be relied upon instead of using shared local memory.

Potential problems using OpenGL buffer object with multiple targets?

I am developing a library for Qt that extends it's OpenGL functionality to support modern OpenGL (3+) features like texture buffers, image load-store textures, shader storage buffers, atomic counter buffers, etc.. In order to fully support features like transform feedback, I allow users to bind a buffer object to different targets at different times (regardless of what data they allocated the buffer with). Consider the following scenario where I use transform feedback to advance vertex data ONCE, and then bind it to a separate program for rendering (used for the rest of the application run time) :
// I attach a (previously allocated) buffer to the transform feedback target so that I
// can capture data advanced in a shader program.
glBindBuffer(GL_TRANSFORM_FEEDBACK_BUFFER, someBufferID);
// Then I execute the shader program...
// and release the buffer from the transform feedback target.
glBindBuffer(GL_TRANSFORM_FEEDBACK_BUFFER, 0);
// Then, I bind the same buffer containing data advanced via transform feedback
// to the array buffer target for use in a separate shader program.
glBindBuffer(GL_ARRAY_BUFFER, someBufferID);
// Then, I render something using this buffer as the data source for a vertex attrib.
// After, I release this buffer from the array buffer target.
glBindBuffer(GL_ARRAY_BUFFER, 0);
In this scenario, being able to bind a buffer object to multiple targets is useful. However, I am uncertain if there are situations in which this capability would cause problems given the OpenGL specification. Essentially, should I allow a single buffer object to be bound to multiple targets or force a target (like the standard Qt buffer wrappers) during instantiation?
Edit:
I have found that there is a problem mixing creation/binding targets with TEXTURE objects as the OpenGL documentation for glBindTexture states:
GL_INVALID_OPERATION is generated if texture was previously created
with a target that doesn't match that of target.
However, the documentation for glBindBuffer states no such problem.
Well, there can always be faulty drivers (and if you don't have "green" hardware nothing can be guaranteed, anyway). But rest assured that it is perfectly legal to bind buffers to any target you want, disregarding the target they were created with.
There might be certain subtleties, like certain targets need to be used in an indexed way using glBindBufferRange/Base (like uniform buffers or transform feedback buffers). But enforcing a buffer to be used with a single target only, like Qt does, is too rigid to be useful in modern OpenGL, as your very common example already shows. Maybe a good compromise would be to use an either per-class (like Qt does) or per-buffer default target (which will be sufficient in most simple situations, when the buffer is always used with a single target, and can be used for class-internal binding operations if neccessary), but provide the option to bind it to something else.

OpenCL Copy-Once Share a lot

I am implementing a solution using OpenCL and I want to do the following thing, say for example you have a large array of data that you want to copy in the GPU once and have many kernels process batches of it and store the results in their specific output buffers.
The actual question is here which way is faster? En-queue each kernel with the portion of the array it needs to have or pass out the whole array before hand an let each kernel (in the same context) process the required batch, since they would have the same address space and could each map the array concurrently. Of course the said array is read-only but is not constant as it changes every time I execute the kernel(s)... (so I could cache it using a global memory buffer).
Also if the second way is actually faster could you point me with direction on how this could be implemented, as I haven't found anything concrete yet (although I am still searching :)).
Cheers.
I use the second memory normally. Sharing the memory is easy. Just pass the same buffer to each kernel. I do this in my real-time ray-tracer. I render with one kernel and post-process (image process) with another.
Using the C++ bindings it looks something like this
cl_input_mem = cl::Buffer(context, CL_MEM_WRITE_ONLY, sizeof(cl_uchar4)*npixels, NULL, &err);
kernel_render.setArg(0, cl_input_mem);
kernel_postprocess.setArg(0, cl_input_mem);
If you want one kernel to operate on a different segment of the array/memory you can pass an offset value to the kernel arguments and add that to e.g. the global memory pointer for each kernel.
I would use the first method if the array (actually the sum of each buffer - including output) does not fit in memory. Another reason to use the first method is if you're running on multiple devices. In my ray tracer I use the first method when I render on multiple devices. For example I have one GTX 580 render the upper half of the screen and the other GTX 580 rendering the lower half (actually I do this dynamically so one device may render 30% while the other 70% but that's besides the point). I have each device only render it's fraction of the output and then I assemble the output on the CPU. With PCI 3.0 the transfer back and forth between CPU and GPU (multiple times) has a negligible effect on the frame rate even for 1920x1080 images.

Qt4/Opengl bindTexture in separated thread

I am trying to implemente a CoverFlow like effect using a QGLWidget, the problem is the texture loading process.
I have a worker (QThread) for loading images from disk, and the main thread checks for new loaded images, if it finds any then uses bindTexture for loading them into QGLContext. While the texture is being bound, the main thread is blocked, so I have a fps drop.
What is the right way to do this?
I have found that the default behaviour of bindTexture in Qt4 is extremelly slow:
bindTexture(image,target,format,LinearFilteringBindOption | InvertedYBindOption | MipmapBindOption)
using only the LinearFilteringBindOption in the binding options speeds up the things a lot, this is my current call:
bindTexture(image, GL_TEXTURE_2D,GL_RGBA,QGLContext::LinearFilteringBindOption);
more info here : load time for a 3800x2850 bmp file reduced from 2 seconds to 34 milliseconds
Of course, if you need mipmapping, this is not the solution. In this case, I think that the way to go is Pixel Buffer Objects.
Binding in the main thread (single QGLWidget solution):
decide on maximum texture size. You could decide it based on maximum possible widget size for example. Say you know that the widget can be at most (approximately) 800x600 pixels and the largest cover visible has 30 pixels margins up and down and 1:2 aspect ratio -> 600-2*30 = 540 -> maximum size of the cover is 270x540, e.g. stored in m_maxCoverSize.
scale the incoming images to that size in the loader thread. It doesn't make sense to bind larger textures and the larger it is, the longer it'll take to upload to the graphics card. Use QImage::scaled(m_maxCoverSize, Qt::KeepAspectRatio) to scale loaded image and pass it to the main thread.
limit the number of textures or better time spent binding them per frame. I.e. remember the time at which you started binding textures (e.g. QTime bindStartTime;) and after binding each texture do:
if (bindStartTime.elapsed() > BIND_TIME_LIMIT)
break;
BIND_TIME_LIMIT would depend on frame rate you want to keep. But of course if binding each one texture takes much longer than BIND_TIME_LIMIT you haven't solved anything.
You might still experience framerate drop while loading images though on slower machines / graphics cards. The rest of the code should be prepared to live with it (e.g. use actual time to drive animation).
Alternative solution is to bind in a separate thread (using a second invisible QGLWidget, see documentation):
2. Texture uploading in a thread.
Doing texture uploads in a thread may be very useful for applications handling large amounts of images that needs to be displayed, like for instance a photo gallery application. This is supported in Qt through the existing bindTexture() API. A simple way of doing this is to create two sharing QGLWidgets. One is made current in the main GUI thread, while the other is made current in the texture upload thread. The widget in the uploading thread is never shown, it is only used for sharing textures with the main thread. For each texture that is bound via bindTexture(), notify the main thread so that it can start using the texture.

How to render a SlimDX scene directly to a GDI bitmap

Is there a way to set the render target to a GDI bitmap in SlimDX so that as soon as the scene is rendered I can immediately BitBlt the render out of there for processing in another thread and continue rendering?
Is it necessary to render to a texture and then copy the contents out to the bitmap? I would like to be able to do this without any unnecessary copying. I'm going to need every speedup I can get.
Sorry, you do need to render to a RenderTarget then copy that resource into a Texture2D then you can map the data and get the pixels into your bitmap.
The memory for RenderTargets is marked for a special kind of use by the graphics card and cannot be read from directly
The memory for Textures can be marked so that it can be read but only through the API as it is still held on the graphics card (some exceptions but DirectX has to go with the lowest common denominator)
If you need the extra speed reuse the same bitmap or have an array of prepared bitmaps ready to fill and keep them on rotation.
And as ever, measure how much time these things are consuming with a profiler so that you can quantify bottlenecks.

Resources