Blur QImage alpha channel - qt

I'm trying to blur QImage alpha channel. My current implementation use deprecated 'alphaChannel' method and works slow.
QImage blurImage(const QImage & image, double radius)
{
QImage newImage = image.convertToFormat(QImage::Format_ARGB32);
QImage alpha = newImage.alphaChannel();
QImage blurredAlpha = alpha;
for (int x = 0; x < alpha.width(); x++)
{
for (int y = 0; y < alpha.height(); y++)
{
uint color = calculateAverageAlpha(x, y, alpha, radius);
blurredAlpha.setPixel(x, y, color);
}
}
newImage.setAlphaChannel(blurredAlpha);
return newImage;
}
I was also trying to implement it using QGraphicsBlurEffect, but it doesn't affect alpha.
What is proper way to blur QImage alpha channel?

I have faced a similar question about pixel read\write access :
Invert your loops. An image is laid out in memory as a succession of rows. So you should access first by height then by width
Use QImage::scanline to access data, rather than expensives QImage::pixel and QImage::setPixel. Pixels in a scan (aka row) are guaranteed to be consecutive.
Your code will look like :
for (int ii = 0; ii < image.height(); ii++) {
uchar* scan = image.scanLine(ii);
int depth =4;
for (int jj = 0; jj < image.width(); jj++) {
//it is in fact an rgba
QRgb* rgbpixel = reinterpret_cast<QRgb*>(scan + jj*depth);
QColor color(*rgbpixel);
int alpha = calculateAverageAlpha(ii, jj, color, image);
color.setAlpha(alpha);
//write
*rgbpixel = color.rgba();
}
}
You can go further and optimize the computation of the alpha average. Lets look at the sum of pixel in a radius. The sum of alpha value at (x,y) in the radius is s(x,y). When you move one pixel in either direction, a single line is added while a single line is removed. lets say you move horizontally. if l(x,y) is the sum of the vertical line of length 2*radius centered around (x,y), you have
s(x + 1, y) = s(x, y) + l(x + r + 1, y) - l(x - r, y)
Which allow you to efficiently compute a matrix of sum (then average, by dividing with the number of pixel) in a first pass.
I suspect this kind of optimization is already implemented in a much better way in libraries such as opencv. So I would encourage you to use existing opencv functions if you wish to save time.

Related

Receiving denormalized output texture coordinates in Frag shader

Update
See rationale at the end of my question below
Using WebGL2 I can access a texel by its denormalized coordinates (sorry don't the right lingo for this). That means I don't have to scale them down to 0-1 like I do in texture2D().
However the input to the fragment shader is still the vec2/3 in normalized values.
Is there a way to declare in/out variables in the Vertex and Frag shaders so that I don't have to scale the coordinates?
somewhere in vertex shader:
...
out vec2 TextureCoordinates;
somewhere in frag shader:
...
in vec2 TextureCoordinates;
I would like for TextureCoordinates to be ivec2 and already scaled.
This question and all my other questions on webgl related to general computing using WebGL. We are trying to do tensor (multi-D matrix) operations using WebGL.
We map our data in a few ways to a Texture. The simplest approach we follow is -- assuming we can access our data as a flat array -- to lay it out along the texture's width and go up the texture's height until we're done.
Since our thinking, logic, and calculations are all based on tensor/matrix indices -- inside the fragment shader -- we'd have to map back to/from the X-Y texture coordinates to indices. The intermediate step here is to calculate an offset for a given position of a texel. Then from that offset we can calculate the matrix indices from its strides.
Calculating an offset in webgl 1 for very large textures seems to be taking much longer than webgl2 using the integer coordinates. See below:
WebGL 1 offset calculation
int coordsToOffset(vec2 coords, int width, int height) {
float s = coords.s * float(width);
float t = coords.t * float(height);
int offset = int(t) * width + int(s);
return offset;
}
vec2 offsetToCoords(int offset, int width, int height) {
int t = offset / width;
int s = offset - t*width;
vec2 coords = (vec2(s,t) + vec2(0.5,0.5)) / vec2(width, height);
return coords;
}
WebGL 2 offset calculation in the presence of int coords
int coordsToOffset(ivec2 coords, int width) {
return coords.t * width + coords.s;
}
ivec2 offsetToCoords(int offset, int width) {
int t = offset / width;
int s = offset - t*width;
return ivec2(s,t);
}
It should be clear that for a series of large texture operations we're saving hundreds of thousands of operations just on the offset/coords calculation.
It's not clear why you want do what you're trying to do. It would be better to ask something like "I'm trying to draw an image/implement post processing glow/do ray tracing/... and to do that I want to use un-normalized texture coordinates because " and then we can tell you if your solution is going to work and how to solve it.
In any case, passing int or unsigned int or ivec2/3/4 or uvec2/3/4 as a varying is supported but not interpolation. You have to declare them as flat.
Still, you can pass un-normalized values as float or vec2/3/4 and the convert to int, ivec2/3/4 in the fragment shader.
The other issue is you'll get no sampling using texelFetch, the function that takes texel coordinates instead of normalized texture coordinates. It just returns the exact value of a single pixel. It does not support filtering like the normal texture function.
Example:
function main() {
const gl = document.querySelector('canvas').getContext('webgl2');
if (!gl) {
return alert("need webgl2");
}
const vs = `
#version 300 es
in vec4 position;
in ivec2 texelcoord;
out vec2 v_texcoord;
void main() {
v_texcoord = vec2(texelcoord);
gl_Position = position;
}
`;
const fs = `
#version 300 es
precision mediump float;
in vec2 v_texcoord;
out vec4 outColor;
uniform sampler2D tex;
void main() {
outColor = texelFetch(tex, ivec2(v_texcoord), 0);
}
`;
// compile shaders, link program, look up locations
const programInfo = twgl.createProgramInfo(gl, [vs, fs]);
// create buffers via gl.createBuffer, gl.bindBuffer, gl.bufferData)
const bufferInfo = twgl.createBufferInfoFromArrays(gl, {
position: {
numComponents: 2,
data: [
-.5, -.5,
.5, -.5,
0, .5,
],
},
texelcoord: {
numComponents: 2,
data: new Int32Array([
0, 0,
15, 0,
8, 15,
]),
}
});
// make a 16x16 texture
const ctx = document.createElement('canvas').getContext('2d');
ctx.canvas.width = 16;
ctx.canvas.height = 16;
for (let i = 23; i > 0; --i) {
ctx.fillStyle = `hsl(${i / 23 * 360 | 0}, 100%, ${i % 2 ? 25 : 75}%)`;
ctx.beginPath();
ctx.arc(8, 15, i, 0, Math.PI * 2, false);
ctx.fill();
}
const tex = twgl.createTexture(gl, { src: ctx.canvas });
gl.useProgram(programInfo.program);
twgl.setBuffersAndAttributes(gl, programInfo, bufferInfo);
// no need to set uniforms since they default to 0
// and only one texture which is already on texture unit 0
gl.drawArrays(gl.TRIANGLES, 0, 3);
}
main();
<canvas></canvas>
<script src="https://twgljs.org/dist/4.x/twgl-full.min.js"></script>
So in response to your updated question it's still not clear what you want to do. Why do you want to pass varyings to the fragment shader? Can't you just do whatever math you want in the fragment shader itself?
Example:
uniform sampler2D tex;
out float result;
// some all the values in the texture
vec4 sum4 = vec4(0);
ivec2 texDim = textureSize(tex, 0);
for (int y = 0; y < texDim.y; ++y) {
for (int x = 0; x < texDim.x; ++x) {
sum4 += texelFetch(tex, ivec2(x, y), 0);
}
}
result = sum4.x + sum4.y + sum4.z + sum4.w;
Example2
uniform isampler2D indices;
uniform sampler2D data;
out float result;
// some only values in data pointed to by indices
vec4 sum4 = vec4(0);
ivec2 texDim = textureSize(indices, 0);
for (int y = 0; y < texDim.y; ++y) {
for (int x = 0; x < texDim.x; ++x) {
ivec2 index = texelFetch(indices, ivec2(x, y), 0).xy;
sum4 += texelFetch(tex, index, 0);
}
}
result = sum4.x + sum4.y + sum4.z + sum4.w;
Note that I'm also not an expert in GPGPU but I have an hunch the code above is not the fastest way because I believe parallelization happens based on output. The code above has only 1 output so no parallelization? It would be easy to change so that it takes a block ID, tile ID, area ID as input and computes just the sum for that area. Then you'd write out a larger texture with the sum of each block and finally sum the block sums.
Also, dependant and non-uniform texture reads are a known perf issue. The first example reads the texture in order. That's cache friendly. The second example reads the texture in a random order (specified by indices), that's not cache friendly.

Best way to copy 2D or 3D data to Local Memory

I am starting to do a lot of work in 3D for my OpenCL kernels for filtering. Is there an optimum way to copy a 2D or 3D subset from global memory into local or private memory?
The use for this could be to take a 3D dataset and apply a 3D kernel (or operate on the space occupied by the 3D kernel). Each thread is going to look at one pixel, crop the data around the pixel in 3 dimensions that is the size of a kernel (say 1, 3, 5, etc), copy this subset of data to local or private memory, and then compute, for example, the Standard Deviation of the subset of data.
The easiest and least efficient way is just by brute force:
__kernel void Filter_3D_StdDev(__global float *Data_3D_In,
int KernelSize){
//Note: KernelSize is always ODD
int k = get_global_id(0); //also z
int j = get_global_id(1); //also y
int i = get_global_id(2); //also x
//Convert 3D to 1D
int linear_coord = i + get_global_size(0)*j + get_global_size(0)*get_global_size(1)*k;
//private memory
float Subset[KernelSize*KernelSize*KernelSize];
int HalfKernel = (KernelSize - 1)/2; //compute the pixel radius
for(int z = -HalfKernel ; z < HalfKernel; z++){
for(int y = -HalfKernel ; y < HalfKernel; y++){
for(int x = -HalfKernel ; z < HalfKernel; x++){
int index = (i + x) + get_global_size(0)*(j + y) + \
get_global_size(0)*get_global_size(1)*(k + z);
Subset[x + HalfKernel + (y + HalfKernel)*KernelSize + (z + HalfKernel)*KernelSize*KernelSize] = Data_3D_In[index];
}
}
}
//Filter subset here
}
This is horribly in-efficient since so many calls are made to global memory. Is there a way to improve this?
My first thought is to use vload to reduce the number of loops, such as:
__kernel void Filter_3D_StdDev(__global float *Data_3D_In,
int KernelSize){
//Note: KernelSize is always ODD
int k = get_global_id(0); //also z
int j = get_global_id(1); //also y
int i = get_global_id(2); //also x
//Convert 3D to 1D
int linear_coord = i + get_global_size(0)*j + get_global_size(0)*get_global_size(1)*k;
//private memory
float Subset[KernelSize*KernelSize];
int HalfKernel = (KernelSize - 1)/2; //compute the pixel radius
for(int z = -HalfKernel ; z < HalfKernel; z++){
for(int y = -HalfKernel ; y < HalfKernel; y++){
//##TODO##
//Automatically determine which vload to use based on Kernel Size
//for now, use vload3
int index = (i + -HalfKernel) + get_global_size(0)*(j + y) + \
get_global_size(0)*get_global_size(1)*(k + z);
int subset_index = (z + HalfKernel)*KernelSize*KernelSize
float3 temp = vload3(index, Data_3D_In);
vstore3(temp, subset_index, Subset);
}
}
//Filter subset here
}
Is there an even better way?
Thanks in Advance!
First off you need to unroll those loops. You will have to make several copies of the function or do string replacement before you compile, or unroll the loops first but just as a test do:
#define HALF_KERNEL_SIZE = 2
#pragma unroll HALF_KERNEL_SIZE * 2 + 1
for(int z = -HALF_KERNEL_SIZE ; z < HALF_KERNEL_SIZE ; z++){
#pragma unroll HALF_KERNEL_SIZE * 2 + 1
for(int y = -HALF_KERNEL_SIZE ; y < HALF_KERNEL_SIZE ; y++){
For the GPU you should read it into local memory (especially for the 5x5x5 ones because you are reading back into global memory A LOT when you already have the data and you don't want to go back to get it. (This is for the GPU) for the CPU it is not as big of an issue.
So do this exactly as you would do for convolution but with an extra dimension:
1. Read in a block (or cube) of memory into local memory for a number of threads.
2. Create a barrier to make sure all data is read before you continue.
3. Sample into your local memory using your local id as an offset.
4. Test various local workgroup sizes until you get best performance
Everything else is the same. For the larger kernels with a bigger overlap this will be orders of manatudes faster.

Change of Steepness, how to do

How would you go about changing the steepness as for loops progress. Essentially I've made a terrain with vertices which form a valley. The creation of the data for these vertices to use is here:
// Divides it to a sensible height
const int DIVISOR_NUMBER = 40;
for (int x = 0; x < TerrainWidth; x++)
{
float height = Math.Abs(((float)x - ((float)TerrainWidth / 2))/ (float)DIVISOR_NUMBER);
for (int y = 0; y < TerrainHeight; y++)
{
float copyOfHeight = height;
float randomValue = random.Next(0, 3);
copyOfHeight += randomValue / 10;
HeightData[x, y] = copyOfHeight;
}
}
This works fine. But I now want to make the sides of the valley steeper at the start and end of the first loop and the valley flatten the closer to the center it gets. I'm having a bit of a mental block and can't think of a good way of doing it. Any help would be appreciated.
You can use a squared (aka quadratic) curve for that. Try:
float offset = (float)x - (float)TerrainWidth/2;
float height = offset*offset*SCALE_FACTOR;
If you still want a "crease" at the bottom of the valley, you can make your height a weighted sum:
float height = Math.Abs(offset) * ABS_FACTOR + offset*offset * QUADRATIC_FACTOR;

How do I optimize displaying a large number of quads in OpenGL?

I am trying to display a mathematical surface f(x,y) defined on a XY regular mesh using OpenGL and C++ in an effective manner:
struct XYRegularSurface {
double x0, y0;
double dx, dy;
int nx, ny;
XYRegularSurface(int nx_, int ny_) : nx(nx_), ny(ny_) {
z = new float[nx*ny];
}
~XYRegularSurface() {
delete [] z;
}
float& operator()(int ix, int iy) {
return z[ix*ny + iy];
}
float x(int ix, int iy) {
return x0 + ix*dx;
}
float y(int ix, int iy) {
return y0 + iy*dy;
}
float zmin();
float zmax();
float* z;
};
Here is my OpenGL paint code so far:
void color(QColor & col) {
float r = col.red()/255.0f;
float g = col.green()/255.0f;
float b = col.blue()/255.0f;
glColor3f(r,g,b);
}
void paintGL_XYRegularSurface(XYRegularSurface &surface, float zmin, float zmax) {
float x, y, z;
QColor col;
glBegin(GL_QUADS);
for(int ix = 0; ix < surface.nx - 1; ix++) {
for(int iy = 0; iy < surface.ny - 1; iy++) {
x = surface.x(ix,iy);
y = surface.y(ix,iy);
z = surface(ix,iy);
col = rainbow(zmin, zmax, z);color(col);
glVertex3f(x, y, z);
x = surface.x(ix + 1, iy);
y = surface.y(ix + 1, iy);
z = surface(ix + 1,iy);
col = rainbow(zmin, zmax, z);color(col);
glVertex3f(x, y, z);
x = surface.x(ix + 1, iy + 1);
y = surface.y(ix + 1, iy + 1);
z = surface(ix + 1,iy + 1);
col = rainbow(zmin, zmax, z);color(col);
glVertex3f(x, y, z);
x = surface.x(ix, iy + 1);
y = surface.y(ix, iy + 1);
z = surface(ix,iy + 1);
col = rainbow(zmin, zmax, z);color(col);
glVertex3f(x, y, z);
}
}
glEnd();
}
The problem is that this is slow, nx=ny=1000 and fps ~= 1.
How do I optimize this to be faster?
EDIT: following your suggestion (thanks!) regarding VBO
I added:
float* XYRegularSurface::xyz() {
float* data = new float[3*nx*ny];
long i = 0;
for(int ix = 0; ix < nx; ix++) {
for(int iy = 0; iy < ny; iy++) {
data[i++] = x(ix,iy);
data[i++] = y(ix,iy);
data[i] = z[i]; i++;
}
}
return data;
}
I think I understand how I can create a VBO, initialize it to xyz() and send it to the GPU in one go, but how do I use the VBO when drawing. I understand that this can either be done in the vertex shader or by glDrawElements? I assume the latter is easier? If so: I do not see any QUAD mode in the documentation for glDrawElements!?
Edit2:
So I can loop trough all nx*ny quads and draw each by:
GL_UNSIGNED_INT indices[4];
// ... set indices
glDrawElements(GL_QUADS, 1, GL_UNSIGNED_INT, indices);
?
1/. Use display lists, to cache GL commands - avoiding recalculation of the vertices and the expensive per-vertex call overhead. If the data is updated, you need to look at client-side vertex arrays (not to be confused with VAOs). Now ignore this option...
2/. Use vertex buffer objects. Available as of GL 1.5.
Since you need VBOs for core profile anyway (i.e., modern GL), you can at least get to grips with this first.
Well, you've asked a rather open ended question. I'd suggest using modern (3.0+) OpenGL for everything. The point of just about any new OpenGL feature is to provide a faster way to do things. Like everyone else is suggesting, use array (vertex) buffer objects and vertex array objects. Use an element array (index) buffer object too. Most GPUs have a 'post-transform cache', which stores the last few transformed vertices, but this can only be used when you call the glDraw*Elements family of functions. I also suggest you store a flat mesh in your VBO, where y=0 for each vertex. Sample the y from a heightmap texture in your vertex shader. If you do this, whenever the surface changes you will only need to update the heightmap texture, which is easier than updating the VBO. Use one of the floating point or integer texture formats for a heightmap, so you aren't restricted to having your values be between 0 and 1.
If so: I do not see any QUAD mode in the documentation for glDrawElements!?
If you want quads make sure you're looking at the GL 2.1-era docs, not the new stuff.

RGB color bitmap and HSV bilinearly interpolated colors

I want to use the algorithm from converting an HSV color value into RGB color to compute a 200 x 200 RGB bitmap with the colors red, green, blue, and white in its corners and bilinearly interpolated HSV colors elsewhere and compute also the bitmap with bilinearly interpolated RGB colors. I have found the formula in the wikipedia but I am confused on how to do this.
Any help would be appreciated.
OK, I think I know what you are getting at. I haven't tried to run any of this, so it probably has some bugs...
First you need to work out the HSV values for red, green, blue and white. Call these, clockwise, a,b,c,d - for example, white would be [0,0,1] or something scaled like that
For a position in the grid, (x,y) with 0 <= x <= 1 The interpolation bit is something like, putting the values into the array out:
for(int i=0; i<3; i++){
out[i] = y*((x*a[i]) + ((1-x)*b[i])) + (1-y)*((x*d[i]) + ((1-x)*c[i]));
}
As finding the linearly interpolated value the a fraction of x between A and B is given by x*A + (1-x)*B. Just do it once for each direction.
Then just convert them to RGB, using the convention from the wikipedia article
void HSVtoRGB(double H, double S, double V, double[] out){
double C = S*V;
double H_prime = H/60; // a number in [0,3]
double X = C*(1 - abs((H_prime%2)-1));
// Do the big if bit
switch((int)X){
case 0:
out[0] = C;
out[1] = X;
out[2] = 0;
case 1:
out[0] = X;
out[1] = C;
out[2] = 0;
// etc etc
}
double m = V - C;
for(int i=0; i<3; i++){
out[i] += m;
}
}
That should do it, give or take. Well, should give you a rough idea at least.

Resources