I coded a program on Processing where all the pixels on the screen are scrambled, but around the cursor. The code works by replacing the pixels with a random pixel between 0 and the pixel the loop is currently on. To find that pixel, I used the code (y*width+x)-1. This code, however, is taking pixels from the entire screen. I want the code to instead take the pixels from a 40m square around the mouse coordinates. How can I do this?
import processing.video.*;
Capture video;
void setup() {
size(640, 480);
video = new Capture(this, 640, 480);
video.start();
}
void draw() {
loadPixels();
if (video.available()){
video.read();
video.loadPixels();
for (int y = 0; y < height; y++) {
for (int x = 0; x < width; x++) {
pixels[y*width+x] = video.pixels[y*video.width+(width-x-1)];
// the code should only be applied 20 pixels around the mouse
if (dist(mouseX, mouseY, x, y) < 20){
int d = int(random(0, y*width+x-1));
pixels[y*width+x] = video.pixels[d];
}
}
}
}
updatePixels();
}
You don't need to iterate through all the pixels to only change a few.
Luckily your sketch is the same size as the webcam feed, so you're on the right track using the x + (y + width) arithmetic to convert from a 2D array index to the 1D pixels[] index. Remember that you're sampling from a 1D array currently (random 0, coords). Even if you upate the start/end index that's still a range that will span a few full image rows which means pixels to the left and right of the effect selection. I recommend picking the random x, y indices in 2D, then converting these random values to 1D (as opposed to a single index from the 1D array).
Here's what I mean:
import processing.video.*;
Capture video;
void setup() {
size(640, 480);
video = new Capture(this, 640, 480);
video.start();
}
void draw() {
loadPixels();
if (video.available()) {
video.read();
video.loadPixels();
//for (int y = 0; y < height; y++) {
// for (int x = 0; x < width; x++) {
// pixels[y*width+x] = video.pixels[y*video.width+(width-x-1)];
// // the code should only be applied 20 pixels around the mouse
// if (dist(mouseX, mouseY, x, y) < 20) {
// int d = int(random(0, y*width+x-1));
// pixels[y*width+x] = video.pixels[d];
// }
// }
//}
// mouse x, y shorthand
int mx = mouseX;
int my = mouseY;
// random pixels effect size
int size = 40;
// half of size
int hsize = size / 2;
// 2D pixel coordinates of the effect's bounding box
int minX = mx - hsize;
int maxX = mx + hsize;
int minY = my - hsize;
int maxY = my + hsize;
// apply the effect only where the bounding can be applied
// e.g. avoid a border (of hsize) around edges of the image
if (mx >= hsize && mx < width - hsize &&
my >= hsize && my < height - hsize) {
for(int y = minY; y < maxY; y++){
for(int x = minX; x < maxX; x++){
// pick random x,y coordinates to sample a pixel from
int rx = (int)random(minX, maxX);
int ry = (int)random(minY, maxY);
// convert the 2D random coordinates to a 1D pixel[] index
int ri = rx + (ry * width);
// replace current pixel with randomly sampled pixel (within effect bbox)
pixels[x + (y * width)] = video.pixels[ri];
}
}
}
}
updatePixels();
}
(Note that the above isn't tested, but hopefully the point gets across)
I'm trying to blur QImage alpha channel. My current implementation use deprecated 'alphaChannel' method and works slow.
QImage blurImage(const QImage & image, double radius)
{
QImage newImage = image.convertToFormat(QImage::Format_ARGB32);
QImage alpha = newImage.alphaChannel();
QImage blurredAlpha = alpha;
for (int x = 0; x < alpha.width(); x++)
{
for (int y = 0; y < alpha.height(); y++)
{
uint color = calculateAverageAlpha(x, y, alpha, radius);
blurredAlpha.setPixel(x, y, color);
}
}
newImage.setAlphaChannel(blurredAlpha);
return newImage;
}
I was also trying to implement it using QGraphicsBlurEffect, but it doesn't affect alpha.
What is proper way to blur QImage alpha channel?
I have faced a similar question about pixel read\write access :
Invert your loops. An image is laid out in memory as a succession of rows. So you should access first by height then by width
Use QImage::scanline to access data, rather than expensives QImage::pixel and QImage::setPixel. Pixels in a scan (aka row) are guaranteed to be consecutive.
Your code will look like :
for (int ii = 0; ii < image.height(); ii++) {
uchar* scan = image.scanLine(ii);
int depth =4;
for (int jj = 0; jj < image.width(); jj++) {
//it is in fact an rgba
QRgb* rgbpixel = reinterpret_cast<QRgb*>(scan + jj*depth);
QColor color(*rgbpixel);
int alpha = calculateAverageAlpha(ii, jj, color, image);
color.setAlpha(alpha);
//write
*rgbpixel = color.rgba();
}
}
You can go further and optimize the computation of the alpha average. Lets look at the sum of pixel in a radius. The sum of alpha value at (x,y) in the radius is s(x,y). When you move one pixel in either direction, a single line is added while a single line is removed. lets say you move horizontally. if l(x,y) is the sum of the vertical line of length 2*radius centered around (x,y), you have
s(x + 1, y) = s(x, y) + l(x + r + 1, y) - l(x - r, y)
Which allow you to efficiently compute a matrix of sum (then average, by dividing with the number of pixel) in a first pass.
I suspect this kind of optimization is already implemented in a much better way in libraries such as opencv. So I would encourage you to use existing opencv functions if you wish to save time.
I want to draw plane that picture.
now I try vertex buffer and DrawPrimitive is D3DPT_LINESTRIP. but not effect I want.
so any way more than effective that???
please give me some advice. thank you.
This could be an option, is not the optimal but would achive that grid
void DrawGrid (float32 Size, CColor Color, int32 GridX, int32 GridZ)
{
// Check if the size of the grid is null
if( Size <= 0 )
return;
// Calculate the data
DWORD grid_color_aux = Color.GetUint32Argb();
float32 GridXStep = Size / GridX;
float32 GridZStep = Size / GridZ;
float32 halfSize = Size * 0.5f;
// Set the attributes to the paint device
m_pD3DDevice->SetTexture(0,NULL);
m_pD3DDevice->SetFVF(CUSTOMVERTEX::getFlags());
// Draw the lines of the X axis
for( float32 i = -halfSize; i <= halfSize ; i+= GridXStep )
{
CUSTOMVERTEX v[] =
{ {i, 0.0f, -halfSize, grid_color_aux}, {i, 0.0f, halfSize, grid_color_aux} };
m_pD3DDevice->DrawPrimitiveUP( D3DPT_LINELIST,1, v,sizeof(CUSTOMVERTEX));
}
// Draw the lines of the Z axis
for( float32 i = -halfSize; i <= halfSize ; i+= GridZStep )
{
CUSTOMVERTEX v[] =
{ {-halfSize, 0.0f, i, grid_color_aux}, {halfSize, 0.0f, i, grid_color_aux} };
m_pD3DDevice->DrawPrimitiveUP( D3DPT_LINELIST,1, v,sizeof(CUSTOMVERTEX));
}
}
The CUSTOMVERTEX struct:
struct CUSTOMVERTEX
{
float32 x, y, z;
DWORD color;
static unsigned int getFlags()
{
return D3DFVF_CUSTOMVERTEX;
}
};
Note: Is only a grid with lines, so you need to draw a solid plane, in order to get a look like result as you want.
You can use DrawPrimitive with D3DPT_TRIANGLESTRIP for a plane. Then draw the indexed lines after with D3DPT_LINELIST with a depth bias. This way, even if they lie on the plane, you won't get any z-fighting.
I will introduce you a book Introduction to 3D programming with DirectX,it has a great
detail on how to do this in Chapter 8,Section 4.
Background
I've implemented this algorithm from Microsoft Research for a radix-2 FFT (Stockham auto sort) using OpenCL.
I use floating point textures (256 cols X N rows) for input and output in the kernel, because I will need to sample at non-integral points and I thought it better to delegate that to the texture sampling hardware. Note that my FFTs are always of 256-point sequences (every row in my texture). At this point, my N is 16384 or 32768 depending on the GPU i'm using and the max 2D texture size allowed.
I also need to perform the FFT of 4 real-valued sequences at once, so the kernel performs the FFT(a, b, c, d) as FFT(a + ib, c + id) from which I can extract the 4 complex sequences out later using an O(n) algorithm. I can elaborate on this if someone wishes - but I don't believe it falls in the scope of this question.
Kernel Source
const sampler_t fftSampler = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_NEAREST;
__kernel void FFT_Stockham(read_only image2d_t input, write_only image2d_t output, int fftSize, int size)
{
int x = get_global_id(0);
int y = get_global_id(1);
int b = floor(x / convert_float(fftSize)) * (fftSize / 2);
int offset = x % (fftSize / 2);
int x0 = b + offset;
int x1 = x0 + (size / 2);
float4 val0 = read_imagef(input, fftSampler, (int2)(x0, y));
float4 val1 = read_imagef(input, fftSampler, (int2)(x1, y));
float angle = -6.283185f * (convert_float(x) / convert_float(fftSize));
// TODO: Convert the two calculations below into lookups from a __constant buffer
float tA = native_cos(angle);
float tB = native_sin(angle);
float4 coeffs1 = (float4)(tA, tB, tA, tB);
float4 coeffs2 = (float4)(-tB, tA, -tB, tA);
float4 result = val0 + coeffs1 * val1.xxzz + coeffs2 * val1.yyww;
write_imagef(output, (int2)(x, y), result);
}
The host code simply invokes this kernel log2(256) times, ping-ponging the input and output textures.
Note: I tried removing the native_cos and native_sin to see if that impacted timing, but it doesn't seem to change things by very much. Not the factor I'm looking for, in any case.
Access pattern
Knowing that I am probably memory-bandwidth bound, here is the memory access pattern (per-row) for my radix-2 FFT.
X0 - element 1 to combine (read)
X1 - element 2 to combine (read)
X - element to write to (write)
Question
So my question is - can someone help me with/point me toward a higher-radix formulation for this algorithm? I ask because most FFTs are optimized for large cases and single real/complex valued sequences. Their kernel generators are also very case dependent and break down quickly when I try to muck with their internals.
Are there other options better than simply going to a radix-8 or 16 kernel?
Some of my constraints are - I have to use OpenCL (no cuFFT). I also cannot use clAmdFft from ACML for this purpose. It would be nice to also talk about CPU optimizations (this kernel SUCKS big time on the CPU) - but getting it to run in fewer iterations on the GPU is my main use-case.
Thanks in advance for reading through all this and trying to help!
I tried several versions, but the one with the best performance on CPU and GPU was a radix-16 kernel for my specific case.
Here is the kernel for reference. It was taken from Eric Bainville's (most excellent) website and used with full attribution.
// #define M_PI 3.14159265358979f
//Global size is x.Length/2, Scale = 1 for direct, 1/N to inverse (iFFT)
__kernel void ConjugateAndScale(__global float4* x, const float Scale)
{
int i = get_global_id(0);
float temp = Scale;
float4 t = (float4)(temp, -temp, temp, -temp);
x[i] *= t;
}
// Return a*EXP(-I*PI*1/2) = a*(-I)
float2 mul_p1q2(float2 a) { return (float2)(a.y,-a.x); }
// Return a^2
float2 sqr_1(float2 a)
{ return (float2)(a.x*a.x-a.y*a.y,2.0f*a.x*a.y); }
// Return the 2x DFT2 of the four complex numbers in A
// If A=(a,b,c,d) then return (a',b',c',d') where (a',c')=DFT2(a,c)
// and (b',d')=DFT2(b,d).
float8 dft2_4(float8 a) { return (float8)(a.lo+a.hi,a.lo-a.hi); }
// Return the DFT of 4 complex numbers in A
float8 dft4_4(float8 a)
{
// 2x DFT2
float8 x = dft2_4(a);
// Shuffle, twiddle, and 2x DFT2
return dft2_4((float8)(x.lo.lo,x.hi.lo,x.lo.hi,mul_p1q2(x.hi.hi)));
}
// Complex product, multiply vectors of complex numbers
#define MUL_RE(a,b) (a.even*b.even - a.odd*b.odd)
#define MUL_IM(a,b) (a.even*b.odd + a.odd*b.even)
float2 mul_1(float2 a, float2 b)
{ float2 x; x.even = MUL_RE(a,b); x.odd = MUL_IM(a,b); return x; }
float4 mul_1_F4(float4 a, float4 b)
{ float4 x; x.even = MUL_RE(a,b); x.odd = MUL_IM(a,b); return x; }
float4 mul_2(float4 a, float4 b)
{ float4 x; x.even = MUL_RE(a,b); x.odd = MUL_IM(a,b); return x; }
// Return the DFT2 of the two complex numbers in vector A
float4 dft2_2(float4 a) { return (float4)(a.lo+a.hi,a.lo-a.hi); }
// Return cos(alpha)+I*sin(alpha) (3 variants)
float2 exp_alpha_1(float alpha)
{
float cs,sn;
// sn = sincos(alpha,&cs); // sincos
//cs = native_cos(alpha); sn = native_sin(alpha); // native sin+cos
cs = cos(alpha); sn = sin(alpha); // sin+cos
return (float2)(cs,sn);
}
// Return cos(alpha)+I*sin(alpha) (3 variants)
float4 exp_alpha_1_F4(float alpha)
{
float cs,sn;
// sn = sincos(alpha,&cs); // sincos
// cs = native_cos(alpha); sn = native_sin(alpha); // native sin+cos
cs = cos(alpha); sn = sin(alpha); // sin+cos
return (float4)(cs,sn,cs,sn);
}
// mul_p*q*(a) returns a*EXP(-I*PI*P/Q)
#define mul_p0q1(a) (a)
#define mul_p0q2 mul_p0q1
//float2 mul_p1q2(float2 a) { return (float2)(a.y,-a.x); }
__constant float SQRT_1_2 = 0.707106781186548; // cos(Pi/4)
#define mul_p0q4 mul_p0q2
float2 mul_p1q4(float2 a) { return (float2)(SQRT_1_2)*(float2)(a.x+a.y,-a.x+a.y); }
#define mul_p2q4 mul_p1q2
float2 mul_p3q4(float2 a) { return (float2)(SQRT_1_2)*(float2)(-a.x+a.y,-a.x-a.y); }
__constant float COS_8 = 0.923879532511287; // cos(Pi/8)
__constant float SIN_8 = 0.382683432365089; // sin(Pi/8)
#define mul_p0q8 mul_p0q4
float2 mul_p1q8(float2 a) { return mul_1((float2)(COS_8,-SIN_8),a); }
#define mul_p2q8 mul_p1q4
float2 mul_p3q8(float2 a) { return mul_1((float2)(SIN_8,-COS_8),a); }
#define mul_p4q8 mul_p2q4
float2 mul_p5q8(float2 a) { return mul_1((float2)(-SIN_8,-COS_8),a); }
#define mul_p6q8 mul_p3q4
float2 mul_p7q8(float2 a) { return mul_1((float2)(-COS_8,-SIN_8),a); }
// Compute in-place DFT2 and twiddle
#define DFT2_TWIDDLE(a,b,t) { float2 tmp = t(a-b); a += b; b = tmp; }
// T = N/16 = number of threads.
// P is the length of input sub-sequences, 1,16,256,...,N/16.
__kernel void FFT_Radix16(__global const float4 * x, __global float4 * y, int pp)
{
int p = pp;
int t = get_global_size(0); // number of threads
int i = get_global_id(0); // current thread
////// y[i] = 2*x[i];
////// return;
int k = i & (p-1); // index in input sequence, in 0..P-1
// Inputs indices are I+{0,..,15}*T
x += i;
// Output indices are J+{0,..,15}*P, where
// J is I with four 0 bits inserted at bit log2(P)
y += ((i-k)<<4) + k;
// Load
float4 u[16];
for (int m=0;m<16;m++) u[m] = x[m*t];
// Twiddle, twiddling factors are exp(_I*PI*{0,..,15}*K/4P)
float alpha = -M_PI*(float)k/(float)(8*p);
for (int m=1;m<16;m++) u[m] = mul_1_F4(exp_alpha_1_F4(m * alpha), u[m]);
// 8x in-place DFT2 and twiddle (1)
DFT2_TWIDDLE(u[0].lo,u[8].lo,mul_p0q8);
DFT2_TWIDDLE(u[0].hi,u[8].hi,mul_p0q8);
DFT2_TWIDDLE(u[1].lo,u[9].lo,mul_p1q8);
DFT2_TWIDDLE(u[1].hi,u[9].hi,mul_p1q8);
DFT2_TWIDDLE(u[2].lo,u[10].lo,mul_p2q8);
DFT2_TWIDDLE(u[2].hi,u[10].hi,mul_p2q8);
DFT2_TWIDDLE(u[3].lo,u[11].lo,mul_p3q8);
DFT2_TWIDDLE(u[3].hi,u[11].hi,mul_p3q8);
DFT2_TWIDDLE(u[4].lo,u[12].lo,mul_p4q8);
DFT2_TWIDDLE(u[4].hi,u[12].hi,mul_p4q8);
DFT2_TWIDDLE(u[5].lo,u[13].lo,mul_p5q8);
DFT2_TWIDDLE(u[5].hi,u[13].hi,mul_p5q8);
DFT2_TWIDDLE(u[6].lo,u[14].lo,mul_p6q8);
DFT2_TWIDDLE(u[6].hi,u[14].hi,mul_p6q8);
DFT2_TWIDDLE(u[7].lo,u[15].lo,mul_p7q8);
DFT2_TWIDDLE(u[7].hi,u[15].hi,mul_p7q8);
// 8x in-place DFT2 and twiddle (2)
DFT2_TWIDDLE(u[0].lo,u[4].lo,mul_p0q4);
DFT2_TWIDDLE(u[0].hi,u[4].hi,mul_p0q4);
DFT2_TWIDDLE(u[1].lo,u[5].lo,mul_p1q4);
DFT2_TWIDDLE(u[1].hi,u[5].hi,mul_p1q4);
DFT2_TWIDDLE(u[2].lo,u[6].lo,mul_p2q4);
DFT2_TWIDDLE(u[2].hi,u[6].hi,mul_p2q4);
DFT2_TWIDDLE(u[3].lo,u[7].lo,mul_p3q4);
DFT2_TWIDDLE(u[3].hi,u[7].hi,mul_p3q4);
DFT2_TWIDDLE(u[8].lo,u[12].lo,mul_p0q4);
DFT2_TWIDDLE(u[8].hi,u[12].hi,mul_p0q4);
DFT2_TWIDDLE(u[9].lo,u[13].lo,mul_p1q4);
DFT2_TWIDDLE(u[9].hi,u[13].hi,mul_p1q4);
DFT2_TWIDDLE(u[10].lo,u[14].lo,mul_p2q4);
DFT2_TWIDDLE(u[10].hi,u[14].hi,mul_p2q4);
DFT2_TWIDDLE(u[11].lo,u[15].lo,mul_p3q4);
DFT2_TWIDDLE(u[11].hi,u[15].hi,mul_p3q4);
// 8x in-place DFT2 and twiddle (3)
DFT2_TWIDDLE(u[0].lo,u[2].lo,mul_p0q2);
DFT2_TWIDDLE(u[0].hi,u[2].hi,mul_p0q2);
DFT2_TWIDDLE(u[1].lo,u[3].lo,mul_p1q2);
DFT2_TWIDDLE(u[1].hi,u[3].hi,mul_p1q2);
DFT2_TWIDDLE(u[4].lo,u[6].lo,mul_p0q2);
DFT2_TWIDDLE(u[4].hi,u[6].hi,mul_p0q2);
DFT2_TWIDDLE(u[5].lo,u[7].lo,mul_p1q2);
DFT2_TWIDDLE(u[5].hi,u[7].hi,mul_p1q2);
DFT2_TWIDDLE(u[8].lo,u[10].lo,mul_p0q2);
DFT2_TWIDDLE(u[8].hi,u[10].hi,mul_p0q2);
DFT2_TWIDDLE(u[9].lo,u[11].lo,mul_p1q2);
DFT2_TWIDDLE(u[9].hi,u[11].hi,mul_p1q2);
DFT2_TWIDDLE(u[12].lo,u[14].lo,mul_p0q2);
DFT2_TWIDDLE(u[12].hi,u[14].hi,mul_p0q2);
DFT2_TWIDDLE(u[13].lo,u[15].lo,mul_p1q2);
DFT2_TWIDDLE(u[13].hi,u[15].hi,mul_p1q2);
// 8x DFT2 and store (reverse binary permutation)
y[0] = u[0] + u[1];
y[p] = u[8] + u[9];
y[2*p] = u[4] + u[5];
y[3*p] = u[12] + u[13];
y[4*p] = u[2] + u[3];
y[5*p] = u[10] + u[11];
y[6*p] = u[6] + u[7];
y[7*p] = u[14] + u[15];
y[8*p] = u[0] - u[1];
y[9*p] = u[8] - u[9];
y[10*p] = u[4] - u[5];
y[11*p] = u[12] - u[13];
y[12*p] = u[2] - u[3];
y[13*p] = u[10] - u[11];
y[14*p] = u[6] - u[7];
y[15*p] = u[14] - u[15];
}
Note that I have modified the kernel to perform the FFT of 2 complex-valued sequences at once instead of one. Also, since I only need the FFT of 256 elements at a time in a much larger sequence, I perform only 2 runs of this kernel, which leaves me with 256-length DFTs in the larger array.
Here's some of the relevant host code as well.
var ev = new[] { new Cl.Event() };
var pEv = new[] { new Cl.Event() };
int fftSize = 1;
int iter = 0;
int n = distributionSize >> 5;
while (fftSize <= n)
{
Cl.SetKernelArg(fftKernel, 0, memA);
Cl.SetKernelArg(fftKernel, 1, memB);
Cl.SetKernelArg(fftKernel, 2, fftSize);
Cl.EnqueueNDRangeKernel(commandQueue, fftKernel, 1, null, globalWorkgroupSize, localWorkgroupSize,
(uint)(iter == 0 ? 0 : 1),
iter == 0 ? null : pEv,
out ev[0]).Check();
if (iter > 0)
pEv[0].Dispose();
Swap(ref ev, ref pEv);
Swap(ref memA, ref memB); // ping-pong
fftSize = fftSize << 4;
iter++;
Cl.Finish(commandQueue);
}
Swap(ref memA, ref memB);
Hope this helps someone!
I am trying to display a mathematical surface f(x,y) defined on a XY regular mesh using OpenGL and C++ in an effective manner:
struct XYRegularSurface {
double x0, y0;
double dx, dy;
int nx, ny;
XYRegularSurface(int nx_, int ny_) : nx(nx_), ny(ny_) {
z = new float[nx*ny];
}
~XYRegularSurface() {
delete [] z;
}
float& operator()(int ix, int iy) {
return z[ix*ny + iy];
}
float x(int ix, int iy) {
return x0 + ix*dx;
}
float y(int ix, int iy) {
return y0 + iy*dy;
}
float zmin();
float zmax();
float* z;
};
Here is my OpenGL paint code so far:
void color(QColor & col) {
float r = col.red()/255.0f;
float g = col.green()/255.0f;
float b = col.blue()/255.0f;
glColor3f(r,g,b);
}
void paintGL_XYRegularSurface(XYRegularSurface &surface, float zmin, float zmax) {
float x, y, z;
QColor col;
glBegin(GL_QUADS);
for(int ix = 0; ix < surface.nx - 1; ix++) {
for(int iy = 0; iy < surface.ny - 1; iy++) {
x = surface.x(ix,iy);
y = surface.y(ix,iy);
z = surface(ix,iy);
col = rainbow(zmin, zmax, z);color(col);
glVertex3f(x, y, z);
x = surface.x(ix + 1, iy);
y = surface.y(ix + 1, iy);
z = surface(ix + 1,iy);
col = rainbow(zmin, zmax, z);color(col);
glVertex3f(x, y, z);
x = surface.x(ix + 1, iy + 1);
y = surface.y(ix + 1, iy + 1);
z = surface(ix + 1,iy + 1);
col = rainbow(zmin, zmax, z);color(col);
glVertex3f(x, y, z);
x = surface.x(ix, iy + 1);
y = surface.y(ix, iy + 1);
z = surface(ix,iy + 1);
col = rainbow(zmin, zmax, z);color(col);
glVertex3f(x, y, z);
}
}
glEnd();
}
The problem is that this is slow, nx=ny=1000 and fps ~= 1.
How do I optimize this to be faster?
EDIT: following your suggestion (thanks!) regarding VBO
I added:
float* XYRegularSurface::xyz() {
float* data = new float[3*nx*ny];
long i = 0;
for(int ix = 0; ix < nx; ix++) {
for(int iy = 0; iy < ny; iy++) {
data[i++] = x(ix,iy);
data[i++] = y(ix,iy);
data[i] = z[i]; i++;
}
}
return data;
}
I think I understand how I can create a VBO, initialize it to xyz() and send it to the GPU in one go, but how do I use the VBO when drawing. I understand that this can either be done in the vertex shader or by glDrawElements? I assume the latter is easier? If so: I do not see any QUAD mode in the documentation for glDrawElements!?
Edit2:
So I can loop trough all nx*ny quads and draw each by:
GL_UNSIGNED_INT indices[4];
// ... set indices
glDrawElements(GL_QUADS, 1, GL_UNSIGNED_INT, indices);
?
1/. Use display lists, to cache GL commands - avoiding recalculation of the vertices and the expensive per-vertex call overhead. If the data is updated, you need to look at client-side vertex arrays (not to be confused with VAOs). Now ignore this option...
2/. Use vertex buffer objects. Available as of GL 1.5.
Since you need VBOs for core profile anyway (i.e., modern GL), you can at least get to grips with this first.
Well, you've asked a rather open ended question. I'd suggest using modern (3.0+) OpenGL for everything. The point of just about any new OpenGL feature is to provide a faster way to do things. Like everyone else is suggesting, use array (vertex) buffer objects and vertex array objects. Use an element array (index) buffer object too. Most GPUs have a 'post-transform cache', which stores the last few transformed vertices, but this can only be used when you call the glDraw*Elements family of functions. I also suggest you store a flat mesh in your VBO, where y=0 for each vertex. Sample the y from a heightmap texture in your vertex shader. If you do this, whenever the surface changes you will only need to update the heightmap texture, which is easier than updating the VBO. Use one of the floating point or integer texture formats for a heightmap, so you aren't restricted to having your values be between 0 and 1.
If so: I do not see any QUAD mode in the documentation for glDrawElements!?
If you want quads make sure you're looking at the GL 2.1-era docs, not the new stuff.