Updating Uniform Buffer Data in WebGL 2? - webgl2

Different from OpenGL ES 3, without gl.mapBufferRange and gl.bufferSubData (It exists), what is the efficient way to update uniform buffer data in WebGL 2?
For example, a PerDraw Uniform block
uniform PerDraw
{
mat4 P;
mat4 MV;
mat3 MNormal;
} u_perDraw;

gl.bufferSubData exists so it would seem like you create a buffer then create a parallel typedArray. Update the typedArray and call
gl.bufferSubData to copy it into the buffer to do the update and gl.bindBufferRange to use it.
That's probably still very fast. First off all value manipulation stays in JavaScript so there's less overhead of calling into WebGL. If you have 10 uniforms to update it means you're making 2 calls into WebGL instead of 10.
In TWGL.js I generate ArrayBufferViews for all uniforms into a single typed array so for example given your uniform block above you can do
ubo.MV[12] = tx;
ubo.MV[13] = ty;
ubo.MV[14] = tz;
Or as another example if you have a math library that takes an array/typedarray as a destination parameter you can do stuff like
var dest = ubo.P;
m4.perspective(fov, aspect, zNear, zFar, dest);
The one issue I have is dealing with uniform optimization. If I edit a shader, say I'm debugging and I just insert output = vec4(1,0,0,1); return; at the top of a fragment shader and some uniform block gets optimized out the code is going to break. I don't know what the standard way of dealing with this is in C/C++ projects. I guess in C++ you'd declare a structure
struct PerDraw {
float P[16];
float MV[16];
float MNormal[9];
}
So the problem kind of goes away. In twgl.js I'm effectively generating that structure at runtime which means if your code expects it to exist but it doesn't get generated because it's been optimized out then code break.
In twgl I made a function that copies from a JavaScript object to the typed array so I can skip any optimized out uniform blocks which unfortunately adds some overhead. You're free to modify the typearray views directly and deal with the breakage when debugging or to use the structured copy function (twgl.setBlockUniforms).
Maybe I should let you specify a structure from JavaScript in twgl and generate it and it's up to you to make it match the uniform block object. That would make it more like C++, remove one copy, and be easier to deal with when debugging optimizations remove blocks.

Related

How to create a RAWSXP vector from C char* ptr without reallocation

Is there a way of creating a RAWSXP vector that is backed by an existing C char* ptr.
Below I show my current working version which needs to reallocate and copy the bytes,
and a second imagined version that doesn't exist.
// My current slow solution that uses lots of memory
SEXP getData() {
// has size, and data
Response resp = expensive_call();
//COPY OVER BYTE BY BYTE
SEXP respVec = Rf_allocVector(RAWSXP, resp.size);
Rbyte* ptr = RAW(respVec);
memcpy(ptr, resp.msg, resp.size);
// free the memory
free(resp.data);
return respVec;
}
// My imagined solution
SEXP getDataFast() {
// has size, and data
Response resp = expensive_call();
// reuse the ptr
SEXP respVec = Rf_allocVectorViaPtr(RAWSXP, resp.data, resp.size);
return respVec;
}
I also noticed Rf_allocVector3 which seems to give control over memory allocations of the vector, but I couldn't get this to work. This is my first time writing an R extension, so I imagine I must be doing something stupid. I'm trying to avoid the copy as the data will be around a GB (very large, sparse though, matrices).
Copying over 1 GB is < 1 second. If your call is expensive, it might be a marginal cost that you should profile to see if it's really a bottleneck.
The way you are trying to do things is probably not possible, because how would R know how to garbage collect the data?
But assuming you are using STL containers, one neat trick I've recently seen is to use the second template argument of STL containers -- the allocator.
template<
class T,
class Allocator = std::allocator<T>
> class vector;
The general outline of the strategy is like this:
Create a custom allocator using R-memory that meets all the requirements (essentially you just need allocate and deallocate)
Every time you need to a return data to R from an STL container, make sure you initialize it with your custom allocator
On returning the data, pull out the underlying R data created by your R-memory allocator -- no copy
This approach gives you all the flexibility of STL containers while using only memory R is aware of.

Is it ok to create big array of AVX/SSE values

I am parallelizing a certain dynamic programming problem using AVX2/SSE instructions.
In the main iteration of my calculation, I calculate column in matrix where each cell is a structure of AVX2 registers (_m256i). I use values from the previous matrix column as input values for calculating the current column. Columns can be big, so what I do is I have an array of structures (on stack), where each structure has two _m256i elements.
Structure:
struct Cell {
_m256i first;
_m256i second;
};
An then I have array like this: Cell prevColumn [N]. N will tipically be few hundreds.
I know that _m256i basically represents an avx2 register, so I am wondering how should I think about this array, how does it behave, since N is much larger than 16 (which is number of avx registers)? Is it a good practice to create such an array, or is there some better approach that i should use when storing a lot of _m256i values that are going to be reused real soon?
Also, is there any aligning I should be doing with this structures? I read a lot about aligning, but I am still not sure how and when to do it exactly.
It's better to structure your code to do everything it can with a value before moving on. Small buffers that fit in L1 cache aren't going to be too bad for performance, but don't do that unless you need to.
I think it's more typical to write your code with buffers of int [] type, rather than __m256i type, but I'm not sure. Either way works, and should get the compile to generate efficient code. But the int [] way means less code has to be different for the SSE, AVX2, and AVX512 version. And it might make it easier to examine things with a debugger, to have your data in an array with a type that will get the data formatted nicely.
As I understand it, the load/store intrinsics are partly there as a cast between _m256i and int [], since AVX doesn't fault on unaligned, just slows down on cacheline boundaries. Assigning to / from an array of _m256i should work fine, and generate load/store instructions where needed, otherwise generate vector instructions with memory source operands. (for more compact code and fewer fused-domain uops.)

Options to structure compile-time constants in OpenCL?

I need to pass a bunch of constants into my OpenCL kernel. Luckily, these are mostly known at compile-time, meaning: kernel compile-time. And therefore I can pass them in as a bunch of defines like -D leftOuterMargin=3 -D rightOuterMargin=2 -D leftInnerMargin=1 -D rightInnerMargin=2 ....
But this gets a bit unwieldy, and makes it hard to write re-usable functions inside the kernel. I'm looking for something a bit more structured, like say structs. However, structs would seem to be stored either in constant space (if create using a global constant instantiation, probably via an appropriate #define), or private space (if create inside the kernel function, again probably via an appropriate #define)?
What options are available for structuring constant data in a kernel, that is known at compile-time? Some things I hope for:
structured, can just pass a single pointer-like thing into reusable methods, rather than having 8-16 clumsily-named #defines, or 8 parameters into each non-kernel method
the values load quickly when used, just like normal #defined values
the values can be used by the compiler for optimizations, eg if I use one for a loop upper-bound, that loop can be fully unwrapped at compile time
won't increase register pressure
relatively standard, will work generally, across different gpu platforms/manufacturers
This:
the values load quickly when used, just like normal #defined values
the values can be used by the compiler for optimizations, eg if I use one for a loop upper-bound, that loop can be fully unwrapped at compile time
won't increase register pressure
Is incompatible with this:
can just pass a single pointer-like thing into reusable methods [...] or 8 parameters into each non-kernel method
If you want the values to be constant expressions (in other words, "known to the compiler"), then the only options are #define and global-scope const. No passing values dynamically by parameter or by indirection.
I suggest you make a struct with your different options:
struct Options {
int leftOuterMargin;
int rightOuterMargin;
int leftInnerMargin;
int rightInnerMargin;
// and so on ...
};
Then you can define a header included in all translation units where the constants are required:
// constants.h
static const Options constants = {
.leftOuterMargin = 3,
.rightOuterMargin = 2,
.leftInnerMargin = 1,
.rightInnerMargin = 2
};
The compiler should be able to optimize your code just as well as if you had used #define.

Variable vs Constant

I was wondering the differences of variable and constants as I see different declaration of variable/constant in the codes written by ex-colleagues.
I know that variable is something that can be change throughout the code and the value of constant is fixed and can't be changed. By far I've written everything in variable (even if the variable will not be change). Is it my practice is incorrect? Perhaps my code is not complicated therefore I use variable all the time.
Anyhow, if my understanding proven wrong, please enlighten me with the correct guidelines on this matter will do.
It is a good code practice to use constants whenever possible.
At runtime / compile time it will be known that only Read operations can be done on those values, thus some accessing / IO optimizations will be done to the code automatically , which will significantly increase performance.
Another difference is that constants are stored in a different preallocated section of your code (compiler dependent, but on most compilers this is what happens), which makes them easier to access , and they don't get allocated / deallocated all the time (so another performance optimization).
And finnaly, constants can be evaluated at compile time .
For example, if you have an ecuation of constants, something like the following :
float a = const1 * const2 / const3 + const4;
Then the whole expression will be evaluated at compile time, saving cycles at runtime (since the value will always be the same).
Some popular constants that refer to this sort of optimization are PI , PI/2 , PI/4, 1/PI.
const int const_a = 10;
int static_a = 70;
public void sample()
{
static_a = const_a+10; //This is correct
// const_a=88; //It is wrong
}
In the above example, if we declare the variable as const we can't able to assign the value from anywhere but we can use that variable.

Common subexpression elimination OpenCL

I have a very large kernel which uses ~1000 temporary variables to compute ~1000 equations. So it is safe to assume that all of the temporary variables will be put in private off-chip memory aka CUDA __local memory (i know it's bad, but there is no other way).
My question is if common subexpressions are eliminated between neighboring lines like these:
const float t747 = t472*t28*t26*t715*t30*t11;
const float t748 = t472*t28*t717*t26*t30*t11;
As you can see the only difference is variable t717 versus t715. The question is if those two lines translate into 7 or 12 global loads?
Because if the target compiler (Nvidia Kepler GPU in my case) does not use registers to cache common subexpressions between lines i'm gonna need to implement it myself.
Note: All code is generated automatically, so manual tuning won't be possible.
EDIT: All t0-t999 variables are declared as "const float".
The compilers translate all the global reads as direct reads. So, in your case 12 reads.
This is due to the fact that the global memory is considered as volatile memory, and caching is not possible. However, if you simply do this: (I think you know it, but anyway...)
const float temp = t472*t28*t26*t30*t11;
const float t747 = temp*t715;
const float t748 = temp*t717;
The compiler will translate that into 7 global reads.
NOTE: At least this was valid with old arquitectures, I dunno if there is some new compiler/arquitecture that can cleverly detect these cases and optimize them.

Resources