I need to pass a bunch of constants into my OpenCL kernel. Luckily, these are mostly known at compile-time, meaning: kernel compile-time. And therefore I can pass them in as a bunch of defines like -D leftOuterMargin=3 -D rightOuterMargin=2 -D leftInnerMargin=1 -D rightInnerMargin=2 ....
But this gets a bit unwieldy, and makes it hard to write re-usable functions inside the kernel. I'm looking for something a bit more structured, like say structs. However, structs would seem to be stored either in constant space (if create using a global constant instantiation, probably via an appropriate #define), or private space (if create inside the kernel function, again probably via an appropriate #define)?
What options are available for structuring constant data in a kernel, that is known at compile-time? Some things I hope for:
structured, can just pass a single pointer-like thing into reusable methods, rather than having 8-16 clumsily-named #defines, or 8 parameters into each non-kernel method
the values load quickly when used, just like normal #defined values
the values can be used by the compiler for optimizations, eg if I use one for a loop upper-bound, that loop can be fully unwrapped at compile time
won't increase register pressure
relatively standard, will work generally, across different gpu platforms/manufacturers
This:
the values load quickly when used, just like normal #defined values
the values can be used by the compiler for optimizations, eg if I use one for a loop upper-bound, that loop can be fully unwrapped at compile time
won't increase register pressure
Is incompatible with this:
can just pass a single pointer-like thing into reusable methods [...] or 8 parameters into each non-kernel method
If you want the values to be constant expressions (in other words, "known to the compiler"), then the only options are #define and global-scope const. No passing values dynamically by parameter or by indirection.
I suggest you make a struct with your different options:
struct Options {
int leftOuterMargin;
int rightOuterMargin;
int leftInnerMargin;
int rightInnerMargin;
// and so on ...
};
Then you can define a header included in all translation units where the constants are required:
// constants.h
static const Options constants = {
.leftOuterMargin = 3,
.rightOuterMargin = 2,
.leftInnerMargin = 1,
.rightInnerMargin = 2
};
The compiler should be able to optimize your code just as well as if you had used #define.
Related
Different from OpenGL ES 3, without gl.mapBufferRange and gl.bufferSubData (It exists), what is the efficient way to update uniform buffer data in WebGL 2?
For example, a PerDraw Uniform block
uniform PerDraw
{
mat4 P;
mat4 MV;
mat3 MNormal;
} u_perDraw;
gl.bufferSubData exists so it would seem like you create a buffer then create a parallel typedArray. Update the typedArray and call
gl.bufferSubData to copy it into the buffer to do the update and gl.bindBufferRange to use it.
That's probably still very fast. First off all value manipulation stays in JavaScript so there's less overhead of calling into WebGL. If you have 10 uniforms to update it means you're making 2 calls into WebGL instead of 10.
In TWGL.js I generate ArrayBufferViews for all uniforms into a single typed array so for example given your uniform block above you can do
ubo.MV[12] = tx;
ubo.MV[13] = ty;
ubo.MV[14] = tz;
Or as another example if you have a math library that takes an array/typedarray as a destination parameter you can do stuff like
var dest = ubo.P;
m4.perspective(fov, aspect, zNear, zFar, dest);
The one issue I have is dealing with uniform optimization. If I edit a shader, say I'm debugging and I just insert output = vec4(1,0,0,1); return; at the top of a fragment shader and some uniform block gets optimized out the code is going to break. I don't know what the standard way of dealing with this is in C/C++ projects. I guess in C++ you'd declare a structure
struct PerDraw {
float P[16];
float MV[16];
float MNormal[9];
}
So the problem kind of goes away. In twgl.js I'm effectively generating that structure at runtime which means if your code expects it to exist but it doesn't get generated because it's been optimized out then code break.
In twgl I made a function that copies from a JavaScript object to the typed array so I can skip any optimized out uniform blocks which unfortunately adds some overhead. You're free to modify the typearray views directly and deal with the breakage when debugging or to use the structured copy function (twgl.setBlockUniforms).
Maybe I should let you specify a structure from JavaScript in twgl and generate it and it's up to you to make it match the uniform block object. That would make it more like C++, remove one copy, and be easier to deal with when debugging optimizations remove blocks.
I've got a symbol that represents the name of a function to be called:
julia> func_sym = :tanh
I can use that symbol to get the tanh function and call it using:
julia> eval(func_sym)(2)
0.9640275800758169
But I'd rather avoid the 'eval' there as it will be called many times and it's expensive (and func_sym can have several different values depending on context).
IIRC in Ruby you can say something like:
obj.send(func_sym, args)
Is there something similar in Julia?
EDIT: some more details on why I have functions represented by symbols:
I have a type (from a neural network) that includes the activation function, originally I included it as a funcion:
type NeuralLayer
weights::Matrix{Float32}
biases::Vector{Float32}
a_func::Function
end
However, I needed to serialize these things to files using JLD, but it's not possible to serialize a Function, so I went with a symbol:
type NeuralLayer
weights::Matrix{Float32}
biases::Vector{Float32}
a_func::Symbol
end
And currently I use the eval approach above to call the activation function. There are collections of NeuralLayers and each can have it's own activation function.
#Isaiah's answer is spot-on; perhaps even more-so after the edit to the original question. To elaborate and make this more specific to your case: I'd change your NeuralLayer type to be parametric:
type NeuralLayer{func_type}
weights::Matrix{Float32}
biases::Vector{Float32}
end
Since func_type doesn't appear in the types of the fields, the constructor will require you to explicitly specify it: layer = NeuralLayer{:excitatory}(w, b). One restriction here is that you cannot modify a type parameter.
Now, func_type could be a symbol (like you're doing now) or it could be a more functionally relevant parameter (or parameters) that tunes your activation function. Then you define your activation functions like this:
# If you define your NeuralLayer with just one parameter:
activation(layer::NeuralLayer{:inhibitory}) = …
activation(layer::NeuralLayer{:excitatory}) = …
# Or if you want to use several physiological parameters instead:
activation{g_K,g_Na,g_l}(layer::NeuralLayer{g_K,g_Na,g_l} = f(g_K, g_Na, g_l)
The key point is that functions and behavior are external to the data. Use type definitions and abstract type hierarchies to define behavior, as is coded in the external functions… but only store data itself in the types. This is dramatically different from Python or other strongly object-oriented paradigms, and it takes some getting used to.
But I'd rather avoid the 'eval' there as it will be called many times and it's expensive (and func_sym can have several different values depending on context).
This sort of dynamic dispatch is possible in Julia, but not recommended. Changing the value of 'func_sym' based on context defeats type inference as well as method specialization and inlining. Instead, the recommended approach is to use multiple dispatch, as detailed in the Methods section of the manual.
I was wondering the differences of variable and constants as I see different declaration of variable/constant in the codes written by ex-colleagues.
I know that variable is something that can be change throughout the code and the value of constant is fixed and can't be changed. By far I've written everything in variable (even if the variable will not be change). Is it my practice is incorrect? Perhaps my code is not complicated therefore I use variable all the time.
Anyhow, if my understanding proven wrong, please enlighten me with the correct guidelines on this matter will do.
It is a good code practice to use constants whenever possible.
At runtime / compile time it will be known that only Read operations can be done on those values, thus some accessing / IO optimizations will be done to the code automatically , which will significantly increase performance.
Another difference is that constants are stored in a different preallocated section of your code (compiler dependent, but on most compilers this is what happens), which makes them easier to access , and they don't get allocated / deallocated all the time (so another performance optimization).
And finnaly, constants can be evaluated at compile time .
For example, if you have an ecuation of constants, something like the following :
float a = const1 * const2 / const3 + const4;
Then the whole expression will be evaluated at compile time, saving cycles at runtime (since the value will always be the same).
Some popular constants that refer to this sort of optimization are PI , PI/2 , PI/4, 1/PI.
const int const_a = 10;
int static_a = 70;
public void sample()
{
static_a = const_a+10; //This is correct
// const_a=88; //It is wrong
}
In the above example, if we declare the variable as const we can't able to assign the value from anywhere but we can use that variable.
I have a very large kernel which uses ~1000 temporary variables to compute ~1000 equations. So it is safe to assume that all of the temporary variables will be put in private off-chip memory aka CUDA __local memory (i know it's bad, but there is no other way).
My question is if common subexpressions are eliminated between neighboring lines like these:
const float t747 = t472*t28*t26*t715*t30*t11;
const float t748 = t472*t28*t717*t26*t30*t11;
As you can see the only difference is variable t717 versus t715. The question is if those two lines translate into 7 or 12 global loads?
Because if the target compiler (Nvidia Kepler GPU in my case) does not use registers to cache common subexpressions between lines i'm gonna need to implement it myself.
Note: All code is generated automatically, so manual tuning won't be possible.
EDIT: All t0-t999 variables are declared as "const float".
The compilers translate all the global reads as direct reads. So, in your case 12 reads.
This is due to the fact that the global memory is considered as volatile memory, and caching is not possible. However, if you simply do this: (I think you know it, but anyway...)
const float temp = t472*t28*t26*t30*t11;
const float t747 = temp*t715;
const float t748 = temp*t717;
The compiler will translate that into 7 global reads.
NOTE: At least this was valid with old arquitectures, I dunno if there is some new compiler/arquitecture that can cleverly detect these cases and optimize them.
I assigned to a project written by someone else. They passed parameters as variables (I mean those things copied to stack when a method is called) and I like them to converted to pointers. It runs significantly faster because only 32-bit or 64-bit pointers are passed to the subroutines. I have almost 600 methods to be converted.
An example method is defined as:
bool insideWindow(tsPoint Point, tsWindow Window)
When I change the type tsWindow into psWindow (defined as *tsWindow) I need to change all dots (.) to (->) in order to imply a pointer operation.
Is there any easy way to change these in QtCreator? To put in another way, I want to change the type to a pointer type and QtCreator will easily change dots into -> ?
Thanks
Well, it is easily solved by passing variables as references. All I need to do is to modify the function prototype (both in h and in cpp files).
bool insideWindow(tsPoint &Point, tsWindow &Window)
This way it still needs a dot (means I won't change the code, replacing dots with -> operators) and they are passed as pointers in fact.
http://www.cprogramming.com/tutorial/references.html