Do preprocessor directives affect OpenCL kernel performance? - opencl

If I use preprocessor directives like #if, #elif, etc., in my kernel , will it affect the performance in any way? I'm assuming that these conditions are resolved at the compile time itself.

If you do live compilation then it will probably have some effect on the compiletime but the on the actual execution of the kernel it won't have any affect since they are resolved at the compile time just as you said. The potential slowdown on compiletime should be way less than doing all those checks at run time.

Related

How to increase CPU usage in parallel processing in R

I am currently using the future package in R for some heavy parallel processing tasks.
When I examined the CPU usage while the script was running, I noticed that each parallel section is using only 2.3% of the CPU power on the machine (see below). Is there a way to increase the usage to a higher number (say 5% or 10%)?
Sorry if I missed anything obvious from the package documentation.
Your script (or any process) will only use what is needed.
While I am not familiar with the exact workings of future, unless you can define a treshold for CPU-usage there are no direct ways of increasing this arbitrarily.
If your script is still slow then you need to look at other (or additional ways) of speeding it up. You should also look into if the overhead of paralellisation is causing unnecessary workloads, perhaps try with less cores/workers and see if that increases the CPU-usage, and evaluate that against a benchmark (e.g. time to completion).

julia workflow with JIT compiler

I'v recently picked up Julia as a neat way to implement some computationally heavy projects. So far I'm quite impressed by both speed and convenience - however, there's one thing I sort of dislike: when a code becomes fairly large running scripts takes increasing amounts of time since the JIT compiler needs to compile all files time and time again (not only the modified ones as, e.g., in C++ with CMake). This slows down my development workflow - what's the most julian/best practice way to speed this up so that I avoid waiting (sometime exessive) time?
Despite the workflow outlined in the comments above (keep REPL open and use Revise.jl), this package might be helpful for you:
https://github.com/dmolina/DaemonMode.jl

#OPENCL#clBuildProgram failed with error code -5

I met a problem when using clBuildProgram() on GTX 750. The kernel failed to build with error code -5(CL_OUT_OF_RESOURCES) and an empty build log.
There is a possible solution, which is adding '-cl-nv-verbose' as input option to clBuildProgram(). However, it doesn't work for all kernels.
Based on that, I tried another optimization option which is '-cl-opt-disable'. It also works for some kernels.
Then I got confused.
I cannot find the real reason for causing the error;
Why do different build-options make sense for some kernels?
The error seems like architecture independent.Since the same Opencl code is executed successfully on GTX 750, while failed on Tesla P100.
Does anyone has ideas?
Possible reasons I can think of:
Running out of registers. This happens if you have a lot of (private) variables in your kernel code, especially arrays. Each core only has a certain amount of registers available (architecture dependent), and it may not be possible for the compiler to "spill" them to global memory. If this is the problem, you can try to rearrange your code so your variables have more limited scope, or you can try to move some arrays to local memory (bearing in mind this is shared between work items in a group, and also limited in size). A good GPU profiler/code analysis tool should be able to tell you how much register pressure there is, so if you've got the kernel working on some hardware, you should be able to find out register pressure for that, and draw conclusions for other hardware too.
Code size itself. I didn't think this should be much of a problem anymore on modern GPUs, but it might be possible if you have truly gigantic kernels.

Compile Ceylon Js Code with Google Closure Compiler in Advanced Mode

What is the recommended way to optimally minimize the size of ceylon javascript code?
I passed the language module through a minifier once (can't remember which one, some online service) and some tests failed after that. But this was before splitting the model from the code, and IIRC the errors were related to the model being minified; so maybe it can work for you if you only minify the code, but leave the model alone.

Just in Time compilation always faster?

Greetings to all the compiler designers here on Stack Overflow.
I am currently working on a project, which focuses on developing a new scripting language for use with high-performance computing. The source code is first compiled into a byte code representation. The byte code is then loaded by the runtime, which performs aggressive (and possibly time consuming) optimizations on it (which go much further, than what even most "ahead-of-time" compilers do, after all that's the whole point in the project). Keep in mind the result of this process is still byte code.
The byte code is then run on a virtual machine. Currently, this virtual machine is implemented using a straight-forward jump table and a message pump. The virtual machine runs over the byte code with a pointer, loads the instruction under the pointer, looks up an instruction handler in the jump table and jumps into it. The instruction handler carries out the appropriate actions and finally returns control to the message loop. The virtual machine's instruction pointer is incremented and the whole process starts over again. The performance I am able to achieve with this approach is actually quite amazing. Of course, the code of the actual instruction handlers is again fine-tuned by hand.
Now most "professional" run-time environments (like Java, .NET, etc.) use Just-in-Time compilation to translate the byte code into native code before execution. A VM using a JIT does usually have much better performance than a byte code interpreter. Now the question is, since all an interpreter basically does is load an instruction and look up a jump target in a jump table (remember the instruction handler itself is statically compiled into the interpreter, so it is already native code), will the use of Just-in-Time compilation result in a performance gain or will it actually degrade performance? I cannot really imagine the jump table of the interpreter to degrade performance that much to make up the time that was spent on compiling that code using a JITer. I understand that a JITer can perform additional optimization on the code, but in my case very aggressive optimization is already performed on the byte code level prior to execution. Do you think I could gain more speed by replacing the interpreter by a JIT compiler? If so, why?
I understand that implementing both approaches and benchmarking will provide the most accurate answer to this question, but it might not be worth the time if there is a clear-cut answer.
Thanks.
The answer lies in the ratio of single-byte-code-instruction complexity to jump table overheads. If you're modelling high level operations like large matrix multiplications, then a little overhead will be insignificant. If you're incrementing a single integer, then of course that's being dramatically impacted by the jump table. Overall, the balance will depend upon the nature of the more time-critical tasks the language is used for. If it's meant to be a general purpose language, then it's more useful for everything to have minimal overhead as you don't know what will be used in a tight loop. To quickly quantify the potential improvement, simply benchmark some nested loops doing some simple operations (but ones that can't be optimised away) versus an equivalent C or C++ program.
When you use an interpreter, the code cache in your processor caches the interpreter code; not the byte code (which may be cached in the data cache). Since code caches are 2 to 3 times faster than data caches, IIRC; you may see a performance boost if you JIT compile. Also, the native, real code you are executing is probably PIC; something which can be avoided for JITted code.
Everything else depends on how optimized the byte code is, IMHO.
JIT can theoretically optimize better, since it has information not available at compile time (especially about typical runtime behavior). So it can for example do better branch prediction, roll out loops as needed, et.c.
I am sure your jumptable approach is OK, but I still think it would perform rather poor compared to straight C code, don't you think?

Resources