Very odd OpenCL CL_OUT_OF_RESOURCES behavior - opencl

I am writing a rather large OpenCL program with lots of function calls. I've been having problems with CL_OUT_OF_RESOURCES errors, but I managed to fix the problem with a simple printf statement. This is the code fragment in question:
...
const float color = raytrace(depthMap, triangles, ...tonMoreParameters...);
if (i == 1234) {
printf("hello\n");
}
outImage[i] = color;
...
This works fine, but if I remove the printf function, the program crashes. If I keep it in, it doesn't.
When it crashes, it gives a CL_OUT_OF_RESOURCES error. Can anyone explain why adding printf makes the program not run out of resources? How can I make this work without this useless printf?
Relevant specs:
OpenCL 1.2
NVIDIA GTX 660
Using Java JOCL as host code
EDIT:
I've noticed that putting printf statements in other places changes the way the code operates. Some printf statements cause the program to output different numeric results, while others cause it to crash.
Even changing code that is never executed hugely changes the calculations.
It's as if changing any code randomizes the way it executes.
Is this a sign of a faulty graphics card?
Or perhaps a bug in the OpenCL compiler?
EDIT 2
As it turns out, recursion is not the problem. I removed all recursive calls, but printfs and other harmless changes sill change the way the code runs depending on where they are put.
This is definitely a problem rooted during compilation time.

large OpenCL program with lots of function calls and recursion
OpenCL C 2.2 pdf, page 46:
Recursion is not supported.
I have no idea why printf changes things, but your program relies on a feature that's explicitly not supported.

I found the solution to my own problem.
The problem was caused by a simple array out-of-bounds error. Apparently, OpenCL does not catch these types of errors. So any attempts to read or write out of bounds could cause a silent memory corruption as it did in my case. The memory that got corrupted were the instructions for the program itself, hence the random results of execution.
The problem was also partially caused by the illegal use of recursion as mogu mentioned. Again, the OpenCL compiler lets this silently corrupt the program memory.
So be careful OpenCL developers.

Related

Common Lisp - 100% CPU usage forever after finishing non-threaded computation?

UPDATE
To better clarify, my question is not if I'm doing the code right or not, I've already understood after the profiling that I wasn't.
The question is: Are you supposed to observe SBCL taking 100%CPU after running a program no matter what you did good or bad? And, is this something that you guys have seen happening before? - I.e. a known bug?
I'd give a reproducible example if I could, but this CPU hogging only happens sometimes (and I've never used multithreading constructs anywhere).
Sorry for not being more clear the first time around :)
-----
Bug?
I'm having occasional issues with Lisp using 100% CPU for long periods of time after running programs.
Update: Right now it was using 100% CPU for 40 minutes after the program had finished computation.
Environment: SBCL, rowswell, emacs+SLIME
My question is if this is a known bug in Common Lisp that I'm not aware of and might be related to GC?
Context
It's not the first time that it happens "randomly", but it has happened that more computationally heavy programs that do a lot of memory allocation end up using 100% for a long time (40min in this case) after the program finished.
The routine is single-threaded, thus there's no possibility of some task still running in the background.
I don't believe it's normal for SBCL to spend 40min after a program runs using 100% CPU. I'm afraid this might be related to some bug in GC?
I then profiled the program in SLIME:
and the program was super slow (~20min execution) and did a lot of allocations, then changed one line, and it now takes 2s to run, just because I was always formatting a debug string to an empty stream (thus generating new string representations of a list with 100k integers at each call):
(https://github.com/AlbertoEAF/advent_of_code_2019/commit/b37797df772c12c2d409b1c3356cf5b690c8f928)
That is not my point though. Even though this case is extremely ill-posed, the task I'm doing is very simple, and thus the program I'm using is irrelevant, the concern is the unstability of the platform, in scenarios where one is using sustained heavy computation and allocation. Are there reports of any issues like this with SLIME/SBCL or some other thing I'm not aware of?
Thank you!
The reason your change improves performance is that debug-stream is NIL.
In the old code you evaluate:
(format nil ...)
When you give nil as the stream to format, it prints to a string so you are doing the formatting work and allocating a big string you throw away.
In the new code you do:
(when nil ...)
Which costs approximately 0.
Note that nil does not mean do nothing when you pass it to format. In general if you want to do nothing you should do nothing instead of calling functions that do things.

#OPENCL#clBuildProgram failed with error code -5

I met a problem when using clBuildProgram() on GTX 750. The kernel failed to build with error code -5(CL_OUT_OF_RESOURCES) and an empty build log.
There is a possible solution, which is adding '-cl-nv-verbose' as input option to clBuildProgram(). However, it doesn't work for all kernels.
Based on that, I tried another optimization option which is '-cl-opt-disable'. It also works for some kernels.
Then I got confused.
I cannot find the real reason for causing the error;
Why do different build-options make sense for some kernels?
The error seems like architecture independent.Since the same Opencl code is executed successfully on GTX 750, while failed on Tesla P100.
Does anyone has ideas?
Possible reasons I can think of:
Running out of registers. This happens if you have a lot of (private) variables in your kernel code, especially arrays. Each core only has a certain amount of registers available (architecture dependent), and it may not be possible for the compiler to "spill" them to global memory. If this is the problem, you can try to rearrange your code so your variables have more limited scope, or you can try to move some arrays to local memory (bearing in mind this is shared between work items in a group, and also limited in size). A good GPU profiler/code analysis tool should be able to tell you how much register pressure there is, so if you've got the kernel working on some hardware, you should be able to find out register pressure for that, and draw conclusions for other hardware too.
Code size itself. I didn't think this should be much of a problem anymore on modern GPUs, but it might be possible if you have truly gigantic kernels.

Mvapich2 buffer aliasing

I am launched an MPI program with MVAPICH2 and got this error:
Fatal error in PMPI_Gather:
Invalid buffer pointer, error stack:
PMPI_Gather(923): MPI_Gather() failed
PMPI_Gather(857): Buffers must not be aliased
There are two ways I think I could solve this:
Rewrite my MPI program (use different buffers)
Disable checking buffer aliasing
Do someone know how I could do this with MVAPICH2? Some compiler option, parameter, environmental variable, etc?
Something like MV2_NO_BUFFER_ALIAS_CHECK, but it does not work.
What you're doing is an incorrect program and you should rewrite your code to use separate buffers
Alternatively, you might be able to use MPI_IN_PLACE if you want to use the same buffer as both the input and output values of your MPI_GATHER. Without seeing your code, I can't tell you how you could do that. You can check out some documentation about MPI_GATHER and read more about how MPI_IN_PLACE works and see if that solves your problem.

Just in Time compilation always faster?

Greetings to all the compiler designers here on Stack Overflow.
I am currently working on a project, which focuses on developing a new scripting language for use with high-performance computing. The source code is first compiled into a byte code representation. The byte code is then loaded by the runtime, which performs aggressive (and possibly time consuming) optimizations on it (which go much further, than what even most "ahead-of-time" compilers do, after all that's the whole point in the project). Keep in mind the result of this process is still byte code.
The byte code is then run on a virtual machine. Currently, this virtual machine is implemented using a straight-forward jump table and a message pump. The virtual machine runs over the byte code with a pointer, loads the instruction under the pointer, looks up an instruction handler in the jump table and jumps into it. The instruction handler carries out the appropriate actions and finally returns control to the message loop. The virtual machine's instruction pointer is incremented and the whole process starts over again. The performance I am able to achieve with this approach is actually quite amazing. Of course, the code of the actual instruction handlers is again fine-tuned by hand.
Now most "professional" run-time environments (like Java, .NET, etc.) use Just-in-Time compilation to translate the byte code into native code before execution. A VM using a JIT does usually have much better performance than a byte code interpreter. Now the question is, since all an interpreter basically does is load an instruction and look up a jump target in a jump table (remember the instruction handler itself is statically compiled into the interpreter, so it is already native code), will the use of Just-in-Time compilation result in a performance gain or will it actually degrade performance? I cannot really imagine the jump table of the interpreter to degrade performance that much to make up the time that was spent on compiling that code using a JITer. I understand that a JITer can perform additional optimization on the code, but in my case very aggressive optimization is already performed on the byte code level prior to execution. Do you think I could gain more speed by replacing the interpreter by a JIT compiler? If so, why?
I understand that implementing both approaches and benchmarking will provide the most accurate answer to this question, but it might not be worth the time if there is a clear-cut answer.
Thanks.
The answer lies in the ratio of single-byte-code-instruction complexity to jump table overheads. If you're modelling high level operations like large matrix multiplications, then a little overhead will be insignificant. If you're incrementing a single integer, then of course that's being dramatically impacted by the jump table. Overall, the balance will depend upon the nature of the more time-critical tasks the language is used for. If it's meant to be a general purpose language, then it's more useful for everything to have minimal overhead as you don't know what will be used in a tight loop. To quickly quantify the potential improvement, simply benchmark some nested loops doing some simple operations (but ones that can't be optimised away) versus an equivalent C or C++ program.
When you use an interpreter, the code cache in your processor caches the interpreter code; not the byte code (which may be cached in the data cache). Since code caches are 2 to 3 times faster than data caches, IIRC; you may see a performance boost if you JIT compile. Also, the native, real code you are executing is probably PIC; something which can be avoided for JITted code.
Everything else depends on how optimized the byte code is, IMHO.
JIT can theoretically optimize better, since it has information not available at compile time (especially about typical runtime behavior). So it can for example do better branch prediction, roll out loops as needed, et.c.
I am sure your jumptable approach is OK, but I still think it would perform rather poor compared to straight C code, don't you think?

to throw, to return or to errno?

i am creating a system. What i want to know is if a msg is unsupported what should it do? should i throw saying unsupported msg? should i return 0 or -1? or should i set an errno (base->errno_). Some messages i wouldnt care if there was an error (such as setBorderColour). Others i would (addText or perhaps save if i create a save cmd).
I want to know what the best method is for 1) coding quickly 2) debugging 3) extending and maintenance. I may make debugging 3rd, its hard to debug ATM but thats bc there is a lot of missing code which i didnt fill in. Actual bugs arent hard to correct. Whats the best way to let the user know there is an error?
The system works something like this but not exactly the same. This is C style and mycode has a bunch of inline functions that wrap settext(const char*text){ to msg(this, esettext, text)
Base base2, base;
base = get_root();
base2 = msg(base, create, BASE_TYPE);
msg(base2, setText, "my text");
const char *p = (const char *)msg(base2, getText);
Generally if it's C++, prefer exceptions unless performance is critical or unless you may be running in an environment (e.g. an embedded platform) that does not support exceptions. Exceptions are by far the best choice for debugging because they are very noticeable when they occur and are ignored. Further, exceptions are self-documenting. They have their own type name and usually a contained message that explains the error. Return codes and errno require separate code definitions and some kind of out-of-band way of communicating what the codes mean in any given context (e.g. man pages, comments).
For coding quickly, return codes are probably easier since they don't involve potentially defining your own exception types, and often the error checking code is not as verbose as with exceptions. But of course the big risk is that it is much easier to silently ignore error return codes, leading to problems that may not be noticed until well after they occur, making debugging and maintenance a nightmare.
Try to avoid ever using errno, since it's very error-prone itself. It's a global, so you never know who is resetting it, and it is most definitively not thread safe.
Edit: I just realized you meant an errno member variable and not the C-style errno. That's better in that it's not global, but you still need additional constructs to make it thread safe (if your app is multi-threaded), and it retains all the problems of a return code.
Returning an error code requires discipline because the error code must be explicitly checked and then passed up. We wrote a large C-based system that used this approach and it took a while to remove all the "lost" errors. We eventually developed some techniques to catch this problem (such as storing the error code in a thread-global location and checking at the top level to see that the returned error code matched the saved error code).
Exception handling is easier to code quickly because if you're writing code and you're not sure how to handle the error, you can just let it propagate upwards (assuming you're not using Java where you have to deal with checked exceptions). It's better for debugging because you can get a stack trace of where the exception occurred (and because you can build in top level exception handlers to catch problems that should have been caught elsewhere). It's better for maintaining because if you've done things right, you'll be notified of problem faster.
However, there are some design issues with exception handling, and if you get it wrong, you'll be worse off. In short, if you're writing code that you don't know how to handle an exception, you should let it propagate up. Too many coders trap errors just to convert the exception and then rethrow it, resulting in spaghetti exception code that sometimes loses information about the original cause of the problem. This assumes that you have exception handlers at the top level (entry points).
Personally when it comes to output graphics, I feel a silent fail is fine. It just makes your picture wrong.
Graphical errors are super easy to spot anyways.
Personally, I would add an errno to your Base struct if it is pure 'C'. If this is C++ I'd throw an exception.
It depends on how 'fatal' these errors are. Does the user really need to see the error, or is it for other developers edification?
For maintainability you need to clearly document the errors that can occur and include clear examples of error handling.

Resources