cudaEventSynchronize behavior under different circumstances - asynchronous

I have a question regarding the CUDA call cudaEventSynchronize.
AFAIK, it activelly polls the event, thus consuming CPU cycles. If I would like to make it synchronous so CPU can be yielded as I can do with kernel executions, how could I do it?.
More specifically, what would be the expected behaviour under:
using CUDA_LAUNCH_BLOCKING=1 env variable.
using cudaDeviceScheduleBlockingSync
using cudaDeviceScheduleYield
I have been experiencing strange behaviours and need some help to elucidate this. Nvidia information on specific technical aspects are very reluctant to help with this... I suppose implementation details must be kept secret.
Thanks in advance,
Jose.

If you want cudaEventSynchronize to use blocking synchronization than you will need to create your event using
cudaError_t cudaEventCreateWithFlags (cudaEvent_t event, unsigned int flags)
and pass cudaEventBlockingSync as the flag.

Related

#OPENCL#clBuildProgram failed with error code -5

I met a problem when using clBuildProgram() on GTX 750. The kernel failed to build with error code -5(CL_OUT_OF_RESOURCES) and an empty build log.
There is a possible solution, which is adding '-cl-nv-verbose' as input option to clBuildProgram(). However, it doesn't work for all kernels.
Based on that, I tried another optimization option which is '-cl-opt-disable'. It also works for some kernels.
Then I got confused.
I cannot find the real reason for causing the error;
Why do different build-options make sense for some kernels?
The error seems like architecture independent.Since the same Opencl code is executed successfully on GTX 750, while failed on Tesla P100.
Does anyone has ideas?
Possible reasons I can think of:
Running out of registers. This happens if you have a lot of (private) variables in your kernel code, especially arrays. Each core only has a certain amount of registers available (architecture dependent), and it may not be possible for the compiler to "spill" them to global memory. If this is the problem, you can try to rearrange your code so your variables have more limited scope, or you can try to move some arrays to local memory (bearing in mind this is shared between work items in a group, and also limited in size). A good GPU profiler/code analysis tool should be able to tell you how much register pressure there is, so if you've got the kernel working on some hardware, you should be able to find out register pressure for that, and draw conclusions for other hardware too.
Code size itself. I didn't think this should be much of a problem anymore on modern GPUs, but it might be possible if you have truly gigantic kernels.

Is a preemptive multitasking OS possible on the interruptless DCPU-16?

I am looking into various OS designs in the hopes of writing a simple multitasking OS for the DCPU-16. However, everything I read about implementation of preemptive multitasking is centered around interrupts. It sounds like in the era of 16-bit hardware and software, cooperative multitasking was more common, but that requires every program to be written with multitasking in mind.
Is there any way to implement preemptive multitasking on an interruptless architecture? All I can think of is an interpreter which would dynamically switch tasks, but that would have a huge performance hit (possibly on the order of 10-20x+ if it had to parse every operation and didn't let anything run natively, I'm imagining).
Preemptive multitasking is normally implemented by having interrupt routines post status changes/interesting events to a scheduler, which decides which tasks to suspend, and which new tasks to start/continue based on priority. However, other interesting events can occur when a running task makes a call to an OS routine, which may have the same effect.
But all that matters is that some event is noted somewhere, and the scheduler decides who to run. So you can make all such event signalling/scheduling occur only only on OS calls.
You can add egregious calls to the scheduler at "convenient" points in various task application code to make your system switch more often. Whether it just switches, or uses some background information such as elapsed time since the last call is a scheduler detail.
Your system won't be as responsive as one driven by interrupts, but you've already given that up by choosing the CPU you did.
Actually, yes. The most effective method is to simply patch run-times in the loader. Kernel/daemon stuff can have custom patches for better responsiveness. Even better, if you have access to all the source, you can patch in the compiler.
The patch can consist of a distributed scheduler of sorts. Each program can be patched to have a very low-latency timer; on load, it will set the timer, and on each return from the scheduler, it will reset it. A simplistic method would allow code to simply do an
if (timer - start_timer) yield to scheduler;
which doesn't yield too big a performance hit. The main trouble is finding good points to pop them in. In between every function call is a start, and detecting loops and inserting them is primitive but effective if you really need to preempt responsively.
It's not perfect, but it'll work.
The main issue is making sure that the timer return is low latency; that way it is just a comparison and branch. Also, handling exceptions - errors in the code that cause, say, infinite loops - in some way. You can technically use a fairly simple hardware watchdog timer and assert a reset on the CPU without clearing any of the RAM; an in-RAM routine would be where RESET vector points, which would inspect and unwind the stack back to the program call (thus crashing the program but preserving everything else). It's sort of like a brute-force if-all-else-fails crash-the-program. Or you could POTENTIALLY change it to multi-task this way, RESET as an interrupt, but that is much more difficult.
So...yes. It's possible but complicated; using techniques from JIT compilers and dynamic translators (emulators use them).
This is a bit of a muddled explanation, I know, but I am very tired. If it's not clear enough I can come back and clear it up tomorrow.
By the way, asserting reset on a CPU mid-program sounds crazy, but it is a time-honored and proven technique. Early versions of Windows even did it to run compatibility mode on, I think 386's, properly, because there was no way to switch back to 32-bit from 16-bit mode. Other processors and OSes have done it too.
EDIT: So I did some research on what the DCPU is, haha. It's not a real CPU. I have no idea if you can assert reset in Notch's emulator, I would ask him. Handy technique, that is.
I think your assessment is correct. Preemptive multitasking occurs if the scheduler can interrupt (in the non-inflected, dictionary sense) a running task and switch to another autonomously. So there has to be some sort of actor that prompts the scheduler to action. If there are no interrupting devices (in the inflected, technical sense) then there's little you can do in general.
However, rather than switching to a full interpreter, one idea that occurs is just dynamically reprogramming supplied program code. So before entry into a process, the scheduler knows full process state, including what program counter value it's going to enter at. It can then scan forward from there, substituting, say, either the twentieth instruction code or the next jump instruction code that isn't immediately at the program counter with a jump back into the scheduler. When the process returns, the scheduler puts the original instruction back in. If it's a jump (conditional or otherwise) then it also effects the jump appropriately.
Of course, this scheme works only if the program code doesn't dynamically modify itself. And in that case you can preprocess it so that you know in advance where jumps are without a linear search. You could technically allow well-written self-modifying code if it were willing to nominate all addresses that may be modified, allowing you definitely to avoid those in your scheduler's dynamic modifications.
You'd end up sort of running an interpreter, but only for jumps.
another way is to keep to small tasks based on an event queue (like current GUI apps)
this is also cooperative but has the effect of not needing OS calls you just return from the task and then it will go on to the next task
if you then need to continue a task you need to pass the next "function" and a pointer to the data you need to the task queue

fault tolerance in MPICH/OpenMPI

I have two questions-
Q1. Is there a more efficient way to handle the error situation in MPI, other than check-point/rollback? I see that if a node "dies", the program halts abruptly.. Is there any way to go ahead with the execution after a node dies ?? (no issues if it is at the cost of accuracy)
Q2. I read in "http://stackoverflow.com/questions/144309/what-is-the-best-mpi-implementation", that OpenMPI has better fault tolerance and recently MPICH-2 has also come up with similar features.. does anybody know what they are and how to use them? is it a "mode"? can they help in the situation stated in Q1 ?
kindly reply. Thank you.
MPI - all implementations - have had the ability to continue after an error for a while. The default is to die - that is, the default error handler is MPI_ERRORS_ARE_FATAL - but that can be set (eg, see the discussion here). But the standard doesn't currently much beyond that; that is, it's hard to recover and continue after such an error. If your program is sufficiently simple - some sort of master-worker type of setup - it may be possible to continue this way.
The MPI forum is currently working on what will become MPI-3, and error handling and fault tolerance will be an important component of the new standard (there's a working group dedicated to the topic). Until that work is complete, however, the only way to get stronger fault tolerance out of MPI is to use earlier, nonstandard, extensions. FT-MPI was a project that developed a very robust MPI, but unfortuantely it's based on MPI1.2; a very early version of the standard. The claim here is that they're now working with OpenMPI, but I don't know what's become of that. There's MPICH-V, based on MPI2, but that's more checkpoint-restart based than what I think you're looking for.
Updated to add: The fault tolerance didn't make it into MPI-3, but the working group continues its work and the expectation is that something will result from that before too long.

Rewriting network packets on the fly using libnetfilter_queue

I am attempting to write a userspace application that can hook into an OS's network stack, sniff packets flying past and edit ones that its interested in.
After much Googling, it appears to me that the simplest (yet reasonably robust) method of doing so (on any platform) is Linux's libnetfilter_queue project. However, I'm having trouble finding any reasonable documentation for the project, outside of the limited official documentation. Its main features (as stated by the first link are)
receiving queued packets from the kernel nfnetlink_queue subsystem
issuing verdicts and/or reinjecting altered packets to the kernel nfnetlink_queue subsystem
Emphasis is my own. How exactly am I meant go about this? I've tried modifying the sample code provided, but perhaps I am misunderstanding something. The code is operating in NFQNL_COPY_PACKET mode, so I am receiving the whole packet -- but my modifications to it seem to be restricted to my own application -- as one would expect, given the "copy" semantics.
My feeling is that I am meant to make use of NF_QUEUE somehow, but I haven't quite grokked it. Any pointers?
(If there is a simpler mechanism for doing this, which is also cross-platform, I'd love to hear about it!)
I can't believe I missed this previously. As reticent as I am to post questions on SO, I thought I would never work this one out myself. :)
I didn't look at the function prototype properly. It turns out in the "verdict" function (outlined below),
int nfq_set_verdict(struct nfq_q_handle *qh,
u_int32_t id,
u_int32_t verdict,
u_int32_t data_len,
const unsigned char *buf
)
The last two parameters are for the data to be returned to the network stack. Obvious in hindsight, but I missed it completely as the print_pkt function doesn't take the packet data as a parameter, but extracts it from the struct nfq_data.
The key is to NF_ACCEPT the packet and pass the suitably modified packet back to the kernel.
Just a wild guess from digging around the source code: try explicitly adding the mangled payload using nfnl_addattr_l(…, NFQA_PAYLOAD, …)?

Real-time control of Windows Console game

another quick question, I want to make simple console based game, nothing too fancy, just to have some weekend project to get more familiar with C. Basically I want to make tetris, but I end up with one problem:
How to let the game engine go, and in the same time wait for input? Obviously cin or scanf is useless for me.
You're looking for a library such as ncurses.
Many Rogue-like games are written using ncurses or similar.
There's two ways to do it:
The first is to run two threads; one waits for input and updates state accordingly while the other runs the game.
The other (more common in game development) way is to write the game as one big loop that executes many times a second, updating game state, redrawing the screen, and checking for input.
But instead of blocking when you get key input, you check for the presence of pending keypresses, and if nothing has happened, you just continue through your loop. If you have multiple input sources (keyboard, network, etc.) they all get put there in the loop, checking one after another.
Yes, it's called polling. No, it's not efficient. But high-end games are usually all about pulling the maximum performance and framerates out of the computer, not running cool.
For added efficiency, you can optionally block with a timeout -- saying "wait for a keypress, but no longer than 300 milliseconds" so you can continue on with your loop.
select() comes to mind, but there are other ways of waiting or checking for input as well.
You could work out how to change stdin to non-blocking, which would enable you to write something like tetris, but the game might be more directly expressed in an event-driven paradigm. Maybe it's a good excuse to learn windows programming.
Anyway, if you want to go the console route, if you are using the microsoft compiler, then you should have kbhit() available (via conio.h) which can tell you whether a call to fgetc on stdin would block.
Actually should mention that the MinGW gcc compiler 3.4.5 also supports kbhit().

Resources