Opencl atomic function performance - opencl

I understand atomic functions in opencl will slow down the program significantly. This is definitely true if there IS racing when you do the writing operation into the memory.
I am curious if there IS_NOT racing, what would happen if I still use the atomic function when updating the memory? According to my understand, It will give slow down anyway since the threads are forced to executed one after one. Am I right about this?

Related

Common Lisp - 100% CPU usage forever after finishing non-threaded computation?

UPDATE
To better clarify, my question is not if I'm doing the code right or not, I've already understood after the profiling that I wasn't.
The question is: Are you supposed to observe SBCL taking 100%CPU after running a program no matter what you did good or bad? And, is this something that you guys have seen happening before? - I.e. a known bug?
I'd give a reproducible example if I could, but this CPU hogging only happens sometimes (and I've never used multithreading constructs anywhere).
Sorry for not being more clear the first time around :)
-----
Bug?
I'm having occasional issues with Lisp using 100% CPU for long periods of time after running programs.
Update: Right now it was using 100% CPU for 40 minutes after the program had finished computation.
Environment: SBCL, rowswell, emacs+SLIME
My question is if this is a known bug in Common Lisp that I'm not aware of and might be related to GC?
Context
It's not the first time that it happens "randomly", but it has happened that more computationally heavy programs that do a lot of memory allocation end up using 100% for a long time (40min in this case) after the program finished.
The routine is single-threaded, thus there's no possibility of some task still running in the background.
I don't believe it's normal for SBCL to spend 40min after a program runs using 100% CPU. I'm afraid this might be related to some bug in GC?
I then profiled the program in SLIME:
and the program was super slow (~20min execution) and did a lot of allocations, then changed one line, and it now takes 2s to run, just because I was always formatting a debug string to an empty stream (thus generating new string representations of a list with 100k integers at each call):
(https://github.com/AlbertoEAF/advent_of_code_2019/commit/b37797df772c12c2d409b1c3356cf5b690c8f928)
That is not my point though. Even though this case is extremely ill-posed, the task I'm doing is very simple, and thus the program I'm using is irrelevant, the concern is the unstability of the platform, in scenarios where one is using sustained heavy computation and allocation. Are there reports of any issues like this with SLIME/SBCL or some other thing I'm not aware of?
Thank you!
The reason your change improves performance is that debug-stream is NIL.
In the old code you evaluate:
(format nil ...)
When you give nil as the stream to format, it prints to a string so you are doing the formatting work and allocating a big string you throw away.
In the new code you do:
(when nil ...)
Which costs approximately 0.
Note that nil does not mean do nothing when you pass it to format. In general if you want to do nothing you should do nothing instead of calling functions that do things.

Limit GPU Memory in Julia using CuArrays

I'm fairly new to julia and I'm currently trying out some deep convolution networks with recurrent structures. I'm training the networks on a GPU using
CuArrays(CUDA Version 9.0).
Having two separate GPU's, I started two instances with different datasets.
Soon after some training both julia instances allocated all available Memory (2 x 11GB) and I couldn't even start another instance on my own using CuArrays (Memory allocation error). This became quite a problem, since this is running on a Server which is shared among many people.
I'm assuming that this is a normal behavior to use all available memory to train as fast as possible. But, under these circumstances I would like to limit the memory which can be allocated to run two instances at the same time and don't block me or other people from using the GPU.
To my surprise I found only very, very little information about this.
I'm aware of the CUDA_VISIBLE_DEVICES Option but this does not help since I want to train simultaneously on both devices.
Another one suggested to call GC.gc() and CuArrays.clearpool()
The second call throws an unknown function error and seems not to be within the CuArray Package anymore. The first one I'm currently testing but not exactly what I need. Is there any possibilty to limit the allocation of RAM on a GPU using CuArrays and Julia?
Thanks in advance
My Batchsize is 100 and one batch should have less than 1MB...
There is currently no such functionality. I quickly whipped something up, see https://github.com/JuliaGPU/CuArrays.jl/pull/379, you can use it to define CUARRAYS_MEMORY_LIMIT and set it to an amount of bytes that the allocator will not go beyond. Note that this might significantly increase memory pressure, a situation for which the CuArrays.jl memory allocator is currently not optimized (though it is one of my top priorities for the Julia GPU infrastructure).

OpenCL as a general purpose C runtime compiler (Why is kernel on CPU slower than direct C?)

I have been playing with openCL as a general purpose means to execute run-time C code. While I'm interested in getting code to run eventually on the GPU, I am currently looking at the overhead of OpenCL on the CPU compared to running straight complied C.
Obviously there is overhead to the preping and compilation of the kernal in OpenCL. But even when I just time the final execution and buffer read:
clEnqueueNDRangeKernel(...);
clFinish();
clEnqueueReadBuffer(...);
The overhead for a simple calculation is significant compared to the straight C code (factor of 30)., even with 1000 loops, giving it an opportunity to parallelize. Obviously there is overhead with reading back a buffer from the OpenCL code... but it is sitting on the same CPU, so it can't be that large of overhead.
Does this sound right?
The call to clEnqueueReadBuffer() is most likely performing a memcpy-like operation which is costly and non-optimal.
If you want to avoid this, you should allocate your data on the host normally (e.g. with malloc() or new) and pass the CL_MEM_USE_HOST_PTR flag to clCreateBuffer() when you create the OpenCL buffer object. Then use clEnqueueMapBuffer() and clEnqueueUnmapMemObject() to pass the data to/from OpenCL without actually performing a copy. This should be close to optimal w.r.t. minimizing OpenCL overhead.
For a higher-level abstraction implementing this technique, take a look at the mapped_view<T> class in Boost.Compute (simple example here).

When does one benefit from executing a garbage collection in R [duplicate]

Periodically I program sloppily. Ok, I program sloppily all the time, but sometimes that catches up with me in the form of out of memory errors. I start exercising a little discipline in deleting objects with the rm() command and things get better. I see mixed messages online about whether I should explicitly call gc() after deleting large data objects. Some say that before R returns a memory error it will run gc() while others say that manually forcing gc is a good idea.
Should I run gc() after deleting large objects in order to ensure maximum memory availability?
"Probably." I do it too, and often even in a loop as in
cleanMem <- function(n=10) { for (i in 1:n) gc() }
Yet that does not, in my experience, restore memory to a pristine state.
So what I usually do is to keep the tasks at hand in script files and execute those using the 'r' frontend (on Unix, and from the 'littler' package). Rscript is an alternative on that other OS.
That workflow happens to agree with
workflow-for-statistical-analysis-and-report-writing
tricks-to-manage-the-available-memory-in-an-r-session
which we covered here before.
From the help page on gc:
A call of 'gc' causes a garbage
collection to take place. This will
also take place automatically without
user intervention, and the primary
purpose of calling 'gc' is for the
report on memory usage.
However, it can be useful to call 'gc'
after a large object has been removed,
as this may prompt R to return memory
to the operating system.
So it can be useful to do, but mostly you shouldn't have to. My personal opinion is that it is code of last resort - you shouldn't be littering your code with gc() statements as a matter of course, but if your machine keeps falling over, and you've tried everything else, then it might be helpful.
By everything else, I mean things like
Writing functions rather than raw scripts, so variables go out of scope.
Emptying your workspace if you go from one problem to another unrelated one.
Discarding data/variables that you aren't interested in. (I frequently receive spreadsheets with dozens of uninteresting columns.)
Supposedly R uses only RAM. That's just not true on a Mac (and I suspect it's not true on Windows either.) If it runs out of RAM, it will start using virtual memory. Sometimes, but not always, processes will 'recognize' that they need to run gc() and free up memory. When they do not do so, you can see this by using the ActivityMonitor.app and seeing that all the RAM is occupied and disk access has jumped up. I find that when I am doing large Cox regression runs that I can avoid spilling over into virtual memory (with slow disk access) by preceding calls with gc(); cph(...)
A bit late to the party, but:
Explicitly calling gc will free some memory "now". ...so if other processes need the memory, it might be a good idea. For example before calling system or similar. Or perhaps when you're "done" with the script and R will sit idle for a while until the next job arrives - again, so that other processes get more memory.
If you just want your script to run faster, it won't matter since R will call it later if it needs to. It might even be slower since the normal GC cycle might never have needed to call it.
...but if you want to measure time for instance, it is typically a good idea to do a GC before running your test. This is what system.time does by default.
UPDATE As #DWin points out, R (or C#, or Java etc) doesn't always know when memory is low and the GC needs to run. So you could sometimes need to do GC as a work-around for deficiencies in the memory system.
No. If there is not enough memory available for an operation, R will run gc() automatically.
"Maybe." I don't really have a definitive answer. But the help file suggests that there are really only two reasons to call gc():
You want a report of memory usage.
After removing a large object, "it may prompt R to return memory to the operating system."
Since it can slow down a large simulation with repeated calls, I have tended to only do it after removing something large. In other words, I don't think that it makes sense to systematically call it all the time unless you have good reason to.

Forcing garbage collection to run in R with the gc() command

Periodically I program sloppily. Ok, I program sloppily all the time, but sometimes that catches up with me in the form of out of memory errors. I start exercising a little discipline in deleting objects with the rm() command and things get better. I see mixed messages online about whether I should explicitly call gc() after deleting large data objects. Some say that before R returns a memory error it will run gc() while others say that manually forcing gc is a good idea.
Should I run gc() after deleting large objects in order to ensure maximum memory availability?
"Probably." I do it too, and often even in a loop as in
cleanMem <- function(n=10) { for (i in 1:n) gc() }
Yet that does not, in my experience, restore memory to a pristine state.
So what I usually do is to keep the tasks at hand in script files and execute those using the 'r' frontend (on Unix, and from the 'littler' package). Rscript is an alternative on that other OS.
That workflow happens to agree with
workflow-for-statistical-analysis-and-report-writing
tricks-to-manage-the-available-memory-in-an-r-session
which we covered here before.
From the help page on gc:
A call of 'gc' causes a garbage
collection to take place. This will
also take place automatically without
user intervention, and the primary
purpose of calling 'gc' is for the
report on memory usage.
However, it can be useful to call 'gc'
after a large object has been removed,
as this may prompt R to return memory
to the operating system.
So it can be useful to do, but mostly you shouldn't have to. My personal opinion is that it is code of last resort - you shouldn't be littering your code with gc() statements as a matter of course, but if your machine keeps falling over, and you've tried everything else, then it might be helpful.
By everything else, I mean things like
Writing functions rather than raw scripts, so variables go out of scope.
Emptying your workspace if you go from one problem to another unrelated one.
Discarding data/variables that you aren't interested in. (I frequently receive spreadsheets with dozens of uninteresting columns.)
Supposedly R uses only RAM. That's just not true on a Mac (and I suspect it's not true on Windows either.) If it runs out of RAM, it will start using virtual memory. Sometimes, but not always, processes will 'recognize' that they need to run gc() and free up memory. When they do not do so, you can see this by using the ActivityMonitor.app and seeing that all the RAM is occupied and disk access has jumped up. I find that when I am doing large Cox regression runs that I can avoid spilling over into virtual memory (with slow disk access) by preceding calls with gc(); cph(...)
A bit late to the party, but:
Explicitly calling gc will free some memory "now". ...so if other processes need the memory, it might be a good idea. For example before calling system or similar. Or perhaps when you're "done" with the script and R will sit idle for a while until the next job arrives - again, so that other processes get more memory.
If you just want your script to run faster, it won't matter since R will call it later if it needs to. It might even be slower since the normal GC cycle might never have needed to call it.
...but if you want to measure time for instance, it is typically a good idea to do a GC before running your test. This is what system.time does by default.
UPDATE As #DWin points out, R (or C#, or Java etc) doesn't always know when memory is low and the GC needs to run. So you could sometimes need to do GC as a work-around for deficiencies in the memory system.
No. If there is not enough memory available for an operation, R will run gc() automatically.
"Maybe." I don't really have a definitive answer. But the help file suggests that there are really only two reasons to call gc():
You want a report of memory usage.
After removing a large object, "it may prompt R to return memory to the operating system."
Since it can slow down a large simulation with repeated calls, I have tended to only do it after removing something large. In other words, I don't think that it makes sense to systematically call it all the time unless you have good reason to.

Resources