Limit Julia memory consumption - julia

Is there any particular command line flag to use for restricting memory consumption by Julia scripts? (Specifically for Genie apps)
Or any similar way to limit such consumption?

Related

Is there a way to limit the number of R processes running

I use the doMC that uses the package multicore. It happened (several times) that when I was debugging (in the console) it went sideways and fork-bombed.
Does R have the setrlimit() syscall?
In pyhton for this i would use resource.RLIMIT_NPROC
Ideally I'd like to restrict the number of R processes running to a number
EDIT: OS is linux CentOS 6
There should be several choices. Here is the relevant section from Writing R Extensions, Section 1.2.1.1
Packages are not standard-alone programs, and an R process could
contain more than one OpenMP-enabled package as well as other components
(for example, an optimized BLAS) making use of OpenMP. So careful
consideration needs to be given to resource usage. OpenMP works with
parallel regions, and for most implementations the default is to use as
many threads as 'CPUs' for such regions. Parallel regions can be
nested, although it is common to use only a single thread below the
first level. The correctness of the detected number of 'CPUs' and the
assumption that the R process is entitled to use them all are both
dubious assumptions. The best way to limit resources is to limit the
overall number of threads available to OpenMP in the R process: this can
be done via environment variable 'OMP_THREAD_LIMIT', where
implemented.(4) Alternatively, the number of threads per region can be
limited by the environment variable 'OMP_NUM_THREADS' or API call
'omp_set_num_threads', or, better, for the regions in your code as part
of their specification. E.g. R uses
#pragma omp parallel for num_threads(nthreads) ...
That way you only control your own code and not that of other OpenMP
users.
One of my favourite tools is a package controlling this: RhpcBLASctl. Here is its Description:
Control the number of threads on 'BLAS' (Aka 'GotoBLAS', 'ACML' and
'MKL'). and possible to control the number of threads in 'OpenMP'. get
a number of logical cores and physical cores if feasible.
After all you need to control the number of parallel session as well as the number of BLAS cores allocated to each of the parallel threads. There is a reason the parallel package has a default of 2 threads per session...
All of this should be largely independent of the flavour of Linux or Unix you are running. Well, apart from the fact that OS X of course (still !!) does not give you OpenMP.
And the very outer level you can control from doMC and friends.
You can use registerDoMC (see the doc here)
registerDoMC(cores=<some number>)
Another option is to use the ulimit command before running the R script:
ulimit -u <some number>
to limit the number of processes R will be able to spawn.
If you want to limit the total number of CPUs several R processes use at the same time, you will need to use cgroups or cpusets and attach the R processes to the cgroup or cpuset. They will then be confined to the physical CPUS defined in the cgroup or cpuset. cgroups allow more control (for instance also memory) but are more complex to setup.

Limit CPU & Memory for *nix Process

Is it possible to limit CPU & Memory for the *nix Process?
The CPU limit may look like "use no more than 10% of one core".
The memory limit may look like "use no more than 100Mb", the OS may limit it or kill the process if it try to exceed the limit, both ways are fine.
Any *nix that could do that would be fine.
It seems it is possible to implement it with virtual machines, but it is not acceptable because the overhead is too huge.
If you happen to use Solaris, the ability to limit resource usage is a native feature.
Memory (RAM) usage can be capped per process using the rcap.max-rss setting while CPU usage can be limited per project using the project.cpu-caps.
Note that Solaris also allows OS level virtualization (a.k.a. zones) which have no significant overhead, if any, compared to a bare metal OS instance.
Resource capping is part of Solaris zones configuration.
Try CPULimit
cpulimit is a simple program which attempts to limit the cpu usage of a process (expressed in percentage, not in cpu time). This is useful to control batch jobs, when you don't want them to eat too much cpu. It does not act on the nice value or other scheduling priority stuff, but on the real cpu usage. Also, it is able to adapt itself to the overall system load, dynamically and quickly.

Estimating OpenCL memory access performance for algorithm design

I have a task which I need to achieve using one of several possible algorithm.
Each algorithm, has its own opportunities for local-memory optimization, and I would like to estimate which algorithm will perform best, based on counting compute operations and memory access.
For the purpose of comparing different number of local memory access operations vs. global memory access operations, I would like to estimate the price (in cycles?) of local memory access (read / write) vs the price of global memory access.
How many cycles does it take (on a modern, consumer GPU) to perform each of these:
read from local memory
write to local memory
read from global memory
write to global memory
Note: I use "local memory" and "global memory" in their meaning in OpenCL.
Usually, access to local memory tooks couple of GPU cycles. Access to global memory tooks tens of cycles. From one video card to another numbers differ significantly. So that are very general numbers, which only show difference of magnitude.
As I understand, you're concerned about low-level optimization. If that's right, than you can use software, which is usually shipped with SDK by GPU vendor. Many of them (AMD, ARM, etc) provides offline compilers, which allows export of clProgramm's compiled binaries assembler with instructions-per-cycle information. Then you will get most definite numbers.

How to modify a code acceding to a function?

I will try to explain my problem. There are 365 (global map)files in two directories dir1 and dir2, which have the same format ,byte,extend,etc. I computed the bias between two datasets using the function and code given below as follows:
How can I solve this problem?please
I suspect this is due to memory limitations on a 32-bit system. You want to allocate an array of 933M doubles, that requires 7.6Gb of continuous memory. I suggest you to read ?Memory and ?"Memory-limits" for more details. In particular, the latter says:
Error messages beginning ‘cannot allocate vector of size’ indicate
a failure to obtain memory, either because the size exceeded the
address-space limit for a process or, more likely, because the
system was unable to provide the memory. Note that on a 32-bit
build there may well be enough free memory available, but not a
large enough contiguous block of address space into which to map
it.
If this is indeed your problem, you may look into bigmemory package (http://cran.r-project.org/web/packages/bigmemory/index.html) which allows to manage massive matrixes with shared and file-based memory. There are also other strategies (e.g. using an SQLite database) to manage data that doesn't fit in memory all at once.
Update. Here is an excerpt from Memory-limit for Windows:
The address-space limit is 2Gb under 32-bit Windows unless the OS's default has been changed to allow more (up to 3Gb). See http://www.microsoft.com/whdc/system/platform/server/PAE/PAEmem.mspx and http://msdn.microsoft.com/en-us/library/bb613473(VS.85).aspx. Under most 64-bit versions of Windows the limit for a 32-bit build of R is 4Gb: for the oldest ones it is 2Gb. The limit for a 64-bit build of R (imposed by the OS) is 8Tb.
It is not normally possible to allocate as much as 2Gb to a single vector in a 32-bit build of R even on 64-bit Windows because of preallocations by Windows in the middle of the address space.
Under Windows, R imposes limits on the total memory allocation available to a single session as the OS provides no way to do so: see memory.size and memory.limit.

OpenCL shared memory optimisation

I am solving a 2d Laplace equation using OpenCL.
The global memory access version runs faster than the one using shared memory.
The algorithm used for shared memory is same as that in the OpenCL Game of Life code.
https://www.olcf.ornl.gov/tutorials/opencl-game-of-life/
If anyone has faced the same problem please help. If anyone wants to see the kernel I can post it.
If your global-memory really runs faster than your local-memory version (assuming both are equally optimized depending on the memory space you're using), maybe this paper could answer your question.
Here's a summary of what it says:
Usage of local memory in a kernel add another constraint to the number of concurrent workgroups that can be run on the same compute unit.
Thus, in certain cases, it may be more efficient to remove this constraint and live with the high latency of global memory accesses. More wavefronts (warps in NVidia-parlance, each workgroup is divided into wavefronts/warps) running on the same compute unit allow your GPU to hide latency better: if one is waiting for a memory access to complete, another can compute during this time.
In the end, each kernel will take more wall-time to proceed, but your GPU will be completely busy because it is running more of them concurrently.
No, it doesn't. It only says that ALL OTHER THINGS BEING EQUAL, an access from local memory is faster than an access from global memory. It seems to me that global accesses in your kernel are being coalesced which yields better performance.
Using shared memory (memory shared with CPU) isn't always going to be faster. Using a modern graphics card It would only be faster in the situation that the GPU/CPU are both performing oepratoins on the same data, and needed to share information with each-other, as memory wouldn't have to be copied from the card to the system and vice-versa.
However, if your program is running entirely on the GPU, it could very well execute faster by running in local memory (GDDR5) exclusively since the GPU's memory will not only likely be much faster than your systems, there will not be any latency caused by reading memory over the PCI-E lane.
Think of the Graphics Card's memory as a type of "l3 cache" and your system's memory a resource shared by the entire system, you only use it when multiple devices need to share information (or if your cache is full). I'm not a CUDA or OpenCL programmer, I've never even written Hello World in these applications. I've only read a few white papers, it's just common sense (or maybe my Computer Science degree is useful after all).

Resources