My R has memory leaks?

My R has memory leaks? - r

I'm using R 2.15.3 on Ubuntu 12.04 (precise) 64-bit.
If I run R in valgrind:
R -d "valgrind" --vanilla
I then exit the program using q() and I get the following report:
==7167== HEAP SUMMARY:
==7167== in use at exit: 28,239,464 bytes in 12,512 blocks
==7167== total heap usage: 28,780 allocs, 16,268 frees, 46,316,337 bytes allocated
==7167==
==7167== LEAK SUMMARY:
==7167== definitely lost: 120 bytes in 2 blocks
==7167== indirectly lost: 480 bytes in 20 blocks
==7167== possibly lost: 0 bytes in 0 blocks
==7167== still reachable: 28,238,864 bytes in 12,490 blocks
==7167== suppressed: 0 bytes in 0 blocks
==7167== Rerun with --leak-check=full to see details of leaked memory
==7167==
==7167== For counts of detected and suppressed errors, rerun with: -v
==7167== Use --track-origins=yes to see where uninitialised values come from
==7167== ERROR SUMMARY: 385 errors from 5 contexts (suppressed: 2 from 2)
Lately R is crashing quite often, especially when I call C++ functions through Rcpp,
could this be the reason?
Thanks!

You may be misreading the valgrind output. Most likely, there is no (obvious) leak here as R is pretty well studied as a system. Yet R is a dynamically typed language which has of course done allocations. "Definitely lost: 120 bytes" is essentially measurement error -- see the valgrind docs.
If you want to see a leak, create one, e.g., with a file like this:
library(Rcpp)
cppFunction('int leak(int N) {double *ptr = (double*) malloc(N*sizeof(double)); \
return 0;}')
leak(10000)
which reserves memory, even explicitly out of R's reach, and then exits. Here we get:
$ R -d "valgrind" -f /tmp/leak.R
[...]
R> leak(10000)
[1] 0
R>
==4479==
==4479== HEAP SUMMARY:
==4479== in use at exit: 35,612,126 bytes in 15,998 blocks
==4479== total heap usage: 47,607 allocs, 31,609 frees, 176,941,927 bytes allocated
==4479==
==4479== LEAK SUMMARY:
==4479== definitely lost: 120 bytes in 2 blocks
==4479== indirectly lost: 480 bytes in 20 blocks
==4479== possibly lost: 0 bytes in 0 blocks
==4479== still reachable: 35,611,526 bytes in 15,976 blocks
==4479== suppressed: 0 bytes in 0 blocks
==4479== Rerun with --leak-check=full to see details of leaked memory
==4479==
==4479== For counts of detected and suppressed errors, rerun with: -v
==4479== Use --track-origins=yes to see where uninitialised values come from
==4479== ERROR SUMMARY: 31 errors from 10 contexts (suppressed: 2 from 2)
$
Now there is a bit more of a leak (though it is still not as easily readable as one would hope). If you add the suggested flags, it will eventually point to the malloc() call we made.
Also, I have a worked example of actual leak in an earlier version of a CRAN package in one of my 'Intro to HPC with R' slide sets. If and when there is a leak, this helps. When there is none, it is harder to see through the noise.
So in short, if you code crashes, it is probably your code's fault. Try a minimal reproducible example is the (good) standard advice.

Related

Compression without dictionary

I have been testing the various compression algorithms with parquet files, and have settled on Zstd.
Now as far as I understand Zstd uses adaptive dictionary unless one is explicitly specified, thus it begins with an empty one. However when having a dictionary enabled the compressed size and and the execution time are quite unsatisfactory.
The file size without using a dictionary is quite less compared to using the adaptive one. (The number at the end of the name is the compression level):
Name: C:\ParquetFiles\Zstd1 Execution time: 279 ms Size: 13738134
Name: C:\ParquetFiles\Zstd2 Execution time: 140 ms Size: 13207017
Name: C:\ParquetFiles\Zstd9 Execution time: 511 ms Size: 12701030
And for comparison the log from using the adaptive dictionary:
Name: C:\ParquetFiles\ZstdDictZstd1 Execution time: 487 ms Size: 19462825
Name: C:\ParquetFiles\ZstdDictZstd2 Execution time: 402 ms Size: 19292513
Name: C:\ParquetFiles\ZstdDictZstd9 Execution time: 614 ms Size: 19072779
Can you help me understand the significance of this, shouldn't the output with an empty dictionary perform at least as good as Zstd compression with dictionary disabled?

Stress-ng stress memory with specific percentage

I am trying to stress a ubuntu container's memory. Typing free in my command terminal provides the following result:
free -m
total used free shared buff/cache available
Mem: 7958 585 6246 401 1126 6743
Swap: 2048 0 2048
I want to stress exactly 10% of the total available memory. Per stress-ng manual:
-m N, --vm N
start N workers continuously calling mmap(2)/munmap(2) and writing to the allocated
memory. Note that this can cause systems to trip the kernel OOM killer on Linux
systems if not enough physical memory and swap is not available.
--vm-bytes N
mmap N bytes per vm worker, the default is 256MB. One can specify the size as % of
total available memory or in units of Bytes, KBytes, MBytes and GBytes using the
suffix b, k, m or g.
Now, on my target container I run two memory stressors to occupy 10% of my memory:
stress-ng -vm 2 --vm-bytes 10% -t 10
However, the memory usage on the container never reaches 10% no matter how many times I run it. I tried different timeout values, no result. The closet it gets is 8.9% never approaches 10%. I inspect memory usage on my container this way:
docker stats --no-stream kind_sinoussi
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
c3fc7a103929 kind_sinoussi 199.01% 638.4MiB / 7.772GiB 8.02% 1.45kB / 0B 0B / 0B 7
In an attempt to understand this behaviour, I tried running the same command with an exact unit of bytes. In my case, I'll opt for 800 mega since 7958m * 0.1 = 795,8 ~ 800m.
stress-ng -vm 2 --vm-bytes 800m -t 15
And, I get 10%!
docker stats --no-stream kind_sinoussi
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
c3fc7a103929 kind_sinoussi 198.51% 815.2MiB / 7.772GiB 10.24% 1.45kB / 0B 0B / 0B 7
Can someone explain why this is happening?
Another question, is it possible for stress-ng to stress memory usage to 100%?

stress-ng --vm-bytes 10% will use sysconf(_SC_AVPHYS_PAGES) to determine the available memory. This sysconf() system call will return the number of pages that the application can use without hindering any other process. So this is approximately what the free command is returning for the free memory statistic.
Note that stress-ng will allocate the memory with mmap, so it may be that during run time mmap'd pages may not necessarily be physically backed at the time you check how much real memory is being used.
It may be worth trying to also use the --vm-populate option; this will try and ensure the pages are physically populated on the mmap'd memory that stress-ng is exercising. Also try --vm-madvise willneed to use the madvise() system call to hint that the pages will be required fairly soon.

Get free space left on target from arm-none-eabi-size

I want to calculate space left on my embedded target.
The Arduino IDE shows this in the output window:
Sketch uses 9544 bytes (3%) of program storage space. Maximum is 262144 bytes.
avr-size has -C option that shows "xx% left":
$ avr-size -C --mcu=atmega32u4 build/myproject.hex
AVR Memory Usage
----------------
Device: atmega32u4
Program: 8392 bytes (25.6% Full)
(.text + .data + .bootloader)
Data: 2196 bytes (85.8% Full)
(.data + .bss + .noinit)
However, I'm actually writing a CMake file to develop code for an Arduino board with an Arm Cortex M0 CPU, so I use arm-none-eabi-size, which shows the code size like this:
[100%] Built target hex
text data bss dec hex filename
8184 208 1988 10380 288c build/myproject
[100%] Built target size
*** Finished ***
Is there a way to calculate the program and data space left on the device? Or do I need to regex the output and calculate percent of a hard-coded value?

If you are using arm-none-eabi toolchain, you can add linker option -Wl,--print-memory-usage which prints RAM and Flash usage in percentage. Output looks like this:
Memory region Used Size Region Size %age Used
RAM: 8968 B 20 KB 43.79%
FLASH: 34604 B 128 KB 26.40%
I am using make file generated by CubeMX, to enable this print I added the option at the end of LDFLAGS line. For CMake this thread might be useful.

Available stack size is not used by R, returning "Error: node stack overflow"

I have written a recursive code in R.
Before invoking R, I set the stack size to 96 MB at the shell with:
ulimit -s 96000
I invoked R with maximum protection pointer stack size of 500000 with:
R --max-ppsize 500000
And I changed the maximum recursion depth to 500000:
options(expression = 500000)
I both used the binary R package at Arch Linux repositories (without memory profiling) and also a binary compiled by me with memory profiling option. Both are of version 3.4.2
I used two versions of the code with and without gc().
The problem is that R exits the code with "node stack overflow" error while only 16 MB of the 93 MB of total available stack is used and depth is just below one percent of the expressions option of 5e5:
size current direction eval_depth
93388800 16284704 1 4958
Error: node stack overflow
The current stack usage change between the last two iterations were around 10K. The only passed and saved object is a numeric vector of 19 items.
The recursive portion of the code is below:
network_recursive <- function(called)
{
print(Cstack_info())
callers <- list_caller[[called + 1]] # get the callers of the called
callers <- callers[!bool[callers + 1]] # subset for nofriends - new friends
new_friend_no <- length(callers) # number of new friends
print(list(called, callers) )
if (new_friend_no > 0) # if1 still new friends
{
friends <<- friends + new_friend_no # increment friend no
print(friends)
bool[callers + 1] <<- T # toggle friends
sapply(callers, network_recursive) # recurse network control
} # close if1
print("end of recursion")
}
What may be the reason for this stack overflow?
Some notes on the R source code, related to the issue.
The portion of the code that triggers the error is lines 5987-5988 from src/main/eval.c:
5975 #ifdef USE_BINDING_CACHE
5976 if (useCache) {
5977 R_len_t n = LENGTH(constants);
5978 # ifdef CACHE_MAX
5979 if (n > CACHE_MAX) {
5980 n = CACHE_MAX;
5981 smallcache = FALSE;
5982 }
5983 # endif
5984 # ifdef CACHE_ON_STACK
5985 /* initialize binding cache on the stack */
5986 vcache = R_BCNodeStackTop;
5987 if (R_BCNodeStackTop + n > R_BCNodeStackEnd)
5988 nodeStackOverflow();
5989 while (n > 0) {
5990 SETSTACK(0, R_NilValue);
5991 R_BCNodeStackTop++;
5992 n--;
5993 }
5994 # else
5995 /* allocate binding cache and protect on stack */
5996 vcache = allocVector(VECSXP, n);
5997 BCNPUSH(vcache);
5998 # endif
5999 }
6000 #endif

Off the top of my head, I see that you used options(expression = 500000), but the field in the list returned by "options()" is called 'expressions' (with an s). If you typed it in the way you described in your question, then the 'expressions' field remained at 5000, not the 500000 you intended to set it as. So this might be why you maxed out while only using what you thought was 1% of the stack depth.

The node stack has its own limit, which is fixed (defined in Defn.h, R_BCNODESTACKSIZE). If you have a real example where the limit is too small, please submit a bug report, we could increase it or also add a command line option for it. The "node stack" is used by the byte-code interpreter, which interprets byte-code produced by the byte-code compiler. Cstack_info() does not display the node stack usage. The node stack is not allocated on the C stack.
Programs based on deep recursion will be very slow in R anyway as function calls are quite expensive. For practical purposes, when a limit related to recursion depth is hit, it might be better to rewrite the program to avoid recursion rather then increasing the limits.
Just as an experiment one might disable the just-in-time compiler and by that reduce the stress on the node stack. It won't be completely eliminated, because some packages are already compiled at installation by default, including base and recommended packages, so e.g. sapply is compiled. Also, this might on the other hand increase the stress on the recursively eliminated expressions, and the program will run even slower.

Why can I use 133% of local memory?

My GPU seems to allow 562% use of global memory and 133% use of local memory for a simple PyOpenCL matrix addition kernel. Here is what my script prints:
GPU: GeForce GTX 670
Global Memory - Total: 2 GB
Global Memory - One Buffer: 3.750000 GB
Number of Global Buffers: 3
Global Memory - All Buffers: 11.250000 GB
Global Memory - Usage: 562.585844 %
Local Memory - Total: 48 KB
Local Memory - One Array: 32.000000 KB
Number of Local Arrays: 2
Local Memory - All Arrays: 64.000000 KB
Local Memory - Usage: 133.333333 %
If I increase global memory use much above this point, I get the error: mem object allocation failure
If I increase local memory use above this point, I get the error: invalid work group size
Why doesn't my script fail immediately when memory use of local or global exceeds 100%?

Global size is multiplied by 32, thats the error.
When clearly a float32 has 4bytes, this makes a and b arrays 4 bytes each. Not 32.
So the proper results for you would be:
Global Memory - Total: 2 GB
Global Memory - One Buffer: 0.4687500 GB
Number of Global Buffers: 3
Global Memory - All Buffers: 1.40625 GB
Global Memory - Usage: 70.3125 %
Local Memory - Total: 48 KB
Local Memory - One Array: 4.000000 KB
Number of Local Arrays: 2
Local Memory - All Arrays: 8.000000 KB
Local Memory - Usage: 16.6666666 %