I'm having an issue where an R function (NbCluster) crashes R, but at different points on different runs with the same data. According to journalctl, the crashes are all because of memory issues. For example:
Sep 04 02:00:56 Q35 kernel: [ 7608] 1000 7608 11071962 10836497 87408640 0 0 rsession
Sep 04 02:00:56 Q35 kernel: Out of memory: Kill process 7608 (rsession) score 655 or sacrifice child
Sep 04 02:00:56 Q35 kernel: Killed process 7608 (rsession) total-vm:44287848kB, anon-rss:43345988kB, file-rss:0kB, shmem-rss:0kB
Sep 04 02:00:56 Q35 kernel: oom_reaper: reaped process 7608 (rsession), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
I have been testing my code to figure out which lines are causing the memory errors, and it turns out that it varies, even using the same data. Aside from wanting to solve it, I am confused as to why this is a intermittent problem. If an object is too big to fit in memory, it should be a problem every time I run it given the same resources, right?
The amount of memory being used by other processes was not dramatically different between runs, and I always started from a clean environment. When I look at top I always have memory to spare (although I am rarely looking at the exact moment of the crash). I've tried reducing the memory load by removing unneeded objects and regular garbage collection, but this has had no discernable effect.
For example, when running NbClust, sometimes the crash occurs while running
length(eigen(TT)$value)
other times it happens during a call of hclust. Sometimes it doesn't crash and exits with a comparatively graceful "cannot allocate vector size"
Aside from any suggestions about reducing memory load, I want to know why I am running out of memory some times but not others.
Edit: After changing all uses of hclust to hclust.vector, I have not had any more crashes during the hierarchical clustering steps. However there are still crashes going on at varying places (often during calls of eigen()).
If I could reliably predict (within a margin of error) how much memory each line of my code was going to use, that would be great.
Modern memory management is by far not as deterministic as you seem to think it is.
If you want more reproducible results, make sure to get rid of any garbage collection, any parallelism (in particular garbage collection running in parallel with your program!) and make sure that the process is limited in memory by a value much less than your free system memory.
The kernel OOM killer is a measure of last resort when the kernel has overcommitted memory (you may want to read what that means), is completely out of swap storage, and cannot fulfill it's promises.
The kernel can allocate memory that doesn't need to exist until it is first accessed. Hence, the OOM killer can occur not on allocation, but when the page is actually used.
Related
In R I have a task that I'm trying to parallelize. Part of this is comparing run-times and peak memory usage for different implementations of the task at hand. I'm using the peakRAM library to determine peak memory, which I think just uses gc() under the surface, since if I do it manually I get the same peak memory results.
The problem is that the results from peakRAM are different from the computer's task manager (or top on Linux). If I run a single-core, these numbers are in the same ballpark, but even using 2 cores, they are really different.
I'm parallelizing using pblapply in a manner similar to this.
times_parallel = peakRAM(
pblapply(X = 1:10,
FUN = \(x) data[iteration==x] %>% parallel_task(),
cl = makeCluster(numcores, type = "FORK"))
)
With a single core, this process requires a peak of 30G of memory. But with 2 cores, peakRAM reports only about 3G of memory. Looking at top however, shows that each of the 2 threads is using around 20-30G of memory at a time.
The only thing I can think of is that peakRAM is only reporting the memory of the main thread but I see nothing in the gc() details that suggests this is happening.
The time reported from peakRAM seems appropriate. Sub-linear gains at different core levels.
I have ~200 .Rds datasets that I perform various operations on (different scripts) in a pipeline (of multiple scripts). In most of these scripts I've begun with a for loop and upgraded to a foreach. My problem is that the dataset objects are different sizes (x axis is size in mb):
so if I optimise core number usage (I have a 12core 16gbRAM machine at the office and a 16core 32gbRAM machine at home), it'll whip through the first 90 without incident, but then larger files bunch up and max out the total RAM allocation (remember Rds files are compressed so these are larger in RAM than on disk, but the variability in file size at least gives an indication of the problem). This causes workers to crash and typically leaves me with 1 to 3 cores running through the remainder of the big files (using .errorhandling = "pass"). I'm thinking it would be great to optimise the core number based on number and RAM size of workers, and total available RAM, and figured others might have been in a similar dilemma and developed strategies to address this. Some approaches I've thought of but not tried:
Approach 1: first loop or list through the files on disk, potentially by opening & closing them, use object.size() to get their sizes in RAM, sort largest to smallest, cut halfway, reverse the order of the second half, and intersperse them: smallest, biggest, 2nd smallest, 2nd biggest, etc. 2 workers (or any even numbered multiple) should therefore be working on the 'mean' RAM usage. However: worker 1 will finish its job faster than any other job in the stack and then go onto job 3, the 2nd smallest, likely finish that really quickly also then do job 4, the second largest, while worker 2 is still on the largest, meaning that by job 4, this approach has the machine processing the 2 largest RAM objects concurrently, the opposite of what we want.
Approach 2: sort objects by size-in-RAM for each object, small to large. Starting from object 1, iteratively add subsequent objects' RAM usage until total RAM core number is exceeded. Foreach on that batch. Repeat. This would work but requires some convoluted coding (probably a for loop wrapper around the foreach which passes the foreach its task list each time?). Also if there are a lot of tasks which won't exceed the RAM (per my example), the cores limit batching process will mean all 12 or 16 have to complete before the next 12 or 16 are started, introducing inefficiency.
Approach 3: sort small-large per 2. Run foreach with all cores. This will churn through the small ones maximally efficiently until the tasks get bigger, at which point workers will start to crash, reducing the number of workers sharing the RAM and thus increasing the chance the remaining workers can continue. Conceptually this will mean cores-1 tasks fail and need to be re-run, but the code is easy and should work fast. I already have code that checks the output directory and removes tasks from the jobs list if they've already been completed, which means I could just re-run this approach, however I should anticipate further losses and therefore reruns required unless I lower the cores number.
Approach 4: as 3 but somehow close the worker (reduce core number) BEFORE the task is assigned, meaning the task doesn't have to trigger a RAM overrun and fail in order to reduce worker count. This would also mean no having to restart RStudio.
Approach 5: ideally there would be some intelligent queueing system in foreach that would do this all for me but beggars can't be choosers! Conceptually this would be similar to 4, above: for each worker, don't start the next task until there's sufficient RAM available.
Any thoughts appreciated from folks who've run into similar issues. Cheers!
I've thought a bit about this too.
My problem is a bit different, I don't have any crash but more some slowdowns due to swapping when not enough RAM.
Things that may work:
randomize the iterations so that it is approximately evenly distributed (without needing to know the timings in advance)
similar to approach 5, have some barriers (waiting of some workers with a while loop and Sys.sleep()) while not enough memory (e.g. determined via package {memuse}).
Things I do in practice:
always store the results of iterations in foreach loops and test if already computed (RDS file already exists)
skip some iterations if needed
rerun the "intensive" iterations using less cores
My code eats up to 3GB of memory at a single time. I figured it out using gc():
gc1 <- gc(reset = TRUE)
graf(...) # the code
gc2 <- gc()
cat(sprintf("mem: %.1fMb.\n", sum(gc2[,6] - gc1[,2])))
# mem: 3151.7Mb.
Which I guess means that there is one single time, when 3151.7 MB are allocated at once.
My goal is to minimize the maximum memory allocated at any single time. How do I figure out which part of my code is reposponsible for the maximum usage of those 3GB of memory? I.e. the place where those 3GB are allocated at once.
I tried memory profiling with Rprof and profvis, but both seem to show different information (which seems undocumented, see my other question). Maybe I need to use them with different parameters (or use different tool?).
I've been looking at Rprofmem... but:
in the profmem vignette they wrote: "with utils::Rprofmem() it is not possible to quantify the total memory usage at a given time because it only logs allocations and does therefore not reflect deallocations done by the garbage collector."
how to output the result of Rprofmem? This source speaks for itself: "Summary functions for this output are still being designed".
My code eats up to 3GB of memory at a single time.
While it looks like your code is consuming a lot of RAM at once by calling one function you can break down the memory consumption by looking into the implementation details of the function (and its sub calls) by using RStudio's built-in profiling (based on profvis) to see the execution time and rough memory consumption. Eg. if I use my demo code:
# graf code taken from the tutorial at
# https://rawgit.com/goldingn/intecol2013/master/tutorial/graf_workshop.html
library(dismo) # install.packages("dismo")
library(GRaF) # install_github('goldingn/GRaF')
data(Anguilla_train)
# loop to call the code under test several times to get better profiling results
for (i in 1:5) {
# keep the first n records of SegSumT, SegTSeas and Method as covariates
covs <- Anguilla_train[, c("SegSumT", "SegTSeas", "Method")]
# use the presence/absence status to fit a simple model
m1 <- graf(Anguilla_train$Angaus, covs)
}
Start profiling with the Profile > Start Profiling menu item, source the above code and stop the profiling via the above menu.
After Profile > Stop Profiling RStudio is showing the result as Flame Graph but what you are looking for is hidden in the Data tab of the profile result (I have unfolded all function calls which show heavy memory consumption):
The numbers in the memory column indicate the memory allocated (positive) and deallocated (negative numbers) for each called function and the values should include the sum of the whole sub call tree + the memory directly used in the function.
My goal is to minimize the maximum memory allocated at any single time.
Why do you want to do that? Do you run out-of-memory or do you suspect that repeated memory allocation is causing long execution times?
High memory consumption (or repeated allocations/deallocations) often come together with a slow execution performance since copying memory costs time.
So look at the Memory or Time column depending on your optimization goals to find function calls with high values.
If you look into the source code of the GRaF package you can find a loop in the graf.fit.laplace function (up to 50 "newton iterations") that calls "slow" R-internal functions like chol, backsolve, forwardsolve but also slow functions implemented in the package itself (like cov.SE.d1).
Now you can try to find faster (or less memory consuming) replacements for these functions... (sorry, I can't help here).
PS: profvis uses Rprof internally so the profiling data is collected by probing the current memory consumption in regular time intervals and counting it for the currently active function (call stack).
Rprof has limitations (mainly not an exact profiling result since the garbage collector triggers at non-deterministic times and the freed memory is attributed to the function the next probing interval break stops at and it does not recognize memory allocated directly from the OS via C/C++ code/libraries that bypasses R's memory management API).
Still it is the easiest and normally good enough indication of memory and performance problems...
For an introduction into profvis see: For https://rstudio.github.io/profvis/
I have been using filebacked.big.matrix to store a very large matrix (~1 million x 20 thousand). I am working on a cluster with very high memory, but not quite that much. I have previously used the ff package which worked great and kept the memory usage consistent despite the matrix size, but it died when I surpassed 10^32 items in the matrix (R community really needs to fix that problem). the filebacked.big.matrix initially seemed to work very well and generally runs without problems, but when I check on the memory usage it is sometimes spiking into the 100s of GBs. I am careful to only read/write to the matrix a relatively few rows at a time, so I think there should not be much in memory at any given time.
Does it do some sort of automatic memory caching or something that is driving the memory usage up? If so can this caching be disabled or limited? The high memory usage is causing some nasty side effects on the cluster so I need a way to do this that is memory neutral. I have checked the filebacked.big.matrix help page, but can't find any pertinent information there.
Thanks!
UPDATE:
I am also using bigmemoryExtras.
I was wrong earlier, the problem is happening when I loop through the entire matrix reading it into a different, smaller file.backed matrix like this:
tmpGeno=fileBackedMatrix(rowIndex-1,numMarkers,'double',tmpDir)
front=1
back=40000
large matrix must be copied in chunks to avoid integer.max errors
while(front < rowIndex-1){
if(back>rowIndex-1) back=rowIndex-1
tmpGeno[front:back,1:numMarkers]=genotypeMatrix[front:back,1:numMarkers,drop=F]
front=front+40000
back=back+40000
}
The physical memory usage is initially very low (with virtual memory very high). But while running this loop, and even after it has finished it seems to just keep most of the matrix in physical memory. I need it to only keep the one small chunk of the matrix in memory at a time.
UPDATE 2:
It is a bit confusing to me: the cluster metrics and top command say that it is using tons of memory (~80GB), but the gc() command says that memory usage never went over 2GB. The free command says that all the memory is used, but in the -/+ buffers/cache line is says only 7GB are being used total.
For an array X in the Global memory, I need to write two values in every Kernel execution.
X[p]=mul1+mul2;
X[p+a]=mul1-mul2;
Here 'a' can range from 0 to very high values. I observed that these two writes slow down my kernel to a great extent.
What is the best way to improve the memory write performance in OpenCL?
Are Coalesced memory writes possible only for intra Kernel writes?
Assuming p is linearly dependent from your thread ID, you are doing things the right way. You could try to pass X+aas a second argument to your kernel to do Y[p]=mul1-mul2; instead of X[p+a]=mul1-mul2; but I doubt it will be really faster.
Concerning your second question, if you are thinking of having two kernels, one performing the addition, the other the substraction and launch them concurrently, you cannot be sure they will be run side-by-side in parallel. Once again I doubt it will be faster in the end.