Complete R Session Size - r

Due to I constantly reach memory size limit in my R Session (8GB Windows PC) I start to remove big objects loaded in. However once I reach this limit, removing objects seems not to work.
So, I was wondering if there's a way to get the R Session size. I know that it's possible to retrieve objects' size (saw in this thread).I want to know if there's a way to count the complete R Session size though (loaded packages, objects, etc).
Thank you!

I personally use this function to get the available memory:
getAvailMem <- function(format = TRUE) {
gc()
if (Sys.info()[["sysname"]] == "Windows") {
memfree <- 1024^2 * (utils::memory.limit() - utils::memory.size())
} else {
# http://stackoverflow.com/a/6457769/6103040
memfree <- 1024 * as.numeric(
system("awk '/MemFree/ {print $2}' /proc/meminfo", intern = TRUE))
}
`if`(format, format(structure(memfree, class = "object_size"),
units = "auto"), memfree)
}

To get the total memory used by R, you may try mem_used() from pryr package. Unlike memory.size, this one is not OS dependent, because it uses the R function gc() underneath it. Try to look in the function body and also this pryr:::node_size and pryr:::show_bytes
pryr::mem_used()
The help file ?pryr::mem_used describes
R breaks down memory usage into Vcells (memory used by vectors) and
Ncells (memory used by everything else). However, neither this
distinction nor the "gc trigger" and "max used" columns are typically
important. What we're usually most interested in is the the first
column: the total memory used. This function wraps around gc() to
return the total amount of memory (in megabytes) currently used by R.
You can also use pryr::mem_change to track the size of the memory used by the R code. Try the example in its documentation page.
The numbers such as 28L and 56L used to refer node size with pryr:::node_size comes from the help file of ?gc, which describes
gc returns a matrix with rows "Ncells" (cons cells), usually 28 bytes
each on 32-bit systems and 56 bytes on 64-bit systems, and "Vcells"
(vector cells, 8 bytes each),
After removing a large object run gc() to free memory

Related

Optimization in R: Levenberg-Marquardt using nls.lm in minpack.lm: resetting `maxiter' to 1024

I am trying to learn how to work with nls.lm in the R library minpack.lm by using the Rosenbrock function to see if the algorithm converges to the global minimum at f(x,y) = (1,1). I do so both with and without the analytic Jacobian. In both instances, I get a warning telling me that the algorithm has decided to revert the maximum number of iterations specified in the call to nls.lm to 1024:
Warning messages:
1: In nls.lm(par = initpar, fn = objective_rosenbrock, jac = gradient_rosenbrock, :
resetting `maxiter' to 1024!
2: In nls.lm(par = initpar, fn = objective_rosenbrock, jac = gradient_rosenbrock, :
lmder: info = -1. Number of iterations has reached `maxiter' == 1024.
The algorithm never quite reaches (1,1) as a result given my initial guess of (-1.2, 1.0). I found the source code for the library on GitHub and the following lines of code are pertinent here:
https://github.com/cran/minpack.lm/blob/master/src/nls_lm.c
OS->maxiter = INTEGER_VALUE(getListElement(control, "maxiter"));
if(OS->maxiter > 1024) {
OS->maxiter = 1024;
warning("resetting `maxiter' to 1024!");
}
Is there any logic to why the maximum number of iterations is capped to 1024? Something with bits and 2^10? I would like to use the library for a different application, but this cap on iterations might prevent that. Any insight would be appreciated.
Git blame says that this code limiting the max iterations was introduced in version 1.1-0, in 2008. The NEWS file for the package only goes back as far as version 1.1-6. I can't find the code in any public repo other than the one you point to (which is only a CRAN mirror; it doesn't contain any comments/commit messages/etc. from developers that might give us clues.)
Other than contacting the maintainer I think it's going to be hard to figure out what the rationale is for this limit.
I do have some guesses though.
The only places that maxiter is actually used in the code are here and here - in R code, not Fortran or C code, so it seems extremely unlikely that we are dealing with something like a 10-bit unsigned integer type (which seems an unlikely choice in any case). I think the limitation is there because we also have a buffer defined for holding trace information here:
double rsstrace[1024];
which, as you can see, is hard-coded to a length of 1024. Presumably bad things would happen if we tried to stuff 1025 iterations'-worth of tracing information into this array ...
My suggestions:
change all instances of '1024' in the code to something larger and see what happens. There are only four:
$ find . -type f -exec grep -Hn 1024 {} \;
./src/nls_lm.c:141: if(OS->maxiter > 1024) {
./src/nls_lm.c:142: OS->maxiter = 1024;
./src/nls_lm.c:143: warning("resetting `maxiter' to 1024!");
./src/minpack_lm.h:20: double rsstrace[1024];
it would be best to #define MAXITER 2048 (or whatever) in src/minpack_lm.h and use that instead of the numerical value.
Contact the maintainer (maintainer("minpack.lm")) and ask them about this issue.

Available stack size is not used by R, returning "Error: node stack overflow"

I have written a recursive code in R.
Before invoking R, I set the stack size to 96 MB at the shell with:
ulimit -s 96000
I invoked R with maximum protection pointer stack size of 500000 with:
R --max-ppsize 500000
And I changed the maximum recursion depth to 500000:
options(expression = 500000)
I both used the binary R package at Arch Linux repositories (without memory profiling) and also a binary compiled by me with memory profiling option. Both are of version 3.4.2
I used two versions of the code with and without gc().
The problem is that R exits the code with "node stack overflow" error while only 16 MB of the 93 MB of total available stack is used and depth is just below one percent of the expressions option of 5e5:
size current direction eval_depth
93388800 16284704 1 4958
Error: node stack overflow
The current stack usage change between the last two iterations were around 10K. The only passed and saved object is a numeric vector of 19 items.
The recursive portion of the code is below:
network_recursive <- function(called)
{
print(Cstack_info())
callers <- list_caller[[called + 1]] # get the callers of the called
callers <- callers[!bool[callers + 1]] # subset for nofriends - new friends
new_friend_no <- length(callers) # number of new friends
print(list(called, callers) )
if (new_friend_no > 0) # if1 still new friends
{
friends <<- friends + new_friend_no # increment friend no
print(friends)
bool[callers + 1] <<- T # toggle friends
sapply(callers, network_recursive) # recurse network control
} # close if1
print("end of recursion")
}
What may be the reason for this stack overflow?
Some notes on the R source code, related to the issue.
The portion of the code that triggers the error is lines 5987-5988 from src/main/eval.c:
5975 #ifdef USE_BINDING_CACHE
5976 if (useCache) {
5977 R_len_t n = LENGTH(constants);
5978 # ifdef CACHE_MAX
5979 if (n > CACHE_MAX) {
5980 n = CACHE_MAX;
5981 smallcache = FALSE;
5982 }
5983 # endif
5984 # ifdef CACHE_ON_STACK
5985 /* initialize binding cache on the stack */
5986 vcache = R_BCNodeStackTop;
5987 if (R_BCNodeStackTop + n > R_BCNodeStackEnd)
5988 nodeStackOverflow();
5989 while (n > 0) {
5990 SETSTACK(0, R_NilValue);
5991 R_BCNodeStackTop++;
5992 n--;
5993 }
5994 # else
5995 /* allocate binding cache and protect on stack */
5996 vcache = allocVector(VECSXP, n);
5997 BCNPUSH(vcache);
5998 # endif
5999 }
6000 #endif
Off the top of my head, I see that you used options(expression = 500000), but the field in the list returned by "options()" is called 'expressions' (with an s). If you typed it in the way you described in your question, then the 'expressions' field remained at 5000, not the 500000 you intended to set it as. So this might be why you maxed out while only using what you thought was 1% of the stack depth.
The node stack has its own limit, which is fixed (defined in Defn.h, R_BCNODESTACKSIZE). If you have a real example where the limit is too small, please submit a bug report, we could increase it or also add a command line option for it. The "node stack" is used by the byte-code interpreter, which interprets byte-code produced by the byte-code compiler. Cstack_info() does not display the node stack usage. The node stack is not allocated on the C stack.
Programs based on deep recursion will be very slow in R anyway as function calls are quite expensive. For practical purposes, when a limit related to recursion depth is hit, it might be better to rewrite the program to avoid recursion rather then increasing the limits.
Just as an experiment one might disable the just-in-time compiler and by that reduce the stress on the node stack. It won't be completely eliminated, because some packages are already compiled at installation by default, including base and recommended packages, so e.g. sapply is compiled. Also, this might on the other hand increase the stress on the recursively eliminated expressions, and the program will run even slower.

fread protection stack overflow error

I'm using fread in data.table (1.8.8, R 3.0.1) in a attempt to read very large files.
The file in questions has 313 rows and ~6.6 million cols of numeric data rows and the file is around around 12gb. This is a Centos 6.4 with 512GB of RAM.
When I attempt to read in the file:
g=fread('final.results',header=T,sep=' ')
'header' changed by user from 'auto' to TRUE
Error: protect(): protection stack overflow
I tried starting R with --max-ppsize 500000 , which is the max, but the same error.
I also tried setting the stack size to unlimited via
ulimit -s unlimited
Virtual memory was already set to unlimited.
Am I being unrealistic with a file of this size? Did I miss something fairly obvious?
Now fixed in v1.8.9 on R-Forge.
An unintended 50,000 column limit has been removed in fread. Thanks to mpmorley for reporting. Test added.
The reason was I got this part wrong in the fread.c source :
// *********************************************************************
// Allocate columns for known nrow
// *********************************************************************
ans=PROTECT(allocVector(VECSXP,ncol));
protecti++;
setAttrib(ans,R_NamesSymbol,names);
for (i=0; i<ncol; i++) {
thistype = TypeSxp[ type[i] ];
thiscol = PROTECT(allocVector(thistype,nrow)); // ** HERE **
protecti++;
if (type[i]==SXP_INT64)
setAttrib(thiscol, R_ClassSymbol, ScalarString(mkChar("integer64")));
SET_TRUELENGTH(thiscol, nrow);
SET_VECTOR_ELT(ans,i,thiscol);
}
According to R-exts section 5.9.1, that PROTECT inside the loop isn't needed :
In some cases it is necessary to keep better track of whether protection is really needed. Be
particularly aware of situations where a large number of objects are generated. The pointer
protection stack has a fixed size (default 10,000) and can become full. It is not a good idea
then to just PROTECT everything in sight and UNPROTECT several thousand objects at the end. It
will almost invariably be possible to either assign the objects as part of another object (which
automatically protects them) or unprotect them immediately after use.
So that PROTECT is now removed and all is well. (It seems that the pointer protection stack limit has been reduced to 50,000 since that text was written; Defn.h contains #define R_PPSSIZE 50000L.) I've checked all other PROTECTs in data.table C source for anything similar and found and fixed one in assign.c too (when adding more than 50,000 columns by reference), no others.
Thanks for reporting!

R - OSX Mountain Lion and CPU Limit

I am collecting data via SQL query though R. I have a loop to pull small chunks of a large table, save the chunk and drop the chunk, on repeat for an hour or so till the whole table is in flat files in my RSQL directory.
However, R shoots a Cputime limit exceeded: 24 error every so often.
I am running Mountain Lion.
I have tried
nice -19n R CMD BATCH myscript.R
and the OS continues to kill the process at odd intervals. I do not believe the script to be getting stuck on a particular operation, it just takes a while to plough through the loop.
The loop looks like so..
for (i in 1:64){
foobyte <- NULL
for (j in 0:7){
max id = 1000000
rows = 1e5
to = max_id * (rows * j) - (i * 7 * rows)
from = max_id * (rows * (j-1)) - (1 * 7 * rows)
foobit <- queryDB(paste("SELECT * FROM foobar where id <= ', to,' and id > ',from,';")
foobyte <- rbind(foobit, foobyte)
}
filename <- paste("/my/data/dir/foobyte", j, ".csv", sep="")
write.table(foobyte, filename)
}
It runs for 30-90 minutes before crashing. I will try firing up R from a shell script calling ulimit in only that terminal session, and see how this works.
Tried ulimit... Appears I do not have access, even via sudo. I get the same output from
ulimit -a -H
before and after giving
ulimit -t 12000 # sets cputime limit to 12000 seconds from 600 seconds
SOLVED via Debian Virtual Machine. If someone has a Mountain Lionic solution, please let us know.
A cursory google search for "Cputime limit exceeded: 24" shows me that this is not an R specific error.
Based on the loop you've posted, I'm guessing its exceeding the cpu time limit on the queryDB call, due to the size of the chunks you're retrieving from the database.
I'm not sure if your from and to math checks out: At rows = 1e5, you're loading 1e11 ids, if you reduce it so rows = 1, you're loading 1e6 ids from the table.
Either way, try reducing size of the chunks you're loading from the database, and see if that helps

Looking for algorithm to do long pair wise nucleotide alignments

I am trying to scan for possible SNPs and indels by aligning scaffolds to subsequences from a reference genome. (the raw reads are not available). I am using R/bioconductor and the `pairwiseAlignment function from the Biostrings package.
This was working fine for smaller scaffolds, but failed when I tried to align as 56kbp scaffold with the error message:
Error in QualityScaledXStringSet.pairwiseAlignment(pattern = pattern,
: cannot allocate memory block of size 17179869183.7 Gb
I am not sure if this is a bug or not ? ; I was under the impression that the Needleman-Wunsch algorithm used by pairwiseAlignment is an O(n*m) which I thought would imply the computational demand to be on the order of 3.1E9 operations (56K * 56k ~= 3.1E9). It seems the Needleman-Wunsch similarity matrix should as well take up on the order of 3.1 gigs of memory as well. Not sure if I'm not remembering big-o notation correctly or that is actually the memory overhead that would be needed to build the alignment given the overhead of the R scripting environment.
Does anybody have suggestions for a better alignment algorithm to use for aligning longer sequences? An initial alignment was already done using BLAST to find the region of the reference genome to align. I am not entirely confident BLAST's reliability for correctly placing indels and I have not yet been able to find an api as good as that provided by biostrings for parsing the raw BLAST alignments.
By the way, here is a code snippet that replicates the problem:
library("Biostrings")
scaffold_set = read.DNAStringSet(scaffold_file_name) #scaffold_set is a DNAStringSet instance
scafseq = scaffold_set[[scaffold_name]] #scaf_seq is a "DNAString" instance
genome = read.DNAStringSet(genome_file_name)[[1]] #genome is a "DNAString" instance
#qstart, qend, substart, subend are all from intial BLAST alignment step
scaf_sub = subseq(scafseq, start=qstart, end=qend) #56170-letter "DNAString" instance
genomic_sub = subseq(genome, start=substart, end=subend) #56168-letter "DNAString" instance
curalign = pairwiseAlignment(pattern = scaf_sub, subject = genomic_sub)
#that last line gives the error:
#Error in .Call2("XStringSet_align_pairwiseAlignment", pattern, subject, :
#cannot allocate memory block of size 17179869182.9 Gb
The error does not happen with shorter alignments (hundreds of bases).
I have not yet found the length cutoff where the error starts happening
So I use Clustal as an alignment tool. Not sure about the specific performance, but it has never given me issues when doing multiple sequence alignments of large quantity. Here is a script that runs a whole directory of .fasta files and aligns them. You can modify the flags on the system call to suit your input/output needs. Just look at the clustal documentation. This is in Perl, I don't use R too much for alignments. You need to edit the executable path in the script to match where clustal is on your computer.
#!/usr/bin/perl
use warnings;
print "Please type the list file name of protein fasta files to align (end the directory path with a / or this will fail!): ";
$directory = <STDIN>;
chomp $directory;
opendir (DIR,$directory) or die $!;
my #file = readdir DIR;
closedir DIR;
my $add="_align.fasta";
foreach $file (#file) {
my $infile = "$directory$file";
(my $fileprefix = $infile) =~ s/\.[^.]+$//;
my $outfile="$fileprefix$add";
system "/Users/Wes/Desktop/eggNOG_files/clustalw-2.1-macosx/clustalw2 -INFILE=$infile -OUTFILE=$outfile -OUTPUT=FASTA -tree";
}

Resources