How to adjust future.global.maxSize? - r

Here's a useless example that shows my point
library(future)
a = 1:200000000
object.size(a)
test %<-% head(a)
I get the following error:
Error in getGlobalsAndPackages(expr, envir = envir, persistent =
persistent, : The total size of all global objects that need to be
exported for the future expression (‘head(a)’) is 762.95 MiB. This
exceeds the maximum allowed size of 500.00 MiB (option
'future.global.maxSize'). There are two globals: ‘a’ (762.94 MiB of
class ‘numeric’) and ‘head’ (10.05 KiB of class ‘function’).
Can anyone help me understand how to adjust that future.global.maxSize option? I tried options(future.global.maxSize = 1500000) but that didn't work.

Got it figured out and learned how you can edit options for any package.
This is the line that I used (edit: the change was from 'global' to 'globals':
options(future.globals.maxSize= 891289600)
If you want to customize your limit, I saw in the package source that the limit was calculated and this is how you would calculate the size for an 850mb limit:
850*1024^2 = 891289600
Thanks!

Related

Optimization in R: Levenberg-Marquardt using nls.lm in minpack.lm: resetting `maxiter' to 1024

I am trying to learn how to work with nls.lm in the R library minpack.lm by using the Rosenbrock function to see if the algorithm converges to the global minimum at f(x,y) = (1,1). I do so both with and without the analytic Jacobian. In both instances, I get a warning telling me that the algorithm has decided to revert the maximum number of iterations specified in the call to nls.lm to 1024:
Warning messages:
1: In nls.lm(par = initpar, fn = objective_rosenbrock, jac = gradient_rosenbrock, :
resetting `maxiter' to 1024!
2: In nls.lm(par = initpar, fn = objective_rosenbrock, jac = gradient_rosenbrock, :
lmder: info = -1. Number of iterations has reached `maxiter' == 1024.
The algorithm never quite reaches (1,1) as a result given my initial guess of (-1.2, 1.0). I found the source code for the library on GitHub and the following lines of code are pertinent here:
https://github.com/cran/minpack.lm/blob/master/src/nls_lm.c
OS->maxiter = INTEGER_VALUE(getListElement(control, "maxiter"));
if(OS->maxiter > 1024) {
OS->maxiter = 1024;
warning("resetting `maxiter' to 1024!");
}
Is there any logic to why the maximum number of iterations is capped to 1024? Something with bits and 2^10? I would like to use the library for a different application, but this cap on iterations might prevent that. Any insight would be appreciated.
Git blame says that this code limiting the max iterations was introduced in version 1.1-0, in 2008. The NEWS file for the package only goes back as far as version 1.1-6. I can't find the code in any public repo other than the one you point to (which is only a CRAN mirror; it doesn't contain any comments/commit messages/etc. from developers that might give us clues.)
Other than contacting the maintainer I think it's going to be hard to figure out what the rationale is for this limit.
I do have some guesses though.
The only places that maxiter is actually used in the code are here and here - in R code, not Fortran or C code, so it seems extremely unlikely that we are dealing with something like a 10-bit unsigned integer type (which seems an unlikely choice in any case). I think the limitation is there because we also have a buffer defined for holding trace information here:
double rsstrace[1024];
which, as you can see, is hard-coded to a length of 1024. Presumably bad things would happen if we tried to stuff 1025 iterations'-worth of tracing information into this array ...
My suggestions:
change all instances of '1024' in the code to something larger and see what happens. There are only four:
$ find . -type f -exec grep -Hn 1024 {} \;
./src/nls_lm.c:141: if(OS->maxiter > 1024) {
./src/nls_lm.c:142: OS->maxiter = 1024;
./src/nls_lm.c:143: warning("resetting `maxiter' to 1024!");
./src/minpack_lm.h:20: double rsstrace[1024];
it would be best to #define MAXITER 2048 (or whatever) in src/minpack_lm.h and use that instead of the numerical value.
Contact the maintainer (maintainer("minpack.lm")) and ask them about this issue.

Complete R Session Size

Due to I constantly reach memory size limit in my R Session (8GB Windows PC) I start to remove big objects loaded in. However once I reach this limit, removing objects seems not to work.
So, I was wondering if there's a way to get the R Session size. I know that it's possible to retrieve objects' size (saw in this thread).I want to know if there's a way to count the complete R Session size though (loaded packages, objects, etc).
Thank you!
I personally use this function to get the available memory:
getAvailMem <- function(format = TRUE) {
gc()
if (Sys.info()[["sysname"]] == "Windows") {
memfree <- 1024^2 * (utils::memory.limit() - utils::memory.size())
} else {
# http://stackoverflow.com/a/6457769/6103040
memfree <- 1024 * as.numeric(
system("awk '/MemFree/ {print $2}' /proc/meminfo", intern = TRUE))
}
`if`(format, format(structure(memfree, class = "object_size"),
units = "auto"), memfree)
}
To get the total memory used by R, you may try mem_used() from pryr package. Unlike memory.size, this one is not OS dependent, because it uses the R function gc() underneath it. Try to look in the function body and also this pryr:::node_size and pryr:::show_bytes
pryr::mem_used()
The help file ?pryr::mem_used describes
R breaks down memory usage into Vcells (memory used by vectors) and
Ncells (memory used by everything else). However, neither this
distinction nor the "gc trigger" and "max used" columns are typically
important. What we're usually most interested in is the the first
column: the total memory used. This function wraps around gc() to
return the total amount of memory (in megabytes) currently used by R.
You can also use pryr::mem_change to track the size of the memory used by the R code. Try the example in its documentation page.
The numbers such as 28L and 56L used to refer node size with pryr:::node_size comes from the help file of ?gc, which describes
gc returns a matrix with rows "Ncells" (cons cells), usually 28 bytes
each on 32-bit systems and 56 bytes on 64-bit systems, and "Vcells"
(vector cells, 8 bytes each),
After removing a large object run gc() to free memory

TensorFlow: Can't invoke streaming_sparse_precision_at_k

Upon trying to calculate precision#k, I get an exception. To what follows is the a simple code that reproduces the problem.
First the code defines the variable scope:
initializer = tf.random_uniform_initializer(-0.1, 0.1, seed=1234)
with tf.variable_scope("model", reuse=None, initializer=initializer)
Then it calls those lines:
predictions = tf.Variable(tf.ones([2, 10], tf.int64))
labels = tf.Variable(tf.ones([2, 1], tf.int64))
precision = tf.contrib.metrics.streaming_sparse_precision_at_k(predictions, labels, 5)
tf.initialize_all_variables().run()
(I know this code is meaningless, and tries to calculate the precision given 2 fixed matrices...)
Then I get the following exception:
W tensorflow/core/framework/op_kernel.cc:936] Failed precondition:
Attempting to use uninitialized value
model/precision_at_5/false_positive_at_5 [[Node:
model/precision_at_5/false_positive_at_5/read = IdentityT=DT_DOUBLE,
_class=["loc:#model/precision_at_5/false_positive_at_5"], _device="/job:localhost/replica:0/task:0/gpu:0"]]
The same goes when I tried to invoke streaming_sparse_recall_at_k instead of streaming_sparse_precision_at_k.
The installed version is r0.10 on linux with python 2.7.
Please help... Thanks in advance :)
Unfortunately, tf.initialize_all_variables() doesn't initialize "local" variables (which tend to be internal implementation details for ops like tf.contrib.metrics.streaming_sparse_precision_at_k() and tf.train.string_input_producer(), as opposed to variables used as model weights).
You'll need to add a line to your program that runs tf.initialize_local_variables() before running the evaluation op:
sess.run(tf.initialize_local_variables()) # or `tf.initialize_local_variables().run()`

Looking for algorithm to do long pair wise nucleotide alignments

I am trying to scan for possible SNPs and indels by aligning scaffolds to subsequences from a reference genome. (the raw reads are not available). I am using R/bioconductor and the `pairwiseAlignment function from the Biostrings package.
This was working fine for smaller scaffolds, but failed when I tried to align as 56kbp scaffold with the error message:
Error in QualityScaledXStringSet.pairwiseAlignment(pattern = pattern,
: cannot allocate memory block of size 17179869183.7 Gb
I am not sure if this is a bug or not ? ; I was under the impression that the Needleman-Wunsch algorithm used by pairwiseAlignment is an O(n*m) which I thought would imply the computational demand to be on the order of 3.1E9 operations (56K * 56k ~= 3.1E9). It seems the Needleman-Wunsch similarity matrix should as well take up on the order of 3.1 gigs of memory as well. Not sure if I'm not remembering big-o notation correctly or that is actually the memory overhead that would be needed to build the alignment given the overhead of the R scripting environment.
Does anybody have suggestions for a better alignment algorithm to use for aligning longer sequences? An initial alignment was already done using BLAST to find the region of the reference genome to align. I am not entirely confident BLAST's reliability for correctly placing indels and I have not yet been able to find an api as good as that provided by biostrings for parsing the raw BLAST alignments.
By the way, here is a code snippet that replicates the problem:
library("Biostrings")
scaffold_set = read.DNAStringSet(scaffold_file_name) #scaffold_set is a DNAStringSet instance
scafseq = scaffold_set[[scaffold_name]] #scaf_seq is a "DNAString" instance
genome = read.DNAStringSet(genome_file_name)[[1]] #genome is a "DNAString" instance
#qstart, qend, substart, subend are all from intial BLAST alignment step
scaf_sub = subseq(scafseq, start=qstart, end=qend) #56170-letter "DNAString" instance
genomic_sub = subseq(genome, start=substart, end=subend) #56168-letter "DNAString" instance
curalign = pairwiseAlignment(pattern = scaf_sub, subject = genomic_sub)
#that last line gives the error:
#Error in .Call2("XStringSet_align_pairwiseAlignment", pattern, subject, :
#cannot allocate memory block of size 17179869182.9 Gb
The error does not happen with shorter alignments (hundreds of bases).
I have not yet found the length cutoff where the error starts happening
So I use Clustal as an alignment tool. Not sure about the specific performance, but it has never given me issues when doing multiple sequence alignments of large quantity. Here is a script that runs a whole directory of .fasta files and aligns them. You can modify the flags on the system call to suit your input/output needs. Just look at the clustal documentation. This is in Perl, I don't use R too much for alignments. You need to edit the executable path in the script to match where clustal is on your computer.
#!/usr/bin/perl
use warnings;
print "Please type the list file name of protein fasta files to align (end the directory path with a / or this will fail!): ";
$directory = <STDIN>;
chomp $directory;
opendir (DIR,$directory) or die $!;
my #file = readdir DIR;
closedir DIR;
my $add="_align.fasta";
foreach $file (#file) {
my $infile = "$directory$file";
(my $fileprefix = $infile) =~ s/\.[^.]+$//;
my $outfile="$fileprefix$add";
system "/Users/Wes/Desktop/eggNOG_files/clustalw-2.1-macosx/clustalw2 -INFILE=$infile -OUTFILE=$outfile -OUTPUT=FASTA -tree";
}

System Invocation Error when Running Cspade - R

I am trying to run the cspade function in the arulesSequences package in R. After I successfully read in my transactions using read_baskets, I try to execute the cspade function against the transactions object I read in.
However, when I execute the command, I obtain an error: system invocation failed. Specifically, here is the output.
preprocessing ... 1 partition(s), 1.2 MB [0.23s]
mining transactions ...Error in cspade(table, parameter = list(support = 0.1), control = list (verbose = TRUE)) :
system invocation failed
The presence of "mining transactions" indicates that the following function call in the cspade code is failing.
if (system2(file.path(exe, "spade"), args = c("-i", file,
"-s", parameter#support, opt, "-e", nop, "-o"), stdout = out))
stop("system invocation failed").
For reference, I can successfully generate sequences using the example zaki dataset.
Does anyone have any idea why that command might be failing?
Thanks,
Stewart
http://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Sequence_Mining/SPADE#Caveats
See link above, caveats section and lower your support threshold.
I have the same problem, I decreased the support ( put 0), still didn't work. But came up with another solution:
I changed the data size (less rows) and it works.
so if for example you can devide your data set to 2 parts of half size and make transcation data of each part as x_1 and x_2,
frequent_pattern_1 <- cspade(x_1,parameter = list(support = 0 ))
frequent_pattern_2 <- cspade(x_2,parameter = list(support = 0 ))
then get the absolute support of analysing each dataset to get result of all.
get the absolute support value
support_x_1<-support(frequent_pattern_1 ,x_1, type= c("absolute"), control = NULL)
support_x_2<-support(frequent_pattern_2 ,x_2, type= c("absolute"), control = NULL)
then finding the matching sequence and summing the supports of the matched ones.

Resources