HI all,
I was trying to load a certain amount of Affymetrix CEL files, with the standard BioConductor command (R 2.8.1 on 64 bit linux, 72 GB of RAM)
abatch<-ReadAffy()
But I keep getting this message:
Error in read.affybatch(filenames = l$filenames, phenoData = l$phenoData, :
allocMatrix: too many elements specified
What's the general meaning of this allocMatrix error? Is there some way to increase its maximum size?
Thank you
The problem is that all the core functions use INTs instead of LONGs for generating R objects. For example, your error message comes from array.c in /src/main
if ((double)nr * (double)nc > INT_MAX)
error(_("too many elements specified"));
where nr and nc are integers generated before, standing for the number of rows and columns of your matrix:
nr = asInteger(snr);
nc = asInteger(snc);
So, to cut it short, everything in the source code should be changed to LONG, possibly not only in array.c but in most core functions, and that would require some rewriting. Sorry for not being more helpful, but i guess this is the only solution. Alternatively, you may wait for R 3.x next year, and hopefully they will implement this...
If you're trying to work on huge affymetrix datasets, you might have better luck using packages from aroma.affymetrix.
Also, bioconductor is a (particularly) fast moving project and you'll typically be asked to upgrade to the latest version of R in order to get any continued "support" (help on the BioC mailing list). I see that Thrawn also mentions having a similar problem with R 2.10, but you still might think about upgrading anyway.
I bumped into this thread by chance. No, the aroma.* framework is not limited by the allocMatrix() limitation of ints and longs, because it does not address data using the regular address space alone - instead it subsets also via the file system. It never hold and never loads the complete data set into memory at any time. Basically the file system sets the limit, not the RAM nor the address space of you OS.
/Henrik
(author of aroma.*)
Related
I'm trying to use package hash, which I understand is the most commonly adopted implementation (other than directly using environments).
If I try to create and store hashes larger than ~20MB, I start getting protect(): protection stack overflow errors.
pryr::object_size(hash::hash(1:120000, 1:120000)) # * (see end of post)
#> 21.5 MB
h <- hash::hash(1:120000, 1:120000)
#> Error: protect(): protection stack overflow
If I run the h <- ... command once, the error only appears once. If I run it twice, I get an infinite loop of errors appearing in the console, freezing Rstudio and forcing me to restart it from the Task Manager.
From multiple other SO questions, I understand this means I'm creating more pointers than R can protect. This makes sense to me, since hashes are actually just environments (which themselves are just hash tables), so I assume R needs to keep track of each value in the hash table as a separate pointer.
The common solution I've seen for the protect() error is to use rstudio.exe --max-ppsize=500000 (which I assume propagates that option to R itself), but it doesn't help in this case, the error remains. This is somewhat surprising, since the hash in the example above is only 120,000 keys/pointers long, much smaller than the given ppsize of 500,000.
So, how can I use large hashes in R? I'm assuming changing to pure environments won't help, since hash is really just a wrapper around environments.
* For the record, the given hash::hash() call above will create hashes with non-syntactic names, but that's irrelevant: my real case has simple character keys and integer values and shows the same behavior)
This is a bug in RStudio, not a limitation in R. The bug happens when it tries to examine the h object for display in the environment pane. The bug is on their issue list as https://github.com/rstudio/rstudio/issues/5546 .
Suppose I have a matrix bigm. I need to use a random subset of this matrix and give it to a machine learning algorithm such as say svm. The random subset of the matrix will only be known at runtime. Additionally there are other parameters that are also chosen from a grid.
So, I have code that looks something like this:
foo = function (bigm, inTrain, moreParamsList) {
parsList = c(list(data=bigm[inTrain, ]), moreParamsList)
do.call(svm, parsList)
}
What I am seeking to know is whether R uses new memory to save that bigm[inTrain, ] object in parsList. (My guess is that it does.) What commands can I use to test such hypotheses myself? Additionally, is there a way of using a sub-matrix in R without using new memory?
Edit:
Also, assume I am calling foo using mclapply (on Linux) where bigm resides in the parent process. Does that mean I am making mc.cores number of copies of bigm or do all cores just use the object from the parent?
Any functions and heuristics of tracking memory location and consumption of objects being made in different cores?
Thanks.
I am just going to put in here what I find from my research on this topic:
I don't think using mclapply makes mc.cores copies of bigm based on this from the manual for multicore:
In a nutshell fork spawns a copy (child) of the current process, that can work in parallel
to the master (parent) process. At the point of forking both processes share exactly the
same state including the workspace, global options, loaded packages etc. Forking is
relatively cheap in modern operating systems and no real copy of the used memory is
created, instead both processes share the same memory and only modified parts are copied.
This makes fork an ideal tool for parallel processing since there is no need to setup the
parallel working environment, data and code is shared automatically from the start.
For your first part of the question, you can use tracemem :
This function marks an object so that a message is printed whenever the internal code copies the object
Here an example:
a <- 1:10
tracemem(a)
## [1] "<0x000000001669cf00"
b <- a ## b and a share memory (no message)
d <- stats::rnorm(10)
invisible(lm(d ~ a+log(b)))
## tracemem[0x000000001669cf00 -> 0x000000001669e298] ## object a is copied twice
## tracemem[0x000000001669cf00 -> 0x0000000016698a38]
untracemem(a)
You already found from the manual that mclapply isn't supposed to make copies of bigm.
But each thread needs to make its own copy of the smaller training matrix as it varies across the threads.
If you'd parallelize with e.g. snow, you'd need to have a copy of the data in each of the cluster nodes. However, in that case you could rewrite your problem in a way that only the smaller training matrices are handed over.
The search term for the general investigation of memory consumption behaviour is memory profiling. Unfortunately, AFAIK the available tools are not (yet) very comfortable, see e.g.
Monitor memory usage in R
Memory profiling in R - tools for summarizing
I have some big, big files that I work with and I use several different I/O functions to access them. The most common one is the bigmemory package.
When writing to the files, I've learned the hard way to flush output buffers, otherwise all bets are off on whether the data was saved. However, this can lead to some very long wait times while bigmemory does its thing (many minutes). I don't know why this happens - it doesn't always occur and it's not easily reproduced.
Is there some way to determine whether or not I/O buffers have been flushed in R, especially for bigmemory? If the operating system matters, then feel free to constrain the answer in that way.
If an answer can be generalized beyond bigmemory, that would be great, as I sometimes rely on other memory mapping functions or I/O streams.
If there are no good solutions to checking whether buffers have been flushed, are there cases in which it can be assumed that buffers have been flushed? I.e. besides using flush().
Update: I should clarify that these are all binary connections. #RichieCotton noted that isIncomplete(), though the help documentation only mentions text connections. It's not clear if that is usable for binary connections.
Is this more convincing that isIncomplete() works with binary files?
# R process 1
zz <- file("~/test", "wb")
writeBin(c(1:100000),con=zz)
close(zz)
# R process 2
zz2 <- file("~/test", "rb")
inpp <- readBin(con=zz2, integer(), 10000)
while(isIncomplete(con2)) {Sys.sleep(1); inpp <- c(inpp, readBin(zz2),integer(), 10000)}
close(zz2)
(Modified from the help(connections) file.)
I'll put forward my own answer, but I welcome anything that is clearer.
From what I've seen so far, the various connection functions, e.g. file, open, close, flush, isOpen, and isIncomplete (among others), are based on specific connection types, e.g. files, pipes, URLs, and a few other things.
In contrast, bigmemory has its own connection type and the bigmemory object is an S4 object with a slot for a memory address for operating system buffers. Once placed there, the OS is in charge of flushing those buffers. Since it's an OS responsibility, then getting information on "dirty" buffers requires interacting with the OS, not with R.
Thus, the answer for bigmemory is "no" as the data is stored in the kernel buffer, though it may be "yes" for other connections that are handled through STDIO (i.e. stored in "user space").
For more insight on the OS / kernel side of things, see this question on SO; I am investigating a couple of programs (not just R + bigmemory) that are producing buffer flushing curiosities, and that thread helped to enlighten me about the kernel side of things.
I have started using the doMC package for R as the parallel backend for parallelised plyr routines.
The parallelisation itself seems to be working fine (though I have yet to properly benchmark the speedup), my problem is that the logging is now asynchronous and messages from different cores are getting mixed in together. I could created different logfiles for each core, but I think I neater solution is to simply add a different label for each core. I am currently using the log4r package for my logging needs.
I remember when using MPI that each processor got a rank, which was a way of distinguishing each process from one another, so is there a way to do this with doMC? I did have the idea of extracting the PID, but this does seem messy and will change for every iteration.
I am open to ideas though, so any suggestions are welcome.
EDIT (2011-04-08): Going with the suggestion of one answer, I still have the issue of correctly identifying which subprocess I am currently inside, as I would either need separate closures for each log() call so that it writes to the correct file, or I would have a single log() function, but have some logic inside it determining which logfile to append to. In either case, I would still need some way of labelling the current subprocess, but I am not sure how to do this.
Is there an equivalent of the mpi_rank() function in the MPI library?
I think having multiple process write to the same file is a recipe for a disaster (it's just a log though, so maybe "disaster" is a bit strong).
Often times I parallelize work over chromosomes. Here is an example of what I'd do (I've mostly been using foreach/doMC):
foreach(chr=chromosomes, ...) %dopar% {
cat("+++", chr, "+++\n")
## ... some undoubtedly amazing code would then follow ...
}
And it wouldn't be unusual to get output that tramples over each other ... something like (not exactly) this:
+++chr1+++
+++chr2+++
++++chr3++chr4+++
... you get the idea ...
If I were in your shoes, I think I'd split the logs for each process and set their respective filenames to be unique with respect to something happening in that process's loop (like chr in my case above). Collate them later if you must ... ie. map/reduce your log files :-)
I'm working with large numbers that I can't have rounded off. Using Lua's standard math library, there seem to be no convenient way to preserve precision past some internal limit. I also see there are several libraries that can be loaded to work with big numbers:
http://oss.digirati.com.br/luabignum/
http://www.tc.umn.edu/~ringx004/mapm-main.html
http://lua-users.org/lists/lua-l/2002-02/msg00312.html (might be identical to #2)
http://www.gammon.com.au/scripts/doc.php?general=lua_bc (but I can't find any source)
Further, there are many libraries in C that could be called from Lua, if the bindings where established.
Have you had any experience with one or more of these libraries?
Using lbc instead of lmapm would be easier because lbc is self-contained.
local bc = require"bc"
s=bc.pow(2,1000):tostring()
z=0
for i=1,#s do
z=z+s:byte(i)-("0"):byte(1)
end
print(z)
I used Norman Ramsey's suggestion to solve Project Euler problem #16. I don't think it's a spoiler to say that the crux of the problem is calculating a 303 digit integer accurately.
Here are the steps I needed to install and use the library:
Lua needs to be built with dynamic loading enabled. I use Cygwin, but I changed PLAT in src/Makefile to be linux. The default, none, doesn't enable dynamic loading.
The MAMP needs to be built and installed somewhere that your C compiler can find it. I put libmapm.a in /usr/local/lib/. Next m_apm.h and m_apm_lc.h went to /usr/local/include/.
The makefile for lmamp needs to be altered to the correct location of the Lua and MAMP libraries. For me, that means uncommenting the second declaration of LUA, LUAINC, LUALIB, and LUABIN and editing the declaration of MAMP.
Finally, mapm.so needs to be placed somewhere that Lua will find it. I put it at /usr/local/lib/lua/5.1/.
Thank you all for the suggestions!
The lmapm library by Luiz Figueiredo, one of the authors of the Lua language.
I can't really answer, but I will add LGMP, a GMP binding. Not used.
Not my field of expertise, but I would expect the GNU multiple precision arithmetic library to be quite a standard here, no?
Though not arbitrary precision, Lua decNumber, a Lua 5.1 wrapper for IBM decNumber, implements the proposed General Decimal Arithmetic standard IEEE 754r. It has the Lua 5.1 arithmetic operators and more, full control over rounding modes, and working precision up to 69 decimal digits.
There are several libraries for the problem, each one with your advantages
and disadvantages, the best choice depends on your requeriments. I would say
lbc is a good first pick if it
fulfills your requirements or any other by Luiz Figueiredo. For the most efficient one I guess would be any using GMP bindings as GMP is a standard C library for dealing with large integers and is very well optimized.
Nevertheless in case you are looking for a pure Lua one, lua-bint
library could be an option for dealing with big integers,
I wouldn't say it's the best because there are more efficient
and complete ones such the ones mentioned above, but usually they requires compiling C code
or can be troublesome to setup. However when comparing pure Lua big integer libraries
and depending in your use case it could perhaps be an efficient choice. The library is documented,
code fully covered by tests and have many examples. But take this recommendation with grant of
salt because I am the library author.
To install you can use luarocks if you already have it in your computer or simply download the
bint.lua
file in your project, as it has no other dependencies other than requiring Lua 5.3+.
Here is a small example using it to solve the problem #16 from Project Euler
(mentioned in previous answers):
local bint = require 'bint'(1024)
local n = bint(1) << 1000
local digits = tostring(n)
local sum = 0
for i=1,#digits do
sum = sum + tonumber(digits:sub(i,i))
end
print(sum) -- should output 1366