How does Branch Prediction affect performance in R?

How does Branch Prediction affect performance in R? - r

Some references:
This is a follow-up on this Why is processing a sorted array faster than processing an unsorted array?
The only post in r tag that I found somewhat related to branch prediction was this Why sampling matrix row is very slow?
Explanation of the problem:
I was investigating whether processing a sorted array is faster than processing an unsorted one (same as the problem tested in Java and C – first link) to see if branch prediction is affecting R in the same manner.
See the benchmark examples below:
set.seed(128)
#or making a vector with 1e7
myvec <- rnorm(1e8, 128, 128)
myvecsorted <- sort(myvec)
mysumU = 0
mysumS = 0
SvU <- microbenchmark::microbenchmark(
Unsorted = for (i in 1:length(myvec)) {
if (myvec[i] > 128) {
mysumU = mysumU + myvec[i]
}
} ,
Sorted = for (i in 1:length(myvecsorted)) {
if (myvecsorted[i] > 128) {
mysumS = mysumS + myvecsorted[i]
}
} ,
times = 10)
ggplot2::autoplot(SvU)
Question:
First, I want to know that why "Sorted" vector is not the fastest all the time and not by the same magnitude as expressed in Java?
Second, why the sorted execution time has a higher variation compared to one of the unsorted?
N.B. My CPU is an i7-6820HQ # 2.70GHz Skylake, quad-core with hyperthreading.
Update:
To investigate the variation part, I did the microbenchmark with the vector of 100 million elements (n=1e8) and repeated the benchmark 100 times (times=100). Here's the associated plot with that benchmark.
Here's my sessioninfo:
R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 16299)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United States.1252
attached base packages:
[1] compiler stats graphics grDevices utils datasets methods base
other attached packages:
[1] rstudioapi_0.10 reprex_0.3.0 cli_1.1.0 pkgconfig_2.0.3 evaluate_0.14 rlang_0.4.0
[7] Rcpp_1.0.2 microbenchmark_1.4-7 ggplot2_3.2.1

Interpreter overhead, and just being an interpreter, explains most of the average difference. I don't have an explanation for the higher variance.
R is an interpreted language, not JIT compiled to machine code like Java, or ahead-of-time like C. (I don't know much about R internals, just CPUs and performance, so I'm making a lot of assumptions here.)
The code that's running on the actual CPU hardware is the R interpreter, not exactly your R program.
Control dependencies in the R program (like an if()) become data dependencies in the interpreter. The current thing being executed is just data for the interpreter running on a real CPU.
Different operations in the R program become control dependencies in the interpreter. For example, evaluating myvec[i] then the + operator would probably be done by two different functions in the interpreter. And a separate function for > and for if() statements.
The classic interpreter loop is based around an indirect branch that dispatches from a table of function pointers. Instead of a taken/not-taken choice, the CPU needs a prediction for one of many recently-used target addresses. I don't know if R uses a single indirect branch like that or if tries to be fancier like having the end of each interpreter block dispatch to the next one, instead of returning to a main dispatch loop.
Modern Intel CPUs (like Haswell and later) have IT-TAGE (Indirect TAgged GEometric history length) prediction. The taken/not-taken state of previous branches along the path of execution are used as an index into a table of predictions. This mostly solves the interpreter branch-prediction problem, allowing it to do a surprisingly good job, especially when the interpreted code (the R code in your case) does the same thing repeatedly.
Branch Prediction and the Performance of Interpreters - Don’t Trust Folklore (2015) - Haswell's ITTAGE is a huge improvement for interpreters, invalidating the previous wisdom that a single indirect branch for interpreter dispatch was a disaster. I don't know what R actually uses; there are tricks that were useful.
X86 prefetching optimizations: "computed goto" threaded code has more links.
https://comparch.net/2013/06/30/why-tage-is-the-best/
https://danluu.com/branch-prediction/ has some links about that at the bottom. Also mentions that AMD has used Perceptron predictors in Bulldozer-family and Zen: like a neural net.
The if() being taken does result in needing to do different operations, so it does actually still make some branching in the R interpreter more or less predictable depending on data. But of course as an interpreter, it's doing much more work at each step than a simple machine-code loop over an array.
So extra branch mispredicts are a much smaller fraction of the total time because of interpreter overhead.
Of course, both your tests are with the same interpreter on the same hardware. I don't know what kind of CPU you have.
If it's Intel older than Haswell or AMD older than Zen, you might be getting a lot of mispredicts even with the sorted array, unless the pattern is simple enough for an indirect branch history predictor to lock onto. That would bury the difference in more noise.
Since you do see a pretty clear difference, I'm guessing that the CPU doesn't mispredict too much in the sorted case, so there is room for it to get worse in the unsorted case.

Related

How to fix memory error, "Error: cannot allocate vector of size 2.2 Gb"? [duplicate]

I am running into issues trying to use large objects in R. For example:
> memory.limit(4000)
> a = matrix(NA, 1500000, 60)
> a = matrix(NA, 2500000, 60)
> a = matrix(NA, 3500000, 60)
Error: cannot allocate vector of size 801.1 Mb
> a = matrix(NA, 2500000, 60)
Error: cannot allocate vector of size 572.2 Mb # Can't go smaller anymore
> rm(list=ls(all=TRUE))
> a = matrix(NA, 3500000, 60) # Now it works
> b = matrix(NA, 3500000, 60)
Error: cannot allocate vector of size 801.1 Mb # But that is all there is room for
I understand that this is related to the difficulty of obtaining contiguous blocks of memory (from here):
Error messages beginning cannot
allocate vector of size indicate a
failure to obtain memory, either
because the size exceeded the
address-space limit for a process or,
more likely, because the system was
unable to provide the memory. Note
that on a 32-bit build there may well
be enough free memory available, but
not a large enough contiguous block of
address space into which to map it.
How can I get around this? My main difficulty is that I get to a certain point in my script and R can't allocate 200-300 Mb for an object... I can't really pre-allocate the block because I need the memory for other processing. This happens even when I dilligently remove unneeded objects.
EDIT: Yes, sorry: Windows XP SP3, 4Gb RAM, R 2.12.0:
> sessionInfo()
R version 2.12.0 (2010-10-15)
Platform: i386-pc-mingw32/i386 (32-bit)
locale:
[1] LC_COLLATE=English_Caribbean.1252 LC_CTYPE=English_Caribbean.1252
[3] LC_MONETARY=English_Caribbean.1252 LC_NUMERIC=C
[5] LC_TIME=English_Caribbean.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base

Consider whether you really need all this data explicitly, or can the matrix be sparse? There is good support in R (see Matrix package for e.g.) for sparse matrices.
Keep all other processes and objects in R to a minimum when you need to make objects of this size. Use gc() to clear now unused memory, or, better only create the object you need in one session.
If the above cannot help, get a 64-bit machine with as much RAM as you can afford, and install 64-bit R.
If you cannot do that there are many online services for remote computing.
If you cannot do that the memory-mapping tools like package ff (or bigmemory as Sascha mentions) will help you build a new solution. In my limited experience ff is the more advanced package, but you should read the High Performance Computing topic on CRAN Task Views.

For Windows users, the following helped me a lot to understand some memory limitations:
before opening R, open the Windows Resource Monitor (Ctrl-Alt-Delete / Start Task Manager / Performance tab / click on bottom button 'Resource Monitor' / Memory tab)
you will see how much RAM memory us already used before you open R, and by which applications. In my case, 1.6 GB of the total 4GB are used. So I will only be able to get 2.4 GB for R, but now comes the worse...
open R and create a data set of 1.5 GB, then reduce its size to 0.5 GB, the Resource Monitor shows my RAM is used at nearly 95%.
use gc() to do garbage collection => it works, I can see the memory use go down to 2 GB
Additional advice that works on my machine:
prepare the features, save as an RData file, close R, re-open R, and load the train features. The Resource Manager typically shows a lower Memory usage, which means that even gc() does not recover all possible memory and closing/re-opening R works the best to start with maximum memory available.
the other trick is to only load train set for training (do not load the test set, which can typically be half the size of train set). The training phase can use memory to the maximum (100%), so anything available is useful. All this is to take with a grain of salt as I am experimenting with R memory limits.

I followed to the help page of memory.limit and found out that on my computer R by default can use up to ~ 1.5 GB of RAM and that the user can increase this limit. Using the following code,
>memory.limit()
[1] 1535.875
> memory.limit(size=1800)
helped me to solve my problem.

Here is a presentation on this topic that you might find interesting:
http://www.bytemining.com/2010/08/taking-r-to-the-limit-part-ii-large-datasets-in-r/
I haven't tried the discussed things myself, but the bigmemory package seems very useful

The simplest way to sidestep this limitation is to switch to 64 bit R.

I encountered a similar problem, and I used 2 flash drives as 'ReadyBoost'. The two drives gave additional 8GB boost of memory (for cache) and it solved the problem and also increased the speed of the system as a whole.
To use Readyboost, right click on the drive, go to properties and select 'ReadyBoost' and select 'use this device' radio button and click apply or ok to configure.

If you are running your script at linux environment you can use this command:
bsub -q server_name -R "rusage[mem=requested_memory]" "Rscript script_name.R"
and the server will allocate the requested memory for you (according to the server limits, but with good server - hugefiles can be used)

One option is before and after running your command that causes high memory consumption to do a "garbage collection" by running the gc() command, this will free up memory for your analyzes, in addition to using the memory.limit() command.
Example:
gc()
memory.limit(9999999999)
fit <-lm(Y ~ X)
gc()

The save/load method mentioned above works for me. I am not sure how/if gc() defrags the memory but this seems to work.
# defrag memory
save.image(file="temp.RData")
rm(list=ls())
load(file="temp.RData")

Issues factoring large prime that is 99 digits long

The number is 112887987371630998240814603336195913423482111436696007401429072377238341647882152698281999652360869
My code is below
def getfactors(number):
factors = []
for potentialFactor in range(1 , int(math.sqrt(number)) + 1):
if number % potentialFactor == 0:
factors.append(potentialFactor)
return factors
and the input is
getfactors(112887987371630998240814603336195913423482111436696007401429072377238341647882152698281999652360869)
The program has been running for at least 3 hours and I still have no results from it yet. The code works with other numbers too. Is there any algorithm or method that I could use to speed this up?

Your method will take a lot of time to factor in the given number since the RSA primes are close to each other. Even sieving with Sieve of Eratosthenes won't help, since you have a 326-bit number. Can you sieving to 163-bit, there is no way. This is slightly larger than the first RSA challenge RSA-100 that has 300-bit.
Use existing libraries like the
CADO-NFS ; http://cado-nfs.gforge.inria.fr/
NFS factoring: http://gilchrist.ca/jeff/factoring/nfs_beginners_guide.html
factoring as a service https://seclab.upenn.edu/projects/faas/
The experiments
I have tried with Pollard's p-1 algorithm, still running for one and a half-day and did not produce a result, yet. This is what expected due to the B bound must be around 2^55 with success probability 1/27. I've stopped the experiment after the CADO-NFS succeeds. This is self-implemented Pollard's p-1, one can find an optimized in GMP-ECM
Tried the CADO-NFS. The stable version may not be easily compiled for new machines, so prefer the active one from the GitLab.
After ~6 hours with 4 cores, CADO-NFS produced the result. As expected this is an RSA CTF/Challange. Since I don't want to spoil the fun; here the hash commitments with SHA-512, it is executed with OpenSSL;
echo -n "prime x" | openssl sha512
27c64b5b944146aa1e40b35bd09307d04afa8d5fa2a93df9c5e13dc19ab032980ad6d564ab23bfe9484f64c4c43a993c09360f62f6d70a5759dfeabf59f18386
faebc6b3645d45f76c1944c6bd0c51f4e0d276ca750b6b5bc82c162e1e9364e01aab42a85245658d0053af526ba718ec006774b7084235d166e93015fac7733d
Details of the experiment
CPU : Intel(R) Core(TM) i7-7700HQ CPU # 2.80GHz
RAM : 32GB - doesn't require much ram, at least during polynomial selection and Sieveing.
Dedicated cores : 4
Test machine Ubuntu 20.04.1 LTS
CUDA - NO
gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04)
cmake version 3.16.3
external libraries: Nothing out of Ubuntu's canonicals
CODA-NFS version : GitLab develepment version cloned at 23-01-2021
The bit sizes;
n has 326 bits ( RSA-100 challenge has 330 and broken by Lenstra in 1991)
p has 165 bits
q has 162 bits
The cado-nfs-2.3.0 did not compile and giving errors about HWLOC- HWLOC_TOPOLOGY_FLAG_IO_DEVICES. Asked a friend to test the compile and it worked for them. It was an older Linux version. So I decided to use the GitLab version.

What do you know about this number? If it is an RSA public key then it only has two large prime factors. If it is a random number then it will very probably have small prime factors. The type of number will determine how you want to approach factorising it.
Two ancillary functions will also be useful. First a Sieve of Eratosthenes to build a list of primes up to, say 50,000 or some convenient limit. Second a large number prime test, such as Miller-Rabin, to check if the residue is prime or not.
Use the sieve of Eratosthenes to give you all the small primes up to a convenient limit. Test for each prime in turn up to the square root of the target number. When you find a prime that divides the test number, divide the test number to make it smaller. A prime may divide in more than once. Reset the prime limit to the square root of the smaller number once all the divisions are finished.
if (numToTest MOD trialFactor = 0)
repeat
listOfFactors.add(trialFactor)
numToTest <- numToTest/trialFactor
until (numToTest MOD trialFactor != 0)
primeLimit <- sqrt(numTotest)
endif
Once the number you are testing has been reduced to 1, you have completely factored it.
If you run out of primes before completely factoring the target, it is worth running the Miller-Rabin test 64 times to see if the remainder is itself prime; potentially that can save you a lot of work trying to find non-existent factors of a large prime. If the remainder is composite then you can either try again with a larger sieve or use one of the heavy duty factoring methods: Quadratic Sieve, Elliptic Curve etc.

I have authored a library for the R programming language called RcppBigIntAlgos that can factor these types of numbers in a reasonable amount of time (not as fast as using the excellent CADO-NFS library as used in #kelalaka's answer).
As others have pointed out, your number is going to be very difficult to factor by any program you roll yourself unless you invest a considerable amount of time into it. For me personally, I have invested thousands of hours. Your mileage may vary.
Here is my test run in a vanilla R console (no special IDE):
numDig99 <- "112887987371630998240814603336195913423482111436696007401429072377238341647882152698281999652360869"
## install.packages("RcppBigIntAlgos") if necessary
library(RcppBigIntAlgos)
prime_fac <- quadraticSieve(numDig99, showStats=TRUE, nThreads=8)
Summary Statistics for Factoring:
112887987371630998240814603336195913423482111436696007401429072377238341647882152698281999652360869
| MPQS Time | Complete | Polynomials | Smooths | Partials |
|--------------------|----------|-------------|------------|------------|
| 11h 13m 25s 121ms | 100% | 11591331 | 8768 | 15707 |
| Mat Algebra Time | Mat Dimension |
|--------------------|--------------------|
| 1m 39s 519ms | 24393 x 24475 |
| Total Time |
|--------------------|
| 11h 15m 12s 573ms |
And just as #kelalaka, we obtain the same hashed values (Again, ran right in the R console):
system(sprintf("printf %s | openssl sha512", prime_fac[1]))
faebc6b3645d45f76c1944c6bd0c51f4e0d276ca750b6b5bc82c162e1e9364e01aab42a85245658d0053af526ba718ec006774b7084235d166e93015fac7733d
system(sprintf("printf %s | openssl sha512", prime_fac[2]))
27c64b5b944146aa1e40b35bd09307d04afa8d5fa2a93df9c5e13dc19ab032980ad6d564ab23bfe9484f64c4c43a993c09360f62f6d70a5759dfeabf59f18386
The algorithm implemented in RcppBigIntAlgos::quadraticSieve is the Multiple Polynomial Quadratic Sieve. There is a more efficient version of the quadratic sieve known as the Self Initializing Quadratic Sieve, however, there isn't as much literature in the wild available.
Here are my machine specs:
MacBook Pro (15-inch, 2017)
Processor: 2.8 GHz Quad-Core Intel Core i7
Memory; 16 GB 2133 MHz LPDDR3
And here is my R info:
sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.7
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] RcppBigIntAlgos_1.0.1 gmp_0.6-0
loaded via a namespace (and not attached):
[1] compiler_4.0.3 tools_4.0.3 Rcpp_1.0.5

R running out of memory during time series distance computation

Problem description
I have 45000 short time series (length 9) and would like to compute the distances for a cluster analysis. I realize that this will result in (the lower triangle of) a matrix of size 45000x45000, a matrix with more than 2 billion entries. Unsurprisingly, I get:
> proxy::dist(ctab2, method="euclidean")
Error: cannot allocate vector of size 7.6 Gb
What can I do?
Ideas
Increase available/addressable memory somehow? However, these 7.6G are probably beyond some hard limit that cannot be extended? In any case, the system has 16GB memory and the same amount of swap. By "Gb", R seems to mean Gigabyte, not Gigabit, so 7.6Gb puts us already dangerously close to a hard limit.
Perhaps a different distance computation method instead of euclidean, say DTW, might be more memory efficient? However, as explained below, the memory limit seems to be the resulting matrix, not the memory required at computation time.
Split the dataset into N chunks and compute the matrix in N^2 parts (actually only those parts relevant for the lower triangle) that can later be reassembled? (This might look similar to the solution to a similar problem proposed here.) It seems to be a rather messy solution, though. Further, I will need the 45K x 45K matrix in the end anyway. However, this seems to hit the limit. The system also gives the memory allocation error when generating a 45K x 45K random matrix:
> N=45000; memorytestmatrix <- matrix( rnorm(N*N,mean=0,sd=1), N, N)
Error: cannot allocate vector of size 15.1 Gb
30K x 30K matrices are possible without problems, R gives the resulting size as
> print(object.size(memorytestmatrix), units="auto")
6.7 Gb
1 Gb more and everything would be fine, it seems. Sadly, I do not have any large objects that I could delete to make room. Also, ironically,
> system('free -m')
Warning message:
In system("free -m") : system call failed: Cannot allocate memory
I have to admit that I am not really sure why R refuses to allocate 7.6 Gb; the system certainly has more memory, although not a lot more. ps aux shows the R process as the single biggest memory user. Maybe there is an issue with how much memory R can address even if more is available?
Related questions
Answers to other questions related to R running out of memory, like this one, suggest to use a more memory efficient methods of computation.
This very helpful answer suggests to delete other large objects to make room for the memory intensive operation.
Here, the idea to split the data set and compute distances chunk-wise is suggested.
Software & versions
R version is 3.4.1. System kernel is Linux 4.7.6, x86_64 (i.e. 64bit).
> version
_
platform x86_64-pc-linux-gnu
arch x86_64
os linux-gnu
system x86_64, linux-gnu
status
major 3
minor 4.1
year 2017
month 06
day 30
svn rev 72865
language R
version.string R version 3.4.1 (2017-06-30)
nickname Single Candle
Edit (Aug 27): Some more information
Updating the Linux kernel to 4.11.9 has no effect.
The bigmemory package may also run out of memory. It uses shared memory in /dev/shm/ of which the system by default (but depending on configuration) allows half the size of the RAM. You can increase this at runtime by doing (for instance) mount -o remount,size=12Gb /dev/shm, but this may still not allow usage of 12Gb. (I do not know why, maybe the memory management configuration is inconsistent then). Also, you may end up crashing your system if you are not careful.
R apparently actually allows access to the full RAM and can create objects up to that size. It just seems to fail for particular functions such as dist. I will add this as an answer, but my conclusions are a bit based on speculation, so I do not know to what degree this is right.

R apparently actually allows access to the full RAM. This works perfectly fine:
N=45000; memorytestmatrix <- matrix(nrow=N, ncol=N)
This is the same thing I tried before as described in the original question, but with a matrix of NA's instead of rnorm random variates. Reassigning one of the values in the matrix as float (memorytestmatrix[1,1]<-0.5) still works and recasts the matrix as a float matrix.
Consequently, I suppose, you can have a matrix of that size, but you cannot do it the way the dist function attempts to do it. A possible explanation is that the function operates with multiple objects of that size in order to speed the computation up. However, if you compute the distances element-wise and change the values in place, this works.
library(mefa) # for the vec2dist function
euclidian <- function(series1, series2) {
return((sum((series1 - series2)^2))^.5)
}
mx = nrow(ctab2)
distMatrixE <- vec2dist(0.0, size=mx)
for (coli in 1:(mx-1)) {
for (rowi in (coli+1):mx) {
# Element indices in dist objects count the rows down column by column from left to righ in lower triangular matrices without the main diagonal.
# From row and column indices, the element index for the dist object is computed like so:
element <- (mx^2-mx)/2 - ((mx-coli+1)^2 - (mx-coli+1))/2 + rowi - coli
# ... and now, we replace the distances in place
distMatrixE[element] <- euclidian(ctab2[rowi,], ctab2[coli,])
}
}
(Note that addressing in dist objects is a bit tricky, since they are not matrices but 1-dimensional vectors of size (N²-N)/2 recast as lower triangular matrices of size N x N. If we go through rows and columns in the right order, it could also be done with a counter variable, but computing the element index explicitly is clearer, I suppose.)
Also note that it may be possible to speed this up by making use of sapply by computing more than one value at a time.

There exist good algorithms that do not need a full distance matrix in memory.
For example, SLINK and DBSCAN and OPTICS.

R: clarification on memory management

Suppose I have a matrix bigm. I need to use a random subset of this matrix and give it to a machine learning algorithm such as say svm. The random subset of the matrix will only be known at runtime. Additionally there are other parameters that are also chosen from a grid.
So, I have code that looks something like this:
foo = function (bigm, inTrain, moreParamsList) {
parsList = c(list(data=bigm[inTrain, ]), moreParamsList)
do.call(svm, parsList)
}
What I am seeking to know is whether R uses new memory to save that bigm[inTrain, ] object in parsList. (My guess is that it does.) What commands can I use to test such hypotheses myself? Additionally, is there a way of using a sub-matrix in R without using new memory?
Edit:
Also, assume I am calling foo using mclapply (on Linux) where bigm resides in the parent process. Does that mean I am making mc.cores number of copies of bigm or do all cores just use the object from the parent?
Any functions and heuristics of tracking memory location and consumption of objects being made in different cores?
Thanks.

I am just going to put in here what I find from my research on this topic:
I don't think using mclapply makes mc.cores copies of bigm based on this from the manual for multicore:
In a nutshell fork spawns a copy (child) of the current process, that can work in parallel
to the master (parent) process. At the point of forking both processes share exactly the
same state including the workspace, global options, loaded packages etc. Forking is
relatively cheap in modern operating systems and no real copy of the used memory is
created, instead both processes share the same memory and only modified parts are copied.
This makes fork an ideal tool for parallel processing since there is no need to setup the
parallel working environment, data and code is shared automatically from the start.

For your first part of the question, you can use tracemem :
This function marks an object so that a message is printed whenever the internal code copies the object
Here an example:
a <- 1:10
tracemem(a)
## [1] "<0x000000001669cf00"
b <- a ## b and a share memory (no message)
d <- stats::rnorm(10)
invisible(lm(d ~ a+log(b)))
## tracemem[0x000000001669cf00 -> 0x000000001669e298] ## object a is copied twice
## tracemem[0x000000001669cf00 -> 0x0000000016698a38]
untracemem(a)

You already found from the manual that mclapply isn't supposed to make copies of bigm.
But each thread needs to make its own copy of the smaller training matrix as it varies across the threads.
If you'd parallelize with e.g. snow, you'd need to have a copy of the data in each of the cluster nodes. However, in that case you could rewrite your problem in a way that only the smaller training matrices are handed over.
The search term for the general investigation of memory consumption behaviour is memory profiling. Unfortunately, AFAIK the available tools are not (yet) very comfortable, see e.g.
Monitor memory usage in R
Memory profiling in R - tools for summarizing

R memory management / cannot allocate vector of size n Mb

I am running into issues trying to use large objects in R. For example:
> memory.limit(4000)
> a = matrix(NA, 1500000, 60)
> a = matrix(NA, 2500000, 60)
> a = matrix(NA, 3500000, 60)
Error: cannot allocate vector of size 801.1 Mb
> a = matrix(NA, 2500000, 60)
Error: cannot allocate vector of size 572.2 Mb # Can't go smaller anymore
> rm(list=ls(all=TRUE))
> a = matrix(NA, 3500000, 60) # Now it works
> b = matrix(NA, 3500000, 60)
Error: cannot allocate vector of size 801.1 Mb # But that is all there is room for
I understand that this is related to the difficulty of obtaining contiguous blocks of memory (from here):
Error messages beginning cannot
allocate vector of size indicate a
failure to obtain memory, either
because the size exceeded the
address-space limit for a process or,
more likely, because the system was
unable to provide the memory. Note
that on a 32-bit build there may well
be enough free memory available, but
not a large enough contiguous block of
address space into which to map it.
How can I get around this? My main difficulty is that I get to a certain point in my script and R can't allocate 200-300 Mb for an object... I can't really pre-allocate the block because I need the memory for other processing. This happens even when I dilligently remove unneeded objects.
EDIT: Yes, sorry: Windows XP SP3, 4Gb RAM, R 2.12.0:
> sessionInfo()
R version 2.12.0 (2010-10-15)
Platform: i386-pc-mingw32/i386 (32-bit)
locale:
[1] LC_COLLATE=English_Caribbean.1252 LC_CTYPE=English_Caribbean.1252
[3] LC_MONETARY=English_Caribbean.1252 LC_NUMERIC=C
[5] LC_TIME=English_Caribbean.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base

Consider whether you really need all this data explicitly, or can the matrix be sparse? There is good support in R (see Matrix package for e.g.) for sparse matrices.
Keep all other processes and objects in R to a minimum when you need to make objects of this size. Use gc() to clear now unused memory, or, better only create the object you need in one session.
If the above cannot help, get a 64-bit machine with as much RAM as you can afford, and install 64-bit R.
If you cannot do that there are many online services for remote computing.
If you cannot do that the memory-mapping tools like package ff (or bigmemory as Sascha mentions) will help you build a new solution. In my limited experience ff is the more advanced package, but you should read the High Performance Computing topic on CRAN Task Views.

For Windows users, the following helped me a lot to understand some memory limitations:
before opening R, open the Windows Resource Monitor (Ctrl-Alt-Delete / Start Task Manager / Performance tab / click on bottom button 'Resource Monitor' / Memory tab)
you will see how much RAM memory us already used before you open R, and by which applications. In my case, 1.6 GB of the total 4GB are used. So I will only be able to get 2.4 GB for R, but now comes the worse...
open R and create a data set of 1.5 GB, then reduce its size to 0.5 GB, the Resource Monitor shows my RAM is used at nearly 95%.
use gc() to do garbage collection => it works, I can see the memory use go down to 2 GB
Additional advice that works on my machine:
prepare the features, save as an RData file, close R, re-open R, and load the train features. The Resource Manager typically shows a lower Memory usage, which means that even gc() does not recover all possible memory and closing/re-opening R works the best to start with maximum memory available.
the other trick is to only load train set for training (do not load the test set, which can typically be half the size of train set). The training phase can use memory to the maximum (100%), so anything available is useful. All this is to take with a grain of salt as I am experimenting with R memory limits.

I followed to the help page of memory.limit and found out that on my computer R by default can use up to ~ 1.5 GB of RAM and that the user can increase this limit. Using the following code,
>memory.limit()
[1] 1535.875
> memory.limit(size=1800)
helped me to solve my problem.

Here is a presentation on this topic that you might find interesting:
http://www.bytemining.com/2010/08/taking-r-to-the-limit-part-ii-large-datasets-in-r/
I haven't tried the discussed things myself, but the bigmemory package seems very useful

The simplest way to sidestep this limitation is to switch to 64 bit R.

I encountered a similar problem, and I used 2 flash drives as 'ReadyBoost'. The two drives gave additional 8GB boost of memory (for cache) and it solved the problem and also increased the speed of the system as a whole.
To use Readyboost, right click on the drive, go to properties and select 'ReadyBoost' and select 'use this device' radio button and click apply or ok to configure.

If you are running your script at linux environment you can use this command:
bsub -q server_name -R "rusage[mem=requested_memory]" "Rscript script_name.R"
and the server will allocate the requested memory for you (according to the server limits, but with good server - hugefiles can be used)

One option is before and after running your command that causes high memory consumption to do a "garbage collection" by running the gc() command, this will free up memory for your analyzes, in addition to using the memory.limit() command.
Example:
gc()
memory.limit(9999999999)
fit <-lm(Y ~ X)
gc()

The save/load method mentioned above works for me. I am not sure how/if gc() defrags the memory but this seems to work.
# defrag memory
save.image(file="temp.RData")
rm(list=ls())
load(file="temp.RData")

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex