I have the script making a ffdf object:
library(ff)
library(ffbase)
setwd("D:/My_package/Personal/R/reading")
x<-cbind(rnorm(1:100000000),rnorm(1:100000000),1:100000000)
system.time(write.csv2(x,"test.csv",row.names=FALSE))
system.time(x <- read.csv2.ffdf(file="test.csv", header=TRUE, first.rows=1000, next.rows=10000,levels=NULL))
Now I want to increase the column#1 of x by 5.
To perform such an operation I use method 'add()' of ff package:
add(x[,1],5)
The ouput is Ok (column#1 is increased by 5). But the extra RAM allocation is disasterous - it looks like as if I am operating the entire dataframe in RAM but not a ffdf object.
So my question is about the correct way to deal with elements of ffdf object without drastic extra RAM allocations.
You can just do as follows
require(ffbase)
x <- ff(1:10)
y <- x + 5
x
y
ffbase has worked out all the Arithmetic operations see help("+.ff_vector")
I have used chunk approach to make arithmatic calculations without RAM extra overheads (see the initial script in the question section):
chunk_size<-100
m<-numeric(chunk_size)
chunks <- chunk(x, length.out=chunk_size)
system.time(
for(i in seq_along(chunks)){
x[chunks[[i]],][[1]]<-x[chunks[[i]],][[1]]+5
}
)
x
Now, I have increased each element of the column#1 of x object by 5 without significant RAM allocations.
The 'chunk_size' regulates the number of chunks as well -> more chunks are used the smaller RAM overheads are. But processing time issues could arise.
The brief example and explanations about chunks in ffdf are here:
https://github.com/demydd/R-for-Big-Data/blob/master/09-ff.Rmd
Anyway, It would be nice to hear alternative approaches.
Related
On my machine,
m1 = ( runif(5*10^7), ncol=10000, nrow=5000 )
uses up about 380 MB. I need to work with many of such matrices at the same time in memory (e.g. add or multiply them or apply functions on them). All in all my code uses up 4 GB of RAM due to multiple matrices stored in memory. I am contemplating options to store the data more efficiently (i.e. in a way that uses up less RAM).
I have seen the R package bigmemory being recommended. However:
library(bigmemory)
m2 = big.matrix( init = 0, ncol=10000, nrow=5000 )
m2[1:5000,1:10000] <- runif( 5*10^7 )
makes R use about the same amount in memory as I verified using Windows Task Manager. So I anticipate no big gain, or am I wrong and should I use big.matrix in a different way?
The solution is to work with matrices stored in files, i.e. setting backingfile to not NULL in the call of big.matrix() function.
Working with filebacked big.matrix from package bigmemory is a good solution.
However, assigning the whole matrix with runif( 5*10^7 ) makes you create this large temporary vector in memory. Yet, if you use gc(reset = TRUE), you will see that this memory usage disappear.
If you want to initialize your matrix by block (say blocks of 500 columns), you could use package bigstatsr. It uses similar objects as filebacked big.matrix (called FBM) and store them in your temporary directory by default. You could do:
library(bigstatsr)
m1 <- FBM(1e4, 5e3)
big_apply(m1, a.FUN = function(X, ind) {
X[, ind] <- runif(nrow(X) * length(ind))
NULL
}, a.combine = 'c', block.size = 500)
Depending on the makeup of your dataset, a sparse.matrix could be your best way forward. This is a common and extremely useful way to boost space and time efficiency. In fact, a lot of R packages require that you use sparse matrices.
I'm partitioning a data frame with split() in order to use parLapply() to call a function on each partition in parallel. The data frame has 1.3 million rows and 20 cols. I'm splitting/partitioning by two columns, both character type. Looks like there are ~47K unique IDs and ~12K unique codes, but not every pairing of ID and code are matched. The resulting number of partitions is ~250K. Here is the split() line:
system.time(pop_part <- split(pop, list(pop$ID, pop$code)))
The partitions will then be fed into parLapply() as follows:
cl <- makeCluster(detectCores())
system.time(par_pop <- parLapply(cl, pop_part, func))
stopCluster(cl)
I've let the split() code alone run almost an hour and it doesn't complete. I can split by the ID alone, which takes ~10 mins. Additionally, R studio and the worker threads are consuming ~6GB of RAM.
The reason I know the resulting number of partitions is I have equivalent code in Pentaho Data Integration (PDI) that runs in 30 seconds (for the entire program, not just the "split" code). I'm not hoping for that type of performance with R, but something that perhaps completes in 10 - 15 mins worst case.
The main question: Is there a better alternative to split? I've also tried ddply() with .parallel = TRUE, but it also ran over an hour and never completed.
Split indexes into pop
idx <- split(seq_len(nrow(pop)), list(pop$ID, pop$code))
Split is not slow, e.g.,
> system.time(split(seq_len(1300000), sample(250000, 1300000, TRUE)))
user system elapsed
1.056 0.000 1.058
so if yours is I guess there's some aspect of your data that slows things down, e.g., ID and code are both factors with many levels and so their complete interaction, rather than the level combinations appearing in your data set, are calculated
> length(split(1:10, list(factor(1:10), factor(10:1))))
[1] 100
> length(split(1:10, paste(letters[1:10], letters[1:10], sep="-")))
[1] 10
or perhaps you're running out of memory.
Use mclapply rather than parLapply if you're using processes on a non-Windows machine (which I guess is the case since you ask for detectCores()).
par_pop <- mclapply(idx, function(i, pop, fun) fun(pop[i,]), pop, func)
Conceptually it sounds like you're really aiming for pvec (distribute a vectorized calculation over processors) rather than mclapply (iterate over individual rows in your data frame).
Also, and really as the initial step, consider identifying the bottle necks in func; the data is large but not that big so perhaps parallel evaluation is not needed -- maybe you've written PDI code instead of R code? Pay attention to data types in the data frame, e.g., factor versus character. It's not unusual to get a 100x speed-up between poorly written and efficient R code, whereas parallel evaluation is at best proportional to the number of cores.
Split(x,f) is slow if x is a factor AND f contains a lot of different elements
So, this code if fast:
system.time(split(seq_len(1300000), sample(250000, 1300000, TRUE)))
But, this is very slow:
system.time(split(factor(seq_len(1300000)), sample(250000, 1300000, TRUE)))
And this is fast again because there are only 25 groups
system.time(split(factor(seq_len(1300000)), sample(25, 1300000, TRUE)))
I've been posting about an issue over the last few days where I need to create a 7000x7000 distance matrix. Doing it all on memory was giving me the could not allocation vector error. I'm using Windows XP SP 3, 3GB RAM, 32-bit system. I originally wanted to use the bigmemory library, but it appears that it is not available for Windows. I've done some reading on the ff package, so this is what I cam up with so far:
require(ff)
ffmat <- ff(vmode="double", dim=c(7000,7000))
ffmat <- as.matrix(dist(data[1:7000, ], diag=TRUE, upper=TRUE))
The problem is that I still get a vector allocation error. Note that dim(data) = 7000x182 (lot's of variables).
Running gc() post-mortem brings the memory.size() back down to normal levels. It's as if R is storing the results in memory prior to writing to the ff that was created. Is there any way around this?
You are probably going to need to break up the task into pieces and assign the individual pieces to the matrix instead of doing it all in one step.
The dist and as.matrix functions do not know that the result will be an ff object, they just try to do their part in memory.
Since the dist function does not compute distances between different sets of data it may be easiest to just calculate the distances by hand, though there may be a function in a package that will do the off diagonal distances.
"It's as if R is storing the results in memory prior to writing to the ff that was created. Is there any way around this?"
That's exactly what R is doing. The way your code is written does two things: it creates an ff object, and then it overwrites that with a traditional matrix created by as.matrix.
You could potentially extend the dist function to work with ff objects, or write your own implementation of dist that uses ff.
Many thanks to jwijffels for steering me in the right direction, and to http://rmazing.wordpress.com/2013/02/22/bigcor-large-correlation-matrices-in-r/ for the start in the right direction.
Assume a 7000x180 data matrix called training.data. The goal is to create a symmetric distance matrix of dimension 7000x7000. In reality, using daisy() creates a dissimilarity measure, but it's similar logic.
distff <- function(training.data, nblocks=5, verbose=TRUE) {
require(ff)
require(cluster)
ffmat <- ff(vmode="single", dim=c(7000,7000), filename="if so desired")
nro <- nrow(training.data)
### This could be changed to handle rowcounts that have
### modulus(nro/nblocks) != 0
splt <- split(1:nro, rep(1:nblocks, each = nro/nblocks))
COMBS <- expand.grid(1:length(splt), 1:length(splt))
COMBS <- t(apply(COMBS, 1, sort))
COMBS <- unique(COMBS)
for (i in 1:nrow(COMBS)) {
COMB <- COMBS[i,]
### Since g1 and g2 get appended below, it wouldn't make sense to append the
### same group to itself
if (COMB[1] != COMB[2]) {
g1 <- splt[[COMB[1]]]
g2 <- splt[[COMB[2]]]
slj <- as.matrix(daisy(training.data[c(g1,g2),], metric="gower",
stand=FALSE))
ffmat[c(g1,g2), c(g1,g2)] <- slj
rm(slj)
gc()
}
}
}
That's it. I realize there are some inefficiencies (like writing several of the groups multiple times). I'm okay with that, since it works. Like I said, the bulk of this code was borrowed and tailored from the website cited above.
I have the following R code:
data <- read.csv('testfile.data', header = T)
mat = as.matrix(data)
Some more statistics of my testfile.data:
> ncol(data)
[1] 75713
> nrow(data)
[1] 44771
Since this is a large dataset, so I am using Amazon EC2 with 64GB Ram space. So hopefully memory isn't an issue. I am able to load the data (1st line works).
But as.matrix transformation (2nd line errors) throws the following error:
resulting vector exceeds vector length limit in 'AnswerType'
Any clue what might be the issue?
As noted, the development version of R supports vectors larger than 2^31-1. This is more-or-less transparent, for instance
> m = matrix(0L, .Machine$integer.max / 4, 5)
> length(m)
[1] 2684354555
This is with
> R.version.string
[1] "R Under development (unstable) (2012-08-07 r60193)"
Large objects consume a lot of memory (62.5% of my 16G, for my example) and to do anything useful requires several times that memory. Further, even simple operations on large data can take appreciable time. And many operations on long vectors are not yet supported
> sum(m)
Error: long vectors not supported yet:
/home/mtmorgan/src/R-devel/src/include/Rinlinedfuns.h:100
So it often makes sense to process data in smaller chunks by iterating through a larger file. This gives full access to R's routines, and allows parallel evaluation (via the parallel package). Another strategy is to down-sample the data, which should not be too intimidating to a statistical audience.
Your matrix has more elements than the maximum vector length of 2^31-1. This is a problem because a matrix is just a vector with a dim attribute. read.csv works because it returns a data.frame, which is a list of vectors.
R> 75713*44771 > 2^31-1
[1] TRUE
See ?"Memory-limits" for more details.
I need to rbind two large data frames. Right now I use
df <- rbind(df, df.extension)
but I (almost) instantly run out of memory. I guess its because df is held in the memory twice. I might see even bigger data frames in the future, so I need some kind of in-place rbind.
So my question is: Is there a way to avoid data duplication in memory when using rbind?
I found this question, which uses SqlLite, but I really want to avoid using the hard drive as a cache.
data.table is your friend!
C.f. http://www.mail-archive.com/r-help#r-project.org/msg175877.html
Following up on nikola's comment, here is ?rbindlist's description (new in v1.8.2) :
Same as do.call("rbind",l), but much faster.
First of all : Use the solution from the other question you link to if you want to be safe. As R is call-by-value, forget about an "in-place" method that doesn't copy your dataframes in the memory.
One not advisable method of saving quite a bit of memory, is to pretend your dataframes are lists, coercing a list using a for-loop (apply will eat memory like hell) and make R believe it actually is a dataframe.
I'll warn you again : using this on more complex dataframes is asking for trouble and hard-to-find bugs. So be sure you test well enough, and if possible, avoid this as much as possible.
You could try following approach :
n1 <- 1000000
n2 <- 1000000
ncols <- 20
dtf1 <- as.data.frame(matrix(sample(n1*ncols), n1, ncols))
dtf2 <- as.data.frame(matrix(sample(n2*ncols), n1, ncols))
dtf <- list()
for(i in names(dtf1)){
dtf[[i]] <- c(dtf1[[i]],dtf2[[i]])
}
attr(dtf,"row.names") <- 1:(n1+n2)
attr(dtf,"class") <- "data.frame"
It erases rownames you actually had (you can reconstruct them, but check for duplicate rownames!). It also doesn't carry out all the other tests included in rbind.
Saves you about half of the memory in my tests, and in my test both the dtfcomb and the dtf are equal. The red box is rbind, the yellow one is my list-based approach.
Test script :
n1 <- 3000000
n2 <- 3000000
ncols <- 20
dtf1 <- as.data.frame(matrix(sample(n1*ncols), n1, ncols))
dtf2 <- as.data.frame(matrix(sample(n2*ncols), n1, ncols))
gc()
Sys.sleep(10)
dtfcomb <- rbind(dtf1,dtf2)
Sys.sleep(10)
gc()
Sys.sleep(10)
rm(dtfcomb)
gc()
Sys.sleep(10)
dtf <- list()
for(i in names(dtf1)){
dtf[[i]] <- c(dtf1[[i]],dtf2[[i]])
}
attr(dtf,"row.names") <- 1:(n1+n2)
attr(dtf,"class") <- "data.frame"
Sys.sleep(10)
gc()
Sys.sleep(10)
rm(dtf)
gc()
Right now I worked out the following solution:
nextrow = nrow(df)+1
df[nextrow:(nextrow+nrow(df.extension)-1),] = df.extension
# we need to assure unique row names
row.names(df) = 1:nrow(df)
Now I don't run out of memory. I think its because I store
object.size(df) + 2 * object.size(df.extension)
while with rbind R would need
object.size(rbind(df,df.extension)) + object.size(df) + object.size(df.extension).
After that I use
rm(df.extension)
gc(reset=TRUE)
to free the memory I don't need anymore.
This solved my problem for now, but I feel that there is a more advanced way to do a memory efficient rbind. I appreciate any comments on this solution.
This is a perfect candidate for bigmemory. See the site for more information. Here are three usage aspects to consider:
It's OK to use the HD: Memory mapping to the HD is much faster than practically any other access, so you may not see any slowdowns. At times I rely upon > 1TB of memory-mapped matrices, though most are between 6 and 50GB. Moreover, as the object is a matrix, this requires no real overhead of rewriting code in order to use the object.
Whether you use a file-backed matrix or not, you can use separated = TRUE to make the columns separate. I haven't used this much, because of my 3rd tip:
You can over-allocate the HD space to allow for a larger potential matrix size, but only load the submatrix of interest. This way there is no need to do rbind.
Note: Although the original question addressed data frames and bigmemory is suitable for matrices, one can easily create different matrices for different types of data and then combine the objects in RAM to create a dataframe, if it's really necessary.