Writing large matrix using R ff - r

I've been posting about an issue over the last few days where I need to create a 7000x7000 distance matrix. Doing it all on memory was giving me the could not allocation vector error. I'm using Windows XP SP 3, 3GB RAM, 32-bit system. I originally wanted to use the bigmemory library, but it appears that it is not available for Windows. I've done some reading on the ff package, so this is what I cam up with so far:
require(ff)
ffmat <- ff(vmode="double", dim=c(7000,7000))
ffmat <- as.matrix(dist(data[1:7000, ], diag=TRUE, upper=TRUE))
The problem is that I still get a vector allocation error. Note that dim(data) = 7000x182 (lot's of variables).
Running gc() post-mortem brings the memory.size() back down to normal levels. It's as if R is storing the results in memory prior to writing to the ff that was created. Is there any way around this?

You are probably going to need to break up the task into pieces and assign the individual pieces to the matrix instead of doing it all in one step.
The dist and as.matrix functions do not know that the result will be an ff object, they just try to do their part in memory.
Since the dist function does not compute distances between different sets of data it may be easiest to just calculate the distances by hand, though there may be a function in a package that will do the off diagonal distances.

"It's as if R is storing the results in memory prior to writing to the ff that was created. Is there any way around this?"
That's exactly what R is doing. The way your code is written does two things: it creates an ff object, and then it overwrites that with a traditional matrix created by as.matrix.
You could potentially extend the dist function to work with ff objects, or write your own implementation of dist that uses ff.

Many thanks to jwijffels for steering me in the right direction, and to http://rmazing.wordpress.com/2013/02/22/bigcor-large-correlation-matrices-in-r/ for the start in the right direction.
Assume a 7000x180 data matrix called training.data. The goal is to create a symmetric distance matrix of dimension 7000x7000. In reality, using daisy() creates a dissimilarity measure, but it's similar logic.
distff <- function(training.data, nblocks=5, verbose=TRUE) {
require(ff)
require(cluster)
ffmat <- ff(vmode="single", dim=c(7000,7000), filename="if so desired")
nro <- nrow(training.data)
### This could be changed to handle rowcounts that have
### modulus(nro/nblocks) != 0
splt <- split(1:nro, rep(1:nblocks, each = nro/nblocks))
COMBS <- expand.grid(1:length(splt), 1:length(splt))
COMBS <- t(apply(COMBS, 1, sort))
COMBS <- unique(COMBS)
for (i in 1:nrow(COMBS)) {
COMB <- COMBS[i,]
### Since g1 and g2 get appended below, it wouldn't make sense to append the
### same group to itself
if (COMB[1] != COMB[2]) {
g1 <- splt[[COMB[1]]]
g2 <- splt[[COMB[2]]]
slj <- as.matrix(daisy(training.data[c(g1,g2),], metric="gower",
stand=FALSE))
ffmat[c(g1,g2), c(g1,g2)] <- slj
rm(slj)
gc()
}
}
}
That's it. I realize there are some inefficiencies (like writing several of the groups multiple times). I'm okay with that, since it works. Like I said, the bulk of this code was borrowed and tailored from the website cited above.

Related

Parallel recursive function in R?

I’ve been wracking my brain around this problem all week and could really use an outside perspective. Basically I’ve built a recursive tree function where the output of each node in one layer is used as the input for a node in the subsequent layer. I’ve generated a toy example here where each call generates a large matrix, splits it into submatrices, and then passes those submatrices to subsequent calls. The key difference from similar questions on Stack is that each call of tree_search doesn't actually return anything, it just appends results onto a CSV file.
Now I'd like to parallelize this function. However, when I run it with mclapply and mc.cores=2, the runtime increases! The same happens when I run it on a multicore cluster with mc.cores=12. What’s going on here? Are the parent nodes waiting for the child nodes to return some output? Does this have something to do with fork/socket parallelization?
For background, this is part of an algorithm that models gene activation in white blood cells in response to viral infection. I’m a biologist and self-taught programmer so I’m a little out of my depth here - any help or leads would be really appreciated!
# Load libraries.
library(data.table)
library(parallel)
# Recursive tree search function.
tree_search <- function(submx = NA, loop = 0) {
# Terminate on fifth loop.
message(paste("Started loop", loop))
if(loop == 5) {return(TRUE)}
# Create large matrix and do some operation.
bigmx <- matrix(rnorm(10), 50000, 250)
bigmx <- sin(bigmx^2)
# Aggregate matrix and save output.
agg <- colMeans(bigmx)
append <- file.exists("output.csv")
fwrite(t(agg), file = "output.csv", append = append, row.names = F)
# Split matrix in submatrices with 100 columns each.
ind <- ceiling(seq_along(1:ncol(bigmx)) / 100)
lapply(unique(ind), function(i) {
submx <- bigmx[, ind == i]
# Pass each submatrix to subsequent call.
loop <- loop + 1
tree_search(submx, loop) # sub matrix is used to generate big matrix in subsequent call (not shown)
})
}
# Initiate tree search.
tree_search()
After a lot more brain wracking and experimentation, I ended up answering my own question. I’m not going to refer to the original example since I've changed up my approach quite a bit. Instead I’ll share some general observations that might help people in similar situations.
1.) For loops are more memory efficient than lapply and recursive functions
When you use lapply, each call creates a copy of your current environment. That’s why you can do this:
x <- 5
lapply(1:10, function(i) {
x <- x + 1
x == 6 # TRUE
})
x == 5 # ALSO TRUE
At the end x is still 5, which means that each call of lapply was manipulating a separate copy of x. That’s not good if, say, x was actually a large dataframe with 10,000 variables. for loops, on the other hand, allow you to override the variables on each loop.
x <- 5
for(i in 1:10) {x <- x + 1}
x == 5 # FALSE
2.) Parallelize once
Distributing tasks to different nodes takes a lot of computational overhead and can cancel out any gains you make from parallelizing your script. Therefore, you should use mclapply with discretion. In my case, that meant NOT putting mclapply inside a recursive function where it was getting called tens to hundreds of times. Instead, I split the starting point into 16 parts and ran 16 different tree searches on separate nodes.
3.) You can use mclapply to throttle memory usage
If you split a job into 10 parts and process them with mclapply and mc.preschedule=F, each core will only process 10% of your job at a time. If mc.cores was set to two, for example, the other 8 "nodes" would wait until one part finished before starting a new one. This is useful if you are running into memory issues and want to prevent each loop from taking on more than it can handle.
Final Note
This is one of the more interesting problems I’ve worked on so far. However, recursive tree functions are complicated. Draw out the algorithm and force yourself to spend a few days away from your code so that you can come back with a fresh perspective.

What is the fastest way to convert correlation between a vector and a matrix in r?

I am trying to find a fast way to calculate the correlation between a vector of values and a matrix. I have a data frame with 200 rows and 400,000 observations after transposing the data. I need to find the cor between each column and every other column.
My code is below but it is too slow. Can anyone come up with a faster way.
for(i in 1:400000){
x=cor(trainDataNew[,i],trainDataNew[,-i])
}
You don't need my data to do this. You can create random data like below.
norm1 <- rnorm(1000)
norm2 <- rnorm(1000)
norm3 <- rnorm(1000)
as.data.frame(cbind(norm1,norm2,norm3))
What's wrong with
cc <- cor(trainDataNew)
?
If you only want the lower triangle you can then use
cc2 <- cc[lower.tri(cc,diag=FALSE)]
This blog post claims to have done a similar-sized (slightly smaller) problem in about a minute. Their approach is implemented in HiClimR::fastCor.
library(HiClimR)
system.time(cc <- fastCor(dd, nSplit = 10,
upperTri = TRUE, verbose = TRUE,
optBLAS=TRUE))
I haven't gotten this working yet (keep running out of memory), but you may have better luck. You should also look into linking R to an optimized BLAS, e.g. see here for MacOS.
Someone here reports a parallelized version (code is here, along with some forked versions)

Faster way to sum up raster values based on polygon extent in R

I am looking for a way to improve the speed and lower the memory-usage of the following lines:
export <- raster(paste0(catch_dir,'/export_streams.rst'))
catchm_polyg <- readOGR(dsn = catch_dir, layer = 'catchment')
Model_10 <- extract(export, catchm_polyg, fun = sum, na.rm = TRUE )
This gives me the sum of all values from export_streams.rst, with catchm_polyg as an extent.
I want to do this a lot of times for different input-data. Therefor the code is part of a function, which is then used in a foreach loop. That all works fine to a certain degree. The code doesn't work with larger input-data though, as I apparently don't have enough memory (32gb, 64bit R version). Also the calculation time is very high. Any suggestions on how to improve the code?
A couple things to speed things up might include some of the following:
Ask yourself: Can I first aggregate my raster to a courser resolution using the sum function?
Memory: Don't always write to the memory when using functions from the raster package. Instead try to write externally when possible or you will get memory errors.
If you have multi-part polygon (a SpatialPolygonDataframe object). Just run the extract function once, then unlist and then run functions.
# quickly summarise across multiple polygons
allmyvals <- extract(myrast, myploys)
myploys$sum_in_poly <- unlist(lapply(allmyvals , function(x) if (!is.null(x)) sum(x, na.rm=TRUE) else NA ))
Take an alternative approach out of the raster package or try something with getValues. See these threads:
https://gis.stackexchange.com/questions/130522/increasing-speed-of-crop-mask-extract-raster-by-many-polygons-in-r
https://gis.stackexchange.com/questions/156663/really-slow-extraction-from-raster-even-after-using-crop

Memory efficient alternative to rbind - in-place rbind?

I need to rbind two large data frames. Right now I use
df <- rbind(df, df.extension)
but I (almost) instantly run out of memory. I guess its because df is held in the memory twice. I might see even bigger data frames in the future, so I need some kind of in-place rbind.
So my question is: Is there a way to avoid data duplication in memory when using rbind?
I found this question, which uses SqlLite, but I really want to avoid using the hard drive as a cache.
data.table is your friend!
C.f. http://www.mail-archive.com/r-help#r-project.org/msg175877.html
Following up on nikola's comment, here is ?rbindlist's description (new in v1.8.2) :
Same as do.call("rbind",l), but much faster.
First of all : Use the solution from the other question you link to if you want to be safe. As R is call-by-value, forget about an "in-place" method that doesn't copy your dataframes in the memory.
One not advisable method of saving quite a bit of memory, is to pretend your dataframes are lists, coercing a list using a for-loop (apply will eat memory like hell) and make R believe it actually is a dataframe.
I'll warn you again : using this on more complex dataframes is asking for trouble and hard-to-find bugs. So be sure you test well enough, and if possible, avoid this as much as possible.
You could try following approach :
n1 <- 1000000
n2 <- 1000000
ncols <- 20
dtf1 <- as.data.frame(matrix(sample(n1*ncols), n1, ncols))
dtf2 <- as.data.frame(matrix(sample(n2*ncols), n1, ncols))
dtf <- list()
for(i in names(dtf1)){
dtf[[i]] <- c(dtf1[[i]],dtf2[[i]])
}
attr(dtf,"row.names") <- 1:(n1+n2)
attr(dtf,"class") <- "data.frame"
It erases rownames you actually had (you can reconstruct them, but check for duplicate rownames!). It also doesn't carry out all the other tests included in rbind.
Saves you about half of the memory in my tests, and in my test both the dtfcomb and the dtf are equal. The red box is rbind, the yellow one is my list-based approach.
Test script :
n1 <- 3000000
n2 <- 3000000
ncols <- 20
dtf1 <- as.data.frame(matrix(sample(n1*ncols), n1, ncols))
dtf2 <- as.data.frame(matrix(sample(n2*ncols), n1, ncols))
gc()
Sys.sleep(10)
dtfcomb <- rbind(dtf1,dtf2)
Sys.sleep(10)
gc()
Sys.sleep(10)
rm(dtfcomb)
gc()
Sys.sleep(10)
dtf <- list()
for(i in names(dtf1)){
dtf[[i]] <- c(dtf1[[i]],dtf2[[i]])
}
attr(dtf,"row.names") <- 1:(n1+n2)
attr(dtf,"class") <- "data.frame"
Sys.sleep(10)
gc()
Sys.sleep(10)
rm(dtf)
gc()
Right now I worked out the following solution:
nextrow = nrow(df)+1
df[nextrow:(nextrow+nrow(df.extension)-1),] = df.extension
# we need to assure unique row names
row.names(df) = 1:nrow(df)
Now I don't run out of memory. I think its because I store
object.size(df) + 2 * object.size(df.extension)
while with rbind R would need
object.size(rbind(df,df.extension)) + object.size(df) + object.size(df.extension).
After that I use
rm(df.extension)
gc(reset=TRUE)
to free the memory I don't need anymore.
This solved my problem for now, but I feel that there is a more advanced way to do a memory efficient rbind. I appreciate any comments on this solution.
This is a perfect candidate for bigmemory. See the site for more information. Here are three usage aspects to consider:
It's OK to use the HD: Memory mapping to the HD is much faster than practically any other access, so you may not see any slowdowns. At times I rely upon > 1TB of memory-mapped matrices, though most are between 6 and 50GB. Moreover, as the object is a matrix, this requires no real overhead of rewriting code in order to use the object.
Whether you use a file-backed matrix or not, you can use separated = TRUE to make the columns separate. I haven't used this much, because of my 3rd tip:
You can over-allocate the HD space to allow for a larger potential matrix size, but only load the submatrix of interest. This way there is no need to do rbind.
Note: Although the original question addressed data frames and bigmemory is suitable for matrices, one can easily create different matrices for different types of data and then combine the objects in RAM to create a dataframe, if it's really necessary.

What's the higher-performance alternative to for-loops for subsetting data by group-id?

A recurring analysis paradigm I encounter in my research is the need to subset based on all different group id values, performing statistical analysis on each group in turn, and putting the results in an output matrix for further processing/summarizing.
How I typically do this in R is something like the following:
data.mat <- read.csv("...")
groupids <- unique(data.mat$ID) #Assume there are then 100 unique groups
results <- matrix(rep("NA",300),ncol=3,nrow=100)
for(i in 1:100) {
tempmat <- subset(data.mat,ID==groupids[i])
# Run various stats on tempmat (correlations, regressions, etc), checking to
# make sure this specific group doesn't have NAs in the variables I'm using
# and assign results to x, y, and z, for example.
results[i,1] <- x
results[i,2] <- y
results[i,3] <- z
}
This ends up working for me, but depending on the size of the data and the number of groups I'm working with, this can take up to three days.
Besides branching out into parallel processing, is there any "trick" for making something like this run faster? For instance, converting the loops into something else (something like an apply with a function containing the stats I want to run inside the loop), or eliminating the need to actually assign the subset of data to a variable?
Edit:
Maybe this is just common knowledge (or sampling error), but I tried subsetting with brackets in some of my code rather than using the subset command, and it seemed to provide a slight performance gain which surprised me. I have some code I used and output below using the same object names as above:
system.time(for(i in 1:1000){data.mat[data.mat$ID==groupids[i],]})
user system elapsed
361.41 92.62 458.32
system.time(for(i in 1:1000){subset(data.mat,ID==groupids[i])})
user system elapsed
378.44 102.03 485.94
Update:
In one of the answers, jorgusch suggested that I use the data.table package to speed up my subsetting. So, I applied it to a problem I ran earlier this week. In a dataset with a little over 1,500,000 rows, and 4 columns (ID,Var1,Var2,Var3), I wanted to calculate two correlations in each group (indexed by the "ID" variable). There are slightly more than 50,000 groups. Below is my initial code (which is very similar to the above):
data.mat <- read.csv("//home....")
groupids <- unique(data.mat$ID)
results <- matrix(rep("NA",(length(groupids) * 3)),ncol=3,nrow=length(groupids))
for(i in 1:length(groupids)) {
tempmat <- data.mat[data.mat$ID==groupids[i],]
results[i,1] <- groupids[i]
results[i,2] <- cor(tempmat$Var1,tempmat$Var2,use="pairwise.complete.obs")
results[i,3] <- cor(tempmat$Var1,tempmat$Var3,use="pairwise.complete.obs")
}
I'm re-running that right now for an exact measure of how long that took, but from what I remember, I started it running when I got into the office in the morning and it finished sometime in the mid-afternoon. Figure 5-7 hours.
Restructuring my code to use data.table....
data.mat <- read.csv("//home....")
data.mat <- data.table(data.mat)
testfunc <- function(x,y,z) {
temp1 <- cor(x,y,use="pairwise.complete.obs")
temp2 <- cor(x,z,use="pairwise.complete.obs")
res <- list(temp1,temp2)
res
}
system.time(test <- data.mat[,testfunc(Var1,Var2,Var3),by="ID"])
user system elapsed
16.41 0.05 17.44
Comparing the results using data.table to the ones I got from using a for loop to subset all IDs and record results manually, they seem to have given me the same answers(though I'll have to check that a bit more thoroughly). That looks to be a pretty big speed increase.
Update 2:
Running the code using subsets finally finished up again:
user system elapsed
17575.79 4247.41 23477.00
Update 3:
I wanted to see if anything worked out differently using the plyr package that was also recommended. This is my first time using it, so I may have done things somewhat inefficiently, but it still helped substantially compared to the for loop with subsetting.
Using the same variables and setup as before...
data.mat <- read.csv("//home....")
system.time(hmm <- ddply(data.mat,"ID",function(df)c(cor(df$Var1,df$Var2, use="pairwise.complete.obs"),cor(df$Var1,df$Var3,use="pairwise.complete.obs"))))
user system elapsed
250.25 7.35 272.09
This is pretty much exactly what the plyr package is designed to make easier. However it's unlikely that it will make things much faster - most of the time is probably spent doing the statistics.
Besides plyr, you can try to use foreach package to exclude explicit loop counter, but I don't know if it will give you any performance benefits.
Foreach, neverless, gives you a quite simple interface to parallel chunk processing if you have multicore workstation (with doMC/multicore packages) (check Getting Started with doMC and foreach for details), if you exclude parallel processing only because it is not very easy to understand for students. If it is not the only reason, plyr is very good solution IMHO.
Personally, I find plyr not very easy to understand. I prefer data.table which is also faster. For instance you want to do the standard deviation of colum my_column for each ID.
dt <- datab.table[df] # one time operation...changing format of df to table
result.sd <- dt[,sd(my_column),by="ID"] # result with each ID and SD in second column
Three statements of this kind and a cbind at the end - that is all you need.
You can also use dt do some action for only one ID without a subset command in an new syntax:
result.sd.oneiD<- dt[ID="oneID",sd(my_column)]
The first statment refers to rows (i), the second to columns (j).
If find it easier to read then player and it is more flexible, as you can also do sub domains within a "subset"...
The documentation describes that it uses SQL-like methods. For instance, the by is pretty much "group by" in SQL. Well, if you know SQL, you can probably do much more, but it is not necessary to make use of the package.
Finally, it is extremely fast, as each operation is not only parallel, but also data.table grabs the data needed for calculation. Subset, however, maintain the levels of the whole matrix and drag it trough the memory.
You have already suggested vectorizing and avoiding making unnecessary copies of intermediate results, so you are certainly on the right track. Let me caution you not to do what i did and just assume that vectorizing will always give you a performance boost (like it does in other languages, e.g., Python + NumPy, MATLAB).
An example:
# small function to time the results:
time_this = function(...) {
start.time = Sys.time(); eval(..., sys.frame(sys.parent(sys.parent())));
end.time = Sys.time(); print(end.time - start.time)
}
# data for testing: a 10000 x 1000 matrix of random doubles
a = matrix(rnorm(1e7, mean=5, sd=2), nrow=10000)
# two versions doing the same thing: calculating the mean for each row
# in the matrix
x = time_this( for (i in 1:nrow(a)){ mean( a[i,] ) } )
y = time_this( apply(X=a, MARGIN=1, FUN=mean) )
print(x) # returns => 0.5312099
print(y) # returns => 0.661242
The 'apply' version is actually slower than the 'for' version. (According to the Inferno author, if you are doing this you are not vectorizing, you are 'loop hiding'.)
But where you can get a performance boost is by using built-ins. Below, i've timed the same operation as the two above, just using the built-in function, 'rowMeans':
z = time_this(rowMeans(a))
print(z) # returns => 0.03679609
An order of magnitude improvement versus the 'for' loop (and the vectorized version).
The other members of the apply family are not just wrappers over a native 'for' loop.
a = abs(floor(10*rnorm(1e6)))
time_this(sapply(a, sqrt))
# returns => 6.64 secs
time_this(for (i in 1:length(a)){ sqrt(a[i])})
# returns => 1.33 secs
'sapply' is about 5x slower compared with a 'for' loop.
Finally, w/r/t vectorized versus 'for' loops, i don't think i ever use a loop if i can use a vectorized function--the latter is usually less keystrokes and and it's a more natural way (for me) to code, which is a different kind of performance boost, i suppose.

Resources