How it is possible that storing data into H2O matrix are slower than in data.table?
#Packages used "H2O" and "data.table"
library(h2o)
library(data.table)
#create the matrix
matrix1<-data.table(matrix(rnorm(1000*1000),ncol=1000,nrow=1000))
matrix2<-h2o.createFrame(1000,1000)
h2o.init(nthreads=-1)
#Data.table variable store
for(i in 1:1000){
matrix1[i,1]<-3
}
#H2O Matrix Frame store
for(i in 1:1000){
matrix2[i,1]<-3
}
Thanks!
H2O is a client/server architecture. (See http://docs.h2o.ai/h2o/latest-stable/h2o-docs/architecture.html)
So what you've shown is a very inefficient way to specify an H2O frame in H2O memory. Every write is going to be turning into a network call. You almost certainly don't want this.
For your example, since the data isn't large, a reasonable thing to do would be to do the initial assignment to a local data frame (or datatable) and then use push method of as.h2o().
h2o_frame = as.h2o(matrix1)
head(h2o_frame)
This pushes an R data frame from the R client into an H2O frame in H2O server memory. (And you can do as.data.table() to do the opposite.)
data.table Tips:
For data.table, prefer the in-place := syntax. This avoids copies. So, for example:
matrix1[i, 3 := 42]
H2O Tips:
The fastest way to read data into H2O is by ingesting it using the pull method in h2o.importFile(). This is parallel and distributed.
The as.h2o() trick shown above works well for small datasets that easily fit in memory of one host.
If you want to watch the network messages between R and H2O, call h2o.startLogging().
I can't answer your question because I don't know h20. However I can make a guess.
Your code to fill the data.table is slow because of the "copy-on-modify" semantic. If you update your table by reference you will incredibly speed-up your code.
for(i in 1:1000){
matrix1[i,1]<-3
}
for(i in 1:1000){
set(matrix1, i, 1L, 3)
}
With set my loop takes 3 millisec, while your loop takes 18 sec (6000 times more).
I suppose h2o to work the same way but with some extra stuff done because this is a special object. Maybe some message passing communication to the H2O cluster?
Related
I have a function written as:
setlower <- function(df) {
for(j in seq_along(df)){
data.table::set(df, j=j, value=stringi::stri_trans_tolower(df[[j]]))
}
invisible(df)
}
I have a much larger package that calls this function on multiple data.table's in an lapply. The profile for it looks like:
You can see I spent a majority of my time for each data.table I am processing in the setlower function. Is there anyway to speed this up? I know both data.table and stringi are supposed to be about as fast as you can get for data frame and string operations but I didn't know if there is a potentially faster way to assure that any given data frame is converted entirely to lower case.
I am working with a very large dataset and I would like to keep the data in H2O as much as possible without bringing it into R.
I noticed whenever I pass an H2O Frame to a function, any modification I make to the Frame is not reflected outside of the function. Is there a way to pass the Frame by Reference?
If not, what's the best way to modify the original frame inside a function with copying all of the Frame?
Another related question: does passing a Frame to other functions (read only), make extra copies on H2O side? My datasets are 30GB - 100GB. So want to make sure passing them around does not cause memory issues.
mod = function(fdx) {
fdx[,"x"] = -1
}
d = data.frame(x = rnorm(100),y=rnorm(100))
dx = as.h2o(d)
dx[1,]
mod(dx)
dx[1,] # does not change the original value of x
> dx[1,]
x y
1 0.3114706 0.9523058
> dx[1,]
x y
1 0.3114706 0.9523058
Thanks!
H2O does a classic copy-on-write optimization. Thus:
No true copy is made, unless you mutate the dataset.
Only changed/added columns are truly copied, all others pass-by-reference
Frames in R are pass-by-value, which H2O mimics
Frames in Python are pass-by-reference, which H2O mimics
In short, do as you would in R, and you're fine.
No extra copies.
It seems clusterMap in Snow doesn't support dynamic processing. I'd like to do parallel computing with two pairs of parameters stored in a data frame. But the elapsed time of every job vary very much. If the jobs are run un-dynamically, it will be time consuming.
e.g.
library(snow)
cl2 <- makeCluster(3, type = "SOCK")
df_t <- data.frame (type=c(rep('a',3),rep('b',3)), value=c(rep('1',3),rep('2',3)))
clusterExport(cl2,"df_t")
clusterMap(cl2, function(x,y){paste(x,y)},
df_t$type,df_t$value)
It is true that clusterMap doesn't support dynamic processing, but there is a comment in the code suggesting that it might be implemented in the future.
In the meantime, I would create a list from the data in order to call clusterApplyLB with a slightly different worker function:
ldf <- lapply(seq_len(nrow(df_t)), function(i) df_t[i,])
clusterApplyLB(cl2, ldf, function(df) {paste(df$type, df$value)})
This was common before clusterMap was added to the snow package.
Note that your use of clusterMap doesn't actually require you to export df_t since your worker function doesn't refer to it. But if you're willing to export df_t to the workers, you could also use:
clusterApplyLB(cl2, 1:nrow(df_t), function(i){paste(df_t$type[i],df_t$value[i])})
In this case, df_t must be exported to the cluster workers since the worker function references it. However, it is generally less efficient since each worker only needs a fraction of the entire data frame.
I found clusterMap in Parallel package support LB. But it less efficient than the method of clusterApplyLB combined with lapply implemented by Snow. I tried to find out the source code to figure out. But the clusterMap is not available when I click the link 'source' and 'R code'.
I know that the R package bigmemory works great in dealing with large matrices and data frames. However, I was wondering if there is any package or any ways to efficiently work with large list.
Specifically, I created a list with its elements being vectors. I have a for loop and during each iteration, multiple values were appended to a selected element in that list (a vector). At first, it runs fast, but when the iteration is over maybe 10000, it slows down gradually (one iteration takes about a second). I'm going to go through about 70000 to 80000 iterations, and the list would be so large after that.
So I was just wondering if there is something like big.list as the big.matrix in the bigmemory package that could speed up this whole process.
Thanks!
I'm not really sure if this a helpful answer, but you can interactively work with lists on disk using the filehash package.
For example here's some code that makes a disk database, assigns a preallocated empty list to the database, then runs a function (getting the current time) that fills the list in the database.
# how many items in the list?
n <- 100000
# setup database on disk
dbCreate("testDB")
db <- dbInit("testDB")
# preallocate vector in database
db$time <- vector("list", length = n)
# run function using disk object
for(i in 1:n) db$time[[i]] <- Sys.time()
There is hardly any use of RAM during this process, however it is VERY slow (two orders of magnitude slower than doing it in RAM on some of my tests) due to constant disk I/O. So I'm not sure that this method is a good answer to the question of how you can speed up working on big objects.
DSL package might help. The DList object works like a drop in replacement for R's list. Futher, it provides a distributed list like facility too.
I keep hitting an issue with the multicore package and big objects. The basic idea is that I'm using a Bioconductor function (readBamGappedAlignments) to read in large objects. I have a character vector of filenames, and I've been using mclapply to loop over the files and read them into a list. The function looks something like this:
objects <- mclapply(files, function(x) {
on.exit(message(sprintf("Completed: %s", x)))
message(sprintf("Started: '%s'", x))
readBamGappedAlignments(x)
}, mc.cores=10)
However, I keep getting the following error: Error: serialization is too large to store in a raw vector. However, it seems I can read the same files in alone without this error. I've found mention of this issue here, without resolution.
Any parallel solution suggestions would be appreciated - this has to be done in parallel. I could look towards snow, but I have a very powerful server with 15 processors, 8 cores each and 256GB of memory I can do this on. I rather just do it on this machine across cores, rather than using one of our clusters.
The integer limit is rumored to be addressed very soon in R. In my experience that limit can block datasets with under 2 billion cells (around the maximum integer), and low level functions like sendMaster in the multicore package rely on passing raw vectors. I had around 1 million processes representing about 400 million rows of data and 800 million cells in the data.table format, and when mclapply was sending the results back it ran into this limit.
A divide and conquer strategy is not that hard and it works. I realize this is a hack and one should be able to rely on mclapply.
Instead of one big list, create a list of lists. Each sub-list is smaller than the broken version, and you then feed them into mclapply split by split. Call this file_map. The results are a list of lists, so you could then use the special double concatenate do.call function. As a result, each time mclapply finishes the size of the serialized raw vector is of a manageable size.
Just loop over the smaller pieces:
collector = vector("list", length(file_map)) # more complex than normal for speed
for(index in 1:length(file_map)) {
reduced_set <- mclapply(file_map[[index]], function(x) {
on.exit(message(sprintf("Completed: %s", x)))
message(sprintf("Started: '%s'", x))
readBamGappedAlignments(x)
}, mc.cores=10)
collector[[index]]= reduced_set
}
output = do.call("c",do.call('c', collector)) # double concatenate of the list of lists
Alternately, save the output to a database as you go such as SQLite.