Does clusterMap in Snow support dynamic processing? - r

It seems clusterMap in Snow doesn't support dynamic processing. I'd like to do parallel computing with two pairs of parameters stored in a data frame. But the elapsed time of every job vary very much. If the jobs are run un-dynamically, it will be time consuming.
e.g.
library(snow)
cl2 <- makeCluster(3, type = "SOCK")
df_t <- data.frame (type=c(rep('a',3),rep('b',3)), value=c(rep('1',3),rep('2',3)))
clusterExport(cl2,"df_t")
clusterMap(cl2, function(x,y){paste(x,y)},
df_t$type,df_t$value)

It is true that clusterMap doesn't support dynamic processing, but there is a comment in the code suggesting that it might be implemented in the future.
In the meantime, I would create a list from the data in order to call clusterApplyLB with a slightly different worker function:
ldf <- lapply(seq_len(nrow(df_t)), function(i) df_t[i,])
clusterApplyLB(cl2, ldf, function(df) {paste(df$type, df$value)})
This was common before clusterMap was added to the snow package.
Note that your use of clusterMap doesn't actually require you to export df_t since your worker function doesn't refer to it. But if you're willing to export df_t to the workers, you could also use:
clusterApplyLB(cl2, 1:nrow(df_t), function(i){paste(df_t$type[i],df_t$value[i])})
In this case, df_t must be exported to the cluster workers since the worker function references it. However, it is generally less efficient since each worker only needs a fraction of the entire data frame.

I found clusterMap in Parallel package support LB. But it less efficient than the method of clusterApplyLB combined with lapply implemented by Snow. I tried to find out the source code to figure out. But the clusterMap is not available when I click the link 'source' and 'R code'.

Related

Send chunks of large dataset to specific cores for R parallel foreach

I am attempting to scale a script that involves applying a feature extraction function on a image file generated using data from each row of an R matrix. For faster computation I split the matrix into equally sized chunks and run each chunk in R parallel using the foreach structure. To do this, I have to send the entire matrix to each core using clusterExport and subset it to desired chunk within the foreach loop.
I am hoping to find a way to export only the matrix chunks to each core, instead of passing the full matrix to each core and then subsetting. I was only able to find one thread close to what I was looking for, this answer by Steve Weston who sent individual chunks to each core using clusterCall (code pasted below)
library(parallel)
cl <- makeCluster(detectCores())
df <- data.frame(a=1:10, b=1:10)
ix <- splitIndices(nrow(df), length(cl))
for (i in seq_along(cl)) {
clusterCall(cl[i], function(d) {
assign('mydata', d, pos=.GlobalEnv)
NULL # don't return any data to the master
}, df[ix[[i]],,drop=FALSE])
}
This answer worked as advertised, however, the cores in this example run in sequence instead of parallel. My attempt to parallelize this using foreach instead of for was hamstrung by having to use clusterExport to transfer the dataset variable, which is the issue I'm trying to avoid
clusterExport(cl,c("df","ix"))
foreach() %dopar% {etc}
Is there a way to pass chunks of a variable to each core, and operate on them in parallel? A foreach solution would be nice, or a parallelized adaptation to Steve Weston's structure. Note that I am developing for Windows, so forking is not an option for me.

Is there a way of "chunking" drake outputs to speed up plan verification and display?

I'm conducting simulations over a range of models and parameter values. At this point in time my drake workflow involves over 3k thousand simulated data.frames and corresponding stanfit objects.
Trying to run make currently incurs a delay of ~2 minutes before plan execution begins. I assume that this is because drake is going through its cache to verify which steps in the plan will need updating. I would like to have some way of letting it know that it can represent all of these models as a single monolithic chunk of output. What I could do is make a function that writes all my output objects as a side-effect and then outputs a hash of sorts so that drake is "fooled" as to what needs to be checked but I can't restructure my code at this point in time given an upcoming deadline and the processing time involved.
Similarly, for purposes of using the dependency graph, having 3k+ objects show up makes it unusable. It would be nice to be able to collapse certain objects under a single "output type" group.
Great question. I know what you are saying, and I think about this problem all the time. In fact, trying to get rid of the delay is one of my top two priorities for drake for 2019.
Unfortunately, drake does not have a solution right now that will allow you to keep your targets up to date. The long-term solution will probably be speed improvements + https://github.com/ropensci/drake/issues/304 + https://github.com/ropensci/drake/issues/233. These are important areas of development, but also huge undertakings.
For new projects, you could have each target be a list of fitted stan models.
drake_plan(
data1 <- generate_data(...),
data2 <- generate_data(...),
models_data1 <- fit_models(data1),
models_data2 <- fit_models(data2)
)
fit_models <- function(data){
list(
run_stan(data, "normal_priors"),
run_stan(data, "t_priors")
)
}
And for the graph visualizations, there is support for target clusters. See https://ropenscilabs.github.io/drake-manual/vis.html#clusters
EDIT: parallel computing and verbosity
If you run make(jobs = c(imports = 4, targets = 6)), drake will use 4 processes on your local machine to do the preprocessing. And make(verbose = 4) shows more progress messages than with the default setting.

Cluster job management in R via future package

I want to use the R package future (supports asynchronous calculations) to make a cluster-jobserver that can dynamically add/remove jobs to/from a queue.
One specific functionality that I would like to add to my jobserver is to distribute memory-demanding jobs to the more powerful machines in my cluster. However, since I have no experience with the package, I am not quite sure whether my approach (given below) has any pitfalls. Specifically, do the subsequent calls of plan have any side effects that might mess things up? Please see the comments in the code for more details.
Thanks in advance!
library(parallel)
library(future)
slaveIPs=c("172.16.2.10","172.16.2.21")
masterIP="172.16.2.33"
workers=makePSOCKcluster(slaveIPs,master=masterIP)
#check whether PSOCK cluster was correctly set up
unlist(clusterCall(workers,function(x) unname(Sys.info()["nodename"]))
#[1] "ip-172-16-2-10" "ip-172-16-2-21"
#now the first important part that I am not sure about
#as you can see, I only use workers[1] for the first task
#is it OK to use workers[1] like that?
plan(cluster,workers=workers[1])
f=future({
#do memory-hungry work
unname(Sys.info()["nodename"])
})
message(value(f))
#ip-172-16-2-10
#now I am only using workers[2] for the second task
#Is this ok? Does the previous call to 'plan' need some cleaning before?
plan(cluster,workers=workers[2])
f=future({
#do low-memory work
unname(Sys.info()["nodename"])
})
message(value(f))
#ip-172-16-2-21
stopCluster(workers)
Author of future here:
Yes, it alright to change future strategies like that, i.e. by using plan(). An alternative is to use:
f <- cluster({
#do low-memory work
unname(Sys.info()["nodename"])
}, workers = workers[2])
which is basically what is happening internally.
The downside of explicitly specifying future strategies like this is that your code will be hard coded to use cluster futures.
FYI, I'm planning to add some kind of mechanism for specifying preferred or required "resources" per future. This is just conceptual for now and will not exists anytime soon, but I'm thinking of something in line with:
f <- future({ ... }, needs = "himem")
where one can query workers for the himem tag / property, e.g. attr(workers[2], "provides") <- c("himem", "superfast"). I'm sharing these thoughts just so you know that I'm aware of needs like yours. Again, it will be quite some time before such mechanisms are available, so in the meanwhile, you need to explicitly specify the future strategy as above.
BTW, instead of:
slaveIPs=c("172.16.2.10","172.16.2.21")
masterIP="172.16.2.33"
workers=makePSOCKcluster(slaveIPs,master=masterIP)
you can try:
slaveIPs <- c("172.16.2.10", "172.16.2.21")
workers <- makeClusterPSOCK(slaveIPs)
provided by the future package - this avoids having to know/specify the IP address of master.

H2O running slower than data.table R

How it is possible that storing data into H2O matrix are slower than in data.table?
#Packages used "H2O" and "data.table"
library(h2o)
library(data.table)
#create the matrix
matrix1<-data.table(matrix(rnorm(1000*1000),ncol=1000,nrow=1000))
matrix2<-h2o.createFrame(1000,1000)
h2o.init(nthreads=-1)
#Data.table variable store
for(i in 1:1000){
matrix1[i,1]<-3
}
#H2O Matrix Frame store
for(i in 1:1000){
matrix2[i,1]<-3
}
Thanks!
H2O is a client/server architecture. (See http://docs.h2o.ai/h2o/latest-stable/h2o-docs/architecture.html)
So what you've shown is a very inefficient way to specify an H2O frame in H2O memory. Every write is going to be turning into a network call. You almost certainly don't want this.
For your example, since the data isn't large, a reasonable thing to do would be to do the initial assignment to a local data frame (or datatable) and then use push method of as.h2o().
h2o_frame = as.h2o(matrix1)
head(h2o_frame)
This pushes an R data frame from the R client into an H2O frame in H2O server memory. (And you can do as.data.table() to do the opposite.)
data.table Tips:
For data.table, prefer the in-place := syntax. This avoids copies. So, for example:
matrix1[i, 3 := 42]
H2O Tips:
The fastest way to read data into H2O is by ingesting it using the pull method in h2o.importFile(). This is parallel and distributed.
The as.h2o() trick shown above works well for small datasets that easily fit in memory of one host.
If you want to watch the network messages between R and H2O, call h2o.startLogging().
I can't answer your question because I don't know h20. However I can make a guess.
Your code to fill the data.table is slow because of the "copy-on-modify" semantic. If you update your table by reference you will incredibly speed-up your code.
for(i in 1:1000){
matrix1[i,1]<-3
}
for(i in 1:1000){
set(matrix1, i, 1L, 3)
}
With set my loop takes 3 millisec, while your loop takes 18 sec (6000 times more).
I suppose h2o to work the same way but with some extra stuff done because this is a special object. Maybe some message passing communication to the H2O cluster?

What is the best way to avoid passing a data frame around?

I have 12 data.frames to work with. They are similar and I have to do the same processing to each one, so I wrote a function that takes a data.frame, processes it, and then returns a data.frame. This works. But I am afraid that I am passing around a very big structure. I may be making temporary copies (am I?) This can't be efficient. What is the best way to avoid passing a data.frame around?
doSomething <- function(df) {
// do something with the data frame, df
return(df)
}
You are, indeed, passing the object around and using some memory. But I don't think you can do an operation on an object in R without passing the object around. Even if you didn't create a function and did your operations outside of the function, R would behave basically the same.
The best way to see this is to set up an example. If you are in Windows open Windows Task Manager. If you are in Linux open a terminal window and run the top command. I'm going to assume Windows in this example. In R run the following:
col1<-rnorm(1000000,0,1)
col2<-rnorm(1000000,1,2)
myframe<-data.frame(col1,col2)
rm(col1)
rm(col2)
gc()
this creates a couple of vectors called col1 and col2 then combines them into a data frame called myframe. It then drops the vectors and forces garbage collection to run. Watch in your windows task manager at the mem usage for the Rgui.exe task. When I start R it uses about 19 meg of mem. After I run the above commands my machine is using just under 35 meg for R.
Now try this:
myframe<-myframe+1
your memory usage for R should jump to over 144 meg. If you force garbage collection using gc() you will see it drop back to around 35 meg. To try this using a function, you can do the following:
doSomething <- function(df) {
df<-df+1-1
return(df)
}
myframe<-doSomething(myframe)
when you run the code above, memory usage will jump up to 160 meg or so. Running gc() will drop it back to 35 meg.
So what to make of all this? Well, doing an operation outside of a function is not that much more efficient (in terms of memory) than doing it in a function. Garbage collection cleans things up real nice. Should you force gc() to run? Probably not as it will run automatically as needed, I just ran it above to show how it impacts memory usage.
I hope that helps!
I'm no R expert, but most languages use a reference counting scheme for big objects. A copy of the object data will not be made until you modify the copy of the object. If your functions only read the data (i.e. for analysis) then no copy should be made.
I came across this question looking for something else, and it's old - so I'll just provide a brief answer for now (leave a comment if you'd like more explanation).
You can pass around environments in R which contain anywhere from 1 to all of your variables. But probably you don't need to worry about it.
[You might also be able to do something similar with classes. I only currently understand how to use classes for polymorphic functions - and note there's more than 1 class system kicking around.]

Resources