I m trying to tune the model using sparklyr. Using for loops to tune parameter is not parallelising the work as expected and it is taking lot of time.
My Question:
Is there any alternative t I can use to parallelize the work?
id_wss <- NA
for (i in 2:8)
{
id_cluster <- ml_kmeans(id_ip4, centers = i, seed = 1234, features_col = colnames(id_ip4))
id_wss[i] <- id_cluster$cost
}
There is nothing specifically wrong with your code when it comes to concurrency:
The distributed and parallel part is model fitting process ml_kmeans(...). For loop doesn't affect that. Each model will be trained using resources available on your cluster as expected.
The outer loop is a driver code. Under normal conditions we use standard single threaded code (not that multithreading at this level is really an option R) when working with Spark.
In general (Scala, Python, Java) it is possible to use separate threads to submit multiple Spark jobs at the same time, but in practice it requires a lot of tuning, and access to low level API. Even there it is rarely worth the fuss, unless you have significantly overpowered cluster at your disposal.
That being said please keep in mind that if you compare Spark K-Means to local implementations on a data that fits in memory, things will be relatively slow. Using randomized initialization might help speed things up:
ml_kmeans(id_ip4, centers = i, init_mode = "random",
seed = 1234, features_col = colnames(id_ip4))
On a side note with algorithms, which can be easily evaluated with one of the available evaluators (ml_binary_classification_evaluator, ml_multiclass_classification_evaluator, ml_regression_evaluator) you can use ml_cross_validator / ml_train_validation_split instead of manual loops (see for example How to train a ML model in sparklyr and predict new values on another dataframe?).
Related
I have a (large) neural net being trained by the nnet package in R. I want to be able to simulate predictions from this neural net, and do so in a parallelised fashion using something like foreach, which I've used before with success (all on a Windows machine).
My code is essentially of the form
library(nnet)
data = data.frame(out=c(0, 0.1, 0.4, 0.6),
in1=c(1, 2, 3, 4),
in2=c(10, 4, 2, 6))
net = nnet(out ~ in1 + in2, data=data, size=5)
library(doParallel)
registerDoParallel(cores=detectCores()-2)
results = foreach(test=1:10, .combine=rbind, .packages=c("nnet")) %dopar% {
result = predict(net, newdata = data.frame(in1=test, in2=5))
return(result)
}
except with a much larger NN being fit and predicted from; it's around 300MB.
The code above runs fine when using a traditional for loop, or when using %do%, but when using %dopar%, everything gets loaded into memory for each core being used - around 700MB each. If I run it for long enough, everything eventually explodes.
Having looked up similar problems, I still have no idea what is causing this. Omitting the 'predict' part has everything run smoothly.
How can I have each core lookup the unchanging 'net' rather than having it loaded into memory? Or is it not possible?
When you start new parallel workers, you're essentially creating a new environment, which means that whatever operations you perform in that new environment will require access to the relevant variables/functions.
For instance, you have to specify .packages=c("nnet") because you require the nnet package within each new worker (environment), and this is how you "clone" or "export" from the global environment to each worker env.
Because you require the trained neural network to make predictions, you will need to export it to each worker as well, and I don't see a way around the memory blowup you're experiencing. If you're still interested in parallelization but are running out of memory, my only advice is to look into doMPI.
How can I have each core lookup the unchanging 'net' rather than having it loaded into memory? Or is it not possible?
CPak's reply explains what's going on; you're effectively running multiple copies (=workers) of the main script in separate R session. Since you're on Windows, calling
registerDoParallel(cores = n)
expands to:
cl <- parallel::makeCluster(n, type = "PSOCK")
registerDoParallel(cl)
which what sets up n independent background R workers with their own indenpendent memory address spaces.
Now, if you'd been on a Unix-like system, it would instead have corresponded to using n forked R workers, cf. parallel::mclapply(). Forked processes are not supported by R on Windows. With forked processing, you would effectively get what you're asking for, because forked child processes will share the objects already allocated by the main process (as long as such objects are not modified), e.g. net.
I notice that using R to program constantly causing some running speed problem, especially when the code involves growing a list. It's just very unintuitive what is slowing down the program so dramatically during the looping.
Specifically I'm using caret to train a gbm model, after getting the tuned hyperparameter, I need to do LOOCV to obtain the test error, which demands me to train the model for n times (n=number of samples). All I store in the list is my prediction result. Yet the list grows slower as the loop progress.
Can you offer some general advice for testing the memory issues related to R programming?
First create an empty list/vector. Size of your "n" so R has not to rewrite it every time it wants to add one additional value
Suppose that I want to do bootstrap procedure 1000 times on each of 100 different simulated data set.
At top level, I can set up foreach backend to distribute the 100 jobs to different CPUs. Then at the lower level, by using function boot from R package boot I can also invoke parallel computing by specifying 'parallel' option in the function.
The pseudo code may look like following.
library(doParallel}
registerDoParallel(cores=4)
foreach(i=seq(100, 5, length.out = 100), .combine=cbind) %dopar% {
sim.dat <- simualateData(i)
boot.res <- boot(sim.dat, mean, R=1000, parallel = 'multicore', ...)
## then extract results and combine
...
}
I am curious to know how the parallel computing really works in this case.
Would the two different levels of parallel computing work at the same time? how would they affect (interact? interrupt? disable?) each other?
More generally, I guess there are now more and more R functions that provide parallel computing option like boot for intensive simulation. In that situation, is there a need to specify the lower-level parallel provided the top level? Or vice versa?
What are the pros and cons, if any, for this two-level parallel setup?
Thanks for any clarification.
EDIT:
I should have explained more clearly the problem. Actually after the boot.res is returned, more additional calculations are to be done on it to finally get summary statistics from boot.res. That means the whole computation is not mutually independent bootstrapping procedure. In this case, only outer parallel loop would mess up the results. So if I understand correctly here, the best way would be using nested foreach parallel backend, but suppress 'parallel' option from boot.
Anyone please correct me if I am wrong. Regards.
END EDIT
I've been struggling to perform this sort of analysis and posted on the stats site about whether I was taking things in the right direction, but as I've been investigating I've also found that my lovely beefy processor (linux OS, i7) is only actually using 1 of its cores. Turns out this is default behaviour, but I have a fairly large dataset and between 40 and 50 variables to select from.
A stepAIC function that is checking various different models seems like the ideal sort of thing for parellizing, but I'm a relative newb with R and I only have sketchy notions about parallel computing.
I've taken a look at the documentation for the packages parallel, and snowfall, but these seems to have some built-in list functions for parallelisation and I'm not sure how to morph the stepAIC into a form that can be run in parellel using these packages.
Does anyone know 1) whether this is a feasible exercise, 2) how to do what I'm looking to do and can give me a sort of basic structure/list of keywords I'll need?
Thanks in advance,
Steph
I think that a process in which a step depends on de last (as in step wise selection) is not trivial to do in parallel.
The simplest way to do something in parallel I know is:
library(doMC)
registerDoMC()
l <- foreach(i=1:X) %dopar% { fun(...) }
in my poor understanding of stepwise one extracts variables (or add forward/backward) of a model and measure the fitting in each step. If extracting a variable the model fit is best you keep this model, for example. In the foreach parallel function each step is blind to other step, maybe you could write your own function to perform this task as in
http://beckmw.wordpress.com/tag/stepwise-selection/
I looked for this code, and seems to me that you could use parallel computing with the vif_func function...
I think you also should check optimized codes to do that task as in the package leaps
http://cran.r-project.org/web/packages/leaps/index.html
hope this helps...
I am using the 'R' library "glmulti" and performing an exhaustive search.
relevant code:
local1.model <- glmulti(est, # use the model with built as a starting point
level = 1, # just look at main effects
method = "h",
crit="aicc") # use AICc because it works better than AIC for small sample sizes
The variable "est" is a fitted GLM that informs glmulti.
If I were a Java-based program that had to do the same thing several hundred thousand times, then I would use more than one core.
My glmulti is not using my cores efficiently.
Is there a way to switch it to make use of more of my system?
Note: when I use 'h2o' it can max out the CPU and make a strong hit on the memory.
R is single-threaded (unless the function is built on a library with its own threading). You can manually add parallelization to your code, using the rparallel library (which is part of core R): http://stat.ethz.ch/R-manual/R-devel/library/parallel/doc/parallel.pdf
I would class it as non-trivial to use. It is a bit of a hack on top of R, so it does lots of memory copying, and you need to think about what is going on if you care about efficiency.
glmulti looks like it ought to be parallel (i.e. each combination of parameters could be done in parallel, even if using a genetic algorithm). My guess is they intended to add it, but development stopped (no updates since Sep 2009).