How to manage R packages in spark cluster - r

I´m working with a small spark cluster (5 DataNodes and 2 NameNodes) with version 1.3.1. I read this entry in a berkeley blog:
https://amplab.cs.berkeley.edu/large-scale-data-analysis-made-easier-with-sparkr/
Where it´s detailled how to implement gradient descent using sparkR; running a user-defined gradient function in parallel through the sparkR method "lapplyPartition". If lapplyPartition makes user-defined gradient function executed in every node, I guess that all the methods used inside the user-defined gradient function should be available in very node too. That means, R and all its packages should be installed in every node. Did I understood well?
If so, is there any way of managing the R packages? Now my cluster is small so we can do it manually, but I guess those people with large clusters won't do it like this. Any suggestion?
Thanks a lot!

Related

How to set up a cluster for a use in R parallel package

I have several i7 desktops which I would like to use to speed up the computation time of function genoud in package rgenoud. In genoud, you can assign a cluster that was generated in parallel (and I assume snow also).
What kind of clustering software would you recommend for this? I have tried Beowulf clusters, but the documentation from this is mostly outdated, so I am looking for a guide which shows me how to do it which is still up to date.
This cluster is going to be over LAN, and I have assigned IPs to the nodes.
Thanks

Using R Parallel with other R packages

I am working on a very time intensive analysis using the LQMM package in R. I set the model to start running on Thursday, it is now Monday, and is still running. I am confident in the model itself (tested as a standard MLM), and I am confident in my LQMM code (have run several other very similar LQMMs with the same dataset, and they all took over a day to run). But I'd really like to figure out how to make this run faster if possible using the parallel processing capabilities of the machines I have access to (note all are Microsoft Windows based).
I have read through several tutorials on using parallel, but I have yet to find one that shows how to use the parallel package in concert with other R packages....am I over thinking this, or is it not possible?
Here is the code that I am running using the R package LQMM:
install.packages("lqmm")
library(lqmm)
g1.lqmm<-lqmm(y~x+IEP+pm+sd+IEPZ+IEP*x+IEP*pm+IEP*sd+IEP*IEPZ+x*pm+x*sd+x*IEPZ,random=~1+x+IEP+pm+sd+IEPZ, group=peers, tau=c(.1,.2,.3,.4,.5,.6,.7,.8,.9),na.action=na.omit,data=g1data)
The dataset has 122433 observations on 58 variables. All variables are z-scored or dummy coded.
The dependent libraries will need to be evaluated on all your nodes. The function clusterEvalQ is foreseen inside the parallel package for this purpose. You might also need to export some of your data to the global environments of your subnodes: For this you can use the clusterExport function. Also view this page for more info on other relevant functions that might be useful to you.
In general, to speed up your application by using multiple cores you will have to split up your problem in multiple subpieces that can be processed in parallel on different cores. To achieve this in R, you will first need to create a cluster and assign a particular number of cores to it. Next, You will have to register the cluster, export the required variables to the nodes and then evaluate the necessary libraries on each of your subnodes. The exact way that you will setup your cluster and launch the nodes will depend on the type of sublibraries and functions that you will use. As an example, your clustersetup might look like this when you choose to utilize the doParallel package (and most of the other parallelisation sublibraries/functions):
library(doParallel)
nrCores <- detectCores()
cl <- makeCluster(nrCores)
registerDoParallel(cl);
clusterExport(cl,c("g1data"),envir=environment());
clusterEvalQ(cl,library("lqmm"))
The cluster is now prepared. You can now assign subparts of the global task to each individual node in your cluster. In the general example below each node in your cluster will process subpart i of the global task. In the example we will use the foreach %dopar% functionality that is provided by the doParallel package:
The doParallel package provides a parallel backend for the
foreach/%dopar% function using the parallel package of R 2.14.0 and
later.
Subresults will automatically be added to the resultList. Finally, when all subprocesses are finished we merge the results:
resultList <- foreach(i = 1:nrCores) %dopar%
{
#process part i of your data.
}
stopCluster(cl)
#merge data..
Since your question was not specifically on how to split up your data I will let you figure out the details of this part for yourself. However, you can find a more detailed example using the doParallel package in my answer to this post.
It sounds like you want to use parallel computing to make a single call of the lqmm function execute more quickly. To do that, you either have to:
Split the one call of lqmm into multiple function calls;
Parallelize a loop inside lqmm.
Some functions can be split up into multiple smaller pieces by specifying a smaller iteration value. Examples include parallelizing randomForest over the ntree argument, or parallelizing kmeans over the nstart argument. Another common case is to split the input data into smaller pieces, operate on the pieces in parallel, and then combine the results. That is often done when the input data is a data frame or a matrix.
But many times in order to parallelize a function you have to modify it. It may actually be easier because you may not have to figure out how to split up the problem and combine the partial results. You may only need to convert an lapply call into a parallel lapply, or convert a for loop into a foreach loop. However, it's often time consuming to understand the code. It's also a good idea to profile the code so that your parallelization really speeds up the function call.
I suggest that you download the source distribution of the lqmm package and start reading the code. Try to understand it's structure and get an idea which loops could be executed in parallel. If you're lucky, you might figure out a way to split one call into multiple calls, but otherwise you'll have to rebuild a modified version of the package on your machine.

Running two commands parallel to each other in R on windows

I have tried reading around on the net on using parallel computing in R.
My problem is that I want to utilize all my cores on my PC, and after reading different ressources I am not sure I need packages like multicore for my purposes, which unfortunately does not work on windows.
Can I simply split my very large data sets into several sub datasets, and run the same function on each, and have the same function run on different cores? They dont need to talk to each other, and I just need the output from each. Is that really impossible to do on windows?
Suppose I have a function called timeanalysis() and two datasets, 1 and 2. Cant I call the same function twice, and tell it to use a different core each time?
timeanalysis(1)
timeanalysis(2)
I have found the snowfall package to be the easiest to use in windows for parallel tasks.

R bindings for Mapnik?

I frequently find myself doing some analysis in R and then wanting to make a quick map. The standard plot() function does a reasonable job of quick, but I quickly find that I need to go to ggplot2 when I want to make something that looks nice or has more complex symbology requirements. Ggplot2 is great, but is sometimes cumbersome to convert a SpatialPolygonsDataFrame into the format required by Ggplot2. Ggplot2 can also be a tad slow when dealing with large maps that require specific projections.
It seems like I should be able to use Mapnik to plot spatial objects directly from R, but after exhausting my Google-fu, I cannot find any evidence of bindings. Rather than assume that such a thing doesn't exist, I thought I'd check here to see if anyone knows of an R - Mapnik binding.
The Mapnik FAQ explicitly mentions Python bindings -- as does the wiki -- with no mention of R, so I think you are correct that no (Mapnik-sponsored, at least) R bindings currently exist for Mapnik.
You might get a more satisfying (or at least more detailed) answer by asking on the Mapnik users list. They will know for certain if any projects exist to make R bindings for Mapnik, and if not, your interest may incite someone to investigate the possibility of generating bindings for R.
I would write the SpatialWotsitDataFrames to Shapefiles and then launch a Python Mapnik script. You could even use R to generate the Python script (package 'brew' is handy for making files from templates and inserting values form R).

Making sense of R-Hive, Elastic MapReduce, RHIPE and Distrubted Text mining with R

After having learned about MapReduce for solving a computer vision problem for my recent internship at Google, I felt like an enlightened person. I had been using R for text mining already. I wanted to use R for large scale text processing and for experiments with topic modeling. I started reading tutorials and working on some of those. I will now put down my understanding of each of the tools:
1) R text mining toolbox: Meant for local (client side) text processing and it uses the XML library
2) Hive: Hadoop interative, provides the framework to call map/reduce and also provides the DFS interface for storing files on the DFS.
3) RHIPE: R Hadoop integrated environment
4) Elastic MapReduce with R: a MapReduce framework for those who do not have their own clusters
5) Distributed Text Mining with R: An attempt to make seamless move form local to server side processing, from R-tm to R-distributed-tm
I have the following questions and confusions about the above packages
1) Hive and RHIPE and the distributed text mining toolbox need you to have your own clusters. Right?
2) If I have just one computer how would DFS work in case of HIVE
3) Are we facing with the problem of duplication of effort with the above packages?
I am hoping to get insights on the above questions in the next few days
(1) Well Hive and Rhipe dont need cluster, you can run them on single node cluster.
RHipe basically is a framework (in R language a package) which integrates R and Hadoop and you can leverage the power of R on Hadoop. For using Rhipe you don't need to have a cluster, you can run in either way i.e. either in cluster mode or pseudo mode. Even if you have Hadoop cluster of more than 2 nodes you can still use Rhipe in local mode by specifying the property mapered.job.tracker='local'.
You can go to my site (search for) "Bangalore R user groups" and you can see how i have tried solving the problems using Rhipe, i hope you can get a fair idea
(2)Well by Hive means do you mean hive package in R? since this package is somewhat misleading with Hive (hadoop data ware house).
The hive package in R is similar to Rhipe only with some additional functionalites( i have not gone through fully)..The hive package when i saw i thought they have integrated R with Hive, but after seeing the functionality it was not like dat.
Well Hadoop data ware house which is HIVE, is basically if you are interested in some subset of results which should run through subset of data, which you normally do using SQL queries. The queries in HIVE also are very much similar to SQL queries.
TO give you a very simple example: lets say you have 1TB of stock data for various stocks for last 10 years. Now the first thing you will do is, you will store on HDFS and then you create a HIVE table on top of it. Thats it...Now fire whatever query you wish. You also might want to do some complex calculatiion also like finding simple moving average (SMA), in this case you can write your UDF (userr defined function). Besides this you can also use UDTF( User defined table generating function)
(3) If you have one system that means you are running Hadoop in pseudo mode. Moreover you need not worry whether Hadoop is running on pseudo mode or cluster mode,, since Hive needs to be installed only on NameNode, not on the data nodes. Once proper configuration is done,hive will take care of submitting the job on cluster.
Unlike Hive, you need to install R and Rhipe on all the data nodes including NameNode. But then at any point of time if you want to run the job only in NameNode you can do as i mentioned above.
(4) One more thing Rhipe is meant for only batch jobs, that means the MR job will run on the whole data set while Hive you can run on subset of data.
(5)I would like to understand what exactly you are doing in text mining, are you trying to do some king of NLP stuff like Name Entity Recognition using HMM (Hidden Markov Models), CRF(Condition Random fields), feature vectors or SVM (Support Vector machines).
Or you simple trying to to do document clustering, indexing etc
Well there are packages like tm,openNLP,HMM,SVM etc
I'm not familiar with the distributed text mining with R application, but Hive can run on a local cluster or on a single-node cluster. This can be done for experimenting or in practice, but does defeat the purpose of having a distributed file system for serious work. As far as duplication of effort, Hive is meant to be a complete SQL implementation on top of Hadoop, so there is duplication in as much as both SQL and R can both work with text data, but not in as much as both are specific tools with different strengths.

Resources