How to call SparkMLLib algorithms using R or SparkR? - r

I am trying use to SparkR and R as front end to develop machine learning models.
I want to make use Spark's MLLib which works on distributed data frames. Is there anyway to call spark MLLib algorithm from R?

Unfortunately no. We will have to wait Apache Spark 1.5 for sparkR-mllib bindings.

Related

How to create and use an R function in SAS 9.4?

I have defined some R functions in R studio which has some complicated scripts and a lot of readlines. I can run them successfully in R studio. Is there any way, like macros to transfer these user-defined functions to SAS 9.4 to use? I am not pretty familiar with SAS programming so it is better just copy the R functions into SAS and use it directly. I am trying to figure out how to do the transformation. Thank you!
You can't natively run R code in SAS, and you probably wouldn't want to. R and SAS are entirely different concepts, SAS being closer to a database language while R is a matrix language. Efficient R approaches are terrible in SAS, and vice versa. (Try a simple loop in R and you'll find SAS is orders of magnitude faster; but try matrix algebra in R instead).
You can call R in SAS, though. You need to be in PROC IML, SAS's matrix language (which may be a separate license from your SAS); once there, you use submit / R to submit the code to R. You need the RLANG system option to be set, and you may need some additional details set up on your SAS box to make sure it can see your R installation, and you need R 3.0+. You also need to be running SAS 9.22 or newer.
If you don't have R available through IML, you can use x or call system, if those are enabled and you have access to R through the command line. Alternately, you can run R by hand separately from SAS. Either way you would use a CSV or similar file format to transfer data back and forth.
Finally, I recommend seeing if there's a better approach in SAS for the same problem you solved in R. There usually is, and it's often quite fast.

Using CRAN packages inside SparkR

If I wanted to use a standard R package like MXNet inside SparkR, is this possible ? Can standard CRAN packages be used inside the Spark distributed environment without considering a local vs a Spark Dataframe. Is the strategy in working with large data sets in R and Spark to use a Spark dataframe, whittle down the Dataframe and then convert it to a local data.frame to use the standard CRAN package ? Is there another strategy that I'm not aware of ?
Thanks
Can standard CRAN packages be used inside the Spark distributed environment without considering a local vs a Spark Dataframe.
No, they cannot.
Is the strategy in working with large data sets in R and Spark to use a Spark dataframe, whittle down the Dataframe and then convert it to a local data.frame.
Sadly, most of the time this is what you do.
Is there another strategy that I'm not aware of ?
dapply and gapply functions in Spark 2.0 can apply arbitrary R code to the partitions or groups.

Why is collect in SparkR so slow?

I have a 500K row spark DataFrame that lives in a parquet file. I'm using spark 2.0.0 and the SparkR package inside Spark (RStudio and R 3.3.1), all running on a local machine with 4 cores and 8gb of RAM.
To facilitate construction of a dataset I can work on in R, I use the collect() method to bring the spark DataFrame into R. Doing so takes about 3 minutes, which is far longer than it'd take to read an equivalently sized CSV file using the data.table package.
Admittedly, the parquet file is compressed and the time needed for decompression could be part of the issue, but I've found other comments on the internet about the collect method being particularly slow, and little in the way of explanation.
I've tried the same operation in sparklyr, and it's much faster. Unfortunately, sparklyr doesn't have the ability to do date path inside joins and filters as easily as SparkR, and so I'm stuck using SparkR. In addition, I don't believe I can use both packages at the same time (i.e. run queries using SparkR calls, and then access those spark objects using sparklyr).
Does anyone have a similar experience, an explanation for the relative slowness of SparkR's collect() method, and/or any solutions?
#Will
I don't know whether the following comment actually answers your question or not but Spark does lazy operations. All the transformations done in Spark (or SparkR) doesn't really create any data they just create a logical plan to follow.
When you run Actions like collect, it has to fetch data directly from source RDDs (assuming you haven't cached or persisted data).
If your data is not large enough and can be handled by local R easily then there is no need for going with SparkR. Other solution can be to cache your data for frequent uses.
Short: Serialization/deserialization is very slow.
See for example post on my blog http://dsnotes.com/articles/r-read-hdfs
However it should be equally slow in both sparkR and sparklyr.

RandomForest algorithm in SparkR?

I have implemented randomForest algorithm in R and trying to implement the same using sparkR (from Apache Spark 2.0.0).
But I found only linear model functions like glm() implementations in sparkR
https://www.codementor.io/spark/tutorial/linear-models-apache-spark-1-5-uses-present-limitations
And Couldn't able to find any RandomForest (Decision Tree algorithm) examples.
There is RandomForest in Spark's MLLib but cannot able to find the R bindings for MLLib also.
Kindly let me know, whether SparkR(2.0.0) supports RandomForest? else is it possible to connect SparkR with MLlib to use RandomForest?
If not how can we acheive this using SparkR?
True, it's not available in SparkR as of now.
Possible option is to build random forest on distributed chunks of data and combine your trees later.
Anyways its all about randomness.
A good link: https://groups.google.com/forum/#!topic/sparkr-dev/3N6LK7k4NB0

How to manage R packages in spark cluster

I´m working with a small spark cluster (5 DataNodes and 2 NameNodes) with version 1.3.1. I read this entry in a berkeley blog:
https://amplab.cs.berkeley.edu/large-scale-data-analysis-made-easier-with-sparkr/
Where it´s detailled how to implement gradient descent using sparkR; running a user-defined gradient function in parallel through the sparkR method "lapplyPartition". If lapplyPartition makes user-defined gradient function executed in every node, I guess that all the methods used inside the user-defined gradient function should be available in very node too. That means, R and all its packages should be installed in every node. Did I understood well?
If so, is there any way of managing the R packages? Now my cluster is small so we can do it manually, but I guess those people with large clusters won't do it like this. Any suggestion?
Thanks a lot!

Resources