If I wanted to use a standard R package like MXNet inside SparkR, is this possible ? Can standard CRAN packages be used inside the Spark distributed environment without considering a local vs a Spark Dataframe. Is the strategy in working with large data sets in R and Spark to use a Spark dataframe, whittle down the Dataframe and then convert it to a local data.frame to use the standard CRAN package ? Is there another strategy that I'm not aware of ?
Thanks
Can standard CRAN packages be used inside the Spark distributed environment without considering a local vs a Spark Dataframe.
No, they cannot.
Is the strategy in working with large data sets in R and Spark to use a Spark dataframe, whittle down the Dataframe and then convert it to a local data.frame.
Sadly, most of the time this is what you do.
Is there another strategy that I'm not aware of ?
dapply and gapply functions in Spark 2.0 can apply arbitrary R code to the partitions or groups.
Related
I am currently working on Databricks using R notebooks. I would love to combine functionalities from the two R interfaces to Spark, namely SparkR and sparklyr. Therefor I needed to make use of SparkR functions on sparklyr Spark DataFrames (SDF) and vice versa.
I know that generally it is not possible to do so in a straight-forward way. However, I also know about workarounds to make use of PySpark SDFs in SparkR, which basically means to create a temp view of the PySpark SDF with
%py
spark_df.createOrReplaceTempView("PySparkSDF")
and subsequently to get it into SparkR via
%r
SparkR::sql("REFRESH TABLE PySparkSDF ")
sparkR_df <- SparkR::sql("SELECT * FROM PySparkSDF ")
Is it by any means possible to combine sparklyr and SparkR in a similar way? An explanation would also be highly appreciated!
I have a code in R which needs to be scaled to use big data. I am using Spark for this and the package that seemed most convenient was sparklyr. However, I am unable to create a TermDocument matrix from a Spark dataframe. Any help would be great.
input_key is the dataframe having the following schema.
ID Keywords
1 A,B,C
2 D,L,K
3 P,O,L
My code in R was the following.
mycorpus <- input_key
corpus <- Corpus(VectorSource(mycorpus$Keywords))
path_matrix <- TermDocumentMatrix(corpus)
Such direct attempt won't work. Sparklyr tables are just views of underlying JVM objects and are not compatible with generic R packages.
While some capability to invoke arbitrary R code through sparklyr::spark_apply, the input and the output have to be data frame, and it is unlikely to translate to your particular use case.
If you committed to using Spark / sparklyr you should rather consider rewriting your pipeline using built-in ML transformers, as well as 3rd party Spark packages like Spark CoreNLP interface or John Snow Labs Spark NLP.
I have a 500K row spark DataFrame that lives in a parquet file. I'm using spark 2.0.0 and the SparkR package inside Spark (RStudio and R 3.3.1), all running on a local machine with 4 cores and 8gb of RAM.
To facilitate construction of a dataset I can work on in R, I use the collect() method to bring the spark DataFrame into R. Doing so takes about 3 minutes, which is far longer than it'd take to read an equivalently sized CSV file using the data.table package.
Admittedly, the parquet file is compressed and the time needed for decompression could be part of the issue, but I've found other comments on the internet about the collect method being particularly slow, and little in the way of explanation.
I've tried the same operation in sparklyr, and it's much faster. Unfortunately, sparklyr doesn't have the ability to do date path inside joins and filters as easily as SparkR, and so I'm stuck using SparkR. In addition, I don't believe I can use both packages at the same time (i.e. run queries using SparkR calls, and then access those spark objects using sparklyr).
Does anyone have a similar experience, an explanation for the relative slowness of SparkR's collect() method, and/or any solutions?
#Will
I don't know whether the following comment actually answers your question or not but Spark does lazy operations. All the transformations done in Spark (or SparkR) doesn't really create any data they just create a logical plan to follow.
When you run Actions like collect, it has to fetch data directly from source RDDs (assuming you haven't cached or persisted data).
If your data is not large enough and can be handled by local R easily then there is no need for going with SparkR. Other solution can be to cache your data for frequent uses.
Short: Serialization/deserialization is very slow.
See for example post on my blog http://dsnotes.com/articles/r-read-hdfs
However it should be equally slow in both sparkR and sparklyr.
I've some experience in R and am learning Spark 1.6.1 by first exploring the implementation of R in Spark.
I noticed that the syntax for the R sample() command is different in Spark:
base::R: sample(x, size, replace)
Spark R: sample(DataFrame, withReplacement, fraction)
base::sample(x, size, replace) still works, but is masked by the Spark R version.
Does anyone know why this is, when most commands are identical between the two?
Are there use cases that I should use one versus the other?
Has anyone found an authoritative list of differences between Spark R and base:: R?
Thanks!
If you have a SparkR dataframe, you'll need to use the SparkR api for sampling. If you have a R dataframe, you'll need to use the base::R sampling function call. SparkR is not R and the function calls are not identical.
The issue you are having is one of masking.
To address the second part of the question, for the benefit of others who follow, I found that the Spark documentation does in fact list the R functions that are masked:
R Function Name Conflicts
I am trying use to SparkR and R as front end to develop machine learning models.
I want to make use Spark's MLLib which works on distributed data frames. Is there anyway to call spark MLLib algorithm from R?
Unfortunately no. We will have to wait Apache Spark 1.5 for sparkR-mllib bindings.