Converting a Spark dataframe to Term document matrix in R using sparklyr - r

I have a code in R which needs to be scaled to use big data. I am using Spark for this and the package that seemed most convenient was sparklyr. However, I am unable to create a TermDocument matrix from a Spark dataframe. Any help would be great.
input_key is the dataframe having the following schema.
ID Keywords
1 A,B,C
2 D,L,K
3 P,O,L
My code in R was the following.
mycorpus <- input_key
corpus <- Corpus(VectorSource(mycorpus$Keywords))
path_matrix <- TermDocumentMatrix(corpus)

Such direct attempt won't work. Sparklyr tables are just views of underlying JVM objects and are not compatible with generic R packages.
While some capability to invoke arbitrary R code through sparklyr::spark_apply, the input and the output have to be data frame, and it is unlikely to translate to your particular use case.
If you committed to using Spark / sparklyr you should rather consider rewriting your pipeline using built-in ML transformers, as well as 3rd party Spark packages like Spark CoreNLP interface or John Snow Labs Spark NLP.

Related

How to import/send an R List of dataframes to Spark

I am an R user with a 1-day experience in Spark. I am testing different packages like sparklyR and SparkR.
I have a big R list of dataframes and I would like to try to upload/pass it to the local Spark. I know that sparklyr::copy_to() copies an R data.frame to Spark, and I am wondering if there is any command that copies an R List to Spark. Thank you.

How to run supervised ML models on a large dataset (15GB) in R?

I have a dataset (15 GB): 72 million records and 26 features. I would like to compare 7 supervised ML models (classification problem): SVM, random forest, decision tree, naive bayes, ANN, KNN and XGBoosting. I created a sample set of 7.2 million records (10% of the entire set). Running models on the sample set (even feature selection) is already an issue. It has a very long processing time. I use only RStudio at this moment.
I've been looking for an answer to my questions for days. I tried the following things:
- data.table - still not sufficient to reduce the processing time
- sparklyr - can't copy my dataset, because it's too large
I am looking for a costless solution to my problem. Can someone please help me?
If you have access to Spark, you can use sparklyr to read the CSV file directly.
install.packages('sparklyr')
library(sparklyr)
## You'll have to connect to your Spark cluster, this is just a placeholder example
sc <- spark_connect(master = "spark://HOST:PORT")
## Read large CSV into Spark
sdf <- spark_read_csv(sc,
name = "my_spark_table",
path = "/path/to/my_large_file.csv")
## Take a look
head(sdf)
You can use dplyr functions to manipulate data (docs). To do machine learning, you'll need to use the sparklyr functions for SparkML (docs). You should be able to find almost all of what you want in sparklyr.
Try Google Colab. This can help you in running your dataset
easily.
You should look into the disk.frame package.

Combining SparkR and sparklyr packages

I am currently working on Databricks using R notebooks. I would love to combine functionalities from the two R interfaces to Spark, namely SparkR and sparklyr. Therefor I needed to make use of SparkR functions on sparklyr Spark DataFrames (SDF) and vice versa.
I know that generally it is not possible to do so in a straight-forward way. However, I also know about workarounds to make use of PySpark SDFs in SparkR, which basically means to create a temp view of the PySpark SDF with
%py
spark_df.createOrReplaceTempView("PySparkSDF")
and subsequently to get it into SparkR via
%r
SparkR::sql("REFRESH TABLE PySparkSDF ")
sparkR_df <- SparkR::sql("SELECT * FROM PySparkSDF ")
Is it by any means possible to combine sparklyr and SparkR in a similar way? An explanation would also be highly appreciated!

Using CRAN packages inside SparkR

If I wanted to use a standard R package like MXNet inside SparkR, is this possible ? Can standard CRAN packages be used inside the Spark distributed environment without considering a local vs a Spark Dataframe. Is the strategy in working with large data sets in R and Spark to use a Spark dataframe, whittle down the Dataframe and then convert it to a local data.frame to use the standard CRAN package ? Is there another strategy that I'm not aware of ?
Thanks
Can standard CRAN packages be used inside the Spark distributed environment without considering a local vs a Spark Dataframe.
No, they cannot.
Is the strategy in working with large data sets in R and Spark to use a Spark dataframe, whittle down the Dataframe and then convert it to a local data.frame.
Sadly, most of the time this is what you do.
Is there another strategy that I'm not aware of ?
dapply and gapply functions in Spark 2.0 can apply arbitrary R code to the partitions or groups.

How to convert Spark R dataframe into R list

This is my first time to try Spark R to do the same work I did with RStudio, on Databricks Cloud Community Edition. But met some weird problems.
It seems that Spark R do support packages like ggplot2, plyr, but the data has to be in R list format. I could generate this type of list in R Studio when I am using train <- read.csv("R_basics_train.csv"), variable train here is a list when you use typeof(train).
However, in Spark R, when I am reading the same csv data as "train", it will be converted into dataframe, and this is not the Spark Python DataFrame we have used before, since I cannot use collect() function to convert it into list.... When you use typeof(train), it shows the type is "S4", but in fact the type is dataframe....
So, is there anyway in Spark R that I can convert dataframe into R list so that I can use methods in ggplot2, plyr?
You can find the origional .csv training data here:
train
Later I found that using r_df <- collect(spark_df) will convert Spark DataFrame into R dataframe, although cannot use R summary() on its dataframe, with R dataframe, we can do many R operations.
It looks like they changed SparkR, so you now need to use
r_df<-as.data.frame(spark_df)
Not sure if you call this as the drawback of sparkR, but in order to leverage many good functionalities which R has to offer such as data exploration, ggplot libraries, you need to convert your pyspark data frame into normal data frame by calling collect
df <- collect(df)

Resources