This is my first time to try Spark R to do the same work I did with RStudio, on Databricks Cloud Community Edition. But met some weird problems.
It seems that Spark R do support packages like ggplot2, plyr, but the data has to be in R list format. I could generate this type of list in R Studio when I am using train <- read.csv("R_basics_train.csv"), variable train here is a list when you use typeof(train).
However, in Spark R, when I am reading the same csv data as "train", it will be converted into dataframe, and this is not the Spark Python DataFrame we have used before, since I cannot use collect() function to convert it into list.... When you use typeof(train), it shows the type is "S4", but in fact the type is dataframe....
So, is there anyway in Spark R that I can convert dataframe into R list so that I can use methods in ggplot2, plyr?
You can find the origional .csv training data here:
train
Later I found that using r_df <- collect(spark_df) will convert Spark DataFrame into R dataframe, although cannot use R summary() on its dataframe, with R dataframe, we can do many R operations.
It looks like they changed SparkR, so you now need to use
r_df<-as.data.frame(spark_df)
Not sure if you call this as the drawback of sparkR, but in order to leverage many good functionalities which R has to offer such as data exploration, ggplot libraries, you need to convert your pyspark data frame into normal data frame by calling collect
df <- collect(df)
Related
Basic data was generated using a SQL query and the intention is to process data in R. However, while importing from a .csv or from .xlsx, R imports numbers as characters in spite of changing the data-type in the built-in import tool. Further, while performing basic arithmetic operations, following errors were encountered:
In Ops.factor((data$A), (data$B)) :‘/’ not meaningful for factors
Is there a simple way to solve this?
Data-set was analysed using the str() function, which revealed that R imported the particular columns as factors.
Used package varhandle and function unfactor to unfactorize the data
Used as.numeric for some columns which were read as characters instead of factors
Tried changing data-types in Excel before importing
data$A <- unfactor(data$A)
data$B <- unfactor(data$B)
data$PERCENTAGE <- (data$B)/(data$A)*100
By what means can R import the data as per specified data-types?
Thank you for the help in advance!
For csv files I would recommend read_csv from Hadley Wickham's excellent Tidyverse package. It has intelligent defaults that cope with most things I throw at it.
For .xlsx, there is read_excel, also from the Tidyverse package (there are other packages available).
Or, alternatively just export a .csv from within Excel and use read_csv.
[Note the Tidyverse's will import these files as a "tibble" which is essentially a data frame on steroids without some of the headaches but is easily converted to a data.frame if you prefer.]
My current approach would be save my sparklyr data frame as a parquet file in a tmp folder and use SparkR to read it. I am wondering if there is a more elegant way.
Another approach will be to stay with sparklyr and normal R only. But that will be a separate discussion
I have a code in R which needs to be scaled to use big data. I am using Spark for this and the package that seemed most convenient was sparklyr. However, I am unable to create a TermDocument matrix from a Spark dataframe. Any help would be great.
input_key is the dataframe having the following schema.
ID Keywords
1 A,B,C
2 D,L,K
3 P,O,L
My code in R was the following.
mycorpus <- input_key
corpus <- Corpus(VectorSource(mycorpus$Keywords))
path_matrix <- TermDocumentMatrix(corpus)
Such direct attempt won't work. Sparklyr tables are just views of underlying JVM objects and are not compatible with generic R packages.
While some capability to invoke arbitrary R code through sparklyr::spark_apply, the input and the output have to be data frame, and it is unlikely to translate to your particular use case.
If you committed to using Spark / sparklyr you should rather consider rewriting your pipeline using built-in ML transformers, as well as 3rd party Spark packages like Spark CoreNLP interface or John Snow Labs Spark NLP.
I'm getting the error, ggplot2 does not know how to deal with data of class tbl_sqlite/tbl_sql/tbl_lazy/tbl. Does this mean I need to reduce the size of my data or convert it to another format before I can plot it?
dplyr holds off on actually collecting the data into r until it is explicitly told to. When using dplyr to query a database, be sure to use collect at the end of the chain to pull the query result into r as a data frame.
addendum: the dbplot package now provides helper functions to work with dplyr & dbplyr to do some plotting or calculations for intermediate plotting steps in-database
I have created a 300000 x 7 numeric matrix in R and I want to work with it in both R and Matlab. However, I'm not able to create a file well readeable for Matlab.
When using the command save(), with file=xx.csv, it recognizes 5 columns instead; with extension .txt all data is opened in a single column instead.
I have also tried with packages ff and ffdf to manage this big data (I guess the problem of R identifying rows and column when saving is related somehow to this), but I don't know how to save it in a readable format for Matlab afterwards.
An example of this dataset would be:
output <- matrix(runif(2100000, 1, 1000), ncol=7, nrow=300000)
If you want to work both with R and Matlab, and you have a matrix as big as yours, I'd suggest using the R.matlab package. The package provides methods readMat and writeMat. Both methods read/write the binary format that is understood by Matlab (and through R.matlab also by R).
Install the package by typing
install.packages("R.matlab")
Subsequently, don't forget to load the package, e.g. by
library(R.matlab)
The documentation of readMat and writeMat, accessible through ?readMat and ?writeMat, contains easy usage examples.