My current approach would be save my sparklyr data frame as a parquet file in a tmp folder and use SparkR to read it. I am wondering if there is a more elegant way.
Another approach will be to stay with sparklyr and normal R only. But that will be a separate discussion
Related
I have a spark data frame, I want to save this data frame as a table in Casandra and save as parquet in S3. everywhere python, java, scala codes are there. I'm not finding the right solution in R.
I have found Rcasandra package can i use this to do the same?
I am very new to R, and what I am trying to achieve is that I have a dataset in CSV format stored in mongodb. I have already linked Rstudio and mongodb and the data is successfully imported in Rstudio. Now, I want to do some visualization of the data. I want to make some bar graphs, piecharts, heatmaps etc. But all the tutorials that I have seen they use dataframes in ggplot. How do I convert the imported data from CSV file to a dataframe? I know I might sound stupid but Im a beginner, any help would be appreciated. The dataset that im using is the 2017 CSV file from this link: https://www1.nyc.gov/site/nypd/stats/reports-analysis/stopfrisk.page
What is the structure of the "csv" you've imported from the database? You could try converting it to a data.frame using as.data.frame. If class(x) has more than one class, e.g. tibble and data.frame, a method will use the class which it is designed to use. If your object is of class e.g. tibble and data.frame, ggplot will know how to handle that.
This question already has an answer here:
Requirements for converting Spark dataframe to Pandas/R dataframe
(1 answer)
Closed 4 years ago.
I use R on Zeppelin at work to develop machine learning models. I extract the data from Hive tables using %sparkr, sql(Constring, 'select * from table') and in default it generates a spark data frame with 94 Million records.
However, I cannot perform all R data munging tasks on this Spark df, so I try to convert it to an R data frame using Collect(), as.data.frame() but I run into memory node/ time-out issues.
I was wondering if stack overflow community is aware of any other way to convert a Spark df to R df by avoiding time-out issues?
Did you try to cache your spark dataframe first? If you cache your data first, it may help speed up the collect as the data is already in RAM...that could get rid of the timeout problem. At the same time, this would only increase your RAM requirements. I too have seen those timeout issues when you are trying to serialize or deserialize certain data types, or just large amounts of data between R and Spark. Serialization and deserialization for large data sets is far from a "bullet proof" operation with R and Spark. Moreover, 94M records may just be too much for your driver node to handle in the first place, especially if there is a lot of dimensionality to your dataset.
One workaround I've used, but am not proud of is to use spark to write out the dataframe as a CSV and then have R read that CSV file back in on the next line of the script. Oddly enough, in a few of the cases I did this, the write a file and read the file method actually ended up being faster than a simple collect operation. A lot faster.
Word of advice- make sure to watch out for partitioning when writing out csv files with spark. You'll get a bunch of csv files and have to do some sort of tmp<- lapply(list_of_csv_files_from_spark, function(x){read.csv(x)}) operation to read in each csv file individually and then maybe a df<- do.call("rbind", tmp) ...it would probably be best to use fread to read in the csvs in place of read.csv as well.
Perhaps the better question is, what other data munging tasks are you unable to do in Spark that you need R for?
Good luck. I hope this was helpful. -nate
It's the first time I deal with Matlab files in R.
The rationale for saving the information in a .mat file type was the length. (the dataset contains 226518 rows). We were worried to excel (and then a csv) would not take them.
I can upload the original file if necessary
So I have my Matlab file and when I open it in Matlab all good.
There are various arrays and the one I want is called "allPoints"
I can open it and then see that it contains values around 0.something.
Screenshot:
What I want to do is to extract the same data in R.
library(R.matlab)
df <- readMat("170314_Col_HD_R20_339-381um_DNNhalf_PPP1-EN_CellWallThickness.mat")
str(df)
And here I get stuck. How do I pull out "allPoints" from it. $ does not seem to work.
I will have multiple files that need to be put together in one single dataframe in R so the plan is to mutate each extracted df generating a new column for sample and then I will rbind together.
Could anybody help?
This is my first time to try Spark R to do the same work I did with RStudio, on Databricks Cloud Community Edition. But met some weird problems.
It seems that Spark R do support packages like ggplot2, plyr, but the data has to be in R list format. I could generate this type of list in R Studio when I am using train <- read.csv("R_basics_train.csv"), variable train here is a list when you use typeof(train).
However, in Spark R, when I am reading the same csv data as "train", it will be converted into dataframe, and this is not the Spark Python DataFrame we have used before, since I cannot use collect() function to convert it into list.... When you use typeof(train), it shows the type is "S4", but in fact the type is dataframe....
So, is there anyway in Spark R that I can convert dataframe into R list so that I can use methods in ggplot2, plyr?
You can find the origional .csv training data here:
train
Later I found that using r_df <- collect(spark_df) will convert Spark DataFrame into R dataframe, although cannot use R summary() on its dataframe, with R dataframe, we can do many R operations.
It looks like they changed SparkR, so you now need to use
r_df<-as.data.frame(spark_df)
Not sure if you call this as the drawback of sparkR, but in order to leverage many good functionalities which R has to offer such as data exploration, ggplot libraries, you need to convert your pyspark data frame into normal data frame by calling collect
df <- collect(df)