Convert SparkR DataFrame to H2O Frame - r

Using SparkR, I am wondering if it is possible to convert a Spark DataFrame into an H2O frame?
I have seen examples of converting R data.frames to h2o frames, but, sadly, this is not a viable option (data size).
I know it is possible to use sparklyr and rsparkling to create an h2o frame, but I am not using HIVE, or Hadoop, sparklyr or rsparkling.
Instead, my goal is to convert the sdf from this:
set.seed(123)
df<- data.frame(ColA=rep(c("dog", "cat", "fish", "shark"), 4), ColB=rnorm(16), ColC=rep(seq(1:8),2))
sdf<- SparkR::createDataFrame(df)
into this:
as.h2o(sdf, destination_frame = "hsdf") # fails, came from Spark (SparkR)
as.h2o(df, destination_frame = "hdf") # succeeds, but this is a regular R data.frame
Hopefully, someone has figured out a way to do this using what SparkR can provide. I think it would be a huge boon to R users.

There is no support for converting between H2O and Spark frames natively in either the h2o or the SparkR packages. You would have to use rsparkling (which depends on sparklyr) or do a conversion from Spark DataFrame -> R data.frame -> H2O Frame.
You mentioned Hadoop and HIVE... just to clarify, neither of those are requirements for using rsparkling::as_h2o_frame().

Since none of the above worked for me, the solution was:
Saving spark dataframe on a csv (folder csv)
Using apply function to open each csv file using the package Rio Import
tmp<- lapply(list.files("data/csvfolder.csv"), function(x){rio::import(paste0("data/csvfolder.csv/", x))})
df00<- do.call("rbind", tmp)
Use the "df00" as a dataframe to use as you wish,,
Hope that works for you guys! Collect and as.data.frame are too weak depending on the type of data being used.
Chers

Related

Is there an UDF implementation in Spark R available

I am new with Spark in R
Is it implemented in Spark R the UDF functions please? Does it work with Arrow encoder?
SparlklyR UDF is not optimized using Arrow on Databricks or otherwise
We thought we must come up with a Spark R UDF instead, running on same dataframe with Arrow encoder installed like library(arrow)).
Databricks only Support Spark R, Sparkly R is vendor specific (R Studio) and not supported and optimized.
Any advice here for this scenario ? Below my code, but R UDF can be anything possible for a dataframe we read from path
library(sparklyr)
library(arrow)
sc <- spark_connect(method = "databricks")
sdf <- spark_read_parquet(sc, "dbfs:/FileStore/tmp/udf_test_fake")
identity <- function(rdf){
worker_log(paste0("Processing ", digest::digest(rdf), " partition with ", nrow(rdf), " rows"))
return(rdf)
}
result <- spark_apply(sdf, identity)
print(sdf_nrow(result))

When use sparkR to process data, where is the place that the programme really run?

I am new to spark and sparkR, and my question is below:
when I wrote the codes below:
1). set up environment and start a spark.session()
sparkR.session(master = "my/spark/master/on/one/server/standaloneMode", , sparkConfig = list(spark.driver.memory="4g",spark.sql.warehouse.dir = "my/hadoop_home/bin",sparkPackages = "com.databricks:spark-avro_2.11:3.0.1"))
Then I wrote:
rund <- data.frame(V1 = runif(10000000,100,10000),V2 =runif(10000000,100,10000))
df <- as.DataFrame(rund)
Here is the thing:
1). Where does the program do the 'splitting'? on my local machine or on server?
2). Also, could anyone tell me where did the program exactly run the code "as.DataFrame()"? on my computer or on my server where was set as standalone_mode of spark.
SparkR is an interface for Spark. This means that some R functions are overridden by the SparkR package to provide a similar user experience you already know from R. You probably should have a look at the documentation to see which Spark functions are available: https://spark.apache.org/docs/latest/api/R/index.html
Those functions typically ingest SparkDataFrames which you can create, for example with the as.DataFrame function. SparkDataFrames provide a reference to a SparkDataFrame in your Spark cluster.
In your example you've created a local R data frame rund. The runif functions also were executed locally in your R instance.
# executed in your local R instance
rund <- data.frame(V1 = runif(10000000,100,10000),V2 =runif(10000000,100,10000))
The df object however is a SparkDataFrame, which will be created in your Spark Cluster. as.DataFrame is executed in R, but the actual SparkDataFrame only will exists in your cluster.
df <- as.DataFrame(rund)
To easily distinguish between R and Spark data frames, you can use the class function:
> class(df)
[1] "data.frame"
> class(df.spark)
[1] "SparkDataFrame"
attr(,"package")
[1] "SparkR"
In general a SparkDataFrame can be used as input for the various functions the SparkR package has to offer, for example to group or sort your SparkDataFrame in Spark. The Spark operations are executed when a Spark action is called. An example for such an action is collect. It triggers the transformations in Spark and retrieves the computed data from your Spark cluster and creates a corresponded R data frame in your local R instance. If you have a look at the documentation you can see if a function can ingest a SparkDataFrame:
##S4 method for signature 'SparkDataFrame'
collect(x, stringsAsFactors = FALSE)
Moreover it is possible to execute custom R code in your Spark cluster by using user-defined functions: https://spark.apache.org/docs/latest/sparkr.html#applying-user-defined-function.

Error: could not find function "includePackage"

I am trying to execute Random Forest algorithm on SparkR, with Spark 1.5.1 installed. I am not getting clear idea, why i am getting the error -
Error: could not find function "includePackage"
Further even if I use mapPartitions function in my code , i get the error saying -
Error: could not find function "mapPartitions"
Please find the below code:
rdd <- SparkR:::textFile(sc, "http://localhost:50070/explorer.html#/Datasets/Datasets/iris.csv",5)
includePackage(sc,randomForest)
rf <- mapPartitions(rdd, function(input) {
## my function code for RF
}
This is more of a comment and a cross question rather than an answer (not allowed to comment because of the reputation) but just to take this further, if we are using the collect method to convert the rdd back to an R dataframe, isnt that counter productive as if the data is too large, it would take too long to execute in R.
Also does it mean that we could possibly use any R package say, markovChain or a neuralnet using the same methodology.
Kindly check the functions that can be in used in sparkR http://spark.apache.org/docs/latest/api/R/index.html
This doesn't include function mapPartitions() or includePackage()
#For reading csv in sparkR
sparkRdf <- read.df(sqlContext, "./nycflights13.csv",
"com.databricks.spark.csv", header="true")
#Possible way to use `randomForest` is to convert the `sparkR` data frame to `R` data frame
Rdf <- collect(sparkRdf)
#compute as usual in `R` code
>install.packages("randomForest")
>library(rainForest)
......
#convert back to sparkRdf
sparkRdf <- createDataFrame(sqlContext, Rdf)

SparkR dubt and Broken pipe exception

Hi I'm working on SparkR in distribuited mode with yarn cluster.
I have two question :
1) If I made a script that contains R line code and SparkR line code, it will distribute just the SparkR code or simple R too?
This is the script. I read a csv and take just the 100k first record.
I clean it (with R function) deleting NA values and created a SparkR dataframe.
This is what it do: foreach Lineset take each TimeInterval where that LineSet appear and sum some attribute (numeric attribute), after that put them all into a Matrix.
This is the script with R and SparkR code. It takes 7h in standalone mode and like 60h in distributed mode (killed by java.net.SocketException: Broken Pipe)
LineSmsInt<-fread("/home/sentiment/Scrivania/LineSmsInt.csv")
Short<-LineSmsInt[1:100000,]
Short[is.na(Short)] <- 0
Short$TimeInterval<-Short$TimeInterval/1000
ShortDF<-createDataFrame(sqlContext,Short)
UniqueLineSet<-unique(Short$LINESET)
UniqueTime<-unique(Short$TimeInterval)
UniqueTime<-as.numeric(UniqueTime)
Row<-length(UniqueLineSet)*length(UniqueTime)
IntTemp<-matrix(nrow =Row,ncol=7)
k<-1
colnames(IntTemp)<-c("LINESET","TimeInterval","SmsIN","SmsOut","CallIn","CallOut","Internet")
Sys.time()
for(i in 1:length(UniqueLineSet)){
SubSetID<-filter(ShortDF,ShortDF$LINESET==UniqueLineSet[i])
for(j in 1:length(UniqueTime)){
SubTime<-filter(SubSetID,SubSetID$TimeInterval==UniqueTime[j])
IntTemp[k,1]<-UniqueLineSet[i]
IntTemp[k,2]<-as.numeric(UniqueTime[j])
k3<-collect(select(SubTime,sum(SubTime$SmsIn)))
IntTemp[k,3]<-k3[1,1]
k4<-collect(select(SubTime,sum(SubTime$SmsOut)))
IntTemp[k,4]<-k4[1,1]
k5<-collect(select(SubTime,sum(SubTime$CallIn)))
IntTemp[k,5]<-k5[1,1]
k6<-collect(select(SubTime,sum(SubTime$CallOut)))
IntTemp[k,6]<-k6[1,1]
k7<-collect(select(SubTime,sum(SubTime$Internet)))
IntTemp[k,7]<-k7[1,1]
k<-k+1
}
print(UniqueLineSet[i])
print(i)
}
This is the script R the only things that change is the subset function and of course is a normal R data.frame not SparkR dataframe.
It takes 1.30 minute in standalone mode.
Why it's so fast just in R and it's so slowly in SparkR?
for(i in 1:length(UniqueLineSet)){
SubSetID<-subset.data.frame(LineSmsInt,LINESET==UniqueLineSet[i])
for(j in 1:length(UniqueTime)){
SubTime<-subset.data.frame(SubSetID,TimeInterval==UniqueTime[j])
IntTemp[k,1]<-UniqueLineSet[i]
IntTemp[k,2]<-as.numeric(UniqueTime[j])
IntTemp[k,3]<-sum(SubTime$SmsIn,na.rm = TRUE)
IntTemp[k,4]<-sum(SubTime$SmsOut,na.rm = TRUE)
IntTemp[k,5]<-sum(SubTime$CallIn,na.rm = TRUE)
IntTemp[k,6]<-sum(SubTime$CallOut,na.rm = TRUE)
IntTemp[k,7]<-sum(SubTime$Internet,na.rm=TRUE)
k<-k+1
}
print(UniqueLineSet[i])
print(i)
}
2) First script, in distribuited mode, was killed by :
java.net.SocketException: Broken Pipe
and this appear too sometimes:
java.net.SocketTimeoutException: Accept timed out
It may comes from a bad configuration? suggestion?
Thanks.
Don't take this the wrong way but it is not a particularly well written piece of code. It is already inefficient using core R and adding SparkR to the equation makes it even worse.
If I made a script that contains R line code and SparkR line code, it will distribute just the SparkR code or simple R too?
Unless you're using distributed data structures and functions which operate on these structures it is just a plain R code executed in a single thread on the master.
Why it's so fast just in R and it's so slowly in SparkR?
For starters you execute a single job for each combination of LINESET, UniqueTime and column. Each time Spark has scan all records and fetch data to the driver.
Moreover using Spark to handle data that can be easily handled in memory of a single machine simply doesn't make sense. Cost of running the job in case like this is usually much higher than a cost of actual processing.
suggestion?
If you really want to use SparkR just groupBy and agg:
group_by(Short, Short$LINESET, Short$TimeInterval) %>% agg(
sum(Short$SmsIn), sum(Short$SmsOut), sum(Short$CallIn),
sum(Short$CallOut), sum(Short$Internet))
If you care about missing (LINESET, TimeInterval) pairs fill these using either join or unionAll.
In practice it would simply stick with a data.table and aggregate locally:
Short[, lapply(.SD, sum, na.rm=TRUE), by=.(LINESET, TimeInterval)]

Making a connection using rmongodb in R on Mac OS x

I've been assigned to analyze some data that are contained in a MongoDB format. I'm a complete newbie to MongoDB, but I can manage if I can read the data and convert it to an R data table or data frame. If possible, I'd like to do just enough to get the MongoDB data into R.
I'm trying to get access to the data using the rmongodb package in R version 3.1.2 on Mac OS X Yosemite via RStudio 0.98.953. I've tried this so far:
install.packages("rmongodb")
library(rmongodb)
#up to here, it works
mongo <- mongo.create(host='localhost')
mongo <- mongo.create(host='127.0.0.1')
mongo <- mongo.create()
Each of the mongo <- assignment statements results in the same error:
Unable to connect to localhost:27017, error code = 2
and
mongo.is.connected(mongo)
returns FALSE.
If this is an essential part of the answer, we can use "db=test" as the database. For what it's worth, the datasets are stored in "~/Desktop/MyExample" and consist of four files with the extension "bson" and their analogues ending with ".metadata.json", as well as a "system.indexes.bson" file.
Any ideas? Thanks in advance!

Resources