SparkR and Packages - r

How do one call packages from spark to be utilized for data operations with R?
example i am trying to access my test.csv in hdfs as below
Sys.setenv(SPARK_HOME="/opt/spark14")
library(SparkR)
sc <- sparkR.init(master="local")
sqlContext <- sparkRSQL.init(sc)
flights <- read.df(sqlContext,"hdfs://sandbox.hortonWorks.com:8020 /user/root/test.csv","com.databricks.spark.csv", header="true")
but getting error as below:
Caused by: java.lang.RuntimeException: Failed to load class for data source: com.databricks.spark.csv
i tried loading the csv package by below option
Sys.setenv('SPARKR_SUBMIT_ARGS'='--packages com.databricks:spark-csv_2.10:1.0.3')
but getting the below error during loading sqlContext
Launching java with spark-submit command /opt/spark14/bin/spark-submit --packages com.databricks:spark-csv_2.10:1.0.3 /tmp/RtmpuvwOky /backend_port95332e5267b
Error: Cannot load main class from JAR file:/tmp/RtmpuvwOky/backend_port95332e5267b
Any help will be highly appreciated.

So it looks like by setting SPARKR_SUBMIT_ARGS you are overriding the default value, which is sparkr-shell. You could probably do the same thing and just append sparkr-shell to the end of your SPARKR_SUBMIT_ARGS. This is seems unnecessarily complex compared to depending on jars so I've created a JIRA to track this issue (and I'll try and a fix if the SparkR people agree with me) https://issues.apache.org/jira/browse/SPARK-8506 .
Note: another option would be using the sparkr command + --packages com.databricks:spark-csv_2.10:1.0.3 since that should work.

Related

Why are some jsonl files failing to load in SparklyR

I am currently trying to read in some jsonl files using SparklyR v 1.3.1 with Spark 2.3.3. While some files read in fine, I am struggling with others, using exactly the same code. Long-ish details below including error messages and packages/code being used.
library(sparklyr)
library(sparklyr.nested)
library(dplyr)
sc <- spark_connect(master = "local")
june1 <- spark_read_json(sc, "june1-aa.jsonl")
june <- spark_read_json(sc, "janetweets_june24.jsonl")
Error: org.apache.spark.sql.AnalysisException: Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the
referenced columns only include the internal corrupt record column
(named _corrupt_record by default). For example:
spark.read.schema(schema).json(file).filter($"_corrupt_record".isNotNull).count()
and spark.read.schema(schema).json(file).select("_corrupt_record").show().
Instead, you can cache or save the parsed results and then send the same query.
For example, val df = spark.read.schema(schema).json(file).cache() and then
df.filter($"_corrupt_record".isNotNull).count().;
The first file appears to read in ok, but any attempt to view the file meets with the following error, and "no tables" is displayed in the connections window, as opposed to the file structure for similar files which can be read in fine.
Error in value[[3L]](cond) :
Failed to fetch data: java.lang.NullPointerException
at sparklyr.Collectors$.collectLongArr(collectors.scala:87)
at sparklyr.Collectors$$anonfun$mkColumnCtx$17.apply(collectors.scala:224)
at sparklyr.Collectors$$anonfun$mkColumnCtx$17.apply(collectors.scala:224)
at sparklyr.Collectors$ColumnCtx.collect(collectors.scala:183)
at sparklyr.Utils$.sparklyr$Utils$$collectRows(utils.scala:90)
at sparklyr.Utils$.collect(utils.scala:114)
at sparklyr.Utils.collect(utils.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at sparklyr.Invoke.invoke(invoke.scala:147)
at sparklyr.StreamHandler.handleMethodCall(stream.scala:136)
at sparklyr.StreamHandler.read(stream.scala:61)
at sparklyr.BackendHandler$$anonfun$channelRead0$1.apply$mcV$sp(h
Owing to the size of these files, these are the first lines on each file, as viewed in terminal.
Sample data in pastebin: https://pastebin.com/y3Zevnpv.
I have tried updating my package to the latest version, deleting and reinstalling, and these files are definitely in the jsonlines format. This files were pulled from Twitter using the twarc command line tool from a Windows machine. Pls note: I have removed some URLs from above data owing to Stack Overflow guidelines. Thanks!

How to import data via Spark connection into R environment from cluster?

I followed this link to make a connection with Spark and my R server.
Connection b/w R studio server pro and hive on GCP
I can see my dataframe but cannot call it into R environment to run analysis on. Can anyone please suggest me the correct way ?
library(sparklyr)
library(dplyr)
sparklyr::spark_install()
#config
Sys.setenv(SPARK_HOME="/usr/lib/spark")
config <- spark_config()
#connect
sc <- spark_connect(master="yarn-client",config = config,version="2.2.1")
I can see my table "rdt" , but when I call it says object not found.
this is what i tried :
data <- rdt
that gives error like so : Error: object 'rdt' not found
then the only way was to put the file directly into the cluster and set working directory to call it (beats the purpose then .. ) I want to call it, how we would usually import a df, in this case from sparklyr connection
setwd("~/Directory")
data2 <- read.csv("rdt.csv",header = TRUE)
str(data2)

How to run SparkR script using spark-submit or sparkR on an EMR cluster?

I have written a sparkR code and wondering if I can submit it using spark-submit or sparkR on an EMR cluster.
I have tried several ways for example:
sparkR mySparkRScript.r or sparkR --no-save mySparkScript.r etc.. but every time I am getting below error:
Error in sparkR.sparkContext(master, appName, sparkHome, sparkConfigMap, :
JVM is not ready after 10 seconds
Sample Code:
#Set the path for the R libraries you would like to use.
#You may need to modify this if you have custom R libraries.
.libPaths(c(.libPaths(), '/usr/lib/spark/R/lib'))
#Set the SPARK_HOME environment variable to the location on EMR
Sys.setenv(SPARK_HOME = '/usr/lib/spark')
#Load the SparkR library into R
library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
#Initiate a Spark context and identify where the master node is located.
#local is used here because the RStudio server
#was installed on the master node
sc <- sparkR.session(master = "local[*]", sparkEnvir = list(spark.driver.memory="2g"))
sqlContext <- sparkRSQL.init(sc)
Note: I am able to run my code in sparkr-shell by pasting directly or using source("mySparkRScript.R").
Ref:
Crunching Statistics at Scale with SparkR on Amazon EMR
SparkR Spark documentation
R on Spark
Executing-existing-r-scripts-from-spark-rutger-de-graaf
Github
I was able to get this running via Rscript. There are a few things you need to do, and this may be a bit process intensive. If you are willing to give it a go, I would recommend:
Figure out how to do an automated SparkR or sparklyR build. Via: https://github.com/UrbanInstitute/spark-social-science
Use the AWS CLI to first create a cluster with the EMR template and bootstrap script you will create via following Step 1. (Make sure to put the EMR template and rstudio_sparkr_emrlyr_blah_blah.sh sripts into an S3 bucket)
Place your R code into a single file and put this in another S3 bucket...the sample code you have provided would work just fine, but I would recommend actually doing some operation, say reading in data from S3, adding a value to it, then writing it back out (just to confirm it works before getting into the 'heavy' code you might have sitting around)
Create another .sh file that copies the R file from the S3 bucket you have to the cluster, and then execute it via Rscript. Put this shell script in the same S3 bucket as your R code file (for simplicity). An example of the contents of this shell file might look like this:
#!/bin/bash
aws s3 cp s3://path/to/the/R/file/from/step3.R theNameOfTheFileToRun.R
Rscript theNameOfTheFileToRun.R
In the AWS CLI, at the time of cluster creation, insert a --step to your cluster creation call, Use the CUSTOM JAR RUNNER provided by Amazon to run the shell script that copies and executes the R code
Make sure to stop the Spark session at the end of your R code.
An example of the AWS CLI command might look like this (I'm using the us-east-1 zone on Amazon in my example, and throwing a 100GB disk on each worker in the cluster...just put your zone in wherever you see 'us-east-1' and pick whatever size disk you want instead)
aws emr create-cluster --name "MY COOL SPARKR OR SPARKLYR CLUSTER WITH AN RSCRIPT TO RUN SOME R CODE" --release-label emr-5.8.0 --applications Name=Spark Name=Ganglia Name=Hadoop --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.xlarge 'InstanceGroupType=CORE,InstanceCount=2,InstanceType=m4.2xlarge,EbsConfiguration={EbsOptimized=true,EbsBlockDeviceConfigs=[{VolumeSpecification={VolumeType=gp2,SizeInGB=100}},{VolumeSpecification={VolumeType=io1,SizeInGB=100,Iops=100},VolumesPerInstance=1}]}' --log-uri s3://path/to/EMR/sparkr_logs --bootstrap-action Path=s3://path/to/EMR/sparkr_bootstrap/rstudio_sparkr_emr5lyr-proc.sh,Args=['--user','cool_dude','--user-pw','top_secret','--shiny','true','--sparkr','true','sparklyr','true'] --ec2-attributes KeyName=mykeyfilename,InstanceProfile=EMR_EC2_DefaultRole,AdditionalMasterSecurityGroups="sg-abc123",SubnetId="subnet-abc123" --service-role EMR_DefaultRole --scale-down-behavior TERMINATE_AT_TASK_COMPLETION --auto-terminate --region us-east-1 --steps Type=CUSTOM_JAR,Name=CustomJAR,ActionOnFailure=CONTINUE,Jar=s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar,Args=["s3://path/to/the/shell/file/from/step4.sh"]
Good luck! Cheers, Nate

SparkR job(R script) submit using spark-submit fails in BigInsights Hadoop cluster

I have created IBM BigInsights service with hadoop cluster of 5 nodes(including Apache Spark with SparkR). I trying to use SparkR to connect cloudant db and get some data and do some processing.
SparkR job(R script) submit using spark-submit fails in BigInsights Hadoop cluster.
I have created SparkR script and ran the following code,
-bash-4.1$ spark-submit --master local[2] test_sparkr.R
16/08/07 17:43:40 WARN SparkConf: The configuration key 'spark.yarn.applicationMaster.waitTries' has been deprecated as of Spark 1.3 and and may be removed in the future. Please use the new key 'spark.yarn.am.waitTime' instead.
Error: could not find function "sparkR.init"
Execution halted
-bash-4.1$
Content of test_sparkr.R file is:
# Creating SparkConext and connecting to Cloudant DB
sc <- sparkR.init(sparkEnv = list("cloudant.host"="<<cloudant-host-name>>","<<><<cloudant-user-name>>>","cloudant.password"="<<cloudant-password>>", "jsonstore.rdd.schemaSampleSize"="-1"))
# Database to be connected to extract the data
database <- "testdata"
# Creating Spark SQL Context
sqlContext <- sparkRSQL.init(sc)
# Creating DataFrame for the "testdata" Cloudant DB
testDataDF <- read.df(sqlContext, database, header='true', source = "com.cloudant.spark",inferSchema='true')
How to install the spark-cloudant connector in IBM BigInsights and resolve the issue. Kindly do the needful. Help would be much appreciated.
I believe that the spark-cloudant connector isn’t for R yet.
Hopefully I can update this answer when it is!

Is it possible to run a SparkR program in Spark without R interpreter installed?

My question is about the feasibilty of running a sparkR program in spark without an R dependency.
In other words can I run the following program in spark when there is no R interpreter installed in the machine?
#set env var
Sys.setenv(SPARK_HOME="/home/fazlann/Downloads/spark-1.5.0-bin-hadoop2.6")
#Tell R where to find sparkR package
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"),"R","lib"), .libPaths()))
#load sparkR into this environment
library(SparkR)
#create the sparkcontext
sc <- sparkR.init(master = "local")
#to work with DataFrames we will need a SQLContext, which can be created from the SparkContext
sqlContext <- sparkRSQL.init(sc)
name <- c("Nimal","Kamal","Ashen","lan","Harin","Vishwa","Malin")
age <- c(23,24,12,25,31,22,43)
child <- c(TRUE,TRUE,FALSE,FALSE,TRUE,FALSE,TRUE)
localdf <- data.frame(name,age,child)
#convert R dataframe into spark DataFrame
sparkdf <- createDataFrame(sqlContext,localdf);
#since we are passing a spark DataFrame into head function, the method gets executed in spark
head(sparkdf)
No, you can't. You'll need to install R and also the needed packages, otherwise your machine won't know that she needs to interpret R.
Don't try to ship your R interpreter in the application you are submitting as the uber application will be excessively heavy to distribute among your cluster.
You'll need a configuration management system that allows you to define the state of your IT infrastructure, then automatically enforces the correct state.
No. SparkR works by having an R process communicating with Spark via rJava. You will still need R installed on your machine, just as you need a JVM installed.

Resources