Running SparkR through RStudio - r

I used the link below to learn how to run SparkR through RStudio:
http://blog.danielemaasit.com/2015/07/26/installing-and-starting-sparkr-locally-on-windows-8-1-and-rstudio/
I am having trouble with section 4.5.
if (nchar(Sys.getenv("SPARK_HOME")) < 1) {
Sys.setenv(SPARK_HOME = "C:/Apache/spark-2.0.0")
}
library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
sparkR.session(master = "local[*]", sparkConfig = list(spark.driver.memory = "1g"))
library(SparkR)
sc<-sparkR.session(master = "local")
sqlContext <- sparkRSQL.init(sc)
DF <- createDataFrame(sqlContext, faithful)
Error comes up when I run the DF function:
Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source)
at java.lang.reflect.Constructor.newInstance(Unknown Source)
at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:258)
at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:359)
at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:263)
at org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
at org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
at org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:46)
at org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45)
at org.a
In addition: Warning message:
'createDataFrame(sqlContext...)' is deprecated.
Use 'createDataFrame(data, schema = NULL, samplingRatio = 1.0)' instead.
See help("Deprecated")
I can't really tell what the error is and any help would be greatly appreciated.
Thanks!

Try this
Sys.setenv(SPARK_HOME = "C://Apache/spark-2.0.0")
You need to use "//" above.

Related

Not able to to convert R data frame to Spark DataFrame

When I try to convert my local dataframe in R to Spark DataFrame using:
raw.data <- as.DataFrame(sc,raw.data)
I get this error:
17/01/24 08:02:04 WARN RBackendHandler: cannot find matching method class org.apache.spark.sql.api.r.SQLUtils.getJavaSparkContext. Candidates are:
17/01/24 08:02:04 WARN RBackendHandler: getJavaSparkContext(class org.apache.spark.sql.SQLContext)
17/01/24 08:02:04 ERROR RBackendHandler: getJavaSparkContext on org.apache.spark.sql.api.r.SQLUtils failed
Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
The question is similar to
sparkR on AWS: Unable to load native-hadoop library and
Don't need to use sc if you are using the latest version of Spark. I am using SparkR package having version 2.0.0 in RStudio. Please go through following code (that is used to connect R session with SparkR session):
if (nchar(Sys.getenv("SPARK_HOME")) < 1) {
Sys.setenv(SPARK_HOME = "path-to-spark home/spark-2.0.0-bin-hadoop2.7")
}
library(SparkR)
library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R","lib")))
sparkR.session(enableHiveSupport = FALSE,master = "spark://master url:7077", sparkConfig = list(spark.driver.memory = "2g"))
Following is the output of R console:
> data<-as.data.frame(iris)
> class(data)
[1] "data.frame"
> data.df<-as.DataFrame(data)
> class(data.df)
[1] "SparkDataFrame"
attr(,"package")
[1] "SparkR"
use this example code :
library(SparkR)
library(readr)
sc <- sparkR.init(appName = "data")
sqlContext <- sparkRSQL.init(sc)
old_df<-read_csv("/home/mx/data.csv")
old_df<-data.frame(old_df)
new_df <- createDataFrame( sqlContext, old_df)

SparkR from Rstudio - gives Error in invokeJava(isStatic = TRUE, className, methodName, ...) :

I am using RStudio.
After creating session if i try to create dataframe using R data it gives error.
Sys.setenv(SPARK_HOME = "E:/spark-2.0.0-bin-hadoop2.7/spark-2.0.0-bin-hadoop2.7")
Sys.setenv(HADOOP_HOME = "E:/winutils")
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
Sys.setenv('SPARKR_SUBMIT_ARGS'='"sparkr-shell"')
library(SparkR)
sparkR.session(sparkConfig = list(spark.sql.warehouse.dir="C:/Temp"))
localDF <- data.frame(name=c("John", "Smith", "Sarah"), age=c(19, 23, 18))
df <- createDataFrame(localDF)
ERROR :
Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source)
at java.lang.reflect.Constructor.newInstance(Unknown Source)
at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:258)
at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:359)
at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:263)
at org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
at org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
at org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:46)
at org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45)
at org.a
>
TIA.
All many thanks for your help.
I had to do was set hadoop_home path in PATH variables
(winutils/bin). This should have your winutils.exe file. So when it
creates metastore for hive default derby) it is able to call hive
classes.
Also i had set hive support as False as i am not using it.
Sys.setenv(SPARK_HOME='E:/spark-2.0.0-bin-hadoop2.7/spark-2.0.0-bin-hadoop2.7',HADOOP_HOME='E:/winutils')
.libPaths(c(file.path(Sys.getenv('SPARK_HOME'), 'R', 'lib'),.libPaths()))
Sys.setenv('SPARKR_SUBMIT_ARGS'='"sparkr-shell"')
library(SparkR)
library(rJava)
sparkR.session(enableHiveSupport = FALSE,master = "local[*]", sparkConfig = list(spark.driver.memory = "1g",spark.sql.warehouse.dir="E:/winutils/bin/"))
df <- as.DataFrame(iris)
If you have not used SparkR library but you're using Spark,
I recommend 'sparklyr' library made by RStudio.
Install the preview version of RStudio.
Install the library:
install.packages("devtools")
devtools::install_github('rstudio/sparklyr')
Load library and install spark.
library(sparklyr)
spark_install('1.6.2')
You can see a vignette in http://spark.rstudio.com/
These are the steps that I did in RStudio and it worked for me:
Sys.setenv(SPARK_HOME="C:\\spark-1.6.1-bin-hadoop2.6")
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
library(SparkR)
sc <- sparkR.init(master="local")
sqlContext <- sparkRSQL.init(sc)
localDF <- data.frame(name=c("John", "Smith", "Sarah"), age=c(19, 23, 18))
df <- createDataFrame(sqlContext, localDF)

SparkR Null Pointer Exception when trying to create a data frame

When trying to create a data frame in sparkR, I get an error regarding a Null Pointer Exception. I have pasted my code, and the error message below. Do I need to install any more packages in order for this code to run?
CODE
SPARK_HOME <- "C:\\Users\\erer\\Downloads\\spark-1.5.2-bin-hadoop2.4\\spark-1.5.2-bin-hadoop2.4"
Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages" "com.databricks:spark-csv_2.10:1.2.0" "sparkr-shell"')
library(SparkR, lib.loc = "C:\\Users\\erer\\Downloads\\spark-1.5.2-bin-hadoop2.4\\R\\lib")
library(SparkR)
library(rJava)
sc <- sparkR.init(master = "local", sparkHome = SPARK_HOME)
sqlContext <- sparkRSQL.init(sc)
localDF <- data.frame(name=c("John", "Smith", "Sarah"), age=c(19, 23, 18))
df <- createDataFrame(sqlContext, localDF)
ERROR:
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1, localhost): java.lang.NullPointerException
at java.lang.ProcessBuilder.start(Unknown Source)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:445)
at org.apache.hadoop.util.Shell.run(Shell.java:418)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650)
at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:873)
at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:853)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:381)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:405)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:397)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:7
You need to point library SparkR to the directory where the local SparkR code is, specified in the lib.loc parameter (if you downloaded a Spark binary, the SPARK_HOME/R/lib will be already populated for you):
`library(SparkR, lib.loc = "/home/kris/spark/spark-1.5.2-bin-hadoop2.6/R/lib")`
See also this tutorial on R-bloggers on how to run Spark from Rstudio: http://www.r-bloggers.com/sparkr-with-rstudio-in-ubuntu-12-04/

SparkR can't load data into R

I followed the exact the same steps from other posts like this one to create a spark dataframe in R.
Sys.setenv(SPARK_HOME = "E:/spark-1.5.0-bin-hadoop2.6")
Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages" "com.databricks:spark-csv_2.10:1.2.0" "sparkr-shell"')
Sys.setenv(JAVA_HOME="C:/Program Files/Java/jre1.8.0_60")
library(rJava)
library(SparkR, lib.loc = "E:/spark-1.5.0-bin-hadoop2.6/R/lib/")
sc <- sparkR.init(master = "local", sparkHome = "E:/spark-1.5.0-bin-hadoop2.6")
sqlContext <- sparkRSQL.init(sc)
df <- createDataFrame(sqlContext, iris)
However, it keeps giving me the error at the very last step:
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1, localhost): java.lang.NullPointerException
at java.lang.ProcessBuilder.start(Unknown Source)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:482)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:873)
at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:853)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:381)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:405)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:397)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:7

Parallel computing problems

I seem to have run on a problem when trying to run a parallel computing in R
library(parallel)
library(foreach)
library(doParallel)
library(snow)
cl <- makeCluster(detectCores())
Loading required package: Rmpi
Error : .onLoad failed in loadNamespace() for 'Rmpi', details:
call: inDL(x, as.logical(local), as.logical(now), ...)
error: unable to load shared object 'C:/Users/PCCasa/Documents/R/win- library/3.2/Rmpi/libs/x64/Rmpi.dll':
LoadLibrary failure: The system is unable to find the package specified.
Error in makeMPIcluster(spec, ...) :
the `Rmpi' package is needed for MPI clusters.
registerDoParallel(cl)
Error in registerDoParallel(cl) : Object 'cl' not found
windowsproduces an error which advises to either repair or reinstall msmpi.dll. Could you kindly let me know what the best prodecure would be as to solve this issue
None, the RMPI spawn function is not implemented for Windows. Here is the excerpt of the RMPI code.
if (.Platform$OS=="windows"){
#stop("Spawning is not implemented. Please use mpiexec with Rprofile.")
workdrive <- unlist(strsplit(getwd(),":"))[1]
workdir <- unlist(strsplit(getwd(),"/"))
if (length(workdir) > 1)
workdir <-paste(workdir, collapse="\\")
else
workdir <- paste(workdir,"\\")
localhost <- Sys.getenv("COMPUTERNAME")
networkdrive <-NULL #.Call("RegQuery", as.integer(2),paste("NETWORK\\",workdrive,sep=""),
#PACKAGE="Rmpi")
remotepath <-networkdrive[which(networkdrive=="RemotePath")+1]
mapdrive <- as.logical(mapdrive && !is.null(remotepath))
arg <- c(Rscript, R.home(), workdrive, workdir, localhost, mapdrive, remotepath)
if (.Platform$r_arch == "i386")
realns <- mpi.comm.spawn(slave = system.file("Rslaves32.cmd",
package = "Rmpi"), slavearg = arg, nslaves = nslaves,
info = 0, root = root, intercomm = intercomm, quiet = quiet)
else
realns <- mpi.comm.spawn(slave = system.file("Rslaves64.cmd",
package = "Rmpi"), slavearg = arg, nslaves = nslaves,
info = 0, root = root, intercomm = intercomm, quiet = quiet)
}

Resources