When trying to create a data frame in sparkR, I get an error regarding a Null Pointer Exception. I have pasted my code, and the error message below. Do I need to install any more packages in order for this code to run?
CODE
SPARK_HOME <- "C:\\Users\\erer\\Downloads\\spark-1.5.2-bin-hadoop2.4\\spark-1.5.2-bin-hadoop2.4"
Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages" "com.databricks:spark-csv_2.10:1.2.0" "sparkr-shell"')
library(SparkR, lib.loc = "C:\\Users\\erer\\Downloads\\spark-1.5.2-bin-hadoop2.4\\R\\lib")
library(SparkR)
library(rJava)
sc <- sparkR.init(master = "local", sparkHome = SPARK_HOME)
sqlContext <- sparkRSQL.init(sc)
localDF <- data.frame(name=c("John", "Smith", "Sarah"), age=c(19, 23, 18))
df <- createDataFrame(sqlContext, localDF)
ERROR:
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1, localhost): java.lang.NullPointerException
at java.lang.ProcessBuilder.start(Unknown Source)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:445)
at org.apache.hadoop.util.Shell.run(Shell.java:418)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650)
at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:873)
at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:853)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:381)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:405)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:397)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:7
You need to point library SparkR to the directory where the local SparkR code is, specified in the lib.loc parameter (if you downloaded a Spark binary, the SPARK_HOME/R/lib will be already populated for you):
`library(SparkR, lib.loc = "/home/kris/spark/spark-1.5.2-bin-hadoop2.6/R/lib")`
See also this tutorial on R-bloggers on how to run Spark from Rstudio: http://www.r-bloggers.com/sparkr-with-rstudio-in-ubuntu-12-04/
Related
spark_write_csv function doesn't work anymore, maybe since I upgraded the Spark version. Could someone help please?
Here is the code example, and the error message below:
library(sparklyr)
library(dplyr)
spark_conn <- spark_connect(master = "local")
iris <- copy_to(spark_conn, iris, overwrite = TRUE)
spark_write_csv(iris, path = "iris.csv")
Error: org.apache.spark.SparkException: Job aborted.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:231)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:188)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:108)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:106)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:131)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:132)
I am using RStudio.
After creating session if i try to create dataframe using R data it gives error.
Sys.setenv(SPARK_HOME = "E:/spark-2.0.0-bin-hadoop2.7/spark-2.0.0-bin-hadoop2.7")
Sys.setenv(HADOOP_HOME = "E:/winutils")
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
Sys.setenv('SPARKR_SUBMIT_ARGS'='"sparkr-shell"')
library(SparkR)
sparkR.session(sparkConfig = list(spark.sql.warehouse.dir="C:/Temp"))
localDF <- data.frame(name=c("John", "Smith", "Sarah"), age=c(19, 23, 18))
df <- createDataFrame(localDF)
ERROR :
Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source)
at java.lang.reflect.Constructor.newInstance(Unknown Source)
at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:258)
at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:359)
at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:263)
at org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
at org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
at org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:46)
at org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45)
at org.a
>
TIA.
All many thanks for your help.
I had to do was set hadoop_home path in PATH variables
(winutils/bin). This should have your winutils.exe file. So when it
creates metastore for hive default derby) it is able to call hive
classes.
Also i had set hive support as False as i am not using it.
Sys.setenv(SPARK_HOME='E:/spark-2.0.0-bin-hadoop2.7/spark-2.0.0-bin-hadoop2.7',HADOOP_HOME='E:/winutils')
.libPaths(c(file.path(Sys.getenv('SPARK_HOME'), 'R', 'lib'),.libPaths()))
Sys.setenv('SPARKR_SUBMIT_ARGS'='"sparkr-shell"')
library(SparkR)
library(rJava)
sparkR.session(enableHiveSupport = FALSE,master = "local[*]", sparkConfig = list(spark.driver.memory = "1g",spark.sql.warehouse.dir="E:/winutils/bin/"))
df <- as.DataFrame(iris)
If you have not used SparkR library but you're using Spark,
I recommend 'sparklyr' library made by RStudio.
Install the preview version of RStudio.
Install the library:
install.packages("devtools")
devtools::install_github('rstudio/sparklyr')
Load library and install spark.
library(sparklyr)
spark_install('1.6.2')
You can see a vignette in http://spark.rstudio.com/
These are the steps that I did in RStudio and it worked for me:
Sys.setenv(SPARK_HOME="C:\\spark-1.6.1-bin-hadoop2.6")
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
library(SparkR)
sc <- sparkR.init(master="local")
sqlContext <- sparkRSQL.init(sc)
localDF <- data.frame(name=c("John", "Smith", "Sarah"), age=c(19, 23, 18))
df <- createDataFrame(sqlContext, localDF)
I'm trying to run the sample dataframe example using RStudio.
I have the following code:
Sys.setenv(SPARK_HOME = "C:\\Users\\himanshu.babbar\\Desktop\\Babbar\\Softwares\\spark-1.6.0-bin")
Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages" "com.databricks:spark-csv_2.11:1.2.0" "sparkr-shell"')
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
library(SparkR)
sc <- sparkR.init(master = "local", sparkHome = SPARK_HOME, sparkEnvir = list(spark.driver.memory="512m"))
sqlContext <- sparkRSQL.init(sc)
# Create a simple local data.frame
localDF <- data.frame(name=c("John", "Smith", "Sarah"), age=c(19, 23, 18))
# Convert local data frame to a SparkR DataFrame
df <- createDataFrame(sqlContext, localDF)
On doing this, I'm getting the following exception :
Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.NullPointerException
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1012)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:482)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:873)
at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:853)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:406)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:404)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:396)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:396)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:192)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
I'm able to run a similar code on my colleagues machine so this could be a configuration thing that i may have been missing. Any pointers here?
You appear to have a Hadoop version and/or libraries mismatch. This may not be "simple" to fix.
Get someone who understands Hadoop installations to take a look at your setup and then ensure the version of Spark you are using supports the Hadoop version on your machine (or the cluster you are connecting to).
I have the following code:
setwd("C:\\Users\\Anonymous\\Desktop\\Data 2014")
Sys.setenv(SPARK_HOME = "C:\\Users\\Anonymous\\Desktop\\Spark-1.4.1\\spark-1.6.0-bin-hadoop2.6\\spark-1.6.0-bin-hadoop2.6")
Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages" "com.databricks:spark-csv_2.10:1.3.0" "sparkr-shell"')
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
library(SparkR)
library(magrittr)
sc <- sparkR.init(master = "local")
sqlContext <- sparkRSQL.init(sc)
When I run the following:
data <- read.df(sqlContext, "Test.csv", "com.databricks.spark.csv", header="true")
I get the following error:
Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.NullPointerException
Test.csv is only a 3 x 2 table.
You will get the more details and cause of the error you are getting in the below link. https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/troubleshooting/javaionotserializableexception.html
You have to give the full path of the csv file,if the file in not in your present working directory.I have not much idea. Below I have pasted the code. It is working fine for me you can try it.
Sys.setenv(SPARK_HOME='/home/jayashree/spark-1.5.0') # the path of spark_home dir. Please change it according to your spark home path
.libPaths(c(file.path(Sys.getenv('SPARK_HOME'), 'R', 'lib'), .libPaths()))
library(SparkR)
sc <- sparkR.init(master="local", sparkPackages="com.databricks:spark-csv_2.11:1.2.0")
sqlContext <- sparkRSQL.init(sc)
data <- read.df(sqlContext, "/full_path/to_your/datafile.csv", "com.databricks.spark.csv", header="true")
Are you working on windows?
I followed the exact the same steps from other posts like this one to create a spark dataframe in R.
Sys.setenv(SPARK_HOME = "E:/spark-1.5.0-bin-hadoop2.6")
Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages" "com.databricks:spark-csv_2.10:1.2.0" "sparkr-shell"')
Sys.setenv(JAVA_HOME="C:/Program Files/Java/jre1.8.0_60")
library(rJava)
library(SparkR, lib.loc = "E:/spark-1.5.0-bin-hadoop2.6/R/lib/")
sc <- sparkR.init(master = "local", sparkHome = "E:/spark-1.5.0-bin-hadoop2.6")
sqlContext <- sparkRSQL.init(sc)
df <- createDataFrame(sqlContext, iris)
However, it keeps giving me the error at the very last step:
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1, localhost): java.lang.NullPointerException
at java.lang.ProcessBuilder.start(Unknown Source)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:482)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:873)
at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:853)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:381)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:405)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:397)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:7