SparkR can't load data into R - r

I followed the exact the same steps from other posts like this one to create a spark dataframe in R.
Sys.setenv(SPARK_HOME = "E:/spark-1.5.0-bin-hadoop2.6")
Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages" "com.databricks:spark-csv_2.10:1.2.0" "sparkr-shell"')
Sys.setenv(JAVA_HOME="C:/Program Files/Java/jre1.8.0_60")
library(rJava)
library(SparkR, lib.loc = "E:/spark-1.5.0-bin-hadoop2.6/R/lib/")
sc <- sparkR.init(master = "local", sparkHome = "E:/spark-1.5.0-bin-hadoop2.6")
sqlContext <- sparkRSQL.init(sc)
df <- createDataFrame(sqlContext, iris)
However, it keeps giving me the error at the very last step:
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1, localhost): java.lang.NullPointerException
at java.lang.ProcessBuilder.start(Unknown Source)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:482)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:873)
at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:853)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:381)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:405)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:397)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:7

Related

sparklyr feature transformation functions result in error

I have some problems using the ft_.. fuctions from the sparklyr R package. ft_bucketizer works, but ft_normalizer or ft_min_max_scaler does not. Here is an example:
library(sparklyr)
library(dplyr)
library(nycflights13)
sc <- spark_connect(master = "local", version = "2.1.0")
x = flights %>% select(dep_delay)
x_tbl <- sdf_copy_to(sc, x)
# works fine
ft_binarizer(x=x_tbl, input.col = "dep_delay", output.col = "delayed", threshold = 0)
# error
ft_normalizer(x= x_tbl, input.col = "dep_delay", output.col = "delayed_norm")
# error
ft_min_max_scaler(x= x_tbl,input.col = "dep_delay",output.col = "delayed_min_max")
The normalizer returns:
Error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 9.0 failed 1 times, most recent failure: Lost task 0.0 in stage 9.0 (TID 9, localhost, executor driver): org.apache.spark.SparkException: Failed to execute user defined function($anonfun$createTransformFunc$1: (double) => vector)"
The min_max_scaler returns:
"Error: java.lang.IllegalArgumentException: requirement failed: Column dep_delay must be of type org.apache.spark.ml.linalg.VectorUDT#3bfc3ba7 but was actually DoubleType."
I think it is a problem with the data type, but don't know how to solve it. Has anybody an idea what to do?
Many thanks in advance!
ft_normalizer operates on Vector columns so you have to use ft_vector_assembler first:
ft_vector_assembler(x_tbl, input_cols="dep_delay", output_col="dep_delay_v") %>%
ft_normalizer(input.col = "dep_delay_v", output.col = "delayed_v_norm")

Running SparkR through RStudio

I used the link below to learn how to run SparkR through RStudio:
http://blog.danielemaasit.com/2015/07/26/installing-and-starting-sparkr-locally-on-windows-8-1-and-rstudio/
I am having trouble with section 4.5.
if (nchar(Sys.getenv("SPARK_HOME")) < 1) {
Sys.setenv(SPARK_HOME = "C:/Apache/spark-2.0.0")
}
library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
sparkR.session(master = "local[*]", sparkConfig = list(spark.driver.memory = "1g"))
library(SparkR)
sc<-sparkR.session(master = "local")
sqlContext <- sparkRSQL.init(sc)
DF <- createDataFrame(sqlContext, faithful)
Error comes up when I run the DF function:
Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source)
at java.lang.reflect.Constructor.newInstance(Unknown Source)
at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:258)
at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:359)
at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:263)
at org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
at org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
at org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:46)
at org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45)
at org.a
In addition: Warning message:
'createDataFrame(sqlContext...)' is deprecated.
Use 'createDataFrame(data, schema = NULL, samplingRatio = 1.0)' instead.
See help("Deprecated")
I can't really tell what the error is and any help would be greatly appreciated.
Thanks!
Try this
Sys.setenv(SPARK_HOME = "C://Apache/spark-2.0.0")
You need to use "//" above.

sparkR : sparkException on createDataFrame

I'm trying to run the sample dataframe example using RStudio.
I have the following code:
Sys.setenv(SPARK_HOME = "C:\\Users\\himanshu.babbar\\Desktop\\Babbar\\Softwares\\spark-1.6.0-bin")
Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages" "com.databricks:spark-csv_2.11:1.2.0" "sparkr-shell"')
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
library(SparkR)
sc <- sparkR.init(master = "local", sparkHome = SPARK_HOME, sparkEnvir = list(spark.driver.memory="512m"))
sqlContext <- sparkRSQL.init(sc)
# Create a simple local data.frame
localDF <- data.frame(name=c("John", "Smith", "Sarah"), age=c(19, 23, 18))
# Convert local data frame to a SparkR DataFrame
df <- createDataFrame(sqlContext, localDF)
On doing this, I'm getting the following exception :
Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.NullPointerException
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1012)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:482)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:873)
at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:853)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:406)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:404)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:396)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:396)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:192)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
I'm able to run a similar code on my colleagues machine so this could be a configuration thing that i may have been missing. Any pointers here?
You appear to have a Hadoop version and/or libraries mismatch. This may not be "simple" to fix.
Get someone who understands Hadoop installations to take a look at your setup and then ensure the version of Spark you are using supports the Hadoop version on your machine (or the cluster you are connecting to).

Stage Failure Error

I have the following code:
setwd("C:\\Users\\Anonymous\\Desktop\\Data 2014")
Sys.setenv(SPARK_HOME = "C:\\Users\\Anonymous\\Desktop\\Spark-1.4.1\\spark-1.6.0-bin-hadoop2.6\\spark-1.6.0-bin-hadoop2.6")
Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages" "com.databricks:spark-csv_2.10:1.3.0" "sparkr-shell"')
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
library(SparkR)
library(magrittr)
sc <- sparkR.init(master = "local")
sqlContext <- sparkRSQL.init(sc)
When I run the following:
data <- read.df(sqlContext, "Test.csv", "com.databricks.spark.csv", header="true")
I get the following error:
Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.NullPointerException
Test.csv is only a 3 x 2 table.
You will get the more details and cause of the error you are getting in the below link. https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/troubleshooting/javaionotserializableexception.html
You have to give the full path of the csv file,if the file in not in your present working directory.I have not much idea. Below I have pasted the code. It is working fine for me you can try it.
Sys.setenv(SPARK_HOME='/home/jayashree/spark-1.5.0') # the path of spark_home dir. Please change it according to your spark home path
.libPaths(c(file.path(Sys.getenv('SPARK_HOME'), 'R', 'lib'), .libPaths()))
library(SparkR)
sc <- sparkR.init(master="local", sparkPackages="com.databricks:spark-csv_2.11:1.2.0")
sqlContext <- sparkRSQL.init(sc)
data <- read.df(sqlContext, "/full_path/to_your/datafile.csv", "com.databricks.spark.csv", header="true")
Are you working on windows?

SparkR Null Pointer Exception when trying to create a data frame

When trying to create a data frame in sparkR, I get an error regarding a Null Pointer Exception. I have pasted my code, and the error message below. Do I need to install any more packages in order for this code to run?
CODE
SPARK_HOME <- "C:\\Users\\erer\\Downloads\\spark-1.5.2-bin-hadoop2.4\\spark-1.5.2-bin-hadoop2.4"
Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages" "com.databricks:spark-csv_2.10:1.2.0" "sparkr-shell"')
library(SparkR, lib.loc = "C:\\Users\\erer\\Downloads\\spark-1.5.2-bin-hadoop2.4\\R\\lib")
library(SparkR)
library(rJava)
sc <- sparkR.init(master = "local", sparkHome = SPARK_HOME)
sqlContext <- sparkRSQL.init(sc)
localDF <- data.frame(name=c("John", "Smith", "Sarah"), age=c(19, 23, 18))
df <- createDataFrame(sqlContext, localDF)
ERROR:
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1, localhost): java.lang.NullPointerException
at java.lang.ProcessBuilder.start(Unknown Source)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:445)
at org.apache.hadoop.util.Shell.run(Shell.java:418)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650)
at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:873)
at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:853)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:381)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:405)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:397)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:7
You need to point library SparkR to the directory where the local SparkR code is, specified in the lib.loc parameter (if you downloaded a Spark binary, the SPARK_HOME/R/lib will be already populated for you):
`library(SparkR, lib.loc = "/home/kris/spark/spark-1.5.2-bin-hadoop2.6/R/lib")`
See also this tutorial on R-bloggers on how to run Spark from Rstudio: http://www.r-bloggers.com/sparkr-with-rstudio-in-ubuntu-12-04/

Resources