Not able to to convert R data frame to Spark DataFrame - r

When I try to convert my local dataframe in R to Spark DataFrame using:
raw.data <- as.DataFrame(sc,raw.data)
I get this error:
17/01/24 08:02:04 WARN RBackendHandler: cannot find matching method class org.apache.spark.sql.api.r.SQLUtils.getJavaSparkContext. Candidates are:
17/01/24 08:02:04 WARN RBackendHandler: getJavaSparkContext(class org.apache.spark.sql.SQLContext)
17/01/24 08:02:04 ERROR RBackendHandler: getJavaSparkContext on org.apache.spark.sql.api.r.SQLUtils failed
Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
The question is similar to
sparkR on AWS: Unable to load native-hadoop library and

Don't need to use sc if you are using the latest version of Spark. I am using SparkR package having version 2.0.0 in RStudio. Please go through following code (that is used to connect R session with SparkR session):
if (nchar(Sys.getenv("SPARK_HOME")) < 1) {
Sys.setenv(SPARK_HOME = "path-to-spark home/spark-2.0.0-bin-hadoop2.7")
}
library(SparkR)
library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R","lib")))
sparkR.session(enableHiveSupport = FALSE,master = "spark://master url:7077", sparkConfig = list(spark.driver.memory = "2g"))
Following is the output of R console:
> data<-as.data.frame(iris)
> class(data)
[1] "data.frame"
> data.df<-as.DataFrame(data)
> class(data.df)
[1] "SparkDataFrame"
attr(,"package")
[1] "SparkR"

use this example code :
library(SparkR)
library(readr)
sc <- sparkR.init(appName = "data")
sqlContext <- sparkRSQL.init(sc)
old_df<-read_csv("/home/mx/data.csv")
old_df<-data.frame(old_df)
new_df <- createDataFrame( sqlContext, old_df)

Related

spark_write_csv doesn't work anymore (with Sparklyr)

spark_write_csv function doesn't work anymore, maybe since I upgraded the Spark version. Could someone help please?
Here is the code example, and the error message below:
library(sparklyr)
library(dplyr)
spark_conn <- spark_connect(master = "local")
iris <- copy_to(spark_conn, iris, overwrite = TRUE)
spark_write_csv(iris, path = "iris.csv")
Error: org.apache.spark.SparkException: Job aborted.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:231)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:188)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:108)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:106)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:131)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:132)

How to make sparkR run

I have been trying to make SparkR work without success.
Read previous questions, blogs, and yet haven't been able to make it work.
First I had issues installing SparkR, finally I think I installed it, but then cannot make it run.
Here is my detailed code with different Options I tried to make it run.
Currently using Rstudio with R 3.6.0 version.
Any help will be appreciated!!
#***************************#
#Installing Spark Option 1
#***************************#
install.packages("SparkR")
'''
Does not work
'''
Sys.setenv("JAVA_HOME" = "D:/Program Files/Java/jdk1.8.0_181")
Sys.getenv("JAVA_HOME")
#***************************#
#Installing Spark Option 2
#***************************#
#Find Spark Versions
jsonlite::fromJSON("https://api.github.com/repos/apache/spark/tags")$name
if (!require('devtools')) install.packages('devtools')
devtools::install_github('apache/spark#v2.4.6', subdir='R/pkg')
Sys.setenv(SPARK_HOME='D:/spark-2.3.1-bin-hadoop2.7')
.libPaths(c(file.path(Sys.getenv('SPARK_HOME'), 'R', 'lib'), .libPaths()))
'''
Installation didnt work
'''
#***************************#
#Installation Spark Option 3
#***************************#
install.packages("sparklyr")
library(sparklyr)
spark_install(version = "2.3.1")
install.packages("https://cran.r-project.org/src/contrib/Archive/SparkR/SparkR_2.3.0.tar.gz", repos = NULL, type="source")
library(SparkR)
'''
One of 2 installations worked
'''
#***************************#
#Starting Spark Option 1
#***************************#
library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R","lib")))
sparkR.session(master = "local[*]", sparkConfig = list(spark.driver.memory = "2g"))
'''
Spark package found in SPARK_HOME: D:/spark-2.3.1-bin-hadoop2.7
Launching java with spark-submit command D:/spark-2.3.1-bin-hadoop2.7/bin/spark-submit2.cmd --driver-memory "2g" sparkr-shell C:\Users\FELIPE~1\AppData\Local\Temp\RtmpKOxYkx\backend_port34a0263f43f5
Error in if (len > 0) { : argumento tiene longitud cero
'''
#***************************#
#Starting Spark Option 2
#***************************#
Sys.setenv("JAVA_HOME" = "D:/Program Files/Java/jdk1.8.0_181")
Sys.getenv("JAVA_HOME")
sparkEnvir <- list(spark.num.executors='5', spark.executor.cores='5')
# initializing Spark context
sc <- sparkR.init(sparkHome = "'D:/spark-2.3.1-bin-hadoop2.7'",
sparkEnvir = sparkEnvir)
'''
Error in sparkR.sparkContext(master, appName, sparkHome, convertNamedListToEnv(sparkEnvir), :
JVM is not ready after 10 seconds
Además: Warning message:
sparkR.init is deprecated.
Use sparkR.session instead.
See help("Deprecated")
'''
#***************************#
#Starting Spark Option 3
#***************************#
Sys.setenv("JAVA_HOME" = "D:/Program Files/Java/jdk1.8.0_181")
Sys.getenv("JAVA_HOME")
sparkEnvir <- list(spark.num.executors='5', spark.executor.cores='5')
# initializing Spark context
sc <- sparkR.session(sparkHome = "'D:/spark-2.3.1-bin-hadoop2.7'",
sparkEnvir = sparkEnvir)
'''
Spark not found in SPARK_HOME: D:/spark-2.3.1-bin-hadoop2.7
Spark package found in SPARK_HOME: D:/spark-2.3.1-bin-hadoop2.7
Launching java with spark-submit command D:/spark-2.3.1-bin-hadoop2.7/bin/spark-submit2.cmd sparkr-shell C:\Users\FELIPE~1\AppData\Local\Temp\RtmpKOxYkx\backend_port34a082b15e1
Error in if (len > 0) { : argumento tiene longitud cero
'''

Setting up sparklyr

I am working on setting up sparklyr utilizing R but I keep getting an error message. I essentially have this type in:
install.packages("sparklyr")
library(sparklyr)
spark_install(version = "2.1.0")
sc <- spark_connect(master = "local")
However when I get to create my spark connect I am receiving the following error message:
Using Spark: 2.1.0
Error in if (a[k] > b[k]) return(1) else if (a[k] < b[k]) return(-1L) :
missing value where TRUE/FALSE needed
In addition: Warning messages:
1: running command '"C:\WINDOWS\SYSTEM32\java.exe" -version' had status 2
2: In compareVersion(parsedVersion, "1.7") : NAs introduced by coercion
Any thoughts?

SparkR Null Pointer Exception when trying to create a data frame

When trying to create a data frame in sparkR, I get an error regarding a Null Pointer Exception. I have pasted my code, and the error message below. Do I need to install any more packages in order for this code to run?
CODE
SPARK_HOME <- "C:\\Users\\erer\\Downloads\\spark-1.5.2-bin-hadoop2.4\\spark-1.5.2-bin-hadoop2.4"
Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages" "com.databricks:spark-csv_2.10:1.2.0" "sparkr-shell"')
library(SparkR, lib.loc = "C:\\Users\\erer\\Downloads\\spark-1.5.2-bin-hadoop2.4\\R\\lib")
library(SparkR)
library(rJava)
sc <- sparkR.init(master = "local", sparkHome = SPARK_HOME)
sqlContext <- sparkRSQL.init(sc)
localDF <- data.frame(name=c("John", "Smith", "Sarah"), age=c(19, 23, 18))
df <- createDataFrame(sqlContext, localDF)
ERROR:
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1, localhost): java.lang.NullPointerException
at java.lang.ProcessBuilder.start(Unknown Source)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:445)
at org.apache.hadoop.util.Shell.run(Shell.java:418)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650)
at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:873)
at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:853)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:381)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:405)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:397)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:7
You need to point library SparkR to the directory where the local SparkR code is, specified in the lib.loc parameter (if you downloaded a Spark binary, the SPARK_HOME/R/lib will be already populated for you):
`library(SparkR, lib.loc = "/home/kris/spark/spark-1.5.2-bin-hadoop2.6/R/lib")`
See also this tutorial on R-bloggers on how to run Spark from Rstudio: http://www.r-bloggers.com/sparkr-with-rstudio-in-ubuntu-12-04/

Problems installing R packages

I'm setting up a new laptop running Gentoo and wish to install R (as I do on all of my computers!).
However, I've hit a bit of a problem when it comes to installing packages.
I first tried to:
> install.packages(c("ggplot2", "plyr", "reshape2"))
And it duly downloaded all of the packages and its dependencies. However they didn't install reporting.
Error in library(data.table) : there is no package called ‘data.table’
Calls: .First -> library
Execution halted
Error in library(data.table) : there is no package called ‘data.table’
Calls: .First -> library
Execution halted
Error in library(data.table) : there is no package called ‘data.table’
Calls: .First -> library
Execution halted
Error in library(data.table) : there is no package called ‘data.table’
Calls: .First -> library
Execution halted
Error in library(data.table) : there is no package called ‘data.table’
Calls: .First -> library
Execution halted
Error in library(data.table) : there is no package called ‘data.table’
Calls: .First -> library
Execution halted
Error in library(data.table) : there is no package called ‘data.table’
Calls: .First -> library
Execution halted
Error in library(data.table) : there is no package called ‘data.table’
Calls: .First -> library
Not a problem I'll just install the data.table package, unfortunately...
> install.packages("data.table")
trying URL 'http://cran.uk.r-project.org/src/contrib/data.table_1.8.2.tar.gz'
Content type 'application/x-gzip' length 818198 bytes (799 Kb)
opened URL
==================================================
downloaded 799 Kb
Error in library(data.table) : there is no package called ‘data.table’
Calls: .First -> library
Execution halted
The downloaded source packages are in
‘/tmp/RtmpbQtALj/downloaded_packages’
Updating HTML index of packages in '.Library'
Making packages.html ... done
Warning message:
In install.packages("data.table") :
installation of package ‘data.table’ had non-zero exit status
And there is no indication of why installation failed at all, so I've no idea how to go about solving this? A traceback() isn't available either.
GCC is installed and configured as the output of gcc-config shows (and the fact that I can install other software from source no problem).
# gcc-config -l
[1] x86_64-pc-linux-gnu-4.6.3 *
Stumped as to how to go about solving this one. Any thoughts or ideas on how to get more information out of install.packages() welcome.
EDIT : contents of .First as requested....
> .First
function ()
{
library(data.table)
library(foreign)
library(ggplot2)
library(Hmisc)
library(lattice)
library(plyr)
library(rms)
library(xtable)
cat("\nWelcome at", date(), "\n")
}
EDIT 2 : No Rprofile.site but there is /usr/lib64/R/library/base/R/Rprofile which has....
# cat /usr/lib64/R/library/base/R/Rprofile
### This is the system Rprofile file. It is always run on startup.
### Additional commands can be placed in site or user Rprofile files
### (see ?Rprofile).
### Notice that it is a bad idea to use this file as a template for
### personal startup files, since things will be executed twice and in
### the wrong environment (user profiles are run in .GlobalEnv).
.GlobalEnv <- globalenv()
attach(NULL, name = "Autoloads")
.AutoloadEnv <- as.environment(2)
assign(".Autoloaded", NULL, envir = .AutoloadEnv)
T <- TRUE
F <- FALSE
R.version <- structure(R.Version(), class = "simple.list")
version <- R.version # for S compatibility
## for backwards compatibility only
R.version.string <- R.version$version.string
## NOTA BENE: options() for non-base package functionality are in places like
## --------- ../utils/R/zzz.R
options(keep.source = interactive())
options(warn = 0)
# options(repos = c(CRAN="#CRAN#"))
# options(BIOC = "http://www.bioconductor.org")
options(timeout = 60)
options(encoding = "native.enc")
options(show.error.messages = TRUE)
## keep in sync with PrintDefaults() in ../../main/print.c :
options(scipen = 0)
options(max.print = 99999)# max. #{entries} in internal printMatrix()
options(add.smooth = TRUE)# currently only used in 'plot.lm'
options(stringsAsFactors = TRUE)
if(!interactive() && is.null(getOption("showErrorCalls")))
options(showErrorCalls = TRUE)
local({dp <- Sys.getenv("R_DEFAULT_PACKAGES")
if(identical(dp, "")) # marginally faster to do methods last
dp <- c("datasets", "utils", "grDevices", "graphics",
"stats", "methods")
else if(identical(dp, "NULL")) dp <- character(0)
else dp <- strsplit(dp, ",")[[1]]
dp <- sub("[[:blank:]]*([[:alnum:]]+)", "\\1", dp) # strip whitespace
options(defaultPackages = dp)
})
## Expand R_LIBS_* environment variables.
Sys.setenv(R_LIBS_SITE =
.expand_R_libs_env_var(Sys.getenv("R_LIBS_SITE")))
Sys.setenv(R_LIBS_USER =
.expand_R_libs_env_var(Sys.getenv("R_LIBS_USER")))
.First.sys <- function()
{
for(pkg in getOption("defaultPackages")) {
res <- require(pkg, quietly = TRUE, warn.conflicts = FALSE,
character.only = TRUE)
if(!res)
warning(gettextf('package %s in options("defaultPackages") was not found', sQuote(pkg)),
call.=FALSE, domain = NA)
}
}
.OptRequireMethods <- function()
{
if("methods" %in% getOption("defaultPackages")) {
res <- require("methods", quietly = TRUE, warn.conflicts = FALSE,
character.only = TRUE)
if(!res)
warning('package "methods" in options("defaultPackages") was not found', call.=FALSE)
}
}
if(nzchar(Sys.getenv("R_BATCH"))) {
.Last.sys <- function()
{
cat("> proc.time()\n")
print(proc.time())
}
## avoid passing on to spawned R processes
## A system has been reported without Sys.unsetenv, so try this
try(Sys.setenv(R_BATCH=""))
}
###-*- R -*- Unix Specific ----
.Library <- file.path(R.home(), "library")
.Library.site <- Sys.getenv("R_LIBS_SITE")
.Library.site <- if(!nchar(.Library.site)) file.path(R.home(), "site-library") else unlist(strsplit(.Library.site, ":"))
.Library.site <- .Library.site[file.exists(.Library.site)]
invisible(.libPaths(c(unlist(strsplit(Sys.getenv("R_LIBS"), ":")),
unlist(strsplit(Sys.getenv("R_LIBS_USER"), ":")
))))
local({
## we distinguish between R_PAPERSIZE as set by the user and by configure
papersize <- Sys.getenv("R_PAPERSIZE_USER")
if(!nchar(papersize)) {
lcpaper <- Sys.getlocale("LC_PAPER") # might be null: OK as nchar is 0
papersize <- if(nchar(lcpaper))
if(length(grep("(_US|_CA)", lcpaper))) "letter" else "a4"
else Sys.getenv("R_PAPERSIZE")
}
options(papersize = papersize,
printcmd = Sys.getenv("R_PRINTCMD"),
dvipscmd = Sys.getenv("DVIPS", "dvips"),
texi2dvi = Sys.getenv("R_TEXI2DVICMD"),
browser = Sys.getenv("R_BROWSER"),
pager = file.path(R.home(), "bin", "pager"),
pdfviewer = Sys.getenv("R_PDFVIEWER"),
useFancyQuotes = TRUE)
})
## non standard settings for the R.app GUI of the Mac OS X port
if(.Platform$GUI == "AQUA") {
## this is set to let RAqua use both X11 device and X11/TclTk
if (Sys.getenv("DISPLAY") == "")
Sys.setenv("DISPLAY" = ":0")
## this is to allow gfortran compiler to work
Sys.setenv("PATH" = paste(Sys.getenv("PATH"),":/usr/local/bin",sep = ""))
}## end "Aqua"
local({
tests_startup <- Sys.getenv("R_TESTS")
if(nzchar(tests_startup)) source(tests_startup)
})
Looks like data.table is not installed for the user that is running the install.packages command. I think wrapping that .First function in if (interactive()) { } would be a good idea in general. Otherwise, you need to install data.table and any other packages that load at startup since install.packages runs the .Rprofile file when starting
WARNING: You're using a non-UTF8 locale, therefore only ASCII characters will work.
Please read R for Mac OS X FAQ (see Help) section 9 and adjust your system preferences accordingly.
[History restored from /Users/carlosaburto/.Rapp.history]
defaults write org.R-project.R force.LANG en_US.UTF-8
Error: unexpected symbol in "defaults write"
starting httpd help server ... done

Resources