How to write to JDBC source with SparkR 1.6.0? - r

With SparkR 1.6.0 I can read from a JDBC source with the following code,
jdbc_url <- "jdbc:mysql://localhost:3306/dashboard?user=<username>&password=<password>"
df <- sqlContext %>%
loadDF(source = "jdbc",
url = jdbc_url,
driver = "com.mysql.jdbc.Driver",
dbtable = "db.table_name")
But after performing a calculation, when I try to write the data back to the database I've hit a roadblock as attempting...
write.df(df = df,
path = "NULL",
source = "jdbc",
url = jdbc_url,
driver = "com.mysql.jdbc.Driver",
dbtable = "db.table_name",
mode = "append")
...returns...
ERROR RBackendHandler: save on 55 failed
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
java.lang.RuntimeException: org.apache.spark.sql.execution.datasources.jdbc.DefaultSource does not allow create table as select.
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:259)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:148)
at org.apache.spark.sql.DataFrame.save(DataFrame.scala:2066)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:141)
at org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:86)
at org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:38)
at io.netty.channel.SimpleChannelIn
Looking around the web I found this which tells me that a patch for this error was included as of version 2.0.0; and we also get the functions read.jdbc and write.jdbc.
For this question, though, assume I'm stuck with SparkR v1.6.0. Is there a way to write to JDBC sources (i.e. is there a workaround that would allow me to use DataFrameWriter.jdbc() from SparkR)?

The short answer is, no, the JDBC write method was not supported by SparkR until version 2.0.0.

Related

R targets and dataRetrieval return a connection error

I am attempting to use a targets workflow in my R project. I am attempting to download water quality data using the dataRetrieval package. In a fresh R session this works:
dataRetrieval::readWQPdata(siteid="USGS-04024315",characteristicName="pH")
To use this in targets, I have the following _targets.R file:
library(targets)
tar_option_set(packages = c("dataRetrieval"))
list(
tar_target(
name = wqp_data,
command = readWQPdata(siteid="USGS-04024315",characteristicName="pH"),
format = "feather",
cue = tar_cue(mode = "never")
)
)
when I run tar_make() the following is returned:
* start target wqp_data
No internet connection.
The following url returned no data:
https://www.waterqualitydata.us/data/Result/search?siteid=USGS-04024315&characteristicName=pH&zip=yes&mimeType=tsv
x error target wqp_data
* end pipeline
Error : attempt to set an attribute on NULL
Error: callr subprocess failed: attempt to set an attribute on NULL
Visit https://books.ropensci.org/targets/debugging.html for debugging advice.
Run `rlang::last_error()` to see where the error occurred.
I have attempted debugging using tar_option_set(debug = "wqp_data") or tar_option_set(workspace_on_error = TRUE) but outside of isolating the error to readWQPdata() didn't get anywhere.
I also had success using curl directly in targets so I do not think it is my actual internet connection:
list(
tar_target(
name = wqp_data,
command = {con <- curl::curl("https://httpbin.org/get")
readLines(con)
close(con)}
)
)
tar_make()
* start target wqp_data
* built target wqp_data
* end pipeline
Any advice on how to diagnose the connection issue when using these two packages?

NoSuchMethodException when calling sendKeys on object of class org.openqa.selenium.remote.RemoteWebElement via R package rJava

I am trying to use the selenium webdriver API directly from R using rJava. I am subject to a fairly restrictive IT environment, so I can't access a remote driver currently (hence why I'm not currently using the Rselenium package), and I don't have either Chrome or Firefox availaible--just phantomjs. I am able to get this working okay from the Scala REPL. I used sbt to get all the dependenices--build.sbt contains, for example:
retrieveManaged := true
libraryDependencies ++= Seq (
"org.seleniumhq.selenium" % "selenium-java" % "3.9.1",
"com.codeborne" % "phantomjsdriver" % "1.4.4"
)
(Note that I have phantomjs installed as /usr/local/bin/phantomjs, and it is
version 2.1.1).
I then copied all the jar files to a single-level folder via cp jars/*/*/*.jar alljars/ containing the following:
animal-sniffer-annotations-1.14.jar httpcore-4.4.6.jar selenium-api-3.9.1.jar
byte-buddy-1.7.9.jar j2objc-annotations-1.1.jar selenium-chrome-driver-3.9.1.jar
checker-compat-qual-2.0.0.jar jline-2.14.5.jar selenium-edge-driver-3.9.1.jar
commons-codec-1.10.jar jsr305-1.3.9.jar selenium-firefox-driver-3.9.1.jar
commons-exec-1.3.jar okhttp-3.9.1.jar selenium-ie-driver-3.9.1.jar
commons-logging-1.2.jar okio-1.13.0.jar selenium-java-3.9.1.jar
error_prone_annotations-2.1.3.jar phantomjsdriver-1.4.4.jar selenium-opera-driver-3.9.1.jar
gson-2.8.2.jar scala-compiler-2.12.4.jar selenium-remote-driver-3.9.1.jar
guava-23.6-jre.jar scala-library-2.12.4.jar selenium-safari-driver-3.9.1.jar
httpclient-4.5.3.jar scala-reflect-2.12.4.jar selenium-support-3.9.1.jar
I start Scala via scala -cp "alljars/*" and can the do following:
val drv = new org.openqa.selenium.phantomjs.PhantomJSDriver
drv.get("https://www.google.com")
val q = drv.findElementByName("q")
q.sendKeys("rJava selenium")
q.submit
drv.getTitle
I think the following is roughly the same thing in R using rJava:
library(rJava)
.jinit()
jars <- dir("alljars", pattern = "*.jar", full.names = TRUE)
.jaddClassPath(jars)
drv <- .jnew('org/openqa/selenium/phantomjs/PhantomJSDriver')
drv$get("https://www.google.com")
q <- drv$findElementByName("q")
q$sendKeys("rJava selenium")
q$submit()
drv$getTitle()
This fails at the point q$sendKeys("rJava selenium") with the following error:
Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl, :
java.lang.NoSuchMethodException: No suitable method for the given parameters
In RStudio, if I type q$ and press TAB, sendKeys is definitely in the list of available methods. I tried to be explicit about this, and tried:
keys <- .jnew("java/lang/String", "rJava selenium")
keys <- .jcast(keys, "java/lang/CharSequence", check = TRUE)
q <- .jcast(q, "org/openqa/selenium/WebElement", check = TRUE)
.jcall(q, "V", "sendKeys", keys)
which resulted in the following error:
Error in .jcall(q, "V", "sendKeys", keys) :
method sendKeys with signature (Ljava/lang/CharSequence;)V not found
q has class org/openqa/selenium/remote/RemoteWebElement in R, and org/openqa/selenium/WebElement in Scala; but in both cases the return is void and the required argument is CharSequence according to the javadocs. I tried a few variations of this--java.lang.String instead of CharSequence, RemoteWebElement instead of WebElement, etc., but no joy.
I doubt this is a problem with rJava, but I'm stumped nonetheless and need help!
Oh good grief. I didn't know about .jmethods. Running this:
> .jmethods(q, "sendKeys")
[1] "public void org.openqa.selenium.remote.RemoteWebElement.sendKeys(java.lang.CharSequence[])"
So, basically, my problem was that I was passing String instead of String[]. That is, instead of:
q$sendKeys("rJava selenium")
I can use:
q$sendKeys(.jarray("rJava selenium"))
The more you know...

Internal error in the mapping processor: java.lang.NullPointerException

I'm trying to my map local pojo to an autogenerated domain objects using mapstruct. Expect for a specific complex structure everything else seems to map and the mapper implementation class gets generation. Below is the error that I get.
My mapper class is:
#Mappings({
#Mapping(source = "sourcefile", target = "sourceFILE"),
#Mapping(source = "id", target = "ID"),
#Mapping(source = "reg", target = "regID"),
#Mapping(source = "itemDetailsType", target = "ItemDetailsType") //This is the structure that does not map
})
AutoGenDomainType map(LocalPojo localPojo);
#Mappings({
#Mapping(source = "line", target = "LINE"),
#Mapping(source = type", target = "TYPE")
})
ItemDetailsType map(ItemDetailsTypes itemDetailsType);
Error:
Internal error in the mapping processor: java.lang.NullPointerException at org.mapstruct.ap.internal.processor.creation.MappingResolverImpl$ResolvingAttempt.hasCompatibleCopyConstructor(MappingResolverImpl.java:547) at org.mapstruct.ap.internal.processor.creation.MappingResolverImpl$ResolvingAttempt.isPropertyMappable(MappingResolverImpl.java:522) at org.mapstruct.ap.internal.processor.creation.MappingResolverImpl$ResolvingAttempt.getTargetAssignment(MappingResolverImpl.java:202) at org.mapstruct.ap.internal.processor.creation.MappingResolverImpl$ResolvingAttempt.access$100(MappingResolverImpl.java:153) at org.mapstruct.ap.internal.processor.creation.MappingResolverImpl.getTargetAssignment(MappingResolverImpl.java:121) at
.....
.....
[ERROR]
[ERROR] Found 1 error and 16 warnings.
[ERROR] -> [Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.1:compile (default-compile) on project uwo-services: Compilation failure
The target object ItemDetailsType does have other properties that need not be mapped. The error says compilation issue, but I dont find any. Also I have tried adding have tried the unmappedTargetPolicy = ReportingPolicy.IGNORE at my mapper class level just to avoid if this is caused by the unmapped properties, but still no solution.
This is a known bug in MapStruct. The bug is reported in #729, it has been fixed in 1.1.0.Final. You are using 1.0.0.Final. I would highly suggest switching to the either 1.1.0.Final or 1.2.0.Beta2.
Once you update you will see a better error message and you will know exactly what the problem in the mapping is.
By looking at this first it looks like that target in #Mapping(source = "itemDetailsType", target = "ItemDetailsType") is wrong. Are you sure that you need a capital letter there?

Can't create dplyr src backed by SparkSQL in dplyr.spark.hive package

Recently I found out about great dplyr.spark.hive package that enables dplyr frontend operations with spark or hive backend .
There is an information on how to install this package in package's README :
options(repos = c("http://r.piccolboni.info", unlist(options("repos"))))
install.packages("dplyr.spark.hive")
and there are also many examples on how to work with dplyr.spark.hive when one is already connected to hiveServer - check this.
But I am not able to connect to hiveServer, so I can not benefit from the great power of this package...
I've tried such commands, but they did not work out. Does anyone have any solution or comment on what am I doing wrong?
> library(dplyr.spark.hive,
+ lib.loc = '/opt/wpusers/mkosinski/R/x86_64-redhat-linux-gnu-library/3.1')
Warning: changing locked binding for ‘over’ in ‘dplyr’ whilst loading ‘dplyr.spark.hive’
Warning: changing locked binding for ‘partial_eval’ in ‘dplyr’ whilst loading ‘dplyr.spark.hive’
Warning: changing locked binding for ‘default_op’ in ‘dplyr’ whilst loading ‘dplyr.spark.hive’
Warning messages:
1: replacing previous import by ‘purrr::%>%’ when loading ‘dplyr.spark.hive’
2: replacing previous import by ‘purrr::order_by’ when loading ‘dplyr.spark.hive’
>
> Sys.setenv(SPARK_HOME = "/opt/spark-1.5.0-bin-hadoop2.4")
> Sys.setenv(HIVE_SERVER2_THRIFT_BIND_HOST = 'tools-1.hadoop.srv')
> Sys.setenv(HIVE_SERVER2_THRIFT_PORT = '10000')
>
> my_db = src_SparkSQL()
Error in .jfindClass(as.character(driverClass)[1]) : class not found
>
> my_db = src_SparkSQL(host = 'jdbc:hive2://tools-1.hadoop.srv:10000/loghost;auth=noSasl',
+ port = 10000)
Error in .jfindClass(as.character(driverClass)[1]) : class not found
>
> my_db = src_SparkSQL(start.server = TRUE)
Error in start.server() :
Couldn't start thrift server:org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 running as process 37580. Stop it first.
In addition: Warning message:
running command 'cd /opt/tech/prj_bdc/pmozie_status/user_topics;/opt/spark-1.5.0-bin-hadoop2.4/sbin/start-thriftserver.sh ' had status 1
>
> my_db = src_SparkSQL(start.server = TRUE,
+ list(spark.num.executors='5', spark.executor.cores='5', master="yarn-client"))
Error in start.server() :
Couldn't start thrift server:org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 running as process 37580. Stop it first.
In addition: Warning message:
running command 'cd /opt/tech/prj_bdc/pmozie_status/user_topics;/opt/spark-1.5.0-bin-hadoop2.4/sbin/start-thriftserver.sh ' had status 1
EDIT 2
I have set more paths to system variables like this but now I receive a warning telling me that some kind of Java logging-configuration is not specified bu I think it is
> library(dplyr.spark.hive,
+ lib.loc = '/opt/wpusers/mkosinski/R/x86_64-redhat-linux-gnu-library/3.1')
Warning messages:
1: replacing previous import by ‘purrr::%>%’ when loading ‘dplyr.spark.hive’
2: replacing previous import by ‘purrr::order_by’ when loading ‘dplyr.spark.hive’
3: package ‘SparkR’ was built under R version 3.2.1
>
> Sys.setenv(SPARK_HOME = "/opt/spark-1.5.0-bin-hadoop2.4")
> Sys.setenv(HIVE_SERVER2_THRIFT_BIND_HOST = 'tools-1.hadoop.srv')
> Sys.setenv(HIVE_SERVER2_THRIFT_PORT = '10000')
> Sys.setenv(HADOOP_JAR = "/opt/spark-1.5.0-bin-hadoop2.4/lib/spark-assembly-1.5.0-hadoop2.4.0.jar")
> Sys.setenv(HADOOP_HOME="/usr/share/hadoop")
> Sys.setenv(HADOOP_CONF_DIR="/etc/hadoop")
> Sys.setenv(PATH='/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/usr/share/hadoop/bin:/opt/hive/bin')
>
>
> my_db = src_SparkSQL()
log4j:WARN No appenders could be found for logger (org.apache.hive.jdbc.Utils).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
My log properties are not empty.
-bash-4.2$ wc /etc/hadoop/log4j.properties
179 432 6581 /etc/hadoop/log4j.properties
EDIT 3
My exact call to the scr_SparkSQL() is
> detach("package:SparkR", unload=TRUE)
Warning message:
package ‘SparkR’ was built under R version 3.2.1
> detach("package:dplyr", unload=TRUE)
> library(dplyr.spark.hive, lib.loc = '/opt/wpusers/mkosinski/R/x86_64-redhat-linux-gnu-library/3.1')
Warning: changing locked binding for ‘over’ in ‘dplyr’ whilst loading ‘dplyr.spark.hive’
Warning: changing locked binding for ‘partial_eval’ in ‘dplyr’ whilst loading ‘dplyr.spark.hive’
Warning: changing locked binding for ‘default_op’ in ‘dplyr’ whilst loading ‘dplyr.spark.hive’
Warning messages:
1: replacing previous import by ‘purrr::%>%’ when loading ‘dplyr.spark.hive’
2: replacing previous import by ‘purrr::order_by’ when loading ‘dplyr.spark.hive’
> Sys.setenv(HADOOP_JAR = "/opt/spark-1.5.0-bin-hadoop2.4/lib/spark-assembly-1.5.0-hadoop2.4.0.jar")
> Sys.setenv(HIVE_SERVER2_THRIFT_BIND_HOST = 'tools-1.hadoop.srv')
> Sys.setenv(HIVE_SERVER2_THRIFT_PORT = '10000')
> my_db = src_SparkSQL()
log4j:WARN No appenders could be found for logger (org.apache.hive.jdbc.Utils).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
And then the proces does not stop (never).
Where those settings work for beeline with such params:
beeline -u "jdbc:hive2://tools-1.hadoop.srv:10000/loghost;auth=noSasl" -n mkosinski --outputformat=tsv --incremental=true -f sql_statement.sql > sql_output
but I am not able to pass user name and dbname to src_SparkSQL()
so I have tried to manual use the code from inside that function but I receive the sam problem that the below code also does not finish
host = 'tools-1.hadoop.srv'
port = 10000
driverclass = "org.apache.hive.jdbc.HiveDriver"
Sys.setenv(HADOOP_JAR = "/opt/spark-1.5.0-bin-hadoop2.4/lib/spark-assembly-1.5.0-hadoop2.4.0.jar")
library(RJDBC)
dr = JDBC(driverclass, Sys.getenv("HADOOP_JAR"))
url = paste0("jdbc:hive2://", host, ":", port)
class = "Hive"
con.class = paste0(class, "Connection") # class = "Hive"
# dbConnect_retry =
# function(dr, url, retry){
# if(retry > 0)
# tryCatch(
# dbConnect(drv = dr, url = url),
# error =
# function(e) {
# Sys.sleep(0.1)
# dbConnect_retry(dr = dr, url = url, retry - 1)})
# else dbConnect(drv = dr, url = url)}
#################
##con = new(con.class, dbConnect_retry(dr, url, retry = 100))
#################
con = new(con.class, dbConnect(dr, url, user = "mkosinski", dbname = "loghost"))
Maybe the url should containg also /loghost - the dbname?
I now see that you tried multiple things with multiple errors. Let me comment error by error.
my_db = src_SparkSQL()
Error in .jfindClass(as.character(driverClass)[1]) : class not found
The RJDBC object could not be created. Unless we solve this, nothing else will work, workarounds or not. Have you set HADOOP_JAR with, for instance,
Sys.setenv(HADOOP_JAR = "../spark/assembly/target/scala-2.10/spark-assembly-1.5.0-hadoop2.6.0.jar"). Sorry I seem to have skipped this in the instructions. Will fix.
my_db = src_SparkSQL(host = 'jdbc:hive2://tools-1.hadoop.srv:10000/loghost;auth=noSasl',
+ port = 10000)
Error in .jfindClass(as.character(driverClass)[1]) : class not found
Same problem. Please note host port argument do not accept URL syntax, just host and port. URL is formed internally.
my_db = src_SparkSQL(start.server = TRUE)
Error in start.server() :
Couldn't start thrift server:org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 running as process 37580. Stop it first.
In addition: Warning message:
running command 'cd /opt/tech/prj_bdc/pmozie_status/user_topics;/opt/spark-1.5.0-bin-hadoop2.4/sbin/start-thriftserver.sh ' had status 1
Stop thriftserver first or connect to existing one, but you still have to fix the class not found problem.
my_db = src_SparkSQL(start.server = TRUE,
+ list(spark.num.executors='5', spark.executor.cores='5', master="yarn-client"))
Error in start.server() :
Couldn't start thrift server:org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 running as process 37580. Stop it first.
In addition: Warning message:
running command 'cd /opt/tech/prj_bdc/pmozie_status/user_topics;/opt/spark-1.5.0-bin-hadoop2.4/sbin/start-thriftserver.sh ' had status 1
Same as above.
Plan:
Set HADOOP_JAR. Find host and port of running thriftserver, if not default. Try src_SparkSQL with start.server = FALSE. If happy quit, else goto step 2
Stop existing thriftserver. Try again src_SparkSQL with start.server = TRUE
Let me know how things go.
There was a problem that I did't specify the proper classPath that was needed inside JDBC function that created a driver. Parameters to classPath in dplyr.spark.hive package are passed via HADOOP_JAR global variable.
To use JDBC as a driver to hiveServer2 (through the Thrift protocol) one need to add at least those 3 .jars with Java classes to create a proper driver
hive-jdbc-1.0.0-standalone.jar
hadoop/common/lib/commons-configuration-1.6.jar
hadoop/common/hadoop-common-2.4.1.jar
versions are arbitrary and should be compatible with the installed version of local hive, hadoop and hiveServer2.
They need to be set with the .Platform$path.sep (as described here)
classPath = c("system_path1_to_hive/hive/lib/hive-jdbc-1.0.0-standalone.jar",
"system_path1_to_hadoop/hadoop/common/lib/commons-configuration-1.6.jar",
"system_path1_to_hadoop/hadoop/common/hadoop-common-2.4.1.jar")
Sys.setenv(HADOOP_JAR= paste0(classPath, collapse=.Platform$path.sep)
Then when HADOOP_JAR is set one have to be carefull with hiveServer2 url. In my case it had to be
host = 'tools-1.hadoop.srv'
port = 10000
url = paste0("jdbc:hive2://", host, ":", port, "/loghost;auth=noSasl")
and finally the proper connection with hiveServer2 using RJDBC package is
Sys.setenv(HADOOP_HOME="/usr/share/hadoop/share/hadoop/common/")
Sys.setenv(HIVE_HOME = '/opt/hive/lib/')
host = 'tools-1.hadoop.srv'
port = 10000
url = paste0("jdbc:hive2://", host, ":", port, "/loghost;auth=noSasl")
driverclass = "org.apache.hive.jdbc.HiveDriver"
library(RJDBC)
.jinit()
dr2 = JDBC(driverclass,
classPath = c("/opt/hive/lib/hive-jdbc-1.0.0-standalone.jar",
#"/opt/hive/lib/commons-configuration-1.6.jar",
"/usr/share/hadoop/share/hadoop/common/lib/commons-configuration-1.6.jar",
"/usr/share/hadoop/share/hadoop/common/hadoop-common-2.4.1.jar"),
identifier.quote = "`")
url = paste0("jdbc:hive2://", host, ":", port, "/loghost;auth=noSasl")
dbConnect(dr2, url, username = "mkosinski") -> cont

SQLITE_ERROR: Connection is closed when connecting from Spark via JDBC to SQLite database

I am using Apache Spark 1.5.1 and trying to connect to a local SQLite database named clinton.db. Creating a data frame from a table of the database works fine but when I do some operations on the created object, I get the error below which says "SQL error or missing database (Connection is closed)". Funny thing is that I get the result of the operation nevertheless. Any idea what I can do to solve the problem, i.e., avoid the error?
Start command for spark-shell:
../spark/bin/spark-shell --master local[8] --jars ../libraries/sqlite-jdbc-3.8.11.1.jar --classpath ../libraries/sqlite-jdbc-3.8.11.1.jar
Reading from the database:
val emails = sqlContext.read.format("jdbc").options(Map("url" -> "jdbc:sqlite:../data/clinton.sqlite", "dbtable" -> "Emails")).load()
Simple count (fails):
emails.count
Error:
15/09/30 09:06:39 WARN JDBCRDD: Exception closing statement
java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (Connection is closed)
at org.sqlite.core.DB.newSQLException(DB.java:890)
at org.sqlite.core.CoreStatement.internalClose(CoreStatement.java:109)
at org.sqlite.jdbc3.JDBC3Statement.close(JDBC3Statement.java:35)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.org$apache$spark$sql$execution$datasources$jdbc$JDBCRDD$$anon$$close(JDBCRDD.scala:454)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1$$anonfun$8.apply(JDBCRDD.scala:358)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1$$anonfun$8.apply(JDBCRDD.scala:358)
at org.apache.spark.TaskContextImpl$$anon$1.onTaskCompletion(TaskContextImpl.scala:60)
at org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:79)
at org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:77)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:77)
at org.apache.spark.scheduler.Task.run(Task.scala:90)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
res1: Long = 7945
I got the same error today, and the important line is just before the exception:
15/11/30 12:13:02 INFO jdbc.JDBCRDD: closed connection
15/11/30 12:13:02 WARN jdbc.JDBCRDD: Exception closing statement
java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (Connection is closed)
at org.sqlite.core.DB.newSQLException(DB.java:890)
at org.sqlite.core.CoreStatement.internalClose(CoreStatement.java:109)
at org.sqlite.jdbc3.JDBC3Statement.close(JDBC3Statement.java:35)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.org$apache$spark$sql$execution$datasources$jdbc$JDBCRDD$$anon$$close(JDBCRDD.scala:454)
So Spark succeeded to close the JDBC connection, and then it fails to close the JDBC statement
Looking at the source, close() is called twice:
Line 358 (org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD, Spark 1.5.1)
context.addTaskCompletionListener{ context => close() }
Line 469
override def hasNext: Boolean = {
if (!finished) {
if (!gotNext) {
nextValue = getNext()
if (finished) {
close()
}
gotNext = true
}
}
!finished
}
If you look at the close() method (line 443)
def close() {
if (closed) return
you can see that it checks the variable closed, but that value is never set to true.
If I see it correctly, this bug is still in the master. I have filed a bug report.
Source: JDBCRDD.scala (lines numbers differ slightly)

Resources