How do I register custom JDBC dialect in Rstudio? - r

I'm trying to analyze bigquery data in Rstudio server running on a google dataproc cluster. However, due to the memory limitations of Rstudio, I intend to run queries on this data in sparklyr but I haven't had any success importing the data directly into the spark cluster from bigquery.
I'm using google's official JDBC connectivity driver:
ODBC and JDBC drivers for BigQuery
I also have the following software versions running:
Google Dataproc: 2.0-Debian 10
Sparklyr: Spark 3.2.1 Hadoop 3.2
R version 4.2.1
I also had to replace the following spark jars with the versions being used by the JDBC connectivity driver above or added them where they were non-existent:
failureaccess-1.0.1 was added
protobuff-java-3.19.4 replaced 2.5.0
guava 31.1-jre replaced 14.0.1
Below is my code using the spark_read_jdbc function to retrieve a dataset from big query
conStr <- "jdbc:bigquery://https://www.googleapis.com/bigquery/v2:443;ProjectId=xxxx;OAuthType=3;AllowLargeResults=1;"
spark_read_jdbc(sc = spkc,
name = "events_220210",
memory = FALSE,
options = list(url = conStr,
driver = "com.simba.googlebigquery.jdbc.Driver",
user = "rstudio",
password = "xxxxxx",
dbtable = "dataset.table"))
The table gets imported into the spark cluster but when I try to preview it, the following error message is received
ERROR sparklyr: Gateway (551) failed calling collect on sparklyr.Utils: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 4) (faucluster1-w-0.europe-west2-c.c.ga4-warehouse-342410.internal executor 2): java.sql.SQLDataException: [Simba][JDBC](10140) Error converting value to long.
at com.simba.googlebigquery.exceptions.ExceptionConverter.toSQLException(Unknown Source)
at com.simba.googlebigquery.utilities.conversion.TypeConverter.toLong(Unknown Source)
at com.simba.googlebigquery.jdbc.common.SForwardResultSet.getLong(Unknown Source)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$9(JdbcUtils.scala:446)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$9$adapted(JdbcUtils.scala:445)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:367)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:349)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:349)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
When I try to import the data via an SQL query, e.g
SELECT date, name, age FROM dataset.tablename
I end up with a table looking like this:
date
name
age
date
name
age
date
name
age
date
name
age
I've read on several posts that the solution to this is to register custom JDBC dialect but I have no idea how to do this; what platform to do it on, or if it's possible to do it within Rstudio. Links to any materials that would help me solve this problem would be appreciated.

Related

Missing letters of database objects being returned in DBI SQL Server ODBC connection

Unfortunately, I will not be able to create a good repro for this issue without sharing confidential creds to the database I am having issues with. Hopefully I have enough information below to flag any obvious problems that ODBC experts will understand.
Background
I am running a MacBook Pro with the following specs:
Model Name: MacBook Pro
Model Identifier: MacBookPro15,1
Processor Name: 6-Core Intel Core i7
Processor Speed: 2.6 GHz
Number of Processors: 1
Total Number of Cores: 6
L2 Cache (per Core): 256 KB
L3 Cache: 9 MB
Hyper-Threading Technology: Enabled
Memory: 32 GB
Boot ROM Version: 1037.0.78.0.0 (iBridge: 17.16.10572.0.0,0)
My ODBC connection is set using FreeTDS as specified here.
The relevant portion of freetds.conf is as follows:
# The POC SQL Server
[POC]
host = 172.22.238.154
port = 1433
tds version = 7.3
My odbcinst.ini file is as follows:
[FreeTDS]
Description=FreeTDS Driver for Linux & SQL Server
Driver=/usr/local/lib/libtdsodbc.so
Setup=/usr/local/lib/libtdsodbc.so
UsageCount=1
My odbc.ini file is specified as follows:
[POC]
Description = Connecton to Partners for our children SQL Server
Driver = FreeTDS
Servername = POC
I am trying to make a connection to a SQL Server 2012 database (via VPN) using the following connection information in R:
con <- DBI::dbConnect(odbc::odbc()
,dsn = "POC"
,uid = Sys.getenv("MSSQL_UN")
,database = "CA_ODS"
,pwd = Sys.getenv("MSSQL_PW"))
This generates the following connection object:
> con
<OdbcConnection> POC2
Database: CA_ODS
Microsoft SQL Server Version: 11.00.7001
In general, this connection works as expected. I can query the database using DBI::dbGetQuery(con, "select * from MyTable"), dplyr::tbl(con, MyTable), etc. without issue.
Problem
RStudio, however, is only displaying every other letter of the database objects, and truncating the object names after the first several letters. The following screenshot should illustrate the issue well:
The database I am trying to connect to is called CA_ODS. However, the RStudio object browser is only displaying every other letter of the database name (i.e. the DB is listed as C_D).
This does not appear to be limited to RStudio per se either. While the results of the actual database queries work fine as described above, the returned names from the INFORMATION_SCHEMA appear to match the information in the object browser. Below, when run directly from SQL Server Management Studio, the returned TABLE_CATALOG is CA_ODS, TABLE_SCHEMA is ndacan, etc. When run via the DB connection, however, I get the following.
> DBI::dbGetQuery(con, "SELECT * FROM INFORMATION_SCHEMA.TABLES WHERE
TABLE_SCHEMA='ndacan'")
TABLE_CATALOG TABLE_SCHEMA TABLE_NAME TABLE_TYPE
1 C_D naa f21v BASE TABLE
Question
Any suggestions as to how I can respecify my ODBC connection in R or in my FreeTDS configs to get the full name of database objects returned?
As noted in #r2evans comments, this appears to be an issue with odbc, running in R 3.6.0, on a Mac.
In general, it appears that this can be fixed by reinstalling odbc from source install.packages("odbc", type = 'source').
As also noted in the comments, I had recently upgraded my Mac to Catalina. Prior to installing odbc from source I needed to first reinstall XCode using xcode-select --install from my terminal.
As can be seen in the screen capture below, I am now getting the full object names displayed from the odbc connection.

how to load an RDBMS driver for h2o in a Jupyter notebook?

I'd like to create a self-contained Jupyter notebook that uses h2o to import and model data that resides in a relational database. The docs show an example where h2o is launched with the JDBC driver in the classpath, e.g.
java -cp <path_to_h2o_jar>:<path_to_jdbc_driver_jar> water.H2OApp
I'd prefer to start h2o from a notebook that's a standalone, reproducible artifact rather than have special steps to prep the environment prior to running the notebook. If I run the following snippet:
import h2o
h2o.init()
connection_url = "jdbc:mysql://mysql.woolford.io/mydb"
select_query = "SELECT description, price FROM mytable"
username = "myuser"
password = "b#dp#ss"
mytable_data = h2o.import_sql_select(connection_url, select_query, username, password)
... the import_sql_select method fails because the driver isn't loaded:
Server error java.lang.RuntimeException:
Error: SQLException: No suitable driver found for jdbc:mysql://mysql.woolford.io/mydb
Is there a way to load the driver when the h2o.init() call is made? Or a best practice for this?
h2o.init() takes a parameter called extra_classpath. You can use this parameter to provide the path to the JDBC driver and H2O will launch with the driver.
This option is designed exactly for the purpose of not having to start H2O outside of the notebook interface.
Example:
import h2o
h2o.init(extra_classpath=["/Users/michal/Downloads/apache-hive-2.2.0-bin/jdbc/hive-jdbc-2.2.0-standalone.jar"])

How to inspect task logs in Spark local mode

I'm using a local Spark instance through the sparklyr R package on a 64G RAM 40 core machine.
I have to process thousands of text files and parse email addresses in them. The goal is to have a data frame with columns such as user name, top level domain, domain, subdomain(s). The data frame is then saved as Parquet file. I split up the files in batches of 2.5G each and process each batch separately.
Most of the batches work fine, however, from time to time, a task fails and the whole batch is "gone". This is the output from the log in such a case:
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 47 in stage 2260.0 failed 1 times, most recent failure: Lost task 47.0 in stage 2260.0 (TID 112228, localhost, executor driver): org.apache.spark.SparkException: Failed to execute user defined function($anonfun$createTransformFunc$2: (string) => array<string>)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.project_doConsume_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$1.hasNext(WholeStageCodegenExec.scala:619)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.util.random.SamplingUtils$.reservoirSampleAndCount(SamplingUtils.scala:57)
at org.apache.spark.RangePartitioner$$anonfun$13.apply(Partitioner.scala:306)
at org.apache.spark.RangePartitioner$$anonfun$13.apply(Partitioner.scala:304)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:853)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:853)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NullPointerException
at java.util.regex.Matcher.getTextLength(Matcher.java:1283)
at java.util.regex.Matcher.reset(Matcher.java:309)
at java.util.regex.Matcher.<init>(Matcher.java:229)
at java.util.regex.Pattern.matcher(Pattern.java:1093)
at java.util.regex.Pattern.split(Pattern.java:1206)
at java.util.regex.Pattern.split(Pattern.java:1273)
at scala.util.matching.Regex.split(Regex.scala:526)
at org.apache.spark.ml.feature.RegexTokenizer$$anonfun$createTransformFunc$2.apply(Tokenizer.scala:144)
at org.apache.spark.ml.feature.RegexTokenizer$$anonfun$createTransformFunc$2.apply(Tokenizer.scala:141)
... 22 more
I use FTRegexTokenizer quite often, e.g. here to split email address into username and domains:
spark_tbls_separated %<>%
ft_regex_tokenizer(input_col = "email",
output_col = "email_split",
pattern = "#",
to_lower_case = FALSE) %>%
sdf_separate_column(column = "email_split",
into = c("email_user", "email_domain")) %>%
select(-email_split, -email)
So now I'd like to know which Spark transformation exactly causes the error, and for which type of input data it fails, so I can actually debug the cause of the error. I guess the only way to figure this out is to look at task logs (do they even exist?). Ideally I could look at the log of task 47 and get way more detailed logging info. How can I access these on a local machine?
Here are my configuration options, so e.g. a history server is prepared to run:
spark_config <- spark_config()
spark_config$`sparklyr.shell.driver-memory` <- "64G"
spark_config$spark.memory.fraction <- 0.75
spark_config$spark.speculation <- TRUE
spark_config$spark.speculation.multiplier <- 2
spark_config$spark.speculation.quantile <- 0.5
spark_config$sparklyr.backend.timeout <- 3600 * 2 # two-hour timeout
spark_config$spark.eventLog.enabled <- TRUE
spark_config$spark.eventLog.dir <- "file:///tmp/spark-events"
spark_config$spark.history.fs.logDirectory <- "file:///tmp/spark-events"
sc <- spark_connect(master = "local", config = spark_config)
Note that this question is not about the actual error seen here, it is about the possibility of inspecting the task log to figure out at which line my sparklyr script fails.

database protocol 'sqlite' not supported - Failed to initialize zdb connection pool()

I am using libzdb - Database Connection Pool Library with sqlite database. I am getting following exception :
Failed to start connection pool - database protocol 'sqlite' not supported
After ConnectionPool_start() - it goes in static int _fillPool(T p), in that it is getting falied at above statement
Connection_T con = Connection_new(P, &P->error);
My connection url is as follows :
sqlite:///home/ZDB_TESTING/zdb-test/testDb.db
Kindly help me with this problem.
This means that the SQLite library is not compiled into the libzdb library. If installing from a distribution, make sure that you select libzdb built with SQLite. If you built libzdb yourself from source, after you run ./configure make sure the output says, SQLite3: ENABLED. Otherwise you need to install SQLite on your system first.

SparkR job(R script) submit using spark-submit fails in BigInsights Hadoop cluster

I have created IBM BigInsights service with hadoop cluster of 5 nodes(including Apache Spark with SparkR). I trying to use SparkR to connect cloudant db and get some data and do some processing.
SparkR job(R script) submit using spark-submit fails in BigInsights Hadoop cluster.
I have created SparkR script and ran the following code,
-bash-4.1$ spark-submit --master local[2] test_sparkr.R
16/08/07 17:43:40 WARN SparkConf: The configuration key 'spark.yarn.applicationMaster.waitTries' has been deprecated as of Spark 1.3 and and may be removed in the future. Please use the new key 'spark.yarn.am.waitTime' instead.
Error: could not find function "sparkR.init"
Execution halted
-bash-4.1$
Content of test_sparkr.R file is:
# Creating SparkConext and connecting to Cloudant DB
sc <- sparkR.init(sparkEnv = list("cloudant.host"="<<cloudant-host-name>>","<<><<cloudant-user-name>>>","cloudant.password"="<<cloudant-password>>", "jsonstore.rdd.schemaSampleSize"="-1"))
# Database to be connected to extract the data
database <- "testdata"
# Creating Spark SQL Context
sqlContext <- sparkRSQL.init(sc)
# Creating DataFrame for the "testdata" Cloudant DB
testDataDF <- read.df(sqlContext, database, header='true', source = "com.cloudant.spark",inferSchema='true')
How to install the spark-cloudant connector in IBM BigInsights and resolve the issue. Kindly do the needful. Help would be much appreciated.
I believe that the spark-cloudant connector isn’t for R yet.
Hopefully I can update this answer when it is!

Resources