How to escape Athena database.table using pool package? - r

I'm trying to connect to Amazon Athena via JDBC and pool:
What has worked so far:
library(RJDBC)
library(DBI)
library(pool)
library(dplyr)
library(dbplyr)
drv <- RJDBC::JDBC('com.amazonaws.athena.jdbc.AthenaDriver', '/opt/jdbc/AthenaJDBC41-1.1.0.jar')
pool_instance <- dbPool(
drv = drv,
url = "jdbc:awsathena://athena.us-west-2.amazonaws.com:443/",
user = "me",
s3_staging_dir = "s3://somedir",
password = "pwd"
)
mydata <- DBI::dbGetQuery(pool_instance, "SELECT *
FROM myDB.myTable
LIMIT 10")
mydata
---> Works fine. Correct data is beeing returned.
That does not work:
pool_instance %>% tbl("myDB.myTable") %>% head(10)
# Error in .verify.JDBC.result(r, "Unable to retrieve JDBC result set for ", :
# Unable to retrieve JDBC result set for SELECT *
# FROM "myDB.myTable" AS "zzz2"
# WHERE (0 = 1) ( Table myDB.myTable not found. Please check your query.)
The problem here is that Athena expects the following syntax as SQL:
Either:
SELECT *
FROM "myDB"."myTable"
Or:
SELECT *
FROM myDB.myTable
So basically, by passing the string "myDB.myTable":
pool_instance %>% tbl("myDB.myTable") %>% head(10)
The following syntax is being used:
SELECT *
FROM "myDB.myTable"
which results in the following error since such table doesn't exist:
# Error in .verify.JDBC.result(r, "Unable to retrieve JDBC result set for ", :
# Unable to retrieve JDBC result set for SELECT *
# FROM "myDB.myTable" AS "zzz6"
# WHERE (0 = 1) ( Table myDB.myTable not found. Please check your query.)
What I have tried:
So therefore I have tried to pass either "myDB"."myTable" or myDB.myTable to tbl() unsuccessfully:
I have tried to use capture.output(cat('\"myDB\".\"myTable\"')):
pool_instance %>% tbl(capture.output(cat('\"myDB\".\"myTable\"'))) %>% head(10)
# Error in .verify.JDBC.result(r, "Unable to retrieve JDBC result set for ", :
# Unable to retrieve JDBC result set for SELECT *
# FROM """myDB"".""myTable""" AS "zzz4"
# WHERE (0 = 1) ( Table ""myDB"".""myTable"" not found. Please check your query.)
pool_instance %>% tbl(noquote("myDB"."myTable") %>% head(10)
# Error in UseMethod("as.sql") :
# no applicable method for 'as.sql' applied to an object of class "noquote"

You can use dbplyr::in_schema:
pool_instance %>% tbl(in_schema("myDB", "myTable")) %>% head(10)

Related

dplyr: use a custom function in summarize() when connected to external database

Is there a way to use custom functions within a summaries statement when using dplyr to pull data from an external database?
I can’t make usable dummy data because this is specific to databases, but imagine you have a table with three fields: product, true_positive, and all_positive. This is the code I want to use:
getPrecision <- function(true_positive, all_positive){
if_else(sum(all_positive, na.rm = TRUE) == 0, 0,
(sum(true_positive) / sum(all_positive , na.rm = TRUE)))
}
database_data %>%
group_by(product) %>%
summarize(precision = getPrecision(true_positive, all_positive)) %>% collect
This is the error: Error in postgresqlExecStatement(conn, statement, ...) :
RS-DBI driver: (could not Retrieve the result : ERROR: function getprecision(integer, integer) does not exist
To understand the error message, you could use show_query instead of collect to see the SQL code sent to the database :
database_data %>%
group_by(product) %>%
summarize(precision = getPrecision(true_positive, all_positive)) %>%
show_query
<SQL>
SELECT "product", getPrecision("true_positive", "all_positive") AS "precision"
FROM "database_table"
GROUP BY "product"
As you can see, this SQL expects getPrecision function to be available on the server, which is not the case.
A potential solution is to collect table data first, before applying this function in the R client:
database_data %>%
collect %>%
group_by(product) %>%
summarize(precision = getPrecision(true_positive, all_positive))
If this isn't possible, because the table is too big, you'll have to implement the function in SQL on the server :
SELECT
"product",
CASE WHEN sum(all_positive)=0 THEN 0 ELSE sum(true_positive)/sum(all_positive) END AS "precision"
FROM "database_table"
GROUP BY "product"

Can I run a BigQuery SQL query and then continue wrangling the data using dbplyr?

In another project working with Amazon Athena I could do this:
con <- DBI::dbConnect(odbc::odbc(), Driver = "path-to-driver",
S3OutputLocation = "location",
AwsRegion = "eu-west-1", AuthenticationType = "IAM Profile",
AWSProfile = "profile", Schema = "prod")
tbl(con,
# Run SQL query
sql('SELECT *
FROM TABLE')) %>%
# Without having collected the data, I could further wrangle the data inside the database
# using dplyr code
select(var1, var2) %>%
mutate(var3 = var1 + var2)
However, now using BigQuery I get the following error
con <- DBI::dbConnect(bigrquery::bigquery(),
project = "project")
tbl(con,
sql(
'SELECT *
FROM TABLE'
))
Error: dataset is not a string (a length one character vector).
Any idea if with BigQuery is not possible to do what I'm trying to do?
Not a BigQuery user, so can't test this, but from looking at this example it appears unrelated to how you are piping queries (%>%). Instead it appears BigQuery does not support receiving a tbl with an sql string as the second argument.
So it is likely to work when the second argument is a string with the name of the table:
tbl(con, "db_name.table_name")
But you should expect it to fail if the second argument is of type sql:
query_string = "SELECT * FROM db_name.table_name"
tbl(con, sql(query_string))
Other things to test:
Using odbc::odbc() to connect to BigQuery instead of bigquery::bigquery(). The problem could be caused by the bigquery package.
The second approach without the conversation to sql: tbl(con, query_string)

How to "arrange" aggregate variable in dbplyr?

The following dbplyr statement fails:
foo <- activity_viewed %>% group_by(pk) %>% summarize(total = n()) %>%
arrange(-total) %>% head(3) %>% collect()
with this error:
Error in postgresqlExecStatement(conn, statement, ...) :
RS-DBI driver: (could not Retrieve the result : ERROR: column "total" does not exist
LINE 4: ORDER BY -"total"
^
)
I can see the problem in the query: SQL doesn't allow the ORDER BY to use column aliases.
Here's the generated query:
> print(show_query(foo))
<SQL>
SELECT "pk", COUNT(*) AS "total"
FROM "activity"
GROUP BY "pk"
ORDER BY -"total"
LIMIT 3
I need ORDER BY -COUNT(*).
How do I get dbplyr to execute this query?
dbplyr can translate desc but not -
library(dplyr)
library(dbplyr)
mtcars2 <- src_memdb() %>%
copy_to(mtcars, name = "mtcars2-cc", overwrite = TRUE)
mtcars2 %>% arrange(desc(cyl)) %>% show_query()
<SQL>
SELECT *
FROM `mtcars2-cc`
ORDER BY `cyl` DESC

Hive ODBC connection with dbplyr Invalid table alias or column reference

I'm connected to Hive using dbplyr and odbc.
A table I would like to connect to is called "pros_year_month":
library(odbc)
library(tidyverse)
library(dbplyr)
con <- dbConnect(odbc::odbc(), "HiveProd")
prosym <- tbl(con, in_schema("my_schema_name", "pros_year_month"))
Table pros_year_month has several fields, two of which are "country" and "year_month".
This appears to work without any problem:
pros_nov <- prosym %>% filter(country == "United States") %>% collect()
However this does not:
pros_nov <- prosym %>% filter(year_month = ymd(as.character(paste0(year_month, "01")))) %>% collect()
Error in new_result(connection#ptr, statement) :
nanodbc/nanodbc.cpp:1344: 42000: [Hortonworks][Hardy] (80) Syntax or
semantic analysis error thrown in server while executing query. Error
message from server: Error while compiling statement: FAILED:
SemanticException [Error 10004]: Line 1:7 Invalid table alias or
column reference 'zzz1.year_month': (possible column names are:
year_month, country, ...
It looks like the field name year_month is somehow now zzz1.year_month? Not sure what this is or how to get around it.
How can I apply a filter for country then year_month before calling collect on a dbplyr object?

Distinct in R while connecting to PostgreSQL using DBI Package

The below code prints:
SELECT "district_code" FROM sd_stage.table1 GROUP BY "district_code"
but I am expecting:
select distinct(district_code) from sd_stage.table1
Code:
library(DBI)
library(tidyverse)
library(dbplyr)
conn_obj <- DBI::dbConnect(RPostgreSQL::PostgreSQL(),
host = "127.0.0.1",
user = "testingdb",
password = "admin#123")
on.exit(DBI::dbDisconnect(conn_obj))
tbl_oil_root_segment <- dplyr::tbl(conn_obj,
dbplyr::in_schema('sd_stage','table1'))
tbl_oil_root_segment %>% distinct(oil_district) %>% show_query()
Output is correct but the query which is generated seems to be not 100%. So is there anyway I can implement the query?
tbl_oil_root_segment %>% select(oil_district) %>% distinct %>% show_query()
will create the query you expect.
However, note that in SQL select distinct a from t is the same as select a from t group by a (see this question).

Resources