How to escape Athena database.table using pool package?

How to escape Athena database.table using pool package? - r

I'm trying to connect to Amazon Athena via JDBC and pool:
What has worked so far:
library(RJDBC)
library(DBI)
library(pool)
library(dplyr)
library(dbplyr)
drv <- RJDBC::JDBC('com.amazonaws.athena.jdbc.AthenaDriver', '/opt/jdbc/AthenaJDBC41-1.1.0.jar')
pool_instance <- dbPool(
drv = drv,
url = "jdbc:awsathena://athena.us-west-2.amazonaws.com:443/",
user = "me",
s3_staging_dir = "s3://somedir",
password = "pwd"
)
mydata <- DBI::dbGetQuery(pool_instance, "SELECT *
FROM myDB.myTable
LIMIT 10")
mydata
---> Works fine. Correct data is beeing returned.
That does not work:
pool_instance %>% tbl("myDB.myTable") %>% head(10)
# Error in .verify.JDBC.result(r, "Unable to retrieve JDBC result set for ", :
# Unable to retrieve JDBC result set for SELECT *
# FROM "myDB.myTable" AS "zzz2"
# WHERE (0 = 1) ( Table myDB.myTable not found. Please check your query.)
The problem here is that Athena expects the following syntax as SQL:
Either:
SELECT *
FROM "myDB"."myTable"
Or:
SELECT *
FROM myDB.myTable
So basically, by passing the string "myDB.myTable":
pool_instance %>% tbl("myDB.myTable") %>% head(10)
The following syntax is being used:
SELECT *
FROM "myDB.myTable"
which results in the following error since such table doesn't exist:
# Error in .verify.JDBC.result(r, "Unable to retrieve JDBC result set for ", :
# Unable to retrieve JDBC result set for SELECT *
# FROM "myDB.myTable" AS "zzz6"
# WHERE (0 = 1) ( Table myDB.myTable not found. Please check your query.)
What I have tried:
So therefore I have tried to pass either "myDB"."myTable" or myDB.myTable to tbl() unsuccessfully:
I have tried to use capture.output(cat('\"myDB\".\"myTable\"')):
pool_instance %>% tbl(capture.output(cat('\"myDB\".\"myTable\"'))) %>% head(10)
# Error in .verify.JDBC.result(r, "Unable to retrieve JDBC result set for ", :
# Unable to retrieve JDBC result set for SELECT *
# FROM """myDB"".""myTable""" AS "zzz4"
# WHERE (0 = 1) ( Table ""myDB"".""myTable"" not found. Please check your query.)
pool_instance %>% tbl(noquote("myDB"."myTable") %>% head(10)
# Error in UseMethod("as.sql") :
# no applicable method for 'as.sql' applied to an object of class "noquote"

You can use dbplyr::in_schema:
pool_instance %>% tbl(in_schema("myDB", "myTable")) %>% head(10)

Related

dplyr: use a custom function in summarize() when connected to external database

Is there a way to use custom functions within a summaries statement when using dplyr to pull data from an external database?
I can’t make usable dummy data because this is specific to databases, but imagine you have a table with three fields: product, true_positive, and all_positive. This is the code I want to use:
getPrecision <- function(true_positive, all_positive){
if_else(sum(all_positive, na.rm = TRUE) == 0, 0,
(sum(true_positive) / sum(all_positive , na.rm = TRUE)))
}
database_data %>%
group_by(product) %>%
summarize(precision = getPrecision(true_positive, all_positive)) %>% collect
This is the error: Error in postgresqlExecStatement(conn, statement, ...) :
RS-DBI driver: (could not Retrieve the result : ERROR: function getprecision(integer, integer) does not exist

To understand the error message, you could use show_query instead of collect to see the SQL code sent to the database :
database_data %>%
group_by(product) %>%
summarize(precision = getPrecision(true_positive, all_positive)) %>%
show_query
<SQL>
SELECT "product", getPrecision("true_positive", "all_positive") AS "precision"
FROM "database_table"
GROUP BY "product"
As you can see, this SQL expects getPrecision function to be available on the server, which is not the case.
A potential solution is to collect table data first, before applying this function in the R client:
database_data %>%
collect %>%
group_by(product) %>%
summarize(precision = getPrecision(true_positive, all_positive))
If this isn't possible, because the table is too big, you'll have to implement the function in SQL on the server :
SELECT
"product",
CASE WHEN sum(all_positive)=0 THEN 0 ELSE sum(true_positive)/sum(all_positive) END AS "precision"
FROM "database_table"
GROUP BY "product"

Can I run a BigQuery SQL query and then continue wrangling the data using dbplyr?

In another project working with Amazon Athena I could do this:
con <- DBI::dbConnect(odbc::odbc(), Driver = "path-to-driver",
S3OutputLocation = "location",
AwsRegion = "eu-west-1", AuthenticationType = "IAM Profile",
AWSProfile = "profile", Schema = "prod")
tbl(con,
# Run SQL query
sql('SELECT *
FROM TABLE')) %>%
# Without having collected the data, I could further wrangle the data inside the database
# using dplyr code
select(var1, var2) %>%
mutate(var3 = var1 + var2)
However, now using BigQuery I get the following error
con <- DBI::dbConnect(bigrquery::bigquery(),
project = "project")
tbl(con,
sql(
'SELECT *
FROM TABLE'
))
Error: dataset is not a string (a length one character vector).
Any idea if with BigQuery is not possible to do what I'm trying to do?

Not a BigQuery user, so can't test this, but from looking at this example it appears unrelated to how you are piping queries (%>%). Instead it appears BigQuery does not support receiving a tbl with an sql string as the second argument.
So it is likely to work when the second argument is a string with the name of the table:
tbl(con, "db_name.table_name")
But you should expect it to fail if the second argument is of type sql:
query_string = "SELECT * FROM db_name.table_name"
tbl(con, sql(query_string))
Other things to test:
Using odbc::odbc() to connect to BigQuery instead of bigquery::bigquery(). The problem could be caused by the bigquery package.
The second approach without the conversation to sql: tbl(con, query_string)

How to "arrange" aggregate variable in dbplyr?

The following dbplyr statement fails:
foo <- activity_viewed %>% group_by(pk) %>% summarize(total = n()) %>%
arrange(-total) %>% head(3) %>% collect()
with this error:
Error in postgresqlExecStatement(conn, statement, ...) :
RS-DBI driver: (could not Retrieve the result : ERROR: column "total" does not exist
LINE 4: ORDER BY -"total"
^
)
I can see the problem in the query: SQL doesn't allow the ORDER BY to use column aliases.
Here's the generated query:
> print(show_query(foo))
<SQL>
SELECT "pk", COUNT(*) AS "total"
FROM "activity"
GROUP BY "pk"
ORDER BY -"total"
LIMIT 3
I need ORDER BY -COUNT(*).
How do I get dbplyr to execute this query?

dbplyr can translate desc but not -
library(dplyr)
library(dbplyr)
mtcars2 <- src_memdb() %>%
copy_to(mtcars, name = "mtcars2-cc", overwrite = TRUE)
mtcars2 %>% arrange(desc(cyl)) %>% show_query()
<SQL>
SELECT *
FROM `mtcars2-cc`
ORDER BY `cyl` DESC

Hive ODBC connection with dbplyr Invalid table alias or column reference

I'm connected to Hive using dbplyr and odbc.
A table I would like to connect to is called "pros_year_month":
library(odbc)
library(tidyverse)
library(dbplyr)
con <- dbConnect(odbc::odbc(), "HiveProd")
prosym <- tbl(con, in_schema("my_schema_name", "pros_year_month"))
Table pros_year_month has several fields, two of which are "country" and "year_month".
This appears to work without any problem:
pros_nov <- prosym %>% filter(country == "United States") %>% collect()
However this does not:
pros_nov <- prosym %>% filter(year_month = ymd(as.character(paste0(year_month, "01")))) %>% collect()
Error in new_result(connection#ptr, statement) :
nanodbc/nanodbc.cpp:1344: 42000: [Hortonworks][Hardy] (80) Syntax or
semantic analysis error thrown in server while executing query. Error
message from server: Error while compiling statement: FAILED:
SemanticException [Error 10004]: Line 1:7 Invalid table alias or
column reference 'zzz1.year_month': (possible column names are:
year_month, country, ...
It looks like the field name year_month is somehow now zzz1.year_month? Not sure what this is or how to get around it.
How can I apply a filter for country then year_month before calling collect on a dbplyr object?

Distinct in R while connecting to PostgreSQL using DBI Package

The below code prints:
SELECT "district_code" FROM sd_stage.table1 GROUP BY "district_code"
but I am expecting:
select distinct(district_code) from sd_stage.table1
Code:
library(DBI)
library(tidyverse)
library(dbplyr)
conn_obj <- DBI::dbConnect(RPostgreSQL::PostgreSQL(),
host = "127.0.0.1",
user = "testingdb",
password = "admin#123")
on.exit(DBI::dbDisconnect(conn_obj))
tbl_oil_root_segment <- dplyr::tbl(conn_obj,
dbplyr::in_schema('sd_stage','table1'))
tbl_oil_root_segment %>% distinct(oil_district) %>% show_query()
Output is correct but the query which is generated seems to be not 100%. So is there anyway I can implement the query?

tbl_oil_root_segment %>% select(oil_district) %>% distinct %>% show_query()
will create the query you expect.
However, note that in SQL select distinct a from t is the same as select a from t group by a (see this question).

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to escape Athena database.table using pool package? - r

You can use dbplyr::in_schema: pool_instance %>% tbl(in_schema("myDB", "myTable")) %>% head(10)

Related

dplyr: use a custom function in summarize() when connected to external database

Can I run a BigQuery SQL query and then continue wrangling the data using dbplyr?

How to "arrange" aggregate variable in dbplyr?

Hive ODBC connection with dbplyr Invalid table alias or column reference

Distinct in R while connecting to PostgreSQL using DBI Package

Categories

Resources