How to "arrange" aggregate variable in dbplyr? - r

The following dbplyr statement fails:
foo <- activity_viewed %>% group_by(pk) %>% summarize(total = n()) %>%
arrange(-total) %>% head(3) %>% collect()
with this error:
Error in postgresqlExecStatement(conn, statement, ...) :
RS-DBI driver: (could not Retrieve the result : ERROR: column "total" does not exist
LINE 4: ORDER BY -"total"
^
)
I can see the problem in the query: SQL doesn't allow the ORDER BY to use column aliases.
Here's the generated query:
> print(show_query(foo))
<SQL>
SELECT "pk", COUNT(*) AS "total"
FROM "activity"
GROUP BY "pk"
ORDER BY -"total"
LIMIT 3
I need ORDER BY -COUNT(*).
How do I get dbplyr to execute this query?

dbplyr can translate desc but not -
library(dplyr)
library(dbplyr)
mtcars2 <- src_memdb() %>%
copy_to(mtcars, name = "mtcars2-cc", overwrite = TRUE)
mtcars2 %>% arrange(desc(cyl)) %>% show_query()
<SQL>
SELECT *
FROM `mtcars2-cc`
ORDER BY `cyl` DESC

Related

R: row to column error while writing to DB

Im using the below statement for converting rownames to column
library(tidyverse)
names(res) <- names(dt)[]
final<- imap(res, ~ .x %>%
as.data.frame %>%
select(!! .y := `Point Forecast`) %>%
rownames_to_column("Month_year")) %>%
reduce(inner_join, by = "Month_year")
and when i try to write the output to a db,
with
dbWriteTable(mycon, value = final , Database= 'mydb' ,name = "Rpredict", append = TRUE )
i receive an error as below:
Error in result_insert_dataframe(rs#ptr, values) : 
  nanodbc/nanodbc.cpp:1587: 42S22: [Microsoft][ODBC SQL Server Driver][SQL Server]Invalid column name 'Month_year'
How do i fix this?

How to use character vector in filter on a database connection in R?

EDIT: I found my error in the example below. I made a typo in stored_group in filter. It works as expected.
I want to use a character value to filter a database table. I use dplyr functions directly on the connection object. See my steps below.
I connected to my MariaDB database:
con <- dbConnect(RMariaDB::MariaDB(),
dbname = mariadb.database,
user = mariadb.username,
password = mariadb.password,
host = mariadb.host,
port = mariadb.port)
Then I want to use a filter on a table in the database, by using dplyr code directly on the connection above:
stored_group <- "some_group"
con %>%
tbl("Table") %>%
select(id, group) %>%
filter(group == stored_group) %>%
collect()
I got a error saying Unknown column 'stored_group' in 'where clause'. So I used show_query() like this:
stored_group <- "some_group"
con %>%
tbl("Table") %>%
select(id, group) %>%
filter(group == stored_group) %>%
show_query()
And I got:
<SQL>
SELECT `id`, `group`
FROM `Table`
WHERE (`group` = `stored_group`)
In translation, stored_group is seen as a column name instead of value in R. How do I prevent this?
On normal data.frames in R this works. Like:
stored_group <- "some_group"
data %>%
select(id, group) %>%
filter(group == stored_group)
I just tested the solution below, and it works. But my database table will grow. I want to filter directly on the database before collecting.
stored_group <- "some_group"
con %>%
tbl("Table") %>%
select(id, group) %>%
collect() %>%
filter(group == stored_group)
Any suggestions?

How to escape Athena database.table using pool package?

I'm trying to connect to Amazon Athena via JDBC and pool:
What has worked so far:
library(RJDBC)
library(DBI)
library(pool)
library(dplyr)
library(dbplyr)
drv <- RJDBC::JDBC('com.amazonaws.athena.jdbc.AthenaDriver', '/opt/jdbc/AthenaJDBC41-1.1.0.jar')
pool_instance <- dbPool(
drv = drv,
url = "jdbc:awsathena://athena.us-west-2.amazonaws.com:443/",
user = "me",
s3_staging_dir = "s3://somedir",
password = "pwd"
)
mydata <- DBI::dbGetQuery(pool_instance, "SELECT *
FROM myDB.myTable
LIMIT 10")
mydata
---> Works fine. Correct data is beeing returned.
That does not work:
pool_instance %>% tbl("myDB.myTable") %>% head(10)
# Error in .verify.JDBC.result(r, "Unable to retrieve JDBC result set for ", :
# Unable to retrieve JDBC result set for SELECT *
# FROM "myDB.myTable" AS "zzz2"
# WHERE (0 = 1) ( Table myDB.myTable not found. Please check your query.)
The problem here is that Athena expects the following syntax as SQL:
Either:
SELECT *
FROM "myDB"."myTable"
Or:
SELECT *
FROM myDB.myTable
So basically, by passing the string "myDB.myTable":
pool_instance %>% tbl("myDB.myTable") %>% head(10)
The following syntax is being used:
SELECT *
FROM "myDB.myTable"
which results in the following error since such table doesn't exist:
# Error in .verify.JDBC.result(r, "Unable to retrieve JDBC result set for ", :
# Unable to retrieve JDBC result set for SELECT *
# FROM "myDB.myTable" AS "zzz6"
# WHERE (0 = 1) ( Table myDB.myTable not found. Please check your query.)
What I have tried:
So therefore I have tried to pass either "myDB"."myTable" or myDB.myTable to tbl() unsuccessfully:
I have tried to use capture.output(cat('\"myDB\".\"myTable\"')):
pool_instance %>% tbl(capture.output(cat('\"myDB\".\"myTable\"'))) %>% head(10)
# Error in .verify.JDBC.result(r, "Unable to retrieve JDBC result set for ", :
# Unable to retrieve JDBC result set for SELECT *
# FROM """myDB"".""myTable""" AS "zzz4"
# WHERE (0 = 1) ( Table ""myDB"".""myTable"" not found. Please check your query.)
pool_instance %>% tbl(noquote("myDB"."myTable") %>% head(10)
# Error in UseMethod("as.sql") :
# no applicable method for 'as.sql' applied to an object of class "noquote"
You can use dbplyr::in_schema:
pool_instance %>% tbl(in_schema("myDB", "myTable")) %>% head(10)

Modify dplyr database query

I'm using dplyr to execute a Redshift query via the database connection src. lag works a little bit differently in Redshift (see https://github.com/tidyverse/dplyr/issues/962), so I'm wondering if it's possible to modify the query that's generated from the dplyr chain to remove the third parameter (NULL) in LAG. Example:
res <- tbl(src, 'table_name') %>%
group_by(groupid) %>%
filter(value != lag(value)) %>%
collect()
gives
Error in postgresqlExecStatement(conn, statement, ...) :
RS-DBI driver: (could not Retrieve the result : ERROR: Default
parameter not be supported for window function lag)
I can see the translated sql:
translated <- dbplyr::translate_sql(
tbl(src, 'table_name') %>%
group_by(groupid) %>%
filter(value != lag(value)) %>%
collect()
)
# <SQL> COLLECT(FILTER(GROUP_BY(TBL("src", 'table_name'), "groupid"), "value" != LAG("value", 1, NULL) OVER ()))
And I can modify it to remove the NULL parameter, which I think will solve the problem:
sub("(LAG\\(.*), NULL), "\\1", translated)
# <SQL> COLLECT(FILTER(GROUP_BY(TBL("src", 'table_name'), "groupid"), "value" != LAG("value", 1) OVER ()))
How can I execute this modified query?
you should be able to useDBI::dbGetQuery(con, sub("(LAG\\(.*), NULL), "\\1", translated)) to run the new query.

Distinct in R while connecting to PostgreSQL using DBI Package

The below code prints:
SELECT "district_code" FROM sd_stage.table1 GROUP BY "district_code"
but I am expecting:
select distinct(district_code) from sd_stage.table1
Code:
library(DBI)
library(tidyverse)
library(dbplyr)
conn_obj <- DBI::dbConnect(RPostgreSQL::PostgreSQL(),
host = "127.0.0.1",
user = "testingdb",
password = "admin#123")
on.exit(DBI::dbDisconnect(conn_obj))
tbl_oil_root_segment <- dplyr::tbl(conn_obj,
dbplyr::in_schema('sd_stage','table1'))
tbl_oil_root_segment %>% distinct(oil_district) %>% show_query()
Output is correct but the query which is generated seems to be not 100%. So is there anyway I can implement the query?
tbl_oil_root_segment %>% select(oil_district) %>% distinct %>% show_query()
will create the query you expect.
However, note that in SQL select distinct a from t is the same as select a from t group by a (see this question).

Resources