Modify dplyr database query - r

I'm using dplyr to execute a Redshift query via the database connection src. lag works a little bit differently in Redshift (see https://github.com/tidyverse/dplyr/issues/962), so I'm wondering if it's possible to modify the query that's generated from the dplyr chain to remove the third parameter (NULL) in LAG. Example:
res <- tbl(src, 'table_name') %>%
group_by(groupid) %>%
filter(value != lag(value)) %>%
collect()
gives
Error in postgresqlExecStatement(conn, statement, ...) :
RS-DBI driver: (could not Retrieve the result : ERROR: Default
parameter not be supported for window function lag)
I can see the translated sql:
translated <- dbplyr::translate_sql(
tbl(src, 'table_name') %>%
group_by(groupid) %>%
filter(value != lag(value)) %>%
collect()
)
# <SQL> COLLECT(FILTER(GROUP_BY(TBL("src", 'table_name'), "groupid"), "value" != LAG("value", 1, NULL) OVER ()))
And I can modify it to remove the NULL parameter, which I think will solve the problem:
sub("(LAG\\(.*), NULL), "\\1", translated)
# <SQL> COLLECT(FILTER(GROUP_BY(TBL("src", 'table_name'), "groupid"), "value" != LAG("value", 1) OVER ()))
How can I execute this modified query?

you should be able to useDBI::dbGetQuery(con, sub("(LAG\\(.*), NULL), "\\1", translated)) to run the new query.

Related

dplyr lazy query mutate column using str_extract

I try to query a table table_a, and I like to mutate a column substr_col based on an existing column col with stringr::str_extract while it is in a lazy query state. I encountered an error message complaining col does not exist.
object 'col' not found
conn <- DBI::dbConnect(...)
dplyr::tbl(conn, table_a) %>%
dplyr::mutate(substring_col = stringr::str_extract(col, "^[A-Z]-\\d{3}")) %>%
dplyr::collect()
But this code works when I collect the data first and then call stringr::str_extract
conn <- DBI::dbConnect(...)
dplyr::tbl(conn, table_a) %>%
dplyr::collect() %>%
dplyr::mutate(substring_col = stringr::str_extract(col, "^[A-Z]-\\d{3}"))
I like to use the substring_col as a filter condition while the query is lazy, how should I do that?
As #IceCreanToucan states, str_extract is not on dbplyr's list of translations. Hence it will not be able to execute this code on the database. (I assume you are using dbplyr as it is the main package for having dplyr commands translated into SQL).
We can test this as follows:
library(dbplyr)
library(dplyr)
library(stringr)
data(starwars)
# pick your simulated connection type (there are many options, not just what I have shown here)
remote_df = tbl_lazy(starwars, con = simulate_mssql())
remote_df = tbl_lazy(starwars, con = simulate_mysql())
remote_df = tbl_lazy(starwars, con = simulate_postgres())
remote_df %>%
mutate(substring_col = str_extract(name, "Luke")) %>%
show_query()
show_query() should return the SQL that our mutate has been translated into. But instead I receive a clear message: "Error: str_extract() is not available in this SQL variant". This makes it clear translation is not defined.
However, there is a translation defined for grep and grepl (etc.) so the following should work:
remote_df %>%
mutate(substring_col = grepl("Luke", name)) %>%
show_query()
But it will return you slightly different output.

dplyr: use a custom function in summarize() when connected to external database

Is there a way to use custom functions within a summaries statement when using dplyr to pull data from an external database?
I can’t make usable dummy data because this is specific to databases, but imagine you have a table with three fields: product, true_positive, and all_positive. This is the code I want to use:
getPrecision <- function(true_positive, all_positive){
if_else(sum(all_positive, na.rm = TRUE) == 0, 0,
(sum(true_positive) / sum(all_positive , na.rm = TRUE)))
}
database_data %>%
group_by(product) %>%
summarize(precision = getPrecision(true_positive, all_positive)) %>% collect
This is the error: Error in postgresqlExecStatement(conn, statement, ...) :
RS-DBI driver: (could not Retrieve the result : ERROR: function getprecision(integer, integer) does not exist
To understand the error message, you could use show_query instead of collect to see the SQL code sent to the database :
database_data %>%
group_by(product) %>%
summarize(precision = getPrecision(true_positive, all_positive)) %>%
show_query
<SQL>
SELECT "product", getPrecision("true_positive", "all_positive") AS "precision"
FROM "database_table"
GROUP BY "product"
As you can see, this SQL expects getPrecision function to be available on the server, which is not the case.
A potential solution is to collect table data first, before applying this function in the R client:
database_data %>%
collect %>%
group_by(product) %>%
summarize(precision = getPrecision(true_positive, all_positive))
If this isn't possible, because the table is too big, you'll have to implement the function in SQL on the server :
SELECT
"product",
CASE WHEN sum(all_positive)=0 THEN 0 ELSE sum(true_positive)/sum(all_positive) END AS "precision"
FROM "database_table"
GROUP BY "product"

How to "arrange" aggregate variable in dbplyr?

The following dbplyr statement fails:
foo <- activity_viewed %>% group_by(pk) %>% summarize(total = n()) %>%
arrange(-total) %>% head(3) %>% collect()
with this error:
Error in postgresqlExecStatement(conn, statement, ...) :
RS-DBI driver: (could not Retrieve the result : ERROR: column "total" does not exist
LINE 4: ORDER BY -"total"
^
)
I can see the problem in the query: SQL doesn't allow the ORDER BY to use column aliases.
Here's the generated query:
> print(show_query(foo))
<SQL>
SELECT "pk", COUNT(*) AS "total"
FROM "activity"
GROUP BY "pk"
ORDER BY -"total"
LIMIT 3
I need ORDER BY -COUNT(*).
How do I get dbplyr to execute this query?
dbplyr can translate desc but not -
library(dplyr)
library(dbplyr)
mtcars2 <- src_memdb() %>%
copy_to(mtcars, name = "mtcars2-cc", overwrite = TRUE)
mtcars2 %>% arrange(desc(cyl)) %>% show_query()
<SQL>
SELECT *
FROM `mtcars2-cc`
ORDER BY `cyl` DESC

R - dbplyr - `lang_name()` is deprecated

I just updated dblyr and since that moment I started to saw warnings
Warning messages: 1: lang_name() is deprecated as of rlang 0.2.0.
Please use call_name() instead. This warning is displayed once per
session. 2: lang() is deprecated as of rlang 0.2.0. Please use
call2() instead. This warning is displayed once per session.
I have no clue what sould I do since my code looks like this
df <- tbl(conn, in_schema("schema", "table")) %>%
filter(status!= "CLOSED" | is.na(status)) %>%
group_by(customer_id) %>%
filter(created == min(created, na.rm = T)) %>%
ungroup() %>%
select(
contract_number,
customer_id,
approved_date = created
) %>%
collect()
There is no call_name() or lang_name() in my code. Do you guys know whats wrong? I know that my code works even with this warnings, but I don't want to see it.
As you already mentioned there is nothing wrong and your code works fine as this is a warning. The window function in dbplyr still uses the lang_name() function call. The window function is called within your filter( ... == min(...)) statements. There is already an issue on Github open for this link.
If you do not want to see the warning you can suppress it like this:
suppressWarnings(df <- tbl(conn, in_schema("schema", "table")) %>%
filter(status!= "CLOSED" | is.na(status)) %>%
group_by(customer_id) %>%
filter(created == min(created, na.rm = T)) %>%
ungroup() %>%
select(
contract_number,
customer_id,
approved_date = created
) %>%
collect())

How to use character vector in filter on a database connection in R?

EDIT: I found my error in the example below. I made a typo in stored_group in filter. It works as expected.
I want to use a character value to filter a database table. I use dplyr functions directly on the connection object. See my steps below.
I connected to my MariaDB database:
con <- dbConnect(RMariaDB::MariaDB(),
dbname = mariadb.database,
user = mariadb.username,
password = mariadb.password,
host = mariadb.host,
port = mariadb.port)
Then I want to use a filter on a table in the database, by using dplyr code directly on the connection above:
stored_group <- "some_group"
con %>%
tbl("Table") %>%
select(id, group) %>%
filter(group == stored_group) %>%
collect()
I got a error saying Unknown column 'stored_group' in 'where clause'. So I used show_query() like this:
stored_group <- "some_group"
con %>%
tbl("Table") %>%
select(id, group) %>%
filter(group == stored_group) %>%
show_query()
And I got:
<SQL>
SELECT `id`, `group`
FROM `Table`
WHERE (`group` = `stored_group`)
In translation, stored_group is seen as a column name instead of value in R. How do I prevent this?
On normal data.frames in R this works. Like:
stored_group <- "some_group"
data %>%
select(id, group) %>%
filter(group == stored_group)
I just tested the solution below, and it works. But my database table will grow. I want to filter directly on the database before collecting.
stored_group <- "some_group"
con %>%
tbl("Table") %>%
select(id, group) %>%
collect() %>%
filter(group == stored_group)
Any suggestions?

Resources