dplyr: use a custom function in summarize() when connected to external database

dplyr: use a custom function in summarize() when connected to external database - r

Is there a way to use custom functions within a summaries statement when using dplyr to pull data from an external database?
I can’t make usable dummy data because this is specific to databases, but imagine you have a table with three fields: product, true_positive, and all_positive. This is the code I want to use:
getPrecision <- function(true_positive, all_positive){
if_else(sum(all_positive, na.rm = TRUE) == 0, 0,
(sum(true_positive) / sum(all_positive , na.rm = TRUE)))
}
database_data %>%
group_by(product) %>%
summarize(precision = getPrecision(true_positive, all_positive)) %>% collect
This is the error: Error in postgresqlExecStatement(conn, statement, ...) :
RS-DBI driver: (could not Retrieve the result : ERROR: function getprecision(integer, integer) does not exist

To understand the error message, you could use show_query instead of collect to see the SQL code sent to the database :
database_data %>%
group_by(product) %>%
summarize(precision = getPrecision(true_positive, all_positive)) %>%
show_query
<SQL>
SELECT "product", getPrecision("true_positive", "all_positive") AS "precision"
FROM "database_table"
GROUP BY "product"
As you can see, this SQL expects getPrecision function to be available on the server, which is not the case.
A potential solution is to collect table data first, before applying this function in the R client:
database_data %>%
collect %>%
group_by(product) %>%
summarize(precision = getPrecision(true_positive, all_positive))
If this isn't possible, because the table is too big, you'll have to implement the function in SQL on the server :
SELECT
"product",
CASE WHEN sum(all_positive)=0 THEN 0 ELSE sum(true_positive)/sum(all_positive) END AS "precision"
FROM "database_table"
GROUP BY "product"

Related

dplyr lazy query mutate column using str_extract

I try to query a table table_a, and I like to mutate a column substr_col based on an existing column col with stringr::str_extract while it is in a lazy query state. I encountered an error message complaining col does not exist.
object 'col' not found
conn <- DBI::dbConnect(...)
dplyr::tbl(conn, table_a) %>%
dplyr::mutate(substring_col = stringr::str_extract(col, "^[A-Z]-\\d{3}")) %>%
dplyr::collect()
But this code works when I collect the data first and then call stringr::str_extract
conn <- DBI::dbConnect(...)
dplyr::tbl(conn, table_a) %>%
dplyr::collect() %>%
dplyr::mutate(substring_col = stringr::str_extract(col, "^[A-Z]-\\d{3}"))
I like to use the substring_col as a filter condition while the query is lazy, how should I do that?

As #IceCreanToucan states, str_extract is not on dbplyr's list of translations. Hence it will not be able to execute this code on the database. (I assume you are using dbplyr as it is the main package for having dplyr commands translated into SQL).
We can test this as follows:
library(dbplyr)
library(dplyr)
library(stringr)
data(starwars)
# pick your simulated connection type (there are many options, not just what I have shown here)
remote_df = tbl_lazy(starwars, con = simulate_mssql())
remote_df = tbl_lazy(starwars, con = simulate_mysql())
remote_df = tbl_lazy(starwars, con = simulate_postgres())
remote_df %>%
mutate(substring_col = str_extract(name, "Luke")) %>%
show_query()
show_query() should return the SQL that our mutate has been translated into. But instead I receive a clear message: "Error: str_extract() is not available in this SQL variant". This makes it clear translation is not defined.
However, there is a translation defined for grep and grepl (etc.) so the following should work:
remote_df %>%
mutate(substring_col = grepl("Luke", name)) %>%
show_query()
But it will return you slightly different output.

Can I run a BigQuery SQL query and then continue wrangling the data using dbplyr?

In another project working with Amazon Athena I could do this:
con <- DBI::dbConnect(odbc::odbc(), Driver = "path-to-driver",
S3OutputLocation = "location",
AwsRegion = "eu-west-1", AuthenticationType = "IAM Profile",
AWSProfile = "profile", Schema = "prod")
tbl(con,
# Run SQL query
sql('SELECT *
FROM TABLE')) %>%
# Without having collected the data, I could further wrangle the data inside the database
# using dplyr code
select(var1, var2) %>%
mutate(var3 = var1 + var2)
However, now using BigQuery I get the following error
con <- DBI::dbConnect(bigrquery::bigquery(),
project = "project")
tbl(con,
sql(
'SELECT *
FROM TABLE'
))
Error: dataset is not a string (a length one character vector).
Any idea if with BigQuery is not possible to do what I'm trying to do?

Not a BigQuery user, so can't test this, but from looking at this example it appears unrelated to how you are piping queries (%>%). Instead it appears BigQuery does not support receiving a tbl with an sql string as the second argument.
So it is likely to work when the second argument is a string with the name of the table:
tbl(con, "db_name.table_name")
But you should expect it to fail if the second argument is of type sql:
query_string = "SELECT * FROM db_name.table_name"
tbl(con, sql(query_string))
Other things to test:
Using odbc::odbc() to connect to BigQuery instead of bigquery::bigquery(). The problem could be caused by the bigquery package.
The second approach without the conversation to sql: tbl(con, query_string)

Can dplyr function work connected with SQL server?

I have a table in SQL server database, and I want to manipulate this table with dbplyr/dplyr in R packages.
library(odbc)
library(DBI)
library(tidyverse)
con <- DBI::dbConnect(odbc::odbc(),
Driver = "SQL Server",
Server = "xx.xxx.xxx.xxx",
Database = "stock",
UID = "userid",
PWD = "userpassword")
startday = 20150101
day = tbl(con, in_schema("dbo", "LogDay"))
I tried this simple dplyr function after connecting to remote database, but only to fail with error messages.
day %>%
mutate(ovnprofit = ifelse(stockCode == lead(stockCode,1),lead(priceOpen,1)/priceClose, NA)) %>%
select(logDate,stockCode, ovnprofit)
How can I solve this problem?
p.s. When I apply dplyr function after transforming 'day' into tibble first, it works. However, I want to apply dplyr function directly, not transforming into tibble because it's to time consuming and memory intensive.

The problem is most likely with the lead function. In R a data set has an ordering, but in SQL datasets are unordered and the order needs to be specified explicitly.
Note that the SQL code in the error message contains:
LEAD("stockCode", 1.0, NULL) OVER ()
That there is nothing in the brackets after the OVER suggests to me that SQL expects somethings here.
Two ways you can resolve this:
By using arrange before the mutate
By specifying the order_by argument of lead
# approach 1:
day %>%
arrange(logDate) %>%
mutate(ovnprofit = ifelse(stockCode == lead(stockCode,1),
lead(priceOpen,1)/priceClose,
NA)
) %>%
select(logDate,stockCode, ovnprofit)
# approach 2:
day %>%
mutate(ovnprofit = ifelse(stockCode == lead(stockCode,1, order_by = 'logDate'),
lead(priceOpen,1, order_by = 'logDate')/priceClose,
NA)
) %>%
select(logDate,stockCode, ovnprofit)
However, it also looks like you are only wanting to lead within each stockCode. This can be done by group_by. I would recommend the following:
output = day %>%
group_by(stockCode) %>%
arrange(logDate) %>%
mutate(next_priceOpen = lead(priceOpen, 1)) %>%
mutate(ovnprofit = next_priceOpen / priceClose)
select(logDate,stockCode, ovnprofit)
If you view the generated SQL with show_query(output) you should see the SQL OVER clause similar to the following:
LEAD(priceOpen, 1.0, NULL) OVER (PARTITION BY stockCode ORDER BY logDate)

How to use character vector in filter on a database connection in R?

EDIT: I found my error in the example below. I made a typo in stored_group in filter. It works as expected.
I want to use a character value to filter a database table. I use dplyr functions directly on the connection object. See my steps below.
I connected to my MariaDB database:
con <- dbConnect(RMariaDB::MariaDB(),
dbname = mariadb.database,
user = mariadb.username,
password = mariadb.password,
host = mariadb.host,
port = mariadb.port)
Then I want to use a filter on a table in the database, by using dplyr code directly on the connection above:
stored_group <- "some_group"
con %>%
tbl("Table") %>%
select(id, group) %>%
filter(group == stored_group) %>%
collect()
I got a error saying Unknown column 'stored_group' in 'where clause'. So I used show_query() like this:
stored_group <- "some_group"
con %>%
tbl("Table") %>%
select(id, group) %>%
filter(group == stored_group) %>%
show_query()
And I got:
<SQL>
SELECT `id`, `group`
FROM `Table`
WHERE (`group` = `stored_group`)
In translation, stored_group is seen as a column name instead of value in R. How do I prevent this?
On normal data.frames in R this works. Like:
stored_group <- "some_group"
data %>%
select(id, group) %>%
filter(group == stored_group)
I just tested the solution below, and it works. But my database table will grow. I want to filter directly on the database before collecting.
stored_group <- "some_group"
con %>%
tbl("Table") %>%
select(id, group) %>%
collect() %>%
filter(group == stored_group)
Any suggestions?

Modify dplyr database query

I'm using dplyr to execute a Redshift query via the database connection src. lag works a little bit differently in Redshift (see https://github.com/tidyverse/dplyr/issues/962), so I'm wondering if it's possible to modify the query that's generated from the dplyr chain to remove the third parameter (NULL) in LAG. Example:
res <- tbl(src, 'table_name') %>%
group_by(groupid) %>%
filter(value != lag(value)) %>%
collect()
gives
Error in postgresqlExecStatement(conn, statement, ...) :
RS-DBI driver: (could not Retrieve the result : ERROR: Default
parameter not be supported for window function lag)
I can see the translated sql:
translated <- dbplyr::translate_sql(
tbl(src, 'table_name') %>%
group_by(groupid) %>%
filter(value != lag(value)) %>%
collect()
)
# <SQL> COLLECT(FILTER(GROUP_BY(TBL("src", 'table_name'), "groupid"), "value" != LAG("value", 1, NULL) OVER ()))
And I can modify it to remove the NULL parameter, which I think will solve the problem:
sub("(LAG\\(.*), NULL), "\\1", translated)
# <SQL> COLLECT(FILTER(GROUP_BY(TBL("src", 'table_name'), "groupid"), "value" != LAG("value", 1) OVER ()))
How can I execute this modified query?

you should be able to useDBI::dbGetQuery(con, sub("(LAG\\(.*), NULL), "\\1", translated)) to run the new query.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

dplyr: use a custom function in summarize() when connected to external database - r

Related

dplyr lazy query mutate column using str_extract

Can I run a BigQuery SQL query and then continue wrangling the data using dbplyr?

Can dplyr function work connected with SQL server?

How to use character vector in filter on a database connection in R?

Modify dplyr database query

Categories

Resources