dplyr lazy query mutate column using str_extract - r

I try to query a table table_a, and I like to mutate a column substr_col based on an existing column col with stringr::str_extract while it is in a lazy query state. I encountered an error message complaining col does not exist.
object 'col' not found
conn <- DBI::dbConnect(...)
dplyr::tbl(conn, table_a) %>%
dplyr::mutate(substring_col = stringr::str_extract(col, "^[A-Z]-\\d{3}")) %>%
dplyr::collect()
But this code works when I collect the data first and then call stringr::str_extract
conn <- DBI::dbConnect(...)
dplyr::tbl(conn, table_a) %>%
dplyr::collect() %>%
dplyr::mutate(substring_col = stringr::str_extract(col, "^[A-Z]-\\d{3}"))
I like to use the substring_col as a filter condition while the query is lazy, how should I do that?

As #IceCreanToucan states, str_extract is not on dbplyr's list of translations. Hence it will not be able to execute this code on the database. (I assume you are using dbplyr as it is the main package for having dplyr commands translated into SQL).
We can test this as follows:
library(dbplyr)
library(dplyr)
library(stringr)
data(starwars)
# pick your simulated connection type (there are many options, not just what I have shown here)
remote_df = tbl_lazy(starwars, con = simulate_mssql())
remote_df = tbl_lazy(starwars, con = simulate_mysql())
remote_df = tbl_lazy(starwars, con = simulate_postgres())
remote_df %>%
mutate(substring_col = str_extract(name, "Luke")) %>%
show_query()
show_query() should return the SQL that our mutate has been translated into. But instead I receive a clear message: "Error: str_extract() is not available in this SQL variant". This makes it clear translation is not defined.
However, there is a translation defined for grep and grepl (etc.) so the following should work:
remote_df %>%
mutate(substring_col = grepl("Luke", name)) %>%
show_query()
But it will return you slightly different output.

Related

dplyr: use a custom function in summarize() when connected to external database

Is there a way to use custom functions within a summaries statement when using dplyr to pull data from an external database?
I can’t make usable dummy data because this is specific to databases, but imagine you have a table with three fields: product, true_positive, and all_positive. This is the code I want to use:
getPrecision <- function(true_positive, all_positive){
if_else(sum(all_positive, na.rm = TRUE) == 0, 0,
(sum(true_positive) / sum(all_positive , na.rm = TRUE)))
}
database_data %>%
group_by(product) %>%
summarize(precision = getPrecision(true_positive, all_positive)) %>% collect
This is the error: Error in postgresqlExecStatement(conn, statement, ...) :
RS-DBI driver: (could not Retrieve the result : ERROR: function getprecision(integer, integer) does not exist
To understand the error message, you could use show_query instead of collect to see the SQL code sent to the database :
database_data %>%
group_by(product) %>%
summarize(precision = getPrecision(true_positive, all_positive)) %>%
show_query
<SQL>
SELECT "product", getPrecision("true_positive", "all_positive") AS "precision"
FROM "database_table"
GROUP BY "product"
As you can see, this SQL expects getPrecision function to be available on the server, which is not the case.
A potential solution is to collect table data first, before applying this function in the R client:
database_data %>%
collect %>%
group_by(product) %>%
summarize(precision = getPrecision(true_positive, all_positive))
If this isn't possible, because the table is too big, you'll have to implement the function in SQL on the server :
SELECT
"product",
CASE WHEN sum(all_positive)=0 THEN 0 ELSE sum(true_positive)/sum(all_positive) END AS "precision"
FROM "database_table"
GROUP BY "product"

Can dplyr function work connected with SQL server?

I have a table in SQL server database, and I want to manipulate this table with dbplyr/dplyr in R packages.
library(odbc)
library(DBI)
library(tidyverse)
con <- DBI::dbConnect(odbc::odbc(),
Driver = "SQL Server",
Server = "xx.xxx.xxx.xxx",
Database = "stock",
UID = "userid",
PWD = "userpassword")
startday = 20150101
day = tbl(con, in_schema("dbo", "LogDay"))
I tried this simple dplyr function after connecting to remote database, but only to fail with error messages.
day %>%
mutate(ovnprofit = ifelse(stockCode == lead(stockCode,1),lead(priceOpen,1)/priceClose, NA)) %>%
select(logDate,stockCode, ovnprofit)
How can I solve this problem?
p.s. When I apply dplyr function after transforming 'day' into tibble first, it works. However, I want to apply dplyr function directly, not transforming into tibble because it's to time consuming and memory intensive.
The problem is most likely with the lead function. In R a data set has an ordering, but in SQL datasets are unordered and the order needs to be specified explicitly.
Note that the SQL code in the error message contains:
LEAD("stockCode", 1.0, NULL) OVER ()
That there is nothing in the brackets after the OVER suggests to me that SQL expects somethings here.
Two ways you can resolve this:
By using arrange before the mutate
By specifying the order_by argument of lead
# approach 1:
day %>%
arrange(logDate) %>%
mutate(ovnprofit = ifelse(stockCode == lead(stockCode,1),
lead(priceOpen,1)/priceClose,
NA)
) %>%
select(logDate,stockCode, ovnprofit)
# approach 2:
day %>%
mutate(ovnprofit = ifelse(stockCode == lead(stockCode,1, order_by = 'logDate'),
lead(priceOpen,1, order_by = 'logDate')/priceClose,
NA)
) %>%
select(logDate,stockCode, ovnprofit)
However, it also looks like you are only wanting to lead within each stockCode. This can be done by group_by. I would recommend the following:
output = day %>%
group_by(stockCode) %>%
arrange(logDate) %>%
mutate(next_priceOpen = lead(priceOpen, 1)) %>%
mutate(ovnprofit = next_priceOpen / priceClose)
select(logDate,stockCode, ovnprofit)
If you view the generated SQL with show_query(output) you should see the SQL OVER clause similar to the following:
LEAD(priceOpen, 1.0, NULL) OVER (PARTITION BY stockCode ORDER BY logDate)

R: dbplyr using eval()

I have a question on how to use eval(parse(text=...)) in dbplyr SQL translation.
The following code works exactly what I want with dplyr using eval(parse(text=eval_text))
selected_col <- c("wt", "drat")
text <- paste(selected_col, ">3")
implode <- function(..., sep='|') {
paste(..., collapse=sep)
}
eval_text <- implode(text)
mtcars %>% dplyr::filter(eval(parse(text=eval_text)))
But when I put it into the database it returns an error message. I am looking for any solution that allows me to dynamically set the column names and filter with the or operator.
db <- tbl(con, "mtcars") %>%
dplyr::filter(eval(parse(eval_text)))
db <- collect(db)
Thanks!
Right approach, but dbplyr tends to work better with something that can receive the !! operator ('bang-bang' operator). At one point dplyr had *_ versions of functions (e.g. filter_) that accepted text inputs. This is now done using NSE (non-standard evaluation).
A couple of references: shiptech and r-bloggers (sorry couldn't find the official dplyr reference).
For your purposes you should find the following works:
library(rlang)
df %>% dplyr::filter(!!parse_expr(eval_text))
Full working:
library(dplyr)
library(dbplyr)
library(rlang)
data(mtcars)
df = tbl_lazy(mtcars, con = simulate_mssql()) # simulated database connection
implode <- function(..., sep='|') { paste(..., collapse=sep) }
selected_col <- c("wt", "drat")
text <- paste(selected_col, ">3")
eval_text <- implode(text)
df %>% dplyr::filter(eval(parse(eval_text))) # returns clearly wrong SQL
df %>% dplyr::filter(!!parse_expr(eval_text)) # returns valid & correct SQL
df %>% dplyr::filter(!!!parse_exprs(text)) # passes filters as a list --> AND (instead of OR)

How to use character vector in filter on a database connection in R?

EDIT: I found my error in the example below. I made a typo in stored_group in filter. It works as expected.
I want to use a character value to filter a database table. I use dplyr functions directly on the connection object. See my steps below.
I connected to my MariaDB database:
con <- dbConnect(RMariaDB::MariaDB(),
dbname = mariadb.database,
user = mariadb.username,
password = mariadb.password,
host = mariadb.host,
port = mariadb.port)
Then I want to use a filter on a table in the database, by using dplyr code directly on the connection above:
stored_group <- "some_group"
con %>%
tbl("Table") %>%
select(id, group) %>%
filter(group == stored_group) %>%
collect()
I got a error saying Unknown column 'stored_group' in 'where clause'. So I used show_query() like this:
stored_group <- "some_group"
con %>%
tbl("Table") %>%
select(id, group) %>%
filter(group == stored_group) %>%
show_query()
And I got:
<SQL>
SELECT `id`, `group`
FROM `Table`
WHERE (`group` = `stored_group`)
In translation, stored_group is seen as a column name instead of value in R. How do I prevent this?
On normal data.frames in R this works. Like:
stored_group <- "some_group"
data %>%
select(id, group) %>%
filter(group == stored_group)
I just tested the solution below, and it works. But my database table will grow. I want to filter directly on the database before collecting.
stored_group <- "some_group"
con %>%
tbl("Table") %>%
select(id, group) %>%
collect() %>%
filter(group == stored_group)
Any suggestions?

Distinct in R while connecting to PostgreSQL using DBI Package

The below code prints:
SELECT "district_code" FROM sd_stage.table1 GROUP BY "district_code"
but I am expecting:
select distinct(district_code) from sd_stage.table1
Code:
library(DBI)
library(tidyverse)
library(dbplyr)
conn_obj <- DBI::dbConnect(RPostgreSQL::PostgreSQL(),
host = "127.0.0.1",
user = "testingdb",
password = "admin#123")
on.exit(DBI::dbDisconnect(conn_obj))
tbl_oil_root_segment <- dplyr::tbl(conn_obj,
dbplyr::in_schema('sd_stage','table1'))
tbl_oil_root_segment %>% distinct(oil_district) %>% show_query()
Output is correct but the query which is generated seems to be not 100%. So is there anyway I can implement the query?
tbl_oil_root_segment %>% select(oil_district) %>% distinct %>% show_query()
will create the query you expect.
However, note that in SQL select distinct a from t is the same as select a from t group by a (see this question).

Resources