I have a table in SQL server database, and I want to manipulate this table with dbplyr/dplyr in R packages.
library(odbc)
library(DBI)
library(tidyverse)
con <- DBI::dbConnect(odbc::odbc(),
Driver = "SQL Server",
Server = "xx.xxx.xxx.xxx",
Database = "stock",
UID = "userid",
PWD = "userpassword")
startday = 20150101
day = tbl(con, in_schema("dbo", "LogDay"))
I tried this simple dplyr function after connecting to remote database, but only to fail with error messages.
day %>%
mutate(ovnprofit = ifelse(stockCode == lead(stockCode,1),lead(priceOpen,1)/priceClose, NA)) %>%
select(logDate,stockCode, ovnprofit)
How can I solve this problem?
p.s. When I apply dplyr function after transforming 'day' into tibble first, it works. However, I want to apply dplyr function directly, not transforming into tibble because it's to time consuming and memory intensive.
The problem is most likely with the lead function. In R a data set has an ordering, but in SQL datasets are unordered and the order needs to be specified explicitly.
Note that the SQL code in the error message contains:
LEAD("stockCode", 1.0, NULL) OVER ()
That there is nothing in the brackets after the OVER suggests to me that SQL expects somethings here.
Two ways you can resolve this:
By using arrange before the mutate
By specifying the order_by argument of lead
# approach 1:
day %>%
arrange(logDate) %>%
mutate(ovnprofit = ifelse(stockCode == lead(stockCode,1),
lead(priceOpen,1)/priceClose,
NA)
) %>%
select(logDate,stockCode, ovnprofit)
# approach 2:
day %>%
mutate(ovnprofit = ifelse(stockCode == lead(stockCode,1, order_by = 'logDate'),
lead(priceOpen,1, order_by = 'logDate')/priceClose,
NA)
) %>%
select(logDate,stockCode, ovnprofit)
However, it also looks like you are only wanting to lead within each stockCode. This can be done by group_by. I would recommend the following:
output = day %>%
group_by(stockCode) %>%
arrange(logDate) %>%
mutate(next_priceOpen = lead(priceOpen, 1)) %>%
mutate(ovnprofit = next_priceOpen / priceClose)
select(logDate,stockCode, ovnprofit)
If you view the generated SQL with show_query(output) you should see the SQL OVER clause similar to the following:
LEAD(priceOpen, 1.0, NULL) OVER (PARTITION BY stockCode ORDER BY logDate)
Related
I try to query a table table_a, and I like to mutate a column substr_col based on an existing column col with stringr::str_extract while it is in a lazy query state. I encountered an error message complaining col does not exist.
object 'col' not found
conn <- DBI::dbConnect(...)
dplyr::tbl(conn, table_a) %>%
dplyr::mutate(substring_col = stringr::str_extract(col, "^[A-Z]-\\d{3}")) %>%
dplyr::collect()
But this code works when I collect the data first and then call stringr::str_extract
conn <- DBI::dbConnect(...)
dplyr::tbl(conn, table_a) %>%
dplyr::collect() %>%
dplyr::mutate(substring_col = stringr::str_extract(col, "^[A-Z]-\\d{3}"))
I like to use the substring_col as a filter condition while the query is lazy, how should I do that?
As #IceCreanToucan states, str_extract is not on dbplyr's list of translations. Hence it will not be able to execute this code on the database. (I assume you are using dbplyr as it is the main package for having dplyr commands translated into SQL).
We can test this as follows:
library(dbplyr)
library(dplyr)
library(stringr)
data(starwars)
# pick your simulated connection type (there are many options, not just what I have shown here)
remote_df = tbl_lazy(starwars, con = simulate_mssql())
remote_df = tbl_lazy(starwars, con = simulate_mysql())
remote_df = tbl_lazy(starwars, con = simulate_postgres())
remote_df %>%
mutate(substring_col = str_extract(name, "Luke")) %>%
show_query()
show_query() should return the SQL that our mutate has been translated into. But instead I receive a clear message: "Error: str_extract() is not available in this SQL variant". This makes it clear translation is not defined.
However, there is a translation defined for grep and grepl (etc.) so the following should work:
remote_df %>%
mutate(substring_col = grepl("Luke", name)) %>%
show_query()
But it will return you slightly different output.
Is there a way to use custom functions within a summaries statement when using dplyr to pull data from an external database?
I can’t make usable dummy data because this is specific to databases, but imagine you have a table with three fields: product, true_positive, and all_positive. This is the code I want to use:
getPrecision <- function(true_positive, all_positive){
if_else(sum(all_positive, na.rm = TRUE) == 0, 0,
(sum(true_positive) / sum(all_positive , na.rm = TRUE)))
}
database_data %>%
group_by(product) %>%
summarize(precision = getPrecision(true_positive, all_positive)) %>% collect
This is the error: Error in postgresqlExecStatement(conn, statement, ...) :
RS-DBI driver: (could not Retrieve the result : ERROR: function getprecision(integer, integer) does not exist
To understand the error message, you could use show_query instead of collect to see the SQL code sent to the database :
database_data %>%
group_by(product) %>%
summarize(precision = getPrecision(true_positive, all_positive)) %>%
show_query
<SQL>
SELECT "product", getPrecision("true_positive", "all_positive") AS "precision"
FROM "database_table"
GROUP BY "product"
As you can see, this SQL expects getPrecision function to be available on the server, which is not the case.
A potential solution is to collect table data first, before applying this function in the R client:
database_data %>%
collect %>%
group_by(product) %>%
summarize(precision = getPrecision(true_positive, all_positive))
If this isn't possible, because the table is too big, you'll have to implement the function in SQL on the server :
SELECT
"product",
CASE WHEN sum(all_positive)=0 THEN 0 ELSE sum(true_positive)/sum(all_positive) END AS "precision"
FROM "database_table"
GROUP BY "product"
I'm trying to join tables from two different datasets in the same project. How can I do this?
library(tidyverse)
library(bigrquery)
con1 <-
bConnect(
drv = bigrquery::bigquery(),
project = PROJECT,
dataset = "dataset_1"
)
con2 <-
bConnect(
drv = bigrquery::bigquery(),
project = PROJECT,
dataset = "dataset_2"
)
A <- con1 %>% tbl("A")
B <- con2 %>% tbl("B")
inner_join(A, B,
by = "key",
copy = T) %>%
collect()
Then I get the error: Error: BigQuery does not support temporary tables
The problem is most likely that you are using different connections to connect with the two tables. When you attempt this, R tries to copy data from one source into a temporary table on the other source.
See this question and the copy parameter in this documentation (its a different package, but the principle is the same).
The solution is to only use a single connection for all your tables. Something like this:
con <-
bConnect(
drv = bigrquery::bigquery(),
project = PROJECT,
dataset = "dataset_1"
)
A <- con %>% tbl("A")
B <- con %>% tbl("B")
inner_join(A, B,
by = "key") %>%
collect()
You may need to leave the dataset parameter blank in your connection string, or use in_schema to include the dataset name along with the table when you connect to a remote table. It's hard to be sure without knowing more about the structure of your database(s).
The below code prints:
SELECT "district_code" FROM sd_stage.table1 GROUP BY "district_code"
but I am expecting:
select distinct(district_code) from sd_stage.table1
Code:
library(DBI)
library(tidyverse)
library(dbplyr)
conn_obj <- DBI::dbConnect(RPostgreSQL::PostgreSQL(),
host = "127.0.0.1",
user = "testingdb",
password = "admin#123")
on.exit(DBI::dbDisconnect(conn_obj))
tbl_oil_root_segment <- dplyr::tbl(conn_obj,
dbplyr::in_schema('sd_stage','table1'))
tbl_oil_root_segment %>% distinct(oil_district) %>% show_query()
Output is correct but the query which is generated seems to be not 100%. So is there anyway I can implement the query?
tbl_oil_root_segment %>% select(oil_district) %>% distinct %>% show_query()
will create the query you expect.
However, note that in SQL select distinct a from t is the same as select a from t group by a (see this question).
I'm working with a large database (via dplyrimpaladb) and dplyr. Due to this I need to filter by date all of which are given in Unix timestamps. While I can convert it locally as
time_t = as.Date(as.POSIXct(time_t/1000, origin = '1970-01-01', tz = 'UTC')))`
This does not work when communicating with DB; I need to translate the following into dplyr.
dau <- bb %>%
tbl(sql("SELECT
device_token_s,
to_date(from_unixtime(cast(collector_date_t/1000 as bigint))) AS dte
FROM bb.sys_app_open
WHERE
build_type_n = 1
AND to_date(from_unixtime(cast(collector_date_t/1000 as bigint))) >= '2016-02-26'
GROUP BY
device_token_s,
to_date(from_unixtime(cast(collector_date_t/1000 as bigint)))")) %>%
collect()
The closest I could get was,
dau.df <- bb %>%
tbl('sys_app_open') %>%
select(device_token_s,
sql('to_date(from_unixtime(cast(collector_date_t/1000 as bigint))) AS dte')) %>%
filter(build_type_n == 1,
sql("to_date(from_unixtime(cast(collector_date_t/1000 as bigint))) >= '2016-02-26' ")) %>%
#mutate(collector_date_t = sql('to_date(from_unixtime(cast(collector_date_t/1000 as bigint)))')) %>%
group_by(device_token_s, sql('to_date(from_unixtime(cast(collector_date_t/1000 as bigint)))')) %>%
collect()
But I receive a
Error: All select() inputs must resolve to integer column positions.
The following do not:
* sql("to_date(from_unixtime(cast(collector_date_t/1000 as bigint))) as dte")
Post a sample data frame. I had the same problem; if I see the data frame once I can tell you how to do this using dplyr.
And if you are unable to do this quickly, I would suggest using dbGetQuery(connection, "YOUR_SQL_QUERY") to get the data.
The error comes from the way you are using the select function. You are trying to send a 'literal' SQL instruction via a select and you should be doing this via the mutate function.
This should work for you:
dau.df <- bb %>%
tbl('sys_app_open') %>%
select(device_token_s, build_type_n, collector_date_t) %>%
mutate(dte = sql("to_date(from_unixtime(cast(collector_date_t/1000 as bigint)))")) %>%
filter(build_type_n == 1, dte > '2016-02-26') %>%
group_by(device_token_s, dte) %>%
collect
I recommend you use the function dbplyr::sql_render() to view the query that dplyr is creating. For example, run
bb %>%
tbl('sys_app_open') %>%
select(device_token_s, build_type_n, collector_date_t) %>%
mutate(dte = sql("to_date(from_unixtime(cast(collector_date_t/1000 as bigint)))")) %>%
filter(build_type_n == 1, dte > '2016-02-26') %>%
dbplyr::sql_render()
to look at the following query created:
<SQL> SELECT *
FROM (SELECT "device_token_s", "build_type_n", "collector_date_t", to_date(from_unixtime(cast(collector_date_t/1000 as bigint))) AS "dte"
FROM (SELECT "device_token_s", "build_type_n", "collector_date_t"
FROM "sys_app_open") "fgyyfaqrwp") "nmmczsfuid"
WHERE (("build_type_n" = 1) AND ("dte" > '2016-02-26'))