Dynamic queries with BigQuery from R - r

I want to subset my data based on a list of items I'm looping through, and add the item that is currently indexed to the query I'm sending to BigQuery from R.
ex.
Item
001
002
003
i=2 => item '002'
instead of having to manually put 002, I want to be able to construct the following query:
sql_string <- "SELECT * FROM MAN WHERE item_code = item[i]"
But currently, I have an argument type mismatch. Could someone show me how this is done through the bigrquery package?

the paste function did the job
sql_string <- paste("SELECT * FROM MAN WHERE item_code =", item[i] )

Pasting the sql together may work, but for security reasons, it's an unsafe practice. This is described in https://db.rstudio.com/best-practices/run-queries-safely/ and almost any resource discussing "sql injection".
This code is adapted from https://db.rstudio.com/databases/big-query/, using your query of the man table.
library(bigrquery)
con <- DBI::dbConnect(
bigrquery::dbi_driver(),
dataset = "noaa_gsod",
project = "bigquery-public-data",
billing = project
)
# big_vector is defined somewhere before this snippet.
for(i in big_vector) {
ds_man <- DBI::dbSendQuery(con, "SELECT * FROM MAN WHERE item_code = ?")
DBI::dbBind(ds_man, list(item[i]))
DBI::dbFetch(ds_man)
DBI::dbClearResult(ds_man)
}
DBI::dbDisconnect(con)
I've never run against BigQuery, so post a comment if this doesn't work.
(Even if sql injection never an issue for your specific application, some people following this example will have this vulnerability.)

Related

Which DBI function for statements like `create table <tabX> as select * from <tabY>` in R?

I am using DBI/ROracle.
drv <- dbDriver("Oracle")
conn <- dbConnect(drv, ...)
I need to create a table from a select query in another table (i.e. a statement like create table <tabX> as select * from <tabY>).
There seems to be several functions that can perform this task, e.g.:
dbSendQuery(conn, "create table tab1 as select * from bigtable")
# Statement: create table tab1 as select * from bigtable
# Rows affected: 28196
# Row count: 0
# Select statement: FALSE
# Statement completed: TRUE
# OCI prefetch: FALSE
# Bulk read: 1000
# Bulk write: 1000
Or:
dbExecute(conn, "create table tab2 as select * from bigtable")
# [1] 28196
Or even:
tab3 <- dbGetQuery(conn, "select * from bigtable")
dbWriteTable(conn = conn, "TAB3", tab3)
# [1] TRUE
Each method seems to work but I guess there is differences in performance/best pratice. What is the best/most efficient way to run statements like create table <tabX> as select * from <tabY>?
I did not find any hint in the DBI and ROracle help pages.
Up front: use dbExecute for this; don't use dbSendQuery, that function suggests the expectation of returned data (though still works).
dbSendQuery should only be used when you expect data in return; most connections will do just fine even if you mis-use it, but that's the design of it. Instead, use dbSendStatement/dbClearResult or better yet just dbExecute.
The following are pairs of perfectly-equivalent pathways:
To retrieve data:
dat <- dbGetQuery(con, qry)
res <- dbSendQuery(con, qry); dat <- dbFetch(res); dbClearResult(res)
To send a statement (that does not return data, e.g. UPDATE or INSERT):
dbExecute(con, stmt)
res <- dbSendStatement(con, stmt); dbClearResult(res)
(sloppy) res <- dbSendQuery(con, stmt); dbClearResult(res) (I think some DBs complain about this method)
If you choose dbSend*, one should always call dbClearResult when done with the statement/fetch. (R will often clean up after you, but if something goes wrong here -- and I have hit this a few times over the last few years -- the connection locks up and you must recreate it. This can leave orphan connections on the database as well.)
I think most use-cases are a single-query-and-out, meaning dbGetQuery and dbExecute are the easiest to use. However, there are times when you may want to repeat a query. An example from ?dbSendQuery:
# Pass multiple sets of values with dbBind():
rs <- dbSendQuery(con, "SELECT * FROM mtcars WHERE cyl = ?")
dbBind(rs, list(6L))
dbFetch(rs)
dbBind(rs, list(8L))
dbFetch(rs)
dbClearResult(rs)
(I think it's a little hasty in that documentation to dbFetch without capturing the data ... I would expect dat <- dbFetch(..), discarding the return value here seems counter-productive.)
One advantage to doing this multi-step (requiring dbClearResult) is with more complex queries: database servers in general tend to "compile" or optimize a query based on its execution engine. This is not always a very expensive step for the server to execute, and it can pay huge dividends on data retrieval. The server often caches this optimized query, and when it sees the same query it uses the already-optimized version of the query. This is one case where using parameter-binding can really help, as the query is identical in repeated use and therefore never needs to be re-optimized.
FYI, parameter-binding can be done repeatedly as shown above using dbBind, or it can be done using dbGetQuery using the params= argument. For instance, this equivalent set of expressions will return the same results as above:
qry <- "SELECT * FROM mtcars WHERE cyl = ?"
dat6 <- dbGetQuery(con, qry, params = list(6L))
dat8 <- dbGetQuery(con, qry, params = list(8L))
As for dbWriteTable, for me it's mostly a matter of convenience for quick work. There are times when the DBI/ODBC connection uses the wrong datatype on the server (e.g., SQL Server's DATETIME instead of DATETIMEOFFSET; or NVARCHAR(32) versus varchar(max)), so if I need something quickly, I'll use dbWriteTable, otherwise I formally define the table with the server datatypes that I know I want, as in dbExecute(con, "create table quux (...)"). This is by far not a "best practice", it is heavily rooted in preference and convenience. For data that is easy (float/integer/string) and the server default datatypes are acceptable, dbWriteTable is perfectly fine. One can also use dbCreateTable (which creates it without uploading data), which allows you to specify the fields with a bit more control.

How to SELECT a single record in table X with the largest value for X.a WHERE values for fields X.b & X.c are specified

I am using the following query to obtain the current component serial number (tr_sim_sn) installed on the host device (tr_host_sn) from the most recent record in a transaction history table (PUB.tr_hist)
SELECT tr_sim_sn FROM PUB.tr_hist
WHERE tr_trnsactn_nbr = (SELECT max(tr_trnsactn_nbr)
FROM PUB.tr_hist
WHERE tr_domain = 'vattal_us'
AND tr_lot = '99524136'
AND tr_part = '6684112-001')
The actual table has ~190 million records. The excerpt below contains only a few sample records, and only fields relevant to the search to illustrate the query above:
tr_sim_sn |tr_host_sn* |tr_host_pn |tr_domain |tr_trnsactn_nbr |tr_qty_loc
_______________|____________|_______________|___________|________________|___________
... |
356136072015140|99524135 |6684112-000 |vattal_us |178415271 |-1.0000000000
356136072015458|99524136 |6684112-001 |vattal_us |178424418 |-1.0000000000
356136072015458|99524136 |6684112-001 |vattal_us |178628048 |1.0000000000
356136072015050|99524136 |6684112-001 |vattal_us |178628051 |-1.0000000000
356136072015836|99524137 |6684112-005 |vattal_us |178645337 |-1.0000000000
...
* = key field
The excerpt illustrates multiple occurrences of tr_trnsactn_nbr for a single value of tr_host_sn. The largest value for tr_trnsactn_nbr corresponds to the current tr_sim_sn installed within tr_host_sn.
This query works, but it is very slow, ~8minutes.
I would appreciate suggestions to improve or refactor this query to improve its speed.
Check with your admins to determine when they last updated the SQL statistics. If the answer is "we don't know" or "never" then you might want to ask them to run the following 4gl program which will create a SQL script to accomplish that:
/* genUpdateSQL.p
*
* mpro dbName -p util/genUpdateSQL.p -param "tmp/updSQLstats.sql"
*
* sqlexp -user userName -password passWord -db dnName -S servicePort -infile tmp/updSQLstats.sql -outfile tmp/updSQLtats.log
*
*/
output to value( ( if session:parameter <> "" then session:parameter else "updSQLstats.sql" )).
for each _file no-lock where _hidden = no:
put unformatted
"UPDATE TABLE STATISTICS AND INDEX STATISTICS AND ALL COLUMN STATISTICS FOR PUB."
'"' _file._file-name '"' ";"
skip
.
put unformatted "commit work;" skip.
end.
output close.
return.
This will generate a script that updates statistics for all table and all indexes. You could edit the output to only update the tables and indexes that are part of this query if you want.
Also, if the admins are nervous they could, of course, try this on a test db or a restored backup before implementing in a production environment.
I am posting this as a response to my request for an improved query.
As it turns out, the following syntax features two distinct features that greatly improved the speed of the query. One is to include tr_domain search criteria in both main and nested portions of the query. Second is to narrow the search by increasing the number of search criteria, which in the following are all included in the nested section of the syntax:
SELECT tr_sim_sn,
FROM PUB.tr_hist
WHERE tr_domain = 'vattal_us'
AND tr_trnsactn_nbr IN (
SELECT MAX(tr_trnsactn_nbr)
FROM PUB.tr_hist
WHERE tr_domain = 'vattal_us'
AND tr_part = '6684112-001'
AND tr_lot = '99524136'
AND tr_type = 'ISS-WO'
AND tr_qty_loc < 0)
This syntax results in ~0.5s response time. (credit to my colleague, Daniel V.)
To be fair, this query uses criteria outside the originally stated parameters that were included in the original post, making it difficult to impossible for others to attempt a reasonable answer. This omission was not on purpose of course, rather due to being fairly new to fundamentals of good query design. This query in part is a result of learning that when too-few or non-indexed fields are used as search criteria in a large table, it is sometimes helpful to narrow the search by increasing the number of search criteria items. The original had 3, this one has 5.

R - sql query stored as object name does not work with r dbGetquery

Need a little help with the following R code. I’ve got quite a number of data to load from a Microsoft sql database. I tried to do a few things to make the sql queries manageable.
1) Stored the query as object names with unique prefix
2) Using search to return a vector of the object names with unique prefix
3) using for loop to loop through the vector to load data <- this part didn’t work.
Library(odbc)
Library(tidyverse)
Library(stringer)
#setting up dB connection, odbc pkg
db<- DBI::dbConnect(odbc::odbc(),Driver =‘SQL Server’, Server=‘Server_name’, Database=‘Datbase name’, UID=‘User ID’, trusted_connection=‘yes’)
#defining the sql query
Sql_query1<-“select * from db1”
Sql_query2<-“select top 100 * from db2”
#the following is to store the sql query object name in a vector by searching for object names with prefix sql_
Sql_list <- ls()[str_detect(ls(),regex(“sql_”,ignore_case=TRUE))]
#This is the part where the code didn’t work
For (i in Sql_list){ i <- dbGetQuery(db, i)}
The error I’ve got is “Error: ‘Sql_query1’ nanodb.cpp:1587: 42000: [Microsoft][ODBC SQL Server Driver][SQL Server]Could not find stored procedure ‘Sql_query1’
However, if i don’t use the loop, no error occurred! It may be feasible if I’ve only got 2 -3 queries to manage... unfortunately I’ve 20 of them!
dbGetquery(db,Sql_query1)
Can anyone help? Thank you!
#Rohits solution written down:
first part from your side is fine
#setting up dB connection, odbc pkg
db<- DBI::dbConnect(odbc::odbc(),Driver =‘SQL Server’, Server=‘Server_name’, Database=‘Datbase name’, UID=‘User ID’, trusted_connection=‘yes’)
But then it would be more convenient to do something like this:
A more verbose version:
sqlqry_lst <- vector(mode = 'list', length = 2)#create a list to hold queries the in real life length = 20
names(sqlqry_lst) <- paste0('Sql_query', 1:2)#assign names to your list again jut use 1:20 here in your real life example
#put the SQL code into the list elements
sqlqry_lst['Sql_query1'] <- "select * from db1"
sqlqry_lst['Sql_query2'] <- "select top 100 * from db2"
#if you really want to use for loops
res <- vector(mode = 'list', length(sqlqry_lst))#result list
for (i in length(sqlqry_lst)) res[[i]] <- dbGetquery(db,sqlqry_lst[[i]])
Or as a two liner, a bit more R stylish and imho elegant:
sqlqry_lst <- list(Sql_query1="select * from db1", Sql_query2="select top 100 * from db2")
res <- lapply(sqlqry_lst, FUN = dbGetQuery, conn=db)
I suggest you mix and mingle the verbose eg for creating or more precisely for naming the query list and the short version for running the queries against the database as it suits u best.

Can I run an SQL update statement using only dplyr syntax in R

I need to update column values conditionnaly on other columns in some PostgreSQL database table. I managed to do it writing an SQL statement in R and executing it with dbExecute from DBI package.
library(dplyr)
library(DBI)
# Establish connection with database
con <- dbConnect(RPostgreSQL::PostgreSQL(), dbname = "myDb",
host="localhost", port= 5432, user="me",password = myPwd)
# Write SQL update statement
request <- paste("UPDATE table_to_update",
"SET var_to_change = 'new value' ",
"WHERE filter_var = 'filter' ")
# Back-end execution
con %>% dbExecute(request)
Is it possible to do so using only dplyr syntax ? I tried, out of curiosity,
con %>% tbl("table_to_update") %>%
mutate(var_to_change = if (filter_var == 'filter') 'new value' else var_to_change)
which works in R but obviously does nothing in db since it uses a select statement. copy_to allows only for append and overwite options, so I can't see how to use it unless deleting then appending the filtered observations...
Current dplyr 0.7.1 (with dbplyr 1.1.0) doesn't support this, because it assumes that all data sources are immutable. Issuing an UPDATE via dbExecute() seems to be the best bet.
For replacing a larger chunk in a table, you could also:
Write the data frame to a temporary table in the database via copy_to().
Start a transaction.
Issue a DELETE FROM ... WHERE id IN (SELECT id FROM <temporary table>)
Issue an INSERT INTO ... SELECT * FROM <temporary table>
Commit the transaction
Depending on your schema, you might be able to do a single INSERT INTO ... ON CONFLICT DO UPDATE instead of DELETE and then INSERT.

Same query, different results. Possible causes?

For testing purposes, I am querying the same table from the same database using two different GUIs (RStudio and SquirreLSQL).
The query in the SquirreLSQL console looks like this:
select count(distinct idstr) from fact_table where date::date='2014-10-30' and (w>0 or x>0 or y>0)
And in RStudio, I have the following code:
library(RPostgreSQL)
drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv,"databaseconnectionstring",user ="usr",password ="pwd",dbname = "db")
res <- dbSendQuery(con, "select count(distinct idstr) from fact_table where date::date='2014-10-30' and (w>0 or x>0 or y>0)")
The query done in SquirreLSQL returns almost twice as many rows as the one done in RStudio. What could cause the same exact query to return different values? The table and contents do not change.
Thanks to Jakub's response, I realized that the GUIs were in different timezones. To fix this, I ran the following line of SQL in SquirreLSQL to find the correct timezone:
SELECT current_setting('TIMEZONE')
It returned "America/New_York", so I then ran the following line in R to get the two programs in the same timezone:
dbGetQuery(con, "SET TIMEZONE TO 'America/New_York'")

Resources