Same query, different results. Possible causes? - r

For testing purposes, I am querying the same table from the same database using two different GUIs (RStudio and SquirreLSQL).
The query in the SquirreLSQL console looks like this:
select count(distinct idstr) from fact_table where date::date='2014-10-30' and (w>0 or x>0 or y>0)
And in RStudio, I have the following code:
library(RPostgreSQL)
drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv,"databaseconnectionstring",user ="usr",password ="pwd",dbname = "db")
res <- dbSendQuery(con, "select count(distinct idstr) from fact_table where date::date='2014-10-30' and (w>0 or x>0 or y>0)")
The query done in SquirreLSQL returns almost twice as many rows as the one done in RStudio. What could cause the same exact query to return different values? The table and contents do not change.

Thanks to Jakub's response, I realized that the GUIs were in different timezones. To fix this, I ran the following line of SQL in SquirreLSQL to find the correct timezone:
SELECT current_setting('TIMEZONE')
It returned "America/New_York", so I then ran the following line in R to get the two programs in the same timezone:
dbGetQuery(con, "SET TIMEZONE TO 'America/New_York'")

Related

Which DBI function for statements like `create table <tabX> as select * from <tabY>` in R?

I am using DBI/ROracle.
drv <- dbDriver("Oracle")
conn <- dbConnect(drv, ...)
I need to create a table from a select query in another table (i.e. a statement like create table <tabX> as select * from <tabY>).
There seems to be several functions that can perform this task, e.g.:
dbSendQuery(conn, "create table tab1 as select * from bigtable")
# Statement: create table tab1 as select * from bigtable
# Rows affected: 28196
# Row count: 0
# Select statement: FALSE
# Statement completed: TRUE
# OCI prefetch: FALSE
# Bulk read: 1000
# Bulk write: 1000
Or:
dbExecute(conn, "create table tab2 as select * from bigtable")
# [1] 28196
Or even:
tab3 <- dbGetQuery(conn, "select * from bigtable")
dbWriteTable(conn = conn, "TAB3", tab3)
# [1] TRUE
Each method seems to work but I guess there is differences in performance/best pratice. What is the best/most efficient way to run statements like create table <tabX> as select * from <tabY>?
I did not find any hint in the DBI and ROracle help pages.
Up front: use dbExecute for this; don't use dbSendQuery, that function suggests the expectation of returned data (though still works).
dbSendQuery should only be used when you expect data in return; most connections will do just fine even if you mis-use it, but that's the design of it. Instead, use dbSendStatement/dbClearResult or better yet just dbExecute.
The following are pairs of perfectly-equivalent pathways:
To retrieve data:
dat <- dbGetQuery(con, qry)
res <- dbSendQuery(con, qry); dat <- dbFetch(res); dbClearResult(res)
To send a statement (that does not return data, e.g. UPDATE or INSERT):
dbExecute(con, stmt)
res <- dbSendStatement(con, stmt); dbClearResult(res)
(sloppy) res <- dbSendQuery(con, stmt); dbClearResult(res) (I think some DBs complain about this method)
If you choose dbSend*, one should always call dbClearResult when done with the statement/fetch. (R will often clean up after you, but if something goes wrong here -- and I have hit this a few times over the last few years -- the connection locks up and you must recreate it. This can leave orphan connections on the database as well.)
I think most use-cases are a single-query-and-out, meaning dbGetQuery and dbExecute are the easiest to use. However, there are times when you may want to repeat a query. An example from ?dbSendQuery:
# Pass multiple sets of values with dbBind():
rs <- dbSendQuery(con, "SELECT * FROM mtcars WHERE cyl = ?")
dbBind(rs, list(6L))
dbFetch(rs)
dbBind(rs, list(8L))
dbFetch(rs)
dbClearResult(rs)
(I think it's a little hasty in that documentation to dbFetch without capturing the data ... I would expect dat <- dbFetch(..), discarding the return value here seems counter-productive.)
One advantage to doing this multi-step (requiring dbClearResult) is with more complex queries: database servers in general tend to "compile" or optimize a query based on its execution engine. This is not always a very expensive step for the server to execute, and it can pay huge dividends on data retrieval. The server often caches this optimized query, and when it sees the same query it uses the already-optimized version of the query. This is one case where using parameter-binding can really help, as the query is identical in repeated use and therefore never needs to be re-optimized.
FYI, parameter-binding can be done repeatedly as shown above using dbBind, or it can be done using dbGetQuery using the params= argument. For instance, this equivalent set of expressions will return the same results as above:
qry <- "SELECT * FROM mtcars WHERE cyl = ?"
dat6 <- dbGetQuery(con, qry, params = list(6L))
dat8 <- dbGetQuery(con, qry, params = list(8L))
As for dbWriteTable, for me it's mostly a matter of convenience for quick work. There are times when the DBI/ODBC connection uses the wrong datatype on the server (e.g., SQL Server's DATETIME instead of DATETIMEOFFSET; or NVARCHAR(32) versus varchar(max)), so if I need something quickly, I'll use dbWriteTable, otherwise I formally define the table with the server datatypes that I know I want, as in dbExecute(con, "create table quux (...)"). This is by far not a "best practice", it is heavily rooted in preference and convenience. For data that is easy (float/integer/string) and the server default datatypes are acceptable, dbWriteTable is perfectly fine. One can also use dbCreateTable (which creates it without uploading data), which allows you to specify the fields with a bit more control.

Join works in Azure SQL but fails with with R DBI connection: The multi-part identifier could not be found

I have a query that works perfectly in SSMS. But when running the query in R using the DBI package, I receive several multipart identifier errors: The multi-part identifier: "rt.secondary_id" could not be bound, "rt.third_id" could not be bound, and "t2.important" could not be bound.
select t1.[main_id]
,rt.secondary_id
,rt.third_id
,t1.[date_col]
,t2.important
from t1
inner join rt on t1.main_id = rt.main_id
inner join t2 on rt.main_id = t2.main_id
inner join (select t1.main_id, max(t1.date_col) as upload_time from t1 group by t1.main_id) AS ag ON t1.main_id = ag.main_id AND t1.date_col = ag.upload_time
The unique identifier in t1 is the combination of main_id and date_col, and this query finds the most recent entry in t1 for a given main_id.
Not exactly sure if my query is structured in a poor way or this is an R issue. I've tried adding SET NOCOUNT ON to the query based on what I thought might be related issues elsewhere on stackoverflow, but no dice.
I found out what my issue was- silly (but time consuming) mistake on my part... but essentially, I was bringing my SQL query into R via paste(scan(...), collapse = " "). I had a comment in my SQL query, --, which could not be read correctly by R. Deleting the comment OR switching the comment to /* ... */ syntax fixes the problem.

Dynamic queries with BigQuery from R

I want to subset my data based on a list of items I'm looping through, and add the item that is currently indexed to the query I'm sending to BigQuery from R.
ex.
Item
001
002
003
i=2 => item '002'
instead of having to manually put 002, I want to be able to construct the following query:
sql_string <- "SELECT * FROM MAN WHERE item_code = item[i]"
But currently, I have an argument type mismatch. Could someone show me how this is done through the bigrquery package?
the paste function did the job
sql_string <- paste("SELECT * FROM MAN WHERE item_code =", item[i] )
Pasting the sql together may work, but for security reasons, it's an unsafe practice. This is described in https://db.rstudio.com/best-practices/run-queries-safely/ and almost any resource discussing "sql injection".
This code is adapted from https://db.rstudio.com/databases/big-query/, using your query of the man table.
library(bigrquery)
con <- DBI::dbConnect(
bigrquery::dbi_driver(),
dataset = "noaa_gsod",
project = "bigquery-public-data",
billing = project
)
# big_vector is defined somewhere before this snippet.
for(i in big_vector) {
ds_man <- DBI::dbSendQuery(con, "SELECT * FROM MAN WHERE item_code = ?")
DBI::dbBind(ds_man, list(item[i]))
DBI::dbFetch(ds_man)
DBI::dbClearResult(ds_man)
}
DBI::dbDisconnect(con)
I've never run against BigQuery, so post a comment if this doesn't work.
(Even if sql injection never an issue for your specific application, some people following this example will have this vulnerability.)

Can I run an SQL update statement using only dplyr syntax in R

I need to update column values conditionnaly on other columns in some PostgreSQL database table. I managed to do it writing an SQL statement in R and executing it with dbExecute from DBI package.
library(dplyr)
library(DBI)
# Establish connection with database
con <- dbConnect(RPostgreSQL::PostgreSQL(), dbname = "myDb",
host="localhost", port= 5432, user="me",password = myPwd)
# Write SQL update statement
request <- paste("UPDATE table_to_update",
"SET var_to_change = 'new value' ",
"WHERE filter_var = 'filter' ")
# Back-end execution
con %>% dbExecute(request)
Is it possible to do so using only dplyr syntax ? I tried, out of curiosity,
con %>% tbl("table_to_update") %>%
mutate(var_to_change = if (filter_var == 'filter') 'new value' else var_to_change)
which works in R but obviously does nothing in db since it uses a select statement. copy_to allows only for append and overwite options, so I can't see how to use it unless deleting then appending the filtered observations...
Current dplyr 0.7.1 (with dbplyr 1.1.0) doesn't support this, because it assumes that all data sources are immutable. Issuing an UPDATE via dbExecute() seems to be the best bet.
For replacing a larger chunk in a table, you could also:
Write the data frame to a temporary table in the database via copy_to().
Start a transaction.
Issue a DELETE FROM ... WHERE id IN (SELECT id FROM <temporary table>)
Issue an INSERT INTO ... SELECT * FROM <temporary table>
Commit the transaction
Depending on your schema, you might be able to do a single INSERT INTO ... ON CONFLICT DO UPDATE instead of DELETE and then INSERT.

RODBC connection- limited rows

I set up an ODBC connect to a Netezza (SQL database). The connection is fine. However, R only pulls out 256 rows by default and restricts the number of rows it can pull out.
If I ran the query in Netezza, it would return a total number of rows (300k). I am expecting the same number of rows in R. However, it only returned 256 rows quite a bit short from 300k.
The driver I am using NetezzaSQL version 7.00.02 NSQLODBC.DLL
I tried to change the pre-fetch count to zero in the "Drivers Option' from
Control Panel > Administrative Tools > Data Sources(OBBC) > System DNS
It didn't work. Any ideas?
I think RODBC acts poorly with Netezza. A solution http://datamining.togaware.com/survivor/Database_Connection.html
just add believeNRows=FALSE to either your sqlQuery or odbcConnect call (use the later if you also use sqlFetch.
You can also try using JDBC driver:
library(RJDBC)
drv <- JDBC("org.netezza.Driver", "nzjdbc.jar", "'")
conn <- dbConnect(drv, "jdbc:netezza://host:5480/database", "user", "password")
res <- dbSendQuery(conn, "select * from mytable")
That way you don't have to deal with DSNs, etc.
I know this is kind of out-dated but the problem is not with the RODBC package. The problem lies in how you set up the ODBC connection if you configure the connection in windows you'll see a last tab in the settings where you can specify the amount of rows it'll fetch. And the default is on 256.

Resources