I'm writing an R query that runs several SQL queries using the DBI package to create reports. To make this work, I need to be able to declare a variable in R (such as a Period End Date) that is then called from within the SQL query. When I run my query, I get the following error:
If I simply use the field name (PeriodEndDate), I get the following error:
Error in (function (classes, fdef, mtable) : unable to find an
inherited method for function ‘dbGetQuery’ for signature ‘"Microsoft
SQL Server", "character"’
If I use # to access the field name (#PeriodEndDate), I get the following error:
Error: nanodbc/nanodbc.cpp:1655: 42000: [Microsoft][ODBC SQL Server
Driver][SQL Server]Must declare the scalar variable "#PeriodEndDate".
[Microsoft][ODBC SQL Server Driver][SQL Server]Statement(s) could not
be prepared. '
An example query might look like this:
library(DBI) # Used for connecting to SQL server and submitting SQL queries.
library(tidyverse) # Used for data manipulation and creating/saving CSV files.
library(lubridate) # Used to calculate end of month, start of month in queries
# Define time periods for queries.
PeriodEndDate <<- ceiling_date(as.Date('2021-10-31'),'month') # Enter Period End Date on this line.
PeriodStartDate <<- floor_date(PeriodEndDate, 'month')
# Connect to SQL Server.
con <- dbConnect(
odbc::odbc(),
driver = "SQL Server",
server = "SERVERNAME",
trusted_connection = TRUE,
timeout = 5,
encoding = "Latin1")
samplequery <- dbGetQuery(con, "
SELECT * FROM [TableName]
WHERE OrderDate <= #PeriodEndDate
")
I believe one way might be to use the paste function, like this:
samplequery <- dbGetQuery(con, paste("
SELECT * FROM [TableName]
WHERE OrderDate <=", PeriodEndDate")
However, that can get unwieldy if it involves several variables being referenced outside the query or in several places within the query.
Is there a relatively straightforward way to do this?
Thanks in advance for any thoughts you might have!
The mechanism in most DBI-based connections is to use ?-placeholders[1] in the query and params= in the call to DBI::dbGetQuery or DBI::dbExecute.
Perhaps this:
samplequery <- dbGetQuery(con, "
SELECT * FROM [TableName]
WHERE OrderDate <= ?
", params = list(PeriodEndDate))
In general the mechanisms for including an R object as a data-item are enumerated well in https://db.rstudio.com/best-practices/run-queries-safely/. In the order of my recommendation,
Parameterized queries (as shown above);
glue::glue_sql;
sqlInterpolate (which uses the same ?-placeholders as #1);
The link also mentions "manual escaping" using dbQuoteString.
Anything else is in my mind more risky due to inadvertent SQL corruption/injection.
I've seen many questions here on SO that try to use one of the following techniques: paste and/or sprintf using sQuote or hard-coded paste0("'", PeriodEndDate, "'"). These are too fragile in my mind and should be avoided.
My preference for parameterized queries extends beyond this usability, it also can have non-insignificant impacts on repeated use of the same query, since DBMSes tend to analyze/optimize the query and cache this for the next use. Consider this:
### parameterized queries
DBI::dbGetQuery("select ... where OrderDate >= ?", params=list("2020-02-02"))
DBI::dbGetQuery("select ... where OrderDate >= ?", params=list("2020-02-03"))
### glue_sql
PeriodEndDate <- as.Date("2020-02-02")
qry <- glue::glue_sql("select ... where OrderDate >= {PeriodEndDate}", .con=con)
# <SQL> select ... where OrderDate >= '2020-02-02'
DBI::dbGetQuery(con, qry)
PeriodEndDate <- as.Date("2021-12-22")
qry <- glue::glue_sql("select ... where OrderDate >= {PeriodEndDate}", .con=con)
# <SQL> select ... where OrderDate >= '2021-12-22'
DBI::dbGetQuery(con, qry)
In the case of parameterized queries, the "query" itself never changes, so its optimized query (internal to the server) can be reused.
In the case of the glue_sql queries, the query itself changes (albeit just a handful of character), so most (all?) DBMSes will re-analyze and re-optimize the query. While they tend to do it quickly, and most analysts' queries are not complex, it is still unnecessary overhead, and missing an opportunity in cases where your query and/or the indices require a little more work to optimize well.
Notes:
? is used by most DBMSes but not all. Others use $name or $1 or such. With odbc::odbc(), however, it is always ? (no name, no number), regardless of the actual DBMS.
Not sure if you are using this elsewhere, but the use of <<- (vice <- or =) can encourage bad habits and/or unreliable/unexpected results.
It is not uncommon to use the same variable multiple times in a query. Unfortunately, you will need to include the variable multiple times, and order is important. For example,
samplequery <- dbGetQuery(con, "
SELECT * FROM [TableName]
WHERE OrderDate <= ?
or (SomethingElse = ? and OrderDate > ?)0
", params = list(PeriodEndDate, 99, PeriodEndDate))
If you have a list/vector of values and want to use SQL's IN operator, then you have two options, my preference being the first (for the reasons stated above):
Create a string of question marks and paste into the query. (Yes, this is pasteing into the query, but we are not dealing with the risk of incorrectly single-quoting or double-quoting. Since DBI does not support any other mechanism, this is what we have.)
MyDates <- c(..., ...)
qmarks <- paste(rep("?", length(MyDates)), collapse=",")
samplequery <- dbGetQuery(con, sprintf("
SELECT * FROM [TableName]
WHERE OrderDate IN (%s)
", qmarks), params = as.list(MyDates))
glue_sql supports expanding internally:
MyDates <- c(..., ...)
qry <- glue::glue_sql("
SELECT * FROM [TableName]
WHERE OrderDate IN ({MyDates*})", .con=con)
DBI::dbGetQuery(con, qry)
Related
I have a database called "db" with a table called "company" which has a column named "name".
I am trying to look up a company name in db using the following query:
dbGetQuery(db, 'SELECT name,registered_address FROM company WHERE LOWER(name) LIKE LOWER("%APPLE%")')
This give me the following correct result:
name
1 Apple
My problem is that I have a bunch of companies to look up and their names are in the following data frame
df <- as.data.frame(c("apple", "microsoft","facebook"))
I have tried the following method to get the company name from my df and insert it into the query:
sqlcomp <- paste0("'SELECT name, ","registered_address FROM company WHERE LOWER(name) LIKE LOWER(",'"', df[1,1],'"', ")'")
dbGetQuery(db,sqlcomp)
However this gives me the following error:
tinyformat: Too many conversion specifiers in format string
I've tried several other methods but I cannot get it to work.
Any help would be appreciated.
this code should work
df <- as.data.frame(c("apple", "microsoft","facebook"))
comparer <- paste(paste0(" LOWER(name) LIKE LOWER('%",df[,1],"%')"),collapse=" OR ")
sqlcomp <- sprintf("SELECT name, registered_address FROM company WHERE %s",comparer)
dbGetQuery(db,sqlcomp)
Hope this helps you move on.
Please vote my solution if it is helpful.
Using paste to paste in data into a query is generally a bad idea, due to SQL injection (whether truly injection or just accidental spoiling of the query). It's also better to keep the query free of "raw data" because DBMSes tend to optimize a query once and reuse that optimized query every time it sees the same query; if you encode data in it, it's a new query each time, so the optimization is defeated.
It's generally better to use parameterized queries; see https://db.rstudio.com/best-practices/run-queries-safely/#parameterized-queries.
For you, I suggest the following:
df <- data.frame(names = c("apple", "microsoft","facebook"))
qmarks <- paste(rep("?", nrow(df)), collapse = ",")
qmarks
# [1] "?,?,?"
dbGetQuery(con, sprintf("select name, registered_address from company where lower(name) in (%s)", qmarks),
params = tolower(df$names))
This takes advantage of three things:
the SQL IN operator, which takes a list (vector in R) of values and conditions on "set membership";
optimized queries; if you subsequently run this query again (with three arguments), then it will reuse the query. (Granted, if you run with other than three companies, then it will have to reoptimize, so this is limited gain);
no need to deal with quoting/escaping your data values; for instance, if it is feasible that your company names might include single or double quotes (perhaps typos on user-entry), then adding the value to the query itself is either going to cause the query to fail, or you will have to jump through some hoops to ensure that all quotes are escaped properly for the DBMS to see it as the correct strings.
I've been trying to query data from postgresql database (pgadmin) into R and analyse. Most of the queries work except when I try to write a condition specifically to filter out most of the rows. Please find the code below
dbGetQuery(con, 'select * from "db_name"."User" where "db_name"."User"."FirstName" = "Mani" ')
Error in result_create(conn#ptr, statement) :
Failed to prepare query: ERROR: column "Mani" does not exist
LINE 1: ...from "db_name"."User" where "db_name"."User"."FirstName" = "Mani"
^
this is the error I get, Why is it considering Mani as a column when it is just an element. Someone pls assist me
String literals in Postgres (and most flavors of SQL) take single quotes. This, combined with a few other optimizations in your code leave us with this:
sql <- "select * from db_name.User u where u.FirstName = 'Mani'"
dbGetQuery(con, sql)
Note that introduced a table alias, for the User table, so that we don't have to repeat the fully qualified name in the WHERE clause.
I have a .sql file that I use with SQL Server Management Studio. I use the same file within my R script to pull the data directly into R (as below) and this works well.
query <- paste(readLines("SQL_FILE.sql"), collapse="\n") # read sql query from hard drive
con <- odbcConnect(dsn ="DATABASE_NAME") # connect to database
dt <- sqlQuery(con, query, rows_at_time = 1, stringsAsFactors = FALSE)
What I need is to insert an additional condition of which the values are being genarated in R environment, to the beginning of the WHERE clause in the .sql file using R. I solved this with below approach:
queryStart <- "SELECT * FROM ANALYSIS_VIEW A WHERE 1=1 AND A.COLUMN_X IN("
filteringValuesForX <- c("FILTERING_VALUE_X1", "FILTERING_VALUE_X2")
queryEnd <- ") AND A.COLUMN_Y = 'FILTERING_VALUE_Y1';"
query <-
paste0(queryStart,
toString(paste("'",filteringValuesForX ,"'", sep='')),
queryEnd)
query
And the output is:
"SELECT * FROM ANALYSIS_VIEW A WHERE 1=1 AND A.COLUMN_X IN('FILTERING_VALUE_X1', 'FILTERING_VALUE_X2') AND A.COLUMN_Y = 'FILTERING_VALUE_Y1';"
However, I am looking for a better solution because of below reasons:
It is not dynamic; when I update the .sql file using SQL Server Management Studio, I need to update queryStart and queryEnd variables manually as well.
The actual SQL script is very long, and I don't want to see all the SQL code in the R script.
Note: There are other WHERE clauses in the original .sql file. However, I want this update only for one specific WHERE clause. To point out this specific one, I added the statement "1=1".
Any suggestions?
I think I found a way!
But before accepting my solution as an answer, I will wait for one more day to see if there is a better answer. So, please, feel free to suggest other approaches.
replaceWithThis <- paste(" A.COLUMN_X IN(", toString(paste("'",filteringValuesForX ,"'", sep='')), ")", " ", collapse = "\n")
query <- sub(x = query, pattern = "1=1 AND A.COLUMN_X IN('FILTERING_VALUE_X1', 'FILTERING_VALUE_X2') ", replacement = replaceWithThis)
query
For this example, to avoid the duplicete WHERE condition, I've used the pattern = "1=1 AND A.COLUMN_X IN('FILTERING_VALUE_X1', 'FILTERING_VALUE_X2') " . In original data it is enough to replace 1=1.
The output is:
"SELECT * FROM ANALYSIS_VIEW A WHERE 1=1 AND A.COLUMN_X IN('FILTERING_VALUE_X1', 'FILTERING_VALUE_X2') AND A.COLUMN_Y = 'FILTERING_VALUE_Y1';"
In perl/python DBI APIs have a mechanism to safely interpolate in parameters to an sql query. For example in python I would do:
cursor.execute("SELECT * FROM table WHERE value > ?", (5,))
Where the second parameter to the execute method is a tuple of parameters to add into the sql query
Is there a similar mechanism for R's DBI compliant APIs? The examples I've seen never show parameters passed to the query. If not, what is the safest way to interpolate in parameters to a query? I'm specifically looking at using RPostgresSQL.
Just for completeness, I'll add an answer based on Hadley's comment. The DBI package now has the function sqlInterpolate which can also perform this. It requires a list of function arguments to be named in the sql query that all must start with a ?. Excerpt from the DBI manual below
sql <- "SELECT * FROM X WHERE name = ?name"
sqlInterpolate(ANSI(), sql, name = "Hadley")
# This is safe because the single quote has been double escaped
sqlInterpolate(ANSI(), sql, name = "H'); DROP TABLE--;")
Indeed the use of bind variables is not really well documented. Anyway the ODBC commands in R work differently for different databases. One possibility for postgres would be like this:
res <- postgresqlExecStatement(con, "SELECT * FROM table WHERE value > $1", c(5))
postgresqlFetch(res)
postgresqlCloseResult(res)
Hope it helps.
I can't figure out how to update an existing DB2 database in R or update a single value in it.
I can't find much information on this topic online other than very general information, but no specific examples.
library(RJDBC)
teachersalaries=data.frame(name=c("bob"), earnings=c(100))
dbSendUpdate(conn, "UPDATE test1 salary",teachersalaries[1,2])
AND
teachersalaries=data.frame(name=c("bob",'sally'), earnings=c(100,200))
dbSendUpdate(conn, "INSERT INTO test1 salary", teachersalaries[which(teachersalaries$earnings>200,] )
Have you tried passing a regular SQL statement like you would in other languages?
dbSendUpdate(conn, "UPDATE test1 set salary=? where id=?", teachersalary, teacherid)
or
dbSendUpdate(conn,"INSERT INTO test1 VALUES (?,?)",teacherid,teachersalary)
Basically you specify the regular SQL DML statement using parameter markers (those question marks) and provide a list of values as comma-separated parameters.
Try this, it worked for me well.
dbSendUpdate(conn,"INSERT INTO test1 VALUES (?,?)",teacherid,teachersalary)
You just need to pass a regular SQL piece in the same way you do in any programing langs. Try it out.
To update multiple rows at the same time, I have built the following function.
I have tested it with batches of up to 10,000 rows and it works perfectly.
# Libraries
library(RJDBC)
library(dplyr)
# Function upload data into database
db_write_table <- function(conn,table,df){
# Format data to write
batch <- apply(df,1,FUN = function(x) paste0("'",trimws(x),"'", collapse = ",")) %>%
paste0("(",.,")",collapse = ",\n")
#Build query
query <- paste("INSERT INTO", table ,"VALUES", batch)
# Send update
dbSendUpdate(conn, query)
}
# Push data
db_write_table(conn,"schema.mytable",mydataframe)
Thanks to the other authors.