insert additional WHERE condition into a .sql file using R - r

I have a .sql file that I use with SQL Server Management Studio. I use the same file within my R script to pull the data directly into R (as below) and this works well.
query <- paste(readLines("SQL_FILE.sql"), collapse="\n") # read sql query from hard drive
con <- odbcConnect(dsn ="DATABASE_NAME") # connect to database
dt <- sqlQuery(con, query, rows_at_time = 1, stringsAsFactors = FALSE)
What I need is to insert an additional condition of which the values are being genarated in R environment, to the beginning of the WHERE clause in the .sql file using R. I solved this with below approach:
queryStart <- "SELECT * FROM ANALYSIS_VIEW A WHERE 1=1 AND A.COLUMN_X IN("
filteringValuesForX <- c("FILTERING_VALUE_X1", "FILTERING_VALUE_X2")
queryEnd <- ") AND A.COLUMN_Y = 'FILTERING_VALUE_Y1';"
query <-
paste0(queryStart,
toString(paste("'",filteringValuesForX ,"'", sep='')),
queryEnd)
query
And the output is:
"SELECT * FROM ANALYSIS_VIEW A WHERE 1=1 AND A.COLUMN_X IN('FILTERING_VALUE_X1', 'FILTERING_VALUE_X2') AND A.COLUMN_Y = 'FILTERING_VALUE_Y1';"
However, I am looking for a better solution because of below reasons:
It is not dynamic; when I update the .sql file using SQL Server Management Studio, I need to update queryStart and queryEnd variables manually as well.
The actual SQL script is very long, and I don't want to see all the SQL code in the R script.
Note: There are other WHERE clauses in the original .sql file. However, I want this update only for one specific WHERE clause. To point out this specific one, I added the statement "1=1".
Any suggestions?

I think I found a way!
But before accepting my solution as an answer, I will wait for one more day to see if there is a better answer. So, please, feel free to suggest other approaches.
replaceWithThis <- paste(" A.COLUMN_X IN(", toString(paste("'",filteringValuesForX ,"'", sep='')), ")", " ", collapse = "\n")
query <- sub(x = query, pattern = "1=1 AND A.COLUMN_X IN('FILTERING_VALUE_X1', 'FILTERING_VALUE_X2') ", replacement = replaceWithThis)
query
For this example, to avoid the duplicete WHERE condition, I've used the pattern = "1=1 AND A.COLUMN_X IN('FILTERING_VALUE_X1', 'FILTERING_VALUE_X2') " . In original data it is enough to replace 1=1.
The output is:
"SELECT * FROM ANALYSIS_VIEW A WHERE 1=1 AND A.COLUMN_X IN('FILTERING_VALUE_X1', 'FILTERING_VALUE_X2') AND A.COLUMN_Y = 'FILTERING_VALUE_Y1';"

Related

Declaring variable in R for DBI query to MS SQL

I'm writing an R query that runs several SQL queries using the DBI package to create reports. To make this work, I need to be able to declare a variable in R (such as a Period End Date) that is then called from within the SQL query. When I run my query, I get the following error:
If I simply use the field name (PeriodEndDate), I get the following error:
Error in (function (classes, fdef, mtable) : unable to find an
inherited method for function ‘dbGetQuery’ for signature ‘"Microsoft
SQL Server", "character"’
If I use # to access the field name (#PeriodEndDate), I get the following error:
Error: nanodbc/nanodbc.cpp:1655: 42000: [Microsoft][ODBC SQL Server
Driver][SQL Server]Must declare the scalar variable "#PeriodEndDate".
[Microsoft][ODBC SQL Server Driver][SQL Server]Statement(s) could not
be prepared. '
An example query might look like this:
library(DBI) # Used for connecting to SQL server and submitting SQL queries.
library(tidyverse) # Used for data manipulation and creating/saving CSV files.
library(lubridate) # Used to calculate end of month, start of month in queries
# Define time periods for queries.
PeriodEndDate <<- ceiling_date(as.Date('2021-10-31'),'month') # Enter Period End Date on this line.
PeriodStartDate <<- floor_date(PeriodEndDate, 'month')
# Connect to SQL Server.
con <- dbConnect(
odbc::odbc(),
driver = "SQL Server",
server = "SERVERNAME",
trusted_connection = TRUE,
timeout = 5,
encoding = "Latin1")
samplequery <- dbGetQuery(con, "
SELECT * FROM [TableName]
WHERE OrderDate <= #PeriodEndDate
")
I believe one way might be to use the paste function, like this:
samplequery <- dbGetQuery(con, paste("
SELECT * FROM [TableName]
WHERE OrderDate <=", PeriodEndDate")
However, that can get unwieldy if it involves several variables being referenced outside the query or in several places within the query.
Is there a relatively straightforward way to do this?
Thanks in advance for any thoughts you might have!
The mechanism in most DBI-based connections is to use ?-placeholders[1] in the query and params= in the call to DBI::dbGetQuery or DBI::dbExecute.
Perhaps this:
samplequery <- dbGetQuery(con, "
SELECT * FROM [TableName]
WHERE OrderDate <= ?
", params = list(PeriodEndDate))
In general the mechanisms for including an R object as a data-item are enumerated well in https://db.rstudio.com/best-practices/run-queries-safely/. In the order of my recommendation,
Parameterized queries (as shown above);
glue::glue_sql;
sqlInterpolate (which uses the same ?-placeholders as #1);
The link also mentions "manual escaping" using dbQuoteString.
Anything else is in my mind more risky due to inadvertent SQL corruption/injection.
I've seen many questions here on SO that try to use one of the following techniques: paste and/or sprintf using sQuote or hard-coded paste0("'", PeriodEndDate, "'"). These are too fragile in my mind and should be avoided.
My preference for parameterized queries extends beyond this usability, it also can have non-insignificant impacts on repeated use of the same query, since DBMSes tend to analyze/optimize the query and cache this for the next use. Consider this:
### parameterized queries
DBI::dbGetQuery("select ... where OrderDate >= ?", params=list("2020-02-02"))
DBI::dbGetQuery("select ... where OrderDate >= ?", params=list("2020-02-03"))
### glue_sql
PeriodEndDate <- as.Date("2020-02-02")
qry <- glue::glue_sql("select ... where OrderDate >= {PeriodEndDate}", .con=con)
# <SQL> select ... where OrderDate >= '2020-02-02'
DBI::dbGetQuery(con, qry)
PeriodEndDate <- as.Date("2021-12-22")
qry <- glue::glue_sql("select ... where OrderDate >= {PeriodEndDate}", .con=con)
# <SQL> select ... where OrderDate >= '2021-12-22'
DBI::dbGetQuery(con, qry)
In the case of parameterized queries, the "query" itself never changes, so its optimized query (internal to the server) can be reused.
In the case of the glue_sql queries, the query itself changes (albeit just a handful of character), so most (all?) DBMSes will re-analyze and re-optimize the query. While they tend to do it quickly, and most analysts' queries are not complex, it is still unnecessary overhead, and missing an opportunity in cases where your query and/or the indices require a little more work to optimize well.
Notes:
? is used by most DBMSes but not all. Others use $name or $1 or such. With odbc::odbc(), however, it is always ? (no name, no number), regardless of the actual DBMS.
Not sure if you are using this elsewhere, but the use of <<- (vice <- or =) can encourage bad habits and/or unreliable/unexpected results.
It is not uncommon to use the same variable multiple times in a query. Unfortunately, you will need to include the variable multiple times, and order is important. For example,
samplequery <- dbGetQuery(con, "
SELECT * FROM [TableName]
WHERE OrderDate <= ?
or (SomethingElse = ? and OrderDate > ?)0
", params = list(PeriodEndDate, 99, PeriodEndDate))
If you have a list/vector of values and want to use SQL's IN operator, then you have two options, my preference being the first (for the reasons stated above):
Create a string of question marks and paste into the query. (Yes, this is pasteing into the query, but we are not dealing with the risk of incorrectly single-quoting or double-quoting. Since DBI does not support any other mechanism, this is what we have.)
MyDates <- c(..., ...)
qmarks <- paste(rep("?", length(MyDates)), collapse=",")
samplequery <- dbGetQuery(con, sprintf("
SELECT * FROM [TableName]
WHERE OrderDate IN (%s)
", qmarks), params = as.list(MyDates))
glue_sql supports expanding internally:
MyDates <- c(..., ...)
qry <- glue::glue_sql("
SELECT * FROM [TableName]
WHERE OrderDate IN ({MyDates*})", .con=con)
DBI::dbGetQuery(con, qry)

Can't find a way to gather data and upload it again to a different SQL server without breaking the encoding R/dbPlyr/DBI

Basic setup is that I connect to a database A, get some data back to R, write it to another connection, database B.
The database is SQL_Latin1_General_CP1_CI_AS and I'm using encoding = "windows-1252" in connection A and B.
Display in RStudio is fine, special characters show as they should.
When I try to write the data, I get a "Cannot insert the value NULL into column".
I narrowed it down to at least one offending field: a cell with a PHI symbol, which causes the error.
How do I make it so the PHI symbol and presumably other special characters are kept the same from source to destination?
conA <- dbConnect(odbc(),
Driver = "ODBC Driver 17 for SQL Server",
Server = "DB",
Database = "serverA",
Trusted_connection = "yes",
encoding = "1252")
dbWriteTable(conB,SQL("schema.table"),failing_row, append = T)
#This causes the "cannot insert null value" error
I suggest working around this problem without dbplyr. As the overwhelming majority of encoding questions have nothing to do with dbplyr (the encoding tag has 23k questions, while the dbplyr tag has <400 questions) this may be a non-trivial problem to resolve without dbplyr.
Here are two work-arounds to consider:
Use a text file as an intermediate step
R will have no problem writing an in-memory table out to a text file/csv. And SQL server has standard ways of reading in a text file/csv. This gives you the added advantage of validating the contents of the text file before loading it into SQL.
Documentation for SQL Server BULK INSERT can be found here. This answer gives instructions for using UTF-8 encoding: CODEPAGE = '65001'. And this documentation gives instructions for unicode: DATAFILETYPE = 'widechar'.
If you want to take this approach entirely within R, it will likely look something like:
write.csv(failing_row, "output_file.csv")
query_for_creating_table = "CREATE TABLE schema_name.table_name (
col1 INT,
col2 NCHAR(10),
)"
# each column listed along with suitable data types
query_for_bulk_insert = "BULK INSERT schema_name.table_name
FROM 'output_file.csv'
WITH
(
DATAFILETYPE = 'widechar',
FIRSTROW = 2,
ROWTERMINATOR = '\n'
)"
DBI::dbExecute(con, query_for_creating_table)
DBI::dbExecute(con, query_for_bulk_insert)
Load all the non-error rows and append the final row after
I have has some success in the past using the INSERT INTO syntax. So would recommend loading the failing row using this approach.
Something like the following:
failing_row = local_df %>%
filter(condition_to_get_just_the_failing_row)
non_failing_rows = local_df %>%
filter(! condition_to_get_just_the_failing_row)
# write non-failing rows
dbWriteTable(con, SQL("schema.table"), non_failing_rows, append = T)
# prep for insert failing row
insert_query = "INSERT INTO schema.table VALUES ("
for(col in colnames(failing_row)){
value = failing_row[[col]]
if(is.numeric(value)){
insert_query = paste0(insert_query, value, ", ")
} else {
insert_query = paste0(insert_query, "'", value, "', ")
}
}
insert_query = paste0(insert_query, ");")
# insert failing row
dbExecute(con, insert_query)
other resources
If you have not seen them already, here are several related Q&A that might assist: Arabic characters, reading UTF-8, encoding to MySQL, and non-Latin characters as question marks. Though some of these are for reading data into R.

RSQLite dbGetQuery with input from Data Frame

I have a database called "db" with a table called "company" which has a column named "name".
I am trying to look up a company name in db using the following query:
dbGetQuery(db, 'SELECT name,registered_address FROM company WHERE LOWER(name) LIKE LOWER("%APPLE%")')
This give me the following correct result:
name
1 Apple
My problem is that I have a bunch of companies to look up and their names are in the following data frame
df <- as.data.frame(c("apple", "microsoft","facebook"))
I have tried the following method to get the company name from my df and insert it into the query:
sqlcomp <- paste0("'SELECT name, ","registered_address FROM company WHERE LOWER(name) LIKE LOWER(",'"', df[1,1],'"', ")'")
dbGetQuery(db,sqlcomp)
However this gives me the following error:
tinyformat: Too many conversion specifiers in format string
I've tried several other methods but I cannot get it to work.
Any help would be appreciated.
this code should work
df <- as.data.frame(c("apple", "microsoft","facebook"))
comparer <- paste(paste0(" LOWER(name) LIKE LOWER('%",df[,1],"%')"),collapse=" OR ")
sqlcomp <- sprintf("SELECT name, registered_address FROM company WHERE %s",comparer)
dbGetQuery(db,sqlcomp)
Hope this helps you move on.
Please vote my solution if it is helpful.
Using paste to paste in data into a query is generally a bad idea, due to SQL injection (whether truly injection or just accidental spoiling of the query). It's also better to keep the query free of "raw data" because DBMSes tend to optimize a query once and reuse that optimized query every time it sees the same query; if you encode data in it, it's a new query each time, so the optimization is defeated.
It's generally better to use parameterized queries; see https://db.rstudio.com/best-practices/run-queries-safely/#parameterized-queries.
For you, I suggest the following:
df <- data.frame(names = c("apple", "microsoft","facebook"))
qmarks <- paste(rep("?", nrow(df)), collapse = ",")
qmarks
# [1] "?,?,?"
dbGetQuery(con, sprintf("select name, registered_address from company where lower(name) in (%s)", qmarks),
params = tolower(df$names))
This takes advantage of three things:
the SQL IN operator, which takes a list (vector in R) of values and conditions on "set membership";
optimized queries; if you subsequently run this query again (with three arguments), then it will reuse the query. (Granted, if you run with other than three companies, then it will have to reoptimize, so this is limited gain);
no need to deal with quoting/escaping your data values; for instance, if it is feasible that your company names might include single or double quotes (perhaps typos on user-entry), then adding the value to the query itself is either going to cause the query to fail, or you will have to jump through some hoops to ensure that all quotes are escaped properly for the DBMS to see it as the correct strings.

R with postgresql database

I've been trying to query data from postgresql database (pgadmin) into R and analyse. Most of the queries work except when I try to write a condition specifically to filter out most of the rows. Please find the code below
dbGetQuery(con, 'select * from "db_name"."User" where "db_name"."User"."FirstName" = "Mani" ')
Error in result_create(conn#ptr, statement) :
Failed to prepare query: ERROR: column "Mani" does not exist
LINE 1: ...from "db_name"."User" where "db_name"."User"."FirstName" = "Mani"
^
this is the error I get, Why is it considering Mani as a column when it is just an element. Someone pls assist me
String literals in Postgres (and most flavors of SQL) take single quotes. This, combined with a few other optimizations in your code leave us with this:
sql <- "select * from db_name.User u where u.FirstName = 'Mani'"
dbGetQuery(con, sql)
Note that introduced a table alias, for the User table, so that we don't have to repeat the fully qualified name in the WHERE clause.

Writing and Updating DB2 tables with r

I can't figure out how to update an existing DB2 database in R or update a single value in it.
I can't find much information on this topic online other than very general information, but no specific examples.
library(RJDBC)
teachersalaries=data.frame(name=c("bob"), earnings=c(100))
dbSendUpdate(conn, "UPDATE test1 salary",teachersalaries[1,2])
AND
teachersalaries=data.frame(name=c("bob",'sally'), earnings=c(100,200))
dbSendUpdate(conn, "INSERT INTO test1 salary", teachersalaries[which(teachersalaries$earnings>200,] )
Have you tried passing a regular SQL statement like you would in other languages?
dbSendUpdate(conn, "UPDATE test1 set salary=? where id=?", teachersalary, teacherid)
or
dbSendUpdate(conn,"INSERT INTO test1 VALUES (?,?)",teacherid,teachersalary)
Basically you specify the regular SQL DML statement using parameter markers (those question marks) and provide a list of values as comma-separated parameters.
Try this, it worked for me well.
dbSendUpdate(conn,"INSERT INTO test1 VALUES (?,?)",teacherid,teachersalary)
You just need to pass a regular SQL piece in the same way you do in any programing langs. Try it out.
To update multiple rows at the same time, I have built the following function.
I have tested it with batches of up to 10,000 rows and it works perfectly.
# Libraries
library(RJDBC)
library(dplyr)
# Function upload data into database
db_write_table <- function(conn,table,df){
# Format data to write
batch <- apply(df,1,FUN = function(x) paste0("'",trimws(x),"'", collapse = ",")) %>%
paste0("(",.,")",collapse = ",\n")
#Build query
query <- paste("INSERT INTO", table ,"VALUES", batch)
# Send update
dbSendUpdate(conn, query)
}
# Push data
db_write_table(conn,"schema.mytable",mydataframe)
Thanks to the other authors.

Resources