R with postgresql database - r

I've been trying to query data from postgresql database (pgadmin) into R and analyse. Most of the queries work except when I try to write a condition specifically to filter out most of the rows. Please find the code below
dbGetQuery(con, 'select * from "db_name"."User" where "db_name"."User"."FirstName" = "Mani" ')
Error in result_create(conn#ptr, statement) :
Failed to prepare query: ERROR: column "Mani" does not exist
LINE 1: ...from "db_name"."User" where "db_name"."User"."FirstName" = "Mani"
^
this is the error I get, Why is it considering Mani as a column when it is just an element. Someone pls assist me

String literals in Postgres (and most flavors of SQL) take single quotes. This, combined with a few other optimizations in your code leave us with this:
sql <- "select * from db_name.User u where u.FirstName = 'Mani'"
dbGetQuery(con, sql)
Note that introduced a table alias, for the User table, so that we don't have to repeat the fully qualified name in the WHERE clause.

Related

Putting the first table column (ID) last, without specifying the other table columns

Background
I am using R Studio to connect R to Microsoft SQL Sever Management Studio. I am reading tables into R as follows:
library(sqldf)
library(DBI)
library(odbc)
library(data.table)
TableX <- dbGetQuery(con, statement = "SELECT * FROM [dim1].[dimA].[TableX]")
Which for some tables works fine. However for most tables which have a binary ID variable
the following happens:
TableA <- dbGetQuery(con, statement = "SELECT * FROM [dim1].[dimA].[TableA]")
Error in result_fetch(res#ptr, n) :
nanodbc/nanodbc.cpp:xxx: xxxxx: [Microsoft][ODBC SQL Server Driver]Invalid Descriptor Index
Warning message:
In dbClearResult(rs) : Result already cleared
I figured out that the problem is caused by the first column, which I can select like this:
TableA <- dbGetQuery(con, statement = "SELECT ID FROM [dim1].[dimA].[TableA]")
and looks as follows:
AlwaysLearning mentioned in the comments that this is a recurring problem (1, 2, 3). The query only works when ID is selected last:
TableA <- dbGetQuery(con, statement = "SELECT AEE, ID FROM [dim1].[dimA].[TableA]")
Updated Question
The question is essentially how I can read in the table with the ID variable last, without specifying all table variables each time (because this would be unworkable).
Possible Workaround
I thought a work around could be to select ID as an integer:
TableA <- dbGetQuery(con, statement = "SELECT CAST(ID AS int), COL2 FROM [dim1].[dimA].[TableA]")
However how do I select the whole table in this case?
I am an SQL beginner, but I thought I could solve it by using something like this (from this link):
TableA <- dbGetQuery(con, statement = "SELECT * EXCEPT(ID), SELECT CAST(ID AS int) FROM [[dim1].[dimA].[TableA]")
Where I select everything but the ID column, and then the ID column last. However the solution I suggest is not accepted syntax.
Other links
A similar problem for java can be found here.
I believe I have found a workaround that meets your requirements using a table alias.
By assigning the alias T to the table I want to query, it allows me to select both a specific column ([ID]) as well as all columns in the aliased table without the need to explicitly specify them all by name.
This returns all columns of the table (including the ID column) as well as a copy of the ID column at the end of the table.
I then remove the ID column from the resulting table.
This leaves you with the desired result: all columns of a table in the order that they appear with the exception of the ID column that is placed at the end.
PS: For the sake of completeness, I have provided a template of my own DBIConnection object. You can substitute this with the specifics of your own DBIConnection object.
library(sqldf)
library(DBI)
library(odbc)
library(data.table)
con <- dbConnect(odbc::odbc(),
.connection_string = 'driver={YourDriver};
server=YourServer;
database=YourDatabase;
Trusted_Connection=yes'
)
dataframe <- dbGetQuery(con, statement= 'SELECT T.*, T.[ID] FROM [SCHEMA_NAME].[TABLE_NAME] AS T')
dataframe_scoped <- dataframe[,-1]

Declaring variable in R for DBI query to MS SQL

I'm writing an R query that runs several SQL queries using the DBI package to create reports. To make this work, I need to be able to declare a variable in R (such as a Period End Date) that is then called from within the SQL query. When I run my query, I get the following error:
If I simply use the field name (PeriodEndDate), I get the following error:
Error in (function (classes, fdef, mtable) : unable to find an
inherited method for function ‘dbGetQuery’ for signature ‘"Microsoft
SQL Server", "character"’
If I use # to access the field name (#PeriodEndDate), I get the following error:
Error: nanodbc/nanodbc.cpp:1655: 42000: [Microsoft][ODBC SQL Server
Driver][SQL Server]Must declare the scalar variable "#PeriodEndDate".
[Microsoft][ODBC SQL Server Driver][SQL Server]Statement(s) could not
be prepared. '
An example query might look like this:
library(DBI) # Used for connecting to SQL server and submitting SQL queries.
library(tidyverse) # Used for data manipulation and creating/saving CSV files.
library(lubridate) # Used to calculate end of month, start of month in queries
# Define time periods for queries.
PeriodEndDate <<- ceiling_date(as.Date('2021-10-31'),'month') # Enter Period End Date on this line.
PeriodStartDate <<- floor_date(PeriodEndDate, 'month')
# Connect to SQL Server.
con <- dbConnect(
odbc::odbc(),
driver = "SQL Server",
server = "SERVERNAME",
trusted_connection = TRUE,
timeout = 5,
encoding = "Latin1")
samplequery <- dbGetQuery(con, "
SELECT * FROM [TableName]
WHERE OrderDate <= #PeriodEndDate
")
I believe one way might be to use the paste function, like this:
samplequery <- dbGetQuery(con, paste("
SELECT * FROM [TableName]
WHERE OrderDate <=", PeriodEndDate")
However, that can get unwieldy if it involves several variables being referenced outside the query or in several places within the query.
Is there a relatively straightforward way to do this?
Thanks in advance for any thoughts you might have!
The mechanism in most DBI-based connections is to use ?-placeholders[1] in the query and params= in the call to DBI::dbGetQuery or DBI::dbExecute.
Perhaps this:
samplequery <- dbGetQuery(con, "
SELECT * FROM [TableName]
WHERE OrderDate <= ?
", params = list(PeriodEndDate))
In general the mechanisms for including an R object as a data-item are enumerated well in https://db.rstudio.com/best-practices/run-queries-safely/. In the order of my recommendation,
Parameterized queries (as shown above);
glue::glue_sql;
sqlInterpolate (which uses the same ?-placeholders as #1);
The link also mentions "manual escaping" using dbQuoteString.
Anything else is in my mind more risky due to inadvertent SQL corruption/injection.
I've seen many questions here on SO that try to use one of the following techniques: paste and/or sprintf using sQuote or hard-coded paste0("'", PeriodEndDate, "'"). These are too fragile in my mind and should be avoided.
My preference for parameterized queries extends beyond this usability, it also can have non-insignificant impacts on repeated use of the same query, since DBMSes tend to analyze/optimize the query and cache this for the next use. Consider this:
### parameterized queries
DBI::dbGetQuery("select ... where OrderDate >= ?", params=list("2020-02-02"))
DBI::dbGetQuery("select ... where OrderDate >= ?", params=list("2020-02-03"))
### glue_sql
PeriodEndDate <- as.Date("2020-02-02")
qry <- glue::glue_sql("select ... where OrderDate >= {PeriodEndDate}", .con=con)
# <SQL> select ... where OrderDate >= '2020-02-02'
DBI::dbGetQuery(con, qry)
PeriodEndDate <- as.Date("2021-12-22")
qry <- glue::glue_sql("select ... where OrderDate >= {PeriodEndDate}", .con=con)
# <SQL> select ... where OrderDate >= '2021-12-22'
DBI::dbGetQuery(con, qry)
In the case of parameterized queries, the "query" itself never changes, so its optimized query (internal to the server) can be reused.
In the case of the glue_sql queries, the query itself changes (albeit just a handful of character), so most (all?) DBMSes will re-analyze and re-optimize the query. While they tend to do it quickly, and most analysts' queries are not complex, it is still unnecessary overhead, and missing an opportunity in cases where your query and/or the indices require a little more work to optimize well.
Notes:
? is used by most DBMSes but not all. Others use $name or $1 or such. With odbc::odbc(), however, it is always ? (no name, no number), regardless of the actual DBMS.
Not sure if you are using this elsewhere, but the use of <<- (vice <- or =) can encourage bad habits and/or unreliable/unexpected results.
It is not uncommon to use the same variable multiple times in a query. Unfortunately, you will need to include the variable multiple times, and order is important. For example,
samplequery <- dbGetQuery(con, "
SELECT * FROM [TableName]
WHERE OrderDate <= ?
or (SomethingElse = ? and OrderDate > ?)0
", params = list(PeriodEndDate, 99, PeriodEndDate))
If you have a list/vector of values and want to use SQL's IN operator, then you have two options, my preference being the first (for the reasons stated above):
Create a string of question marks and paste into the query. (Yes, this is pasteing into the query, but we are not dealing with the risk of incorrectly single-quoting or double-quoting. Since DBI does not support any other mechanism, this is what we have.)
MyDates <- c(..., ...)
qmarks <- paste(rep("?", length(MyDates)), collapse=",")
samplequery <- dbGetQuery(con, sprintf("
SELECT * FROM [TableName]
WHERE OrderDate IN (%s)
", qmarks), params = as.list(MyDates))
glue_sql supports expanding internally:
MyDates <- c(..., ...)
qry <- glue::glue_sql("
SELECT * FROM [TableName]
WHERE OrderDate IN ({MyDates*})", .con=con)
DBI::dbGetQuery(con, qry)

RODBC gives proper row count but yields empty query

Using R-3.5.0 and RODBC v. 1.3-15 on Windows.
I am trying to query data from a remote database. I can connect fine and if I do a query to count the rows, the answer comes out correctly. But if I try to remove the count statement select count(*) and actually get the data via select *, I yield an empty query (with some rather strange headers). Only two of the column names come out correctly and the rest are question marks and a number (as shown below). I can using sql developer to query the data no problem.
I include the simplest version of the code below but I get the same results if I try to limit to just a few rows or certain conditions, etc. Sorry I cannot create a reproducible example but as this is a remote db and I have no idea what the problem is, I'm not sure how I could even do that.
I can query other tables from different schemas within the same odbc connection, so I don't think it is that. I have tried with and without the believeNRows and the rows_at_time.
Thank you for any thoughts.
channel <- odbcConnect("mydb", uid="myuser", pwd="mypass", believeNRows=FALSE,rows_at_time = 1)
myquery <- paste("select count(*) from MYSCHEMA.MYTABLE")
sqlQuery(channel, myquery)
COUNT(*)
1 149712361
myquery <- paste("select * from MYSCHEMA.MYTABLE")
sqlQuery(channel, myquery)
[1] ID FMC_IN_ID ? ?.1 ?.2 ?.3 ?.4 ?.5 ?.6 ?.7 ?.8 ?.9 ?.10 ?.11 ?.12 ?.13 ?.14 ?.15
<0 rows> (or 0-length row.names)
I would try the following:
add a simple limit 100 to your query to see if you can get some data back
add the believeNRows option to the sqlQuery call -- in my experience it is needed at that level
In case it helps others, the problem was that the database contained an Oracle spatial field (MDSYS.SDO_GEOMETRY). R did not know what to do with it. I assumed it would just convert it to a character but instead it just got confused. By omitting the spatial field, the query worked fine.

Querying mixed case columns in SQL with R

I have a mixed case column in my_table that can only be queried using double quotes in psql. For example:
select "mixedCase" from my_table limit 5; would be the correct way to write the query in psql, and this returns records successfully
However, I am unable to replicate this query in R:
I have tried the following:
dbGetQuery(con, "SELECT '\"mixedCase\"' from my_table limit 5;")
which throws: RS-DBI driver warning: (unrecognized PostgreSQL field type unknown (id:705) in column 0)
dbGetQuery(con, "SELECT 'mixedCase' from my_table limit 5;")
which throws: RS-DBI driver warning: (unrecognized PostgreSQL field type unknown (id:705) in column 0)
dbGetQuery(con, "SELECT "mixedCase" from my_table limit 5;")
which throws Error: unexpected symbol in "dbGetQuery(con, "SELECT "mixedCase"
What is the solution for mixed case columns with the RPostgreSQL package?
You seem to understand the problem, yet you never actually tried just using the literal correct query in R. Just escape the double quotes in the query string and it should work:
dbGetQuery(con, "SELECT \"mixedCase\" from my_table limit 5;")
Your first two attempts would have failed because you are passing in mixedCase as a string literal, not as a column name. And the third attempt would fail on the R side because you are passing in a broken string/code.

How do I run a SQL update statement in RODBC?

When trying to run an update with a SQL statement with the sqlQuery function in RODBC, it brings up an error
"[RODBC] ERROR: Could not SQLExecDirect '.
How do you run a direct update statement with R?
You cannot use a plain SQL update statement with the SQL query function, it just needs to return a resultset. For example, the following statement won't work:
sql="update mytable set column=value where column=value"
cn <-odbcDriverConnect(connection="yourconnectionstring")
resultset <- sqlQuery(cn,sql)
But if you add an output statement, the SQL query function will work fine. For example.
sql="update mytable set column=value output inserted.column where column=value"
cn <-odbcDriverConnect(connection="yourconnectionstring")
resultset <- sqlQuery(cn,sql)
I just added a function to make it easy to take your raw sql and quickly turn it into an update statement.
setUpdateSql <-function(updatesql, wheresql, output="inserted.*"){
sql=paste(updatesql," output ",output, wheresql)
sql=gsub("\n"," ",sql) #remove new lines if they appear in sql
return(sql)
}
So now I just need to split the SQL statement and it will run. I could also add an "inserted.columnname" if I didn't want to return the whole thing.
sql=setUpdateSql("update mytable set column=value","where column=value","inserted.column")#last parameter is optional
cn <-odbcDriverConnect(connection="yourconnectionstring")
resultset <- sqlQuery(cn,sql)
The other advantage with this method is you can find out what has changed in the resultset.

Resources