R: Data Frame is removing every other character from nvarchar/unicode fields - r

I have connected R to my database using the below:
con <- DBI::dbConnect(odbc::odbc(),
Driver = "/usr/local/lib/libmsodbcsql.17.dylib",
Server = "my server",
Database = "my db",
UID = "my uid",
PWD = "my pw",
Port = 1433)
but for every string (either table names or field values), every even character has been removed.
Example 1-:
dbListTables(con)
Returns:
SLn
However, the actual table name is 'SOLine'
Example 2 - Running a query using:
query<-paste0("SELECT SOOrder.AddressLine1, SOOrder.OrderDate, SOOrder.OrderTotal FROM SOOrder WHERE SOOrder.OrderNbr=1")
test_query<-dbGetQuery(con,query)
test query
Returns:
AddressLine1 OrderDate OrderTotal
Ts drs ie1 2019-10-28 100.00
When running the same query in SSMS returns:
AddressLine1 OrderDate OrderTotal
Test Address Line 1 2019-10-28 100.00
Therefore, integers and date-time are not affected. It solely appears to be strings: more specifically nvarchar (varchar types are not affected). Looking into the schema, they are Unicode fields.

Have not found a fix - but a good workaround is
CAST(Table.Column AS varchar) AS Column
This allows the complete column to successfully pull through.

Related

String length limitation using dbBind of DBI in R

I want to use DBI::dbBind to run some parameterized queries to update a SQL Server database but the string values I wrote got truncated at 256 characters. In the code below, I am supposed to see a string of 500 "n" in the database but I only see 256.
conn <- DBI::dbConnect(odbc::odbc(), Driver = "xxx", Server = "serverx", Database = "dbx", UID = "pathx", PWD = "passwd", PORT = 1234)
query <- "UPDATE tableA SET fieldA = ? WHERE rowID = ?"
para <- list(strrep("n", 500), "id12345")
sentQuery <- dbSendQuery(conn, query)
dbBind(sentQuery, para)
dbClearResult(sentQuery)
I also tried writing the 500 "n" without using dbBind, and the result is fine. I see all 500 n. I guess this eliminates some culprits of the problem, like the conn and the field definition in the database. This is the code that works.
query <- (paste0("UPDATE tableA SET fieldA = '", strrep("n", 500), "' WHERE rowID = 'id12345'"))
dbExecute(conn, query)
I found one similar question without answer (Truncated updated string with R DBI package). However that question didn't point out dbBind therefore I am posting this for higher specificity.

How do I find the schema of a table in an ODBC connection by name?

I'm using the odbc package to connect to a MS SQL Server
con <- dbConnect(odbc::odbc(),
Driver = "ODBC Driver 13 for SQL Server",
Server = "server",
Database = "database",
UID = "user",
PWD = "pass",
Port = 1111)
This server has many tables, so I'm using dbListTables(con) to search for the ones containing a certain substring. But once I find them I need to discover which schema they are in to be able to query them. I'm currently doing this manually (looking for the name of the table in each schema), but is there any way I can get the schema of all tables that match a string?
Consider running an SQL query with LIKE search using the built-in INFORMATION_SCHEMA metadata table if your user has sufficient privileges.
SELECT SCHEMA_NAME
FROM INFORMATION_SCHEMA.SCHEMATA
WHERE SCHEMA_NAME LIKE '%some string%'
Call above with R odbc with a parameterized query on the wildcard search:
# PREPARED STATEMENT
strSQL <- paste("SELECT SCHEMA_NAME" ,
"FROM INFORMATION_SCHEMA.SCHEMATA",
"WHERE SCHEMA_NAME LIKE ?SEARCH")
# SAFELY INTERPOLATED QUERY
query <- sqlInterpolate(conn, strSQL, SEARCH = '%some string%')
# DATA FRAME BUILD FROM RESULTSET
schema_names_df <- dbGetQuery(conn, query)
I found a work around using the RODBC package:
library('RODBC')
# First connect to the DB
dbconn <- odbcDriverConnect("driver = {ODBC Driver xx for SQL Server};
server = server;
database = database;
uid = username;
pwd = password")
# Now fetch the DB tables
sqlTables(dbconn)
For my specific DB I get:
names(sqlTables(dbconn)
[1] "TABLE_CAT" "TABLE_SCHEM" "TABLE_NAME" "TABLE_TYPE" "REMARKS"

Add dictionary or list to Sqlite3 - Gives error operation parameter must be str

Ok I am new to sqlite and python in general so please be nice =)
I have a simple dictionary -
time = data[0]['timestamp']
price = data[0]['price']
myprice = {'Date':time,'price':price}
myprice looks like this (time is a timestamp) -
{'Date': 1553549093, 'price': 1.7686}
I now want to add the data to sqlite3 database...so I created this -
#Create database if not exist...
db_filename = 'mydb_test.db'
connection = sqlite3.connect(db_filename)
#Get a SQL cursor to be able to execute SQL commands...
cursor = connection.cursor()
#Create table
sql = '''CREATE TABLE IF NOT EXISTS TEST
(PID INTEGER PRIMARY KEY AUTOINCREMENT,
DATE TIMESTAMP,
PRICE FLOAT)'''
#Now lets execute the above SQL
cursor.execute(sql)
#Insert data in sql
sql2 = ("INSERT INTO GBPCAD VALUES (?,?)", [(myprice['Date'],myprice['price'])])
cursor.execute(sql2)
cursor.commit()
connection.close()
But when executing this code I get ValueError: operation parameter must be str
What am I doing wrong?
Pass the arguments of the insert statement in execute():
sql2 = "INSERT INTO GBPCAD (DATE, PRICE) VALUES (?,?)"
cursor.execute(sql2, (myprice['Date'], myprice['price']))
Also include in the insert statement the names of the columns.

Amazon Redshift - table columns declared as varchar(max) but forced as varchar(255)

I'm coding a data extraction tool to load data from Google Search Console (GSC from now on) and store it on an Amazon Redshift (AR from now on) database. I coded a function to parse the elements on the data frame coming from GSC to determine the field structure when creating tables on AR.
This is the R function I created:
get_table_fields <- function (d) {
r <- FALSE
if (is.data.frame(d)) {
r <- vector()
t <- d[1,]
c <- colnames(t)
for (k in c) {
v <- t[, k]
if (is.character(v)) {
r[k] <- "nvarchar(max)"
} else if (!is.na(as.Date(as.character(v), format = c("%Y-%m-%d")))) {
r[k] <- "date"
} else if (is.numeric(v)) {
r[k] <- ifelse(grepl(".", v, fixed = TRUE), "real", "integer")
}
}
}
return(r)
}
So far, so good. I pass the full data frame and the function extracts all relevant information from the first row, giving me the structure needed to create a table on AR.
This is the code I use to extract data from GSC and write it onto AR:
# retrieve the table fields schema
s_fields <- get_table_fields(data)
# compose the table creation definition out of the fields schema
d_fields <- paste(toString(sapply(names(s_fields), function (x) {
return(sprintf('"%s" %s', x, s_fields[x]))
})))
# compose the table creation query
c_query <- sprintf("CREATE TABLE IF NOT EXISTS %s (%s);", t_table_name, d_fields)
if (nrow(data) > 0) {
# create the table if it doesn't exist
dbSendUpdate(db, c_query)
# delete previous saved records for the specified date
dbSendUpdate(db, sprintf("DELETE FROM %s WHERE date = '%s' AND gsc_domain = '%s';", t_table_name, date_range[d], config.gsc.domain))
# upload the Google Search Console (GSC) data to Amazon Redshift (AR)
dbWriteTable(db, t_table_name, data, append = TRUE, row.names = FALSE)
}
db is the database connection object, declated like this:
# initialize the Amazon Redshift JDBC driver
driver <- JDBC("com.amazon.redshift.jdbc42.Driver", "drivers/RedshiftJDBC42-1.2.16.1027.jar", identifier.quote = "`")
# connect to the Amazon Redshift database instance
db <- dbConnect(driver, sprintf("jdbc:redshift://%s:%s/%s?user=%s&password=%s", config.ar.host, config.ar.port, config.ar.database, config.ar.user, config.ar.password))
t_table_name is a concatenated string with the different dimensions in the GSC extraction definition with gsc_by as a prefix and joined with an underscore so, if we wanted to extract date, page and device, the table name would be gsc_by_date_page_device
So, basically, what this code does is gather a data frame from GSC, ensure the table for the specified extraction exists. If not, it creates it. Otherwise, it removes any existing data (in case the extraction is re-launched not to duplicate any entries) and stores it in AR.
The problem is it seems like either the AR database or the JDBC driver from Amazon Redshift is forcing my column definitions as varchar(255) instead of the nvarchar(max) or varchar(max) I'm trying to write. I've tried different combinations but the result is always the same:
<simpleError in .local(conn, statement, ...): execute JDBC update query failed in dbSendUpdate ([Amazon](500310) Invalid operation: Value too long for character type
Details:
-----------------------------------------------
error: Value too long for character type
code: 8001
context: Value too long for type character varying(255)
query: 116225
location: funcs_string.hpp:395
process: padbmaster [pid=29705]
-----------------------------------------------;)>
If I print the c_query variable (the table creation query) before sending the query, it prints out correctly:
CREATE TABLE IF NOT EXISTS gsc_by_date_query_device ("date" date, "query" nvarchar(max), "device" nvarchar(max), "clicks" integer, "impressions" integer, "ctr" real, "position" integer, "gsc_domain" nvarchar(max));
CREATE TABLE IF NOT EXISTS gsc_by_date_query_country_device ("date" date, "query" nvarchar(max), "country" nvarchar(max), "device" nvarchar(max), "countryName" nvarchar(max), "clicks" integer, "impressions" integer, "ctr" real, "position" integer, "gsc_domain" nvarchar(max));
CREATE TABLE IF NOT EXISTS gsc_by_date_page_device ("date" date, "page" nvarchar(max), "device" nvarchar(max), "clicks" integer, "impressions" integer, "ctr" real, "position" real, "gsc_domain" nvarchar(max));
If I execute this on SQLWorkbench/J (the tool I'm using for checking), it creates the table correctly and even with that, what is failing is the data insertion.
Can you give me a hint on what am I doing wrong or how can I specify the text columns as bigger than 256 characters? I'm having a nightmare with this and I think I've tried everything I could.
I've written an extensive blogpost explaining a lot of nuances of reading/writing data to/from Amazon Redshift: https://auth0.com/blog/a-comprehensive-guide-for-connecting-with-r-to-redshift/
In particular, the best way to read data with R is using the RPostgres library, and to write data i recommend using the R Package i created: https://github.com/sicarul/redshiftTools
In particular, it does not have the issue you are reporting, varchars are created based on the length of the strings using function calculateCharSize: https://github.com/sicarul/redshiftTools/blob/master/R/table_definition.R#L2
Though, as a best practice i'd say unless it's a temporary or staging table, try to always create the table yourself, that way you can control sortkeys, distkeys and compression, those are very important for performance in Amazon Redshift.
If you already have created the table, you can do something like:
rs_replace_table(data, dbcon=db, table_name=t_table_name, bucket="mybucket", split_files=4)
If you haven't created the table, you can do practically the same thing with rs_create_table
You'll need an S3 bucket and the AWS keys to access it, since this package uploads to S3 and then directs redshift to that bucket, it's the fastest way to bulk upload data.

How to select database column name with a dot in it in R?

The Vertica database table I'm using has a column called: incident.date
I connect to it ok:
install.packages("RJDBC",dep=TRUE)
library(RJDBC)
vDriver <- JDBC(driverClass="com.vertica.jdbc.Driver", classPath="C:/Vertica/vertica jar/vertica-jdbc-7.0.1-0.jar")
vertica <- dbConnect(vDriver, "jdbc:vertica://127.0.0.1:5433/dir", "name", "pass")
I can pull a regular query from it:
myframe = dbGetQuery(vertica, "Select * from output_servers")
but if I want specific column with a dot in the name, I get an error.
myframe = dbGetQuery(vertica, "Select product, incident, incident.date from output_servers")
Error in .verify.JDBC.result(r, "Unable to retrieve JDBC result set for ", :
Unable to retrieve JDBC result set for Select product, incident, incident.date from output_servers ([Vertica][VJDBC](4566) ERROR: Relation "incident" does not exist)
I've tried square brackets, backticks, single and double quotes, and backslashes around the column name. I'm pretty sure it's simple, but what am I missing? Thanks!
I found it:
myframe = dbGetQuery(vertica, "Select product, incident, \"incident.date\" from output_servers")
Apparently it's Vertica that cares, not R.

Resources