I'm trying to write Unicode strings from R to SQL, and then use that SQL table to power a Power BI dashboard. Unfortunately, the Unicode characters only seem to work when I load the table back into R, and not when I view the table in SSMS or Power BI.
require(odbc)
require(DBI)
require(dplyr)
con <- DBI::dbConnect(odbc::odbc(),
.connection_string = "DRIVER={ODBC Driver 13 for SQL Server};SERVER=R9-0KY02L01\\SQLEXPRESS;Database=Test;trusted_connection=yes;")
testData <- data_frame(Characters = "❤")
dbWriteTable(con,"TestUnicode",testData,overwrite=TRUE)
result <- dbReadTable(con, "TestUnicode")
result$Characters
Successfully yields:
> result$Characters
[1] "❤"
However, when I pull that table in SSMS:
SELECT * FROM TestUnicode
I get two different characters:
Characters
~~~~~~~~~~
â¤
Those characters are also what appear in Power BI. How do I correctly pull the heart character outside of R?
It turns out this is a bug somewhere in R/DBI/the ODBC driver. The issue is that R stores strings as UTF-8 encoded, while SQL Server stores them as UTF-16LE encoded. Also, when dbWriteTable creates a table, it by default creates a VARCHAR column for strings which can't even hold Unicode characters. Thus, you need to both:
Change the column in the R data frame from being a string column to a list column of UTF-16LE raw bytes.
When using dbWriteTable, specify the field type as being NVARCHAR(MAX)
This seems like something that should still be handled by either DBI or ODBC or something though.
require(odbc)
require(DBI)
# This function takes a string vector and turns it into a list of raw UTF-16LE bytes.
# These will be needed to load into SQL Server
convertToUTF16 <- function(s){
lapply(s, function(x) unlist(iconv(x,from="UTF-8",to="UTF-16LE",toRaw=TRUE)))
}
# create a connection to a sql table
connectionString <- "[YOUR CONNECTION STRING]"
con <- DBI::dbConnect(odbc::odbc(),
.connection_string = connectionString)
# our example data
testData <- data.frame(ID = c(1,2,3), Char = c("I", "❤","Apples"), stringsAsFactors=FALSE)
# we adjust the column with the UTF-8 strings to instead be a list column of UTF-16LE bytes
testData$Char <- convertToUTF16(testData$Char)
# write the table to the database, specifying the field type
dbWriteTable(con,
"UnicodeExample",
testData,
append=TRUE,
field.types = c(Char = "NVARCHAR(MAX)"))
dbDisconnect(con)
Inspired by last answer and github: r-dbi/DBI#215: Storing unicode characters in SQL Server
Following field.types = c(Char = "NVARCHAR(MAX)") but with vector and compute of max because of the error dbReadTable/dbGetQuery returns Invalid Descriptor Index .... :
vector_nvarchar<-c(Filter(Negate(is.null),
(
lapply(testData,function(x){
if (is.character(x) ) c(
names(x),
paste0("NVARCHAR(",
max(
# nvarchar(max) gave error dbReadTable/dbGetQuery returns Invalid Descriptor Index error on SQL server
# https://github.com/r-dbi/odbc/issues/112
# so we compute the max
nchar(
iconv( #nchar doesn't work for UTF-8 : help (nchar)
Filter(Negate(is.null),x)
,"UTF-8","ASCII",sub ="x"
)
)
,na.rm = TRUE)
,")"
)
)
})
)
))
con= DBI::dbConnect(odbc::odbc(),.connection_string=xxxxt, encoding = 'UTF-8')
DBI::dbWriteTable(con,"UnicodeExample",testData, overwrite= TRUE, append=FALSE, field.types= vector_nvarchar)
DBI::dbGetQuery(con,iconv('select * from UnicodeExample'))
Inspired by the last answer I also tried to find an automated way for writing data frames to SQL server. I can not confirm the nvarchar(max) errors, so I ended up with these functions:
convertToUTF16_df <- function(df){
output <- cbind(df[sapply(df, typeof) != "character"]
, list.cbind(apply(df[sapply(df, typeof) == "character"], 2, function(x){
return(lapply(x, function(y) unlist(iconv(y, from = "UTF-8", to = "UTF-16LE", toRaw = TRUE))))
}))
)[colnames(df)]
return(output)
}
field_types <- function(df){
output <- list()
output[colnames(df)[sapply(df, typeof) == "character"]] <- "nvarchar(max)"
return(output)
}
DBI::dbWriteTable(odbc_connect
, name = SQL("database.schema.table")
, value = convertToUTF16_df(df)
, overwrite = TRUE
, row.names = FALSE
, field.types = field_types(df)
)
I found the previous answer very useful but ran into problems with character vectors that had another encoding such as 'latin1' instead of UTF-8. This resulted in random NULLs in the database column due to special characters such as non-breaking spaces.
In order to avoid these encoding issues, I've made the following modifications to detect the character vector encoding or otherwise default back to UTF-8 before conversion to UTF-16LE:
library(rlist)
convertToUTF16_df <- function(df){
output <- cbind(df[sapply(df, typeof) != "character"]
, list.cbind(apply(df[sapply(df, typeof) == "character"], 2, function(x){
return(lapply(x, function(y) {
if (Encoding(y)=="unknown") {
unlist(iconv(enc2utf8(y), from = "UTF-8", to = "UTF-16LE", toRaw = TRUE))
} else {
unlist(iconv(y, from = Encoding(y), to = "UTF-16LE", toRaw = TRUE))
}
}))
}))
)[colnames(df)]
return(output)
}
field_types <- function(df){
output <- list()
output[colnames(df)[sapply(df, typeof) == "character"]] <- "nvarchar(max)"
return(output)
}
DBI::dbWriteTable(odbc_connect
, name = SQL("database.schema.table")
, value = convertToUTF16_df(df)
, overwrite = TRUE
, row.names = FALSE
, field.types = field_types(df)
)
Ideally, I'd still modify this to remove the rlist dependency but it seems to work now.
You could consider using the package RODBC instead of odbc/DBI. I've have used RODBC with SQL Server and with Microsoft Access as permanent data storage system. I never had trouble with german umlaut (e.g. Ä, ä, ..., ß)
I wonder if using iconv is an appealing alternative as there seem to boe some '\X00' issues (e.g. https://www.r-bloggers.com/2010/06/more-powerful-iconv-in-r/)
I am posting this answer as an Extension to the top answer, because some people might find it useful.
If you need Unicode strings in SQL statements such as INSERT or UPDATE where you cannot use dbWriteTable(), you can constructing your query with dbBind() like this:
x <- "äöü"
x <- iconv(x, from="UTF-8", to="UTF-16LE", toRaw = TRUE)
q <-
"
update foobar
set umlauts = ?
where id = 1
")
query <- DBI::dbSendStatement(con, q)
DBI::dbBind(query, list(x))
DBI::dbClearResult(query)
Related
I'm just starting my journey with r, so I'm a complete newbie and I can't find anything that will help me solve this.
I have a csv table (random integers in each column) with 9 columns. I read 8 and I want to append them to a sql table with 8 fields (Col1 ... 8, all int's). After uploading the csv into rStudio, it looks right and only has 8 columns:
The code I'm using is:
# Libraries
library(DBI)
library(odbc)
library(tidyverse )
# CSV Files
df = head(
read_delim(
"C:/Data/test.txt",
" ",
trim_ws = TRUE,
skip = 1,
skip_empty_rows = TRUE,
col_types = cols('X7'=col_skip())
)
, -1
)
# Add Column Headers
col_headings <- c('Col1', 'Col2', 'Col3', 'Col4', 'Col5', 'Col6', 'Col7', 'Col8')
names(df) <- col_headings
# Connect to SQL Server
con <- dbConnect(odbc(), "SQL", timeout = 10)
# Append data
dbAppendTable(conn = con,
schema = "tmp",
name = "test",
value = df,
row.names = NULL)
I'm getting this error message:
> Error in result_describe_parameters(rs#ptr, fieldDetails) :
> Query requires '8' params; '18' supplied.
I ran into this issue also. I agree with Hayward Oblad, the dbAppendTable function appears to be finding another table of the same name throwing the error. Our solution was to specify the name parameter as an Id() (from DBI::Id())
So taking your example above:
# Append data
dbAppendTable(conn = con,
name = Id(schema = "tmp", table = "test"),
value = df,
row.names = NULL)
Ran into this issue...
Error in result_describe_parameters(rs#ptr, fieldDetails) : Query
requires '6' params; '18' supplied.
when saving to a snowflake database and couldn't find any good information on the error.
Turns out that there was a test schema where the tables within the schema had exactly the same names as in the prod schema. DBI::dbAppendTable() doesn't differentiate the schemas, so until those tables in the text schema got renamed to unique table names, the params error persisted.
Hope this saves someone the 10 hours I spent trying to figure out why DBI was throwing the error.
See he for more on this.
ODBC/DBI in R will not write to a table with a non-default schema in R
add the name = Id(schema = "my_schema", table = "table_name") to DBI::dbAppendTable()
or in my case it was the DBI::dbWriteTable().
Not sure why the function is not using the schema from my connection object though.. seems redundant.
I want to read a table from my Postgre database (default encoding UTF8) and since it actually is a PostGIS table I am using the st package with st_read.
test_sf <- st_read(con, stringAsFactor = FALSE, layer = "test_df")
Running the command returns a message "type is 146", where I coulnd't find an explanation for the code. According to ISO there is no geometry type of code 146, but that's another story.
Looking at the data read into test_sf I can see that the font-encoding goes wrong. I have strings with letters such as 'ø' etc. The particular 'ø' is shown in RStudio as 'ø'. Trying to solve that I was looking at the encoding of the particular column in the dataframe:
Encoding(test_sf[["status"]])
but the result shows only unknown. Altering the encoding Encoding(test_sf[["status"]]) <- "latin1" does show an encoding of latin1 for all the strings with special characters, but all those where there are no special characters stay with an encoding of unknown. Even worst, a View(test_sf) is still showing the 'ø' instead of an 'ø'.
When I look at the database table with DBeaver the font-encoding is correct. Reading the data in Python, the font-encoding is correct. Since I want to show the data in a plot in Shiny I want to use R.
How can I get a correct font-encoding of my data?
I just see your question because I recently faced the same problem. So I write this function that allow to convert the character columns (because this was the incorrect columns in my case) in the encoding that you choose. In my case, I use UTF-8 but you can change it.
readable = function(df){
x = sapply(df, class)
if (class(df)[1] == "sf") { #if it's a sf object
lst = list()
for (i in names(x)) {
if (x[i] == "character") { #just works with the columns with the character type
lst[[i]] = (df[i])[1][[1]]
Encoding(lst[[i]]) = "UTF-8"
df[i] = lst[[i]]
}
}
}
else{ #if it's a data.frame
for (i in names(x)) {
if (x[i] == "character") {
Encoding(df[, i]) = "UTF-8"
}
}
}
return(df)
}
After you just have to make
test_sf = readable(test_sf ) to convert the encoding
As RJDBC is the only package I have been able to make work on Ubuntu, I am trying to use it to INSERT a CSV-file into a database.
I can make the following work:
# Connecting to database
library(RJDBC)
drv <- JDBC('com.microsoft.sqlserver.jdbc.SQLServerDriver', 'drivers/sqljdbc42.jar', identifier.quote="'")
connection_string <- "jdbc:sqlserver://blablaserver;databaseName=testdatabase"
ch <- dbConnect(drv, connection_string, "username", "password")
# Inserting a row
dbSendQuery(ch, "INSERT INTO cpr_esben.CPR000_Startrecord (SORTFELT_10,OPGAVENR,PRODDTO,PRODDTOFORRIG,opretdato) VALUES ('TEST', 123, '2012-01-01', '2012-01-01', '2012-01-01')")
The insert works. Next I try to make an INSERT of a CSV-file with the same data, that is separated by the default "tab" and I am working on windows.
# Creating csv
df <- data.frame(matrix(c('TEST', 123, '2012-01-01', '2012-01-01', '2012-01-01'), nrow = 1), stringsAsFactors = F)
colnames(df) <- c("SORTFELT_10","OPGAVENR","PRODDTO","PRODDTOFORRIG","opretdato")
class(df$SORTFELT_10) <- "character"
class(df$OPGAVENR) <- "character"
class(df$PRODDTO) <- "character"
class(df$PRODDTOFORRIG) <- "character"
class(df$opretdato) <- "character"
write.table(df, file = "test.csv", col.names = FALSE, quote = FALSE)
# Inserting CSV to database
dbSendQuery(ch, "INSERT cpr_esben.CPR000_Startrecord FROM 'test.csv'")
Unable to retrieve JDBC result set for INSERT cpr_esben.CPR000_Startrecord FROM 'test.csv' (Incorrect syntax near the keyword 'FROM'.)
Do you have any suggestions to what I am doing wrong, when trying to insert the csv-file? I do not get the Incorrect syntax near the keyword 'FROM' error?
What if you create a statement from your data? Something like:
# Data from your example
df <- data.frame(matrix(c('TEST', 123, '2012-01-01', '2012-01-01', '2012-01-01'), nrow = 1), stringsAsFactors = F)
colnames(df) <- c("SORTFELT_10","OPGAVENR","PRODDTO","PRODDTOFORRIG","opretdato")
class(df$SORTFELT_10) <- "character"
class(df$OPGAVENR) <- "character"
class(df$PRODDTO) <- "character"
class(df$PRODDTOFORRIG) <- "character"
class(df$opretdato) <- "character"
# Formatting rows to insert into SQL statement
rows <- apply(df, 1, function(x){paste0('"', x, '"', collapse = ', ')})
rows <- paste0('(', rows, ')')
# SQL statement
statement <- paste0(
"INSERT INTO cpr_esben.CPR000_Startrecord (",
paste0(colnames(df), collapse = ', '),
')',
' VALUES ',
paste0(rows, collapse = ', ')
)
dbSendQuery(ch, statement)
This should work for any number of rows in your df
RJDBC is built on DBI, which has many useful functions to do tasks like this. What you want is dbWriteTable. Syntax would be:
dbWriteTable(ch, 'cpr_esben.CPR000_Startrecord', df, append = TRUE)
and would replace your write.table line.
I am not that familiar with RJDBC specifically, but I think the issue with your sendQuery is that you are referencing test.csv inside your SQL statement, which does not locate the file that you created with write.table as the scope of that SQL statement is not in your working directory.
Have you tried loading the file directly to the database as below.
library(RJDBC)
drv <- JDBC("connections")
conn <- dbConnect(drv,"...")
query = "LOAD DATA INFILE 'test.csv' INTO TABLE test"
dbSendUpdate(conn, query)
You can also try to include other statements in the end like delimiter for column like "|" for .txt file and "," for csv file.
I have an user defined function:
xml2csv = function(inputFile,outputFile) {
X <- read.table(inputFile, header = FALSE, fill = TRUE,sep=" ")
#step1: seperate cell by one or more space
Y <- concat.split.multiple(X, as.vector(colnames(X)), "")
#step2: seperate cell by ":"
Z <- concat.split.multiple(Y, as.vector(colnames(Y))[-c(1:2)], ":")
#delete repeat rows
U=Z[!Z[,1] == "__REPEAT__", ]
#convert factor column as character
V <- data.frame(lapply(U, as.character), stringsAsFactors=FALSE)
W = V
W[is.na(W)] = 0
write.csv(W, outputFile, quote = F, row.names = F)
}
It works perfectly fine for small inputFile; when the input file is big (>2000kb), the following error appears:
Error in textConnection(text, encoding = "UTF-8") : all connections are in use
Any suggestions?
It's hard to tell what's going on exactly (your error is not reproducible!), but my money is on it's a bug/limitation of concat.split.multiple when you have too many columns in X, since
it calls concat.split on each of the columns in lapply,
which uses textConnection internally,
which creates a temporary file,
which uses a file handler,
which is a limited resource in most operating systems :).
This is driving me mad.
I have a csv file "hello.csv"
a,b
"drivingme,mad",1
I just want to convert this into a sqlite database from within R (I need to do this because the actual file is actually 10G and it won't fit into a data.frame, so I will use Sqlite as an intermediate datastore)
dbWriteTable(conn= dbConnect(SQLite(),
dbname="c:/temp/data.sqlite3",
name="data",
value="c:/temp/hello.csv",row.names=FALSE, header=TRUE)
The above code failed with error
Error in try({ :
RS-DBI driver: (RS_sqlite_import: c:/temp/hello.csv line 2 expected 2 columns of data but found 3)
In addition: Warning message:
In read.table(fn, sep = sep, header = header, skip = skip, nrows = nrows, :
incomplete final line found by readTableHeader on 'c:/temp/hello.csv'
How do I tell it to treat comma (,) within a quote "" is to be treat as string and not a separator!
I tried adding in the argument
quote="\""
But it didn't work. Help!! read.csv work just file it will fail when reading large files.
Update
A much better now is to use readr's chunked functions e.g.
#setting up sqlite
con_data = dbConnect(SQLite(), dbname="yoursqlitefile")
readr::read_delim_chunked(file, function(chunk) {
dbWriteTable(con_data, chunk, name="data", append=TRUE )) #write to sqlite
})
Original more cumbuersome way
One way to do this is to read from the file since read.csv works but it just cannot load the whole data into memory.
n = 100000 # experiment with this number
f = file(csv)
con = open(f) # open a connection to the file
data <-read.csv(f,nrows=n,header=TRUE)
var.names = names(data)
#setting up sqlite
con_data = dbConnect(SQLite(), dbname="yoursqlitefile")
while(nrow(data) == n) { # if not reached the end of line
dbWriteTable(con_data, data, name="data",append=TRUE )) #write to sqlite
data <-read.csv(f,nrows=n,header=FALSE))
names(data) <- var.names
}
close(f)
if (nrow(data) != 0 ) {
dbWriteTable(con_data, data, name="data",append=TRUE ))
Improving the proposed answer:
data_full_path <- paste0(data_folder, data_file)
con_data <- dbConnect(SQLite(),
dbname=":memory:") # you can also store in a .sqlite file if you prefer
readr::read_delim_chunked(file = data_full_path,
callback =function(chunk,
dummyVar # https://stackoverflow.com/a/42826461/9071968
) {
dbWriteTable(con_data, chunk, name="data", append=TRUE ) #write to sqlite
},
delim = ";",
quote = "\""
)
(The other, current answer with readr does not work: parentheses are not balanced, the chunk function requires two parameters, see https://stackoverflow.com/a/42826461/9071968)
You make a parser to parse it.
string = yourline[i];
if (string.equals(",")) string = "%40";
yourline[i] = string;
or something of that nature. You could also use:
string.split(",");
and rebuild your string that way. That's how I would do it.
Keep in mind that you'll have to "de-parse" it when you want to get the values back. Commas in SQL mean column, so it can really screw things up, not to mention JSONArrays or JSONObjects.
Also keep in mind that this might be very costly for 10GB of data, so you might want to start by parsing the input before it even gets to the CSV if possible..