Read .sql into R with Spanish characters (á, é, í, ó, ú, ñ, etc) - r

So, I've been struggling with this for a while now and can't seem to google my way out of it. I'm trying to read a .sql file into R, I always do that to avoid putting 100+ lines of sql in my R scripts. I usually do this:
library(tidyverse)
library(DBI)
con <- dbConnect(<CONNECTION ARGUMENTS>)
query <- read_file("path/to/script.sql")
df <- as_tibble(dbGetQuery(con, query))
dbDisconnect(con)
However, this time my sql script has some Spanish characters in it. Say something like this:
select tree_id, tree
from forest.trees
where species = 'árbol'
When I read this script into R and make the query it just doesn't return anything, but if I copy and paste the sql script into an R string it works! So it seems that the problem is in the line where I read the script into R.
I tried changing the string's encoding in a couple of ways:
# none of these work
query <- read_file("path/to/script.sql")
Encoding(query) <- "latin1"
query <- readLines("path/to/script.sql", encoding = "latin1")
query <- paste0(query, collapse = " ")
Unfortunately I don't have a public database to offer to anyone reading this. I'm connecting to a postgreSQL 11 database.
--- UPDATE ----
I'm on a windows 10 machine, with US locale.
When I use the read_file function the contents of query look ok, the Spanish characters print out like they should, but when I pass it to dbGetQuery it just doesn't fetch anything.
I tried forcing encoding "latin1" because I found online that Spanish characters tend to fix in R when doing that. When doing this, the Spanish characters print out wrong, so I didn't expected it to work, and it didn't.
The character values in my database have 'utf-8' encoding.
Just to be completely clear, all my attempts to read the .sql script haven't worked, however this does work:
library(tidyverse)
library(DBI)
con <- dbConnect(<CONNECTION ARGUMENTS>)
query <- "select tree_id, tree from forest.trees where species = 'árbol'"
# df actually has results
df <- as_tibble(dbGetQuery(con, query))
dbDisconnect(con)

The encoding statement is telling R how to interpret the filename, not its contents. Try this instead:
filetext <- readLines(file("path/to/script.sql", encoding = "latin1"))
See this answer for more details:R: can't read unicode text files even when specifying the encoding

So after some time to think about it, I wondered why the solution proposed by MrFlick didn't work. I checked the encoding of the file created by this chunk:
query <- "select tree_id, tree from forest.trees where species = 'árbol'"
write_lines(query, "test.sql")
After checking what encoding did test.sql have, it turned out it was ANSI, but it didn't look right. So I manually changed my original script.sql encoding to ANSI. After that it worked totally fine.
This solution however, didn't work when I cloned my repo on an ubuntu environment. In ubuntu there was no problem with the original 'utf-8' encoding.
Hope this helps anyone dealing with this in windows.

Related

R script with cyrillic symbols in Task Scheduler. Encoding issue

I have an R script that contains cyrillic symbols (as filtering terms) in it. It is saved with UTF-8 encoding and locale option is set to Sys.setlocale('LC_ALL', "Ukrainian"). It works perfectly when I run it manually. I want to run this script through the Windows Task scheduler. Generally, it works (and produces resulting dataset with non-distorted symbols), but it does not filter by terms, written in cyrillic.
I was wondering how this issue can be resolved?
The script, actually looks like
library(tidyRSS)
library(googlesheets4)
library(dplyr)
library(plyr)
library(adapr)
library(devtools)
library(xlsx)
#devtools::install_github("cran/adapr")
Sys.setlocale('LC_ALL', "Ukrainian")
my_feed1 <- tidyfeed("https://www.vugledar-rada.gov.ua/index.php?format=feed&type=rss")
my_feed2 <- tidyfeed("https://ugledar.info/feed")
to_filter <- rbind.fill(my_feed1, my_feed2)
term <- which(grepl("город", to_filter$item_title) | grepl("город", to_filter$item_description) | grepl("місто", to_filter$item_title) | grepl("місто", to_filter$item_description))
filtered <- to_filter[c(term),]
d <- Sys.Date()
t <- Sys.time()
print("saving to the disk...")
setwd("C:\\Users\\user\\Desktop\\Hanna K\\Newsfeed")
write.xlsx(filtered, paste0("check_news__", d, ".xlsx"))
it is difficult to guess the exact source of your problem, but i strongly suspect that it has something to do with the encoding.
try to check the encoding in the console and in the task scheduler, by printing
options("encoding")
if the encodings differ then you can set the encoding at the top of your script with:
options(encoding = "myencoding")
confirm that the encoding did indeed change with the first command.

Roracle special characters on ShinyServer (Linux)

There seem to be plenty of similar issues and answers to them, yet so far haven't found a solution that would work for me.
Issue in short: special characters (scandic letters, ohm-symbol, Celcius degree-symbol, etc.) get all scrambled up when application is run on Linux Shiny server. Data shown on Shiny Dashboard is queried from Oracle database. The column name in these examples is "NAME" and type is VARCHAR2. When I run similar code on Linux server in R or on my local Windows RStudio, all characters look fine.
What I've tried so far: Characters started to look fine in Linux R after placing NLS_LANG to NLS_LANG=AMERICAN_AMERICA.AL32UTF8 in /etc/environment. I figured these are correct NLS_LANG settings by running SELECT * FROM V$NLS_PARAMETERS and SELECT * from NLS_SESSION_PARAMETERS in Linux's R. Though this didn't fix the issue on the Shiny Server side.
Also I've played around with the dbConnect encoding-parameter, with no luck.
Somewhat reproducible example: (sorry I can't gain access to my Oracle server ;-) )
library(ROracle)
ORAdrv <- dbDriver("Oracle", unicode_as_utf8 = TRUE, ora.attributes = TRUE) #doesn't matter if I have these two latter attributes or not
ORAconnect.string <- paste(
"(DESCRIPTION=",
"(ADDRESS=(PROTOCOL=tcp)(HOST=xx.xx.xx.xx)(PORT=xxxx))",
"(CONNECT_DATA=(SID=...)))", sep = ""
)
query2 <- ("select NAME, DATA_FIELD from TABLE where DATA_FIELD in ('ID7018789', 'ID7025838', 'ID7021380')")
ORAcon <- dbConnect(ORAdrv, username = "...", password = "...", dbname = ORAconnect.string, encoding = "UTF-8") #doesn't matter if encoding is defined or not
res <- dbSendQuery(ORAcon, query2, 'set character set "utf8"') #doesn't matter if the last attribute is defined or not
df <- fetch(res)
dbDisconnect(ORAcon)
print(df)
What the end result looks like:
If I run the code in Linux R, the result is what is expected (Ohm, Celcius symbols and scandic characters look good):
If I run the same code and render the dataframe as datatable on ShinyServer app, the result is like this: (Ohm and Celcius symbols are replaced with question mark, scandic characters äö -> ao)
Any help on getting the encoding correctly on Shiny Server application side is highly appreciated =)
I was able to solve it finally. If someone else is struggling with this, cast the column to nvarchar already in the query.
in my case, to_nchar(NAME) did the trick.
https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions187.htm

How to do encoding in R and why ’ comes instead of apostrophe(')s and how to resolve it

Hi I am trying to do text mining in R version 3.4.2
I am trying to import .txt files from local drive using VCorpus command.
But after Run following code
cname <- file.path("C:", "texts")
cname
dir(cname)
library(readr)
library(tm)
docs <- VCorpus(DirSource(cname))
summary(docs)
inspect(docs[1])
writeLines(as.character(docs[1]))
Output:
Well, the election, it came out really well. Next time we**’**ll triple the number and so on
’ its originally aporstophe(')s now how can i convert or get original text in Rstudio?
Please it will appreciate if someone help me
Thanks in Advance
Encoding issues are not easy to solve, since they depend on various factors (file ecnoding, encoding settings during loading, etc.). As a first step you might try the following line, if we are lucky it solves your problem.
Encoding(your_text) <- "UTF-8"
Otherwise, other solutions have to be chekced, e.g., using stri_trans from stringi package or replacing wrong symbols with brute force via gsub(falsecharacter, desiredcharacter, fixed = TRUE) (there are debugging tables, e.g., on i18nqa.com).
I solved this a different way.
I found that apostrophes that looked like this: ' would render properly, while ones that looked slightly different, like this: ’ would not.
So, for any text that I was printing, I converted ’ to ' like this:
mytext <- gsub("’", "'", mytext )
Tada... no more issues with "’".

ROracle connect and pull utf8 characters

I am connecting to an Oracle database from R using ROracle. The problem is for every special utf-8 character it returns a question mark. Some Chinese values returns a solid string of question marks. I believe this is relevant because I haven't found any other question on this site (or others) that answers this for the package ROracle.
Some questions that were the most promising include an answer for MySQL: Fetching UTF-8 text from MySQL in R returns "????" but I was unable to make this work for ROracle. This site also provided some useful information https://docs.oracle.com/cd/E17952_01/mysql-5.5-en/charset-connection.html Before I was using RODBC and was easily able to configure the uft-8 encoding.
Here is some sample code... I am sorry that unless you have an Oracle database with utf-8 characters it may be impossible to duplicate... I also changed the host number and the sid for data privacy reasons...
library(ROracle)
drv <- dbDriver("Oracle")
# Create the connection string
host <- "10.00.000.86"
port <- 1521
sid <- "f110"
connect.string <- paste(
"(DESCRIPTION=",
"(ADDRESS=(PROTOCOL=tcp)(HOST=", host, ")(PORT=", port, "))",
"(CONNECT_DATA=(SID=", sid, ")))", sep = "")
con <- dbConnect(drv, username = "XXXXXXXXX",
password = "xxxxxxxxx",dbname=connect.string)
my.table <- dbReadTable(con, "DASH_D_PROJECT_INFO")
my.table[40, 1:3]
PROJECT_ID DATE_INPUT PROJECT_NAME
211625 2012-07-01 ??????, ?????????????????? ????? ??????, 1869?1917 [????? 3]
Any help is appreciated. I have read the entire documentation of the ROracle packages, and it seemed to have a solution for writing utf-8 characters, but not for reading them.
Okay after several weeks I found my own answer. I hope that it will be of value to someone else.
My question is largely answered by how Oracle stores the data. If you want UTF-8 characteristics preserverd you need the column in the table to be an NVARCHAR not just a varchar. At that point regular data pulling and encoding will work in R as expected. I was looking for the error in the wrong place.
I also want to mention one hang up on how to write utf-8 data from R to Oracle with utf-8
In writing files I had some that would not convert to UTF-8 in the following manner. So I did the step in too parts and wrote them in two steps to an oracle table. The results worked perfectly.
Encoding(my.data1$Project.Name) <- "UTF-8"
my.data1.1 <- my.data1[Encoding(my.data1$Project.Name) == "UTF-8", ]
my.data1.2 <- my.data1[Encoding(my.data1$Project.Name) != "UTF-8", ]
attr(my.data1.1$Project.Name, "ora.encoding") <- "UTF-8"
If you found this insightful give it an up vote so more can find it.

How to convert special symbols in web scraping with R?

I am learning how to scrape the web with the XML and the RCurl packages. All goes well except for one thing. Special characters like ö or č they are read in differently into R. For instance the í is read in as í. I assume the latter is some sort of HTML coding for the first.
I have been looking for a way to convert these characters but I have not found it. I am sure other people have stumbled upon this problem as well, and I suspect there must be some sort of function to convert these characters. Does anyone know the solution? Thanks in advance.
Here is an example of the code, sorry I did not provide it earlier.
library(XML)
url <- 'http://en.wikipedia.org/wiki/2000_Wimbledon_Championships_%E2%80%93_Men%27s_Singles'
tables <- readHTMLTable(url)
Sec <- tables[[6]]
pl1R1 <- unlist(strsplit(as.character(Sec[,2]), ' '))[seq(2,32, 4)]
enc2utf8(pl1R1) # does not seem to work
Try parsing it first while specifying the encoding, then reading the table, as here: readHTMLTable and UTF-8 encoding.
An example might be:
library(XML)
url <- "http://en.wikipedia.org/wiki/2000_Wimbledon_Championships_%E2%80%93_Men%27s_Singles"
doc <- htmlParse(url, encoding = "UTF-8") #this will preserve characters
tables <- as.data.frame(readHTMLTable(doc, stringsAsFactors = FALSE))
Sec <- tables[[6]]
#not sure what you're trying to do here though
pl1R1 <- unlist(strsplit(as.character(Sec[,2]), ' '))[seq(2,32, 4)]

Resources