R character Encoding goes wrong (English - Spanish) - r

Im trying to load a dataset into R using an API that lets me do a query and returns back the data I need (I cant configure on server side).
I know it has something to do with Encoding. When i check the string in from by dataframe in R in gives me ENC: UTF-8 "Cosmética".
When i copy the source string "Cosmética" it gives me latin1.
How can i get the UTF-8 string properly formatted like the latin1?
Ive tried this below:
Sys.setlocale("LC_ALL","Spanish")
tried directly on the string:
Enconding(Description) <- "latin1"
unfortunately i cant get it to work. Any ideas are welcome! Thanks.

You can use iconv to change to encoding of the string:
iconv(mystring, to = "ISO-8859-1")
# [1] "Cosmética"
ISO 8859-1 is the common character encoding in Western Europe.

Related

i used FROM_geojson() to get data but letters are broken

I used FROM_geojson() to get geojson data from url.
The context contains korean language and i set default encoding in Rstudio global option as "UTF-18".
But context are broken like "湲몄뙂怨듭썝" which should be presented like "길쌈공원".
Do you have any idea to solve this problem?
(I'm not sure if it's encoding problem.. )
url="https://icloudgis.incheon.go.kr/server/rest/services/ParkScore_Dynamic/MapServer/1/query?outFields=*&where=1%3D1&f=geojson"
file_json=FROM_GeoJson(url_file_string = url)

Read .sql into R with Spanish characters (á, é, í, ó, ú, ñ, etc)

So, I've been struggling with this for a while now and can't seem to google my way out of it. I'm trying to read a .sql file into R, I always do that to avoid putting 100+ lines of sql in my R scripts. I usually do this:
library(tidyverse)
library(DBI)
con <- dbConnect(<CONNECTION ARGUMENTS>)
query <- read_file("path/to/script.sql")
df <- as_tibble(dbGetQuery(con, query))
dbDisconnect(con)
However, this time my sql script has some Spanish characters in it. Say something like this:
select tree_id, tree
from forest.trees
where species = 'árbol'
When I read this script into R and make the query it just doesn't return anything, but if I copy and paste the sql script into an R string it works! So it seems that the problem is in the line where I read the script into R.
I tried changing the string's encoding in a couple of ways:
# none of these work
query <- read_file("path/to/script.sql")
Encoding(query) <- "latin1"
query <- readLines("path/to/script.sql", encoding = "latin1")
query <- paste0(query, collapse = " ")
Unfortunately I don't have a public database to offer to anyone reading this. I'm connecting to a postgreSQL 11 database.
--- UPDATE ----
I'm on a windows 10 machine, with US locale.
When I use the read_file function the contents of query look ok, the Spanish characters print out like they should, but when I pass it to dbGetQuery it just doesn't fetch anything.
I tried forcing encoding "latin1" because I found online that Spanish characters tend to fix in R when doing that. When doing this, the Spanish characters print out wrong, so I didn't expected it to work, and it didn't.
The character values in my database have 'utf-8' encoding.
Just to be completely clear, all my attempts to read the .sql script haven't worked, however this does work:
library(tidyverse)
library(DBI)
con <- dbConnect(<CONNECTION ARGUMENTS>)
query <- "select tree_id, tree from forest.trees where species = 'árbol'"
# df actually has results
df <- as_tibble(dbGetQuery(con, query))
dbDisconnect(con)
The encoding statement is telling R how to interpret the filename, not its contents. Try this instead:
filetext <- readLines(file("path/to/script.sql", encoding = "latin1"))
See this answer for more details:R: can't read unicode text files even when specifying the encoding
So after some time to think about it, I wondered why the solution proposed by MrFlick didn't work. I checked the encoding of the file created by this chunk:
query <- "select tree_id, tree from forest.trees where species = 'árbol'"
write_lines(query, "test.sql")
After checking what encoding did test.sql have, it turned out it was ANSI, but it didn't look right. So I manually changed my original script.sql encoding to ANSI. After that it worked totally fine.
This solution however, didn't work when I cloned my repo on an ubuntu environment. In ubuntu there was no problem with the original 'utf-8' encoding.
Hope this helps anyone dealing with this in windows.

Entering and viewing Cyrillic strings in R

How to handle Cyrillic strings in R?
Sys.setlocale("LC_ALL","Polish")
dataset <- data.frame( ProductName = c('ąęćśżźół','тест') )
#Encoding(dataset) <- "UTF-8" #this line does not change anything
View(dataset)
The code above results in:
But I would like to get what I typed in тест instead of sequence <U+number>. Is there any way for that?
This works for me and see the cyrillic test in my data frame.
I think you should check what your locale is (with sessionInfo) and whether it supports UTF.
Also check this link and try to maybe change the encoding of your column.
Encoding(dataset$Cyrillic) <- "UTF-8"

R string encoding cyrillic

It seems that I have some cyrillic strings stored as UTF-8 in my database. However I need to restore in cyrillic using R.
For example in database it's stored as: "õÆ¿ª®Ï". What I need is Москва.
I tried some stuff using iconv, but not sure if I need to double-convert the string first:
1. iconv(x, "UTF-8", "CP1251") # I get NA
2. iconv(x, "CP1251", "UTF-8") # I get ûûû \"òƸл°¸»ª¿-õƸƺ±Ð\"
I assumed I need to restore the string from UTF-8 to cyrillic first, but I get NA.
Help appreciated
enc2native and enc2utf8 convert elements of character vectors to the native encoding or UTF-8 respectively, taking any marked encoding into account. They are primitive functions, designed to do minimal copying.

Get R to keep UTF-8 Codepoint representation

This question is related to the utf-8 package for R. I have a weird problem in which I want emojis in a data set I'm working with to stay in code point representation (i.e. as '\U0001f602'). I want to use the 'FindReplace' function from the Data Combine package to turn UTF-8 encodings into prose descriptions of emojis in a dataset of YouTube comments (using a dictionary I made available here). The only issue is that when I 'save' the output as an object in R the nice utf-8 encoding generated by utf8_encode for which I can use my dictionary, it disappears...
First I have to adjust the dictionary a bit:
emojis$YouTube <- tolower(emojis$Codepoint)
emojis$YouTube <- gsub("u\\+","\\\\U000", emojis$YouTube)
Convert to character so as to be able to use utf8_encode:
emojimovie$test <- as.character(emojimovie$textOriginal)
This works great, gives output of \U0001f595 (etc.) that can be matched with dictionary entries when it 'prints' in the console.
utf8_encode(emojimovie$test)
BUT, when I do this:
emojimovie$text2 <- utf8_encode(emojimovie$test)
and then:
emoemo <- FindReplace(data = emojimovie, Var = "text2", replaceData = emojis, from = "YouTube", to = "Name", exact = TRUE)
I get all NAs. When I look at the output in $text2 with View I don't see the \U0001f595, I see actual emojis. I think this is why the FindReplace function isn't working -- when it gets saved to an object it just gets represented as emojis again and the function can't find any matches. When I try gsub("\U0001f602", "lolface", emojimovie$text2), however, I can actually match and replace things, but I don't want to do this for all ~2,000 or so emojis.... I've tried reading as much as I can about utf-8, but I can't understand why this is happening. I'm stumped! :P
It looks like in the above, you are trying to convert the UTF-8 emoji to a text version. I would recommend going the other direction. Something like
emojis <- read.csv('Emoji Dictionary 2.1.csv', stringsAsFactors = FALSE)
# change U+1F469 U+200D U+1F467 to \U1F469\U200D\U1F467
escapes <- gsub("[[:space:]]*\\U\\+", "\\\\U", emojis$Codepoint)
# convert to UTF-8 using the R parser
codes <- sapply(parse(text = paste0("'", escapes, "'"),
keep.source = FALSE), eval)
This will convert the text representations like U+1F469 to UTF-8 strings. Then, you can search for these strings in the original data.
Note: If you are using Windows, make sure you have the latest release of R; in older versions, the parser gives you the wrong result for strings litke "\U1F469".
The utf8::utf8_encode should really only be used if you have UTF-8 and are trying to print it to the screen.

Resources