handling special characters e.g. accents in R - r

I am doing some web scraping of names into a dataframe
For a name such as "Tomáš Rosický, I get a result "Tomáš Rosický"
I tried
Encoding("Tomáš Rosický") # with latin1 response
but was not sure where to go from there to get the original name with accents back. Played around with iconv without success
I would be satisfied (and might even prefer) an output of "Tomas Rosicky"

You've read in a page encoded in UTF-8. if x is your column of names, use Encoding(x) <- "UTF-8".

You should use this:
df$colname <- iconv(df$colname, from="UTF-8", to="LATIN1")

To do a correct read of the file use the scan function:
namb <- scan(file='g:/testcodering.txt', fileEncoding='UTF-8',
what=character(), sep='\n', allowEscapes=T)
cat(namb)
This also works:
namc <- readLines(con <- file('g:/testcodering.txt', "r",
encoding='UTF-8')); close(con)
cat(namc)
This will read the file with the correct accents

A way to export accents correctly:
enc2utf8(as(dataframe$columnname, "character"))

Related

iconv() returns NA when given a string with a specific special character

I am trying to convert some strings of an input file from UTF8 to ASCII. For most of the strings I give it, the conversion works perfectly fine with iconv(). However on some of them, it returns NA. While manually fixing the issue in the file seems like the simplest option, it is unfortunately not an option that I have available at the moment at all.
I have made a reproducible example of my problem but we assume to assume that I have to figure a way for iconv() to somehow convert the string in s1 and not get NA.
Here is the reproducible example:
s1 <- "Besançon" #as read from an input file I cannot modify
s2 <- "Paris"
s3 <- "Linköping"
s4 <- "Besançon" #Manual input for testing
s1 <- iconv(s1, to='ASCII//TRANSLIT')
s2 <- iconv(s2, to='ASCII//TRANSLIT')
s3 <- iconv(s3, to='ASCII//TRANSLIT')
s4 <- iconv(s4, to='ASCII//TRANSLIT')
I get the following output:
> s1
[1] NA
> s2
[1] "Paris"
> s3
[1] "Link\"oping"
> s4
[1] "Besancon"
After playing around with the code, I figured that something was wrong in the entry "Besançon" that is now copied exactly from the input file. When I input it manually myself, the problem is solved. Since I can't modify the input file at all, what do you think is the exact issue and would you have any idea on how to solve it?
Thanks in advance,
Edit:
After closer inspection, there is something odd in the characters of the first line. It seems to be taken away by SO's formatting.
But to reproduce it, the best I could give is these two images describing it. First image places my cursor just before the #
Second image is after pressing delete, which should delete the white space... turns out it deletes the ". So there is definitely something weird there.
It turns out that using sub='' actually solved the issue although I am quite unsure why.
iconv(s1, to='ASCII//TRANSLIT', sub='')
From the documentation sub
character string. If not NA it is used to replace any non-convertible
bytes in the input. (This would normally be a single character, but
can be more.) If "byte", the indication is "" with the hex code of
the byte. If "Unicode" and converting from UTF-8, the Unicode point in
the form "<U+xxxx>".
So I eventually figured out that there was a character I couldn't convert (nor see) in the string and using sub was a way to eliminate it. I am still not sure what this character is though. But the problem is solved.
There is probably a latin1 (or other encoding) character in your supposedly utf8 file. For example:
> latin=iconv('Besançon','utf8','latin1')
> iconv(latin,to='ascii//translit')
[1] NA
> iconv(latin,'utf8','ascii//translit')
[1] NA
> iconv(latin,'latin1','ascii//translit')
[1] "Besancon"
> iconv(l,'Windows-1250','ascii//translit')
[1] "Besancon"
You can e.g. make one new vector or data column with the result of each character set encoding in your data, and if one is NA, fall back to the next one, e.g.
utf8 = iconv(x,'utf8','ascii//translit')
latin1 = iconv(x,'latin1','ascii//translit')
win1250 = iconv(x,'Windows-1250','ascii//translit')
result = ifelse(
is.na(utf8),
ifelse(
is.na(latin1),
win1250,
latin1
),
utf8
)
If these encodings don't work, make a file with just the problem word, then use the unix/linux file command to detect the encoding, or else try some likely encodings.
I have in the past just listed all of iconv's supported encodings, tried all with lapply, and then used whichever results worked on each string, but some "from" encodings will return a non-NA but incorrect result, so it's best to try this on each unique character in your data in order to decide which subset of iconv's encodings to use and in which order.

How recode unicode char like "\xe9" and "<e9>" to "é" in R?

I read "csv" file where one field has values like "J\xe9rome" or "Jrome" at the same time. How to read this file to have values like "Jérome" or make characters transformation then?
I tried to use
df <- fread(file_name, encoding = "UTF-8")
but it does not work.
Thanks!

How change the default UTF-8 encoding to LATIN1

First time caller.
I just want to change string encoding from UTF-8 to LATIN1. I use Xpath to retrieve the data from the web:
>library(RCurl)
>library(rvest)
>library(XML)
>library(httr)
>library(reshape2)
>library(reshape)
>response <- GET(paste0("http://www.visalietuva.lt/imone/jogminda-uab-telsiai-muziejaus-g-35"))
>doc <- content(response,type="text/html")
>base <- xpathSApply(doc, "//ul//li//span",xmlValue)[5]
As as result I get the following:
>base
[1] "El. paštas"
When I check the encoding I have UTF-8:
>Encoding(base)
[1] "UTF-8"
I suspect I need LATIN1 encoding. So that the result would be "El. paštas", instead of "El. paÅ¡tas".
Although when I specifie the LATIN1 encoding I get the following:
>latin <- iconv(base, from = "UTF-8", to = "LATIN1")
[1] "El. paštas"
i.e. the same result as with UTF-8. Changing the encoding does not help to get "El. paštas".
Moreover I need the correct LATIN1 encoding of the string while saving data to .csv file. I tried to save the data to .csv:
write.table(latin,file = "test.csv")
and get the same strange characters as mentioned above: "El. paštas".
Any advice on how to change the encoding would be more than welcome. Thank you.
Try
doc <- content(response,type="text/html", encoding = "UTF-8")

Trying to convert the character encoding in a dataset

I would like to convert the character encoding of one of my variable in a dataset but I don't understand why the iconv command is not working.
My gist file is here : https://gist.github.com/pachevalier/5850660
Two ideas come to mind:
1) you have a simple problem and are asking the wrong question
-- the final line in your Gist is
Encoding(tab$fiche_communale$Nom)
Are you actually wanting:
Encoding(tab$fiche_communale$name)
2) readHTMLTable may not be reading in the character encoding correctly, in which case, you could set it explicitly with Encoding(tab$fiche_communale$Nom) <- "latin-1"
3) try relying on iconv to detect the local encoding:
iconv(tab$fiche_communale$Nom, from="", to="UTF-8")

Error in tolower() invalid multibyte string

This is the error that I receive when I try to run tolower() on a character vector from a file that cannot be changed (at least, not manually - too large).
Error in tolower(m) : invalid multibyte string X
It seems to be French company names that are the problem with the É character. Although I have not investigated all of them (also not possible to do so manually).
It's strange, because my thought was that encoding issues would have been identified during read.csv(), rather than during operations after the fact.
Is there a quick way to remove these multibyte strings? Or, perhaps a way to identify and convert? Or even just ignore them entirely?
Here's how I solved my problem:
First, I opened the raw data in a texteditor (Geany, in this case), clicked properties and identified the Encoding type.
After which I used the iconv() function.
x <- iconv(x,"WINDOWS-1252","UTF-8")
To be more specific, I did this for every column of the data.frame from the imported CSV. Important to note that I set stringsAsFactors=FALSE in my read.csv() call.
dat[,sapply(dat,is.character)] <- sapply(
dat[,sapply(dat,is.character)],
iconv,"WINDOWS-1252","UTF-8")
I was getting the same error. However, in my case it wasn't when I was reading the file, but a bit later when processing it. I realised that I was getting the error, because the file wasn't read with the correct encoding in the first place.
I found a much simpler solution (at least for my case) and wanted to share. I simply added encoding as below and it worked.
read.csv(<path>, encoding = "UTF-8")
library(tidyverse)
data_clean = data %>%
mutate(new_lowercase_col = tolower(enc2utf8(as.character(my_old_column))))
Where new_lowercase_col is the name of the new column I'm making out of the old uppercase one, which was called my_old_column.
I know this has been answered already but thought I'd share my solution to this as I experienced the same thing.
In my case, I used the function str_trim() from package stringr to trim whitespace from start and end of string.
com$uppervar<-toupper(str_trim(com$var))
# to avoid datatables warning: error in tolower(x) invalid multibyte string
# assuming all columns are char
new_data <- as.data.frame(
lapply(old_data, enc2utf8),
stringsAsFactors = FALSE
)
My solution to this issue
library(dplyr) # pipes
library(stringi) # for stri_enc_isutf8
#Read in csv data
old_data<- read.csv("non_utf_data.csv", encoding = "UTF-8")
#despite specifying utf -8, the below columns are not utf8:
all(stri_enc_isutf8(old_data$problem_column))
#The below code uses regular expressions to cleanse. May need to tinker with the last
#portion that selects the grammar to retain
utf_eight_data<- old_data %>%
mutate(problem_column = gsub("[^[:alnum:][:blank:]?&/\\-]", "", old_data$problem_column)) %>%
rename(solved_problem = problem_column)
#this column is now utf 8.
all(stri_enc_isutf8(utf_eight_data$solved_problem))

Resources