Text encoding issues in R - r

I'm doing text mining in R with Spanish documents and I'm encountering constant issues with the encoding, and the different solutions I came up with to solve them. I have been searching for different topics, but can't really find a clear solution. And the fact that things work different every time probably means that I don't really understand the problem.
I extracted text data from a PDF using pdf_text (package pdftools), and the characters with tildes are translated into Unicode, e.g. "<U+00ED>". However, when I try to substitute these with the normal characters using gsub(or finding them with grepl), R doesn't find anything. The output looks something like this:
> txt
[1] "Comunidad Aut<U+00F3>noma: Pa<U+00ED>s Vasco"
[2] "Provincia: <U+00C1>lava"
[3] "Alda se extingue y su territorio se incorpora a Valle de Arana. Censo 1950"
> str(txt)
chr [1:3] "Comunidad Aut<U+00F3>noma: Pa<U+00ED>s Vasco" ...
> grepl("<U\\+00F3>", txt)
[1] FALSE FALSE FALSE
> grepl("<U+00F3>", txt)
[1] FALSE FALSE FALSE
> gsub("<U\\+00F3>", "o", txt)
[1] "Comunidad Aut<U+00F3>noma: Pa<U+00ED>s Vasco"
[2] "Provincia: <U+00C1>lava"
[3] "Alda se extingue y su territorio se incorpora a Valle de Arana. Censo 1950"
However, if you introduce these stringrs manually, R does find them and substitutions are possible:
> txt = c("Comunidad Aut<U+00F3>noma: Pa<U+00ED>s Vasco", "Provincia: <U+00C1>lava")
> str(txt)
chr [1:2] "Comunidad Aut<U+00F3>noma: Pa<U+00ED>s Vasco" ...
> grepl("<U\\+00F3>", txt)
[1] TRUE FALSE
> gsub("<U\\+00F3>", "o", txt)
[1] "Comunidad Autonoma: Pa<U+00ED>s Vasco"
[2] "Provincia: <U+00C1>lava"
Why is this happening? What is actually reading R so that it sees both things as different?
In the past I use to build an adapt function that basically substitutes these stuff for the normal characters, but I'm findings that in some cases it doesn't work, and I don't really understand why. Also, when these data comes from OCR, the mess is even bigger, and the translation to UTF-8 differs constantly, etc.
Does anyone know of any general approach that solves this stuff? I'll be working extensively with this in the future.
Thanks a lot.
P.S.:
> sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
locale:
[1] C
attached base packages:
[1] stats graphics grDevices utils datasets methods
[7] base
other attached packages:
[1] pdftools_1.4 stringr_1.2.0
loaded via a namespace (and not attached):
[1] compiler_3.4.1 magrittr_1.5 tools_3.4.1 Rcpp_0.12.12
[5] stringi_1.1.5

The problem here is that your locale is set to C, so R will refuse to print non-ASCII characters. If you change your locale to one that allows printing Unicode, then you will see the characters that you expect.
txt <- "Comunidad Aut\u00F3noma: Pa\u00EDs Vasco"
Sys.setlocale("LC_CTYPE", "C") # switch character type locale to "C"
## "C"
print(txt)
## [1] "Comunidad Aut<U+00F3>noma: Pa<U+00ED>s Vasco"
Sys.setlocale("LC_CTYPE", "") # switch to native locale
## [1] "en_US.UTF-8"
print(txt)
## [1] "Comunidad Autónoma: País Vasco"
Here's how to replace the "o with acute accent" character:
gsub("\u00F3", "o", txt)
## [1] "Comunidad Autonoma: País Vasco"

Related

Broken encoding UTF-8 when use Encoding() and tokens()

I've got quite strange problem with encoding. When I run Encoding(txt) <- "UTF-8", I get encoding broken and strings look like "\xe7\xed\xe0\xfe\xf2".
txt <- c("привет", "пока")
Encoding(txt) # I get "unknown" "unknown"
Encoding(txt) <- "UTF-8"
Encoding(txt) # I get "UTF-8" "UTF-8", but strange symbols in vector
Plus, when I run l10n_info(), I get
$MBCS
[1] FALSE
$`UTF-8`
[1] FALSE
$`Latin-1`
[1] FALSE
I know, that I can use enc2utf8() with strings, but I work with quanteda and get the same issue as here: https://github.com/quanteda/quanteda/issues/1387 (but reinsalling package from github didn't help). I think, that problrm is with encoding on server.
P.S. dataframe loaded from excel is displayed correctly + when I save tokens object into new xslx, all the strings displayed in cyrillic.
Here is my session info:
R version 4.0.4 (2021-02-15)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server 2012 R2 x64 (build 9600)
Matrix products: default
locale:
[1] LC_COLLATE=Russian_Russia.1251 LC_CTYPE=Russian_Russia.1251 LC_MONETARY=Russian_Russia.1251 LC_NUMERIC=C LC_TIME=Russian_Russia.1251
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] tsne_0.1-3 stringi_1.5.3 tm_0.7-8 NLP_0.2-1 DataCombine_0.2.21 emo_0.0.0.9000 ggplot2_3.3.3 reshape2_1.4.4
[9] topicmodels_0.2-12 ldatuning_1.0.2 writexl_1.3.1 plyr_1.8.6 quanteda_2.9.9000 stringr_1.4.0 readxl_1.3.1
Thank you in advance!
Encoding issues are tricky, especially on Windows systems. It looks like your native encoding system is Windows-1251, an 8-bit encoding for Cyrillic. So when you input your string, it is input in that encoding. You can convert it to Unicode, but it still won't necessarily display correctly if you use the print method.
Here's the result of me trying to simulate the problem on my macOS platform.
> stringi::stri_info()$Charset.native[1:2]
$Name.friendly
[1] "UTF-8"
$Name.ICU
[1] "UTF-8"
My guess is that your system will show something different, but I cannot be sure.\
> # on macOS 10.15.7
> txt <- c("привет", "пока")
> txt
[1] "\u043f\u0440\u0438\u0432\u0435\u0442" "\u043f\u043e\u043a\u0430"
> Encoding(txt)
[1] "UTF-8" "UTF-8"
So that produces the same output that you are seeing, but it's encoded as UTF-8. To simulate what that would look like if the system encoded it as Windows-1251, we can convert it:
> # convert to Windows-1251
> txt_1251 <- iconv(txt, from = "UTF-8", to = "WINDOWS-1251")
> print(txt_1251)
[1] "\xef\xf0\xe8\xe2\xe5\xf2" "\xef\xee\xea\xe0"
> cat(txt_1251)
������ ����> Encoding(txt_1251)
[1] "unknown" "unknown"
Is that what you are seeing?
You can try fixing it this way:
> txt_from1251 <- stringi::stri_conv(txt_1251, from = "windows-1251", to = "utf-8")
> print(txt_from1251)
[1] "\u043f\u0440\u0438\u0432\u0435\u0442" "\u043f\u043e\u043a\u0430"
> cat(txt_from1251)
привет пока> Encoding(txt_from1251)
[1] "UTF-8" "UTF-8"
So while it still does not print(), it shows correctly from cat(), and has the correct Encoding bit set.
I could be wrong about this since my understanding of Unicode and character sets in R is incomplete, and it seems to be quite platform and locale dependent. I would happily see another response that improves this answer, or to hear your success or not with trying some of the fixes suggested above.

How to resolve Devanagari script encoding error in R

I have a .csv file which is UTF-8 encoded when I saved it. The script is Devanagari of the data in this file. I am able to see the words in csv file in excel
में
लिए
किया
गया
हैं
नहीं
सिंह
पुलिस
दिया
करने
कहा
रहे
बाद
करें
साथ
रहा
But when I open that in R, the words do not get encoded correctly. the output for print() is like this:
word
सारे_खतरों_को
जानते_हà¥\u0081à¤\u008f_भी
विवेक_ने
टीवी
How can I resolve this? I have tried Sys.setlocale() and read.delim(wordlist.csv, encoding = "UTF-8") but neither worked.
Too long for comment (sorry, I'm a greenhorn in R):
print( sessionInfo())
library(stringi)
library(magrittr)
x <- read.delim("D:\\bat\\SO\\64497248_devangari.csv", encoding = "UTF-8")
print('=== print(x)')
print(x)
for (line in x){
y <- line %>%
stri_replace_all_regex("<U\\+([[:alnum:]]+)>", "\\\\u$1") %>%
stri_unescape_unicode() %>%
stri_enc_toutf8()
}
print('=== print(y)')
print(y)
print('=== for (i in y) {print(i)}')
for (i in y) {print(i)}
print('=== print(z)')
z <- x['word'] %>%
stri_replace_all_regex("<U\\+([[:alnum:]]+)>", "\\\\u$1") %>%
stri_unescape_unicode() %>%
stri_enc_toutf8()
print(z)
Output (in Rgui.exe console):
> source ( 'D:\\bat\\SO\\64497248.r' )
R version 4.0.1 (2020-06-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)
Matrix products: default
locale:
[1] LC_COLLATE=Czech_Czechia.1250 LC_CTYPE=Czech_Czechia.1250 LC_MONETARY=Czech_Czechia.1250
[4] LC_NUMERIC=C LC_TIME=Czech_Czechia.1250
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] compiler_4.0.1
[1] "=== print(x)"
word
1 <U+092E><U+0947><U+0902>
2 <U+0932><U+093F><U+090F>
3 <U+0915><U+093F><U+092F><U+093E>
4 <U+0917><U+092F><U+093E>
5 <U+0939><U+0948><U+0902>
6 <U+0928><U+0939><U+0940><U+0902>
7 <U+0938><U+093F><U+0902><U+0939>
8 <U+092A><U+0941><U+0932><U+093F><U+0938>
9 <U+0926><U+093F><U+092F><U+093E>
10 <U+0915><U+0930><U+0928><U+0947>
11 <U+0915><U+0939><U+093E>
12 <U+0930><U+0939><U+0947>
13 <U+092C><U+093E><U+0926>
14 <U+0915><U+0930><U+0947><U+0902>
15 <U+0938><U+093E><U+0925>
16 <U+0930><U+0939><U+093E>
[1] "=== print(y)"
[1] "में" "लिए" "किया" "गया" "हैं" "नहीं" "सिंह" "पुलिस" "दिया" "करने" "कहा" "रहे" "बाद" "करें" "साथ" "रहा"
[1] "=== for (i in y) {print(i)}"
[1] "में"
[1] "लिए"
[1] "किया"
[1] "गया"
[1] "हैं"
[1] "नहीं"
[1] "सिंह"
[1] "पुलिस"
[1] "दिया"
[1] "करने"
[1] "कहा"
[1] "रहे"
[1] "बाद"
[1] "करें"
[1] "साथ"
[1] "रहा"
[1] "=== print(z)"
[1] "c(\"में\", \"लिए\", \"किया\", \"गया\", \"हैं\", \"नहीं\", \"सिंह\", \"पुलिस\", \"दिया\", \"करने\", \"कहा\", \"रहे\", \"बाद\", \"करें\", \"साथ\", \"रहा\"\n)"
Warning messages:
1: package ‘magrittr’ was built under R version 4.0.2
2: In stri_replace_all_regex(., "<U\\+([[:alnum:]]+)>", "\\\\u$1") :
argument is not an atomic vector; coercing
>

Encoding issue when reading latin characters

I need to read some txt tables in my shiny app and some columns of it are text. When I read a table with latin characters (á,â,ó,é,ô,ç...) it appears �. I think I need to read it using UTF-8 as encoding, I tried the following codes, but it does not work for me.
test<-read.delim("test.txt",sep="\t",dec=",",header=TRUE, encoding="UTF-8") or
test<-read.table(file = "test.txt", dec = ",", fileEncoding = "utf-8")
Here a short script to reproduce my problem:
x<- matrix(c('áeiou','aêiou','aeíou','aeiõu','aeiouç',1:5),ncol=2)
write.table(x,"test.txt",sep="\t")
test<-read.delim("test.txt",sep="\t",header=TRUE)
I´m using a Windows OS and I have this infos:
localeToCharset()
[1] "ISO8859-1"
sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: i386-w64-mingw32/i386 (32-bit)
locale:
[1] LC_COLLATE=Portuguese_Brazil.1252 LC_CTYPE=Portuguese_Brazil.1252
[3] LC_MONETARY=Portuguese_Brazil.1252 LC_NUMERIC=C
[5] LC_TIME=Portuguese_Brazil.1252
Thanks in advance!!
Works for me:
> Sys.getlocale()
[1] "LC_COLLATE=English_Ireland.1252;LC_CTYPE=English_Ireland.1252;LC_MONETARY=English_Ireland.1252;LC_NUMERIC=C;LC_TIME=English_Ireland.1252"
> x<- matrix(c('áeiou','aêiou','aeíou','aeiõu','aeiouç',1:5),ncol=2)
> write.table(x,"test.txt",sep="\t")
> test<-read.delim("test.txt",sep="\t",header=TRUE)
> test
V1 V2
1 áeiou 1
2 aêiou 2
3 aeíou 3
4 aeiõu 4
5 aeiouç 5
> sessionInfo()$platform
[1] "x86_64-w64-mingw32/x64 (64-bit)"
try
Sys.setlocale(category = "LC_ALL", locale = "English_Ireland.1252")
or some similar locale.

Difficulties with `agrep(..., fixed=F)`

In ?agrep (grep with fuzzy matching) it mentions that I can set the argument fixed=FALSE to let my pattern be interpreted as a regular expression.
However, I can't get it to work!
agrep('(asdf|fdsa)', 'asdf', fixed=F)
# integer(0)
The above should match as the regular expression "(asdf|fdsa)" exactly matches the test string "asdf" in this case.
To confirm:
grep('(asdf|fdsa)', 'asdf', fixed=F)
# 1 : it does match with grep
And even more confusingly, adist correctly gives the distance between the pattern and string as 0, meaning that agrep should definitely return 1 rather than integer(0) (there's no possibility that 0 is greater than the default max.dist = 0.1).
adist('(asdf|fdsa)', 'asdf', fixed=F)
# [,1]
# [1,] 0
Why is this not working? Is there something I don't understand? A workaround?
I'm happy to use adist, but am not entirely sure how to convert agrep's default max.distance=0.1 parameter to adist's corresponding parameter.
(yes, I'm stuck on an old computer that can't do better than R 2.15.2)
> sessionInfo()
R version 2.15.2 (2012-10-26)
Platform: i686-redhat-linux-gnu (32-bit)
locale:
[1] LC_CTYPE=en_AU.utf8 LC_NUMERIC=C
[3] LC_TIME=en_AU.utf8 LC_COLLATE=en_AU.utf8
[5] LC_MONETARY=en_AU.utf8 LC_MESSAGES=en_AU.utf8
[7] LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_AU.utf8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
tl;dr: agrep(..., fixed=F) does not seem to work with the '|' character. Use aregexec.
Upon further investigation, I think this is a bug, and that agrep(..., fixed=F) does not seem to work with '|' regexes (although adist(..., fixed=F) does).
To elaborate, note that
adist('(asdf|fdsa)', 'asdf', fixed=T) # 7
nchar('(asdf|fdsa)') # 11
If 'asdf' were agrep'd to the non-regular-expression string '(asdf|fdsa)', then it would have distance 7.
On that note:
agrep('(asdf|fdsa)', 'asdf', fixed=F, max.distance=7) # 1
agrep('(asdf|fdsa)', 'asdf', fixed=F, max.distance=6) # integer(0)
These are the results I'd expect if fixed=T. If fixed=F, my regex would match 'asdf' exactly and the distance would be 0, so I'd always get a result of '1' back out of agrep.
So it looks agrep(pattern, x, fixed=F) does not work, i.e. it actually regardes fixed as TRUE for this sort of pattern.
As #Arun mentions, it might just be '|' regexes that don't work. For example, agrep('la[sb]y', 'lazy', fixed=FALSE) does work as expected.
EDIT: Workaround (thanks #Arun)
The function aregexec appears to work.
> aregexec('(asdf|fdsa)', 'asdf', fixed=F)
[[1]]
[1] 1 1
attr(,"match.length")
[1] 4 4
This has (finally) been fixed in the R sources "trunk" / R-devel") and R-patched which will become R 3.5.1 early July 2018.

How to read.table with "Hebrew" column names (in R)?

I am trying to read a .txt file, with Hebrew column names, but without success.
I uploaded an example file to:
http://www.talgalili.com/files/aa.txt
And am trying the command:
read.table("http://www.talgalili.com/files/aa.txt", header = T, sep = "\t")
This returns me with:
X.....ª X...ª...... X...œ....
1 12 97 6
2 123 354 44
3 6 1 3
Instead of:
אחת שתיים שלוש
12 97 6
123 354 44
6 1 3
My output for:
l10n_info()
Is:
$MBCS
[1] FALSE
$`UTF-8`
[1] FALSE
$`Latin-1`
[1] TRUE
$codepage
[1] 1252
And for:
Sys.getlocale()
Is:
[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
Can you suggest to me what to try and change to allow me to load the file correctly ?
Update:
Trying to use:
read.table("http://www.talgalili.com/files/aa.txt",fileEncoding ="iso8859-8")
Has resulted in:
V1
1 ?
Warning messages:
1: In read.table("http://www.talgalili.com/files/aa.txt", fileEncoding = "iso8859-8") :
invalid input found on input connection 'http://www.talgalili.com/files/aa.txt'
2: In read.table("http://www.talgalili.com/files/aa.txt", fileEncoding = "iso8859-8") :
incomplete final line found by readTableHeader on 'http://www.talgalili.com/files/aa.txt'
While also trying this:
Sys.setlocale("LC_ALL", "en_US.UTF-8")
Or this:
Sys.setlocale("LC_ALL", "en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8")
Get's me this:
[1] ""
Warning message:
In Sys.setlocale("LC_ALL", "en_US.UTF-8") :
OS reports request to set locale to "en_US.UTF-8" cannot be honored
Finally, here is the > sessionInfo()
R version 2.10.1 (2009-12-14)
i386-pc-mingw32
locale:
[1] LC_COLLATE=English_United States.1255 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] tools_2.10.1
Any suggestion or clarification will be appreciated.
Best,
Tal
I would try passing parameter fileEncoding to read.table with a value of iso8859-8.
Use iconvlist() to get an alphabetical list of the supported encodings. As I saw here Hebrew must be part 8 of ISO 8859.
I've tried #George Donats answer, but couldn't make it work. So I wanted to suggest another possibility for future reference.
I couldn't find the file online, so I've recreated a txt file like your using TAB as a seperator. You can load it into R with the Hebrew text using a connection. It is demonstrated below:
con<-file("aa.txt",open="r",encoding="iso8859-8") ##Open a read-only connection with encoding fit for Hebrew (iso8859-8)
Than you can load it into R with your code, using con variable as the file input, code described here:
data<-read.table(con,sep="\t",header=TRUE)
Browsing into the data variable gives the following results:
str(data)
'data.frame': 3 obs. of 3 variables:
$ אחת : int 6 44 3
$ שתיים: int 97 354 1
$ שלוש : int 12 123 6
> data$אחת
[1] 6 44 3

Resources