Broken encoding UTF-8 when use Encoding() and tokens() - r

I've got quite strange problem with encoding. When I run Encoding(txt) <- "UTF-8", I get encoding broken and strings look like "\xe7\xed\xe0\xfe\xf2".
txt <- c("привет", "пока")
Encoding(txt) # I get "unknown" "unknown"
Encoding(txt) <- "UTF-8"
Encoding(txt) # I get "UTF-8" "UTF-8", but strange symbols in vector
Plus, when I run l10n_info(), I get
$MBCS
[1] FALSE
$`UTF-8`
[1] FALSE
$`Latin-1`
[1] FALSE
I know, that I can use enc2utf8() with strings, but I work with quanteda and get the same issue as here: https://github.com/quanteda/quanteda/issues/1387 (but reinsalling package from github didn't help). I think, that problrm is with encoding on server.
P.S. dataframe loaded from excel is displayed correctly + when I save tokens object into new xslx, all the strings displayed in cyrillic.
Here is my session info:
R version 4.0.4 (2021-02-15)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server 2012 R2 x64 (build 9600)
Matrix products: default
locale:
[1] LC_COLLATE=Russian_Russia.1251 LC_CTYPE=Russian_Russia.1251 LC_MONETARY=Russian_Russia.1251 LC_NUMERIC=C LC_TIME=Russian_Russia.1251
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] tsne_0.1-3 stringi_1.5.3 tm_0.7-8 NLP_0.2-1 DataCombine_0.2.21 emo_0.0.0.9000 ggplot2_3.3.3 reshape2_1.4.4
[9] topicmodels_0.2-12 ldatuning_1.0.2 writexl_1.3.1 plyr_1.8.6 quanteda_2.9.9000 stringr_1.4.0 readxl_1.3.1
Thank you in advance!

Encoding issues are tricky, especially on Windows systems. It looks like your native encoding system is Windows-1251, an 8-bit encoding for Cyrillic. So when you input your string, it is input in that encoding. You can convert it to Unicode, but it still won't necessarily display correctly if you use the print method.
Here's the result of me trying to simulate the problem on my macOS platform.
> stringi::stri_info()$Charset.native[1:2]
$Name.friendly
[1] "UTF-8"
$Name.ICU
[1] "UTF-8"
My guess is that your system will show something different, but I cannot be sure.\
> # on macOS 10.15.7
> txt <- c("привет", "пока")
> txt
[1] "\u043f\u0440\u0438\u0432\u0435\u0442" "\u043f\u043e\u043a\u0430"
> Encoding(txt)
[1] "UTF-8" "UTF-8"
So that produces the same output that you are seeing, but it's encoded as UTF-8. To simulate what that would look like if the system encoded it as Windows-1251, we can convert it:
> # convert to Windows-1251
> txt_1251 <- iconv(txt, from = "UTF-8", to = "WINDOWS-1251")
> print(txt_1251)
[1] "\xef\xf0\xe8\xe2\xe5\xf2" "\xef\xee\xea\xe0"
> cat(txt_1251)
������ ����> Encoding(txt_1251)
[1] "unknown" "unknown"
Is that what you are seeing?
You can try fixing it this way:
> txt_from1251 <- stringi::stri_conv(txt_1251, from = "windows-1251", to = "utf-8")
> print(txt_from1251)
[1] "\u043f\u0440\u0438\u0432\u0435\u0442" "\u043f\u043e\u043a\u0430"
> cat(txt_from1251)
привет пока> Encoding(txt_from1251)
[1] "UTF-8" "UTF-8"
So while it still does not print(), it shows correctly from cat(), and has the correct Encoding bit set.
I could be wrong about this since my understanding of Unicode and character sets in R is incomplete, and it seems to be quite platform and locale dependent. I would happily see another response that improves this answer, or to hear your success or not with trying some of the fixes suggested above.

Related

Text encoding issues in R

I'm doing text mining in R with Spanish documents and I'm encountering constant issues with the encoding, and the different solutions I came up with to solve them. I have been searching for different topics, but can't really find a clear solution. And the fact that things work different every time probably means that I don't really understand the problem.
I extracted text data from a PDF using pdf_text (package pdftools), and the characters with tildes are translated into Unicode, e.g. "<U+00ED>". However, when I try to substitute these with the normal characters using gsub(or finding them with grepl), R doesn't find anything. The output looks something like this:
> txt
[1] "Comunidad Aut<U+00F3>noma: Pa<U+00ED>s Vasco"
[2] "Provincia: <U+00C1>lava"
[3] "Alda se extingue y su territorio se incorpora a Valle de Arana. Censo 1950"
> str(txt)
chr [1:3] "Comunidad Aut<U+00F3>noma: Pa<U+00ED>s Vasco" ...
> grepl("<U\\+00F3>", txt)
[1] FALSE FALSE FALSE
> grepl("<U+00F3>", txt)
[1] FALSE FALSE FALSE
> gsub("<U\\+00F3>", "o", txt)
[1] "Comunidad Aut<U+00F3>noma: Pa<U+00ED>s Vasco"
[2] "Provincia: <U+00C1>lava"
[3] "Alda se extingue y su territorio se incorpora a Valle de Arana. Censo 1950"
However, if you introduce these stringrs manually, R does find them and substitutions are possible:
> txt = c("Comunidad Aut<U+00F3>noma: Pa<U+00ED>s Vasco", "Provincia: <U+00C1>lava")
> str(txt)
chr [1:2] "Comunidad Aut<U+00F3>noma: Pa<U+00ED>s Vasco" ...
> grepl("<U\\+00F3>", txt)
[1] TRUE FALSE
> gsub("<U\\+00F3>", "o", txt)
[1] "Comunidad Autonoma: Pa<U+00ED>s Vasco"
[2] "Provincia: <U+00C1>lava"
Why is this happening? What is actually reading R so that it sees both things as different?
In the past I use to build an adapt function that basically substitutes these stuff for the normal characters, but I'm findings that in some cases it doesn't work, and I don't really understand why. Also, when these data comes from OCR, the mess is even bigger, and the translation to UTF-8 differs constantly, etc.
Does anyone know of any general approach that solves this stuff? I'll be working extensively with this in the future.
Thanks a lot.
P.S.:
> sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
locale:
[1] C
attached base packages:
[1] stats graphics grDevices utils datasets methods
[7] base
other attached packages:
[1] pdftools_1.4 stringr_1.2.0
loaded via a namespace (and not attached):
[1] compiler_3.4.1 magrittr_1.5 tools_3.4.1 Rcpp_0.12.12
[5] stringi_1.1.5
The problem here is that your locale is set to C, so R will refuse to print non-ASCII characters. If you change your locale to one that allows printing Unicode, then you will see the characters that you expect.
txt <- "Comunidad Aut\u00F3noma: Pa\u00EDs Vasco"
Sys.setlocale("LC_CTYPE", "C") # switch character type locale to "C"
## "C"
print(txt)
## [1] "Comunidad Aut<U+00F3>noma: Pa<U+00ED>s Vasco"
Sys.setlocale("LC_CTYPE", "") # switch to native locale
## [1] "en_US.UTF-8"
print(txt)
## [1] "Comunidad Autónoma: País Vasco"
Here's how to replace the "o with acute accent" character:
gsub("\u00F3", "o", txt)
## [1] "Comunidad Autonoma: País Vasco"

R encoding UTF-8: U+0080-U+009F

I am struggling with some encoding issues. I have many textfiles that contain rows in the following format:
https://dl.dropboxusercontent.com/u/94114397/example.txt
According to Notepad++, these are all encoded in UTF-8 and most non-ASCII characters are displayed correctly, as you can see in lines 1 and 2. However, I have problems with some characters that seem to be wrongly interpreted(?). In my example file, this the case in line 3 in the word "Lakuic", where there should be an "š" between the "u" and the "i". There actually is a character between those two letters which can be seen by copy-pasting the word into the google chrome address bar.
Now when I read the data in R, it displays "Laku< U+009A>ic". How can I resolve this?
Try converting from UTF-8 to latin1:
df <- read.table("http://dl.dropboxusercontent.com/u/94114397/example.txt", sep = "\t", row.names = 1, stringsAsFactors = FALSE, encoding="UTF-8")
iconv(df[, 1], from = "UTF-8", to = "latin1")
# [1] "Trichocentrum<->longifolium<-><->(Lindl.) R.Jiménez, Acta Bot. Mex. 97: 54 (2011)."
# [2] "Salvia<->× hegelmaieri<->nothosubsp. accidentalis<->(Sánchez-Gómez & R.Morales)."
# [3] "Edraianthus<->tarae<-><->Lakušic, Bilten Drustva Ekologa BiH, Ser. A 4: 108 (1987)."
My sessioInfo():
# Platform: x86_64-w64-mingw32/x64 (64-bit)
# Running under: Windows 7 x64 (build 7601) Service Pack 1
#
# locale:
# [1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252 LC_MONETARY=German_Germany.1252 LC_NUMERIC=C LC_TIME=German_Germany.1252
This works for me:
file1 <- "https://dl.dropboxusercontent.com/u/94114397/example.txt"
result <- read.table(file1, header=F, sep="\t", quote="\"",encoding="windows-1252")

Encoding issue when reading latin characters

I need to read some txt tables in my shiny app and some columns of it are text. When I read a table with latin characters (á,â,ó,é,ô,ç...) it appears �. I think I need to read it using UTF-8 as encoding, I tried the following codes, but it does not work for me.
test<-read.delim("test.txt",sep="\t",dec=",",header=TRUE, encoding="UTF-8") or
test<-read.table(file = "test.txt", dec = ",", fileEncoding = "utf-8")
Here a short script to reproduce my problem:
x<- matrix(c('áeiou','aêiou','aeíou','aeiõu','aeiouç',1:5),ncol=2)
write.table(x,"test.txt",sep="\t")
test<-read.delim("test.txt",sep="\t",header=TRUE)
I´m using a Windows OS and I have this infos:
localeToCharset()
[1] "ISO8859-1"
sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: i386-w64-mingw32/i386 (32-bit)
locale:
[1] LC_COLLATE=Portuguese_Brazil.1252 LC_CTYPE=Portuguese_Brazil.1252
[3] LC_MONETARY=Portuguese_Brazil.1252 LC_NUMERIC=C
[5] LC_TIME=Portuguese_Brazil.1252
Thanks in advance!!
Works for me:
> Sys.getlocale()
[1] "LC_COLLATE=English_Ireland.1252;LC_CTYPE=English_Ireland.1252;LC_MONETARY=English_Ireland.1252;LC_NUMERIC=C;LC_TIME=English_Ireland.1252"
> x<- matrix(c('áeiou','aêiou','aeíou','aeiõu','aeiouç',1:5),ncol=2)
> write.table(x,"test.txt",sep="\t")
> test<-read.delim("test.txt",sep="\t",header=TRUE)
> test
V1 V2
1 áeiou 1
2 aêiou 2
3 aeíou 3
4 aeiõu 4
5 aeiouç 5
> sessionInfo()$platform
[1] "x86_64-w64-mingw32/x64 (64-bit)"
try
Sys.setlocale(category = "LC_ALL", locale = "English_Ireland.1252")
or some similar locale.

Find whether a particular date is an Option Expiration Friday - problem with timeDate package

I am trying to write a simple function that (should) return true if the parameter date(s) is an Op-Ex Friday.
require(timeDate)
require(quantmod)
getSymbols("^GSPC", adjust=TRUE, from="1960-01-01")
assign("SPX", GSPC, envir=.GlobalEnv)
names(SPX) <- c("SPX.Open", "SPX.High", "SPX.Low", "SPX.Close",
"SPX.Volume", "SPX.Adjusted")
dates <- last(index(SPX), n=10)
from <- as.numeric(format(as.Date(min(dates)), "%Y"))
to <- as.numeric(format(as.Date(max(dates)), "%Y"))
isOpExFriday <- ifelse(
isBizday(
timeDate(as.Date(dates)),
holidayNYSE(from:to)) & (as.Date(dates) == as.Date(
format(timeNthNdayInMonth(timeFirstDayInMonth(dates), nday=5, nth=3)))
), TRUE, FALSE)
Now, the result should be [1] "2011-09-16". But instead I get [1] "2011-09-15":
dates[isOpExFriday]
[1] "2011-09-15"
Am I doing something wrong, expecting something that timeDate package is not doing by design or is there a bug in timeDate?
I am guessing it's a timezone problem. What happens if you use this:
format(dates[isOpExFriday], tz="UTC")
On second look, you probably need to put the 'tz=' argument inside the format call inside the as.Date(format(...)) call. The format function "freezes" that dates value as text.
EDIT: On testing however I think you are right about it being a bug. (And I sent a bug report to the maintainer with this response.) Even after trying to insert various timezone specs and setting myFinCenter in RmetricsOptions, I still get the which stems from this error deep inside your choice of functions:
timeNthNdayInMonth(as.Date("2011-09-01"), nday=5, nth=3)
America/New_York
[1] [2011-09-15]
I suspect it is because of this code since as I understand it Julian dates are not adjusted for timezones or daylight savings times:
ct = 24 * 3600 * (as.integer(julian.POSIXt(lt)) +
(nth - 1) * 7 + (nday - lt1$wday)%%7)
class(ct) = "POSIXct"
The ct value in seconds is then coverted to POSIXct from second since "origin" simply by coercion of class. If I change the code to:
ct=as.POSIXct(ct, origin="1970-01-01") # correct results come back
My quantmod and timeDate versions are both current per CRAN. Running Mac with R 2.13.1 in 64 bit mode with a US locale. I have not yet tried to reproduce with a minimal session so there could still be some collision or hijacking with other packages:
> sessionInfo()
R version 2.13.1 RC (2011-07-03 r56263)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] grid splines stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] quantmod_0.3-17 TTR_0.20-3 xts_0.8-2
[4] Defaults_1.1-1 timeDate_2130.93 zoo_1.7-4
[7] gplots_2.10.1 KernSmooth_2.23-6 caTools_1.12
[10] bitops_1.0-4.1 gdata_2.8.1 gtools_2.6.2
[13] wordnet_0.1-8 ggplot2_0.8.9 proto_0.3-9.2
[16] reshape_0.8.4 plyr_1.6 rattle_2.6.10
[19] RGtk2_2.20.17 rms_3.3-1 Hmisc_3.8-3
[22] survival_2.36-9 sos_1.3-0 brew_1.0-6
[25] lattice_0.19-30

How to read.table with "Hebrew" column names (in R)?

I am trying to read a .txt file, with Hebrew column names, but without success.
I uploaded an example file to:
http://www.talgalili.com/files/aa.txt
And am trying the command:
read.table("http://www.talgalili.com/files/aa.txt", header = T, sep = "\t")
This returns me with:
X.....ª X...ª...... X...œ....
1 12 97 6
2 123 354 44
3 6 1 3
Instead of:
אחת שתיים שלוש
12 97 6
123 354 44
6 1 3
My output for:
l10n_info()
Is:
$MBCS
[1] FALSE
$`UTF-8`
[1] FALSE
$`Latin-1`
[1] TRUE
$codepage
[1] 1252
And for:
Sys.getlocale()
Is:
[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
Can you suggest to me what to try and change to allow me to load the file correctly ?
Update:
Trying to use:
read.table("http://www.talgalili.com/files/aa.txt",fileEncoding ="iso8859-8")
Has resulted in:
V1
1 ?
Warning messages:
1: In read.table("http://www.talgalili.com/files/aa.txt", fileEncoding = "iso8859-8") :
invalid input found on input connection 'http://www.talgalili.com/files/aa.txt'
2: In read.table("http://www.talgalili.com/files/aa.txt", fileEncoding = "iso8859-8") :
incomplete final line found by readTableHeader on 'http://www.talgalili.com/files/aa.txt'
While also trying this:
Sys.setlocale("LC_ALL", "en_US.UTF-8")
Or this:
Sys.setlocale("LC_ALL", "en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8")
Get's me this:
[1] ""
Warning message:
In Sys.setlocale("LC_ALL", "en_US.UTF-8") :
OS reports request to set locale to "en_US.UTF-8" cannot be honored
Finally, here is the > sessionInfo()
R version 2.10.1 (2009-12-14)
i386-pc-mingw32
locale:
[1] LC_COLLATE=English_United States.1255 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] tools_2.10.1
Any suggestion or clarification will be appreciated.
Best,
Tal
I would try passing parameter fileEncoding to read.table with a value of iso8859-8.
Use iconvlist() to get an alphabetical list of the supported encodings. As I saw here Hebrew must be part 8 of ISO 8859.
I've tried #George Donats answer, but couldn't make it work. So I wanted to suggest another possibility for future reference.
I couldn't find the file online, so I've recreated a txt file like your using TAB as a seperator. You can load it into R with the Hebrew text using a connection. It is demonstrated below:
con<-file("aa.txt",open="r",encoding="iso8859-8") ##Open a read-only connection with encoding fit for Hebrew (iso8859-8)
Than you can load it into R with your code, using con variable as the file input, code described here:
data<-read.table(con,sep="\t",header=TRUE)
Browsing into the data variable gives the following results:
str(data)
'data.frame': 3 obs. of 3 variables:
$ אחת : int 6 44 3
$ שתיים: int 97 354 1
$ שלוש : int 12 123 6
> data$אחת
[1] 6 44 3

Resources