How to read.table with "Hebrew" column names (in R)? - r

I am trying to read a .txt file, with Hebrew column names, but without success.
I uploaded an example file to:
http://www.talgalili.com/files/aa.txt
And am trying the command:
read.table("http://www.talgalili.com/files/aa.txt", header = T, sep = "\t")
This returns me with:
X.....ª X...ª...... X...œ....
1 12 97 6
2 123 354 44
3 6 1 3
Instead of:
אחת שתיים שלוש
12 97 6
123 354 44
6 1 3
My output for:
l10n_info()
Is:
$MBCS
[1] FALSE
$`UTF-8`
[1] FALSE
$`Latin-1`
[1] TRUE
$codepage
[1] 1252
And for:
Sys.getlocale()
Is:
[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
Can you suggest to me what to try and change to allow me to load the file correctly ?
Update:
Trying to use:
read.table("http://www.talgalili.com/files/aa.txt",fileEncoding ="iso8859-8")
Has resulted in:
V1
1 ?
Warning messages:
1: In read.table("http://www.talgalili.com/files/aa.txt", fileEncoding = "iso8859-8") :
invalid input found on input connection 'http://www.talgalili.com/files/aa.txt'
2: In read.table("http://www.talgalili.com/files/aa.txt", fileEncoding = "iso8859-8") :
incomplete final line found by readTableHeader on 'http://www.talgalili.com/files/aa.txt'
While also trying this:
Sys.setlocale("LC_ALL", "en_US.UTF-8")
Or this:
Sys.setlocale("LC_ALL", "en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8")
Get's me this:
[1] ""
Warning message:
In Sys.setlocale("LC_ALL", "en_US.UTF-8") :
OS reports request to set locale to "en_US.UTF-8" cannot be honored
Finally, here is the > sessionInfo()
R version 2.10.1 (2009-12-14)
i386-pc-mingw32
locale:
[1] LC_COLLATE=English_United States.1255 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] tools_2.10.1
Any suggestion or clarification will be appreciated.
Best,
Tal

I would try passing parameter fileEncoding to read.table with a value of iso8859-8.
Use iconvlist() to get an alphabetical list of the supported encodings. As I saw here Hebrew must be part 8 of ISO 8859.

I've tried #George Donats answer, but couldn't make it work. So I wanted to suggest another possibility for future reference.
I couldn't find the file online, so I've recreated a txt file like your using TAB as a seperator. You can load it into R with the Hebrew text using a connection. It is demonstrated below:
con<-file("aa.txt",open="r",encoding="iso8859-8") ##Open a read-only connection with encoding fit for Hebrew (iso8859-8)
Than you can load it into R with your code, using con variable as the file input, code described here:
data<-read.table(con,sep="\t",header=TRUE)
Browsing into the data variable gives the following results:
str(data)
'data.frame': 3 obs. of 3 variables:
$ אחת : int 6 44 3
$ שתיים: int 97 354 1
$ שלוש : int 12 123 6
> data$אחת
[1] 6 44 3

Related

Broken encoding UTF-8 when use Encoding() and tokens()

I've got quite strange problem with encoding. When I run Encoding(txt) <- "UTF-8", I get encoding broken and strings look like "\xe7\xed\xe0\xfe\xf2".
txt <- c("привет", "пока")
Encoding(txt) # I get "unknown" "unknown"
Encoding(txt) <- "UTF-8"
Encoding(txt) # I get "UTF-8" "UTF-8", but strange symbols in vector
Plus, when I run l10n_info(), I get
$MBCS
[1] FALSE
$`UTF-8`
[1] FALSE
$`Latin-1`
[1] FALSE
I know, that I can use enc2utf8() with strings, but I work with quanteda and get the same issue as here: https://github.com/quanteda/quanteda/issues/1387 (but reinsalling package from github didn't help). I think, that problrm is with encoding on server.
P.S. dataframe loaded from excel is displayed correctly + when I save tokens object into new xslx, all the strings displayed in cyrillic.
Here is my session info:
R version 4.0.4 (2021-02-15)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server 2012 R2 x64 (build 9600)
Matrix products: default
locale:
[1] LC_COLLATE=Russian_Russia.1251 LC_CTYPE=Russian_Russia.1251 LC_MONETARY=Russian_Russia.1251 LC_NUMERIC=C LC_TIME=Russian_Russia.1251
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] tsne_0.1-3 stringi_1.5.3 tm_0.7-8 NLP_0.2-1 DataCombine_0.2.21 emo_0.0.0.9000 ggplot2_3.3.3 reshape2_1.4.4
[9] topicmodels_0.2-12 ldatuning_1.0.2 writexl_1.3.1 plyr_1.8.6 quanteda_2.9.9000 stringr_1.4.0 readxl_1.3.1
Thank you in advance!
Encoding issues are tricky, especially on Windows systems. It looks like your native encoding system is Windows-1251, an 8-bit encoding for Cyrillic. So when you input your string, it is input in that encoding. You can convert it to Unicode, but it still won't necessarily display correctly if you use the print method.
Here's the result of me trying to simulate the problem on my macOS platform.
> stringi::stri_info()$Charset.native[1:2]
$Name.friendly
[1] "UTF-8"
$Name.ICU
[1] "UTF-8"
My guess is that your system will show something different, but I cannot be sure.\
> # on macOS 10.15.7
> txt <- c("привет", "пока")
> txt
[1] "\u043f\u0440\u0438\u0432\u0435\u0442" "\u043f\u043e\u043a\u0430"
> Encoding(txt)
[1] "UTF-8" "UTF-8"
So that produces the same output that you are seeing, but it's encoded as UTF-8. To simulate what that would look like if the system encoded it as Windows-1251, we can convert it:
> # convert to Windows-1251
> txt_1251 <- iconv(txt, from = "UTF-8", to = "WINDOWS-1251")
> print(txt_1251)
[1] "\xef\xf0\xe8\xe2\xe5\xf2" "\xef\xee\xea\xe0"
> cat(txt_1251)
������ ����> Encoding(txt_1251)
[1] "unknown" "unknown"
Is that what you are seeing?
You can try fixing it this way:
> txt_from1251 <- stringi::stri_conv(txt_1251, from = "windows-1251", to = "utf-8")
> print(txt_from1251)
[1] "\u043f\u0440\u0438\u0432\u0435\u0442" "\u043f\u043e\u043a\u0430"
> cat(txt_from1251)
привет пока> Encoding(txt_from1251)
[1] "UTF-8" "UTF-8"
So while it still does not print(), it shows correctly from cat(), and has the correct Encoding bit set.
I could be wrong about this since my understanding of Unicode and character sets in R is incomplete, and it seems to be quite platform and locale dependent. I would happily see another response that improves this answer, or to hear your success or not with trying some of the fixes suggested above.

How to resolve Devanagari script encoding error in R

I have a .csv file which is UTF-8 encoded when I saved it. The script is Devanagari of the data in this file. I am able to see the words in csv file in excel
में
लिए
किया
गया
हैं
नहीं
सिंह
पुलिस
दिया
करने
कहा
रहे
बाद
करें
साथ
रहा
But when I open that in R, the words do not get encoded correctly. the output for print() is like this:
word
सारे_खतरों_को
जानते_हà¥\u0081à¤\u008f_भी
विवेक_ने
टीवी
How can I resolve this? I have tried Sys.setlocale() and read.delim(wordlist.csv, encoding = "UTF-8") but neither worked.
Too long for comment (sorry, I'm a greenhorn in R):
print( sessionInfo())
library(stringi)
library(magrittr)
x <- read.delim("D:\\bat\\SO\\64497248_devangari.csv", encoding = "UTF-8")
print('=== print(x)')
print(x)
for (line in x){
y <- line %>%
stri_replace_all_regex("<U\\+([[:alnum:]]+)>", "\\\\u$1") %>%
stri_unescape_unicode() %>%
stri_enc_toutf8()
}
print('=== print(y)')
print(y)
print('=== for (i in y) {print(i)}')
for (i in y) {print(i)}
print('=== print(z)')
z <- x['word'] %>%
stri_replace_all_regex("<U\\+([[:alnum:]]+)>", "\\\\u$1") %>%
stri_unescape_unicode() %>%
stri_enc_toutf8()
print(z)
Output (in Rgui.exe console):
> source ( 'D:\\bat\\SO\\64497248.r' )
R version 4.0.1 (2020-06-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)
Matrix products: default
locale:
[1] LC_COLLATE=Czech_Czechia.1250 LC_CTYPE=Czech_Czechia.1250 LC_MONETARY=Czech_Czechia.1250
[4] LC_NUMERIC=C LC_TIME=Czech_Czechia.1250
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] compiler_4.0.1
[1] "=== print(x)"
word
1 <U+092E><U+0947><U+0902>
2 <U+0932><U+093F><U+090F>
3 <U+0915><U+093F><U+092F><U+093E>
4 <U+0917><U+092F><U+093E>
5 <U+0939><U+0948><U+0902>
6 <U+0928><U+0939><U+0940><U+0902>
7 <U+0938><U+093F><U+0902><U+0939>
8 <U+092A><U+0941><U+0932><U+093F><U+0938>
9 <U+0926><U+093F><U+092F><U+093E>
10 <U+0915><U+0930><U+0928><U+0947>
11 <U+0915><U+0939><U+093E>
12 <U+0930><U+0939><U+0947>
13 <U+092C><U+093E><U+0926>
14 <U+0915><U+0930><U+0947><U+0902>
15 <U+0938><U+093E><U+0925>
16 <U+0930><U+0939><U+093E>
[1] "=== print(y)"
[1] "में" "लिए" "किया" "गया" "हैं" "नहीं" "सिंह" "पुलिस" "दिया" "करने" "कहा" "रहे" "बाद" "करें" "साथ" "रहा"
[1] "=== for (i in y) {print(i)}"
[1] "में"
[1] "लिए"
[1] "किया"
[1] "गया"
[1] "हैं"
[1] "नहीं"
[1] "सिंह"
[1] "पुलिस"
[1] "दिया"
[1] "करने"
[1] "कहा"
[1] "रहे"
[1] "बाद"
[1] "करें"
[1] "साथ"
[1] "रहा"
[1] "=== print(z)"
[1] "c(\"में\", \"लिए\", \"किया\", \"गया\", \"हैं\", \"नहीं\", \"सिंह\", \"पुलिस\", \"दिया\", \"करने\", \"कहा\", \"रहे\", \"बाद\", \"करें\", \"साथ\", \"रहा\"\n)"
Warning messages:
1: package ‘magrittr’ was built under R version 4.0.2
2: In stri_replace_all_regex(., "<U\\+([[:alnum:]]+)>", "\\\\u$1") :
argument is not an atomic vector; coercing
>

How to fix Perl warning message when loading gdata package?

I've updated Strawberry Perl 64-bit 5.30.2001 and the gdata package. Now, when loading library(gdata) I always get this warning messages which appear to be related to Perl.
suppressPackageStartupMessages(library(gdata))
# Warning messages:
# 1: In system(cmd, intern = intern, wait = wait | intern, show.output.on.console = wait, :
# running command 'C:\Windows\system32\cmd.exe /c ftype perl' had status 2
# 2: In system(cmd, intern = intern, wait = wait | intern, show.output.on.console = wait, :
# running command 'C:\Windows\system32\cmd.exe /c ftype perl' had status 2
However, read.xls, the function I need, seems to run well, except that the warning is repeated every time I use it.
read.xls("http://file-examples-com.github.io/uploads/2017/02/file_example_XLS_10.xls")
# trying URL 'http://file-examples-com.github.io/uploads/2017/02/file_example_XLS_10.xls'
# Content type 'application/vnd.ms-excel' length 8704 bytes
# downloaded 8704 bytes
# X0 First.Name Last.Name Gender Country Age Date Id
# 1 1 Dulce Abril Female United States 32 15/10/2017 1562
# 2 2 Mara Hashimoto Female Great Britain 25 16/08/2016 1582
# 3 3 Philip Gent Male France 36 21/05/2015 2587
# 4 4 Kathleen Hanner Female United States 25 15/10/2017 3549
# 5 5 Nereida Magwood Female United States 58 16/08/2016 2468
# 6 6 Gaston Brumm Male United States 24 21/05/2015 2554
# 7 7 Etta Hurn Female Great Britain 56 15/10/2017 3598
# 8 8 Earlean Melgar Female United States 27 16/08/2016 2456
# 9 9 Vincenza Weiland Female United States 40 21/05/2015 6548
# Warning messages:
# 1: In system(cmd, intern = intern, wait = wait | intern, show.output.on.console = wait, :
# running command 'C:\Windows\system32\cmd.exe /c ftype perl' had status 2
# 2: In system(cmd, intern = intern, wait = wait | intern, show.output.on.console = wait, :
# running command 'C:\Windows\system32\cmd.exe /c ftype perl' had status 2
I'm not sure how to deal with this warning because it says nothing to me, I could probably just ignore it and wrap a suppressWarnings() around it.
Nevertheless, does anybody know a way to fix this? I couldn't find anything by googling and don't know where to start and what's actually wrong.
> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] gdata_2.18.0
loaded via a namespace (and not attached):
[1] compiler_4.0.2 tools_4.0.2 gtools_3.8.2
I had the same issue with a freshly installed version of R, gdata and Strawberry Perl. I finally found this answer to a different (but related) question. Adapting the suggestion there, I ran the following on an elevated command promt:
FTYPE perl=C:\Strawberry\perl\bin\perl.exe %1 %*
This solved the issue for me – however: I am not sure if setting the FTYPE like this might have any unwanted side effects. So be careful.
Update: The command above did suppress the warning "ftype perl' had status 2" for me, but gdata still had issues:
gdata: Unable to load perl libaries needed by read.xls()
gdata: to support 'XLSX' (Excel 2007+) files.
gdata: Run the function 'installXLSXsupport()'
gdata: to automatically download and install the perl
gdata: libaries needed to support Excel XLS and XLSX formats.
However, installXLSXsupport() failed with an unspecific error message.
I then ran
Sys.which("perl")
perl
"C:\\rtools40\\usr\\bin\\perl.exe"
and realized that the Perl version from RTools takes precedence over my Strawberry Perl installation – and apparently gdata does not "like" that Perl version.
Therefore, I decided to give Strawberry Perl precedence over RTools by changing my .Renviron file (usethis::edit_r_environ()):
PATH="${RTOOLS40_HOME}\usr\bin;${PATH}" # old
PATH="${PATH};${RTOOLS40_HOME}\usr\bin" # new
Again, I'm not entirely sure what ramifications this might have, but it fixed gdata for me.
Maybe adjusting the PATH alone would also have done the trick (without the ftype stunt I made first), but I cannot test this anymore.
What I recommend:
Adjust the PATH first.
If gdata still complains about the ftype, set the ftype.

Text encoding issues in R

I'm doing text mining in R with Spanish documents and I'm encountering constant issues with the encoding, and the different solutions I came up with to solve them. I have been searching for different topics, but can't really find a clear solution. And the fact that things work different every time probably means that I don't really understand the problem.
I extracted text data from a PDF using pdf_text (package pdftools), and the characters with tildes are translated into Unicode, e.g. "<U+00ED>". However, when I try to substitute these with the normal characters using gsub(or finding them with grepl), R doesn't find anything. The output looks something like this:
> txt
[1] "Comunidad Aut<U+00F3>noma: Pa<U+00ED>s Vasco"
[2] "Provincia: <U+00C1>lava"
[3] "Alda se extingue y su territorio se incorpora a Valle de Arana. Censo 1950"
> str(txt)
chr [1:3] "Comunidad Aut<U+00F3>noma: Pa<U+00ED>s Vasco" ...
> grepl("<U\\+00F3>", txt)
[1] FALSE FALSE FALSE
> grepl("<U+00F3>", txt)
[1] FALSE FALSE FALSE
> gsub("<U\\+00F3>", "o", txt)
[1] "Comunidad Aut<U+00F3>noma: Pa<U+00ED>s Vasco"
[2] "Provincia: <U+00C1>lava"
[3] "Alda se extingue y su territorio se incorpora a Valle de Arana. Censo 1950"
However, if you introduce these stringrs manually, R does find them and substitutions are possible:
> txt = c("Comunidad Aut<U+00F3>noma: Pa<U+00ED>s Vasco", "Provincia: <U+00C1>lava")
> str(txt)
chr [1:2] "Comunidad Aut<U+00F3>noma: Pa<U+00ED>s Vasco" ...
> grepl("<U\\+00F3>", txt)
[1] TRUE FALSE
> gsub("<U\\+00F3>", "o", txt)
[1] "Comunidad Autonoma: Pa<U+00ED>s Vasco"
[2] "Provincia: <U+00C1>lava"
Why is this happening? What is actually reading R so that it sees both things as different?
In the past I use to build an adapt function that basically substitutes these stuff for the normal characters, but I'm findings that in some cases it doesn't work, and I don't really understand why. Also, when these data comes from OCR, the mess is even bigger, and the translation to UTF-8 differs constantly, etc.
Does anyone know of any general approach that solves this stuff? I'll be working extensively with this in the future.
Thanks a lot.
P.S.:
> sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
locale:
[1] C
attached base packages:
[1] stats graphics grDevices utils datasets methods
[7] base
other attached packages:
[1] pdftools_1.4 stringr_1.2.0
loaded via a namespace (and not attached):
[1] compiler_3.4.1 magrittr_1.5 tools_3.4.1 Rcpp_0.12.12
[5] stringi_1.1.5
The problem here is that your locale is set to C, so R will refuse to print non-ASCII characters. If you change your locale to one that allows printing Unicode, then you will see the characters that you expect.
txt <- "Comunidad Aut\u00F3noma: Pa\u00EDs Vasco"
Sys.setlocale("LC_CTYPE", "C") # switch character type locale to "C"
## "C"
print(txt)
## [1] "Comunidad Aut<U+00F3>noma: Pa<U+00ED>s Vasco"
Sys.setlocale("LC_CTYPE", "") # switch to native locale
## [1] "en_US.UTF-8"
print(txt)
## [1] "Comunidad Autónoma: País Vasco"
Here's how to replace the "o with acute accent" character:
gsub("\u00F3", "o", txt)
## [1] "Comunidad Autonoma: País Vasco"

Encoding issue when reading latin characters

I need to read some txt tables in my shiny app and some columns of it are text. When I read a table with latin characters (á,â,ó,é,ô,ç...) it appears �. I think I need to read it using UTF-8 as encoding, I tried the following codes, but it does not work for me.
test<-read.delim("test.txt",sep="\t",dec=",",header=TRUE, encoding="UTF-8") or
test<-read.table(file = "test.txt", dec = ",", fileEncoding = "utf-8")
Here a short script to reproduce my problem:
x<- matrix(c('áeiou','aêiou','aeíou','aeiõu','aeiouç',1:5),ncol=2)
write.table(x,"test.txt",sep="\t")
test<-read.delim("test.txt",sep="\t",header=TRUE)
I´m using a Windows OS and I have this infos:
localeToCharset()
[1] "ISO8859-1"
sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: i386-w64-mingw32/i386 (32-bit)
locale:
[1] LC_COLLATE=Portuguese_Brazil.1252 LC_CTYPE=Portuguese_Brazil.1252
[3] LC_MONETARY=Portuguese_Brazil.1252 LC_NUMERIC=C
[5] LC_TIME=Portuguese_Brazil.1252
Thanks in advance!!
Works for me:
> Sys.getlocale()
[1] "LC_COLLATE=English_Ireland.1252;LC_CTYPE=English_Ireland.1252;LC_MONETARY=English_Ireland.1252;LC_NUMERIC=C;LC_TIME=English_Ireland.1252"
> x<- matrix(c('áeiou','aêiou','aeíou','aeiõu','aeiouç',1:5),ncol=2)
> write.table(x,"test.txt",sep="\t")
> test<-read.delim("test.txt",sep="\t",header=TRUE)
> test
V1 V2
1 áeiou 1
2 aêiou 2
3 aeíou 3
4 aeiõu 4
5 aeiouç 5
> sessionInfo()$platform
[1] "x86_64-w64-mingw32/x64 (64-bit)"
try
Sys.setlocale(category = "LC_ALL", locale = "English_Ireland.1252")
or some similar locale.

Resources