Read file using EUC-KR text encoding in R - r

Has anyone had experience to read Korean language file using EUC-KR as text encoding?
I used fread function as it can read that file structure perfectly. Below is the sample code:
test <- fread("KoreanTest.txt", encoding = "EUC-KR")
Then I got error, "Error in fread("KoreanTest.txt", encoding = "EUC-KR") : Argument 'encoding' must be 'unknown', 'UTF-8' or 'Latin-1'".
Initially i was using UTF-8 as text encoding but the output characters were not displayed correctly in Korean language. I was looking to another solution but nothing seems to work at this time.
Appreciate if someone could share ideas. Thanks.

It allows an explicit encoding parameter. This common usage works well:
read.table(filesource, header = TRUE, stringsAsFactors = FALSE, encoding = "EUC-KR")
or you can try with Rstudio
File -> Import Dataset -> From text

Related

Write.csv in Japanese from R to excel

When I use write.csv for my Japanese text, I get gibberish in Excel (which normally handles Japanese fine). I've searched this site for solutions, but am coming up empty-handed. Is there an encoding command to add to write.csv to enable Excel to import the Japanese properly from R? Any help appreciated!
Thanks!
I just ran into this exact same problem - I used what I saw online:
write.csv(merch_df, file = "merch.reduced.csv", fileEncoding = "UTF-8")
and indeed, when opening my .xls file, <U+30BB><U+30D6><U+30F3>, etc.... Odd and disappointing.
A little Google and I found this awesome blog post by Kevin Ushey which explains it all... https://kevinushey.github.io/blog/2018/02/21/string-encoding-and-r/
Using the function he proposes:
write_utf8 <- function(text, f = tempfile()) {
# step 1: ensure our text is utf8 encoded
utf8 <- enc2utf8(text)
# step 2: create a connection with 'native' encoding
# this signals to R that translation before writing
# to the connection should be skipped
con <- file(f, open = "w+", encoding = "native.enc")
# step 3: write to the connection with 'useBytes = TRUE',
# telling R to skip translation to the native encoding
writeLines(utf8, con = con, useBytes = TRUE)
# close our connection
close(con)
# read back from the file just to confirm
# everything looks as expected
readLines(f, encoding = "UTF-8")
}
works magic. Thank you Kevin!
As a work around -- and diagnostic -- have you tried saving as .txt and then both opening the file in Excel and also pasting the data into Excel from a text editor?
I ran into the same problem as tchevrier. Japanese text were not display correctly both in Excel and a text editor when exporting with write.csv. I found using:
readr::write_excel_csv(df, "filename.csv")
corrected the issue.

R/RStudio changes name of column when read from csv

I am trying to read in a file in R, using the following command (in RStudio):
fileRaw <- read.csv(file = "file.csv", header = TRUE, stringsAsFactors = FALSE)
file.csv looks something like this:
However, when it's read into R, I get:
As you can see LOCATION is changed to ï..LOCATION for seemingly no reason.
I tried adding check.names = FALSE but this only made it worse, as LOCATION is now replaced with LOCATION. What gives?
How do I fix this? Why is R/RStudio doing this?
There is a UTF-8 BOM at the beginning of the file. Try reading as UTF-8, or remove the BOM from the file.
The UTF-8 representation of the BOM is the (hexadecimal) byte sequence
0xEF,0xBB,0xBF. A text editor or web browser misinterpreting the text
as ISO-8859-1 or CP1252 will display the characters  for this.
Edit: looks like using fileEncoding = "UTF-8-BOM" fixes the problem in RStudio.
Using fileEncoding = "UTF-8-BOM" fixed my problem and read the file with no issues.
Using fileEncoding = "UTF-8"/encoding = "UTF-8" did not resolve the issue.

Converting tab delimited text file of unknown encoding to R-compatible file encoding in Python

I have many textfiles of unknown encoding which I wasn't able to open at all in R, where I would like to work with them. I ended up being able to open them in python with the help of codecs in UTF-16:
f = codecs.open(input,"rb","utf-16")
for line in f:
print repr(line)
One line in my files now looks like this when printed in python:
u'06/28/2016\t14:00:00\t0,000\t\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\t00000000\t6,000000\t0,000000\t144,600000\t12,050000
\t8,660000\t-120,100000\t-0,040000\t-0,110000\t1,340000\t5,360000
\t-1,140000\t-1,140000\t24,523000\t269,300000\t271,800000\t0,130000
\t272,000000\t177,000000\t0,765000\t0,539000\t\r\n'
The "u" in the beginning tells me that this in unicode, but now I don't really know what do with it. My goal is to convert the textfiles to something I can use in R, e.g. properly encoded csv, but I have failed using unicodecsv:
in_txt = unicodecsv.reader(f, delimiter = '\t', encoding = 'utf-8')
out_csv = unicodecsv.writer(open(output), 'wb', encoding = 'utf-8')
out_csv.writerows(in_txt)
Can anybody tell me what the principal mistake in my approach is?
You can try guess_encoding(y) from readr package in R. It is not 100% bullet proof but it has worked for me in the past and should at least get you pointed in the right direction:
guess_encoding(y)
#> encoding confidence
#> 1 ISO-8859-2 0.4
#> 2 ISO-8859-1 0.3
try using read_tsv() to read-in your files and then try guess_enconding()
Hope it helps.

Output accented characters for use with latex

I'm trying to use R to create the content of a tex file. The content contains many accented letters and I not able to correctly write them to a tex file.
Here is a short minimal example of what I would like to perform:
I have a file texinput.tex, which already exists and is encoded as UTF8 without BOM. When I manually write é in Notepad++ and save this file, it compiles correctly in LaTex and the output is as expected.
Then I tried to do this in R:
str.to.write <- "é"
cat(str.to.write, file = "tex_list.tex", append=TRUE)
As a result, the encoded character xe9 appears in the tex file. LaTex throws this error when trying to compile:
! File ended while scanning use of \UTFviii#three#octets.<inserted text>\par \include{texinput}
I then tried all of the following things before the cat command:
Encoding(str.to.write) <- "latin1"
-> same output error as above
str.to.write <- enc2utf8(str.to.write)
-> same output and error as above
Encoding(str.to.write) <- "UTF-8"
-> this appears in the tex file: \xe9. LaTex throws this error: ! Undefined control sequence. \xe
Encoding(str.to.write) <- "bytes"
-> this appears in the tex file: \\xe9. LaTex compiles without error and the output is xe9
I know that I could replace é by \'{e}, but I would like to have an automatic method, because the real content is very long and contains words from 3 different Latin languages, so it has lots of different accented characters.
However, I would also be happy about a function to automatically sanitize the R output to be used with Latex. I tried using xtable and sanitize.text.function, but it appears that it doesn't accept character vectors as input.
After quite a bit of searching and trial-and-error, I found something that worked for me:
# create output function
writeTex <- function(x) {write.table(x, "tex_list.tex",
append = TRUE, row.names = FALSE,
col.names = FALSE, quote = FALSE,
fileEncoding = "UTF-8")}
writeTex("é")
Output is as expected (é), and it compiles perfectly well in LaTex.
Use TIPA for processing International
Phonetic Alphabet (IPA) symbols in Latex. It has become standard in the linguistics field.

Set encoding for output in R globally

I'd like R to render its whole output (both to the console and to files) in UTF-8. Is there a way to define the encoding for R output for a whole document?
I think what you're getting at comes out of options . For example, here's part of the help page for file :
file(description = "", open = "", blocking = TRUE,
encoding = getOption("encoding"), raw = FALSE)
So if you investigate setting options(encoding=your_choicehere) you may be all set.
Edit: If you haven't already, be sure to set your locale to the language desired.

Resources