Write.csv in Japanese from R to excel - r

When I use write.csv for my Japanese text, I get gibberish in Excel (which normally handles Japanese fine). I've searched this site for solutions, but am coming up empty-handed. Is there an encoding command to add to write.csv to enable Excel to import the Japanese properly from R? Any help appreciated!
Thanks!

I just ran into this exact same problem - I used what I saw online:
write.csv(merch_df, file = "merch.reduced.csv", fileEncoding = "UTF-8")
and indeed, when opening my .xls file, <U+30BB><U+30D6><U+30F3>, etc.... Odd and disappointing.
A little Google and I found this awesome blog post by Kevin Ushey which explains it all... https://kevinushey.github.io/blog/2018/02/21/string-encoding-and-r/
Using the function he proposes:
write_utf8 <- function(text, f = tempfile()) {
# step 1: ensure our text is utf8 encoded
utf8 <- enc2utf8(text)
# step 2: create a connection with 'native' encoding
# this signals to R that translation before writing
# to the connection should be skipped
con <- file(f, open = "w+", encoding = "native.enc")
# step 3: write to the connection with 'useBytes = TRUE',
# telling R to skip translation to the native encoding
writeLines(utf8, con = con, useBytes = TRUE)
# close our connection
close(con)
# read back from the file just to confirm
# everything looks as expected
readLines(f, encoding = "UTF-8")
}
works magic. Thank you Kevin!

As a work around -- and diagnostic -- have you tried saving as .txt and then both opening the file in Excel and also pasting the data into Excel from a text editor?

I ran into the same problem as tchevrier. Japanese text were not display correctly both in Excel and a text editor when exporting with write.csv. I found using:
readr::write_excel_csv(df, "filename.csv")
corrected the issue.

Related

Can someone help resolve the error reading csv file for herbarium data? [duplicate]

I am trying to import a csv that is in Japanese. This code:
url <- 'http://www.mof.go.jp/international_policy/reference/itn_transactions_in_securities/week.csv'
x <- read.csv(url, header=FALSE, stringsAsFactors=FALSE)
returns the following error:
Error in type.convert(data[[i]], as.is = as.is[i], dec = dec, na.strings = character(0L)) :
invalid multibyte string at '<91>ΊO<8b>y<82>ёΓ<e0><8f>،<94><94><84><94><83><8c>_<96>񓙂̏󋵁#(<8f>T<8e><9f><81>E<8e>w<92><e8><95>񍐋#<8a>փx<81>[<83>X<81>j'
I tried changing the encoding (Encoding(url) <- 'UTF-8' and also to latin1) and tried removing the read.csv parameters, but received the same "invalid multibyte string" message in each case. Is there a different encoding that should be used, or is there some other problem?
Encoding sets the encoding of a character string. It doesn't set the encoding of the file represented by the character string, which is what you want.
This worked for me, after trying "UTF-8":
x <- read.csv(url, header=FALSE, stringsAsFactors=FALSE, fileEncoding="latin1")
And you may want to skip the first 16 lines, and read in the headers separately. Either way, there's still quite a bit of cleaning up to do.
x <- read.csv(url, header=FALSE, stringsAsFactors=FALSE,
fileEncoding="latin1", skip=16)
# get started with the clean-up
x[,1] <- gsub("\u0081|`", "", x[,1]) # get rid of odd characters
x[,-1] <- as.data.frame(lapply(x[,-1], # convert to numbers
function(d) type.convert(gsub(d, pattern=",", replace=""))))
You may have encountered this issue because of the incompatibility of system locale
try setting the system locale with this code Sys.setlocale("LC_ALL", "C")
The readr package from the tidyverse universe might help.
You can set the encoding via the local argument of the read_csv() function by using the local() function and its encoding argument:
read_csv(file = "http://www.mof.go.jp/international_policy/reference/itn_transactions_in_securities/week.csv",
skip = 14,
local = locale(encoding = "latin1"))
The simplest solution I found for this issue without losing any data/special character (for example when using fileEncoding="latin1" characters like the Euro sign € will be lost) is to open the file first in a text editor like Sublime Text, and to "Save with encoding - UTF-8".
Then R can import the file with no issue and no character loss.
I had the same error and tried all the above to no avail. The issue vanished when I upgraded from R 3.4.0 to 3.4.3, so if your R version is not up to date, update it!
I came across this error (invalid multibyte string 1) recently, but my problem was a bit different:
We had forgotten to save a csv.gz file with an extension, and tried to use read_csv() to read it. Adding the extension solved the problem.
For those using Rattle with this issue Here is how I solved it:
First make sure to quit rattle so your at the R command prompt
> library (rattle) (if not done so already)
> crv$csv.encoding="latin1"
> rattle()
You should now be able to carry on. ie, import your csv > Execute > Model > Execute etc.
That worked for me, hopefully that helps a weary traveller
I had a similar problem with scientific articles and found a good solution here:
http://tm.r-forge.r-project.org/faq.html
By using the following line of code:
tm_map(yourCorpus, content_transformer(function(x) iconv(enc2utf8(x), sub = "byte")))
you convert the multibyte strings into hex code.
I hope this helps.
If the file you are trying to import into R that was originally an Excel file. Make sure you open the original file and Save as a csv and that fixed this error for me when importing into R.

Rstudio will not write "UTF-8" encoding with emoji data all of a sudden

I am working on a project that uses text files with emojis and I started having issue with writing a dataframe of emojis to a csv file. I have working with these files for some time now and so far I've been able to save the text data using write.csv(x, "filename") and viewing them with read.csv("filename", encoding = "UTF-8") without any problems. Yesterday, quite suddenly, that stopped working. All the files that I previously saved will still display emojis using the read.csv() function but I can not write and read any new files. For example, if I have:
x <- c("😂","😃","😄")
View(x)
write.csv(x, "testemoji.csv")
x2 <- read.csv("testemoji.csv", encoding = "UTF-8")
View(x2)
x will be
while x2 will be
I am using R verion 3.6.3 and windows 10.
What I have tried so far:
write.csv(x, "filename", fileEncoding = "UTF-8")
write.table(x, "filename", fileEncoding = "UTF-8")
write.csv2(x, "filename", fileEncoding = "UTF-8")
x2<- read.csv2("filename", encoding = "UTF-8")
I've tried every option of "tools - global options - coding - saving - default text encoding"
I've also tried messing with the locale language on the computer and the beta-UTF-8 option
when I check the encoding with Encoding(x$v1) it returns "UTF-8", "UTF-8", "UTF-8" but when I check Encoding(x2$x) it returns "unknown", "unknown", "unknown".
trying to change the encoding with Encoding(x2$x)<- "UTF-8" does not change the outcome.
I have been working on this project for 3 months now with no issues. I can't understand why it would come on so suddenly. To my recollection, I have not changed any preferences or setting is R, Rstudion or my computer before this happened. The deadline for this project is coming up in a week and I am getting desperate for answers. If anyone could please help I would greatly appreciate it. Thank you
It might be worth switching to use the readr library to have better encoding support. This worked for me
readr::write_csv(data.frame(x),'testemoji.csv')
x2<- readr::read_csv("testemoji.csv")
View(x2)

Read file using EUC-KR text encoding in R

Has anyone had experience to read Korean language file using EUC-KR as text encoding?
I used fread function as it can read that file structure perfectly. Below is the sample code:
test <- fread("KoreanTest.txt", encoding = "EUC-KR")
Then I got error, "Error in fread("KoreanTest.txt", encoding = "EUC-KR") : Argument 'encoding' must be 'unknown', 'UTF-8' or 'Latin-1'".
Initially i was using UTF-8 as text encoding but the output characters were not displayed correctly in Korean language. I was looking to another solution but nothing seems to work at this time.
Appreciate if someone could share ideas. Thanks.
It allows an explicit encoding parameter. This common usage works well:
read.table(filesource, header = TRUE, stringsAsFactors = FALSE, encoding = "EUC-KR")
or you can try with Rstudio
File -> Import Dataset -> From text

read an Excel file embedded in a website

I would like to read automatically in R the file which is located at
https://clients.rte-france.com/servlets/IndispoProdServlet?annee=2017
This link generates the automatic download of a zipfile. This zipfile contains the Excel file I want to read in R.
Does any of you have any suggestions on this? Thanks.
Panagiotis' comment to use download.file() is generally good advice, but I couldn't make it work here (and would be curious to know why). Instead I used httr.
(Edit: got it, I reversed args of download.file()... Repeat after me: always use named args...)
Another problem with this data: it appears not to be a regular xls file, I couldn't open it with the yet excellent readxl package.
Looks like a tab separated flat file, but no success with read.table() either. readr::read_delim() made it.
library(httr)
library(readr)
r <- GET("https://clients.rte-france.com/servlets/IndispoProdServlet?annee=2017")
# Write the archive on disk
writeBin(r$content, "./data/rte_data")
rte_data <-
read_delim(
unzip("./data/rte_data", exdir = "./data/"),
delim = "\t",
locale = locale(encoding = "ISO-8859-1"),
col_names = TRUE
)
There still are parsing problems, but not sure they should be dealt with in this SO question.

Converting tab delimited text file of unknown encoding to R-compatible file encoding in Python

I have many textfiles of unknown encoding which I wasn't able to open at all in R, where I would like to work with them. I ended up being able to open them in python with the help of codecs in UTF-16:
f = codecs.open(input,"rb","utf-16")
for line in f:
print repr(line)
One line in my files now looks like this when printed in python:
u'06/28/2016\t14:00:00\t0,000\t\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\t00000000\t6,000000\t0,000000\t144,600000\t12,050000
\t8,660000\t-120,100000\t-0,040000\t-0,110000\t1,340000\t5,360000
\t-1,140000\t-1,140000\t24,523000\t269,300000\t271,800000\t0,130000
\t272,000000\t177,000000\t0,765000\t0,539000\t\r\n'
The "u" in the beginning tells me that this in unicode, but now I don't really know what do with it. My goal is to convert the textfiles to something I can use in R, e.g. properly encoded csv, but I have failed using unicodecsv:
in_txt = unicodecsv.reader(f, delimiter = '\t', encoding = 'utf-8')
out_csv = unicodecsv.writer(open(output), 'wb', encoding = 'utf-8')
out_csv.writerows(in_txt)
Can anybody tell me what the principal mistake in my approach is?
You can try guess_encoding(y) from readr package in R. It is not 100% bullet proof but it has worked for me in the past and should at least get you pointed in the right direction:
guess_encoding(y)
#> encoding confidence
#> 1 ISO-8859-2 0.4
#> 2 ISO-8859-1 0.3
try using read_tsv() to read-in your files and then try guess_enconding()
Hope it helps.

Resources