Using chinese characters without changing locale in R - r

I can use chinese characters in R, can put them in the strings inside a data.frame, substitute them with gsub, and they display normally on screen. I can save them to a file using write.table, but I can't read them with read.table! I'm using fileEncoding="UTF-8" for write.table and for read.table, but the latter gives me:
invalid multibyte string at ...
I've read about changing the locale, but if the chinese characters work everywhere else, I would like not to mess with the locale (my machine use a mix of english and portuguese locale). Is that possible?
I'm using RKWard in Ubuntu 14.10.
EDIT: chinese characters work perfectly everywhere in the files, they just produce errors when used for quoting...

Sorry. I arrived too late. I am using ubuntu 20.04 and the following worked for my file:
lists <- read_delim("LISTS.csv", ";", escape_double = FALSE, locale = locale(encoding = "ISO-8859-1"), trim_ws = TRUE)
Good luck

Related

Can someone help resolve the error reading csv file for herbarium data? [duplicate]

I am trying to import a csv that is in Japanese. This code:
url <- 'http://www.mof.go.jp/international_policy/reference/itn_transactions_in_securities/week.csv'
x <- read.csv(url, header=FALSE, stringsAsFactors=FALSE)
returns the following error:
Error in type.convert(data[[i]], as.is = as.is[i], dec = dec, na.strings = character(0L)) :
invalid multibyte string at '<91>ΊO<8b>y<82>ёΓ<e0><8f>،<94><94><84><94><83><8c>_<96>񓙂̏󋵁#(<8f>T<8e><9f><81>E<8e>w<92><e8><95>񍐋#<8a>փx<81>[<83>X<81>j'
I tried changing the encoding (Encoding(url) <- 'UTF-8' and also to latin1) and tried removing the read.csv parameters, but received the same "invalid multibyte string" message in each case. Is there a different encoding that should be used, or is there some other problem?
Encoding sets the encoding of a character string. It doesn't set the encoding of the file represented by the character string, which is what you want.
This worked for me, after trying "UTF-8":
x <- read.csv(url, header=FALSE, stringsAsFactors=FALSE, fileEncoding="latin1")
And you may want to skip the first 16 lines, and read in the headers separately. Either way, there's still quite a bit of cleaning up to do.
x <- read.csv(url, header=FALSE, stringsAsFactors=FALSE,
fileEncoding="latin1", skip=16)
# get started with the clean-up
x[,1] <- gsub("\u0081|`", "", x[,1]) # get rid of odd characters
x[,-1] <- as.data.frame(lapply(x[,-1], # convert to numbers
function(d) type.convert(gsub(d, pattern=",", replace=""))))
You may have encountered this issue because of the incompatibility of system locale
try setting the system locale with this code Sys.setlocale("LC_ALL", "C")
The readr package from the tidyverse universe might help.
You can set the encoding via the local argument of the read_csv() function by using the local() function and its encoding argument:
read_csv(file = "http://www.mof.go.jp/international_policy/reference/itn_transactions_in_securities/week.csv",
skip = 14,
local = locale(encoding = "latin1"))
The simplest solution I found for this issue without losing any data/special character (for example when using fileEncoding="latin1" characters like the Euro sign € will be lost) is to open the file first in a text editor like Sublime Text, and to "Save with encoding - UTF-8".
Then R can import the file with no issue and no character loss.
I had the same error and tried all the above to no avail. The issue vanished when I upgraded from R 3.4.0 to 3.4.3, so if your R version is not up to date, update it!
I came across this error (invalid multibyte string 1) recently, but my problem was a bit different:
We had forgotten to save a csv.gz file with an extension, and tried to use read_csv() to read it. Adding the extension solved the problem.
For those using Rattle with this issue Here is how I solved it:
First make sure to quit rattle so your at the R command prompt
> library (rattle) (if not done so already)
> crv$csv.encoding="latin1"
> rattle()
You should now be able to carry on. ie, import your csv > Execute > Model > Execute etc.
That worked for me, hopefully that helps a weary traveller
I had a similar problem with scientific articles and found a good solution here:
http://tm.r-forge.r-project.org/faq.html
By using the following line of code:
tm_map(yourCorpus, content_transformer(function(x) iconv(enc2utf8(x), sub = "byte")))
you convert the multibyte strings into hex code.
I hope this helps.
If the file you are trying to import into R that was originally an Excel file. Make sure you open the original file and Save as a csv and that fixed this error for me when importing into R.

R/RStudio changes name of column when read from csv

I am trying to read in a file in R, using the following command (in RStudio):
fileRaw <- read.csv(file = "file.csv", header = TRUE, stringsAsFactors = FALSE)
file.csv looks something like this:
However, when it's read into R, I get:
As you can see LOCATION is changed to ï..LOCATION for seemingly no reason.
I tried adding check.names = FALSE but this only made it worse, as LOCATION is now replaced with LOCATION. What gives?
How do I fix this? Why is R/RStudio doing this?
There is a UTF-8 BOM at the beginning of the file. Try reading as UTF-8, or remove the BOM from the file.
The UTF-8 representation of the BOM is the (hexadecimal) byte sequence
0xEF,0xBB,0xBF. A text editor or web browser misinterpreting the text
as ISO-8859-1 or CP1252 will display the characters  for this.
Edit: looks like using fileEncoding = "UTF-8-BOM" fixes the problem in RStudio.
Using fileEncoding = "UTF-8-BOM" fixed my problem and read the file with no issues.
Using fileEncoding = "UTF-8"/encoding = "UTF-8" did not resolve the issue.

Unescape LaTeX to UTF-8 or ASCII

I use the R packages RefManageR and bibtex packages to read in a bibtex file I exported from Mendeley (my reference manager). Sometimes authors are listed with accents in their name (López), but in BibTeX these are escaped to "L{\\'{o}}pez". However, in another reference this name is spelled without accent (Lopez).
How can I parse the "L{\\'{o}}pez" to López or Lopez so I can compare them?
I googled but this only shows how I can escape -while I want to unescape- or to make pdf's from R.
I tried this and it worked for me, but I still think there must be a better solution:
deTeX <- function(x) {
gsub("\\{\\\\.+?\\{([a-z]*)\\}\\}", "\\1", x, fixed = FALSE, perl = TRUE, ignore.case = TRUE)
}

R, Windows and foreign language characters

This has been a longstanding problem with R: it can read non-latin characters on Unix, but I cannot read them on Windows. I've reproduced this program on several English-edition Windows machines over the years. I've tried changing the localisation settings in Windows and numerous other to no effect. Has anyone actually been able to read a foreign text file on Windows? I think being able to read/write/display unicode is a pretty nifty feature for a program.
Environment:
> Sys.getlocale()
[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
The problem can be reproduced as follows:
Create simple file in a language like Russian or Arabic in a text editor and save it as UTF-8 w/o BOM.
> test_df <- read.table("test2.txt",header=FALSE, sep=";", quote="",fill=FALSE,
encoding="UTF-8",comment.char="",dec=",")
......Warning message:
......In read.table("test2.txt", header = FALSE, sep = ";", quote = "", :
......incomplete final line found by readTableHeader on 'test2.txt'
> test_df
...... V1 V2
......1 <U+043E><U+0439>!yes 9
using read.csv()yields the same results, minus the warning. I realize that the "" is both searchable and can be converted to the readable character by an external program. But I want to see actual cyrillic text in charts, tables, output etc, like I can in every other program I've used.
So I've had this problem for a few years, consistently. Then one morning, yesterday, I tried the following:
test_df <- read.table("items.txt",header=FALSE, sep=";",quote="",fill=FALSE,
encoding="bytes",comment.char="",dec=",")
And encoding="bytes" worked! I saw cyrillic in the console. I then had to reinstall R (same version, same computer, same everything), the solution evaporated. I've literally retraced all my steps, and it seems like magic. Now encoding="bytes", just produces the same garbage (РєРѕРЅСЊСЏРє) as encoding="pizza" would (the param is ignored).
There is also a fileEncoding param for read.table. I am not sure how what it does, but it doesn't work either and cannot read even english text.
Can you read a non-ascii text file on your windows PC? How on earth do you do it?
Try setting the locale. For example,
Sys.setlocale(locale = "Russian")
See ?Sys.setlocale for more information.

Getting rid of BOM between SAS and R

I used SAS to save a tab-delimited text file with utf8 encoding on a windows machine. Then I tried to open this in R:
read.table(myfile, header =TRUE, sep = "\t")
To my surprise, the data was totally messed up, but only in a sneaky way. Number values changed randomly, but the overall layout looked normal, so it took me a while to notice the problem, which I'm assuming now is the BOM.
This is not a new issue of course; they address it briefly here, and recommend using
read.table(myfile, fileEncoding = "UTF-8", header =TRUE, sep = "\t")
However, this made no improvement! My only solution was to suppress the header, with or without the fileEncoding argument:
read.table(myfile, fileEncoding = "UTF-8", header =FALSE, sep = "\t")
read.table(myfile, header =FALSE, sep = "\t")
In either case, I have to do some funny business to replace the column names with the first row, but only after I remove some version of the BOM that appears at the beginning of the first column name (<U+FEFF> if I use fileEncoding and
 if I don't use fileEncoding).
Isn't there a simple way to just remove the BOM and use read.table without any special arguments?
Update for #Joe:
The SAS that I used:
FILENAME myfile 'C:\Documents ... file.txt' encoding="utf-8";
proc export data=lib.sastable
outfile=myfile
dbms=tab replace;
putnames=yes;
run;
Update on further weirdness: Using fileEncoding="UTF-8-BOM" as #Joe suggested in his solution below seems to remove the BOM. However, it did not fix my original motivating problem, which is corruption in the data; the header row is fine, but weirdly the last few digits of the first column of numbers gets messed up. I'll give Joe credit for his answer -- maybe my problem is not actually a BOM issue?
Hack solution: Use fileEncoding="UTF-8-BOM" AND also include the argument colClasses = "character". No idea why this works to fix the data corruption issue -- could be the topic of a future question.
As per your link, it looks like it works for me with:
read.table('c:\\temp\\testfile.txt',fileEncoding='UTF-8-BOM',header=TRUE,sep='\t')
note the -BOM in the file encoding.
This is in 2.1 Variations on read.table in the r documentation. Under 12 Encoding, see "Under UNIX you might need...", which apparently applies even on Windows now (for me, at least).
or you can use the sas system option options=NOBOMFILE the write a uft-8 file without the BOM.

Resources