Reading Chinese characters in R - r

I am using read.xlsx to load a spreadsheet that has several columns containing Chinese characters.
slides<-read.xlsx("test1.xlsx",sheetName="Sheet1",encoding="UTF8",stringsAsFactors=FALSE)
I have tried with and without specifying encoding, I have tried reading from a text file, CSV file etc. No matter the approach, the result is always:
樊志强 -> é‡åº†æ–°æ¡¥åŒ»é™¢
Is there any package/format/sys.locale setup that might help me get the right information into R?
Any help would be greatly appreciated.

You can try the readr package, and use the following code:
library(readr)
loc <- locale(encoding = "UTF-8")
slides <- read_table("filename", locale = loc)

Related

Can someone help resolve the error reading csv file for herbarium data? [duplicate]

I am trying to import a csv that is in Japanese. This code:
url <- 'http://www.mof.go.jp/international_policy/reference/itn_transactions_in_securities/week.csv'
x <- read.csv(url, header=FALSE, stringsAsFactors=FALSE)
returns the following error:
Error in type.convert(data[[i]], as.is = as.is[i], dec = dec, na.strings = character(0L)) :
invalid multibyte string at '<91>ΊO<8b>y<82>ёΓ<e0><8f>،<94><94><84><94><83><8c>_<96>񓙂̏󋵁#(<8f>T<8e><9f><81>E<8e>w<92><e8><95>񍐋#<8a>փx<81>[<83>X<81>j'
I tried changing the encoding (Encoding(url) <- 'UTF-8' and also to latin1) and tried removing the read.csv parameters, but received the same "invalid multibyte string" message in each case. Is there a different encoding that should be used, or is there some other problem?
Encoding sets the encoding of a character string. It doesn't set the encoding of the file represented by the character string, which is what you want.
This worked for me, after trying "UTF-8":
x <- read.csv(url, header=FALSE, stringsAsFactors=FALSE, fileEncoding="latin1")
And you may want to skip the first 16 lines, and read in the headers separately. Either way, there's still quite a bit of cleaning up to do.
x <- read.csv(url, header=FALSE, stringsAsFactors=FALSE,
fileEncoding="latin1", skip=16)
# get started with the clean-up
x[,1] <- gsub("\u0081|`", "", x[,1]) # get rid of odd characters
x[,-1] <- as.data.frame(lapply(x[,-1], # convert to numbers
function(d) type.convert(gsub(d, pattern=",", replace=""))))
You may have encountered this issue because of the incompatibility of system locale
try setting the system locale with this code Sys.setlocale("LC_ALL", "C")
The readr package from the tidyverse universe might help.
You can set the encoding via the local argument of the read_csv() function by using the local() function and its encoding argument:
read_csv(file = "http://www.mof.go.jp/international_policy/reference/itn_transactions_in_securities/week.csv",
skip = 14,
local = locale(encoding = "latin1"))
The simplest solution I found for this issue without losing any data/special character (for example when using fileEncoding="latin1" characters like the Euro sign € will be lost) is to open the file first in a text editor like Sublime Text, and to "Save with encoding - UTF-8".
Then R can import the file with no issue and no character loss.
I had the same error and tried all the above to no avail. The issue vanished when I upgraded from R 3.4.0 to 3.4.3, so if your R version is not up to date, update it!
I came across this error (invalid multibyte string 1) recently, but my problem was a bit different:
We had forgotten to save a csv.gz file with an extension, and tried to use read_csv() to read it. Adding the extension solved the problem.
For those using Rattle with this issue Here is how I solved it:
First make sure to quit rattle so your at the R command prompt
> library (rattle) (if not done so already)
> crv$csv.encoding="latin1"
> rattle()
You should now be able to carry on. ie, import your csv > Execute > Model > Execute etc.
That worked for me, hopefully that helps a weary traveller
I had a similar problem with scientific articles and found a good solution here:
http://tm.r-forge.r-project.org/faq.html
By using the following line of code:
tm_map(yourCorpus, content_transformer(function(x) iconv(enc2utf8(x), sub = "byte")))
you convert the multibyte strings into hex code.
I hope this helps.
If the file you are trying to import into R that was originally an Excel file. Make sure you open the original file and Save as a csv and that fixed this error for me when importing into R.

Why R cannot read this table while excel can?

I am trying to read a specific file that I have copied from an SFTP location. The file is pipe delimited. I can read the file in Excel. But R read is as null values and column names are being duplicated. I don't understand if this is an encoding issue? I am trying to create a bash script to automate this process. Any help? Below is the link for the data.
Here's file!
I have tried changing the Encoding. But without knowing which encoding I am struggling. I have tried using read_delim, ead_table, read.table, read_csv and read.csv. But no help.
this is the code I have used to read the file.
read_delim("./Engagement_Level.txt", delim = "|")
I would like to read it as a data frame.
The issue is that the file encoding is UTF-16LE, which read_delim cannot read at present.
You could use the base read.delim and file() to specify the encoding:
read.delim(file("Engagement_Level.txt", encoding = "UTF-16LE"), sep = "|")
That will convert all the quoted numbers to numeric. If you'd rather they were type character, to deal with later:
read.delim(file("Engagement_Level.txt", encoding = "UTF-16LE"), sep = "|",
colClasses = "character")
I really recommend you to use Excel to build a CSV file using Data>Text in columns, this is not appropriate in this context but it's incredibly infallible and quickly.
Then use read.csv(file,sep=",").

read an Excel file embedded in a website

I would like to read automatically in R the file which is located at
https://clients.rte-france.com/servlets/IndispoProdServlet?annee=2017
This link generates the automatic download of a zipfile. This zipfile contains the Excel file I want to read in R.
Does any of you have any suggestions on this? Thanks.
Panagiotis' comment to use download.file() is generally good advice, but I couldn't make it work here (and would be curious to know why). Instead I used httr.
(Edit: got it, I reversed args of download.file()... Repeat after me: always use named args...)
Another problem with this data: it appears not to be a regular xls file, I couldn't open it with the yet excellent readxl package.
Looks like a tab separated flat file, but no success with read.table() either. readr::read_delim() made it.
library(httr)
library(readr)
r <- GET("https://clients.rte-france.com/servlets/IndispoProdServlet?annee=2017")
# Write the archive on disk
writeBin(r$content, "./data/rte_data")
rte_data <-
read_delim(
unzip("./data/rte_data", exdir = "./data/"),
delim = "\t",
locale = locale(encoding = "ISO-8859-1"),
col_names = TRUE
)
There still are parsing problems, but not sure they should be dealt with in this SO question.

exporting data frame from R to excel

I have a large data frame(df). I want to export it as an excel file. I am using "WriteXLS" function from "WriteXLS" library. Everything is fine except for some non English characters. For "İ", "ı" and for some other characters a strange object(which is "�") is printed in the excel sheet. I guess it is an ecoding issue. The used code is:
WriteXLS("df", "C:/Users/ozgur/Desktop/df.xlsx",Encoding = "UTF-8", col.names = TRUE,perl = "perl")
The base exporting function
write.table(df, "C:/Users/ozgur/Desktop/df.txt", sep="\t",col.names=TURE)
does not work well. It does not convert data types as in R.
How can I overcome this problem. I will be very glad for any help. Thanks a lot.

How to convert special symbols in web scraping with R?

I am learning how to scrape the web with the XML and the RCurl packages. All goes well except for one thing. Special characters like ö or č they are read in differently into R. For instance the í is read in as í. I assume the latter is some sort of HTML coding for the first.
I have been looking for a way to convert these characters but I have not found it. I am sure other people have stumbled upon this problem as well, and I suspect there must be some sort of function to convert these characters. Does anyone know the solution? Thanks in advance.
Here is an example of the code, sorry I did not provide it earlier.
library(XML)
url <- 'http://en.wikipedia.org/wiki/2000_Wimbledon_Championships_%E2%80%93_Men%27s_Singles'
tables <- readHTMLTable(url)
Sec <- tables[[6]]
pl1R1 <- unlist(strsplit(as.character(Sec[,2]), ' '))[seq(2,32, 4)]
enc2utf8(pl1R1) # does not seem to work
Try parsing it first while specifying the encoding, then reading the table, as here: readHTMLTable and UTF-8 encoding.
An example might be:
library(XML)
url <- "http://en.wikipedia.org/wiki/2000_Wimbledon_Championships_%E2%80%93_Men%27s_Singles"
doc <- htmlParse(url, encoding = "UTF-8") #this will preserve characters
tables <- as.data.frame(readHTMLTable(doc, stringsAsFactors = FALSE))
Sec <- tables[[6]]
#not sure what you're trying to do here though
pl1R1 <- unlist(strsplit(as.character(Sec[,2]), ' '))[seq(2,32, 4)]

Resources