Import dataset with spanish special characters into R - r

I'm new to R and I've imported a dataset in a CSV file format created in Excel into my project using the "Import Dataset from Text File" function. However the dataset displays spanish special characters (á, é, í, ó, ú, ñ) with the � symbol, as below:
Nombre Direccion Beneficiado
Mu�oz B�rbara H�medo
...
Subsequently I tried with this code to make R display the spanish special characters:
Encoding(dataset) <- "UTF-8"
And received the following answer:
Error in `Encoding<-`(`*tmp*`, value = "UTF-8") :
a character vector argument expected
So far I haven't been able to find a solution to this.
I'm working in Rstudio Version 0.98.1083, in Windows 7.
Thanks in advance for your help.

Related

Showing unicode character in R

Using read_excel function, I read an excel sheet which has a column that contains data in both English and Arabic language.
English is shown normally in R. but Arabic text is shown like this <U+0627><U+0644><U+0639><U+0645><U+0644>
dataset <- read_excel("Dataset_Draft v1.xlsx",skip = 1 )
dataset %>% select(description)
I tried Sys.setlocale("LC_ALL", "en_US.UTF-8") but with no success.
I want to show Arabic text normally and I want to make filter on this column with Arabic value.
Thank you.
You could try the read.xlsx() function from the xlsx library.
Here you can specify an encoding.
data <- xlsx::read.xlsx("file.xlsx", encoding="UTF-8")

Keeping Turkish characters with the text mining package for R

let me start this by saying that I'm still pretty much a beginner with R.
Currently I am trying out basic text mining techniques for Turkish texts, using the tm package.
I have, however, encountered a problem with the display of Turkish characters in R.
Here's what I did:
docs <- VCorpus(DirSource("DIRECTORY", encoding = "UTF-8"), readerControl = list(language = "tur"))
writeLines(as.character(docs), con="documents.txt")
My thinking being, that setting the language to Turkish and the encoding to UTF-8 (which is the original encoding of the text files) should make the display of the Turkish characters İ, ı, ğ, Ğ, ş and Ş possible. Instead the output converts these charaters to I, i, g, G, s and S respectively and saves it to an ANSI-Encoding, which cannot display these characters.
writeLines(as.character(docs), con="documents.txt", Encoding("UTF-8"))
also saves the file without the characters in ANSI encoding.
This seems to not only be an issue with the output file.
writeLines(as.character(docs[[1]])
for example yields a line that should read "Okul ve cami açılışları umutları artırdı" but instead reads "Okul ve cami açilislari umutlari artirdi"
After reading this: UTF-8 file output in R
I also tried the following code:
writeLines(as.character(docs), con="documents.txt", Encoding("UTF-8"), useBytes=T)
which didn't change the results.
All of this is on Windows 7 with both the most recent version of R and RStudio.
Is there a way to fix this? I am probably missing something obvious, but any help would be appreciated.
Here is how I keep the Turkish characters intact:
Open a new .Rmd file in RStudio. (RStudio -> File -> New File -> R Markdown)
Copy and Paste your text containing Turkish characters.
Save the .Rmd file with encoding. (RStudio -> File -> Save with Encoding.. -> UTF-8)
yourdocument <- readLines("yourdocument.Rmd", encoding = "UTF-8")
yourdocument <- paste(yourdocument, collapse = " ")
After this step you can create your corpus
e.g start from VectorSource() in tm package.
Turkish characters will appear as they should.

Character accented with R

I am treating text with R (text classification) and I have a problem with some words in a french text , like for exemple this :
Charg\u00e9 d'\u00e9tude
How Can I do to resolve this problem?
Thank you
I got the method from this answer:"Print unicode character string in R". It looks like R is supposed to handle accents but maybe something is missing on the original file, and R is not recognizing the text as Unicode.
library(stringi)
stri_unescape_unicode("Charg\u00e9 d'\u00e9tude")
[1] "Chargé d'étude"

utf-8 characters get lost when converting from list to data.frame in R

I am using R 3.2.0 with RStudio 0.98.1103 on Windows 7 64-bit. The Windows "regional and language settings" of my computer is English (United States).
For some reason the following code replaces my Czech characters "č" and "ř" by "c" and "r" in the text "Koryčany nad přehradou", when I read a XML file in utf-8 encoding from the web, parse the XML file to a list, and convert the list to a data.frame.
library(XML)
url <- "http://hydrodata.info/chmi-h/cuahsi_1_1.asmx/GetSiteInfoObject?site=CHMI-H:1263&authToken="
doc <- xmlRoot(xmlTreeParse(url, getDTD=FALSE, useInternalNodes = TRUE))
infoList <- xmlToList(doc[[2]][[1]])
siteName <- infoList$siteName
#this still displays correctly "Koryčany nad přehradou"
print(siteName)
#make a data.frame from the list item. I suspect here is the problem.
df <- data.frame(name=siteName, id=1)
#now the Czech characters are lost. I see only "Korycany nad prehradou"
View(df)
write.csv(df,"test.csv")
#the test.csv file also contains "Korycany nad prehradou"
#instead of "Koryčany nad přehradou"
What is the problem? How do I make R to show my data.frame correctly with all the utf-8 special characters and save the .csv file without losing the "č" and "ř" Czech characters?
This is not a perfect answer, but the following workaround solved the problem for me. I tried to understand the behavior or R, and make the example so that my R script produces the same results both on Windows and on Linux platform:
(1) Get XML data in UTF-8 from the Internet
library(XML)
url <- "http://hydrodata.info/chmi-h/cuahsi_1_1.asmx/GetSiteInfoObject?site=CHMI-H:1263&authToken="
doc <- xmlRoot(xmlTreeParse(url, getDTD=FALSE, useInternalNodes = TRUE))
infoList <- xmlToList(doc[[2]][[1]])
siteName <- infoList$siteName
(2) Print out the text from the Internet: Encoding is UTF-8, display in the R console is also correct using both the Czech and the English locale on Windows:
> Sys.getlocale(category="LC_CTYPE")
[1] "English_United States.1252"
> print(siteName)
[1] "Koryčany nad přehradou"
> Encoding(siteName)
[1] "UTF-8"
>
(3) Try to create and view a data.frame. This has a problem. The data.frame displays incorrectly both in the RStudio view and in the console:
df <- data.frame(name=siteName, id=1)
df
name id
1 Korycany nad prehradou 1
(4) Try to use a matrix instead. Surprisingly the matrix displays correctly in the R console.
m <- as.matrix(df)
View(m) #this shows incorrectly in RStudio
m #however, this shows correctly in the R console.
name id
[1,] "Koryčany nad přehradou" "1"
(5) Change the locale. If I'm on Windows, set locale to Czech. If I'm on Unix or Mac, set locale to UTF-8. NOTE: This has some problems when I run the script in RStudio, apparently RStudio doesn't always react immediately to the Sys.setlocale command.
#remember the original locale.
original.locale <- Sys.getlocale(category="LC_CTYPE")
#for Windows set locale to Czech. Otherwise set locale to UTF-8
new.locale <- ifelse(.Platform$OS.type=="windows", "Czech_Czech Republic.1250", "en_US.UTF-8")
Sys.setlocale("LC_CTYPE", new.locale)
(7) Write the data to a text file. IMPORTANT: don't use write.csv but instead use write.table. When my locale is Czech on my English Windows, I must use the fileEncoding="UTF-8" in the write.table. Now the text file shows up correctly in notepad++ and in also in Excel.
write.table(m, "test-czech-utf8.txt", sep="\t", fileEncoding="UTF-8")
(8) Set the locale back to original
Sys.setlocale("LC_CTYPE", original.locale)
(9) Try to read the text file back into R. NOTE: If I read the file, I had to set the encoding parameter (NOT fileEncoding !). The display of a data.frame read from the file is still incorrect, but when I convert my data.frame to a matrix the Czech UTF-8 characters are preserved:
data.from.file <- read.table("test-czech-utf8.txt", sep="\t", encoding="UTF-8")
#the data.frame still has the display problem, "č" and "ř" get "lost"
> data.from.file
name id
1 Korycany nad prehradou 1
#see if a matrix displays correctly: YES it does!
matrix.from.file <- as.matrix(data.from.file)
> matrix.from.file
name id
1 "Koryčany nad přehradou" "1"
So the lesson learnt is that I need to convert my data.frame to a matrix, set my locale to Czech (on Windows) or to UTF-8 (on Mac and Linux) before I write my data with Czech characters to a file. Then when I write the file, I must make sure fileEncoding must be set to UTF-8. On the other hand when I later read the file, I can keep working in the English locale, but in read.table I must set the encoding="UTF-8".
If anybody has a better solution, I'll welcome your suggestions.

Reading csv file with Japanese characters into R

I am struggling to have R read in a csv file which has some of its columns in standard English characters, some numerical and some fields in Japanese characters.Here is how the data looks like:
category,desc,otherdesc,volume
UPC - 31401 Age Itameabura,かどや製油 純白ごま油,OIL_OTHERS_SML_ECO,83.0
UPC - 31401 Age Itameabura,オレインリッチ,OIL_OTHERS_MED,137.0
UPC - 31401 Age Itameabura,TVキャノーラ油,OIL_CANOLA_OTHERS_LRG,3026.0
Keeping the R's language setting as English, the japanese characters are converted into some gibberish. When I change the language setting in R to Japanese, Sys.setlocale("LC_CTYPE", "japanese"), I see the file is not read in at all. R gives an error saying:
Error in make.names(col.names, unique = TRUE) :
invalid multibyte string at 'サcategory'
I have no clue what's wrong with my csv file or the header names. Can you guide me as to how can I go about reading this csv file into R so that everything is displayed just as they do in the csv file?
Thanks!
Vish
For japanese the below works for me:
df <- read.csv("your_file.csv", fileEncoding="cp932")

Resources