I am treating text with R (text classification) and I have a problem with some words in a french text , like for exemple this :
Charg\u00e9 d'\u00e9tude
How Can I do to resolve this problem?
Thank you
I got the method from this answer:"Print unicode character string in R". It looks like R is supposed to handle accents but maybe something is missing on the original file, and R is not recognizing the text as Unicode.
library(stringi)
stri_unescape_unicode("Charg\u00e9 d'\u00e9tude")
[1] "Chargé d'étude"
Related
Using read_excel function, I read an excel sheet which has a column that contains data in both English and Arabic language.
English is shown normally in R. but Arabic text is shown like this <U+0627><U+0644><U+0639><U+0645><U+0644>
dataset <- read_excel("Dataset_Draft v1.xlsx",skip = 1 )
dataset %>% select(description)
I tried Sys.setlocale("LC_ALL", "en_US.UTF-8") but with no success.
I want to show Arabic text normally and I want to make filter on this column with Arabic value.
Thank you.
You could try the read.xlsx() function from the xlsx library.
Here you can specify an encoding.
data <- xlsx::read.xlsx("file.xlsx", encoding="UTF-8")
Im trying to load a dataset into R using an API that lets me do a query and returns back the data I need (I cant configure on server side).
I know it has something to do with Encoding. When i check the string in from by dataframe in R in gives me ENC: UTF-8 "Cosmética".
When i copy the source string "Cosmética" it gives me latin1.
How can i get the UTF-8 string properly formatted like the latin1?
Ive tried this below:
Sys.setlocale("LC_ALL","Spanish")
tried directly on the string:
Enconding(Description) <- "latin1"
unfortunately i cant get it to work. Any ideas are welcome! Thanks.
You can use iconv to change to encoding of the string:
iconv(mystring, to = "ISO-8859-1")
# [1] "Cosmética"
ISO 8859-1 is the common character encoding in Western Europe.
let me start this by saying that I'm still pretty much a beginner with R.
Currently I am trying out basic text mining techniques for Turkish texts, using the tm package.
I have, however, encountered a problem with the display of Turkish characters in R.
Here's what I did:
docs <- VCorpus(DirSource("DIRECTORY", encoding = "UTF-8"), readerControl = list(language = "tur"))
writeLines(as.character(docs), con="documents.txt")
My thinking being, that setting the language to Turkish and the encoding to UTF-8 (which is the original encoding of the text files) should make the display of the Turkish characters İ, ı, ğ, Ğ, ş and Ş possible. Instead the output converts these charaters to I, i, g, G, s and S respectively and saves it to an ANSI-Encoding, which cannot display these characters.
writeLines(as.character(docs), con="documents.txt", Encoding("UTF-8"))
also saves the file without the characters in ANSI encoding.
This seems to not only be an issue with the output file.
writeLines(as.character(docs[[1]])
for example yields a line that should read "Okul ve cami açılışları umutları artırdı" but instead reads "Okul ve cami açilislari umutlari artirdi"
After reading this: UTF-8 file output in R
I also tried the following code:
writeLines(as.character(docs), con="documents.txt", Encoding("UTF-8"), useBytes=T)
which didn't change the results.
All of this is on Windows 7 with both the most recent version of R and RStudio.
Is there a way to fix this? I am probably missing something obvious, but any help would be appreciated.
Here is how I keep the Turkish characters intact:
Open a new .Rmd file in RStudio. (RStudio -> File -> New File -> R Markdown)
Copy and Paste your text containing Turkish characters.
Save the .Rmd file with encoding. (RStudio -> File -> Save with Encoding.. -> UTF-8)
yourdocument <- readLines("yourdocument.Rmd", encoding = "UTF-8")
yourdocument <- paste(yourdocument, collapse = " ")
After this step you can create your corpus
e.g start from VectorSource() in tm package.
Turkish characters will appear as they should.
I'm new to R and I've imported a dataset in a CSV file format created in Excel into my project using the "Import Dataset from Text File" function. However the dataset displays spanish special characters (á, é, í, ó, ú, ñ) with the � symbol, as below:
Nombre Direccion Beneficiado
Mu�oz B�rbara H�medo
...
Subsequently I tried with this code to make R display the spanish special characters:
Encoding(dataset) <- "UTF-8"
And received the following answer:
Error in `Encoding<-`(`*tmp*`, value = "UTF-8") :
a character vector argument expected
So far I haven't been able to find a solution to this.
I'm working in Rstudio Version 0.98.1083, in Windows 7.
Thanks in advance for your help.
I am struggling to have R read in a csv file which has some of its columns in standard English characters, some numerical and some fields in Japanese characters.Here is how the data looks like:
category,desc,otherdesc,volume
UPC - 31401 Age Itameabura,かどや製油 純白ごま油,OIL_OTHERS_SML_ECO,83.0
UPC - 31401 Age Itameabura,オレインリッチ,OIL_OTHERS_MED,137.0
UPC - 31401 Age Itameabura,TVキャノーラ油,OIL_CANOLA_OTHERS_LRG,3026.0
Keeping the R's language setting as English, the japanese characters are converted into some gibberish. When I change the language setting in R to Japanese, Sys.setlocale("LC_CTYPE", "japanese"), I see the file is not read in at all. R gives an error saying:
Error in make.names(col.names, unique = TRUE) :
invalid multibyte string at 'サcategory'
I have no clue what's wrong with my csv file or the header names. Can you guide me as to how can I go about reading this csv file into R so that everything is displayed just as they do in the csv file?
Thanks!
Vish
For japanese the below works for me:
df <- read.csv("your_file.csv", fileEncoding="cp932")