I have a CSV file with Hindi and English words. Currently, R is not reading Hindi words. How do I enable other language support along with English in R studio.
Related
I want to download all the words in Spanish from A-Z. No definitions or any of that, just the words.
When I search "download Spanish dictionary" I just get either an English to Spanish dictionary or some app that has the words.
I need to study a corpus of various .txt files in R with the tm package. My files have to be stored in a single folder called "english" since their text is in English, but the file names are sometimes in Chinese, Korean, Russian or Arabic.
I have tried using the code below to create my corpus.
setwd("C:/Users/etomilina/Documents/M2/mails/english")
data <- Corpus(DirSource(encoding="UTF-8"))
However, I get the following error messages: "non-existent or non-readable files : ./??.txt, ./?????.txt" (and so on with interrogation marks instead of the file names which should be 陳國.txt or 정극지생명과학연구부.txt for example)
Currently Sys.getlocale() returns "LC_COLLATE=French_France.1252;LC_CTYPE=French_France.1252;LC_MONETARY=French_France.1252;LC_NUMERIC=C;LC_TIME=French_France.1252" but even when specifying Sys.Setlocale("English") the error persists.
If I only had Chinese filenames I would simply switch my system locale to Chinese, but here I have some in Korean, Arabic and Russian as well besides some English ones.
Is there any way to handle all these languages at once?
I am using Rstudio 1.4.1106 and my R version is 4.2.1.
I need help reading Korean characters into R environment.
This one is my file.
Here is the error
I am using Windows7 Home Premium and R Studio 0.99.896.
I have a csv file containing a column with text in several different languages eg english, european, korean, simplified chinese, traditional chinese, greek, japanese etc.
I read it into R using
table<-read.csv("broker.csv",stringsAsFactors =F, encoding="UTF-8")
so that all the text is readable in it's language.
Most of the text is within a column called named "content". Within the console, when I have a look
toplines<-head(table$content,10)
I can see all the languages as they are, but when I try to write to a csv file and open it in excel, I can no longer see the languages. I typed in
write.csv(toplines,file="toplines.csv",fileEncoding="UTF-8")
then I opened toplines.csv in excel 2013 and it looked liked this
1 [<U+5916><U+5A92>:<U+4E2D><U+56FD><U+6C1.....
2 [<U+4E2D><U+56FD><U+6C11><U+822A><U+51C6.....
3 [<U+5916><U+5A92>:<U+4E2D><U+56FD><U+6C1.....
and so forth
Would anyone be able to tell me how I can write to a csv or excel file so that the languages that can be read as they are in Excel 2013? Thank you very much.
write_excel_csv() from the readr package, as suggested by #PhiSeu in the comments, has solved it for me.
This problem troubles for a year. My R has trouble in opening my csv file containing simplified Chinese character. The data is coded as GBK I believe. I have three computers with different language and operation system and it has mixed results on opening the same Chinese csv file. Could someone tell me why the results are different?
(1)Windows+English OS+English R and R studio: UNABLE to read my csv even if I encoded it as UTF8,GBK, and you name it encoding for Chinese.
(2) Mac+EnglishOS+English R: ABLE to read the Chinese csv without forcing the encoding (update: after I reinstall operation system to El Caption, it could not open my csv correctly)
(3) Windows+Chinese OS,+Chinese R: ABLE read csv without forcing encoding or gbk
(4) Windows+English OS,+Chinese R: UNABLE
(5) Ubuntu English OS,English R: ABLE
In the windows case(English and Chinese OS), notebook can open the csv correctly but excel cannot in the English Case. When ever I could not open my csv with excel, my r cannot either.
If I converge the csv by Google sheet, my excel can open my csv but R still not ok.
How does the encoding work in R, why the results change with the OS Lanuage?
read.csv(...,encoding=)
It could be related with the excel csv encoding system. If your windows operation system is Englihs. The excel might not properly open the cvs correctly. A work around is the using google sheer or Ubuntu installed sheet to converge it to csv and try using r open it.
I have figured out how to solve. It deals with large less than 800M files contained Simplified Chinese Characters. The key is that you should know the default Chinese encoding in your operation system.
The Ubuntu use UTF-8 as default Chinese Encoding. So you should encode it as UTF-8 instead GB18130 or other GB starting encoding.
(1) Download Open Office (free and fast to install, have have higher
file size than Cals in Ubuntu).
(2) Detect your CSV encoding. Simply open your csv using Open office and choose an encoding method that display your Chinese character.
(3) Save your csv to correct encoding system according to your
operation system. Default Windows are GBK for Chinese and Ubuntu is
UTF8.
This should solve your file size problem and encoding problem. You do not even force the encoding. Normal read.csv would work.