I need to study a corpus of various .txt files in R with the tm package. My files have to be stored in a single folder called "english" since their text is in English, but the file names are sometimes in Chinese, Korean, Russian or Arabic.
I have tried using the code below to create my corpus.
setwd("C:/Users/etomilina/Documents/M2/mails/english")
data <- Corpus(DirSource(encoding="UTF-8"))
However, I get the following error messages: "non-existent or non-readable files : ./??.txt, ./?????.txt" (and so on with interrogation marks instead of the file names which should be 陳國.txt or 정극지생명과학연구부.txt for example)
Currently Sys.getlocale() returns "LC_COLLATE=French_France.1252;LC_CTYPE=French_France.1252;LC_MONETARY=French_France.1252;LC_NUMERIC=C;LC_TIME=French_France.1252" but even when specifying Sys.Setlocale("English") the error persists.
If I only had Chinese filenames I would simply switch my system locale to Chinese, but here I have some in Korean, Arabic and Russian as well besides some English ones.
Is there any way to handle all these languages at once?
I am using Rstudio 1.4.1106 and my R version is 4.2.1.
I am working with a dataset that contains data in multiple languages.
Is there a way to export my work as a CSV file and have R maintain the use of characters in a foreign language instead of replacing them with gibberish English symbols?
Update for anyone who reaches this by Google:
It looks like R only pretends to screw up foreign languages. When you use write_csv, it actually does create a .csv that uses the correct foreign characters.
However, you'll only see them if you open them in Notepad. If you open it in Excel, Excel will screw it up, and if you open it with read_csv, R will screw it up (but will still export it correctly when you use write_csv again).
write_excel_csv() from the readr package seems to work without messing up the foreign characters when opening with Excel.
Our users have RStudio installed on their local machines and are using Shiny to filter data and exporting dataframes to an .xlsx file.
This works really well for most characters, but not when it comes to the Japanese and Mandarin ones. For those, they get to see ??????? instead of the actual text.
Data is residing in a SQL DB and we're using RODBC to connect to DB.
RODBC doesn't seem to like reading these Japanese and Mandarin characters. Is there a way to get around this?
Any help is much appreciated!
Thanks
I had a similar problem with french language the other day. Maybe these options can help you :
In RStudio, try going in Tool > Global Options > Code > Saving > and then choose the right encoding for Japanese and Mandarin. The UTF-8 enconding might work for you.
The blog post Escaping from character encoding hell in R on Windows explains how to set the encoding to import external documents. It should work with data imported with RODBC as well. The autor uses Japanese characters in his examples.
In the odbcDriverConnect() function of the RODBC package, the argument DBMSencoding="UTF-8" might work for you.
I scraped a site that includes the names of many different cities from around the world using R's rvest package. Some of these names have German umlauts and characters from almost every other major language in them which are not showing properly in the .csv file I used to output the text. Is there a way to make Excel display these names properly? I'm using Excel 2011 on Mac. Here is some examples of what the names appear as in my csv file.
"MÔÈhldorf am Inn" instead of "Mühldorf am Inn"
"PÔ_rnu" instead of "Pärnu"
I did not use any kind of encoding when outputting the text as a cvs and don't have access to the original scraped object in R.
write.csv(data, "master_me_data.csv")
Any help would be appreciated.
Is it possible to read an MSWord 2010 file into R? I have Windows 7 and a Dell PC.
I am using the line:
my.data <- readLines('c:/users/mark w miller/simple R programs/test_for_r.docx')
to try to read an MSWord file containing the following text:
A 20 1000 AA
B 30 1001 BB
C 10 1500 CC
I get a warning message that says:
Warning message:
In readLines("c:/users/mark w miller/simple R programs/test_for_r.docx") :
incomplete final line found on 'c:/users/mark w miller/simple R programs/test_for_r.docx'
and my.data appears to be gibberish:
# [1] "PK\003\004\024" "¤l" "ÈFÃË‹Átí"
I know with this simple example I could easily convert the MSWord file to a different format. However, my actual data files consist of complex tables that were typed decades ago and then scanned into pdf documents later. Age of the original paper document and perhaps imperfections in the original paper, typing and/or scanning process has resulted in some letters and numbers not being very clear. So far converting the pdf files to MSWord seems to be the most successful at correctly translating the tables. Converting the MSWord files to Excel or rich text, etc, has not been very successful. Even after conversion to MSWord the resulting files are very complex and contain numerous errors. I thought if I could read the MSWord files into R that might be the most efficient way to edit and correct them.
I am aware of 'package tm' that I guess can read MSWord files into R, but I am a little concerned about using it because it seems to require installing third-party software.
Thank you for any suggestions.
First, readLines() is not the correct solution, since a Word file is not a text (that is plain, ASCII text) file.
The Word-related function in the tm package is called readDOC() but both it and the required third-party tool (Antiword) are for older Word files (up to Word 2003) and will not work using newer .docx files.
The best I can suggest is that you try readPDF(), also found in the tm package. Note: it requires that the tool pdftotext is installed on your system. Easy for Linux, no idea about Windows. Alternatively, find a Windows tool which converts PDF to plain, ASCII text files (not Word files) - they should open and display correctly using Notepad on Windows - then try readLines() again. However, given that your PDF files are old and come from a scanner, conversion to text might be difficult.
Finally: I realise that you did not make the original decision in this instance, but for anybody else - Word and PDF are not appropriate formats for storing data that you want to parse.
In case it helps anyone else, https://cran.r-project.org/web/packages/readtext/vignettes/readtext_vignette.html, it appears there's a new package dedicated specifically to reading text data, including Word files (also new .docx format).
I have not figured out how to read the MSWord file into R, but I have gotten the contents into a format that R can read.
I converted a pdf to MSWord with Acrobat X Pro
The original tables had solid vertical lines separating columns. It turns out these vertical lines were disrupting the format of the data when I converted an MSWord file to a text file, but I was able to delete the lines from an MSWord file before creating a text file.
Convert the MSWord file to a text file after deleting vertical lines in Step 2.
Resulting text files still require extensive editing, but at least the data are largely present in a format R can read and I will not have to re-enter all data in the pdfs by hand, saving many hours of work.
You can do this with RDCOMClient very easily.
In saying so, some characters will not read in correctly.
require(RDCOMClient)
# Create the connection
wordApp <- COMCreate("Word.Application")
# Let's set visible to true so you can see it run
wordApp[["Visible"]] <- TRUE
# Define the file we want to open
wordFileName <- "c:/path/to/word/doc.docx"
# Open the file
doc <- wordApp[["Documents"]]$Open(wordFileName)
# Print the text
print(doc$range()$text())