Proper encoding for umlauts in .csv file - r

I scraped a site that includes the names of many different cities from around the world using R's rvest package. Some of these names have German umlauts and characters from almost every other major language in them which are not showing properly in the .csv file I used to output the text. Is there a way to make Excel display these names properly? I'm using Excel 2011 on Mac. Here is some examples of what the names appear as in my csv file.
"MÔÈhldorf am Inn" instead of "Mühldorf am Inn"
"PÔ_rnu" instead of "Pärnu"
I did not use any kind of encoding when outputting the text as a cvs and don't have access to the original scraped object in R.
write.csv(data, "master_me_data.csv")
Any help would be appreciated.

Related

Why Unicodes get converting from excel file to r data frame

I have excel file with unicode characters (like Korean and Japan language), when i load it from excel file to r data frame it's converting to some codes. For example
Excel Source Column
values=KR|512207456|투비씨엔씨(주)
After load the excel file to
DF = KR|512207456|<U+D22C><U+BE44><U+C528><U+C5D4><U+C528>(<U+C8FC>)
I did lot of google to find the solution, unfortunately not able to find any solution. Any help would be highly appreciated.

Exporting Chinese characters from Excel to R

I have a file in Excel which has a column with Chinese simplified characters. When I open it in R from the corresponding CSV file I only get ?'s.
I'm afraid the problem is when exporting from Excel to CSV because when I open the CSV file on a text editor I also get ?'s.
How can I get around this?
The best way to secure your Chinese/Unicode characters is to read file from .xlsx:
library(readxl)
read_xlsx("yourfilepath.xlsx", col_types = "text")
If your file is too big to read from .xlsx, then the best way is to open Excel and split manually into multiple files.
(My experience with a laptop with 8GB RAM is to split files into 250,000 rows x 106 columns.)
If you need to read from .csv, your all windows settings/localization needs to be the same as your file, but even that does not guarantee the integrity of all your Unicode characters (eg. emojis).
(If you also need .csv for something else, then you can use the R function write.csv after you read data from .xlsx into R.)

Reading Chinese characters in R using readChar()

I notice that it is easy to load CSV files containing Chinese characters in R with read.csv("mydata.csv", encoding="UTF-8").
I have text files (.txt) which I want to read using readChar(). However, readChar() does not allow me to specify encoding.
What can I do?

write.csv with encoding UTF8

I am using Windows7 Home Premium and R Studio 0.99.896.
I have a csv file containing a column with text in several different languages eg english, european, korean, simplified chinese, traditional chinese, greek, japanese etc.
I read it into R using
table<-read.csv("broker.csv",stringsAsFactors =F, encoding="UTF-8")
so that all the text is readable in it's language.
Most of the text is within a column called named "content". Within the console, when I have a look
toplines<-head(table$content,10)
I can see all the languages as they are, but when I try to write to a csv file and open it in excel, I can no longer see the languages. I typed in
write.csv(toplines,file="toplines.csv",fileEncoding="UTF-8")
then I opened toplines.csv in excel 2013 and it looked liked this
1 [<U+5916><U+5A92>:<U+4E2D><U+56FD><U+6C1.....
2 [<U+4E2D><U+56FD><U+6C11><U+822A><U+51C6.....
3 [<U+5916><U+5A92>:<U+4E2D><U+56FD><U+6C1.....
and so forth
Would anyone be able to tell me how I can write to a csv or excel file so that the languages that can be read as they are in Excel 2013? Thank you very much.
write_excel_csv() from the readr package, as suggested by #PhiSeu in the comments, has solved it for me.

read an MSWord file into R

Is it possible to read an MSWord 2010 file into R? I have Windows 7 and a Dell PC.
I am using the line:
my.data <- readLines('c:/users/mark w miller/simple R programs/test_for_r.docx')
to try to read an MSWord file containing the following text:
A 20 1000 AA
B 30 1001 BB
C 10 1500 CC
I get a warning message that says:
Warning message:
In readLines("c:/users/mark w miller/simple R programs/test_for_r.docx") :
incomplete final line found on 'c:/users/mark w miller/simple R programs/test_for_r.docx'
and my.data appears to be gibberish:
# [1] "PK\003\004\024" "¤l" "ÈFÃË‹Átí"
I know with this simple example I could easily convert the MSWord file to a different format. However, my actual data files consist of complex tables that were typed decades ago and then scanned into pdf documents later. Age of the original paper document and perhaps imperfections in the original paper, typing and/or scanning process has resulted in some letters and numbers not being very clear. So far converting the pdf files to MSWord seems to be the most successful at correctly translating the tables. Converting the MSWord files to Excel or rich text, etc, has not been very successful. Even after conversion to MSWord the resulting files are very complex and contain numerous errors. I thought if I could read the MSWord files into R that might be the most efficient way to edit and correct them.
I am aware of 'package tm' that I guess can read MSWord files into R, but I am a little concerned about using it because it seems to require installing third-party software.
Thank you for any suggestions.
First, readLines() is not the correct solution, since a Word file is not a text (that is plain, ASCII text) file.
The Word-related function in the tm package is called readDOC() but both it and the required third-party tool (Antiword) are for older Word files (up to Word 2003) and will not work using newer .docx files.
The best I can suggest is that you try readPDF(), also found in the tm package. Note: it requires that the tool pdftotext is installed on your system. Easy for Linux, no idea about Windows. Alternatively, find a Windows tool which converts PDF to plain, ASCII text files (not Word files) - they should open and display correctly using Notepad on Windows - then try readLines() again. However, given that your PDF files are old and come from a scanner, conversion to text might be difficult.
Finally: I realise that you did not make the original decision in this instance, but for anybody else - Word and PDF are not appropriate formats for storing data that you want to parse.
In case it helps anyone else, https://cran.r-project.org/web/packages/readtext/vignettes/readtext_vignette.html, it appears there's a new package dedicated specifically to reading text data, including Word files (also new .docx format).
I have not figured out how to read the MSWord file into R, but I have gotten the contents into a format that R can read.
I converted a pdf to MSWord with Acrobat X Pro
The original tables had solid vertical lines separating columns. It turns out these vertical lines were disrupting the format of the data when I converted an MSWord file to a text file, but I was able to delete the lines from an MSWord file before creating a text file.
Convert the MSWord file to a text file after deleting vertical lines in Step 2.
Resulting text files still require extensive editing, but at least the data are largely present in a format R can read and I will not have to re-enter all data in the pdfs by hand, saving many hours of work.
You can do this with RDCOMClient very easily.
In saying so, some characters will not read in correctly.
require(RDCOMClient)
# Create the connection
wordApp <- COMCreate("Word.Application")
# Let's set visible to true so you can see it run
wordApp[["Visible"]] <- TRUE
# Define the file we want to open
wordFileName <- "c:/path/to/word/doc.docx"
# Open the file
doc <- wordApp[["Documents"]]$Open(wordFileName)
# Print the text
print(doc$range()$text())

Resources