read an MSWord file into R - r

Is it possible to read an MSWord 2010 file into R? I have Windows 7 and a Dell PC.
I am using the line:
my.data <- readLines('c:/users/mark w miller/simple R programs/test_for_r.docx')
to try to read an MSWord file containing the following text:
A 20 1000 AA
B 30 1001 BB
C 10 1500 CC
I get a warning message that says:
Warning message:
In readLines("c:/users/mark w miller/simple R programs/test_for_r.docx") :
incomplete final line found on 'c:/users/mark w miller/simple R programs/test_for_r.docx'
and my.data appears to be gibberish:
# [1] "PK\003\004\024" "¤l" "ÈFÃË‹Átí"
I know with this simple example I could easily convert the MSWord file to a different format. However, my actual data files consist of complex tables that were typed decades ago and then scanned into pdf documents later. Age of the original paper document and perhaps imperfections in the original paper, typing and/or scanning process has resulted in some letters and numbers not being very clear. So far converting the pdf files to MSWord seems to be the most successful at correctly translating the tables. Converting the MSWord files to Excel or rich text, etc, has not been very successful. Even after conversion to MSWord the resulting files are very complex and contain numerous errors. I thought if I could read the MSWord files into R that might be the most efficient way to edit and correct them.
I am aware of 'package tm' that I guess can read MSWord files into R, but I am a little concerned about using it because it seems to require installing third-party software.
Thank you for any suggestions.

First, readLines() is not the correct solution, since a Word file is not a text (that is plain, ASCII text) file.
The Word-related function in the tm package is called readDOC() but both it and the required third-party tool (Antiword) are for older Word files (up to Word 2003) and will not work using newer .docx files.
The best I can suggest is that you try readPDF(), also found in the tm package. Note: it requires that the tool pdftotext is installed on your system. Easy for Linux, no idea about Windows. Alternatively, find a Windows tool which converts PDF to plain, ASCII text files (not Word files) - they should open and display correctly using Notepad on Windows - then try readLines() again. However, given that your PDF files are old and come from a scanner, conversion to text might be difficult.
Finally: I realise that you did not make the original decision in this instance, but for anybody else - Word and PDF are not appropriate formats for storing data that you want to parse.

In case it helps anyone else, https://cran.r-project.org/web/packages/readtext/vignettes/readtext_vignette.html, it appears there's a new package dedicated specifically to reading text data, including Word files (also new .docx format).

I have not figured out how to read the MSWord file into R, but I have gotten the contents into a format that R can read.
I converted a pdf to MSWord with Acrobat X Pro
The original tables had solid vertical lines separating columns. It turns out these vertical lines were disrupting the format of the data when I converted an MSWord file to a text file, but I was able to delete the lines from an MSWord file before creating a text file.
Convert the MSWord file to a text file after deleting vertical lines in Step 2.
Resulting text files still require extensive editing, but at least the data are largely present in a format R can read and I will not have to re-enter all data in the pdfs by hand, saving many hours of work.

You can do this with RDCOMClient very easily.
In saying so, some characters will not read in correctly.
require(RDCOMClient)
# Create the connection
wordApp <- COMCreate("Word.Application")
# Let's set visible to true so you can see it run
wordApp[["Visible"]] <- TRUE
# Define the file we want to open
wordFileName <- "c:/path/to/word/doc.docx"
# Open the file
doc <- wordApp[["Documents"]]$Open(wordFileName)
# Print the text
print(doc$range()$text())

Related

Is there a way to accelerate formatted table writing from R to excel?

I have a 174603 rows and 178 column dataframe, which I'm importing to Excel using openxlsx::saveWorkbook, (Using this package to obtain the aforementioned format of cells, with colors, header styles and so on). But the process is extremely slow, (depending on the amount of memory used by the machine it can take from 7 to 17 minutes!!) and I need a way to reduce this significantly (Doesn't need to be seconds, but anything bellow 5 min would be OK)
I've already searched other questions but they all seem to focus either in exporting to R (I have no problem with this) or writing non-formatted files to R (using write.csv and other options of the like)
Apparently I can't use xlsx package because of the settings on my computer (industrial computer, Check comments on This question)
Any suggestions regarding packages or other functionalities inside this package to make this run faster would be highly appreciated.
This question has some time ,but I had the same problem as you and came up with a solution worth mentioning.
There is package called writexl that has implemented a way to export a data frame to Excel using the C library libxlsxwriter. You can export to excel using the next code:
library(writexl)
writexl::write_xlsx(df, "Excel.xlsx",format_headers = TRUE)
The parameter format_headers only apply centered and bold titles, but I had edited the C code of the its source in github writexl library made by ropensci.
You can download it or clone it. Inside src folder you can edit write_xlsx.c file.
For example in the part that he is inserting the header format
//how to format headers (bold + center)
lxw_format * title = workbook_add_format(workbook);
format_set_bold(title);
format_set_align(title, LXW_ALIGN_CENTER);
you can add this lines to add background color to the header
format_set_pattern (title, LXW_PATTERN_SOLID);
format_set_bg_color(title, 0x8DC4E4);
There are lots of formating you can do searching in the libxlsxwriter library
When you have finished editing that file and given you have the source code in a folder called writexl, you can build and install the edited package by
shell("R CMD build writexl")
install.packages("writexl_1.2.tar.gz", repos = NULL)
Exporting again using the first chunk of code will generate the Excel with formats and faster than any other library I know about.
Hope this helps.
Have you tried ;
write.table(GroupsAlldata, file = 'Groupsalldata.txt')
in order to obtain it in txt format.
Then on Excel, you can simply transfer you can 'text to column' to put your data into a table
good luck

Importing to R an Excel file saved as web-page

I would like to open an Excel file saved as webpage using R and I keep getting error messages.
The desired steps are:
1) Upload the file into RStudio
2) Change the format into a data frame / tibble
3) Save the file as an xls
The message I get when I open the file in Excel is that the file format (excel webpage format) and extension format (xls) differ. I have tried the steps in this answer, but to no avail. I would be grateful for any help!
I don't expect anybody will be able to give you a definitive answer without a link to the actual file. The complication is that many services will write files as .xls or .xlsx without them being valid Excel format. This is done because Excel is so common and some non-technical people feel more confident working with Excel files than a csv file. Now, the files will have been stored in a format that Excel can deal with (hence your warning message), but R's libraries are more strict and don't see the actual file type they were expecting, so they fail.
That said, the below steps worked for me when I last encountered this problem. A service was outputting .xls files which were actually just HTML tables saved with an .xls file extension.
1) Download the file to work with it locally. You can script this of course, e.g. with download.file(), but this step helps eliminate other errors involved in working directly with a webpage or connection.
2) Load the full file with readHTMLTable() from the XML package
library(XML)
dTemp = readHTMLTable([filename], stringsAsFactors = FALSE)
This will return a list of dataframes. Your result set will quite likely be the second element or later (see ?readHTMLTable for an example with explanation). You will probably need to experiment here and explore the list structure as it may have nested lists.
3) Extract the relevant list element, e.g.
df = dTemp[2]
You also mention writing out the final data frame as an xls file which suggests you want the old-style format. I would suggest the package WriteXLS for this purpose.
I seriously doubt Excel is 'saved as a web page'. I'm pretty sure the file just sits on a server and all you have to do is go fetch it. Some kind of files (In particular Excel and h5) are binary rather than text files. This needs an added setting to warn R that it is a binary file and should be handled appropriately.
myurl <- "http://127.0.0.1/imaginary/file.xlsx"
download.file(url=myurl, destfile="localcopy.xlsx", mode="wb")
or, for use downloader, and ty something like this.
myurl <- "http://127.0.0.1/imaginary/file.xlsx"
download(myurl, destfile="localcopy.csv", mode="wb")

How to view workspace in R

I'm trying to learn R (via this video) and immediately ran into problems. As directed, I created a dataset in Excel with column A being numbers 1 through 10 and column B being random integers. Saved as .xlsx and .csv.
Next I tried to read the data in R with
> data1 <- read.table(file.choose(), header=TRUE, sep="\t")
and that's as far as I got. There's no Workspace like in the video, or an option anywhere to view it. There are many windows in the video, but I only have "R Console".
So, how do I get the workspace?
You may be looking for "R Studio." It's a user-friendly shell that sits on top of R... It shows you your current work space, etc.
http://www.rstudio.com/
Also, you want to use sep="," not sep="\t" if you have a CSV. \t is tab-delimited...
I think you are using the R basic program (not basic in core functionality, just basic in terms of user interface features) that you probably downloaded from http://www.r-project.org/
The video you are watching is running a productive user interface called RStudio. You can download it for free from here: http://www.rstudio.com/ Works the same for all your purposes.

Importing the contents of a word document into R

I am new to R and have worked for a while as follows. I have the code writen in a word document, then I copy and paste the document with the code into R as to have the code run which works fine, however when the code is long (hundred pages) it takes a significant amount of time in R to start making the code run. This seems rather not a very effective working procedure and I am sure there are other forms to compile the R code.
On another hand one of then that comes to my mind is to import the content of word into R which I am unsure how to do. Have tried with read.table but it does not work, have look on internet as to how to import data, however most explanations are all for data tables etc or internet files in the form of data tables and similar. I have tried saving the document into csv. however word does not include csv have tried with Rich text format and XML package but again the instructions from the packages are for importing tables and similars. I am wondering if there is an effective way for R to import a word document as is in the word document.
Thank you
It's hard to say what the easiest solution would be, without examining the word document. Assuming it only contains code and nothing else, it should be pretty easy to convert it all to plain text from within Word. You can do that by going to File -> Save As, and use 'plain text' under 'Save as type'.
Then edit the filename extension to .R from .txt, download a proper text editor (I can recommend RStudio for R), and open your code in it. Then you will be able to run the code from inside the editor without using copy / paste.
No, read table won't do it.
Microsoft Word has its own format, which includes a lot of meta data over and above the text you enter into it. You'll need a reader/parser that understands the Word format.
A Java developer would use a library like Apache POI to read and parse it into word tokens and n-grams.
Look for Natural Language Processing tools, like this R module:
http://cran.r-project.org/web/views/NaturalLanguageProcessing.html

Flex: Write data to file to open in Excel

I have a Flex application with a couple of DataGrids with data. I'd like to save the data to a file so that the user can keep working with them in Excel, OpenOffice or Numbers.
I'm currently writing a csv file straight off, which opens well in OpenOffice or Numbers, but not in Excel. The problem is with the Swedish characters ÅÄÖ, which turn up as other characters when opening in Excel. Converting (in Notepad++) the csv-file to ANSI encoding makes the ÅÄÖ show up correctly in Excel.
Is there any way to write ANSI-encoded files straight from Flex?
Any other options for writing a file that can be opened in Excel and OpenOffice?
(I've looked at the as3xls library, but according to the comments those files cannot be opened in OpenOffice)
Using the writeMultiByte function from the ByteArray class allows you to specify a character set. See :
http://www.adobe.com/livedocs/flash/9.0/ActionScriptLangRefV3/flash/utils/ByteArray.html#writeMultiByte%28%29
There is also the option of the as3xls package at http://code.google.com/p/as3xls/. I like this as it comes out as a straight excel file that can also be easily opened in open office as well.

Resources