Converting Word2007 into R - r

I'm just starting using R and I can't figure out how to infile files from any other program into R. I tried a basic example from going to Word to R. I used this website as a supposed example on how to do this http://www.mayin.org/ajayshah/KB/R/html/r1.html. So here is what I typed:
A<-read.table("C:\Users\anr28\Desktop\x.docx", sep=",", col.names=c("year", "my1", "my2"))
I had a document named "x" in Microsoft Word which according to the properties menu on my computer ends with docx. I followed exactly what they did in the example and it didn't work. This was the error messages printed out, but I don't know how to interpret them.
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line 1 did not have 3 elements
In addition: Warning message:
In read.table("C:\\Users\\anr28\\Desktop\\x.docx", sep = ",", col.names = c("year", :
incomplete final line found by readTableHeader on 'C:\Users\anr28\Desktop\x.docx'
Please Help, I'm trying to learn this on my own and it's very frustrating not being able to bring files in to actually learn the crux of the program, which is what I'm really after. Thanks

The read.table function (and related) expects a plain text file. Word uses its own file type (hence the .docx extension) which is not plain text, it includes your data (probably compressed) along with information about fonts, colors, sizes, and a bunch of other things in a way that R does not understand.
The best approach is to open your file in word, then save it again as a plain text file (try clicking the circle in the upper left corned, then choose "Save As", then choose "Other Formats", then in the dialog box choose the "Plain text (.txt)" option for "Save as type"). Then read the text file into R following the example.

The link you posted is about a file that looks like this:
1997,3.1,4
1998,7.2,19
1999,1.7,2
2000,1.1,13
With "looks like" it is meant that if you read this file in a plain text editor like notepad, this is what you get. A word file is not plain text. A plain text file is a file (often with .txt as extension, but this is not necessary) that only contains text. A word file is a file that can be opened and read by word and contains information on the text, but also typesetting, fonts, etcetera, encoded in a machine language that is not readable. You can see the difference by opening the word document in notepad.
As said in other answers, you can save your word file as a plain text file with "save as". You can also save data from excel as a plain text file which can easily be read in R.

You might want to use a plain text editor (not a word processor) for typing in simple data files - try notepad++, which is as easy to use as notepad but with a lot more functionality.
Google and download it, then enter some comma-separated numbers, save, and read into R.
There is a also a basic text editor built into R for Windows that you can use to type R functions and data files.

It makes no sense to read data into R from a proprietary windows format. R will happily accept any plain text format. In your case, just save as plain text and read it in.

Related

R write fixed width columns to csv (in the plain text file)

I write data frames to csv files using write.csv(). When this is done, the output when viewed in a plain text editor, in particular vi or notepad++, shows no spacing between the column content and the commas, resulting in it being relatively hard to read. For example, the columns are not lined up down the page.
I have negative interest in using excel to view the csv files. I am definitely not looking for a suggestion for a csv viewer. Nor do I want instructions on how to modify the plain text file afterward. Padding needs to be spaces not tabs.
I am interested in how to get R to line up the columns in the plain text csv file so that they are easier to read using a non specialized plain text editor.
I could (and might) write my own routine that converts everything to some fixed width string format and print that. But, I would prefer to find that this is an option within write.csv() or similar common output library call.
[I just this moment found out about printf in R, and that might be the best answer to this conundrum].

Reading diacritics in R

I have imported several .txt files (texts written in Spanish) to RStudio using the following code:
content = readLines(paste("my_texts", "text1",sep = "/"))
However, when I read the texts in RStudio, they contain codes instead of diacritics. For example, I see the code <97> instead of an "ó" or the code <96> instead of an "ñ".
I have realized also that if the .txt file was originally written using a computer configured in Spanish, I don't see the codes but the actual diacritics. And if the texts were written using a a computer configured in English, then I do get the codes (even though when opening the .txt file on TextEdit I see the diacritics).
I don't know why R displays those symbols and what I can do to retain the diacritics I see in the original .txt files.
I read I could possibly solve this by changing the encoding to UTF-8, so I tried this:
content = readLines(paste("my_texts", "text1",sep = "/"), encoding = "UTF-8")
But that didn't work. Any ideas what those codes are and how to keep my diacritics?
As you figured out, you need to set the correct encoding. Unfortunately the text file was written using a legacy encoding rather than UTF-8 — namely, MacRoman. Ideally the application producing the file would not use this encoding, and Apple products by default no longer produce it.
But since this is what you’ve got, we have to deal with it, and we can. But unfortunately we need to go a detour because the encoding argument of readLines is a bit useless. Instead, we need to manually open a file connection:
con = file(file.path("my_texts", "text1"), encoding = "macintosh")
on.exit(close(con)) # Always make sure to close connections!
contents = readLines(con)
Do note that the encoding name “macintosh” is strictly speaking not portable, so this might not work on all platforms.

How to read csv file with unknown formatting and unknown encoding in R Program? (example file provided)

I have tried my best to read a CSV file in r but failed. I have provided a sample of the file in the following Gdrive link.
Data
I found that it is a tab-delimited file by opening in a text editor. The file is read in Excel without issues. But when I try to read it in R using "readr" package or the base r packages, it fails. Not sure why. I have tried different encoding like UTF-8. UTF-16, UTF16LE. Could you please help me to write the correct script to read this file. Currently, I am converting this file to excel as a comma-delimited to read in R. But I am sure there must be something that I am doing wrong. Any help would be appreciated.
Thanks
Amal
PS: What I don't understand is how excel is reading the file without any parameters provided? Can we build the same logic in R to read any file?
This is a Windows-related encoding problem.
When I open your file in Notepad++ it tells me it is encoded as UCS-2 LE BOM. There is a trick to reading in files with unusual encodings into R. In your case this seems to do the trick:
read.delim(con <- file("temp.csv", encoding = "UCS-2LE"))
(adapted from R: can't read unicode text files even when specifying the encoding).
BTW "CSV" stands for "comma separated values". This file has tab-separated values, so you should give it either a .tsv or .txt suffix, not .csv, to avoid confusion.
In terms of your second question, could we build the same logic in R to guess encoding, delimiters and read in many types of file without us explicitly saying what the encoding and delimiter is - yes, this would certainly be possible. Whether it is desirable I'm not sure.

How do I read in a CSV file into R?

I'm having some technical issues with loading a CSV file into R. When I inspect the csv file in RStudio's Source pane, all the characters are surrounded by weird red circles or dots. When I inspect another self-made CSV file, the characters appear perfectly fine, without any of the red circles.
What is this issue/symptom, and what would be the best way to fixing this for about 40 similar CSV files?
When I try to run readfile <- read.csv("filename.csv", sep="", collapse=NULL) I get the following error:
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
empty beginning of file
My guess would be that you ran into some encoding issue.
Especially on Windows you can run into all sorts of problems with that.
Try opening the csv file with a text editor that has the capability of saving files with various Encodings (e.g. Notpead++) then change that to e.g. UTF-8 (which is the preferred Encoding of RStudio most other Editors and R itself), save the file and try to run the import again.
Just make sure that you don't loose characters - especially special characters tend to get lost during Encoding changes.
Greetings ...

read an MSWord file into R

Is it possible to read an MSWord 2010 file into R? I have Windows 7 and a Dell PC.
I am using the line:
my.data <- readLines('c:/users/mark w miller/simple R programs/test_for_r.docx')
to try to read an MSWord file containing the following text:
A 20 1000 AA
B 30 1001 BB
C 10 1500 CC
I get a warning message that says:
Warning message:
In readLines("c:/users/mark w miller/simple R programs/test_for_r.docx") :
incomplete final line found on 'c:/users/mark w miller/simple R programs/test_for_r.docx'
and my.data appears to be gibberish:
# [1] "PK\003\004\024" "¤l" "ÈFÃË‹Átí"
I know with this simple example I could easily convert the MSWord file to a different format. However, my actual data files consist of complex tables that were typed decades ago and then scanned into pdf documents later. Age of the original paper document and perhaps imperfections in the original paper, typing and/or scanning process has resulted in some letters and numbers not being very clear. So far converting the pdf files to MSWord seems to be the most successful at correctly translating the tables. Converting the MSWord files to Excel or rich text, etc, has not been very successful. Even after conversion to MSWord the resulting files are very complex and contain numerous errors. I thought if I could read the MSWord files into R that might be the most efficient way to edit and correct them.
I am aware of 'package tm' that I guess can read MSWord files into R, but I am a little concerned about using it because it seems to require installing third-party software.
Thank you for any suggestions.
First, readLines() is not the correct solution, since a Word file is not a text (that is plain, ASCII text) file.
The Word-related function in the tm package is called readDOC() but both it and the required third-party tool (Antiword) are for older Word files (up to Word 2003) and will not work using newer .docx files.
The best I can suggest is that you try readPDF(), also found in the tm package. Note: it requires that the tool pdftotext is installed on your system. Easy for Linux, no idea about Windows. Alternatively, find a Windows tool which converts PDF to plain, ASCII text files (not Word files) - they should open and display correctly using Notepad on Windows - then try readLines() again. However, given that your PDF files are old and come from a scanner, conversion to text might be difficult.
Finally: I realise that you did not make the original decision in this instance, but for anybody else - Word and PDF are not appropriate formats for storing data that you want to parse.
In case it helps anyone else, https://cran.r-project.org/web/packages/readtext/vignettes/readtext_vignette.html, it appears there's a new package dedicated specifically to reading text data, including Word files (also new .docx format).
I have not figured out how to read the MSWord file into R, but I have gotten the contents into a format that R can read.
I converted a pdf to MSWord with Acrobat X Pro
The original tables had solid vertical lines separating columns. It turns out these vertical lines were disrupting the format of the data when I converted an MSWord file to a text file, but I was able to delete the lines from an MSWord file before creating a text file.
Convert the MSWord file to a text file after deleting vertical lines in Step 2.
Resulting text files still require extensive editing, but at least the data are largely present in a format R can read and I will not have to re-enter all data in the pdfs by hand, saving many hours of work.
You can do this with RDCOMClient very easily.
In saying so, some characters will not read in correctly.
require(RDCOMClient)
# Create the connection
wordApp <- COMCreate("Word.Application")
# Let's set visible to true so you can see it run
wordApp[["Visible"]] <- TRUE
# Define the file we want to open
wordFileName <- "c:/path/to/word/doc.docx"
# Open the file
doc <- wordApp[["Documents"]]$Open(wordFileName)
# Print the text
print(doc$range()$text())

Resources