Extracting origin properties (metadata) from a word document in R

Extracting origin properties (metadata) from a word document in R - r

I have hundreds of word documents (.docx) that were sent to me for analysis and I´m struggling to find a way to extract the Origin metadata. I would like to know the author's name, content created and the date last saved for each document - please note that these are properties that you can check under "details" separator when you use the mouse right-click on the word document and you select "properties". I would like to have an have a dataframe, a list or something similar with:
[1] "document.name" "authors.name" "content.created" "date.lastsaved"
[2] "document.name2" "authors.name2" "content.created2" "date.lastsaved2"
Does anyone have a nice solution for the problem?
Thank you!

Perhaps something like this:
setwd("/Desktop/docs") # set directory to wherever your docx files are
file.info(list.files(getwd())) # obtain file info for docs in folder

Related

BlySky Statistics - File naming conventions

When opening file 'TestFile.RData' in BlueSky Statistics it is opened with this name PLUS Dataset3 attached. Looks like this in tab TestFile.RData(Dataset3)
I would like to use my original name when using r code in the r command editor but from what I see BlueSky wants me to use the Dataset3 name.
Please clarify this file name issue for me.
If my original name is changed I see issues with reproducing things - as the given name of Dataset3 is not controllable.
Regards

Your observation is correct. When ever a file is opened in BlueSky Statistics (that is not an R datafile) we create a dataframe object in R. We name these objects sequentially namely Dataset1, Dataset2,Dataset3, etc. We could always use the name of the original file, however we went with Dataset1,Dataset2,Dataset3 for compatibility with SPSS. Many of our users come from SPSS and that is exactly what SPSS does. There is a simple work around, see below.
To work around this you need to change the default code we use to open the dataset. To see the code in the output window, Go to the top level menu Tools , Tools->Configuration settings->Select the Output tab and select the checkbox near the text "Show syntax in output window"
The code you will see when you open a dataset in the output Window is
BSkyloadDataset(fullpathfilename='C:/Users/Aaron_2/Documents/BlueSky Statistics/Sample Datasets/IRT/engagement.csv', filetype='CSV', worksheetName='',load.missing=FALSE, character.to.factor=FALSE, csvHeader=TRUE, isBasketData=FALSE, trimSPSStrailing=FALSE, sepChar=',', deciChar='.', datasetName='Dataset2')
All you need to do is change the datasetName parameter to the name you want to use
I will also add an enhancement to make the default behavior of naming the dataset when opening files to be the name of the file. This is easy to do.
With R datasets this is not a problem because we load all dataframe objects into the grid. The name of the dataset in the grid, continues to be the dataset object

BlueSky is one of the few packages that use R and allow you to open and work on multiple data files at once. This naming approach is its way of allowing that while using files that have not yet been stored as R data files (.RData). After importing data from a non-R file, simply use "File> Save as" and save it as an R Object (.RData). The next time you open that file, it will maintain the name you've given it.

regex search expression to find chunked word text using officer in r

I would like to be able to add charts and tables at specific locations in a MS Word document based on a template, using Officer's body_add function.
Since text in ms word can get chunked, is there a way to search for the expected text using regex, taking into account possible chunking?
For example, I am expecting:
xyzChart 1
But the actual chunk in my template (shown by docx_show_chunk) seems to be:
3 text nodes found at this cursor.
<w:t>: 'xyz'
<w:t>: 'Chart'
<w:t>: ' 1'
Is there a way to search for something that might look like this in the Word template?
I realize that a possible solution is to make sure that the template document never contains chunks like this, but it seems more robust to me to assume that things like this can happen and to account for it in my code.
If there is a better way to provide a reproducible example for this, please let me know. Thanks in advance.

r download url file with partial name

I am programming in R. I need to download a set of files from an http: address. The naming format of the file refers to a date/time period but also contains additional numbering that is not recognizable. For example for the file below the first set of numbers refers to the date 2014/10/24 at 05:10am but the second batch of numbers is not recognizable. All files on the webpage follow this standard format.
http://www.nemweb.com.au/REPORTS/CURRENT/MCCDispatch/PUBLIC_MCCDISPATCH_201410240510_0000000258279329.zip
My question is: How do I download the file with only partial name information?
For example if I wanted to download the file relating to the 6:30 time period I know that the url prefix is as below, but would not know the numbers that follow after: http://www.nemweb.com.au/REPORTS/CURRENT/MCCDispatch/PUBLIC_MCCDISPATCH_201410240630_??????????????.zip

You're actually in luck. Because you have a directory listing. Essentially, you have to download the list of links and then grep them. Here's how you would go about doing that.
library(XML)
url <- "http://www.nemweb.com.au/REPORTS/CURRENT/MCCDispatch/"
parsed <- htmlParse(url)
links <- xpathSApply(parsed, "//#href")
Now you have a list of URLs that you can search through and choose the one that's appropriate.
Hint: grep("pattern",links)

Request user to identify file location and auto-extract variable name from file location in R

I am EXTREMELY new to R, and programming in general, so thank you for your patience.
I am trying to write a script which reads values from a .txt file and after some manipulation plots the results. I have two questions which are somewhat coupled.
First, is there a function which asks the user to identify the location of a file? i.e. User runs script. Script opens up file navigation prompt and requests user to navigate to and select relevant file.
Currently, I have to manually identify the file and location in R. e.g.
spectra.raw <- read.table("C:\Users\...\file1.txt", row.names=NULL, header = TRUE)
I'd rather have the user identify the file location each time the script is run. This will be used by non-tech people, and I don't trust them to copy/paste file locations into R.
The second question I've been struggling with is, is it possible to create a variable name based off the file selected? For example, if the user selects "file1.txt" I'd like R to assign the output of read.table() to a variable named "file1.raw" much like the above "spectra.raw"
If it helps, all the file names will have the exact same number of characters, so if it's possible to select the last say 5 characters from the file location, that would work.
Thank you very much, and please excuse my ignorance.

See file.choose. Though I believe it behaves slightly differently on different platforms, so beware of that.
See assign, i.e. assign("fileName",value). You'll want to parse the file path that file.choose spits back using string manipulation functions like substr or strsplit.

Try
file.choose
I think it can do what you want.
For example,
myfile <- file.choose()
Enter file name: adataset.Rdata
load(myfile)
myfile contains the name of the file so you don't have to do anything special.

Retrieving GWAS information with R

I am trying to get specific disease-related information from the GWAS catalog. This can be done directly from the website via a spreadsheet download. But I was wondering if I could possibly do it programmatically in R. Any suggestions will be greatly appreciated.
Thanks.
Avoks

Checkout the function download.file() and the package rcurl (http://cran.r-project.org/web/packages/RCurl/index.html) - this should do what you are looking for

You will have to download .tsv file(s) first and manually edit them.
This is because GWAS Catalog files contain HTML symbols, like &#x000A7 in "Behçet's disease" (defining that special fourth letter). The # in these symbols will be interpreted by R as an end of line, thus you will get an error message, e.g.:
line 2028 did not have 34 elements
So you downlad it first, open in plain text editor, automatically replace every # with empty character, and only then load it into R with:
read.table("gwas_catalog_v1.0-associations_e91_r2018-02-21.tsv",sep="\t",h=T,stringsAsFactors = F,quote="")

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex