Import Combined docx Document to R with the officer Package - r

I want to import and modify a docx file in R with the officer package. However, when I use the suggested functions, R imports only an empty data.frame. Consider the following example:
# Packages
library("magrittr")
library("officer")
# Create example docx on computer
my_doc1 <- read_docx() %>%
body_add_par('aaa', style = 'Normal')
my_doc2 <- read_docx() %>%
body_add_par('bbb', style = 'Normal')
my_doc3 <- read_docx() %>%
body_add_par('ccc', style = 'Normal')
print(my_doc1, target = 'C:/your-path/aaa.docx')
print(my_doc2, target = 'C:/your-path/bbb.docx')
print(my_doc3, target = 'C:/your-path/ccc.docx')
# Combine all docx
my_doc_all <- read_docx() %>%
body_add_docx(src = 'C:/your-path/aaa.docx') %>%
body_add_docx(src = 'C:/your-path/bbb.docx') %>%
body_add_docx(src = 'C:/your-path/ccc.docx')
# Print combined docx to computer
print(my_doc_all, target = 'C:/your-path/all.docx')
This is my current situation. Please note that this situation is given. The previous steps can not be changed.
Now, I want to import the combined docx file and modify it in R. According to the documentation of officer (p. 26) and this thread (answer by David Gohel), I should be able to do it with the following code:
# Read combined docx file
read_my_doc_all <- read_docx("C:/your-path/all.docx")
# Return dataset representing the docx document
docx_summary(my_doc_all)
However, the output is a data.frame with one empty row:
### doc_index content_type style_name text level num_id
### 1 1 paragraph NA NA NA
I was researching on this problem myself and figured out that everything works fine, if we don't have to combine several docx documents in the forefront (as demonstrated in the beginning of this example). If we create a single file in R and export/import it, everything works fine.
How could I import a combined docx document to R with the officer package? If possible, I would like to stick to the officer package to keep the R syntax coherent.

Related

using subscripts in gtsummary in R

Context
I am making a table and saving it into Microsoft Word with .docx file.
the table has a variable named PM2.5. I want to subscript 2.5 like PM2.5.
In this answear, I can use the grammar 'PM~2.5~' with as_kable() to use subscrpit in PM2.5.
But when I save the result (tab), it is a blank .docx file.
Question
How can I use subscripts in gtsummary and save it into .docx file?
Reproducible code
library(dplyr)
library(gtsummary)
df = data.frame(PM2.5 = 1)
tab = # make a table using gtsummary
df %>%
tbl_summary(label = PM2.5 ~ 'PM~2.5~') %>% # subscript in main table
modify_table_styling(columns = label,
rows = label == 'PM~2.5~',
footnote = 'PM~2.5~ in footnote') %>% # subscript in footnote
as_kable()
tab %>% flextable::save_as_docx(path = 'test.docx') # a blank .docx file
The reason it is blank is because you're using flextable::save_as_docx() to save the table. That function will only work with flextables...not knitr::kable() tables.
You can put this table in an R markdown or Quarto document with output type Word, and the table will appear.

Creating multiple Word file reports in a for loop in R Markdown

I'm trying to generate multiple reports automatically with R markdown. I have a MS Word file that I import to R with officer library. In this MS Word file I want to substitute the word alpha with A, B, and C names that I defined in a VarNames vector. I then want to generate report for each of the VarNames in a separate MS Word file. I try to use the code bellow:
library(officer)
library(magrittr)
library(rmarkdown)
my_doc <- read_docx('Word_file.docx')
varNames <- c("A", "B", "C")
for (i in 1:length(varNames)) {
doc2 <- body_replace_all_text(my_doc, old_value = "alpha",new_value = varNames[i], only_at_cursor=FALSE,ignore.case =FALSE);
}
doc2 <- cursor_backward(doc2)
docx_show_chunk(doc2)
my_doc2 <- print(doc2, target ="/Users/majerus/Desktop/R/auto_reporting/my_doc2.docx")
But the code only generates one report for the varname A. Could you please help me figuring out what is wrong with the code? Even if I can generate the report in .pdf or .html formats would be fine. Thanks!
Ok, so I think the best solution would be this:
# Remember to set your working directory with setwd() accordingly - all reads
# and writes will be in that dir - or specify the path to the file every time,
# if you prefer it that way.
# setwd("xyz")
library(officer)
# I think the pipes %>% are very useful, ESPECIALLY with officer, so:
library(dplyr)
# To make it a fully reproductive example, let's create a Word file
# with only "alpha" text in it.
read_docx() %>%
body_add_par("alpha") %>%
print("Word_file.docx")
# now, let's create the vector for the loop to take in
varNames <- c("A", "B", "C")
### Whole docx creation script should be inside the for loop
for (i in 1:length(varNames)) {
#firsty, read the file
read_docx("Word_file.docx") %>%
#then, replace the text according to varNames
body_replace_all_text(old_value = "alpha",
new_value = varNames[i],
only_at_cursor=FALSE,
ignore.case =FALSE) %>%
# then, print the outputs. Output name should be generated dynamically:
# every report (for every i) need to have a different name.
print(target = paste0("Output_",i,".docx"))
}
# After running your script, in your working directory should be 4 files:
# Word_file.docx "alpha"
# Output_1.docx "A"
# Output_2.docx "B"
# Output_3.docx "C"
Your whole bit with cursor_backward() and docx_show_chunk() seems to be pointless. From my experience - it's best not to use the cursor functionality too much.
Best practice may be to specify in the template the specific places to replace the text (as in your example and my solution) - or just build the whole document dynamically in R (you can firstly load an empty template with predefined styles if you want to use custom ones).

write docx output after a loop

I have a data frame with several information about patients. I created a loop with R to process each information and write them to a docx file using ReporteRs, but with this loop I obtain as much docx as subjects I have, instead I would like to have 1 unique docx with all information one after the other.
this is the df
Surname Name Born Subject Place
Halls Ben 09/08/2019 3387502 S.Jeorge
Beck David 12/08/2019 1319735 S.Jeorge
Essimy Daniel 12/08/2019 3387789 S.Jeorge
Rich Maria 12/08/2019 3307988 S.Agatha
and this is the code I have written
dfY2 <- read.table("file.txt",header=T)
for(i in 1:nrow(dfY2)) {
my_title <- pot('Exam', textProperties(font.weight = "bold",font.size=12, font.family="Times New Roman"))
row1<-pot("Surname and Name",textProperties(font.weight="bold"))+" "+pot(dfY2[i,1])+" "+pot(dfY2[i,2])+" "+pot("Born",textProperties(font.weight="bold"))+pot(dfY2[i,3])
row2<-pot("SubjectID",textProperties(font.weight="bold"))+" "+pot(dfY2[i,4])+pot("Place",textProperties(font.weight="bold"))+" "+pot(dfY2[i,5])
doc<-docx("Template.docx")%>%
addParagraph(my_title, par.properties=parProperties( text.align = "center"))%>%
addParagraph(c(""))%>%
addParagraph(row1)%>%
addParagraph(row2)%>%
writeDoc(doc,file = paste0(dfY2[i,1],"output.docx"))
}
So, in this way, I obtain several outputs, while I would like to write all the rows one after the other for each subject in only a single doc.
What can I do?
thanks
First of all, I would recommend using the newer package officer from the same author because ReporteRs is not anymore maintained.
To your question: you need to create the 'docx' object before the loop and save it after the loop (eventually you want to add the title before the loop as well):
doc <- docx("Template.docx")
for(i in 1:nrow(dfY2)) {
...
doc <- doc %>%
addParagraph(my_title, par.properties=parProperties( text.align = "center")) %>%
addParagraph(c("")) %>%
addParagraph(row1) %>%
addParagraph(row2)
}
writeDoc(doc, file ="output.docx")

Linked text in LaTeX table with knitr/kable

I have the following dataframe:
site_name | site_url
--------------------| ------------------------------------
3D Printing | https://3dprinting.stackexchange.com
Academia | https://academia.stackexchange.com
Amateur Radio | https://ham.stackexchange.com
I want to generate a third column with the link integrated with the text. In HTML I came up with the following pseudo code:
df$url_name <- "[content of site_name](content of site_url)"
resulting in the following working program code:
if (knitr::is_html_output()) {
df <- df %>% dplyr::mutate(url_name = paste0("[", df[[1]], "](", df[[2]], ")"))
knitr::kable(df)
}
Is there a way to this in LaTeX with knitr as well?
(I am preferring a solution compatible with kableExtra, but if this is not possible I am ready to learn whatever table package can do this.)
*** ADDED: I just noticed that the above code works within a normal .Rmd document with the yaml header output: pdf_document. But not in my bookdown project.
The problem is with knitr::kable. It doesn't recognize that the bookdown project needs Markdown output, so you need to tell it that explicitly:
df <- df %>% dplyr::mutate(url_name = paste0("[", df[[1]], "](", df[[2]], ")"))
knitr::kable(df, format = "markdown")
This will work for any kind of Markdown output: html_document, pdf_document, bookdown::pdf_book, etc.
Alternatively, if you need LaTeX output for some other part of the table, you could write the LaTeX equivalent. This won't work for HTML output, of course, but should be okay for the PDF targets:
df <- df %>% dplyr::mutate(urlName = paste0("\\href{", df[[2]], "}{", df[[1]], "}"))
knitr::kable(df, format = "latex", escape = FALSE)
For this one I had to change the column name; underscores are special in LaTeX. You could probably get away without doing that if you left it as format = "markdown", but then you'd probably be better off using the first solution.

Parsing non-nested XML tags in R

I am trying to parse a number of documents using the excellent xml2 R library. As an example, consider the following XML file:
pg <- read_xml("https://www.theyworkforyou.com/pwdata/scrapedxml/westminhall/westminster2001-01-24a.xml")
Which contains a number of <speech> tags which are separated, though not nested within, a number of <minor-heading> and <major-heading> tags. I would like to be process this document to a resulting data.frame with the following structure:
major_heading_id speech_text
heading_id_1 text1
heading_id_1 text2
heading_id_2 text3
heading_id_2 text4
Unfortunately, because the tags are not nested, I cannot figure out how to do this! I have code that successfully recovers the relevant information (see below), but matching the speech tags to their respective major-headings is beyond me.
My intuition is that it would probably be best to split the XML document at the heading tags, and then process each as an individual document, but I couldn't find a function in the xml2 package that would let me do this!
Any help would be great.
Where I have got to so far:
speech_recs <- xml_find_all(pg, "//speech")
speech_text <- trimws(xml_text(speech_recs))
heading_recs <- xml_find_all(pg, "//major-heading")
major_heading_id <- xml_attr(heading_recs, "id")
You can do this as follows:
require(xml2)
require(tidyverse)
doc <- read_xml("https://www.theyworkforyou.com/pwdata/scrapedxml/westminhall/westminster2001-01-24a.xml")
# Get the headings
heading_recs <- xml_find_all(doc, "//major-heading")
# path creates the structure you want
# so the speech nodes that have exactly n headings above them.
path <- sprintf("//speech[count(preceding-sibling::major-heading)=%d]",
seq_along(heading_recs))
# Get the text of the speech nodes
map(path, ~xml_text(xml_find_all(doc, .x))) %>%
# Combine it with the id of the headings
map2_df(xml_attr(heading_recs, "id"),
~tibble(major_heading_id = .y, speech_text = .x))
This results in:

Resources