How to extract images from word using media_extract in r?

How to extract images from word using media_extract in r? - r

I am working in rmarkdown to produce a report that extracts and displays images extracted from word.
To do this, I am using the officer package. It has a function called media_extract which can 'extract files from an rdocx or rpptx object'.
In word, I am struggling to locate the image without the media_path column.
The media_path is used as an argument in the media_extract function to locate the image. See example code from package documentation below:
example_pptx <- system.file(package = "officer",
"doc_examples/example.pptx")
doc <- read_pptx(example_pptx)
content <- pptx_summary(doc)
image_row <- content[content$content_type %in% "image", ]
media_file <- image_row$media_file
png_file <- tempfile(fileext = ".png")
media_extract(doc, path = media_file, target = png_file)
The file path is generated using either; docx_summary or pptx_summary, depending on the file type, which create a data frame summary of the files. The pptx_summary includes a column media_path, which displays a file path for the image. The docx_summary data frame doesn't include this column. Another stackoverflow post posed a solution for this using word/media/ subdir which seemed to work, however I'm not sure what this means or how to use it?
How do I extract an image from a word doc, using word/media/ subdir as the media path?

I have continued to research this and found an answer, so I thought I would share!
The difficultly I was having extracting images from docx was due to the absence of a media_file column in the summary data frame (produced using docx_summary), which is used to locate the desired image. This column is present in the data frame produced for pptx pptx_summary and is used in the example code from the package documentation.
In the absence of this column you instead need to locate the image using the document subdirectory (file path when the docx is in XML format), which looks like:
media_path <- "/word/media/image3.png"
If you want see what this structure looks like you can right click on your document >7-Zip>Extract files.. and a folder containing the document contents will be created, otherwise just change the image number to select the desired image.
Note: sometimes images have names that do not follow the image.png format so you may need to extract the files to find the name of the desired image.
Example using media_extract with docx.
#extracting image from word doc using officer package
report <- read_docx("/Users/user.name/Documents/mydoc.docx")
png_file <- tempfile(fileext = ".png")
media_file <- "/word/media/image3.png"
media_extract(report, path = media_file, target = png_file)
The output you are looking for is TRUE.
The image can then be included in a report using knitr (or another method).
include_graphics(png_file)

Related

How to extract images from word and powerpoint using media_extract in r?

I am working in rmarkdown to produce a report that extracts and displays images extracted from word and powerpoint.
To do this, I am using the officer package. It has a function called media_extract which can 'extract files from an rdocx or rpptx object'.
I have two issues:
How to view or use the image after I have located it.
In word, how to locate the image without the media_path column.
I have been able to locate an image in pptx using this function: the pptx_summary function creates a data frame with a media_path column, which displays a file path for image elements. The media_path is then used as an argument in the media_extract function to locate the image. See example code from package documentation below:
example_pptx <- system.file(package = "officer",
"doc_examples/example.pptx")
doc <- read_pptx(example_pptx)
content <- pptx_summary(doc)
image_row <- content[content$content_type %in% "image", ]
media_file <- image_row$media_file
png_file <- tempfile(fileext = ".png")
media_extract(doc, path = media_file, target = png_file)
However, when I run media_extract it returns 'TRUE', which is the example output, but I am unsure how to add the image to my report. I've tried assigning the media_extract as a value eg
image <- media_extract(doc, path = media_file, target = png_file)
but this returns 'FALSE'.
How do I include the image as an image in my report?
The second issue I'm having is how to locate an image in word. The documentation for media_extract says it can be used to extract images from both .docx and .pptx, I have only managed to get it to work for the latter. I haven't been able to create a file path for .docx.
The file path is generated using either; docx_summary or pptx_summary, depending on the file type, which create a data frame summary of the files. The pptx_summary includes a column media_path, which displays a file path for the image. The docx_summary data frame doesn't include this column. Another stackoverflow post posed a solution for this using word/media/ subdir which seemed to work, however I'm not sure what this means or how to use it?
How do I extract an image from a word doc, using word/media/ subdir as the media path?

media_extract() is a function that copy the media where you want. We can show the extracted images using R Markdown with at least 3 methods:
knitr::include_graphics()
regular markdown
magick::image_read()
They are illustrated below:
---
title: "media_extract usage"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(officer)
library(flextable)
example_pptx <- system.file(package = "officer",
"doc_examples/example.pptx")
doc <- read_pptx(example_pptx)
content <- pptx_summary(doc)
image_row <- content[content$content_type %in% "image", ]
media_file <- image_row$media_file
png_file <- tempfile(fileext = ".png")
media_extract(doc, path = media_file, target = png_file)
```
## include_graphics
```{r out.width="200px"}
knitr::include_graphics(png_file)
```
## markdown
You can't use `tempfile()` here - path is better when defined as relative.
Let's write it to "./file.png".
```{r results='hide'}
media_extract(doc, path = media_file, target = "file.png")
```
![](file.png){style="width:200px;"}
## magick
```{r out.width="200px"}
magick::image_read(png_file)
```

I have continued to research the second issue and found an answer, so thought I would share!
The difficultly I was having extracting images from docx was due to the absence of a media_file column in the summary data frame (produced using docx_summary), which is used to locate the desired image. This column is present in the data frame produced for pptx pptx_summary and is used in the example code from the package documentation.
In the absence of this column you instead need to locate the image using the document subdirectory (file path when the docx is in XML format), which looks like: media_path <- "/word/media/image3.png"
If you want see what this structure looks like you can right click on your document >7-Zip>Extract files.. and a folder containing the document contents will be created, otherwise just change the image number to select the desired image.
Note: sometimes images have names that do not follow the image.png format so you may need to extract the files to find the name of the desired image.
Example using media_extract with docx.
#extracting image from word doc using officer package
report <- read_docx("/Users/user.name/Documents/mydoc.docx")
png_file <- tempfile(fileext = ".png")
media_file <- "/word/media/image3.png"
media_extract(report, path = media_file, target = png_file)

Bind or merge multiple powerpoints in r

I have been using officer package to create the respective PowerPoint decks, however at this moment, i would like to merge/ bind them all as one slide deck and was not able to figure out. Can someone guide me if there any package that helps to merge multiple PowerPoint decks into one.

I believe currently there are no functions or packages that do this in R, so I'll suggest you a few possible solutions that come to mind.
1: I believe you could use read_pptx() to read, say, a deck1 and a deck2 files. Then, loop through the slide indexes of deck2, and use those values to add_slide() into deck1. I think there's a function in officer called pptx_summary(), which converts a pptx R object into a tibble, but I'm not sure you could convert a tibble back to a pptx R object.
2: You could convert pptx files into pdf files, and use pdftools to join them.

When creating PowerPoint slides automatically via R (for example by using the PowerPoint export of R markdown), merging them with pre-manufactured fixed slides (for example explanations with elaborate visuals) may likely become necessary. As there seems not single-line solution so far, here's an incomplete answer to a 3-year-old question.
A look into the sources of OfficeR shows that the package works with a data structure and a temporary folder in the background that contains the XML files that are zipped in the XLSX file.
Copying slides, therefore, requires both: To update the structure, and to copy XML files and other ressources, eventually. Here is a very rough draft of how merging two PowerPoint files can work, based on the OfficeR classes.
merge_pptx = function(a, b, filename) {
# go through the slides of b
for (index in 1:length(source$slide$get_metadata())) {
# We need a new filename in the target's slide directory
new_slidename <- target$slide$get_new_slidename()
xml_file <- file.path(target$package_dir, "ppt/slides", new_slidename)
# Copy XML from source to new filename
orgFilename = source$slide$get_metadata()[index, "filename"]
newFilepath = paste(target$package_dir, newFilename, sep="/")
file.copy(orgFilename, xml_file)
# Not sure yet, what exactly this does
slide_info <- target$slideLayouts$get_metadata()[1,] # Use first best layout at the moment
layout_obj <- target$slideLayouts$collection_get(slide_info$filename)
layout_obj$write_template(xml_file)
# update presentation elements
target$presentation$add_slide(target = file.path("slides", new_slidename))
target$content_type$add_slide(partname = file.path("/ppt/slides", new_slidename))
# Add the slide to the collection
target$slide$add_slide(xml_file, target$slideLayouts$get_xfrm_data())
target$cursor <- target$slide$length()
}
print(target, target=filename)
}
source = read_pptx("One.pptx")
target = read_pptx("Two.pptx")
merge_pptx(source, target, "Combined.pptx")
Please note, that this is a draft only. It does not yet respect the different layouts for slides, not even speaking of different masters. Embedded files (images) are not yet copied.
The bulk of this function is inspired by the add_slide() function in the dir_slide class, see https://github.com/davidgohel/officer/blob/master/R/ppt_class_dir_collection.R

Extracting one text files from multiple zip archives in R

I am trying to extract one text file from each of the zip files located in one folder. Then I want to combine those text files into one dataframe.
The folder has multiple Zip files:
pf_0915.zip
pf_0914.zip
pf_0913.zip
.....
Inside of those zip files are multiple text files. I am only interested in the one called abc.txt. This is a fixed width format file without header. I have already set up a read for this file using read_fwd. Since all the extracted text files have the same name, it might be better to rename them according the name of their archive. i.e. the abc.txt from pf_0915.zip could be called abc_0915.txt. Once they are all read they should be combined into a large file called abcCombined.txt.
Or as each new abc.txt file is read, we could add it to the abcCombined.txt.
I have tried various version of unzip() and unz() without much success. This was done without looping through all the zip files. And finally, this directory contains many zip files, are there ways to read only some of them by using pattern matching like grep. I would for example be interested in reading only September files, those .._09...txt.
Any hints would be appreciated.

The following:
Creates a vector of the files in a directory
Uses the list parameter to unzip() to see the metadata for the contents
Builds a regular expression to find only the target file (I did that in the event your use-case generalizes to a broader pattern)
Tests if any of the files meet your criteria
Keeps only those files into a resultant vector
Iterates over that vector and
Extracts only the target file into a temporary directory
Reads it into a data.frame
Ultimately binds the individual data.frames into one big one
You can write out the resultant combined data.frame however you wish.
library(purrr)
target_dir <- "so"
extract_file <- "abc.txt"
list.files(target_dir, full.names=TRUE) %>%
keep(~any(grepl(sprintf("^%s$", extract_file), unzip(., list=TRUE)$Name))) %>%
map_df(function(x) {
td <- tempdir()
read.fwf(unzip(x, extract_file, exdir=td), widths=c(4,1,4,2))
}) -> combined_df
The version below just expands some of the shortcuts in the one above:
only_files_with_this_name <- function(zip_path, name) {
zip_contents <- unzip(zip_path, list=TRUE)
look_for <- sprintf("^%s$", name)
any(grepl(look_for, zip_contents$Name))
}
list.files(target_dir, full.names=TRUE) %>%
keep(only_files_with_this_name, name=extract_file)) %>%
map_df(function(x) {
td <- tempdir()
file_in_zip <- unzip(x, extract_file, exdir=td)
read.fwf(file_in_zip, widths=c(4,1,4,2))
unlink(file_in_zip)
}) -> combined_df

Can't comment because of my low reputation, so although this is a partial answer:
If you know the file name within the various zips the syntax to get just that file would be something like the following:
my_data<-read.csv(unz("pf_0915.zip","abc.txt"))
This is the code for a csv obviously, not a fixed width text, but if you already have that set up, it'll be something like
my_data<-read_fwd(unz("pf_0915.zip","abc.txt") ... )
with all your other parameters in the ...
You can do this in a loop if you have many zips, and accumulate them in a data frame, data table, whatever structure floats your boat...

RMarkdown Inline Code Format

I am reading ISL at the moment which is related to machine learning in R
I really like how the book is laid out specifically where the authors reference code inline or libraries for example library(MASS).
Does anyone know if the same effect can be achieved using R Markdown i.e. making the MASS keyword above brown when i reference it in a paper? I want to color code columns in data frames when i talk about them in the R Markdown document. When you knit it as a HTML document it provides pretty good formatting but when i Knit it to MS Word it seems to just change the font type
Thanks

I've come up with a solution that I think might address your issue. Essentially, because inline source code gets the same style label as code chunks, any change you make to SourceCode will be applied to both chunks, which I don't think is what you want. Instead, there needs to be a way to target just the inline code, which doesn't seem to be possible from within rmarkdown. Instead, what I've opted to do is take the .docx file that is produced, convert it to a .zip file, and then modify the .xml file inside that has all the data. It applies a new style to the inline source code text, which can then be modified in your MS Word template. Here is the code:
format_inline_code = function(fpath) {
if (!tools::file_ext(fpath) == "docx") stop("File must be a .docx file...")
cur_dir = getwd()
.dir = dirname(fpath)
setwd(.dir)
out = gsub("docx$", "zip", fpath)
# Convert to zip file
file.rename(fpath, out)
# Extract files
unzip(out, exdir=".")
# Read in document.xml
xml = readr::read_lines("word/document.xml")
# Replace styling
# VerbatimChar didn't appear to the style that was applied in Word, nor was
# it present to be styled. VerbatimStringTok was though.
xml = sapply(xml, function(line) gsub("VerbatimChar", "VerbatimStringTok", line))
# Save document.xml
readr::write_lines(xml, "word/document.xml")
# Zip files
.files = c("_rels", "docProps", "word", "[Content_Types].xml")
zip(zipfile=out, files=.files)
# Convert to docx
file.rename(out, fpath)
# Remove the folders extracted from zip
sapply(.files, unlink, recursive=TRUE)
setwd(cur_dir)
}
The style that you'll want to modify in you MS Word template is VerbatimStringTok. Hope that helps!

Appending r output in a single sheet of xlsx file

How can i append my R outputs in a single sheet of xlsx file? I am currently working on web crawling wherein i need to scrap the user reviews from website and save it in my deskstop in xlsx format. I need to every time change the website url(as user reviews are in different pages) in my code and save the output in one sheet of xlsx file.
Can you please help me with the code of appending outputs in a single sheet of xlsx file? Below is the code which i am using: Every time i need to change the website url and run the same below code and save the corresponding output in a single sheet of mydata.xlsx
library("rvest")
htmlpage <- html("http://www.glassdoor.com/GD/Reviews/Symphony-Teleca-Reviews-E28614_P2.htm?sort.sortType=RD&sort.ascending=false&filter.employmentStatus=REGULAR&filter.employmentStatus=PART_TIME&filter.employmentStatus=UNKNOWN")
proshtml <- html_nodes(htmlpage, ".pros")
pros <- html_text(proshtml)
pros
data=data.frame(pros)
library(xlsx)
write.xlsx(data, "D:/mydata.xlsx", append=TRUE)

A trivial, but super-slow way:
If you only need to add (a few) row(s) to an existing Excel file, and it only has one sheet to which you want to append, you can just do a simple read => overwrite step:
SHEET.NAME <- '...' # fill in with yours
existing.data <- read.xlsx(file, sheetName = SHEET.NAME)
new.data <- rbind(existing.data, data)
write.xlsx(new.data, file, sheetName = SHEET.NAME, row.names = F, append = F)
Note:
It's quite slow in general, will work only for small scale
read.xlsx is a slow function. Try read.xlsx2 to make it much faster (see the difference in the docs)
If your R process is run once and keeps working for a long time, obviously don't do it this way (reading and overwriting a file is ridiculous in that case)

look at package xlsx.
?write.xlsx will show you what you want. append=TRUE is the key.
========= EDIT TO CORRECT =========
As #Jakub pointed out, append=TRUE adds another worksheet to the file.
========= EDIT TO ADD: ANOTHER METHOD ==========
Another method is to save the data to a .csv file, which could easily open from excel. In this case, the append=T works as expected (adding to the existing sheet):
write.table(df,"D:/MyFile.csv",append=T,sep=",")

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to extract images from word using media_extract in r? - r

Related

How to extract images from word and powerpoint using media_extract in r?

Bind or merge multiple powerpoints in r

Extracting one text files from multiple zip archives in R

RMarkdown Inline Code Format

Appending r output in a single sheet of xlsx file

Categories

Resources