I am working in rmarkdown to produce a report that extracts and displays images extracted from word.
To do this, I am using the officer package. It has a function called media_extract which can 'extract files from an rdocx or rpptx object'.
In word, I am struggling to locate the image without the media_path column.
The media_path is used as an argument in the media_extract function to locate the image. See example code from package documentation below:
example_pptx <- system.file(package = "officer",
"doc_examples/example.pptx")
doc <- read_pptx(example_pptx)
content <- pptx_summary(doc)
image_row <- content[content$content_type %in% "image", ]
media_file <- image_row$media_file
png_file <- tempfile(fileext = ".png")
media_extract(doc, path = media_file, target = png_file)
The file path is generated using either; docx_summary or pptx_summary, depending on the file type, which create a data frame summary of the files. The pptx_summary includes a column media_path, which displays a file path for the image. The docx_summary data frame doesn't include this column. Another stackoverflow post posed a solution for this using word/media/ subdir which seemed to work, however I'm not sure what this means or how to use it?
How do I extract an image from a word doc, using word/media/ subdir as the media path?
I have continued to research this and found an answer, so I thought I would share!
The difficultly I was having extracting images from docx was due to the absence of a media_file column in the summary data frame (produced using docx_summary), which is used to locate the desired image. This column is present in the data frame produced for pptx pptx_summary and is used in the example code from the package documentation.
In the absence of this column you instead need to locate the image using the document subdirectory (file path when the docx is in XML format), which looks like:
media_path <- "/word/media/image3.png"
If you want see what this structure looks like you can right click on your document >7-Zip>Extract files.. and a folder containing the document contents will be created, otherwise just change the image number to select the desired image.
Note: sometimes images have names that do not follow the image.png format so you may need to extract the files to find the name of the desired image.
Example using media_extract with docx.
#extracting image from word doc using officer package
report <- read_docx("/Users/user.name/Documents/mydoc.docx")
png_file <- tempfile(fileext = ".png")
media_file <- "/word/media/image3.png"
media_extract(report, path = media_file, target = png_file)
The output you are looking for is TRUE.
The image can then be included in a report using knitr (or another method).
include_graphics(png_file)
Using expss package I am creating cross tabs by reading SPSS files in R. This actually works perfectly but the process takes lots of time to load. I have a folder which contains various SPSS files(usually 3 files only) and through R script I am fetching the last modified file among the three.
setwd('/file/path/for/this/file/SPSS')
library(expss)
expss_output_viewer()
#get all .sav files
all_sav <- list.files(pattern ='\\.sav$')
#use file.info to get the index of the file most recently modified
pass<-all_sav[with(file.info(all_sav), which.max(mtime))]
mydata = read_spss(pass,reencode = TRUE) # read SPSS file mydata
w <- data.frame(mydata)
args <- commandArgs(TRUE)
Everything is perfect and works absolutely fine but it generally takes too much time to load large files(112MB,48MB for e.g) which isn't good.
Is there a way I can make it more time-efficient and takes less time to create the table. The dropdowns are created using PHP.
I have searched for this and found another library called 'haven' but I am not sure whether that can give me significance as well. Can anyone help me with this? I would really appreciate that. Thanks in advance.
As written in the expss vignette (https://cran.r-project.org/web/packages/expss/vignettes/labels-support.html) you can use in the following way:
# we need to load packages strictly in this order to avoid conflicts
library(haven)
library(expss)
spss_data = haven::read_spss("spss_file.sav")
# add missing 'labelled' class
spss_data = add_labelled_class(spss_data)
I am reading ISL at the moment which is related to machine learning in R
I really like how the book is laid out specifically where the authors reference code inline or libraries for example library(MASS).
Does anyone know if the same effect can be achieved using R Markdown i.e. making the MASS keyword above brown when i reference it in a paper? I want to color code columns in data frames when i talk about them in the R Markdown document. When you knit it as a HTML document it provides pretty good formatting but when i Knit it to MS Word it seems to just change the font type
Thanks
I've come up with a solution that I think might address your issue. Essentially, because inline source code gets the same style label as code chunks, any change you make to SourceCode will be applied to both chunks, which I don't think is what you want. Instead, there needs to be a way to target just the inline code, which doesn't seem to be possible from within rmarkdown. Instead, what I've opted to do is take the .docx file that is produced, convert it to a .zip file, and then modify the .xml file inside that has all the data. It applies a new style to the inline source code text, which can then be modified in your MS Word template. Here is the code:
format_inline_code = function(fpath) {
if (!tools::file_ext(fpath) == "docx") stop("File must be a .docx file...")
cur_dir = getwd()
.dir = dirname(fpath)
setwd(.dir)
out = gsub("docx$", "zip", fpath)
# Convert to zip file
file.rename(fpath, out)
# Extract files
unzip(out, exdir=".")
# Read in document.xml
xml = readr::read_lines("word/document.xml")
# Replace styling
# VerbatimChar didn't appear to the style that was applied in Word, nor was
# it present to be styled. VerbatimStringTok was though.
xml = sapply(xml, function(line) gsub("VerbatimChar", "VerbatimStringTok", line))
# Save document.xml
readr::write_lines(xml, "word/document.xml")
# Zip files
.files = c("_rels", "docProps", "word", "[Content_Types].xml")
zip(zipfile=out, files=.files)
# Convert to docx
file.rename(out, fpath)
# Remove the folders extracted from zip
sapply(.files, unlink, recursive=TRUE)
setwd(cur_dir)
}
The style that you'll want to modify in you MS Word template is VerbatimStringTok. Hope that helps!
I am writing an R script that reads in a template .R file, a list of dates, and creates a bunch of folders corresponding to the dates and containing copes of the .R wherein text substitution has been performed in R to customize each script for the given date.
I'm stuck on the part where I write out the .R file though, because the formatting and/or character representation keeps getting screwed up.
Here's a minimal, reproducible example:
RMapsDemo <- readLines("https://raw.githubusercontent.com/hack-r/RMapsDemo/master/RMapsDemo.R")
RMapsDemo <- gsub("## File: RMapsDemo.R", "## File: RMapsDemo.R ####", RMapsDemo)
save(RMapsDemo, file = "RMapsDemo.R") # Doesn't work right
save(RMapsDemo, file = "RMapsDemo.R", ascii = T) # Doesn't work right
dput(RMapsDemo, file = "RMapsDemo.R") # Close, but no cigar
dput(RMapsDemo, file = "RMapsDemo.R", control = c("keepNA", "keepInteger")) # Close, but no cigar
Ricardo Saporta pointed out the solution in the comments -- use writeLines.
I feel stupid for not thinking of this myself. It works beautifully.
writeLines(RMapsDemo, con = "RMapsDemo.R")
I have what I think is a common enough issue, on optimising workflow in R. Specifically, how can I avoid the common issue of having a folder full of output (plots, RData files, csv, etc.), without, after some time, having a clue where they came from or how they were produced? In part, it surely involves trying to be intelligent about folder structure. I have been looking around, but I'm unsure of what the best strategy is. So far, I have tackled it in a rather unsophisticated (overkill) way: I created a function metainfo (see below) that writes a text file with metadata, with a given file name. The idea is that if a plot is produced, this command is issued to produce a text file with exactly the same file name as the plot (except, of course, the extension), with information on the system, session, packages loaded, R version, function and file the metadata function was called from, etc. The questions are:
(i) How do people approach this general problem? Are there obvious ways to avoid the issue I mentioned?
(ii) If not, does anyone have any tips on improving this function? At the moment it's perhaps clunky and not ideal. Particularly, getting the file name from which the plot is produced doesn't necessarily work (the solution I use is one provided by #hadley in 1). Any ideas would be welcome!
The function assumes git, so please ignore the probable warning produced. This is the main function, stored in a file metainfo.R:
MetaInfo <- function(message=NULL, filename)
{
# message - character string - Any message to be written into the information
# file (e.g., data used).
# filename - character string - the name of the txt file (including relative
# path). Should be the same as the output file it describes (RData,
# csv, pdf).
#
if (is.null(filename))
{
stop('Provide an output filename - parameter filename.')
}
filename <- paste(filename, '.txt', sep='')
# Try to get as close as possible to getting the file name from which the
# function is called.
source.file <- lapply(sys.frames(), function(x) x$ofile)
source.file <- Filter(Negate(is.null), source.file)
t.sf <- try(source.file <- basename(source.file[[length(source.file)]]),
silent=TRUE)
if (class(t.sf) == 'try-error')
{
source.file <- NULL
}
func <- deparse(sys.call(-1))
# MetaInfo isn't always called from within another function, so func could
# return as NULL or as general environment.
if (any(grepl('eval', func, ignore.case=TRUE)))
{
func <- NULL
}
time <- strftime(Sys.time(), "%Y/%m/%d %H:%M:%S")
git.h <- system('git log --pretty=format:"%h" -n 1', intern=TRUE)
meta <- list(Message=message,
Source=paste(source.file, ' on ', time, sep=''),
Functions=func,
System=Sys.info(),
Session=sessionInfo(),
Git.hash=git.h)
sink(file=filename)
print(meta)
sink(file=NULL)
}
which can then be called in another function, stored in another file, e.g.:
source('metainfo.R')
RandomPlot <- function(x, y)
{
fn <- 'random_plot'
pdf(file=paste(fn, '.pdf', sep=''))
plot(x, y)
MetaInfo(message=NULL, filename=fn)
dev.off()
}
x <- 1:10
y <- runif(10)
RandomPlot(x, y)
This way, a text file with the same file name as the plot is produced, with information that could hopefully help figure out how and where the plot was produced.
In terms of general R organization: I like to have a single script that recreates all work done for a project. Any project should be reproducible with a single click, including all plots or papers associated with that project.
So, to stay organized: keep a different directory for each project, each project has its own functions.R script to store non-package functions associated with that project, and each project has a master script that starts like
## myproject
source("functions.R")
source("read-data.R")
source("clean-data.R")
etc... all the way through. This should help keep everything organized, and if you get new data you just go to early scripts to fix up headers or whatever and rerun the entire project with a single click.
There is a package called Project Template that helps organize and automate the typical workflow with R scripts, data files, charts, etc. There is also a number of helpful documents like this one Workflow of statistical data analysis by Oliver Kirchkamp.
If you use Emacs and ESS for your analyses, learning Org-Mode is a must. I use it to organize all my work. Here is how it integrates with R: R Source Code Blocks in Org Mode.
There is also this new free tool called Drake which is advertised as "make for data".
I think my question belies a certain level of confusion. Having looked around, as well as explored the suggestions provided so far, I have reached the conclusion that it is probably not important to know where and how a file is produced. You should in fact be able to wipe out any output, and reproduce it by rerunning code. So while I might still use the above function for extra information, it really is a question of being ruthless and indeed cleaning up folders every now and then. These ideas are more eloquently explained here. This of course does not preclude the use of Make/Drake or Project Template, which I will try to pick up on. Thanks again for the suggestions #noah and #alex!
There is also now an R package called drake (Data Frames in R for Make), independent from Factual's Drake. The R package is also a Make-like build system that links code/dependencies with output.
install.packages("drake") # It is on CRAN.
library(drake)
load_basic_example()
plot_graph(my_plan)
make(my_plan)
Like it's predecessor remake, it has the added bonus that you do not have to keep track of a cumbersome pile of files. Objects generated in R are cached during make() and can be reloaded easily.
readd(summ_regression1_small) # Read objects from the cache.
loadd(small, large) # Load objects into your R session.
print(small)
But you can still work with files as single-quoted targets. (See 'report.Rmd' and 'report.md' in my_plan from the basic example.)
There is package developed by RStudio called pins that might address this problem.