create pdf in addition to word docx using officer - r

I am using officer (used to use reporters) within a loop to create 150 unique documents. I need these documents however to be exported from R as word docx AND pdfs.
Is there a way to export the document created with officer to a pdf?

That's possible but the solution I have depends on libreoffice. Here is the code I am using. Hope it will help. I've hard-coded libreoffice path then you probably will have to adapt or improve the code for variable cmd_.
The code is transforming a PPTX or DOCX file to PDF.
library(pdftools)
office_shot <- function( file, wd = getwd() ){
cmd_ <- sprintf(
"/Applications/LibreOffice.app/Contents/MacOS/soffice --headless --convert-to pdf --outdir %s %s",
wd, file )
system(cmd_)
pdf_file <- gsub("\\.(docx|pptx)$", ".pdf", basename(file))
pdf_file
}
office_shot(file = "your_presentation.pptx")

I've been using RDCOMClient to convert my OfficeR created docx's to PDFs.
library(RDCOMClient)
file <- "C:/path/to your/doc.docx"
wordApp <- COMCreate("Word.Application") #creates COM object
wordApp[["Documents"]]$Open(Filename=file) #opens your docx in wordApp
wordApp[["ActiveDocument"]]$SaveAs("C:/path/to your/doc.pdf"), FileFormat=17) #saves as PDF
wordApp$Quit() #quits the COM Word application
I found the FileFormat=17 bit here https://learn.microsoft.com/en-us/office/vba/api/word.wdexportformat
I've been able to put the above in a loop to convert multiple docx's to PDFs quickly, too.
Hope this helps!

There is a way to convert your docx into the pdf. There is a function convert_to_pdf from the docxtractr package.
Note that this function is using LibreOffice to convert docx to pdf. So you have to install LibreOffice before and write the path to the soffice.exe. Read more about paths for different OS here.
Here is a simple example how to convert several docx documents into pdf on the Windows machine. I have Windows 10 and LibreOffice 6.4 installed. Just imagine that you have X Word documents stored in the data folder and you want to create the same amount of PDF in the data/pdf folder (you have to create the pdf folder before).
library(dplyr)
library(purrr)
library(docxtractr)
# You have to show the way to the LibreOffice before
set_libreoffice_path("C:/Program Files/LibreOffice/program/soffice.exe")
# 1) List of word documents
words <- list.files("data/",
pattern = "?.docx",
full.names = T)
# 2) Custom function
word2pdf <- function(path){
# Let's extract the name of the file
name <- str_remove(path, "data/") %>%
str_remove(".docx")
convert_to_pdf(path,
pdf_file = paste0("data/pdf/",
name,
".pdf"))
}
# 3) Convert
words %>%
map(~word2pdf(.x))

Related

Saving pptx as pdf in R

I have created powerpoint files using officer package and I would also like to save them as pdf from R (dont want to manualy open and save as pdf each file). Is this possible?
you can save the powerpoint object edited using the code which is posted here: create pdf in addition to word docx using officer.
You will need to first install pdftools and libreoffice
library(pdftools)
office_shot <- function( file, wd = getwd() ){
cmd_ <- sprintf(
"/Applications/LibreOffice.app/Contents/MacOS/soffice --headless --convert-to pdf --outdir %s %s",
wd, file )
system(cmd_)
pdf_file <- gsub("\\.(docx|pptx)$", ".pdf", basename(file))
pdf_file
}
office_shot(file = "your_presentation.pptx")
Note that the author of the officer package is the one who referred someone to this response.
Note that the answer from Corey Pembleton has the LibreOffice iOS path. (Which I personally didn't initially notice). The Windows path would be something like "C:/Program Files/LibreOffice/program/soffice.exe".
Since the initial answer provided by Corey, an example using docxtractr::convert_to_pdf can now be found here.
The package and function are the ones John M commented in Corey initial answer.
An easy solution to this question is to use convert_to_pdf function from docxtractr package. Note: this solution requires to download LibreOffice from here. I used the following order.
First, I need to set the path to LibreOffice and soffice.exe
library(docxtractr)
set_libreoffice_path("C:/Program Files/LibreOffice/program/soffice.exe")
Second, I set the path of the PowerPoint document I want to convert to pdf.
pptx_path <- "G:/My Drive/Courses/Aysem/Certifications/September17_Part2.pptx"
Third, convert it using convert_to_pdf function.
pdf <- convert_to_pdf(pptx_path, pdf_file = tempfile(fileext = ".pdf"))
Be careful here. The converted pdf file is saved in a local temporary folder. Here is mine to give you an idea. Just go and copy it from the temporary folder.
"C:\\Users\\MEHMET~1\\AppData\\Local\\Temp\\RtmpqAaudc\\file3eec51d77d18.pdf"
EDIT: A quick solution to find where the converted pdf is saved. Just replace the third step with the following line of code. You can set the path where you want to save. You don't need to look for the weird local temp folder.
pdf <- convert_to_pdf(pptx_path, pdf_file = sub("[.]pptx", ".pdf", pptx_path))

How do I pull in multiple pdfs into pdf_convert using r and pdftools package?

How do I import multiple pdf files into the pdf_convert command of the pdftools package?
I have a directory that with multiple pdf files. I'm using the pdftools package with the pdf_convert command to render jpegs from the pdf document. However, there is no pattern command to set a pattern to the documents.
I've tried:
for(i in length(dir(folder))){
pdf_convert("C:/folder/*.pdf", format = "jpeg")
}
However that throws an error that says:
Error in normaizePath(path.expand(path), winslash, mustWork) :
path[1]="C:/folder/*.pdf: The filename, director name, or volume label syntax is incorrect
When I don't use the *.pdf and instead use the actual file name, it works.
How do I get the command to read multiple files?
I'm sorry I don't have a reproducible example. I'm not sure how I would post a directory with multiple pdf files and access to it on SO.
This will do the trick and no need for a loop.
library(pdftools)
directory <- "C:/folder"
file.list <- paste(directory, "/",list.files(directory, pattern = "*.pdf"), sep = "")
lapply(file.list, FUN = function(files) {
pdf_convert(files, format = "jpeg")
})

OCR on PDF with Tesseract in R, writing TIFF - error

For a small project I am trying to read some data from scanned PDF files that do not contain the data.
Following the instructions of the Tesseract package, the code below should work.
Unfortunately it triggers an error.
Error in tiff::writeTIFF(bitmap, "page.tiff") :
INTEGER() can only be applied to a 'integer', not a 'raw'
Any clue on how this can be resolved?
library(pdftools)
library(tiff)
library(tesseract)
# A PDF file with some text
setwd(tempdir())
news <- file.path(Sys.getenv("R_DOC_DIR"), "NEWS.pdf")
orig <- pdf_text(news)[1]
# Render pdf to jpeg/tiff image
bitmap <- pdf_render_page(news, dpi = 300)
tiff::writeTIFF(bitmap, "page.tiff")
# Extract text from images
out <- ocr("page.tiff")
cat(out)
Perhaps using pdf_convert() instead of pdf_render_page(), i.e.:
library(pdftools)
# A PDF file with some text
setwd(tempdir())
news <- file.path(Sys.getenv("R_DOC_DIR"), "NEWS.pdf")
orig <- pdf_text(news)[1]
# Render pdf to jpeg/tiff image
pdf_convert(news, format = "tiff")
This generates multiple tiffs in the directory so you should add a code that reads and processes all of them one by one.

Print/save Excel (.xlsx) sheet to PDF using R

I want to print an Excel file to a pdf file after manipulating it. For the manipulation I used the .xlsx package which works fine. There is a function printSetup but I cannot find a function to start the printing. Is there a solution for this?
library(xlsx)
file <- "test.xlsx"
wb <- loadWorkbook(file)
sheets <- getSheets(wb) # get all sheets
sheet <- sheets[[1]] # get first sheet
# HERE: MAGIC TO SAVE THIS SHEET TO PDF
It may be a solution using DCOM through the RDCOMClient package, though I would prefer a plattform independent solution (e.g. using xlsx) as I work on MacOS. Any ideas?
Below a solution using the DCOM interface via the RDCOMClient. This is not my preferred solution as it only works on Windows. A plattform independent solution would still be appreciated.
library(RDCOMClient)
library(R.utils)
file <- "file.xlsx" # relative path to Excel file
ex <- COMCreate("Excel.Application") # create COM object
file <- getAbsolutePath(file) # convert to absolute path
book <- ex$workbooks()$Open(file) # open Excel file
sheet <- book$Worksheets()$Item(1) # pointer to first worksheet
sheet$Select() # select first worksheet
ex[["ActiveSheet"]]$ExportAsFixedFormat(Type=0, # export as PDF
Filename="my.pdf",
IgnorePrintAreas=FALSE)
ex[["ActiveWorkbook"]]$Save() # save workbook
ex$Quit() # close Excel
An open source and cross platform way to do this would be with libreoffice as so:
library("XLConnect")
x <- rnorm(1:100)
y <- x ^ 2
writeWorksheetToFile("test.xlsx", data.frame(x = x, y = y), "Data")
tmpDir <- file.path(tempdir(), "LOConv")
system2("libreoffice", c(paste0("-env:UserInstallation=file://", tmpDir), "--headless", "--convert-to pdf",
"--outdir", getwd(), file.path(getwd(),"test.xlsx")))
Ideally you'd then remove the folder referenced by tmpDir but that would be platform specific.
Note this assumes libreoffice is in your path. If it isn't, then the command would need to be altered to include the full path to the libreoffice executable.
The reason for the env bit is that headless libreoffice will only do anything otherwise if it isn't already running in GUI mode. See http://ask.libreoffice.org/en/question/1686/how-to-not-connect-to-a-running-instance/ for more info.
You could use the pdf function:
pdf(file="myfile.pdf", width=8.5, height=11)
print(firstsheet)
grid.newpage()
print(secondsheet)
grid.newpage()
print(thirdsheet)
dev.off()

Saving graphs in both PDF and PNG format but using PDF files in the final document

I'm using knitr for my analysis. I can save graphs in PDF format with \SweaveOpts{dev=pdf} and in PNG format with \SweaveOpts{dev=png}. I'm interested to save graphs both in PDF and PNG format in one run but to use the PDF in the final documents interactively.
How can I do this?
Here comes the real solution:
Knitr 0.3.9 starts to support multiple devices per chunk (for now, you have to install from GitHub); in your case, you can set the chunk option dev=c('pdf', 'png') to get both PDF and PNG files.
Here is a solution that uses ImageMagick to convert PDF files to PNG. Of course you have to install ImageMagick first, and make sure its bin directory is in PATH:
knit_hooks$set(convert = function(before, options, envir) {
# quit if before a chunk or no figures in this chunk
if (before || (n <- options$fig.num) == 0L) return()
# only convert pdf files
if (options$fig.ext != 'pdf') return()
# use ImageMagick to convert all pdf to png
name = fig_path() # figure filename
owd = setwd(dirname(name)); on.exit(setwd(owd))
files = paste(basename(name), if (n == 1L) '' else seq(n), sep = '')
lapply(files, function(f) {
system(sprintf('convert %s.pdf %s.png', f, f))
})
NULL
})
Basically this hook is executed after a chunk and run convert foo.pdf foo.png on all PDF figures. You can use it like
<<test-png, convert=TRUE>>=
plot(1); plot(2)
#
Or if you put all figures in a separate directory, you can run convert directly in that directory (i.e. do not have to call system() in R).
This is not an ideal solution but should work. To make use of R's native png() device, you need to answer my question in the above comment first.

Resources