I have created powerpoint files using officer package and I would also like to save them as pdf from R (dont want to manualy open and save as pdf each file). Is this possible?
you can save the powerpoint object edited using the code which is posted here: create pdf in addition to word docx using officer.
You will need to first install pdftools and libreoffice
library(pdftools)
office_shot <- function( file, wd = getwd() ){
cmd_ <- sprintf(
"/Applications/LibreOffice.app/Contents/MacOS/soffice --headless --convert-to pdf --outdir %s %s",
wd, file )
system(cmd_)
pdf_file <- gsub("\\.(docx|pptx)$", ".pdf", basename(file))
pdf_file
}
office_shot(file = "your_presentation.pptx")
Note that the author of the officer package is the one who referred someone to this response.
Note that the answer from Corey Pembleton has the LibreOffice iOS path. (Which I personally didn't initially notice). The Windows path would be something like "C:/Program Files/LibreOffice/program/soffice.exe".
Since the initial answer provided by Corey, an example using docxtractr::convert_to_pdf can now be found here.
The package and function are the ones John M commented in Corey initial answer.
An easy solution to this question is to use convert_to_pdf function from docxtractr package. Note: this solution requires to download LibreOffice from here. I used the following order.
First, I need to set the path to LibreOffice and soffice.exe
library(docxtractr)
set_libreoffice_path("C:/Program Files/LibreOffice/program/soffice.exe")
Second, I set the path of the PowerPoint document I want to convert to pdf.
pptx_path <- "G:/My Drive/Courses/Aysem/Certifications/September17_Part2.pptx"
Third, convert it using convert_to_pdf function.
pdf <- convert_to_pdf(pptx_path, pdf_file = tempfile(fileext = ".pdf"))
Be careful here. The converted pdf file is saved in a local temporary folder. Here is mine to give you an idea. Just go and copy it from the temporary folder.
"C:\\Users\\MEHMET~1\\AppData\\Local\\Temp\\RtmpqAaudc\\file3eec51d77d18.pdf"
EDIT: A quick solution to find where the converted pdf is saved. Just replace the third step with the following line of code. You can set the path where you want to save. You don't need to look for the weird local temp folder.
pdf <- convert_to_pdf(pptx_path, pdf_file = sub("[.]pptx", ".pdf", pptx_path))
Related
I hope someone can help me. I use pdf_subset() from pdftools package to select some pages from .pdf file and save in new .pdf file. However, there is a problem: my path/filename consists of specific characters (polish letters) which are replaced by other symbols when file saving. How to fix the problem with replacing symbols?
Thanks!
library(pdftools)
# extract some pages
pdf_subset('https://cran.r-project.org/doc/manuals/r-release/R-intro.pdf',
pages = 1:3, output = "FŁŻ_6/Siłaków.pdf")
Blad w poleceniu 'cpp_pdf_select(input, output, pages, password)':
open C:\Users\PDF\FĹĹ»_6\SiĹ‚akĂłw.pdf: No such file or directory
Please forgive me if this is not perfect but this is my first post.
I am currently working on trying to transform a large number of .docx documents into .pdf
I have found the RDCOMClient package which has done wonders. However I now need to add alt text into my charts. The code I am using is below:
library(RDCOMClient)
library(plyr)
library(tidyverse)
# this will destroy all objects in your workspace so be careful
# rm(list = ls()) # deletes all data frames
file <- "directory"
wordApp <- COMCreate("Word.Application") # create COM object
wordApp[["Visible"]] <- FALSE #opens a Word application instance visibly if true
wordApp[["Documents"]]$Add() #adds new blank docx in your application
wordApp[["Documents"]]$Open(Filename=file) #opens your docx in wordApp
#THIS IS THE MAGIC
wordApp[["ActiveDocument"]]$SaveAs("Directory",
FileFormat=17) #FileFormat=17 saves as .PDF
wordApp[["ActiveDocument"]]$Close(SaveChanges = 1) # says there are no changes that need saving
Where there is function in double [] like Documents is there something for chart.
I have found a full list of them for excel at the link here: http://www.omegahat.net/RDCOMClient/Docs/introduction.html
However I tried to install the SWinTypeLibs package to get the same thing for word using the following code:
install.packages("remotes")
remotes::install_github("omegahat/SWinTypeLibs")
and keep getting an error
if anyone has a list for word like the excel above would really need it.
Thanks for all the help in advance.
James
I've applied a quick fix to the .Rd docs in the original package from omegahat. You can find it here--should compile now.
karnner2/SWinTypeLibs
devtools::install_github("Karnner2/SWinTypeLibs")
I am using officer (used to use reporters) within a loop to create 150 unique documents. I need these documents however to be exported from R as word docx AND pdfs.
Is there a way to export the document created with officer to a pdf?
That's possible but the solution I have depends on libreoffice. Here is the code I am using. Hope it will help. I've hard-coded libreoffice path then you probably will have to adapt or improve the code for variable cmd_.
The code is transforming a PPTX or DOCX file to PDF.
library(pdftools)
office_shot <- function( file, wd = getwd() ){
cmd_ <- sprintf(
"/Applications/LibreOffice.app/Contents/MacOS/soffice --headless --convert-to pdf --outdir %s %s",
wd, file )
system(cmd_)
pdf_file <- gsub("\\.(docx|pptx)$", ".pdf", basename(file))
pdf_file
}
office_shot(file = "your_presentation.pptx")
I've been using RDCOMClient to convert my OfficeR created docx's to PDFs.
library(RDCOMClient)
file <- "C:/path/to your/doc.docx"
wordApp <- COMCreate("Word.Application") #creates COM object
wordApp[["Documents"]]$Open(Filename=file) #opens your docx in wordApp
wordApp[["ActiveDocument"]]$SaveAs("C:/path/to your/doc.pdf"), FileFormat=17) #saves as PDF
wordApp$Quit() #quits the COM Word application
I found the FileFormat=17 bit here https://learn.microsoft.com/en-us/office/vba/api/word.wdexportformat
I've been able to put the above in a loop to convert multiple docx's to PDFs quickly, too.
Hope this helps!
There is a way to convert your docx into the pdf. There is a function convert_to_pdf from the docxtractr package.
Note that this function is using LibreOffice to convert docx to pdf. So you have to install LibreOffice before and write the path to the soffice.exe. Read more about paths for different OS here.
Here is a simple example how to convert several docx documents into pdf on the Windows machine. I have Windows 10 and LibreOffice 6.4 installed. Just imagine that you have X Word documents stored in the data folder and you want to create the same amount of PDF in the data/pdf folder (you have to create the pdf folder before).
library(dplyr)
library(purrr)
library(docxtractr)
# You have to show the way to the LibreOffice before
set_libreoffice_path("C:/Program Files/LibreOffice/program/soffice.exe")
# 1) List of word documents
words <- list.files("data/",
pattern = "?.docx",
full.names = T)
# 2) Custom function
word2pdf <- function(path){
# Let's extract the name of the file
name <- str_remove(path, "data/") %>%
str_remove(".docx")
convert_to_pdf(path,
pdf_file = paste0("data/pdf/",
name,
".pdf"))
}
# 3) Convert
words %>%
map(~word2pdf(.x))
I would like to read automatically in R the file which is located at
https://clients.rte-france.com/servlets/IndispoProdServlet?annee=2017
This link generates the automatic download of a zipfile. This zipfile contains the Excel file I want to read in R.
Does any of you have any suggestions on this? Thanks.
Panagiotis' comment to use download.file() is generally good advice, but I couldn't make it work here (and would be curious to know why). Instead I used httr.
(Edit: got it, I reversed args of download.file()... Repeat after me: always use named args...)
Another problem with this data: it appears not to be a regular xls file, I couldn't open it with the yet excellent readxl package.
Looks like a tab separated flat file, but no success with read.table() either. readr::read_delim() made it.
library(httr)
library(readr)
r <- GET("https://clients.rte-france.com/servlets/IndispoProdServlet?annee=2017")
# Write the archive on disk
writeBin(r$content, "./data/rte_data")
rte_data <-
read_delim(
unzip("./data/rte_data", exdir = "./data/"),
delim = "\t",
locale = locale(encoding = "ISO-8859-1"),
col_names = TRUE
)
There still are parsing problems, but not sure they should be dealt with in this SO question.
Is it possible to parse text data from PDF files in R? There does not appear to be a relevant package for such extraction, but has anyone attempted or seen this done in R?
In Python there is PDFMiner, but I would like to keep this analysis all in R if possible.
Any suggestions?
Linux systems have pdftotext which I had reasonable success with. By default, it creates foo.txt from a give foo.pdf.
That said, the text mining packages may have converters. A quick rseek.org search seems to concur with your crantastic search.
This is a very old thread, but for future reference: the pdftools R package extracts text from PDFs.
A colleague turned me on to this handy open-source tool: http://tabula.nerdpower.org/. Install, upload the PDF, and select the table in the PDF that requires data-ization. Not a direct solution in R, but certainly better than manual labor.
A purely R solution could be:
library('tm')
file <- 'namefile.pdf'
Rpdf <- readPDF(control = list(text = "-layout"))
corpus <- VCorpus(URISource(file),
readerControl = list(reader = Rpdf))
corpus.array <- content(content(corpus)[[1]])
then you'll have pdf lines in an array.
install.packages("pdftools")
library(pdftools)
download.file("http://www.nfl.com/liveupdate/gamecenter/56901/DEN_Gamebook.pdf",
"56901.DEN.Gamebook", mode = "wb")
txt <- pdf_text("56901.DEN.Gamebook")
cat(txt[1])
The tabula PDF table extractor app is based around a command line application based on a Java JAR package, tabula-extractor.
The R tabulizer package provides an R wrapper that makes it easy to pass in the path to a PDF file and get data extracted from data tables out.
Tabula will have a good go at guessing where the tables are, but you can also tell it which part of a page to look at by specifying a target area of the page.
Data can be extracted from multiple pages, and a different area can be specified for each page, if required.
For an example use case, see: When Documents Become Databases – Tabulizer R Wrapper for Tabula PDF Table Extractor.
I used an external utility to do the conversion and called it from R. All files had a leading table with the desired information
Set path to pdftotxt.exe and convert pdf to text
exeFile <- "C:/Projects/xpdfbin-win-3.04/bin64/pdftotext.exe"
for(i in 1:length(pdfFracList)){
fileNumber <- str_sub(pdfFracList[i], start = 1, end = -5)
pdfSource <- paste0(reportDir,"/", fileNumber, ".pdf")
txtDestination <- paste0(reportDir,"/", fileNumber, ".txt")
print(paste0("File number ", i, ", Processing file ", pdfSource))
system(paste(exeFile, "-table" , pdfSource, txtDestination, sep = " "), wait = TRUE)
}