I wanted to highlighted some text in a PDF document using R. I want to search a PDF document for some text and highlight the text if found. I searched for packages which could do this.
pdftools and pdfsearch are packages which help in handling PDF documents. These packages mainly handle converting pdf to text and doing any sort of manipulation.
Is there a way in which we can highlight a PDF document using R?
I was able to highlight some keywords in a PDF with the following code. There are four steps :
Save wikipedia page to PDF;
Convert the PDF to word document with the Word Software (There is an OCR!!);
Highlight the keywords in the word document;
Save the word document as PDF.
library(RDCOMClient)
library(DescTools)
library(pagedown)
#############################################
#### Step 1 : Save wikipedia page as PDF ####
#############################################
chrome_print(input = "https://en.wikipedia.org/wiki/Cat",
output = "C:\\Text_PDF_Cat.pdf")
path_PDF <- "C:\\Text_PDF_Cat.pdf"
path_Word <- "C:\\Text_PDF_Cat.docx"
################################################################
#### Step 2 : Convert PDF to word document with OCR of Word ####
################################################################
wordApp <- COMCreate("Word.Application")
wordApp[["Visible"]] <- TRUE
wordApp[["DisplayAlerts"]] <- FALSE
doc <- wordApp[["Documents"]]$Open(normalizePath(path_PDF),
ConfirmConversions = FALSE)
doc$SaveAs2(path_Word)
doc_Selection <- wordApp$Selection()
######################################################
#### Step 3 : Highlight keywords in word document ####
######################################################
move_To_Beginning_Doc <- function(doc_Selection)
{
doc_Selection$HomeKey(Unit = wdConst$wdStory) # Need DescTools for wdConst$wdStory
}
highlight_Text_Regex_Word <- function(doc,
doc_Selection,
words_To_Highlight,
colorIndex = 7,
nb_Max_Word = 100)
{
for(i in words_To_Highlight)
{
move_To_Beginning_Doc(doc_Selection)
for(j in 1 : nb_Max_Word)
{
doc_Selection$Find()$Execute(FindText = i, MatchCase = FALSE)
doc_Selection_Range <- doc_Selection$Range()
doc_Selection_Range[["HighlightColorIndex"]] <- colorIndex
}
}
}
highlight_Text_Regex_Word(doc, doc_Selection,
words_To_Highlight = c("cat", "domestic", "quick"),
colorIndex = 7, nb_Max_Word = 100)
###############################################
#### Step 4 : Convert word document to pdf ####
###############################################
path_PDF_Highlighted <- "C:\\Text_PDF_Cat_Highlighted.pdf"
wordApp[["ActiveDocument"]]$SaveAs(path_PDF_Highlighted, FileFormat = 17) # FileFormat = 17 saves as .PDF
doc$Close()
wordApp$Quit() # quit wordApp
Related
library(tidyverse)
library(easyalluvial) # https://github.com/erblast/easyalluvial
library(parcats) # https://github.com/erblast/parcats
# My data
knitr::kable(head(mtcars2))
# My Alluvial
MyAlluvial <- alluvial_wide(data = mtcars2,
max_variables = 5,
fill_by = 'first_variable')
# My Nice alluvial
p <- parcats(MyAlluvial, marginal_histograms = FALSE, data = mtcars2)
p
# Saving PDF
pdf("/Users/Master/Downloads/MyAlluvial.pdf")
p
dev.off()
I'm able to save the plot as png/jpg using RStudio GUI, but I cannot save it in a vector format (neither pdf nor eps).
As far as I known, interactive plot was generated using Plotly.
I can save printing a PDF from the browser but I don't like this!!
I was able to save your graph in a PDF file with the following code :
library(rmarkdown)
library(pagedown)
vector_RMD_Content <- c(
'---',
'title: "Untitled"',
'output: html_document',
'---',
'```{r setup, include=FALSE}',
'knitr::opts_chunk$set(echo = TRUE)',
'```',
'```{r cars}',
'library(tidyverse)',
'library(easyalluvial)',
'library(parcats)',
"MyAlluvial <- alluvial_wide(data = mtcars2, max_variables = 5, fill_by = 'first_variable')",
'p <- parcats(MyAlluvial, marginal_histograms = FALSE, data = mtcars2)',
'p',
'```')
zzfil <- tempfile(fileext = ".Rmd")
writeLines(text = vector_RMD_Content, con = zzfil)
render(input = zzfil,
output_file = "C:/stackoverflow.html")
chrome_print("C:/stackoverflow.html",
output = "C:/testpdf2.pdf")
A html file with your graph is generated with Rmarkdown. After, the HTML file is printed to PDF with the R function chrome_print of the R package pagedown.
I am searching for a way to create PDF files automatically using R. I saw people suggesting the RDCOMClient option, but it doesn't work for my PC.
How to create pdf file using excel sheet in R?
Input file: file.xlsx
Output file: file.pdf
Input:
Expected Output:
I thought to create a pdf file with the pdf() function for Data frame but I only managed to save the tables through the grid.table() function, but it is not creating exact pdf file.
pdf("file.pdf")
grid.table(df)
dev.off()
Does anyone have better solutions?
Here are two functions that you can consider :
library(RDCOMClient)
save_Excel_As_PDF <- function(path_To_Excel_File,
path_To_PDF_File)
{
xlApp <- COMCreate("Excel.Application")
xlWbk <- xlApp$Workbooks()$Open(path_Excel_File)
xlWbk$ExportAsFixedFormat(Type = 0, FileName = path_To_PDF_File)
xlWbk$Close()
xlApp$Quit()
}
save_Excel_Sheet_As_PDF <- function(path_To_Excel_File,
path_To_PDF_File,
sheet_Id)
{
xlApp <- COMCreate("Excel.Application")
xlWbk <- xlApp$Workbooks()$Open(path_Excel_File)
sheet <- xlWbk $Worksheets()$Item(sheet_Id)
sheet$Select()
# Type = 0 => PDF, Type = 1 => XPS
xlWbk[["ActiveSheet"]]$ExportAsFixedFormat(Type = 0, Filename = path_To_PDF_File,
IgnorePrintAreas = FALSE)
xlWbk$Close()
xlApp$Quit()
}
I use these functions at my job and they work very well.
I am new to R language. I am trying to run the Tesseract OCR function if the uploaded file is pdf file, it seems like it always goes to the else part. I know there is an error in the if part but I have no clue to use what symbol.
Here are some part of the code
output$table <- renderTable({
if(is.null(input$file)) {return()}
read.table(file=input$file$datapath[input$file$name==input$Select], fill = TRUE, skipNul = TRUE)
# PDF file
if (input$file$datapath[input$file$name==input$Select] == "pdf"){
pdffile <- pdftools::pdf_convert(input$file$datapath[input$file$name==input$Select], dpi = 600)
text <- tesseract::ocr(pdffile)
}
# JPEG file
else{
eng <- tesseract("eng")
text <- tesseract::ocr(input$file$datapath[input$file$name==input$Select], engine = eng)
}
})
Using reportRs pacakge, I'm trying to add several graphs(.png/.jpg) which are named as e.g. test-0,test-1,test-2 etc into a pptx file. These graphs have been extracted from a pdf named e.g. test using im.convert function.I can add them individually but not able to automate the code for graphs,title, slide number, date etc in loop which can figure out how many graphs with 'test' name are there in a folder and then import them in the pptx one by one in a new slide ata time and one final pptx file.
sample code:
library(animation)
im.convert("Test.pdf", output = "Test.png", extra.opts="-density 150")
library("ReporteRs")
doc <- pptx()
doc <- pptx(template = templateDir)
doc <- addSlide( doc, slide.layout = 'Competative Landscape' )
doc <- addTitle(doc, paste("Test-0"))
doc <- addImage(doc, "Test-0.png")
:
:
:
:
doc <- addSlide( doc, slide.layout = 'Competative Landscape' )
doc <- addTitle(doc, paste("Test-3"))`enter code here`
doc <- addImage(doc, "Test-3.png")
You could try using the list.files function to find the number of png files with the name Test in a folder.
sample code:
list_of_files=list.files(path = "C:/output_folder", pattern = c("Test",".png"))
library("ReporteRs")
doc <- pptx()
doc <- pptx(template = templateDir)
for( i in 0:(length(list_of_files)-1))
{
doc <- addSlide( doc, slide.layout = 'Competative Landscape' )
doc <- addTitle(doc, paste0("Test-",i))
doc <- addImage(doc, paste0("Test-",i,".png"))
}
You could also try the eoffice package:
install.package("eoffice")
fig<-infigure("figes",savegg=T)
topptx(fig,file="test.pptx")
##or
infigure("figs",showfig=T)
topptx(fig,file="test.pptx")
I have a R shiny code which makes various reports, word cloud, sentiment analysis and various other things. Now I want that by click of a button all these reports which are generated can be downloaded in one single shot and attached to ppt. So, for instance it should look like:
Slide 1: Word cloud
Slide 2: Sentiment Analysis
Slide 3: Report 1 ...and so on
Till now, I can download all these reports separately i.e. I have different tabs in my Shiny UI, for every report and I go to it and click "Download" and it get downloaded by downloadHandler.
Also, In one click I can download all these reports in one pdf i.e. in one page I have report 1 and so and so forth.
Till now I have reached till below:
#downloadReport is my action button
#on click of this button I am expecting the ppt. to be downloaded
observeEvent(input$downloadReport, {
# Create a PowerPoint document
doc = pptx( )
# Slide 1 : Title slide
#+++++++++++++++++++++++
doc <- addSlide(doc, "Title Slide")
doc <- addTitle(doc,"Create a PowerPoint document from R software")
doc <- addSubtitle(doc, "R and ReporteRs package")
# Slide 2 : Add Word Cloud
#+++++++++++++++++++++++
doc <- addSlide(doc, "Title and Content")
doc <- addTitle(doc, "Bar Plot")
newData=rawInputData(); # Function which captures data (.csv file) when I have input it through R shiny
words_list = strsplit(as.character(newData$CONTENT), " ") #CONTENT is the column which contains the test data
words_per_tweet = sapply(words_list, length)
pptwordcloud<-barplot(table(words_per_tweet), border=NA,main="Distribution of words per tweet", cex.main=1,col="darkcyan")
#pptwordcloud<-barplot(table(words_per_tweet), col="darkcyan")
doc <- addPlot(doc, fun= print, x = pptwordcloud,vector.graphic =FALSE )
writeDoc(doc,'file1.pptx')
})
The ppt. is getting generated but I can't see barplot in it by using vector.graphic =FALSE as a option. If I remove this,I am getting this error
Warning: Unhandled error in observer:
javax.xml.bind.UnmarshalException
- with linked exception: [org.xml.sax.SAXParseException: The markup in the document preceding the root element must be well-formed.]
observeEvent(input$downloadReport)
Can somebody point out my error.
Let try to reproduce =)
1) I havent your data so i use iris and select input used for choise second colunm for table
UI
library(shiny)
shinyUI(
# Use a fluid Bootstrap layout
fluidPage(
selectInput("sel",label = "col",choices = colnames(iris)[2:ncol(iris)]),
downloadButton('downloadData', 'Download')
)
)
Server
library(shiny)
library(DT)
library(ReporteRs)
shinyServer(function(input, output,session) {
output$downloadData <- downloadHandler(
filename = "file.pptx",
content = function(file) {
doc = pptx( )
# Slide 1 : Title slide
#+++++++++++++++++++++++
doc <- addSlide(doc, "Title Slide")
doc <- addTitle(doc,"Create a PowerPoint document from R software")
doc <- addSubtitle(doc, "R and ReporteRs package")
# Slide 2 : Add Word Cloud
#+++++++++++++++++++++++
doc <- addSlide(doc, "Title and Content")
doc <- addTitle(doc, "Bar Plot")
#newData=rawInputData(); # Function which captures data (.csv file) when I have input it through R shiny
#words_list = strsplit(as.character(newData$CONTENT), " ") #CONTENT is the column which contains the test data
#words_per_tweet = sapply(words_list, length)
words_per_tweet=iris
pptwordcloud<-function(){
barplot(table(words_per_tweet[,c("Sepal.Length",input$sel)]), border=NA,main="Distribution of words per tweet", cex.main=1,col="darkcyan")
}#pptwordcloud<-barplot(table(words_per_tweet), col="darkcyan")
doc <- addPlot(doc, fun= pptwordcloud,vector.graphic =FALSE )
writeDoc(doc,file)
}
)
})