OCR on PDF with Tesseract in R, writing TIFF - error

OCR on PDF with Tesseract in R, writing TIFF - error - r

For a small project I am trying to read some data from scanned PDF files that do not contain the data.
Following the instructions of the Tesseract package, the code below should work.
Unfortunately it triggers an error.
Error in tiff::writeTIFF(bitmap, "page.tiff") :
INTEGER() can only be applied to a 'integer', not a 'raw'
Any clue on how this can be resolved?
library(pdftools)
library(tiff)
library(tesseract)
# A PDF file with some text
setwd(tempdir())
news <- file.path(Sys.getenv("R_DOC_DIR"), "NEWS.pdf")
orig <- pdf_text(news)[1]
# Render pdf to jpeg/tiff image
bitmap <- pdf_render_page(news, dpi = 300)
tiff::writeTIFF(bitmap, "page.tiff")
# Extract text from images
out <- ocr("page.tiff")
cat(out)

Perhaps using pdf_convert() instead of pdf_render_page(), i.e.:
library(pdftools)
# A PDF file with some text
setwd(tempdir())
news <- file.path(Sys.getenv("R_DOC_DIR"), "NEWS.pdf")
orig <- pdf_text(news)[1]
# Render pdf to jpeg/tiff image
pdf_convert(news, format = "tiff")
This generates multiple tiffs in the directory so you should add a code that reads and processes all of them one by one.

Related

Save cropped pdf image from R imagemagick to a new pdf

I'm using R with imagemagick to crop some borders from a pdf file. I'm executing the following commands:
library(magick)
pdf_total <- image_read_pdf(path = "file1.pdf")
pdf_cropped <- image_crop(pdf_total,"3000x1500")
After this process I have a perfect cropped file, but my problem occurs when I try to save the file to a new pdf file. What is the correct procedure to save this converted pdf?

My final solution is:
library(magick)
pdf_total <- image_read_pdf(path = "file1.pdf")
pdf_cropped <- image_crop(pdf_total,"3000x1500")
for(i in seq(1,length(pdf_cropped))){
plot(pdf_cropped[i])
}
dev.off()
In this case I made a for loop to save all the pages, if you pass plot(pdf_cropped) the result is a pdf with a single page (first picture).

Convert each page of a multi-paged pdf into separate png files in R

I have seen a few questions asked involving trying to convert a pdf into a png but none of the answers show how to save each page of a multi-paged pdf as a different png file.
Starting out with an example 13-page pdf:
# exmaple pdf
example_pdf <- "https://arxiv.org/ftp/arxiv/papers/1312/1312.2789.pdf"
How can I save each page of the pdf as a different png file?

We can create a png of each page using the image_read_pdf function from the magick package:
#install magick package
install.packages("magick")
library("magick")
# creating magick-image class with a png for each page of the pdf
pages <- magick::image_read_pdf(example_pdf)
pages
# saving each page of the pdf as a png
j <- 1:13
for (i in j){
pages[i] %>% image_write(., path = paste0("image",i,".png"), format = "png")
}
This would save each page as "image(page number).png" in your main directory file.

create pdf in addition to word docx using officer

I am using officer (used to use reporters) within a loop to create 150 unique documents. I need these documents however to be exported from R as word docx AND pdfs.
Is there a way to export the document created with officer to a pdf?

That's possible but the solution I have depends on libreoffice. Here is the code I am using. Hope it will help. I've hard-coded libreoffice path then you probably will have to adapt or improve the code for variable cmd_.
The code is transforming a PPTX or DOCX file to PDF.
library(pdftools)
office_shot <- function( file, wd = getwd() ){
cmd_ <- sprintf(
"/Applications/LibreOffice.app/Contents/MacOS/soffice --headless --convert-to pdf --outdir %s %s",
wd, file )
system(cmd_)
pdf_file <- gsub("\\.(docx|pptx)$", ".pdf", basename(file))
pdf_file
}
office_shot(file = "your_presentation.pptx")

I've been using RDCOMClient to convert my OfficeR created docx's to PDFs.
library(RDCOMClient)
file <- "C:/path/to your/doc.docx"
wordApp <- COMCreate("Word.Application") #creates COM object
wordApp[["Documents"]]$Open(Filename=file) #opens your docx in wordApp
wordApp[["ActiveDocument"]]$SaveAs("C:/path/to your/doc.pdf"), FileFormat=17) #saves as PDF
wordApp$Quit() #quits the COM Word application
I found the FileFormat=17 bit here https://learn.microsoft.com/en-us/office/vba/api/word.wdexportformat
I've been able to put the above in a loop to convert multiple docx's to PDFs quickly, too.
Hope this helps!

There is a way to convert your docx into the pdf. There is a function convert_to_pdf from the docxtractr package.
Note that this function is using LibreOffice to convert docx to pdf. So you have to install LibreOffice before and write the path to the soffice.exe. Read more about paths for different OS here.
Here is a simple example how to convert several docx documents into pdf on the Windows machine. I have Windows 10 and LibreOffice 6.4 installed. Just imagine that you have X Word documents stored in the data folder and you want to create the same amount of PDF in the data/pdf folder (you have to create the pdf folder before).
library(dplyr)
library(purrr)
library(docxtractr)
# You have to show the way to the LibreOffice before
set_libreoffice_path("C:/Program Files/LibreOffice/program/soffice.exe")
# 1) List of word documents
words <- list.files("data/",
pattern = "?.docx",
full.names = T)
# 2) Custom function
word2pdf <- function(path){
# Let's extract the name of the file
name <- str_remove(path, "data/") %>%
str_remove(".docx")
convert_to_pdf(path,
pdf_file = paste0("data/pdf/",
name,
".pdf"))
}
# 3) Convert
words %>%
map(~word2pdf(.x))

Doing OCR with R

I have been trying to do OCR within R (reading PDF data which data as scanned image). Have been reading about this # http://electricarchaeology.ca/2014/07/15/doing-ocr-within-r/
This a very good post.
Effectively 3 steps:
convert pdf to ppm (an image format)
convert ppm to tif ready for tesseract (using ImageMagick for convert)
convert tif to text file
The effective code for the above 3 steps as per the link post:
lapply(myfiles, function(i){
# convert pdf to ppm (an image format), just pages 1-10 of the PDF
# but you can change that easily, just remove or edit the
# -f 1 -l 10 bit in the line below
shell(shQuote(paste0("F:/xpdf/bin64/pdftoppm.exe ", i, " -f 1 -l 10 -r 600 ocrbook")))
# convert ppm to tif ready for tesseract
shell(shQuote(paste0("F:/ImageMagick-6.9.1-Q16/convert.exe *.ppm ", i, ".tif")))
# convert tif to text file
shell(shQuote(paste0("F:/Tesseract-OCR/tesseract.exe ", i, ".tif ", i, " -l eng")))
# delete tif file
file.remove(paste0(i, ".tif" ))
})
The first two steps are happening fine. (although taking good amount of time, for 4 pages of a pdf, but will look into the scalability part later, first trying if this works or not)
While running this, the fist two steps work fine.
While runinng the 3rd step, i.e
shell(shQuote(paste0("F:/Tesseract-OCR/tesseract.exe ", i, ".tif ", i, " -l eng")))
I having this error:
Error: evaluation nested too deeply: infinite recursion / options(expressions=)?
Or Tesseract is crashing.
Any workaround or root cause analysis would be appreciated.

By using "tesseract", I created a sample script which works.Even it works for scanned PDF's too.
library(tesseract)
library(pdftools)
# Render pdf to png image
img_file <- pdftools::pdf_convert("F:/gowtham/A/B/invoice.pdf", format = 'tiff', dpi = 400)
# Extract text from png image
text <- ocr(img_file)
write.table(text, "F:/gowtham/A/B/mydata.txt")
I'm new to R and Programming. Guide me if it's wrong. Hope this help you.

The newly released tesseract package might be worth checking out. It allows you to perform the whole process inside of R without the shell calls.
Taking the procedure as used in the help documentation of the tesseract package your function would look something like this:
lapply(myfiles, function(i){
# convert pdf to jpef/tiff and perform tesseract OCR on the image
# Read in the PDF
pdf <- pdf_text(i)
# convert pdf to tiff
bitmap <- pdf_render_page(news, dpi = 300)
tiff::writeTIFF(bitmap, paste0(i, ".tiff"))
# perform OCR on the .tiff file
out <- ocr(paste0, (".tiff"))
# delete tiff file
file.remove(paste0(i, ".tiff" ))
})

How can one copy-paste local png files to a word document using R?

I have ~10,000 png images saved neatly in different files on my PC. I want to write a function that does something like go to a particular folder and iteratively copy-pastes all the png files in that folder to a word document. Is this possible in R?
I've looked at package R2wd but it sadly only has a function that takes RData and outputs its plot to a word document (function wdPlot).
I also have the RData saved for each and every plot, so reason would dictate that I should be able to simply load the RData associated with a particular plot and then use wdPlot . The problem is that when I generated my png's the plots were grobs and I did something as follows:
png("rp.png",width=w,height=h)
plot(rp)
#Increase size of title
grid.edit(gridTitle_Ref, gp=gpar(fontsize=20))
#Other grid.edit alterations
dev.off()
save(rp)
Now, when I try to get that rp onto a word document by first loading it into R I naively do the following and it does not output a plot to MS Word with the title enlarged or any of the other grid.editalterations.
load("rp.Rdata")
png("rp.png",width=w,height=h)
wdPlot(rp)
#Increase size of title
grid.edit(gridTitle_Ref, gp=gpar(fontsize=20))
#Other grid.edit alterations
dev.off()
So, to reiterate: I have all these png files. At various times I have to copy-paste a subset of them into a word document. I'm too lazy to do that manually each time and want a program to do it for me.
EDIT 1
So, as per suggestions below, I've read up on Markdown. Following this post How to set size for local image using knitr for markdown?
I wrote something along the lines of:
```{r,echo=FALSE,fig.width=100, fig.height=100}
# Generate word documents of reports
# Clear all
rm(list=ls())
library(png)
library(grid)
library(knitr)
dir<-"location\of\file"
setwd(dir)
# Output only directories:
folders<-dir()[file.info(dir())$isdir]
for(folder in folders){
currentDir<-paste(dir,folder,"\\",sep="")
setwd(currentDir)
#All files in current folder
files<-list.files()
imgs<-[A list of all the png images in this particular file that I want in the word document - the png names]
for(img in imgs){
imgRaster<-readPNG(img)
grid.raster(imgRaster)
}
}
```
The following is a screenshot of what's in the resulting word document. How might I fix this? I want the images to appear one after the other in the document as the for loop above runs.
Do note that this is the first time I've ever used Markdown so any relevant tutorials linked in the comments could also be of great help.
EDIT 2
I followed the second answer's example below. Here is the output that I obtained
As you can see there are no images, only the html tags. How do I fix this?

If you have the png's saved you can just use a little html and a for loop to save them to a .doc file.
edit 2 for windows
# Start empty word doc
cat("<body>", file="exOut.doc", sep="\n")
# select all png files in working directory
for(i in list.files(pattern="*.png"))
{
temp <- paste('<img src=', i, '>')
cat(temp, file="exOut.doc", sep="\n", append=TRUE)
}
cat("</body>", file="exOut.doc", sep="\n", append=TRUE)
# Some example plots
for(i in 1:5)
{
png(paste0("ex", i, ".png"))
plot(1:5)
title(paste("plot", i))
dev.off()
}
# Start empty word doc
cat(file="exOut.doc")
# select all png files in working directory
for(i in list.files(pattern="*.png"))
{
temp <- paste('<img src=', i, '>')
cat(temp, file="exOut.doc", sep="\n", append=TRUE)
}
You will then need to embed the figures, either using the drop down menus or by writing a small macro that you can call with system
EDIT : small update to show explicit paths to output and figures
cat("<body>", file="/home/daff/Desktop/exOut.doc", sep="\n")
for(i in list.files(pattern="*.png"))
{
temp <- paste0('<img src=/home/daff/', i, '>')
cat(temp, file="/home/daff/Desktop/exOut.doc", sep="\n", append=TRUE)
}
Note that i used paste0 to remove the space between the path /home/daff/ and ex*.png.

Have you tried Rstudio and Markdown? You could put your code into chunks that load the files and save as word document. http://rmarkdown.rstudio.com/word_document_format.html

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

OCR on PDF with Tesseract in R, writing TIFF - error - r

Related

Save cropped pdf image from R imagemagick to a new pdf

Convert each page of a multi-paged pdf into separate png files in R

create pdf in addition to word docx using officer

Doing OCR with R

How can one copy-paste local png files to a word document using R?

Categories

Resources