I have been trying to do OCR within R (reading PDF data which data as scanned image). Have been reading about this # http://electricarchaeology.ca/2014/07/15/doing-ocr-within-r/
This a very good post.
Effectively 3 steps:
convert pdf to ppm (an image format)
convert ppm to tif ready for tesseract (using ImageMagick for convert)
convert tif to text file
The effective code for the above 3 steps as per the link post:
lapply(myfiles, function(i){
# convert pdf to ppm (an image format), just pages 1-10 of the PDF
# but you can change that easily, just remove or edit the
# -f 1 -l 10 bit in the line below
shell(shQuote(paste0("F:/xpdf/bin64/pdftoppm.exe ", i, " -f 1 -l 10 -r 600 ocrbook")))
# convert ppm to tif ready for tesseract
shell(shQuote(paste0("F:/ImageMagick-6.9.1-Q16/convert.exe *.ppm ", i, ".tif")))
# convert tif to text file
shell(shQuote(paste0("F:/Tesseract-OCR/tesseract.exe ", i, ".tif ", i, " -l eng")))
# delete tif file
file.remove(paste0(i, ".tif" ))
})
The first two steps are happening fine. (although taking good amount of time, for 4 pages of a pdf, but will look into the scalability part later, first trying if this works or not)
While running this, the fist two steps work fine.
While runinng the 3rd step, i.e
shell(shQuote(paste0("F:/Tesseract-OCR/tesseract.exe ", i, ".tif ", i, " -l eng")))
I having this error:
Error: evaluation nested too deeply: infinite recursion / options(expressions=)?
Or Tesseract is crashing.
Any workaround or root cause analysis would be appreciated.
By using "tesseract", I created a sample script which works.Even it works for scanned PDF's too.
library(tesseract)
library(pdftools)
# Render pdf to png image
img_file <- pdftools::pdf_convert("F:/gowtham/A/B/invoice.pdf", format = 'tiff', dpi = 400)
# Extract text from png image
text <- ocr(img_file)
write.table(text, "F:/gowtham/A/B/mydata.txt")
I'm new to R and Programming. Guide me if it's wrong. Hope this help you.
The newly released tesseract package might be worth checking out. It allows you to perform the whole process inside of R without the shell calls.
Taking the procedure as used in the help documentation of the tesseract package your function would look something like this:
lapply(myfiles, function(i){
# convert pdf to jpef/tiff and perform tesseract OCR on the image
# Read in the PDF
pdf <- pdf_text(i)
# convert pdf to tiff
bitmap <- pdf_render_page(news, dpi = 300)
tiff::writeTIFF(bitmap, paste0(i, ".tiff"))
# perform OCR on the .tiff file
out <- ocr(paste0, (".tiff"))
# delete tiff file
file.remove(paste0(i, ".tiff" ))
})
Related
I want to read 12 images at a time in R.
I don't know how to do it. I am complete new to working on images in R.
How can I read couple of images from a folder in my system?
I am using windows10 operating system. RAM 8 gb. CORE i5 processor.
GPU is Intel(R) HD Graphics 620.
I am able to read only single image in R and that image is displaying as numeric values. I tried to convert it into raster format and then tried to print image to view the image. But I am still finding the color codes in values but not the image in print.
Can anyone help me on this?
Thanks a lot.
install.packages("magick")
library(magick)
install.packages("rsvg")
install.packages("jpeg")
library(jpeg)
img <- readJPEG("C:/Users/folder/Abc.jpg", native = FALSE)
img1 <- as.raster(img, interpolate = F)
print(img1)
I want to read couple of images at a time into R console and want to view or print images.
The suggested duplicate gives you the basics for how to read in a number of files at once, but there are a few potential gotchas, and it won't help you with displaying the images.
This first bit is purely to set up the example
library(jpeg)
library(grid)
# Create a new directory and move to it
tdir <- "jpgtest"
dir.create(tdir)
setwd(tdir)
# Copy the package:jpeg test image twice, once as .jpg and once as .jpeg
# to the present working directory
file.copy(system.file("img", "Rlogo.jpg", package="jpeg"),
to=c(file.path(getwd(), "test.jpg"), file.path(getwd(), "test.jpeg")))
Then we can list the files, either using a regex match, or choose them interactively, then read and store the images in a list.
# Matches any file ending in .jpg or .jpeg
(flist <- list.files(pattern="*\\.jp[e]?g$"))
# Interactive selection
flist <- file.choose()
jpglist <- lapply(flist, readJPEG)
To display the images I tend to use grid, but there are a number of alternatives.
grid.raster(jpglist[[1]], interpolate=FALSE)
Remove temporary directory
setwd("..")
unlink(tdir)
I'm trying to write a script to read a series of pdfs, OCR them using the tesseract package, and then do things with the text I can extract.
So far, I'm at the following:
ReportDensity <- list()
AllReports <- list.files(path = "path",pattern = "*.PDF",full.names=TRUE)
and then I needed to call the page number for each pdf so that I can read the image data
for (i in seq(AllReports))
ReportDensity[[i]] <- pdf_info(AllReports[[i]])
ReportDensity <- lapply(ReportDensity, `[[`, 2)
Now, what I want to do is to list each page of a pdf of a separate image file so that I can OCR it.
for (i in seq(AllReports))
for (j in 1:ReportDensity[[i]])
(assign(paste0("Report_",i,"_Page_",j),image_read_pdf(AllReports[[i]],pages = ReportDensity[j])))
The error message I receive is:
"Error in poppler_render_page(loadfile(pdf), page, dpi, opw, upw, antialiasing, :
Invalid page."
which I believe to be because I wrote the loop incorrectly. I have tested the code by manually putting in image/page numbers, and it loads correctly.
I'm hoping that the end result would be a series of image files of the form "Report_ReportNumber_PageNumber" that I could then process.
pdfs are text mainly (most often);
i usually extract text from pdfs using python's pdf2txt, page by page run on the shell through a call to
i=pagenumber
system(paste("pdf2txt -p", i, "-o text.txt pdffile.pdf"))
then you can grep text from each page; flag -o can output an html or xml which you can scrap with library(rvest)
[pdfimages][2] extracts the images contained in pdfs, you can OCR those:
system(paste("pdfimages -f", i, "-l", i, "-p -png pdffile.pdf imagefile"))
that may output a lot of pngs from a single page, they come out numbered:
system(paste0("tesseract imagefile-",i,"-006.png out6"))
tesseract has several parameters you must tune before getting a decent result
For a small project I am trying to read some data from scanned PDF files that do not contain the data.
Following the instructions of the Tesseract package, the code below should work.
Unfortunately it triggers an error.
Error in tiff::writeTIFF(bitmap, "page.tiff") :
INTEGER() can only be applied to a 'integer', not a 'raw'
Any clue on how this can be resolved?
library(pdftools)
library(tiff)
library(tesseract)
# A PDF file with some text
setwd(tempdir())
news <- file.path(Sys.getenv("R_DOC_DIR"), "NEWS.pdf")
orig <- pdf_text(news)[1]
# Render pdf to jpeg/tiff image
bitmap <- pdf_render_page(news, dpi = 300)
tiff::writeTIFF(bitmap, "page.tiff")
# Extract text from images
out <- ocr("page.tiff")
cat(out)
Perhaps using pdf_convert() instead of pdf_render_page(), i.e.:
library(pdftools)
# A PDF file with some text
setwd(tempdir())
news <- file.path(Sys.getenv("R_DOC_DIR"), "NEWS.pdf")
orig <- pdf_text(news)[1]
# Render pdf to jpeg/tiff image
pdf_convert(news, format = "tiff")
This generates multiple tiffs in the directory so you should add a code that reads and processes all of them one by one.
I have ~10,000 png images saved neatly in different files on my PC. I want to write a function that does something like go to a particular folder and iteratively copy-pastes all the png files in that folder to a word document. Is this possible in R?
I've looked at package R2wd but it sadly only has a function that takes RData and outputs its plot to a word document (function wdPlot).
I also have the RData saved for each and every plot, so reason would dictate that I should be able to simply load the RData associated with a particular plot and then use wdPlot . The problem is that when I generated my png's the plots were grobs and I did something as follows:
png("rp.png",width=w,height=h)
plot(rp)
#Increase size of title
grid.edit(gridTitle_Ref, gp=gpar(fontsize=20))
#Other grid.edit alterations
dev.off()
save(rp)
Now, when I try to get that rp onto a word document by first loading it into R I naively do the following and it does not output a plot to MS Word with the title enlarged or any of the other grid.editalterations.
load("rp.Rdata")
png("rp.png",width=w,height=h)
wdPlot(rp)
#Increase size of title
grid.edit(gridTitle_Ref, gp=gpar(fontsize=20))
#Other grid.edit alterations
dev.off()
So, to reiterate: I have all these png files. At various times I have to copy-paste a subset of them into a word document. I'm too lazy to do that manually each time and want a program to do it for me.
EDIT 1
So, as per suggestions below, I've read up on Markdown. Following this post How to set size for local image using knitr for markdown?
I wrote something along the lines of:
```{r,echo=FALSE,fig.width=100, fig.height=100}
# Generate word documents of reports
# Clear all
rm(list=ls())
library(png)
library(grid)
library(knitr)
dir<-"location\of\file"
setwd(dir)
# Output only directories:
folders<-dir()[file.info(dir())$isdir]
for(folder in folders){
currentDir<-paste(dir,folder,"\\",sep="")
setwd(currentDir)
#All files in current folder
files<-list.files()
imgs<-[A list of all the png images in this particular file that I want in the word document - the png names]
for(img in imgs){
imgRaster<-readPNG(img)
grid.raster(imgRaster)
}
}
```
The following is a screenshot of what's in the resulting word document. How might I fix this? I want the images to appear one after the other in the document as the for loop above runs.
Do note that this is the first time I've ever used Markdown so any relevant tutorials linked in the comments could also be of great help.
EDIT 2
I followed the second answer's example below. Here is the output that I obtained
As you can see there are no images, only the html tags. How do I fix this?
If you have the png's saved you can just use a little html and a for loop to save them to a .doc file.
edit 2 for windows
# Start empty word doc
cat("<body>", file="exOut.doc", sep="\n")
# select all png files in working directory
for(i in list.files(pattern="*.png"))
{
temp <- paste('<img src=', i, '>')
cat(temp, file="exOut.doc", sep="\n", append=TRUE)
}
cat("</body>", file="exOut.doc", sep="\n", append=TRUE)
# Some example plots
for(i in 1:5)
{
png(paste0("ex", i, ".png"))
plot(1:5)
title(paste("plot", i))
dev.off()
}
# Start empty word doc
cat(file="exOut.doc")
# select all png files in working directory
for(i in list.files(pattern="*.png"))
{
temp <- paste('<img src=', i, '>')
cat(temp, file="exOut.doc", sep="\n", append=TRUE)
}
You will then need to embed the figures, either using the drop down menus or by writing a small macro that you can call with system
EDIT : small update to show explicit paths to output and figures
cat("<body>", file="/home/daff/Desktop/exOut.doc", sep="\n")
for(i in list.files(pattern="*.png"))
{
temp <- paste0('<img src=/home/daff/', i, '>')
cat(temp, file="/home/daff/Desktop/exOut.doc", sep="\n", append=TRUE)
}
Note that i used paste0 to remove the space between the path /home/daff/ and ex*.png.
Have you tried Rstudio and Markdown? You could put your code into chunks that load the files and save as word document. http://rmarkdown.rstudio.com/word_document_format.html
I'm using knitr for my analysis. I can save graphs in PDF format with \SweaveOpts{dev=pdf} and in PNG format with \SweaveOpts{dev=png}. I'm interested to save graphs both in PDF and PNG format in one run but to use the PDF in the final documents interactively.
How can I do this?
Here comes the real solution:
Knitr 0.3.9 starts to support multiple devices per chunk (for now, you have to install from GitHub); in your case, you can set the chunk option dev=c('pdf', 'png') to get both PDF and PNG files.
Here is a solution that uses ImageMagick to convert PDF files to PNG. Of course you have to install ImageMagick first, and make sure its bin directory is in PATH:
knit_hooks$set(convert = function(before, options, envir) {
# quit if before a chunk or no figures in this chunk
if (before || (n <- options$fig.num) == 0L) return()
# only convert pdf files
if (options$fig.ext != 'pdf') return()
# use ImageMagick to convert all pdf to png
name = fig_path() # figure filename
owd = setwd(dirname(name)); on.exit(setwd(owd))
files = paste(basename(name), if (n == 1L) '' else seq(n), sep = '')
lapply(files, function(f) {
system(sprintf('convert %s.pdf %s.png', f, f))
})
NULL
})
Basically this hook is executed after a chunk and run convert foo.pdf foo.png on all PDF figures. You can use it like
<<test-png, convert=TRUE>>=
plot(1); plot(2)
#
Or if you put all figures in a separate directory, you can run convert directly in that directory (i.e. do not have to call system() in R).
This is not an ideal solution but should work. To make use of R's native png() device, you need to answer my question in the above comment first.