Is there an industry standard output format for OCR? - standards

Is there an industry standard output format for OCR? I can't seem to find any thing that is defined as an industry standard, nor am I very experienced with OCR so I wouldn't know if there was a standard either.

hOCR is an open standard which defines a data format for representation of OCR output.

There is no such one format, but there are commonly used practices and open-standard formats that will satisfy your requirements. This question is like asking "what is the standard result from cooking potatoes". Mashed potatoes, french fries, or baked (Not sure where that example came from, I must be getting hungry...)
Also, an "industry standard" will depend on the specific industry. If you are in a specific vertical, then some formats will be more common (almost standard) than others. For example:
Medical - HL7 formatted text
Libraries - ALTO PDF
Legal/eDiscovery - PDF Text Under Image
Integration/Automation - XML
In general, I will not be wrong if I answer your question that most commonly used and industry-accepted formats are: TXT, XML, PDF (several flavors). Each has unique properties and specific uses, but each can be widely used by other technologies due to open standards.
Approaching it from the opposite end is better, meaning thinking through 'business requirements' what will happen with the data and where it needs to be absorbed should exactly define what hand-off format you would like to use from OCR output.

XIEO (http://xieo.info) uses a (Maya Software) proprietary format called CML (Clix Markup Language) that efficiently encodes page, zone, line, text box, and related information. VisualText/NLP++ (available at http://www.textanalysis.com) has a special tokenizer pass to "inhale" that format and produce a ready-made parse tree. NLP++ analyzers can then build on that initial parse tree.
This work flow has been used for more than 5 years at XIEO, primarily for processing Official Records documents (deeds, mortgages, clerk of court, etc) and extracting information from them.
One can clean up the OCRed text, re-zone to fix up OCR errors and mis-zoning, and extract the pertinent information from text, in this workflow.
Amnon Meyers, CTO, Text Analysis International, Inc amnon.meyers#textanalysis.com

Related

Extract Text from a pdf only English text Canadian Legislation R

I'm trying to extract data from a Canadian Act for a project (in this case, the Food and Drugs Act), and import it into R. I want to break it up into 2 parts. 1st the table of contents (pic 1). Second, the information in the act (pic 2). But I do not want the French part (je suis désolé). I have tried using tabulizer extract_area(), but I don't want to have to select the area by hand 90 times (I'm going to do this for multiple pieces of legislation).
Obviously I don't have a minimal reproducible example coded out... But the pdf is downloadable here: https://laws-lois.justice.gc.ca/eng/acts/F-27/
Option 2 is to write something to pull it out via XML, but I'm a bit less used to working with XML files. Unless it's incredibly annoying to do using either pdftools or tabulizer, I'd prefer the answer using one of those libraries (mostly for learning purposes).
I've seen some similarish questions on stackoverflow, but they're all confusingly written/designed for tables, of which this is not. I am not a quant/data science researcher by training, so an explanation would be super helpful (but not required).
Here's an option that reads in the pdf text and detects language. You're probably going to have to do a lot of text cleanup after reading in the pdf. Assume you don't care about retaining formatting.
library(pdftools)
a = pdf_text('F-27.pdf')
#split text to get sentence chunks, mostly.
b = sapply(a,strsplit,'\r\n')
#do a bunch of other text cleanup, here's an example using the third list element. You can expand this to cover all of b with a loop or list function like sapply.
#Two spaces should hopefully retain most sentence-like fragments, you can get more sophisticated:
d = strsplit(b[[3]], ' ')[[1]]
library(cld3) #language tool to detect french and english
x = sapply(d,detect_language)
#Keep only English
x[x=='en']

Postscript parser - add hyperlinks to text

I need to take a list of questions on a pdf, and hyperlink the answer to each question.
I currently have converted the pdf file to postscript. However, postscript is a very complicated language to programmatically hyperlink each question of the format Question #i: to a link example.com/answers/i/. How can I accomplish this?
PostScript isn't merely complicated, its a complete programming language. This means that the way your answer is expressed in the program is entirely arbitrary.
Assuming you are using the same conversion process each time, you can probably assume that its deterministic in its behaviour (ie it converts the same input to the same output every time), in which case you can probably look for the result in the output.
But basically, you're on your own here, there isn't some magic solution I can give you.
I'd suggest that you're doping it wrong anyway. PostScript isn't PDF, and it doesn't have any concept of a hyperlink. So this suggests to me that you intend to use a pdfmark extension operator, and then pass the resulting PostScript back through a Distiller-like application in order to get a PDF back out again.
Converting to PostScript and back to PDF really just confuses the issue. Assuming that the PDF is a form (again, by implication from the question and answer format) you can extract the form field readily enough from the PDF file directly. Then you can replace it with a /Link annotation.
In short, don't do this by going to PostScript and back, do it all in PDF.
If there's a reason why you can't do this, then you're going to have to explain it.

Google OCR: Special Characters Affecting OCR Number Recognition

I've been playing around with Google's OCR recently using the default tutorial and was trying to parse numbers. I've seen previous issues dealing with numbers on license plates, but was wondering if there was a solution when special characters affect the results of OCR. Most notably, including the '#' character with a number, such as #1, #2, etc as shown below results in the output ##Z#T# and even occasionally gives me Chinese characters even after I set the language to/from settings to English.
Numbers with pound sign
For a similar comparison, the image below is easily read by the OCR:
Numbers without pound sign
Is there a setting that I'm missing that can improve the results or is this just a constraint by the model?

Test Data for Time and Date Parsing (Varied Formats)

Any ideas how I can get a varied set of time / date strings to test a parser?
The idea is to see how wide a range of different formats can be parsed. Note that I am looking for different formats, so simply extracting all timestamps from a bunch of emails isn't that useful (since the format is fixed by RFC 2822).
[Also, I am not sure this is appropriate for SO, sorry, so please feel free to suggest an alternative place to ask.]
You'll probably have to create your own list. But here are some resources describing some of the various formats you might encounter:
http://www.hackcraft.net/web/datetime/
http://en.wikipedia.org/wiki/Date_format_by_country

FAQ markup to R data structure

I'm reading the R FAQ source in texinfo, and thinking that it would be easier to manage and extend if it was parsed as an R structure. There are several existing examples related to this:
the fortunes package
bibtex entries
Rd files
each with some desirable features.
In my opinion, FAQs are underused in the R community because they lack i) easy access from the R command-line (ie through an R package); ii) powerful search functions; iii) cross-references; iv) extensions for contributed packages. Drawing ideas from packages bibtex and fortunes, we could conceive a new system where:
FAQs can be searched from R. Typical calls would resemble the fortune() interface: faq("lattice print"), or faq() #surprise me!, faq(51), faq(package="ggplot2").
Packages can provide their own FAQ.rda, the format of which is not clear yet (see below)
Sweave/knitr drivers are provided to output nicely formatted Markdown/LaTeX, etc.
QUESTION
I'm not sure what is the best input format, however. Either for converting the existing FAQ, or for adding new entries.
It is rather cumbersome to use R syntax with a tree of nested lists (or an ad hoc S3/S4/ref class or structure,
\list(title = "Something to be \\escaped", entry = "long text with quotes, links and broken characters", category = c("windows", "mac", "test"))
Rd documentation, even though not an R structure per se (it is more a subset of LaTeX with its own parser), can perhaps provide a more appealing example of an input format. It also has a set of tools to parse the structure in R. However, its current purpose is rather specific and different, being oriented towards general documentation of R functions, not FAQ entries. Its syntax is not ideal either, I think a more modern markup, something like markdown, would be more readable.
Is there something else out there, maybe examples of parsing markdown files into R structures? An example of deviating Rd files away from their intended purpose?
To summarise
I would like to come up with:
1- a good design for an R structure (class, perhaps) that would extend the fortune package to more general entries such as FAQ items
2- a more convenient format to enter new FAQs (rather than the current texinfo format)
3- a parser, either written in R or some other language (bison?) to convert the existing FAQ into the new structure (1), and/or the new input format (2) into the R structure.
Update 2: in the last two days of the bounty period I got two answers, both interesting but completely different. Because the question is quite vast (arguably ill-posed), none of the answers provide a complete solution, thus I will not (for now anyway) accept an answer. As for the bounty, I'll attribute it to the answer most up-voted before the bounty expires, wishing there was a way to split it more equally.
(This addresses point 3.)
You can convert the texinfo file to XML
wget http://cran.r-project.org/doc/FAQ/R-FAQ.texi
makeinfo --xml R-FAQ.texi
and then read it with the XML package.
library(XML)
doc <- xmlParse("R-FAQ.xml")
r <- xpathSApply( doc, "//node", function(u) {
list(list(
title = xpathSApply(u, "nodename", xmlValue),
contents = as(u, "character")
))
} )
free(doc)
But it is much easier to convert it to text
makeinfo --plaintext R-FAQ.texi > R-FAQ.txt
and parse the result manually.
doc <- readLines("R-FAQ.txt")
# Split the document into questions
# i.e., around lines like ****** or ======.
i <- grep("[*=]{5}", doc) - 1
i <- c(1,i)
j <- rep(seq_along(i)[-length(i)], diff(i))
stopifnot(length(j) == length(doc))
faq <- split(doc, j)
# Clean the result: since the questions
# are in the subsections, we can discard the sections.
faq <- faq[ sapply(faq, function(u) length(grep("[*]", u[2])) == 0) ]
# Use the result
cat(faq[[ sample(seq_along(faq),1) ]], sep="\n")
I'm a little unclear on your goals. You seem to want all the R-related documentation converted into some format which R can manipulate, presumably so the one can write R routines to extract information from the documentation better.
There seem to be three assumptions here.
1) That it will be easy to convert these different document formats (texinfo, RD files, etc.) to some standard form with (I emphasize) some implicit uniform structure and semantics.
Because if you cannot map them all to a single structure, you'll have to write separate R tools for each type and perhaps for each individual document, and then the post-conversion tool work will overwhelm the benefit.
2) That R is the right language in which to write such document processing tools; suspect you're a little biased towards R because you work in R and don't want to contemplate "leaving" the development enviroment to get information about working with R better. I'm not an R expert, but I think R is mainly a numerical language, and does not offer any special help for string handling, pattern recognition, natural language parsing or inference, all of which I'd expect to play an important part in extracting information from the converted documents that largely contain natural language. I'm not suggesting a specific alternative language (Prolog??), but you might be better off, if you succeed with the conversion to normal form (task 1) to carefully choose the target language for processing.
3) That you can actually extract useful information from those structures. Library science was what the 20th century tried to push; now we're all into "Information Retrieval" and "Data Fusion" methods. But in fact reasoning about informal documents has defeated most of the attempts to do it. There are no obvious systems that organize raw text and extract deep value from it (IBM's Jeopardy-winning Watson system being the apparent exception but even there it isn't clear what Watson "knows"; would you want Watson to answer the question, "Should the surgeon open you with a knife?" no matter how much raw text you gave it) The point is that you might succeed in converting the data but it isn't clear what you can successfully do with it.
All that said, most markup systems on text have markup structure and raw text. One can "parse" those into tree-like structures (or graph-like structures if you assume certain things are reliable cross-references; texinfo certainly has these). XML is widely pushed as a carrier for such parsed-structures, and being able to represent arbitrary trees or graphs it is ... OK ... for capturing such trees or graphs. [People then push RDF or OWL or some other knoweldge encoding system that uses XML but this isn't changing the problem; you pick a canonical target independent of R]. So what you really want is something that will read the various marked-up structures (texinfo, RD files) and spit out XML or equivalent trees/graphs. Here I think you are doomed into building separate O(N) parsers to cover all the N markup styles; how otherwise would a tool know what the value markup (therefore parse) was? (You can imagine a system that could read marked-up documents when given a description of the markup, but even this is O(N): somebody still has to describe the markup). One this parsing is to this uniform notation, you can then use an easily built R parser to read the XML (assuming one doesn't already exist), or if R isn't the right answer, parse this with whatever the right answer is.
There are tools that help you build parsers and parse trees for arbitrary lanuages (and even translators from the parse trees to other forms). ANTLR is one; it is used by enough people so you might even accidentally find a texinfo parser somebody already built. Our DMS Software Reengineering Toolkit is another; DMS after parsing will export an XML document with the parse tree directly (but it won't necessarily be in that uniform representation you ideally want). These tools will likely make it relatively easy to read the markup and represent it in XML.
But I think your real problem will be deciding what you want to extract/do, and then finding a way to do that. Unless you have a clear idea of how to do the latter, doing all the up front parsers just seems like a lot of work with unclear payoff. Maybe you have a simpler goal ("manage and extend" but those words can hide a lot) that's more doable.

Resources