Diacritic recognition in Handwritten Text Recognition - handwriting-recognition

I'm not really sure if this is the right site for asking this, but I've read so many papers about the handwriting recognition problem applied to different language scripts or languages.
However, I still don't have a clear understanding about recognizing diacritics (á, à, ä or ñ, ć, č). Are they recognized as full new characters?
Are the diacritics separated from the alphabet characters so that we can do character+diacritics combinations?
Thanks in advance!

Related

I have two questions about google translation api

The first problem is The original language is english, "CORN STARCH PROCESSING LINE" is translated into Russian with a neural machine. The result is "ЛИНИЯ ОБРАБОТКИ КУХНЯ", which is larger difference from the original language. And the second problem is the original language is english, "SULLAIR COALESCING FILTER 02250153-324", translated into Arabic with the neural machine, the result is "SULLAIR COALESCING FILTER 02250153-324", not Arabic.how can I solve this problem?
Regarding the Russian translation, the Cloud Translation API is giving a result as much accurate as it is possible. Those results are being constantly improved and updated.
For the Arabic translation part, there seems to be an issue with using the - symbol in the numbers. Because if you remove it or use any other symbol then the words will be translated to Arabic as expected.
I have created an issue tracker for that, you can follow this link to get updates on the fix. Keep in mind that there is no ETA on when the fix will be ready, so as a work around for now, just replace the - symbol with _ symbol and the words will be translated to Arabic.

How do you deal with the English text having Hindi words in between? (Text Mining in R)

I'm doing sentiment analysis in R, and I'm looking for an effective way of dealing with Hindi language words in English text.
For eg: "I know, magar this can happen"
Here "magar" is a Hindi word meaning "but". How to deal with such multi language text written in English?
Thanks!
You can use a phonetic algorithm like soundex to deal with out-of-vocabulary terms and to try to match them to hindi words. Then you translate those indie words into english.

Google OCR: Special Characters Affecting OCR Number Recognition

I've been playing around with Google's OCR recently using the default tutorial and was trying to parse numbers. I've seen previous issues dealing with numbers on license plates, but was wondering if there was a solution when special characters affect the results of OCR. Most notably, including the '#' character with a number, such as #1, #2, etc as shown below results in the output ##Z#T# and even occasionally gives me Chinese characters even after I set the language to/from settings to English.
Numbers with pound sign
For a similar comparison, the image below is easily read by the OCR:
Numbers without pound sign
Is there a setting that I'm missing that can improve the results or is this just a constraint by the model?

What are the two symbols in the Google Translate icon

What are the two symbols used in the Google Translate icon????
I've got the "A"... said the ignorant American.
Another one is a Chinese character, 文.
In Chinese, 文 means text or article, and its pronunciation is similar like when in English.

Is there an industry standard output format for OCR?

Is there an industry standard output format for OCR? I can't seem to find any thing that is defined as an industry standard, nor am I very experienced with OCR so I wouldn't know if there was a standard either.
hOCR is an open standard which defines a data format for representation of OCR output.
There is no such one format, but there are commonly used practices and open-standard formats that will satisfy your requirements. This question is like asking "what is the standard result from cooking potatoes". Mashed potatoes, french fries, or baked (Not sure where that example came from, I must be getting hungry...)
Also, an "industry standard" will depend on the specific industry. If you are in a specific vertical, then some formats will be more common (almost standard) than others. For example:
Medical - HL7 formatted text
Libraries - ALTO PDF
Legal/eDiscovery - PDF Text Under Image
Integration/Automation - XML
In general, I will not be wrong if I answer your question that most commonly used and industry-accepted formats are: TXT, XML, PDF (several flavors). Each has unique properties and specific uses, but each can be widely used by other technologies due to open standards.
Approaching it from the opposite end is better, meaning thinking through 'business requirements' what will happen with the data and where it needs to be absorbed should exactly define what hand-off format you would like to use from OCR output.
XIEO (http://xieo.info) uses a (Maya Software) proprietary format called CML (Clix Markup Language) that efficiently encodes page, zone, line, text box, and related information. VisualText/NLP++ (available at http://www.textanalysis.com) has a special tokenizer pass to "inhale" that format and produce a ready-made parse tree. NLP++ analyzers can then build on that initial parse tree.
This work flow has been used for more than 5 years at XIEO, primarily for processing Official Records documents (deeds, mortgages, clerk of court, etc) and extracting information from them.
One can clean up the OCRed text, re-zone to fix up OCR errors and mis-zoning, and extract the pertinent information from text, in this workflow.
Amnon Meyers, CTO, Text Analysis International, Inc amnon.meyers#textanalysis.com

Resources