Google OCR: Special Characters Affecting OCR Number Recognition - google-cloud-vision

I've been playing around with Google's OCR recently using the default tutorial and was trying to parse numbers. I've seen previous issues dealing with numbers on license plates, but was wondering if there was a solution when special characters affect the results of OCR. Most notably, including the '#' character with a number, such as #1, #2, etc as shown below results in the output ##Z#T# and even occasionally gives me Chinese characters even after I set the language to/from settings to English.
Numbers with pound sign
For a similar comparison, the image below is easily read by the OCR:
Numbers without pound sign
Is there a setting that I'm missing that can improve the results or is this just a constraint by the model?

Related

How to display boxed numeric characters on web page

I'm working on a PDF to HTML project. In the original .ai file, some numeric characters are displayed in a box:
Although I know the font used in the file is GothicMB101Pro DeBold-83pv-RKSJ-H, I don't have the font file on my machine (and of course the original designer is long gone). In my illustrator, it appear like this:
The 1) part is one single character - not "1" and ")", so at least I know it's not some form of kerning but some unicode character. But I couldn't find any match in my search. The "enclosed numeric" characters ① aren't the same.
Since I'm not sure which character it is, and being not very knowledgeable in Japanese (it seems like a very common occurrence in Japanese language), I couldn't satisfy my client's requirement.
What are those characters and how do I get them onscreen?
I would guess, that since the output you are seeing without the original font installed, consists of two characters, the original also consisted of two characters, first of which is a regular one (in that case, number 1), and the second one is a combining character. There is one for a combining enclosing square, and this is probably the one that is rendered as closing parenthesis ")" that you see in the output. Using the number 1 and the enclosing square (at least in my browser in the stackoverflow answer editior) gives me the required result, as shown below:
1⃞
If your font does not render the enclosing square, it is probably the fault of your font, that is used as a fallback. But without knowing which font exactly is used as a replacement, it is hard to say if it is possible to work around the issue.

I have two questions about google translation api

The first problem is The original language is english, "CORN STARCH PROCESSING LINE" is translated into Russian with a neural machine. The result is "ЛИНИЯ ОБРАБОТКИ КУХНЯ", which is larger difference from the original language. And the second problem is the original language is english, "SULLAIR COALESCING FILTER 02250153-324", translated into Arabic with the neural machine, the result is "SULLAIR COALESCING FILTER 02250153-324", not Arabic.how can I solve this problem?
Regarding the Russian translation, the Cloud Translation API is giving a result as much accurate as it is possible. Those results are being constantly improved and updated.
For the Arabic translation part, there seems to be an issue with using the - symbol in the numbers. Because if you remove it or use any other symbol then the words will be translated to Arabic as expected.
I have created an issue tracker for that, you can follow this link to get updates on the fix. Keep in mind that there is no ETA on when the fix will be ready, so as a work around for now, just replace the - symbol with _ symbol and the words will be translated to Arabic.

Omitting Words from Spellcheck in qdap

This is my first post with StackOverflow, I apologize if I violate any rules.
I am working with the R package qdap on spellchecking very messy medical record text. The goal of this work is to identify misspellings of drug side effects in order to build a side effect misspelling dictionary. The text I am working with contains many, many misspellings, abbreviations, and other things that make a simple spellcheck difficult. After I run a spellcheck on a small doctors note, I get hundreds of words returned to me by the spellcheck program. This makes it difficult to search for the side effect misspellings that I care about.
I attempted to use the following code to create a dictionary consisting only of correctly spelled side effects, so that qdap will trigger closely misspelled words as belonging to this dictionary. The problem is that with this, nearly every word in the text, properly or improperly spelled is not returned as incorrect (i.e. "notable" is spelled wrong and "nausea" is the suggested replacement from my dictionary).
dictionary <- readLines("dictionary.txt")
check_spelling(text$NOTE_TEXT[3379],range = 0, dictionary = dictionary,
assume.first.correct=FALSE)
Here the term "dictionary" is my self-built side-effects dictionary, and check_spelling is being run on text contained in a csv file. Is there any way to omit words that are very far away from words contained in my dictionary from appearing in the spellcheck function (such as my previous example)? This way I can cut down the number of words I am seeing in my spell_check output and identify only the misspelled side effects.
As a small note, changing assume.first.correct to TRUE will not change anything, because the dictionary does not run with it set that way.

How to change numbers type from arabic to english for a font?

I'm supposed to add some modifications to a PHP web site which uses a font with Arabic style numbers.
I'm asked to convert the numbers style (language) to the English style (language) using the same font, is that achievable ?
Arabic(red) & English (green) numbering:
In principle, it is possible to create a font that has alternate glyphs for Arabic digits, selectable with OpenType font features and looking like common (European) digits. However, I do not know any such font, and such an approach would be odd on several accounts. The Arabic digits have been encoded as separate characters, and treating the difference between them and common digits as merely a glyph difference would deviate from normal reasonable practices.
Thus, the change, if desired, should be made at the character level. The details depend on the context, but the principle is simple: common digits are U+0030...U+0039 and Arabic digits are U+0660...U+0669, both in numeric order, so at the character code level it is simply a matter of adding or subtracting a constant.

Which printable ASCII characters will usually appear in an english text?

I have been trying to solve Project Euler's problem #59 for a while, and I am having trouble because some of it seems somewhat more ambiguous than previous problems.
As background, the problem says that the given text file is encrypted text with the ASCII codes saved as numbers. The encryption method is to XOR 3 lowercase letters cyclically with the plaintext (so it is reversible). The problem asks for the key that decrypts the file to English text. How should I restrict the character set of my output to get the answer, without trying to sift through all possible plaintexts (26^3)?
I have tried restricting to letters, spaces, and punctuation, and that did not work.
To clarify: I want to determine, out of all printable ASCII characters, which ones I can probably discard and which ones I can expect to be in the plaintext string.
Have you tried two of the most "basic" and common tools in analyzing the algorithm used?
Analyze the frequency of the characters and try to match it against English letter frequency
Bruteforce using keys from a wordlist, most often common words are used as keys by "dumb" users
To analyze the frequency for this particular problem you would have to split the string every third element since the key is of length 3, you should now be able to produce three columns:
79 59 12
2 79 35
8 28 20
2 3 68
...
you have to analyse the frequency for each column, since now they are independent of the key.
Ok, actually took my time and constructed the 3 complete columns and counted the frequency for each of the columns and got the two most frequent item or each column:
Col1 Col2 Col3
71 79 68
2 1 1
Now if you check for instance: http://en.wikipedia.org/wiki/Letter_frequency
You have the most frequent letters, and don't forget you have spaces and other characters which is not present on that page, but I think you can assume that space is the most frequent character.
So now it is just a matter of xor:ing the most frequent characters in the table I provided with the most frequent characters in English language, and see if you get any lowercase characters, I found a three letter word which I think is the answer with only this data.
Good luck and by the way, it was a nice problem!
A possible solution is to simply assume the presence of a given three-character sequence in the encrypted text. You can use a three-letter word, or a three letter sequence which is likely to appear in English text (e.g. " a ": the letter 'a' enclosed between two spaces). Then simply try all possible positions of that sequence in the encrypted text. Each position allows you to simply recompute the key, then decrypt the whole text into a file.
Since the original text has length 1201, you get 1199 files to skim through. At that point it is only a matter of patience, but you can make it much faster by using a simple text search utility on another frequent sequence in English (e.g. "are"), for instance with the Unix tool grep.
I did just that, and got the decrypted text in less than five minutes.
I'll admit upfront I'm not familiar with an XOR cipher.
However, it seems very similar to the concept of the vigenere cipher. Escpecially in the line where they mention for unbreakable encryption the keylength equals the message length. That screams Vernam Cipher.
As mentioned in the other answer, the strategical approach to breaking a vigenere cipher involves a probabilistic approach. I will not go into detail because most of the theory I learned was relatively complicated, but it can be found here keeping in mind that vignere is a series of caesar ciphers.
The problem makes it easy for you though because you already know the keylength. Because of that, as you mentioned, you can simply bruteforce by trying every single 3 letter combination.
Here's what I would do: take a reasonably sized chunk of the ciphertext, say maybe 10-20 characters, and try the brute force approach on that. Keep track of all the keys that seem to create understandable sequences of letters and then use those on the whole ciphertext. That way we can employ the obvious brute forcing method, but without bruteforcing the entire problem, so I don't think you'll have to worry about limiting your output.
That said, I agree that as you're creating the output, if you ever get a non printable character, you could probably break your loop and move on to the next key. I wouldn't try anything more specific than that because who knows what the original message could have, never make assumptions about the data you're dealing with. Short circuiting logic like that is always a good idea, especially when implementing a brute force solution.
Split the ciphertext into 3.
Ciphertext1 comprises the 1st, 4th, 7th, 10th...numbers
Ciphertext2 comprises the 2nd, 5th, 8th, 11th...numbers
Ciphertext3 comprises the 3rd, 6th, 9th, 12th...numbers
Now you know that each cyphertext is encrypted with the same key letter. Now do a standard frequency analysis on it. That should give you enough clues as to what the letter is.
I just solved this problem a few days ago. Without spoiling it for you, I want to describe my approach to this problem. Some of what I say may be redundant to what you already knew, but was part of my approach.
First I assumed that the key is exactly as described, three lowercase ASCII letters. So I began brute forcing at 'aaa' and went to 'zzz'. While decrypting, if any resulting byte was a value lower than 32 (the ASCII value of space, the lowest "printable" ASCII value) or higher than 126 (the ASCII value of the tilde '~' which is the highest printable character in ASCII) than I assumed the key was invalid because any value outside 32 and 126 would be an invalid character for a plain text stretch of English. As soon as a single byte is outside of this range, I stopped decrypting and went to the next possible key.
Once I decrypted the entire message using a particular key (after passing the first test of all bytes being printable characters), I needed a way to verify it as a valid decryption. I was expecting the result to be a simple list of words with no particular order or meaning. Through other cryptography experience, I thought back to letter frequency, and most simply that your average English word in text is 5 characters long. The file contains 1201 input bytes. So that would mean that there would be (on average) 240 words. After decrypting, I counted how many spaces were in the resulting output string. Since Project Euler is anything but average, I compared the number of spaces to 200 accounting for longer, more obscure words. When an output had more than 200 spaces in it, I printed out the key it was decrypted with and the output text. The one and only output that has more than 200 spaces is the answer. Let me tell you that it's more than obvious that you have the answer when you see it.
Something to point out is that the answer to the question is NOT the key. It is the sum of all the ASCII values of the output string. This approach will also solve the equation under the one minute mark, in fact, it times in around 3 or 4 seconds.

Resources