Version 1 of the Google Cloud Vision API (beta) permits optical character recognition via TEXT_DETECTION requests. While recognition quality is good, characters are returned without any hint of the original layout. Structured text (e.g., tables, receipts, columnar data) are therefore sometimes incorrectly ordered.
Is it possible to preserve document structure with the Google Cloud Vision API? Similar questions have been asked of tesseract and hOCR. For example, [1] and [2]. There is currently no information about TEXT_DETECTION options in the documentation [3].
[1] How to preserve document structure in tesseract
[2] Tesseract - ambiguity in space and tab
[3] https://cloud.google.com/vision/
Recognizing the text structure is a more abstract concept than to recognize the text itself : letters,words,sentence. If you already have this text structure information in your file metadata you could do something like :
Segment/divide your input image in subparts.
Execute your text_detection requests.
Re-order your text correctly based on your meta-data.
I'm not an expert in Cloud Vision text_detection API but it's written text_detection not language_detection or text_structure_detection, so it gives some little clues about the detection level/layer.
Maybe it's a feature they are planning to add in the future or describe in the documentation.
Related
I wonder if there is a list of the possible labels returned by google's object localization: 'human', 'dog', 'cat', etc.
Knowing all possible labels returned by the object localization service of Google, can help us use the service more efficiently. For example, if we are looking in our database for images with hats, we first send our images to the api, then we need to know all possible labels related to hat that google returned. Looking for the word "hat" in the labels will miss those images in which google object recognition returned "sombrero".
There is no extensive list available which has all the possible labels used in Google object localization. If you feel that list would be highly beneficial you may post a feature request in Google's issuetracker.
In any case, notice that Google object localization results contain a machine-generated identifier (MID) corresponding to a label's Google Knowledge Graph entry. Therefore, you may perform calls to the GKG API to check similar possible results.
For example, if you perform the call for Sombrero
https://kgsearch.googleapis.com/v1/entities:search?query=sombrero&key=<yourAPIKey>&limit=5&indent=True
you will obtain the results: Sombrero, Hat, Sun Hat, Sombrero Galaxy, Straw Hat.
Is it possible to obtain labels from ML Kit Image Labeling in a given language?
I easily manage to get them in english...
but I need different languages... any suggestion?
In the docs I found this
In addition the text description of each label that ML Kit returns, it also returns the label's Google Knowledge Graph entity ID. This ID is a string that uniquely identifies the entity represented by the label, and is the same ID used by the Knowledge Graph Search API. You can use this string to identify an entity across languages, and independently of the formatting of the text description.
Maybe it is possible to use a graph entity id to translate the label?
Or what else can I do?
As the Firebase support told me via mail the day Feb 1, 2019
Unfortunately at the moment it is not possible to use other languages for image labeling, however I have created a feature request for our engineering team to take a look at and consider for future releases. There's no telling on when this will be ready, but you can keep an eye on the Firebase Release Notes to be informed of the latest from Firebase.
On the other hand the Knowledge Graph entity ID can be used to find entities in the Google Knowledge Graph but at the moment it is not possible to connect these results with the image labeling in order to translate the label.
I firstly tryed to play with the Graph entity ID, in order to traslate the label description... but since i used the in-device Firebase library, i obtained some ID that Knowledge Graph wasn't able to recognize (for instance: Label: Flower, Confidence: 0.97793585, EntityID: /m/0c9ph5).
I ended up using a free translation API sevices (Yandex) wich is free for the first million translated character a day.
I used the computer vision api on an image. The word pizza was returned in describing the image and the only connection to pizza I can make is a pizza company logo on a napkin. The word birthday was also returned. Is there any way to figure out if the word pizza was returned because of the company logo, or it was a guess associated with the word birthday?
This depends on how much details the API gives you back. If it allows you to observe the intermediate outputs of the classifier that is used to categorize the image, you can see which parts of the image that results in high output values. The pizza company logo on a napkin, depending on how large it appears, is quite likely to cause this.
If you are using a more open API and a classifer, like keras and the networks provided under keras.applications, you can use what are called "class activation maps" to see which parts of the image causes the result.
If you find the above too had to do, one easy way to investigate the reason is to crop parts of the image using a loop and pass them to the API. I suspect that "birthday" might be related to a distributed feature and you might not be able to find where that comes from, whereas pizza might be from the logo or some other part of the image.
say that I have images and I want to generate labels for them in Spanish - does the Google Cloud Vision API allow to select which language to return the labels in?
Label Detection
Google Cloud Vision APIs do not allow configuring the result language for label detection. You will need to use a different API like Cloud Translation API to perform that operation instead.
OCR (Text detection)
If you're interested in text detection in your image, Google Cloud Vision APIs support Optical Character Recognition (OCR) with automatic language detection in a broad set of languages listed here.
For TEXT_DETECTION and DOCUMENT_TEXT_DETECTION requests, you can provide languageHints parameter in the request to get better results for certain cases where the language is unknown and/or not easily detectable.
languageHints[]
string
List of languages to use for TEXT_DETECTION. In most cases, an empty
value yields the best results since it enables automatic language
detection. For languages based on the Latin alphabet, setting
languageHints is not needed. In rare cases, when the language of the
text in the image is known, setting a hint will help get better
results (although it will be a significant hindrance if the hint is
wrong). Text detection returns an error if one or more of the
specified languages is not one of the supported languages.
The DetectedLanguage information is available in the request to identify the language along with a confidence value.
Detected language for a structural component.
JSON representation
{
"languageCode": string,
"confidence": number,
}
As the title says, my app takes parameters from the user and displays relevant locations around them. But from what I understand, the Google Maps API terms prohibit you from doing this. Can this use be termed as a listing service, where I'm simply displaying the data I got from Google. I'm also providing all relevant attributions.