Text extraction - line-by-line - google-cloud-vision

Text extraction - line-by-line - google-cloud-vision

I am using Google Vision API, primarily to extract texts. I works fine, but for specific cases where I would need the API to scan the enter line, spits out the text before moving to the next line. However, it appears that the API is using some kind of logic that makes it scan top to bottom on the left side and moving to right side and doing a top to bottom scan. I would have liked if the API read left-to-right, move down and so on.
For example, consider the image:
The API returns the text like this:
“ Name DOB Gender: Lives In John Doe 01-Jan-1970 LA ”
Whereas, I would have expected something like this:
“ Name: John Doe DOB: 01-Jan-1970 Gender: M Lives In: LA ”
I suppose there is a way to define the block size or margin setting (?) to read the image/scan line by line?
Thanks for your help.
Alex

This might be a late answer but adding it for future reference.
You can add feature hints to your JSON request to get the desired results.
{
"requests": [
{
"image": {
"source": {
"imageUri": "https://i.stack.imgur.com/TRTXo.png"
}
},
"features": [
{
"type": "DOCUMENT_TEXT_DETECTION"
}
]
}
]
}
For text which are very far apart the DOCUMENT_TEXT_DETECTION also does not provide proper line segmentation.
The following code does simple line segmentation based on the character polygon coordinates.
https://github.com/sshniro/line-segmentation-algorithm-to-gcp-vision

Here a simple code to read line by line. y-axis for lines and x-axis for each word in the line.
items = []
lines = {}
for text in response.text_annotations[1:]:
top_x_axis = text.bounding_poly.vertices[0].x
top_y_axis = text.bounding_poly.vertices[0].y
bottom_y_axis = text.bounding_poly.vertices[3].y
if top_y_axis not in lines:
lines[top_y_axis] = [(top_y_axis, bottom_y_axis), []]
for s_top_y_axis, s_item in lines.items():
if top_y_axis < s_item[0][1]:
lines[s_top_y_axis][1].append((top_x_axis, text.description))
break
for _, item in lines.items():
if item[1]:
words = sorted(item[1], key=lambda t: t[0])
items.append((item[0], ' '.join([word for _, word in words]), words))
print(items)

You can extract the text based on the bounds per line too, you can use boundyPoly and concatenate the text in the same line
"boundingPoly": {
"vertices": [
{
"x": 87,
"y": 148
},
{
"x": 411,
"y": 148
},
{
"x": 411,
"y": 206
},
{
"x": 87,
"y": 206
}
]
for example this 2 words are in the same "line"
"description": "you",
"boundingPoly": {
"vertices": [
{
"x": 362,
"y": 1406
},
{
"x": 433,
"y": 1406
},
{
"x": 433,
"y": 1448
},
{
"x": 362,
"y": 1448
}
]
}
},
{
"description": "start",
"boundingPoly": {
"vertices": [
{
"x": 446,
"y": 1406
},
{
"x": 540,
"y": 1406
},
{
"x": 540,
"y": 1448
},
{
"x": 446,
"y": 1448
}
]
}
}

I get max and min y and iterate over y to get all potential lines, here is the full code
import io
import sys
from os import listdir
from google.cloud import vision
def read_image(image_file):
client = vision.ImageAnnotatorClient()
with io.open(image_file, "rb") as image_file:
content = image_file.read()
image = vision.Image(content=content)
return client.document_text_detection(
image=image,
image_context={"language_hints": ["bg"]}
)
def extract_paragraphs(image_file):
response = read_image(image_file)
min_y = sys.maxsize
max_y = -1
for t in response.text_annotations:
poly_range = get_poly_y_range(t.bounding_poly)
t_min = min(poly_range)
t_max = max(poly_range)
if t_min < min_y:
min_y = t_min
if t_max > max_y:
max_y = t_max
max_size = max_y - min_y
text_boxes = []
for t in response.text_annotations:
poly_range = get_poly_y_range(t.bounding_poly)
t_x = get_poly_x(t.bounding_poly)
t_min = min(poly_range)
t_max = max(poly_range)
poly_size = t_max - t_min
text_boxes.append({
'min_y': t_min,
'max_y': t_max,
'x': t_x,
'size': poly_size,
'description': t.description
})
paragraphs = []
for i in range(min_y, max_y):
para_line = []
for text_box in text_boxes:
t_min = text_box['min_y']
t_max = text_box['max_y']
x = text_box['x']
size = text_box['size']
# size < max_size excludes the biggest rect
if size < max_size * 0.9 and t_min <= i <= t_max:
para_line.append(
{
'text': text_box['description'],
'x': x
}
)
# here I have to sort them by x so the don't get randomly shuffled
para_line = sorted(para_line, key=lambda x: x['x'])
line = " ".join(map(lambda x: x['text'], para_line))
paragraphs.append(line)
# if line not in paragraphs:
# paragraphs.append(line)
return "\n".join(paragraphs)
def get_poly_y_range(poly):
y_list = []
for v in poly.vertices:
if v.y not in y_list:
y_list.append(v.y)
return y_list
def get_poly_x(poly):
return poly.vertices[0].x
def extract_paragraphs_from_image(picName):
print(picName)
pic_path = rootPics + "/" + picName
text = extract_paragraphs(pic_path)
text_path = outputRoot + "/" + picName + ".txt"
write(text_path, text)
This code is WIP.
In the end, I get the same line multiple times and post-processing to determine the exact values. (paragraphs variable). Let me know if I have to clarify anything

Inspired by Borislav's answer, I just wrote something for python that also works for handwriting. It's messy and I am new to python, but I think you can get an idea of how to do this.
A class to hold some extended data for each word, for example, the average y position of a word, which I used to calculate the differences between words:
import re
from operator import attrgetter
import numpy as np
class ExtendedAnnotation:
def __init__(self, annotation):
self.vertex = annotation.bounding_poly.vertices
self.text = annotation.description
self.avg_y = (self.vertex[0].y + self.vertex[1].y + self.vertex[2].y + self.vertex[3].y) / 4
self.height = ((self.vertex[3].y - self.vertex[1].y) + (self.vertex[2].y - self.vertex[0].y)) / 2
self.start_x = (self.vertex[0].x + self.vertex[3].x) / 2
def __repr__(self):
return '{' + self.text + ', ' + str(self.avg_y) + ', ' + str(self.height) + ', ' + str(self.start_x) + '}'
Create objects with that data:
def get_extended_annotations(response):
extended_annotations = []
for annotation in response.text_annotations:
extended_annotations.append(ExtendedAnnotation(annotation))
# delete last item, as it is the whole text I guess.
del extended_annotations[0]
return extended_annotations
Calculate the threshold.
First, all words a sorted by their y position, defined as being the average of all 4 corners of a word. The x position is not relevant at this moment.
Then, the differences between every word and their following word are calculated. For a perfectly straight line of words, you would expect the differences of the y position between every two words to be 0. Even for handwriting, it should be around 1 ~ 10.
However, whenever there is a line break, the difference between the last word of the former row and the first word of the new row is much greater than that, for example, 50 or 60.
So to decide whether there should be a line break between two words, the standard deviation of the differences is used.
def get_threshold_for_y_difference(annotations):
annotations.sort(key=attrgetter('avg_y'))
differences = []
for i in range(0, len(annotations)):
if i == 0:
continue
differences.append(abs(annotations[i].avg_y - annotations[i - 1].avg_y))
return np.std(differences)
Having calculated the threshold, the list of all words gets grouped into rows accordingly.
def group_annotations(annotations, threshold):
annotations.sort(key=attrgetter('avg_y'))
line_index = 0
text = [[]]
for i in range(0, len(annotations)):
if i == 0:
text[line_index].append(annotations[i])
continue
y_difference = abs(annotations[i].avg_y - annotations[i - 1].avg_y)
if y_difference > threshold:
line_index = line_index + 1
text.append([])
text[line_index].append(annotations[i])
return text
Finally, each row is sorted by their x position to get them into the correct order from left to right.
Then a little regex is used to remove whitespace in front of interpunctuation.
def sort_and_combine_grouped_annotations(annotation_lists):
grouped_list = []
for annotation_group in annotation_lists:
annotation_group.sort(key=attrgetter('start_x'))
texts = (o.text for o in annotation_group)
texts = ' '.join(texts)
texts = re.sub(r'\s([-;:?.!](?:\s|$))', r'\1', texts)
grouped_list.append(texts)
return grouped_list

Based on Borislav Stoilov latest answer I wrote the code for c# for anybody that might need it in the future. Find the code bellow:
public static List<TextParagraph> ExtractParagraphs(IReadOnlyList<EntityAnnotation> textAnnotations)
{
var min_y = int.MaxValue;
var max_y = -1;
foreach (var item in textAnnotations)
{
var poly_range = Get_poly_y_range(item.BoundingPoly);
var t_min = poly_range.Min();
var t_max = poly_range.Max();
if (t_min < min_y) min_y = t_min;
if (t_max > max_y) max_y = t_max;
}
var max_size = max_y - min_y;
var text_boxes = new List<TextBox>();
foreach (var item in textAnnotations)
{
var poly_range = Get_poly_y_range(item.BoundingPoly);
var t_x = Get_poly_x(item.BoundingPoly);
var t_min = poly_range.Min();
var t_max = poly_range.Max();
var poly_size = t_max - t_min;
text_boxes.Add(new TextBox
{
Min_y = t_min,
Max_y = t_max,
X = t_x,
Size = poly_size,
Description = item.Description
});
}
var paragraphs = new List<TextParagraph>();
for (int i = min_y; i < max_y; i++)
{
var para_line = new List<TextLine>();
foreach (var text_box in text_boxes)
{
int t_min = text_box.Min_y;
int t_max = text_box.Max_y;
int x = text_box.X;
int size = text_box.Size;
//# size < max_size excludes the biggest rect
if (size < (max_size * 0.9) && t_min <= i && i <= t_max)
para_line.Add(
new TextLine
{
Text = text_box.Description,
X = x
}
);
}
// here I have to sort them by x so the don't get randomly enter code hereshuffled
para_line = para_line.OrderBy(x => x.X).ToList();
var line = string.Join(" ", para_line.Select(x => x.Text));
var paragraph = new TextParagraph
{
Order = i,
Text = line,
WordCount = para_line.Count,
TextBoxes = para_line
};
paragraphs.Add(paragraph);
}
return paragraphs;
//return string.Join("\n", paragraphs);
}
private static List<int> Get_poly_y_range(BoundingPoly poly)
{
var y_list = new List<int>();
foreach (var v in poly.Vertices)
{
if (!y_list.Contains(v.Y))
{
y_list.Add(v.Y);
}
}
return y_list;
}
private static int Get_poly_x(BoundingPoly poly)
{
return poly.Vertices[0].X;
}
Calling ExtractParagraphs() method will return a list of strings which contains doubles from the file. I also wrote some custom code to treat that problem. If you need any help processing the doubles let me know, and I could provide the rest of the code.
Example:
Text in picture: "I want to make this thing work 24/7!"
Code will return:
"I"
"I want"
"I want to "
"I want to make"
"I want to make this"
"I want to make this thing"
"I want to make this thing work"
"I want to make this thing work 24/7!"
"to make this thing work 24/7!"
"this thing work 24/7!"
"thing work 24/7!"
"work 24/7!"
"24/7!"
I also have an implementation of parsing PDFs to PNGs beacause Google Cloud Vision Api won't accept PDFs that are not stored in the Cloud Bucket. If needed I can provide it.
Happy coding!

Related

Karate; Counting # of K:V pairs within an object in a json array

For debugging purposes before writing out tests, I am looking to get the number of key:value pairs within the one object in the array.
Right now, I have this:
"items": [
{
"id": "6b0051ad-721d-blah-blah-4dab9cf39ff4",
"external_id": "blahvekmce",
"filename": "foo-text_field-XYGLVU",
"created_date": "2019-02-11T04:10:31Z",
"last_update_date": "2019-02-11T04:10:31Z",
"file_upload_date": "2019-02-11T04:10:31Z",
"deleted_date": null,
"released_and_not_expired": true,
"asset_properties": null,
"file_properties": null,
"thumbnails": null,
"embeds": null
}
]
When I write out:
* print response.items.length // returns 1
When I write out:
* print response.items[0].length it doesn't return anything
Any thoughts on how I can approach this?

There are multiple ways, but this should work, plus you see how to get the keys as well:
* def keys = []
* eval karate.forEach(response.items[0], function(x){ keys.add(x) })
* def count = keys.length
* match count == 12
Refer the docs: https://github.com/intuit/karate#json-transforms

Karate now provides karate.sizeOf() API to get count of an object.
* def object = { a: 1, b: 'hello' }
* def count = karate.sizeOf(object)
* match count == 2
Ref: https://github.com/karatelabs/karate#the-karate-object

count = 0
for (var v in response.items[0]) {
count = count + 1;
}
print(count)

¿Can refer to a map property in his declararion?

For example, is posible to do something like this (this fails):
def map = [ property: 1,
propertyPlusOne: map.property + 1]
Of course, it's posible to do so:
def map = [:]
map.property = 1
map.propertyPlusOne = map.property + 1
But all in the declaration?

You could use a with declaration:
def map = [ : ].with {
property = 1
propertyPlusOne = property + 1
it
}
assert map.propertyPlusOne == 2
Though something like ruby's tap (or #timyates' extension) is slightly cleaner:
def map = [ : ].tap {
property = 1
propertyPlusOne = property + 1
}
assert map.propertyPlusOne == 2

Generally not.
You have to define and initialize your map var first, to be able to set values:
def map = [ property: 1 ]
map += [ propertyPlusOne: map.property + 1]
I'm not sure what you are up to, but it might be worth checking the withDefault() method.

How can I create new map with new values but same keys from an existing map?

I have an existing map in Groovy.
I want to create a new map that has the same keys but different values in it.
Eg.:
def scores = ["vanilla":10, "chocolate":9, "papaya": 0]
//transformed into
def preference = ["vanilla":"love", "chocolate":"love", "papaya": "hate"]
Any way of doing it through some sort of closure like:
def preference = scores.collect {//something}

You can use collectEntries
scores.collectEntries { k, v ->
[ k, 'new value' ]
}
An alternative to using a map for the ranges would be to use a switch
def grade = { score ->
switch( score ) {
case 10..9: return 'love'
case 8..6: return 'like'
case 5..2: return 'meh'
case 1..0: return 'hate'
default : return 'ERR'
}
}
scores.collectEntries { k, v -> [ k, grade( v ) ] }

Nice, functional style solution(including your ranges, and easy to modify):
def scores = [vanilla:10, chocolate:9, papaya: 0]
// Store somewhere
def map = [(10..9):"love", (8..6):"like", (5..2):"meh", (1..0):"hate"]
def preference = scores.collectEntries { key, score -> [key, map.find { score in it.key }.value] }
// Output: [vanilla:love, chocolate:love, papaya:hate]

def scores = ["vanilla":10, "chocolate":9, "papaya": 0]
def preference = scores.collectEntries {key, value -> ["$key":(value > 5 ? "like" : "hate")]}
Then the result would be
[vanilla:like, chocolate:like, papaya:hate]
EDIT: If you want a map, then you should use collectEntries like tim_yates said.

DC.js histogram of crossfilter dimension counts

I have a crossfilter with the following data structure being inputted.
project | subproject | cost
data = [
["PrA", "SubPr1", 100],
["PrA", "SubPr2", 150],
["PrA", "SubPr3", 100],
["PrB", "SubPr4", 300],
["PrB", "SubPr5", 500],
["PrC", "SubPr6", 450]]
I can create a barchart that has the summed cost per project:
var ndx = crossfilter(data)
var projDim = ndx.dimension(function(d){return d.project;});
var projGroup = costDim.group().reduceSum(function(d){return d.budget;});
What I want to do is create a dc.js histogram by project cost...so {450: 2, 300: 1}, etc. As far as I can tell, crossfilter can have only attributes of each row be input for a dimension. Is there a way around this?

Accepting the challenge!
It is true, crossfilter does not support this kind of double-reduction, but if you are willing to accept a slight loss of efficiency, you can create "fake dimensions" and "fake groups" with the desired behavior. Luckily, dc.js doesn't use very much of the crossfilter API, so you don't have to implement too many methods.
The first part of the trick is to duplicate the dimension and group so that the new dimension and old dimension will each observe filtering on the other.
The second part is to create the fake groups and dimensions, which walk the bins of the copied group and rebin and refilter based on the values instead of the keys.
A start of a general solution is below. For some charts it is also necessary to implement group.top(), and it is usually okay to just forward that to group.all().
function values_dimension(dim, group) {
return {
filter: function(v) {
if(v !== null)
throw new Error("don't know how to do this!");
return dim.filter(null);
},
filterFunction: function(f) {
var f2 = [];
group.all().forEach(function(kv) {
if(f(kv.value))
f2.push(kv.key);
});
dim.filterFunction(function(k) {
return f2.indexOf(k) >= 0;
});
return this;
}
};
}
function values_group(group) {
return {
all: function() {
var byv = [];
group.all().forEach(function(kv) {
if(kv.value === 0)
return;
byv[kv.value] = (byv[kv.value] || 0) + 1;
});
var all2 = [];
byv.forEach(function(d, i) {
all2.push({key: i, value: d});
});
return all2;
}
};
}
// duplicate the dimension & group so each will observe filtering on the other
var projDim2 = ndx.dimension(function(d){return d.project;});
var projGroup2 = projDim2.group().reduceSum(function(d){return d.budget;});
var countBudgetDim = values_dimension(projDim2, projGroup2),
countBudgetGroup = values_group(projGroup2);
jsfiddle here: http://jsfiddle.net/gordonwoodhull/55zf7L1L/

JSFillde Link
Denormalize + Map-reduce. Note the data already include the cost per project as the 4th column ( and this can be pre-calculated easily). It's a hack, but hopefully an easy one in order to get DC.js and crossfilter works without too much change.
var data = [
["PrA", "SubPr1", 100, 450],
["PrA", "SubPr2", 150, 450],
["PrA", "SubPr3", 200, 450],
["PrB", "SubPr4", 300, 800],
["PrB", "SubPr5", 500, 800],
["PrC", "SubPr6", 450, 450]
];
var newdata = data.map(function (d) {
return {
project: d[0],
subproject: d[1],
budget: d[2],
cost: d[3]
};
})
var ndx = crossfilter(newdata),
costDim = ndx.dimension(function (d) {
return d.cost;
}),
visitedProj = {},
costGroup = costDim.group().reduce(function (p, v) {
if (visitedProj[v.project]) return p;
console.info(v.project);
visitedProj[v.project] = true;
return p + 1;
}, null, function () {
return 0;
});
dc.rowChart("#costChart")
.renderLabel(true)
.dimension(costDim)
.group(costGroup)
.xAxis().ticks(2);
dc.renderAll();
Map-Reduce can be very powerful and the API can be accessed from here. JSFillde Link

Simple flot graph not updating

I'm trying to create a simple flot line graph and update it on a timer and I only want to display the last 10 points of data. But I only ever see the axis and not the graph plot. Also, I see the x axis change with the extra data but the y axis remain the same and do not correspond to the additional data. My code is as following:
var dataSet = [];
var PlotData;
var x = 0;
var y = 0;
var plot = null;
function EveryOneSec()
{
if (dataSet.length == 10)
{
dataSet.shift();
}
x++;
y += 2;
dataSet("[" + x + ", " + y + "]");
PlotData = { label: "line 1", data: [ dataSet ], color: "green" };
if (plot == null)
{
plot = $.plot($("#placeholder"), [ PlotData ], { lines: {show: true}, points: {show: true}});
}
else
{
plot.setData([ PlotData ]);
plot.setupGrid();
plot.draw();
}
setTimeout(EveryOneSec, 1000);
}
I have tried with and without the call to setupGrid() but this makes no difference to the axis display or graph plot. The x axis stop changing when I get the ticks 0 to 9 plotted even though x is incrementing past that, and the y axis remains static. I believe the code is correct above in terms of passing arrays of data, so why is the graph not appearing?

OK, you have two problems here.
First, you're not appending to your dataSet correctly. I'm not sure what the syntax you've got is doing, but what you need in each slot of the array is [x,y], which you can achieve with Array.push.
This:
dataSet("[" + x + ", " + y + "]");
Should look like this:
dataSet.push([x , y]);
And when you create your series object PlotData, you don't need to store your data inside of another array, so instead of this:
PlotData = { label: "line 1", data: [ dataSet ], color: "green" };
You need this:
PlotData = { label: "line 1", data: dataSet , color: "green" };
See it working here: http://jsfiddle.net/ryleyb/qJEXH/

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Text extraction - line-by-line - google-cloud-vision

Related

Karate; Counting # of K:V pairs within an object in a json array

¿Can refer to a map property in his declararion?

How can I create new map with new values but same keys from an existing map?

DC.js histogram of crossfilter dimension counts

Simple flot graph not updating

Categories

Resources