Is there a good way to use crossfilter to query multiple dimensions for unique values, not aggregates? - crossfilter

I've got a big set of data loaded into crossfilter for a dc.js project I'm working on. Of course a number of my columns have repeated values in them and I'd like to be able to find the unique list of values in one column that correspond to the repeated values in another. The sample data below probably illustrates the point more clearly.
var data = [
{ state: "WA", city: "Seattle", data: "a" },
{ state: "WA", city: "Seattle", data: "b" },
{ state: "WA", city: "Tacoma", data: "c" },
{ state: "OR", city: "Portland", data: "d" },
{ state: "OR", city: "Bend", data: "e" },
{ state: "OR", city: "Bend", data: "f" }
];
I'd like to be able to filter on a particular state and then find the unique list of cities for that state. So, if the input was "WA", I'd like get back a two element array containing "Seattle" and "Tacoma". The code below actually does exactly that (and also provides the counts, though I really don't care about those) but having to create a second crossfilter object feels very clumsy to me. I also don't know about the performance since I'll end up having to iterate through this several times, once for each state.
var Ndx = crossfilter(data);
var stateDim = Ndx.dimension(function (d) { return d.state; });
var cityDim = Ndx.dimension(function (d) { return d.city; });
var stateFilter = stateDim.filter("WA");
var stateRows = stateFilter.top(Infinity);
// It seems like there should be a better way than this.
var cityNdx = crossfilter(stateRows);
var cityDim2 = cityNdx.dimension(function (d) { return d.city; });
var cites = cityDim2.group().top(Infinity);
cites.forEach(function(d) {
console.log("City: " + d.key + ", Count: " + d.value);
});
/* --------------------------- *\
Log output:
City: Seattle, Count: 2
City: Tacoma, Count: 1
\* --------------------------- */
It seems like the should be a way to get to this kind of result with some filtering, grouping, or reducing strategy, but after spending way too much time trying, I haven't been able to come up with one. All the examples I've seen that use multiple dimensions produce aggregates, but that's not what I need. I need values. Is there a better way to go about this?

I'd use a custom reduce function to keep an array of all city values that have appeared for a given state. Something like the following (completely untested - sorry) should work:
var Ndx = crossfilter(data);
var stateDim = Ndx.dimension(function (d) { return d.state; });
var stateGroup = stateDim.group().reduce(
function(p, v) {
p.count++;
if(p.uniques.indexOf(v.city) === -1) p.uniques.push(v.city);
},
function(p, v) {
p.count--;
// Note: uniques are not filtered. You need to use a map and keep
// count of uniques to have uniques that match your current filter
},
function() {
return { count: 0, uniques: [] };
}
);
stateGroup.top(Infinity).forEach( function(g) {
console.log("State " + g.key + " has count " + g.value.count);
console.log("Unique cities in " + g.key + ":");
g.value.uniques.forEach(function (c) {
console.log(c);
});
});

Related

Kotlin - group elements by a key under some conditions with new value type

I'm trying to find a way to use Kotlin collection operation to do some logic that I'm going to explain:
Let's say type Classroom contains a list of Student as a field in it, eg. classroom.getSudents() returns a list of certain studends.
Now I have a list of mixed Student that I need to group by one of its fields say major, and the value of the resultant map to be Classroom.
So I need to convert List<Student> to Map<Student.major, Classroom>
Also at some cases of major, for example for all major == chemistry, I'll need to group by another criteria, say firstname, so the keys of major chemistry would be major_firstname
Here's an example, I have a list of Student(major, firstname):
[
Student("chemistry", "rafael"),
Student("physics", "adam"),
Student("chemistry", "michael"),
Student("math", "jack"),
Student("chemistry", "rafael"),
Student("biology", "kevin")
]
I need the result to be:
{
"math" -> Classroom(Student("math", "jack")),
"physics" -> Classroom(Student("physics", "adam")),
"chemistry_michael" -> Classroom(Student("chemistry", "michael")),
"chemistry_rafael" -> Classroom(Student("chemistry", "rafael"), Student("chemistry", "rafael")),
"biology" -> Classroom(Student("biology", "kevin"))
}
I've tried groupBy, flatMapTo and associateBy but as far as I understand all of these doesn't group by a certain condition.
I will try to answer the 1st part as Roland posted an answer for the 2nd part (although I did not try it).
Assuming your classes are:
class Student(val major: String, val firstName: String)
class Classroom(val studentList: MutableList<Student>) {
fun getStudents(): MutableList<Student> {
return studentList
}
}
and with an initialization like:
val list = mutableListOf<Student>(
Student("chemistry", "rafael"),
Student("physics", "adam"),
Student("chemistry", "michael"),
Student("math", "jack"),
Student("chemistry", "rafael"),
Student("biology", "kevin"))
val classroom = Classroom(list)
val allStudents = classroom.getStudents()
you can have a result list:
val finalList: MutableList<Pair<String, Classroom>> = mutableListOf()
allStudents.map { it.major }.distinctBy { it }.forEach { major ->
finalList.add(major to Classroom(allStudents.filter { it.major == major }.toMutableList()))
}
so by the below code:
finalList.forEach {
println(it.first + "->")
it.second.getStudents().forEach { println(" " + it.major + ", " + it.firstName) }
}
this will be printed:
chemistry->
chemistry, rafael
chemistry, michael
chemistry, rafael
physics->
physics, adam
math->
math, jack
biology->
biology, kevin
It's actually the mixture of those methods which you require. There are also other ways to achieve it, but here is one possible example using groupBy and flatMap:
val result = students.groupBy { it.major }
.flatMap { (key, values) -> when (key) {
"chemistry" -> values.map { it.firstname }
.distinct()
.map { firstname -> "chemistry_$firstname" to ClassRoom(values.filter { it.firstname == firstname }) }
else -> listOf(key to ClassRoom(values))
}
}.toMap()
Assuming the following data classes:
data class Student(val major: String, val firstname: String)
data class ClassRoom(val students : List<Student>)
If you also want a map with all students grouped by major, the following suffices:
val studentsPerMajor = students.groupBy { it.major }
.map { (major, values) -> major to ClassRoom(values) }
If you then rather want to continue working with that map instead of recalculating everything from the source, it's also possible, e.g. the following will then return your desired map based on the studentsPerMajor:
val result = studentsPerMajor.flatMap { (key, classroom) -> when (key) {
"chemistry" -> classroom.students.map { it.firstname }
.distinct()
.map { firstname -> "chemistry_$firstname" to ClassRoom(classroom.students.filter { it.firstname == firstname }) }
else -> listOf(key to classroom)
}
}.toMap()

Text extraction - line-by-line

I am using Google Vision API, primarily to extract texts. I works fine, but for specific cases where I would need the API to scan the enter line, spits out the text before moving to the next line. However, it appears that the API is using some kind of logic that makes it scan top to bottom on the left side and moving to right side and doing a top to bottom scan. I would have liked if the API read left-to-right, move down and so on.
For example, consider the image:
The API returns the text like this:
“ Name DOB Gender: Lives In John Doe 01-Jan-1970 LA ”
Whereas, I would have expected something like this:
“ Name: John Doe DOB: 01-Jan-1970 Gender: M Lives In: LA ”
I suppose there is a way to define the block size or margin setting (?) to read the image/scan line by line?
Thanks for your help.
Alex
This might be a late answer but adding it for future reference.
You can add feature hints to your JSON request to get the desired results.
{
"requests": [
{
"image": {
"source": {
"imageUri": "https://i.stack.imgur.com/TRTXo.png"
}
},
"features": [
{
"type": "DOCUMENT_TEXT_DETECTION"
}
]
}
]
}
For text which are very far apart the DOCUMENT_TEXT_DETECTION also does not provide proper line segmentation.
The following code does simple line segmentation based on the character polygon coordinates.
https://github.com/sshniro/line-segmentation-algorithm-to-gcp-vision
Here a simple code to read line by line. y-axis for lines and x-axis for each word in the line.
items = []
lines = {}
for text in response.text_annotations[1:]:
top_x_axis = text.bounding_poly.vertices[0].x
top_y_axis = text.bounding_poly.vertices[0].y
bottom_y_axis = text.bounding_poly.vertices[3].y
if top_y_axis not in lines:
lines[top_y_axis] = [(top_y_axis, bottom_y_axis), []]
for s_top_y_axis, s_item in lines.items():
if top_y_axis < s_item[0][1]:
lines[s_top_y_axis][1].append((top_x_axis, text.description))
break
for _, item in lines.items():
if item[1]:
words = sorted(item[1], key=lambda t: t[0])
items.append((item[0], ' '.join([word for _, word in words]), words))
print(items)
You can extract the text based on the bounds per line too, you can use boundyPoly and concatenate the text in the same line
"boundingPoly": {
"vertices": [
{
"x": 87,
"y": 148
},
{
"x": 411,
"y": 148
},
{
"x": 411,
"y": 206
},
{
"x": 87,
"y": 206
}
]
for example this 2 words are in the same "line"
"description": "you",
"boundingPoly": {
"vertices": [
{
"x": 362,
"y": 1406
},
{
"x": 433,
"y": 1406
},
{
"x": 433,
"y": 1448
},
{
"x": 362,
"y": 1448
}
]
}
},
{
"description": "start",
"boundingPoly": {
"vertices": [
{
"x": 446,
"y": 1406
},
{
"x": 540,
"y": 1406
},
{
"x": 540,
"y": 1448
},
{
"x": 446,
"y": 1448
}
]
}
}
I get max and min y and iterate over y to get all potential lines, here is the full code
import io
import sys
from os import listdir
from google.cloud import vision
def read_image(image_file):
client = vision.ImageAnnotatorClient()
with io.open(image_file, "rb") as image_file:
content = image_file.read()
image = vision.Image(content=content)
return client.document_text_detection(
image=image,
image_context={"language_hints": ["bg"]}
)
def extract_paragraphs(image_file):
response = read_image(image_file)
min_y = sys.maxsize
max_y = -1
for t in response.text_annotations:
poly_range = get_poly_y_range(t.bounding_poly)
t_min = min(poly_range)
t_max = max(poly_range)
if t_min < min_y:
min_y = t_min
if t_max > max_y:
max_y = t_max
max_size = max_y - min_y
text_boxes = []
for t in response.text_annotations:
poly_range = get_poly_y_range(t.bounding_poly)
t_x = get_poly_x(t.bounding_poly)
t_min = min(poly_range)
t_max = max(poly_range)
poly_size = t_max - t_min
text_boxes.append({
'min_y': t_min,
'max_y': t_max,
'x': t_x,
'size': poly_size,
'description': t.description
})
paragraphs = []
for i in range(min_y, max_y):
para_line = []
for text_box in text_boxes:
t_min = text_box['min_y']
t_max = text_box['max_y']
x = text_box['x']
size = text_box['size']
# size < max_size excludes the biggest rect
if size < max_size * 0.9 and t_min <= i <= t_max:
para_line.append(
{
'text': text_box['description'],
'x': x
}
)
# here I have to sort them by x so the don't get randomly shuffled
para_line = sorted(para_line, key=lambda x: x['x'])
line = " ".join(map(lambda x: x['text'], para_line))
paragraphs.append(line)
# if line not in paragraphs:
# paragraphs.append(line)
return "\n".join(paragraphs)
def get_poly_y_range(poly):
y_list = []
for v in poly.vertices:
if v.y not in y_list:
y_list.append(v.y)
return y_list
def get_poly_x(poly):
return poly.vertices[0].x
def extract_paragraphs_from_image(picName):
print(picName)
pic_path = rootPics + "/" + picName
text = extract_paragraphs(pic_path)
text_path = outputRoot + "/" + picName + ".txt"
write(text_path, text)
This code is WIP.
In the end, I get the same line multiple times and post-processing to determine the exact values. (paragraphs variable). Let me know if I have to clarify anything
Inspired by Borislav's answer, I just wrote something for python that also works for handwriting. It's messy and I am new to python, but I think you can get an idea of how to do this.
A class to hold some extended data for each word, for example, the average y position of a word, which I used to calculate the differences between words:
import re
from operator import attrgetter
import numpy as np
class ExtendedAnnotation:
def __init__(self, annotation):
self.vertex = annotation.bounding_poly.vertices
self.text = annotation.description
self.avg_y = (self.vertex[0].y + self.vertex[1].y + self.vertex[2].y + self.vertex[3].y) / 4
self.height = ((self.vertex[3].y - self.vertex[1].y) + (self.vertex[2].y - self.vertex[0].y)) / 2
self.start_x = (self.vertex[0].x + self.vertex[3].x) / 2
def __repr__(self):
return '{' + self.text + ', ' + str(self.avg_y) + ', ' + str(self.height) + ', ' + str(self.start_x) + '}'
Create objects with that data:
def get_extended_annotations(response):
extended_annotations = []
for annotation in response.text_annotations:
extended_annotations.append(ExtendedAnnotation(annotation))
# delete last item, as it is the whole text I guess.
del extended_annotations[0]
return extended_annotations
Calculate the threshold.
First, all words a sorted by their y position, defined as being the average of all 4 corners of a word. The x position is not relevant at this moment.
Then, the differences between every word and their following word are calculated. For a perfectly straight line of words, you would expect the differences of the y position between every two words to be 0. Even for handwriting, it should be around 1 ~ 10.
However, whenever there is a line break, the difference between the last word of the former row and the first word of the new row is much greater than that, for example, 50 or 60.
So to decide whether there should be a line break between two words, the standard deviation of the differences is used.
def get_threshold_for_y_difference(annotations):
annotations.sort(key=attrgetter('avg_y'))
differences = []
for i in range(0, len(annotations)):
if i == 0:
continue
differences.append(abs(annotations[i].avg_y - annotations[i - 1].avg_y))
return np.std(differences)
Having calculated the threshold, the list of all words gets grouped into rows accordingly.
def group_annotations(annotations, threshold):
annotations.sort(key=attrgetter('avg_y'))
line_index = 0
text = [[]]
for i in range(0, len(annotations)):
if i == 0:
text[line_index].append(annotations[i])
continue
y_difference = abs(annotations[i].avg_y - annotations[i - 1].avg_y)
if y_difference > threshold:
line_index = line_index + 1
text.append([])
text[line_index].append(annotations[i])
return text
Finally, each row is sorted by their x position to get them into the correct order from left to right.
Then a little regex is used to remove whitespace in front of interpunctuation.
def sort_and_combine_grouped_annotations(annotation_lists):
grouped_list = []
for annotation_group in annotation_lists:
annotation_group.sort(key=attrgetter('start_x'))
texts = (o.text for o in annotation_group)
texts = ' '.join(texts)
texts = re.sub(r'\s([-;:?.!](?:\s|$))', r'\1', texts)
grouped_list.append(texts)
return grouped_list
Based on Borislav Stoilov latest answer I wrote the code for c# for anybody that might need it in the future. Find the code bellow:
public static List<TextParagraph> ExtractParagraphs(IReadOnlyList<EntityAnnotation> textAnnotations)
{
var min_y = int.MaxValue;
var max_y = -1;
foreach (var item in textAnnotations)
{
var poly_range = Get_poly_y_range(item.BoundingPoly);
var t_min = poly_range.Min();
var t_max = poly_range.Max();
if (t_min < min_y) min_y = t_min;
if (t_max > max_y) max_y = t_max;
}
var max_size = max_y - min_y;
var text_boxes = new List<TextBox>();
foreach (var item in textAnnotations)
{
var poly_range = Get_poly_y_range(item.BoundingPoly);
var t_x = Get_poly_x(item.BoundingPoly);
var t_min = poly_range.Min();
var t_max = poly_range.Max();
var poly_size = t_max - t_min;
text_boxes.Add(new TextBox
{
Min_y = t_min,
Max_y = t_max,
X = t_x,
Size = poly_size,
Description = item.Description
});
}
var paragraphs = new List<TextParagraph>();
for (int i = min_y; i < max_y; i++)
{
var para_line = new List<TextLine>();
foreach (var text_box in text_boxes)
{
int t_min = text_box.Min_y;
int t_max = text_box.Max_y;
int x = text_box.X;
int size = text_box.Size;
//# size < max_size excludes the biggest rect
if (size < (max_size * 0.9) && t_min <= i && i <= t_max)
para_line.Add(
new TextLine
{
Text = text_box.Description,
X = x
}
);
}
// here I have to sort them by x so the don't get randomly enter code hereshuffled
para_line = para_line.OrderBy(x => x.X).ToList();
var line = string.Join(" ", para_line.Select(x => x.Text));
var paragraph = new TextParagraph
{
Order = i,
Text = line,
WordCount = para_line.Count,
TextBoxes = para_line
};
paragraphs.Add(paragraph);
}
return paragraphs;
//return string.Join("\n", paragraphs);
}
private static List<int> Get_poly_y_range(BoundingPoly poly)
{
var y_list = new List<int>();
foreach (var v in poly.Vertices)
{
if (!y_list.Contains(v.Y))
{
y_list.Add(v.Y);
}
}
return y_list;
}
private static int Get_poly_x(BoundingPoly poly)
{
return poly.Vertices[0].X;
}
Calling ExtractParagraphs() method will return a list of strings which contains doubles from the file. I also wrote some custom code to treat that problem. If you need any help processing the doubles let me know, and I could provide the rest of the code.
Example:
Text in picture: "I want to make this thing work 24/7!"
Code will return:
"I"
"I want"
"I want to "
"I want to make"
"I want to make this"
"I want to make this thing"
"I want to make this thing work"
"I want to make this thing work 24/7!"
"to make this thing work 24/7!"
"this thing work 24/7!"
"thing work 24/7!"
"work 24/7!"
"24/7!"
I also have an implementation of parsing PDFs to PNGs beacause Google Cloud Vision Api won't accept PDFs that are not stored in the Cloud Bucket. If needed I can provide it.
Happy coding!

how to get the specific format in crossfilter.js i.e distinct of distinct count

this is my format of data:
[{city:"Bhopal",id: 1},{city:"Bhopal",id: 2},{city:"Delhi",id: 3},{city:"Delhi",id:3}]
here i have Delhi repeated twice with same id.
now i need distinct count of city where id is distinct i.e like :
[key:"Bhopal",value:2, key:"Delhi",value:1]
where value is count
got the answer using Reductio and Crossfilter.
var payments = crossfilter([
{city: "Bhopal", id: 1},{city: "Bhopal", id: 2},{city: "Delhi", id: 3},{city: "Delhi", id: 3}
]);
var dim = payments.dimension(function(d) { return d.city; });
var group = dim.group();
var reducer = reductio()
.exception(function(d) { return d.id; })
.exceptionCount(true);
reducer(group);
console.log(group.top(Infinity));
output: [ { key: 'Bhopal', value: { exceptionCount: 2 },{ key: 'Delhi', value: { exceptionCount: 2 }]

.pluck returning undefined in Meteor

Trying to pull a list of ratings from a collection of Reviews and then average them to come up with an aggregated average rating for a Plate. When I look at the data output from the ratings variable I get nothing but "undefined undefined undefined".
averageRating: function() {
var reviews = Reviews.findOne({plateId: this._id});
var ratings = _.pluck(reviews, 'rating');
var sum = ratings.reduce(function(pv, cv){return pv + cv;}, 0);
var avg = sum / ratings.length;
//Testing output
var test = "";
var x;
for (x in reviews) {
text += reviews[x] + ',';
}
return test;
}
Sorry if this is a super newbie question, but I've been at this for hours and cannot figure it out.
I figured out the issue. As listed above var reviews gets set to a cursor which apparently .pluck does not work on. By first converting the cursor to an array of objects I was then able to use .pluck. So updated code looks like this:
averageRating: function() {
var reviewsCursor = Reviews.find({plateId: this._id});
//Converts cursor to an array of objects
var reviews = reviewsCursor.fetch();
var ratings = _.pluck(reviews, 'rating');
var sum = ratings.reduce(function(pv, cv){return pv + cv;}, 0);
var avg = (sum / ratings.length).toPrecision(2);
return avg;
}

DC.js histogram of crossfilter dimension counts

I have a crossfilter with the following data structure being inputted.
project | subproject | cost
data = [
["PrA", "SubPr1", 100],
["PrA", "SubPr2", 150],
["PrA", "SubPr3", 100],
["PrB", "SubPr4", 300],
["PrB", "SubPr5", 500],
["PrC", "SubPr6", 450]]
I can create a barchart that has the summed cost per project:
var ndx = crossfilter(data)
var projDim = ndx.dimension(function(d){return d.project;});
var projGroup = costDim.group().reduceSum(function(d){return d.budget;});
What I want to do is create a dc.js histogram by project cost...so {450: 2, 300: 1}, etc. As far as I can tell, crossfilter can have only attributes of each row be input for a dimension. Is there a way around this?
Accepting the challenge!
It is true, crossfilter does not support this kind of double-reduction, but if you are willing to accept a slight loss of efficiency, you can create "fake dimensions" and "fake groups" with the desired behavior. Luckily, dc.js doesn't use very much of the crossfilter API, so you don't have to implement too many methods.
The first part of the trick is to duplicate the dimension and group so that the new dimension and old dimension will each observe filtering on the other.
The second part is to create the fake groups and dimensions, which walk the bins of the copied group and rebin and refilter based on the values instead of the keys.
A start of a general solution is below. For some charts it is also necessary to implement group.top(), and it is usually okay to just forward that to group.all().
function values_dimension(dim, group) {
return {
filter: function(v) {
if(v !== null)
throw new Error("don't know how to do this!");
return dim.filter(null);
},
filterFunction: function(f) {
var f2 = [];
group.all().forEach(function(kv) {
if(f(kv.value))
f2.push(kv.key);
});
dim.filterFunction(function(k) {
return f2.indexOf(k) >= 0;
});
return this;
}
};
}
function values_group(group) {
return {
all: function() {
var byv = [];
group.all().forEach(function(kv) {
if(kv.value === 0)
return;
byv[kv.value] = (byv[kv.value] || 0) + 1;
});
var all2 = [];
byv.forEach(function(d, i) {
all2.push({key: i, value: d});
});
return all2;
}
};
}
// duplicate the dimension & group so each will observe filtering on the other
var projDim2 = ndx.dimension(function(d){return d.project;});
var projGroup2 = projDim2.group().reduceSum(function(d){return d.budget;});
var countBudgetDim = values_dimension(projDim2, projGroup2),
countBudgetGroup = values_group(projGroup2);
jsfiddle here: http://jsfiddle.net/gordonwoodhull/55zf7L1L/
JSFillde Link
Denormalize + Map-reduce. Note the data already include the cost per project as the 4th column ( and this can be pre-calculated easily). It's a hack, but hopefully an easy one in order to get DC.js and crossfilter works without too much change.
var data = [
["PrA", "SubPr1", 100, 450],
["PrA", "SubPr2", 150, 450],
["PrA", "SubPr3", 200, 450],
["PrB", "SubPr4", 300, 800],
["PrB", "SubPr5", 500, 800],
["PrC", "SubPr6", 450, 450]
];
var newdata = data.map(function (d) {
return {
project: d[0],
subproject: d[1],
budget: d[2],
cost: d[3]
};
})
var ndx = crossfilter(newdata),
costDim = ndx.dimension(function (d) {
return d.cost;
}),
visitedProj = {},
costGroup = costDim.group().reduce(function (p, v) {
if (visitedProj[v.project]) return p;
console.info(v.project);
visitedProj[v.project] = true;
return p + 1;
}, null, function () {
return 0;
});
dc.rowChart("#costChart")
.renderLabel(true)
.dimension(costDim)
.group(costGroup)
.xAxis().ticks(2);
dc.renderAll();
Map-Reduce can be very powerful and the API can be accessed from here. JSFillde Link

Resources