How are the graphs on SQLite made? [duplicate] - sqlite

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Tool for generating railroad diagram used on json.org
SQLite has some awesome graphs showing the grammer of the language on their website, does anyone know how these are made?
Is there a tool for generating graphs from grammas?

This example looks a lot like a finite automaton -- i.e. the graph equivalent of a regular expression. If you can repesent your grammar to a RE (naturally, not all grammars will be representable as REs!), you can use the Kleene's theorem to translate it into a FA graph.
Note that the alphabet in question for the REs is not single letters, but words and tokens. In the above example, the corresponding RE looks like:
DELETE FROM qualified-table-name
(WHERE expr|()) /* "WHERE expr" is optional; the alternative branch is the empty expression "()" */
(
(ORDER BY ordering-term (, ordering-term)*|()) /* ", ordering-term" may be repeated */
LIMIT expr ((OFFSET|,) expr|()) /* can use "OFFSET" or "," */
|()
)
This translates into a FA very similar to your diagram. GraphViz will do a passable job of drawing it legibly.
However that's not quite the same as the original, is it? Presenting it nicely is the next challenge. I'd suggest taking the nested RE expressions and rendering them recursively, starting at the leaves.
For example, to render (WHERE expr|()):
Two alternate paths. Render each separately:
Render WHERE expr:
Render WHERE as box.
Then an arrow.
Then render expr as a box.
Render () as a single arrow.
Find the longest (the first one) and stretch the others to match it.
Create start and end nodes at each end.
Draw connecting edges from the start node to each subpart, then from each subpart to the end node.
Doing this graphically means keeping track of box sizes and positions, including invisible boxes. There is an invisible box round each subpart. There are three things to note about the recursive structure:
The size of a part depends on the sizes of its children.
The location of a subpart depends on the location of its parent.
Locations of everything (may) depend on the sizes of everything else.
This implies that you should first calculate the sizes of each part, starting at the bottom. Then, once you know the size of the root, you can start positioning the parts, top-down.

Related

python-docx get words position and attributes

I'm looking for a means to extract the position (x, y) and attributes (font / size) of every word in a document.
From the python-docx docs, I know that :
Conceptually, Word documents have two layers, a text layer and a
drawing layer. In the text layer, text objects are flowed from left to
right and from top to bottom, starting a new page when the prior one
is filled. In the drawing layer, drawing objects, called shapes, are
placed at arbitrary positions. These are sometimes referred to as
floating shapes.
A picture is a shape that can appear in either the text or drawing layer. When it appears in the text layer it is called an inline shape,
or more specifically, an inline picture.
[...] At the time of writing, python-docx only supports inline pictures.
Yet, even if it is not the gist of it, I'm wondering if something similar exists :
from docx import Document
main_file = Document("/tmp/file.docx")
for paragraph in main_file.paragraphs:
for word in paragraph.text: # <= Non-existing (yet wished) functionnalities, IMHO
print(word.x, word.y) # <= Non-existing (yet wished) functionnalities, IMHO
Does somebody has an idea ?
Best,
Arthur
for word in paragraph.text: # <= Non-existing (yet wished) functionalities, IMHO
This functionality is provided right in the Python library as str.split(). These can be composed easily as:
for word in paragraph.text.split():
...
Regarding
print(word.x, word.y) # <= Non-existing (yet wished) functionnalities, IMHO
I think it's safe to say this functionality will never appear in python-docx, and if it did it could not look like this.
What such a feature would be doing is asking the page renderer for the location at which the renderer was going to place those characters. python-docx has no rendering engine (because it does not render documents); it is simply a fancy XML editor that selectively modifies XML files in the WordprocessingML vocabulary.
It may be possible to get these values from Word itself, because Word does have a rendering engine (which it uses for screen display and printing).
If there was such a function, I expect it would take a paragraph and a character offset within that paragraph, or something more along those lines, like document.position(paragraph, offset=42) or perhaps paragraph.position(offset=42).

FontForge Interpolation Producing Jumbled Results

I'm making a multi-weight font from a Thin weight and a Heavy weight. The glyphs that were correctly interpolated look good, but the ones that weren't look jumbled and terrible. (I know it looks like Verdana, don't remind me)
I will provide the two fonts as raw .sfd files, and as .otf exports. Could you help me look into this bug?
Check the number of points, the position of the first point (you can set the first point with CTRL-1), and the direction of the path.
Interpolation becomes especially tricky when there are multiple paths (e.g. any character with an enclosed counter such as o or e) or when it contains multiple references (letters with diacritics, for instance). You need to match up not just the points in each path, but also the order of the different paths and the order of different references. You can reorder paths and references by cutting and re-pasting them; this will move them to the top of the stack.
Unfortunately, path and reference ordering are not displayed. You can number your points (View>Number Points>SVG) which helps somewhat with ordering paths, though lines vs curves get numbered differently, so numbering won't always match exactly even between glyphs that interpolate just fine; also, this numbering lasts only as long as that glyph window is open; and none of this tells you anything about the ordering of references. It's a pain.
I usually just start cutting and re-pasting and use a process of elimination until I get it right.
Make your contours compatible, ie. same number of points and same start points across masters.
I fixed the problem by pasting each path onto a new glyph, corresponding glyphs in the same order.

How can I make DOT/neato graphs more compact without introducing overlap?

My question is essentially the same as this one but the given answer doesn't work for me.
Here is a sample rendering (source) with
compound=true;
overlap=scalexy;
splines=true;
layout=neato;
There is some unnecessary overlap in the edges but this isn't too bad, the main problem is all the wasted space.
I tried setting sep=-0.7; and here's what happens.
The spacing is much better but now there is some overlap with the nodes. I experimented with different overlap parameters and this is the only one which gives remotely acceptable results.
I tried changing to fdp layout and setting the spring constant attribute K globally but I just got stuff like this:
The source is all straightforward a--b--c sort of stuff, no fancy tricks.
What I want is for all edges to be shortened as much as possible (up to a minimum) provided that this adjustment doesn't introduce any new overlaps, which is where sep fails completely. That doesn't seem like it should be too hard for a layout engine to do. Is it possible with the graphviz suite? I don't mind changing rendering software, but I don't want to annotate the source on a per-node or per-edge basis.
My ideal result would be to minimize the deviation in edge length, considered one node at a time, i.e. each node would have edges of equal length apart from the necessary exceptions, but that's wishful thinking. The priority is to reduce the length of each edge with the constraint that this cannot introduce overlap.
I will accept partial solutions but they must be fully automatic and open source.
How can I do this? Thanks.
I found https://sites.google.com/site/kuabus/programming-by-hu/graphviz-test-tool, an interactive tool for parameterizing the many options and repeatedly rendering them. I went through the full list provided by the Java application, eventually ending up with this set of attributes:
overlap=false
maxiter=99999999
damping=9999999
voro_margin=.001
start=0.1
K=1
nodesep=999999999999
labelloc=c
defaultdist=9999999
size=20,20
sep=+1
normalize=99999999
labeljust=l
outputorder=nodesfirst
concentrate=true
mindist=2
fontsize=99999999
center=true
scale=.01
inputscale=99999999
levelsgap=9999999
epsilon=0.0001
I was not able to find a parameterization of neato that made producing the desired "moderately scaled" graph possible.
You should set
overlap = compress;
this should compress it at much as possible.
Try sep = +1; first, and then play with values between 0 and +1 to find the optimal setting for you.
I have a graph with 50 nodes and 68 edged (sorry cannot publish the whole picture, just a fragment). Found two reasonable presets (1 and 2):
digraph {
graph[
# 1. Less overlaps but less compact.
# This is the choice for now.
layout=neato; overlap=prism; overlap_scaling=-3.5;
# 2. More compact but some overlaps exist (may be adjusted by `sep`).
#layout=neato; overlap=voronoi; sep=-0.15;
# The following is common.
outputorder=nodesfirst, # Will always draw edges over nodes.
splines=curved;
]
node[fontname="Helvetica",];
node[shape=box;style="filled";penwidth="0.5";width=0;height=0;margin="0.05,0.05"];
edge[label=" ";color="#000080";penwidth="0.5";arrowhead="open";arrowsize="0.7";];
. . .
}

OCR and character similarity

I am currently working on some kind of OCR (Optical Character Recognition) system. I have already written a script to extract each character from the text and clean (most of the) irregularities out of it. I also know the font. The images I have now for example are:
M (http://i.imgur.com/oRfSOsJ.png (font) and http://i.imgur.com/UDEJZyV.png (scanned))
K (http://i.imgur.com/PluXtDz.png (font) and http://i.imgur.com/TRuDXSx.png (scanned))
C (http://i.imgur.com/wggsX6M.png (font) and http://i.imgur.com/GF9vClh.png (scanned))
For all of these images I already have a sort of binary matrix (1 for black, 0 for white). I was now wondering if there was some kind of mathematical projection-like formula to see the similarity between these matrices. I do not want to rely on a library, because that was not the task given to me.
I know this question may seem a bit vague and there are similar questions, but I'm looking for the method, not for a package and so far I couldn't find any comments regarding the method. The reason this question being vague is that I really have no point to start. What I want to do is actually described here on wikipedia:
Matrix matching involves comparing an image to a stored glyph on a pixel-by-pixel basis; it is also known as "pattern matching" or "pattern recognition".[9] This relies on the input glyph being correctly isolated from the rest of the image, and on the stored glyph being in a similar font and at the same scale. This technique works best with typewritten text and does not work well when new fonts are encountered. This is the technique the early physical photocell-based OCR implemented, rather directly. (http://en.wikipedia.org/wiki/Optical_character_recognition#Character_recognition)
If anyone could help me out on this one, I would appreciate it very much.
for recognition or classification most OCR's use neural networks
These must be properly configured to desired task like number of layers internal interconnection architecture , and so on. Also problem with neural networks is that they must be properly trained which is pretty hard to do properly because you will need to know for that things like proper training dataset size (so it contains enough information and do not over-train it). If you do not have experience with neural networks do not go this way if you need to implement it yourself !!!
There are also other ways to compare patterns
vector approach
polygonize image (edges or border)
compare polygons similarity (surface area, perimeter, shape ,....)
pixel approach
You can compare images based on:
histogram
DFT/DCT spectral analysis
size
number of occupied pixels per each line
start position of occupied pixel in each line (from left)
end position of occupied pixel in each line (from right)
these 3 parameters can be done also for rows
points of interest list (points where is some change like intensity bump,edge,...)
You create feature list for each tested character and compare it to your font and then the closest match is your character. Also these feature list can be scaled to some fixed size (like 64x64) so the recognition became invariant on scaling.
Here is sample of features I use for OCR
In this case (the feature size is scaled to fit in NxN) so each character has 6 arrays by N numbers like:
int row_pixels[N]; // 1nd image
int lin_pixels[N]; // 2st image
int row_y0[N]; // 3th image green
int row_y1[N]; // 3th image red
int lin_x0[N]; // 4th image green
int lin_x1[N]; // 4th image red
Now: pre-compute all features for each character in your font and for each readed character. Find the most close match from font
min distance between all feature vectors/arrays
not exceeding some threshold difference
This is partially invariant on rotation and skew up to a point. I do OCR for filled characters so for outlined font it may have use some tweaking
[Notes]
For comparison you can use distance or correlation coefficient

Very general question about the implementation of vectors, vertices, edges, rays, lines & line segments

This is just a LARGE generalized question regarding rays (and/or line segments or edges etc) and their place in a software rendered 3d engine that is/not performing raytracing operations. I'm learning the basics and I'm the first to admit that I don't know much about this stuff so please be kind. :)
I wondered why a parameterized line is not used instead of a ray(or are they??). I have looked around at a few cpp files around the internet and seen a couple of resources define a Ray.cpp object, one with a vertex and a vector, another used a point and a vector. I'm pretty sure that you can define an infinate line with only a normal or a vector and then define intersecting points along that line to create a line segment as a subset of that infinate line. Are there any current engines implementing lines in this way, or is there a better way to go about this?
To add further complication (or simplicity?) Wikipedia says that in vector space, the end points of a line segment are often vectors, notably u -> u + v, which makes alot of sence if defining a line by vectors in space rather than intersecting an already defined, infinate line, but I cannot find any implementation of this either which makes me wonder about the validity of my thoughts when applying this in a 3d engine and even further complication is created when looking at the Flash 3D engine, Papervision, I looked at the Ray class and it takes 6 individual number values as it's parameters and then returns them as 2 different Number3D, (the Papervision equivalent of a Vector), data types?!?
I'd be very interested to see an implementation of something which actually uses the CORRECT way of implementing these low level parts as per their true definitions.
I'm pretty sure that you can define an infinate line with only a normal or a vector
No, you can't. A vector would define a direction of the line, but all the parallel lines share the same direction, so to pick one, you need to pin it down using a specific point that the line passes through.
Lines are typically defined in Origin + Direction*K form, where K would take any real value, because that form is easy for other math. You could as well use two points on the line.

Resources