Convert database data into graph with tinkerpop tools - graph

I am looking to find graph software that will create a graph from a database automatically. Upon exploration of the tinkerpop documentation, the provided tutorials discuss querying ready-made graphs but there is not much about creating graphs from a database. Is it possible to use any of the tools in the tinkerpop suite to automatically convert data from a database into a graph ready for querying?
Let's say we have an event stream like this:
event_type=create_file name="filename.txt" handle=1
event_type=read handle=1 data="file content"
event_type=write handle=1 data="new file content"
event_type=close handle=1
Is there a way to convert the event stream into a graph automatically by specifying which properties to follow for creating edges? For example, by selecting the "handle" property I should get:
create_file-->read-->write-->close
All the examples I could find teach me how to do some activity like
add_node create_file
add_node read
add_node write
add_node close
followed by adding all desired edges manually.
Thank you for your help.

I just came across http://neo4j-contrib.github.io/neo4j-etl-components/ which is a very interesting tool. It takes an RDBMS schema and generates a graph representation, turning foreign keys and join tables into relationships, and (other) tables into nodes in the graph. And then it generates CSV files to load into a graph database, either for wholesale or incremental updates.
I don't know if there's an equivalent tool for Tinkerpop. But I'd hope that since much of the work is done (reading the SQL schema, mapping tables with foreign keys into vertices and edges) in this open source project, perhaps it'd be a good starting point?
The output of the tool looks like it depends on having a clean source data model, and may be naive. It looks like it's configurable so when it guesses wrongly about which tables are vertices and which are edges you can override it I think.

What you are suggesting is simply not possible. A graph database is very different from a traditional relational database. The biggest difference being that graphs are naturally unstructured which allow for more flexibility and manipulation. On the other hand, traditional relational or tabular databases are more rigidly structured which provide less flexibility but easier control and querying.
As stated by the answer provided here you also show not be using your original database as a frame of reference. You should instead be thinking about how to manipulate your data into a graph so as to take advantage of graphs.
For example, a traversal in a graph as opposed to a query in a tabular DB is a lot more flexible (and arguably powerful) but harder to construct and formalise.
There is a lot of good material providing guidelines for how to approach this problem [1] [2] [3] [4]. Unfortunately though there is no good automated migration at the moment.

Related

How to enforce typing in gremlin/tinkerpop?

In something like SQL, when I create a table I can create type constraints (String with certain lengths, booleans, etc).
How do I do that in gremlin? I am using a javascript implementation, and I know I can switch to typescript and add a lot of type enforcement on that side, but ideally I would also like to have type constraints on the database side too.
I know I can switch to typescript and add a lot of type enforcement on that side
There is an open issue for Typescript at TINKERPOP-2027 and while there was some activity there, no one has really picked up the work.
I would also like to have type constraints on the database side too.
Constraints in the database are not a feature of TinkerPop. For 3.x we long ago committed to allowing graph providers who implement TinkerPop interfaces to provide their own functionality for doing so. There are a lot of historical reasons for that which I won't bother to detail, but the basic answer to your question is that if you want such functionality you need to choose a graph that has that sort of thing. Perhaps take a look at JanusGraph or DS Graph as both have a fairly robust schema language.

TinkerPop Blueprints and Frames - How to treat data as collections

I found this question How to store and retrieve different types of Vertices with the Tinkerpop/Blueprints graph API?
But it's not a clear answer for me.
How do I query the 10 most recent articles vs the 10 most recently registered users for an api for instance? Put graphing aside for a moment as this stack must be able to handle both a collection/document paradigm as well as a graphing paradigm (relations between elements of these TYPES of collections). I've read everything including source and am getting closer but not quite there.
Closest document I've read is the multitenant article here: http://architects.dzone.com/articles/multitenant-graph-applications
But it focuses on using gremlin for graph segregation and querying which I'm hoping to avoid until I require analysis on the graphs.
I'm considering using cassandr/hadoop at this point with a reference to the graph id but I can see this biting me down the road.
Indexes in Tinkerpop/Blueprints only support simple lookups based on exact matches of property keys.
If you want to find the 10 most recent articles, or articles with the phrase 'Foo Bar', you may have to maintain an external index using Lucene or Elastic Search. These technologies use inverted indexes which support term/phrase lookups, range queries, and wildcard searches, etc. You can store the vertex.getId() as the field in the indexed document to link back to the vertex in the graph database.
I have implemented something like this for my application which uses a Blueprints database (Bitsy) and a fancy index (Lucene). I have documented the high-level design to (a) keep the fancy index up-to-date using batch updates every few seconds and (b) ensure transactional consistency across the graph database and the fancy index. Hope this helps.
Answering my own question to this - The answer is indices.
https://github.com/tinkerpop/blueprints/wiki/Graph-Indices

Convert nested dictionary/xml to flat file for sqlite

I've scoured the net and cannot seem to find an appropriate example so I thought I'd ask...
(Btw, much of this is new to me- not all, just most.)
Problem: trying to convert a bio/python nested dictionary (or xml) of pubmed citation data into a flat (normalized) structure eg, sqlite. Citation data was fetched from pubmed using biopython and was parsed into a dictionary, but can also retrieve as xml if needed.
Not all citations will have all fields/keys and not all fields/keys will have the same number of items (authors, mesh terms, refs, etc...) and understand that this is part of the normalization process.
This is about where my practical understanding ends.
That said, I think the process should go something like this: first remove/normalize all unique fields (those that have 1 per paper eg, title, abstract, date, citation, etc..., but say not affiliation as that would be linked to first author). Papers with no abstract could be filled as null?
Then move on to, say, authors and create a separate table again using PMID as the fk and then do same for the various other fields/keys/items in separate tables eg, mesh headings, EC numbers, ref, etc...
Is there a way to do this that removes (pops?) keys/items from the master dictionary so that I can visually see what's been done/needs to be done (obviously leaving the PMID)?
Again, apologies in advance if I'm asking a blindingly obvious question to the initiated- and I do understand that you can't fit a nested structure into a flat space- just looking for the least boneheaded way of going about this and hopefully one that will allow me to make sure that everything was properly captured.
Many thanks,
chris
A quick question -- if you already have the data in XML, why are you normalizing it into a SQL format? Why not just use the raw XML? Berkeley DB XML is a library (like SQLite) that links into your application. There is no separate server to install or maintain. The library allows you to store and query XML data using XPath or XQuery. It's very fast, has a small footprint. is transactional, recoverable and highly reliable. It has HA features as well, if that is required.
Keeping the data in XML should simplify the whole data import process and still allow you to query the semi-structured data.

Sample Data Creation Tool (mainly for Databases)

I’m thinking through some database design concepts and believe that creating sample data simulating real-world volume of my application will help solidify some design decisions.
Does any anyone know of a tool to create sample data? I’m looking for something that’s database and platform neutral if possible (from MySQL to DB/2 and Windows to UNIX) so to test the design across different systems/architectures. I’m visioning some tool that you can:
point to a database table(s) (some configuration of the DSN, etc.)
introspect the fields and based on the field... (point-and-click or add some configuration)
have a means for expressing how to create sample data (MySQL Sample Data Creator is the kind of thing I vision but I think their'd be some more options like commit frequency so to create very large data sets... millions or billions of rows... don't think this tool would scale to the volume of data I want to create)
push a button and go (depending on your parameters, this may take a long time)
Any thoughts? Sure, I could write an app to do this but it seems so generic that I shouldn’t have to reinvent the wheel.
DBMonster is fine but I prefer databene benerator as I explained it in this answer to a similar question.
Something like DBMonster?
This page also has a listing of many DB data generators.
I cannot help you with MySQL or DB/2 but, in case anyone gets to this answer with a need for MS SQL Server, I can recommend the Data Generator from Red Gate.
Our test data generator, Datanamic DB Data Generator can do this for you. Works with MySQL. It uses default "generator settings" when loading your tables the first time. You can then "fine-tune" the fields and/or choose other "generators".

Saving MFC Model as SQLite database

I am playing with a CAD application using MFC. I was thinking it would be nice to save the document (model) as an SQLite database.
Advantages:
I avoid file format changes (SQLite takes care of that)
Free query engine
Undo stack is simplified (table name, column name, new value
and so on...)
Opinions?
This is a fine idea. Sqlite is very pleasant to work with!
But remember the old truism (I can't get an authoritative answer from Google about where it originally is from) that storing your data in a relational database is like parking your car by driving it into the garage, disassembling it, and putting each piece into a labeled cabinet.
Geometric data, consisting of points and lines and segments that refer to each other by name, is a good candidate for storing in database tables. But when you start having composite objects, with a heirarchy of subcomponents, it might require a lot less code just to use serialization and store/load the model with a single call.
So that would be a fine idea too.
But serialization in MFC is not nearly as much of a win as it is in, say, C#, so on balance I would go ahead and use SQL.
This is a great idea but before you start I have a few recommendations:
Be careful that each database is uniquely identifiable in some way besides file name such as having a table that describes the file within the database.
Take a look at some of the MFC based examples and wrappers already available before creating your own. The ones I have seen had borrowed on each to create a better result. Google: MFC SQLite Wrapper.
Using SQLite database is also useful for maintaining state. Think ahead about how you would manage keeping in mind what features are included and are missing in SQLite.
You can also think now about how you may extend your application to the web by making sure your database table structure is easily exportable to other SQL database systems- as well as easy enough to extend to a backup system.

Resources