TinkerPop Blueprints and Frames - How to treat data as collections - graph

I found this question How to store and retrieve different types of Vertices with the Tinkerpop/Blueprints graph API?
But it's not a clear answer for me.
How do I query the 10 most recent articles vs the 10 most recently registered users for an api for instance? Put graphing aside for a moment as this stack must be able to handle both a collection/document paradigm as well as a graphing paradigm (relations between elements of these TYPES of collections). I've read everything including source and am getting closer but not quite there.
Closest document I've read is the multitenant article here: http://architects.dzone.com/articles/multitenant-graph-applications
But it focuses on using gremlin for graph segregation and querying which I'm hoping to avoid until I require analysis on the graphs.
I'm considering using cassandr/hadoop at this point with a reference to the graph id but I can see this biting me down the road.

Indexes in Tinkerpop/Blueprints only support simple lookups based on exact matches of property keys.
If you want to find the 10 most recent articles, or articles with the phrase 'Foo Bar', you may have to maintain an external index using Lucene or Elastic Search. These technologies use inverted indexes which support term/phrase lookups, range queries, and wildcard searches, etc. You can store the vertex.getId() as the field in the indexed document to link back to the vertex in the graph database.
I have implemented something like this for my application which uses a Blueprints database (Bitsy) and a fancy index (Lucene). I have documented the high-level design to (a) keep the fancy index up-to-date using batch updates every few seconds and (b) ensure transactional consistency across the graph database and the fancy index. Hope this helps.

Answering my own question to this - The answer is indices.
https://github.com/tinkerpop/blueprints/wiki/Graph-Indices

Related

Convert database data into graph with tinkerpop tools

I am looking to find graph software that will create a graph from a database automatically. Upon exploration of the tinkerpop documentation, the provided tutorials discuss querying ready-made graphs but there is not much about creating graphs from a database. Is it possible to use any of the tools in the tinkerpop suite to automatically convert data from a database into a graph ready for querying?
Let's say we have an event stream like this:
event_type=create_file name="filename.txt" handle=1
event_type=read handle=1 data="file content"
event_type=write handle=1 data="new file content"
event_type=close handle=1
Is there a way to convert the event stream into a graph automatically by specifying which properties to follow for creating edges? For example, by selecting the "handle" property I should get:
create_file-->read-->write-->close
All the examples I could find teach me how to do some activity like
add_node create_file
add_node read
add_node write
add_node close
followed by adding all desired edges manually.
Thank you for your help.
I just came across http://neo4j-contrib.github.io/neo4j-etl-components/ which is a very interesting tool. It takes an RDBMS schema and generates a graph representation, turning foreign keys and join tables into relationships, and (other) tables into nodes in the graph. And then it generates CSV files to load into a graph database, either for wholesale or incremental updates.
I don't know if there's an equivalent tool for Tinkerpop. But I'd hope that since much of the work is done (reading the SQL schema, mapping tables with foreign keys into vertices and edges) in this open source project, perhaps it'd be a good starting point?
The output of the tool looks like it depends on having a clean source data model, and may be naive. It looks like it's configurable so when it guesses wrongly about which tables are vertices and which are edges you can override it I think.
What you are suggesting is simply not possible. A graph database is very different from a traditional relational database. The biggest difference being that graphs are naturally unstructured which allow for more flexibility and manipulation. On the other hand, traditional relational or tabular databases are more rigidly structured which provide less flexibility but easier control and querying.
As stated by the answer provided here you also show not be using your original database as a frame of reference. You should instead be thinking about how to manipulate your data into a graph so as to take advantage of graphs.
For example, a traversal in a graph as opposed to a query in a tabular DB is a lot more flexible (and arguably powerful) but harder to construct and formalise.
There is a lot of good material providing guidelines for how to approach this problem [1] [2] [3] [4]. Unfortunately though there is no good automated migration at the moment.

Meteor Approach to Collections Considering Join Aren't Supported by Core Yet?

I'm creating my main project functionality right now so it's kind of a big decision to make in my project, I want efficient & scalable solution. I use different API's to fetch users products ultimately for 1 collection to display products information inside a table with possible merge by SKU TITLE from different sources.
I have thought of 2 approaches (In both approaches we add Meteor.userId() to collection insert so each users has it's own products:
1) to create each API it's own collection and fetch the products to it, after or in middle of the API query where I insert it to sourceXProducts also add the logic of merge products by sku and add it to main usersProducts Only the fields I need, and we have the collection of the sourceXproducts if we ever need anything we didn't really include to main usersProducts we can query it and get it so we basically keep all the information possible (because it can come handy)
source1Products = new Meteor.Collection('source1Products');
source2Products = new Meteor.Collection('source2Products');
usersProducts = new Meteor.Collection('usersProducts');
Pros: Honestly I'm not sure, It makes it organized also the way I learned Meteor it seems to be used a lot.
Cons: Meteor collection joins is not supported in core yet, So I have to use a meteor package such as: meteor-publish-composite which seems good but this way might hit performance
2) Create 1 collection and just insert everything the API resonse has and additional apiSource field so we can choose products from X user X api.
usersProducts = new Meteor.Collection('usersProducts');
Pros: No joins, possibly better performance
Cons: Not organized, It can become a large collection maybe it's not good for mongodb
3) Your ideas? :)
First, you should improve the question. You do not tell us anything precise about your schema. What are the entities you have and what type of relations are there and what type of joins do you think you will be doing. How often you will be doing them?
Second, you should rethink your schema and think in the terms of a non-relational database. I see many people coming from SQL world and then they simply design their schema in the same way. Wrong. MongoDB is not SQL and things you learned there you should not try to just reuse here. You should start using features like subdocuments and arrays which can help you solve many basic things you would do in SQL with joins. So, knowing your schema would help us help you design the schema. For example, see this answer and especially the comments for the discussion for a similar type of question you are asking here.
Third, have you evaluated various solutions which exist out there? There are many, but you have not shown us that you tried any of them and how it worked for you. What were pros and cons of them, for you and your project?
Fourth, if you are lazy to evaluate, you can just use peerlibrary:peerdb and peerlibrary:related. They are simply perfect. You should trust me. I am their author.

Storing and downloading Data in iOS Applications

I am a bit new to iOS Development and I was wondering if someone could point me in the right direction regarding an application I am working on.
I am currently working on an application that will be displaying product lists and categories. The list is updated on a weekly basis (one every week).
I am now trying to decide two things:
1- What's the best method of storing this data, I am looking for a way that will allow me to replace the data in the application once every week.
2- Is it going to be beneficial to use CoreData? Note that I Only have Product Category, Product and Product Information entities.
Appreciate your support.
I would use Core Data. Because I know Core Data and am used to work with it. But this is clearly very much like using a chainsaw to cut a slice of bread.
As I understand, you're not familiar with Core Data. Maybe it's not the right tool for the job considering the learning curve.
In your case I would simply use JSON files as provided by the server.
That said, if your looking in Core Data anyway, any store will do, either atomic, XML or SQLite. The first two will load the whole data set in memory and queries will be done in memory as well. SQLite provides the benefits usually associated with databases, with a slightly increased complexity. A chainsaw.
I would use Core Data. If you haven't worked with Core Data before, learn it. It's a great framework.

What are the rules for deciding when a new catalog should be created?

I'd like to learn about using catalogs correctly.
I have about 30 useful content types, about 50 indexes in catalog.xml, and about 45 metadatas. There are just three types which account for most of the site's data - and I may need millions of these. I've been reading, and there's lots to do, but I want to have the basic configuration right before I begin all that.
This page told me that any non-default indexes should not be added to the portal_catalog. I've even read people explaining how removing one, or two of the default indexes makes a performance difference.
My question is: what are the rules for dividing up the indexes into different catalogs, and for selecting which catalog(s) index which type(s)?
So far I have created one additional catalog, used to catalog all indexes for my 'site-setup' objects (which I have caused to no longer be indexed in portal_catalog). The site-setup indexes are very often used, but more rarely modified than others, so I thought it was correct to separate them from objects which are reindexed more often. I'm not sure if that's the main consideration though.
Another similar question (a good example of the kind of thing I want to solve): how would you handle something like secondary workflow review_state variables? I give each workflow's review_state variable an index (and search on them quite often), but some of my workflows are only used on just a few types. (my most prolific objects have secondary workflows...)
I'd be very grateful for advice!
Campbell
This won't cover everything but I'll bring up some points..
Anything not in the portal_catalog won't work with collections, folder_contents view, getFolderContents method, search, portlet collections, related items(I think) and anything else the assumes you're using the portal_catalog.
I like to use an additional catalog when I need to be able to query the data but it only affects a sub-set of the content objects.
Use collective.indexing to speed up indexing operations.
Mount the catalogs on their own mount points so you can cache them differently from the rest of the site(so you can cache the whole catalog). Then, you can even serve the the catalogs from dedicated zeoserver.
Also, if your content doesn't have to be cataloged by the portal_catalog(with all the constraints listed), you may even want to think about if you need it as a full-fledged (archetype|dexterity) type in the first place. You can use a more slim repoze.catalog to catalog arbitrary objects(which could be very simple data) for whatever your purpose is and get even more performance. Or better yet, look into Solr for indexing it for VERY good performance.
On more thing, depending on the type of data you're storing, you could even look into using a relational database for a data store. But I don't know what kind of queries, indexes, data, etc you have...
30 different types seems like a lot but I don't know what your use case is. Care to share? Perhaps there is a better way to do it.

Which one to use? EAV or Blobs in the database?

I am currently working to rework the data system of our application. Basically, it is designed so that people can add all the custom fields they want, with only a few constant/always-there fields.
Our current design is giving us plenty of maintenance problems. What we do is dynamically(at runtime) add a column to the database for each field. We have to have a meta table and other cruft to maintain all of these dynamic columns.
Now we are looking at EAV, but it doesn't seem much better. Basically, we have many different types of fields, so there would be a StringValues, IntegerValues, etc table... which makes things that much worse.
I am wondering if using JSON or XML blobs in the database may be a better solution, specifically because in most use cases, when we retrieve anything out of these tables, we need the entire row. The problems is that we need to be able to create reports for this data as well.. No solution really makes custom queries look easy. And searching across such a blob database will surely be a performance nightmare when reports are ran.
Each "row" needs to have anywhere from about 15 to 100(possibly more) attributes/columns associated with it.
We are using SQL Server 2008 and our application interfacing with the database is a C# web application(so, ASP.Net).
what do you think? Use EAV or blobs or something else entirely? (Also, yes, I know a schema free database like MongoDB would be awesome here, but I can't convince my boss to use it)
What about the xml datatype? Advanced querying is possible against this type.
We've used the xml type with good success. We do most of our heavy lifting at the code level using linq to parse out values. Our schema is somewhat fixed, so that may not be an option for you.
One interesting feature of SQL server is the sql_variant type. It's fully supported in .NET and quite easy to use. The advantages is you don't need to create StringValue, IntValue, etc... columns, just one Value column that can contain all the simple types.
This very specific type favors the EAV option, IMHO.
It has some drawbacks though (sorting, distinct selects, etc...). So if you want to use it, make sure you read all the documentation and understand its limit.
Create a table with your known columns and "X" sparse columns using a sequential name such as DataColumn0001, DataColumn0002, etc. When there is a definition for a new column just rename a column and start inserting data. The great advantage to the sparse column is it is indexable.
More info at this link.
What you're doing is STUPID with a database that doesn't support your data type. You should work with a medium that meets your needs which include NoSQL databases such as RavenDB, MongoDB, DocumentDB, CouchBase or Postgres in RDMBS to name several.
You are inherently using the tool in a capacity it was neither designed for, and one it specifically attempts to limit you from achieving success. NoSQL database solutions frequently use JSON as an underlying storage because JSON is inherently schemaless. Want to add a property? Sure go ahead, want to add a whole sub collection? Sure go ahead. NoSQL databases were in part, created specifically to remove rigid schema requirements of RDBMS.
2015 Edit: Postgres now natively supports JSON. This is a viable option for RDBMS. My answer is still correct that you need to use the correct tool for the problem. It is a polygot persistence world.

Resources