I would like to better understand how Grakn is handling super nodes? - vaticle-typedb

Is it just sharding and sending multiple queries?
If it is the case how it keeps the relations over shards?
Can I set or hint the sharding method?
Can I split the dataset and somehow hint there is another dataset in which a "cloned" entity resides?

Grakn handles super node sharding transparently in the schema, and will have automatic instance sharding in upcoming versions.
Sharding is a relatively simple mechanism, in which a node with more than a number of edges (set by knowledge-base.type-shard-threshold in the grakn.properties) will be split into a root node, with new edges attaching to child nodes that are invisible so the user. Since the goal of this is to be hands-off, we don't currently offer a way to control the sharding mechanism.
That should cover questions 1-3, but I don't follow question 4. Splitting a dataset but inserting it into the same keyspace will be no different from not splitting it in the first place. Splitting it into two keyspaces will isolate your data and there will be no sharing at all.

Related

how to create dynamoDB efficiently with my table?

If each of my database's an overview has only two types (state: pending, appended), is it efficient to designate these two types as partition keys? Or is it effective to index this state value?
It would be more effective to use a sparse index. In your case, you might add an attribute called isPending. You can add this attribute to items that are pending, and remove it once they are appended. If you create a GSI with tid as the hash key and isPending as the sort key, then only items that are pending will be in the GSI.
It will depend on how would you search for these records!
For example, if you will always search by record ID, it never minds. But if you will search every time by the set of records pending, or appended, you should think in use partitions.
You also could research in this Best practice guide from AWS: https://docs.aws.amazon.com/en_us/amazondynamodb/latest/developerguide/best-practices.html
Updating:
In this section of best practice guide, it recommends the following:
Keep related data together. Research on routing-table optimization
20 years ago found that "locality of reference" was the single most
important factor in speeding up response time: keeping related data
together in one place. This is equally true in NoSQL systems today,
where keeping related data in close proximity has a major impact on
cost and performance. Instead of distributing related data items
across multiple tables, you should keep related items in your NoSQL
system as close together as possible.
As a general rule, you should maintain as few tables as possible in a
DynamoDB application. As emphasized earlier, most well designed
applications require only one table, unless there is a specific reason
for using multiple tables.
Exceptions are cases where high-volume time series data are involved,
or datasets that have very different access patterns—but these are
exceptions. A single table with inverted indexes can usually enable
simple queries to create and retrieve the complex hierarchical data
structures required by your application.
Use sort order. Related items can be grouped together and queried
efficiently if their key design causes them to sort together. This is
an important NoSQL design strategy.
Distribute queries. It is also important that a high volume of
queries not be focused on one part of the database, where they can
exceed I/O capacity. Instead, you should design data keys to
distribute traffic evenly across partitions as much as possible,
avoiding "hot spots."
Use global secondary indexes. By creating specific global secondary
indexes, you can enable different queries than your main table can
support, and that are still fast and relatively inexpensive.
I hope I could help you!

Does DynamoDB GSI overloading give performance benefits or just flexibility

Does GSI Overloading provide any performance benefits, e.g. by allowing cached partition keys to be more efficiently routed? Or is it mostly about preventing you from running out of GSIs? Or maybe opening up other query patterns that might not be so immediately obvious.
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-gsi-overloading.html
e.g. I you have a base table and you want to partition it so you can query a specific attribute (which becomes the PK of the GSI) over two dimensions, does it make any difference if you create 1 overloaded GSI, or 2 non-overloaded GSIs.
For an example of what I'm referring to see the attached image:
https://drive.google.com/file/d/1fsI50oUOFIx-CFp7zcYMij7KQc5hJGIa/view?usp=sharing
The base table has documents which can be in a published or draft state. Each document is owned by a single user. I want to be able to query by user to find:
Published documents by date
Draft documents by date
I'm asking in relation to the more recent DynamoDB best practice that implies that all applications only require one table. Some of the techniques being shown in this documentation show how a reasonably complex relational model can be squashed into 1 DynamoDB table and 2 GSIs and yet still support 10-15 query patterns.
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-relational-modeling.html
I'm trying to understand why someone would go down this route as it seems incredibly complicated.
The idea – in a nutshell – is to not have the overhead of doing joins on the database layer or having to go back to the database to effectively try to do the join on the application layer. By having the data sliced already in the format that your application requires, all you really need to do is basically do one select * from table where x = y call which returns multiple entities in one call (in your example that could be Users and Documents). This means that it will be extremely efficient and scalable on the db level. But also means that you'll be less flexible as you need to know the access patterns in advance and model your data accordingly.
See Rick Houlihan's excellent talk on this https://www.youtube.com/watch?v=HaEPXoXVf2k for why you'd want to do this.
I don't think it has any performance benefits, at least none that's not called out – which makes sense since it's the same query and storage engine.
That being said, I think there are some practical reasons for why you'd want to go with a single table as it allows you to keep your infrastructure somewhat simple: you don't have to keep track of metrics and/or provisioning settings for separate tables.
My opinion would be cost of storage and provisioned throughput.
Apart from that not sure with new limit of 20

How to deal with high redundancy in Neo4j?

I am working on an application that uses both relationnal and graph databases (sqlite and neo4j). I am trying to see if I can't get rid of sqlite to use only neo4j, and I am confronted to a problem of redundancy.
Let say I have nodes that represent audio tracks. I want to store of what musical genre each track is. With hundreds of thousands of nodes, I don't think repeating "South-African Psytrance" as a string property is a good idea, and I am pretty sure that creating a "South-African Psytrance" node and linking it to all concerned nodes is an even worse idea (bottleneck?).
Am I right if I say that using 1) properties takes too much space, and using 2) relationships is a bad design for this particular problem?
The current code uses the sqlite db to store a set of musical genres, and their indexes as properties in nodes (which are converted to their string representation in the application layer).
Is there a way to use only neo4j and avoid bottlenecks and redundancy?
Option 1 is definitely NOT the way to go, as it will waste space and is antithetical to good graph DB design.
Option 2 is the classic way you would do this with a graph DB. There are many examples of neo4j DBs with very large numbers of relationships per node. And neo4j currently supports up to 34 billion relationships in a DB, so there is little danger that you will exceed a capacity limit. So, I would recommend that you at least try using this approach.
There are also a few blogs about people using neo4j for storing similar data. For example:
http://neo4j.com/blog/musicbrainz-in-neo4j-part-1/
http://neo4j.com/blog/fun-with-music-neo4j-and-talend/
http://neo4j.com/blog/upload-last-fm-data-neo4j-rneo4j-transactional-endpoint/
[EDITED]
As the slides mentioned by #Pawamoy imply, there is actually a third option. That is, you can create a specific node label for each genre, and apply the appropriate genre label (a node can have more than one) to every track node. This would allow you to avoid using relationships for genres. However, it would tend to "muddy" the label space, since labels at least feel like "node types", and a "music genre" is not an "album track". Also, neo4j supports a very limited number of labels per node, and the maximum number of labels in a DB is also relatively small. So, I would not use this approach unless there was a definite advantage to doing so and the capacity limits are not an issue.

Neo4j - family graph design and ancestor/pedigree lookup

I just started playing around with Neo4j, so my apologies if this is a simple concept...
I'm building a relatively large database of family information (a few million nodes with about 5-15 properties per node). As of right now, all data is being stored in a mysql database using Redis as a caching layer, but I'm playing around with switching out Redis for Neo4j to help speed up some of our more expensive queries (and eventually using Neo4j as the main data store instead of mysql).
I'm playing around with storing all my nodes and their properties in Neo4j, and connecting them via HAS_FATHER and HAS_MOTHER relationships. Is this a good approach? Would it be more beneficial to use HAS_PARENT and set a parent_type property on each relationship to either father or mother? Should I also save a reverse relationship called HAS_CHILD on all parents? What are the pros and cons of my options?
Secondly, assuming that I'm using the HAS_FATHER and HAS_MOTHER relationships, what's the optimal query to grab all nodes, properties, and relationships for all direct ancestors (pedigree) 7 generations away? Here's an example query that I'm currently playing with, but I'm new to Cypher and I'm not too familiar with the bottlenecks, optimizations, etc.
MATCH tree = (c)-[:HAS_FATHER|HAS_MOTHER*0..7]->(p)
WHERE c.id = 29421
RETURN nodes(tree), rels(tree)
Any help or tips would be appreciated. Thanks!
Having HAS_MOTHER and HAS_FATHER instead of a HAS_PARENT with a type property is definitely better. In case of more verbose relationships e.g. when you query for mothers your traversals don't need to dig into properties - they can solely rely on relationships.
The reason for that being more performant is that properties are lazy loaded on demand, see http://neo4j.com/docs/stable/performance-guide.html#_neo4j_primitives_lifecycle.
If you have semantically inverse relationships you don't have to model them explicitly because if a is mother of b, consequently b is son of a. So for querying children just follow HAS_FATHER and HAS_MOTHER in inverse direction.

Keeping Neo4j graph data separated by user

I have an interesting situation. I am allowing users to provide their own data sources to be imported into neo4j. The data sources could be the same across different users, but I would like cypher queries to only query nodes and relations specified by a particular user's sources.
I can think of several ways to do this:
Separate neo4j instances for each user
Tag nodes and relationships by user
Currently node duplicates are prevented by indexes so I would have to alter that approach since nodes which already exist simply cause a new relationship to that node. Number of relationships to a node are used in my analysis so separating relationships by user are important.
I will have to update an existing graph database to account for these new attributes. I'm thinking that tagging relationships might be the way to go. Any thoughts pro/con against this approach? This way I can include the user tag as a relationship parameter.
Thoughts?
Henry
You can tag all your users with labels and use these even to tag the source:
http://docs.neo4j.org/chunked/preview/query-match.html#match-get-all-nodes-with-a-label

Resources