Modeling Reddit style Comments in DynamoDB - amazon-dynamodb

I am looking into using DynamoDB to store comments for my application. The comments will be a nested data structure like you would find in reddit. So users can rate and reply to any comment. For example
Topic1
Comment1
Reply1
Reply2
Comment2
Reply1
My question is how do I model the Reply relationship in DynamoDB so I can query a topics comments and all subsequent replies without doing a lot of grouping on the backend. This kind of data structure is obviously more suited to a Graph database but I am curious if anyone has tried to model a tree like data structure in DynamoDB.

With document support introduced in late 2014, you can model tree data using Map and List types. Your thread depth would be limited by the maximum depth of JSON documents, currently at 32. Alternatively, you could use the DynamoDB Storage Backend for Titan to model your message data as a graph. You get to decide how many hops you want your graph traversals to perform, so you get to decide the limit for thread depth.

Related

DynamoDB usable for largeish event table?

I'm thinking of re-architecting an RDS model to a DynamoDB one and it appears mostly to be working using a single-table design. We have, however a log table that can contain 5-10 million rows that are queried on many attributes.
Is there any pattern that might be applicable in migrating to DynamoDB or is this a case where full scans would be required and we would just be better off keeping the log stuff as a relational table?
Thanks in advance,
Nik
Those keywords and phrases "log" and "queried on many attributes" sound to me like DynamoDB is not the best solution for your log data. If the number of distinct queries is fairly limited and well-known in advance, you might be able to design your keys to fit your access patterns.
For example, if you commonly query on Color and Quantity attributes, you could design a key like COLOR#Red#QTY#25. And you could use secondary or global secondary indexes for queries involving other attributes similarly.
But it is not a great solution if you have many attributes that you need to query arbitrarily.
Alternative Solution: Another serverless option to consider is storing your log data in S3 and using Athena to query it using SQL.
You will likely be trading away a bit of latency and speed by taking this approach compared to RDS and DynamoDB. But queries against log data often don't need millisecond response times, so it can cover a lot of use cases.
Data modelling for DynamoDB
Write down all of your access patterns, in order of priority/most used
Research models which are similar to your use-case
Download NoSQL Workbench and create test models where you can visualize your ideas
Run commands against DynamoDB Local and test your access patterns are fulfilled.
Access Parterns
Your access patterns will ultimately decide if DynamoDB will suit your needs. If you need to query based on multiple fields you can have up to 20 Global Secondary Indexes which will give you some flexibility, but usually if you exceed 8-10 indexes then DynamoDB may not be a good choice or the schema is badly designed.
Use smart designs with sort-key and index-key overloading, it will allow you to group the data better and make your access patterns more efficient.
Log Data Use-case
Storing log data is a pretty common use-case for DynamoDB and many many AWS customers use it for that sole purpose. But I can't over emphasize the importance of understanding your access patterns and working backwards from those to create your model.
Alternatives
If you require query capability or free text search ability, then you could use DynamoDB integrations with OpenSearch (via Lambda/EventBridge) for example, with OpenSearch providing you the flexibility for your queries.
Doesn't seem like a good use case - I have done it and wasn't at all happy with the result - now I load 'log like' data into elasticsearch and much happier with the result.
In my case, I insert the data to dynamodb - to archive it - but also feed data in ES, but once in a while if I kill my ES cluster, I can reload all or some of the data from ddb.

Should CosmosDB be modeled like a document database or a graph database?

I see that a CosmosDb can support both graph queries as well as more traditional SQL like queries - however I'm a bit confused about what kind of underlying schema is best at the collections level. If I were to model something in MongoDb or SQL Server, or Neo4j, I would have very different schemas. Also - it seems like I can query using more traditional SQL-like syntax - which makes it confusing about what's right or efficient underneath. Sometimes, making something easy to query does not mean that one should assume that it's an efficient query.
Is CosmosDb at it's heart a document database and I should model it accordingly - or is it a very different beast.
Example use case
Here's an example- let's say I have:
a user profile
multiple post types (photo, blog, question)
users can like photos
users can comment on photos, blogs, questions
With a sql database I would have tables:
profiles
photos
blogs
questions
and join tables with referential integrity to support the actions:
photoLikes
blogComments
photoComments
questionComments
With a graph database
I would just have the same core tables
profiles
photos
blogs
questions
and just create graph relationship types for like and comment - relying on the code business logic to enforce the rule that you can't like blogs, etc..
With a document db like MongoDb
Again, I might have the same core tables
profiles
photos
blogs
questions
Comments would be sub collections under each - and there would be a question of whether we want to keep the likes as an embedded collection under each profile, or under photos.. and we would have to increment and sync a like count to the other collection (depending on the use case we might create a like collection as well). Comments would be tucked under each photo, blog or question as an embedded collection and not have their own top-level collection.
So my question is this:
How do we model this schema in CosmosDB? Should we model it like a traditional Document Database like MongoDb, or does having access to a graph query allow us additional freedoms like not having to denormalize fields for actions such as "like?"
Azure Cosmos DB database engine is designed to be fully schema-agnostic.
A container (which can be a graph, a collection of documents, or a table) is a schema-agnostic container of arbitrary user generated content which gets automatically indexed upon ingest. I suggest to read "Schema-Agnostic Indexing with Azure DocumentDB" - http://www.vldb.org/pvldb/vol8/p1668-shukla.pdf, which is the same in Cosmos DB to better understand the details.
How do we model this schema in CosmosDB? Should we model it like a traditional Document Database like MongoDb, or does having access to a graph query allow us additional freedoms like not having to denormalize fields for actions such as "like?"
When you start modeling data in Azure Cosmos DB, you need to consider: 1.Is your application read heavy or write heavy? 2.How is your application going to query and update data? etc. Normally denormalized data models can provide better read performance, normalizing can provide better write performance.
This article explained with example how to model document data for NoSQL databases, and shared some scenarios for using embedded data models, normalized data models and Hybrid data models, which should be helpful.

Should bulk data be included in the graph?

I have been using ArangoDB for a while now for smaller system requirements and love it. We have recently been tasked by a client to analyze a large amount of financial data which is currently housed in SQL but I was hoping to more efficiently query the data in ArangoDB.
One of the more simplistic requirements is to rollup gl entry amounts to determine account totals across their general ledger. There are approximately 2200 accounts in their general ledger with a maximum depth of approximately 10. The number of gl entries is approximately 150 million and I was wondering what the most efficient method of aggregating account totals would be?
I plan on using a graph to manage the account hierarchy/structure but should edges be created for 150 million gl entries or is it more efficient to traverse the inbound relationships and run sub queries on the gl entry collections to calculate total the amounts?
I would normally just run the tests myself but I am struggling with simply loading the data in my local instance of arango and thought I would get some insight while I work at loading the data.
Thanks in advance!
What is the benefit you're looking to gain by moving the data into a graph model. If it's to build connections between accounts, customers, GL's, and such, then it might be best to go with a hybrid model.
It's possible to build a hierarchical graph style relationship between your accounts and GL's, but then store your GL entries in a flat document collection.
This way you can use AQL style graph queries to quickly determine relationships between accounts and GLs. If you need to SUM entries in a GL, then you can have queries that identify the GL._id's and then sum the flat collections that have foreign keys that reference the GL._id they are associated with.
By adding indexes on your foreign keys you will speed up queries, and by using Foxx Micro Services you can provide a layer of abstraction between a REST style query and the actual data model you are using. That way if you find you need to change your database model under the covers, by updating your Foxx MicroServices the consumer doesn't need to be aware of those changes.
I can't answer your question on performance, you'll just need to ensure your hardware is appropriately spec'ed.

Relational behavior against a NoSQL document store for ODBC support

The first assertion is that document style nosql databases such as MarkLogic and Mongo should store each piece of information in a nested/complex object.
Consider the following model
<patient>
<patientid>1000</patientid>
<firstname>Johnny</firstname>
<claim>
<claimid>1</claimid>
<claimdate>2015-01-02</claimdate>
<charge><amount>100</amount><code>374.3</code></charge>
<charge><amount>200</amount><code>784.3</code></charge>
</claim>
<claim>
<claimid>2</claimid>
<claimdate>2015-02-02</claimdate>
<charge><amount>300</amount><code>372.2</code></charge>
<charge><amount>400</amount><code>783.1</code></charge>
</claim>
</patient>
In the relational world this would be modeled as a patient table, claim table, and claim charge table.
Our primary desire is to simultaneously feed downstream applications with this data, but also perform analytics on it. Since we don't want to write a complex program for every measure, we should be able to put a tool on top of this. For example Tableau claims to have a native connection with MarkLogic, which is through ODBC.
When we create views using range indexes on our document model, the SQL against it in MarkLogic returns excessive repeating results. The charge numbers are also double counted with sum functions. It does not work.
The thought is that through these index, view, and possibly fragment techniques of MarkLogic, we can define a semantic layer that resembles a relational structure.
The documentation hints that you should create 1 object per table, but this seems to be against the preferred document db structure.
What is the data modeling and application pattern to store large amounts of document data and then provide a turnkey analytics tool on top of it?
If the ODBC connection is going to always return bad data and not be aware of relationships, then all of the tools claiming to have ODBC support against NoSQL is not true.
References
https://docs.marklogic.com/guide/sql/setup
https://docs.marklogic.com/guide/sql/tableau
http://www.marklogic.com/press-releases/marklogic-and-tableau-build-connection/
https://developer.marklogic.com/learn/arch/data-model
For your question: "What is the data modeling and application pattern to store large amounts of document data and then provide a turnkey analytics tool on top of it?"
The rule of thumb I use is that when I want to count "objects", I model them as separate documents. So if you want to run queries that count patients, claims, and charges, you would put them in separate documents.
That doesn't mean we're constraining MarkLogic to only relational patterns. In UML terms, a one-to-many relationship can be a composition or an aggregation. In a relational model, I have no choice but to model those as separate tables. But in a document model, I can do separate documents per object or roll them all together - the choice is usually based on how I want to query the data.
So your first assertion is partially true - in a document store, you have the option of nesting all your related data, but you don't have to. Also note that because MarkLogic is schema-agnostic, it's straightforward to transform your data as your requirements evolve (corb is a good option for this). Certain requirements may require denormalization to help searches run efficiently.
Brief example - a person can have many names (aliases, maiden name) and many addresses (different homes, work address). In a relational model, I'd need a persons table, a names table, and an addresses table. But I'd consider the names to be a composite relationship - the lifecycle of a name equals that of the person - and so I'd rather nest those names into a person document. An address OTOH has a lifecycle independent of the person, so I'd make that an address document and toss an element onto the person document for each related address. From an analytics perspective, I can now ask lots of interesting questions about persons and their names, and persons and addresses - I just can't get counts of names efficiently, because names aren't in separate documents.
I guess MarkLogic is a little atypical compared to other document stores. It works best when you don't store an entire table as one document, but one record per document. MarkLogic indexing is optimized for this approach, and handles searching across millions of documents easily that way. You will see that as soon as you store records as documents, results in Tableau will improve greatly.
Splitting documents to such small fragments also allows higher performance, and lower footprints. MarkLogic doesn't hold the data as persisted DOM trees that allow random access. Instead, it streams the data in a very efficient way, and relies on index resolution to pull relevant fragments quickly..
HTH!

Which technology is best suited to store and query a huge readonly graph?

I have a huge directed graph: It consists of 1.6 million nodes and 30 million edges. I want the users to be able to find all the shortest connections (including incoming and outgoing edges) between two nodes of the graph (via a web interface). At the moment I have stored the graph in a PostgreSQL database. But that solution is not very efficient and elegant, I basically need to store all the edges of the graph twice (see my question PostgreSQL: How to optimize my database for storing and querying a huge graph).
It was suggested to me to use a GraphDB like neo4j or AllegroGraph. However the free version of AllegroGraph is limited to 50 million nodes and also has a very high-level API (RDF), which seems too powerful and complex for my problem. Neo4j on the other hand has only a very low level API (and the python interface is not mature yet). Both of them seem to be more suited for problems, where nodes and edges are frequently added or removed to a graph. For a simple search on a graph, these GraphDBs seem to be too complex.
One idea I had would be to "misuse" a search engine like Lucene for the job, since I'm basically only searching connections in a graph.
Another idea would be, to have a server process, storing the whole graph (500MB to 1GB) in memory. The clients could then query the server process and could transverse the graph very quickly, since the graph is stored in memory. Is there an easy possibility to write such a server (preferably in Python) using some existing framework?
Which technology would you use to store and query such a huge readonly graph?
LinkedIn have to manage a sizeable graph. It may be instructive to check out this info on their architecture. Note particularly how they cache their entire graph in memory.
There is also OrientDB a open source document-graph dbms with commercial friendly license (Apache 2). Simple API, SQL like language, ACID Transactions and the support for Gremlin graph language.
The SQL has extensions for trees and graphs. Example:
select from Account where friends traverse (1,7) (address.city.country.name = 'New Zealand')
To return all the Accounts with at least one friend that live in New Zealand. And for friend means recursively up to the 7th level of deep.
I have a directed graph for which I (mis)used Lucene.
Each edge was stored as a Document, with the nodes as Fields of the document that I could then search for.
It performs well enough, and query times for fetching in and outbound links from a node would be acceptable to a user using it as a web based tool. But for computationally intensive, batch calculations where I am doing many 100000s queries I am not satisfied with the query times I'm getting. I get the sense that I am definitely misusing Lucene so I'm working on a second Berkeley DB based implementation so that I can do a side by side comparison of the two. If I get a chance to post the results here I will do.
However, my data requirements are much larger than yours at > 3GB, more than could fit in my available memory. As a result the Lucene index I used was on disk, but with Lucene you can use a "RAMDirectory" index in which case the whole thing will be stored in memory, which may well suit your needs.
Correct me if I'm wrong, but since each node is list of the linked nodes, seems to me a DB with a schema is more of a burden than an advantage.
It also sound like Google App Engine would be right up your alley:
It's optimized for reading - and there's memcached if you want it even faster
it's distributed - so the size doesn't affect efficiency
Of course if you somehow rely on Relational DB to find the path, it won't work for you...
And I just noticed that the q is 4 months old
So you have a graph as your data and want to perform a classic graph operation. I can't see what other technology could fit better than a graph database.

Resources