Serverless Titan graph stack with AWS DynamoDB and Lambda - amazon-dynamodb

As announced here, it is possible to use Titan with DynamoDB as its backend.
Is it possible to build a serverless Titan Graph DB stack that is accessed via AWS Lambda functions?
Theoretically there should be nothing stoping this implementation but I couldn't find any example. There had been a discussion on the issue under the code repository but did not yield anything concrete yet.

It is possible but I have not estimated the latency considerations involved in starting Titan in a Lambda function. For high request rates, write loads may not be appropriate as each lambda container will try to secure one range of ids from the titan_ids table and you may run out of ids quickly. If your requests are read-only, one way to reduce Titan launch time is to open the graph in read-only mode. In read-only mode, Titan does not need to get an id range lease from titan-ids either.

Related

Verify dynamodb is healthy

I would like to verify in my service /health check that I have a connection with my dynamodb.
I am searching for something like select 1 in MySQL (its only ping the db and return 1) but for dynamodb.
I saw this post but searching for a nonexisting item is an expensive action.
Any ideas on how to only ping my db?
I believe the select 1 equivalent in DDB is Scan with a Limit of 1 item. You can read more here.
Dynamodb is a managed service from AWS. It is highly available anyways. Instead of using query for verifying health of dynamodb, why not setup cloudwatch metrics on your table and check for recent alarm in cloud watch concerning dynamodb. This will also prevent you from spending your read units.
The question is perhaps too broad to answer as stated. There are many ways you could set this up, depending on your concerns and constraints.
My recommendation would be to not over-think, or over-do it in terms of verifying connectivity from your service host to DynamoDB: for example just performing a periodic GetItem should be sufficient to establish basic network connectivity..
Instead of going about the problem from this angle, perhaps you might want to consider a different approach:
a) setup canary tests that exercise all your service features periodically -- these should be "fail-fast" light tests that run constantly and in the event of consistent failure you can take action
b) setup error metrics from your service and monitor on those metrics: for example, CloudWatch allows you to take action on metrics -- you will likely get more milage out of this approach than narrowly focusing on a single failure mode (ie. DynamoDB, which, as other have stated, is a Managed service with very good availability SLA)

Azure Cosmos Gremlin API: transactions and efficient graph traversal

We are experimenting with the Cosmos Gremlin API because we are building a large scale knowledge-management-system which is naturally suited for a graph DB. Knowledge items are highly interconnected and therefore a graph is much better than a relational or a document-oriented (hierarchical) structure.
We need atomic write operations (not full transaction support, just atomic writes). E.g. we need to create several vertices and edges in one atomic write operation.
After carefully reading the documentation and extensively searching for solutions, our current state of knowledge is following:
Cosmos Gremlin API stores vertices as documents and outgoing edges as part of the "outgoing document".
A Gremlin statement creating vertices and egdes might be split up and executed in parallel.
There is no transaction support and there are no atomic write operations.
Write operations are not idempotent.
The two facts taken together mean: If you execute a graph write operation and an error occurs somewhere along the traversal, you have no chance whatsoever to recover from it in a clean way. Let's say you add an edge, add some vertices, perform some side-effect-steps and something goes wrong. Which vertices and edges are persisted and which are not? Since you cannot simply run the statement a second time (vertices with the ids already exist), you're kind of stuck. In addition, this is nothing which can be solved on the end-user-level in the UI.
Taken this points into account, it seems, that the Cosmos Gremlin API is not ready for a production app. When you have a look at the Gremlin "data explorer" in the portal, that seems even more true. I looks like a prototype.
Since edges are stored on the "outgoing document", one should always traverse the graph using the outgoing edges, not the incoming.
This takes away a lot from the efficiency of working with a graph DB: To traverse both directions efficiently.
It leads to workarounds: For each outgoing edge, create an "inverse edge" on the incoming vertex.
So I'd like to ask the question: Should one use Cosmos Gremlin API in production? So far I haven't seen or read about anyone who does so.
Write operations are not idempotent.
It is possible to write queries in an idempotent way however it's not really done in a nice readable and maintainable way. See an idempotent gremlin example here: https://spin.atomicobject.com/2021/08/10/idempotent-queries-in-gremlin/
Taken this points into account, it seems, that the Cosmos Gremlin API is not ready for a production app
This really depends on your application requirements, not all production applications require atomicity or transactions. Sometimes some systems can drop data or if needed you can do various things to ensure data integrity - Though this often puts more responsibility of the application developer
So I'd like to ask the question: Should one use Cosmos Gremlin API in production? So far I haven't seen or read about anyone who does so.
I haven't seen too many stories of using it in production anecdotally, it looks like CosmosDB is relatively popular but hard to tell what proportion of users are running which API.

Scalable delete-all from Google Cloud Datastore

I'm trying to implement a complete backup/restore function for my Google appengine/datastore solution. I'm using the recommended https://cloud.google.com/datastore/docs/export-import-entities for periodic backup and for restore.
One thing I cannot wrap my head around on how to do is how to restore to an empty datastore? The import function won't clear the datastore before importing so I have to implement a total wipe of the datastore myself. (And, a way to clear the datastore might be a good thing also for test purposes etc.)
The datastore admin is not an option since it's being phased out.
The recommended way, according to the google documentation, is to use the bulk delete: https://cloud.google.com/dataflow/docs/templates/provided-templates#cloud-datastore-bulk-delete.
The problem with this method is that I will have to launch 1 dataflow job for each namespace/kind combination. And I have a multi-tenant solution with one namespace per tenant and around 20 kinds per namespace. Thus, if I have e.g. 100 tenants, that would give 2000 dataflow jobs to wipe the datastore. But the default quota is 25 simultaneous jobs... Yes, I can contact Google to get a higher quota, but the difference in numbers suggests that I'm doing it wrong.
So, any suggestions on how to wipe my entire datastore? I'm hoping for a scalable solution (that won't exceed request timeout limits etc) where I don't have to write hundreds of lines of code...
One possibility is to create a simple 1st generation python 2.7 GAE application (or just a service) in that project and use the ndb library (typically more efficient than the generic datastore APIs) to implement an on-demand selective/total datastore wiping as desired, along the lines described in How to delete all the entries from google datastore?
This solution deletes all entries in all namespaces.
By using ndb.metadata, no model classes are needed.
And by using ndb.delete_multi_async it will be able to handle a reasonably large datastore before hitting a request time limit.
from google.appengine.api import namespace_manager
from google.appengine.ext import ndb
...
def clearDb():
for namespace in ndb.metadata.get_namespaces():
namespace_manager.set_namespace(namespace)
for kind in ndb.metadata.get_kinds():
keys = [k for k in ndb.Query(kind=kind).iter(keys_only=True)]
ndb.delete_multi_async(keys)
The solution is a combination of the answers:
GAE, delete NDB namespace
https://stackoverflow.com/a/46802370/10612548
Refer to the latter for tips on how to improve it as time limits are hit and how to avoid instance explosion.

Should I use JanusGraph as main database to store all my data for a new project?

I'm thinking about learn JanusGraph to use in my new big project but i can't understand some things.
Janus can be used like any database and supports "insert", "update", "delete" operations so JanusGraph will write data into Cassandra or other database to store these data, right?
Where JanusGraph store the Nodes, Edges, Attributes etc, it will write these into database, right?
These data should be loaded in memory by Janus or will be read from Cassandra all the time?
The data that JanusGraph read, must be load in JanusGraph in every query or it will do selects in database to retrieve the data I need?
The data retrieved in database is only what I need or Janus will read all records in database all the time?
Should I use JanusGraph in my project in production or should I wait until it becomes production ready?
I'm developing some kind of social network that need to store friendship, posts, comments, user blocks and do some elasticsearch too, in this case, what database backend should I use?
Janus will write data into Cassandra or other database to store these data, right?
Where Janus store the Nodes, Edges, Attributes etc, it will write these into database, right?
Janus Graph will write the data into whatever storage backend you configure it to use. This includes Cassandra. It writes this data into the underlaying database using the data model roughly outlined here
These data should be loaded in memory by Janus or will be read from Cassandra all the time?
The data retrieved in database is only what I need or Janus will read all records in database all the time?
Janus Graph will only load into memory vertices and edges which you touch during a query/traversal. So if you do something like:
graph.traversal().V().hasLabel("My Amazing Label");
Janus will read and load into memory only the vertices with that label. So you don't need to worry about initializing a graph connection and then waiting for the entire graph to be serialised into memory before you can query. Janus is a lazy reader.
Should I use Janus in my project in production or should I wait until it becomes production ready?
That is entirely up to you and your use case. Janus is being used in production already as can be seen here at the bottom of the page. Janus was forked from and improved on TitanDB which is also used in several production use cases. So if you wondering "is it ready" then I would say yes, it's clearly ready given it's existing uses.
what database backend should I use?
Again, that's entirely up to you. I use Cassandra because it can scale horizontally and I find it easier to work with. It also seems to suit all different sizes of data.
I have toyed with Google Big Table and that seems very powerful as well. However, it's only really suited for VERY big data and it's also only on the cloud where as Cassandra can be hosted locally very easily.
I have not used Janus with HBase or BerkeleyDB so I can't comment there.
It's very simple to change between backends though (all you need to do is adjust some configs and check your dependencies are in place) so during your development feel free to play around with the backends. You only really need to commit to a backend when you go production or are more sure of each backend.
When considering what storage backend to use for a new project it's important to consider what tradeoffs you'd like to make. In my personal projects, I've enjoyed using NoSQL graph databases due to the following advantages over relational dbs
Not needing to migrate schemas increases productivity when rapidly iterating on a new project
Traversing a heavily normalized data-model is not as expensive as with JOINs in an RDBMS
Most include in-memory configurations which are great for experimenting & testing.
Support for multi-machine clusters and Partition Tolerance.
Here are sample JanusGraph and Neo4j backends written in Kotlin:
https://github.com/pm-dev/janusgraph-exploration
https://github.com/pm-dev/neo4j-exploration
The main advantage with JanusGraph is the flexibility of pluging-in whichever storage backend you'd like.

What is the best way to integrate DynamoDB stream with CloudSearch? [duplicate]

I'm using Dynamo DB pretty heavily for a service I'm building. A new client request has come in that requires cloud search. I see that a cloud search domain can be created from a dynamo table via the AWS console.
My question is this:
Is there a way to automatically offload data from a dynamo table into a cloud search domain via the API or otherwise at a specified
time interval?
I'd prefer this to manually offloading dynamo documents to cloudsearch. All help greatly appreciated!
Here are two ideas.
The official AWS way of searching DynamoDB data with CloudSearch
This approach is described pretty thoroughly in the "Synchronizing a Search Domain with a DynamoDB Table" section of http://docs.aws.amazon.com/cloudsearch/latest/developerguide/searching-dynamodb-data.html.
The downside is that it sounds like a huge pain: you have to either re-create new search domains or maintain an update table in order to sync, and you'd need a cron job or something to execute the script.
The AWS Lambdas way
Use the newish Lambdas event processing service. It is pretty simple to set up an event stream based on Dynamo (see http://docs.aws.amazon.com/lambda/latest/dg/wt-ddb.html).
Your Lambda would then submit a search document to CloudSearch based on the Dynamo event. For an example of submitting a document from a Lambda, see https://gist.github.com/fzakaria/4f93a8dbf483695fb7d5
This approach is a lot nicer in my opinion as it would continuously update your search index without any involvement from you.
I'm not so clear on how Lambda would always keep the data in sync with the data in dynamoDB. Consider the following flow:
Application updates a DynamoDB table's Record A (say to A1)
Very closely after that Application updates same table's same record A (to A2)
Trigger for 1 causes Lambda of 1 to start execute
Trigger for 2 causes Lambda of 2 to start execute
Step 4 completes first, so CloudSearch sees A2
Now Step 3 completes, so CloudSearch sees A1
Lambda triggers are not guaranteed to start ONLY after previous invocation is complete (Correct if wrong, and provide me link)
As we can see, the thing goes out of sync.
The closest I can think which will work is to use AWS Kinesis Streams, but those too with a single Shard (1MB ps limit ingestion). If that restriction works, then your consumer application can be written such that the record is first processed sequentially, i.e., only after previous record is put into CS, then the next record should be put.

Resources