Efficient way to Increment Vertex counter property in janusgraph - gremlin

I am using janusGraph-0.2.0 with Cassandra backend with ES.
I want to store no.of views in Vertex property, Need an efficient and scalable way to increment/store the views count without impacting read performance.
Read views property from graph while fetching vertex, and update new views count in another query. (Wont impact read performance, but counter is not synchronised)
g.V().has("key","keyId").valueMap(true);
g.V(id).property('views', 21);
Using sack to store value 1, and add it to views property.
g.withSack(0).V().has("key","keyId").
sack(assign).by("views").sack(sum).by(constant(1)).
property("views", sack())
Use in-memory storage (Redis) to increment counters, and persist the updates in graph periodically.
Any other better approach ?
Is there any way to use cassendra's counter functionality in janusGraph?

There is no way to use Cassandra counters with JanusGraph. Even more, there is no way to use Cassandra counters with general Cassandra table. The logic of Cassandra counter developed in such way that updating the counter don't require a lock. That is why you get a lot of limitations in exchange for great performance.
Counting views isn't that easy task. In short, my suggestion would be to go with option 3.
I would go with Redis and periodical update to JanusGraph in case when we are in a single data center and your single master server can handle all requests (you can ofcourse use some hash ring to split your counters among different Redis servers but it will increase complexity costs for maintenance).
In case you have multiple data centers your single master Redis server cannot handle all requests I would go with Cassandra counters.
In case you have a very big amount of view events so even Cassandra counters (with their cache) cannot handle all requests because the disk is accessed too many times and you cannot scale more because of high cost then the logic would be harder. I have never been in such situation so it is just theoretically. In this case I would develop the application servers to cache and group views and periodically send this cached data to RabbitMQ workers so that they could update Cassandra counters and then update necessary vertex with total views amount in JanusGraph. In such case very frequently vertex views would be grouped so that we don't need to update counter with +1 each time but with +100 or +1000 views in a single update. It would decease disk usage very much and you would have eventually consistent and fast counters. Again, this solution is only theoretical and should be tested. I believe other solutions also exist.

Related

Elastic Cache vs DynamoDb DAX

I have use case where I write data in Dynamo db in two table say t1 and t2 in transaction.My app needs to read data from these tables lot of times (1 write, at least 4 reads). I am considering DAX vs Elastic Cache. Anyone has any suggestions?
Thanks in advance
K
ElastiCache is not intended for use with DynamoDB.
DAX is good for read-heavy apps, like yours. But be aware that DAX is only good for eventually consistent reads, so don't use it with banking apps, etc. where the info always needs to be perfectly up to date. Without further info it's hard to tell more, these are just two general points to consider.
Amazon DynamoDB Accelerator (DAX) is a fully managed, highly available, in-memory cache that can reduce Amazon DynamoDB response times from milliseconds to microseconds, even at millions of requests per second. While DynamoDB offers consistent single-digit millisecond latency, DynamoDB with DAX takes performance to the next level with response times in microseconds for millions of requests per second for read-heavy workloads. With DAX, your applications remain fast and responsive, even when a popular event or news story drives unprecedented request volumes your way. No tuning required. https://aws.amazon.com/dynamodb/dax/
AWS recommends that you use **DAX as solution for this requirement.
Elastic Cache is an old method and it is used to store the session states in addition to the cache data.
DAX is extensively used for intensive reads through eventual consistent reads and for latency sensitive applications. Also DAX stores cache using these parameters:-
Item cache - populated with items with based on GetItem results.
Query cache - based on parameters used while using query or scan method
Cheers!
I'd recommend to use DAX with DynamoDB, provided you're having more read calls using item level API (and NOT query level API), such as GetItem API.
Why? DAX has one weird behavior as follows. From, AWS,
"Every write to DAX alters the state of the item cache. However, writes to the item cache don't affect the query cache. (The DAX item cache and query cache serve different purposes, and operate independently from one another.)"
Hence, If I elaborate, If your query operation is cached, and thereafter if you've write operation that affect's result of previously cached query and if same is not yet expired, in that case your query cache result would be outdated.
This out of sync issue, is also discussed here.
I find DAX useful only for cached queries, put item and get item. In general very difficult to find a use case for it.
DAX separates queries, scans from CRUD for individual items. That means, if you update an item and then do a query/scan, it will not reflect changes.
You can't invalidate cache, it only invalidates when ttl is reached or nodes memory is full and it is dropping old items.
Take Aways:
doing puts/updates and then queries - two seperate caches so out of sync
looking for single item - you are left only with primary key and default index and getItem request (no query and limit 1). You can't use any indexes for gets/updates/deletes.
Using ConsistentRead option when using query to get latest data - it works, but only for primary index.
Writing through DAX is slower than writing directly to Dynamodb since you have a hop in the middle.
XRay does not work with DAX
Use Case
You have queries that you don't really care they are not up to date
You are doing few putItem/updateItem and a lot of getItem

Dynamodb secondary indexes latency for realtime updates

I am wondering if Amazon Dynamodb Global secondary indexes can be used for a realtime application with very heavy writes. Ex: Chat application.
where global secondary indexes need to be updated in sub-millisecond latency as soon the main table write/update is done. Would that be possible?
With those requirements, at this time, no. While DynamoDB can handle VERY heavy write throughput, the data currently will replicate from the table to the GSI in single digit millisecond latency, not sub-millisecond. Just make sure you get your data model correct.

Titan+dynamodb traversal backend performance

we try to use Titan (1.0.0 version) with DynamoDB backend like our recommendation system engine. We have a huge user’s database with their relationships. It contains about 3.5 millions of users and about 2 billions of relationships between users.
Here is the code that we used to create schema
https://gist.github.com/angryTit/3b1a4125fc72bc8b9e9bb395892caf92
As you can see we use one composite index to find starting point of traversal fast, 5 edge’s types and some properties.
In our case users can have a really big amount of edges. Each could have tens of thousands edges.
Here is the code that we use to provide recommendations online
https://gist.github.com/angryTit/e0d1e18c0074cc8549b053709f63efdf
The problem that the traversal works very slow.
This one
https://gist.github.com/angryTit/e0d1e18c0074cc8549b053709f63efdf#file-reco-L28
tooks 20 - 30 seconds in case when user has about 5000 - 6000 edges.
Our DynamoDB’s tables has enough read/write capacity (we can see from CloudWatch that consumed capacity lower than provided by 1000 units.)
Here is our configuration of Titan
https://gist.github.com/angryTit/904609f0c90beca5f90e94accc7199e5
We tried to run it inside Lambda functions with max memory and on the big instance (r3.8xlarge) but with the same results...
Are we doing something wrong or it's normal in our case?
Thank you.
The general recommendation with the system would be to use vertex centric indexes to speed up your traversals on Titan. Also, Titan is a dead project. If you're looking for updates to the code, JanusGraph has forked the Titan code and continued to update it.

Is it ok to build architecture around regular creation/deletion of tables in DynamoDB?

I have a messaging app, where all messages are arranged into seasons by creation time. There could be billions of messages each season. I have a task to delete messages of old seasons. I thought of a solution, which involves DynamoDB table creation/deletion like this:
Each table contains messages of only one season
When season becomes 'old' and messages no longer needed, table is deleted
Is it a good pattern and does it encouraged by Amazon?
ps: I'm asking, because I'm afraid of two things, met in different Amazon services -
In Amazon S3 you have to delete each item before you can fully delete bucket. When you have billions of items, it becomes a real pain.
In Amazon SQS there is a notion of 'unwanted behaviour'. When using SQS api you can act badly regarding SQS infrastructure (for example not polling messages) and thus could be penalized for it.
Yes, this is an acceptable design pattern, it actually follows a best practice put forward by the AWS team, but there are things to consider for your specific use case.
AWS has a limit of 256 tables per region, but this can be raised. If you are expecting to need multiple orders of magnitude more than this you should probably re-evaluate.
You can delete a table a DynamoDB table that still contains records, if you have a large number of records you have to regularly delete this is actually a best practice by using a rolling set of tables
Creating and deleting tables is an asynchronous operation so you do not want to have your application depend on the time it takes for these operations to complete. Make sure you create tables well in advance of you needing them. Under normal circumstances tables create in just a few seconds to a few minutes, but under very, very rare outage circumstances I've seen it take hours.
The DynamoDB best practices documentation on Understand Access Patterns for Time Series Data states...
You can save on resources by storing "hot" items in one table with
higher throughput settings, and "cold" items in another table with
lower throughput settings. You can remove old items by simply deleting
the tables. You can optionally backup these tables to other storage
options such as Amazon Simple Storage Service (Amazon S3). Deleting an
entire table is significantly more efficient than removing items
one-by-one, which essentially doubles the write throughput as you do
as many delete operations as put operations.
It's perfectly acceptable to split your data the way you describe. You can delete a DynamoDB table regardless of its size of how many items it contains.
As far as I know there are no explicit SLAs for the time it takes to delete or create tables (meaning there is no way to know if it's going to take 2 seconds or 2 minutes or 20 minutes) but as long your solution does not depend on this sort of timing you're fine.
In fact the idea of sharding your data based on age has the potential of significantly improving the performance of your application and will definitely help you control your costs.

Riak and time-sorted records

I'd like to sort some records, stored in riak, by a function of the each record's score and "age" (current time - creation date). What is the best way do do a "time-sensitive" query in riak? Thus far, the options I'm aware of are:
Realtime mapreduce - Do the entire calculation in a mapreduce job, at query-time
ETL job - Periodically do the query in a background job, and store the result back into riak
Punt it to the app layer - Don't sort at all using riak, and instead use an application-level layer to sort and cache the records.
Mapreduce seems the best on paper, however, I've read mixed-reports about the real-world latency of riak mapreduce.
MapReduce is a quite expensive operation and not recommended as a real-time querying tool. It works best when run over a limited set of data in batch mode where the number of concurrent mapreduce jobs can be controlled, and I would therefore not recommend the first option.
Having a process periodically process/aggregate data for a specific time slice as described in the second option could work and allow efficient access to the prepared data through direct key access. The aggregation process could, if you are using leveldb, be based around a secondary index holding a timestamp. One downside could however be that newly inserted records may not show up in the results immediately, which may or may not be a problem in your scenario.
If you need the computed records to be accurate and will perform a significant number of these queries, you may be better off updating the computed summary records as part of the writing and updating process.
In general it is a good idea to make sure that you can get the data you need as efficiently as possibly, preferably through direct key access, and then perform filtering of data that is not required as well as sorting and aggregation on the application side.

Resources