We are testing Apache Ignite Continuous Queries and during the tests, we have noticed that some continuous queries misses some updates of the cache.
Our use case is the following one:
we have a feeder (that is part of the server nodes) and that feeds 1000 caches with data every seconds
we have a client program (that is part of the server nodes) that opens 1000 continuous queries, 1 continuous query per cache
So, some of these continuous queries misses from times to times few updates of the cache. We were wondering whether the fact that a continuous query misses some updates of the cache was "normal".
We have also performed the test with 500, 250, 100 caches and have the same results.
Besides, we are also wondering how many caches and continuous queries can be created? Does Apache Ignite support hundred thousands of creation of caches and continuous queries?
Notifications should never be lost in continuous queries. If this actually happens, it's most likely a bug in the product or in your test. I would recommend to share your test with Apache Ignite community.
Got the answer with the Ignite mailing-list (http://apache-ignite-users.70518.x6.nabble.com/Can-a-Continuous-Queries-miss-some-updates-order-td11620.html#a11623):
Continuous Query (CQ) garanties that the order for events will be preserved per entry. For example:
cache.put(key1, 100);
cache.put(key2, 100);
cache.put(key1, 200);
cache.put(key2, 200);
CQ guarantees that for key1 events will be received in the following order (1, 100) --> (1, 200) but you can got event for key2 early than for key1.
Ignite doesn't have limitation on count of caches and CQ.
Thanks Nikolai Tikhonov!
Related
I have been scouring the internet for days on a solution to this problem.
That is, how to handle aggregation when there is no network connection? I have a task management app that looks to aggregate meta data about user tasks. For example, the task can contain tags that can be aggregated to be shown in a dashboard to the user on a daily basis. This would be easy if the user is always online, so I could use transaction or cloud function to aggregate, but when the user is offline, the aggregation will appear to be incorrect, until the user restores their network connection.
Aggregation queries are explained here:
https://firebase.google.com/docs/firestore/solutions/aggregation
Which states a limitation:
Offline support - Client-side transactions will fail when the user's
device is offline, which means you need to handle this case in your
app and retry at the appropriate time.
However, there has yet to be any example or documentation on how to 'handle this case'. How would I go about addressing this problem?
Some thoughts:
I could cache the item if a transaction fails. This item will be aggregated on top of the stored aggregation. However, going down this line would mean that I can't take advantage of the Firestore's "offline mode", because I'm using my own cache on every write while offline anyway.
I could aggregate on demand. That is, never store the aggregation. This is going to be very heavy on read depending on how many tasks a user has. Furthermore, if the aggregation will need to be shared as insights to other users, this option will not work because other users do not have access to the tasks.
I'm at a loss and any help would be appreciated, thanks!
After a lot of research and trial and error I found a solution that can address this problem gracefully.
FieldValue.increment to the rescue.
What FieldValue.increment does is bypass the use of transaction while respecting the default Firestore's offline cache behaviour. It requires the use of set or update on the field directly. The drawback is the inability to use the 'withConverter' on the collection for type safety. I'm willing to live with the drawback considering how useful FieldValue.increment is.
I've done multiple tests and can confirm that the values can be incremented/decremented multiple times locally while offline. This offline value is reflected in a get or snapshot call to the cache. When the network connection is restored, the values are updated on the server.
The value itself is not stored on the cache, it simply stores the "difference" in the FieldValue sentinel for when it is time to update it on the server.
This method only works with incrementing and decrementing values. Storing averages will not be possible using this method. That is because the true total number of items is not known at the time of its calculation when offline.
Instead, the total number of items are stored along side the total value. The average is then calculated when and as needed. In this way the average will always be accurate from a local perspective when offline, and it will also be accurate when online when the total value and count has been synced.
I have setup TTL on dynamodb table and enabled a stream. According to aws docs it can take up to 48hrs before item is removed. I have run some experiments and I am seeing a 10min delay. I can live with this but has anyone else had longer delays?
Yes,
There are instances where the time taken for the item removal to happen takes more than 10 mins! In fact, the SLA from DynamoDB is 48 hours. The time needed for the actual removal to happen depends on the activity levels of DynamoDB tables.
A more pointed rephrase of Allan:
Even if no one has seen that delay (and chances of finding it anecdotally through a Q&A site seems like a bad statistical test) Amazon says to expect the possibility of that much of a delay. This is for resource cleanup only, and most likely a breach of the 48h SLA will only allow you a refund of storage costs.
Do not depend on the absence of a given item to trigger logic within your application (e.g., user session timeout).
My application is not live yet, so I'm testing the performance of my Gremlin queries before it gets into production.
To test I'm using a query that adds edges from one vertex to 300 other vertices. It does more things but that's the simple description. I added this mentioned workload of 300 just for testing.
If I run the query 300 times one after the other it takes almost 3 minutes to finish and 90.000 edges are created (300 x 300).
I'm worried because if I have like 60.000 users using my application at the same time they probably are going to be creating 90.000 edges in 2 minutes using this query and 60.000 users at the same time is not much in my case.
If I have 1 million users at the same time I'm going to be needing many servers at full capacity, that is out of my budget.
Then I noticed that when my test is executing the CPU doesn't show much activity, I don't know why, I don't know how the DB works internally.
So I thought that maybe a more real situation could be to call my queries all at the same time because that is what's going to happen with the real users, when I tried test that I got ConcurrentModificationException.
Is far as I understand this error happens because an edge or vertex is being read or written in 2 queries at the same time, this is something that could happen a lot in my application because all the user vertices are changing connections to the same 4 vertices all the time, this "collisions" are going to be happening all the time.
I'm testing in local using gremlin server 3.4.8 connecting using sockets with Node.js. My plan is to use AWS Neptune as my database when it goes to production.
What can I do to recover hope? There must be very important things about this subject that I don't know because I don't know how graph databases work internally.
Edit
I implemented a logic to retry the queries requests when receiving an error using the "Exponential Backoff" approach. It fixed the ConcurrentModificationException
but there are a lot of problems in Gremlin Server when sending multiple queries at the same time that shows how multi-threading is totally unsupported and unstable in Gremlin Server and we should try multi-threading in other Gremlin compatible databases as the response says. I experienced random inconsistencies in the data returned by the queries and errors like NegativeArraySize and other random stuff coming from the database, be warned of this to not waste time thinking that your code could be bugged like it happened to me.
While TinkerPop and Gremlin try to provide a vendor agnostic experience, they really only do so at their interface level. So while you might be able to run the same query in JanusGraph, Neptune, CosmosDB, etc, you will likely find that there are differences in performance depending on the nature of the query and the degree to which the graph in question is capable of optimizing that query.
For your case, consider TinkerGraph as you are running your tests there locally. TinkerGraph is an in-memory graph without transaction capability and is not proven thread safe for writes. If you apply a heavy write workload to it, I could envision ConcurrentModificationException easy to generate. Now consider JanusGraph. If you had tested your heavy write workload with that, you might have found that you were struck with tons of TemporaryLockingException errors if your schema had required a unique property key and would have to modify your code to do transactional retries with exponential backoff.
The point here is that if your target graph is Neptune and you have a traversal you've tested for correctness and now are concerned about performance, it's probably time to load test on Neptune to see if any problems arise there.
I'm worried because if I have like 60.000 users using my application at the same time they probably are going to be creating 90.000 edges in 2 minutes using this query and 60.000 users at the same time is not much in my case. If I have 1 million users at the same time I'm going to be needing many servers at full capacity, that is out of my budget.
You will want to develop a realistic testing plan. Is 60,000 users all pressing "submit" to trigger this query at the exact same time really what's going to happen? Or is it more likely that you have 100,000 users with some doing reads and perhaps every half second three of them happens to click the "submit".
Your graph growth rate seems fairly high and the expected usage you've described here will put your graph in the category of billions of edges quite quickly (not to mention whatever other writes you might have). Have you tested your read workloads on a graph with billions of edges? Have you tested that explicitly on Neptune? Have you thought about how you will maintain a multi-billion edge graph (e.g. changing the schema when a new feature is needed, ensuring that it is growing correctly, etc.)?
All of these questions are rhetorical and just designed to make you think about your direction. Good luck!
Although there is already an accepted answer I want to suggest another approach to deal with this problem.
Idea is to figure out whether you really need to do synchronous writes on the Graph. I'm suggesting that when you receive a request, just use the attributes in this request to fetch the sub-graph / neighbors, and continue with the business logic.
Simultaneously put the event in an SQS or something to have the writes taken care of by an Asynchronous system - say AWS Lambda. Because in SQS + Lambda you can decide the write concurrency to your systems' comfort (low enough that your query does NOT cause above exception).
Further suggestion: You have abnormally high write blast radius - your query should NOT touch that many nodes while writing. You can try converting some of the edges to vertices in order to reduce the radius. Then while inserting a node you'll just have to make one edge to the vertex which was previously an edge instead of making hundreds of edges to all the vertices this node is related to. I hope this makes sense.
I have use case where I write data in Dynamo db in two table say t1 and t2 in transaction.My app needs to read data from these tables lot of times (1 write, at least 4 reads). I am considering DAX vs Elastic Cache. Anyone has any suggestions?
Thanks in advance
K
ElastiCache is not intended for use with DynamoDB.
DAX is good for read-heavy apps, like yours. But be aware that DAX is only good for eventually consistent reads, so don't use it with banking apps, etc. where the info always needs to be perfectly up to date. Without further info it's hard to tell more, these are just two general points to consider.
Amazon DynamoDB Accelerator (DAX) is a fully managed, highly available, in-memory cache that can reduce Amazon DynamoDB response times from milliseconds to microseconds, even at millions of requests per second. While DynamoDB offers consistent single-digit millisecond latency, DynamoDB with DAX takes performance to the next level with response times in microseconds for millions of requests per second for read-heavy workloads. With DAX, your applications remain fast and responsive, even when a popular event or news story drives unprecedented request volumes your way. No tuning required. https://aws.amazon.com/dynamodb/dax/
AWS recommends that you use **DAX as solution for this requirement.
Elastic Cache is an old method and it is used to store the session states in addition to the cache data.
DAX is extensively used for intensive reads through eventual consistent reads and for latency sensitive applications. Also DAX stores cache using these parameters:-
Item cache - populated with items with based on GetItem results.
Query cache - based on parameters used while using query or scan method
Cheers!
I'd recommend to use DAX with DynamoDB, provided you're having more read calls using item level API (and NOT query level API), such as GetItem API.
Why? DAX has one weird behavior as follows. From, AWS,
"Every write to DAX alters the state of the item cache. However, writes to the item cache don't affect the query cache. (The DAX item cache and query cache serve different purposes, and operate independently from one another.)"
Hence, If I elaborate, If your query operation is cached, and thereafter if you've write operation that affect's result of previously cached query and if same is not yet expired, in that case your query cache result would be outdated.
This out of sync issue, is also discussed here.
I find DAX useful only for cached queries, put item and get item. In general very difficult to find a use case for it.
DAX separates queries, scans from CRUD for individual items. That means, if you update an item and then do a query/scan, it will not reflect changes.
You can't invalidate cache, it only invalidates when ttl is reached or nodes memory is full and it is dropping old items.
Take Aways:
doing puts/updates and then queries - two seperate caches so out of sync
looking for single item - you are left only with primary key and default index and getItem request (no query and limit 1). You can't use any indexes for gets/updates/deletes.
Using ConsistentRead option when using query to get latest data - it works, but only for primary index.
Writing through DAX is slower than writing directly to Dynamodb since you have a hop in the middle.
XRay does not work with DAX
Use Case
You have queries that you don't really care they are not up to date
You are doing few putItem/updateItem and a lot of getItem
I'd like to setup a scripted input in Splunk to do a curl against the render url api for Graphite. I imagine I could configure this input to run on the minute, and retrieve that last minutes worth of events.
My concern with this is that some events might be missed, or duplicated.
Has anybody done something similar to this? How could I keep track of the events from Graphite that I have already read?
If you write a modular input you can use data checkpoints. See the docs for more info: http://docs.splunk.com/Documentation/Splunk/6.2.1/AdvancedDev/ModInputsCheckpoint
My concern with this is that some events might be missed, or duplicated.
Yes, it may go missing. In two cases-
If you're pushing your graphite server to the limits, there is a lag between the point wherein the datapoint is received and its flushing to disk. With large queues, i have seen this go upto 20 mins. (IO is the constraint here).
For example- in the case above wherein there's a 20 minute lag, and i am storing data at a 1m granularity- i will have the latest 20 datapoints with NULL against the timestamp. Of-course, they will soon fill in with the next flush.
Know that these are indeterminate. So if you have a zero lag deployment- go for this approach.
The latest datapoint can or cannot be NULL at any given point, because of the flushing nature of graphite, even if nothing is throttling. You can use something like &from=-21m&to=-1m to make sure you never encounter this. Note: Your monitoring now lags by a minute. :)
All said, graphite is a great monitoring tool if your requirements aren't realtime.