DynamoDB: Time frame to avoid stale read - amazon-dynamodb

I am writing to dynamoDB using AWS lambda. I am using AWS Console to read data from dynamoDB. But I have seen instances of stale read with latest not records getting updated when trying to pull records under few mins. What is a safe time interval for data pull which would ensure that latest data is available on read? Would 30 mins be a safe interval?
The below is from AWS site: Just want to understand how recent is recent here. "When you read data from a DynamoDB table, the response might not reflect the results of a recently completed write operation. The response might include some stale data"
Regards,
Dbeings

If you must have a strongly consistent read, you can specify that in your read statement. That way the client will always read from the leader storage node for that partition.
In order for DynamoDB to acknowledge a write, the write must be durable on the leader storage node for that partition and one other storage node for the partition.
If you do an eventually consistent read (which is the default), you might get that read you have a 1:3 chance of that read coming from a node that was not part of the write and an even lesser chance that the item has not been updated on that third storage node.
So, if you need a strongly consistent read, ask for one and you'll get the newest of that item. There is no real performance degradation for doing a strongly consistent read.

Related

How to take druid segment data backup?

I am new to druid. In our application we use druid for timeseries data and this can go pretty large(10-20TBs).
Druid provide you facility of deep storage. But if this deep storage crashes/or not reachable then it will result in data loss and which in turn affect the analytics the application is running.
I am thinking of taking an incremental backup druid segment data to some secure location like ftp server. So if deep storage is unavailable, then they can restore the data from this ftp server.
Is there any tool/utility available in druid to incrementally backup/restore druid segment?
In general it's important to take regular snapshots of the metadata storage as this is the "index" of what's in the Deep Storage. Maybe one snapshot per day, and store them for however long you like. It's good to store them for at least a couple of weeks, in case you need to roll back for some reason.
You also need to back up new segments in deep storage when they appear. It isn't important to take consistent snapshots, just to get every file eventually.
Also see https://groups.google.com/g/druid-user/c/itfKT5vaDl8
One other note as you mentioned data loss: Deep Storage is not queried directly - queries execute on the local segment cache in, for example, the Historical process. The Deep Storage is written to at ingestion time, so you might "lose" data that can't be ingested once it's available again, but you will continue to get analytics capability as the already-loaded data is on the historicals... Just a thought haha !
I hope that helps....?!?!

Will Google Firestore always write to the server immediately in a mobile app with frequent writes?

Is it possible to limit the speed at which Google Firestore pushes writes made in an app to the online database?
I'm investigating the feasibility of using Firestore to store a data stream from an IoT device via a mobile device/bluetooth.
The main concern is battery cost - receive a new data packet roughly two minutes, I'm concerned about the additional battery drain that an internet round-trip every two minutes, 24hrs a day, will cost. I also would want to limit updates to wifi connections only.
It's not important for the data to be available online real-time. However it is possible for multiple sources to add to the same datastream in a 2-way sybc, (ie online DB and all devices have the merged data).
I'm currently handling that myself, but when I saw the offline capability of Datastore I hoped I could get all that functionality for free.
I know we can't directly control offline-mode in Firestore, but is there any way to prevent it from always and immediately pushing write changes online?
The only technical question I can see here has to do with how batch writes operate, and more importantly, cost. Simply put, a batch write of 100 writes is the same as writing 100 writes individually. The function is not a way to avoid the write costs of firestore. Same goes for transactions. Same for editing a document (that's a write). If you really want to avoid those costs then you could store the values for the thirty minutes and let the client send the aggregated data in a single document. Though you mentioned you need data to be immediate so I'm not sure that's an option for you. Of course, this would be dependent on what one interprets "immediate" as based off the relative timespan. In my opinion, (I know those aren't really allowed here but it's kind of part of the question) if the data is stored over months/years, 30 minutes is fairly immediate. Either way, batch writes aren't quite the solution I think you're looking for.
EDIT: You've updated your question so I'll update my answer. You can do a local cache system and choose how you update however you wish. That's completely up to you and your own code. Writes aren't really automatic. So if you want to only send a data packet every hour then you'd send it at that time. You're likely going to want to do this in a transaction if multiple devices will write to the same stream so one doesn't overwrite the other if they're sending at the same time. Other than that I don't see firestore being a problem for you.

Cosmos db document automatically archive with TTL

Is there a way to automatically move expired documents to blob storage via change feed?
I Google but found no solution to automatically move expired documents to blob storage via the change feed option. Is it possible?
There is not built in functionality for something like that and the change feed would be of no use in this case.
The change feed processor (which is what the Azure Function trigger is using too) won't notify you for deleted documents so you can't listen for them.
Your best bet is to write some custom application that does scheduling archiving and deleted the archived document.
As statements in the Cosmos db TTL document: When you configure TTL, the system will automatically delete the expired items based on the TTL value, unlike a delete operation that is explicitly issued by the client application.
So,it is controlled by the cosmos db system,not client side.You could follow and vote up this feedback to push the progress of cosmos db.
To come back to this question, one way I've found that works is to make use of the in-built TTL (Let CosmosDB expire documents) and to have a backup script that queries for documents that are near the TTL, but with a safe window in case of latency - e.g. I have the window up to 24 hours.
The main reasoning for this is that issuing deletes as a query not only uses RUs, but quite a lot of them. Even when you slim your indexes down you can still have massive RU usage, whereas letting Cosmos TTL the documents itself induces no extra RU use.
A pattern that I came up with to help is to have my backup script enable the container-level TTL when it starts (doing an initial backup first to ensure no data loss occurs instantly) and to have a try-except-finally, with the finally removing/turning off the TTL to stop it potentially removing data in case my script is down for longer than the window. I'm not yet sure of the performance hit that might occur on large containers when it has to update indexes, but in a logical sense this approach seems to be effective.

Integration testing a DynamoDB client which uses inconsistent reads?

Situation:
A web service with an API to read records from DynamoDB. It uses eventually consistent reads (GetItem default mode)
An integration test consisting of two steps:
create test data in DynamoDB
call the service to verify that it is returning the expected result
I worry that this test is bound to be fragile due to eventual consistency of the data.
If I attempt to verify the data immediately after writing using GetItem withConsistenRead=true it only guarantees that the data has been written to the majority of DB copies, but not all, so the service under test still has a chance to read from the non-updated copy on the next step.
Is there a way to ensure that the data has been written to all DynamoDB copies before proceeding?
The data usually reach all geographically distributed replicas in a second.
My suggestion is to wait (i.e. in Java terms sleep for few seconds) for couple of seconds before calling the web service should produce the desired result.
After inserting the data into DynamoDB table, wait for few seconds before calling the web service.
Eventually Consistent Reads (Default) – the eventual consistency
option maximizes your read throughput. However, an eventually
consistent read might not reflect the results of a recently completed
write. Consistency across all copies of data is usually reached within
a second. Repeating a read after a short time should return the
updated data.

Firebase: queries on large datasets

I'm using Firebase to store user profiles. I tried to put the minimum amount of data in each user profile (following the good practices advised in the documentation about structuring data) but as I have more than 220K user profiles, it still represents 150MB when downloading as JSON all user profiles.
And of course, it will grow bigger and bigger as I intend to have a lot more users :)
I can't do queries on those user profiles anymore because each time I do that, I reach 100% Database I/O capacity and thus some other requests, performed by users currently using the app, end up with errors.
I understand that when using queries, Firebase need to consider all data in the list and thus read it all from disk. And 150MB of data seems to be too much.
So is there an actual limit before reaching 100% Database I/O capacity? And what is exactly the usefulness of Firebase queries in that case?
If I simply have small amounts of data, I don't really need queries, I could easily download all data. But now that I have a lot of data, I can't use queries anymore, when I need them the most...
The core problem here isn't the query or the size of the data, it's simply the time required to warm the data into memory (i.e. load it from disk) when it's not being frequently queried. It's likely to be only a development issue, as in production this query would likely be a more frequently used asset.
But if the goal is to improve performance on initial load, the only reasonable answer here is to query on less data. 150MB is significant. Try copying a 150MB file between computers over a wireless network and you'll have some idea what it's like to send it over the internet, or to load it into memory from a file server.
A lot here depends on the use case, which you haven't included.
Assuming you have fairly standard search criteria (e.g. you search on email addresses), you can use indices to store email addresses separately to reduce the data set for your query.
/search_by_email/$user_id/<email address>
Now, rather than 50k per record, you have only the bytes to store the email address per records--a much smaller payload to warm into memory.
Assuming you're looking for robust search capabilities, the best answer is to use a real search engine. For example, enable private backups and export to BigQuery, or go with ElasticSearch (see Flashlight for an example).

Resources