What are "other" requests in Cosmos DB insights when creating items? - azure-cosmosdb

My database is configured to auto scale from 500 to 5000RUs/s.
Once a day I'm running an import where about 2.5 million items are created. The parition key is the date in format "2022-02-10". I'm using bulk updates and transactional batches.
I can observe two things:
I'm getting status 429 pretty quickly
There are many "other" requests showing up in Insights (see below)
What are the "other" requests I see here? I'm only creating items, nothing else queries the DB. Or could these be the transactional batches not showing up as individual creates?

After some more testing I can confirm that transactional batches end up as "other" in Insights, even if they include only create operations.
This also means it has nothing to do with the 429s.

Related

DynamoDB Read Capacity way over the limit but only on one table

I have 6 DynamoDB tables set up for use as data sources for my iOS app. They all get called from the client near enough the exact same amount when someone uses the app but for some reason one table if getting crazy spikes in reads and going way over the read capacity and causing a provisioned throughput for table exceeded error.
The app only has a very small number of users in general (below 100).
The offending table metrics look like this:
But all the other tables look like this:
Any thoughts on what might be causing this? The code for making requests client side is all the same. They are seemingly all set up the same in the API using lambda and API gateway. What am i missing?

Cloud Firestore Data Structure

I am creating an application that uses cloud firestore to store data about "events" in our lab on several assets. We collected data for a few months and we are averaging about 2000 events per asset per month. Each event captures a few pieces of meta data that the user can query.
I imported all the data into firestore with a very simple layout at first.
Events (Collection of event data)
-> EventData (documents which contains a few fields for metadata)
From my understanding, even if the collection of events becomes quite large, for billing and speed of queries this won't be a problem (assuming I do some sort of pagination on the query results). The composite indexes are also very manageable with this structure.
The problem I see, is if someone goes and looks at the firestore console and brings that collection up, our read requests go through the roof. It seems that does a full read on the entire collection...which of course will kill us on billing as time goes on. I don't see this as a problem forever as eventually we should get to the point where everything is stable and won't need to go into the console very often, but what if someone does when we have a million or more records.
My next thought was to structure the database like this:
Events -> Assets -> {Asset_Name } -> {year_month} -> {Collection of
Document with field meta-data}
This certainly solves the issue of the ever growing collection of documents. The number of assets that we have is fixed, and the number of events is (effectively) capped to a maximum amount per month as well. The problem with this setup, however, is managing composite indexes. There are about 5 indexes needed for my original setup. I think this alternative setup means I would need to setup the same 5 indexes for each each collection of documents for every asset every month.
I thought maybe there could be a way to have a cloud function manage it for me (it doesn't appear there is an API for this). I think the number of indexes per project is also capped.
So, in the end, I am looking for recommendations on how to structure this database to limit reads if using the console, as well as keeping the indexes manageable. I am pretty new to NoSQL and perhaps I am just completely off.
I recommend you keep your structure as is if that's what's working for you. You should not need to optimize for reducing console reads. Console reads do count towards your usage but the console does not load the entire collection when you open the console.
The console loads just enough documents to let you scroll a bit and then it loads more documents if you scroll down. It will only load the entire collection if you scroll through the entire collection.

How to backup datastore with filter

I declare table on datastore,
such as |id|date|value1|value2|...|
I want to backup only this week.("date > today-7" and "date < today")
Can I backup with filtering?
(Datastore Admin or something.)
This is not possible with Datastore admin, nor with Import/Export. Your option here is to use Dataflow template Datastore to Text.
Be careful with this. For your query, you may be hitting a narrow range of keys (not sure if you have 500 entities over the last 7 days, or 1million on one of them), which may cause high latency and read timeouts retrieving the entities, as Dataflow will try to pull them all massively parallel-ly. If you can break this into queries that gather every other day or somehow distribute the reads a little better, you'll get better results.

Is it ok to build architecture around regular creation/deletion of tables in DynamoDB?

I have a messaging app, where all messages are arranged into seasons by creation time. There could be billions of messages each season. I have a task to delete messages of old seasons. I thought of a solution, which involves DynamoDB table creation/deletion like this:
Each table contains messages of only one season
When season becomes 'old' and messages no longer needed, table is deleted
Is it a good pattern and does it encouraged by Amazon?
ps: I'm asking, because I'm afraid of two things, met in different Amazon services -
In Amazon S3 you have to delete each item before you can fully delete bucket. When you have billions of items, it becomes a real pain.
In Amazon SQS there is a notion of 'unwanted behaviour'. When using SQS api you can act badly regarding SQS infrastructure (for example not polling messages) and thus could be penalized for it.
Yes, this is an acceptable design pattern, it actually follows a best practice put forward by the AWS team, but there are things to consider for your specific use case.
AWS has a limit of 256 tables per region, but this can be raised. If you are expecting to need multiple orders of magnitude more than this you should probably re-evaluate.
You can delete a table a DynamoDB table that still contains records, if you have a large number of records you have to regularly delete this is actually a best practice by using a rolling set of tables
Creating and deleting tables is an asynchronous operation so you do not want to have your application depend on the time it takes for these operations to complete. Make sure you create tables well in advance of you needing them. Under normal circumstances tables create in just a few seconds to a few minutes, but under very, very rare outage circumstances I've seen it take hours.
The DynamoDB best practices documentation on Understand Access Patterns for Time Series Data states...
You can save on resources by storing "hot" items in one table with
higher throughput settings, and "cold" items in another table with
lower throughput settings. You can remove old items by simply deleting
the tables. You can optionally backup these tables to other storage
options such as Amazon Simple Storage Service (Amazon S3). Deleting an
entire table is significantly more efficient than removing items
one-by-one, which essentially doubles the write throughput as you do
as many delete operations as put operations.
It's perfectly acceptable to split your data the way you describe. You can delete a DynamoDB table regardless of its size of how many items it contains.
As far as I know there are no explicit SLAs for the time it takes to delete or create tables (meaning there is no way to know if it's going to take 2 seconds or 2 minutes or 20 minutes) but as long your solution does not depend on this sort of timing you're fine.
In fact the idea of sharding your data based on age has the potential of significantly improving the performance of your application and will definitely help you control your costs.

Alfresco CMIS different result with same query

we have a bit of a problem.
We've builded a GWT application on top of our two Alfresco instances. The application should work like this:
User search a document
Our web app spam two same queries against two repositories, wait for both results and expose a merged resultset.
This is true in case the search is for a specific documento (number id for example) or 10, 20, 50 documents (we don't know when this begins to act strange).
If the query is a consistent one (like all documents from last month, there should be about 30-60k/month) obviously the limit of cmis query (500) stops before.
BUT, if the user hits "search" the first time, after a while, the resultset is composed of 2 documents. And if the users hits "search" right after that again, with the same query, the resultset is exposed almost immediately and there are 500 documents listed.
What the heck is wrong? Does CMIS caches results in some way? How do big CMIS queries work?
Thanks
A.
As you mentioned you're using Apache Chemistry. Chemistry has a clientside caching mechanism:
http://chemistry.apache.org/java/how-to/how-to-tune-perfomance.html
I suspect this is not CMIS related at all but is instead due to the Alfresco Lucene "max permission check" problem. At a high-level, there is a config setting for the maximum number of permission checks that Alfresco will do against a search result set. There is also a limit to the total amount of time it will spend performing such checks. These limits are configured in the repository properties file as:
# The maximum time spent pruning results
system.acl.maxPermissionCheckTimeMillis=10000
# The maximum number of results to perform permission checks against
system.acl.maxPermissionChecks=1000
The first time you run a search the server begins performing these checks and hits the limit. It then returns the search results it was able to filter. Now the permission cache is populated so the next time you run the search the results come back much faster and the result set is larger.
Searches in Alfresco are non-deterministic--you cannot guarantee that, for large result sets, you will get back the exact same result set every time, regardless of how big you make those settings.
If you are able to upgrade at some point you may find that configuring Alfresco to use Solr rather than Lucene could help alleviate this, but I'm not 100% sure it will.
To disable security checks replace public SearchService with searchService. Public services have enforced security so with searchService you can avoid security checking.

Resources