How to download many separate JSON documents from CosmosDB without deserializing? - azure-cosmosdb

Context & goal
I need to periodically create snapshots of cosmosDB partitions. That is:
export all documents from a single CosmosDB partition. Ca 100-10k doc per partition, 1KB-200KB each doc, entire partition JSON usually <50M)
each document must be handled separately, with id known.
Host the process in Azure function app, using consumption plan (so mem/CPU/duration matters).
And run this for thousands of partitions..
Using Microsoft.Azure.Cosmos v3 C# API.
What I've tried
I can skip deserialization using the Container.*StreamAsync() tools in API, and avoid parsing the document contents. This should notably reduce the CPU/Mem need also avoids accidentally touching the documents to be exported with serialization roundtrip. The tricky part is how to combine it with having 10k documents per partition.
Query individually x 10k
I could query item ids per partition using SQL and just send send separate ReadItemStreamAsync(id) requests.
This skips deserialization, still have ids, I could control how many docs are in memory at given time, etc.
It would work, but it smells as too chatty, sending 1+10k requests to CosmosDB per partition, which is a lot = millions of requests.. Also, by my experience SQL-querying large documents would usually be RU-wise cheaper than loading those documents by point reads which would add up in this scale. It would be nice to be able to pull N docuents with a single (SQL query) request..
Query all as stream x 1.
There is Container.GetItemQueryStreamIterator() for which I could just pass in select * from c where c.partiton = #key. It would be simpler, use less RUs, I could control batch size with MaxItemsCount, it sends just a minimal number or requests to cosmos (query+continuations). All is good, except ..
.. it would return a single JSON array for all documents in batch and I would need to deserialize it all to split it into individual documents and mapping to their ids. Defeating the purpose of loading them as Stream.
Similarly, ReadManyItemsStreamAsync(..) would return the items as single response stream.
Question
Does the CosmosDB API provide a better way to download a lot of individual raw JSON documents without deserializing?
Preferably with having some control over how much data is being buffered in client.

While I agree that designing the solution around streaming documents with change feed is promising and might have better scalability and cost-effect on cosmosDB side, but to answer the original question ..
The chattiness of solution "Query individually x 10k" could be reduced with Bulk mode.
That is:
Prepare a bulk CosmosClient with AllowBulkExecution option
query document ids to export (select c.id from c where c.partition = #key)
(Optionally) split the ids to batches of desired size to limit the number of documents loaded in memory.
For each batch:
Load all documents in batch concurrently using ReadItemStreamAsync(id, partition), this avoids deserialization but retains link to id.
Write all documents to destination before starting next batch to release memory.
Since all reads are to the same partition, then bulk mode will internally merge the underlying requests to CosmosDB, reducing the network "chattiness" and trading this for some internal (hidden) complexity and slight increase in latency.
It's worth noting that:
It is still doing the 1+10k queries to cosmosDB + their RU cost. It's just compacted in network.
batching ids and waiting on batch completion is required as otherwise Bulk would send all internal batches concurrently (See: Bulk support improvements for Azure Cosmos DB .NET SDK). Or don't, if you prefer to max out throughput instead and don't care about memory footprint. In this case the partitions are smallish enough so it does not matter much.
Bulk has a separate internal batch size. Most likely its best to use the same value. This seems to be 100, which is a rather good chunk of data to process anyway.
Bulk may add latency to requests if waiting for internal batch to fill up
before dispatching (100ms). Imho this is largely neglible in this case and could be avoided by fully filling the internal Bulk batch bucket if possible.
This solution is not optimal, for example due to burst load put on CosmosDB, but the main benefit is simplicity of implementation, and the logic could be run in any host, on-demand, with no infra setup required..

There isn't anything out of the box that provides a means to doing on-demand, per-partition batch copying of data from Cosmos to blob storage.
However, before looking at other ways you can do this as a batch job, another approach you may consider, is to stream your data using Change Feed from Cosmos to blob storage. The reason is that, for a database like Cosmos, throughput (and cost) is measured on a per-second basis. The more you can amortize the cost of some set of operations over time, the less expensive it is. One other major thing I should point out too is, the fact that you want to do this on a per-partition basis means that the amount of throughput and cost required for the batch operation will be a combination of throughput * the number of physical partitions for your container. This is because when you increase throughput in a container, the throughput is allocated evenly across all physical partitions, so if I need 10k RU additional throughput to do some work on data in one container with 10 physical partitions, I will need to provision 100k RU/s to do the same work for the same amount of time.
Streaming data is often a less expensive solution when the amount of data involved is significant. Streaming effectively amortizes cost over time reducing the amount of throughput required to move that data elsewhere. In scenarios where the data is being moved to blob storage, often when you need the data is not important because blob storage is very cheap (0.01 USD/GB)
compared to Cosmos (0.25c USD/GB)
As an example, if I have 1M (1Kb) documents in a container with 10 physical partitions and I need to copy from one container to another, the amount of RU needed to do the entire thing will be 1M RU to read each document, then approximately 10 RU (with no indexes) to write it into another container.
Here is the breakdown for how much incremental throughput I would need and the cost for that throughput (in USD), if I ran this as a batch job over that period of time. Keep in mind that Cosmos DB charges you for the maximum throughput per hour.
Complete in 1 second = 11M RU/s $880 USD * 10 partitions = $8800 USD
Complete in 1 minute = 183K RU/s $14 USD * 10 partitions = $140 USD
Complete in 10 minutes = 18.3K $1/USD * 10 partitions = $10 USD
However, if I streamed this job over the course of a month, the incremental throughput required would be, only 4 RU/s which can be done without any additional RU at all. Another benefit is that it is usually less complex to stream data than to handle as a batch. Handling exceptions and dead-letter queuing are easier to manage. Although because you are streaming, you will need to first look up the document in blob storage and then replace it due to the data being streamed over time.
There are two simple ways you can stream data from Cosmos DB to blob storage. The easiest is Azure Data Factory. However, it doesn't really give you the ability to capture costs on a per logical partition basis as you're looking to do.
To do this you'd need to write your own utility using change feed processor. Then within the utility, as you read in and write each item, you can capture the amount of throughput to read the data (usually 1 RU/s) and can calculate the cost of writing it to blob storage based upon the per unit cost for whatever your monthly hosting cost is for the Azure Function that hosts the process.
As I prefaced, this is only a suggestion. But given the amount of data and the fact that it is on a per-partition basis, may be worth exploring.

Related

AWS Neptune Query gremlins slowness on cold call

I'm currently running some queries with a big gap of performance between first call (up to 2 minutes) and the following one (around 5 seconds).
This duration difference can be seen through the gremlin REST API in both execution and profile mode.
As the query is loading a big amount of data, I expect the issue is coming from the caching functionalities of Neptune in its default configuration. I was not able to find any way to improve this behavior through configuration and would be glad to have some advices in order to reduce the length of the first call.
Context :
The Neptune database is running on a db.r5.8xlarge instance, and during execution CPU always stay bellow 20%. I'm also the only user on this instance during the tests.
As we don't have differential inputs, the database is recreated on a weekly basis and switched to production once the loader has loaded everything. Our database have then a short lifetime.
The database is containing slightly above 1.000.000.000 nodes and far more edges. (probably around 10.000.000.000) Those edges are splitted across 10 types of labels, and most of them are not used in the current query.
Query :
// recordIds is a table of 50 ids.
g.V(recordIds).HasLabel("record")
// Convert local id to neptune id.
.out('local_id')
// Go to tree parent link. (either myself if edge come back, or real parent)
.bothE('tree_top_parent').inV()
// Clean duplicates.
.dedup()
// Follow the tree parent link backward to get all children, this step load a big amount of nodes members of the same tree.
.in('tree_top_parent')
.not(values('some flag').Is('Q'))
// Limitation not reached, result is between 80k and 100K nodes.
.limit(200000)
// Convert back to local id for the 80k to 100k selected nodes.
.in('local_id')
.id()
Neptune's architecture is comprised of a shared cluster "volume" (where all data is persisted and where this data is replicated 6 times across 3 availability zones) and a series of decoupled compute instances (one writer and up to 15 read replicas in a single cluster). No data is persisted on the instances however, approximately 65% of the memory capacity on an instance is reserved for a buffer pool cache. As data is read from the underlying cluster volume, it is stored in the buffer pool cache until the cache fills. Once the cache fills, a least-recently-used (LRU) eviction policy will clear buffer pool cache space for any newer reads.
It is common to see first reads be slower due to the need to fetch objects from the underlying storage. One can improve this by writing and issuing "prefetch" queries that pull in objects that they think they might need in the near future.
If you have a use case that is filling buffer pool cache and constantly seeing buffer pool cache misses (a metric one can see in the CloudWatch metrics for Neptune), then you may also want to consider using one of the "d" instance types (ex: r5d.8xlarge) and enabling the Lookup Cache feature [1]. This feature specifically focuses on improving access to property values/literals at query time by keeping them in a directly attached NVMe store on the instance.
[1] https://docs.aws.amazon.com/neptune/latest/userguide/feature-overview-lookup-cache.html

Dynamo - Increased Read Latency During Writes

There is a DynamoDB table Entity which has a hash key on id and GSI on another attribute: cardId. The GSI only has range key and does not have any sort key.
Whenever, we get a batch of create/update requests, we first use the GSI to read existing data and then write the main table, which also updates the GSI table eventually. During this time, we may also serve some parallel read requests from the GSI.
We are seeing an issue where the latency of both main table and GSI table increases from 200ms to 10-15 seconds during this time (batch writes + reads). I am not able to establish a co-relation between consecutive reads and writes in the table. The table is set to use on-demand capacity and there is no throttling. "SuccessfulRequestLatency" is ~300-400 ms only.
It is the DDB client method that has latency in seconds. It does not do any data transformation, just return the DB data as is to upper layers. Anything else that I should be monitoring to get to the root cause for this?
Thanks!
I don't have a full answer, but do have some directions you might want to investigate.
First, I noticed in the past is that extremely long latencies may indicate that your client gave up and retried the request. Some clients hide this retry, and it just looks like a very slow request from the outside.
Second, you're right that on-demand billing mode doesn't throttle based on provisioned throughput, but it nevertheless can do throttling - see
https://aws.amazon.com/premiumsupport/knowledge-center/on-demand-table-throttling-dynamodb/. By default there are limits on the throughput that an on-demand table can have, as well as how quickly the throughput may grow. These limits are at least partially for your protection - you wouldn't want a run-away-train application to accidentally do billions of requests and cost you a million dollars :-)

Does DynamoDB latency depend on number of items per partition

Newbie to DDB here. I've been using a DDB table for a year now. Recently, I made improvements by compressing the payload using gzip (and representing it as a binary in DDB) and storing the new data in another newly created beta table. Overall compression was 3x. I expected the read latency(GetItem) to improve as well as it's less data to be transported over the wire. However, I'm seeing that the read latency has increased from ~ 50ms p99.9 to ~114 ms p99.9. I'm not sure how that happened and was wondering if because of the compression, now I have a lot of rows per partition (which I think is defined as <= 10 GB). I now have 3-4x more rows per partition. So, I'm wondering that once dynamoDb determines the right partition for a partition key, then within the partition how does it find the correct item? Gut feel is that this shouldn't lead to an increase in latency as a simplified representation of the partition can be a giant hashmap so it'd just be a simple lookup. I'd appreciate any help here.
My DDB schema:
partition-key - user-id,dataset-name
range-key - update-timestamp
payload - used to be string, now is compressed/binary.
In my GetItem requests, I specify both partition key and range key.
According to your description, your change included two unrelated parts: You compressed the payload, and increased the number of items per partition. The first change - the compression - probably has little effect on the p99 latency (it could have a more noticable effect on the mean latency - which, according to Little's Law is related to throughput, if your client has fixed concurrency - but I'd expect it to lower, not increase).
Some guesses as to what might have increased the p99 latency:
More items per partition means that DynamoDB (which uses a B-tree) needs to do more disk reads to find a specific item. Since each disk access has rare delays caused by queueing, this adds to the tail latency.
You said that the change caused each partition to hold more items, I guess this means you now have fewer partitions. If you have too few of them, you can start getting unbalanced load on the different DynamoDB partitions, and more contention and latency for specific "hot" partitions.
I don't know how you measure your latency. Your client now needs (I guess) to uncompress the returned result, maybe it is now busier, adding queening delays in the client? Can you lower your client's concurrency (how many client threads run in parallel) and see if the high tail latency is an artifact of the server design, or the client's design?

peak read capacity units dynamo DB table

I need to find out the peak read capacity units consumed in the last 20 seconds in one of my dynamo DB table. I need to find this pro-grammatically in java and set an auto-scaling action based on the usage.
Please can you share a sample java program to find the peak read capacity units consumed in the last 20 seconds for a particular dynamo DB table?
Note: there are unusual spikes in the dynamo DB requests on the database and hence needs dynamic auto-scaling.
I've tried this:
result = DYNAMODB_CLIENT.describeTable(recomtableName);
readCapacityUnits = result.getTable()
.getProvisionedThroughput().getReadCapacityUnits();
but this gives the provisioned capacity but I need the consumed capacity in last 20 seconds.
You could use the CloudWatch API getMetricStatistics method to get a reading for the capacity metric you require. A hint for the kinds of parameters you need to set can be found here.
For that you have to use Cloudwatch.
GetMetricStatisticsRequest metricStatisticsRequest = new GetMetricStatisticsRequest()
metricStatisticsRequest.setStartTime(startDate)
metricStatisticsRequest.setEndTime(endDate)
metricStatisticsRequest.setNamespace("AWS/DynamoDB")
metricStatisticsRequest.setMetricName('ConsumedWriteCapacityUnits',)
metricStatisticsRequest.setPeriod(60)
metricStatisticsRequest.setStatistics([
'SampleCount',
'Average',
'Sum',
'Minimum',
'Maximum'
])
List<Dimension> dimensions = []
Dimension dimension = new Dimension()
dimension.setName('TableName')
dimension.setValue(dynamoTableHelperService.campaignPkToTableName(campaignPk))
dimensions << dimension
metricStatisticsRequest.setDimensions(dimensions)
client.getMetricStatistics(metricStatisticsRequest)
But I bet you'd results older than 5 minutes.
Actually current off the shelf autscaling is using Cloudwatch. This does have a drawback and for some applications is unacceptable.
When spike load is hitting your table it does not have enough capacity to respond with. Reserved with some overload is not enough and a table starts throttling. If records are kept in memory while waiting a table to respond it can simply blow the memory up. Cloudwatch on the other hand reacts in some time often when spike is gone. Based on our tests it was at least 5 mins. And rising capacity gradually, when it was needed straight up to the max
Long story short. We have created custom solution with own speedometers. What it does is counting whatever it has to count and changing tables's capacity accordingly. There is a still a delay because
App itself takes a bit of time to understand what to do
Dynamo table takes ~30 sec to get updated with new capacity details.
On a top we also have a throttling detector. So if write/read request has got throttled we immediately rise a capacity accordingly. Some times level of capacity looks all right but throttling because of HOT key issue.

AWS DynamoDB: read/write units estimation issue

I am creating an online crowd driven game. I expect the read/write requests to fluctuate (like, 50,50,50,1500,50,50,50)every second and I need to process all 100% requests with strong consistency.
I am planning to go with AWS's DynamoDB from GAE datastore for its strong consistency. I have the below doubts which I could not get clear answers in other discussions.
1. If the item size for a write action is just 4B, Will that be rounded to a 1KB and consume a write unit?
2. Financially it is not wise to set the Provisioned Throughput Capacity around the expected peak value. Alarms can warn us. But in the case of sudden rise, the requests could be throttled at the time we receive alarm. Is DynamoDB really designed to handle highly fluctuating read/write?
3. I read about Dynamc DynamoDB to update the read/write throughput capacity for us, When we add some read/write units, How long it will take to allocate them? If it takes too long, Whats the use of increasing the bar after the tide hits?
Google app engine bills just for the number of requests happen in that month. If I can make AWS work like, "Whatever the request count could be, I will expand and contract myself and charge you only for the used read/write units", I will go for AWS.
Please advise. Dont hesitate if I am not being clear at parts.
Thanks,
Karthick.
Yes. Item sizes are rounded up and the throughput is used. From the Provisioned Throughput in Amazon DynamoDB documentation:
The total number of read operations necessary is the item size, rounded up to the next multiple of 4 KB, divided by 4 KB.
It can handle some bursting, but it is generally intended to be used for uniform workloads. Here is a section from the Guidelines for Working with Tables documentation and some other helpful links about the best practices:
A temporary non-uniformity in a workload can generally be absorbed by the bursting allowance, as described in Use Burst Capacity Sparingly. However, if your application must accommodate non-uniform workloads on a regular basis, you should design your table with DynamoDB's partitioning behavior in mind (see Understand Partition Behavior), and be mindful when increasing and decreasing provisioned throughput on that table.
Query and Scan guidelines for avoiding bursts of read activity
The Table Best Practices section
Use Burst Capacity Sparingly
This one is going to depend on how much data your table has, because DynamoDB will have to repartition the data if you are scaling up. See the Consider Workload Uniformity When Adjusting Provisioned Throughput documentation for more information about the partitioning..

Resources