Should null elements be stored in Cosmos DB or should they be ignored? - azure-cosmosdb

Is there a good reason to serialize null elements in a Cosmos DB document or is it better to ignore them?
With the is_defined function I can query for undefined elements similar to how I query for null elements.
Does either consume less RUs? In my tests they seem to perform similarly.

If your query truly depends on filtering based on the existence of, or value of, an optional property, then do exactly that: either check for existence (or non-existence), or check that an optional property is a specific value you're looking for.
Storing null properties is an anti-pattern with document databases such as Cosmos DB. It's not required, and if you do decide to do it, you'll have to add new null properties to existing documents every time you add a new property (potentially costly, since you'd have to perform a ReplaceDocument() on every single existing document, every time you add a new property that can be null). Same thing when you decide to remove an optional property, and cleaning up all of your extraneous nulls.
Cosmos DB doesn't require every document to be the same, and you'd be giving up very big benefit by approaching data the same way as a relational store (where you do have to deal with nulls in table columns). Just imagine a shopping site, with thousands of product types, each with varying properties (books, CDs, lawn mowers, coffee...). You'd end up with thousands of null properties per document (which seems like a very unmanageable scenario, not to mention the per-document size limit you'll likely exceed eventually).
Also, you will incur additional RU per write, since every index will need to be updated for every document.

Not sending keys that don't have values will save you space some small amount of bytes (and thus RU/s) and there isn't any important performance difference in queries otherwise.
This could be significant if you have VERY sparse values among your keys. For instance, let's say you could have 1 of 1 million keys per doc and let's assume it is ~7 bytes per key. Well you'd be out of luck if you included all 1 million keys with a null value for all but one because in keys alone you'd have 7MB and your doc can only be 2MB.
It can add up for a single doc at scale. If one 7-byte key in each of 1 million documents reads is null (much more common) instead of undefined, it will theoretically cost 7000 RU/s to read them. That's about $340 a month spent on a key with a null value assuming you're doing 1M RPS the whole month (but that would only be .8% of your cost, so other optimizations like using the right indexes/etc. would make bigger differences).

Related

Surge-like inserts in GCP Firestore in Datastore Mode with one property value equal across all entities

I'm building an app that will have to accept a surge of records in short bursts, like 100 000 in 5 minutes once every 24h. I've chosen Firestore in Datastore Mode as DB. I've figured how to make keys lexically different so the key space is wide and I won't bottle neck on that. I can prefeed the appropriate kind with dummy entities to ensure it can accept high insert traffic on the first day. The only remaining problem I have is that all records from a single surge need to have a property, say event_id, and it's gonna have the same value across all of them. I need to filter by that property later on (only equality filter), so it has to be indexed.
My concern is that this will cause hotspots in the index, but I'm not 100% sure. The docs mostly mention monotonically increasing values or narrow ranges to be a problem, not single values. Strictly speaking, however, mine is a case of a severely narrow range.
I was thinking of using a hierarchical structure, like Event/event-id/Records/my_entities_go_here, but I'm not sure if creating a new Event and submitting entities to (an initially empty) Event/event-id/Records isn't the same as writing to an empty Kind, which is slow in the begging.
Does anyone know a way around this?

I want to increase number of records read using queryPage in dynamoDB

I have a requirement where I need to get only a certain attribute from the matching records on querying a DynamoDB table. I have used withSelect(Select.SPECIFIC_ATTRIBUTES).withProjectionExpression(<attribute_name>) to get that attribute. But the number of records being read by the queryPage operation is the same in both the cases (1. using withSelect and 2. without using withSelect). The only advantage is by using withSelect, these operations are being processed very quickly. But this is in turn causing a lot of DynamoDB reads. Is there any way I can read more records in a single query thereby reducing my number of DB reads?
The reason you are seeing that the number of reads is the same is due to the fact that projection expressions are applied after each item is retrieved from the storage nodes, but before it is collected into the response object. The net benefit of projection expressions is to save network bandwidth, which in turn can save latency. But it will not result in consumed capacity savings.
If you want to save consumed capacity and be able to retrieve more items per request, your only options are:
create an index and project only the attributes you need to query; this can be a local secondary index, or a global secondary index, depending whether you need to change the partition key for the index
try to optimize the schema of your data stored in the table; perhaps you can compress your items, or just generally work out encodings that result in smaller documents
Some things to keep in mind if you do decide to go with an index: a local secondary index would probably work best in your example but you would need to create a new table for that (local secondary indexes can only be created when you create the table); a global secondary index would also work but only if your application can tolerate eventually consistent reads on the index (and of course, there is a higher cost associated with these).
Read more about using indexes with DynamoDB here: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-indexes.html

What are the performance penalties, if any, of KEYS_ONLY versus ALL when projecting attributes in a DynamoDB GlobalSecondaryIndex?

If you don't project all the attributes then if you query by that index by definition you can only get the hash of that value from the result and then perform another query using the hash key to get all the other data.
This would be 2 queries just to get 1 item. Naturally that doesn't make a lot of sense so there must be some reasons why it's advantageous to project only KEY attributes.
Does it speed up replication across the GSI since there are less values to copy thereby increasing chance of a fully consistent read taking place?
Does it lower read / write costs to the table overall as a whole?

A limit clarification for the new Firestore

So in the limits section (https://firebase.google.com/docs/firestore/quotas) of the new Firestore product from Firebase it says:
Maximum write rate to a collection in which documents contain
sequential values in an indexed field: 500 per second
We're pretty confused as to what that actually entails.
If we have, say, a root-level collection called users with 10 million entries in it, will this rate affect this collection in such a way, so only 500 users can update their data in any given second?
Can anyone clarify?
Sorry for the confusion; an example might help.
If your user documents contained a last-updated timestamp and you index on that timestamp then each new write would end up clustering around the same value (now) creating a hotspot in the index.
Similarly if you somehow assigned users a sequential value like a place in line or something like that this would also create a hotspot.
Incidentally this is why generated document IDs are random strings. This evenly distributes the writes on the primary key index.
If you avoid these kinds of patterns the sky's the limit, though during beta you'd hit the database-wide limit.
A quick additional note : for the moment all properties are indexed by default, so if you had a last-updated timestamp it would necessarily be indexed - so you would not be able to avoid the hotspoting.
Index disablement will be available down the road though.

Retrieve all items with a column beginning with specified text on DynamoDB

I have a table in DynamoDB:
Id: int, hash key
Name: string
(there are many more columns, but I omitted them)
Typically I just pull out and update items by their Id, and this schema works fine for that.
However, one of the requirements is to have an auto-completing drop down box based on the name. I want to be able to query all items in this DynamoDB table for Name columns starting with a query string.
The SQL way of solving this would be to just add an index on Name and write a query like SELECT Id FROM table WHERE Name LIKE 'query%', but I can't figure out a DynamoDB-friendly way of doing this.
I have considered a few ways to solve this:
Scan the table. This is the easiest option, but least efficient. There's a bit more data in this table than I would be comfortable frequently scanning.
Scan + cache it in memory. But then I have to worry about cache invalidation etc.
Make Name a range key, which supports a begins_with function on the query. However, I'd still have to Scan the table since I want to retrieve results for every single hash key, so this doesn't really work.
Make a global secondary index and query it only with the range key. This also doesn't appear to be possible. I could have a column with a static value and use that as the hash key for the GSI, but that seems like a really ugly hack.
Use a full text search engine like CloudSearch, but this seems like massive overkill for my use case.
Is there a simple solution to this issue?
The use case you described is not directly supported by DynamoDB's Query operation today - DynamoDB typically requires you to specify a hashkey then query on the range key accordingly.
However, there is a popular scatter-gather technique that is commonly used for usecase such as yours. In this case, you would add an attribute bucket_id and create a global secondary index with bucket_id as hash key, and Name as the range key.
The bucket_id refers to a fixed range of IDs or numbers, with enough cardinality to ensure your global secondary index is well-distributed. For instance, bucket_id could range from 0 to 99. Then when updating your base table, whenever a new entry is added, a random bucket_id between 0 and 99 is assigned to it.
During your autocomplete query, the application would send 100 separate queries (scatter) for each bucket_id value (0 to 99) and use BEGINS_WITH on the range key Name. After the results are retrieved, the application would have to combine the 100 sets of responses and re-sort as necessary (gather).
The above process may seem a bit cumbersome, but it allows your system/table to scale well by ensuring the load is evenly distributed over a fixed key range. You can increase the bucket_id range as appropriate. To save cost, you can choose to project KEYS_ONLY onto your global secondary index, so cost of querying is minimized.
The problem is that DynamoDB is essentially a key-value store with support for operations against a single key, and you are trying to search all values which doesn't work well . The "simplest" solution to this is to have a known hash key and then you can Query it directly and specify conditions.
For example, you could query with hash_key='name_search' and range_key=begins_with(myText) or other_key=begins_with(myText) and get the use case you are describing. This will work fine for small sets of data that do not require a large amount of provisioned RCUs.
The problem is that this does not scale because you are not following any of the DynamoDB best practices (in fact, this is an anti-pattern). Take a look at the Understand Partition Behavior documentation
My suggestion would be to use a different service/solution to accomplish this rather than trying to squeeze DynamoDB into this use case.

Resources