DynamoDB querying Global Secondary Index with non-existant ExclusiveStartKey returns unexpected results - amazon-dynamodb

I have a DynamoDB table that has
partition key "idA", sort key "idB"
GSI partition key "idB", sort key "idA"
I am attempting to delete all items with specific "idB", so I query the GSI to get a list of records, but also want to paginate results for scale.
If I was querying against the main index I could probably simply re-run the query with a limit and delete each record but because I need to use the GSI this results in records that have been deleted showing up in subsequent queries due to GSI's eventual consistency, ie a records deletion often does not propogate to the GSI before the next query is invoked.
Another path is to use the LastEvaluatedKey from the previous query response as the ExclusiveStartKey for the next query which should result in only new records being returned, however there is a fair chance the record LastEvaluatedKey points too will no longer exist due to being deleted in the previous iteration.
When that happens weird results are returned, I thought it should work because on a main index if you send a non-existent ExclusiveStartKey Dynamo can still figure out where it should start retrieving records from, due to it's ordering system.
But on the GSI the results often start with the next expected record, but then some records might get skipped, and often the non-existent ExclusiveStartKey query will not return a LastEvaluatedKey, even though not all remaining records have been returned.
I am playing with ideas to handle this strange behaviour:
not deleting the last record return to decrease the chance
ExclusiveStartKey does not exist (and deleting it's record after the
next iteration)
doing an extra check when no LastEvaluatedKey is returned to make sure no records actually remain
But they are messy workarounds.
Anybody understand why the weird results happen, does it happen in any structured way?
Any other advice how to solve the task I am performing?

DynamoDB GSIs behave just like the base table and a LEK is a pointer to a position on a partition, it does not need the item to exist to understand where to start the next iteration.
Ensure you are not corrupting the ESK and that you pass it to the next Query exactly as returned as an LEK, it should include GSI keys as well as base table keys.
If you still see an issue after that, please share code.

Related

Mapping a dynamodb query result

I have a table with a composite key; there is both a partition and a sort key. I know that the java sdk allows me to query by just the partition key. However, if I do this then the docs say I will get this iterator back ItemCollection<QueryOutcome>. This means for me to work with this data, I will have to iterate over the entire collection in order to fulfill my needs.
It would be easier if I was able to get back a Map<T, V> type where the key here would be the sort key. That way, I can quickly find rows for a particular sort key. Is this possible? I would rather not iterate over the collection just to find certain items with a certain sort key value.
If you just want an item with a certain sort key, that’s a get item. Don’t do a Query.
You may be confused by DynamoDB’s use of the word Query. That’s not the only way to query the database. It’s one way to query which happens to have the name Query.

DynamoDB Best practice to select all items from a table with pagination (Without PK)

I simply want to get a list of products back from my table and paginated, the pagination part is relatively clear with last_evaluated_key, however all the examples are using on PK or SK, but in my case I just want to get paginated results sort by createdAt.
My product id (uniq uuid) is not very useful in this case. Is the last solution to scan the whole table?
Yes, you will use Scan. DynamoDB has two types of read operation, Query and Scan. You can Query for one-and-only-one Partition Key (and optionally a range of Sort Key values if your table has a compound primary key). Everything else is a Scan.
Scan operations read every item, max 1 MB, optionally filtered. Filters are applied after the read. Results are unsorted.
The SDKs have pagination helpers like paginateScan to make life easier.
Re: Cost. Ask yourself: "is Scan returning lots of data MB I don't actually need?" If the answer is "No", you are fine. The more you are overfetching, however, the greater the cost benefit of Query over Scan.

What is the model for checking if a GSI key exists or not in DDB?

I have a pretty straight forward question
I want to know if some GSI hash key exists or not.
The best I can find right now is
DynamoDBQueryExpression<T> queryExpression;
// Logic for constructing query
queryExpression.withIndexName(SomeIndexName);
QueryResultPage<T> queryResponse mapper.queryPage(T.class, queryExpression, someMapperConfig));
Here query result page contains a list of results, I can check if that list has anything and conclude whether it exists or not.
The obvious problem is the efficiency drop when there are things that are present. Is there a way to not move the contents of the item across network IO for the purpose of verification (i.e. a server side total validation of the predicate of checking if some GSI key exists or not)?
When writing an item with, for example, Put-Item you can add a condition specifying the key must not exist. This way DynamoDB checks whether the provided key is already taken and will give an error when you try to put something in. Just catch the error and then you know the key was already taken.

DynamoDB Global Secondary Index to increase performance

I am designing an application for tracing call activity.
Each call can be either terminated or activated. An application will query the database every minute to generate a list of activate calls. There can be up to 1000 calls per second.
How should I design my database? Should I have a "Call" table and a global secondary index on "state" attribute that can equal to "activate" or "terminated"
OR
a "Call" table and a global secondary index on "isActive" attribute that is present for active calls only.
Problems you may faced if you go out with schema that you have suggested in the question:
CallList table with 'state' as GSI, this way you will be able to query your GSI and get all the active calls, but eventually it will effect your performance as there is very limited values for partition key (I am assuming you wont be deleting the record either, thus table will grow huge in no time)
CallList table with GSI on isActive, this will have same above problem as most of the rows will have "isActive=False"
My proposed schema:
Keep a separate activeCall table only having an entry of the active calls, this way you don't have to worry about the size of the table or GSI which eventually result in paying less, once call is terminated you can remove the entry from the table.

How does GAE datastore index null values

I'm concerned about read performance, I want to know if putting an indexed field value as null is faster than giving it a value.
I have lots of items with a status field. The status can be, "pending", "invalid", "banned", etc...
my typical request is to find the status "ok" (or null). Since null fields are not saved to datastore, it is already a win to avoid to have a "useless" default value I can replace with null. So I already have less disk space use.
But I was wondering, since datastore is noSql, it doesn't know about the data structure and it doesn't know there is a missing column status. So how does it do the status = null request check?
Does it have to check all columns of each row trying to find my column? or is there some smarter mechanism?
For example, index (null=Entity,key) when we pass a column explicitly saying it is null (if this is the case, does Objectify respect that and keep the field in the list when passing it to the native API if it's null?)
And mainly, which request is more efficient?
The low level API (and Objectify) stores and indexes nulls if you specify that a field/property should be indexed. For Objectify, you can specify #Ignore(IfNull.class) or #Unindex(IfNull.class) if you want to alter this behavior. You are probably confusing this with documentation for other data access APIs.
Since GAE only allows you to query for indexed fields, your question is really: Is it better to index nulls and query for them, or to query for everything and filter out non-null values?
This is purely a question of sparsity. If the overwhelming majority of your records contain null values, then you're probably better off querying for everything and filtering out the ones you don't want manually. A handful of extra entity reads are probably cheaper than updating and storing an extra index. On the other hand, if null records are a small percentage of your data, then you will certainly want the index.
This indexing dilema is not unique to GAE. All databases present this question with respect to low-cardinality fields; it's just that they'll do the table scan (testing and skipping rows) for you.
If you really want to fine-tune this behavior, read Objectify's documentation on Partial Indexes.
null is also treated as a value in datastore and there will be entries for null values in indexes. Datastore doc says, "Datastore distinguishes between an entity that does not possess a property and one that possesses the property with a null value"
Datastore will never check all columns or all records. If you have this property indexed, it will get records from the index only If not indexed, you cannot query by that property.
In terms of query performance, it should be the same, but you can always profile and check.

Resources