How can I Scan an index in reverse in DynamoDB? - amazon-dynamodb

I am currently using DynamoDB and having a problem scanning. I am able to get paged results in forward order by using the ExclusiveStartKey. However, regardless of whether I set ScanIndexForward true or false, I get results in forward order from my scan operation. How can i get results in reverse order from a Scan in DynamoDB?

ScanIndexForward is the correct way to get items in descending order by the range key of the table or index you are querying. From the AWS API Reference:
A value that specifies ascending (true) or descending (false)
traversal of the index. DynamoDB returns results reflecting the
requested order determined by the range key. If the data type is
Number, the results are returned in numeric order. For type String,
the results are returned in order of ASCII character code values. For
type Binary, DynamoDB treats each byte of the binary data as unsigned
when it compares binary values.
Based on the docs for Scan, I conclude that there is no way to Scan in reverse. However, I would say that you are not using DynamoDB correctly if you need to do that. When designing a schema for a database like DyanmoDB you should plan the schema based on your expected queries to ensure that almost all application queries have a good index. Scans are meant more for sys admin operations or for feeding into MapReduce or analytics. "A Scan operation always scans the entire table, then filters out values to provide the desired result, essentially adding the extra step of removing data from the result set." (Query and Scan Performance) That can lead to performance problems and other issues.
Using DynamoDB is fundamentally different from working with a traditional relational database and requires a big change in the way you think about using it. You need to decide whether DynamoDB's advantages of availability in storage and performance, reliability and availability are worth accepting its limitations.

As of now the dynamoDB scan cannot return you sorted results.
You need to use a query with a new global secondary index (GSI) with a hashkey and range field. The trick is to use a hashkey which is assigned the same value for all data in your table.
I recommend making a new field for all data and calling it "Status" and set the value to "OK", or something similar.
Then your query to get all the results sorted would look like this:
{
TableName: "YourTable",
IndexName: "Status-YourRange-index",
KeyConditions: {
Status: {
ComparisonOperator: "EQ",
AttributeValueList: [
"OK"
]
}
},
ScanIndexForward: false
}
The docs for how to write GSI queries are found here: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GSI.html#GSI.Querying

Related

Does a logical partition scan in CosmosDB always returns items in the same order?

In CosmosDB using the SQL API ( hope API might not matter ) and queries that do not use ORDER BY over an specific Logic Partition ( e.g. WHERE CustomerId = 123 ), wondering if the response will return the results always in the same order.
A use case could be something like an Audit log, where it is possible that TimeStamp _ts is not granular enough so likely to find at some point the same value twice and the source or events doesn't allow to create an sequence that can be used for ordering.
wondering if the response will return the results always in the same
order.
Based on my previous test, if you do not set any sort rules, it will be sorted as default based on the time created in the database,whatever it is partitioned or not.
In above sample documents, the sort will not be changed if I change the id,partition key(that's name) or ts.

How to get attribute from a list of partition keys in DynamoDB - is scan my only option?

I've got a list of partition keys from one table.
userId["123","456","235"]
I need to get an attribute that they all share. like "username".
What would be the best practice to get them all at once?
Is scan my only option knowing that I know all my partition keys?
Do I know the sort key? yes but only the beginning of it. Therefore I
don't think I could use batchGetItem.
Scan is only appropriate if you don't know the partition keys. Because you know the partition keys you want to search, you can achieve the desired behavior with multiple Query operations.
A Query searches all documents with the specified partition key; you can only query one partition key per request, so you'll need multiple queries, but this will still be significantly more efficient than a single Scan operation.
If you're only looking for documents with a sort key that begins with something, you can include it in your KeyConditionExpression along with the partition key.
For example, if you wanted to only return documents whose sort key begins with a certain string, you could pass something like userId = :user_id AND begins_with(#SortKey, :str) as the key condition expression.
You can efficiently achieve the result by using PartQL SELECT statement. It allows to query array of partition keys with IN operator and apply additional conditions on other attributes without causing a full table scan.
To ensure that a SELECT statement does not result in a full table
scan, the WHERE clause condition must specify a partition key. Use the
equality or IN operator.

Dynamodb query expression

Team,
I have a dynamodb with a given hashkey (userid) and sort key (ages). Lets say if we want to retrieve the elements as "per each hashkey(userid), smallest age" output, what would be the query and filter expression for the dynamo query.
Thanks!
I don't think you can do it in a query. You would need to do full table scan. If you have a list of hash keys somewhere, then you can do N queries (in parallel) instead.
[Update] Here is another possible approach:
Maintain a second table, where you have just a hash key (userID). This table will contain record with the smallest age for given user. To achieve that, make sure that every time you update main table you also update second one if new age is less than current age in the second table. You can use conditional update for that. Update can either be done by application itself, or you can have AWS lambda listening to dynamoDB stream. Now if you need smallest age for each use, you still do full table scan of the second table, but this scan will only read relevant records, to it will be optimal.
There are two ways to achieve that:
If you don't need to get this data in realtime you can export your data into a other AWS systems, like EMR or Redshift and perform complex analytics queries there. With this you can write SQL expressions using joins and group by operators.
You can even perform EMR Hive queries on DynamoDB data, but they perform scans, so it's not very cost efficient.
Another option is use DynamoDB streams. You can maintain a separate table that stores:
Table: MinAges
UserId - primary key
MinAge - regular numeric attribute
On every update/delete/insert of an original query you can query minimum age for an updated user and store into the MinAges table
Another option is to write something like this:
storeNewAge(userId, newAge)
def smallestAge = getSmallestAgeFor(userId)
storeSmallestAge(userId, smallestAge)
But since DynamoDB does not has native transactions support it's dangerous to run code like that, since you may end up with inconsistent data. You can use DynamoDB transactions library, but these transactions are expensive. While if you are using streams you will have consistent data, at a very low price.
You can do it using ScanIndexForward
YourEntity requestEntity = new YourEntity();
requestEntity.setHashKey(hashkey);
DynamoDBQueryExpression<YourEntity> queryExpression = new DynamoDBQueryExpression<YourEntity>()
.withHashKeyValues(requestEntity)
.withConsistentRead(false);
equeryExpression.setIndexName(IndexName); // if you are using any index
queryExpression.setScanIndexForward(false);
queryExpression.setLimit(1);

How does GAE datastore index null values

I'm concerned about read performance, I want to know if putting an indexed field value as null is faster than giving it a value.
I have lots of items with a status field. The status can be, "pending", "invalid", "banned", etc...
my typical request is to find the status "ok" (or null). Since null fields are not saved to datastore, it is already a win to avoid to have a "useless" default value I can replace with null. So I already have less disk space use.
But I was wondering, since datastore is noSql, it doesn't know about the data structure and it doesn't know there is a missing column status. So how does it do the status = null request check?
Does it have to check all columns of each row trying to find my column? or is there some smarter mechanism?
For example, index (null=Entity,key) when we pass a column explicitly saying it is null (if this is the case, does Objectify respect that and keep the field in the list when passing it to the native API if it's null?)
And mainly, which request is more efficient?
The low level API (and Objectify) stores and indexes nulls if you specify that a field/property should be indexed. For Objectify, you can specify #Ignore(IfNull.class) or #Unindex(IfNull.class) if you want to alter this behavior. You are probably confusing this with documentation for other data access APIs.
Since GAE only allows you to query for indexed fields, your question is really: Is it better to index nulls and query for them, or to query for everything and filter out non-null values?
This is purely a question of sparsity. If the overwhelming majority of your records contain null values, then you're probably better off querying for everything and filtering out the ones you don't want manually. A handful of extra entity reads are probably cheaper than updating and storing an extra index. On the other hand, if null records are a small percentage of your data, then you will certainly want the index.
This indexing dilema is not unique to GAE. All databases present this question with respect to low-cardinality fields; it's just that they'll do the table scan (testing and skipping rows) for you.
If you really want to fine-tune this behavior, read Objectify's documentation on Partial Indexes.
null is also treated as a value in datastore and there will be entries for null values in indexes. Datastore doc says, "Datastore distinguishes between an entity that does not possess a property and one that possesses the property with a null value"
Datastore will never check all columns or all records. If you have this property indexed, it will get records from the index only If not indexed, you cannot query by that property.
In terms of query performance, it should be the same, but you can always profile and check.

SQLite - Get a specific row index for a Sorted/Filtered Query

I'm creating a caching system to take data from an SQLite database table using a sorted/filtered query and display it. The tables I'm pulling from can be potentially very large and, of course, I need to minimize impact on memory by only retaining a maximum number of rows in memory at any given time. This is easily done by using LIMIT and OFFSET to load only the records I need and update the cache as needed. Implementing this is trivial. The problem I'm having is determining where the insertion index is for a new record inserted into a particular query so I can update my UI appropriately. Is there an easy way to do this? So far the ideas I've had are:
Dump the entire cache, re-count the Query results (there's no guarantee the new row will be included), refresh the cache and refresh the entire UI. I hope it's obvious why that's not really desirable.
Use my own algorithm to determine whether the new row is included in the current query, if it is included in the current cached results and at what index it should be inserted into if it's within the current cached scope. The biggest downfall of this approach is it's complexity and the risk that my own sorting/filtering algorithm won't match SQLite's.
Of course, what I want is to be able to ask SQLite: Given 'Query A' what is the index of 'Row B', without loading the entire query results. However, so far I haven't been able to find a way to do this.
I don't think it matters but this is all occurring on an iOS device, using the objective-c programming language.
More Info
The Query and subsequent cache is based off of user input. Essentially the user can re-sort and filter (or search) to alter the results they're seeing. My reticence in simply recreating the cache on insertions (and edits, actually) is to provide a 'smoother' UI experience.
I should point out that I'm leaning toward option "2" at the moment. I played around with creating my own caching/indexing system by loading all the records in a table and performing the sort/filter in memory using my own algorithms. So much of the code needed to determine whether and/or where a particular record is in the cache is already there, so I'm slightly predisposed to use it. The danger lies in having a cache that doesn't match the underlying query. If I include a record in the cache that the query wouldn't return, I'll be in trouble and probably crash.
You don't need record numbers.
Save the values of the ordered field in the first and last records of the LIMITed query result.
Then you can use these to check whether the new record falls into this range.
In other words, assuming that you order by the Name field, and that the original query was this:
SELECT Name, ...
FROM mytab
WHERE some_conditions
ORDER BY Name
LIMIT x OFFSET y
then try to get at the new record with a similar query:
SELECT 1
FROM mytab
WHERE some_conditions
AND PrimaryKey = LastInsertedValue
AND Name BETWEEN CachedMin AND CachedMax
Similarly, to find out before (or after) which record the new record was inserted, start directly after the inserted record and use a limit of one, like this:
SELECT Name
FROM mytab
WHERE some_conditions
AND Name > MyInsertedName
AND Name BETWEEN CachedMin AND CachedMax
ORDER BY Name
LIMIT 1
This doesn't give you a number; you still have to check where the returned Name is in your cache.
Typically you'd expect a cache to be invalidated if there were underlying data changes. I think dropping it and starting over will be your simplest, maintainable solution. I would recommend it unless you have a very good reason.
You could write another query that just returned the row count (example below) to see if your cache should be invalidated. That would save recreating the cache when it did not change.
SELECT name,address FROM people WHERE area_code=970;
SELECT COUNT(rowid) FROM people WHERE area_code=970;
The information you'd need from sqlite to know when your cache was invalidated would require some rather intimate knowledge of how the query and/or index was working. I would say that is fairly high coupling.
Otherwise, you'd want to know where it was inserted with regards to the sorting. You would probably key each page on the sorted field. Delete anything greater than the insert/delete field. Any time you change the sorting you'd drop everything.
Something like the below would be a start if you were using C++. I realize you aren't doing C++, but hopefully it is evident as to what I'm trying to do.
struct Person {
std::string name;
std::string addr;
};
struct Page {
std::string key;
std::vector<Person> persons;
struct Less {
bool operator()(const Page &lhs, const Page &rhs) const {
return lhs.key.compare(rhs.key) < 0;
}
};
};
typedef std::set<Page, Page::Less> pages_t;
pages_t pages;
void insert(const Person &person) {
if (sql_insert(person)) {
pages_t::iterator drop_cache_start = pages.lower_bound(person);
//... drop this page and everything after it
}
}
You'd have to do some wrangling to get different datatypes of key to work nicely, but its possible.
Theoretically you could just leave the pages out of it and only use the objects themselves. The database would no longer "own" the data though. If you only fill pages from the database, then you'll have less data consistency worries.
This may be a bit off topic, you aren't re-implementing views are you? It doesn't cache per se, but it isn't clear if that is a requirement of your project.
The solution I came up with is not exactly simple, but it's currently working well. I realized that the index of a record in a Query Statement is also the Count of all it's previous records. What I needed to do was 'convert' all the ORDER statements in the query to a series of WHERE statements that would return only the preceding records and take a count of those records. It's trickier than it sounds (or maybe not...it sounds tricky). The biggest issue I had was making sure the query was, in fact, sorted in a way I could predict. This meant I needed to have an order column in the Order Parameters that was based off of a column with unique values. So, whenever a user sorts on a column, I append to the statement another order parameter on a unique column (I used a "Modified Date Stamp") to break ties.
Creating the WHERE portion of the statement requires more than just tacking on a bunch of ANDs. It's easier to demonstrate. Say you have 3 Order columns: "LastName" ASC, "FirstName" DESC, and "Modified Stamp" ASC (the tie breaker). The WHERE statement would have to look something like this ('?' = record value):
WHERE
"LastName" < ? OR
("LastName" = ? AND "FirstName" > ?) OR
("LastName" = ? AND "FirstName" = ? AND "Modified Stamp" < ?)
Each set of WHERE parameters grouped together by parenthesis are tie breakers. If, in fact, the record values of "LastName" are equal, we must then look at "FirstName", and finally "Modified Stamp". Obviously, this statement can get really long if you're sorting by a bunch of order parameters.
There's still one problem with the above solution. Mathematical operations on NULL values always return false, and yet when you sort SQLite sorts NULL values first. Therefore, in order to deal with NULL values appropriately you've gotta add another layer of complication. First, all mathematical equality operations, =, must be replace by IS. Second, all < operations must be nested with an OR IS NULL to include NULL values appropriately on the < operator. This turns the above operation into:
WHERE
("LastName" < ? OR "LastName" IS NULL) OR
("LastName" IS ? AND "FirstName" > ?) OR
("LastName" IS ? AND "FirstName" IS ? AND ("Modified Stamp" < ? OR "Modified Stamp" IS NULL))
I then take a count of the RowID using the above WHERE parameter.
It turned out easy enough for me to do mostly because I had already constructed a set of objects to represent various aspects of my SQL Statement which could be assembled to generate the statement. I can't even imagine trying to manipulate a SQL statement like this any other way.
So far, I've tested using this on several iOS devices with up to 10,000 records in a table and I've had no noticeable performance issues. Of course, it's designed for single record edits/insertions so I don't really need it to be super fast/efficient.

Resources