DynamoDB Scan Vs Query on same data - amazon-dynamodb

I have a use case where I have to return all elements of a table in Dynamo DB.
Suppose my table has a partition key (Column X) having same value in all rows say "monitor" and sort key (Column Y) with distinct elements.
Will there be any difference in execution time in the below approaches or is it the same?
Scanning whole table.
Querying data based on the partition key having "monitor".

You should use the parallell scans concept. Basically you're doing multiple scans at once on different segments of the Table. Watch out for higher RCU usage though.

Avoid using scan as far as possible.
Scan will fetch all the rows from a table, you will have to use pagination also to iterate over all the rows. It is more like a select * from table; sql operation.
Use query if you want to fetch all the rows based on the partition key. If you know which partition key you want the results for, you should use query, because it will kind of use indexes to fetch rows only with the specific partition key

Direct answer
To the best of my knowledge, in the specific case you are describing, scan will be marginally slower (esp. in first response). This is when assuming you do not do any filtering (i.e., FilterExpression is empty).
Further thoughts
DynamoDB can potentially store huge amounts of data. By "huge" I mean "more than can fit in any machine's RAM". If you need to 'return all elements of a table' you should ask yourself: what happens if that table grows such that all elements will no longer fit in memory? you do not have to handle this right now (I believe that as of now the table is rather small) but you do need to keep in mind the possibility of going back to this code and fixing it such that it addresses this concern.
questions I would ask myself if I were in your position:
(1) can I somehow set a limit on the number of items I need to read (say,
read only the first 1000 items)?
(2) how is this information (the list of
items) used? is it sent back to a JS application running inside a
browser which displays it to a user? if the answer is yes, then what
will the user do with a huge list of items?
(3) can you work on the items one at a time (or 10 or 100 at a time)? if the answer is yes then you only need to store one (or 10 or 100) items in memory but not the entire list of items
In general, in DDB scan operations are used as described in (3): read one item (or several items) at a time, do some processing and then moving on to the next item.

Related

Dynamo DB, How do you query everything AND leverage sort key

I already have an index set up with the second sort key set to what I want (an integer timestamp). The API keeps complaining that I'm not giving it a KeyConditionExpression. Then if I give it one, it says id must be specified. I've tried forcing it to just give me everything using id <> null and it STILL won't do it. Is this even possible?? Maybe its time to get rid of dynamo if it can't do this utterly simple task.
For the love of god, all I'm trying to do is query the entire table AND have it use my sort key. I would have had this going in SQL hours ago..
First of all, DynamoDB is a NOSQL database, so it's intentionally NOT SQL. Perhaps you shouldn't expect to be able to perform SQL like queries that you are used to, and be frustrated by the fact that these are two completely different types of databases, each with its strengths and weaknesses.
Records in DynamoDB are partitioned using the hash key, and may optionally be sorted within each partition.
The hash key should be picked so that items are as evenly distributed over partitions as possible. The use of partitions is what makes DynamoDB extremely scalable and fast. But if what you need is to scan over all your items and get them in sorted order, then you probably either are using the wrong tool for the job, or you need to sort the items on the client side.
The scan operation will simply go through all partitions, returning all items from each partition. At this point, the items can only be sorted within their respective partition.
As an example, consider a set of data being partitioned into 3 partitions:
Partition A Partition B Partition B
Sort key Sort key Sort key
A D C
C E K
P G L
As you can see, you can easily query each partition and get the items in it in sorted order. But if you scan, you will probably get items sorted as
[A, C, P, D, E, G, C, K, L], if the sort order is at all deterministic. At this point you would have to sort the items yourself.
A "trick" that is sometimes seen is to use a "dummy" hash key with an equal value for all items, like you mentioned in your own answer. This way you can query for "dummy = 1" and get the items sorted according to the sort key. However, this completely defeats the purpose of the hash key as all items will be put in the same partition, thus not making the table scale at all. But if you find yourself using DynamoDB even though you have a really small dataset, by all means it would work. But again, with a small data set and use-cases like this, you should probably be using another tool such as RDS in the first place.
Just to elaborate on #JHH though. In general I'd say he is correct that you shouldn't need to sort all elements in DynamoDB. I also have a requirement similar to this, as I need to get the top N number of elements, which could all be in different partitions.
DynamoDB does have a way of doing this, it just isn't out of the box. I don't think that it's so correct to say you should then need an SQL database, as arguably you'd never use a NoSQL database because you will always have one of these limitations. Also if you only ever use NoSQL for large data-sets then you will always have to rework your application later.
What to do then? Well you do have a few options, and it depends on your use-case, lets' assume that you are at least having sorting within your partitions, this makes it easier. We'll also assume you are looking for the max.
The simplest way would be if you would get the first value from every partition. And find the max. If you needed say the top 10 values you could still utilise this strategy but would get too complicated.
Next option is to make use of DynamoDB Streams. Say we want to keep a list of the top 100 elements. These would sit ready and waiting on their own top values partition, sorted and ready for instant retrieval. You would need to maintain this list yourself by checking when items are inserted or updated, that they are greater than the 100th element. If that is the case you would insert the element into the top values partition, and delete the last value. This I think would be the most likely way to approach this problem.
So in NoSQL if there is some sort of query, you would love to do which is oh so easy in SQL, and you cant use your Table/GSI/LSI, then you pretty much need to compute the result manually, and have it ready for consumption.
Now if you weren't going to make use of these top values very often, then you might go with the first method, and scan every partition top values till you had the list you wanted, but depending on how much the values are scattered across partitions this could take many capacity units.
Hope that helps.
Turns out, you can also add an IndexName to a scan. That helps. Furthermore, if you create an index with a sort key, all primary indices MUST be identical for the sort to occur.

Model daily game ranking in DynamoDB

I have a question. I m pretty new to DynamoDB but have been working on large scale aggregation on SQL databases for a long time.
Suppose you have a table called GamePoints (PlayerId, GameId, Points) and would like to create a ranking table Rankings (PlayerId, Points) sorted by points.
This table needs to be updated on an hourly basis but keeping the previous version of its contents is not required. Just the current Rankings.
The query will always be give me the ranking table (with paging).
The GamePoints table will get very very large over time.
Questions:
Is this the best practice schema for DynamoDB ?
How would you do this kind of aggregation?
Thanks
You can enable a DynamoDB Stream on the GamePoints table. You can read stream records from the stream to maintain materialized views, including aggregations, like the Rankings table. Set StreamViewType=NEW_IMAGE on your GamePoints table, and set up a Lambda function to consume stream records from your stream and update the points per player using atomic counters (UpdateItem, HK=player_id, UpdateExpression="ADD Points #stream_record_points", ExpressionAttributeValues={"#stream_record_points":[put the value from stream record here.]}). As the hash key of the Rankings table would still be the player ID, you could do full table scans of the Rankings table every hour to get the n highest players, or all the players and sort.
However, considering the size of fields (player_id and number of points probably do not take more than 100 bytes), an in memory cache updated by a Lambda function could equally well be used to track the descending order list of players and their total number of points in real time. Finally, if your application requires stateful processing of Stream records, you could use the Kinesis Client Library combined with the DynamoDB Streams Kinesis Adapter on your application server to achieve the same effect as subscribing a Lambda function to the Stream of the GamePoints table.
An easy way to do this is by using DynamoDb's HashKey and Sort key. For example, the HashKey is the GameId and Sort key is the Score. You then query the table with a descending sort and a limit to get the real-time top players in O(1).
To get the rank of a given player, you can use the same technique as above: you get the top 1000 scores in O(1) and you then use BinarySearch to find the player's rank amongst the top 1000 scores in O(log n) on your application server.
If the user has a rank of 1000, you can specify that this user has a rank of 1000+. You can also obviously change 1000 to a greater number (100,000 for example).
Hope this helps.
Henri
The PutItem can be helpful to implement the persistence logic according to your Use Case:
PutItem Creates a new item, or replaces an old item with a new item.
If an item that has the same primary key as the new item already
exists in the specified table, the new item completely replaces the
existing item. You can perform a conditional put operation (add a new
item if one with the specified primary key doesn't exist), or replace
an existing item if it has certain attribute values. Source:
http://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_PutItem.html
In terms of querying the data, if you know for sure that you are going to be reading the entire Ranking table, I would suggest doing it through several read operations with minimum acceptable page size so you can make the best use of your provisioned throughput. See the guidelines below for more details:
Instead of using a large Scan operation, you can use the following
techniques to minimize the impact of a scan on a table's provisioned
throughput.
Reduce Page Size
Because a Scan operation reads an entire page (by default, 1 MB), you
can reduce the impact of the scan operation by setting a smaller page
size. The Scan operation provides a Limit parameter that you can use
to set the page size for your request. Each Scan or Query request that
has a smaller page size uses fewer read operations and creates a
"pause" between each request. For example, if each item is 4 KB and
you set the page size to 40 items, then a Query request would consume
only 40 strongly consistent read operations or 20 eventually
consistent read operations. A larger number of smaller Scan or Query
operations would allow your other critical requests to succeed without
throttling.
Isolate Scan Operations
DynamoDB is designed for easy scalability. As a result, an application
can create tables for distinct purposes, possibly even duplicating
content across several tables. You want to perform scans on a table
that is not taking "mission-critical" traffic. Some applications
handle this load by rotating traffic hourly between two tables – one
for critical traffic, and one for bookkeeping. Other applications can
do this by performing every write on two tables: a "mission-critical"
table, and a "shadow" table.
SOURCE: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/QueryAndScanGuidelines.html#QueryAndScanGuidelines.BurstsOfActivity
You can also segment your tables by GameId (e.g. Ranking_GameId) to distribute the data more evenly and give you more granularity in terms of provisioned throughput.

Is a scan query always expensive in DynamoDB or should you use a range key

I've been playing around with Amazon DynamoDB and looking through their examples but I think I'm still slightly confused by the example. I've created the example data on a local dynamodb instance to get used to querying data etc. The sample data sets up 3 tables of 'Forum'->'Thread'->'Reply'
Now if I'm in a specific forum, the thread table has a ForumName key I can query against to return relevant threads, but would the very top level (displaying the forums) always have to be a scan operation?
From what I can gather the only way to "select *" in dynamodb is to use a scan and I assume in this instance - where forum is very high level and might have a relatively small number of rows - that it wouldn't be that expensive or are you actually better creating a hash and range key and using that to query this table? I'm not sure what the range key would be in this instance, maybe just a number and then specify in the query that the value has to be > 0? Or perhaps a date it was created and the query always uses a constant date in the past?
I did try a sample query on the 'Forum' table example data using a ComparisonOperator of 'GE' (Greater than or equal) with an attribute value list of 'S'=>'a' but this states that any conditions on the hash key must be of type EQ which implies I couldn't do the above as I would always need to know my 'Name' values upfront
Maybe I'm still struggling having come from an RDBS background especially seen as there are many forum examples out there.
thanks
I think using Scan to get all the forums is fine. I think it is very efficient because it will not return you anything that you don't need (all of the work that scan does is necessary). Also since Scan operation is so simple it is easier to implement and more likely to be efficient

Get last N records in a DynamoDB table

Is there any way to get the last N records from a dynamodb table. The range key I have is the timestamp. So I could use the ScanIndex forward to order items chronologically.
But in order to query I need to have a hashKey condition, which I don't want to filter. Any thoughts?
DynamoDB is not designed to work this way. The items are distributed according to a hash on the HashKey in such a way that the order is not predictable.
Your options include:
grouping the items under a single hash key (not recommended: you would overload a few servers with your data, and Amazon cannot guarantee your read/write capacity)
scanning the whole table and keep the N most recent items (something like for (item in items) { if (item newer then oldest accumulated item) accumulate item; });
partition your table into multiple tables (ie, instead of a table called Events, create one called Events20130705 for today's events, Events20130706 for tomorrow's events), and scan just like the previous option -- this way your scans are smaller
You could also maybe change your data model. For example, You could have one versioned entry that would keep references to the N most recent items. Or you could have something like a single counter that you'd increment and update N other entries under hashkeys such as recent-K where K is your counter mod N.
Maybe you could even use another tool for this job. For instance, you could have a Redis server to do this. Without knowing your use case with much more detail, it is hard to make a precise suggestion -- how scalable should this be? how reliable should it be? how much maintenance are you willing to perform? how much are you willing to pay for it?
It's usually better to embrace the limitation, know your constraints and be creative.
I'm not sure this is still relevant. I'm fairly sure you can use ScanIndexForward along with a rangeKey to get the latest value.

Reindexing a large SQL Server database to Lucene

We have a web service method which accepts some data and puts it in Lucene index. We use it to index new and updated entries from our asp.net web app.
These entries are stored in a large SQL Server table (20M rows and growing), and I need a way to be able to reindex the whole table in case if current index gets deleted or corrupted. I'm not sure what's the optimal way to retrieve chunks of data from a large table. Currently, we use the fact that the table has PK which is autoincrement, so we get chunks of 1000 rows until it starts to return nothing. Kind of like (in pseudo language):
i = 0
while (true)
{
SELECT col1, col2, col3 FROM mytable WHERE pk between i and i + 1000
.... if result is empty 20 times in a row, break ....
.... otherwise send result to web service to reindex ....
i = i + 1000
}
This way, we don't need to SELECT COUNT(*) which would be a big performance killer, and we just move up the pk values until we stop getting any results. This has it's con: if we have a hole greater than 20,000 values somewhere in the table, it will stop indexing assuming it reached the end, but that's a tradeoff we have to live for now.
Can anyone suggest a more efficient way of getting data from a table to index? I would assume we are not the first ones facing this problem - search engines are widely used nowadays :)
For what we do with Lucene, we rarely need to reindex everything. I can't remember coming across any case when all index would be corrupted (Lucene is actually quite safe/good at this), but it has been many times when individual items needed to be reindexed because of one reason or another. I'd say the most frequent reindexing patterns would be:
reindex items by given id (or set of ids)
reindex items by given period of time
The latter, of course, requires separate db index on the relevant date field(s) which should be a bit costly for 20M+ records but we decided to go for it (our biggest deployment had up to 10M records) as disk space is cheap these days anyway.
EDIT: added few explanations as per question author's comment.
If the source data structure changes, requiring reindexing of all records, our approach is to roll out new code which ensures all new data is correct (basically forms correct Lucene Document from this moment). Then after we can reindex things in batches (either manually or by hand), by providing relevant period ranges. This, to certain extent, also applies to Lucene version changes, too.
Why is a COUNT(*) a performance killer? What about MAX(id)? I'm thinking that a index would provide the information needed for those queries. You do have an index on your primary key, right?
I actually just figured it out - I can use IDENT_CURRENT(table_name) to get the last generated id, and use that instead of MAX() or Count() - this method should blow the other two away :)

Resources