What is the best way to query only primary key in amazon dynamodb? - amazon-dynamodb

We have a dynamodb table with close to 5000 items. The primary key for this table is a column called "serialNumber". We have to run a scheduled job that picks up all the serial numbers and does some processing. What is the most optimized way to query for this data? We only need the serial numbers and not any other columns. Should I use scan/LSI/paginated query, or something else?

instead of query all your data, i recommend to use dynamodb stream with lambda. you will be able to get only your primary key value, and do as ever you need.
http://docs.aws.amazon.com/lambda/latest/dg/with-ddb.html

If you need to repeatedly get the list of all the serial numbers in the table you might do a scan with a projection of only the serial number. For a table with 5000 items this will execute really fast and won't consume a ton of capacity (you're probably looking at about 20 IOPs for the scan). For trivial loads and infrequent access (ie. once an hour, or once a day) a scan is the way to go - no need for excessive complexity.
However, if you expect that your table's contents will change all the time and you need near-real time updates then a Dynamo Stream to Lambda with potentially a cache would be the way to go.

Related

Filtering a large dynamodb table for data analytics purposes

We have a request come in from our compliance department asking us to scan a dynamodb table which has millions of records, we need to be able to filter all the records for approximately 1300 email addresses, the email address on this table is not the partition key and is a secondary global index.
This is not a one time request and we need to be able to repeat this process with minimal effort in future. That means the table might have grown in that time or the number of requested emails might be larger.
What would be the best approach to filter the data and only take the records related to these emails?
I can only think of the following two approaches, maybe utilizing a lambda or step functions if the work needs to be done in batches but am open to any scalable alternatives:
should we export the whole table to S3 and then process that?
go through each email and call dynamodb
You say that the emails are in a GSI. If the email is in the primary key for the GSI then the easiest solution is to call DynamoDB once for each email, and you can make these calls in parallel (but you may want to do them in chunks of 1000 to avoid throttles or exhausting file handles on your host).
If the email is not in the PK, then running a scan on the GSI, returning KEYS_ONLY can be ok depending on your table size and how often you run the task. If you have 10 million records with 1KB average record size in the GSI, this will cost $0.30 USD each time it is run. You can run a parallel scan to make it run fast. You can judge if the time/money tradeoff makes sense versus another solution that takes more engineering effort, such as exporting to S3.

How can one get count, with a where condition in DynamoDB

Let us say, We have a situation where instead of getting the total count in a table, get the count of records with a particular status.
We know DynamoDb is schemaless and still has to count each record one by one to get the total count.
And yet, How can we leverage the above need using dynamoDb queries?
While normally "Query" or "Scan" requests return all the matching items, you can pass the Select=COUNT parameter and ask to retrieve only the number of matching items, instead of the actual items. But before you go doing that, there are a few things you should know:
DynamoDB will still be reading - and you will still be paying for - all the data, even if just for being counted. Doing a "Scan" with a filter is in almost all cases out of the question, because it will read the entire data set every time. With a "Query" you can ask to read just one partition, or a contiguous range of sort-keys in one partition, which in some cases may be reasonable enough (but please think if it is, in your use case).
Even if you're not actually reading the data, and just counting, DynamoDB still does Scan and Query with "paging", i.e., your reads request will read just 1MB of data from disk, return you the partial count, and ask you to submit another request to resume the scan. Your DynamoDB library probably has a way to automate this resumption, so for example it can run thousands or whatever number of queries needed until finally finishing the scan and calculating the total sum.
In some cases, it may make sense for to maintain a counter in addition to the data. Writes will be more expensive (e.g., each write adds data and increments the counter), but reads that need this counter will be hugely cheaper - so it all depends on how much of each your workload needs.

Give auto-suggest for drop-down on my DynamoDB hash key

My use-case is that I want to be able to provide the user an auto-suggest feature in drop-down box where user starts typing first few characters and he should be shown suggestions.
The problem is that the field I want the suggestions on is also the hash key for my DynamoDB table. And queries on hash key have to specify the full value of hash key and not with prefix.
Can anyone suggest a good DynamoDB pattern for this use-case?
10,000 entries with, say, 20 characters = 200K. This is totally feasible to keep in memory and would be very fast to access.
Compare this with performing a database query every time the user types a character in the drop-down box and you'll be making maybe 10 database calls as they type. Then, multiply by the number of concurrent users and you could conceivably be hitting hundreds of database accesses per second. The DynamoDB table would need to be provisioned with a high Read Capacity to support this.
It would be much more sensible to keep it in memory, or use Amazon DynamoDB Accelerator (DAX) – Fully managed in-memory cache for DynamoDB or Amazon ElastiCache table.

DynamoDB table structure

We are looking to use AWS DynamoDB for storing application logs. Logs from multiple components in our system would be stored here. We are expecting a lot of writes and only minimal number of reads.
The client that we use for writing into DynamoDB generates a UUID for the partition key, but using this makes it difficult to actually search.
Most prominent search cases are,
Search based on Component / Date / Date time
Search based on JobId / File name
Search based on Log Level
From what I have read so far, using a UUID for the partition key is not suitable for our case. I am currently thinking about using either / for our partition key and ISO 8601 timestamp as our sort key. Does this sound reasonable / widely used setting for such an use case ?
If not kindly suggest alternatives that can be used.
Using UUID as partition key will efficiently distribute the data amongst internal partitions so you will have ability to utilize all of the provisioned capacity.
Using sortable (ISO format) timestamp as range/sort key will store the data in order so it will be possible to retrieve it in order.
However for retrieving logs by anything other than timestamp, you may have to create indexes (GSI) which are charged separately.
Hope your logs are precious enough to store in DynamoDB instead of CloudWatch ;)
In general DynamoDB seems like a bad solution for storing logs:
It is more expensive than CloudWatch
It has poor querying capabilities, unless you start utilising global secondary indexes which will double or triple your expenses
Unless you use random UUID for hash key, you are risking creating hot partitions/keys in your db (For example, using component ID as a primary or global secondary key, might result in throttling if some component writes much more often than others)
But assuming you already know these drawbacks and you still want to use DynamoDB, here is what I would recommend:
Use JobId or Component name as hash key (one as primary, one as GSI)
Use timestamp as a sort key
If you need to search by log level often, then you can create another local sort key, or you can combine level and timestamp into single sort key. If you only care about searching for ERROR level logs most of the time, then it might be better to create a sparse GSI for that.
Create a new table each day(let's call it "hot table"), and only store that day's logs in that table. This table will have high write throughput. Once the day finishes, significantly reduce its write throughput (maybe to 0) and only leave some read capacity. This way you will reduce risk of running into 10 GB limit per hash key that Dynamo DB has.
This approach also has an advantage in terms of log retention. It is very easy and cheap to remove log older than X days this way. By keeping old table capacity very low you will also avoid very high costs. For more complicated ad-hoc analysis, use EMR

Retrieving Random Single Items in Dynamo

We are trying to get our heads wrapped around a design question, which is not really easy in any DB. We have 100,000 random items, (could be a lot more), (we are talking a truly random key, we'll use UUIDs,) and we want to hand them out one at a time. Order is not important. We are thinking that we'll create a dynamo table of the items, then delete them out of that table as they are assigned. We can do a conditional delete to make sure that we have not already given the item away. But, when trying to find an item in the first place, if we do a scan or a query with a limit of 1, will it always hit the same first available record? I'm wondering what the ramifications are. Dynamo will shard on the UUID. We are worried about everyhone trying to hit on the same record all the time. First one would of course get delete, then they could all hit on the second one, etc.
We could set up a memcache/redis instance in elastic cache, and keep a list of the available UUDS in there. We can do a random select of items from this using redis SPOP, which gets a random item and deletes it. We might have a problem where we could get out of sync between the two, but for the most part this would work.
Any thoughts on how to do this without the cache would be great. If dynamo does scans starting at different points, that would be dandy.
I have the same situation with you that have a set of million of UUID as key in DynamoDB and I need to random select some of them in a API call. For the performance issue and easy implementation. I did use Redis as you said.
add the UUID to a Set in Redis
when the call comes, SPOP a UUID from the set
with that UUID, del in DynamoDB
The performance of Scan operation is bad, should try to avoid it as best as you can.

Resources