I have this rudimentary question about DynamoDb -- is it worth shortening attribute names and removing whitespace in order to save on throughput and storage?
I am planning to store millions of items that looks like this:
{
"currency": "USD",
"openPrice": 0.1,
"closePrice": 0.1,
"highPrice": 0.1,
"lowPrice": 0.1
}
if I reformat this JSON fragment to look like this:
{"c":"USD","op":0.1,"cp":0.1,"hp":0.1,"lp":0.1}
would this shorter JSON result in cost savings because of lower storage and fewer throughput capacity units?
Thanks.
Yes, attribute names are factored into the total item size. This becomes important in one of two scenarios:
If your access patterns involve scans; or queries that expect to retrieve many items - because in these cases you get billed for aggregated size (ie sum of all accessed items) so small items use consumed capacity more efficiently than larger items
If you have items with many attributes such that the size of the attribute names and the size of their values pushes the total item size past the 1KB boundary - this is important because write capacity is billed in 1KB increments
I would use common sense in naming: prefer short and concise naming but try to avoid cryptic names (e.g. c is not a great name, but neither is customerDataItemIdentifier - a better choice might be custId)
Related
I'm working on large scale component that generates unique/opaque tokens representing business entities. Over time there will be many billions of these records, but for the first year we're not expecting growth to exceed more than 2 billion individual items (probably less than 500 million).
The system itself is horizontally scaled but needs token generation to be idempotent; data integrity is maintained by using a contained but reasonably complex combination of transactional writes with embedded condition expressions AND standalone condition check write items.
The tokens themselves are UUIDs, and 'being efficient' are persisted as Binary attribute values (16 bytes) rather than the string representation (36 bytes), however the downside is that the data doesn't visualise nicely in query consoles making support hard if we encounter any bugs and/or broken data. Note there is no extra code complexity since we implement attributevalue.Marshaler interface to bind UUID (language) types to DynamoDB Binary attributes, and similarly do the same for any composite attributes.
My question relates to (mostly) data size/saving. Since the tokens are the partition keys, and some mapping columns are [token] -> [other token composite attributes], for example two UUIDs concatenated together into 32 bytes.
I wanted to keep really tight control over storage costs knowing that, over time, we will be spending ~$0.25/GB per month for this. My question is really three parts:
Are the PK/SK index size 'reserved' (i.e. padded) so it would make no difference at all to storage cost if we compress the overall field sizes down to the minimum possible size? (... I read somewhere that 100 bytes is typically reserved.
If they ARE padded, the cost savings for the data would be reasonably high, because each (tree) index node will be nearly as big as the data being mapped. (I assume a tree index is used once hashed PK has routed the query to the right server node/disk etc.)
Is there any observable query time performance benefit to compacting 36 bytes into 16 (beyond saving a few bytes across the network)? i.e. if Dynamo has to read fewer pages it'll work faster, but in practice are we talking microseconds at best?
This is a secondary concern, but is worth considering if there is a lot of concurrent access to the data. UUIDs will distribute partitions but inevitably sometimes we will have some more active partitions than others.
Are there any tools that can parse bytes back into human-readable UUIDs (or that we customise to inject behaviour to do this)?
This is concern, because making things small and efficient is ok, but supporting and resolving data issues will be difficult without significant tooling investment, and (unsurprisingly) the DynamoDB console, DynamoDB IntelliJ plugin and AWS NoSQL Workbench all garble the binary into unreadable characters.
No, the PK/SK types are not padded. There's 100 bytes of overhead per item stored.
Sending less data certainly won't hurt your performance. Don't expect a noticeable improvement though. If shorter values can keep your items at 1,024 bytes instead of 1,025 bytes then you save yourself a Write Unit during the save.
For the "garbled" binary values I assume you're looking at the base64 encoded values, which is a standard binary encoding standard which can be reversed by lots of tooling (now that you know the name of it).
I have a question...
If I have 1000 item having same partition key in a table... And if I made a query for this partition key with limit 10 then I want to know does it take read capacity unit for 1000 items or for just 10 items
Please clear my doubt
I couldn't find the exact point in the DynamoDB documentation. From my experience it uses only the returned limit for consumed capacity which is 10 (Not 1000).
You can quickly evaluate this also using the following approach.
However, you can specify the ReturnConsumedCapacity parameter in
a Query request to obtain this information.
The limit option will limit the number of results returned. The capacity consumed depends on the size of the items, and how many of them are accessed (I say accessed because if you have filters in place, more capacity may be consumed than the number of items actually returned would consume if there are items that get filtered out) to produce the results returned.
The reason I mention this is because, for queries, each 4KB of returned capacity is equivalent to 1 read capacity unit.
Why is this important? Because if your items are small, then for each capacity unit consumed you could return multiple items.
For example, if each item is 200 bytes in size, you could be returning up to 20 items for each capacity unit.
According to the aws documentation:
The maximum number of items to evaluate (not necessarily the number of matching items). If DynamoDB processes the number of items up to the limit while processing the results, it stops the operation and returns the matching values up to that point, and a key in LastEvaluatedKey to apply in a subsequent operation, so that you can pick up where you left off.
It seems to me that it means that it will not consume the capacity units for all the items with the same partition key. According to your example the consumed capacity units will be for your 10 items.
However since I did not test it I cannot be sure, but that is how I understand the documentation.
So the question came up about whether tombstones should be included when calculating the load factor of a hash table.
I thought that, given that the load factor is used to determine when to expand capacity, tombstones should not be included. An obvious example is if you almost fill and then remove every value in a hash table. Here insertions are super easy (no collisions) so I believe the load factor shouldn't include them.
But you could look at this and think that with all the tombstones lookups will be slow (potentially searching almost the entire space).
So I thought I'd ask the question. Should the load factor of a hashtable include tombstones in the calculation?
Load factor is not an essential part of hash table data structure -- it is the way to define rules of behaviour for the dymamic system (growing/shrinking hash table is a dynamic system).
Moreover, in my opinion, in 95% of modern hash table cases this way is over simplified, dynamic systems behave suboptimally. It's advantages:
Well, simplicity of understanding and implementation.
Hash table data structure shouldn't store many numbers with some thresholds -- likely only one number. This is meaningful when hash table is very small and the size of the header affects total data structure memory efficiency (in bytes to store an entry).
In certain (and common) case: append/update only hash table, more complex models of behaviour degenerate to the "just load factor" model, in other words, load factor model defines relatively optimal behaviour.
See also my answer on load factor model. I prefer [min load, target load, max load] + growth factor frame model.
If you develop general-purpose hash table with tombstones, I think you can just pick up my results (below). I spend maybe several weeks solely developing this model. Maybe you can make some improvements or further research, I would be glad.
Two main hash table dynamic behaviour patterns are targeted:
growing hash table (maybe in growing phase), with little or no removals
initial fill of hash table, when proper capacity was not specified (or unknown)
hash table that remains of the same or nearly the same size, number of removals is equal or nearly equal to number of insertions
caches with upper size bound, LRUs, tables with entry expires
Two thresholds are defined:
max size (i. e. number of alive entries), table size * max load
min number of free (i. e. empty, without alive entry nor tombstone) slots, computed by magic formula.
If hash table size exceeds max size, we assume we are in the "growing pattern", rehash to the table size to be able to store current size * growth factor entries, i. e. choose table size closest possible to current size * growth factor / target load.
If the number of free slots becomes below than min number of free slots, we are in "cache pattern", rehash "to the current size", i. e. to the table size closest possible to current size / target load.
Read the source where all the above logics are coded.
Also, article Tombstones purge from hashtable: theory and practice sheds some light.
If you develop specially purposed hash table, which dymanic properties are known (or could be studied), I recommend you to develop your own model, fitting your case. Don't rely on pure math and CS theory, evaluate your model in benchmarks.
Condider a table A with index A-index. I write around 100 items into A in batches (using PutRequest within BatchWriteItem).
If I repeat the operation with the same set of items, they will be just replacing the existing items. But how does that impact the local secondary index? Since it's a complete replace, does it replace in index also, thereby consuming throughput there too? Or does it figure out the items are exactly same and hence doesn't perform any operation, thereby resulting in no additional consumed throughput for index?
Found the answer by running a trial program and noticing the results in ConsumedCapacity attribute for table and indices.
During replace, if there are no changes, the consumed throughput is not calculated as DynamoDB figures out it's exactly the same. But if there are changes, throughput per item is calculated.
what constitutes an actual read in DynamoDB?
is it reading every line in a table or what data is returned?
is this why a scan is so expensive - you read the entire table and are charged for every table line that is read?
Can you put ElasticCache (Memcached) in front of DynamoDB to keep the cost down?
Finally are you charged for a query that yields no results?
See this link: http://aws.amazon.com/dynamodb/faqs/
1 Write = 1 Write per second for an item up to 1Kb in size.
1 Read = 2 Reads per second for an item up to 1Kb in size, or 1 per second if you required fully consistent results.
For example, if your items are 512 bytes and you need to read 100
items per second from your table, then you need to provision 100 units
of Read Capacity.
If your items are larger than 1KB in size, then you should calculate
the number of units of Read Capacity and Write Capacity that you need.
For example, if your items are 1.5KB and you want to do 100
reads/second, then you would need to provision 100 (read per second) x
2 (1.5KB rounded up to the nearest whole number) = 200 units of Read
Capacity.
Note that the required number of units of Read Capacity is determined
by the number of items being read per second, not the number of API
calls. For example, if you need to read 500 items per second from your
table, and if your items are 1KB or less, then you need 500 units of
Read Capacity. It doesn’t matter if you do 500 individual GetItem
calls or 50 BatchGetItem calls that each return 10 items.
The above applies to all the usual methods, GET, BATCH X & QUERY.
SCAN is a little different, they don't document exactly how they calculate the usage but they do offer the following:
The Scan API will iterate through your entire dataset and apply the
filter conditions to every row. Since only 1MB of data can be scanned
at a time, you may need to do multiple round trips (using a
continuation token) to complete the scan. Further, using this API may
consume much of your provisioned read throughput. Hence, this method
has limited scaling characteristics and we do not recommend that you
use it as a part of your application’s regular behavior.
So to answer your question directly: The calculation is made on what data is returned in all cases except for SCAN, where there isn't really any clear indication on how they charge. A query that yields no results will not cost you anything.
You can definitely set up a caching system infront of Dynamo, definitely recommend you look into that if you want to keep your reads down.
Hope that helps!