So I am looking for potential feedback on the issues and benefits around storing many 100's of millions of rows of memory address data.
The addresses consist of 8 x UInt32 positions, I was initially going to store the data in 8 columns, and have an 8 way multipart index on those columns.
CREATE TABLE address (position1 INTEGER, position2 INTEGER, positionN ......)
But it then dawned on me that I could also have a BLOB column, and just store 64 bytes of data and index that.
CREATE TABLE address (position BLOB)
So I guess it becomes a question about any known efficiencies or detrimental effects of looking up on an 8 way index vs a far larger binary index, both with high cardinality.
There will be duplicate, timestamped, entries for the same positions although they will be fairly infrequent.
The table will be read and added to constantly if that helps, with a 3(read)/1(write) ratio.
Many thanks.
Related
I'm working on large scale component that generates unique/opaque tokens representing business entities. Over time there will be many billions of these records, but for the first year we're not expecting growth to exceed more than 2 billion individual items (probably less than 500 million).
The system itself is horizontally scaled but needs token generation to be idempotent; data integrity is maintained by using a contained but reasonably complex combination of transactional writes with embedded condition expressions AND standalone condition check write items.
The tokens themselves are UUIDs, and 'being efficient' are persisted as Binary attribute values (16 bytes) rather than the string representation (36 bytes), however the downside is that the data doesn't visualise nicely in query consoles making support hard if we encounter any bugs and/or broken data. Note there is no extra code complexity since we implement attributevalue.Marshaler interface to bind UUID (language) types to DynamoDB Binary attributes, and similarly do the same for any composite attributes.
My question relates to (mostly) data size/saving. Since the tokens are the partition keys, and some mapping columns are [token] -> [other token composite attributes], for example two UUIDs concatenated together into 32 bytes.
I wanted to keep really tight control over storage costs knowing that, over time, we will be spending ~$0.25/GB per month for this. My question is really three parts:
Are the PK/SK index size 'reserved' (i.e. padded) so it would make no difference at all to storage cost if we compress the overall field sizes down to the minimum possible size? (... I read somewhere that 100 bytes is typically reserved.
If they ARE padded, the cost savings for the data would be reasonably high, because each (tree) index node will be nearly as big as the data being mapped. (I assume a tree index is used once hashed PK has routed the query to the right server node/disk etc.)
Is there any observable query time performance benefit to compacting 36 bytes into 16 (beyond saving a few bytes across the network)? i.e. if Dynamo has to read fewer pages it'll work faster, but in practice are we talking microseconds at best?
This is a secondary concern, but is worth considering if there is a lot of concurrent access to the data. UUIDs will distribute partitions but inevitably sometimes we will have some more active partitions than others.
Are there any tools that can parse bytes back into human-readable UUIDs (or that we customise to inject behaviour to do this)?
This is concern, because making things small and efficient is ok, but supporting and resolving data issues will be difficult without significant tooling investment, and (unsurprisingly) the DynamoDB console, DynamoDB IntelliJ plugin and AWS NoSQL Workbench all garble the binary into unreadable characters.
No, the PK/SK types are not padded. There's 100 bytes of overhead per item stored.
Sending less data certainly won't hurt your performance. Don't expect a noticeable improvement though. If shorter values can keep your items at 1,024 bytes instead of 1,025 bytes then you save yourself a Write Unit during the save.
For the "garbled" binary values I assume you're looking at the base64 encoded values, which is a standard binary encoding standard which can be reversed by lots of tooling (now that you know the name of it).
I have a table with the following columns:
.create-merge table events
(
Time: datetime,
SiteId: int,
SiteCode: string,
...
)
Site ID and code both provide unique value for a site, theoretically it does not matter which one to use unless I need certain data type in the output. However I see a noticeable difference in performance between the queries:
events | summarize count() by SiteCode
~ 300 ms on a 150M rows table
events | summarize count() by SiteId
~ 560 ms on a 150M rows table
The difference is small in the absolute value, but the string one is almost two times faster than the integer one (for consistent results, I issue requests from a client in the same region). The string code consists of 10-20 characters and intuitively seems to have larger footprint in the computer memory as opposed to 4-byte integer, hence I would expect longer processing of the string one, but it works conversely.
What could be the reason for that? I am missing something fundamental from ADX internals?
Assuming that you are using EngineV3, you are seeing the impact of the dictionary encoding optimization implemented in this engine, where in certain cases string values are encoded to small and efficient int values, hence the better performance. As EngineV3 continues to improve this optimization may be added to int values as well.
I am having a bit of a dilemma, I have a huge non-relational sqlite database with a table containing millions of entries with relations between entities using their "ID"(A long number). Now, these entries create multiple hierarchies. What I want to do is store the data for each hierarchy separately and playing around with temporary tables and indexes I started to wonder if there is any difference on doing an index of a number as string or integer.
In fewer words, does the index of "43789164238" as a string work faster than the same number as an integer?
Integers are faster than strings. The reason for this is quite simple. An integer uses less space than a string.
integer is between 2 and 8 bytes
a string is at least 4 bytes plus the value inside
hope this helps :)
My application requires a key value store. Following are some of the details regarding key values:
1) Number of keys (data type: string) can either be 256, 1024 or 4096.
2) Data type of values against each key is a list of integers.
3) The list of integers (value) against each key can vary in size
4) The largest size of the value can be around 10,000,000 integers
5) Some keys might contain very small list of integers
The application needs fast access to the list of integers against a specified key . However, this step is not frequent in the working of the application.
I need suggestions for best Key value stores for my case. I need fast retrieval of values against key and value size can be around 512 MB or more.
I checked Redis but it requires the store to be stored in memory. However, in the given scenario I think I should look for disk based key value stores.
LevelDB can fit your use case very well, as you have limited number of keys (given you have enough disk space for your requirements), and might not need a distributed solution.
One thing you need to specify is if (and how) you wish to modify the lists once in the db, as levelDB and many other general key-val stores do not have such atomic transactions.
If you are looking for a distributed db, cassandra is good, as it will also let you insert/remove individual list elements.
According to this FAQ, page fill factor can be adversely affected by not specifying a sorting function for binary data on little-endian systems. I understand that it will also result in cursors not returning data in the "correct" sorted order.
Other than excessive page usage, would this cause any other performance issues? For example, does a poor page fill factor adversely affect the speed of key lookups?
Furthermore, if I have data already stored in a BTREE without a sorting function, will anything break if I subsequently start using a sorting function to add new records? i.e. would a mismatch between the originally used sort order and a new sort function break key lookups?
Yes, incorrect endian-ness can reduce your fill factor and as a result your database will be bigger and slower to acess. Today I was inserting about 30 million records with a sequential integer key and noticed quite poor btree fill factor (60%). Then changed the endianness of the key (used htonl() function) and the fill factor jumped to 99%. At the same time database size was reduced from 1.3 GB to 700 MB.
Endianness is important when your key is sequential or shows some locality (common prefix for related data). For some keys changing the endianness could worsen the performance (I experienced this with mobile phone numbers).
BTW you don't have to provide a sorting function - you can just convert the keys to correct endianness when inserting and searching by key.