How many bits of integer data can be stored in a DynamoDB attribute of type Number? - amazon-dynamodb

DynamoDB's Number type supports 38 digits of decimal precision. This is not big enough to store a 128-bit integer which would require 39 digits. The max value is 340,282,366,920,938,463,463,374,607,431,768,211,455 for unsigned 128-bit ints or 170,141,183,460,469,231,731,687,303,715,884,105,727 for signed 128-bit ints. These are both 39-digit numbers.
If I can't store 128 bits, then how many bits of integer data can I store in a Number?

DynamoDB attribute of type Number can store 126-bit integers (or 127-bit unsigned integers, with serious caveats).
According to Amazon's documentation:
Numbers can have up to 38 digits precision. Exceeding this results in an exception.
This means (verified by testing in the AWS console) that the largest positive integer and smallest negative integers, respectively, that DynamoDB can store in a Number attribute are:
99,999,999,999,999,999,999,999,999,999,999,999,999 (aka 10^38-1)
-99,999,999,999,999,999,999,999,999,999,999,999,999 (aka -10^38+1)
These numbers require 126 bits of storage, using this formula:
bits = floor (ln(number) / ln (2))
= floor (87.498 / 0.693)
= floor (126.259)
= 126
So you can safely store a 126-bit signed int in a DynamoDB.
If you want to live dangerously, you can store a 127-bit unsigned int too, but there are some caveats:
You'd need to avoid (or at least be very careful) using such a number as a sort key, because values with a most-significant-bit of 1 will sort as negative numbers.
Your app will need to convert unsigned ints to signed ints when storing them or querying for them in DynamoDB, and will also need to convert them back to unsigned after reading data from DynamoDB.
If it were me, I wouldn't take these risks for one extra bit without a very, very good reason.
One logical question is whether 126 (or 127 given the caveats above) is good enough to store a UUID. The answer is: it depends. If you are in control of the UUID generation, then you can always shave a bit or two from the UUID and store it. If you shave from the 4 "version" bits (see format here) then you may not be losing any entropy at all if you are always generating UUIDs with the same version.
However, if someone else is generating those UUIDs AND is expecting lossless storage, then you may not be able to use a Number to store the UUID. But you may be able to store it if you restrict clients to a whitelist of 4-8 UUID versions. The largest version now is 5 out of a 0-15 range, and some of the older versions are discouraged for privacy reasons, so this limitation may be reasonable depending on your clients and whether they adhere to the version bits as defined in RFC 4122.
BTW, I was surprised that this bit-limit question wasn't already online... at least not in an easily-Google-able place. So contributing this Q&A pair so future searchers can find it.

Related

What is the max numeric value that can be stored on firebase?

I'm building a system that will eventually store extremely large number values (double or floating point types, not int) and I'm using firestore to store my data. Is there a cap for how big numbers can grow in a firestore database? I can't seem to find any documentation anywhere.
As per the documentation, Firestore allows only 64 bit signed integers. A 64-bit signed integer. It has a minimum value of -9,223,372,036,854,775,808 and a maximum value of 9,223,372,036,854,775,807 (inclusive).
Same for float: Firestore allows 64-bit double precision, IEEE 754
I just tried adding a random number over that limit and you get.
Then I tried requesting the document and that's the data you get:
It'll be hard to get the precise value if you store it as a number. You may be interested storing the number in string format if that numbers are that large. But do keep the 1 MiB (1,048,576 bytes) limit of a document in mind.

storing as integer vs string size

I have checked the Docs but I got confused a bit. When storing a long integer such as 265628143862153200. Would it be more efficient to store it as a string of integer.
This is what I need help with is the below calculation corrent?
Integer:
length of 265628143862153200 *8 ?
String:
length of 265628143862153200 +1 ?
The Firebase console is not meant to be part of administrative workflows. It's just for discovery and development. If you have production-grade procedures to follow, you should only write code for that using the provided SDKs. Typically, developers make their own admin sites to deal with data in Firesotre.
Also, you should know that JavaScript integers are not "big" enough to store data to the full size provided by Firestore. If you want to use the full size of a number field, you will have to use a system that supports Firestore's full 64 bit signed integer type.
If you must store numbers bigger than either limit, and be able to modify them by hand, consider instead storing multiple numbers, similar to the way Firestore's timestamp stores seconds and nanoseconds as separate numbers, so that the full value can overflow greater than signed 64 bits.

Building a unique ID without collisions

I'm playing around with system design and have been reading up on url shortener. I realize there are many questions around this topic, but have some specific questions with respect to hashing and the order in which I hash + encode.
Input: https://example.com/owjpojwepofjwpoejfpwjepfojpwejfp/wefoijhwioejfiowef/weoifhwoiehjfiowef
Output: https://example.com/abr4fna
If I run this input through md5 I get the following 9e91e9c2a7ce0f0d11b475d2abfb8593. Clearly, this exceeds the length that I want, so I could truncate the substring from (0,7]. The problem is, to some degree, I can still have a collision since the prefix of the md5 is not guaranteed to be unique as the amount of urls generated increases within the service.
I do not want to have to check the database if I've already used this ID before as that would increase the amount of reads I'm doing proportional to the number of writes I'm doing. In addition, there could be concurrency issues as I grow the number of application servers doing the hash generation and storage.
I see people mentioning the use of base64 encoding the output hash, but what value does this add after the hash? Is it because I grow the amount of unique combinations by 64^n where n is the length of my hash versus md5 being only 36^n?
Thanks. Just interested in having this discussion.
edit:
As I understand, we purely doing the encoding piece to ensure we do not have transmission failures if the receiving system has issues interpreting binary data from the output hash - so it's used for the pure sake of display.
By definition, you cannot hash a large domain and expect to get a smaller domain without collisions. A hash is useful because it is one-way and would require a computationally infeasible amount of tries to find those collisions. However, with a 7 character output and a large input domain, it will be exceptionally easy to generate collisions even by chance.
You're currently using 7 hexadecimal digits. Each hexadecimal digit represents 4 bits. So you have 28 bits or 2^28 possible values. That's around 256 million possible values. So if you guess long enough you'll get a collision soon enough. With base64 you'd have 6 bits per character instead (2^6 = 64, hence the name). That means that you increase the bit size with 7 * 2 = 14 bits, or around 16 thousand times as much, but you'd still be pretty far from collision free.
Actually, for any cryptographic reassurance when taking in the birthday bound, the 16 byte output of MD5 is about the absolute minimum size of hash you want to avoid collisions. Of course, MD5 hashn't been deprecated for nothing, you'd really want to use SHA-256.

What is the Realm native integer size? Int vs Int8, Int16, Int32

TL;DR: I'm building a data set to share between iOS and Android. Should I tweak the integer sizes to match their actual ranges, or just make everything Integer and use Int in Swift and long in both Java and Swift?
In a typical SQL database, storing a large number of 4-byte integers would take ~4x more space than a 1-byte integer[1]. However, I read in this answer that integers are stored bit-packed and in the Realm Java help that The integer types byte, short, int, and long are all mapped to the same type (long actually) within Realm. So, reading between the lines, it seems that the on disk storage will be the same regardless of what integer sub-type I use.
So, from a pure Realm / database perspective should I just use Int & long in Swift & Java respectively? (I.e. leaving aside language differences, like casting, in-memory size etc.)
If an integer field is indexed, does that make any difference to the type chosen?
PS: Many thanks to the Realm team for their great docs and good support here on SO!
[1] Yes, I know it's more complicated than that.
Your interpretation is correct: the same underlying storage type is used for all integer types in Realm, and that storage type adjusts the number of bits it uses per value based on the range of values stored. For example, if you only store values in the range 0-15 then each value will use fewer bits than if you store values in the range 0-65,535. Similarly, all indexes on integer properties use a common storage type.

Probability of collision with truncated SHA-256 hash

I have a database-driven web application where the primary keys of all data rows are obfuscated as follows: SHA256(content type + primary key + secret), truncated to the first 8 characters. The content type is a simple word, e.g. "post" or "message" and the secret is a 20-30 char ASCII constant. The result is stored in a separate indexed column for fast DB lookup.
How do I calculate the probability of a hash collision in this scenario? I am not a mathematician at all, but a friend claimed that due to the Birthday Paradox the collision probability would be ~1% for 10,000 rows with an 8-char truncation. Is there any truth to this claim?
Yes, there is a collision probability & it's probably somewhat too high. The exact probability depends on what "8 characters" means.
Does "8 characters" mean:
A) You store 8 hex characters of the hash? That would store 32 bits.
B) You store 8 characters of BASE-64? That would store 48 bits.
C) You store 8 bytes, encoded in some single-byte charset/ or hacked in some broken way into a character encoding? That would store 56-64 bits, but if you don't do encoding right you'll encounter character conversion problems.
D) You store 8 bytes, as bytes? That genuinely stores 64 bits of the hash.
Storing binary data as either A) hex or D) binary bytes, would be my preferred options. But I'd definitely recommend either reconsidering your "key obfuscation" scheme or significantly expanding the stored key-size to reduce the (currently excessive) probability of key collision.
From Wikipedia:
https://en.wikipedia.org/wiki/Birthday_problem#Probability_table
The birthday problem in this more generic sense applies to hash functions: the expected number of N-bit hashes that can be generated before getting a collision is not 2^N, but rather only 2^(N/2).
Since in the most conservative above understanding of your design (reading it as A, 8 chars of hex == 32 bits) your scheme would be expected to suffer collisions if it stored on the scale of ~64,000 rows. I would consider such an outcome unacceptable for all serious, or even toy, systems.
Transaction tables may have volumes, allowing growth for the business, from 1000 - 100,000 transactions/day (or more). Systems should be designed to function 100 years (36500 days), with a 10x growth factor built in, so..
For your keying mechanism to be genuinely robust & professionally useful, you would need to be able to scale it up to potentially handle ~36 billion (2^35) rows without collision. That would imply 70+ bits of hash.
The source-control system Git, for example, stores 160 bits of SHA-1 hash (40 chars of hex == 20 bytes or 160 bits). Collisions would not be expected to be probable with < less than 2^80 different file revisions stored.
A possibility better design might be, rather than hashing & pseudo-randomizing the key entirely & hoping (against hope) to avoid collisions, to prepend/ append/ fold-in 8-10 bits of a hash into the key.
This would generates a larger key, containing all the uniqueness of the original key plus 8-10 bits of verification. Attempts to access keys would then be verified, and more than 3 invalid requests would be treated as an attempt to violate security by "probing" the keyspace & would trigger semi-permanent lockout.
The only major costs here, would be a modest reduction in the size of the available keyspace for a given int-size. 32-bit int to/from the browser would have 8-10 bits dedicated to security, thus leaving 22-24 for the actual key. So you'd use 64-bit ints where that was not sufficient.

Resources