Why do DynamoDB imported size and actual table size differ so much? - amazon-dynamodb

I have imported a DynamoDB table from S3. Here are the dataset sizes at each step:
Compressed dataset in S3 (DynamoDB JSON format with GZIP compression) = 13.3GB
Imported table size according to the DynamoDB imports page (uncompressed) = 64.4GB
Imported item count = 376 736 126
Current table size according to the DynamoDB tables page (compressed?) = 41.5GB (less than at the import time!)
Item count = 380 528 674 (I have performed some insertions already)
Since the import time, the table was only growing.
What's the reason of a much lesser estimation of the actual table size? Is it because of an approximation of DynamoDB tables sizes in general? Or does DynamoDB apply any compression to the stored data?
The source S3 dataset should not have any duplicates: it is built using Athena by running a GROUP BY query on the DynamoDB table's key. So, I do not expect it to be a cause.
Each item has 4 attributes: PK is 2 long strings (blockchain addresses = 40 hex chars + extra ≈2–6 characters) + 1 long string (uint256 balance as a hex string ≤ 64 characters) + 1 numeric value. Table import format is DynamoDB JSON.

DynamoDB performs no compression.
The likely cause is how you are calculating the size of the table and number of items. If you are relying on the tables metadata, it is only updated every 6 hours and is an approximate value, which should not be relied upon for comparisons or validity checks.

The reason is in the DynamoDB JSON format's overhead. I am the author of the question, so an exact example should provide more clarity. Here is a random item I have:
{"Item":{"w":{"S":"65b88d0d0a1223eb96bccae06317a3155bc7e391"},"sk":{"S":"43ac8b882b7e06dde7c269f4f8aaadd5801bd974_"},"b":{"S":"6124fee993bc0000"},"n":{"N":"12661588"}}}
When imports from S3, DynamoDB import functionality bills per the total read uncompressed size. Which for this item results in 169 bytes (168 chars + newline).
However, when stored to DynamoDB, this item only occupies its fields capacity (see DynamoDB docs):
The size of a string is (length of attribute name) + (number of UTF-8-encoded bytes).
The size of a number is approximately (length of attribute name) + (1 byte per two significant digits) + (1 byte).
For this specific item the DynamoDB's native size estimation is:
w (string) = 1 + 40 chars
sk (string) = 2 + 41 chars
b (string) = 1 + 16 chars
n (number) = 1 + (8 significant digits / 2 = 4) + 1
Total is 107 bytes. Actually, current DynamoDB's estimation for this table is 108.95 bytes per item on average which is pretty close (some fields values vary in length, this particular example is nearly the shortest possible).
This results in about 100% – 108.95 / 169 = 35% size reduction when the data is actually stored in DynamoDB compared to the imported size. Which is very close to the results I have reported in the question: 64.4GB * 108.95 / 169 = 40.39GB ≈ 41.5GB.

Related

How to unnest an integer array represented as a BLOB?

SQLite doesn’t have native support for arrays. I would think that my method (thinking outlined below) of making a custom BLOB encoding would be a fairly common workaround (yes, I need an array rather than normalizing the table). The benefit of representing an integer array as a BLOB is primarily the space savings, for example:
13,24,455,23,64789
Stored as TEXT will take up 18 bytes (commas included, making assumptions here). But if one were to store the above TEXT in a custom encoded BLOB format it would look like this:
0x000D001801C7FD15
Where every number is assumed to take up 2 Bytes (0 to 65,535 or 0000 to FFFF). This BLOB, to my understanding, would then be 10 Bytes. This is nearly half the size as storing it in a delimited TEXT format. This space savings would also be magnified by the number of rows and the number of integers in the array.
Is there a good way of unnesting a BLOB by width? Say that I want to unnest the BLOB so that each row represents an integer. Can I take the above BLOB and turn it into this?
id
number
1
000D
2
0018
3
01C7
4
FD15
SQLite's HEX() function
interprets its argument as a BLOB and returns a string which is the
upper-case hexadecimal rendering of the content of that blob.
After you get the blob as a string use SUBSTR() to slice it in 4 char parts:
WITH cte(id) AS (VALUES (1), (2), (3), (4), (5))
SELECT HEX(t.col) hex_col,
c.id,
SUBSTR(HEX(t.col), (c.id - 1) * 4 + 1, 4) number
FROM tablename t JOIN cte c
ORDER BY t.rowid, c.id;
Or, with a recursive CTE to generate dynamically the list of values 1..n:
WITH cte(id) AS (
SELECT 1
UNION ALL
SELECT id + 1
FROM cte
WHERE id < 5
)
SELECT HEX(t.col) hex_col,
c.id,
SUBSTR(HEX(t.col), (c.id - 1) * 4 + 1, 4) number
FROM tablename t JOIN cte c
ORDER BY t.rowid, c.id;
Change col to the name of your blob column.
See the demo.

How do I manually calculate the Firestore automatic index size?

I'm relatively new to Firebase and I'm looking to calculate how much data I incur from automatically created single-field indexes on Firestore native mode. I understand that the rule of thumb is to go through and exempt fields and there are guidelines on this. However, I am more interested in how much data is actually calculated as an overhead if I don't exempt any fields.
I am looking for someone to verify my math. I used the rules outlined here https://firebase.google.com/docs/firestore/storage-size. This is what I have
Document size is the sum of:
The document name size = 48 bytes
The sum of the string size of each field name = Average of 9 bytes * 97 attributes=873
The sum of the size of each field value = Average of 70 bytes * 97 attributes=6790
32 additional bytes
Document size 48+873+6790+32 =7,743 bytes
Collection size = (doc size) * (# of docs) 7,743*130 = 1MB
Single-field index with collection scope is the sum of:
The document name size of the indexed document = 48 bytes
The document name size of the indexed document's parent document = 0
The string size of the indexed field name = 9 bytes
The size of the indexed field value = 70 bytes
32 additional bytes
Single-field index size on the average attribute = (48+9+70+32) = 159 bytes
Single field index with 130 documents= 159*130=20,670 bytes
Automatically created ascending and descending indexes 20,670*2 = 41,340 bytes
Single field index for 97 attributes 41,340*97=4MB

SQLite: How to reduce byte size of integer values?

I have a SQLite table (without row ID, but that's probably irrelevant, and without any indexes) where my rows contain the following data:
2 real values, one of which is the primary key
3 integers < 100
1 more field for integers, but currently always null
According to http://www.sqlite.org/datatype3.html, integer values can take 1, 2, 3, 4, 6 or 8 bytes according to their magnitude. Therefore I'd expect each row in my table to take up about 20 bytes. In reality, sqlite3_analyzer gives me for the table
Average payload per entry......................... 25.65
Maximum payload per entry......................... 26
which is somewhere in between the minimum value of 20 and the maximum of 32 (if all integers were stored with 4 bytes). Is it possible to give the DB a "hint" to use even smaller integer types wherever possible? Or how else can the discrepancy be explained? (I don't think it's indexes because there are none for this table.)
Similarly, on a previous table I had 2 real values + 2 small integers and each entry occupied slightly more than 24 bytes (which is also more than I would have expected).
Also, there is no way to store floats in single precision with SQLite right?
The actual record format has one integer for the header size, one integer for each column to describe the value's type, and all the data of the column values.
In this case, we have:
bytes
1 header size
6 column types
16 two real values
3 three small integers between 2 and 127
0 NULL
--
26

How to compute the size of the allocated memory for a general type

I need to work with some databases read with read.table from csv (comma separated values ), and I wish to know how to compute the size of the allocated memory for each type of variable.
How to do it ?
edit -- in other words : how much memory R allocs for a general data frame read from a .csv file ?
You can get the amount of memory allocated to an object with object.size. For example:
x = 1:1000
object.size(x)
# 4040 bytes
This script might also be helpful- it lets you view or graph the amount of memory used by all of your current objects.
In answer to your question of why object.size(4) is 48 bytes, the reason is that there is some overhead in each numeric vector. (In R, the number 4 is not just an integer as in other languages- it is a numeric vector of length 1). But that doesn't hurt performance, because the overhead does not increase with the size of the vector. If you try:
> object.size(1:100000) / 100000
4.0004 bytes
This shows you that each integer itself requires only 4 bytes (as you expect).
Thus, summary:
For a numeric vector of length n, the size in bytes is typically 40 + 8 * floor(n / 2). However, on my version of R and OS there is a single slight discontinuity, where it jumps to 168 bytes faster than you would expect (see plot below). Beyond that, the linear relationship holds, even up to a vector of length 10000000.
plot(sapply(1:50, function(n) object.size(1:n)))
For a categorical variable, you can see a very similar linear trend, though with a bit more overhead (see below). Outside of a few slight discontinuities, the relationship is quite close to 400 + 60 * n.
plot(sapply(1:100, function(n) object.size(factor(1:n))))

How to efficiently convert a few bytes into an integer between a range?

I'm writing something that reads bytes (just a List<int>) from a remote random number generation source that is extremely slow. For that and my personal requirements, I want to retrieve as few bytes from the source as possible.
Now I am trying to implement a method which signature looks like:
int getRandomInteger(int min, int max)
I have two theories how I can fetch bytes from my random source, and convert them to an integer.
Approach #1 is naivé . Fetch (max - min) / 256 number of bytes and add them up. It works, but it's going to fetch a lot of bytes from the slow random number generator source I have. For example, if I want to get a random integer between a million and a zero, it's going to fetch almost 4000 bytes... that's unacceptable.
Approach #2 sounds ideal to me, but I'm unable come up with the algorithm. it goes like this:
Lets take min: 0, max: 1000 as an example.
Calculate ceil(rangeSize / 256) which in this case is ceil(1000 / 256) = 4. Now fetch one (1) byte from the source.
Scale this one byte from the 0-255 range to 0-3 range (or 1-4) and let it determine which group we use. E.g. if the byte was 250, we would choose the 4th group (which represents the last 250 numbers, 750-1000 in our range).
Now fetch another byte and scale from 0-255 to 0-250 and let that determine the position within the group we have. So if this second byte is e.g. 120, then our final integer is 750 + 120 = 870.
In that scenario we only needed to fetch 2 bytes in total. However, it's much more complex as if our range is 0-1000000 we need several "groups".
How do I implement something like this? I'm okay with Java/C#/JavaScript code or pseudo code.
I'd also like to keep the result from not losing entropy/randomness. So, I'm slightly worried of scaling integers.
Unfortunately your Approach #1 is broken. For example if min is 0 and max 510, you'd add 2 bytes. There is only one way to get a 0 result: both bytes zero. The chance of this is (1/256)^2. However there are many ways to get other values, say 100 = 100+0, 99+1, 98+2... So the chance of a 100 is much larger: 101(1/256)^2.
The more-or-less standard way to do what you want is to:
Let R = max - min + 1 -- the number of possible random output values
Let N = 2^k >= mR, m>=1 -- a power of 2 at least as big as some multiple of R that you choose.
loop
b = a random integer in 0..N-1 formed from k random bits
while b >= mR -- reject b values that would bias the output
return min + floor(b/m)
This is called the method of rejection. It throws away randomly selected binary numbers that would bias the output. If min-max+1 happens to be a power of 2, then you'll have zero rejections.
If you have m=1 and min-max+1 is just one more than a biggish power of 2, then rejections will be near half. In this case you'd definitely want bigger m.
In general, bigger m values lead to fewer rejections, but of course they require slighly more bits per number. There is a probabilitistically optimal algorithm to pick m.
Some of the other solutions presented here have problems, but I'm sorry right now I don't have time to comment. Maybe in a couple of days if there is interest.
3 bytes (together) give you random integer in range 0..16777215. You can use 20 bits from this value to get range 0..1048575 and throw away values > 1000000
range 1 to r
256^a >= r
first find 'a'
get 'a' number of bytes into array A[]
num=0
for i=0 to len(A)-1
num+=(A[i]^(8*i))
next
random number = num mod range
Your random source gives you 8 random bits per call. For an integer in the range [min,max] you would need ceil(log2(max-min+1)) bits.
Assume that you can get random bytes from the source using some function:
bool RandomBuf(BYTE* pBuf , size_t nLen); // fill buffer with nLen random bytes
Now you can use the following function to generate a random value in a given range:
// --------------------------------------------------------------------------
// produce a uniformly-distributed integral value in range [nMin, nMax]
// T is char/BYTE/short/WORD/int/UINT/LONGLONG/ULONGLONG
template <class T> T RandU(T nMin, T nMax)
{
static_assert(std::numeric_limits<T>::is_integer, "RandU: integral type expected");
if (nMin>nMax)
std::swap(nMin, nMax);
if (0 == (T)(nMax-nMin+1)) // all range of type T
{
T nR;
return RandomBuf((BYTE*)&nR, sizeof(T)) ? *(T*)&nR : nMin;
}
ULONGLONG nRange = (ULONGLONG)nMax-(ULONGLONG)nMin+1 ; // number of discrete values
UINT nRangeBits= (UINT)ceil(log((double)nRange) / log(2.)); // bits for storing nRange discrete values
ULONGLONG nR ;
do
{
if (!RandomBuf((BYTE*)&nR, sizeof(nR)))
return nMin;
nR= nR>>((sizeof(nR)<<3) - nRangeBits); // keep nRangeBits random bits
}
while (nR >= nRange); // ensure value in range [0..nRange-1]
return nMin + (T)nR; // [nMin..nMax]
}
Since you are always getting a multiple of 8 bits, you can save extra bits between calls (for example you may need only 9 bits out of 16 bits). It requires some bit-manipulations, and it is up to you do decide if it is worth the effort.
You can save even more, if you'll use 'half bits': Let's assume that you want to generate numbers in the range [1..5]. You'll need log2(5)=2.32 bits for each random value. Using 32 random bits you can actually generate floor(32/2.32)= 13 random values in this range, though it requires some additional effort.

Resources