Huffman code length - hashtable

I want to build a huffman tree and assign a code to all the 255 byte values based on their frequencies. But For my application I need a hash table to get the code for a byte in constant time. But in worst case the tree may be so unbalanced that certain bytes have a very large key (even 254 bit long) . So maintaining a hash table is being very difficult. The code requires high performance and so stroing them as a string won't work. How can I resolve the issue?

Why would you need a hash table for 256 values? Simply have a 256-entry table where you directly index the code for each byte.
Each code is at most 32 bytes long, so just have a table of 256 entries, each with a fixed number of 33 bytes per entry. 8448 bytes. The first byte of the 33 being the length of the code in bits, and the remaining bytes being the code, of which you only use the requisite number of bits for each.

Related

What are the sizes for numbers in Firestore?

I checked the storage size but I'm confused when it comes to storing numbers.
In case of Bytes, what does "Byte length" mean? If I store -128 what's the length? And in case of 12?
In case of Floating-point number and Integer, it doesn't matter if I store 325 or 9.9999999999999 it will always be 8 bytes?
In case of Array? Let' say we have ["ab", "bcd"], what's the size, (2+3=5) or (2+1)+(3+1)=7
If you store an array of bytes, the size will simply be the length of that array. An array with a single byte value of -128 is still just one byte.
Yes, all numbers occupy the same 8-byte size, even if you don't see a fractional part.
The documentation says it's the sum of the array element sizes, so I would expect 7, the sum of the two individual string sizes, each encoded in UTF-8 + 1

Finding Aes256 keys

I have a question about aes keys.
I have a binary file which contains an aes256 key (32 bytes) at an unknown offset.
Would it be somehow possible to find this key in the file? Is it somehow possible to tell whether the next 32 bytes would be a valid aes key?
Thanks in advance
EDIT:
Thanks for all of your answers,
The key is stored in the file as normal bytes.
I finally managed to create a way to get it.
I basically filter out all strings, which actually made it work.
Thanks again
Well, yes and no. AES-256 keys should consist of just 32 bytes that are indistinguishable from random. Most files do not consist of just random bytes, so it could be possible t find a sequence that is most likely random, and this could be that key you are looking for. However, it might very well be that there are other random sequences in the file, or sequences that look like random but aren't random at all (such as the binary representation of the number Pi).
It may also be that you are unlucky and that the AES key doesn't look all that random. Or that the key is stored in hexadecimals (text) rather than binary byte values. Then there is the issue of finding the exact offset that might be the problem (is that initial byte with value 0x20 indicating the size of the AES key, a space character or part of the key value)?
Most files have a specific format, so you should first have a look at that. Just looking for random sequences may give you both false positives (rather likely) or false negatives (less likely). If you expect 64 bytes of randomness (two keys) then I suggest you search for that first, as it brings down the chance of false positives by a rather large amount.
No - unless you have a way to verify the key against a known plaintext/ciphertext pair - an AES key is not distinguishable from random noise. Any set of 16, 24 or 32 bytes is a valid AES key.

Length of AES encrypted data

I have a data that needs to be stored in a database as encrypted, the maximum length of the data before encryption is 50 chars (English or Arabic), I need to encrypt the data using AES-128 bit, and store the output in the database (base64string).
How to know the length of the data after encryption?
Try it with your specified algorithm, block size, IV size, and see what size output you get :-)
First it depends on the encoding of the input text. Is it UTF8? UTF16?
Lets assume UTF8 so 1 Byte per character means 50 Bytes of input data to your encryption algorithm. (100 Bytes if UTF16)
Then you will pad to the Block Size for the algorithm. AES, regardless of key size is a block of 16 Bytes. So we will be padded out to 64 Bytes (Or 112 for UTF 16)
Then we need to store the IV and header information. So that is (usually, with default settings/IV sizes) another 16Bytes so we are at 80 Bytes (Or 128 for UTF16)
Finally we are encoding to Base64. I assume you want string length, since otherwise it is wasteful to make it into a string. So Base 64 bloats the string using the following formula: Ceil(bytes/3) * 4. So for us that is Ceil(80/3) = 27 * 4 = 108 characters (Or 172 for UTF 16)
Again this is all highly dependent on your choices of how you encrypt, what the text is encoded as, etc.
I would try it with your scenario before relying on these numbers for anything useful.

How can a file be compressed to less than its entropy?

I created a file that contains 100,000 numbers that were drawn uniformly (with probability 1/8) from the set {1,2,3,4,5,6,7,8}.
When a look at the size of this file on my hard-disk it is 293 KB (kilo-byte) which makes sense because one needs 3 bits to "identify" a number between 1 and 8 and 3*100,000 = 300 KB.
Next I compress the file using Win-zip and find that the file is reduced to only 57 KB ! How can this be since I expect that the random-number generator I used for my draws is - for all practical purposes - ideal. This means that the sequence should be truly random and the size of the file should therefore be given by its entropy ( which is 300 KB)?
I am afraid you are confused about certain concepts.
3 bits times 100,000 gives you 300,000 bits, and there are 8 bits to the byte, which corresponds to roughly 37.5 KB. That's a far cry from 300 KB.
(And in any case, if you were to create "a file that contains 100,000 numbers", there is no magic fairy sitting on your hard disk, who will figure out the min & max range of your numbers, and store them in the file using the smallest number of bits necessary to represent them all.)
So, it is very important to get it out of the way that 300 KB has absolutely nothing to do with the entropy of 100,000 single-digit numbers.
You told us absolutely nothing about how you created that file, so its file format is a mystery, but we can make some simple calculations and guesses: 293 KB times 1024 is 300,000, so what you have is a 300,000 byte file. Which means that you are writing 3 bytes per number. Which means that you have written these numbers as text, in a text file, either each digit followed by a comma, then followed by a space, or each digit followed by a carriage return and a linefeed, or something similar.
Text file formats are extremely wasteful in terms of storage space.
So, yes, this is a highly compressible file consisting mostly of identical bytes, and even the bytes that are not identical (the digits) all map to just 3 bits each, so it is no wonder that the entire file gets compressed so well.
No laws of nature were harmed during the making of this question.

randomblob(N) in sqlite

The documentation (http://www.sqlite.org/lang_corefunc.html) says that it generates a N-byte blob, and it goes on to give an example of using randomblob(16) with hex() for generating unique ids.
But isn't a randomblob(8) is more than enough for most databases. 8 bytes gives 64 bits, which would give 2^64 different possible values (which will be converted into hex format by hex(randomblob(8)). Why waste the extra 8 bytes here?
GUIDs are defined as having 128 bits.

Resources