One of the neat characteristics of UTF-8 is that if you compare two strings (with <) byte-by-byte, you get the same answer as if you had compared them codepoint-by-codepoint. I was wondering if there was a similar encoding that was optimal in size (e.g. UTF-8 "wastes" space by tagging bytes with 10xxxxxx if they are not the first byte representing a codepoint).
The assumption for optimality here is that a non-negative number n is more frequent than a number m if n < m.
I am most interested in knowing if there is a (byte-comparable) encoding that works for integers, with n more frequent than m if |n| < |m|.
Have you considered a variant of Huffman coding? Traditionally one recursively merges the two least frequent symbols, but to preserve order one could instead merge the two adjacent symbols having the least sum.
Looks like this problem has been well-studied (and the greedy algorithm is not optimal). The optimal algorithm was given by Hu and Tucker, which is described here and more detail in this thesis.
This paper discussing order-preserving dictionary-based compression also looks interesting.
There are very few standard encodings and the answer is no. Any further optimization beyond UTF-8 should not be referred to as "encoding" but a "compression" - and lexicographically-comparable compression is a different department.
If you are solving a real-world (non-purely-academic) problem, I'd just stick with the most standard UTF8. You can learn about its efficiency compared to other standard encodings on utf8everywhere.org.
To fully answer that question you need to know the frequency of the codepoints in the material.
UTF-8 is optimal for texts in English as multi-byte characters are very rare in typical English text.
To encode integers using UTF-8 as a base algorithm would entail mapping the first n integers to a 1-byte encoding, the next m to a 2-byte encoding and so on.
Whether that is an optimal encoding depends on the distribution. If the first n numbers are very frequent compared to higher numbers, then UTF-8 would be (near) optimal.
Related
I am working on a system that makes heavy use of pseudonyms to make privacy-critical data available to researchers. These pseudonyms should have the following properties:
They should not contain any information (e.g. time of creation, relation to other pseudonyms, encoded data, …).
It should be easy to create unique pseudonyms.
They should be human readable. That means they should be easy for humans to compare, copy, and understand when read out aloud.
My first idea was to use UUID4. They are quite good on (1) and (2), but not so much on (3).
An variant is to encode UUIDs with a wider alphabet, resulting in shorter strings (see for example shortuuid). But I am not sure whether this actually improves readability.
Another approach I am currently looking into is a paper from 2005 titled "An optimal code for patient identifiers" which aims to tackle exactly my problem. The algorithm described there creates 8-character pseudonyms with 30 bits of entropy. I would prefer to use a more widely reviewed standard though.
Then there is also the git approach: only display the first few characters of the actual pseudonym. But this would mean that a pseudonym could lose its uniqueness after some time.
So my question is: Is there any widely-used standard for human-readable unique ids?
Not aware of any widely-used standard for this. Here’s a non-widely-used one:
Proquints
https://arxiv.org/html/0901.4016
https://github.com/dsw/proquint
A UUID4 (128 bit) would be converted into 8 proquints. If that’s too much, you can take the last 64 bits of the UUID4 (= just take 64 random bits). This doesn’t make it magically lose uniqueness; only increases the likelihood of collisions, which was non-zero to begin with, and which you can estimate mathematically to decide if it’s still OK for your purposes.
Here you go UUID Readable
Generate Easy to Remember, Readable UUIDs, that are Shakespearean and Grammatically Correct Sentences
This article suggests to use the first few characters from a SHA-256 hash, similarly to what git does. UUIDs are typically based on SHA-1, so this is not all that different. The tradeoff between property (2) and (3) is in the number of characters.
With d being the number of digits, you get 2 ** (4 * d) identifiers in total, but the first collision is expected to happen after 2 ** (2 * d).
The big question is really not about the kind of identifier you use, it is how you handle collisions.
I have a question about working on very big numbers. I'm trying to run RSA algorithm and lets's pretend i have 512 bit number d and 1024 bit number n. decrypted_word = crypted_word^d mod n, isn't it? But those d and n are very large numbers! Non of standard variable types can handle my 512 bit numbers. Everywhere is written, that rsa needs 512 bit prime number at last, but how actually can i perform any mathematical operations on such a number?
And one more think. I can't use extra libraries. I generate my prime numbers with java, using BigInteger, but on my system, i have only basic variable types and STRING256 is the biggest.
Suppose your maximal integer size is 64 bit. Strings are not that useful for doing math in most languages, so disregard string types. Now choose an integer of half that size, i.e. 32 bit. An array of these can be interpreted as digits of a number in base 232. With these, you can do long addition and multiplication, just like you are used to with base 10 and pen and paper. In each elementary step, you combine two 32-bit quantities, to produce both a 32-bit result and possibly some carry. If you do the elementary operation in 64-bit arithmetic, you'll have both of these as part of a single 64-bit variable, which you'll then have to split into the 32-bit result digit (via bit mask or simple truncating cast) and the remaining carry (via bit shift).
Division is harder. But if the divisor is known, then you may get away with doing a division by constant using multiplication instead. Consider an example: division by 7. The inverse of 7 is 1/7=0.142857…. So you can multiply by that to obtain the same result. Obviously we don't want to do any floating point math here. But you can also simply multiply by 14286 then omit the last six digits of the result. This will be exactly the right result if your dividend is small enough. How small? Well, you compute x/7 as x*14286/100000, so the error will be x*(14286/100000 - 1/7)=x/350000 so you are on the safe side as long as x<350000. As long as the modulus in your RSA setup is known, i.e. as long as the key pair remains the same, you can use this approach to do integer division, and can also use that to compute the remainder. Remember to use base 232 instead of base 10, though, and check how many digits you need for the inverse constant.
There is an alternative you might want to consider, to do modulo reduction more easily, perhaps even if n is variable. Instead of expressing your remainders as numbers 0 through n-1, you could also use 21024-n through 21024-1. So if your initial number is smaller than 21024-n, you add n to convert to this new encoding. The benefit of this is that you can do the reduction step without performing any division at all. 21024 is equivalent to 21024-n in this setup, so an elementary modulo reduction would start by splitting some number into its lower 1024 bits and its higher rest. The higher rest will be right-shifted by 1024 bits (which is just a change in your array indexing), then multiplied by 21024-n and finally added to the lower part. You'll have to do this until you can be sure that the result has no more than 1024 bits. How often that is depends on n, so for fixed n you can precompute that (and for large n I'd expect it to be two reduction steps after addition but hree steps after multiplication, but please double-check that) whereas for variable n you'll have to check at runtime. At the very end, you can go back to the usual representation: if the result is not smaller than n, subtract n. All of this should work as described if n>2512. If not, i.e. if the top bit of your modulus is zero, then you might have to make further adjustments. Haven't thought this through, since I only used this approach for fixed moduli close to a power of two so far.
Now for that exponentiation. I very much suggest you do the binary approach for that. When computing xd, you start with x, x2=x*x, x4=x2*x2, x8=…, i.e. you compute all power-of-two exponents. You also maintain some intermediate result, which you initialize to one. In every step, if the corresponding bit is set in the exponent d, then you multiply the corresponding power into that intermediate result. So let's say you have d=11. Then you'd compute 1*x1*x2*x8 because d=11=1+2+8=10112. That way, you'll need only about 1024 multiplications max if your exponent has 512 bits. Half of them for the powers-of-two exponentiation, the other to combine the right powers of two. Every single multiplication in all of this should be immediately followed by a modulo reduction, to keep memory requirements low.
Note that the speed of the above exponentiation process will, in this simple form, depend on how many bits in d are actually set. So this might open up a side channel attack which might give an attacker access to information about d. But if you are worried about side channel attacks, then you really should have an expert develop your implementation, because I guess there might be more of those that I didn't think about.
You may write some macros you may execute under Microsoft for functions like +, -, x, /, modulo, x power y which work generally for any integer of less than ten or hundred thousand digits (the practical --not theoretical-- limit being the internal memory of your CPU). Please note the logic is exactly the same as the one you got at elementary school.
E.g.: p= 1819181918953471 divider of (2^8091) - 1, q = ((2^8091) - 1)/p, mod(2^8043 ; q ) = 23322504995859448929764248735216052746508873363163717902048355336760940697615990871589728765508813434665732804031928045448582775940475126837880519641309018668592622533434745187004918392715442874493425444385093718605461240482371261514886704075186619878194235490396202667733422641436251739877125473437191453772352527250063213916768204844936898278633350886662141141963562157184401647467451404036455043333801666890925659608198009284637923691723589801130623143981948238440635691182121543342187092677259674911744400973454032209502359935457437167937310250876002326101738107930637025183950650821770087660200075266862075383130669519130999029920527656234911392421991471757068187747362854148720728923205534341236146499449910896530359729077300366804846439225483086901484209333236595803263313219725469715699546041162923522784170350104589716544529751439438021914727772620391262534105599688603950923321008883179433474898034318285889129115556541479670761040388075352934137326883287245821888999474421001155721566547813970496809555996313854631137490774297564881901877687628176106771918206945434350873509679638109887831932279470631097604018939855788990542627072626049281784152807097659485238838560958316888238137237548590528450890328780080286844038796325101488977988549639523988002825055286469740227842388538751870971691617543141658142313059934326924867846151749777575279310394296562191530602817014549464614253886843832645946866466362950484629554258855714401785472987727841040805816224413657036499959117701249028435191327757276644272944743479296268749828927565559951441945143269656866355210310482235520220580213533425016298993903615753714343456014577479225435915031225863551911605117029393085632947373872635330181718820669836830147312948966028682960518225213960218867207825417830016281036121959384707391718333892849665248512802926601676251199711698978725399048954325887410317060400620412797240129787158839164969382498537742579233544463501470239575760940937130926062252501116458281610468726777710383038372260777522143500312913040987942762244940009811450966646527814576364565964518092955053720983465333258335601691477534154940549197873199633313223848155047098569827560014018412679602636286195283270106917742919383395056306107175539370483171915774381614222806960872813575048014729965930007408532959309197608469115633821869206793759322044599554551057140046156235152048507130125695763956991351137040435703946195318000567664233417843805257728.
The last step took about 0.1 sec.
wpjo (willibrord oomen on academia.edu)
This is something I have been thinking while reading programming books and in computer science class at school where we learned how to convert decimal values into hexadecimal.
Can someone please tell me what are the advantages of using hexadecimal values and why we use them in programmnig?
Thank you.
In many cases (like e.g. bit masks) you need to use binary, but binary is hard to read because of its length. Since hexadecimal values can be much easier translated to/from binary than decimals, you could look at hex values as kind of shorthand notation for binary values.
It certainly depends on what you're doing.
It comes as an extension of base 2, which you probably are familiar with as essential to computing.
Check this out for a good discussion of
several applications...
https://softwareengineering.stackexchange.com/questions/170440/why-use-other-number-bases-when-programming/
The hexadecimal digit corresponds 1:1 to a given pattern of 4 bits. With experience, you can map them from memory. E.g. 0x8 = 1000, 0xF = 1111, correspondingly, 0x8F = 10001111.
This is a convenient shorthand where the bit patterns do matter, e.g. in bit maps or when working with i/o ports. To visualize the bit pattern for 169d is in comparison more difficult.
A byte consists of 8 binary digits and is the smallest piece of data that computers normally work with. All other variables a computer works with are constructed from bytes. For example; a single character can be stored in a single byte, and a 32bit integer consists of 4 bytes.
As bytes are so fundamental we want a way to write down their value as neatly and efficiently as possible. One option would be to use binary, but then we would need a lot of digits. This takes up a lot of space and can be confusing when many numbers are written in sequence:
200 201 202 == 11001000 11001001 11001010
Using hexadecimal notation, we can write every byte using just two digits:
200 == C8
Also, as 16 is a power of 2, it is easy to convert between hexadecimal and binary representations in your head. This is useful as sometimes we are only interested in a single bit within the byte. As a simple example, if the first digit of a hexadecimal representation is 0 we know that the first four binary digits are 0.
I came across an interesting math problem that would require me to do some artithmetic with numbers that have more than 281 digits. I know that its impossible to represent a number this large with a system where there is one memory unit for each digit but wondered if there were any ways around this.
My initial thought was to use a extremely large base instead of base 10 (decimal). After some thought I believe (but can't verify) that the optimal base would be the square root of the number of digits (so for a number with 281 digits you'd use base 240ish) which is a improvement but that doesn't scale well and still isn't really practical.
So what options do I have? I know of many arbitrary precision libraries, but are there any that scale to support this sort of arithmetic?
Thanks o7
EDIT: after thinking some more i realize i may be completely wrong about the "optimal base would be the square root of the number of digits" but a) that's why im asking and b) im too tired to remember my initial reasoning for assumption.
EDIT 2: 1000,000 in base ten = F4240 in base 16 = 364110 in base 8. In base 16 you need 20 bits to store the number in base 8 you need 21 so it would seem that by increasing the base you decrees the total number of bits needed. (again this could be wrong)
This is really a compression problem pretending to be an arithmetic problem. What you can do with such a large number depends entirely on its Kolmogorov complexity. If you're required to do computations on such a large number, it's obviously not going be arrive as 2^81 decimal digits; the Kolmogorov complexity would too high in that case and you can't even finish reading the input before the sun goes out. The best way to deal with such a number is via delayed evaluation and symbolic rational types that a language like Scheme provides. This way a program may be able to answer some questions about the result of computations on the number without actually having to write out all those digits to memory.
I think you should just use scientific notation. You will lose precision, but you can not store numbers that large without losing precision, because storing 2^81 digits will require more than 10^24 bits(about thousand billion terabytes), which is much more that you can have nowadays.
that have more than 2^81 digits
Non-fractional number with 2^81 bits, will take 3*10^11 terabytes of data. Per number.
That's assuming you want every single digit and data isn't compressible.
You could attempt to compress the data storing it in some kind of sparse array that allocates memory only for non-zero elements, but that doesn't guarantee that data will be fit anywhere.
Such precision is useless and impossible to handle on modern hardware. 2^81 bits will take insane amount of time to simply walk through number (9584 trillion years, assuming 1 byte takes 1 millisecond), never mind multiplication/division. I also can't think of any problem that would require precision like that.
Your only option is to reduce precision to first N significant digits and use floating point numbers. Since data won't fit into double, you'll have to use bignum library with floating point support, that provides extremely large floating point numbers. Since you can represent 2^81 (exponent) in bits, you can store beginning of a number using very big floating point.
1000,000 in base ten
Regardless of your base, positive number will take at least floor(log2(number))+1 bits to store it. If base is not 2, then it will take more than floor(log2(number))+1 bits to store it. Numeric base won't reduce number of required bits.
if i have a hash say like this: 0d47aeda9d97686ab3da96bae2c93d078a5ab253
how do i do the math to find out the number of possibilities to try if i start with 0000000000000000000000000000000000000000 to 9999999999999999999999999999999999999999 which is the general length of a sha1.
The number of possibilities would be 2^(X) where X is the number of bits in the hash.
In the normal hexadecimal string representation of the hash value like the one you gave, each character is 4 bits, so it would be 2^(4*len) where len is the string length of the hash value. In your example, you have a 40 character SHA1 digest, which corresponds to 160 bits, or 2^160 == 1.4615016373309029182036848327163e+48 values.
An SHA-1 hash is 160 bits, so there are 2^160 possible hashes.
Your hexadecimal digit range is 0 through f.
Then it's simply 16^40 or however many characters it contains
Recall that a hash function accepts inputs of arbitrary length. A good cryptographic hash function will seem to assign a "random" hash result to any input. So if the digest is N bits long (for SHA-1, N=160), then every input will be hashed to one of 2^N possible results, in a manner we'll treat as random.
That means that the expectation for finding a preimage for your hash result is running though 2^N inputs. They don't have to be specifically the range that you suggested - any 2^N distinct inputs are fine.
This also means that 2^N inputs don't guarantee that you'll find a preimage - each try is random, so you might miss your 1-in-2^N chance in every single one of those 2^N inputs (just like flipping a coin twice doesn't guarantee you'll get heads at least once). But you can figure out how many inputs are required to find a preimage for the hash with probability p or greater - with p being as close to one as you desire (just not actually 1).
maximum variations, with repeating and with attention to the order are defined as n^k. in your case this would mean 10^40, which can't be correct for SHA1. Reading Wikipedia it sais SHA1 has a max. complexity for a collision based attack of 2^80, using different technices researches were allready successfull with 2^51 collisions, so 10^40 seems a bit much.