Dictionary Training For Different Language - dictionary

I am working on a messaging system and got the idea of storing the messages in each inbox independently. While working on that idea I asked myself why not compress the messages. So I am looking for a good way to optain dictionaries in different languages.
Since the messages are highly related to everydays talking (social bla bla) I need a good source and way for that.
I need some text for it like a bunch of millions emails, books etc. I would like to create a Huffman tree out of it with the ability to inline and represent each message as a string within this huffman tree. So decoding would be fast enough.
The languages I want to use are over the place. Since Newspapers and alike might not be sufficient I need other ways.
[Update]
As I countinue my research, I noticed that I actually create two dictionaries out of wikipedia. First is a dictionary containing typical characters with a certain propability. Also I noticed that special characters I used for each language seams to have even distribution among latain based languages (well actually latain is just one member of the language family) and even russians tend to have the same distribution beside the quite different alphabet.
Also I noticed that in around 15 to 18% a special character (like ' ', ',', ';') follows another special character. And by than the first 8 most frequent word yield 10% where the next 16 words yield 9% and going on and on and by around 128 (160 words) you reach a yield of 80% of all words. So storing the next 256 and more words, becomes senseless in terms of analysing. This leaves me behind with three dictionaries (characters, words, special characters) per language of around 2 to 5KB (I use a special format to use prefix compression) that save me between 30% to 60% in character reduction and also when you remember that in Java each character stores 16 bits it results in an overall reduction of even more making it a 1:5 to 1:10 by also compressing (huffman tree) the characters having to insert.
Using such a system and compressing numbers as variable length integers, one produces a byte array that can be used for String matching, loads faster and checking for words to contain is faster than doing character by character comparism since one can check for complete words more faster by not needing to tokenize or recognize words in the first place.
It solved the problem of supporting string keys since I just can transform the string in each language and it results in a set of keys I can use for lookups.

Related

Is it possible to write an Enigma encryption algorithm that can use all alphanumeric as input but does not output ambiguous characters?

This is about Enigma encryption, I'm guessing the number of rotors doesn't matter but I'm using 3.
I am working with what's basically a coded version of the old mechanical enigma style encryption machines. The concept is rather old but before I get too far into learning it, I was wondering if it would be possible to be able to encrypt using all characters 0-9 a-z and A-Z but the encrypted text itself will only be a subset of these characters? I'm trying to replace a subset of characters (around 10 total) from the encrypted output, while still being able to get back to those characters if they were part of the input?
You can disambiguate by adding 1 to 2-character mapping for ambiguous symbols: O -> A1; 0 -> A2; other ambiguous symbols; A->AA. This is basically just like escaping in strings: we usually can’t put new line inside the string, so we represent it as \n. \ is represented as \\
If you’re working with encrypted data (so the probabilities of all characters are uniformly distributed and characters cannot be predicted) then you can’t compress the ciphertext. If you can compress it, then you’ve noticed some kind of pattern in the text and partially broken the encryption.
If you want to reduce the ciphertext’s alphabet, then you must increase the length of the ciphertext, otherwise you’ve successfully compressed it.

Human readable alternative for UUIDs

I am working on a system that makes heavy use of pseudonyms to make privacy-critical data available to researchers. These pseudonyms should have the following properties:
They should not contain any information (e.g. time of creation, relation to other pseudonyms, encoded data, …).
It should be easy to create unique pseudonyms.
They should be human readable. That means they should be easy for humans to compare, copy, and understand when read out aloud.
My first idea was to use UUID4. They are quite good on (1) and (2), but not so much on (3).
An variant is to encode UUIDs with a wider alphabet, resulting in shorter strings (see for example shortuuid). But I am not sure whether this actually improves readability.
Another approach I am currently looking into is a paper from 2005 titled "An optimal code for patient identifiers" which aims to tackle exactly my problem. The algorithm described there creates 8-character pseudonyms with 30 bits of entropy. I would prefer to use a more widely reviewed standard though.
Then there is also the git approach: only display the first few characters of the actual pseudonym. But this would mean that a pseudonym could lose its uniqueness after some time.
So my question is: Is there any widely-used standard for human-readable unique ids?
Not aware of any widely-used standard for this. Here’s a non-widely-used one:
Proquints
https://arxiv.org/html/0901.4016
https://github.com/dsw/proquint
A UUID4 (128 bit) would be converted into 8 proquints. If that’s too much, you can take the last 64 bits of the UUID4 (= just take 64 random bits). This doesn’t make it magically lose uniqueness; only increases the likelihood of collisions, which was non-zero to begin with, and which you can estimate mathematically to decide if it’s still OK for your purposes.
Here you go UUID Readable
Generate Easy to Remember, Readable UUIDs, that are Shakespearean and Grammatically Correct Sentences
This article suggests to use the first few characters from a SHA-256 hash, similarly to what git does. UUIDs are typically based on SHA-1, so this is not all that different. The tradeoff between property (2) and (3) is in the number of characters.
With d being the number of digits, you get 2 ** (4 * d) identifiers in total, but the first collision is expected to happen after 2 ** (2 * d).
The big question is really not about the kind of identifier you use, it is how you handle collisions.

What is the name for encoding/encrypting with noise padding?

I want code to render n bits with n + x bits, non-sequentially. I'd Google it but my Google-fu isn't working because I don't know the term for it.
For example, the input value in the first column (2 bits) might be encoded as any of the output values in the comma-delimited second column (4 bits) below:
0 1,2,7,9
1 3,8,12,13
2 0,4,6,11
3 5,10,14,15
My goal is to take a list of integer IDs, and transform them in a way they can still be used for persistent URLs, but that can't be iterated/enumerated sequentially, and where a client cannot determine programmatically if a URL in a search result set has been visited previously without visiting it again.
I would term this process "encoding". You'll see something similar done to permit the use of communications channels that have special symbols that are not permitted in data. Examples: uuencoding and base64 encoding.
That said, you still need to (and appear at first blush to have) ensure that there is only one correct de-code; and accept the increase in size of the output (in the case above, the output will be double the size, bit-for-bit as the input).
I think you'd be better off encrypting the number with a cheap cypher + a constant secret key stored on your server(s), adding a random character or four at the end, and a cheap checksum, and simply reject any responses that don't have a valid checksum.
<encrypt(secret)>
<integer>+<random nonsense>
</encrypt>
+
<checksum()>
<integer>+<random nonsense>
</checksum>
Then decrypt the first part (remember, cheap == fast), validate the ciphertext using the checksum, throw off the random nonsense, and use the integer you stored.
There are probably some cryptographic no-no's here, but let's face it, the cost of this algorithm being broken is a touch on the low side.

Program to mimic scanf() using system calls

As the Title says, i am trying out this last year's problem that wants me to write a program that works the same as scanf().
Ubuntu:
Here is my code:
#include<unistd.h>
#include<stdio.h>
int main()
{
int fd=0;
char buf[20];
read(0,buf,20);
printf("%s",buf);
}
Now my program does not work exactly the same.
How do i do that both the integer and character values can be stored since my given code just takes the character strings.
Also how do i make my input to take in any number of data, (only 20 characters in this case).
Doing this job thoroughly is a non-trivial exercise.
What you show does not emulate sscanf("%s", buffer); very well. There are at least two problems:
You limit the input to 20 characters.
You do not stop reading at the first white space character, leaving it and other characters behind to be read next time.
Note that the system calls cannot provide an 'unget' functionality; that has to be provided by the FILE * type. With file streams, you are guaranteed one character of pushback. I recently did some empirical research on the limits, finding values that the number of pushed back characters ranged from 1 (AIX, HP-UX) to 4 (Solaris) to 'big', meaning up to 4 KiB, possibly more, on Linux and MacOS X (BSD). Fortunately, scanf() only requires one character of pushback. (Well, that's the usual claim; I'm not sure whether that's feasible when distinguishing between "1.23e+f" and "1.23e+1"; the first needs three characters of lookahead, it seems to me, before it can tell that the e+f is not part of the number.)
If you are writing a plug-in replacement for scanf(), you are going to need to use the <stdarg.h> mechanism. Fortunately, all the arguments to scanf() after the format string are data pointers, and all data pointers are the same size. This simplifies some aspects of the code. However, you will be parsing the scan format string (a non-trivial exercise in its own right; see the recent discussion of print format string parsing) and then arranging to make the appropriate conversions and assignments.
Unless you have unusually stringent conditions imposed upon you, assume that you will use the character-level Standard I/O library functions such as getchar(), getc() and ungetc(). If you can't even use them, then write your own variants of them. Be aware that full integration with the rest of the I/O functions is tricky - things like fseek() complicate matters, and ensuring that pushed-back characters are properly consumed is also not entirely trivial.

How do programming languages handle huge number arithmetic

For a computer working with a 64 bit processor, the largest number that it can handle would be 264 = 18,446,744,073,709,551,616. How does programming languages, say Java or be it C, C++ handle arithmetic of numbers higher than this value. Any register cannot hold it as a single piece. How was this issue tackled?
There are lots of specialized techniques for doing calculations on numbers larger than the register size. Some of them are outlined in this wikipedia article on arbitrary precision arithmetic
Low level languages, like C and C++, leave large number calculations to the library of your choice. One notable one is the GNU Multi-Precision library. High level languages like Python, and others, integrate this into the core of the language, so normal numbers and very large numbers are identical to the programmer.
You assume the wrong thing. The biggest number it can handle in a single register is a 64-bits number. However, with some smart programming techniques, you could just combined a few dozens of those 64-bits numbers in a row to generate a huge 6400 bit number and use that to do more calculations. It's just not as fast as having the number fit in one register.
Even the old 8 and 16 bits processors used this trick, where they would just let the number overflow to other registers. It makes the math more complex but it doesn't put an end to the possibilities.
However, such high-precision math is extremely unusual. Even if you want to calculate the whole national debt of the USA and store the outcome in Zimbabwean Dollars, a 64-bits integer would still be big enough, I think. It's definitely big enough to contain the amount of my savings account, though.
Programming languages that handle truly massive numbers use custom number primitives that go beyond normal operations optimized for 32, 64, or 128 bit CPUs. These numbers are especially useful in computer security and mathematical research.
The GNU Multiple Precision Library is probably the most complete example of these approaches.
You can handle larger numbers by using arrays. Try this out in your web browser. Type the following code in the JavaScript console of your web browser:
The point at which JavaScript fails
console.log(9999999999999998 + 1)
// expected 9999999999999999
// actual 10000000000000000 oops!
JavaScript does not handle plain integers above 9999999999999998. But writing your own number primitive is to make this calculation work is simple enough. Here is an example using a custom number adder class in JavaScript.
Passing the test using a custom number class
// Require a custom number primative class
const {Num} = require('./bases')
// Create a massive number that JavaScript will not add to (correctly)
const num = new Num(9999999999999998, 10)
// Add to the massive number
num.add(1)
// The result is correct (where plain JavaScript Math would fail)
console.log(num.val) // 9999999999999999
How it Works
You can look in the code at class Num { ... } to see details of what is happening; but here is a basic outline of the logic in use:
Classes:
The Num class contains an array of single Digit classes.
The Digit class contains the value of a single digit, and the logic to handle the Carry flag
Steps:
The chosen number is turned into a string
Each digit is turned into a Digit class and stored in the Num class as an array of digits
When the Num is incremented, it gets carried to the first Digit in the array (the right-most number)
If the Digit value plus the Carry flag are equal to the Base, then the next Digit to the left is called to be incremented, and the current number is reset to 0
... Repeat all the way to the left-most digit of the array
Logistically it is very similar to what is happening at the machine level, but here it is unbounded. You can read more about about how digits are
carried here; this can be applied to numbers of any base.
Ada actually supports this natively, but only for its typeless constants ("named numbers"). For actual variables, you need to go find an arbitrary-length package. See Arbitrary length integer in Ada
More-or-less the same way that you do. In school, you memorized single-digit addition, multiplication, subtraction, and division. Then, you learned how to do multiple-digit problems as a sequence of single-digit problems.
If you wanted to, you could multiply two twenty-digit numbers together using nothing more than knowledge of a simple algorithm, and the single-digit times tables.
In general, the language itself doesn't handle high-precision, high-accuracy large number arithmetic. It's far more likely that a library is written that uses alternate numerical methods to perform the desired operations.
For example (I'm just making this up right now), such a library might emulate the actual techniques that you might use to perform that large number arithmetic by hand. Such libraries are generally much slower than using the built-in arithmetic, but occasionally the additional precision and accuracy is called for.
As a thought experiment, imagine the numbers stored as a string. With functions to add, multiply, etc these arbitrarily long numbers.
In reality these numbers are probably stored in a more space efficient manner.
Think of one machine-size number as a digit and apply the algorithm for multi-digit multiplication from primary school. Then you don't need to keep the whole numbers in registers, just the digits as they are worked on.
Most languages store them as array of integers. If you add/subtract two to of these big numbers the library adds/subtracts all integer elements in the array separately and handles the carries/borrows.
It's like manual addition/subtraction in school because this is how it works internally.
Some languages use real text strings instead of integer arrays which is less efficient but simpler to transform into text representation.

Resources