should I use utf-8 or utf-16 or utf-32 for my multilingual cms? - utf

Besides the difference in how characters are stored, are there any special characters in any language utf-32 can display and utf-8 cannot?

All UTF encodings can represent the same range of code points (0 to 0x10FFFF). So, the same characters can be encoded by any of them.
Whether they can be "displayed" is an entirely different question. That's nothing to do with the encoding, and a function of the font family used. I am not sure that any font has glyphs for every single Unicode code point. But I assume you meant "represented".
They do vary in how many bytes they'll need to represent a given string. UTF-8 is almost always the shortest for non-Asian languages. For those, UTF-16 might win (I haven't really "benchmarked".) I can't imagine a realistic case where UTF-32 would be optimal.

Is there any character one of them can't represent?
In theory: No.
All of those formats can represent all Unicode code points.
In practice: Depends.
The Windows API uses UCS-2 (which is pretty much the first UTF-16 chunk) and doesn't always handle surrogates correctly. So you might want to use UTF-16 to have your program act as "normal" as possible compared to other programs, instead of truncating high-ranging UTF-32 code points manually.
Anything else?
Yes: Use UTF-8!
It's endian-less, so you it avoids byte-order issues, which are a pain in the rear.
Of course, if you're on Windows then you need to convert to UTF-16 before using them.

UTF-8, UTF-16 and UTF-32 all can be used to represent all Unicode datapoints. So no, there are no special characters that can be represented in UTF-32 and not in UTF-8.

1) UTF-8 can be backward compatible with ASCII for regular english characters, this can be an advantage when your client just have english characters.
2) UTF-8 is good in saving network bandwidth if you have ASCII characters more than non-English characters.
3) UTF-16 would be good if you have more non-English characters in terms of saving Storage space.
I suggest to use UTF-8 based on #1 above.

Related

Is it possible to write an Enigma encryption algorithm that can use all alphanumeric as input but does not output ambiguous characters?

This is about Enigma encryption, I'm guessing the number of rotors doesn't matter but I'm using 3.
I am working with what's basically a coded version of the old mechanical enigma style encryption machines. The concept is rather old but before I get too far into learning it, I was wondering if it would be possible to be able to encrypt using all characters 0-9 a-z and A-Z but the encrypted text itself will only be a subset of these characters? I'm trying to replace a subset of characters (around 10 total) from the encrypted output, while still being able to get back to those characters if they were part of the input?
You can disambiguate by adding 1 to 2-character mapping for ambiguous symbols: O -> A1; 0 -> A2; other ambiguous symbols; A->AA. This is basically just like escaping in strings: we usually can’t put new line inside the string, so we represent it as \n. \ is represented as \\
If you’re working with encrypted data (so the probabilities of all characters are uniformly distributed and characters cannot be predicted) then you can’t compress the ciphertext. If you can compress it, then you’ve noticed some kind of pattern in the text and partially broken the encryption.
If you want to reduce the ciphertext’s alphabet, then you must increase the length of the ciphertext, otherwise you’ve successfully compressed it.

ASP.net query encryption method that doesn't produce slash character

I searched a lot to find an encryption algorithm which its encrypted results do not include slash character. Anything I've tested so far (like this, this and this) generate strings which include slash character and therefore they make asp.net (web forms) routing misunderstand the way it should interpret the route.
Can you please help by introducing a symmetric encryption algorithm which generate encrypted strings that can safely be used for encrypting query strings without misguiding asp.net routing?
Encryption algorithms generally produce random (looking) bytes. These bytes can have any value. You can encode this value, for instance using hexadecimals or base 64. With hexadecimals you have already code that only contains 0..9 and a..f (in upper or lower case). However, hexadecimal encoding is not very efficient, doubling the ciphertext.
Base 64 uses 64 characters: A..Z, a..z, 0..9, + and /, and sometimes a padding character =. It is however very easy to replace the URL unsafe + and / characters with other ones, e.g. - and _ according to RFC 4648. You can also remove any = characters at the end, although you may have to put them back (until you get a multiple of 4 base 64 characters) depending on the base 64 decoding routine. Base 64 uses 4 characters for 3 bytes, so it expands the ciphertext by 33%.

Optimal integer encoding that still sorts

One of the neat characteristics of UTF-8 is that if you compare two strings (with <) byte-by-byte, you get the same answer as if you had compared them codepoint-by-codepoint. I was wondering if there was a similar encoding that was optimal in size (e.g. UTF-8 "wastes" space by tagging bytes with 10xxxxxx if they are not the first byte representing a codepoint).
The assumption for optimality here is that a non-negative number n is more frequent than a number m if n < m.
I am most interested in knowing if there is a (byte-comparable) encoding that works for integers, with n more frequent than m if |n| < |m|.
Have you considered a variant of Huffman coding? Traditionally one recursively merges the two least frequent symbols, but to preserve order one could instead merge the two adjacent symbols having the least sum.
Looks like this problem has been well-studied (and the greedy algorithm is not optimal). The optimal algorithm was given by Hu and Tucker, which is described here and more detail in this thesis.
This paper discussing order-preserving dictionary-based compression also looks interesting.
There are very few standard encodings and the answer is no. Any further optimization beyond UTF-8 should not be referred to as "encoding" but a "compression" - and lexicographically-comparable compression is a different department.
If you are solving a real-world (non-purely-academic) problem, I'd just stick with the most standard UTF8. You can learn about its efficiency compared to other standard encodings on utf8everywhere.org.
To fully answer that question you need to know the frequency of the codepoints in the material.
UTF-8 is optimal for texts in English as multi-byte characters are very rare in typical English text.
To encode integers using UTF-8 as a base algorithm would entail mapping the first n integers to a 1-byte encoding, the next m to a 2-byte encoding and so on.
Whether that is an optimal encoding depends on the distribution. If the first n numbers are very frequent compared to higher numbers, then UTF-8 would be (near) optimal.

Octal to Decimal converting with VB 6

I have heard about octal number system lately and i wanna learn about it.
My dumb teacher that i asked for it. Told me "its no more used, u dont need to learn" but no sire im pretty sure its still in use so i need to know!
If there is someone who can explain me the octal number system and show me a way to convert it to Decimal(number system that we use in life) "it would help me to learn about it a lot" and where i can use it in life so i can show smt to that dumb teacher that he is wrong, that he must do his job on teaching..
i wanna do it on vb6, cause my teacher works on vb6 usually.
You can get more about Octal from Wiki - http://en.wikipedia.org/wiki/Octal
The octal, or base 8, number system is a common system used with computers. Because of its relationship with the binary system, it is useful in programming some types of computers.
Decimal, hexadecimal, and octal representations are straightforward. To read a string in these formats, use CLng.
Dim value As Long
value = CLng(Text1.Text)
Hexadecimal strings should begin with &H and octal strings should begin with &O.
To convert a value into a decimal, hexadecimal, or octal string representation, use Format$, Hex$, and Oct$ respectively. For example, Oct$(256) returns the octal representation of the value 256 (which is "400").

How to encode a large number (in an URL)?

Quite often one has to encode an big (e.g. 128 or 160 bits) number in an url. For example many web applications use md5(random()) for UUIDs.
If you need to put that value in an URL the common approach is to just encode it as an hexadecimal string.
But obviously hex encoding is not a very tight encoding. What other approaches are there which fit nicely in an URL?
I would use The "URL and Filename safe" Base 64 Alphabet.
Base 64 uses two character sets.
Data: ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/
URLs: ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-_
To use base 64 you need to pad your value to be a multiple of 3 bytes long (24 bits) then split those 24 bits into 4 6bit bytes. Each 6bit value is looked up by position in the string I gave above.
If it all goes well, your final base64 value will always be a multiple of 4 characters long and decode back to a multiple of 3 (8bit) bytes long.
Depending on the language you are using, a lot of them have built in encode and decode functions.
You can do even better with base64-url encoding (a-z, A-Z, 0-9, - and _ [see RFC4648 Section 5]). RFC4648 covers a number of different encoding methods (base16, base32, and base64) an a couple of variants. Also depending on the sparsity of the bits that are set in the number you could conceivably run it through gzip and then use one of the described encoding methods. Of course use of gzip really depends on how large the number you are going to be encoding is.
If you want it tight you can use a base-36 encoding (from 0 to Z).
Using the hint of base36 I currently use something like this (in Python):
>>> str(base64.b32encode(uuid.uuid1().bytes).rstrip('='))
'MTB2ONDSL3YWJN3CA6XIG7O4HM'
Just use hex. Even if you were to get 8 bits per character you're still using a 16-20 character random sequence, which nobody will want to type or say. If you can't put up a short identifier, work on your search capabilities.

Resources