QChar stores a negative Latin1 code for multiply sign '×' - qt

I want to get the Latin1 code for multiply sign ×, but when I check the value inside the QChar it has -41'×'.
My code:
QString data = "×";
QChar m = data.at(0);
unsigned short ascii = (unsigned short)m.toLatin1();
When I debug, in the second line I see the QChar value is -41'×'.
I changed the code:
unsigned int ascii = c.unicode();
But I get the value 215 rather and I expect 158.

The multiply sign × is not an ascii sign, as you can see when checking man ascii if you are on a unix system.
What its value is depends on the encoding, see here for its UTF representations.
For example on UTF-8 it has the value 0xC397 which are two bytes.
As is mentioned on the unicode page I linked 215 is the decimal value to represent this character in UTF-16 encoding, which is what c.unicode() returns.
I don't know why you expect 158.
There is an ascii multiply sign though, which is *.

If you check the Latin1 code table, it's obvious that × is indeed encoded as 215, or -41. Qt is giving you the correct result.
Your mistakes are:
Assuming that Latin1 is equivalent to ASCII. Latin1 merely contains ASCII, but is the superset: it defines 2x more codes than ASCII does.
Assuming that × is represented in the ASCII. It is not.
I have no clue where you got the idea that Latin1-encoded × should be 158. Surely it didn't come from the Latin1 code table! Incidentally, the Latin1 and UTF-8 encodings of × are identical.

Related

While converting Hexadecimal to ASCII characters I am getting Boxes and unrecognized Symbols

When I am Converting Hexadecimal Numbers to their ASCII Characters some are coming in boxes with numbers and question marks sort of. I want them to be converted in ASCII Characters.
I have tried converting them to hexa but It is going proper results in hexa but not in %c
printf("\nKey Algorithm String: ");
for(i = 80; i<=105; i++)
{
printf("%c", packet[i]);
}
I am Getting A������T1^�d;���F=D I want this A.. .....T1^.d;....F =D
Your data is not pure ASCII, that's why you're getting those "weird symbols". ASCII uses the lowest 7 bits with the highest bit always in 0, so it's in range 0-127 but your data is probably in the range 0-255 so you need to know how to interpret these other values.
The alternatives are to ignore all ASCII characters or to decode them correctly.
If you want to just ignore them, filtering out the character that are higher than 127 should be enough. Something similar to this:
printf("\nKey Algorithm String: ");
for(i = 80; i <= 105; i++)
{
if(packet[i] < 128)
printf("%c", packet[i]);
else
printf(".");
}
If you want to correctly print the rest of the characters, you need to know the charset being used. Most likely it's utf-8 or Latin1 (cp1252). Once you know what it is you can write a decoder yourself or searching for one to use (I'd recommend the latter).
See
https://en.m.wikipedia.org/wiki/Character_encoding
https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/

Simple string encryptation - safety of higher ascii characters

I am trying to create a simple encryptation scheme for strings. Each character of the string is given another ascii value.
It entails writing ascii characters upto 246 to a simple file on disk.
I want to find out if it is safe to write these special characters to the disk or can it cause untoward results. Thanks for your help.
Edit: I am considering algorithm similar to following:
* Convert each character of string to its integer number (hence 110 for 'n' and 122 for 'z')
* Double that number (get 220 and 244)
* Convert this to character (will get extended ascii codes)
* Save these characters to file.
Is it safe to save these extended ascii characters to disk files using usual text file writing functions?
There is only a limited set of ASCII characters. There are 95 printable characters such as 'A' but also the space character. There are 33 printable characters such as Line Feed, Carriage Return, NUL but also DELETE. So you cannot use 246 characters of ASCII as there are only 128 total available. ASCII is strictly 7 bits giving you 2^7 = 128 possible values.
Even if you would use the ISO 8859 Latin character set or the Windows-1252 character set you would still have the unprintable control characters to deal with, leaving you with 256 - 33 - 5 characters or 218 characters. Windows-1252 still has 5 undefined characters.
What you can do is of course save your data as bytes. Each byte has 256 possible values (usually 0 to 255 or -128 to 127). As long as you open files as binary this pose no problem.
You can of course store as many characters in a file as you want, up to the file system or operating system limit. So I presume you didn't ask that.

How to represent acute accents in ASCII?

I'm having an encoding problem related to cookies on one of my websites.
A user is inputing Usuário, which has an acute accent, and that's being put in a cookie. The raw HEX for the cookie response is (for the Usuário string):
55 73 75 C3 A1 72 69 6F
When I see it in the browser, it looks like this:
...which is really messy. I need to fix this up.
Then I went to this website: http://www.rapidtables.com/convert/number/hex-to-ascii.htm and converted the HEX value to see how it would look like. And I got the same output:
Right. This means the HEX code is wrong. Then I tried to convert Usuário to ASCII to see how it should be. I used this WebSite: http://www.asciitohex.com/ and this is the result:
For my surprise, the HEX is exactly the one that is showing up messy. Why???
And how do I represent Usuário in ASCII so I can put it in a cookie? Should I manually encode it?
PS: I'm using ASP.NET, just in case it matters.
As of 2015 the standard of the web to store character data is UTF-8 and not ASCII. ASCII actually only contains the first 128 characters of the codepage, and does not include any kind of accented characters. To add accented characters to this 128 characters there were many legacy solutions: codepages. They each added 128 different characters to the default ASCII list thereby allowing representing 256 different characters.
The problem was, that this didn't properly solve the issue: ASCII based codepages were more or less incomatible with each other (except for the first 128 characters), and there was usually no way of programatically knowing which codepage was in used.
One of the solutions was UTF-8, which is a way to encode the unocde character set (containing most of the characters used around the world, and more) while trying to remain compatible with ASCII. The first 128 characters are actually the same in both cases, but afterwards UTF-8 characters become multi-byte: one character is encoded using a series of bytes (usually 2-3, depends on which character needs to be encoded)
The problem is if you are using some kind of ASCII based single byte codebase (like ISO-8859-1), which encodes supported characters in single bytes, but your input is actually UTF-8, which will encode accented characters in multiple bytes (you can see this in your HEX example. á is encoded as C3 A1: two bytes). If you try to read these two bytes in an ASCII based codepage, which uses single bytes for every characters (in West-Europe this codepage is usually ISO-8859-1), then each of this two bytes will be reprensented with two different characters.
In the web world the default encoding is UTF-8, so your clients will usually send their requests using UTF-8. ASP.NET is Unicode aware, so it can handle these requests. However somewere in your code this UTF-8 is converted acccidentally into ISO-8859-1, and then back into UTF-8. This might happen on various layers. As you have issues it probably happens at the cookie layer, which is sometimes problematic (here is how it worked in 2009). You should also double check your application that it uses UTF-8 everywhere else though (views, database, etc.), if you want to properly support accented characters.

What causes XOR encryption to return a "blank"?

What is the cause of certain characters to be blank when using XOR encryption? Furthermore, how can this be compensated for when decrypting?
For instance:
....
void basic_encrypt(char *to_encrypt) {
char c;
while (*to_encrypt) {
*to_encrypt = *to_encrypt ^ 20;
to_encrypt++;
}
}
will return "nothing" for the character k. Clearly, character decay is problematic for decryption.
I assume this is caused by the bit operator, but I am not very good with binary so I was wondering if anyone could explain.
Is it converting an element, k, in this case, to some spaceless ASCII character? Can this be compensated for by choosing some y < x < z operator where x is the operator?
Lastly, if it hasn't been compensated for, is there a realistic decryption strategy for filling in blanks besides guess and check?
'k' has the ASCII value 107 = 0x6B. 20 is 0x14, so
'k' ^ 20 == 0x7F == 127
if your character set is ASCII compatible. 127 is \DEL in ASCII, which is a non-printable character, so won't be displayed if you print it out.
You will have to know the difference between bytes and characters to understand which is happening. On the one hand you have the C char type, which is simply a presentation of a byte, not a character.
In the old days each character was mapped to one byte or octet value in a character encoding table, or code page. Nowadays we have encodings that take more bytes for certain characters, e.g. UTF-8, or even encodings that always take more than one byte such as UTF-16. The last two are unicode encodings, which means that each character has a certain number value and the encoding is used to encode this number into bytes.
Many computers will interpret bytes in ISO/IEC 8859-1 or Latin-1, sometimes extended by Windows-1252. These code pages have holes for control characters, or byte values that are simply not used. Now it depends on the runtime system how these values are handled. Java by default substitutes an ? character in place of the missing character. Other runtimes will simply drop the value or - of course - execute the control code. Some terminals may use the ESC control code to set the color or to switch to another code page (making a mess of the screen).
This is why ciphertext should be converted to another encoding, such as hexadecimals or Base64. These encodings should make sure that the result is readable text. This takes care of the cipher text. You will have to choose a character set for your plain text too, e.g. simply perform ASCII or UTF-8 encoding before encryption.
Getting a zero value from encryption does not matter because once you re-xor with the same xor key you get the original value.
value == value
value XOR value == 0 [encryption]
( value XOR value ) XOR value == value [decryption]
If you're using a zero-terminated string mechanism, then you have two main strategies for preventing 'character degradation'
store the length of the string before encryption and make sure to decrypt at least that number of characters on decryption
check for a zero character after decoding the character

What is QString::toUtf8 doing?

This may sounds like a obvious question, but I'm missing something about either how UTF-8 is encoded or how the toUtf8 function works.
Let's look at a very simple program
QString str("Müller");
qDebug() << str << str.toUtf8().toHex();
Then I get the output
"Müller" "4dc383c2bc6c6c6572"
But I got the idea the the letter ü should have been encoded as c3bc and not c383c2bc.
Thanks
Johan
It depends on the encoding of your source code.
I tend to think that your file is already encoded in UTF-8, the character ü being encoded as C3 BC.
You're calling the QString::QString ( const char * str ) constructor which, according to http://doc.qt.io/qt-4.8/qstring.html#QString-8, converts your string to unicode using the QString::fromAscii() method which by default considers the input as Latin1 contents.
As C3 and BC are both valid in Latin 1, representing respectively à and ¼, converting them to UTF-8 will lead to the following characters:
à (C3) -> C3 83
¼ (BC) -> C2 BC
which leads to the string you get: "4d c3 83 c2 bc 6c 6c 65 72"
To sum things up, it's double UTF-8 encoding.
There are several options to solve this issue:
1) You can convert your source file to Latin-1 using your favorite text editor.
2) You can properly escape the ü character into \xFC in the litteral string, so the string won't depend on the file's encoding.
3) you can keep the file and string as UTF-8 data and use QString str = QString::fromUtf8 ("Müller");
Update: This issue is no longer relevant in QT5. http://doc.qt.io/qt-5/qstring.html#QString-8 states that the constructor now uses QString::fromUtf8() internally instead of QString::fromAscii(). So, as long as UTF-8 encoding is used consistently, it will be used by default.
Running your code I get expected result
"4dc3bc6c6c6572"
I think the problem is with your input not output.
Check the encoding of your source file and look at
void QTextCodec::setCodecForCStrings ( QTextCodec * codec ) [static]

Resources