Which is the first multi-byte UTF codepoint? - utf

I just wanted to know which unicode blocks can be safely used when being limited to single-byte codepoints only.
So, which is the last single-byte codepoint, and which is the first multi-byte codepoint?

In UTF-8, the last single-byte code point is U+007F, and first 2-byte code point is U+0080.
See https://en.wikipedia.org/wiki/UTF-8#Encoding

Related

How to represent acute accents in ASCII?

I'm having an encoding problem related to cookies on one of my websites.
A user is inputing Usuário, which has an acute accent, and that's being put in a cookie. The raw HEX for the cookie response is (for the Usuário string):
55 73 75 C3 A1 72 69 6F
When I see it in the browser, it looks like this:
...which is really messy. I need to fix this up.
Then I went to this website: http://www.rapidtables.com/convert/number/hex-to-ascii.htm and converted the HEX value to see how it would look like. And I got the same output:
Right. This means the HEX code is wrong. Then I tried to convert Usuário to ASCII to see how it should be. I used this WebSite: http://www.asciitohex.com/ and this is the result:
For my surprise, the HEX is exactly the one that is showing up messy. Why???
And how do I represent Usuário in ASCII so I can put it in a cookie? Should I manually encode it?
PS: I'm using ASP.NET, just in case it matters.
As of 2015 the standard of the web to store character data is UTF-8 and not ASCII. ASCII actually only contains the first 128 characters of the codepage, and does not include any kind of accented characters. To add accented characters to this 128 characters there were many legacy solutions: codepages. They each added 128 different characters to the default ASCII list thereby allowing representing 256 different characters.
The problem was, that this didn't properly solve the issue: ASCII based codepages were more or less incomatible with each other (except for the first 128 characters), and there was usually no way of programatically knowing which codepage was in used.
One of the solutions was UTF-8, which is a way to encode the unocde character set (containing most of the characters used around the world, and more) while trying to remain compatible with ASCII. The first 128 characters are actually the same in both cases, but afterwards UTF-8 characters become multi-byte: one character is encoded using a series of bytes (usually 2-3, depends on which character needs to be encoded)
The problem is if you are using some kind of ASCII based single byte codebase (like ISO-8859-1), which encodes supported characters in single bytes, but your input is actually UTF-8, which will encode accented characters in multiple bytes (you can see this in your HEX example. á is encoded as C3 A1: two bytes). If you try to read these two bytes in an ASCII based codepage, which uses single bytes for every characters (in West-Europe this codepage is usually ISO-8859-1), then each of this two bytes will be reprensented with two different characters.
In the web world the default encoding is UTF-8, so your clients will usually send their requests using UTF-8. ASP.NET is Unicode aware, so it can handle these requests. However somewere in your code this UTF-8 is converted acccidentally into ISO-8859-1, and then back into UTF-8. This might happen on various layers. As you have issues it probably happens at the cookie layer, which is sometimes problematic (here is how it worked in 2009). You should also double check your application that it uses UTF-8 everywhere else though (views, database, etc.), if you want to properly support accented characters.

Anything odd about Chinese unicode characters 稍 and 稊 that would affect KDiff3?

I have reported a bug and entered a support request at the KDiff3 site (https://sourceforge.net/p/kdiff3/bugs/198/), but I wonder if anyone has any prompt information for me about a behavior I'm seeing that might lead me to understanding why such a bug might exist -- if there's anything unusual about these unicode characters.
When I merge two identical files containing the character 稍 using KDiff3 version 0.9.98, it reads the character as 稊 and shows that character in all the panes of the merge. The output then contains that character instead of 稍.
I've observed this behavior with UCS-2 Little Endian encoding in version 0.9.98 of KDiff3, but not with UTF-8 encoding, and not with version 0.9.96a the version of Kdiff3 that comes with TortoiseHg. Although I can reproduce the problem in 0.9.96 and 0.9.97, TortoiseHg's KDiff3 reports that it is version 0.9.96a, and does not exhibit the problem.
Edit: I vaguely suspect the source of the problem to be somewhere in the Qt library. So any information about what Qt does especially in regard to handling international text might be useful.
Utilities that process text files need to break the text into characters to operate effectively. The simplest possible process is to treat each 8-bit byte as a single character. Unfortunately this doesn't work well with UTF-16 or UCS-2 input, since each byte is only half of the character.
The character you're having problems with is 稍 (U+7a0d) which is being converted to 稊 (U+7a0a). When you break those down into little-endian bytes, you get 0x0d, 0x7a and 0x0a, 0x7a. The 8-bit character 0x0d is the ASCII code for Return, and 0x0a is the code for Linefeed. It seems that KDiff3 is interpreting these bytes as line endings, and substituting a Linefeed when it encounters a Return. This is verified by your report of an error message indicating inconsistent line endings in the file.
When working with Unicode it is often better to use UTF-8 encoding. The characters above U+007f will still take up more than one byte, but each of those bytes will have a value of 0x80 or greater and cannot accidentally be mistaken for one of the ASCII characters. For example 稍 becomes 0xe7, 0xa8, 0x8d.

CSS character encoding

According to W3C, CSS can set its character encoding by #charset in the first line, is it valid to to say that I should put #charset "UTF-8" in every CSS i made, even it only contains ASCII characters?
Will there be any performance penalty after I declare it using UTF-8 ?
p.s. I can't think of a way to test it out.
No, it is not valid to say so, as an unqualified statement. If your file contains only ASCII characters, it is very likely that its character encoding is ASCII compatible (EBCDIC is not much used there days), so the rule would be harmless, but also pointless as long as the file keeps being ASCII-only.
What matters is what happens when a non-ASCII character gets inserted into the CSS file, for whatever reason. It could be, for example, a innocent-look smart quote (”) inserted when editing the file with a program that produces smart quotes. It is more likely that the smart quote gets inserted in windows-1252 encoding than in UTF-8 encoding. So if the file has the #charset "UTF-8" rule, it probably becomes a bit more difficult to analyze the problem.
If, on the other hand, you know that your CSS file will be edited using software that uses UTF-8 encoding by default, then it is OK to declare it as UTF-8 encoded even if it only contains ASCII characters. For example, if you some day edit the file and add a declaration like content: "“foo”", you might forget to add the #charset rule.
There is no overhead in declaring the encoding as UTF-8. If the data contains ASCII characters only, any decent routine that reads UTF-8 will process the characters as fast as simple reading of ASCII. A routine that reads a UTF-8 bytestream will have to first check whether the byte is in the ASCII range and take it as standing for an ASCII character if it is.

Printing ASCII value of BB (HEX) in Unix

When I am trying to paste the character » (right double angle quotes) in Unix from my Notepad, it's converting to /273. The corresponding Hex value is BB and the Decimal value is 187.
My actual requirement is to have this character as the file delimiter when I export a .dat file from a database table. So, this character was put in as the delimiter after each column name. But, while copy-pasting, it's getting converted to /273.
Any idea about how to fix this? I am on Solaris (SunOS 5.10).
Thanks,
Visakh
ASCII only defines the character codes up to 127 (0x7F) - everything after that is another encoding, such as ISO-8859-1 or UTF-8. Make sure your locale is set to the encoding you are trying to use - the locale command will report your current locale settings, the locale(5) and environ(5) man pages cover how to set them. A much more in-depth introduction to the whole character encoding concept can be found in Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
The character code 0xBB is shown as » in the IS0-8859-1 character chart, so that's probably the character set you want, so the locale would be something like en_US.ISO8859-1 for that character set with US/English messages/date formats/currency settings/etc.

Why does the encoding's of a URL and the query string part differ?

I was researching why my query parameters have plus + signs in it instead of %20 and why they have strings like %C3%BC instead of a ü (UTF-8) as an encoded URL does.
After 2 hours of thinking my webapp is not compatible to the URL encoding standard I found that the encoding scheme of a query string is not the same as the encoding of a URL (here i mean the part without the query string).
Examples:
URL:
whitespace encodes to %20
UTF-8 chars stays UTF-8 chars
Query params:
whitespace encodes to +
UTF-8 chars encodes to the hex representation
So can someone tell me why do the encoding schemes differ, since the query parameters are a part of the URL?
See:
wiki Percent-encoding
wiki: Query String
URIs originated in RFC 1630, with percent-encoding as a method to allow "unsafe" characters to be represented. This original version actually mentioned the ISO Latin 1 character set as the encoding for non-ASCII characters. RFC 1738 later that year removed this reference to Latin-1 in defining URLs.
The query string format is actually a different but related encoding, application/x-www-form-urlencoded, defined in RFC 1866 along with HTML 2.0. It was based on RFC 1738, but specified that spaces (not all whitespace, just the character with ASCII code 0x20) are replaced by '+' and that line breaks are to be encoded as CRLF (i.e. %0D%0A). The former is likely because that saves 2 bytes for a very common character in form submissions at the expense of using an extra 2 bytes for a much less common character, and the latter is to avoid problems when transferring between systems using different end-of-line codings. Non-ASCII characters were left unconsidered.
UTF-8 coding in URIs came over a decade later, in RFC 3986, although individual protocols may have specified this or another encoding of non-ASCII characters earlier. To maintain backwards compatibility, all UTF-8 octets must be percent-encoded. The companion RFC 3987 defines "Internationalized Resource Identifiers" (IRIs) which are basically "URIs with most codepoints 160 and above allowed to appear unencoded", but many protocols still require URIs. Note that your statement above is incorrect, as a URL may not contain an unencoded ü or any other non-ASCII character.
application/x-www-form-urlencoded has been internationalized in a different manner. The HTML5 specification of application/x-www-form-urlencoded explicitly allows that any ASCII-compatible character set may be used for characters in the query string, and in fact different fields may use different character sets, but all non-ASCII octets must still be percent-encoded. When used in the query part of an IRI, it is possible that these characters could be represented unencoded if properly-normalized UTF-8 is being used as the character set, since conversion back to a URI would result in correct application/x-www-form-urlencoded data.
They don't necessarily have to differ, a + is a valid path character and a ü is a valid search character (per RFC 3987). You're probably seeing browsers or some other preconceived encoding scheme making assumptions that are either outdated or overly cautious.
There is no difference between + and %20 when it comes to Query string parameters:
SPACE is encoded as '+' or '%20'
Quote reference

Resources