How to repair unicode letters? - http

Someone in email sent me letters like this
IVIØR†€™
correct should be
IVIØR†€™
suppose to be
How do I represent them in their original Portuguese langauge, it got altered after being passed through HTTP GET request.
I probably will not be able to fix the site.. but maybe create a repair tool to repair these broken encoded letters? or anyone know of any repair tool? or how to do it manually by hand? Seems like nothing is lost.. just badly interpreted

What happened here is that UTF-8 got misinterpreted as ISO-8859-1; and then other kinds of mangling (the bad ISO-8859-1 string being re-UTF-8-encoded; the non-breaking space character '\xA0' being converted to regular space '\x20') seem to have happened afterward, though those may just be a result of pasting it into Stack Overflow.
Due to the subsequent mangling, there's no really good way to completely undo it, but you can largely undo it by passing it through a not-very-strict UTF-8 interpreter. For example, if I save "IVIØR†€™" as a text-file on my computer, using Notepad, with the "ANSI" (single-byte) encoding, and then I open it in Firefox and tell it to interpret it as UTF-8 (Firefox > Web Developer > Character Encoding > Unicode (UTF-8)), then it displays "IVIØR� €™". (The "�" is because of the '\xA0' having been changed to '\x20', which broke the UTF-8 encoding.)

They're probably not broken. It's just a difference between the encoding they were sent in, vs. the decoding you're viewing them in.
Figure out what encoding was originally used, and use the same one to decode it, and it should look like the original. In terms of writing a "fix-it" tool, you'd always need to know what encoding they were originally created in, which can be complicated depending on the source, and whether or not you have access to said information.

Related

How to detect wrong encoding declaration?

I am building a ASP.NET webservice loading other webpages and then hand it clients.
I have been doing quite well with character code treatment, reading the meta tag from HTML then use that codeset to read the file.
But nevertheless, some less educated users just don't understand code sets. They declare a specific encoding method e.g. "gb2312", but in fact, he is just using normal UTF8. When I use gb2312 to decode the text, everything turns out a holy mess.
How can I detect whether the text is properly decoded? I loaded that page into my IE, which correctly use UTF-8 to decode the page. How does it achieve that?
Based on the BOM you can tell what encoding is used.
BOM and encoding
If you want to detect character set you could use the C# port of mozilla's character set detector.
CharDetSharp
If you want to make it extra sure that you are using a correct one, you maybe could be looking for special characters that are not supposed to be there. It is not very likely to include "óké". So you could be looking for such characters and try to use different encoding/character set to process your file.
Actually it is really hard to make your application completely "fool-proof".

What is causing the corruption of text fields with ¿ characters?

We have a very strange problem in out application, all of a sudden we started noticing
upside down question marks being saved along with other text typed in to the fields on the screen. These upside down question marks were not originally entered by the users and it is unclear where they come from. We are using Oracle 10g with Asp.Net.
Here is an example of the issue: "140, 141) ¿ 16-Oct-07". If any one have seen this before and found a way to fix this please let me know how.
This sounds like a character encoding issue. Please check what encoding your database (tables) are set to, and what encoding the objects or strings which are passing data in the database are of. If there is a mis-match (DB in ANSI, App in UTF-8), these sorts of issues can appear.
Greg, you should check NLS_CHARACTERSET not NLS_NCHAR_CHARACTERSET settings. And I bet you it's WE8ISO8859P1 or something similar and not unicode. The problem occurs when the submitted data in unicode, which is probably UTF8, and Oracle tries to map the characters to WE8ISO8859P1 character set. It does fine for most of them but fails for high ASCII number characters, like 140.
So yes, I have seen the same issue in our application and in our case it was caused by special quote marks (“example”, ‘example’) that were copied from MS Word. Word automatically converts double quotes to some other quotes. The solution was to convert the database to UTF-8.
IF your users are copying from MS Word you can turn the feature off . Its part of the autocorrect/autoformat functionality. If you uncheck the replace options for quotes and apostrophes you should be ok. Be sure turn off the replacements in both the AutoFormat and AutoFormat as you type.

Warning when validating my website with http://validator.w3.org?

I created a simple test page on my website www.xaisoft.com and it had no errors, but it came back with the following warning and I am not sure what it means.
The Unicode Byte-Order Mark (BOM) in UTF-8 encoded files is known to cause problems for some text editors and older browsers. You may want to consider avoiding its use until it is better supported.
To find out what the BOM is, you can take a look at the Unicode FAQ (quoting) :
Q: What is a BOM?
A: A byte order mark (BOM) consists of
the character code U+FEFF at the
beginning of a data stream, where it
can be used as a signature defining
the byte order and encoding form,
primarily of unmarked plaintext files.
Under some higher level protocols, use
of a BOM may be mandatory (or
prohibited) in the Unicode data stream
defined in that protocol.
Depending on your editor, you might find an option in the preferences to indicate it should save unicode documents without a BOM... or change editor ^^
Some text editors - notably Notepad - put an extra character at the front of the text file to indicate that it's Unicode and what byte-order it is in. You don't expect Notepad to do this sort of thing, and you don't see it when you edit with Notepad. You need to open the file and explicitly resave it as ANSI. If you're using fancy characters like smart quotes, trademark symbols, circle-r, or that sort of thing, don't. Use the HTML entities instead.

what encoding it is?

If i have a HTML page with setting to be UTF-8.
and then I input Chinese characters with encoding big5 in the form and submit.
what encoding it is at server side ?
is it automatically converted to UTF-8?
Or how it works ??
Thanks!
Supplement1:
Actually i am really not sure, why the browser can decide which encoding to use ? since the encode was generated by IME. for example: the tool i used to input Chinese character, right ?
supplement2:
if everything just like what "Michael Madsen" said at the below response, then how can asp.net handle this, such that whatever and no matter how i input the characters in the forms, it will not get corrupted always but jsp can't?
The browser works with Unicode - when the characters are typed in there, they're internally stored as Unicode. When the form is submitted, it outputs the characters in whatever encoding is appropriate - usually the encoding of the page.
If you're talking about copy/pasting from a Big5 document, then it will already have been converted to Unicode when it's inserted into the clipboard - maybe even when the document is loaded, depending on your editor.
If you're talking about using some IME to input the characters, the question is kind of faulty, since your IME should be working exclusively with Unicode and Big5 encoding is therefore never involved. If it is, then there's some layer inbetween doing the conversion to/from Unicode anyway, so regardless of that part, the browser never knows the source encoding.
The browser can send up its post in big5 if it wants to, and the server should be able to handle that. But what do you mean by "I input Chinese characters with encoding big5 in the form"? When you input the characters, it's up to the browser to decide which encoding to use, surely?

using non-latin characters in a URL

I'm working on a site which the client has had translated into Croatian and Slovenian. In keeping with our existing URL patterns we have generated URL re-writing rules that mimic the layout of the application which has lead to having many non-ascii charachters in the URLs.
Examples š ž č
Some links are triggered from Flash using getURL, some are standard HTML links. Some are programatic Response.Redirects and some through adding 301 status codes and location headers to the response. I'm testing in IE6, IE7 and Firefox 3 and internitmtently, the browsers display the non-latin chars url encoded.
š = %c5%a1
ž = %c5%be
č = %c4%8d
I'm guessing this is something to do with IIS and the way it handles Response.Redirect and AddHeader("Location ...
Does anyone know of a way of forcing IIS to not URL encode these chars or is my best bet to replace these with non-diacritic chars?
Thanks
Ask yourself if you really want them non-url encoded. What happens when a user that does not have support for those characters installed comes around? I have no idea, but I wouldn't want to risk making large parts of my site unavailable to a large part of the world's computers...
Instead, focus on why you need this feature. Is it to make the urls look nice? If so, using a regular z instead of ž will do just fine. Do you use the urls for user input? If so, url-encode everything before parsing it to link output, and url-decode it before using the input. But don't use ž and other local letters in urls...
As a side note, in Sweden we have å, ä and ö, but no one ever uses them in urls - we use a, a and o, because browsers won't support the urls otherwise. This doesn't surprise the users, and very few are unable to understand what words we're aiming at just because the ring in å is missing in the url. The text will still show correctly on the page, right? ;)
Does anyone know of a way of forcing IIS to not URL encode
You must URL-encode. Passing a raw ‘š’ (\xC5\xA1) in an HTTP header is invalid. A browser might fix the error up to ‘%C5%A1’ for you, but if so the result won't be any different to if you'd just written ‘%C5%A1’ in the first place.
Including a raw ‘š’ in a link is not wrong as such, the browser is supposed to encode it to UTF-8 and URL-encode as per the IRI spec. But to make sure this actually works you should ensure that the page with the link in is served as UTF-8 encoded. Again, manual URL-encoding is probably safest.
I've had no trouble with UTF-8 URLs, can you link to an example that is not working?
do you have a link to a reference where it details what comprises a valid HTTP header?
Canonically, RFC 2616. However, in practice it is somewhat unhelpful. The critical passage is:
Words of *TEXT MAY contain characters from character sets other than ISO-8859-1 only when encoded according to the rules of RFC 2047.
The problem is that according to the rules of RFC 2047, only ‘atoms’ can accommodate a 2047 ‘encoded-word’. TEXT, in most situations it is included in HTTP, cannot be contrived to be an atom. Anyway RFC 2047 is explicitly designed for RFC 822-family formats, and though HTTP looks a lot like an 822 format, it isn't in reality compatible; it has its own basic grammar with subtle but significant differences. The reference to RFC 2047 in the HTTP spec gives no clue for how one might be able to interpret it in any consistent way and is, as far as anyone I know can work out, a mistake.
In any case no actual browser attempts to find a way to interpret RFC 2047 encoding anywhere in its HTTP handling. And whilst non-ASCII bytes are defined by RFC 2616 to be in ISO-8859-1, in reality browsers can use a number of other encodings (such UTF-8, or whatever the system default encoding is) in various places when handling HTTP headers. So it's not safe to rely even on the 8859-1 character set! Not that that would have given you ‘š’ anyhow...
Those characters should be valid in a URL. I did the URL SEO stuff on a large travel site and that's when I learnt that. When you force diacritics to ascii you can change the meaning of words if you're not careful. There often is no translation as diacritics only exist in their context.

Resources