Byte-Order Mark found in UTF-8 File. W3C Validation Error - xhtml

I have created a web site which is valid to strict XHTML and passes the validation, but the W3C validator tells me I have a note (error):
Byte-Order Mark found in UTF-8 File.
The Unicode Byte-Order Mark (BOM) in UTF-8 encoded files is known to cause problems for some text editors and older browsers. You may want to consider avoiding its use until it is better supported.
But I have no BOM in my file. It's straight XHTML done in VS.
Is the server adding it? How can I get rid of the error?
This is important as it screws up semantic extraction. http://www.w3.org/2003/12/semantic-extractor.html

You do have a BOM (EF BB BF) in your resource. Consider removing it, perhaps, using some hex editor. How do I remove the BOM character from my xml file

The W3C Markup Validator does not indicate a BOM in UTF-8 as an error; it would itself be in error if it did, since a BOM is allowed at the start of UTF-8 data. It issues a warning.
The warning is seriously outdated. No problems have been observed in relevant browsers for many years. On the contrary, BOM should be regarded as useful, since if e.g. a file is saved locally (and HTTP headers are thus lost, the BOM in UTF-8 format lets browsers to infer, with practical certainty, that the document is UTF-8 encoded.
The Semantic data extraction tool is not very up-to-date, and it suffers from a too theoretic approach, but it does not seem to have any problem with BOM at the start of UTF-8 data.
It is possible that the server adds the BOM, or that your authoring tool adds it. Either way, it should be considered as useful, rather than a problem.

Related

Are paths in content.opf URL-encoded by standard?

When processing EPUB files, I've ran into the issue that in some epub books the paths of the xhtml files are written into the content.opf URL encoded.
For example the path "abcá.xhtml" is written into the content.opf as href="abc%C3%A1.xhtml" (%C3%A1 being the url encoded representation of the character 'á').
I could not find any information about this anywhere. Is this in the EPUB standard? The EPUB file in question was generated with Adobe InDesign.
UPDATE: I tested the epub with the Calibre E-book viewer, with the following results:
Special character in file name, URL-encoded path in content.opf (abcá.xhtml and href="abc%C3%A1.xhtml"): Calibre opens the epub with no problem.
Special character in file name, special character is directly written into path in content.opf with UTF-8 (abcá.xhtml and href="abcá.xhtml"): Calibre opens the epub with no problem.
File name contains a string which happens to be URL-decodeable and the same string is written into the content.opf (abc%C3%A1.xhtml and href="abc%C3%A1.xhtml"): Calibre can not open the epub and displays an error message.
So I guess that Calibre URL-decodes every path in the content.opf before it tries to open the files, which can lead to weird edge cases like the last one.
However this seems to be quite a rare case, so I think I am going to process the paths the same way by URL-decoding them.
It looks like it's probably a bad thing done by InDesign. Two relevant passages from the OPF spec:
From section 1.3.4: Relationship to Unicode
Reading Systems must parse all UTF-8 and UTF-16 characters properly (as required by XML). Reading Systems may decline to display some characters, but must be capable of signaling in some fashion that undisplayable characters are present. Reading Systems must not display Unicode characters merely as if they were 8-bit characters.
And section 1.4 Conformance
1.4.1.1: Package Conformance
Each conformant OPF Package Document must meet these necessary conditions:
it is a well-formed XML document (as defined in XML 1.0); and
it is encoded in UTF-8 or UTF-16; and
...
My reading of that is the reading system needs to be capable of parsing href="abcá.xhtml", and so that's what InDesign should put in the .opf file.

How to repair unicode letters?

Someone in email sent me letters like this
IVIØR†€™
correct should be
IVIØR†€™
suppose to be
How do I represent them in their original Portuguese langauge, it got altered after being passed through HTTP GET request.
I probably will not be able to fix the site.. but maybe create a repair tool to repair these broken encoded letters? or anyone know of any repair tool? or how to do it manually by hand? Seems like nothing is lost.. just badly interpreted
What happened here is that UTF-8 got misinterpreted as ISO-8859-1; and then other kinds of mangling (the bad ISO-8859-1 string being re-UTF-8-encoded; the non-breaking space character '\xA0' being converted to regular space '\x20') seem to have happened afterward, though those may just be a result of pasting it into Stack Overflow.
Due to the subsequent mangling, there's no really good way to completely undo it, but you can largely undo it by passing it through a not-very-strict UTF-8 interpreter. For example, if I save "IVIØR†€™" as a text-file on my computer, using Notepad, with the "ANSI" (single-byte) encoding, and then I open it in Firefox and tell it to interpret it as UTF-8 (Firefox > Web Developer > Character Encoding > Unicode (UTF-8)), then it displays "IVIØR� €™". (The "�" is because of the '\xA0' having been changed to '\x20', which broke the UTF-8 encoding.)
They're probably not broken. It's just a difference between the encoding they were sent in, vs. the decoding you're viewing them in.
Figure out what encoding was originally used, and use the same one to decode it, and it should look like the original. In terms of writing a "fix-it" tool, you'd always need to know what encoding they were originally created in, which can be complicated depending on the source, and whether or not you have access to said information.

ASP.NET requestEncoding and responseEncoding UTF-8 or ISO-8859-1

In a Microsoft Security Document, in the Code Review section ( http://msdn.microsoft.com/en-us/library/aa302437.aspx ), it suggests setting the globalization.requestEncoding and globalization.responseEncoding to "ISO-8859-1" opposed to "UTF-8" or another Unicode format.
What are the downsides to using "ISO-8859-1", in the past I've set both to UTF-8 for maximum compatibility.
The downside is that it's not as compatible. In fact, there are lots of reasons not to use anything but UTF-8.
I looked at that doc page and I'm not sure it's actually suggesting to use Latin1 - I think it might just be using that as an example.
The HttpUtility encoding methods all use UTF-8 by default, so unless you really didn't want international characters coming in with your inputs, I don't see any reason to set it to Latin-1.
That page doesn't seem to recommend ISO-8859-1 specifically, all it says is:
"To help prevent attackers using canonicalization and multi-byte escape sequences to trick your input validation routines, check that the character encoding is set correctly to limit the way in which input can be represented."
Also, on another page it says "Both approaches are shown below using the ISO-8859-1 character encoding, which is the default in early versions of HTML and HTTP"

Warning when validating my website with http://validator.w3.org?

I created a simple test page on my website www.xaisoft.com and it had no errors, but it came back with the following warning and I am not sure what it means.
The Unicode Byte-Order Mark (BOM) in UTF-8 encoded files is known to cause problems for some text editors and older browsers. You may want to consider avoiding its use until it is better supported.
To find out what the BOM is, you can take a look at the Unicode FAQ (quoting) :
Q: What is a BOM?
A: A byte order mark (BOM) consists of
the character code U+FEFF at the
beginning of a data stream, where it
can be used as a signature defining
the byte order and encoding form,
primarily of unmarked plaintext files.
Under some higher level protocols, use
of a BOM may be mandatory (or
prohibited) in the Unicode data stream
defined in that protocol.
Depending on your editor, you might find an option in the preferences to indicate it should save unicode documents without a BOM... or change editor ^^
Some text editors - notably Notepad - put an extra character at the front of the text file to indicate that it's Unicode and what byte-order it is in. You don't expect Notepad to do this sort of thing, and you don't see it when you edit with Notepad. You need to open the file and explicitly resave it as ANSI. If you're using fancy characters like smart quotes, trademark symbols, circle-r, or that sort of thing, don't. Use the HTML entities instead.

File names containing non-ascii international language characters

Has anyone had experience generating files that have filenames containing non-ascii international language characters?
Is doing this an easy thing to achieve, or is it fraught with danger?
Is this functionality expected from Japanese/Chinese speaking web users?
Should file extensions also be international language characters?
Info: We currently support multilanguage on our site, but our filenames are always ASCII. We are using ASP.NET on the .NET framework. This would be used in a scenario where international users could choose a common format and name for there files.
Is this functionality expected from Japanese/Chinese speaking web users?
Yes.
Is doing this an easy thing to achieve, or is it fraught with danger?
There are issues. If you are serving files directly, or otherwise have the filename in the URL (eg.: http://​www.example.com/files/こんにちは.txt -> http://​www.example.com/files/%E3%81%93%E3%82%93%E3%81%AB%E3%81%A1%E3%81%AF.txt), you're generally OK.
But if you're serving files with the filename generated by the script, you can have problems. The issue is with the header:
Content-Disposition: attachment;filename="こんにちは.txt"
How do we encode those characters into the filename parameter? Well it would be nice if we could just dump it in in UTF-8. And that will work in some browsers. But not IE, which uses the system codepage to decode characters from HTTP headers. On Windows, the system codepage might be cp1252 (Latin-1) for Western users, or cp932 (Shift-JIS) for Japanese, or something else completely, but it will never be UTF-8 and you can't really guess what it's going to be in advance of sending the header.
Tedious aside: what does the standard say should happen? Well, it doesn't really. The HTTP standard, RFC2616, says that bytes in HTTP headers are ISO-8859-1, which wouldn't allow us to use Japanese. It goes on to say that non-Latin-1 characters can be embedded in a header by the rules of RFC2047, but RFC2047 explicitly denies that its encoded-words can fit in a quoted-string. Normally in RFC822-family headers you would use RFC2231 rules to embed Unicode characters in a parameter of a Content-Disposition (RFC2183) header, and RFC2616 does defer to RFC2183 for definition of that header. But HTTP is not actually an RFC822-family protocol and its header syntax is not completely compatible with the 822 family anyway. In summary, the standard is a bloody mess and no-one knows what to do, certainly not the browser manufacturers who pay no attention to it whatsoever. Hell, they can't even get the ‘quoted-string’ format of ‘filename="..."’ right, never mind character encodings.
So if you want to serve a file dynamically with non-ASCII characters in the name, the trick is to avoid sending the ‘filename’ parameter and instead dump the filename you want in a trailing part of the URL.
Should file extensions also be international language characters?
In principle yes, file extensions are just part of the filename and can contain any character.
In practice on Windows I know of no application that has ever used a non-ASCII file extension.
One final thing to look out for on systems for East Asian users: you will find them typing weird, non-ASCII versions of Latin characters sometimes. These are known as the full-width and half-width forms, and are designed to allow Asians to type Latin characters that line up with the square grid used by their ideographic (Han etc.) characters.
That's all very well in free text, but for fields you expect to parse as Latin text or numbers, receiving an unexpected ‘42’ integer or ‘.txt’ file extension can trip you up. To convert these ‘compatibility characters’ down to plain Latin, normalise your strings to ‘Unicode Normal Form NFKC’ before doing anything with them.
Refer to this overview of file name limitations on Wikipedia.
You will have to consider where your files will travel, and stay within the most restrictive set of rules.
From my experience in Japan, filenames are typically saved in Japanese with the standard English extension. Apply the same to any other language.
The only problem you will run into is that in an unsupported environment for that character set, people will usually just see a whole bunch of squares with an extension. Obviously this won't be a problem for your target users.
I have been playing around with Unicode and Indian languages for a while now. Here are my views on your questions:
Its easy. You will need two things: Enable Unicode (UTF-8/16/32) support in your OS so that you can type those characters and get Unicode compatible editors/tools so that your tools understand those characters.
Also, since you are looking at a localised web application, you have to ensure or atleast inform your visitor that he/she needs to have a browser which uses relevant encoding.
Your file extensions need not be i18-ned.
My two cents:
Key thing to international file names is to make URLs like bobince suggested:
www.example.com/files/%E3%81%93%E3%82%93%E3.txt
I had to make special routine for IE7 since it crop filename if its longer then 30 characters. So instead of "Your very long file name.txt" file will appear as "%d4y long file name.txt". However interesting thing is that IE7 actually understands header attachment;filename=%E3%81%93%E3%82%93%E3.txt correctly.

Resources