I am building a ASP.NET webservice loading other webpages and then hand it clients.
I have been doing quite well with character code treatment, reading the meta tag from HTML then use that codeset to read the file.
But nevertheless, some less educated users just don't understand code sets. They declare a specific encoding method e.g. "gb2312", but in fact, he is just using normal UTF8. When I use gb2312 to decode the text, everything turns out a holy mess.
How can I detect whether the text is properly decoded? I loaded that page into my IE, which correctly use UTF-8 to decode the page. How does it achieve that?
Based on the BOM you can tell what encoding is used.
BOM and encoding
If you want to detect character set you could use the C# port of mozilla's character set detector.
CharDetSharp
If you want to make it extra sure that you are using a correct one, you maybe could be looking for special characters that are not supposed to be there. It is not very likely to include "óké". So you could be looking for such characters and try to use different encoding/character set to process your file.
Actually it is really hard to make your application completely "fool-proof".
Related
As a security measure we're using the Microsoft.Security.Application.Encoder.HtmlEncode method to encode and render values that have been stored in our database by various users.
We would like to allow the user to use single quotes but they are being encoded as & #39;
Does anyone know of a safe way to allow single quotes to render but ensure the rest of the input is encoded? Is it just a case of replacing after the encoding has taken place? This approach seems a bit hacky.
I got to the bottom of this. The web control was also encoding the input data and therefore html encoding was taking place twice.
I want to use the Microsoft AntiXss library for my project. When I use the Microsoft.Security.Application.Encoder.HtmlEncode(str) function to safely show some value in my web page, it encodes Farsi characters which I consider to be safe. For instance, it converts لیست to لیست. Am I using the wrong function? How should I be able to print the user input in my page safely?
I'm currently using it like this:
<h2>#Encoder.HtmlEncode(ViewBag.UserInput)</h2>
I think I messed up! Razor view encodes the values unless you use #Html.Raw right? Well, I encoded the string and it encoded it again. So in the end it just got encoded twice and hence, the weird looking chars (Unicode values)!
If your encoding (lets assume that it's Unicode by default) supports Farsi it's safe to use Farsi, without any additional effort, in ASP.NET MVC almost always.
First of all, escape-on-input is just wrong - you've taken some input and applied some transformation that is totally irrelevant to that data. It's generally wrong to encode your data immediately after you receive it from the user. You should store the data in pure view to your database and encode it only when you display it to the user and according to the possible vulnerabilities for the current system. For example the 'dangerous' html characters are not 'dangerous' for SQL or android etc. and that's one of the main reasons why you shouldn't encode the data when you store it in the server. And one more reason - when you html encode the string you got 6-7 times more characters for your string. This can be a problem with server constraints for strings length. When you store the data to the sql server you should escape, validate, sanitize your data only for it and prevent only its vulnerabilities (like sql injection).
Now for ASP.NET MVC and razor you don't need to html encode your strings because it's done by default unless you use Html.Raw() but generally you should avoid it (or html encode when you use it). Also if you double encode your data you'll result in corrupted output :)
I Hope this will help to clear your mind.
Someone in email sent me letters like this
IVIØR†€™
correct should be
IVIØR†€™
suppose to be
How do I represent them in their original Portuguese langauge, it got altered after being passed through HTTP GET request.
I probably will not be able to fix the site.. but maybe create a repair tool to repair these broken encoded letters? or anyone know of any repair tool? or how to do it manually by hand? Seems like nothing is lost.. just badly interpreted
What happened here is that UTF-8 got misinterpreted as ISO-8859-1; and then other kinds of mangling (the bad ISO-8859-1 string being re-UTF-8-encoded; the non-breaking space character '\xA0' being converted to regular space '\x20') seem to have happened afterward, though those may just be a result of pasting it into Stack Overflow.
Due to the subsequent mangling, there's no really good way to completely undo it, but you can largely undo it by passing it through a not-very-strict UTF-8 interpreter. For example, if I save "IVIØR†€™" as a text-file on my computer, using Notepad, with the "ANSI" (single-byte) encoding, and then I open it in Firefox and tell it to interpret it as UTF-8 (Firefox > Web Developer > Character Encoding > Unicode (UTF-8)), then it displays "IVIØR� €™". (The "�" is because of the '\xA0' having been changed to '\x20', which broke the UTF-8 encoding.)
They're probably not broken. It's just a difference between the encoding they were sent in, vs. the decoding you're viewing them in.
Figure out what encoding was originally used, and use the same one to decode it, and it should look like the original. In terms of writing a "fix-it" tool, you'd always need to know what encoding they were originally created in, which can be complicated depending on the source, and whether or not you have access to said information.
I created a simple test page on my website www.xaisoft.com and it had no errors, but it came back with the following warning and I am not sure what it means.
The Unicode Byte-Order Mark (BOM) in UTF-8 encoded files is known to cause problems for some text editors and older browsers. You may want to consider avoiding its use until it is better supported.
To find out what the BOM is, you can take a look at the Unicode FAQ (quoting) :
Q: What is a BOM?
A: A byte order mark (BOM) consists of
the character code U+FEFF at the
beginning of a data stream, where it
can be used as a signature defining
the byte order and encoding form,
primarily of unmarked plaintext files.
Under some higher level protocols, use
of a BOM may be mandatory (or
prohibited) in the Unicode data stream
defined in that protocol.
Depending on your editor, you might find an option in the preferences to indicate it should save unicode documents without a BOM... or change editor ^^
Some text editors - notably Notepad - put an extra character at the front of the text file to indicate that it's Unicode and what byte-order it is in. You don't expect Notepad to do this sort of thing, and you don't see it when you edit with Notepad. You need to open the file and explicitly resave it as ANSI. If you're using fancy characters like smart quotes, trademark symbols, circle-r, or that sort of thing, don't. Use the HTML entities instead.
Inside an asp.net page, should I use
<html><title>My page's title from México</title></html>
Or
<html><title>My page’s title from México</title></html>
Both examples have the same output. Since asp.net encodes all my pages to utf-8, there is no need to use html entities, is that right?
The ASCII table is set of characters, arguable the first standardized set of characters back in the days when you could only spare 1 byte per character. http://asciitable.com/ But I did some looking around at the extended character set of ASCII and it appears that the character you are referencing is an ASCII character. So there really isn't a problem which ever way you choose to display your title.
My revised answer is go for less expensive one according to space (i.e. the first one)
The second example will ensure compatibility with ASCII standards of HTML transmition. So my vote is for the second example, so you don't have to ensure the HTML is output and encoded as UTF-8 all the way through all the proxy servers and any other kind of caching and translation that might occur.
You're correct; As long as there's unicode at both ends of the pipe, it really doesn't matter. Personally, I would use the first simply because it's more readable.
And, honestly, unicode has been widespread for some time. I personally believe that it's time to leave anyone who can't handle UTF-8 behind.