I need to sanitise user input (or output) for a web app I'm developing. The user input is just plain text, and I want to prevent HTML or other "harmful" strings. However characters such as less than, greater than, apostrophes, ampersands, quotes, etc., should be allowed.
I guess the first step is to disable request validation to prevent the generic "a potentially dangerous value was detected" message, but what else do I need to do? I can't simply htmlencode the output otherwise I'll end up with < being displayed in place of a less than character, for example.
Are there any tools that can help? I had a quick look at the AntiXSS library but from what I've seen it's just a glorified htmlencoder, or am I missing something? What about MVC - does this have anything built in?
I've never found a decent article on this kind of thing. Some say to sanitise input, while others say to sanitise output, and examples are typically over-simplistic, using techniques like htmlencoding, which will reformat perfectly valid characters such as a less than.
The Anti-XSS library is the standard library in ASP.Net WebForms for now. Though it is sub optimal. And the latest version (4.2) has several breaking bugs that haven't been fixed in awhile.
Also see the MSDN article Information Security - Anti-Cross Site Scripting.
See Should I use the Anti-XSS Security Runtime Engine in ASP.NET MVC? for your answer regarding MVC. From that answer:
Phil Haack has an interesting blog post here
http://haacked.com/archive/2009/02/07/take-charge-of-your-security.aspx.
He suggests using Anti-XSS combined with CAT.NET.
Related
What are the techniques that one can use to prevent cross site scripting in asp.net? Are there any non ready implementations that one can use to achieve a website protected against xss?
We did in-house development for this purpose for a long time, but finally Microsoft provided a library for it. We now replaced our library with this one completely. It can simply be used as follows:
string sanitizedString = Microsoft.Security.Application.Sanitizer.GetSafeHtmlFragment(string myStringToBeChecked);
The only problem with this method is that it trims multiple whitespaces that are separated with line ending characters. If you do not want that to happen, you may consider splitting the string first with respect to line ending characters (\r\n), then calculate the number of whitespaces before and after these splitted strings, apply sanitizer, append whitespaces back and concatenate.
Other than that, Microsoft library works fine.
The microsoft anti cross site scripting library is a good start. It has some useful helper methods to prevent XSS. It is now part of the larger Microsoft Web Protection Library
To protect cross site scripting attack as following ways.
1. Use HtmlEncoding while saving the input content that is received from web application controls usch as textbox etc.
2. Use InnerText instead of InnerHtml while displaying data on the page.
3. Do the input sanitization before saving the data into database.
My colleagues and I have been debating how to best protect ourselves
from XSS attacks but still preserve HTML characters that get entered
into fields in our software.
To me, the ideal solution is to accept the data (turn off ASP .NET
request validation) as the user enters it, throw it in the database
exactly as they entered it. Then, whenever you display the data on the
web, HTML-encode it. The problem with this approach is that there's a
high likelihood that a developer somewhere someday will forget to
HTML-encode the display of a value somewhere. Bam! XSS vulnerability.
Another solution that was proposed was to turn request validation off
and strip out any HTML users enter before it is stored in the database
using a regex. Devs will still have to HTML-encode things for display,
but since you've stripped out any HTML tags, even if a dev forgets, we
think it would be safe. The drawback to this is that users can't enter
HTML tags into descriptions and fields and things, even if they
explicitly want to, or they may accidentally paste in an email address
surrounded by < > and the regex doesn't pick it up...whatever. It
screws with the data, and it's not ideal.
The other issue we have to keep in mind is that the system has been
built in the fear of commitment to any one strategy around this. And
at one point, some devs wrote some pages to HTML encode data before it
gets entered into the database. So some data may be already HTML
encoded in the database, some data is not - it's a mess. We can't
really trust any data that comes from the database as safe for display
in a browser.
My question is: What would be the ideal solution if you were
building an ASP .NET web app from the ground up, and what would be a good
approach for us, given our situation?
Assuming you go ahead and store the HTML directly in the database, in ASP.NET/MVC Razor, HTML-encoding is done automatically, so your negligent developer would have to really go above and beyond the call of duty to introduce the XSS. With standard webforms (or the webform view engine), you can force developers to use the <%: syntax, which will accomplish the same thing. (albeit with more risk that the developer will be negligent)
Furthermore, you could consider only selectively disabling request validation. Do you really need to support it for every request? The vast majority of requests, presumably, would not need to preserve (or allow) the HTML.
Using a regex to strip html is fairly easy to defeat and very difficult to get correct. If you want to clean HTML input it's better to use an actual parser to enforce strict XML compliance.
What I would do in this situation is store two fields in the database: clean and raw for the data. When the user wants to edit their content, you send them the raw data. When they submit changes, you sanitize it and store it in the clean field. Developers then only ever use the clean field when outputting the content to the page. I would even go so far as to name the raw field dangerousRawContent so it's obvious that care must be taken when referencing that field.
The added benefit of this technique is that you can re-sanitize the raw data with improved parsers at a later date without every loosing the originally intended content.
I am looking for more info on these kinds of HTTP Parameters that are found in ASP.NET web applications:
ctl00$ContentPlaceHolder1$GenericWebUserControl$StartDate5
ctl00$ContentPlaceHolder1$_rptStateLabels$ctl00$_rptFacilities$ctl01$_btnSelectFacilit.x
I want to understand the logic behind these input element names:
How are they generated?
How does a given name like this map to some given HTML or web-application structure? What is the significance of the "common" parts of the parameter names, like the ctl### bits (or any other things like that which I haven't noticed a pattern in)? How many of those should I expect to see?
I am looking at this from someone who wants to understand the HTTP requests that are being sent to such an application - i.e. when can I expect to see such-and-such HTTP parameter versus something else, given some structure of the site.
I haven't found this in ASP.NET docs, though I am not really familiar with them - any pointers are appreciated - again, not wanting to know as a ASP.NET programmer, which I'm not (i.e. I don't want to know how to code ASP.NET with this kind of thing), rather as someone analyzing the web traffic at the HTTP level and wanting to know the significance of these parameters to the web-application and how to parse them, i.e. understand their structure, not as a machine, but as a human (which I am).
These look like ASP.Net-generated control names, not ASP.Net MVC control names.
In ASP.Net MVC, you have control over them because you have to consume them directly whereas in ASP.Net you're divorced from the naming conventions and instead consume events that they generate on postback (or talk to the button/whatever controls to access their values).
I'm not really clear what you're looking to gain from being able to parse the structure of the names when ASP.Net already does that for you.
This loos like the element ID assigned by the framework to server web/html controls. It is really not related to HTTP parameters.
The '$' is generally the delimiter between parent/child controls' ID.
If I HTML encode any data entered by website users when I redisplay it, will this prevent CSS vulnerabilities?
Also, is there a tool/product available that will sanitize my user input for me, so that I don't have to write my own routines.
There are various subtleties to this question, although the answer in general is yes.
The safety of your website is highly dependent on where you put the data. If you put it as legit text, there is essentially no way for the attacker to execute XSS. If you put it in an attribute, if you forget to escape quotes or don't check for multibyte well-formedness, you have a possible attack. If you put it in a JSON variable, not escaping properly can lead to arbitrary JavaScript. Etc. etc. Context is very important.
Other users have suggested using XSS removal or XSS detection functions. I tend to think of XSS removal as user unfriendly; if I post an email address like <foo#example.com> and your remove XSS function thinks it's an HTML tag, this text mysteriously disappears. If I am running an XSS discussion forum, I don't want people's sample code to be removed. Detection is a little more sensible; if your application can tell when someone is attacking it, it can ban the IP address or user account. You should be careful with this sort of functionality, however; innocents can and will get caught in the crossfire.
Validation is an important part of website logic, but it's also independent of escaping. If I don't validate anything but escape everything, there will be no XSS attacks, but someone can say that their birthday is "the day the music died", and the application wouldn't be the wiser. In theory, strict enough validation for certain data types can perform all the duties of escaping (think numbers, enumerations, etc), but it's general good practice of defense in depth to escape them anyway. Even if you're 100% it's an integer. It might not be.
Escaping plaintext is a trivial problem; if your language doesn't give you a function, a string replace for <, >, ", ' and & with their corresponding HTML entities will do the trick. (You need other HTML entities only if you're not using UTF-8). Allowing HTML tags is non-trivial, and merits its own Stack Overflow question.
encoding your HTML is a start... it does not protect from all XSS attacks.
If you use PHP, here is a good function you can use in your sites: Kallahar's RemoveXSS() function
If you don't use PHP, at least the code is well commented, explaining the purpose of each section, and could then be adapted to another programming language.
The answer is no, encoding is not enought. The best protection for XSS is a combination of "whitelist" validation of all incoming data and appropriate encoding of all output data. Validation allows the detection of attacks, and encoding prevents any successful script injection from running in the browser. If you are using .NET you can check this library http://msdn.microsoft.com/en-us/library/aa973813.aspx
You can check also some Cheat sheets to test your protections: http://ha.ckers.org/xss.html
Regards,
Victor
HtmlEncoding input gets you a good portion of the way by not allowing the HTML to render to the page.
Depending on your language items should exist there to sanitize the data. In .NET you can use Server.HtmlEncode(txtInput.Text) to input data from a textbox named txtInput.
As others have mentioned more items are needed to be truly protected.
I'm writing an asp.net application that will need to be localized to several regions other than North America. What do I need to do to prepare for this globalization? What are your top 1 to 2 resources for learning how to write a world ready application.
A couple of things that I've learned:
Absolutely and brutally minimize the number of images you have that contain text. Doing so will make your life a billion percent easier since you won't have to get a new set of images for every friggin' language.
Be very wary of css positioning that relies on things always remaining the same size. If those things contain text, they will not remain the same size, and you will then need to go back and fix your designs.
If you use character types in your sql tables, make sure that any of those that might receive international input are unicode (nchar, nvarchar, ntext). For that matter, I would just standardize on using the unicode versions.
If you're building SQL queries dynamically, make sure that you include the N prefix before any quoted text if there's any chance that text might be unicode. If you end up putting garbage in a SQL table, check to see if that's there.
Make sure that all your web pages definitively state that they are in a unicode format. See Joel's article, mentioned above.
You're going to be using resource files a lot for this project. That's good - ASP.NET 2.0 has great support for such. You'll want to look into the App_LocalResources and App_GlobalResources folder as well as GetLocalResourceObject, GetGlobalResourceObject, and the concept of meta:resourceKey. Chapter 30 of Professional ASP.NET 2.0 has some great content regarding that. The 3.5 version of the book may well have good content there as well, but I don't own it.
Think about fonts. Many of the standard fonts you might want to use aren't unicode capable. I've always had luck with Arial Unicode MS, MS Gothic, MS Mincho. I'm not sure about how cross-platform these are, though. Also, note that not all fonts support all of the Unicode character definition. Again, test, test, test.
Start thinking now about how you're going to get translations into this system. Go talk to whoever is your translation vendor about how they want data passed back and forth for translation. Think about the fact that, through your local resource files, you will likely be repeating some commonly used strings through the system. Do you normalize those into global resource files, or do you have some sort of database layer where only one copy of each text used is generated. In our recent project, we used resource files which were generated from a database table that contained all the translations and the original, english version of the resource files.
Test. Generally speaking I will test in German, Polish, and an Asian language (Japanese, Chinese, Korean). German and Polish are wordy and nearly guaranteed to stretch text areas, Asian languages use an entirely different set of characters which tests your unicode support.
Learn about the System.Globalization namespace:
System.Globalization
Also, a good book is NET Internationalization: The Developer's Guide to Building Global Windows and Web Applications
Would be good to refresh a bit on Unicodes if you are targeting other cultures,languages.
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
This is a hard problem. I live in Canada, so multilingualism is a big issue. In all my years of doing software development, I've never seen a solution that I liked. I've seen a lot of solutions that worked, and got the job done, but they've always felt like a big kludge. I would go with #harriyott, and make sure that none of your strings are actually in code. A resource file works well for desktop applications. However in ASP.Net, I'd recommend using the database. #John Christensen also has some good pointers.
Make sure you're compiling with Code Analysis turned on, and pay attention to the Globalization warnings that it gives you. Keep data in an invariant format (CultureInfo.InvariantCulture) until you display it to the user (then use CultureInfo.CurrentCulture).
I would seriously consider reading the following code project article:
Globalization and localization demystified in ASP.NET 2.0
It covers everything from Cultures and Locales, setting the threads current culture, resource files, encodings, you name it!
And of course it's loaded with pretty pictures and examples :-). Good luck!
I would suggest:
Put all strings in either the database or resource files.
Allow extra space for translated text, as some (e.g. German) are wordier.