Storing HTML in db while avoiding persistent xss/sql injection - asp.net

I'm building a page in asp.net that will use tiny mce to provide a rich text editor on the page. Tiny mce outputs the rich text as html which I would like to save to a database. Then at a later date, I want to pull the HTML from the database and display it in a page.
I'm concerned about allowing malicious html, js tags into my database that would later be output.
Can someone walk me through at what point in my process I should html encode/decode etc. to prevent a persistent xss attack and or sql injection attack?

We use the Microsoft Web Protection Library to scrape out any potentially dangerous HTML on the way in. What I mean by "on the way in" - when the page is posted to the server, we scrub the HTML using MS WPL and take the results of that and throw that into the database. Don't even let any bad data get to your database, and you'll be safer for it. As far as encoding, you won't want to mess with HTML encoding/decoding - just take whatever is in your tinyMCE control, scrub it, and save it. Then on your display page, just write it out like it exists in your database into a literal control or something like that, and you should be good.
I believe Microsoft.Security.Application.Sanitizer.GetSafeHtmlFragment(input) will do exactly what you want here.

Are these admins that are using the RTE? If so, I wouldn't worry about it.
If not, then I don't recommend using a WYSIWYIG such as TinyMCE. You'll have to actually look for malicious input, and chances are, you will miss some. Since the RTE outputs plain HTML, which I assume you want, you can't just convert HTML entities. That would kind of eliminate the whole point of using TinyMCE.
Stopping SQL injection is done in the backend when inserting the data into the database. You will want to use a parametrized query or escape the input (not sure how in ASP.NET, I'm a PHP guy.)

Couldn't you use a rich text editor that uses BBCode and on the server, escape everything that needs to be escaped and convert BBCode to HTML markup afterwards?
You could also, instead of producing BBCode on the client, convert the HTML markup to BBCode on the server, escape the remaining HTML and convert the result from BBCode back to HTML.

There are two approaches, you will probably use the first one
1) you will make a list of permitted tags and escape/strip rest of them. TinyMCE has probably some feature to disallow user to use some tags..(vut this is only client side, you should validate it on server)
2) you will encode permitted tags differently ([b]bold[/b]), than you could save everything to DB and while rendering escape everything and than interpret your special tags
Third approach: if the user is admin (the one who should know whats is he doing), than you can leave everyhing without escaping...he is the responsible one for his own mistakes....

Related

Allowing user-created templates in an ASP.NET site

I have a website I’m converting from Classic ASP to ASP.NET. The old site allowed users to edit the website template to create their own design for their section of the website (think MySpace, only LESS professional.)
I’ve been racking my brain trying to figure out how to do this with .NET. My sites generally use master pages, but obviously that won’t work for end-users.
I’ve tried loading the HTML templates as a regular text file and parsing it to ‘fit around’ the content place holders. It is as ugly as it sounds.
There’s got a be something generally regarded as the best practice here, but I sure can’t find it.
Any suggestions?
How much control do you want your users to have?
There are a few ways to implement this. I'll give you a quick summary of all the ideas I can think of:
Static HTML with predefined fields.
With this approach you get full control and you minimize the risk of any security vulnerabilities. You would store per-user HTML in a database table somewhere. This HTML would have predefined fields with some markup, like using {fieldName}. You would then use a simple parser to identify curly brackets and replace fieldName with a string value pulled from a Dictionary<String,String> somewhere.
This approach can be fast if you're smart with string processing (i.e. using a state-machine parser to find the curley brackets, sending the rebuilt string either directly to the output or into a StringBuilder, etc, and importantly not using String.Replace). The downside is it isn't a real templating system: there's no provision for looping over result-sets (assuming you want to allow that), expression evaluation, but it works for simple "insert this content into this space"-type designs.
Allow users to edit their own ASPX or ASCX files.
Why build your own templating system if you can use ASP.NET's? Well, this approach is the simplest if you want to build a quick 'n' dirty reporting system, but it fails terribly for security. Unfortunately you cannot secure any <% %> / <script runat="server"> code in ASPX files in a sandbox or use CAS owing to how ASP.NET works (I looked into this myself earlier: Code Access Security on a per-view ASP.NET MVC basis ).
You don't need to store the actual ASPX and ASCX files in the website's filesystem, you can store the files in your database using a VirtualPathProvider implementation, but getting it to work right can be a bit of a pain (especially as the ASP.NET runtime compiles ASPX files, so you'd need to inform it if an ASPX file was modified by the user). You also need to be aware that ASPX loading is tied into the user's request path (unless you're using Routing or MVC) so you're better off using ASCX, not that it matters really.
A custom ASP.NET handler that runs in its own CAS sandbox that implements a fully-fledged templating engine
This is the most painful option, and it exists between the two: you get the flexibility of a first-class templating engine (loops, fields, evaluation if necessary) without needing to open your application up to gaping security flaws. The downside is you need to build pretty much everything by yourself. I won't go into detail here.
Which option you go for depends on what your requirements are. MySpace's system was more "skinning" than "templating", in that the users were free to set a stylesheet and define some arbitrary common HTML rather than modify their page's general template directly.
You can easily implement a MySpace-like system in ASP.NET assuming that each skinnable feature is implemented as a Control subclass, just extend your Render method to allow for the insertion of said arbitrary HTML. Adding custom stylesheets is also easy: just add it inside a <style type="text/css"> element in your page's <head>.
When/if you do allow the user to enter HTML directly, I strongly recommend you parse it and filter out any dangerous elements (such as <script>, <style>, <object>, etc). If you want to allow for the embedding of YouTube videos and related then you should analyse <object> elements to ensure they actually are of YouTube videos, extract the video ID, then recreate the element from a known, trusted template. It is important that any custom HTML is "tag-balanced" (you can verify this by passing it through a strict XML parser instead of a more forgiving HTML parser, as XHTML is (technically) a subset of HTML anyway), that way any custom markup won't break the page.
Have fun.

Is HttpUtility.HtmlEncode safe?

I want the user to enter text and i would like to show the text back to the user and keep all the whitespaces. I dont want any exploits and have the user inject html or javascript. Is HttpUtility.HtmlEncode safe enough to use? ATM it looks correct since its properly encoding < > and other test letters. To display the the text back correctly what do i use? right now i am using <pre><code>. It looks alright, is this the correct way to display it?
HtmlEncode should be secure as far as any HTML codes or JavaScript. Any HTML markup characters will be encoded so that they appear only as other characters when displayed on a web page.
Yes, if I wanted to keep formatting (including all spaces), I would use <pre>.
You'll want to have a look at the GetSafeHTMLFragment method in the AntiXSS section of the Web Protection Library. This uses a whitelist of what HTML is considered 'safe' for XSS purposes, anything not in the whitelist is stripped out. Blowdart (who works on the WPL team) has a great blogpost on using the method.

Building a Wikipedia on ASP.NET(learning exercise). How to clean untrusted data, but keep formatting?

I want to give end users the ability to save HTML to my backend store. Since this feature could easily cause SQL Injection, and loads of other issues, does anyone know of a server side library that will clean the input so only the "safe" parts of HTML can be used?
Some things I'd like to avoid:
Object Tag use
JavaScript use
Windows "style" pop-up boxes (such as your PC is infected with a virus)
CSS with a Javascript action
inline data from external sites
Since there is a 100% guarantee that I didn't come up with all the ways a user could be malicious with this feature, I'd like to learn what options I have to clean the data, but preserve basic formatting
Consider sanitizing user input with the Microsoft AntiXSS library.
http://wpl.codeplex.com/
http://msdn.microsoft.com/en-us/security/aa973814.aspx

Best practice for preventing saving malicious client script in HTML

We have an ASP.NET custom control that lets users enter HTML (similar to a Rich text box). We noticed that a user can potentially inject malicious client scripts within the <script> tag in the HTML view. I can validate HTML code on save to ensure that I remove any <script> elements.
Is this all I need to do? Are all other tags other than the <script> tag safe? If you were an attacker, what else would you attempt to do?
Any best practices I need to follow?
EDIT - How is the MS anti Xss library different from the native HtmlEncode for my purpose?
XSS (Cross Site Scripting) is a big a difficult subject to tackle correctly.
Instead of black-listing some tags (and missing some of the ways you may be attacked), it is better to decide on a set of tags that are OK for your site and only allowing them.
This in itself will not be enough, as you will have to catch all possible encodings an attacker might try and there are other things an attacker might try. There are anti-xss libraries that help - here is one from Microsoft.
For more information and guidance, see this OWASP article.
Have a look at this page:
http://ha.ckers.org/xss.html
to get an idea of different XSS attacks that somebody may try.
There's a whole lot to do when it comes to filtering out JavaScript from HTML. Here's a short list of some of the bigger points:
Multiple passes over the input is required to make sure that what you removed before doesn't create a new injection. If you're doing a single pass, things like <scr<script></script>ipt>alert("XSS!");</scr<script></script>ipt> will get past you since after your remove <script> tags from the string, you'll have created a new one.
Strip the use of the javascript: protocol in href and src attributes.
Strip embedded event handler attributes like onmouseover/out, onclick, onkeypress, etc.
White lists are safer than black lists. Only allow tags and attributes that you know are safe.
Make sure you're dealing with all the same character encoding. If you treat the input like ASCII (single byte) and the input has Unicode (multibyte) characters, you're going to get a nasty surprise.
Here's a more complete cheat sheet. Also, Oli linked to a good article at ha.ckers.org with samples to test your filtration.
Removing only the <script> tags will not be sufficient as there are lots of methods for encoding / hiding them in input. Most languages now have anti-xss and anti-csrf libraries and functions for filtering input. You should use one of these generally agreed upon libraries to filter your user input.
I'm not sure what the best options are in ASP.NET, but this might shed some light:
http://msdn.microsoft.com/en-us/library/ms998274.aspx
This is called a Cross Site Scripting (XSS) attack. They can be very hard to prevent, as there are a lot of surprising ways of getting JavaScript code to execute (javascript: URLs, sometimes CSS, object and iframe tags, etc).
The best approach is to whitelist tags, attributes, and types of URLs (and keep the whitelist as small as possible to do what you need) instead of blacklisting. That means that you only allow certain tags that you know are safe, rather than banning tags that you believe to be dangerous. This way, there are fewer possible ways for people to get an attack into your system, because tags that you didn't think about won't be allowed, rather than blacklisting where if you missed something, you will still have a vulnerability. Here's an example of a whitelist approach to sanitization.

HTMLEncode script tags only

I'm working on StackQL.net, which is just a simple web site that allows you to run ad hoc tsql queries on the StackOverflow public dataset. It's ugly (I'm not a graphic designer), but it works.
One of the choices I made is that I do not want to html encode the entire contents of post bodies. This way, you see some of the formatting from the posts in your queries. It will even load images, and I'm okay with that.
But I am concerned that this will also leave <script> tags active. Someone could plant a malicious script in a stackoverflow answer; they could even immediately delete it, so no one sees it. One of the most common queries people try when they first visit is a simple Select * from posts, so with a little bit of timing a script like this could end up running in several people's browsers. I want to make sure this isn't a concern before I update to the (hopefully soon-to-be-released) October data export.
What is the best, safest way to make sure just script tags end up encoded?
You may want to modify the HTMLSanatize script to fit your purposes. It was written by Jeff Atwood to allow certain kinds of HTML to be shown. Since it was written for Stack Overflow, it'd fit your purpose as well.
I don't know whether it's 'up to date' with what Jeff currently has deployed, but it's a good starting point.
Don't forget onclick, onmouseover, etc or javascript: psuedo-urls (<img src="javascript:evil!Evil!">) or CSS (style="property: expression(evil!Evil!);") or…
There are a host of attack vectors beyond simple script elements.
Implement a white list, not a black list.
If the messages are in XHTML format then you could do an XSL transform and encode/strip tags and properties that you don't want. It gets a little easier if you use something like TinyMCE or CKEditor to provide a wysiwyg editor that outputs XHTML.
What about simply breaking the <script> tags? Escaping only < and > for that tag, ending up with <script>, could be one simple and easy way.
Of course links are another vector. You should also disable every instance of href='javascript:', and every attribute starting with on*.
Just to be sure, nuke it from orbit.
But I am concerned that this will also leave <script tags active.
Oh, that's just the beginning of HTML ‘malicious content’ that can cause cross-site scripting. There's also event handlers; inline, embedded and linked CSS (expressions, behaviors, bindings), Flash and other embeddable plugins, iframes to exploit sites, javascript: and other dangerous schemes (there are more than you think!) in every place that can accept a URL, meta-refresh, UTF-8 overlongs, UTF-7 mis-sniffing, data binding, VML and other non-HTML stuff, broken markup parsed as scripts by permissive browsers...
In short any quick-fix attempt to sanitise HTML with a simple regex will fail badly.
Either escape everything so that any HTML is displayed as plain text, or use a full parser-and-whitelist-based sanitiser. (And keep it up-to-date, because even that's a hard job and there are often newly-discovered holes in them.)
But aren't you using the same Markdown system as SO itself to render posts? That would be the obvious thing to do. I can't guarantee there are no holes in Markdown that would allow cross-site scripting (there certainly have been in the past and there are probably some more obscure ones still in there as it's quite a complicated system). But at least you'd be no more insecure than SO is!
Use a Regex to replace the script tags with the encoded tags. This will filter the tags which has the word "script" in it and HtmlEncode it. Thus, all the script tags such as <script>, </script> and <script type="text/javascript"> etc. will get encoded and will not encode other tags in the string.
Regex.Replace(text, #"</?(\w+)[^>]*>",
tag => tag.Groups[1].Value.ToLower().Contains("script") ? HttpUtility.HtmlEncode(tag.Value) : tag.Value,
RegexOptions.Singleline);

Resources