HTMLEncode script tags only - asp.net

I'm working on StackQL.net, which is just a simple web site that allows you to run ad hoc tsql queries on the StackOverflow public dataset. It's ugly (I'm not a graphic designer), but it works.
One of the choices I made is that I do not want to html encode the entire contents of post bodies. This way, you see some of the formatting from the posts in your queries. It will even load images, and I'm okay with that.
But I am concerned that this will also leave <script> tags active. Someone could plant a malicious script in a stackoverflow answer; they could even immediately delete it, so no one sees it. One of the most common queries people try when they first visit is a simple Select * from posts, so with a little bit of timing a script like this could end up running in several people's browsers. I want to make sure this isn't a concern before I update to the (hopefully soon-to-be-released) October data export.
What is the best, safest way to make sure just script tags end up encoded?

You may want to modify the HTMLSanatize script to fit your purposes. It was written by Jeff Atwood to allow certain kinds of HTML to be shown. Since it was written for Stack Overflow, it'd fit your purpose as well.
I don't know whether it's 'up to date' with what Jeff currently has deployed, but it's a good starting point.

Don't forget onclick, onmouseover, etc or javascript: psuedo-urls (<img src="javascript:evil!Evil!">) or CSS (style="property: expression(evil!Evil!);") or…
There are a host of attack vectors beyond simple script elements.
Implement a white list, not a black list.

If the messages are in XHTML format then you could do an XSL transform and encode/strip tags and properties that you don't want. It gets a little easier if you use something like TinyMCE or CKEditor to provide a wysiwyg editor that outputs XHTML.

What about simply breaking the <script> tags? Escaping only < and > for that tag, ending up with <script>, could be one simple and easy way.
Of course links are another vector. You should also disable every instance of href='javascript:', and every attribute starting with on*.
Just to be sure, nuke it from orbit.

But I am concerned that this will also leave <script tags active.
Oh, that's just the beginning of HTML ‘malicious content’ that can cause cross-site scripting. There's also event handlers; inline, embedded and linked CSS (expressions, behaviors, bindings), Flash and other embeddable plugins, iframes to exploit sites, javascript: and other dangerous schemes (there are more than you think!) in every place that can accept a URL, meta-refresh, UTF-8 overlongs, UTF-7 mis-sniffing, data binding, VML and other non-HTML stuff, broken markup parsed as scripts by permissive browsers...
In short any quick-fix attempt to sanitise HTML with a simple regex will fail badly.
Either escape everything so that any HTML is displayed as plain text, or use a full parser-and-whitelist-based sanitiser. (And keep it up-to-date, because even that's a hard job and there are often newly-discovered holes in them.)
But aren't you using the same Markdown system as SO itself to render posts? That would be the obvious thing to do. I can't guarantee there are no holes in Markdown that would allow cross-site scripting (there certainly have been in the past and there are probably some more obscure ones still in there as it's quite a complicated system). But at least you'd be no more insecure than SO is!

Use a Regex to replace the script tags with the encoded tags. This will filter the tags which has the word "script" in it and HtmlEncode it. Thus, all the script tags such as <script>, </script> and <script type="text/javascript"> etc. will get encoded and will not encode other tags in the string.
Regex.Replace(text, #"</?(\w+)[^>]*>",
tag => tag.Groups[1].Value.ToLower().Contains("script") ? HttpUtility.HtmlEncode(tag.Value) : tag.Value,
RegexOptions.Singleline);

Related

How do I create an html link that has a link name, the same as the URL address?

Is this the easiest way in an html doc to create a link to a page that has the same name as the url?
So basically it will say:
Please click the following link:
http://test.com.
That is all I want it to say.
The code I wrote for this is as follows:
http://test.com.
Or is there a more all inclusive way where you don't have to write the name of the url twice?
Obviously my code doesnt include the initial text, this is just for example purposes.
Unless you want to copy the URL from one place to another using JavaScript, you will have to write the URL twice.
I advise agains the JavaScript copying, because its performance and SEO costs are much worse than the cost of typing everything twice.
What you have got now is the easiest way.
If it's not an option for some reason you can use server side scripting to search the page content for URLs and wrap an <a> tag around them.
This will require some very complicated regex. Daring Fireball has a very good blog post instructing you how to do this, and explaining exactly why it's actually impossible for this to be perfectly reliable (which is probably why HTML doesn't allow it):
http://daringfireball.net/2010/07/improved_regex_for_matching_urls
I've done this sort of thing before (with emails actually) and it's very difficult and took years to get right. If at all possible, you should just do what you're already doing - manually type in the <a> tag yourself.
Alternatively, you could use something like smarty (for PHP. I don't know what the ASP equivalent would be) to write something along the lines of the following, to programatically generate the full <a> tag:
{link url='http://example.com'}
Why don't we just sidestep the issue by making our links more semantically-rich?
Instead of:
For more information on our delicious pizza, visit www.pizzasrawesome.com.
Use this:
Read more about our delicious pizza.

Best practice for preventing saving malicious client script in HTML

We have an ASP.NET custom control that lets users enter HTML (similar to a Rich text box). We noticed that a user can potentially inject malicious client scripts within the <script> tag in the HTML view. I can validate HTML code on save to ensure that I remove any <script> elements.
Is this all I need to do? Are all other tags other than the <script> tag safe? If you were an attacker, what else would you attempt to do?
Any best practices I need to follow?
EDIT - How is the MS anti Xss library different from the native HtmlEncode for my purpose?
XSS (Cross Site Scripting) is a big a difficult subject to tackle correctly.
Instead of black-listing some tags (and missing some of the ways you may be attacked), it is better to decide on a set of tags that are OK for your site and only allowing them.
This in itself will not be enough, as you will have to catch all possible encodings an attacker might try and there are other things an attacker might try. There are anti-xss libraries that help - here is one from Microsoft.
For more information and guidance, see this OWASP article.
Have a look at this page:
http://ha.ckers.org/xss.html
to get an idea of different XSS attacks that somebody may try.
There's a whole lot to do when it comes to filtering out JavaScript from HTML. Here's a short list of some of the bigger points:
Multiple passes over the input is required to make sure that what you removed before doesn't create a new injection. If you're doing a single pass, things like <scr<script></script>ipt>alert("XSS!");</scr<script></script>ipt> will get past you since after your remove <script> tags from the string, you'll have created a new one.
Strip the use of the javascript: protocol in href and src attributes.
Strip embedded event handler attributes like onmouseover/out, onclick, onkeypress, etc.
White lists are safer than black lists. Only allow tags and attributes that you know are safe.
Make sure you're dealing with all the same character encoding. If you treat the input like ASCII (single byte) and the input has Unicode (multibyte) characters, you're going to get a nasty surprise.
Here's a more complete cheat sheet. Also, Oli linked to a good article at ha.ckers.org with samples to test your filtration.
Removing only the <script> tags will not be sufficient as there are lots of methods for encoding / hiding them in input. Most languages now have anti-xss and anti-csrf libraries and functions for filtering input. You should use one of these generally agreed upon libraries to filter your user input.
I'm not sure what the best options are in ASP.NET, but this might shed some light:
http://msdn.microsoft.com/en-us/library/ms998274.aspx
This is called a Cross Site Scripting (XSS) attack. They can be very hard to prevent, as there are a lot of surprising ways of getting JavaScript code to execute (javascript: URLs, sometimes CSS, object and iframe tags, etc).
The best approach is to whitelist tags, attributes, and types of URLs (and keep the whitelist as small as possible to do what you need) instead of blacklisting. That means that you only allow certain tags that you know are safe, rather than banning tags that you believe to be dangerous. This way, there are fewer possible ways for people to get an attack into your system, because tags that you didn't think about won't be allowed, rather than blacklisting where if you missed something, you will still have a vulnerability. Here's an example of a whitelist approach to sanitization.

What's the best way to remove (or ignore) script and form tags in HTML?

I have text stored in SQL as HTML. I'm not guaranteed that this data is well-formed, as users can copy/paste from anywhere into the editor control I'm using, or manually edit the HTML that's generated.
The question is: what's the best way of going about removing or somehow ignoring <script/> and <form/> tags so that, when the user's text is displayed elsewhere in the Web Application, it doesn't disrupt the normal operation of the containing page.
I've toyed with the idea of simply doing a "Find and Replace" for <script>/<form>with <div> (obviously taking into account whitespace and closing tags, if they exist). I'm also open to any way to somehow "ignore" certain tags. For all I know, there could be some built-in way of saying (in HTML, CSS, or JavaScript) "for all elements in <div id="MyContent">, treat <form> and <script> as <div>.
Any help or advice would be greatly appreciated!
In terms of sanitising user input, form and script tags are not the only ones that should be cleaned up.
The best way of doing this job depends a little on what tools you are using. Have a look at these questions:
What’s the best method for sanitizing user input with PHP?
Sanitising user input using Python
It depends on which language you're using. In general, I'd recommend using an HTML parser, constructing a small DOM from the snippet, then nuking unwanted elements. There are many good HTML parser, especially designed to handle real-world, messy HTML. Examples include BeautifulSoup (Python), HTMLParser (Java)... And, since the answer came in while I was typing, what Colin said!
Don't try and do it yourself - there are far too many tricks for getting bits of script and general nastiness into a page. Use the Microsoft AntiXSS library - version 3.1 has HTML sanitation built in. You probably want the GetSafeHTMLFragment method, which returns a sanitised chunk of HTML. See my previous answer.
Since you're using .Net I would recommend HtmlAgilityPack as it is easy to work with and works well with malformed HTML.
Though the answers suggested were acceptable, I ended up using a good old regular expression to replace begin and end <script> and <form> tags with <div>'s.
txtStore.Text=Regex.Replace(txtStore, "<.*?>", string.Empty);
I had faced same problem before. But my scenario was something different. I was adding content with ajax request to page. The content coming in ajax response was html and it also included script tags. I just wanted to get html without any script so I did removed all script tags from ajax response with jquery.
jquery-remove-script-tags-from-string

Preventing XSS (Cross-site Scripting)

Let's say I have a simple ASP.NET MVC blog application and I want to allow readers to add comments to a blog post. If I want to prevent any type of XSS shenanigans, I could HTML encode all comments so that they become harmless when rendered. However, what if I wanted to some basic functionality like hyperlinks, bolding, italics, etc?
I know that StackOverflow uses the WMD Markdown Editor, which seems like a great choice for what I'm trying to accomplish, if not for the fact that it supports both HTML and Markdown which leaves it open to XSS attacks.
If you are not looking to use an editor you might consider OWASP's AntiSamy.
You can run an example here:
http://www.antisamy.net/
How much HTML are you going to support? Just bold/italics/the basic stuff? In that case, you can convert those to markdown syntax and then strip the rest of the HTML.
The stripping needs to be done server side, before you store it. You need to validate the input on the server as well, when checking for SQL-vulnerabilities and other unwanted stuff.
If you need to do it in the browser: http://code.google.com/p/google-caja/wiki/JsHtmlSanitizer
I'd suggest you only submit the markdown syntax. On the front end, the client can type markdown and have an HTML preview (same as SO), but only submit the markdown syntax server-side. Then you can validate it, generate the HTML, escape it and store it.
I believe that's the way most of us do it. In either case, markdown is there to alleviate anyone from writing structured HTML code and give power to those who wouldn't even know how to.
If there's something specific you'd like to do with the HTML, then you can tweak it with some CSS inheritance '.comment a { color: #F0F; }', front end JS or just traverse over the generated HTML from parsing markdown before you store it.
Why don't you use Jeff's code ? http://refactormycode.com/codes/333-sanitize-html
I'd vote for the FCKEditor but you have to do some extra steps to the returned output too.
You could use an HTML whitelist so that certain tags can still be used, but everything else is blocked.
There are tools that can do this for you. SO uses the code that Slough linked.

HTML Sanitization - bad markup?

I've been scanning some off the discussions on sanitizing HTML markup strings for redisplay on a page (e.g. blog comments). In the past I've just unilaterally escaped the markup for re-display.
Does anyone know if there are any solutions out there that go beyond just removing "unsafe" tags?
What if the markup is invalid? For example, how do you prevent and unclosed <b> tag from bold facing all the text that follows it on in on the page?
It seems like Stackoverflow handles this.
Example of unclosed 'b' tag
Thanks.
Stackoverflow either uses textile or something very much like it.
Textile is more or less guaranteed to spit out valid (x)html, ameliorating many typical problems with sanitizing user input.
Check this code:
Sanitize HTML, I think StackOverflow uses it somewhere...
A method to sanitize any potentially
dangerous tags from the provided raw
HTML input using a whitelist based
approach, leaving the "safe" HTML
tags.
The Html Agility Pack is probably a good starting point as it claims to be very tolerant of badly formatted and malformed HTML. On top of that you'll may want to build some rules to do further sanitization. In the end you serialize the obtained DOM back to plain HTML code.
I faced the same problem you did and built such a rule based HTML sanitizer on top of the Html Agility Pack. It allows you to flatten or remove tags, transform tags for example replacing b with strong tags and restrict attribute usage. Take a look at the source code code of the HtmlRuleSanitizer for ideas or just get the NuGet package if you want to be done quickly.

Resources