Preventing XSS (Cross-site Scripting) - asp.net

Let's say I have a simple ASP.NET MVC blog application and I want to allow readers to add comments to a blog post. If I want to prevent any type of XSS shenanigans, I could HTML encode all comments so that they become harmless when rendered. However, what if I wanted to some basic functionality like hyperlinks, bolding, italics, etc?
I know that StackOverflow uses the WMD Markdown Editor, which seems like a great choice for what I'm trying to accomplish, if not for the fact that it supports both HTML and Markdown which leaves it open to XSS attacks.

If you are not looking to use an editor you might consider OWASP's AntiSamy.
You can run an example here:
http://www.antisamy.net/

How much HTML are you going to support? Just bold/italics/the basic stuff? In that case, you can convert those to markdown syntax and then strip the rest of the HTML.
The stripping needs to be done server side, before you store it. You need to validate the input on the server as well, when checking for SQL-vulnerabilities and other unwanted stuff.

If you need to do it in the browser: http://code.google.com/p/google-caja/wiki/JsHtmlSanitizer

I'd suggest you only submit the markdown syntax. On the front end, the client can type markdown and have an HTML preview (same as SO), but only submit the markdown syntax server-side. Then you can validate it, generate the HTML, escape it and store it.
I believe that's the way most of us do it. In either case, markdown is there to alleviate anyone from writing structured HTML code and give power to those who wouldn't even know how to.
If there's something specific you'd like to do with the HTML, then you can tweak it with some CSS inheritance '.comment a { color: #F0F; }', front end JS or just traverse over the generated HTML from parsing markdown before you store it.

Why don't you use Jeff's code ? http://refactormycode.com/codes/333-sanitize-html

I'd vote for the FCKEditor but you have to do some extra steps to the returned output too.

You could use an HTML whitelist so that certain tags can still be used, but everything else is blocked.
There are tools that can do this for you. SO uses the code that Slough linked.

Related

How do I output HTML form data to PDF?

I need to collect data from a visitor in an HTML form and then have them print a document with the appropriate fields pre-populated. They'll need to have a couple of signatures on the document, so it has to be printed.
The paper form already exists, so one idea was to scan it in, with nothing filled out, as an image. I would then have the HTML form data print out using CSS for positioning and using the blank scanned form as a background image.
A better option, I would think, would be to automatically generate the PDF with this data, but I'm not sure how to accomplish either.
Suggestions and ideas would be greatly appreciated! =)
I would have to respectfully disagree with Osvaldo. Using CSS to align on a printed document would take ages to do efficiently in the aspect of cross-browser integration. Plus, if Microsoft comes out with a new browser, you're going to have to constantly update for the new use in browsers.
If you know any PHP (Which, if you know JavaScript and HTML, basic PHP is very simple), here's a good library you can use, FDPF:
Thankfully, PHP doesn't deprecate a whole lot of methods and the total code is less than 10 lines if you have to go in and change things around.
You can control printed documents acceptably well with CSS, so I would suggest you to try that option first. Because it's easier.
This is actually a great php library for converting HTML to PDF documents http://code.google.com/p/dompdf/ there are many demo's available on the site
XSL-FO is what I would recommend. XSL-FO (along with XSLT and XPath) is a sub-standard of XSL that was designed to be an abstract representation of a formatted document (that contains, text, graphic elements, fonts, styles, etc).
XSL-FO documents are valid xml documents, and there exist tools and apis that allow you to convert an XSL-FO documet to MS Word, PDF, RTF, etc. Depending on the technology you use, a quick google search will tell you what is available.
Here are a few links to help you get started with XSL-FO:
http://en.wikipedia.org/wiki/XSL_Formatting_Objects
http://www.w3schools.com/xslfo/xslfo_intro.asp
http://www.w3.org/TR/xsl11/

Is HttpUtility.HtmlEncode safe?

I want the user to enter text and i would like to show the text back to the user and keep all the whitespaces. I dont want any exploits and have the user inject html or javascript. Is HttpUtility.HtmlEncode safe enough to use? ATM it looks correct since its properly encoding < > and other test letters. To display the the text back correctly what do i use? right now i am using <pre><code>. It looks alright, is this the correct way to display it?
HtmlEncode should be secure as far as any HTML codes or JavaScript. Any HTML markup characters will be encoded so that they appear only as other characters when displayed on a web page.
Yes, if I wanted to keep formatting (including all spaces), I would use <pre>.
You'll want to have a look at the GetSafeHTMLFragment method in the AntiXSS section of the Web Protection Library. This uses a whitelist of what HTML is considered 'safe' for XSS purposes, anything not in the whitelist is stripped out. Blowdart (who works on the WPL team) has a great blogpost on using the method.

HTMLEncode script tags only

I'm working on StackQL.net, which is just a simple web site that allows you to run ad hoc tsql queries on the StackOverflow public dataset. It's ugly (I'm not a graphic designer), but it works.
One of the choices I made is that I do not want to html encode the entire contents of post bodies. This way, you see some of the formatting from the posts in your queries. It will even load images, and I'm okay with that.
But I am concerned that this will also leave <script> tags active. Someone could plant a malicious script in a stackoverflow answer; they could even immediately delete it, so no one sees it. One of the most common queries people try when they first visit is a simple Select * from posts, so with a little bit of timing a script like this could end up running in several people's browsers. I want to make sure this isn't a concern before I update to the (hopefully soon-to-be-released) October data export.
What is the best, safest way to make sure just script tags end up encoded?
You may want to modify the HTMLSanatize script to fit your purposes. It was written by Jeff Atwood to allow certain kinds of HTML to be shown. Since it was written for Stack Overflow, it'd fit your purpose as well.
I don't know whether it's 'up to date' with what Jeff currently has deployed, but it's a good starting point.
Don't forget onclick, onmouseover, etc or javascript: psuedo-urls (<img src="javascript:evil!Evil!">) or CSS (style="property: expression(evil!Evil!);") or…
There are a host of attack vectors beyond simple script elements.
Implement a white list, not a black list.
If the messages are in XHTML format then you could do an XSL transform and encode/strip tags and properties that you don't want. It gets a little easier if you use something like TinyMCE or CKEditor to provide a wysiwyg editor that outputs XHTML.
What about simply breaking the <script> tags? Escaping only < and > for that tag, ending up with <script>, could be one simple and easy way.
Of course links are another vector. You should also disable every instance of href='javascript:', and every attribute starting with on*.
Just to be sure, nuke it from orbit.
But I am concerned that this will also leave <script tags active.
Oh, that's just the beginning of HTML ‘malicious content’ that can cause cross-site scripting. There's also event handlers; inline, embedded and linked CSS (expressions, behaviors, bindings), Flash and other embeddable plugins, iframes to exploit sites, javascript: and other dangerous schemes (there are more than you think!) in every place that can accept a URL, meta-refresh, UTF-8 overlongs, UTF-7 mis-sniffing, data binding, VML and other non-HTML stuff, broken markup parsed as scripts by permissive browsers...
In short any quick-fix attempt to sanitise HTML with a simple regex will fail badly.
Either escape everything so that any HTML is displayed as plain text, or use a full parser-and-whitelist-based sanitiser. (And keep it up-to-date, because even that's a hard job and there are often newly-discovered holes in them.)
But aren't you using the same Markdown system as SO itself to render posts? That would be the obvious thing to do. I can't guarantee there are no holes in Markdown that would allow cross-site scripting (there certainly have been in the past and there are probably some more obscure ones still in there as it's quite a complicated system). But at least you'd be no more insecure than SO is!
Use a Regex to replace the script tags with the encoded tags. This will filter the tags which has the word "script" in it and HtmlEncode it. Thus, all the script tags such as <script>, </script> and <script type="text/javascript"> etc. will get encoded and will not encode other tags in the string.
Regex.Replace(text, #"</?(\w+)[^>]*>",
tag => tag.Groups[1].Value.ToLower().Contains("script") ? HttpUtility.HtmlEncode(tag.Value) : tag.Value,
RegexOptions.Singleline);

What's the best way to remove (or ignore) script and form tags in HTML?

I have text stored in SQL as HTML. I'm not guaranteed that this data is well-formed, as users can copy/paste from anywhere into the editor control I'm using, or manually edit the HTML that's generated.
The question is: what's the best way of going about removing or somehow ignoring <script/> and <form/> tags so that, when the user's text is displayed elsewhere in the Web Application, it doesn't disrupt the normal operation of the containing page.
I've toyed with the idea of simply doing a "Find and Replace" for <script>/<form>with <div> (obviously taking into account whitespace and closing tags, if they exist). I'm also open to any way to somehow "ignore" certain tags. For all I know, there could be some built-in way of saying (in HTML, CSS, or JavaScript) "for all elements in <div id="MyContent">, treat <form> and <script> as <div>.
Any help or advice would be greatly appreciated!
In terms of sanitising user input, form and script tags are not the only ones that should be cleaned up.
The best way of doing this job depends a little on what tools you are using. Have a look at these questions:
What’s the best method for sanitizing user input with PHP?
Sanitising user input using Python
It depends on which language you're using. In general, I'd recommend using an HTML parser, constructing a small DOM from the snippet, then nuking unwanted elements. There are many good HTML parser, especially designed to handle real-world, messy HTML. Examples include BeautifulSoup (Python), HTMLParser (Java)... And, since the answer came in while I was typing, what Colin said!
Don't try and do it yourself - there are far too many tricks for getting bits of script and general nastiness into a page. Use the Microsoft AntiXSS library - version 3.1 has HTML sanitation built in. You probably want the GetSafeHTMLFragment method, which returns a sanitised chunk of HTML. See my previous answer.
Since you're using .Net I would recommend HtmlAgilityPack as it is easy to work with and works well with malformed HTML.
Though the answers suggested were acceptable, I ended up using a good old regular expression to replace begin and end <script> and <form> tags with <div>'s.
txtStore.Text=Regex.Replace(txtStore, "<.*?>", string.Empty);
I had faced same problem before. But my scenario was something different. I was adding content with ajax request to page. The content coming in ajax response was html and it also included script tags. I just wanted to get html without any script so I did removed all script tags from ajax response with jquery.
jquery-remove-script-tags-from-string

HTML Sanitization - bad markup?

I've been scanning some off the discussions on sanitizing HTML markup strings for redisplay on a page (e.g. blog comments). In the past I've just unilaterally escaped the markup for re-display.
Does anyone know if there are any solutions out there that go beyond just removing "unsafe" tags?
What if the markup is invalid? For example, how do you prevent and unclosed <b> tag from bold facing all the text that follows it on in on the page?
It seems like Stackoverflow handles this.
Example of unclosed 'b' tag
Thanks.
Stackoverflow either uses textile or something very much like it.
Textile is more or less guaranteed to spit out valid (x)html, ameliorating many typical problems with sanitizing user input.
Check this code:
Sanitize HTML, I think StackOverflow uses it somewhere...
A method to sanitize any potentially
dangerous tags from the provided raw
HTML input using a whitelist based
approach, leaving the "safe" HTML
tags.
The Html Agility Pack is probably a good starting point as it claims to be very tolerant of badly formatted and malformed HTML. On top of that you'll may want to build some rules to do further sanitization. In the end you serialize the obtained DOM back to plain HTML code.
I faced the same problem you did and built such a rule based HTML sanitizer on top of the Html Agility Pack. It allows you to flatten or remove tags, transform tags for example replacing b with strong tags and restrict attribute usage. Take a look at the source code code of the HtmlRuleSanitizer for ideas or just get the NuGet package if you want to be done quickly.

Resources