When are definitely needed or for a good practice to use escaping functions?
Such as using esc_url(); with:
get_template_directory_uri();
get_permalink();
get_author_posts_url();
get_edit_post_link();
wp_get_attachment_url();
And esc_html(); with:
get_the_title();
get_the_author();
get_the_date();
get_search_query();
Also I think esc_html(); and esc_attr(); are very similar, aren't they? What are the differences?
Part 1
According to the documentation - Validating, Sanitizing, and Escaping by WP VIP team.
Guiding Principles
Never trust user input.
Escape as late as possible.
Escape everything from untrusted sources (like databases and users), third-parties (like Twitter), etc.
Never assume anything.
Never trust user input.
Sanitation is okay, but validation/rejection is better.
Never trust user input.
“Escaping isn’t only about protecting from bad guys. It’s just making our software durable. Against random bad input, against malicious input, or against bad weather.” –nb
Part 2
Codex entry for esc_html
Codex entry for esc_attr
According to the article - Introduction to WordPress Front End Security: Escaping the Things by Andy Adams from CSS-Tricks.
Function: esc_html
Used for: Output that should have absolutely no HTML in the output.
What it does: Converts HTML special characters (such as <, >, &) into their "escaped" entity (<, >, &).
Function: esc_attr
Used for: Output being used in the context of an HTML attribute (think "title", "data-" fields, "alt" text).
What it does: The exact same thing as esc_html. The only difference is that different WordPress filters are applied to each function.
Related
Making an ad manager plugin for WordPress, so the advertisement code can be almost anything, from good code to dirty, even evil.
I'm using simple sanitization like:
$get_content = '<script>/*code to destroy the site*/</script>';
//insert into db
$sanitized_code = addslashes( $get_content );
When viewing:
$fetched_data = /*slashed code*/;
//show as it's inserted
echo stripslashes( $fetched_data );
I'm avoiding base64_encode() and base64_decode() as I learned their performance is a bit slow.
Is that enough?
if not, what else I should ensure to protect the site and/or db from evil attack using bad ad code?
I'd love to get your explanation why you are suggestion something - it'll help deciding me the right thing in future too. Any help would be greatly appreciated.
addslashes then removeslashes is a round trip. You are echoing the original string exactly as it was submitted to you, so you are not protected at all from anything. '<script>/*code to destroy the site*/</script>' will be output exactly as-is to your web page, allowing your advertisers to do whatever they like in your web page's security context.
Normally when including submitted content in a web page, you should be using htmlspecialchars so that everything comes out as plain text and < just means a less then sign.
If you want an advertiser to be able to include markup, but not dangerous constructs like <script> then you need to parse the HTML, only allowing tags and attributes you know to be safe. This is complicated and difficult. Use an existing library such as HTMLPurifier to do it.
If you want an advertiser to be able to include markup with scripts, then you should put them in an iframe served from a different domain name, so they can't touch what's in your own page. Ads are usually done this way.
I don't know what you're hoping to do with addslashes. It is not the correct form of escaping for any particular injection context and it doesn't even remove difficult characters. There is almost never any reason to use it.
If you are using it on string content to build a SQL query containing that content then STOP, this isn't the proper way to do that and you will also be mangling your strings. Use parameterised queries to put data in the database. (And if you really can't, the correct string literal escape function would be mysql_real_escape_string or other similarly-named functions for different databases.)
I have a website that allows to enter HTML through a TinyMCE rich editor control. It's purpose is to allow users to format text using HTML.
This user entered content is then outputted to other users of the system.
However this means someone could insert JavaScript into the HTML in order to perform a XSS attack on other users of the system.
What is the best way to filter out JavaScript code from a HTML string?
If I perform a Regular Expression check for <SCRIPT> tags it's a good start, but an evil doer could still attach JavaScript to the onclick attribute of a tag.
Is there a fool-proof way to script out all JavaScript code, whilst leaving the rest of the HTML untouched?
For my particular implementation, I'm using C#
Microsoft have produced their own anti-XSS library, Microsoft Anti-Cross Site Scripting Library V4.0:
The Microsoft Anti-Cross Site Scripting Library V4.0 (AntiXSS V4.0) is an encoding library designed to help developers protect their ASP.NET web-based applications from XSS attacks. It differs from most encoding libraries in that it uses the white-listing technique -- sometimes referred to as the principle of inclusions -- to provide protection against XSS attacks. This approach works by first defining a valid or allowable set of characters, and encodes anything outside this set (invalid characters or potential attacks). The white-listing approach provides several advantages over other encoding schemes. New features in this version of the Microsoft Anti-Cross Site Scripting Library include:- A customizable safe list for HTML and XML encoding- Performance improvements- Support for Medium Trust ASP.NET applications- HTML Named Entity Support- Invalid Unicode detection- Improved Surrogate Character Support for HTML and XML encoding- LDAP Encoding Improvements- application/x-www-form-urlencoded encoding support
It uses a whitelist approach to strip out potential XSS content.
Here are some relevant links related to AntiXSS:
Anti-Cross Site Scripting Library
Microsoft Anti-Cross Site Scripting Library V4.2 (AntiXSS V4.2)
Microsoft Web Protection Library
Peter, I'd like to introduce you to two concepts in security;
Blacklisting - Disallow things you know are bad.
Whitelisting - Allow things you know are good.
While both have their uses, blacklisting is insecure by design.
What you are asking, is in fact blacklisting. If there had to be an alternative to <script> (such as <img src="bad" onerror="hack()"/>), you won't be able to avoid this issue.
Whitelisting, on the other hand, allows you to specify the exact conditions you are allowing.
For example, you would have the following rules:
allow only these tags: b, i, u, img
allow only these attributes: src, href, style
That is just the theory. In practice, you must parse the HTML accordingly, hence the need of a proper HTML parser.
If you want to allow some HTML but not all, you should use something like OWASP AntiSamy, which allows you to build a whitelisted policy over which tags and attributes you allow.
HTMLPurifier might also be an alternative.
It's of key importance that it is a whitelist approach, as new attributes and events are added to HTML5 all the time, so any blacklisting would fail within short time, and knowing all "bad" attributes is also difficult.
Edit: Oh, and regex is a bit hard to do here. HTML can have lots of different formats. Tags can be unclosed, attributes can start with or without quotes (single or double), you can have line breaks and all kinds of spaces within the tags to name a few issues. I would rely on a welltested library like the ones I mentioned above.
Regular expressions are the wrong tool for the job, you need a real HTML parser or things will turn bad. You need to parse the HTML string and then remove all elements and attributes but the allowed ones (whitelist approach, blacklists are inherently insecure). You can take the lists used by Mozilla as a starting point. There you also have a list of attributes that take URL values - you need to verify that these are either relative URLs or use an allowed protocol (typically only http:/https:/ftp:, in particular no javascript: or data:). Once you've removed everything that isn't allowed you serialize your data back to HTML - now you have something that is safe to insert on your web page.
I try to replace tag element format like this:
public class Utility
{
public static string PreventXSS(string sInput) {
if (sInput == null)
return string.Empty;
string sResult = string.Empty;
sResult = Regex.Replace(sInput, "<", "< ");
sResult = Regex.Replace(sResult, #"<\s*", "< ");
return sResult;
}
}
Usage before save to db:
string sResultNoXSS = Utility.PreventXSS(varName)
I have test that I have input data like :
<script>alert('hello XSS')</script>
it will be run on browser. After I add Anti XSS the code above will be:
< script>alert('hello XSS')< /script>
(There is a space after <)
And the result, the script won't be run on browser.
I'm currently working on a project that demand a few web application written in Prolog, and I choosed to use the famous SWI-Prolog PWP library. Which parses a script with prolog queries inside an HTML file.
I have a page responding to the following request example:
/user?id=N
Where N is a integer value.
But I'm having trouble to read the query string ID of the request inside the HTML file.
I have the .pl file:
showUser(UserId, Request) :-
reply_pwp_file(mydir('user_page.html'), [mime_type('text/html')], Request).
I don't know how I can read the UserId or the Request to retrieve again the UserId in the query strings.
I tried this way in the HTML markup:
<span pwp:ask="http_parameters(Request, [id(UserId, [optional(true)])])." pwp:use="UserId" />
Someone had this kind of trouble before?
Thank you very much.
Here's some interesting links that may help us:
PWP/SGML Pages
SWI-Prolog HTTP Library
It took to me some time, but at least I've been able to run the demo_pwp.pl that I found in ~/pl-devel/packages/http/examples. Now, after
?- server(1234).
I open the URL
http://localhost:1234/user_id.pwp?user_id=1&user_name=carlo
where I wrote in ~/pl-devel/packages/http/examples/pwp/user_id.pwp file
<?xml version="1.0"?>
<!DOCTYPE html>
<html xmlns:pwp="http://www.cs.otago.ac.nz/staffpriv/ok/pwp.pl">
<head>
<title>Context variables for PWP scripts</title>
</head>
<body>
<p>This PWP demo lists the context-parameters that are passed into
the script.
</p>
<ul>
<li pwp:ask="member(Name=Value, CONTEXT)">
<span class=name pwp:use="Name"/>
=
<span class=value pwp:use="writeq(Value)"/>
</li>
</ul>
<!-- here is the specific part for my answer -->
<p pwp:ask="memberchk('QUERY'=Q, CONTEXT),memberchk(user_id=UID,Q),memberchk(user_name=NAME,Q)">
UID : <span pwp:use="UID"/> / NAME : <span pwp:use="NAME"/>
</p>
<!-- nested access is well thought -->
<p pwp:ask="member('QUERY'=Q,CONTEXT)">
UID : <span pwp:use="UID" pwp:ask="member(user_id=UID,Q)"/>
/ NAME : <span pwp:use="NAME" pwp:ask="member(user_name=NAME,Q)"/>
</p>
</body>
</html>
(that's a copy of context.pwp, with added my info at bottom)
and I get
This PWP demo lists the context-parameters that are passed into the script.
...
- QUERY = [user_id='1',user_name=carlo]
...
UID : 1 / NAME : carlo
UID : 1 / NAME : carlo
Then I can confirm that the guidelines that Giulio suggested are ok.
This is really out of the blue, since I've not churned Prolog in a long time, but I'm slightly amused at the effort of writing web applications in Prolog, and because I sympathize (long story short: I tried myself years ago, but it wasn't pure Prolog) I figured I could just take my chance at pointing you what I noticed by reading the documentation. Its clarity and extensivity, by the way, are not the reason why PWP is "famous", I presume.
However, buried somewhere in the PWP page you linked there is a blurb about the attribute pwp:use, that's said to take a Term as its value.
Term is a Prolog term; variables in Term are bound by the context. An empty Term is regarded as a missing value for this attribute. The Prolog variable CONTEXT refers to the entire context, a list of Name = Value, where Name is a Prolog atom holding the name of the context variable and Value is an arbitrary Prolog term.
Buried somewhere else, namely the documentation page for reply_pwp_page/3 (oh, there's no reply_pwp_file/3 up there in the page you linked, really, even if you used it) there's another interesting snippet listing the contents of the so-called initial context, and in particular:
QUERY [is a] Var=Value list representing the query-parameters
Since there is no hint or suggestion or even example about the use of the query parameters list - but that's hardly the worst problem for one that's forced to write web applications in Prolog anyway - my personal take is that the name for query parameter id is just id (hoping that Var is just a misname for Param, not a real Prolog variable) and that the value is, well, just the value, but then again we know nothing about conversions or whatever may happen automatically during the parsing of the query string, since in the query string everything is, well, a string, but you may need a numeric id, and you are probably left on your own converting that string to a number. I guess there's some magical predicate doing exactly that, somewhere. Ain't Prolog wonderful?
So, without any other clue, and with lots of thanks for those writing the documentation of this... stuff, my wild guess is that you need somewhere the following element, an empty span nonetheless, which is illegal in any reasonably valid HTML document:
<span pwp:ask="..."/>
where, as the ask value, you should provide a query that traverse the CONTEXT list (by means of member/2, maybe?) until it finds a term of the form 'QUERY'=QueryParameters; then in QueryParameters you should have the actual query parameters list, so you need to traverse it in the same fashion as the CONTEXT list before, and when you find a term of the form id=N here you finally are, N should contain the value of your hardly earned user id.
Now, I really hope it's way simpler that what I have outlined. Remember, it's just a wild guess by looking at the documentation you pointed to. But, while others will be quite probably busy down-voting this answer for a number of reasons (and hopefully because it's plain wrong, and the solution is way simpler), my last, parting suggestion is for you to discuss the constraints of your project again with whoever is in charge of them, because writing web applications in Prolog is really an unreasonable thing to do when there are plenty of frameworks (frameworks, I say, not just some module thrown into the standard library for the "greater good") written in other languages that are incredibly well documented, much simpler to understand and, of course, to use.
I just bought a book on ASP.NET MVC With Razor View Engine. There is a subsection called Usage of # Operator and this subsection title makes me ... well, uncomfortable.
Is # inside the razor view engine called operator?
UPDATE
I guess my question is not so clear. I want to know if # is an operator inside the razor view engine. For example, < > = != >= => these are called as operator inside C# language. Is it same for # inside Razor view engine?
I think the reason for you discomfort is that the # token (when talking about it from a parsing perspective, though the word character should also do) is overloaded to indicate a number of different situations. Let's examine what those are:
Write a value to the output:
#this.Value
Indicate a code block transition:
#{
var foo = 1;
foo += 1;
}
Indicate a code statement transition:
<div>
#if(foo) {
// code
}
</div>
Indicate an escape to markup till the end of the line
#if(foo) {
foo = false;
#: value printed to output
}
Indicate directive statements:
#inherits MyCustomBaseType
Indicate special code blocks:
#section Foo {
<div />
}
#helper Bar(int param) {
param += 1;
}
Delimit comment blocks:
#*
This is a comment
*#
Escape the # character:
Email me at myemail##example.com
In my opinion only the first usage can be considered as an operator. The operand (i.e. everything else that follows the # up to but excluding the first markup-significant whitespace character) is passed as the parameter to the Write() method. All other usages don't have any clearly identifiable operands or require additional tokens (the * in the comment block, etc) to be indentified.
According to the official pages on Razor, you can find one example here, it does not seem that this is called an operator.
From the linked page:
You add code to a page using the # character
(my emphasis)
I also found numerous other pages on that same site, all referring it to just the "# character", so in that sense it isn't considered an operator.
<opinion>
However, if you read on Wikipedia on the topic of operators, then:
Syntactically operators usually contrast to functions. In most languages, functions may be seen as a special form of prefix operator with fixed precedence level and associativity, often with compulsory parentheses e.g. Func(a) (or (Func a) in LISP). Most languages support programmer-defined functions, but cannot really claim to support programmer-defined operators, unless they have more than prefix notation and more than a single precedence level. Semantically operators can be seen as special form of function with different calling notation and a limited number of parameters (usually 1 or 2).
(again, my emphasis)
Then I would argue that # is in fact an operator. It is a symbol with a specific meaning, and you could argue that you're "escaping out of the surrounding context to do something else", sort of like a function call.
In other words, while the word operator does not appear in the website articles I've seen so far, I would consider it to be an operator nonetheless.
</opinion>
Hmm, are you referring to the Fonts and Colors section of the Tools > Options dialog in Visual Studio? If so, no it isn't an Operator, it's "HTML Server-Side Script". The "functions", "section" and other razor-specific keywords are also this color.
Otherwise, yes it is an operator from a technical description. I'm not entirely sure what other distinction you are looking for.
No, the # character is not used as an operator. It's only the choise of that author to call it so.
For example, in the blog post introducing Razor it's not described as an operator anywhere.
There doesn't seem to be an official or de-facto term for it yet. Personally I would call it a tag, as it's used in the markup code along with other tags, and it's used as a replacement for the <% %> script tag.
I guess that it's not really incorrect to call it an operator, but it's not commonly used. As the razor tag contains C# or VB code which also can contain operators, it can get confusing.
The # character starts inline expressions, single statement blocks, and multi-statement blocks.
On Scott Gu's blog, he doesn't refer to it as an operator but there are lots of other developers who do. Personally, I don't see this as an operator but more of an identifier.
As far as I know are symbols like >, != and && called equality, relational or conditional operators. The # symbol does not seem to fit in any of those groups so I must say, no, it cannot be called an operator. On the other hand however, it does indicate that 'some operation' is performed which makes it rather likely to be called an operator.
The # symbol is not an operator, it is an indicator which indicates that a razor statement begins.
It is comparable with <?php ?> in php. This indicates to the compiler/interpreter where your code is
e.g.
#Html.ActionLink(...)
a #(...) gives you the oportunitie to do some coding int there
e.g
#(
int a = 0;
int b = 4;
int c = a + b;
)
I hope this helps you on the way
I was reading some questions trying to find a good solution to preventing XSS in user provided URLs(which get turned into a link). I've found one for PHP but I can't seem to find anything for .Net.
To be clear, all I want is a library which will make user-provided text safe(including unicode gotchas?) and make user-provided URLs safe(used in a or img tags)
I noticed that StackOverflow has very good XSS protection, but sadly that part of their Markdown implementation seems to be missing from MarkdownSharp. (and I use MarkdownSharp for a lot of my content)
Microsoft has the Anti-Cross Site Scripting Library; you could start by taking a look at it and determining if it fits your needs. They also have some guidance on how to avoid XSS attacks that you could follow if you determine the tool they offer is not really what you need.
There's a few things to consider here. Firstly, you've got ASP.NET Request Validation which will catch many of the common XSS patterns. Don't rely exclusively on this, but it's a nice little value add.
Next up you want to validate the input against a white-list and in this case, your white-list is all about conforming to the expected structure of a URL. Try using Uri.IsWellFormedUriString for compliance against RFC 2396 and RFC 273:
var sourceUri = UriTextBox.Text;
if (!Uri.IsWellFormedUriString(sourceUri, UriKind.Absolute))
{
// Not a valid URI - bail out here
}
AntiXSS has Encoder.UrlEncode which is great for encoding string to be appended to a URL, i.e. in a query string. Problem is that you want to take the original string and not escape characters such as the forward slashes otherwise http://troyhunt.com ends up as http%3a%2f%2ftroyhunt.com and you've got a problem.
As the context you're encoding for is an HTML attribute (it's the "href" attribute you're setting), you want to use Encoder.HtmlAttributeEncode:
MyHyperlink.NavigateUrl = Encoder.HtmlAttributeEncode(sourceUri);
What this means is that a string like http://troyhunt.com/<script> will get escaped to http://troyhunt.com/<script> - but of course Request Validation would catch that one first anyway.
Also take a look at the OWASP Top 10 Unvalidated Redirects and Forwards.
i think you can do it yourself by creating an array of the charecters and another array with the code,
if you found characters from the array replace it with the code, this will help you ! [but definitely not 100%]
character array
<
>
...
Code Array
& lt;
& gt;
...
I rely on HtmlSanitizer. It is a .NET library for cleaning HTML fragments and documents from constructs that can lead to XSS attacks.
It uses AngleSharp to parse, manipulate, and render HTML and CSS.
Because HtmlSanitizer is based on a robust HTML parser it can also shield you from deliberate or accidental
"tag poisoning" where invalid HTML in one fragment can corrupt the whole document leading to broken layout or style.
Usage:
var sanitizer = new HtmlSanitizer();
var html = #"<script>alert('xss')</script><div onload=""alert('xss')"""
+ #"style=""background-color: test"">Test<img src=""test.gif"""
+ #"style=""background-image: url(javascript:alert('xss')); margin: 10px""></div>";
var sanitized = sanitizer.Sanitize(html, "http://www.example.com");
Assert.That(sanitized, Is.EqualTo(#"<div style=""background-color: test"">"
+ #"Test<img style=""margin: 10px"" src=""http://www.example.com/test.gif""></div>"));
There's an online demo, plus there's also a .NET Fiddle you can play with.
(copy/paste from their readme)