Is there a difference between Server.UrlEncode and HttpUtility.UrlEncode?
I had significant headaches with these methods before, I recommend you avoid any variant of UrlEncode, and instead use Uri.EscapeDataString - at least that one has a comprehensible behavior.
Let's see...
HttpUtility.UrlEncode(" ") == "+" //breaks ASP.NET when used in paths, non-
//standard, undocumented.
Uri.EscapeUriString("a?b=e") == "a?b=e" // makes sense, but rarely what you
// want, since you still need to
// escape special characters yourself
But my personal favorite has got to be HttpUtility.UrlPathEncode - this thing is really incomprehensible. It encodes:
" " ==> "%20"
"100% true" ==> "100%%20true" (ok, your url is broken now)
"test A.aspx#anchor B" ==> "test%20A.aspx#anchor%20B"
"test A.aspx?hmm#anchor B" ==> "test%20A.aspx?hmm#anchor B" (note the difference with the previous escape sequence!)
It also has the lovelily specific MSDN documentation "Encodes the path portion of a URL string for reliable HTTP transmission from the Web server to a client." - without actually explaining what it does. You are less likely to shoot yourself in the foot with an Uzi...
In short, stick to Uri.EscapeDataString.
HttpServerUtility.UrlEncode will use HttpUtility.UrlEncode internally. There is no specific difference. The reason for existence of Server.UrlEncode is compatibility with classic ASP.
Fast-forward almost 9 years since this was first asked, and in the world of .NET Core and .NET Standard, it seems the most common options we have for URL-encoding are WebUtility.UrlEncode (under System.Net) and Uri.EscapeDataString. Judging by the most popular answer here and elsewhere, Uri.EscapeDataString appears to be preferable. But is it? I did some analysis to understand the differences and here's what I came up with:
WebUtility.UrlEncode encodes space as +; Uri.EscapeDataString encodes it as %20.
Uri.EscapeDataString percent-encodes !, (, ), and *; WebUtility.UrlEncode does not.
WebUtility.UrlEncode percent-encodes ~; Uri.EscapeDataString does not.
Uri.EscapeDataString throws a UriFormatException on strings longer than 65,520 characters; WebUtility.UrlEncode does not. (A more common problem than you might think, particularly when dealing with URL-encoded form data.)
Uri.EscapeDataString throws a UriFormatException on the high surrogate characters; WebUtility.UrlEncode does not. (That's a UTF-16 thing, probably a lot less common.)
For URL-encoding purposes, characters fit into one of 3 categories: unreserved (legal in a URL); reserved (legal in but has special meaning, so you might want to encode it); and everything else (must always be encoded).
According to the RFC, the reserved characters are: :/?#[]#!$&'()*+,;=
And the unreserved characters are alphanumeric and -._~
The Verdict
Uri.EscapeDataString clearly defines its mission: %-encode all reserved and illegal characters. WebUtility.UrlEncode is more ambiguous in both definition and implementation. Oddly, it encodes some reserved characters but not others (why parentheses and not brackets??), and stranger still it encodes that innocently unreserved ~ character.
Therefore, I concur with the popular advice - use Uri.EscapeDataString when possible, and understand that reserved characters like / and ? will get encoded. If you need to deal with potentially large strings, particularly with URL-encoded form content, you'll need to either fall back on WebUtility.UrlEncode and accept its quirks, or otherwise work around the problem.
EDIT: I've attempted to rectify ALL of the quirks mentioned above in Flurl via the Url.Encode, Url.EncodeIllegalCharacters, and Url.Decode static methods. These are in the core package (which is tiny and doesn't include all the HTTP stuff), or feel free to rip them from the source. I welcome any comments/feedback you have on these.
Here's the code I used to discover which characters are encoded differently:
var diffs =
from i in Enumerable.Range(0, char.MaxValue + 1)
let c = (char)i
where !char.IsHighSurrogate(c)
let diff = new {
Original = c,
UrlEncode = WebUtility.UrlEncode(c.ToString()),
EscapeDataString = Uri.EscapeDataString(c.ToString()),
}
where diff.UrlEncode != diff.EscapeDataString
select diff;
foreach (var diff in diffs)
Console.WriteLine($"{diff.Original}\t{diff.UrlEncode}\t{diff.EscapeDataString}");
Keep in mind that you probably shouldn't be using either one of those methods. Microsoft's Anti-Cross Site Scripting Library includes replacements for HttpUtility.UrlEncode and HttpUtility.HtmlEncode that are both more standards-compliant, and more secure. As a bonus, you get a JavaScriptEncode method as well.
Server.UrlEncode() is there to provide backward compatibility with Classic ASP,
Server.UrlEncode(str);
Is equivalent to:
HttpUtility.UrlEncode(str, Response.ContentEncoding);
The same, Server.UrlEncode() calls HttpUtility.UrlEncode()
Related
If I have a URL with query parameters, is it valid to "escape" the & query parameter delimiter?
Ex.
go
vs
go
RFC 2396 clearly states that use of "&" is proper, but i cant find anything on the (in)validity of using escaped versions of the reserved characters.
One thing i noticed is Chrome seems to forgive them when clicking on the link in the browser, however when i view source of the page, and click on the link (/foo.html?cat=meow&dog=woof) from the view-source view, it doesn't work.
I'd love to know if there is any spec/section i can point to that says "only use & and dont use & or %26 (which is & URL encoded).
(Note: this question arises as I started working w a code base that structures their URLs in this fashion, I would personally use '&')
RCF 2396: http://www.ietf.org/rfc/rfc2396.txt
UPDATE 1
Correct - the actual URL that the server writes to the page is: < a href="/foo.html?cat=meow&dog=woof" >go< /a > .. is there a spec that speaks to the validity of using & as a query param delimiter? Im not looking for "what works mostly" in browsers, but what is the correct way(s) to delimit query params.
TLDR; All formulations that evaluate to & are equally valid.
From OP's link:
Unlike many specifications that use a BNF-like grammar to define the bytes (octets) allowed by a protocol, the URI grammar is defined in terms of characters. Each literal in the grammar corresponds to the character it represents, rather than to the octet encoding of that character in any particular coded character set. How a URI is represented in terms of bits and bytes on the wire is dependent upon the character encoding of the protocol used to transport it, or the charset of the document which contains it.
-- RFC: 2396 - Uniform Resource Identifiers (URI): Generic Syntax August 1998
by
T. Berners-Lee*
MIT/LCS
R. Fielding
U.C. Irvine
L. Masinter
Xerox Corporation
*: how cool is that!
The escaping is happening in HTML - when you click on such a link, the browser will treat & as &.
To encode & on the URL you can percent encode it to %26.
What I don't really understand is the benefit of using '?' instead of '&' in urls:
It makes nobody's life easier if we use a different character as the first separator character.
Can you come up with a reasonable explanation?
EDIT: after more research I found that "&" can be a part of file name (terms&conditions.html) so "?" is a good separator. But still I think using "?" for separators makes lives easier (from url generators and parsers point of view):
Is there any advantage in using "&" which is not clear at the first glance?
From the URI spec's (RFC 3986) point of view, the only separator here is "?". the format of the query is opaque; the ampersands just are something that HTML happens to use for form submissions.
The answer's pretty much in this article - http://www.skorks.com/2010/05/what-every-developer-should-know-about-urls/ . To highlight it, here goes :
Query is the preferred way to send some parameters to a resource on
the server. These are key=value pairs and are separated from the rest
of the URL by a ? (question mark) character and are normally separated
from each other by & (ampersand) characters. What you may not know is
the fact that it is legal to separate them from each other by the ;
(semi-colon) character as well. The following URLs are equivalent:
http://www.blah.com/some/crazy/path.html?param1=foo¶m2=bar
http://www.blah.com/some/crazy/path.html?param1=foo;param2
The RFC 3896 (https://www.ietf.org/rfc/rfc3986.txt) defines general and sub delimiters ... '?' is a general, '&' and ';' are sub. The spec is pretty clear about that.
In this case the latter '?' chars would be treated as part of the query. If the query parser follows the spec strictly, it would then pass the whole query on to the app-destination. If the app-destination could choose to further process the query string in a manner which treats the ? as a param name-value pairs delimiter, that is up to the app's designers.
My guess is that this often 'just works' because code that splits query strings and the original uri uses all delimiters for matching: 1) first query is split on '?' then 2) query string is parsed using char match list that includes '?' (convenience only).... This could be occurring in ubiquitous parsing libraries already.
I was reading some questions trying to find a good solution to preventing XSS in user provided URLs(which get turned into a link). I've found one for PHP but I can't seem to find anything for .Net.
To be clear, all I want is a library which will make user-provided text safe(including unicode gotchas?) and make user-provided URLs safe(used in a or img tags)
I noticed that StackOverflow has very good XSS protection, but sadly that part of their Markdown implementation seems to be missing from MarkdownSharp. (and I use MarkdownSharp for a lot of my content)
Microsoft has the Anti-Cross Site Scripting Library; you could start by taking a look at it and determining if it fits your needs. They also have some guidance on how to avoid XSS attacks that you could follow if you determine the tool they offer is not really what you need.
There's a few things to consider here. Firstly, you've got ASP.NET Request Validation which will catch many of the common XSS patterns. Don't rely exclusively on this, but it's a nice little value add.
Next up you want to validate the input against a white-list and in this case, your white-list is all about conforming to the expected structure of a URL. Try using Uri.IsWellFormedUriString for compliance against RFC 2396 and RFC 273:
var sourceUri = UriTextBox.Text;
if (!Uri.IsWellFormedUriString(sourceUri, UriKind.Absolute))
{
// Not a valid URI - bail out here
}
AntiXSS has Encoder.UrlEncode which is great for encoding string to be appended to a URL, i.e. in a query string. Problem is that you want to take the original string and not escape characters such as the forward slashes otherwise http://troyhunt.com ends up as http%3a%2f%2ftroyhunt.com and you've got a problem.
As the context you're encoding for is an HTML attribute (it's the "href" attribute you're setting), you want to use Encoder.HtmlAttributeEncode:
MyHyperlink.NavigateUrl = Encoder.HtmlAttributeEncode(sourceUri);
What this means is that a string like http://troyhunt.com/<script> will get escaped to http://troyhunt.com/<script> - but of course Request Validation would catch that one first anyway.
Also take a look at the OWASP Top 10 Unvalidated Redirects and Forwards.
i think you can do it yourself by creating an array of the charecters and another array with the code,
if you found characters from the array replace it with the code, this will help you ! [but definitely not 100%]
character array
<
>
...
Code Array
& lt;
& gt;
...
I rely on HtmlSanitizer. It is a .NET library for cleaning HTML fragments and documents from constructs that can lead to XSS attacks.
It uses AngleSharp to parse, manipulate, and render HTML and CSS.
Because HtmlSanitizer is based on a robust HTML parser it can also shield you from deliberate or accidental
"tag poisoning" where invalid HTML in one fragment can corrupt the whole document leading to broken layout or style.
Usage:
var sanitizer = new HtmlSanitizer();
var html = #"<script>alert('xss')</script><div onload=""alert('xss')"""
+ #"style=""background-color: test"">Test<img src=""test.gif"""
+ #"style=""background-image: url(javascript:alert('xss')); margin: 10px""></div>";
var sanitized = sanitizer.Sanitize(html, "http://www.example.com");
Assert.That(sanitized, Is.EqualTo(#"<div style=""background-color: test"">"
+ #"Test<img style=""margin: 10px"" src=""http://www.example.com/test.gif""></div>"));
There's an online demo, plus there's also a .NET Fiddle you can play with.
(copy/paste from their readme)
I have different cases:
No spaces allowed at all
No spaces allowed at the beginning or the end of the string only
..A little question, is it good to check (to validate) the input for spaces through the RegEx of the RegularExpressionValidator ?
The \S escape sequence matches anything that isn't a whitespace character.
Thus the regexes that you need follow:
^\S+$
^\S.*\S$
In your previous question, you mentioned you wanted from 0 to 50 characters. If that's still the case, here's what you want:
/^\S{0,50}$/
/^(?!\s).{0,50}(?<!\s)$/
As of right now, I think these are the only regexes posted that allow for less than one letter with the first pattern, and less than two letters with the second pattern.
Regexes are not a "bad" thing, they're just a specialized tool that isn't suited for every task. If you're trying to validate input in ASP.NET, I would definitely use a RegularExpressionValidator for this particular pattern, because otherwise you'll have to waste your time writing a CustomValidator for a pretty meager performance boost. See my answer to this other question for a little guidance on when and when not to use regex.
In this case, the reason I'd use a regex validator has less to do with the pattern itself and more to do with ASP.NET. A RegularExpressionValidator can just be dragged and dropped into your ASPX code, and all you'd have to write would be 10-21 characters of regex. With a CustomValidator, you'd have to write custom validation functions, both in the codebehind and the JavaScript. You might squeeze a little more performance out of it, but think about when validation comes into play: only once per postback. The performance difference is going to be less than a millisecond. It's simply not worth your time as a developer -- to you or your employer. Remember: Hardware is cheap, programmers are expensive, and premature optimization is the root of all evil.
You know the saying about regex's (now you have two problems) - this is doable without regex and should be more performant and easier to read.
string checkString = /* whatever */
if(checkString.IndexOf(" ") > -1)
// Failed Condition 1
if(checkString.Trim() != checkString)
// Failed Condition 2
No spaces allowed at all:
^\S+$
No spaces allowed at the beginning or end:
^\S+.*\S+$
The System.String class contains everything you need:
No spaces allowed at all
This will handle the case of spaces only:
bool valid = !str.Contains(" ");
If you need to check for tabs as well:
char[] naughty = " \t".ToCharArray();
bool fail = (str.IndexOfAny(naughty) == -1);
There are other whitespace characters you could check for, see Character Escapes for more details.
No spaces allowed at the beginning or the end of the string only
A bit simpler, since Trim() will remove any kind of whitespace, including newlines:
bool valid = str.Length == str.Trim().Length;
Can I with ASP.NET Resources/Localization translate strings to depend on one or other (the English grammar) in a easy way, like I pass number 1 in my translate, it return "You have one car" or with 0, 2 and higher, "You have %n cars"?
Or I'm forced to have logic in my view to see if it's singular or plural?
JasonTrue wrote:
To the best of my knowledge, there isn't a language that requires something more complicated than singular/plural
Such languages do exist. In my native Polish, for example, there are three forms: for 1, for 2-4 and for zero and numbers greater than 4. Then after you reach 20, the forms for 21, 22-24 and 25+ are again different (same grammatical forms as for numerals 0-9). And yes, "you have 0 things" sounds awkward, because you're not likely to see that used in real life.
As a localization specialist, here's what I would recommend:
If possible, use forms which put the numeral at the end:
a: Number of cars: %d
This means the form of the noun "car" does not depend on the numeral, and 0 is as natural as any other number.
If the above is not acceptable, at least always make a complete sentence a translatable resource. That is, do use
b: You have 1 car.
c: You have %d cars.
But never split such units into smaller fragments such as
d: You have
e: car(s)
(Then somewhere else you have a non-localizable resource such as %s %d %s)
The difference is that while I cannot translate (c) directly, since the form of the noun will change, I can see the problem and I can change the sentence to form (a) in translation.
On the other hand, when I am faced with (d) and (e) fragments, there is no way to make sure the resulting sentence will be grammatical. Again: using fragments guarantees that in some languages the translation will be anything from grammatically awkward to completely broken.
This applies across the board, not just to numerals. For example, a fragment such as %s was deleted is also untranslatable, since the form of the verb will depend on the gender of the noun, which is unavailable here. The best I can do for Polish is the equivalent of Deleted: %s, but I can only do it as long as the %s placeholder is included in the translatable resource. If all I have is "was deleted" with no clue to the referent noun, I can only startle my dog by cursing aloud and in the end I still have to produce garbage grammar.
I've been working on a library to assist with internationalization of an application. It's called SmartFormat and is open-source on GitHub.
It contains "grammatical number" rules for many languages that determine the correct singular/plural form based on the locale. When translating a phrase that has words that depend on a quantity, you specify the multiple forms and the formatter chooses the correct form based on these rules.
For example:
var message = "There {0:is:are} {0} {0:item:items} remaining.";
var output = Smart.Format(culture, message, items.Count);
It has a very similar syntax to String.Format(...), but has tons of great features that make it easy to create natural and grammatically-correct messages.
It also deals with possible gender-specific formatting, lists, and much more.
moodforaday wrote:
This applies across the board, not just to numerals. For example, a fragment such as "%s was deleted" is also untranslatable, since the form of the verb will depend on the gender of the noun, which is unavailable here. The best I can do for Polish is the equivalent of "Deleted: %s", but I can only do it as long as the %s placeholder is included in the translatable resource. If all I have is "was deleted" with no clue to the referent noun, I can only startle my dog by cursing aloud and in the end I still have to produce garbage grammar.
The point I would like to make here, is never include a noun as a parameter. Many European languages modify the noun based on its gender, and in the case of German, whether it is the subject, object, or indirect object. Just use separate messages.
Instead of, "%s was deleted.", use a separate translation string for each type:
"The transaction was deleted."
"The user was deleted." etc.
This way, each string can be translated properly.
The solution to handle plural forms in different languages can be found here:
http://www.gnu.org/software/gettext/manual/html_node/Plural-forms.html
While you can't use the above software in a commercial application (it's covered under GPL), you can use the same concepts. Their plural form expressions fit perfectly with lambda expressions in .NET. There are a limited number of plural form expressions that cover a large number of languages. You can map a particular language to a lambda expression that calculates which plural form to use based on the language. Then, look up the appropriate plural form in a .NET resx file.
There is nothing built in, we ended up coding something like this:
Another solution can be:
use place holders like {CAR} in the resource file strings
Maintain separate resource entries for singular and plural words for "car" :
CAR_SINGULAR car
CAR_PLURAL cars
Develop a class with this logic:
class MyResource
{
private List<string> literals = new List<string>();
public MyResource() { literals.Add("{CAR}") }
public static string GetLocalizedString(string key,bool isPlural)
{
string val = rm.GetString(key);
if(isPlural)
{
val = ReplaceLiteralWithPlural(val);
}
else
{
val = ReplaceLiteralWithSingular(val);
}
}
}
string ReplaceLiteralWithPlural(string val)
{
StringBuilder text = new StringBuilder(val)
foreach(string literal in literals)
{
text = text.Replace(literal,GetPluralKey(literal));
}
}
string GetPluralKey(string literal)
{
return literal + "_PLURAL";
}
This is still not possible in .NET natively. But it is a well known problem and already solved by gettext. It have native support for many technologies, even includes C# support. But looks like there is modern reimplementation now in .NET: https://github.com/VitaliiTsilnyk/NGettext
Orchard CMS uses GNU Gettext's PO files for localization and plural forms.
Here is their code where you can grab the idea:
https://github.com/OrchardCMS/OrchardCore/tree/main/src/OrchardCore/OrchardCore.Localization.Core
https://github.com/OrchardCMS/OrchardCore/tree/main/src/OrchardCore/OrchardCore.Localization.Abstractions
The logic needn't be in your view, but it's certainly not in the resource model the DotNet framework. Yours is a simple enough case that you can probably get away with creating a simple format string for singular, and one for plural. "You have 1 car."/"You have {0} cars." You then need to write a method that discriminates between the plural and singular case.
There are also a number of languages where there's no distinction between singular and plural, but this just requires a little duplication in the resource strings. (edit to add: gender and sometimes endings change depending on number of units in many languages, but this should be fine as long as you distinguish between singular and plural sentences, rather than just words).
Microsoft mostly gave up on semantically clever localization because it's hard to generalize even something like pluralization to 30+ languages. This is why you see such boring presentation of numeric quantities in most applications that are translated to many languages, along the lines of "Cars: {0}". That sounds lame, but surprisingly, the last time I checked, usability studies didn't actually favor the verbose natural language presentation in most cases.