Should I use Replace() instead of HtmlEncode()? - asp.net

Should HtmlEncode() be abandoned and Replace() used instead of I want to parse links in posts/comments (with regular expressions)? HtmlEncode() replaces & with & which I assume can cause problems with links, should I just use Replace() to replace < with <?
For example if a user posts something like:
See this site http://www.somesite.com/somepage.aspx?qs1=1&qs2=2&qs3=3
I want it to be:
See this site <a href="http://www.somesite.com/somepage.aspx?qs1=1&qs2=2&qs3=3">http://www.somesite.com/somepage.aspx?qs1=1&qs2=2&qs3=3</a>
But With HtmlEncode() the URL will become (notice the ampersand):
See this site http://www.somesite.com/somepage.aspx?qs1=1&qs2=2&qs3=3
Should I avoid the problem by using Replace() instead?
Thanks

Actually, your last example - the one you're worried about - is the only correct one. In HTML documents, ampersands are used to introduce entity references, and therefore must be escaped. While most browsers are forgiving enough to let them slip through when not obviously part of an entity reference, you can run into subtle problems should their use in a URL happen to look like an entity.
Let HtmlEncode() do its job.

Perhaps you are looking for UrlEncode()?
http://msdn.microsoft.com/en-us/library/zttxte6w.aspx

What are you looking to replace and why? HtmlEncode() is typically used to sanitize user-supplied data. That said, if you're allowing users to submit links, you probably don't want to HtmlEncode them, in the first place. You're basically going to render them exactly as the user supplied them.

Replacing & with & inside of an href attribute is correct. If you do not, then your code is technically invalid. Also, you should escape it even if it's inside of a link. The only case you'll run into problems is if you end up HTMLEncoding it multiple times.

I recommend against using Replace to do the job of HTMLEncode or URLEncode. These functions are custom designed to take care of most of the problems that you'd see in user entered content and if you try to replace them with your own code, the results might get ugly (I am talking from experience here) if you forgot something vital.

Related

How do I search with the "&" keyword, e.g I want food & drink

In bootstrap-table jquery (http://issues.wenzhixin.net.cn/bootstrap-table/)
How do I search with the "&" keyword, e.g I want food & drink.
However, when read the source code, it has .replace for /&/ to &. Any idea I can bypass this? It is impossible to ask user to key in & in the search text box.
So... im still a little unclear as there is no bug in core code.
See this fiddle which proves "&" search works fine:
http://jsfiddle.net/dabros/x8efv6wf/1/
If this somehow doesnt answer your needs, elaborate on exactly why.
Then look at some extensions like multiple search, filter and filter-control:
http://bootstrap-table.wenzhixin.net.cn/extensions/
If still not happy, then create a custom search function. Easy to do if using server-side pagination, but possible even if client-side.
If you fixed on using client-side (not server side pagination + search, meaning that the js does the search, not your own server code) then look at custom search.
Relatively new feature i think, not used it myself as if i wanted a custom search i would use server-side code.
But here it is:
Custom search function #1956
https://github.com/wenzhixin/bootstrap-table/issues/1956
How to search within row details? #2007
https://github.com/wenzhixin/bootstrap-table/issues/2007
https://github.com/wenzhixin/bootstrap-table/pull/1979
That pull requests seems most detailed, it appears full example still otw, but shouldnt be hard if you read through there.
Though frankly, stick with plugins/extensions i listed above or use server-side code if still not happy - gives you far greater control with far simpler execution/code-maintenance.

Using duplicate parameters in a URL

We are building an API in-house and often are passing a parameter with multiple values.
They use: mysite.com?id=1&id=2&id=3
Instead of: mysite.com?id=1,2,3
I favor the second approach but I was curious if it was actually incorrect to do the first?
I'm not an HTTP guru, but from what I understand there's not a definitive standard on the query part of the URL regarding multiple values, it's typically up to the CGI that handles the request to parse the query string.
RFC 1738 section 3.3 mentions a searchpart and that it should go after the ? but doesn't seem to elaborate on its format.
http://<host>:<port>/<path>?<searchpart>
I did not (bother to) check which RFC standard defines it. (Anyone who knows about this please leave a reference in the comment.) But in practice, the mysite.com?id=1&id=2&id=3 way is already how a browser would produce when a form contains duplicated fields, typically the checkboxes. See it in action in this w3schools example page. So there is a good chance that the whatever programming language you are using, already provides some helper functions to parse an input like that and probably returns a list.
You could, of course, go with your own approach such as mysite.com?id=1,2,3, which is not bad at all in this particular case. But you will need to implement your own logic to produce and to consume such format. Now you may or may not need to think about handling some corner cases by yourself, such as: what if the input is not well-formed, like mysite.com?id=1,2,? And do you need to invent yet another separator, if the comma sign itself can also be a valid input, like mysite.com?name=Doe,John|Doe,Jane? Would you reach to a point that you will use a json string as the value, like mysite.com?name=["John Doe", "Jane Doe"]? etc. etc.. Your mileage may vary.
Worth adding that inconsistend handling of duplicate parameters in the URL on the server is may lead to vulnerabilities, specifically server-side HTTP parameter pollution, with a practical example - Client side Http Parameter Pollution - Yahoo! Classic Mail Video Poc.
in your first approach you will get an array of querystring values but in second approach you will get a string of querystring values.
I guess it depends on technology you use, how it becomes convenient. I am currently standing in front of the same question using currency=USD,CHF or currency=USD&currency=CHF
I am using Thymeleaf and using the second option makes it easy to work, I can then request something like: ${param.currency.contains(currency.value)}. When I try to use the first option it seems it takes the "array" like a string, so I need to split first and then do contain, what leads me to a more mess code.
Just my 50 cents :-)

How do you test a function that just retrieves a template output?

I have a template class that grabs HTML and basically returns html to the caller. How do I test the caller using PHP Unit? Do I just assertTrue(is_string(call_function))? It seems like a stupid test, and I thought I may be testing it improperly.
Is the returned HTML supposed to be well-formed? If so you could validate it.
And/or if there is always supposed to be a certain node, or string of text, present you could check for its existence. Using strpos, regexes, or a proper DOM parser.
This StackOverflow question gives you some ideas for ways to parse and query your HTML: How do you parse and process HTML/XML in PHP?
More generally, the way I usually approach how to test a function that returns a string is to use:
$html=call_function();
$this->assertEquals("dummy",$html);
Then it fails, but tells me the correct output, so I paste that in:
$html=call_function();
$expected=<<<EOD
<html>
...
</html>
EOD;
$this->assertEquals($expected,$html);
If it fails again I then study the differences between the two correct answers I have. If this is a good unit test should they really even be different? Do I want to use a mock object to replace some uncontrollable aspect of the system? (E.g. if the HTML it is returning is google search results, then maybe I want a mock object to simulate calling google, but always return exactly the same search results page.)
If the only differences are timestamps I might use regexes to hunt-and-destroy them, to give me a string that should always be the same, e.g.
$html=preg_replace('/\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}/','[TIMESTAMP]',$html);
ADDITION
If the HTML string is very big, one alternative is to use md5() to reduce it to a short string. This will still warn you when something breaks, but the (big) downside is when it breaks you won't know where. If you are concerned about that then it is better to use the DOM approach (or its poor cousin, regexes) to just cherry-pick a few key parts of the HTML to test.

How should I sanitize urls so people don't put 漢字 or á or other things in them?

How should I sanitize urls so people don't put 漢字 or other things in them?
EDIT: I'm using java. The url will be generated from a question the user asks on a form. It seems StackOverflow just removed the offending characters, but it also turns an á into an a.
Is there a standard convention for doing this? Or does each developer just write their own version?
The process you're describing is slugify. There's no fixed mechanism for doing it; every framework handles it in their own way.
Yes, I would sanitize/remove. It will either be inconsistent or look ugly encoded
Using Java see URLEncoder API docs
Be careful! If you are removing elements such as odd chars, then two distinct inputs could yield the same stripped URL when they don't mean to.
The specification for URLs (RFC 1738, Dec. '94) poses a problem, in that it limits the use of allowed characters in URLs to only a limited subset of the US-ASCII character set
This means it will get encoded. URLs should be readable. Standards tend to be English biased (what's that? Langist? Languagist?).
Not sure what convention is other countries, but if I saw tons of encoding in a URL send to me, I would think it was stupid or suspicious ...
Unless the link is displayed properly, encoded by the browser and decoded at the other end ... but do you want to take that risk?
StackOverflow seems to just remove those chars from the URL all together :)
StackOverflow can afford to remove the
characters because it includes the
question ID in the URL. The slug
containing the question title is for
convenience, and isn't actually used
by the site, AFAIK. For example, you
can remove the slug and the link will
still work fine: the question ID is
what matters and is a simple mechanism
for making links unique, even if two
different question titles generate the
same slug. Actually, you can verify
this by trying to go to
stackoverflow.com/questions/2106942/…
and it will just take you back to this
page.
Thanks Mike Spross
Which language you are talking about?
In PHP I think this is the easiest and would take care of everything:
http://us2.php.net/manual/en/function.urlencode.php

Token replacement

I currently implement a replace function in the page render method which replaces commonly used strings - such as replace [cfe] with the root to the customer front end. This is because the value may be different based on the version of the site - for example the root to the image folder ([imagepath]) is /Images on development and live, but /Test/Images on test.
I have a catalogue of products for which I would like to change [productName] to a link to the catalogue page for that product. I would like to go through the entire page and replace all instances of [someValue] with the relevant link. Currently I do this by looping through all the products in the product database and replacing [productName] with the link to the catalog page for that product. However this is limited to products which exist in the database. "Links" to products which have been removed currently wont be replaced, so [someValue] will be displayed to the user. This does not look good.
So you should be able to see my problem from this. Does anyone know of a way to achieve what I would like to easily? I could use regexes, but I don't have much experience of those. If this is the easiest way, using "For Each Match As String In Regex.Matches(blah, blah)" then I am willing to look further into this.
However at some point I would like to take this further - for example setting page layouts such as 3 columns with an image top right using [layout type="3colImageTopRight" imageURL="imageURL"]Content here[/layout]. I think I could kind of do this now, but I cant figure out how to deal with this if the imageURL were, say, [Image:Product01.gif] (using regex.match("[[a-zA-Z]{0,}]") I think would match just [layout type="3colImageTopRight" imageURL="[Image:Product01.gif] (it would not get to the end of the layout tag). Obviously the above wouldn't quite work, as I haven't included double quotes in the match string or anything, but you get the general idea. You should be able to get the general idea of what I am getting at and what I am trying to do though.
Does anyone have any ideas or pointers which could help me with this? Also if this is not strictly token replacement then please point me to what it is, so I can further develop this.
Aristos - hope reexplaining this resolves the confusion.
Thanks in advance,
Regards,
Richard Clarke
#RichardClarke - I would go with Regular Expressions, they're not as terrible to learn as you might think and with a bit of careful usage will solve your problems.
I've always found this a very useful tool.
http://derekslager.com/blog/posts/2007/09/a-better-dotnet-regular-expression-tester.ashx
goes nicely with a cheat sheet ;-)
http://www.addedbytes.com/cheat-sheets/regular-expressions-cheat-sheet/
Good luck.

Resources