Replace multiple occurrences of same string in href using Regular Expressions - asp.net

Our CMS is (I suppose correctly) encoding comma characters in URLs. So instead of being "?values=1,2,3" the CMS is rendering "?values=1%2c2%2c3". This in itself is not a problem however the external system that these links are pointing at cannot handle the encoded commas and only works if we pass actual commas in the query string.
We already have a Regex clean-up tool that processes the HTML pre-render and cleans out non XHTML compliant mark-up. This is an old CMS running on ASP.Net v2.
My question is what regular expression would be required to swap out all occurrences of "%2c" for a comma, but only where this text exists within an anchor tag. I've been easily able to swap out all instances of %2c but this runs the risk of corrupting the page elsewhere if that string happened to be used for a non-URL purpose.
I'm using .Net and System.Text.RegularExpressions. We have an XML file that contains all of the Find and Replace rules. This gets loaded at runtime and cleans the HTML. Each rule consists of:
Text to find - e.g. "<script>"
Text to replace - e.g. "<script type='text/javascript'>"
We then have some C# that loops over each of the rules and does the following:
// HTML = full page HTML
Regex regex = new Regex(searchTxt, RegexOptions.IgnoreCase);
HTML = regex.Replace(HTML, replaceTxt);
Simple as that. I just can't get the right regex syntax for our specific scenario.
Many thanks for your help.
Here is a complete C# console app that hopefully explains my scenario
class Program
{
static void Main(string[] args)
{
string html = GetPageHTML();
string regexString = "(<a href=).*|(%2c)";
string replaceTxt = ",";
RegexOptions options = RegexOptions.IgnoreCase | RegexOptions.Multiline;
Regex regex = new Regex(regexString, options);
// We are currently using a simple regex.Replace
string cleanHTML = regex.Replace(html, replaceTxt);
// But for this example should we be doing something with the Matches collection?
foreach (Match match in regex.Matches(html))
{
if (match.Success)
{
// do something?
}
}
}
private static string GetPageHTML()
{
return #"<html>
<head></head>
<body>
<a title='' href='http://www.testsite.com/?x=491191%2cy=291740%2czoom=6%2cbase=demo%2clayers=%2csearch=text:WE9%203QA%2cfade=false%2cmX=0%2cmY=0' target='_blank'>A link</a>
<p>We wouldn't want this (%2c) to be replaced</p>
</body>
</html>";
}
}

If .net would support pcre regex you could do something like this:
^(?!<a href=").*(*SKIP)(*FAIL)|(%2c)
That is what you want. Above regex will match only %2c inside anchor tags. But you could achieve the same if you use regex the regex discard technique plus some logic.
If you use below regex, you could match %2c and also capture the %2c string that is within anchor tags:
^(?!<a href=").*|(%2c)
Working demo
So, what you can do is to add logic and to check if the capturing group content is equal to %2c, in that case means that it matches %2c from the anchor tag. Then you can replace that for a comma.

Related

How to replace url if it is not anchored yet?

I need to replace all urls in the text IF they are not put into '...' HTML tags yet.
The unconditional way to replace is described here: Recognize URL in plain text.
Here is my implementation of it:
private static readonly Regex UrlMatcherRegex = new Regex(#"\b(?:(?:https?|ftp|file)://|www\.|ftp\.)(?:\([-A-Z0-9+&##/%=~_|$?!:,.]*\)|[-A-Z0-9+&##/%=~_|$?!:,.])*(?:\([-A-Z0-9+&##/%=~_|$?!:,.]*\)|[A-Z0-9+&##/%=~_|$])", RegexOptions.Compiled | RegexOptions.IgnoreCase);
public static string GetProcessedMessage(this INews news)
{
string res = UrlMatcherRegex.Replace(news.Mess, ReplaceHrefByAnchor);
return res;
}
private static string ReplaceHrefByAnchor(Match match)
{
string href = match.Groups[0].Value;
return string.Format("{0}", href);
}
But how can I ignore those URLs which are already formatted properly?
Please advise.
P.S. I'm using ASP.NET 4.5
P.P.S. I could imagine that one of the solutions could be enhance regex to check for "
From my point of view there are 2 solutions:
Use special libraries to parse your HTML document (if it's proper HTML document). For example, you can use XDocument.Parse. After parsing the document you can easily find out if the element is normal HTML "a" tag or it's just a plain text.
You can suggest that if the link is already formatted properly - it will have "href" prefix before the URL. So, in your regex you can search for all links not having "href=" before them. This could be done either via C# or via regex negative look-around functionality. You can see an example here: Regular expression to match string not containing a word?

Character + is converted to %2B in HTTP Post

I'm adding functionality to a GM script we use here at work, but when trying to post (cross site may I add) to another page, my posting value of CMD is different than what it is on the page.
It's supposed to be Access+My+Account+Info but the value that is posted becomes Access%2BMy%2BAccount%2BInfo.
So I guess my question is: What's escaping my value and how do I make it not escape? And if there's no way to unescape it, does anyone have any ideas of a workaround?
Thanks!
%2B is the code for a +. You (or whatever framework you're using) should already be decoding the POST data server-side...
Just a quick remark: If you want to decode a path segment, you can use UriUtils (spring framework):
#Test
public void decodeUriPathSegment() {
String pathSegment = "some_text%2B"; // encoded path segment
String decodedText = UriUtils.decode(pathSegment, "UTF-8");
System.out.println(decodedText);
assertEquals("some_text+", decodedText);
}
Uri path segments are different from HTML escape chars (see list). Here is an example:
#Test
public void decodeHTMLEscape() {
String someString = "some_text+";
String stringJsoup = org.jsoup.parser.Parser.unescapeEntities(someString, false);
String stringApacheCommons = StringEscapeUtils.unescapeHtml4(someString);
String stringSpring = htmlUnescape(someString);
assertEquals("some_text+", stringJsoup);
assertEquals("some_text+", stringApacheCommons);
assertEquals("some_text+", stringSpring);
}
/data/v50.0/query?q=SELECT Id from Case
This worked for me. Give space instead of '+'

How can I convert arbitrary strings to CLS-Compliant names?

Does anyone know of an algorithm (or external library) that I could call to convert an arbitrary string (i.e. outside my control) to be CLS compliant?
I am generating a dynamic RDLC (Client Report Definition) for an ASP.Net Report Viewer control and some of the field names need to be based on strings entered by the user.
Unfortunately I have little control over the entry of the field names by the client (through a 3rd party CMS). But I am quite flexible around substitutions required to create the compliant string.
I have a reactive hack algorithm for now along the lines of:
public static string FormatForDynamicRdlc(this string s)
{
//We need to change this string to be CLS compliant.
return s.Replace(Environment.NewLine, string.Empty)
.Replace("\t", string.Empty)
.Replace(",", string.Empty)
.Replace("-", "_")
.Replace(" ", "_");
}
But I would love something more comprehensive. Any ideas?
NOTE: If it is of any help, the algorithm I am using to create the dynamic RDLC is based on the BuildRDLC method found here: http://csharpshooter.blogspot.com/2007/08/revised-dynamic-rdlc-generation.html
Here's the algorithm I use to create C/C++ identifiers from arbitrary strings (translated to C#):
static void Main(string[] args)
{
string input = "9\ttotally no # # way!!!!";
string safe = string.Concat("_", Regex.Replace(input, "[^a-z0-9_]+", "_"));
Console.WriteLine(safe);
}
The leading underscore is unnecessary if the first character of the regex result is not numeric.
Here is a regex which I found could be useful for splitting CLS-compliant string part and non-CLS-compliant string part. Below implementation in C#:
string strRegex = #"^(?<nonCLSCompliantPart>[^A-Za-z]*)(?<CLSCompliantPart>.*)";
Regex myRegex = new Regex(strRegex, RegexOptions.IgnoreCase | RegexOptions.CultureInvariant);
string strTargetString = #" _aaaaaa[5]RoundingHeader";
foreach (Match myMatch in myRegex.Matches(strTargetString))
{
if (myMatch.Success)
{
// Add your code here
}
}

How to validate email address inputs?

I have an ASP.NET web form where I can can enter an email address.
I need to validate that field with acceptable email addresses ONLY in the below pattern:
xxx#home.co.uk
xxx#home.com
xxx#homegroup.com
A regular expression to validate this would be:
^[A-Z0-9._%+-]+((#home\.co\.uk)|(#home\.com)|(#homegroup\.com))$
C# sample:
string emailAddress = "jim#home.com";
string pattern = #"^[A-Z0-9._%+-]+((#home\.co\.uk)|(#home\.com)|(#homegroup\.com))$";
if (Regex.IsMatch(emailAddress, pattern, RegexOptions.IgnoreCase))
{
// email address is valid
}
VB sample:
Dim emailAddress As String = "jim#home.com"
Dim pattern As String = "^[A-Z0-9._%+-]+((#home\.co\.uk)|(#home\.com)|(#homegroup\.com))$";
If Regex.IsMatch(emailAddress, pattern, RegexOptions.IgnoreCase) Then
' email address is valid
End If
Here's how I would do the validation using System.Net.Mail.MailAddress:
bool valid = true;
try
{
MailAddress address = new MailAddress(email);
}
catch(FormatException)
{
valid = false;
}
if(!(email.EndsWith("#home.co.uk") ||
email.EndsWith("#home.com") ||
email.EndsWith("#homegroup.com")))
{
valid = false;
}
return valid;
MailAddress first validates that it is a valid email address. Then the rest validates that it ends with the destinations you require. To me, this is simpler for everyone to understand than some clumsy-looking regex. It may not be as performant as a regex would be, but it doesn't sound like you're validating a bunch of them in a loop ... just one at a time on a web page
Depending on what version of ASP.NET your are using you can use one of the Form Validation controls in your toolbox under 'Validation.' This is probably preferable to setting up your own logic after a postback. There are several types that you can drag to your form and associate with controls, and you can customize the error messages and positioning as well.
There are several types that can make it a required field or make sure its within a certain range, but you probably want the Regular Expression validator. You can use one of the expressions already shown or I think Visual Studio might supply a sample email address one.
You could use a regular expression.
See e.g. here:
http://tim.oreilly.com/pub/a/oreilly/windows/news/csharp_0101.html
Here is the official regex from RFC 2822, which will match any proper email address:
(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")#(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])
I second the use of a regex, however Patrick's regex won't work (wrong alternation). Try:
[A-Z0-9._%+-]+#home(\.co\.uk|(group)?\.com)
And don't forget to escape backslashes in a string that you use in source code, depending on the language used.
"[A-Z0-9._%+-]+#home(\\.co\\.uk|(group)?\\.com)"
Try this:
Regex matcher = new Regex(#"([a-zA-Z0-9_\-\.]+)\#((home\.co\.uk)|(home\.com)|(homegroup\.com))");
if(matcher.IsMatch(theEmailAddressToCheck))
{
//Allow it
}
else
{
//Don't allow it
}
You'll need to add the Regex namespace to your class too:
using System.Text.RegularExpressions;
Use a <asp:RegularExpressionValidator ../> with the regular expression in the ValidateExpression property.
An extension method to do this would be:
public static bool ValidEmail(this string email)
{
var emailregex = new Regex(#"[A-Za-z0-9._%-]+(#home\.co\.uk$)|(#home\.com$)|(#homegroup\.com$)");
var match = emailregex.Match(email);
return match.Success;
}
Patricks' answer seems pretty well worked out but has a few flaws.
You do want to group parts of the regex but don't want to capture them. Therefore you'll need to use non-capturing parenthesis.
The alternation is partly wrong.
It does not test if this was part of the string or the entire string
It uses Regex.Match instead of Regex.IsMatch.
A better solution in C# would be:
string emailAddress = "someone#home.co.uk";
if (Regex.IsMatch(emailAddress, #"^[A-Z0-9._%+-]+#home(?:\.co\.uk|(?:group)?\.com)$", RegexOptions.IgnoreCase))
{
// email address is valid
}
Of course to be completely sure that all email addresses pass you can use a more thorough expression:
string emailAddress = "someone#home.co.uk";
if (Regex.IsMatch(emailAddress, #"^[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*#home(?:\.co\.uk|(?:group)?\.com)$", RegexOptions.IgnoreCase))
{
// email address is valid
}

Allowing dangerous query strings

I need to be able to allow query strings that contain characters like '<' and '>'. However, putting something like id=mi<ke into the the URL will output an error page saying:
A potentially dangerous Request.QueryString value was detected from the client (id="mi<ke").
If I first url encode the url (to create id=mi%3Cke) I still get the same error. I can get around this by putting ValidateRequest="false" into the Page directive, but I'd prefer not to do that if at all possible.
So is there anyway to allow these characters in query strings and not turn off ValidateRequest?
EDIT: I want to allow users to be able to type the urls in by hand as well, so encoding them in some way might not work.
I ran into a problem similar to this. I chose to base64 encode the query string to work around it.
using
System.Text.ASCIIEncoding.ASCII.GetBytes
to get the string as bytes
and then
System.Convert.ToBase64String
to turn it into a "safe" string.
To get it back, use:
System.Convert.FromBase64String
and then:
System.Text.ASCIIEncoding.ASCII.GetString
to reverse the polarity of the flow.
A little googling and I don't think so.
The exception seems to happen before your code even runs so you can't trap the exception.
I like the encoding as base64 or something idea.
Instead of URL encode, you could encrypt your id value to get around the issue. You will probably then need to URL encode the encrypted string.
I think you have some options. You could do as you indicate and turn off ValidateRequest. You would then need to take care of any input sanitization on your own. Or you could allow only certain characters and either have the user use a meta language to input them, i.e., instead of '<' use '[' and replace '>' with ']' or re-encoding these before submission yourself to the meta language (or Base64). Doing the re-encoding yourself would require Javascript be available for queries that used forbidden characters. You may still need to do input sanitization.
Quick stab at a jquery implementation:
$(document).ready( function() {
$('form').bind('submit', function() {
$('form' > 'input[type=text]').each( function(i) {
if (this.value) {
this.value = encode(this.value);
}
});
});
});
function encode(value) {
return ...suitable encoding...
}
I was working with the same problem, however i stumbled upon this javascript encoding method:
<script type="text/javascript">
var unencodedText = "This is my text that contains whitespaces and characters like and Ø";
var encodedText = "";
var decodedText = "";
alert('unencodedText: ' + unencodedText);
//To encode whitespaces and the 'Ø' character - use encodeURI
encodedText = encodeURI(unencodedText);
//We see that whitespaces and 'Ø' are encoded, but the '' is still there:
alert('encodedText: ' + encodedText);
//If we decode it we should get our unencodedText back
decodedText = decodeURI(encodedText);
alert('decodedText: ' + decodedText);
//To also encode the '' we use the encodeURIComponent
encodedText = encodeURIComponent(unencodedText);
//Now all the characters have been encoded:
alert('encodedText: ' + encodedText);
//To get our unencodedText back we now need to use the decodeURIComponent
decodedText = decodeURIComponent(encodedText);
alert('decodedText: ' + decodedText);
</script>
If you're dealing with more complicated symbols then you might want to use the encodeURIComponent for the url.
And i steal this gem from this link.

Resources