How to replace url if it is not anchored yet? - asp.net

I need to replace all urls in the text IF they are not put into '...' HTML tags yet.
The unconditional way to replace is described here: Recognize URL in plain text.
Here is my implementation of it:
private static readonly Regex UrlMatcherRegex = new Regex(#"\b(?:(?:https?|ftp|file)://|www\.|ftp\.)(?:\([-A-Z0-9+&##/%=~_|$?!:,.]*\)|[-A-Z0-9+&##/%=~_|$?!:,.])*(?:\([-A-Z0-9+&##/%=~_|$?!:,.]*\)|[A-Z0-9+&##/%=~_|$])", RegexOptions.Compiled | RegexOptions.IgnoreCase);
public static string GetProcessedMessage(this INews news)
{
string res = UrlMatcherRegex.Replace(news.Mess, ReplaceHrefByAnchor);
return res;
}
private static string ReplaceHrefByAnchor(Match match)
{
string href = match.Groups[0].Value;
return string.Format("{0}", href);
}
But how can I ignore those URLs which are already formatted properly?
Please advise.
P.S. I'm using ASP.NET 4.5
P.P.S. I could imagine that one of the solutions could be enhance regex to check for "

From my point of view there are 2 solutions:
Use special libraries to parse your HTML document (if it's proper HTML document). For example, you can use XDocument.Parse. After parsing the document you can easily find out if the element is normal HTML "a" tag or it's just a plain text.
You can suggest that if the link is already formatted properly - it will have "href" prefix before the URL. So, in your regex you can search for all links not having "href=" before them. This could be done either via C# or via regex negative look-around functionality. You can see an example here: Regular expression to match string not containing a word?

Related

Replace multiple occurrences of same string in href using Regular Expressions

Our CMS is (I suppose correctly) encoding comma characters in URLs. So instead of being "?values=1,2,3" the CMS is rendering "?values=1%2c2%2c3". This in itself is not a problem however the external system that these links are pointing at cannot handle the encoded commas and only works if we pass actual commas in the query string.
We already have a Regex clean-up tool that processes the HTML pre-render and cleans out non XHTML compliant mark-up. This is an old CMS running on ASP.Net v2.
My question is what regular expression would be required to swap out all occurrences of "%2c" for a comma, but only where this text exists within an anchor tag. I've been easily able to swap out all instances of %2c but this runs the risk of corrupting the page elsewhere if that string happened to be used for a non-URL purpose.
I'm using .Net and System.Text.RegularExpressions. We have an XML file that contains all of the Find and Replace rules. This gets loaded at runtime and cleans the HTML. Each rule consists of:
Text to find - e.g. "<script>"
Text to replace - e.g. "<script type='text/javascript'>"
We then have some C# that loops over each of the rules and does the following:
// HTML = full page HTML
Regex regex = new Regex(searchTxt, RegexOptions.IgnoreCase);
HTML = regex.Replace(HTML, replaceTxt);
Simple as that. I just can't get the right regex syntax for our specific scenario.
Many thanks for your help.
Here is a complete C# console app that hopefully explains my scenario
class Program
{
static void Main(string[] args)
{
string html = GetPageHTML();
string regexString = "(<a href=).*|(%2c)";
string replaceTxt = ",";
RegexOptions options = RegexOptions.IgnoreCase | RegexOptions.Multiline;
Regex regex = new Regex(regexString, options);
// We are currently using a simple regex.Replace
string cleanHTML = regex.Replace(html, replaceTxt);
// But for this example should we be doing something with the Matches collection?
foreach (Match match in regex.Matches(html))
{
if (match.Success)
{
// do something?
}
}
}
private static string GetPageHTML()
{
return #"<html>
<head></head>
<body>
<a title='' href='http://www.testsite.com/?x=491191%2cy=291740%2czoom=6%2cbase=demo%2clayers=%2csearch=text:WE9%203QA%2cfade=false%2cmX=0%2cmY=0' target='_blank'>A link</a>
<p>We wouldn't want this (%2c) to be replaced</p>
</body>
</html>";
}
}
If .net would support pcre regex you could do something like this:
^(?!<a href=").*(*SKIP)(*FAIL)|(%2c)
That is what you want. Above regex will match only %2c inside anchor tags. But you could achieve the same if you use regex the regex discard technique plus some logic.
If you use below regex, you could match %2c and also capture the %2c string that is within anchor tags:
^(?!<a href=").*|(%2c)
Working demo
So, what you can do is to add logic and to check if the capturing group content is equal to %2c, in that case means that it matches %2c from the anchor tag. Then you can replace that for a comma.

How to make url case sensitive asp.net application hosted on IIS 8

I have issue here regarding URL Case Sensitivity. i.e. we show results for http://www.starmicronics.com/Printer/Home.aspx (the actual page that exists) as well as for http://www.starmicronics.com/printer/home.aspx (a second page and folder listed with lower case names that actually doesn’t exist).
I want to Convert second url to fist url automatically. How to do that. Any suggestion is highly appreciated.
Thanks
Dwarika
I am not sure what language you are using. But if you are doing this on the server side in C# you could use a regex:
static void Main( string[] args )
{
//Your test string
string test = #"http://www.starmicronics.com/printer/home.aspx";
var result = Regex.Replace( test, "(?<=[^/]/)[^/]", delegate( Match match )
{
string v = match.ToString();
return char.ToUpper(v[0]) + v.Substring(1);
});
Console.WriteLine(result); //http:www.starmicronics.com/Printer/Home.aspx
}
Explanation of Regex (?<=[^/]/)[^/]
A character that is not a / preceded by a / that itself is not preceeded by a /
[^/] not a /
?<= a positive look behind
This is a simple approach that would satisfy your example.
Please try ISAPI_Rewrite 3, it may help you. you need to write rule for it.
http://www.helicontech.com/isapi_rewrite/

Character + is converted to %2B in HTTP Post

I'm adding functionality to a GM script we use here at work, but when trying to post (cross site may I add) to another page, my posting value of CMD is different than what it is on the page.
It's supposed to be Access+My+Account+Info but the value that is posted becomes Access%2BMy%2BAccount%2BInfo.
So I guess my question is: What's escaping my value and how do I make it not escape? And if there's no way to unescape it, does anyone have any ideas of a workaround?
Thanks!
%2B is the code for a +. You (or whatever framework you're using) should already be decoding the POST data server-side...
Just a quick remark: If you want to decode a path segment, you can use UriUtils (spring framework):
#Test
public void decodeUriPathSegment() {
String pathSegment = "some_text%2B"; // encoded path segment
String decodedText = UriUtils.decode(pathSegment, "UTF-8");
System.out.println(decodedText);
assertEquals("some_text+", decodedText);
}
Uri path segments are different from HTML escape chars (see list). Here is an example:
#Test
public void decodeHTMLEscape() {
String someString = "some_text+";
String stringJsoup = org.jsoup.parser.Parser.unescapeEntities(someString, false);
String stringApacheCommons = StringEscapeUtils.unescapeHtml4(someString);
String stringSpring = htmlUnescape(someString);
assertEquals("some_text+", stringJsoup);
assertEquals("some_text+", stringApacheCommons);
assertEquals("some_text+", stringSpring);
}
/data/v50.0/query?q=SELECT Id from Case
This worked for me. Give space instead of '+'

How can I convert arbitrary strings to CLS-Compliant names?

Does anyone know of an algorithm (or external library) that I could call to convert an arbitrary string (i.e. outside my control) to be CLS compliant?
I am generating a dynamic RDLC (Client Report Definition) for an ASP.Net Report Viewer control and some of the field names need to be based on strings entered by the user.
Unfortunately I have little control over the entry of the field names by the client (through a 3rd party CMS). But I am quite flexible around substitutions required to create the compliant string.
I have a reactive hack algorithm for now along the lines of:
public static string FormatForDynamicRdlc(this string s)
{
//We need to change this string to be CLS compliant.
return s.Replace(Environment.NewLine, string.Empty)
.Replace("\t", string.Empty)
.Replace(",", string.Empty)
.Replace("-", "_")
.Replace(" ", "_");
}
But I would love something more comprehensive. Any ideas?
NOTE: If it is of any help, the algorithm I am using to create the dynamic RDLC is based on the BuildRDLC method found here: http://csharpshooter.blogspot.com/2007/08/revised-dynamic-rdlc-generation.html
Here's the algorithm I use to create C/C++ identifiers from arbitrary strings (translated to C#):
static void Main(string[] args)
{
string input = "9\ttotally no # # way!!!!";
string safe = string.Concat("_", Regex.Replace(input, "[^a-z0-9_]+", "_"));
Console.WriteLine(safe);
}
The leading underscore is unnecessary if the first character of the regex result is not numeric.
Here is a regex which I found could be useful for splitting CLS-compliant string part and non-CLS-compliant string part. Below implementation in C#:
string strRegex = #"^(?<nonCLSCompliantPart>[^A-Za-z]*)(?<CLSCompliantPart>.*)";
Regex myRegex = new Regex(strRegex, RegexOptions.IgnoreCase | RegexOptions.CultureInvariant);
string strTargetString = #" _aaaaaa[5]RoundingHeader";
foreach (Match myMatch in myRegex.Matches(strTargetString))
{
if (myMatch.Success)
{
// Add your code here
}
}

How to validate email address inputs?

I have an ASP.NET web form where I can can enter an email address.
I need to validate that field with acceptable email addresses ONLY in the below pattern:
xxx#home.co.uk
xxx#home.com
xxx#homegroup.com
A regular expression to validate this would be:
^[A-Z0-9._%+-]+((#home\.co\.uk)|(#home\.com)|(#homegroup\.com))$
C# sample:
string emailAddress = "jim#home.com";
string pattern = #"^[A-Z0-9._%+-]+((#home\.co\.uk)|(#home\.com)|(#homegroup\.com))$";
if (Regex.IsMatch(emailAddress, pattern, RegexOptions.IgnoreCase))
{
// email address is valid
}
VB sample:
Dim emailAddress As String = "jim#home.com"
Dim pattern As String = "^[A-Z0-9._%+-]+((#home\.co\.uk)|(#home\.com)|(#homegroup\.com))$";
If Regex.IsMatch(emailAddress, pattern, RegexOptions.IgnoreCase) Then
' email address is valid
End If
Here's how I would do the validation using System.Net.Mail.MailAddress:
bool valid = true;
try
{
MailAddress address = new MailAddress(email);
}
catch(FormatException)
{
valid = false;
}
if(!(email.EndsWith("#home.co.uk") ||
email.EndsWith("#home.com") ||
email.EndsWith("#homegroup.com")))
{
valid = false;
}
return valid;
MailAddress first validates that it is a valid email address. Then the rest validates that it ends with the destinations you require. To me, this is simpler for everyone to understand than some clumsy-looking regex. It may not be as performant as a regex would be, but it doesn't sound like you're validating a bunch of them in a loop ... just one at a time on a web page
Depending on what version of ASP.NET your are using you can use one of the Form Validation controls in your toolbox under 'Validation.' This is probably preferable to setting up your own logic after a postback. There are several types that you can drag to your form and associate with controls, and you can customize the error messages and positioning as well.
There are several types that can make it a required field or make sure its within a certain range, but you probably want the Regular Expression validator. You can use one of the expressions already shown or I think Visual Studio might supply a sample email address one.
You could use a regular expression.
See e.g. here:
http://tim.oreilly.com/pub/a/oreilly/windows/news/csharp_0101.html
Here is the official regex from RFC 2822, which will match any proper email address:
(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")#(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])
I second the use of a regex, however Patrick's regex won't work (wrong alternation). Try:
[A-Z0-9._%+-]+#home(\.co\.uk|(group)?\.com)
And don't forget to escape backslashes in a string that you use in source code, depending on the language used.
"[A-Z0-9._%+-]+#home(\\.co\\.uk|(group)?\\.com)"
Try this:
Regex matcher = new Regex(#"([a-zA-Z0-9_\-\.]+)\#((home\.co\.uk)|(home\.com)|(homegroup\.com))");
if(matcher.IsMatch(theEmailAddressToCheck))
{
//Allow it
}
else
{
//Don't allow it
}
You'll need to add the Regex namespace to your class too:
using System.Text.RegularExpressions;
Use a <asp:RegularExpressionValidator ../> with the regular expression in the ValidateExpression property.
An extension method to do this would be:
public static bool ValidEmail(this string email)
{
var emailregex = new Regex(#"[A-Za-z0-9._%-]+(#home\.co\.uk$)|(#home\.com$)|(#homegroup\.com$)");
var match = emailregex.Match(email);
return match.Success;
}
Patricks' answer seems pretty well worked out but has a few flaws.
You do want to group parts of the regex but don't want to capture them. Therefore you'll need to use non-capturing parenthesis.
The alternation is partly wrong.
It does not test if this was part of the string or the entire string
It uses Regex.Match instead of Regex.IsMatch.
A better solution in C# would be:
string emailAddress = "someone#home.co.uk";
if (Regex.IsMatch(emailAddress, #"^[A-Z0-9._%+-]+#home(?:\.co\.uk|(?:group)?\.com)$", RegexOptions.IgnoreCase))
{
// email address is valid
}
Of course to be completely sure that all email addresses pass you can use a more thorough expression:
string emailAddress = "someone#home.co.uk";
if (Regex.IsMatch(emailAddress, #"^[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*#home(?:\.co\.uk|(?:group)?\.com)$", RegexOptions.IgnoreCase))
{
// email address is valid
}

Resources