negative-lookahead in gsub - r

In a recent scenario I wanted to extract the very last part of a vector of url's.
Eg.
> urls <- c('https::abc/efg/hij/', 'https::abc/efg/hij/lmn/', 'https::abc/efg/hij/lmn/opr/')
> rs <- regexpr("([^/])*(?=/$)", urls, perl = TRUE)
> substr(urls, rs, rs + attr(rs, 'match.length'))
[1] "hij/" "lmn/" "opr/"
which is somewhat simple to read. But I'd like to understand how I could do something similar by inverting the lookahead expression, eg. remove the second to last '/' and anything preceding (assuming that the string always ends with '/'). I can't seem to get the exact logic straight,
> gsub('([^/]|[/])(?!([^/]*/)$)', '', urls, perl = TRUE)
[1] "/hij" "/lmn" "/opr"
Basically I'm looking for the regexp logic that would return the result in the first example, but using only a single gsub call.

To get a match only, you could still use the lookahead construct:
^.*/(?=[^/]*/$)
^ Start of the string
.*/ Match until the last /
(?= Positive lookahead, assert what is on the right is
[^/]*/$ assert what is at the right is 0+ times any char except /, then match / at end of string
) Close lookahead
Regex demo | R example
For example
gsub('^.*/(?=[^/]*/$)', '', urls, perl = TRUE)
An option using a negative lookahead:
^.*/(?!$)
^ Start of string
.*/ Match the last /
(?!$) Negative lookahead, assert what is directly to the right is not the end of the string
Regex demo

The non-regex & very quick solution would be to use basename():
basename(urls)
[1] "hij" "lmn" "opr"
Or, for your case:
paste0(basename(urls), '/')
[1] "hij/" "lmn/" "opr/"

my prefered method is to replace the whole string with parts of the string, like so:
gsub("^.*/([^/]+/)$", "\\1", urls)
The "\\1" matches whatever was matched inside ().
So Basically I am replacing the whole string with the last part of the url.

Related

Regular expression for a username

I'm trying to write a regular expression for a username that fits the following criteria...
Must be between 6 and 16 characters,
any 4 of which must be letters (though not necessarily consecutive),
May also contain letters, numbers, dash and underscore.
So _1Bobby1_ and -Bo-By19- would match, but _-bo-_ and -123-456_ wouldn't.
I've tried:
^(?=.*[a-zA-Z].{4})([a-zA-Z0-9_-]{6,16})$
But this doesn't seem to work, I've looked online and can't find anything that works and used Regexper to visualise the expression and try to build it from scratch.
Any pointers would be greatly appreciated.
This regex can be used to verify username
^(?=.{6,16}$)(?=(?:.*[A-Za-z]){4})[\w-]+$
Regex Breakdown
^ #Start of string
(?=.{6,16}$) #There should be between 6 to 16 characters
(?=
(?:.*[A-Za-z]){4} # Lookahead to match 4 letter anywhere in string
)
[\w-]+ #If above conditions are correct, match the string. It should only contain dgits, alphabets and dash
$ #End of string. Not necessary as the first check (?=.{6,16}$) itself does that
bool IsValid(string userName)
{
return userName.Length >= 6 && userName.Length <= 16 && userName.Count(s => char.IsLetter(s)) >= 4;
}
It simpler not to use regular expressions.
And as known you can use other char.Is[something] functions if you need it

Matching series of Ampersands in R?

I am unable to solve the below question.Requesting all to help me in this regard.
I have series of ampersands(&) in my data, I want to replace pair of ampersands with some value, but for some reason I am unable to do it.
My attempt and example:
string1 <- "This aa should be replaced: but this aaa shouldn't"
string2 <- "This && should be replaced: but this &&& shouldn't"
gsub("aa", "XXX", string1) #1.
gsub("\\baa\\b", "XXX", string1) #2.
gsub("&&", "XXX", string2) #3.
gsub("\\b&&\\b", "XXX", string2) #4.
Above, if I want to match 'aa' from string1, I can have two approaches,
In approach 1 (denoted as : #1), I can simply pass 'aa' but this will also match 'aaa' partially, which I don't want, I want my regex to match exactly pairs of 'a', which in my case is 'aa'.
To solve this I use regex (#2), In this case it is working fine.
Now, in string2, I expected a similar behavior, where instead of matching pair of 'a' I want to match pair of '&&' which is not matching.
The (#3) attempt is working, but that is not the result I want as it is also matching partially '&&&',
The (#4) attempt is not working for some reason and its not replacing the string.
My question is:
1) Why pair of ampersands are not working with boundary conditions ?
2) What is the way around to solve this problem ?
I really had the hard time, and wasted my entire day due to this and really feeling bad, tried finding the solution on google, not yet successful.
In case some one know, if its there please redirect me to a post. OR if someone finds its a duplicate please let me know, I will remove it.
Thanks for your help and reading the question.
EDIT: My word boundary is space for now.
Outputs:
> gsub("aa", "XXX", string1)
[1] "This XXX should be replaced: but this XXXa shouldn't"
> gsub("\\baa\\b", "XXX", string1)
[1] "This XXX should be replaced: but this aaa shouldn't"
>
> gsub("&&", "XXX", string2)
[1] "This XXX should be replaced: but this XXX& shouldn't"
> gsub("\\b&&\\b", "XXX", string2)
[1] "This && should be replaced: but this &&& shouldn't"
>
Note: I have also checked with perl=TRUE, but its not working.
The \b word boundary means:
There are three different positions that qualify as word boundaries:
Before the first character in the string, if the first character is a
word character.
After the last character in the string, if the last
character is a word character.
Between two characters in the string,
where one is a word character and the other is not a word character.
The "\\b&&\\b" pattern matches && when it is enclosed with word chars, letters, digits or _ chars.
To match whitespace boundaries, you may use
gsub("(?<!\\S)&&(?!\\S)", "XXX", string2, perl=TRUE)
The pattern matches
(?<!\\S) - a location not immediately preceded with a non-whitespace char (that is, there must be start of string or a whitespace char immediately to the left of the current location)
&& - a literal substring
(?!\\S) - a location not immediately followed with a non-whitespace char (that is, there must be end of string or a whitespace char immediately to the right of the current location).
More specific, but you could use a 2-step function like so
replace2steps <- function(mystring, toreplace,replacement, toexclude, intermediate) {
intermstring <- gsub(toexclude, intermediate,string2)
result <- gsub(toreplace, replacement, intermstring)
result <- gsub(intermediate, toexclude, result)
return(result)
}
replace2steps(string2, "&&", "XX", "&&&", "%%%")
[1] "This XX should be replaced: but this &&& shouldn't"

How to replace special characters using regex

Using Asp.net for regex.
I've written an extension method that I want to use to replace whole words - a word might also be a single special character like '&'.
In this case I want to replace '&' with 'and', and I'll need to use the same technique to reverse it back from 'and' to '&', so it must work for whole words only and not extended words like 'hand'.
I've tried a few variations for the regex pattern - started with '\bWORD\b' which didn't work at all for the ampersand, and now have '\sWORD\s' which almost works except that it also removes the spaces around the word, meaning that a phrase like "health & beauty" ends up as "healthandbeauty".
Any help appreciated.
Here's the extension method:
public static string ReplaceWord(this string #this,
string wordToFind,
string replacement,
RegexOptions regexOptions = RegexOptions.None)
{
Guard.String.NotEmpty(() => #this);
Guard.String.NotEmpty(() => wordToFind);
Guard.String.NotEmpty(() => replacement);
var pattern = string.Format(#"\s{0}\s", wordToFind);
return Regex.Replace(#this, pattern, replacement, regexOptions);
}
In order to match a dynamic string that should be enclosed with spaces (or be located at the start or end of string), you can use negative lookaheads:
var pattern = string.Format(#"(?<!\S){0}(?!\S)", wordToFind);
^^^^^^^ ^^^^^^
or even safer:
var pattern = string.Format(#"(?<!\S){0}(?!\S)", Regex.Escape(wordToFind));
^^^^^^^^^^^^^
The (?<!\S) lookbehind will fail the match if the word is not preceded with a non-whitespace character and (?!\S) lookahead will fail the match if the word is not followed with a non-whitespace character.

Regex: Split X length words

I'm new to regular expresions. I have a gigantic text. In the aplication, i need words of 4 characters and delete the rest. The text is in spanish. So far, I can select 4 char length words but i still need to delete the rest.
This is my regular expression
\s(\w{3,3}[a-zA-ZáéíóúäëïöüñÑ])\s
How can i get all words with 4 letters in asp.net vb?
/(?:\A|(?<=\P{L}))(\p{L}{4})(?:(?=\P{L})|\z)/g
Explanation:
Switch /g is for repeatedly search
\A is start of the string (not start of line)
\p{L} matches a single code point in the category letter
\P{L} matches a single code point not in the category letter
{n} specify a specific amount of repetition [n is number]
\z is end of string (not end of line)
| is logic OR operator
(?<=) is lookbehind
(?=) is lookahead
(?:) is non backreference grouping
() is backreference grouping
Using the character class provided above in another answer (\w does NOT match spanish word characters unfortunately).
You can use this for a match (it matches the reverse, basically matches everything that is NOT a 4-character word, so you can replace with " ", leaving only the 4-character words):
/(^|(?<=(?<=\W)[a-zA-ZáéíóúäëïöüñÑ]{4,4}(?=\W)))(.*?)((?=(?<=\W)[a-zA-ZáéíóúäëïöüñÑ]{4,4}(?=\W))|$)/gis
Approximated code in VB (not tested):
Dim input As String = "This is your text"
Dim pattern As String = "/(^|(?<=(?<=\W)[a-zA-ZáéíóúäëïöüñÑ]{4,4}(?=\W)))(.*?)((?=(?<=\W)[a-zA-ZáéíóúäëïöüñÑ]{4,4}(?=\W))|$)/gis"
Dim replacement As String = " "
Dim rgx As New Regex(pattern)
Dim result As String = rgx.Replace(input, replacement)
Console.WriteLine("Original String: {0}", input)
Console.WriteLine("Replacement String: {0}", result)
You can see the result of the regex in action here:
http://regexr.com?30n29
\[^a-zA-ZáéíóúäëïöüñÑ][a-zA-ZáéíóúäëïöüñÑ]{4}[^a-zA-ZáéíóúäëïöüñÑ]\g
Translated:
A non-letter, followed by 4 letters, followed by a non-letter. The 'g' indicated will match globally ... more than once.
Check out this link to find out more info on looping over your matches:
http://osherove.com/blog/2003/5/12/practical-parsing-using-groups-in-regular-expressions.html

Regular Expression required for condition

I need a regular expression which can match a string with the following requirements:
Must be between 6 and 64 characters long
Cannot include the following symbols : #, &, ', <, >, !, ", /, #, $, %, +, ?, (, ), *, [ , ] , \ , { , }
Cannot contain spaces, tabs, or consecutive underscores, i.e. __
Cannot contain elements that imply an email address or URL, such as ".com", ".net", ".org", ".edu" or any variation (e.g. "_com" or "-com")
Cannot start with underscore '_', dash '-' or period '.'
Cannot contain the words "honey" or "allied"
Cannot contain single letter followed by numbers
This is better done with several regular expressions! And some of your conditions don't even need regexes (in fact, they would be counter productive).
use a string length function
use a function looking up for that character in your string;
match against _{2,} and \s
match against [._-](?:com|net|....)
use a string function looking for these characters at the first position, or ^[-._]
whole words? What about "calliedaaa"? If whole words, match against \b(?:honey|allied)\b, otherwise use a string lookup function
match against \w\d+

Resources