Regex: Split X length words - asp.net

I'm new to regular expresions. I have a gigantic text. In the aplication, i need words of 4 characters and delete the rest. The text is in spanish. So far, I can select 4 char length words but i still need to delete the rest.
This is my regular expression
\s(\w{3,3}[a-zA-ZáéíóúäëïöüñÑ])\s
How can i get all words with 4 letters in asp.net vb?

/(?:\A|(?<=\P{L}))(\p{L}{4})(?:(?=\P{L})|\z)/g
Explanation:
Switch /g is for repeatedly search
\A is start of the string (not start of line)
\p{L} matches a single code point in the category letter
\P{L} matches a single code point not in the category letter
{n} specify a specific amount of repetition [n is number]
\z is end of string (not end of line)
| is logic OR operator
(?<=) is lookbehind
(?=) is lookahead
(?:) is non backreference grouping
() is backreference grouping

Using the character class provided above in another answer (\w does NOT match spanish word characters unfortunately).
You can use this for a match (it matches the reverse, basically matches everything that is NOT a 4-character word, so you can replace with " ", leaving only the 4-character words):
/(^|(?<=(?<=\W)[a-zA-ZáéíóúäëïöüñÑ]{4,4}(?=\W)))(.*?)((?=(?<=\W)[a-zA-ZáéíóúäëïöüñÑ]{4,4}(?=\W))|$)/gis
Approximated code in VB (not tested):
Dim input As String = "This is your text"
Dim pattern As String = "/(^|(?<=(?<=\W)[a-zA-ZáéíóúäëïöüñÑ]{4,4}(?=\W)))(.*?)((?=(?<=\W)[a-zA-ZáéíóúäëïöüñÑ]{4,4}(?=\W))|$)/gis"
Dim replacement As String = " "
Dim rgx As New Regex(pattern)
Dim result As String = rgx.Replace(input, replacement)
Console.WriteLine("Original String: {0}", input)
Console.WriteLine("Replacement String: {0}", result)
You can see the result of the regex in action here:
http://regexr.com?30n29

\[^a-zA-ZáéíóúäëïöüñÑ][a-zA-ZáéíóúäëïöüñÑ]{4}[^a-zA-ZáéíóúäëïöüñÑ]\g
Translated:
A non-letter, followed by 4 letters, followed by a non-letter. The 'g' indicated will match globally ... more than once.
Check out this link to find out more info on looping over your matches:
http://osherove.com/blog/2003/5/12/practical-parsing-using-groups-in-regular-expressions.html

Related

negative-lookahead in gsub

In a recent scenario I wanted to extract the very last part of a vector of url's.
Eg.
> urls <- c('https::abc/efg/hij/', 'https::abc/efg/hij/lmn/', 'https::abc/efg/hij/lmn/opr/')
> rs <- regexpr("([^/])*(?=/$)", urls, perl = TRUE)
> substr(urls, rs, rs + attr(rs, 'match.length'))
[1] "hij/" "lmn/" "opr/"
which is somewhat simple to read. But I'd like to understand how I could do something similar by inverting the lookahead expression, eg. remove the second to last '/' and anything preceding (assuming that the string always ends with '/'). I can't seem to get the exact logic straight,
> gsub('([^/]|[/])(?!([^/]*/)$)', '', urls, perl = TRUE)
[1] "/hij" "/lmn" "/opr"
Basically I'm looking for the regexp logic that would return the result in the first example, but using only a single gsub call.
To get a match only, you could still use the lookahead construct:
^.*/(?=[^/]*/$)
^ Start of the string
.*/ Match until the last /
(?= Positive lookahead, assert what is on the right is
[^/]*/$ assert what is at the right is 0+ times any char except /, then match / at end of string
) Close lookahead
Regex demo | R example
For example
gsub('^.*/(?=[^/]*/$)', '', urls, perl = TRUE)
An option using a negative lookahead:
^.*/(?!$)
^ Start of string
.*/ Match the last /
(?!$) Negative lookahead, assert what is directly to the right is not the end of the string
Regex demo
The non-regex & very quick solution would be to use basename():
basename(urls)
[1] "hij" "lmn" "opr"
Or, for your case:
paste0(basename(urls), '/')
[1] "hij/" "lmn/" "opr/"
my prefered method is to replace the whole string with parts of the string, like so:
gsub("^.*/([^/]+/)$", "\\1", urls)
The "\\1" matches whatever was matched inside ().
So Basically I am replacing the whole string with the last part of the url.

How to replace special characters using regex

Using Asp.net for regex.
I've written an extension method that I want to use to replace whole words - a word might also be a single special character like '&'.
In this case I want to replace '&' with 'and', and I'll need to use the same technique to reverse it back from 'and' to '&', so it must work for whole words only and not extended words like 'hand'.
I've tried a few variations for the regex pattern - started with '\bWORD\b' which didn't work at all for the ampersand, and now have '\sWORD\s' which almost works except that it also removes the spaces around the word, meaning that a phrase like "health & beauty" ends up as "healthandbeauty".
Any help appreciated.
Here's the extension method:
public static string ReplaceWord(this string #this,
string wordToFind,
string replacement,
RegexOptions regexOptions = RegexOptions.None)
{
Guard.String.NotEmpty(() => #this);
Guard.String.NotEmpty(() => wordToFind);
Guard.String.NotEmpty(() => replacement);
var pattern = string.Format(#"\s{0}\s", wordToFind);
return Regex.Replace(#this, pattern, replacement, regexOptions);
}
In order to match a dynamic string that should be enclosed with spaces (or be located at the start or end of string), you can use negative lookaheads:
var pattern = string.Format(#"(?<!\S){0}(?!\S)", wordToFind);
^^^^^^^ ^^^^^^
or even safer:
var pattern = string.Format(#"(?<!\S){0}(?!\S)", Regex.Escape(wordToFind));
^^^^^^^^^^^^^
The (?<!\S) lookbehind will fail the match if the word is not preceded with a non-whitespace character and (?!\S) lookahead will fail the match if the word is not followed with a non-whitespace character.

How to split strings from a line in robot framework

How to get rest of the values from the variable
${random employee}= Convert To String ${random emp}
${replace}= Remove String Using Regexp ${random employee} ['\\[\\]\\,]
${splitline}= Fetch From Left ${replace} ${SPACE}
Output:
${replace} Alagu kartest1234+3alagu#gmail.cokartest1234+3ramu#gmail.com Developer Team B3 Team lead
${splitline} = Alagu
How to get rest of the values from the variable ${replace}
Keyword Split String from String standard library does this.
Split String string, separator=None, max_split=-1
Splits the string using separator as a delimiter string.
If a separator is not given, any whitespace string is a separator. In that case also possible consecutive whitespace as well as leading and trailing whitespace is ignored.
Split words are returned as a list. If the optional max_split is given, at most max_split splits are done, and the returned list will have maximum max_split + 1 elements.
Examples:
#{words} = Split String ${string}
#{words} = Split String ${string} ,${SPACE}
To get single values from #{words} use common array syntax: #{NAME}[i]. i is the index of the selected value. Indexes start from zero.

How do I create a function to check if a string only consists of A-Z , 0-9

Is there any way to check in Xquery (A Xquery function) if an input string has only characters A-Z or numbers 0-9 and no other characters.
for example if the string is ABZ10 the function should return true and if the input string is 5& 123x it returns a false.
I am able to do it in java by simply using following.
inputstring.matches("^[0-9A-Z]+$"))
Use:
matches($vYourString, '^[A-Z0-9]+$')

Regular Expression required for condition

I need a regular expression which can match a string with the following requirements:
Must be between 6 and 64 characters long
Cannot include the following symbols : #, &, ', <, >, !, ", /, #, $, %, +, ?, (, ), *, [ , ] , \ , { , }
Cannot contain spaces, tabs, or consecutive underscores, i.e. __
Cannot contain elements that imply an email address or URL, such as ".com", ".net", ".org", ".edu" or any variation (e.g. "_com" or "-com")
Cannot start with underscore '_', dash '-' or period '.'
Cannot contain the words "honey" or "allied"
Cannot contain single letter followed by numbers
This is better done with several regular expressions! And some of your conditions don't even need regexes (in fact, they would be counter productive).
use a string length function
use a function looking up for that character in your string;
match against _{2,} and \s
match against [._-](?:com|net|....)
use a string function looking for these characters at the first position, or ^[-._]
whole words? What about "calliedaaa"? If whole words, match against \b(?:honey|allied)\b, otherwise use a string lookup function
match against \w\d+

Resources