R regex match whole word taking punctuation into account - r

I'm in R. I want to match whole words in text, taking punctuation into account.
Example:
to_match = c('eye','nose')
text1 = 'blah blahblah eye-to-eye blah'
text2 = 'blah blahblah eye blah'
I would like eye to be matched in text2 but not in text1.
That is, the command:
to_match[sapply(paste0('\\<',to_match,'\\>'),grepl,text1)]
should return character(0). But right now, it returns eye.
I also tried with '\\b' instead of '\\<', with no success.

UseĀ 
to_match[sapply(paste0('(?:\\s|^)',to_match,'(?:\\s|$)'),grepl,text1)]
The point is that word boundaries match between a word and a nonword chars, that is why you had a match in eye-to-eye. You want to match only in between start or end of string and whitespace.
In a TRE regex, this is better done with groups as this regex library does not support lookarounds and you just need to test a string for a single pattern match to return true or false.
The (?:\s|^) noncapturing group matches any whitespace or start of string and (?:\s|$) matches whitespace or end of string.

Related

Remove all whitespace from string AX 2012

PurchPackingSlipJournalCreate class -> initHeader method have a line;
vendPackingSlipJour.PackingSlipId = purchParmTable.Num;
but i want when i copy and paste ' FDG 2020 ' (all blanks are tab character) in Num area and click okey, write this value as 'FDG2020' in the PackagingSlipId field of the vendPackingSlipJour table.
I tried -> vendPackingSlipJour.PackingSlipId = strRem(purchParmTable.Num, " ");
but doesn't work for tab character.
How can i remove all whitespace characters from string?
Version 1
Try the strAlpha() function.
From the documentation:
Copies only the alphanumeric characters from a string.
Version 2
Because version 1 also deletes allowed hyphens (-), you could use strKeep().
From the documentation:
Builds a string by using only the characters from the first input string that the second input string specifies should be kept.
This will require you to specify all desired characters, a rather long list...
Version 3
Use regular expressions to replace any unwanted characters (defined as "not a wanted character"). This is similar to version 2, but the list of allowed characters can be expressed a lot shorter.
The example below allows alphanumeric characters(a-z,A-Z,0-9), underscores (_) and hyphens (-). The final value for newText is ABC-12_3.
str badCharacters = #"[^a-zA-Z0-9_-]"; // so NOT an allowed character
str newText = System.Text.RegularExpressions.Regex::Replace(' ABC-12_3 ', badCharacters, '');
Version 4
If you know the only unwanted characters are tabs ('\t'), then you can go hunting for those specifically as well.
vendPackingSlipJour.PackingSlipId = strRem(purchParmTable.Num, '\t');

Removing a specific first item in a string in R

I have strings such as:
'THE HOUSE'
'IN THE HOUSE'
'THE THE HOUSE'
And I would like to remove 'THE' only if it occurs at the first position in the string.
I know how to remove 'THE' with:
gsub("\\<THE\\>", "", string)
And I know how to grab the first word with:
"([A-Za-z]+)" or "([[:alpha:]]+)"or "(\\w+)"
But no idea how to combine the two to end up having:
'HOUSE'
'IN THE HOUSE'
'THE HOUSE'
Cheers!
You may use
string <- c("THE HOUSE", "IN THE HOUSE", "THE THE HOUSE")
sub("^THE\\b\\s*", "", string)
## => [1] "HOUSE" "IN THE HOUSE" "THE HOUSE"
See the regex demo and an online R demo.
Details
^ - start of string
THE - a literal substring
\\b - a word boundary (you may keep \\> trailing word boundary if you wish)
\\s* - 0+ whitespace chars.

Regex for "Characters Numbers"

I need a Regex that matches these Strings:
Test 1
Test 123
Test 1.1 (not required but would be neat)
Test
Test a
But not the following:
Test 1a
I don't know how this pattern should look like that it allows text or whitespace at the end but not if there is a number before.
I tried this one
^.*([0-9])$ (matches only Test 1, but not for example Test or Test a)
and this one
^.*[0-9].$ (matches only Test 1a, but not for example Test or Test 1)
but they don't match what I need.
This is working for all cases you provided
^\w+(\s(\d+(\.\d+)?|[a-z]))?$
Regex Demo
Regex Breakdown
^ #Start of string
\w+ #Match any characters until next space or end of string
(\s #Match a whitespace
(
\d+ #Match any set of digits
(\.\d+)? #Digits after decimal(optional)
| #Alternation(OR)
[a-z] #Match any character
)
)? #Make it optional
$ #End of string
If you also want to include capital letters, then you can use
^\w+(\s(\d+(\.\d+)?|[A-Za-z]))?$
Try with
^\w+\s+((\d+\.\d+)|(\d+)|([^\d^\s]\w+))?\s*$
Another pattern for you to try:
^(Test(?:$|\s(?:\d$|[a-z]$|\d{3}|\d\.\d$)))
LIVE DEMO.
As per your strings in your question (and your comments):
^\w+(\s[a-z]|\s\d+(\.\d+)?)?$

ASP.Net RegEx with ampersand and spaces

I am using the following regular expression to find words and phrases in a document. (Have to use regular expression and have to use \b.)
\b (zoo|a & b|dummy)\b
When I try to find matches in the following string
going to the zoo with a & b
a & b doesn't get matched. However, if I remove the leading and following space from the string and regex, making both a&b, it matches, but I do need to those spaces.
Use \s for spaces
string strRegex = #"\b\s(zoo|a\s&\sb|dummy)\b";
Regex myRegex = new Regex(strRegex, RegexOptions.None);
string strTargetString = #"going to the zoo with a & b";
foreach (Match myMatch in myRegex.Matches(strTargetString))
{
Console.WriteLine(myMatch);
}

Find word (not containing substrings) in comma separated string

I'm using a linq query where i do something liike this:
viewModel.REGISTRATIONGRPS = (From a In db.TABLEA
Select New SubViewModel With {
.SOMEVALUE1 = a.SOMEVALUE1,
...
...
.SOMEVALUE2 = If(commaseparatedstring.Contains(a.SOMEVALUE1), True, False)
}).ToList()
Now my Problem is that this does'n search for words but for substrings so for example:
commaseparatedstring = "EWM,KI,KP"
SOMEVALUE1 = "EW"
It returns true because it's contained in EWM?
What i would need is to find words (not containing substrings) in the comma separated string!
Option 1: Regular Expressions
Regex.IsMatch(commaseparatedstring, #"\b" + Regex.Escape(a.SOMEVALUE1) + #"\b")
The \b parts are called "word boundaries" and tell the regex engine that you are looking for a "full word". The Regex.Escape(...) ensures that the regex engine will not try to interpret "special characters" in the text you are trying to match. For example, if you are trying to match "one+two", the Regex.Escape method will return "one\+two".
Also, be sure to include the System.Text.RegularExpressions at the top of your code file.
See Regex.IsMatch Method (String, String) on MSDN for more information.
Option 2: Split the String
You could also try splitting the string which would be a bit simpler, though probably less efficient.
commaseparatedstring.Split(new Char[] { ',' }).Contains( a.SOMEVALUE1 )
what about:
- separating the commaseparatedstring by comma
- calling equals() on each substring instead of contains() on whole thing?
.SOMEVALUE2 = If(commaseparatedstring.Split(',').Contains(a.SOMEVALUE1), True, False)

Resources