Unexpected regular expression result in stringr (R) - r

Would you somebody please explain it to me, why does str_detect (from the stringr package, ver 1.1.0) return TRUE for each of the three following codes, contrary to my expectations?
str_detect("01", "^[0]*[1-9]*[0]+")
str_detect("01", "^0*[1-9]*0+")
str_detect("01", "^0*[1-9]*0")
I wanted to look for any zeroes at the beginning followed by at least 1 non-zero number and later a zero in the string.
Clearly the "01" string cannot qualify as it does not have a 0 after the 1.
Am I missing something? Is the pattern wrong for what I am looking for?
Thank you for your time!

Since the leading 0 are optionnal in you patterns, they are ignored and the trailing zeros detects the 0 in the string...
Use a $ to specify the end of the string:
str_detect("01", "^[0]*[1-9]*[0]+$")
str_detect("01", "^0*[1-9]*0+$")
str_detect("01", "^0*[1-9]*0$")

I believe you want the following pattern:
^0[1-9]+0
See https://regex101.com/r/v9cwHJ/1 for full pattern explanation.
Your specific error was using * for the first 0, it matches none as well.
Also use + for the second digit to find at least 1.

Related

Extract numerical value before a string in R

I have been mucking around with regex strings and strsplit but can't figure out how to solve my problem.
I have a collection of html documents that will always contain the phrase "people own these". I want to extract the number immediately preceding this phrase. i.e. '732,234 people own these' - I'm hoping to capture the number 732,234 (including the comma, though I don't care if it's removed).
The number and phrase are always surrounded by a . I tried using Xpath but that seemed even harder than a regex expression. Any help or advice is greatly appreciated!
example string: >742,811 people own these<
-> 742,811
Could you please try following.
val <- "742,811 people own these"
gsub(' [a-zA-Z]+',"",val)
Output will be as follows.
[1] "742,811"
Explanation: using gsub(global substitution) function of R here. Putting condition here where it should replace all occurrences of space with small or capital alphabets with NULL for variable val.
Try using str_extract_all from the stringr library:
str_extract_all(data, "\\d{1,3}(?:,\\d{3})*(?:\\.\\d+)?(?= people own these)")

Need some help building a somewhat simple REGEX expression

I'm trying to build a somewhat REGEX expression of the of only numbers including decimal with a maximum of 3 numbers to the right of the decimal (thousandths) and 50 to the left. Valid entries would like something like these.
1
1.0
.1
1.011
.011
1202938.123
1237923782.0
So far I have ^([0-9]*|\d*\.\d{1}?\d*){1,999}$.. Any help appreciated. Thanks.
I believe this should suffice:
^(?=.)\d{0,50}(?:\.\d{0,3})?$
See the regex demo. Note this will also match 1., if this is undesired change \d{0,3} to \d{1,3}. Similarely, this regex will match .5 (with no integer part), if you dont want this then use \d{1,50} instead of \d{0,50}.
You could try:
^(?=.+)\d{0,50}(?:\.\d{1,3})?$
Demonstration here at regex101.com
Explanation -
^ tells the regex that the match will begin at the start of the string,
\d{0, 50} matches 0 - 50 digits,
(?=.+) is a positive look-ahead, that tells the regex that the matching should only start if the line contains some characters in it (as rightly pointed out in the comments!),
(?:\.\d{1,3})? matches an optional dot (.), followed by 1 - 3 digits,
$ tells the regex that whatever it has matched so far will be followed by the end of the string.
Other way: You can check if the string isn't empty and if the dot is always followed by digits, putting a word-boundary at a strategic place:
^\d{0,50}\.?\b\d{0,3}$
As you can see, all is optional in the pattern except the word-boundary that does the magic.
demo

how to use grep in R to get the specified character?

I have
str=c("00005.profit", "00005.profit-in","00006.profit","00006.profit-in")
and I want to get
"00005.profit" "00006.profit"
How can I achieve this using grep in R?
Here is one way:
R> s <- c("00005.profit", "00005.profit-in","00006.profit","00006.profit-in")
> unique(gsub("([0-9]+.profit).*", "\\1", s))
[1] "00005.profit" "00006.profit"
R>
We define a regular expression as digits followed by .profit, which we assign by keeping the expression in parantheses. The \\1 then recalls the first such assignment -- and as we recall nothing else that is what we get. The unique() then reduces the four items to two unique ones.
Dirk's answer is pretty much the ideal generalisable answer, but here are a couple of other options based on the fact that your example always has a - character starting the part you wish to chop off:
1: gsub to return everything prior to the -
gsub("(.+)-.+","\\1",str)
2: strsplit on - and keep only the first part.
sapply(strsplit(str,"-"),head,1)
Both return:
[1] "00005.profit" "00005.profit" "00006.profit" "00006.profit"
which you can then wrap in unique to not return duplicates like:
unique(gsub("(.+)-.+","\\1",str))
unique(sapply(strsplit(str,"-"),head,1))
These will then return:
[1] "00005.profit" "00006.profit"
Another non-generalisable solution would be to just take the first 12 characters (assuming string length for the part you want to keep doesn't change):
unique(substr(str,1,12))
[1] "00005.profit" "00006.profit"
I'm actually interpreting your question differently. I think you might want
grep("[0-9]+\\.profit$",str,value=TRUE)
That is, if you only want the strings that end with profit. The $ special character stands for "end of string", so it excludes cases that have additional characters at the end ... The \\. means "I really want to match a dot, not any character at all" (a . by itself would match any character). You weren't entirely clear about your target pattern -- you might prefer "0+[1-9]\\.profit$" (any number of zeros followed by a single non-zero digit), or even "0{4}[1-9]\\.profit$" (4 zeros followed by a single non-zero digit).

Using Regex OR operator to solve 2 conditions

I am trying to combine 2 regular expressions into 1 with the OR operator: |
I have one that checks for match of a letter followed by 8 digits:
Regex.IsMatch(s, "^[A-Z]\d{8}$")
I have another that checks for simply 9 digits:
Regex.IsMatch(s, "^\d{9}$")
Now, Instead of doing:
If Not Regex.IsMatch(s, "^[A-Z]\d{8}$") AndAlso
Not Regex.IsMatch(s, "^\d{9}$") Then
...
End If
I thought I could simply do:
If Not Regex.IsMatch(s, "^[A-Z]\d{8}|\d{9}$") Then
...
End If
Apparently I am not combining the two correctly and apparently I am horrible at regular expressions. Any help would be much appreciated.
And for those wondering, I did take a glance at How to combine 2 conditions and more in regex and I am still scratching my head.
The | operator has a high precedence and in your original regex will get applied first. You should be combining the two regex's w/ grouping parentheses to make the precedence clear. As in:
"^(([A-Z]\d{8})|(\d{9}))$"
How about using ^[A-Z0-9]\d{8}$ ?
I think you want to group the conditions:
Regex.IsMatch(s, "^(([A-Z]\d{8})|(\d{9}))$")
The ^ and $ represent the beginning and end of the line, so you don't want them considered in the or condition. The parens allow you to be explicit about "everything in this paren" or "anything in this other paren"
#MikeC's offering seems the best:
^[A-Z0-9]\d{8}$
...but as to why your expression is not working the way you might expect, you have to understand that the | "or" or "alternation" operator has a very high precedence - the only higher one is the grouping construct, I believe. If you use your example:
^[A-Z]\d{8}|\d{9}$
...you're basically saying "match beginning of string, capital letter, then 8 digits OR match 9 digits then end of string" -- if, instead you mean "match beginning of string, then a capital letter followed by 8 digits then the end of string OR the beginning of the string followed by 9 digits, then the end of string", then you want one of these:
^([A-Z]\d{8}|\d{9})$
^[A-Z]\d{8}$|^\d{9}$
Hope this is helpful for your understanding
I find the OR operator a bit weird sometimes as well, what I do I use groups to denote which sections I want to match, so your regex would become something like so: ^(([A-Z]\d{8})|(\d{9}))$

Need help with a regex

Hi I'm trying to right a regular expression that will take a string and ensure it starts with an 'R' and is followed by 4 numeric digits then anything
eg. RXXXX.................
Can anybody help me with this? This is for ASP.NET
You want it to be at the beginning of the line, not anywhere. Also, for efficiency, you dont want the .+ or .* at the end because that will match unnecessary characters. So the following regex is what you really want:
^R\d{4}
This should do it...
^R\d{4}.*$
\d{4} matches 4 digits
.* is simply a way to match any character 0 or more times
the beginning ^ and end $ anchors ensure that nothing precedes or follows
As Vincent suggested, for your specific task it could even be simplified to this...
^R\d{4}
Because as you stated, it doesn't really matter what follows.
/^R\d{4}.*/ and set the case insensitive option unless you only want capital R's
^R\d{4}.*
The caret ^ matches the position before the first character in the string.
\d matches any numeric character (it's the same as [0-9])
{4} indicates that there must be exactly 4 numbers, and
.* matches 0 or more other characters
To use:
string input = "R0012 etc..";
Match match = Regex.Match(input, #"^R\d{4}.*", RexOptions.IgnoreCase);
if (match.Success)
{
// Success!
}
Note the use of RexOptions.IgnoreCase to ignore the case of the letter R (so it'll match strings which start with r. Leave this out if you don't want to undertake a case insensitive match.

Resources