Extract up to two more digits - r

This may be a very simple question but I have not much experience with regex expressions. This page is a good source of regex expressions but could not figure out how to include them into my following code:
data %>% filter(grepl("^A01H1", icl))
Question
I would like to extract the values in one column of my data frame starting with this A01H1 up to 2 more digits, for example A01H100, A01H140, A01H110. I could not find a solution despite my few attempts:
Attempts
I looked at this question from which I used ^A01H1[0-9].{2} to select up tot two more digits.
I tried with adding any character ^A01H1[0-9][0-9][x-y] to stop after two digits.
Any help would be much appreciated :)

You can use "^A01H1\\d{1,2}$".
The first part ("^A01H1"), you figured out yourself, so what are we doing in the second part ("\\d{1,2}$")?
\d includes all digits and is equivalent to [0-9], since we are working in R you need to escape \ and thus we use \\d
{1,2} indicates we want to have 1 or 2 matches of \\d
$ specifies the end of the string, so nothing should come afterwards and this prevents to match more than 2 digits

It looks as if you want to match a part of a string that starts with A01H1, then contains 1 or 2 digits and then is not followed with any digit.
You may use
^A01H1\d{1,2}(?!\d)
See the regex demo. If there can be no text after two digits at all, replace (?!\d) with $.
Details
^ - start of strinmg
A01H1 - literal string
\d{1,2} - one to two digits
(?!\d) - no digit allowed immediately to the right
$ - end of string
In R, you could use it like
grepl("^A01H1\\d{1,2}(?!\\d)", icl, perl=TRUE)
Or, with the string end anchor,
grepl("^A01H1\\d{1,2}$", icl)
Note the perl=TRUE is only necessary when using PCRE specific syntax like (?!\d), a negative lookahead.

Related

R: How to use stringr to extract the substring as the output to mutate a column of strings that begins with a string pattern and end with a number?

I'm creating a small example to be put into mutate(). Not sure why this doesn't work.
> str_extract("rs1234-<b>C</b>","^rs*\\d$")
[1] NA
I'd be great if you can point to my misunderstanding of the language instead of merely providing a solution. I expect to get "rs1234".
The ^rs*\d$ regex matches
^ - start of string
rs* - r and zero or more occurrences of s char
\d - a digit
$ - end of string.
So, your pattern matches strings like rsssss1, r3, etc.
You need
str_extract("rs1234-<b>C</b>", "^rs\\d+")
where ^rs\d+ matches rs at the start of string and then one or more digits. See this regex demo.
But if I just want the substring in between "rs" and the last number. What should I do?
You would use rs.*\d:
str_extract("rs1234-<b>C</b>", "rs.*\\d")
where rs.*\d matches rs, then any zero or more chars other than line break chars as many as possible and then a digit.
NOTE: If you need to match line endings, too, you need to prepend the last pattern with (?s) inline DOTALL modifier.
See this regex demo.

How to match more than one ending character? [duplicate]

I try to find a regex that matches the string only if the string does not end with at least three '0' or more. Intuitively, I tried:
.*[^0]{3,}$
But this does not match when there one or two zeroes at the end of the string.
If you have to do it without lookbehind assertions (i. e. in JavaScript):
^(?:.{0,2}|.*(?!000).{3})$
Otherwise, use hsz's answer.
Explanation:
^ # Start of string
(?: # Either match...
.{0,2} # a string of up to two characters
| # or
.* # any string
(?!000) # (unless followed by three zeroes)
.{3} # followed by three characters
) # End of alternation
$ # End of string
You can try using a negative look-behind, i.e.:
(?<!000)$
Tests:
Test Target String Matches
1 654153640 Yes
2 5646549800 Yes
3 848461158000 No
4 84681840000 No
5 35450008748 Yes
Please keep in mind that negative look-behinds aren't supported in every language, however.
What wrong with the no-look-behind, more general-purpose ^(.(?!.*0{3,}$))*$?
The general pattern is ^(.(?!.* + not-ending-with-pattern + $))*$. You don't have to reverse engineer the state machine like Tim's answer does; you just insert the pattern you don't want to match at the end.
This is one of those things that RegExes aren't that great at, because the string isn't very regular (whatever that means). The only way I could come up with was to give it every possibility.
.*[^0]..$|.*.[^0].$|.*..[^0]$
which simplifies to
.*([^0]|[^0].|[^0]..)$
That's fine if you only want strings not ending in three 0s, but strings not ending in ten 0s would be long. But thankfully, this string is a bit more regular than some of these sorts of combinations, and you can simplify it further.
.*[^0].{0,2}$

REGEX pattern match in R for Course number

I need to identify matching course number that have xx.3xxxxxx.
These are some examples of the course numbers.
26.3730004
27.0210000
26.3730009
26.7114001
23.9610071
26.0A34430
23.3670005
26.0B05430
I tried many patterns one example I used is the pattern below. It did not get any match.
"[^0-9]{2}\Q.\E3[^0-9]+$"
I tried using grep and grepl. I actually need the code to return indexes.
This code shows my attempt to tag the rows that have matches.
Teacher$virtual[
which(
grepl("[^0-9]{2}\\Q.\\E3[^0-9]+$",Teacher$CourseNumber))]
<- "1"
I need to remove any row from my dataframe that have the course number with that pattern. XX.3XXXXXX
But, my code did not find any match. Can you please help me?
You should use
grepl("^[0-9]{2}\\.3", Teacher$CourseNumber)
See the regex graph:
Details:
^ - start of a string
[0-9]{2} - two digits
\\. - a dot (note that a regex escape is a literal backslash, but inside a string literal, "...", a single backslash is used to form string escape sequences, hence the backslash must be double to obtain a literal backslash char necessary for a regex escape)
3 - a 3 char.
NOTE: If you want to use in-pattern quoting with \Q and \E (in between which all chars are treated literally) you need to use PCRE regex, add perl=TRUE and use
grepl("^[0-9]{2}\\Q.\\E3", Teacher$CourseNumber, perl=TRUE)
Now, the dot is treated as a literal dot, not a . metacharacter that matches any char but a line break char (in a PCRE regex, . does not match line break chars by default).
Here, this simple expression would likely cover that:
^[0-9]{2}\.[3].+$
which has a [3] boundary right after the .. It would probably work without start and end anchors:
[0-9]{2}\.[3].+
Demo
We can add or reduce the boundaries, if it'd be necessary.

Need some help building a somewhat simple REGEX expression

I'm trying to build a somewhat REGEX expression of the of only numbers including decimal with a maximum of 3 numbers to the right of the decimal (thousandths) and 50 to the left. Valid entries would like something like these.
1
1.0
.1
1.011
.011
1202938.123
1237923782.0
So far I have ^([0-9]*|\d*\.\d{1}?\d*){1,999}$.. Any help appreciated. Thanks.
I believe this should suffice:
^(?=.)\d{0,50}(?:\.\d{0,3})?$
See the regex demo. Note this will also match 1., if this is undesired change \d{0,3} to \d{1,3}. Similarely, this regex will match .5 (with no integer part), if you dont want this then use \d{1,50} instead of \d{0,50}.
You could try:
^(?=.+)\d{0,50}(?:\.\d{1,3})?$
Demonstration here at regex101.com
Explanation -
^ tells the regex that the match will begin at the start of the string,
\d{0, 50} matches 0 - 50 digits,
(?=.+) is a positive look-ahead, that tells the regex that the matching should only start if the line contains some characters in it (as rightly pointed out in the comments!),
(?:\.\d{1,3})? matches an optional dot (.), followed by 1 - 3 digits,
$ tells the regex that whatever it has matched so far will be followed by the end of the string.
Other way: You can check if the string isn't empty and if the dot is always followed by digits, putting a word-boundary at a strategic place:
^\d{0,50}\.?\b\d{0,3}$
As you can see, all is optional in the pattern except the word-boundary that does the magic.
demo

Need help with a regex

Hi I'm trying to right a regular expression that will take a string and ensure it starts with an 'R' and is followed by 4 numeric digits then anything
eg. RXXXX.................
Can anybody help me with this? This is for ASP.NET
You want it to be at the beginning of the line, not anywhere. Also, for efficiency, you dont want the .+ or .* at the end because that will match unnecessary characters. So the following regex is what you really want:
^R\d{4}
This should do it...
^R\d{4}.*$
\d{4} matches 4 digits
.* is simply a way to match any character 0 or more times
the beginning ^ and end $ anchors ensure that nothing precedes or follows
As Vincent suggested, for your specific task it could even be simplified to this...
^R\d{4}
Because as you stated, it doesn't really matter what follows.
/^R\d{4}.*/ and set the case insensitive option unless you only want capital R's
^R\d{4}.*
The caret ^ matches the position before the first character in the string.
\d matches any numeric character (it's the same as [0-9])
{4} indicates that there must be exactly 4 numbers, and
.* matches 0 or more other characters
To use:
string input = "R0012 etc..";
Match match = Regex.Match(input, #"^R\d{4}.*", RexOptions.IgnoreCase);
if (match.Success)
{
// Success!
}
Note the use of RexOptions.IgnoreCase to ignore the case of the letter R (so it'll match strings which start with r. Leave this out if you don't want to undertake a case insensitive match.

Resources