Regex force length of specific regex [closed] - r

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I'm using R and need a regex for
a block of N characters starting with zero or more
whitespaces and continuing with one or more digits afterwards
For N = 9 here are
examples of valid strings
123456789
kfasdf 3456789asdf
a 1
and examples of invalid strings
12345 789
1 9
a 678a

Another option is to match 8 times either a digit OR a space not preceded by a digit and then match a digit at the end.
(?<![\d\h])(?>\d|(?<!\d)\h){8}\d
In parts
(?<![\d\h]) Negative lookbehind, assert what is on the left is not a horizontal whitespace char or digit
(?> Atomic group (no backtracking)
\d Match a digit
| Or
\h(?<!\d\h) Match a horizontal whitespace char asserting that it is not preceded by a digit
){8} Close the group and repeat 8 times
\d Match the last digit
Regex demo | R demo
Example code, using perl=TRUE
x <- "123456789
kfasdf 3456789asdf
a 1
12345 789
1 9
a 678a"
regmatches(x, gregexpr("(?<![\\d\\h])(?>\\d|(?<!\\d)\\h){8}\\d", x, perl=TRUE))
Output
[[1]]
[1] "123456789" " 3456789" " 1"
If there can not be a digit present after matching the last 9th digit, you could end the pattern with a negative lookahead asserting not a digit.
(?<![\d\h])(?>\d|(?<!\d)\h){8}\d(?!\d)
Regex demo
If there can not be any digits on any side:
(?<!\d)(?>\d|(?<!\d)\h){8}\d(?!\d)
Regex demo

Using string s from #d.b's answer.
Extract optional whitespace followed by numbers.
library(stringr)
str_extract(s, '(\\s+)?\\d+')
#[1] "123456789" " 3456789" " 1" "12345" "1" " 678"
Check their length using nchar.
nchar(str_extract(s, '(\\s+)?\\d+')) == 9
#[1] TRUE TRUE TRUE FALSE FALSE FALSE
Using the same logic in base R function.
nchar(regmatches(s, regexpr('(\\s+)?\\d+', s))) == 9
#[1] TRUE TRUE TRUE FALSE FALSE FALSE
If there could be multiple such instances we can use str_extract_all :
sapply(str_extract_all(s, '(\\s+)?\\d+'), function(x) any(nchar(x) == 9))

The desired substring contains 9 digits or fewer than 9 digits. In the second case it begins with a space, ends with a digit and each of the 7 characters in between is a space preceded by a space or a digit followed by a digit. We therefore could use the following regular expression.
\d{9}|\s(?:(?<=\s)\s|\d(?=\d)){7}\d
Demo
The regex engine performs the following operations.
\d{9} : match 9 digits
| : or
\s : match a space
(?: : begin non-capture group
(?<=\s) : next character must be preceded by a space
\s : match a space
| : or
\d : match a digit
(?=\d) : next character must be a digit
) : end non-capture group
{7} : execute non-capture group 7 times
\d : match a digit

Basic form: space bias
this is a basic form that has no anchors or boundrys
(?:[ ]|\d(?![ ])){8}\d
dem0
feature:
block of 9
minimum block size of 2
match takes maximum spaces vs minimal digits
Basic form: number bias
same basic form that has been modified to get number bias.
(?=((?:[ ]|\d(?![ ])){8}\d(?!\d)|\d{9}))\1
dem1
feature:
block of 9
minimum block size of 2
match takes minimal spaces vs maximum digits
End of line Anchor method (numeric bias) :
(?=[ ]{0,8}?\d{1,9}(.*)$)[ \d]{9}(?=\1$)
dem2
feature:
block of 9
minimum block size of 2
match takes minimal spaces vs maximum digits
single capture is not part of match
line orientated regex, needs multi-line option if string is more than 1 line

Add a comma before the spaces
split at the comma
keep only either space or digits
Count number of characters and see if it matches the required size
s = c("123456789", "kfasdf 3456789asdf",
"a 1", "12345 789", "1 9",
"a 678a")
sapply(strsplit(gsub("(\\s+)", ",\\1", s), ","), function(x) {
any(nchar(gsub("[A-Za-z]", "", x)) == 9)
})
#[1] TRUE TRUE TRUE FALSE FALSE FALSE

You may use the regex pattern
[ \d](?:(?<=[ ])[ ]|\d){7}\d
and in R use
str_extract(x, regex('[ \\d](?:(?<=[ ])[ ]|\\d){7}\\d'))
See this demo.
Please note that in the above regex pattern the [ ] may be replaced by a simple space character. Using [ ] is a common practice to increase readability.

If you are looking for a clean regex solution, then you should use the following pattern:
(?=[ \\d]{9}(.*$))[ ]*\\d+\\1$
...where you combine a positive lookahead with a regular matching that includes a match from the lookahead.
The R syntax is then
str_extract(x, regex('(?=[ \\d]{9}(.*$))[ ]*\\d+\\1$'))
and you can test this code here.
If your desire is also to catch a matching N-character long substring, then use
str_match(x, regex('(?=[ \\d]{9}(.*$))([ ]*\\d+)\\1$')) [,3]
as shown in this demo.

That's not an easy task for regexp-s. You really should consider parsing the string yourself. At least partially. Because you need the lengths of capturing groups and regexp-s do not have this feature.
But if you really want to use them, then there's a workaround:
I'll use JS so that the code can be ran right here.
const re = /^(.*)(\s*\d+)(.*)$(?<=\1.{9}\3)/
console.log(re.test("123456789"))
console.log(re.test("kfasdf 3456789asdf"))
console.log(re.test("a 1"))
console.log(re.test("12345 789"))
console.log(re.test("1 9" ))
console.log(re.test("a 678a"))
where
\s*\d+ meets your base condition of zero or more spaces followed by one or more digits
we can't get groups' lengths, but we can get everything before and after the main group. That is what ^(.*) and (.*)$ are for.
Now we need to check that all three groups add up to a full string, for that we use look behind assertion (?<=\1.{9}\3) and we set the desired N for a number of symbols allowed in the main group (9 in this case)
You didn't mention how the regexp should behave in all situations, for example in this one:
" 3456780000000"
with extra spaces and extra digits.
So I won't try to guess. But it's easy to fix the regexp I've provided for all your cases.
Update:
I think the Edward's original answer is the best for you (look in the history). But not sure about boundary constraints. They are not clear from your question.
But I'll still leave mine because, while Edward's answer is shortest and fastest for your specific case, mine is more general and better suits the title of the question.
And I added performance tests:
const chars = Array(1000000)
const half_len = chars.length/2
chars.fill("a", 0, half_len)
chars.fill("1", half_len, half_len + 9)
chars.fill("a", half_len + 9)
const str = chars.join("")
function test(name, re) {
console.log(name)
console.time(re.toString())
const res = re.test(str)
console.timeEnd(re.toString())
console.log("res",res)
}
test("Edward's original", /((?<!\d)\s|\d){9}(?<=\d)/)
test("Ωmega's" , /(?=[ \d]{9}(.*$))[ ]*\d+\1$/)
test("Edward's modified", /(?=[ ]{0,8}?\d{1,9}(.*))[ \d]{9}(?=\1$)/)
test("mine" , /^(.*)(\s*\d+)(.*)$(?<=\1.{9}\3)/)
Surely lookbehinds are not cheap!

Related

regex to find the position of the first four concurrent unique values

I've solved 2022 advent of code day 6, but was wondering if there was a regex way to find the first occurance of 4 non-repeating characters:
From the question:
bvwbjplbgvbhsrlpgdmjqwftvncz
bvwbjplbgvbhsrlpgdmjqwftvncz
# discard as repeating letter b
bvwbjplbgvbhsrlpgdmjqwftvncz
# match the 5th character, which signifies the end of the first four character block with no repeating characters
in R I've tried:
txt <- "bvwbjplbgvbhsrlpgdmjqwftvncz"
str_match("(.*)\1", txt)
But I'm having no luck
You can use
stringr::str_extract(txt, "(.)(?!\\1)(.)(?!\\1|\\2)(.)(?!\\1|\\2|\\3)(.)")
See the regex demo. Here, (.) captures any char into consequently numbered groups and the (?!...) negative lookaheads make sure each subsequent . does not match the already captured char(s).
See the R demo:
library(stringr)
txt <- "bvwbjplbgvbhsrlpgdmjqwftvncz"
str_extract(txt, "(.)(?!\\1)(.)(?!\\1|\\2)(.)(?!\\1|\\2|\\3)(.)")
## => [1] "vwbj"
Note that the stringr::str_match (as stringr::str_extract) takes the input as the first argument and the regex as the second argument.

Regular expression weird result [duplicate]

This question already has answers here:
Multiple overlapping regex matches instead of one
(2 answers)
Biostrings gregexpr2 gives errors while gregexpr works fine
(1 answer)
Closed 3 years ago.
Code
gsub('101', '111', '110101101')
#[1] "111101111"
Would anyone know why the second 0 in the input isn't being substituted into a 1 in the output?
I'm looking for the pattern 101 in string and replace it with string 111. Later on I wish to turn longer sub-sequences into sequences of 1's, such as 10001 to 11111.
You could use a lookahead ?=
The way this works is q(?=u) matches a q that is followed by a u, without making the u part of the match.
Example:
gsub('10(?=1)', '11', '110101101', perl=TRUE);
// Output: 111111111
Edit: you need to use gsub in perl mode to use lookaheads
Its because it doesnt work in a recursive way
gsub('101', '111', '110101101') divides the third string as it finds the matches. So it finds the first 101 and its left with 01101. Think about it. If it would replace "recursively", something like gsub('11', '111', '11'), would return an infinite string of '1' and break. It doesn't check in the already "replaced" text.
It is because when R first detected 110101101, it treat the next 0 as in 011 in 110101101.
It seems that you only want to replace '0' by '1'. Then you can just use gsub('0', '1', '110101101')
Later on I wish to turn longer sub-sequences into sequences of 1's, such as 10001 to 11111.
Hopefully, R provides a means to generate the replacement string based on the matched substring. (This is a common feature.)
If so, search for 10+, and have the replacement string generator create a string consisting of a number of 1 characters equal to the length of the match. (e.g. If 100 is matched, replace with 111. If 1000 is matched, replace with 1111. etc.)
I don't know R in the least. Here's how it's done in some other languages in case that helps:
Perl:
$s =~ s{10+}{ "1" x length($&) }ger
Python:
re.sub(r'10+', lambda match: '1' * len(match.group()), s)
JavaScript:
s.replace(/10+/g, function(match) { return '1'.repeat(match.length) })
JavaScript (ES6):
s.replace(/10+/g, match => '1'.repeat(match.length))
According to the OP
Later on I wish to turn longer sub-sequences into sequences of 1's,
such as 10001 to 11111.
If I understand correctly, the final goal is to replace any sub-sequence of consecutive 0 into the same number of 1 if they are surrounded by a 1 on both sides.
In R, this can be achieved by the str_replace_all() function from the stringr package. For demonstration and testing, the input vector contains some edge cases where substrings of 0 are not surrounded by 1.
input <- c("110101101",
"11010110001",
"110-01101",
"11010110000",
"00010110001")
library(stringr)
str_replace_all(input, "(?<=1)0+(?=1)", function(x) str_dup("1", str_length(x)))
[1] "111111111" "11111111111" "110-01111" "11111110000" "00011111111"
The regex "(?<=1)0+(?=1)" uses look behind (?<=1) as well as look ahead (?=1) to ensure that the subsequence 0+ to replace is surrounded by 1. Thus, leading and trailing subsequences of 0 are not replaced.
The replacement is computed by a functions which returns a subsequence of 1 of the same length as the subsequence of 0 to replace.

Matching character followed by exactly 1 digit

I need to align formatting of some clinical trial IDs two merge two databases. For example, in database A patient 123 visit 1 is stored as '123v01' and in database B just '123v1'
I can match A to B by grep match those containing 'v0' and strip out the trailing zero to just 'v', but for academic interest & expanding R / regex skills, I want to reverse match B to A by matching only those containing 'v' followed by only 1 digit, so I can then separately pad that digit with a leading zero.
For a reprex:
string <- c("123v1", "123v01", "123v001")
I can match those with >= 2 digits following a 'v', then inverse subset
> idx <- grepl("v(\\d{2})", string)
> string[!idx]
[1] "123v1"
But there must be a way to match 'v' followed by just a single digit only? I have tried the lookarounds
# Negative look ahead "v not followed by 2+ digits"
grepl("v(?!\\d{2})", string)
# Positive look behind "single digit following v"
grepl("(?<=v)\\d{1})", string)
But both return an 'invalid regex' error
Any suggestions?
You need to set the perl=TRUE flag on your grepl function.
e.g.
grepl("v(?!\\d{2})", string, perl=TRUE)
[1] TRUE FALSE FALSE
See this question for more info.
You may use
grepl("v\\d(?!\\d)", string, perl=TRUE)
The v\d(?!\d) pattern matches v, 1 digits and then makes sure there is no digit immediately to the right of the current location (i.e. after the v + 1 digit).
See the regex demo.
Note that you need to enable PCRE regex flavor with the perl=TRUE argument.

Stringr function or or gsub() to find an x digit string and extract first x digits?

Regex and stringr newbie here. I have a data frame with a column from which I want to find 10-digit numbers and keep only the first three digits. Otherwise, I want to just keep whatever is there.
So to make it easy let's just pretend it's a simple vector like this:
new<-c("111", "1234567891", "12", "12345")
I want to write code that will return a vector with elements: 111, 123, 12, and 12345. I also need to write code (I'm assuming I'll do this iteratively) where I extract the first two digits of a 5-digit string, like the last element above.
I've tried:
gsub("\\d{10}", "", new)
but I don't know what I could put for the replacement argument to get what I'm looking for. Also tried:
str_replace(new, "\\d{10}", "")
But again I don't know what to put in for the replacement argument to get just the first x digits.
Edit: I disagree that this is a duplicate question because it's not just that I want to extract the first X digits from a string but that I need to do that with specific strings that match a pattern (e.g., 10 digit strings.)
If you are willing to use the library stringr from which comes the str_replace you are using. Just use str_extract
vec <- c(111, 1234567891, 12)
str_extract(vec, "^\\d{1,3}")
The regex ^\\d{1,3} matches at least 1 to a maximum of 3 digits occurring right in the beginning of the phrase. str_extract, as the name implies, extracts and returns these matches.
You may use
new<-c("111", "1234567891", "12")
sub("^(\\d{3})\\d{7}$", "\\1", new)
## => [1] "111" "123" "12"
See the R online demo and the regex demo.
Regex graph:
Details
^ - start of string anchor
(\d{3}) - Capturing group 1 (this value is accessed using \1 in the replacement pattern): three digit chars
\d{7} - seven digit chars
$ - end of string anchor.
So, the sub command only matches strings that are composed only of 10 digits, captures the first three into a separate group, and then replaces the whole string (as it is the whole match) with the three digits captured in Group 1.
You can use:
as.numeric(substring(my_vec,1,3))
#[1] 111 123 12

R Regex for matching comma separated sections in a column/vector

The original Title for this Question was : R Regex for word boundary excluding space.It reflected the manner I was approaching the problem in. However, this is a better solution to my particular problem. It should work as long as a particular delimiter is used to separate items within a 'cell'
This must be very simple, but I've hit a brick wall on it.
I have a dataframe column where each cell(row) is a comma separated list of items. I want to find the rows that have a specific item.
df<-data.frame( nms= c("XXXCAP,XXX CAPITAL LIMITED" , "XXX,XXX POLYMERS LIMITED, 3455" , "YYY,XXX REP LIMITED,999,XXX" ),
b = c('A', 'X', "T"))
nms b
1 XXXCAP,XXX CAPITAL LIMITED A
2 XXX,XXX POLYMERS LIMITED, 3455 X
3 YYY,XXX REP LIMITED,999,XXX T
I want to search for rows that have item XXX. Rows 2 and 3 should match. Row 1 has the string XXX as part of a larger string and obviously should not match.
However, because XXX in row 1 is separated by spaces in each side, I am having trouble filtering it out with \\b or [[:<:]]
grep("\\bXXX\\b",df$nms, value = F) #matches 1,2,3
The easiest way to do this of course is strsplit() but I'd like to avoid it.Any suggestions on performance are welcome.
When \b does not "work", the problem usually lies in the definition of the "whole word".
A word boundary can occur in one of three positions:
Before the first character in the string, if the first character is a word character.
After the last character in the string, if the last character is a word character.
Between two characters in the string, where one is a word character and the other is not a word character.
It seems you want to only match a word in between commas or start/end of the string).
You may use a PCRE regex (note the perl=TRUE argument) like
(?<![^,])XXX(?![^,])
See the regex demo (the expression is "converted" to use positive lookarounds due to the fact it is a demo with a single multiline string).
Details
(?<![^,]) (equal to (?<=^|,)) - either start of the string or a comma
XXX - an XXX word
(?![^,]) (equal to (?=$|,)) - either end of the string or a comma
R demo:
> grep("(?<![^,])XXX(?![^,])",df$nms, value = FALSE, perl=TRUE)
## => [1] 2 3
The equivalent TRE regex will look like
> grep("(?:^|,)XXX(?:$|,)",df$nms, value = FALSE)
Note that here, non-capturing groups are used to match either start of string or , (see (?:^|,)) and either end of string or , (see ((?:$|,))).
This is perhaps a somewhat simplistic solution, but it works for the examples which you've provided:
library(stringr)
df$nms %>%
str_replace_all('\\s', '') %>% # Removes all spaces, tabs, newlines, etc
str_detect('(^|,)XXX(,|$)') # Detects string XXX surrounded by comma or beginning/end
[1] FALSE TRUE TRUE
Also, have a look at this cheatsheet made by RStudio on Regular Expressions - it is very nicely made and very useful (I keep going back to it when I'm in doubt).

Resources