Replacing repeated groups of characters using regex - r

In R, I have a string where it contains repeated groups of characters:
testString <- "Hi hi missing u lollol hahahahalol sillybilly haaaaa!"
I'm trying to use a gsub regex to replace repeated groups of characters within each word to produce the following output:
"Hi hi missing u lol halol sillybilly haaaaa!"
I've tried the following line but it isn't producing the right output:
gsub("[[:blank:]](.+?){2,}[[blank]]\\1",
replacement="\\1", testString, perl=TRUE)
What have I done wrong?

You may match repeated consecutive word chars and skip them, and then handle all other repeated consecutive chars with a solution like
x <- "Hi hi missing u lollol hahahahalol sillybilly haaaaa!"
gsub("(\\w)\\1+(*SKIP)(*F)|(\\w+?)\\2+", "\\2", x, perl=TRUE)
See the regex demo and an online R demo
Details:
(\\w)\\1+(*SKIP)(*F) - match and capture a word char (with (\\w), this can be adjusted) and then 1+ ocurrences of this same char (with \\1+) and then the whole text is discarded and the engine goes on to search for another match after the end of the match (with the PCRE (*SKIP)(*FAIL) verbs sequence)
| - or
(\\w+?)\\2+ - 1 or more word chars, as few as possible, are captured into Group 2 (with (\\w+?)) and then 1+ occurrences of the same value are matched (with \\2+).
The replacement is just the Group 2 value.

Related

regex to find the position of the first four concurrent unique values

I've solved 2022 advent of code day 6, but was wondering if there was a regex way to find the first occurance of 4 non-repeating characters:
From the question:
bvwbjplbgvbhsrlpgdmjqwftvncz
bvwbjplbgvbhsrlpgdmjqwftvncz
# discard as repeating letter b
bvwbjplbgvbhsrlpgdmjqwftvncz
# match the 5th character, which signifies the end of the first four character block with no repeating characters
in R I've tried:
txt <- "bvwbjplbgvbhsrlpgdmjqwftvncz"
str_match("(.*)\1", txt)
But I'm having no luck
You can use
stringr::str_extract(txt, "(.)(?!\\1)(.)(?!\\1|\\2)(.)(?!\\1|\\2|\\3)(.)")
See the regex demo. Here, (.) captures any char into consequently numbered groups and the (?!...) negative lookaheads make sure each subsequent . does not match the already captured char(s).
See the R demo:
library(stringr)
txt <- "bvwbjplbgvbhsrlpgdmjqwftvncz"
str_extract(txt, "(.)(?!\\1)(.)(?!\\1|\\2)(.)(?!\\1|\\2|\\3)(.)")
## => [1] "vwbj"
Note that the stringr::str_match (as stringr::str_extract) takes the input as the first argument and the regex as the second argument.

Regex force length of specific regex [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I'm using R and need a regex for
a block of N characters starting with zero or more
whitespaces and continuing with one or more digits afterwards
For N = 9 here are
examples of valid strings
123456789
kfasdf 3456789asdf
a 1
and examples of invalid strings
12345 789
1 9
a 678a
Another option is to match 8 times either a digit OR a space not preceded by a digit and then match a digit at the end.
(?<![\d\h])(?>\d|(?<!\d)\h){8}\d
In parts
(?<![\d\h]) Negative lookbehind, assert what is on the left is not a horizontal whitespace char or digit
(?> Atomic group (no backtracking)
\d Match a digit
| Or
\h(?<!\d\h) Match a horizontal whitespace char asserting that it is not preceded by a digit
){8} Close the group and repeat 8 times
\d Match the last digit
Regex demo | R demo
Example code, using perl=TRUE
x <- "123456789
kfasdf 3456789asdf
a 1
12345 789
1 9
a 678a"
regmatches(x, gregexpr("(?<![\\d\\h])(?>\\d|(?<!\\d)\\h){8}\\d", x, perl=TRUE))
Output
[[1]]
[1] "123456789" " 3456789" " 1"
If there can not be a digit present after matching the last 9th digit, you could end the pattern with a negative lookahead asserting not a digit.
(?<![\d\h])(?>\d|(?<!\d)\h){8}\d(?!\d)
Regex demo
If there can not be any digits on any side:
(?<!\d)(?>\d|(?<!\d)\h){8}\d(?!\d)
Regex demo
Using string s from #d.b's answer.
Extract optional whitespace followed by numbers.
library(stringr)
str_extract(s, '(\\s+)?\\d+')
#[1] "123456789" " 3456789" " 1" "12345" "1" " 678"
Check their length using nchar.
nchar(str_extract(s, '(\\s+)?\\d+')) == 9
#[1] TRUE TRUE TRUE FALSE FALSE FALSE
Using the same logic in base R function.
nchar(regmatches(s, regexpr('(\\s+)?\\d+', s))) == 9
#[1] TRUE TRUE TRUE FALSE FALSE FALSE
If there could be multiple such instances we can use str_extract_all :
sapply(str_extract_all(s, '(\\s+)?\\d+'), function(x) any(nchar(x) == 9))
The desired substring contains 9 digits or fewer than 9 digits. In the second case it begins with a space, ends with a digit and each of the 7 characters in between is a space preceded by a space or a digit followed by a digit. We therefore could use the following regular expression.
\d{9}|\s(?:(?<=\s)\s|\d(?=\d)){7}\d
Demo
The regex engine performs the following operations.
\d{9} : match 9 digits
| : or
\s : match a space
(?: : begin non-capture group
(?<=\s) : next character must be preceded by a space
\s : match a space
| : or
\d : match a digit
(?=\d) : next character must be a digit
) : end non-capture group
{7} : execute non-capture group 7 times
\d : match a digit
Basic form: space bias
this is a basic form that has no anchors or boundrys
(?:[ ]|\d(?![ ])){8}\d
dem0
feature:
block of 9
minimum block size of 2
match takes maximum spaces vs minimal digits
Basic form: number bias
same basic form that has been modified to get number bias.
(?=((?:[ ]|\d(?![ ])){8}\d(?!\d)|\d{9}))\1
dem1
feature:
block of 9
minimum block size of 2
match takes minimal spaces vs maximum digits
End of line Anchor method (numeric bias) :
(?=[ ]{0,8}?\d{1,9}(.*)$)[ \d]{9}(?=\1$)
dem2
feature:
block of 9
minimum block size of 2
match takes minimal spaces vs maximum digits
single capture is not part of match
line orientated regex, needs multi-line option if string is more than 1 line
Add a comma before the spaces
split at the comma
keep only either space or digits
Count number of characters and see if it matches the required size
s = c("123456789", "kfasdf 3456789asdf",
"a 1", "12345 789", "1 9",
"a 678a")
sapply(strsplit(gsub("(\\s+)", ",\\1", s), ","), function(x) {
any(nchar(gsub("[A-Za-z]", "", x)) == 9)
})
#[1] TRUE TRUE TRUE FALSE FALSE FALSE
You may use the regex pattern
[ \d](?:(?<=[ ])[ ]|\d){7}\d
and in R use
str_extract(x, regex('[ \\d](?:(?<=[ ])[ ]|\\d){7}\\d'))
See this demo.
Please note that in the above regex pattern the [ ] may be replaced by a simple space character. Using [ ] is a common practice to increase readability.
If you are looking for a clean regex solution, then you should use the following pattern:
(?=[ \\d]{9}(.*$))[ ]*\\d+\\1$
...where you combine a positive lookahead with a regular matching that includes a match from the lookahead.
The R syntax is then
str_extract(x, regex('(?=[ \\d]{9}(.*$))[ ]*\\d+\\1$'))
and you can test this code here.
If your desire is also to catch a matching N-character long substring, then use
str_match(x, regex('(?=[ \\d]{9}(.*$))([ ]*\\d+)\\1$')) [,3]
as shown in this demo.
That's not an easy task for regexp-s. You really should consider parsing the string yourself. At least partially. Because you need the lengths of capturing groups and regexp-s do not have this feature.
But if you really want to use them, then there's a workaround:
I'll use JS so that the code can be ran right here.
const re = /^(.*)(\s*\d+)(.*)$(?<=\1.{9}\3)/
console.log(re.test("123456789"))
console.log(re.test("kfasdf 3456789asdf"))
console.log(re.test("a 1"))
console.log(re.test("12345 789"))
console.log(re.test("1 9" ))
console.log(re.test("a 678a"))
where
\s*\d+ meets your base condition of zero or more spaces followed by one or more digits
we can't get groups' lengths, but we can get everything before and after the main group. That is what ^(.*) and (.*)$ are for.
Now we need to check that all three groups add up to a full string, for that we use look behind assertion (?<=\1.{9}\3) and we set the desired N for a number of symbols allowed in the main group (9 in this case)
You didn't mention how the regexp should behave in all situations, for example in this one:
" 3456780000000"
with extra spaces and extra digits.
So I won't try to guess. But it's easy to fix the regexp I've provided for all your cases.
Update:
I think the Edward's original answer is the best for you (look in the history). But not sure about boundary constraints. They are not clear from your question.
But I'll still leave mine because, while Edward's answer is shortest and fastest for your specific case, mine is more general and better suits the title of the question.
And I added performance tests:
const chars = Array(1000000)
const half_len = chars.length/2
chars.fill("a", 0, half_len)
chars.fill("1", half_len, half_len + 9)
chars.fill("a", half_len + 9)
const str = chars.join("")
function test(name, re) {
console.log(name)
console.time(re.toString())
const res = re.test(str)
console.timeEnd(re.toString())
console.log("res",res)
}
test("Edward's original", /((?<!\d)\s|\d){9}(?<=\d)/)
test("Ωmega's" , /(?=[ \d]{9}(.*$))[ ]*\d+\1$/)
test("Edward's modified", /(?=[ ]{0,8}?\d{1,9}(.*))[ \d]{9}(?=\1$)/)
test("mine" , /^(.*)(\s*\d+)(.*)$(?<=\1.{9}\3)/)
Surely lookbehinds are not cheap!

Stringr function or or gsub() to find an x digit string and extract first x digits?

Regex and stringr newbie here. I have a data frame with a column from which I want to find 10-digit numbers and keep only the first three digits. Otherwise, I want to just keep whatever is there.
So to make it easy let's just pretend it's a simple vector like this:
new<-c("111", "1234567891", "12", "12345")
I want to write code that will return a vector with elements: 111, 123, 12, and 12345. I also need to write code (I'm assuming I'll do this iteratively) where I extract the first two digits of a 5-digit string, like the last element above.
I've tried:
gsub("\\d{10}", "", new)
but I don't know what I could put for the replacement argument to get what I'm looking for. Also tried:
str_replace(new, "\\d{10}", "")
But again I don't know what to put in for the replacement argument to get just the first x digits.
Edit: I disagree that this is a duplicate question because it's not just that I want to extract the first X digits from a string but that I need to do that with specific strings that match a pattern (e.g., 10 digit strings.)
If you are willing to use the library stringr from which comes the str_replace you are using. Just use str_extract
vec <- c(111, 1234567891, 12)
str_extract(vec, "^\\d{1,3}")
The regex ^\\d{1,3} matches at least 1 to a maximum of 3 digits occurring right in the beginning of the phrase. str_extract, as the name implies, extracts and returns these matches.
You may use
new<-c("111", "1234567891", "12")
sub("^(\\d{3})\\d{7}$", "\\1", new)
## => [1] "111" "123" "12"
See the R online demo and the regex demo.
Regex graph:
Details
^ - start of string anchor
(\d{3}) - Capturing group 1 (this value is accessed using \1 in the replacement pattern): three digit chars
\d{7} - seven digit chars
$ - end of string anchor.
So, the sub command only matches strings that are composed only of 10 digits, captures the first three into a separate group, and then replaces the whole string (as it is the whole match) with the three digits captured in Group 1.
You can use:
as.numeric(substring(my_vec,1,3))
#[1] 111 123 12

Replace matched string with its sub-groups

I have some DNA sequences to process, they look like:
>KU508975.1 Acalypha australis maturase K (matK) gene, partial cds; chloroplast
TAAATTATGTGTCAGAGCTATTAATACCTTACCCCATCCATCTAGAAAAATGGGTTCAAATTCTTCGATA
TTGGCTGAAAGATCCCTCTTCTTTGCATTTATTACGACTCTTTCTTCATGAATATTGGAATTGGAACTGT
TTTCTTATTCCAAAGAAATCGATTGCTATTTTTACAAAAAGTAATCCAAGATTTTTCTTGTTTCTATATA
>KC747175.1 Achyranthes bidentata bio-material USDA:GRIN:PI613015 maturase K (matK) gene, partial cds; chloroplast
GATATATTAATACCTTACCCCGCTCATCTAGAAATCTTGGTTCAAACTCTCCGATACTGGTTGAAAGATG
CTTCTTCTTTGCATTTATTACGATTCTTTCTTTATGAGTGTCGTAATTGGATTAGTCTTATTACTCCAAA
AAAATCCATTTCCTTTTTGAAAAAAAGGAATCGAAGATTATTCTTGTTCCTATATAATTTCTATGTATGT
I coded a regex in order to detect the title line of each sequence:
(\>)([A-Z]{2}\d{6}\.?\d)\s([a-zA-Z]+\-?[a-zA-Z]+)\s([a-zA-Z]+\-?[a-zA-Z]+)\s(.*)\n
What function should I use to replace this whole match with its group3 + group4? In addition, I've got 72 matches in one txt file, how can I replace them in one run?
Your current regex won't work for lines where Group 3 or 4 contains a single letter word because [a-zA-Z]+\\-?[a-zA-Z]+ matches 1+ letters, then an optional hyphen, and then again 1+ letters (that means, there must be at least 2 letters). With [a-zA-Z]+(?:-[a-zA-Z]+)?, you can match 1+ letters followed with an optional sequence of - and then 1+ letters.
Also, \s also matches line breaks, and if your title lines are shorter than you assume then .* may grab a sequence line by mistake. You may use \h or [ \t] instead.
Note that \n is not necessary at the end of the pattern because .* matches any 0+ chars other than line break chars with an ICU regex library (it is used in your current code, str_replace_all).
In general, you should only capture with (...) what you need to keep, everything else can be just matched. Remove extra capturing parentheses, and it will save some performance.
If you add (?m)^ at the start, you will make sure you only match > that is at the start of a line.
You may use
"(?m)^>[A-Z]{2}\\d{6}\\.?\\d\\h+([a-zA-Z]+(?:-[a-zA-Z]+)?)\\h+([a-zA-Z]+(?:-[a-zA-Z]+)?).*"
See the regex demo.
Code:
Sequence <- str_replace_all(SequenceRaw,
"(?m)^>[A-Z]{2}\\d{6}\\.?\\d\\h+([a-zA-Z]+(?:-[a-zA-Z]+)?)\\h+([a-zA-Z]+(?:-[a-zA-Z]+)?).*",
"\\1 \\2")
I figured it out myself, with tidyverse packages:
library(tidyverse)
SequenceRaw <- read_file("PATH OF SEQUENCE FILE\\sequenceraw.fasta") ## e.g. sequenceraw.fasta
Sequence <- str_replace_all(SequenceRaw,
"(\\>)([A-Z]{2}\\d{6}\\.?\\d)\\s([a-zA-Z]+\\-?[a-zA-Z]+)\\s([a-zA-Z]+\\-?[a-zA-Z]+)\\s(.*)\\n",
">\\3 \\4\n") ## Keep '>' and add a new line with '\n'
write_file(Sequence, "YOUR PATH\\sequence.fasta")
Here is the result:

text manipulation in R

I am trying to add parentheses around certain book titles character strings and I want to be able to paste with the paste0 function. I want to take this string:
a <- c("I Like What I Know 1959 02e pdfDrama (amazon.com)", "My Liffe 1993 07e pdfDrama (amazon.com)")
wrap certain strings in parentheses:
a
[1] “I Like What I Know (1959) (02e) (pdfDrama) (amazon.com)”
[2] ”My Life (1993) (07e) (pdfDrama) (amazon.com)”
I have tried but can't figure out a way to replace them within the string:
paste0("(",str_extract(a, "\\d{4}"),")")
paste0("(",str_extract(a, ”[0-9]+.e”),”)”)
Help?
I can suggest a regex for a fixed number of words of specific type:
a <- c("I Like What I Know 1959 02e pdfDrama (amazon.com)","My Life 1993 07e pdfDrama (amazon.com)")
sub("\\b(\\d{4})(\\s+)(\\d+e)(\\s+)([a-zA-Z]+)(\\s+\\([^()]*\\))", "(\\1)\\2(\\3)\\4(\\5)\\6", a)
See the R demo
And here is the regex demo. In short,
\\b(\\d{4}) - captures 4 digits as a whole word into Group 1
(\\s+) - Group 2: one or more whitespaces
(\\d+e) - Group 3: one or more digits and e
(\\s+) - Group 4: ibid
([a-zA-Z]+) - Group 5: one or more letters
(\\s+\\([^()]*\\)) - Group 6: one or more whitespaces, (, zero or more chars other than ( and ), ).
The contents of the groups are inserted back into the result with the help of backreferences.
If there are more words, and you need to wrap words starting with a letter/digit/underscore after a 4-digit word in the string, use
gsub("(?:(?=\\b\\d{4}\\b)|\\G(?!\\A))\\s*\\K\\b(\\S+)", "(\\1)", a, perl=TRUE)
See the R demo and a regex demo
Details:
(?:(?=\\b\\d{4}\\b)|\\G(?!\\A)) - either the location before a 4-digit whole word (see the positive lookahead (?=\\b\\d{4}\\b)) or the end of the previous successful match
\\s* - 0+ whitespaces
\\K - omitting the text matched so far
\\b(\\S+) - Group 1 capturing 1 or more non-whitespace symbols that are preceded with a word boundary.

Resources