zero padding regex dependent on length of digits - r

I have a field which contains two charecters, some digits and potentially a single letter. For example
QU1Y
ZL002
FX16
TD8
BF007P
VV1395
HM18743
JK0001
I would like to consistently return all letters in their original position, but digits as follows.
for 1 to 3 digits :
return all digits OR the digits left padded with zeros
For 4 or more digits :
it must not begin with a zero and return the 4 first digits OR if the first is a zero then truncate to three digits
example from the data above
QU001Y
ZL002
FX016
TD008
BF007P
VV1395
HM1874
JK001
The implementation will be in R but I'm interested in a straight regex solution, I'll work out the R side of things. It may not be possible in straight regex which is why I can't get my head round it.
This identifies the correct ones, but I'm hoping to correct those which are not
right.
"[A-Z]{2}[1-9]{0,1}[0-9]{1,3}[F,Y,P]{0,1}"
For the curious, they are flight numbers but entered by a human. Hence the variety...

You may use
> library(gsubfn)
> l <- c("QU1Y", "ZL002", "FX16", "TD8", "BF007P", "VV1395", "HM18743", "JK0001")
> gsubfn('^[A-Z]{2}\\K0*(\\d{1,4})\\d*', ~ sprintf("%03d",as.numeric(x)), l, perl=TRUE)
[1] "QU001Y" "ZL002" "FX016" "TD008" "BF007P" "VV1395" "HM1874" "JK001"
The pattern matches
^ - start of string
[A-Z]{2} - two uppercase letters
\\K - the text matched so far is removed from the match
0* - 0 or more zeros
(\\d{1,4}) - Capturing group 1: one to four digits
\\d* - 0+ digits.
Group 1 is passed to the callback function where sprintf("%03d",as.numeric(x)) pads the value with the necessary amount of digits.

Related

regex to find the position of the first four concurrent unique values

I've solved 2022 advent of code day 6, but was wondering if there was a regex way to find the first occurance of 4 non-repeating characters:
From the question:
bvwbjplbgvbhsrlpgdmjqwftvncz
bvwbjplbgvbhsrlpgdmjqwftvncz
# discard as repeating letter b
bvwbjplbgvbhsrlpgdmjqwftvncz
# match the 5th character, which signifies the end of the first four character block with no repeating characters
in R I've tried:
txt <- "bvwbjplbgvbhsrlpgdmjqwftvncz"
str_match("(.*)\1", txt)
But I'm having no luck
You can use
stringr::str_extract(txt, "(.)(?!\\1)(.)(?!\\1|\\2)(.)(?!\\1|\\2|\\3)(.)")
See the regex demo. Here, (.) captures any char into consequently numbered groups and the (?!...) negative lookaheads make sure each subsequent . does not match the already captured char(s).
See the R demo:
library(stringr)
txt <- "bvwbjplbgvbhsrlpgdmjqwftvncz"
str_extract(txt, "(.)(?!\\1)(.)(?!\\1|\\2)(.)(?!\\1|\\2|\\3)(.)")
## => [1] "vwbj"
Note that the stringr::str_match (as stringr::str_extract) takes the input as the first argument and the regex as the second argument.

Regex for alternating pattern between two character groups

I'm trying to find matches where the pattern alternates between two character groups, D\E and R\K\H.
The pattern I've come up with (through reading other posts on here) is
(([DE](?=[RKH])*)|(([RKH])(?=[DE])*))+
Using this pattern with this test string: DREDRDRDRARDK
I get the following matches: DR, DRDRD, RD
I want: DRE, DRDRDR, RDK
The matches are missing the last letter for each group.
Please could someone help me figure out why.
Match the first group followed by the second with all that matched any number of times and then possibly followed by the first group. i.e. ([DE][RKH])+[DE]?, or the same with the groups interchanged, i.e. ([RKH][DE])+[RKH]? or just the first group, i.e. [DE] or just the second group, i.e. [RKH]:
library(gsubfn)
x <- "DREDRDRDRARDK" # input
rx <- "(([DE][RKH])+[DE]?|([RKH][DE])+[RKH]?|[DE]|[RKH])"
strapply(x, rx)
## [[1]]
## [1] "DRE" "DRDRDR" "RDK"
In your pattern, you repeatedly match a single character out of 2 character classes followed by a positive lookahead which asserts that there should be a character present directly at the right.
(Note that the positive lookahead should not be optionally repeated (?=[RKH])* or else it will always be true, matching too much)
If the quantifier * is not present after the lookahead you will get your matches where characters are missing.
The reason why the matches are missing the last letter for each group is when [DE] is matched, there is a positive lookahead asserting what is directly to the right is [RKH] (and the other way around due to the alternation)
It does not match the E in DRE because when matching E the lookahead asserts on of [RKH] after is, which is not the case
It does not match the last R in DRDRDR as there is no A following the last R
As the positive lookahead asserts that there should be a next character present, you also don't match the last K because there is no character after it
As already answered, you can repeatedly match the pairs of character classes followed by optionally matching the first character class after it.
Without the groups, I think it could also be shortened to:
(?:[DE][RKH])+[DE]?|(?:[RKH][DE])+[RKH]?
Regex demo
library(stringr)
str_extract_all("DREDRDRDRARDK", "(?:[DE][RKH])+[DE]?|(?:[RKH][DE])+[RKH]?")
Output
[[1]]
[1] "DRE" "DRDRDR" "RDK"

How to match binomial expressions in R?

I want to match binomials, that is, bisyllabic words, sometimes hyphenated, with slightly varied syllable reduplication; the variation always concerns the first (and, possibly, second) letter in the reduplicated syllable:
x <- c("pow-wow", "pickwick", "easy-peasy", "nitty-gritty", "bzzzzzzz", "mmmmmm", "shish", "wedged", "yaaaaaa")
Here, we have said syllable reduplication in pow-wow, pickwick, easy-peasy, and nitty-gritty (which are then the expected output) but not in bzzzzzzz, mmmmmm, shish, wedged and yaaaaa.
This regex does at least manage to get rid of wedged(which is pronounced as one syllable) as well as monosyllabic words by requiring the presence of a vowel in the capturing group:
grep("\\b\\w?((?!ed)(?=[aeiou])\\w{2,})-?\\w\\w?\\1\\b$", x, value = T, perl = T)
[1] "pow-wow" "pickwick" "easy-peasy" "nitty-gritty" "yaaaaa"
However, yaaaaa is getting matched too. To not match it my feeling is that the capturing group should be disallowed to contain two identical vowels in immediate succession but I don't know how to implement that restriction.
Any ideas?
It looks as though you want to match words that cannot contain ed after the initial chars and 2 or more repeated chars if the same chunk is not found farther in the string. Also, the allowed "difference" window at the start and middle is 0 to 2 characters.
You may use
\b\w{0,2}(?!((.)\2+)(?!.*\1)|ed)([aeiou]\w+)-?\w\w?\3\b
See the regex demo
Details
\b - a word boundary (you may use ^ if your "words" are equal to whole string)
\w{0,2} - two or more word chars (replace with \p{L} to only match letters)
(?!((.)\2+)(?!.*\1)|ed) - no ed or two or more identical chars that do not repeat later in the string are allowed immediately to the right of the current location
([aeiou]\w+) - a vowel (captured in Group 3) and 1+ word chars (replace with \p{L} to only match letters)
-? - an optional hyphen
\w\w? - 1 or 2 word charsd
\3 - same value as captured in Group 3
\b - a word boundary (you may use $ if your "words" are equal to whole string)

Matching character followed by exactly 1 digit

I need to align formatting of some clinical trial IDs two merge two databases. For example, in database A patient 123 visit 1 is stored as '123v01' and in database B just '123v1'
I can match A to B by grep match those containing 'v0' and strip out the trailing zero to just 'v', but for academic interest & expanding R / regex skills, I want to reverse match B to A by matching only those containing 'v' followed by only 1 digit, so I can then separately pad that digit with a leading zero.
For a reprex:
string <- c("123v1", "123v01", "123v001")
I can match those with >= 2 digits following a 'v', then inverse subset
> idx <- grepl("v(\\d{2})", string)
> string[!idx]
[1] "123v1"
But there must be a way to match 'v' followed by just a single digit only? I have tried the lookarounds
# Negative look ahead "v not followed by 2+ digits"
grepl("v(?!\\d{2})", string)
# Positive look behind "single digit following v"
grepl("(?<=v)\\d{1})", string)
But both return an 'invalid regex' error
Any suggestions?
You need to set the perl=TRUE flag on your grepl function.
e.g.
grepl("v(?!\\d{2})", string, perl=TRUE)
[1] TRUE FALSE FALSE
See this question for more info.
You may use
grepl("v\\d(?!\\d)", string, perl=TRUE)
The v\d(?!\d) pattern matches v, 1 digits and then makes sure there is no digit immediately to the right of the current location (i.e. after the v + 1 digit).
See the regex demo.
Note that you need to enable PCRE regex flavor with the perl=TRUE argument.

How to remove ending zeros in binary bit sequence in R?

I need to remove ending zeros from binary bit sequences.
The length of the bit sequence is fixed, say 52. i.e.,
0101111.....01100000 (52-bit),
10111010..1010110011 (52-bit),
10111010..1010110100 (52-bit).
From converting decimal number to normalized double precision, significand is 52 bit, and hence zeros are populated to the right hand side even if significand is less than 52 bit at first step. I am reversing the process: i.e., I am trying to convert a normalized double precision in memory to decimal number, hence, I have to remove zeros (at the end) that are used to populate 52 bits for significand.
It is not guaranteed that the sequence in hand necessarily have 0s in the end (like the 2nd example above). If there is, all ending zeros must be truncated:
f(0101111.....01100000) # 0101111.....011; leading 0 must be kept
f(10111010..1010110011) # 10111010..1010110011; no truncation
f(10111010..1010110100) # 10111010..10101101
Unfortunately, the number of truncated 0s at the end differs. (5 in the 1st example; 2 in the 3rd example).
It is OK for me if input and output class are string:
f("0101111.....01100000") # "0101111.....011"; leading 0 must be kept
f("10111010..1010110011") # "10111010..1010110011"; no truncation
f("10111010..1010110100") # "10111010..10101101"
Any help is greatly appreciated.
This is a simple regular expression.
f <- function(x) sub('0+$', '', x)
Explanation:
0 - matches the character 0.
0+ - the character zero repeated at least one time, meaning, one or more times.
$ matches the end of the string.
0+$ the character 0 repeated one or more times and nothing else until the end of the string.
Replace the sub-string matched by the pattern with the empty string, ''.
Now test the function.
f("010111101100000")
#[1] "0101111011"
f("0100000001010101100010000000000000000000000000000000000000000000")
#[1] "010000000101010110001"
f("010000000101010110001000000")
#[1] "010000000101010110001"
f("00010000000101010110001000000")
#[1] "00010000000101010110001"

Resources