Negative Lookahead Invalidated by extra numbers in string - r

I am trying to write a regular expression in R that matches a certain string up to the point where a . occurs. I thought a negative lookahead might be the answer, but I am getting some false positives.
So in the following 9-item vector
vec <- c("mcq_q9", "mcq_q10", "mcq_q11", "mcq_q12", "mcq_q1.factor", "mcq_q2.factor", "mcq_q10.factor", "mcq_q11.factor", "mcq_q12.factor")
The grep
grep("mcq_q[0-9]+(?!\\.)", vec, perl = T)
does its job for the first six elements in the vector, matching "mcq_q11" but not "mcq_q2.factor". Unfortunately though it does match the last 3 elements, when there are two numbers following the second q. Why does that second number kill off my negative lookahead?

I think you want your negative lookahead to scan the entire string first, ensuring it sees no "dot":
(?!.*\.)mcq_q[0-9]+
https://regex101.com/r/f5XxR2/2/

If you are to capture until a dot then you should use this:
mcq_q[0-9]+(?![\d\.])
Demo
Sample Source ( run here )
vec <- c("mcq_q9", "mcq_q10", "mcq_q11", "mcq_q12", "mcq_q1.factor", "mcq_q2.factor", "mcq_q10.factor", "mcq_q11.factor", "mcq_q12.factor")
grep("mcq_q[0-9]+(?![\\d\\.])", vec, perl = T)

We can use it without any lookaround to match zero or more characters that are not a . after the numbers ([0-9]+) till the end of the string ($)
grep("mcq_q[0-9]+[^.]*$", vec, value = TRUE)
#[1] "mcq_q9" "mcq_q10" "mcq_q11" "mcq_q12"

A negative lookahead is tricky nere, as explained in a comment. But you don't need it
/mcq_q[0-9]+(?:$|[^.0-9])/
This requires that a string of digits is followed by either end-of-string or a non-[.,digit] character. So it will allow mcq_q12a etc. If your permissible strings may only end in numbers remove |[^...], and then the non-capturing group (?:...) isn't needed either, for /mcq_q[0-9]+$/
Tested only in Perl as the question was tagged with it. It should be the same for your example in R.

Related

Can quantifiers be used in regex replacement in R?

My objective would be replacing a string by a symbol repeated as many characters as have the string, in a way as one can replace letters to capital letters with \\U\\1, if my pattern was "...(*)..." my replacement for what is captured by (*) would be something like x\\q1 or {\\q1}x so I would get so many x as characters captured by *.
Is this possible?
I am thinking mainly in sub,gsub but you can answer with other libraris like stringi,stringr, etc.
You can use perl = TRUE or perl = FALSE and any other options with convenience.
I assume the answer can be negative, since seems to be quite limited options (?gsub):
a replacement for matched pattern in sub and gsub. Coerced to character if possible. For fixed = FALSE this can include backreferences "\1" to "\9" to parenthesized subexpressions of pattern. For perl = TRUE only, it can also contain "\U" or "\L" to convert the rest of the replacement to upper or lower case and "\E" to end case conversion. If a character vector of length 2 or more is supplied, the first element is used with a warning. If NA, all elements in the result corresponding to matches will be set to NA.
Main quantifiers are (?base::regex):
?
The preceding item is optional and will be matched at most once.
*
The preceding item will be matched zero or more times.
+
The preceding item will be matched one or more times.
{n}
The preceding item is matched exactly n times.
{n,}
The preceding item is matched n or more times.
{n,m}
The preceding item is matched at least n times, but not more than m times.
Ok, but it seems to be an option (which is not in PCRE, not sure if in PERL or where...) (*) which captures the number of characters the star quantifier is able to match (I found it at https://www.rexegg.com/regex-quantifier-capture.html) so then it could be used \q1 (same reference) to refer to the first captured quantifier (and \q2, etc.). I also read that (*) is equivalent to {0,} but I'm not sure if this is really the fact for what I'm interested in.
EDIT UPDATE:
Since asked by commenters I update my question with an specific example provide by this interesting question. I modify a bit the example. Let's say we have a <- "I hate extra spaces elephant" so we are interested in keeping the a unique space between words, the 5 first characters of each word (till here as the original question) but then a dot for each other character (not sure if this is what is expected in the original question but doesn't matter) so the resulting string would be "I hate extra space. eleph..." (one . for the last s in spaces and 3 dots for the 3 letters ant in the end of elephant). So I started by keeping the 5 first characters with
gsub("(?<!\\S)(\\S{5})\\S*", "\\1", a, perl = TRUE)
[1] "I hate extra space eleph"
How should I replace the exact number of characters in \\S* by dots or any other symbol?
Quantifiers cannot be used in the replacement pattern, nor the information how many chars they match.
What you need is a \G base PCRE pattern to find consecutive matches after a specific place in the string:
a <- "I hate extra spaces elephant"
gsub("(?:\\G(?!^)|(?<!\\S)\\S{5})\\K\\S", ".", a, perl = TRUE)
See the R demo and the regex demo.
Details
(?:\G(?!^)|(?<!\S)\S{5}) - the end of the previous successful match or five non-whitespace chars not preceded with a non-whitespace char
\K - a match reset operator discarding text matched so far
\S - any non-whitespace char.
gsubfn is like gsub except the replacement string can be a function which inputs the match and outputs the replacement. The function can optionally be expressed a formula as we do here replacing each string of word characters with the output of the function replacing that string. No complex regular expressions are needed.
library(gsubfn)
gsubfn("\\w+", ~ paste0(substr(x, 1, 5), strrep(".", max(0, nchar(x) - 5))), a)
## [1] "I hate extra space. eleph..."
or almost the same except function is slightly different:
gsubfn("\\w+", ~ paste0(substr(x, 1, 5), substring(gsub(".", ".", x), 6)), a)
## [1] "I hate extra space. eleph..."

regular expression in R, reuse matched string in replacement

I want to insert a '0' before the single digit month (e.g. 2020M6 to 2020M06) using regular expressions.
The one below correctly matches the string I need to replace (a single digit at the end of the string following a 'M', excluding 'M'), but the replacement pattern '0$0' is interepreted literally in R; elsewhere (regeprep in matlab) I referenced the matched string, '6' in the example, by '$0'.
sub('(?<=M)([0-9]{1})$','0$0', c('2020M6','2020M10'), perl = T)
[1] "2020M0$0" "2020M10"
I cannot find how to reference and re-use matched strings in the replacement pattern.
PS: There are alternative ways to accomplish the task, but I need to use regular expressions.
Unfortunately, it is not possible to use a backreference to the whole match in base R regex functions.
You can use
sub("(M)([0-9])$", "\\10\\2", x)
With TRE regex like here, you do not have to worry about a digit after a backreference, since only 9 backreferences starting with 1 till 9 are allowed in TRE regex patterns. What is of interest is that you may use perl=TRUE in the above line of code and it will yield the same results.
See the R demo online:
x <- c('2020M6','2020M10')
sub("(M)([0-9])$", "\\10\\2", x)
## => [1] "2020M06" "2020M10"
Also, see the regex demo.
I think you have to capture the digit after 'M' and not 'M' itself, therefore :
sub('(?<=M)([0-9]{1})$','0\\1', c('2020M6','2020M10'), perl = T)
Captured strings can be reused with \\1, \\2 etc, by the way.

remove parenthesis after number, keep number

I need to remove a parenthesis after a number in a string:
"dl_CONH_r = a0cons+a2cons*(CONH_r_lag_1)-a3cons*HGDI_r_lag_1)-(1-a3cons)*HNW_r_lag_2)+a4cons*rate_90_r_lag_1))+a5cons*dl_HCOE_r+a6cons*dl_HOY_r_lag_2)+a7cons*dl_HNW_r_lag_1)+a8cons*d_rate_UNE_lag_2)+(1-a5cons-a6cons-a7cons)*(dl_TREND_PROD+dl_TREND_AVEH+dl_TREND_WAP)"
The resulting string should look like this:
"dl_CONH_r = a0cons+a2cons*(CONH_r_lag_1-a3cons*HGDI_r_lag_1-(1-a3cons)*HNW_r_lag_2+a4cons*rate_90_r_lag_1)+a5cons*dl_HCOE_r+a6cons*dl_HOY_r_lag_2+a7cons*dl_HNW_r_lag_1+a8cons*d_rate_UNE_lag_2+(1-a5cons-a6cons-a7cons)*(dl_TREND_PROD+dl_TREND_AVEH+dl_TREND_WAP)"
The regular expression I am trying to capture here is the first parenthesis after the string "lag_" followed by some number. Note, that in places there are two parenthesis:
rate_90_r_lag_1))
And I only want to remove the first one.
I've tried a simple regex in gsub
a <- "dl_CONH_r = a0cons+a2cons*(CONH_r_lag_1)-a3cons*HGDI_r_lag_1)-(1-a3cons)*HNW_r_lag_2)+a4cons*rate_90_r_lag_1))+a5cons*dl_HCOE_r+a6cons*dl_HOY_r_lag_2)+a7cons*dl_HNW_r_lag_1)+a8cons*d_rate_UNE_lag_2)+(1-a5cons-a6cons-a7cons)*(dl_TREND_PROD+dl_TREND_AVEH+dl_TREND_WAP)"
gsub("[0-9]\\)","[0-9]",a)
But I the resulting string removes the number and replaces it with [0-9]:
"dl_CONH_r = a0cons+a2cons*(CONH_r_lag_[0-9]-a3cons*HGDI_r_lag_[0-9]-(1-a3cons)*HNW_r_lag_[0-9]+a4cons*rate_90_r_lag_[0-9])+a5cons*dl_HCOE_r+a6cons*dl_HOY_r_lag_[0-9]+a7cons*dl_HNW_r_lag_[0-9]+a8cons*d_rate_UNE_lag_[0-9]+(1-a5cons-a6cons-a7cons)*(dl_TREND_PROD+dl_TREND_AVEH+dl_TREND_WAP)"
I understand that the gsub is doing what it is intended to do. What I don't know is how to keep the number before the parenthesis?
You need to use a look around (in this case the preceded by) so that it will match just the parentheses as the matching group instead of the numbers and the parentheses. Then you can just remove the parentheses.
gsub("(?<=[0-9])\\)","", a, perl = TRUE)
You can do this using capture groups:
Lets just try it on the string my_string <- " = a0cons+a2cons*(CONH_r_lag_1)-a3cons*"
reg_expression <- "(.*[0-9])\\)(.*)" #two capture groups, with the parenthesis not in a group
my_sub_string <- sub(reg_expression,"\\1\\2", my_string)
Notice "\\1" reads like \1 to the regex engine, and so is a special character referring to the first capture group. (These can also be named)
Another way of doing this is lookarounds:
There are two basic kinds of lookarounds, a lookahead (?=) and a lookbehind (?<=). Since we want to match a pattern, but not capture, something behind our matched expression we need a lookbehind.
reg_expression <- "(?<=[0-9])\\)" #lookbehind
my_sub_string <- sub(reg_expression,"", my_string)
Which will match the pattern, but only replace the parenthesis.

R Regex to identify and replace characters between multiple dots

I have the following codes
"ABC.A.SVN.10.10.390.10.UDGGL"
"XYZ.Z.SVN.11.12.111.99.ASDDL"
and I need to replace the characters that exist between the 2nd and the 3rd dot. In this case it is SVN but it may well be any combination of between A and ZZZ, so really the only way to make this work is by using the dots.
The required outcome would be:
"ABC.A..10.10.390.10.UDGGL"
"XYZ.Z..11.12.111.99.ASDDL"
I tried variants of grep("^.+(\\.\\).$", "ABC.A.SVN.10.10.390.10.UDGGL") but I get an error.
Some examples of what I have tried with no success :
Link 1
Link 2
EDIT
I tried #Onyambu 's first method and I ran into a variant which I had not accounted for: "ABC.A.AB11.1.12.112.1123.UDGGL". In the replacement part, I also have numeric values. The desired outcome is "ABC.A..1.12.112.1123.UDGGL" and I get it using sub("\\.\\w+.\\B.",".",x) per the second part of his answer!
See code in use here
x <- c("ABC.A.SVN.10.10.390.10.UDGGL", "XYZ.Z.SVN.11.12.111.99.ASDDL")
sub("^(?:[^.]*\\.){2}\\K[^.]*", "", x, perl=T)
^ Assert position at the start of the line
(?:[^.]*\.){2} Match the following exactly twice
[^.]*\. Match any character except . any number of times, followed by .
\K Resets the starting point of the pattern. Any previously consumed characters are no longer included in the final match
[^.]* Match any character except . any number of times
Results in [1] "ABC.A..10.10.390.10.UDGGL" "XYZ.Z..11.12.111.99.ASDDL"
x= "ABC.A.SVN.10.10.390.10.UDGGL" "XYZ.Z.SVN.11.12.111.99.ASDDL"
sub("([A-Z]+)(\\.\\d+)","\\2",x)
[1] "ABC.A..10.10.390.10.UDGGL" "XYZ.Z..11.12.111.99.ASDDL"
([A-Z]+) Capture any word that has the characters A-Z
(\\.\\d+) The captured word above, must be followed with a dot ie\\..This dot is then followed by numbers ie \\d+. This completes the capture.
so far the captured part of the string "ABC.A.SVN.10.10.390.10.UDGGL" is SVN.10 since this is the part that matches the regular expression. But this part was captured as SVN and .10. we do a backreference ie replace the whole SVN.10 with the 2nd part .10
Another logic that will work:
sub("\\.\\w+.\\B.",".",x)
[1] "ABC.A..10.10.390.10.UDGGL" "XYZ.Z..11.12.111.99.ASDDL"
Not exactly regex but here is one more approach
#DATA
S = c("ABC.A.SVN.10.10.390.10.UDGGL", "XYZ.Z.SVN.11.12.111.99.ASDDL")
sapply(X = S,
FUN = function(str){
ind = unlist(gregexpr("\\.", str))[2:3]
paste(c(substring(str, 1, ind[1]),
"SUBSTITUTION",
substring(str, ind[2], )), collapse = "")
},
USE.NAMES = FALSE)
#[1] "ABC.A.SUBSTITUTION.10.10.390.10.UDGGL" "XYZ.Z.SUBSTITUTION.11.12.111.99.ASDDL"

How to extract a string between a symbol and a space?

I am trying to extract usernames tagged in a text-chat, such as "#Jack #Marie Hi there!"
I am trying to do it on the combination of # and whitespace but I cannot get the regex to match non-greedy (or at least this is what I think is wrong):
library(stringr)
str_extract(string = '#This is what I want to extract', pattern = "(?<=#)(.*)(?=\\s+)")
[1] "This is what I want to"
What I would like to extract instead is only This.
You could make your regex non greedy:
(?<=#)(.*?)(?=\s+)
Or if you want to capture only "This" after the # sign, you could try it like this using only a positive lookbehind:
(?<=#)\w+
Explanation
A positive lookbehind (?<=
That asserts that what is behind is an #
Close positive lookbehind )
Match one or more word characters \w+
The central part of your regex ((.*)) is a sequence of any chars.
Instead you shoud look for a sequence of chars other than white space
(\S+) or word chars (\w+).
Note also that I changed * to +, as you are probably not interested
in any empty sequence of chars.
To capture also a name which has "last" position in the source
string, the last part of your regex should match not only a sequence
of whitespace chars, but also the end of the string, so change
(?=\\s+) to (?=\\s+|$).
And the last remark: Actually you don't need the parentheses around
the "central" part.
So to sum up, the whole regex can be like this:
(?<=#)\w+(?=\s+|$)
(with global oprion).
Here is a non-regex approach or rather a minimal-regex approach since grep takes the detection of # through the regex engine
grep('#', strsplit(x, ' ')[[1]], value = TRUE)
#[1] "#This"
Or to avoid strsplit, we can use scan (taken from this answer), i.e.
grep('#', scan(textConnection(x), " "), value=TRUE)
#Read 7 items
#[1] "#This"

Resources