Insert character string between period and digit in R - r

I have a vector of character strings like so:
test <- c("A1.7","A1.8")
and I want to used regular expressions to insert A1c<= between the period and digit like so:
A1.A1c<=7 A1.A1c<=8
I looked through questions and found #zx8754 similar question; I tried to modify the answer posted in their question but had no luck
insert <- 'A1c<='
n <- 4
old <- test
lhs <- paste0('([[:alpha:]][[:digit:]][[:punct:]]{', n-1, '})([[:digit:]]+)$')
rhs <- paste0('\\1', insert, '\\2')
gsub(lhs, rhs, test)
Can anyone direct me as to how to correctly execute this?

Another pattern:
gsub("\\.(\\d+)", "\\.A1c<=\\1", test)
## [1] "A1.A1c<=7" "A1.A1c<=8"
Regex Demo

You may use
insert <- 'A1c<='
test <- c("A1.7","A1.8")
sub("(?<=\\.)(?=\\d)", insert, test, perl=TRUE)
## => A1.A1c<=7 A1.A1c<=8
See the online R demo
Details
(?<=\\.) - a positive lookbehind that matches a location that is immediately preceded with a dot
(?=\\d) - a positive lookahead that matches a location that is immediately followed with a digit.
The sub function will replace the first occurrence only, and perl=TRUE makes it possible to use the lookaround constructs in the pattern (as it is now parsed with the PCRE regex engine).

Related

regular expression in R, reuse matched string in replacement

I want to insert a '0' before the single digit month (e.g. 2020M6 to 2020M06) using regular expressions.
The one below correctly matches the string I need to replace (a single digit at the end of the string following a 'M', excluding 'M'), but the replacement pattern '0$0' is interepreted literally in R; elsewhere (regeprep in matlab) I referenced the matched string, '6' in the example, by '$0'.
sub('(?<=M)([0-9]{1})$','0$0', c('2020M6','2020M10'), perl = T)
[1] "2020M0$0" "2020M10"
I cannot find how to reference and re-use matched strings in the replacement pattern.
PS: There are alternative ways to accomplish the task, but I need to use regular expressions.
Unfortunately, it is not possible to use a backreference to the whole match in base R regex functions.
You can use
sub("(M)([0-9])$", "\\10\\2", x)
With TRE regex like here, you do not have to worry about a digit after a backreference, since only 9 backreferences starting with 1 till 9 are allowed in TRE regex patterns. What is of interest is that you may use perl=TRUE in the above line of code and it will yield the same results.
See the R demo online:
x <- c('2020M6','2020M10')
sub("(M)([0-9])$", "\\10\\2", x)
## => [1] "2020M06" "2020M10"
Also, see the regex demo.
I think you have to capture the digit after 'M' and not 'M' itself, therefore :
sub('(?<=M)([0-9]{1})$','0\\1', c('2020M6','2020M10'), perl = T)
Captured strings can be reused with \\1, \\2 etc, by the way.

Is there an R function to replace a matched RegEx with a string of characters with the same length? [duplicate]

This question already has an answer here:
Replace every single character at the start of string that matches a regex pattern
(1 answer)
Closed 2 years ago.
I have a vector
test <- c("NNNCTCGTNNNGTCGTNN", "NNNNNCGTNNNGTCGTGN")
and I want to replace all N in the head of all elements using same length "-".
When I use function gsub only replace with one "-".
gsub("^N+", "-", test)
# [1] "-CTCGTNNNGTCGTNN" "-CGTNNNGTCGTGN"
But I want the result looks like this
# "---CTCGTNNNGTCGTNN", "-----CGTNNNGTCGTGN"
Is there any R function that can do this? Thanks for your patience and advice.
You can write:
test <- c("NNNCTCGTNNNGTCGTNN", "NNNNNCGTNNNGTCGTGN", "XNNNNNCGTNNNGTCGTGN")
gsub("\\GN", "-", perl=TRUE, test)
which returns:
"---CTCGTNNNGTCGTNN" "-----CGTNNNGTCGTGN" "XNNNNNCGTNNNGTCGTGN"
regex | R code
\G, which is supported by Perl (and by PCRE (PHP), Ruby, Python's PyPI regex engine and others), asserts that the current position is at the beginning of the string for the first match and at the end of the previous match thereafter.
If the string were "NNNCTCGTNNNGTCGTNN" the first three "N"'s would each be matched (and replaced with a hyphen by gsub), then the attempt to match "C" would fail, terminating the match and string replacement.
One approach would be to use the stringr functions, which support regex callbacks:
test <- c("NNNCTCGTNNNGTCGTNN", "NNNNNCGTNNNGTCGTGN")
repl <- function(x) { gsub("N", "-", x) }
str_replace_all(test, "^N+", function(m) repl(m))
[1] "---CTCGTNNNGTCGTNN" "-----CGTNNNGTCGTGN"
The strategy here is to first match ^N+ to capture one or more leading N. Then, we pass that match to a callback function which replaces each N with a dash.

Replace Strings in R with regular expression in it dynamically [duplicate]

I'm wanting to build a regex expression substituting in some strings to search for, and so these string need to be escaped before I can put them in the regex, so that if the searched for string contains regex characters it still works.
Some languages have functions that will do this for you (e.g. python re.escape: https://stackoverflow.com/a/10013356/1900520). Does R have such a function?
For example (made up function):
x = "foo[bar]"
y = escape(x) # y should now be "foo\\[bar\\]"
I've written an R version of Perl's quotemeta function:
library(stringr)
quotemeta <- function(string) {
str_replace_all(string, "(\\W)", "\\\\\\1")
}
I always use the perl flavor of regexps, so this works for me. I don't know whether it works for the "normal" regexps in R.
Edit: I found the source explaining why this works. It's in the Quoting Metacharacters section of the perlre manpage:
This was once used in a common idiom to disable or quote the special meanings of regular expression metacharacters in a string that you want to use for a pattern. Simply quote all non-"word" characters:
$pattern =~ s/(\W)/\\$1/g;
As you can see, the R code above is a direct translation of this same substitution (after a trip through backslash hell). The manpage also says (emphasis mine):
Unlike some other regular expression languages, there are no backslashed symbols that aren't alphanumeric.
which reinforces my point that this solution is only guaranteed for PCRE.
Apparently there is a function called escapeRegex in the Hmisc package. The function itself has the following definition for an input value of 'string':
gsub("([.|()\\^{}+$*?]|\\[|\\])", "\\\\\\1", string)
My previous answer:
I'm not sure if there is a built in function but you could make one to do what you want. This basically just creates a vector of the values you want to replace and a vector of what you want to replace them with and then loops through those making the necessary replacements.
re.escape <- function(strings){
vals <- c("\\\\", "\\[", "\\]", "\\(", "\\)",
"\\{", "\\}", "\\^", "\\$","\\*",
"\\+", "\\?", "\\.", "\\|")
replace.vals <- paste0("\\\\", vals)
for(i in seq_along(vals)){
strings <- gsub(vals[i], replace.vals[i], strings)
}
strings
}
Some output
> test.strings <- c("What the $^&(){}.*|?", "foo[bar]")
> re.escape(test.strings)
[1] "What the \\$\\^&\\(\\)\\{\\}\\.\\*\\|\\?"
[2] "foo\\[bar\\]"
An easier way than #ryanthompson function is to simply prepend \\Q and postfix \\E to your string. See the help file ?base::regex.
Use the rex package
These days, I write all my regular expressions using rex. For your specific example, rex does exactly what you want:
library(rex)
library(assertthat)
x = "foo[bar]"
y = rex(x)
assert_that(y == "foo\\[bar\\]")
But of course, rex does a lot more than that. The question mentions building a regex, and that's exactly what rex is designed for. For example, suppose we wanted to match the exact string in x, with nothing before or after:
x = "foo[bar]"
y = rex(start, x, end)
Now y is ^foo\[bar\]$ and will only match the exact string contained in x.
According to ?regex:
The symbol \w matches a ‘word’ character (a synonym for [[:alnum:]_], an extension) and \W is its negation ([^[:alnum:]_]).
Therefore, using capture groups, (\\W), we can detect the occurrences of non-word characters and escape it with the \\1-syntax:
> gsub("(\\W)", "\\\\\\1", "[](){}.|^+$*?\\These are words")
[1] "\\[\\]\\(\\)\\{\\}\\.\\|\\^\\+\\$\\*\\?\\\\These\\ are\\ words"
Or similarly, replacing "([^[:alnum:]_])" for "(\\W)".

Ignore last "/" in R regex

Given the string "http://compras.dados.gov.br/materiais/v1/materiais.html?pdm=08275/", I need to generate a regex filter so that it ignores the last char if it is an "/" .
I tried the following regex "(http:////)?compras\\.dados\\.gov\\.br.*\\?.*(?<!//)" as of regexr.com/4om61, but it doesn´t work when I run in R as:
regex_exp_R <- "(http:////)?compras\\.dados\\.gov\\.br.*\\?.*(?<!//)"
grep(regex_exp_R, "http://compras.dados.gov.br/materiais/v1/materiais.html?pdm=08275/", perl = T, value = T)
I need this to work in pure regex and grep function, without using any string R package.
Thank you.
Simplified Case:
After important contributions of you all, one last issue remains.
Because I will use regex as an input in another friunction, the solution must work with pure regex and grep.
The remaining point is a very basic one: given the strings "a1bc/" or "a1bc", the regex must return "a1bc". Building on suggestions I received, I tried
grep(".*[^//]" ,"a1bc/", perl = T, value = T), but still get "a1bc/" instead of "a1bc". Any hints? Thank you.
If you want to return the string without the last / you can do this several ways. Below are a couple options using base R:
Using a back-reference in gsub() (sub() would work too here):
gsub("(.*?)/*$", "\\1", x)
[1] "http://compras.dados.gov.br/materiais/v1/materiais.html?pdm=08275"
# or, adapting your original pattern
gsub("((http:////)?compras\\.dados\\.gov\\.br.*\\?.*?)/*$", "\\1", x)
[1] "http://compras.dados.gov.br/materiais/v1/materiais.html?pdm=08275"
By position using ifelse() and substr() (this will proabbly be a little bit faster if scaling matters)
ifelse(substr(x, nchar(x), nchar(x)) == "/", substr(x, 1, nchar(x)-1), x)
[1] "http://compras.dados.gov.br/materiais/v1/materiais.html?pdm=08275"
Data:
x <- "http://compras.dados.gov.br/materiais/v1/materiais.html?pdm=08275/"
Use sub to remove a trailing /:
x <- c("a1bc/", "a2bc")
sub("/$", "", x)
This changes nothing on a string that does not end in /.
As others have pointed out, grep does not modify strings. It returns a numeric vector of indices of the matched strings or a vector of the (unmodified) matched items. It's usually used to subset a character vector.
You can use a negative look-behind at the end to ensure it doesn't end with the character you don't want (in this case, a /). The regex would then be:
.+(?<!\/)
You can view it here with your three input examples: https://regex101.com/r/XB9f7K/1/. If you only want it to match urls, then you would change the .+ part at the beginning to your url regex.
How about trying gsub("(.*?)/+$","\\1",s)?

Negative Lookahead Invalidated by extra numbers in string

I am trying to write a regular expression in R that matches a certain string up to the point where a . occurs. I thought a negative lookahead might be the answer, but I am getting some false positives.
So in the following 9-item vector
vec <- c("mcq_q9", "mcq_q10", "mcq_q11", "mcq_q12", "mcq_q1.factor", "mcq_q2.factor", "mcq_q10.factor", "mcq_q11.factor", "mcq_q12.factor")
The grep
grep("mcq_q[0-9]+(?!\\.)", vec, perl = T)
does its job for the first six elements in the vector, matching "mcq_q11" but not "mcq_q2.factor". Unfortunately though it does match the last 3 elements, when there are two numbers following the second q. Why does that second number kill off my negative lookahead?
I think you want your negative lookahead to scan the entire string first, ensuring it sees no "dot":
(?!.*\.)mcq_q[0-9]+
https://regex101.com/r/f5XxR2/2/
If you are to capture until a dot then you should use this:
mcq_q[0-9]+(?![\d\.])
Demo
Sample Source ( run here )
vec <- c("mcq_q9", "mcq_q10", "mcq_q11", "mcq_q12", "mcq_q1.factor", "mcq_q2.factor", "mcq_q10.factor", "mcq_q11.factor", "mcq_q12.factor")
grep("mcq_q[0-9]+(?![\\d\\.])", vec, perl = T)
We can use it without any lookaround to match zero or more characters that are not a . after the numbers ([0-9]+) till the end of the string ($)
grep("mcq_q[0-9]+[^.]*$", vec, value = TRUE)
#[1] "mcq_q9" "mcq_q10" "mcq_q11" "mcq_q12"
A negative lookahead is tricky nere, as explained in a comment. But you don't need it
/mcq_q[0-9]+(?:$|[^.0-9])/
This requires that a string of digits is followed by either end-of-string or a non-[.,digit] character. So it will allow mcq_q12a etc. If your permissible strings may only end in numbers remove |[^...], and then the non-capturing group (?:...) isn't needed either, for /mcq_q[0-9]+$/
Tested only in Perl as the question was tagged with it. It should be the same for your example in R.

Resources