I want to insert a '0' before the single digit month (e.g. 2020M6 to 2020M06) using regular expressions.
The one below correctly matches the string I need to replace (a single digit at the end of the string following a 'M', excluding 'M'), but the replacement pattern '0$0' is interepreted literally in R; elsewhere (regeprep in matlab) I referenced the matched string, '6' in the example, by '$0'.
sub('(?<=M)([0-9]{1})$','0$0', c('2020M6','2020M10'), perl = T)
[1] "2020M0$0" "2020M10"
I cannot find how to reference and re-use matched strings in the replacement pattern.
PS: There are alternative ways to accomplish the task, but I need to use regular expressions.
Unfortunately, it is not possible to use a backreference to the whole match in base R regex functions.
You can use
sub("(M)([0-9])$", "\\10\\2", x)
With TRE regex like here, you do not have to worry about a digit after a backreference, since only 9 backreferences starting with 1 till 9 are allowed in TRE regex patterns. What is of interest is that you may use perl=TRUE in the above line of code and it will yield the same results.
See the R demo online:
x <- c('2020M6','2020M10')
sub("(M)([0-9])$", "\\10\\2", x)
## => [1] "2020M06" "2020M10"
Also, see the regex demo.
I think you have to capture the digit after 'M' and not 'M' itself, therefore :
sub('(?<=M)([0-9]{1})$','0\\1', c('2020M6','2020M10'), perl = T)
Captured strings can be reused with \\1, \\2 etc, by the way.
Related
How can I back reference file_version_1a.csv in the following?
vec = c("dir/file_version_1a.csv")
In particular, I wonder why
gsub("(file.*csv$)", "", vec)
[1] "dir/"
as if I have a correct pattern, yet
gsub("(file.*csv$)", "\\1", vec)
[1] "dir/file_version_1a.csv"
You want to extract the substring starting with file and ending with csv at the end of string.
Since gsub replaces the match, and you want to use it as an extraction function, you need to match all the text in the string.
As the text not matched with your regex is at the start of the string, you need to prepend your pattern with .* (this matches any zero or more chars, as many as possible, if you use TRE regex in base R functions, and any zero or more chars other than line break chars in PCRE/ICU regexps used in perl=TRUE powered base R functions and stringr/stringi functions):
vec = c("dir/file_version_1a.csv")
gsub(".*(file.*csv)$", "\\1", vec)
However, stringr::str_extract seems a more natural choice here:
stringr::str_extract(vec, "file.*csv$")
regmatches(vec, regexpr("file.*csv$",vec))
See the R demo online.
I have a vector of strings and I want to remove -es from all strings (words) ending in either -ses or -ces at the same time. The reason I want to do it at the same time and not consequitively is that sometimes it happens that after removing one ending, the other ending appears while I don't want to apply this pattern to a single word twice.
I have no idea how to use two patterns at the same time, but this is the best I could:
text <- gsub("[sc]+s$", "[sc]", text)
I know the replacement is not correct, but I wonder how can I show that I want to replace it with the letter I just detected (c or s in this case). Thank you in advance.
To remove es at the end of words, that is preceded with s or c, you may use
gsub("([sc])es\\b", "\\1", text)
gsub("(?<=[sc])es\\b", "", text, perl=TRUE)
To remove them at the end of strings, you can go on using your $ anchor:
gsub("([sc])es$", "\\1", text)
gsub("(?<=[sc])es$", "", text, perl=TRUE)
The first gsub TRE pattern is ([sc])es\b: a capturing group #1 that matches either s or c, and then es is matched, and then \b makes sure the next char is not a letter, digit or _. The \1 in the replacement is the backreference to the value stored in the capturing group #1 memory buffer.
In the second example with the PCRE regex (due to perl=TRUE), (?<=[sc]) positive lookbehind is used instead of the ([sc]) capturing group. Lookbehinds are not consuming text, the text they match does not land in the match value, and thus, there is no need to restore it anyhow. The replacement is an empty string.
Strings ending with "ces" and "ses" follow the same pattern, i.e. "*es$"
If I understand it correctly than you don't need two patterns.
Example:
x = c("ces", "ses", "mes)
gsub( pattern = "*([cs])es$", replacement = "\\1", x)
[1] "c" "s" "mes"
Hope it helps.
M
I have a vector of character strings like so:
test <- c("A1.7","A1.8")
and I want to used regular expressions to insert A1c<= between the period and digit like so:
A1.A1c<=7 A1.A1c<=8
I looked through questions and found #zx8754 similar question; I tried to modify the answer posted in their question but had no luck
insert <- 'A1c<='
n <- 4
old <- test
lhs <- paste0('([[:alpha:]][[:digit:]][[:punct:]]{', n-1, '})([[:digit:]]+)$')
rhs <- paste0('\\1', insert, '\\2')
gsub(lhs, rhs, test)
Can anyone direct me as to how to correctly execute this?
Another pattern:
gsub("\\.(\\d+)", "\\.A1c<=\\1", test)
## [1] "A1.A1c<=7" "A1.A1c<=8"
Regex Demo
You may use
insert <- 'A1c<='
test <- c("A1.7","A1.8")
sub("(?<=\\.)(?=\\d)", insert, test, perl=TRUE)
## => A1.A1c<=7 A1.A1c<=8
See the online R demo
Details
(?<=\\.) - a positive lookbehind that matches a location that is immediately preceded with a dot
(?=\\d) - a positive lookahead that matches a location that is immediately followed with a digit.
The sub function will replace the first occurrence only, and perl=TRUE makes it possible to use the lookaround constructs in the pattern (as it is now parsed with the PCRE regex engine).
I am trying to extract usernames tagged in a text-chat, such as "#Jack #Marie Hi there!"
I am trying to do it on the combination of # and whitespace but I cannot get the regex to match non-greedy (or at least this is what I think is wrong):
library(stringr)
str_extract(string = '#This is what I want to extract', pattern = "(?<=#)(.*)(?=\\s+)")
[1] "This is what I want to"
What I would like to extract instead is only This.
You could make your regex non greedy:
(?<=#)(.*?)(?=\s+)
Or if you want to capture only "This" after the # sign, you could try it like this using only a positive lookbehind:
(?<=#)\w+
Explanation
A positive lookbehind (?<=
That asserts that what is behind is an #
Close positive lookbehind )
Match one or more word characters \w+
The central part of your regex ((.*)) is a sequence of any chars.
Instead you shoud look for a sequence of chars other than white space
(\S+) or word chars (\w+).
Note also that I changed * to +, as you are probably not interested
in any empty sequence of chars.
To capture also a name which has "last" position in the source
string, the last part of your regex should match not only a sequence
of whitespace chars, but also the end of the string, so change
(?=\\s+) to (?=\\s+|$).
And the last remark: Actually you don't need the parentheses around
the "central" part.
So to sum up, the whole regex can be like this:
(?<=#)\w+(?=\s+|$)
(with global oprion).
Here is a non-regex approach or rather a minimal-regex approach since grep takes the detection of # through the regex engine
grep('#', strsplit(x, ' ')[[1]], value = TRUE)
#[1] "#This"
Or to avoid strsplit, we can use scan (taken from this answer), i.e.
grep('#', scan(textConnection(x), " "), value=TRUE)
#Read 7 items
#[1] "#This"
I am looking for a way to replace all _ (by say '') in each of the following characters
x <- c('test_(match)','test_xMatchToo','test_a','test_b')
if and only if _ is followed by ( or x. So the output wanted is:
x <- c('test(match)','testxMatchToo','test_a','test_b')
How can this be done (using any package is fine)?
Using a lookahead:
_(?=[(x])
What a lookahead does is assert that the pattern matches, but does not actually match the pattern it's looking ahead for. So, here, the final match text consists of only the underscore, but the lookahead asserts that it's followed by an x or (.
Demo on Regex101
Your R code would look a bit like this (one arg per line for clarity):
gsub(
"_(?=[(x])", # The regex
"", # Replacement text
c("your_string", "your_(other)_string"), # Vector of strings
perl=TRUE # Make sure to use PCRE
)