I am trying to extract a portion of text that is embedded within parenthesis in a text string:
"Dominion Diamond Corporation (DDC) "
(I want to extract DDC).
Perusing the interwebs suggests that the regular expression
"\([^)]*\)"
will be useful.
I try the following:
ret = Regex(regExp)
match(ret, "Dominion Diamond Corporation (DDC) ")
Output:
RegexMatch("Dominion Diamond Corporation (DDC", 1="Dominion Diamond Corporation (DDC")
However, when i enter the regex expression into the match function directly:
match(r"\([^)]*\)"t, "Dominion Diamond Corporation (DDC) ")
The output is:
RegexMatch("(DDC)")
Why / how are these two expressions different? How do I interpolate an arbitrary regex expression into the first arg for match?
As #Laurel suggests in a comment, the single backslashes weren't making it through to the match function.
julia> rstring = "\\([^)]*\\)"
"\\([^)]*\\)"
julia> match(Regex(rstring), "Dominion Diamond Corporation (DDC) ")
RegexMatch("(DDC)")
Related
I am working on Single Cell RNA, and trying to demultiplex RAW count matrix. I am following this. In this tutorial, the barcode is in this format:
"BIOKEY_33_Pre_AAACCTGAGAGACTTA-1". Where BIOKEY13_Pre is the prefix and AAACCTGCAACAACCT-1 is the sequence of bases. The prefixes are sample names, so they will be used to demultiplex the data.
Using this regular expression, I can extract the prefixes.
data.pfx <- gsub("(.+)_[A-Z]+-1$", "\\1", colnames(data.count), perl=TRUE).
The problem is, in my data, the barcode is in this format:
AAACCTGAGAAACCGC_LN_05 where the sequence is first, and the sample name is last. I need to extract postfixes. If I run the above regular expression on my data, I get the following output:
data.pfx <- gsub("(.+)_[A-Z]+-1$", "\\1", colnames(data.count), perl=TRUE)
sample.names <- unique(data.pfx)
head(sample.names)
"AAACCTGAGAAACCGC_LN_05"
"AAACCTGAGAAACGCC_NS_13"
"AAACCTGAGCAATATG_LUNG_N34"
The desired output:
"LN_05"
"NS_13"
"LUNG_N34"
You can use
sub(".*_([A-Z]+_[0-9A-Z]+)$", "\\1", sample.names)
See the regex demo.
Details:
.* - any zero or more chars as many as possible
_ - an underscore
([A-Z]+_[0-9A-Z]+) - Group 1 (\1): one or more uppercase ASCII letters, _ and one or more uppercase ASCII letters o digits
$ - end of string.
A bit easier by just removing all leading capital letters up to and including the first underscore
sample.names <- c("AAACCTGAGAAACCGC_LN_05" ,
"AAACCTGAGAAACGCC_NS_13")
sub("^[A-Z]+_", "", sample.names)
#> [1] "LN_05" "NS_13"
I want to insert a '0' before the single digit month (e.g. 2020M6 to 2020M06) using regular expressions.
The one below correctly matches the string I need to replace (a single digit at the end of the string following a 'M', excluding 'M'), but the replacement pattern '0$0' is interepreted literally in R; elsewhere (regeprep in matlab) I referenced the matched string, '6' in the example, by '$0'.
sub('(?<=M)([0-9]{1})$','0$0', c('2020M6','2020M10'), perl = T)
[1] "2020M0$0" "2020M10"
I cannot find how to reference and re-use matched strings in the replacement pattern.
PS: There are alternative ways to accomplish the task, but I need to use regular expressions.
Unfortunately, it is not possible to use a backreference to the whole match in base R regex functions.
You can use
sub("(M)([0-9])$", "\\10\\2", x)
With TRE regex like here, you do not have to worry about a digit after a backreference, since only 9 backreferences starting with 1 till 9 are allowed in TRE regex patterns. What is of interest is that you may use perl=TRUE in the above line of code and it will yield the same results.
See the R demo online:
x <- c('2020M6','2020M10')
sub("(M)([0-9])$", "\\10\\2", x)
## => [1] "2020M06" "2020M10"
Also, see the regex demo.
I think you have to capture the digit after 'M' and not 'M' itself, therefore :
sub('(?<=M)([0-9]{1})$','0\\1', c('2020M6','2020M10'), perl = T)
Captured strings can be reused with \\1, \\2 etc, by the way.
I am trying to extract (from a string) all the chunks of characters between two \r\n expressions that do not contain a white space. To do so, I am using the negative lookahead operator.
This is my string:
my_string <- "\r\nContent: base64\r\n\r\nDBhHB\r\nDGlV\r\nPAAHJ\r\nAwQU\r\n"
And this is what I've tried:
pat <- "\\r\\n+(?! )\\r\\n.*"
out <- unlist(regmatches(my_string,
regexpr(pat, my_string, perl=TRUE)))
This is what I got in R:
> out
[1] "\r\n\r\nDBhHB\r\n"
As you can see, it stops on the first match.
EDIT
My expected output, in this case, would be the final part of the string.
> out
[1] "DBhHB\r\nDGlV\r\nPAAHJ\r\nAwQU\r\n"
I would like to be able to retrieve multiple parts if there is another one or two white spaces in other chunks in the middle of the string.
my_string <- "\r\nNot This\r\n\r\KeepThis\r\nKeepThis\r\nNot This\r\nKeepThis\r\n"
Suggestions under the base R approach would be greatly appreciated.
Thanks in advance.
I suggest using
(?m)^\S+(?:\R\S+)*$
See the regex demo. Details:
(?m) - multiline mode on
^ - this anchor now matches all line start positions
\S+ - one or more non-whitespace chars
(?:\R\S+)* - zero or more repetitions of a line break sequence and then one or more non-whitespace chars
$ - end of a line.
R demo:
library(stringr)
my_string <- "\r\nContent: base64\r\n\r\nDBhHB\r\nDGlV\r\nPAAHJ\r\nAwQU\r\n"
pat <- "(?m)^\\S+(?:\\R\\S+)*$"
unlist(str_extract_all(my_string, pat))
## => [1] "DBhHB\r\nDGlV\r\nPAAHJ\r\nAwQU"
my_string <- "\r\nNot This\r\n\r\nKeepThis\r\nKeepThis\r\nNot This\r\nKeepThis\r\n"
unlist(str_extract_all(my_string, pat))
## => [1] "KeepThis\r\nKeepThis" "KeepThis"
Base R usage
Note that in base R, PCRE engine is used, and $ in a multiline mode (when (?m) is used) only matches right before \n. Since you have \r\n line endings, you cannot use plain $ to mark the line end. Consuming \r is not a good idea (\r$) as you do not want to have \r in the output. You can tell PCRE to treat CRLF, CR or LF as line ending sequence with the (*ANYCRLF) PCRE verb:
unlist(regmatches(my_string, gregexpr("(*ANYCRLF)(?m)^\\S+(?:\\R\\S+)*$",my_string, perl=TRUE)))
Note (*ANYCRLF) PCRE verb must be at the start of the regex pattern.
See this R demo online.
I'm wanting to build a regex expression substituting in some strings to search for, and so these string need to be escaped before I can put them in the regex, so that if the searched for string contains regex characters it still works.
Some languages have functions that will do this for you (e.g. python re.escape: https://stackoverflow.com/a/10013356/1900520). Does R have such a function?
For example (made up function):
x = "foo[bar]"
y = escape(x) # y should now be "foo\\[bar\\]"
I've written an R version of Perl's quotemeta function:
library(stringr)
quotemeta <- function(string) {
str_replace_all(string, "(\\W)", "\\\\\\1")
}
I always use the perl flavor of regexps, so this works for me. I don't know whether it works for the "normal" regexps in R.
Edit: I found the source explaining why this works. It's in the Quoting Metacharacters section of the perlre manpage:
This was once used in a common idiom to disable or quote the special meanings of regular expression metacharacters in a string that you want to use for a pattern. Simply quote all non-"word" characters:
$pattern =~ s/(\W)/\\$1/g;
As you can see, the R code above is a direct translation of this same substitution (after a trip through backslash hell). The manpage also says (emphasis mine):
Unlike some other regular expression languages, there are no backslashed symbols that aren't alphanumeric.
which reinforces my point that this solution is only guaranteed for PCRE.
Apparently there is a function called escapeRegex in the Hmisc package. The function itself has the following definition for an input value of 'string':
gsub("([.|()\\^{}+$*?]|\\[|\\])", "\\\\\\1", string)
My previous answer:
I'm not sure if there is a built in function but you could make one to do what you want. This basically just creates a vector of the values you want to replace and a vector of what you want to replace them with and then loops through those making the necessary replacements.
re.escape <- function(strings){
vals <- c("\\\\", "\\[", "\\]", "\\(", "\\)",
"\\{", "\\}", "\\^", "\\$","\\*",
"\\+", "\\?", "\\.", "\\|")
replace.vals <- paste0("\\\\", vals)
for(i in seq_along(vals)){
strings <- gsub(vals[i], replace.vals[i], strings)
}
strings
}
Some output
> test.strings <- c("What the $^&(){}.*|?", "foo[bar]")
> re.escape(test.strings)
[1] "What the \\$\\^&\\(\\)\\{\\}\\.\\*\\|\\?"
[2] "foo\\[bar\\]"
An easier way than #ryanthompson function is to simply prepend \\Q and postfix \\E to your string. See the help file ?base::regex.
Use the rex package
These days, I write all my regular expressions using rex. For your specific example, rex does exactly what you want:
library(rex)
library(assertthat)
x = "foo[bar]"
y = rex(x)
assert_that(y == "foo\\[bar\\]")
But of course, rex does a lot more than that. The question mentions building a regex, and that's exactly what rex is designed for. For example, suppose we wanted to match the exact string in x, with nothing before or after:
x = "foo[bar]"
y = rex(start, x, end)
Now y is ^foo\[bar\]$ and will only match the exact string contained in x.
According to ?regex:
The symbol \w matches a ‘word’ character (a synonym for [[:alnum:]_], an extension) and \W is its negation ([^[:alnum:]_]).
Therefore, using capture groups, (\\W), we can detect the occurrences of non-word characters and escape it with the \\1-syntax:
> gsub("(\\W)", "\\\\\\1", "[](){}.|^+$*?\\These are words")
[1] "\\[\\]\\(\\)\\{\\}\\.\\|\\^\\+\\$\\*\\?\\\\These\\ are\\ words"
Or similarly, replacing "([^[:alnum:]_])" for "(\\W)".
Extract all the valid equations in the following text.
I have tried a few regex expressions but none seem to work.
Hoping to use sub or gsub functions in R.
myText <- 'equation1: 2+3=5, equation2 is: 2*3=6, do not extract 2w3=6'
expected result : 2+3=5 2*3=6
Here is a base R approach. We can use grepexpr() to find multiple matches of equations in the input string:
x <- c("equation1: 2+3=5, equation2 is: 2*3=6, do not extract 2w3=6")
m <- gregexpr("\\b\\w+(?:[+-\\*]\\w+)+=\\w+\\b", x)
regmatches(x, m)
[[1]]
[1] "2+3=5" "2*3=6"
Here is an explanation of the regex:
\\b\\w+ match an initial symbol
(?:[+-\\*]\\w+) then match at least one arithmetic symbol (+-\*) followed
by another variable
+=\\w+ match an equals sign, followed by a variable
For the examples you've posted, the regex (\d+[+\-*\/]\d+=\d+) should extract the equations and not the rest of the text. Note that this regex does not handle variables/variable names, only numbers and the basic arithmetic operators. This may need to be adapted for r.
Demo