Here are some examples from my data:
a <-c("sp|Q9Y6W5|","sp|Q9HB90|,sp|Q9NQL2|","orf|NCBIAAYI_c_1_1023|",
"orf|NCBIACEN_c_10_906|,orf|NCBIACEO_c_5_1142|",
"orf|NCBIAAYI_c_258|,orf|aot172_c_6_302|,orf|aot180_c_2_405|")
For a: The individual strings can contain even more entries of "sp|" and "orf"
The results have to be like this:
[1] "sp|Q9Y6W5" "sp|Q9HB90,sp|Q9NQL2" "orf|NCBIAAYI_c_1_1023"
"orf|NCBIACEN_c_10_906,orf|NCBIACEO_c_5_1142"
"orf|NCBIAAYI_c_258,orf|aot172_c_6_302,orf|aot180_c_2_405"
So the aim is to remove the last "|" for each "sp|" and "orf|" entry. It seems that "|" is a special challenge because it is a metacharacter in regular expressions. Furthermore, the length and composition of the "orf|" entries varying a lot. The only things they have in common is "orf|" or "sp|" at the beginning and that "|" is on the last position. I tried different things with gsub() but also with the stringr package or regexpr() or [:punct:], but nothing really worked. Maybe it was just the wrong combination.
We can use gsub to match the | that is followed by a , or is at the end ($) of the string and replace with blank ("")
gsub("[|](?=(,|$))", "", a, perl = TRUE)
#[1] "sp|Q9Y6W5"
#[2] "sp|Q9HB90,sp|Q9NQL2"
#[3] "orf|NCBIAAYI_c_1_1023"
#[4] "orf|NCBIACEN_c_10_906,orf|NCBIACEO_c_5_1142"
#[5] "orf|NCBIAAYI_c_258,orf|aot172_c_6_302,orf|aot180_c_2_405"
Or we split by ,', remove the last character withsubstr, andpastethelist` elements together
sapply(strsplit(a, ","), function(x) paste(substr(x, 1, nchar(x)-1), collapse=","))
An easy alternative that might work. You need to escape the "|" using "\\|".
# Input
a <-c("sp|Q9Y6W5|","sp|Q9HB90|,sp|Q9NQL2|","orf|NCBIAAYI_c_1_1023|",
"orf|NCBIACEN_c_10_906|,orf|NCBIACEO_c_5_1142|",
"orf|NCBIAAYI_c_258|,orf|aot172_c_6_302|,orf|aot180_c_2_405|")
# Expected output
b <- c("sp|Q9Y6W5", "sp|Q9HB90,sp|Q9NQL2", "orf|NCBIAAYI_c_1_1023" ,
"orf|NCBIACEN_c_10_906,orf|NCBIACEO_c_5_1142" ,
"orf|NCBIAAYI_c_258,orf|aot172_c_6_302,orf|aot180_c_2_405")
res <- gsub("\\|,", ",", gsub("\\|$", "", a))
all(res == b)
#[1] TRUE
You could construct a single regex call to gsub, but this is simple and easy to understand. The inner gsub looks for | and the end of the string and removes it. The outer gsub looks for ,| and replaces with ,.
You do not have to use a PCRE regex here as all you need can be done with the default TRE regex (if you specify perl=TRUE, the pattern is compiled with a PCRE regex engine and is sometimes slower than TRE default regex engine).
Here is the single simple gsub call:
gsub("\\|(,|$)", "\\1", a)
See the online R demo. No lookarounds are really necessary, as you see.
Pattern details
\\| - a literal | symbol (because if you do not escape it or put into a bracket expression it will denote an alternation operator, see the line below)
(,|$) - a capturing group (referenced to with \1 from the replacement pattern) matching either of the two alternatives:
, - a comma
| - or (the alternation operator)
$ - end of string anchor.
The \1 in the replacement string tells the regex engine to insert the contents stored in the capturing group #1 back into the resulting string (so, the commas are restored that way where necessary).
Related
I have the string in R
BLCU142-09|Apodemia_mejicanus
and I would like to get the result
Apodemia_mejicanus
Using the stringr R package, I have tried
str_replace_all("BLCU142-09|Apodemia_mejicanus", "[[A-Z0-9|-]]", "")
# [1] "podemia_mejicanus"
which is almost what I need, except that the A is missing.
You can use
sub(".*\\|", "", x)
This will remove all text up to and including the last pipe char. See the regex demo. Details:
.* - any zero or more chars as many as possible
\| - a | char (| is a special regex metacharacter that is an alternation operator, so it must be escaped, and since string literals in R can contain string escape sequences, the | is escaped with a double backslash).
See the R demo online:
x <- c("BLCU142-09|Apodemia_mejicanus", "a|b|c|BLCU142-09|Apodemia_mejicanus")
sub(".*\\|", "", x)
## => [1] "Apodemia_mejicanus" "Apodemia_mejicanus"
We can match one or more characters that are not a | ([^|]+) from the start (^) of the string followed by | in str_remove to remove that substring
library(stringr)
str_remove(str1, "^[^|]+\\|")
#[1] "Apodemia_mejicanus"
If we use [A-Z] also to match it will match the upper case letter and replace with blank ("") as in the OP's str_replace_all
data
str1 <- "BLCU142-09|Apodemia_mejicanus"
You can always choose to _extract rather than _remove:
s <- "BLCU142-09|Apodemia_mejicanus"
stringr::str_extract(s,"[[:alpha:]_]+$")
## [1] "Apodemia_mejicanus"
Depending on how permissive you want to be, you could also use [[:alpha:]]+_[[:alpha:]]+ as your target.
I would keep it simple:
substring(my_string, regexpr("|", my_string, fixed = TRUE) + 1L)
I want to extract strings using rm_between function from the library(qdapRegex)
I need to extract the string between the second "|" and the word "_HUMAN".
I cant figure out how to select the second "|" and not the first.
example <- c("sp|B5ME19|EIFCL_HUMAN", "sp|Q99613|EIF3C_HUMAN")
prots <- rm_between(example, '|', 'HUMAN', extract=TRUE)
Thank you!!
Another alternative using regmatches, regexpr and using perl=TRUE to make use of \K
^(?:[^|]*\|){2}\K[^|_]+(?=_HUMAN)
Regex demo
For example
regmatches(example, regexpr("^(?:[^|]*\\|){2}\\K[^|_]+(?=_HUMAN)", example, perl=TRUE))
Output
[1] "EIFCL" "EIF3C"
In your rm_between(example, '|', 'HUMAN', extract=TRUE) command, the | is used to match the leftmost | and HUMAN is used to match the left most HUMAN right after.
Note the default value for the FIXED argument is TRUE, so | and HUMAN are treated as literal chars.
You need to make the pattern a regex pattern, by setting fixed=FALSE. However, the ^(?:[^|]*\|){2} as the left argument regex will not work because the qdap package creates an ICU regex with lookarounds (since you use extract=TRUE that sets include.markers to FALSE), which is (?<=^(?:[^|]*\|){2}).*?(?=HUMAN).
As a workaround, you could use a constrained-width lookbehind, by replacing * with a limiting quantifier with a reasonably large max parameter. Say, if you do not expect more than a 1000 chars between each pipe, you may use {0,1000}:
rm_between(example, '^(?:[^|]{0,1000}\\|){2}', '_HUMAN', extract=TRUE, fixed=FALSE)
# => [[1]]
# [1] "EIFCL"
#
# [[2]]
# [1] "EIF3C"
However, you really should think of using simpler approaches, like those described in other answers. Here is another variation with sub:
sub("^(?:[^|]*\\|){2}(.*?)_HUMAN.*", "\\1", example)
# => [1] "EIFCL" "EIF3C"
Details
^ - startof strig
(?:[^|]*\\|){2} - two occurrences of any 0 or more non-pipe chars followed with a pipe char (so, matching up to and including the second |)
(.*?) - Group 1: any 0 or more chars, as few as possible
_HUMAN.* - _HUMAN and the rest of the string.
\1 keeps only Group 1 value in the result.
A stringr variation:
stringr::str_match(example, "^(?:[^|]*\\|){2}(.*?)_HUMAN")[,2]
# => [1] "EIFCL" "EIF3C"
With str_match, the captures can be accessed easily, we do it with [,2] to get Group 1 value.
this is not exactly what you asked for, but you can achieve the result with base R:
sub("^.*\\|([^\\|]+)_HUMAN.*$", "\\1", example)
This solution is an application of regular expression.
"^.*\\|([^\\|]+)_HUMAN.*$" matches the entire character string.
\\1 matches whatever was matched inside the first parenthesis.
Using regular gsub:
example <- c("sp|B5ME19|EIFCL_HUMAN", "sp|Q99613|EIF3C_HUMAN")
gsub(".*?\\|.*?\\|(.*?)_HUMAN", "\\1", example)
#> [1] "EIFCL" "EIF3C"
The part (.*?) is replaced by itself as the replacement contains the back-reference \\1.
If you absolutely prefer qdapRegex you can try:
rm_between(example, '.{0,100}\\|.{0,100}\\|', '_HUMAN', fixed = FALSE, extract = TRUE)
The reason why we have to use .{0,100} instead of .*? is that the underlying stringi needs a mamixmum length for the look-behind pattern (i.e. the left argument in rm_between).
Just saying that you could easily just use sapply()/strsplit():
example <- c("sp|B5ME19|EIFCL_HUMAN", "sp|Q99613|EIF3C_HUMAN")
unlist(sapply(strsplit(example, "|", fixed = T),
function(item) strsplit(item[3], "_HUMAN", fixed = T)))
# [1] "EIFCL" "EIF3C"
It just splits on | in the first list and on _HUMAN on every third element within that list.
Need to take out all the characters before "(" and combine them with ";"
stringr::word(pilist, 2, sep = '(\\s*|\\')
pilist = "pi1(tag1,tag2);pi2(tag3,tag4,tag5);"
I expect the output as
"pi1;pi2"
You can try this pattern
\([^)]+\)|;+$
Regex Demo
Note:- Use escape character as \\ or \ depending on your regex engine
If you are string has the exact same structure as shown, we can remove everything which comes between round brackets and the trailing ; using gsub
gsub("\\(.*?\\)|;$", "", pilist)
#[1] "pi1;pi2"
However, following your description it can also be done by extracting the words which we want instead of removing. Using str_extract_all
paste0(stringr::str_extract_all(pilist, "(\\w+)(?=\\(.*\\))")[[1]], collapse = ";")
#[1] "pi1;pi2"
Here's a reproducible example
S0 <- "\n3 4 5"
S1 <- "\n3 5"
I want to use gsub and the following regex pattern (outside of R it works - tested in regex101) to return the digits. This regex should ignore \ and n whether they occur together or not.
([^\\n])(\s{1})?
I am not looking for a way to match the digits with a fundamentally different pattern - I'd like to know how to get the above pattern to work in R. The following do not work for me
gsub("([^\\\n])(\\s{1})?", "\\1", S0)
gsub("([^[\\\]n])(\\s{1})?", "\\1", S1)
The output should be
#S0 - 345
#S1 - 3 5
Since you specifically want that regex to work you could match and optional \n (using (\n)?):
gsub("(\n)?([^\\n])(\\s{1})", "\\2", S0)
#[1] "345"
gsub("(\n)?([^\\n])(\\s{1})", "\\2", S1)
#[1] "3 5"
Note that you were right, if you use a regex tester like: https://regex101.com/ it works without the extra "(\n)?". However, I think in R you have to match more for capture groups to work properly.
Your ([^\\n])(\s{1})? pattern in regex101 (PCRE) matches different strings than the same pattern used in gsub without perl=TRUE (that is, when it is handled by the TRE regex library). They would work the same if you used perl=TRUE and use gsub("([^\\\\n])(\\s{1})?", "\\1", S1, perl=TRUE).
What is so peculair with the PCRE Regex ([^\\n])(\s{1})?
This pattern in a regex tester with PCRE option matches:
([^\\n]) - any char other than \ and n (put into Group 1)
(\s{1})? - matches and captures into Group 2 any single whitespace char, optionally, 1 or 0 times.
Note this pattern does not match any non-newline char with the first capturing group, it would match any non-newline if it were [^\n].
Now, the same regex with gsub will be
gsub("([^\n])(\\s{1})?", "\\1", S1) # OR
gsub("([^\\\\n])(\\s{1})?", "\\1", S1, perl=TRUE)
Why different number of backslashes? Because the first regex is handled with TRE regex library and in these patterns, inside bracket expressions, no regex escapes are parsed as such, the \ and n are treated as 2 separate chars. In a PCRE pattern, the one with perl=TRUE, the [...] are called character classes and inside them, you can define regex escapes, and thus the \ regex escape char should be doubled (that is, inside the R string literal it should be quadrupled as you need a \ to escape \ for the R engine to "see" a backslash).
Actually, if you want to match a newline, you just need to use \n in the regex pattern, you may either use "\n" or "\\n" as both TRE and PCRE regex engines parse LF and a \n regex escape as a newline char matching pattern. These four are equivalent:
gsub("\n([^\n])(\\s{1})?", "\\1", S1)
gsub("\\n([^\n])(\\s{1})?", "\\1", S1)
gsub("\n([^\\\\n])(\\s{1})?", "\\1", S1, perl=TRUE)
gsub("\\n([^\\\\n])(\\s{1})?", "\\1", S1, perl=TRUE)
If the \n must be optional, just add ? quantifier after it, no need wrapping it with a group:
gsub("\n?([^\n])(\\s{1})?", "\\1", S1)
^
And simplifying it further:
gsub("\n?([^\n])\\s?", "\\1", S1)
And also, if by [^\n] you want to match any char but a newline, just use . with (?n) inline modifier:
gsub("(?n)(.)(\\s{1})?", "\\1", S1)
See R demo online.
A couple of issues. The is not a backslash in your S object (it's an escape-operator rather than a character) and there is a predefined digit character class that can be negated:
gsub("[^[:digit:]]", "", S)
[1] "345"
If in the other hand you wanted to exclude the newline character and the spaces, it would be done by removing one of the escape operators, since they are not needed except for the small group of special characters that exist in the character class context:
gsub("[\n ]", "", S)
[1] "345"
I have two related questions regarding regular expressions in R:
[1]
I would like to convert sub-strings, containing punctuation followed by a letter, to an upper case letter.
Example:
Dr_dre to: DrDre
Captain.Spock to: CaptainSpock
spider-man to: spiderMan
[2]
I would like convert camel case strings to lower case strings with underscore delimiter.
Example:
EndOfFile to: End_of_file
CamelCase to: Camel_Case
ABC to: A_B_C
Thanks much,
Kamashay
We can use sub. We match one or more punctuation characters ([[:punct:]]+) followed by a single character which is captured as a group ((.)). In the replacement, the backreference for the capture group (\\1) is changed to upper case (\\U).
sub("[[:punct:]]+(.)", "\\U\\1", str1, perl = TRUE)
#[1] "DrDre" "CaptainSpock" "spiderMan"
For the second case, we use regex lookarounds i.e. match a letter ((?<=[A-Za-z])) followed by a capital letter and replace with _.
gsub("(?<=[A-Za-z])(?=[A-Z])", "_", str2, perl = TRUE)
#[1] "End_Of_File" "Camel_Case" "A_B_C"
data
str1 <- c("Dr_dre", "Captain.Spock", "spider-man")
str2 <- c("EndOfFile", "CamelCase", "ABC")