split at entire deliminator but not each component of deliminator - r

I want to split a string and keep where its being split.
str = 'Glenn: $53 Sutter: $44'
strsplit(str, '[0-9]\\s+[A-Z]', perl = TRUE)
# [[1]]
# [1] "Glenn: $5" "utter: $44" ## taking out what was matched
strsplit(str, '(?=[0-9]\\s+[A-Z])', perl = TRUE)
# [[1]]
# [1] "Glenn: $5" "3" " Sutter: $44" ## splitting at each component of the match
Is there a way to split it at the entire deliminator? So it returns:
# [1] "Glenn: $53" "Sutter: $44"

We can use a regex lookaround to split at one ore more spaces (\\s+) before an upper case letter and after a digit
strsplit(str, "(?<=[0-9])\\s+(?=[A-Z])", perl = TRUE)[[1]]
#[1] "Glenn: $53" "Sutter: $44"

My understanding is that you wish to split on spaces following strings comprise of a dollar sign followed by one or more digits, provided the spaces are followed by a letter.
By setting perl = true, you will use Perl's regex engine, which supports \K, which effectively means to discard everything matched so far. You therefore could use the following regex (with the case-indifferent flag set):
\$\d+\K\s+(?=[a-z])
Demo
In some cases, as here, \K can be used as a substitute for a variable-length lookbehind. Alas, most regex engines, including Perl's, do not support variable-length lookbehinds.

Related

Get substring before the second capital letter

Is there an R function to get only the part of a string before the 2nd capital character appears?
For example:
Example <- "MonkeysDogsCats"
Expected output should be:
"Monkeys"
Maybe something like
stringr::str_extract("MonkeysDogsCats", "[A-Z][a-z]*")
#[1] "Monkeys"
Here is an alternative approach:
Here we first put a space before all uppercase and then extract the first word:
library(stringr)
word(gsub("([a-z])([A-Z])","\\1 \\2", Example), 1)
[1] "Monkeys"
A base solution with sub():
x <- "MonkeysDogsCats"
sub("(?<=[a-z])[A-Z].*", "", x, perl = TRUE)
# [1] "Monkeys"
Another way using stringr::word():
stringr::word(x, 1, sep = "(?=[A-Z])\\B")
# [1] "Monkeys"
If the goal is strictly to capture any string before the 2nd capital character, one might want pick a solution it'll also work with all types of strings including numbers and special characters.
strings <- c("MonkeysDogsCats",
"M4DogsCats",
"M?DogsCats")
stringr::str_remove(strings, "(?<=.)[A-Z].*")
Output:
[1] "Monkeys" "M4" "M?"
It depends on what you want to allow to match. You can for example match an uppercase char [A-Z] optionally followed by any character that is not an uppercase character [^A-Z]*
If you don't want to allow whitespace chars, you can exclude them [^A-Z\\s]*
library(stringr)
str_extract("MonkeysDogsCats", "[A-Z][^A-Z]*")
Output
[1] "Monkeys"
R demo
If there should be an uppercase character following, and there are only lowercase characters allowed:
str <- "MonkeysDogsCats"
regmatches(str, regexpr("[A-Z][a-z]*(?=[A-Z])", str, perl = TRUE))
Output
[1] "Monkeys"
R demo

Split string in parts by minus and plus in R

I want to split this string:
test = "-1x^2+3x^3-x^8+1-x"
...into parts by plus and minus characters in R. My goal would be to get:
"-1x^2" "+3x^3" "-x^8" "+1" "-x"
This didn't work:
strsplit(test, split = "-")
strsplit(test, split = "+")
We can provide a regular expression in strsplit, where we use ?= to lookahead to find the plus or minus sign, then split on that character. This will allow for the character itself to be retained rather than being dropped in the split.
strsplit(x, "(?<=.)(?=[+])|(?<=.)(?=[-])",perl = TRUE)
# [1] "-1x^2" "+3x^3" "-x^8" "+1" "-x"
Try
> strsplit(test, split = "(?<=.)(?=[+-])", perl = TRUE)[[1]]
[1] "-1x^2" "+3x^3" "-x^8" "+1" "-x"
where (?<=.)(?=[+-]) captures the spliter that happens to be in front of + or -.
This uses gsub to search for any character followed by + or - and inserts a semicolon between the two characters. Then it splits on semicolon.
s <- "-1x^2+3x^3-x^8+1-x"
strsplit(gsub("(.)([+-])", "\\1;\\2", s), ";")[[1]]
## [1] "-1x^2" "+3x^3" "-x^8" "+1" "-x"
In your examples, you use strsplit with a plus and a minus sign which will split on every encounter.
You could assert that what is directly to the left is not either the start of the string or + or -, while asserting + and - directly to the right.
(?<!^|[+-])(?=[+-])
Explanation
(?<! Negative lookabehind assertion
^ Start of string
| Or - [+-] Match either + or - using a character class
) Close lookbehind
(?= Positive lookahead assertion
[+-] Match either + or -
) Close lookahead
As the pattern uses lookaround assertions, you have to use perl = T to use a perl style regex.
Example
test <- "-1x^2+3x^3-x^8+1-x"
strsplit(test, split = "(?<!^|[\\s+-])(?=[+-])", perl = T)
Output
[[1]]
[1] "-1x^2" "+3x^3" "-x^8" "+1" "-x"
See a online R demo.
If there can also not be a space to the left, you can write the pattern as
(?<!^|[\\s+-])(?=[+-])
See a regex demo.

Extract string using `rm_between` function

I want to extract strings using rm_between function from the library(qdapRegex)
I need to extract the string between the second "|" and the word "_HUMAN".
I cant figure out how to select the second "|" and not the first.
example <- c("sp|B5ME19|EIFCL_HUMAN", "sp|Q99613|EIF3C_HUMAN")
prots <- rm_between(example, '|', 'HUMAN', extract=TRUE)
Thank you!!
Another alternative using regmatches, regexpr and using perl=TRUE to make use of \K
^(?:[^|]*\|){2}\K[^|_]+(?=_HUMAN)
Regex demo
For example
regmatches(example, regexpr("^(?:[^|]*\\|){2}\\K[^|_]+(?=_HUMAN)", example, perl=TRUE))
Output
[1] "EIFCL" "EIF3C"
In your rm_between(example, '|', 'HUMAN', extract=TRUE) command, the | is used to match the leftmost | and HUMAN is used to match the left most HUMAN right after.
Note the default value for the FIXED argument is TRUE, so | and HUMAN are treated as literal chars.
You need to make the pattern a regex pattern, by setting fixed=FALSE. However, the ^(?:[^|]*\|){2} as the left argument regex will not work because the qdap package creates an ICU regex with lookarounds (since you use extract=TRUE that sets include.markers to FALSE), which is (?<=^(?:[^|]*\|){2}).*?(?=HUMAN).
As a workaround, you could use a constrained-width lookbehind, by replacing * with a limiting quantifier with a reasonably large max parameter. Say, if you do not expect more than a 1000 chars between each pipe, you may use {0,1000}:
rm_between(example, '^(?:[^|]{0,1000}\\|){2}', '_HUMAN', extract=TRUE, fixed=FALSE)
# => [[1]]
# [1] "EIFCL"
#
# [[2]]
# [1] "EIF3C"
However, you really should think of using simpler approaches, like those described in other answers. Here is another variation with sub:
sub("^(?:[^|]*\\|){2}(.*?)_HUMAN.*", "\\1", example)
# => [1] "EIFCL" "EIF3C"
Details
^ - startof strig
(?:[^|]*\\|){2} - two occurrences of any 0 or more non-pipe chars followed with a pipe char (so, matching up to and including the second |)
(.*?) - Group 1: any 0 or more chars, as few as possible
_HUMAN.* - _HUMAN and the rest of the string.
\1 keeps only Group 1 value in the result.
A stringr variation:
stringr::str_match(example, "^(?:[^|]*\\|){2}(.*?)_HUMAN")[,2]
# => [1] "EIFCL" "EIF3C"
With str_match, the captures can be accessed easily, we do it with [,2] to get Group 1 value.
this is not exactly what you asked for, but you can achieve the result with base R:
sub("^.*\\|([^\\|]+)_HUMAN.*$", "\\1", example)
This solution is an application of regular expression.
"^.*\\|([^\\|]+)_HUMAN.*$" matches the entire character string.
\\1 matches whatever was matched inside the first parenthesis.
Using regular gsub:
example <- c("sp|B5ME19|EIFCL_HUMAN", "sp|Q99613|EIF3C_HUMAN")
gsub(".*?\\|.*?\\|(.*?)_HUMAN", "\\1", example)
#> [1] "EIFCL" "EIF3C"
The part (.*?) is replaced by itself as the replacement contains the back-reference \\1.
If you absolutely prefer qdapRegex you can try:
rm_between(example, '.{0,100}\\|.{0,100}\\|', '_HUMAN', fixed = FALSE, extract = TRUE)
The reason why we have to use .{0,100} instead of .*? is that the underlying stringi needs a mamixmum length for the look-behind pattern (i.e. the left argument in rm_between).
Just saying that you could easily just use sapply()/strsplit():
example <- c("sp|B5ME19|EIFCL_HUMAN", "sp|Q99613|EIF3C_HUMAN")
unlist(sapply(strsplit(example, "|", fixed = T),
function(item) strsplit(item[3], "_HUMAN", fixed = T)))
# [1] "EIFCL" "EIF3C"
It just splits on | in the first list and on _HUMAN on every third element within that list.

Removing the second "|" on the last position

Here are some examples from my data:
a <-c("sp|Q9Y6W5|","sp|Q9HB90|,sp|Q9NQL2|","orf|NCBIAAYI_c_1_1023|",
"orf|NCBIACEN_c_10_906|,orf|NCBIACEO_c_5_1142|",
"orf|NCBIAAYI_c_258|,orf|aot172_c_6_302|,orf|aot180_c_2_405|")
For a: The individual strings can contain even more entries of "sp|" and "orf"
The results have to be like this:
[1] "sp|Q9Y6W5" "sp|Q9HB90,sp|Q9NQL2" "orf|NCBIAAYI_c_1_1023"
"orf|NCBIACEN_c_10_906,orf|NCBIACEO_c_5_1142"
"orf|NCBIAAYI_c_258,orf|aot172_c_6_302,orf|aot180_c_2_405"
So the aim is to remove the last "|" for each "sp|" and "orf|" entry. It seems that "|" is a special challenge because it is a metacharacter in regular expressions. Furthermore, the length and composition of the "orf|" entries varying a lot. The only things they have in common is "orf|" or "sp|" at the beginning and that "|" is on the last position. I tried different things with gsub() but also with the stringr package or regexpr() or [:punct:], but nothing really worked. Maybe it was just the wrong combination.
We can use gsub to match the | that is followed by a , or is at the end ($) of the string and replace with blank ("")
gsub("[|](?=(,|$))", "", a, perl = TRUE)
#[1] "sp|Q9Y6W5"
#[2] "sp|Q9HB90,sp|Q9NQL2"
#[3] "orf|NCBIAAYI_c_1_1023"
#[4] "orf|NCBIACEN_c_10_906,orf|NCBIACEO_c_5_1142"
#[5] "orf|NCBIAAYI_c_258,orf|aot172_c_6_302,orf|aot180_c_2_405"
Or we split by ,', remove the last character withsubstr, andpastethelist` elements together
sapply(strsplit(a, ","), function(x) paste(substr(x, 1, nchar(x)-1), collapse=","))
An easy alternative that might work. You need to escape the "|" using "\\|".
# Input
a <-c("sp|Q9Y6W5|","sp|Q9HB90|,sp|Q9NQL2|","orf|NCBIAAYI_c_1_1023|",
"orf|NCBIACEN_c_10_906|,orf|NCBIACEO_c_5_1142|",
"orf|NCBIAAYI_c_258|,orf|aot172_c_6_302|,orf|aot180_c_2_405|")
# Expected output
b <- c("sp|Q9Y6W5", "sp|Q9HB90,sp|Q9NQL2", "orf|NCBIAAYI_c_1_1023" ,
"orf|NCBIACEN_c_10_906,orf|NCBIACEO_c_5_1142" ,
"orf|NCBIAAYI_c_258,orf|aot172_c_6_302,orf|aot180_c_2_405")
res <- gsub("\\|,", ",", gsub("\\|$", "", a))
all(res == b)
#[1] TRUE
You could construct a single regex call to gsub, but this is simple and easy to understand. The inner gsub looks for | and the end of the string and removes it. The outer gsub looks for ,| and replaces with ,.
You do not have to use a PCRE regex here as all you need can be done with the default TRE regex (if you specify perl=TRUE, the pattern is compiled with a PCRE regex engine and is sometimes slower than TRE default regex engine).
Here is the single simple gsub call:
gsub("\\|(,|$)", "\\1", a)
See the online R demo. No lookarounds are really necessary, as you see.
Pattern details
\\| - a literal | symbol (because if you do not escape it or put into a bracket expression it will denote an alternation operator, see the line below)
(,|$) - a capturing group (referenced to with \1 from the replacement pattern) matching either of the two alternatives:
, - a comma
| - or (the alternation operator)
$ - end of string anchor.
The \1 in the replacement string tells the regex engine to insert the contents stored in the capturing group #1 back into the resulting string (so, the commas are restored that way where necessary).

Camel Case format conversion using regular expressions in R

I have two related questions regarding regular expressions in R:
[1]
I would like to convert sub-strings, containing punctuation followed by a letter, to an upper case letter.
Example:
Dr_dre to: DrDre
Captain.Spock to: CaptainSpock
spider-man to: spiderMan
[2]
I would like convert camel case strings to lower case strings with underscore delimiter.
Example:
EndOfFile to: End_of_file
CamelCase to: Camel_Case
ABC to: A_B_C
Thanks much,
Kamashay
We can use sub. We match one or more punctuation characters ([[:punct:]]+) followed by a single character which is captured as a group ((.)). In the replacement, the backreference for the capture group (\\1) is changed to upper case (\\U).
sub("[[:punct:]]+(.)", "\\U\\1", str1, perl = TRUE)
#[1] "DrDre" "CaptainSpock" "spiderMan"
For the second case, we use regex lookarounds i.e. match a letter ((?<=[A-Za-z])) followed by a capital letter and replace with _.
gsub("(?<=[A-Za-z])(?=[A-Z])", "_", str2, perl = TRUE)
#[1] "End_Of_File" "Camel_Case" "A_B_C"
data
str1 <- c("Dr_dre", "Captain.Spock", "spider-man")
str2 <- c("EndOfFile", "CamelCase", "ABC")

Resources