Split string in parts by minus and plus in R - r

I want to split this string:
test = "-1x^2+3x^3-x^8+1-x"
...into parts by plus and minus characters in R. My goal would be to get:
"-1x^2" "+3x^3" "-x^8" "+1" "-x"
This didn't work:
strsplit(test, split = "-")
strsplit(test, split = "+")

We can provide a regular expression in strsplit, where we use ?= to lookahead to find the plus or minus sign, then split on that character. This will allow for the character itself to be retained rather than being dropped in the split.
strsplit(x, "(?<=.)(?=[+])|(?<=.)(?=[-])",perl = TRUE)
# [1] "-1x^2" "+3x^3" "-x^8" "+1" "-x"

Try
> strsplit(test, split = "(?<=.)(?=[+-])", perl = TRUE)[[1]]
[1] "-1x^2" "+3x^3" "-x^8" "+1" "-x"
where (?<=.)(?=[+-]) captures the spliter that happens to be in front of + or -.

This uses gsub to search for any character followed by + or - and inserts a semicolon between the two characters. Then it splits on semicolon.
s <- "-1x^2+3x^3-x^8+1-x"
strsplit(gsub("(.)([+-])", "\\1;\\2", s), ";")[[1]]
## [1] "-1x^2" "+3x^3" "-x^8" "+1" "-x"

In your examples, you use strsplit with a plus and a minus sign which will split on every encounter.
You could assert that what is directly to the left is not either the start of the string or + or -, while asserting + and - directly to the right.
(?<!^|[+-])(?=[+-])
Explanation
(?<! Negative lookabehind assertion
^ Start of string
| Or - [+-] Match either + or - using a character class
) Close lookbehind
(?= Positive lookahead assertion
[+-] Match either + or -
) Close lookahead
As the pattern uses lookaround assertions, you have to use perl = T to use a perl style regex.
Example
test <- "-1x^2+3x^3-x^8+1-x"
strsplit(test, split = "(?<!^|[\\s+-])(?=[+-])", perl = T)
Output
[[1]]
[1] "-1x^2" "+3x^3" "-x^8" "+1" "-x"
See a online R demo.
If there can also not be a space to the left, you can write the pattern as
(?<!^|[\\s+-])(?=[+-])
See a regex demo.

Related

R Sub function: pull everything after second number

Trying to figure out how to pull everything after the second number using the sub function in R. I understand the basics with the lazy and greedy matching, but how do I take it one step further and pull everything after the second number?
str <- 'john02imga-04'
#lazy: pulls everything after first number
sub(".*?[0-9]", "", str)
#output: "2imga-04
#greedy: pulls everything after last number
sub(".*[0-9]", "", str)
#output: ""
#desired output: "imga-04"
You can use
sub("\\D*[0-9]+", "", str)
## Or,
## sub("\\D*\\d+", "", str)
## => [1] "imga-04"
See the regex demo. Also, see the R demo online.
sub will find and replace the first occurrence of
\D* (=[^0-9]) - any zero or more non-digit chars
[0-9]+ (=\d+) - one or more digits.
Alternative ways
Match one or more letters, -, one or more digits at the end of the string:
> regmatches(str, regexpr("[[:alpha:]]+-\\d+$", str))
[1] "imga-04"
> library(stringr)
> str_extract(str, "\\p{L}+-\\d+$")
[1] "imga-04"
You can use a capture group for the second part and use that in the replacement
^\D+\d+(\D+\d+)
^ Start of string
\D+\d+ Match 1+ non digits, then 1+ digits
(\D+\d+) Capture group 1, match 1+ non digits and match 1+ digits
Regex demo | R demo
str <- 'john02imga-04'
sub("^\\D+\\d+(\\D+\\d+)", "\\1", str)
Output
[1] "imga-04"
If you want to remove all after the second number:
^\D+\d+(\D+\d+).*
Regex demo
As an alternative getting a match only using perl=T for using PCRE and \K to clear the match buffer:
str <- 'john02imga-04'
regmatches(str, regexpr("^\\D+\\d+\\K\\D+\\d+", str, perl = T))
Output
[1] "imga-04"
See an R demo

Delete string parts within delimiter

I have a string as"dfgdf" sa"2323":
a <- "as\"dfgdf\" sa\"2323\""
The delimiter (same for the start and the end) here is ". So what I want is to get a string were everything is deleted within delimiter but not delimiter itself. So the end result string should look like as"" sa""
You could match " and forget what is matched using \K
Then use a negated character class matching any char except " or a whitespace character and use lookarounds to assert " to the right.
Use perl=TRUE to enable Perl-like regular expressions.
a <- "as\"dfgdf\" sa\"2323\""
gsub('"\\K[^"\\s]+(?=")', "", a, perl=TRUE)
Output
[1] "as\"\" sa\"\""
R demo
Here is another base R option using paste0 + strsplit
s <- paste0(paste0(unlist(strsplit(a, '"\\w+"')), '""'), collapse = "")
which gives
> s
[1] "as\"\" sa\"\""
> cat(s)
as"" sa""
Here is one option with a regex lookaround to match a word (\\w+) that succeeds a double quote and precedes one as pattern and is replaced by blank ("")
cat(gsub('(?<=")\\w+(?=")', "", a, perl = TRUE), "\n")
#as"" sa""
Or without regex lookaround
cat(gsub('"\\w+"', '""', a), "\n")
#as"" sa""
I also found a way with stringr library:
library(stringr)
a <- "as\"dfgdf\" sa\"2323\""
result <- str_replace_all(a, "\".*?\"", "\"\"")
cat(result)

Extract all text after last occurrence of a special character

I have the string in R
BLCU142-09|Apodemia_mejicanus
and I would like to get the result
Apodemia_mejicanus
Using the stringr R package, I have tried
str_replace_all("BLCU142-09|Apodemia_mejicanus", "[[A-Z0-9|-]]", "")
# [1] "podemia_mejicanus"
which is almost what I need, except that the A is missing.
You can use
sub(".*\\|", "", x)
This will remove all text up to and including the last pipe char. See the regex demo. Details:
.* - any zero or more chars as many as possible
\| - a | char (| is a special regex metacharacter that is an alternation operator, so it must be escaped, and since string literals in R can contain string escape sequences, the | is escaped with a double backslash).
See the R demo online:
x <- c("BLCU142-09|Apodemia_mejicanus", "a|b|c|BLCU142-09|Apodemia_mejicanus")
sub(".*\\|", "", x)
## => [1] "Apodemia_mejicanus" "Apodemia_mejicanus"
We can match one or more characters that are not a | ([^|]+) from the start (^) of the string followed by | in str_remove to remove that substring
library(stringr)
str_remove(str1, "^[^|]+\\|")
#[1] "Apodemia_mejicanus"
If we use [A-Z] also to match it will match the upper case letter and replace with blank ("") as in the OP's str_replace_all
data
str1 <- "BLCU142-09|Apodemia_mejicanus"
You can always choose to _extract rather than _remove:
s <- "BLCU142-09|Apodemia_mejicanus"
stringr::str_extract(s,"[[:alpha:]_]+$")
## [1] "Apodemia_mejicanus"
Depending on how permissive you want to be, you could also use [[:alpha:]]+_[[:alpha:]]+ as your target.
I would keep it simple:
substring(my_string, regexpr("|", my_string, fixed = TRUE) + 1L)

How to insert a white space before open bracket

I have a string 3.4(2.5-4.7), I want to insert a white space before the open bracket "(" so that the string becomes 3.4 (2.5-4.7).
Any idea how this could be done in R?
x <- "3.4(2.5-4.7)"
sub("(.*)(?=\\()", "\\1 ", x, perl = T)
[1] "3.4 (2.5-4.7)"
This regex is based on lookahead: it creates one capturing group subsuming everything up until the lookahead, namely, the opening parenthesis (?=\\()), recalls it and inserts one whitespace after it in the replacement argument to sub (which is enough unless you have more than one such substitution per string, in which case gsubis needed). The argument perl = Tneeds to be added to enable the lookahead.
EDIT:
If you have a string like this:
x <- "3.4(2.5to4.7)"
the regex gets slightly more complex; the underlying idea though remains the same: you divide the string into different captruing groups (...), which you then recall using appropriate backreference in the replacement argument while adding the sought spaces:
sub("(.*)(\\(\\d+\\.\\d+)(to)(\\d+\\.\\d+\\))", "\\1 \\2 \\3 \\4", x)
[1] "3.4 (2.5 to 4.7)"
EDIT2:
x <- '3.4(2.5,4.7)'
sub("(.*)(\\(\\d+\\.\\d+)(,)(\\d+\\.\\d+\\))", "\\1 \\2\\3 \\4", x)
[1] "3.4 (2.5, 4.7)"
EDIT3:
x <- '3(2,4)'
sub("(.*)(\\(\\d+)(,)(\\d+)", "\\1 \\2\\3 \\4", x)
A very short way uses sub, which will substitute the first open bracket ( with a space followed by an open bracket, i.e. (.
x <- '3.4(2.5-4.7)'
sub("\\(", " (", x)
# [1] "3.4 (2.5-4.7)"
Alternatively, you can specify the argument fixed = TRUE which considers the pattern as fixed and not as a regular expression.
x <- '3.4(2.5-4.7)'
sub("(", " (", x, fixed = TRUE)
# [1] "3.4 (2.5-4.7)"
Try
gsub('(.*)(\\(.*\\))', '\\1 \\2', '3.4(2.5-4.7)')
#[1] "3.4 (2.5-4.7)"
The way the regex works is that it creates two groups. The first group (.*) it takes all elements and the second group (\\(.*\\)) takes all elements after the parenthesis. Note that we need to escape the parenthesis so we use \\(. We then join those two groups with a space between them \\1 \\2

split at entire deliminator but not each component of deliminator

I want to split a string and keep where its being split.
str = 'Glenn: $53 Sutter: $44'
strsplit(str, '[0-9]\\s+[A-Z]', perl = TRUE)
# [[1]]
# [1] "Glenn: $5" "utter: $44" ## taking out what was matched
strsplit(str, '(?=[0-9]\\s+[A-Z])', perl = TRUE)
# [[1]]
# [1] "Glenn: $5" "3" " Sutter: $44" ## splitting at each component of the match
Is there a way to split it at the entire deliminator? So it returns:
# [1] "Glenn: $53" "Sutter: $44"
We can use a regex lookaround to split at one ore more spaces (\\s+) before an upper case letter and after a digit
strsplit(str, "(?<=[0-9])\\s+(?=[A-Z])", perl = TRUE)[[1]]
#[1] "Glenn: $53" "Sutter: $44"
My understanding is that you wish to split on spaces following strings comprise of a dollar sign followed by one or more digits, provided the spaces are followed by a letter.
By setting perl = true, you will use Perl's regex engine, which supports \K, which effectively means to discard everything matched so far. You therefore could use the following regex (with the case-indifferent flag set):
\$\d+\K\s+(?=[a-z])
Demo
In some cases, as here, \K can be used as a substitute for a variable-length lookbehind. Alas, most regex engines, including Perl's, do not support variable-length lookbehinds.

Resources