Extract substring between two colons / special characters - r

I want to extract "SUBSTRING" with sub() from the following string:
attribute <- "S4Q7b1_t1_r1: SUBSTRING: some explanation: some explanation - ..."
I used the following code, but unfortunately it didn't work:
sub(".*: (.*) : .*", "\\1", attribute)
Does anyone know an answer for that?

You may use
sub("^[^:]*: ([^:]*).*", "\\1", attribute)
See the regex demo
You need to rely on negated character classes, [^:] that matches any char but :, since .* matches greedily any 0 or more chars. Also, your pattern contains a space before : and it is missing in the string.
Details
^ - start of string
[^:]* - any 0+ chars other than :
: - a colon with a space
-([^:]*) - Capturing group 1 (\1 refers to this value): any 0+ chars other than :
.* - the rest of the string.
R Demo:
attribute <- "S4Q7b1_t1_r1: SUBSTRING: some explanation: some explanation - ..."
sub("^[^:]*: ([^:]*).*", "\\1", attribute)
## => [1] "SUBSTRING"

Related

Extract digits after matching the certain string second time

I want to extract the digits after second occurance of under score _ from a pattern.
by following the similar posts here
Matching different digits after a lookahead
regex - return all before the second occurrence
I tried
library(stringr)
pattern <- c("1/2/3_500k/855kk_1400k/AVBB")
str_extract(pattern, "(^_){2}(\\d+\\.*\\d*)")
which outputs
[1] NA
instead of 1400. Could you help?
You may use a base R solution with regexpr/regmatches:
regmatches(x, regexpr("^(?:[^_]*_){2}[^_0-9]*\\K\\d*\\.?\\d+", x, perl=TRUE))
Or, with sub:
sub("^(?:[^_]*_){2}[^_0-9]*(\\d*\\.?\\d+).*", "\\1", x)
See the R demo online.
The regex is
^(?:[^_]*_){2}[^_0-9]*\K\d*\.?\d+
See the online regex demo.
Details
^ - start of string
(?:[^_]*_){2} - 2 repetitions of
[^_]* - any 0+ chars other than _
_ - an underscore
[^_0-9]* - any 0+ chars other than _ and digits
\K - match reset operator discarding all text matched so far
\d*\.?\d+ - a float or integer number pattern (0+ digits, an optional . and then 1+ digits).
In the sub regex variation, the \K is not necessary, the number pattern is captured into a capturing group and the rest of string is matched with .* pattern. The result is the contents of Group 1, referred to with the \1 placeholder.
One option could be as:
pattern <- c("1/2/3_500k/855kk_1400k/AVBB")
sub(".*_*_(\\d+).*","\\1", pattern, perl = TRUE)
[1] "1400"
The regex is:
".*_*_(\\d+).*"
Details:
.*_ anything before first _
.*_ anything after first _ and before 2nd _
\\d+ look for digits and take those as selection.
.* anything afterwards.
\\1 replaces matching strings with values found for 1st group.

Replace a specific character only between parenthesis

Lest's say I have a string:
test <- "(pop+corn)-bread+salt"
I want to replace the plus sign that is only between parenthesis by '|', so I get:
"(pop|corn)-bread+salt"
I tried:
gsub("([+])","\\|",test)
But it replaces all the plus signs of the string (obviously)
If you want to replace all + symbols that are inside parentheses (if there may be 1 or more), you can use any of the following solutions:
gsub("\\+(?=[^()]*\\))", "|", x, perl=TRUE)
See the regex demo. Here, the + is only matched when it is followed with any 0+ chars other than ( and ) (with [^()]*) and then a ). It is only good if the input is well-formed and there is no nested parentheses as it does not check if there was a starting (.
gsub("(?:\\G(?!^)|\\()[^()]*?\\K\\+", "|", x, perl=TRUE)
This is a safer solution since it starts matching + only if there was a starting (. See the regex demo. In this pattern, (?:\G(?!^)|\() matches the end of the previous match (\G(?!^)) or (|) a (, then [^()]*? matches any 0+ chars other than ( and ) chars, and then \K discards all the matched text and \+ matches a + that will be consumed and replaced. It still does not handle nested parentheses.
Also, see an online R demo for the above two solutions.
library(gsubfn)
s <- "(pop(+corn)+unicorn)-bread+salt+malt"
gsubfn("\\((?:[^()]++|(?R))*\\)", ~ gsub("+", "|", m, fixed=TRUE), s, perl=TRUE, backref=0)
## => [1] "(pop(|corn)|unicorn)-bread+salt+malt"
This solves the problem of matching nested parentheses, but requires the gsubfn package. See another regex demo. See this regex description here.
Note that in case you do not have to match nested parentheses, you may use "\\([^()]*\\)" regex with the gsubfn code above. \([^()]*\) regex matches (, then any zero or more chars other than ( and ) (replace with [^)]* to match )) and then a ).
We can try
sub("(\\([^+]+)\\+","\\1|", test)
#[1] "(pop|corn)-bread+salt"

Replace some text after a string with Regex and Gsub in R

It's a simple question, but I'm not good with Regex. (I tried many expressions without success)
I want to replace all the text (replace for nothing) after a pattern.
My pattern is something like this:
/canais/*/
My data is:
/canais/b3/conheca-o-pai-dos-indices-da-b3/
/canais/cpbs/cvm-abre-audiencia-publica-de-instruc
/canais/stocche-forbes/dividendo-controverso/
The desired result is:
/canais/b3/
/canais/cpbs/
/canais/stocche-forbes/
How can I do it with gsub?
Thanks
You may use the following sub:
x <- c("/canais/b3/conheca-o-pai-dos-indices-da-b3/","/canais/cpbs/cvm-abre-audiencia-publica-de-instruc","/canais/stocche-forbes/dividendo-controverso/")
sub("^(/canais/[^/]+/).*", "\\1", x)
See the online R demo
Details:
^ - start of string
(/canais/[^/]+/) - Group 1 (later referred to with \1) capturing:
/canais/ - a substring /canais/
[^/]+ - 1 or more chars other than /
/ - a slash
.* - any 0+ chars up to the end of string.

Remove all characters after the 2nd occurrence of "-" in each element of a vector

I would like to remove all characters after the 2nd occurrence of "-" in each element of a vector.
Initial string
aa-bbb-cccc => aa-bbb
aa-vvv-vv => aa-vvv
aa-ddd => aa-ddd
Any help?
Judging by the sample input and expected output, I assume you need to remove all beginning with the 2nd hyphen.
You may use
sub("^([^-]*-[^-]*).*", "\\1", x)
See the regex demo
Details:
^ - start of string
([^-]*-[^-]*) - Group 1 capturing 0+ chars other than -, - and 0+ chars other than -
.* - any 0+ chars (in a TRE regex like this, a dot matches line break chars, too.)
The \\1 (\1) is a backreference to the text captured into Group 1.
R demo:
x <- c("aa-bbb-cccc", "aa-vvv-vv", "aa-ddd")
sub("^([^-]*-[^-]*).*", "\\1", x)
## => [1] "aa-bbb" "aa-vvv" "aa-ddd"

Regex - Substitute character in a matching substring

Let's say I have the following string:
input = "askl jmsp wiqp;THIS IS A MATCH; dlkasl das, fm"
I need to replace the white-spaces with underscores, but only in the substrings that match a pattern. (In this case the pattern would be a semi-colon before and after.)
The expected output should be:
output = "askl jmsp wiqp;THIS_IS_A_MATCH; dlkasl das, fm"
Any ideas how to achieve that, preferably using regular expressions, and without splitting the string?
I tried:
gsub("(.*);(.*);(.*)", "\\2", input) # Pattern matching and
gsub(" ", "_", input) # Naive gsub
Couldn't put them both together though.
Regarding the original question:
Substitute character in a matching substring
You may do it easily with gsubfn:
> library(gsubfn)
> input = "askl jmsp wiqp;THIS IS A MATCH; dlkasl das, fm"
> gsubfn(";([^;]+);", function(g1) paste0(";",gsub(" ", "-", g1, fixed=TRUE),";"), input)
[1] "askl jmsp wiqp;THIS-IS-A-MATCH; dlkasl das, fm"
The ;([^;]+); matches any string starting with ; and up to the next ; capturing the text in-between and then replacing the whitespaces with hyphens only inside the captured part.
Another approach is to use a PCRE regex with a \G based regex with gsub:
p = "(?:\\G(?!\\A)|;)(?=[^;]*;)[^;\\s]*\\K\\s"
> gsub(p, "-", input, perl=TRUE)
[1] "askl jmsp wiqp;THIS-IS-A-MATCH; dlkasl das, fm"
See the online regex demo
Pattern details:
(?:\\G(?!\\A)|;) - a custom boundary: either the end of the previous successful match (\\G(?!\\A)) or (|) a semicolon
(?=[^;]*;) - a lookahead check: there must be a ; after 0+ chars other than ;
[^;\\s]* - 0+ chars other than ; and whitespaces
\\K - omitting the text matched so far
\\s - 1 single whitespace character (if multiple whitespaces are to be replaced with 1 hyphen, add + after it).

Resources