R keep a character part of the selection gsub [duplicate] - r

s <- "YXABCDXABCDYX"
I want to use a regular expression to return ABCDABCD, i.e. 4 characters on each side of central "X" but not including the "X".
Note that "X" is always in the center with 6 letters on each side.
I can find the central pattern with e.g. "[A-Z]{4}X[A-Z]{4}", but can I somehow let the return be the first and third group in "([A-Z]{4})(X)([A-Z]{4})"?

Your regex "([A-Z]{4})(X)([A-Z]{4})" won't match your string since you have characters before the first capture group ([A-Z]{4}), so you can add .* to match any character (.) 0 or more times (*) until your first capture group.
You can reference the groups in gsub, for example, using \\n where n is the nth capture group
s <- "YXABCDXABCDYX"
gsub('.*([A-Z]{4})(X)([A-Z]{4}).*', '\\1\\3', s)
# [1] "ABCDABCD"
which is basically matching the entire string and replacing it with whatever was captured in groups 1 and 3 and pasting that together.
Another way would be to use (?i) which is case-insensitive matching along with [a-z] or \\w
gsub('(?i).*(\\w{4})(x)(\\w{4}).*', '\\1\\3', s)
# [1] "ABCDABCD"
Or gsub('.*(.{4})X(.{4}).*', '\\1\\2', s) if you like dots

Related

Regex in R - is it possible to do a partial string substitution? [duplicate]

s <- "YXABCDXABCDYX"
I want to use a regular expression to return ABCDABCD, i.e. 4 characters on each side of central "X" but not including the "X".
Note that "X" is always in the center with 6 letters on each side.
I can find the central pattern with e.g. "[A-Z]{4}X[A-Z]{4}", but can I somehow let the return be the first and third group in "([A-Z]{4})(X)([A-Z]{4})"?
Your regex "([A-Z]{4})(X)([A-Z]{4})" won't match your string since you have characters before the first capture group ([A-Z]{4}), so you can add .* to match any character (.) 0 or more times (*) until your first capture group.
You can reference the groups in gsub, for example, using \\n where n is the nth capture group
s <- "YXABCDXABCDYX"
gsub('.*([A-Z]{4})(X)([A-Z]{4}).*', '\\1\\3', s)
# [1] "ABCDABCD"
which is basically matching the entire string and replacing it with whatever was captured in groups 1 and 3 and pasting that together.
Another way would be to use (?i) which is case-insensitive matching along with [a-z] or \\w
gsub('(?i).*(\\w{4})(x)(\\w{4}).*', '\\1\\3', s)
# [1] "ABCDABCD"
Or gsub('.*(.{4})X(.{4}).*', '\\1\\2', s) if you like dots

replace positions of elements in a string using R

I have a string:
str = 'Mr[5]'
I want to switch the positions of Mr and 5 in str, and get a result like this:
result = '[5]Mr'
How can I do this in R?
You can use a regex with 2 matching group for which you change position.
stringr package helps with character manipulation.
s <- c("Mr[5]", "Mr[3245]", "Mrs[98j]")
stringr::str_replace_all(s, "^(.*)(\\[.*\\])$", "\\2\\1")
#> [1] "[5]Mr" "[3245]Mr" "[98j]Mrs"
about the regex
^ is the beginning of the string and $ the end
.* matches every character, zero or more time
( and ) define matching group
\\[ and \\] match literal bracket
together you have a simple regex that match for exemple Mr then [5] : "(.*)(\\[.*\\])"
\\1 refers to the first matching group, \\2 refers to the second. \\2\\1 inverse the groups
Obviously, you can create a better regex that fits precisely to your need. The mechanism with matching groups with remain. regex101 is a good site to help you with regex.
In R, stringr website have nice intro about regex
You can use gsub :
values <- c("Mr[5]","Mr[1234]", "Mrs[456]")
values2 <- gsub("^(.+)(\\[[0-9]+\\])$", "\\2\\1", values)
# > values2
# [1] "[5]Mr" "[1234]Mr" "[456]Mrs"

How to extract first occurrence of alphabets in a string in R?

I have a character column having values like "CHELSEAFC17FEB640CE", "BARCAFC17FEB1400CE". I want to extract characters "CHELSEAFC", "BARCAFC" and so on. Currently I am using
regmatches(x$symbol,regexpr("[A-z]+",x$symbol))
but getting an error:
Error in $<-.data.frame(*tmp*, "cg", value = c("CHELSEAFC",
"CHELSEAFC", "TOTTENHAMFC", : replacement has 11366767 rows, data
has 11366772 Calls: $<- -> $<-.data.frame Execution halted
I can't seem to find the problem row. Please somebody help with debugging or suggest a better way to do this :)
Assuming that we need to extract the non-numeric part, one option is to remove the other characters by matching one or more numbers ([0-9]+) followed by other characters (.*) and replace it with ""
sub("[0-9]+.*", "", str1)
#[1] "CHELSEAFC" "BARCAFC"
Or capture the upper case letters as a group (([A-Z]+)) from the start (^) of the string and replace it with the backreference (\\1) for that group
sub("^([A-Z]+).*", "\\1", str1)
#[1] "CHELSEAFC" "BARCAFC"
data
str1 <- c( "CHELSEAFC17FEB640CE", "BARCAFC17FEB1400CE")
Instead of [A-z]+ you should use ^[A-Za-z]+ Check this for more understanding why you shouldn't do that: https://stackoverflow.com/a/29771926/4082217
The error appears because you have some values in the input vector that do not contain letters (and some symbols that [A-z] matches). That makes regmatches return no value in case there is no match, and thus, assigning the column values becomes impossible as the number of matches does not coincide with the number of rows in the data frame.
What you may do is:
1) Use sub
x <- c("------", "CHELSEAFC17FEB640CE", "BARCAFC17FEB1400CE")
> sub("^([a-zA-Z]+).*|.*", "\\1", df$x)
[1] "" "CHELSEAFC" "BARCAFC"
>
x$symbol <- sub("^([a-zA-Z]+).*|.*", "\\1", x$symbol)
The ^([a-zA-Z]+).*|.* pattern will match and capture one or more ASCII letters (replace [a-zA-Z]+ with [[:alpha:]]+ to match letters other than ASCII, too) at the start of the string (^), and .* will match the rest of the string, OR (|) the whole string will get matches with the second branch and the match will be replaced with the capturing group contents (so, it will be either filled with a letter value or will be empty).
2) If you want to keep NA for the values with no match, use stringr str_extract:
library(stringr)
> x$symbol <- str_extract(x$symbol, "^[A-Za-z]+")
## => 1 <NA>
## 2 CHELSEAFC
## 3 BARCAFC
Note that ^[A-Za-z]+ matches 1+ ASCII letters ([A-Za-z]+) at the start of the string only (^).

regular expression: remove consecutive repeated characters at least 2 times as well as those after it in a string in R

I have a vector with different strings like this:
s <- c("mir123mm8", "qwe98wwww98", "123m3tqppppw23!")
and
> s
[1] "mir123mm8" "qwe98wwww98" "123m3tqppppw23!"
I would like to have the answer like this:
> c("mir123", "qwe98", "123m3tq")
[1] "mir123" "qwe98" "123m3tq"
That means that if a string has at least 2 consecutive repeated characters, then them and after them should be removed.
What is the better way to do it using regular expression in R?
You can use back reference in the pattern to match repeated characters:
sub("(.*?)(.)\\2.*", "\\1", s)
# [1] "mir123" "qwe98" "123m3tq"
The pattern matches when the second captured group which is a single character repeats directly after it. Make the first capture group ungreedy by ? so that whenever the pattern matches, the first captured group is returned.

How to extract character string dynamically from character vector r

Here are three character vectors:
[1] "Session_1/Focal_1_P1/240915_P1_S1_F1.csv"
[2] "Session_2/Focal_1_PA10/250915_PA10_S2_F1.csv"
[3] "Session_3/Focal_1_DA100/260915_DA100_S3_F1.csv"
I'm trying to extract the strings P1, PA10 and DA100, respectively in a standardised manner (as I have several hundred other strings in which I want to extract this.
I know I need to use regex but I'm fairly new to it and not exactly sure which one.
I can see that the commonalities are 6 numbers (\d\d\d\d\d\d)followed by an _ and then what I want followed by another _.
How do I extract what I want? I believe with grep but am not 100% on the regular expression I need.
We can use gsub. We match zero or more characters (.*) followed by a forward slash (\\/), followed by one or more numbers and a underscore (\\d+_), or (!) two instances of an underscore followed by one or more characters that are not an underscore ((_[^_]+){2}) and replace it with blank ("").
gsub(".*\\/\\d+_|(_[^_]+){2}", "", v1)
#[1] "P1" "PA10" "DA100"
Or we extract the basename of the vector, match one or more numbers followed by underscore (\\d+_) followed by character not an underscore (([^_]+)) as a capture group followed by characters until the end of the string and replace it with the backreference (\\1) for the captured group.
sub("\\d+_([^_]+).*", "\\1", basename(v1))
#[1] "P1" "PA10" "DA100"
data
v1 <- c( "Session_1/Focal_1_P1/240915_P1_S1_F1.csv",
"Session_2/Focal_1_PA10/250915_PA10_S2_F1.csv",
"Session_3/Focal_1_DA100/260915_DA100_S3_F1.csv")

Resources