replace positions of elements in a string using R - r

I have a string:
str = 'Mr[5]'
I want to switch the positions of Mr and 5 in str, and get a result like this:
result = '[5]Mr'
How can I do this in R?

You can use a regex with 2 matching group for which you change position.
stringr package helps with character manipulation.
s <- c("Mr[5]", "Mr[3245]", "Mrs[98j]")
stringr::str_replace_all(s, "^(.*)(\\[.*\\])$", "\\2\\1")
#> [1] "[5]Mr" "[3245]Mr" "[98j]Mrs"
about the regex
^ is the beginning of the string and $ the end
.* matches every character, zero or more time
( and ) define matching group
\\[ and \\] match literal bracket
together you have a simple regex that match for exemple Mr then [5] : "(.*)(\\[.*\\])"
\\1 refers to the first matching group, \\2 refers to the second. \\2\\1 inverse the groups
Obviously, you can create a better regex that fits precisely to your need. The mechanism with matching groups with remain. regex101 is a good site to help you with regex.
In R, stringr website have nice intro about regex

You can use gsub :
values <- c("Mr[5]","Mr[1234]", "Mrs[456]")
values2 <- gsub("^(.+)(\\[[0-9]+\\])$", "\\2\\1", values)
# > values2
# [1] "[5]Mr" "[1234]Mr" "[456]Mrs"

Related

Add symbol between the letter S and any number in a column dataframe

I am trying to add a - between letter S and any number in a column of a data frame. So, this is an example:
VariableA
TRS34
MMH22
GFSR104
GS23
RRTM55
P3
S4
My desired output is:
VariableA
TRS-34
MMH22
GFSR104
GS-23
RRTM55
P3
S-4
I was trying yo use gsub:
gsub('^([a-z])-([0-9]+)$','\\1d\\2',myDF$VariableA)
but this is not working.
How can I solve this?
Thanks!
Your ^([a-z])-([0-9]+)$ regex attempts to match strings that start with a letter, then have a - and then one or more digits. This can't work as there are no hyphens in the strings, you want to introduce it into the strings.
You can use
gsub('(S)([0-9])', '\\1-\\2', myDF$VariableA)
The (S)([0-9]) regex matches and captures S into Group 1 (\1) and then any digit is captured into Group 2 (\2) and the replacement pattern is a concatenation of group values with a hyphen in between.
If there is only one substitution expected, replace gsub with sub.
See the regex demo and the online R demo.
Other variations:
gsub('(S)(\\d)', '\\1-\\2', myDF$VariableA) # \d also matches digits
gsub('(?<=S)(?=\\d)', '-', myDF$VariableA, perl=TRUE) # Lookarounds make backreferences redundant
Here is the version I like using sub:
myDF$VariableA <- gsub('S(\\d)', 'S-\\1', myDF$VariableA)
This requires using only one capture group.
Using stringr package
library(stringr)
str_replace_all(myDF$VariableA, 'S(\\d)', 'S-\\1')
You could also use lookbehinds if you set perl=TRUE:
> gsub('(?<=S)([0-9]+)', '-\\1', myDF$VariableA, perl=TRUE)
[1] "TRS-34" "MMH22" "GFSR104" "GS-23" "RRTM55" "P3" "S-4"
>

R keep a character part of the selection gsub [duplicate]

s <- "YXABCDXABCDYX"
I want to use a regular expression to return ABCDABCD, i.e. 4 characters on each side of central "X" but not including the "X".
Note that "X" is always in the center with 6 letters on each side.
I can find the central pattern with e.g. "[A-Z]{4}X[A-Z]{4}", but can I somehow let the return be the first and third group in "([A-Z]{4})(X)([A-Z]{4})"?
Your regex "([A-Z]{4})(X)([A-Z]{4})" won't match your string since you have characters before the first capture group ([A-Z]{4}), so you can add .* to match any character (.) 0 or more times (*) until your first capture group.
You can reference the groups in gsub, for example, using \\n where n is the nth capture group
s <- "YXABCDXABCDYX"
gsub('.*([A-Z]{4})(X)([A-Z]{4}).*', '\\1\\3', s)
# [1] "ABCDABCD"
which is basically matching the entire string and replacing it with whatever was captured in groups 1 and 3 and pasting that together.
Another way would be to use (?i) which is case-insensitive matching along with [a-z] or \\w
gsub('(?i).*(\\w{4})(x)(\\w{4}).*', '\\1\\3', s)
# [1] "ABCDABCD"
Or gsub('.*(.{4})X(.{4}).*', '\\1\\2', s) if you like dots

Regex in R - is it possible to do a partial string substitution? [duplicate]

s <- "YXABCDXABCDYX"
I want to use a regular expression to return ABCDABCD, i.e. 4 characters on each side of central "X" but not including the "X".
Note that "X" is always in the center with 6 letters on each side.
I can find the central pattern with e.g. "[A-Z]{4}X[A-Z]{4}", but can I somehow let the return be the first and third group in "([A-Z]{4})(X)([A-Z]{4})"?
Your regex "([A-Z]{4})(X)([A-Z]{4})" won't match your string since you have characters before the first capture group ([A-Z]{4}), so you can add .* to match any character (.) 0 or more times (*) until your first capture group.
You can reference the groups in gsub, for example, using \\n where n is the nth capture group
s <- "YXABCDXABCDYX"
gsub('.*([A-Z]{4})(X)([A-Z]{4}).*', '\\1\\3', s)
# [1] "ABCDABCD"
which is basically matching the entire string and replacing it with whatever was captured in groups 1 and 3 and pasting that together.
Another way would be to use (?i) which is case-insensitive matching along with [a-z] or \\w
gsub('(?i).*(\\w{4})(x)(\\w{4}).*', '\\1\\3', s)
# [1] "ABCDABCD"
Or gsub('.*(.{4})X(.{4}).*', '\\1\\2', s) if you like dots

How to substitute a character in multiples locations with R

I'm trying to split a dataframe with "," separators. However, some parts of the strings have the pattern [0-9][,][0-9]{2}, and i'd like to substitute only the comma inside, not the hole pattern, in order to preserve the numerical inputs.
I try to solve with stringr, but got stucked in the following pattern of error:
library(stringr)
string <- '"name: John","age: 27","height: 1,73", "weight: 78,30"'
str_replace_all(string, "[0-9][,][0-9]{2}", "[0-9][;][0-9]{2}")
[1] "\"name: John\",\"age: 27\",\"height: [0-9][;][0-9]{2}\", \"weight: 7[0-9][;][0-9]{2}\""
I know it can be done with substitution by position, but the string is too big.
I'd appreciate any help. Thanks in advance.
You need to use capturing groups around the parts of the pattern you need to keep and, in the replacement pattern, refer to those submatches with backreferences:
> str_replace_all(string, "([0-9]),([0-9]{2})", "\\1;\\2")
[1] "\"name: John\",\"age: 27\",\"height: 1;73\", \"weight: 78;30\""
Or the same regex can be used with gsub:
> gsub("([0-9]),([0-9]{2})", "\\1;\\2", string)
[1] "\"name: John\",\"age: 27\",\"height: 1;73\", \"weight: 78;30\""
Details:
([0-9]) - capturing group 1, whose value is referred to using \\1 in the replacement pattern, matching a single digit
, - a comma
([0-9]{2}) - capturing group 2, whose value is referred to using \\2 in the replacement pattern, matching 2 digits.

How to extract first occurrence of alphabets in a string in R?

I have a character column having values like "CHELSEAFC17FEB640CE", "BARCAFC17FEB1400CE". I want to extract characters "CHELSEAFC", "BARCAFC" and so on. Currently I am using
regmatches(x$symbol,regexpr("[A-z]+",x$symbol))
but getting an error:
Error in $<-.data.frame(*tmp*, "cg", value = c("CHELSEAFC",
"CHELSEAFC", "TOTTENHAMFC", : replacement has 11366767 rows, data
has 11366772 Calls: $<- -> $<-.data.frame Execution halted
I can't seem to find the problem row. Please somebody help with debugging or suggest a better way to do this :)
Assuming that we need to extract the non-numeric part, one option is to remove the other characters by matching one or more numbers ([0-9]+) followed by other characters (.*) and replace it with ""
sub("[0-9]+.*", "", str1)
#[1] "CHELSEAFC" "BARCAFC"
Or capture the upper case letters as a group (([A-Z]+)) from the start (^) of the string and replace it with the backreference (\\1) for that group
sub("^([A-Z]+).*", "\\1", str1)
#[1] "CHELSEAFC" "BARCAFC"
data
str1 <- c( "CHELSEAFC17FEB640CE", "BARCAFC17FEB1400CE")
Instead of [A-z]+ you should use ^[A-Za-z]+ Check this for more understanding why you shouldn't do that: https://stackoverflow.com/a/29771926/4082217
The error appears because you have some values in the input vector that do not contain letters (and some symbols that [A-z] matches). That makes regmatches return no value in case there is no match, and thus, assigning the column values becomes impossible as the number of matches does not coincide with the number of rows in the data frame.
What you may do is:
1) Use sub
x <- c("------", "CHELSEAFC17FEB640CE", "BARCAFC17FEB1400CE")
> sub("^([a-zA-Z]+).*|.*", "\\1", df$x)
[1] "" "CHELSEAFC" "BARCAFC"
>
x$symbol <- sub("^([a-zA-Z]+).*|.*", "\\1", x$symbol)
The ^([a-zA-Z]+).*|.* pattern will match and capture one or more ASCII letters (replace [a-zA-Z]+ with [[:alpha:]]+ to match letters other than ASCII, too) at the start of the string (^), and .* will match the rest of the string, OR (|) the whole string will get matches with the second branch and the match will be replaced with the capturing group contents (so, it will be either filled with a letter value or will be empty).
2) If you want to keep NA for the values with no match, use stringr str_extract:
library(stringr)
> x$symbol <- str_extract(x$symbol, "^[A-Za-z]+")
## => 1 <NA>
## 2 CHELSEAFC
## 3 BARCAFC
Note that ^[A-Za-z]+ matches 1+ ASCII letters ([A-Za-z]+) at the start of the string only (^).

Resources