How to extract first occurrence of alphabets in a string in R? - r

I have a character column having values like "CHELSEAFC17FEB640CE", "BARCAFC17FEB1400CE". I want to extract characters "CHELSEAFC", "BARCAFC" and so on. Currently I am using
regmatches(x$symbol,regexpr("[A-z]+",x$symbol))
but getting an error:
Error in $<-.data.frame(*tmp*, "cg", value = c("CHELSEAFC",
"CHELSEAFC", "TOTTENHAMFC", : replacement has 11366767 rows, data
has 11366772 Calls: $<- -> $<-.data.frame Execution halted
I can't seem to find the problem row. Please somebody help with debugging or suggest a better way to do this :)

Assuming that we need to extract the non-numeric part, one option is to remove the other characters by matching one or more numbers ([0-9]+) followed by other characters (.*) and replace it with ""
sub("[0-9]+.*", "", str1)
#[1] "CHELSEAFC" "BARCAFC"
Or capture the upper case letters as a group (([A-Z]+)) from the start (^) of the string and replace it with the backreference (\\1) for that group
sub("^([A-Z]+).*", "\\1", str1)
#[1] "CHELSEAFC" "BARCAFC"
data
str1 <- c( "CHELSEAFC17FEB640CE", "BARCAFC17FEB1400CE")

Instead of [A-z]+ you should use ^[A-Za-z]+ Check this for more understanding why you shouldn't do that: https://stackoverflow.com/a/29771926/4082217

The error appears because you have some values in the input vector that do not contain letters (and some symbols that [A-z] matches). That makes regmatches return no value in case there is no match, and thus, assigning the column values becomes impossible as the number of matches does not coincide with the number of rows in the data frame.
What you may do is:
1) Use sub
x <- c("------", "CHELSEAFC17FEB640CE", "BARCAFC17FEB1400CE")
> sub("^([a-zA-Z]+).*|.*", "\\1", df$x)
[1] "" "CHELSEAFC" "BARCAFC"
>
x$symbol <- sub("^([a-zA-Z]+).*|.*", "\\1", x$symbol)
The ^([a-zA-Z]+).*|.* pattern will match and capture one or more ASCII letters (replace [a-zA-Z]+ with [[:alpha:]]+ to match letters other than ASCII, too) at the start of the string (^), and .* will match the rest of the string, OR (|) the whole string will get matches with the second branch and the match will be replaced with the capturing group contents (so, it will be either filled with a letter value or will be empty).
2) If you want to keep NA for the values with no match, use stringr str_extract:
library(stringr)
> x$symbol <- str_extract(x$symbol, "^[A-Za-z]+")
## => 1 <NA>
## 2 CHELSEAFC
## 3 BARCAFC
Note that ^[A-Za-z]+ matches 1+ ASCII letters ([A-Za-z]+) at the start of the string only (^).

Related

R keep a character part of the selection gsub [duplicate]

s <- "YXABCDXABCDYX"
I want to use a regular expression to return ABCDABCD, i.e. 4 characters on each side of central "X" but not including the "X".
Note that "X" is always in the center with 6 letters on each side.
I can find the central pattern with e.g. "[A-Z]{4}X[A-Z]{4}", but can I somehow let the return be the first and third group in "([A-Z]{4})(X)([A-Z]{4})"?
Your regex "([A-Z]{4})(X)([A-Z]{4})" won't match your string since you have characters before the first capture group ([A-Z]{4}), so you can add .* to match any character (.) 0 or more times (*) until your first capture group.
You can reference the groups in gsub, for example, using \\n where n is the nth capture group
s <- "YXABCDXABCDYX"
gsub('.*([A-Z]{4})(X)([A-Z]{4}).*', '\\1\\3', s)
# [1] "ABCDABCD"
which is basically matching the entire string and replacing it with whatever was captured in groups 1 and 3 and pasting that together.
Another way would be to use (?i) which is case-insensitive matching along with [a-z] or \\w
gsub('(?i).*(\\w{4})(x)(\\w{4}).*', '\\1\\3', s)
# [1] "ABCDABCD"
Or gsub('.*(.{4})X(.{4}).*', '\\1\\2', s) if you like dots

Regex in R - is it possible to do a partial string substitution? [duplicate]

s <- "YXABCDXABCDYX"
I want to use a regular expression to return ABCDABCD, i.e. 4 characters on each side of central "X" but not including the "X".
Note that "X" is always in the center with 6 letters on each side.
I can find the central pattern with e.g. "[A-Z]{4}X[A-Z]{4}", but can I somehow let the return be the first and third group in "([A-Z]{4})(X)([A-Z]{4})"?
Your regex "([A-Z]{4})(X)([A-Z]{4})" won't match your string since you have characters before the first capture group ([A-Z]{4}), so you can add .* to match any character (.) 0 or more times (*) until your first capture group.
You can reference the groups in gsub, for example, using \\n where n is the nth capture group
s <- "YXABCDXABCDYX"
gsub('.*([A-Z]{4})(X)([A-Z]{4}).*', '\\1\\3', s)
# [1] "ABCDABCD"
which is basically matching the entire string and replacing it with whatever was captured in groups 1 and 3 and pasting that together.
Another way would be to use (?i) which is case-insensitive matching along with [a-z] or \\w
gsub('(?i).*(\\w{4})(x)(\\w{4}).*', '\\1\\3', s)
# [1] "ABCDABCD"
Or gsub('.*(.{4})X(.{4}).*', '\\1\\2', s) if you like dots

replace positions of elements in a string using R

I have a string:
str = 'Mr[5]'
I want to switch the positions of Mr and 5 in str, and get a result like this:
result = '[5]Mr'
How can I do this in R?
You can use a regex with 2 matching group for which you change position.
stringr package helps with character manipulation.
s <- c("Mr[5]", "Mr[3245]", "Mrs[98j]")
stringr::str_replace_all(s, "^(.*)(\\[.*\\])$", "\\2\\1")
#> [1] "[5]Mr" "[3245]Mr" "[98j]Mrs"
about the regex
^ is the beginning of the string and $ the end
.* matches every character, zero or more time
( and ) define matching group
\\[ and \\] match literal bracket
together you have a simple regex that match for exemple Mr then [5] : "(.*)(\\[.*\\])"
\\1 refers to the first matching group, \\2 refers to the second. \\2\\1 inverse the groups
Obviously, you can create a better regex that fits precisely to your need. The mechanism with matching groups with remain. regex101 is a good site to help you with regex.
In R, stringr website have nice intro about regex
You can use gsub :
values <- c("Mr[5]","Mr[1234]", "Mrs[456]")
values2 <- gsub("^(.+)(\\[[0-9]+\\])$", "\\2\\1", values)
# > values2
# [1] "[5]Mr" "[1234]Mr" "[456]Mrs"

How to substitute a character in multiples locations with R

I'm trying to split a dataframe with "," separators. However, some parts of the strings have the pattern [0-9][,][0-9]{2}, and i'd like to substitute only the comma inside, not the hole pattern, in order to preserve the numerical inputs.
I try to solve with stringr, but got stucked in the following pattern of error:
library(stringr)
string <- '"name: John","age: 27","height: 1,73", "weight: 78,30"'
str_replace_all(string, "[0-9][,][0-9]{2}", "[0-9][;][0-9]{2}")
[1] "\"name: John\",\"age: 27\",\"height: [0-9][;][0-9]{2}\", \"weight: 7[0-9][;][0-9]{2}\""
I know it can be done with substitution by position, but the string is too big.
I'd appreciate any help. Thanks in advance.
You need to use capturing groups around the parts of the pattern you need to keep and, in the replacement pattern, refer to those submatches with backreferences:
> str_replace_all(string, "([0-9]),([0-9]{2})", "\\1;\\2")
[1] "\"name: John\",\"age: 27\",\"height: 1;73\", \"weight: 78;30\""
Or the same regex can be used with gsub:
> gsub("([0-9]),([0-9]{2})", "\\1;\\2", string)
[1] "\"name: John\",\"age: 27\",\"height: 1;73\", \"weight: 78;30\""
Details:
([0-9]) - capturing group 1, whose value is referred to using \\1 in the replacement pattern, matching a single digit
, - a comma
([0-9]{2}) - capturing group 2, whose value is referred to using \\2 in the replacement pattern, matching 2 digits.

How to extract character string dynamically from character vector r

Here are three character vectors:
[1] "Session_1/Focal_1_P1/240915_P1_S1_F1.csv"
[2] "Session_2/Focal_1_PA10/250915_PA10_S2_F1.csv"
[3] "Session_3/Focal_1_DA100/260915_DA100_S3_F1.csv"
I'm trying to extract the strings P1, PA10 and DA100, respectively in a standardised manner (as I have several hundred other strings in which I want to extract this.
I know I need to use regex but I'm fairly new to it and not exactly sure which one.
I can see that the commonalities are 6 numbers (\d\d\d\d\d\d)followed by an _ and then what I want followed by another _.
How do I extract what I want? I believe with grep but am not 100% on the regular expression I need.
We can use gsub. We match zero or more characters (.*) followed by a forward slash (\\/), followed by one or more numbers and a underscore (\\d+_), or (!) two instances of an underscore followed by one or more characters that are not an underscore ((_[^_]+){2}) and replace it with blank ("").
gsub(".*\\/\\d+_|(_[^_]+){2}", "", v1)
#[1] "P1" "PA10" "DA100"
Or we extract the basename of the vector, match one or more numbers followed by underscore (\\d+_) followed by character not an underscore (([^_]+)) as a capture group followed by characters until the end of the string and replace it with the backreference (\\1) for the captured group.
sub("\\d+_([^_]+).*", "\\1", basename(v1))
#[1] "P1" "PA10" "DA100"
data
v1 <- c( "Session_1/Focal_1_P1/240915_P1_S1_F1.csv",
"Session_2/Focal_1_PA10/250915_PA10_S2_F1.csv",
"Session_3/Focal_1_DA100/260915_DA100_S3_F1.csv")

Resources