How to extract character string dynamically from character vector r - r

Here are three character vectors:
[1] "Session_1/Focal_1_P1/240915_P1_S1_F1.csv"
[2] "Session_2/Focal_1_PA10/250915_PA10_S2_F1.csv"
[3] "Session_3/Focal_1_DA100/260915_DA100_S3_F1.csv"
I'm trying to extract the strings P1, PA10 and DA100, respectively in a standardised manner (as I have several hundred other strings in which I want to extract this.
I know I need to use regex but I'm fairly new to it and not exactly sure which one.
I can see that the commonalities are 6 numbers (\d\d\d\d\d\d)followed by an _ and then what I want followed by another _.
How do I extract what I want? I believe with grep but am not 100% on the regular expression I need.

We can use gsub. We match zero or more characters (.*) followed by a forward slash (\\/), followed by one or more numbers and a underscore (\\d+_), or (!) two instances of an underscore followed by one or more characters that are not an underscore ((_[^_]+){2}) and replace it with blank ("").
gsub(".*\\/\\d+_|(_[^_]+){2}", "", v1)
#[1] "P1" "PA10" "DA100"
Or we extract the basename of the vector, match one or more numbers followed by underscore (\\d+_) followed by character not an underscore (([^_]+)) as a capture group followed by characters until the end of the string and replace it with the backreference (\\1) for the captured group.
sub("\\d+_([^_]+).*", "\\1", basename(v1))
#[1] "P1" "PA10" "DA100"
data
v1 <- c( "Session_1/Focal_1_P1/240915_P1_S1_F1.csv",
"Session_2/Focal_1_PA10/250915_PA10_S2_F1.csv",
"Session_3/Focal_1_DA100/260915_DA100_S3_F1.csv")

Related

using regular expressions (regex) to make replace multiple patterns at the same time in R

I have a vector of strings and I want to remove -es from all strings (words) ending in either -ses or -ces at the same time. The reason I want to do it at the same time and not consequitively is that sometimes it happens that after removing one ending, the other ending appears while I don't want to apply this pattern to a single word twice.
I have no idea how to use two patterns at the same time, but this is the best I could:
text <- gsub("[sc]+s$", "[sc]", text)
I know the replacement is not correct, but I wonder how can I show that I want to replace it with the letter I just detected (c or s in this case). Thank you in advance.
To remove es at the end of words, that is preceded with s or c, you may use
gsub("([sc])es\\b", "\\1", text)
gsub("(?<=[sc])es\\b", "", text, perl=TRUE)
To remove them at the end of strings, you can go on using your $ anchor:
gsub("([sc])es$", "\\1", text)
gsub("(?<=[sc])es$", "", text, perl=TRUE)
The first gsub TRE pattern is ([sc])es\b: a capturing group #1 that matches either s or c, and then es is matched, and then \b makes sure the next char is not a letter, digit or _. The \1 in the replacement is the backreference to the value stored in the capturing group #1 memory buffer.
In the second example with the PCRE regex (due to perl=TRUE), (?<=[sc]) positive lookbehind is used instead of the ([sc]) capturing group. Lookbehinds are not consuming text, the text they match does not land in the match value, and thus, there is no need to restore it anyhow. The replacement is an empty string.
Strings ending with "ces" and "ses" follow the same pattern, i.e. "*es$"
If I understand it correctly than you don't need two patterns.
Example:
x = c("ces", "ses", "mes)
gsub( pattern = "*([cs])es$", replacement = "\\1", x)
[1] "c" "s" "mes"
Hope it helps.
M

R keep a character part of the selection gsub [duplicate]

s <- "YXABCDXABCDYX"
I want to use a regular expression to return ABCDABCD, i.e. 4 characters on each side of central "X" but not including the "X".
Note that "X" is always in the center with 6 letters on each side.
I can find the central pattern with e.g. "[A-Z]{4}X[A-Z]{4}", but can I somehow let the return be the first and third group in "([A-Z]{4})(X)([A-Z]{4})"?
Your regex "([A-Z]{4})(X)([A-Z]{4})" won't match your string since you have characters before the first capture group ([A-Z]{4}), so you can add .* to match any character (.) 0 or more times (*) until your first capture group.
You can reference the groups in gsub, for example, using \\n where n is the nth capture group
s <- "YXABCDXABCDYX"
gsub('.*([A-Z]{4})(X)([A-Z]{4}).*', '\\1\\3', s)
# [1] "ABCDABCD"
which is basically matching the entire string and replacing it with whatever was captured in groups 1 and 3 and pasting that together.
Another way would be to use (?i) which is case-insensitive matching along with [a-z] or \\w
gsub('(?i).*(\\w{4})(x)(\\w{4}).*', '\\1\\3', s)
# [1] "ABCDABCD"
Or gsub('.*(.{4})X(.{4}).*', '\\1\\2', s) if you like dots

Regex in R - is it possible to do a partial string substitution? [duplicate]

s <- "YXABCDXABCDYX"
I want to use a regular expression to return ABCDABCD, i.e. 4 characters on each side of central "X" but not including the "X".
Note that "X" is always in the center with 6 letters on each side.
I can find the central pattern with e.g. "[A-Z]{4}X[A-Z]{4}", but can I somehow let the return be the first and third group in "([A-Z]{4})(X)([A-Z]{4})"?
Your regex "([A-Z]{4})(X)([A-Z]{4})" won't match your string since you have characters before the first capture group ([A-Z]{4}), so you can add .* to match any character (.) 0 or more times (*) until your first capture group.
You can reference the groups in gsub, for example, using \\n where n is the nth capture group
s <- "YXABCDXABCDYX"
gsub('.*([A-Z]{4})(X)([A-Z]{4}).*', '\\1\\3', s)
# [1] "ABCDABCD"
which is basically matching the entire string and replacing it with whatever was captured in groups 1 and 3 and pasting that together.
Another way would be to use (?i) which is case-insensitive matching along with [a-z] or \\w
gsub('(?i).*(\\w{4})(x)(\\w{4}).*', '\\1\\3', s)
# [1] "ABCDABCD"
Or gsub('.*(.{4})X(.{4}).*', '\\1\\2', s) if you like dots

Separate long and complex names in R

Say I have the following list of full scientific names of plant species inside my dataset:
FullSpeciesNames <- c("Aronia melanocarpa (Michx.) Elliott", "Cotoneaster divaricatus Rehder & E. H. Wilson","Rosa canina L.","Ranunculus montanus Willd.")
I want to obtain a list of simplified names, i.e just the first two elements of a given name, namely:
SimpleSpeciesNames<- c("Aronia melanocarpa", "Cotoneaster divaricatus", "Rosa canina", "Ranunculus montanus")
How can this be done in R?
We can use sub to match a word (\\w+) followed by one or more white space (\\s+) followed by another word and space, capture as a group, and the rest of the characters (.*). In the replacement, use the backreference of the captured group (\\1)
trimws(sub("^((\\w+\\s+){2}).*", "\\1", FullSpeciesNames))
An alternative that is more complicated in function use, but does not require regular expressions is
substring(FullSpeciesNames,
1, sapply(gregexpr(" ", FullSpeciesNames, fixed=TRUE), "[[", 2) - 1)
[1] "Aronia melanocarpa" "Cotoneaster divaricatus" "Rosa canina" "Ranunculus montanus"
gregexpr can be used to find the positions of certain characters in a string (it can also look for patterns with regular expressions). Here we are looking for spaces. It returns a list of the positions for each string in the character vector. sapply is used to extract the position of the second space. The vector of these positions (minus one) is fed to substring, which runs through the initial vector and takes the substrings starting from the first character to the indicated position.

How to extract first occurrence of alphabets in a string in R?

I have a character column having values like "CHELSEAFC17FEB640CE", "BARCAFC17FEB1400CE". I want to extract characters "CHELSEAFC", "BARCAFC" and so on. Currently I am using
regmatches(x$symbol,regexpr("[A-z]+",x$symbol))
but getting an error:
Error in $<-.data.frame(*tmp*, "cg", value = c("CHELSEAFC",
"CHELSEAFC", "TOTTENHAMFC", : replacement has 11366767 rows, data
has 11366772 Calls: $<- -> $<-.data.frame Execution halted
I can't seem to find the problem row. Please somebody help with debugging or suggest a better way to do this :)
Assuming that we need to extract the non-numeric part, one option is to remove the other characters by matching one or more numbers ([0-9]+) followed by other characters (.*) and replace it with ""
sub("[0-9]+.*", "", str1)
#[1] "CHELSEAFC" "BARCAFC"
Or capture the upper case letters as a group (([A-Z]+)) from the start (^) of the string and replace it with the backreference (\\1) for that group
sub("^([A-Z]+).*", "\\1", str1)
#[1] "CHELSEAFC" "BARCAFC"
data
str1 <- c( "CHELSEAFC17FEB640CE", "BARCAFC17FEB1400CE")
Instead of [A-z]+ you should use ^[A-Za-z]+ Check this for more understanding why you shouldn't do that: https://stackoverflow.com/a/29771926/4082217
The error appears because you have some values in the input vector that do not contain letters (and some symbols that [A-z] matches). That makes regmatches return no value in case there is no match, and thus, assigning the column values becomes impossible as the number of matches does not coincide with the number of rows in the data frame.
What you may do is:
1) Use sub
x <- c("------", "CHELSEAFC17FEB640CE", "BARCAFC17FEB1400CE")
> sub("^([a-zA-Z]+).*|.*", "\\1", df$x)
[1] "" "CHELSEAFC" "BARCAFC"
>
x$symbol <- sub("^([a-zA-Z]+).*|.*", "\\1", x$symbol)
The ^([a-zA-Z]+).*|.* pattern will match and capture one or more ASCII letters (replace [a-zA-Z]+ with [[:alpha:]]+ to match letters other than ASCII, too) at the start of the string (^), and .* will match the rest of the string, OR (|) the whole string will get matches with the second branch and the match will be replaced with the capturing group contents (so, it will be either filled with a letter value or will be empty).
2) If you want to keep NA for the values with no match, use stringr str_extract:
library(stringr)
> x$symbol <- str_extract(x$symbol, "^[A-Za-z]+")
## => 1 <NA>
## 2 CHELSEAFC
## 3 BARCAFC
Note that ^[A-Za-z]+ matches 1+ ASCII letters ([A-Za-z]+) at the start of the string only (^).

Resources