Replace patterns separated by delimiter in R - r

I need to remove values matching "CBII_*_*_" with "MAP_" in vector tt below.
tt <- c("CBII_27_1018_62770", "CBII_2733_101448_6272", "MAP_1222")
I tried
gsub("CBII_*_*", "MAP_") which won't give the expected result. What would be the solution for this so I get:
"MAP_62770", "MAP_6272", "MAP_1222"

You can use:
gsub("^CBII_.*_.*_", "MAP_",tt)
or
stringr::str_replace(tt, "^CBII_.*_.*_", "MAP_")
Output
[1] "MAP_62770" "MAP_6272" "MAP_1222"

An option with trimws from base R along with paste. We specify the whitespace as characters (.*) till the _. Thus, it removes the substring till the last _ and then with paste concatenate a new string ("MAP_")
paste0("MAP_", trimws(tt, whitespace = ".*_"))
#[1] "MAP_62770" "MAP_6272" "MAP_1222"

sub(".*(?<=_)(\\d+)$", "MAP_\\1", tt, perl = T)
[1] "MAP_62770" "MAP_6272" "MAP_1222"
Here we use positive lookbehind to assert that there is an underscore _ on the left of the capturing group (\\d+) at the very end of the string ($); we recall that capturing group with \\1 in the replacement argument to sub and move MAP_in front of it.

Related

Convert sign in column names if not at certain position in R [duplicate]

I have a character string of names which look like
"_6302_I-PAL_SPSY_000237_001"
I need to remove the first occurred underscore, so that it will be as
"6302_I-PAL_SPSY_000237_001"
I aware of gsub but it removes all of underscores. Thank you for any suggestions.
gsub function do the same, to remove starting of the string symbol ^ used
x <- "_6302_I-PAL_SPSY_000237_001"
x <- gsub("^\\_","",x)
[1] "6302_I-PAL_SPSY_000237_001"
We can use sub with pattern as _ and replacement as blanks (""). This will remove the first occurrence of '_'.
sub("_", "", str1)
#[1] "6302_I-PAL_SPSY_000237_001"
NOTE: This will remove the first occurence of _ and it will not limit based on the position i.e. at the start of the string.
For example, suppose we have string
str2 <- "6302_I-PAL_SPSY_000237_001"
sub("_", "", str2)
#[1] "6302I-PAL_SPSY_000237_001"
As the example have _ in the beginning, another option is substring
substring(str1, 2)
#[1] "6302_I-PAL_SPSY_000237_001"
data
str1 <- "_6302_I-PAL_SPSY_000237_001"
This can be done with base R's trimws() too
string1<-"_6302_I-PAL_SPSY_000237_001"
trimws(string1, which='left', whitespace = '_')
[1] "6302_I-PAL_SPSY_000237_001"
In case we have multiple words with leading underscores, we may have to include a word boundary (\\b) in our regex, and use either gsub or stringr::string_remove:
string2<-paste(string1, string1)
string2
[1] "_6302_I-PAL_SPSY_000237_001 _6302_I-PAL_SPSY_000237_001"
library(stringr)
str_remove_all(string2, "\\b_")
> str_remove_all(string2, "\\b_")
[1] "6302_I-PAL_SPSY_000237_001 6302_I-PAL_SPSY_000237_001"

Extract all text after last occurrence of a special character

I have the string in R
BLCU142-09|Apodemia_mejicanus
and I would like to get the result
Apodemia_mejicanus
Using the stringr R package, I have tried
str_replace_all("BLCU142-09|Apodemia_mejicanus", "[[A-Z0-9|-]]", "")
# [1] "podemia_mejicanus"
which is almost what I need, except that the A is missing.
You can use
sub(".*\\|", "", x)
This will remove all text up to and including the last pipe char. See the regex demo. Details:
.* - any zero or more chars as many as possible
\| - a | char (| is a special regex metacharacter that is an alternation operator, so it must be escaped, and since string literals in R can contain string escape sequences, the | is escaped with a double backslash).
See the R demo online:
x <- c("BLCU142-09|Apodemia_mejicanus", "a|b|c|BLCU142-09|Apodemia_mejicanus")
sub(".*\\|", "", x)
## => [1] "Apodemia_mejicanus" "Apodemia_mejicanus"
We can match one or more characters that are not a | ([^|]+) from the start (^) of the string followed by | in str_remove to remove that substring
library(stringr)
str_remove(str1, "^[^|]+\\|")
#[1] "Apodemia_mejicanus"
If we use [A-Z] also to match it will match the upper case letter and replace with blank ("") as in the OP's str_replace_all
data
str1 <- "BLCU142-09|Apodemia_mejicanus"
You can always choose to _extract rather than _remove:
s <- "BLCU142-09|Apodemia_mejicanus"
stringr::str_extract(s,"[[:alpha:]_]+$")
## [1] "Apodemia_mejicanus"
Depending on how permissive you want to be, you could also use [[:alpha:]]+_[[:alpha:]]+ as your target.
I would keep it simple:
substring(my_string, regexpr("|", my_string, fixed = TRUE) + 1L)

Extract all words from a sentence ending in an expression using R

suppose I have the next string:
"palavras a serem encontradas fazer-se encontrar-se, enganar-se"
How can I extract the words "fazer-se" "encontrar-se" "enganar-se"
I'm try o use stringr like
library(stringr)
sentence <- "palavras a serem encontradas fazer-se encontrar-se, enganar-se"
str_extract_all(sentence, "se$")
I'd like this output:
[1] "fazer-se" "encontrar-se" "enganar-se"
We can specify the word boundary (\\b) and not the end ($) of the string (there is only one match for that, i.e. at the end of the string) and we need to get the characters that are not a whitespace before the se substring, so use \\S+ i.e. one or more non-whitespace characters
library(stringr)
str_extract_all(sentence, "\\S+se\\b")[[1]]
#[1] "fazer-se" "encontrar-se" "enganar-se"
In base R, we can use gregexpr and regmatches :
regmatches(sentence, gregexpr('\\w+-se', sentence))[[1]]
#[1] "fazer-se" "encontrar-se" "enganar-se"

Remove specific sub string in a string with regex expression in R

I'm quite new to the regex world and I'm struggling with this problem. I'd like to remove the specific word in a string. I was able to remove last n characters in this way:
gsub('.{5}$', '', mystring)
like this
mystring = "HOBBIES_1_001_CA_1"
newstring= "HOBBIES_1_001"
Now I wanted to remove the central sub string in this way:
mystring = "HOBBIES_1_001_CA_1"
newstring= "HOBBIES_CA_1"
Any help is appreciate thanks in advance!!
We can use substring as it would be faster
substring(mystring, 1, nchar(mystring)-5)
[#1] "HOBBIES_1_001"
To remove the middle string, match the _ followed by one or more digits (\\d+) followed by the _ and digits and replace with blank ("")
sub("_\\d+_\\d+", "", mystring)
#[1] "HOBBIES_CA_1"
Or another option is to capture the substring and replace with the backreference
sub("^([^_]+)_\\d+_\\d+", "\\1", mystring)
#[1] "HOBBIES_CA_1"
We can extract string in 2 parts using sub. The first part is letters [A-Z] before first underscore and second part is [A-Z] followed by a number at the end of the sentence.
sub('([A-Z])_.*?([A-Z]+_\\d+)$', '\\1_\\2',mystring)
#[1] "HOBBIES_CA_1"

Extract string between the last occurrence of a character and a fixed expression

I have a set of strings such as
mystring
[1] "RData/processed_AutoServico_cat.rds"
[2] "RData/processed_AutoServico_cat_master.rds"
I would like to retrieve the string between the last occurrence of a underscore "_" and ".rds"
I can do it in two steps
str_extract(mystring, '[^_]+$') %>% # get everything after the last '_'
str_extract('.+(?=\\.rds)') # get everything that preceeds '.rds'
[1] "cat" "master"
And there are other ways I can do it.
Is there any single regex expression that would get me all the characters between the last occurrence of a generic character and another fixed expression?
Regex such as
str_extract(mystring, '[^_]+$(?=\\.rds)')
str_extract(mystring, '(?<=[_]).+$(?=\\.rds)')
do not work
The [^_]+$(?=\.rds) pattern matches 1+ chars other than _ up to the end of the string, and then it requires .rds after the end of string, which is impossible, this regex will never match any string. (?<=[_]).+$(?=\.rds) is similar in that regard, it won't match any string, it just starts matching once it finds the first _ and will come to the end of string trying to find .rds after it.
You may use
str_extract(mystring, "[^_]+(?=\\.rds$)")
Or, base R equivalent:
regmatches(s, regexpr("[^_]+(?=\\.rds$)", s, perl=TRUE))
See the regex demo
Pattern details
[^_]+ - 1 or more chars other than _
(?=\.rds$) - a positive lookahead that requires .rds at the end of the string immediately to the right of the current location.
See the Regulex graph:
With base R, we get the basename and use sub to capture the word before the . followed by the characters that are not a . till the end ($) of the string and replace with the backreference (\\1) of the captured group
sub(".*_(\\w+)\\.[^.]+$", "\\1", basename(mystring))
#[1] "cat" "master"
If it is a fixed character
sub(".*_(\\w+)\\.rds", "\\1", basename(mystring))
Or using gsub
gsub(".*_|\\.[^.]+$", "", mystring)
#[1] "cat" "master"

Resources