Remove specific sub string in a string with regex expression in R - r

I'm quite new to the regex world and I'm struggling with this problem. I'd like to remove the specific word in a string. I was able to remove last n characters in this way:
gsub('.{5}$', '', mystring)
like this
mystring = "HOBBIES_1_001_CA_1"
newstring= "HOBBIES_1_001"
Now I wanted to remove the central sub string in this way:
mystring = "HOBBIES_1_001_CA_1"
newstring= "HOBBIES_CA_1"
Any help is appreciate thanks in advance!!

We can use substring as it would be faster
substring(mystring, 1, nchar(mystring)-5)
[#1] "HOBBIES_1_001"
To remove the middle string, match the _ followed by one or more digits (\\d+) followed by the _ and digits and replace with blank ("")
sub("_\\d+_\\d+", "", mystring)
#[1] "HOBBIES_CA_1"
Or another option is to capture the substring and replace with the backreference
sub("^([^_]+)_\\d+_\\d+", "\\1", mystring)
#[1] "HOBBIES_CA_1"

We can extract string in 2 parts using sub. The first part is letters [A-Z] before first underscore and second part is [A-Z] followed by a number at the end of the sentence.
sub('([A-Z])_.*?([A-Z]+_\\d+)$', '\\1_\\2',mystring)
#[1] "HOBBIES_CA_1"

Related

Convert sign in column names if not at certain position in R [duplicate]

I have a character string of names which look like
"_6302_I-PAL_SPSY_000237_001"
I need to remove the first occurred underscore, so that it will be as
"6302_I-PAL_SPSY_000237_001"
I aware of gsub but it removes all of underscores. Thank you for any suggestions.
gsub function do the same, to remove starting of the string symbol ^ used
x <- "_6302_I-PAL_SPSY_000237_001"
x <- gsub("^\\_","",x)
[1] "6302_I-PAL_SPSY_000237_001"
We can use sub with pattern as _ and replacement as blanks (""). This will remove the first occurrence of '_'.
sub("_", "", str1)
#[1] "6302_I-PAL_SPSY_000237_001"
NOTE: This will remove the first occurence of _ and it will not limit based on the position i.e. at the start of the string.
For example, suppose we have string
str2 <- "6302_I-PAL_SPSY_000237_001"
sub("_", "", str2)
#[1] "6302I-PAL_SPSY_000237_001"
As the example have _ in the beginning, another option is substring
substring(str1, 2)
#[1] "6302_I-PAL_SPSY_000237_001"
data
str1 <- "_6302_I-PAL_SPSY_000237_001"
This can be done with base R's trimws() too
string1<-"_6302_I-PAL_SPSY_000237_001"
trimws(string1, which='left', whitespace = '_')
[1] "6302_I-PAL_SPSY_000237_001"
In case we have multiple words with leading underscores, we may have to include a word boundary (\\b) in our regex, and use either gsub or stringr::string_remove:
string2<-paste(string1, string1)
string2
[1] "_6302_I-PAL_SPSY_000237_001 _6302_I-PAL_SPSY_000237_001"
library(stringr)
str_remove_all(string2, "\\b_")
> str_remove_all(string2, "\\b_")
[1] "6302_I-PAL_SPSY_000237_001 6302_I-PAL_SPSY_000237_001"

Replace patterns separated by delimiter in R

I need to remove values matching "CBII_*_*_" with "MAP_" in vector tt below.
tt <- c("CBII_27_1018_62770", "CBII_2733_101448_6272", "MAP_1222")
I tried
gsub("CBII_*_*", "MAP_") which won't give the expected result. What would be the solution for this so I get:
"MAP_62770", "MAP_6272", "MAP_1222"
You can use:
gsub("^CBII_.*_.*_", "MAP_",tt)
or
stringr::str_replace(tt, "^CBII_.*_.*_", "MAP_")
Output
[1] "MAP_62770" "MAP_6272" "MAP_1222"
An option with trimws from base R along with paste. We specify the whitespace as characters (.*) till the _. Thus, it removes the substring till the last _ and then with paste concatenate a new string ("MAP_")
paste0("MAP_", trimws(tt, whitespace = ".*_"))
#[1] "MAP_62770" "MAP_6272" "MAP_1222"
sub(".*(?<=_)(\\d+)$", "MAP_\\1", tt, perl = T)
[1] "MAP_62770" "MAP_6272" "MAP_1222"
Here we use positive lookbehind to assert that there is an underscore _ on the left of the capturing group (\\d+) at the very end of the string ($); we recall that capturing group with \\1 in the replacement argument to sub and move MAP_in front of it.

Extract string between the last occurrence of a character and a fixed expression

I have a set of strings such as
mystring
[1] "RData/processed_AutoServico_cat.rds"
[2] "RData/processed_AutoServico_cat_master.rds"
I would like to retrieve the string between the last occurrence of a underscore "_" and ".rds"
I can do it in two steps
str_extract(mystring, '[^_]+$') %>% # get everything after the last '_'
str_extract('.+(?=\\.rds)') # get everything that preceeds '.rds'
[1] "cat" "master"
And there are other ways I can do it.
Is there any single regex expression that would get me all the characters between the last occurrence of a generic character and another fixed expression?
Regex such as
str_extract(mystring, '[^_]+$(?=\\.rds)')
str_extract(mystring, '(?<=[_]).+$(?=\\.rds)')
do not work
The [^_]+$(?=\.rds) pattern matches 1+ chars other than _ up to the end of the string, and then it requires .rds after the end of string, which is impossible, this regex will never match any string. (?<=[_]).+$(?=\.rds) is similar in that regard, it won't match any string, it just starts matching once it finds the first _ and will come to the end of string trying to find .rds after it.
You may use
str_extract(mystring, "[^_]+(?=\\.rds$)")
Or, base R equivalent:
regmatches(s, regexpr("[^_]+(?=\\.rds$)", s, perl=TRUE))
See the regex demo
Pattern details
[^_]+ - 1 or more chars other than _
(?=\.rds$) - a positive lookahead that requires .rds at the end of the string immediately to the right of the current location.
See the Regulex graph:
With base R, we get the basename and use sub to capture the word before the . followed by the characters that are not a . till the end ($) of the string and replace with the backreference (\\1) of the captured group
sub(".*_(\\w+)\\.[^.]+$", "\\1", basename(mystring))
#[1] "cat" "master"
If it is a fixed character
sub(".*_(\\w+)\\.rds", "\\1", basename(mystring))
Or using gsub
gsub(".*_|\\.[^.]+$", "", mystring)
#[1] "cat" "master"

How to remove start and end of the string in R?

I have this string mystring. I want to remove the begining and end of the string in one go and get the result. How do I do this ?
mystring <- c("new_DCLd_2_LTR_assembly.csv", "new_nonLTR_DCLd_2_assembly.csv"
)
result I want:
DCLd_2_LTR_assembly
nonLTR_DCLd_2_assembly
We can use gsub to match zero or more character that are not a _ ([^_]*) followed by a _ from the start (^) of the string or (|) the . followed by csv and replace it with blank ("")
gsub("^[^_]*_|\\.csv", "", mystring)
#[1] "DCLd_2_LTR_assembly" "nonLTR_DCLd_2_assembly"
Or use sub with capture groups
sub("^[^_]*_([^.]*)\\..*", "\\1", mystring)
library(stringr)
str_sub(mystring,5,-5)
[1] "DCLd_2_LTR_assembly" "nonLTR_DCLd_2_assembly"
Or just using (As per akrun )
substr(mystring, 5, nchar(mystring)-4)

Characters before/after a symbol

I have the following string in R: "xxx, yyy. zzz"
I want to get the yyy part only, which are in between "," and "."
I don't want to use regex.
I searched half a day, found many string functions in R but none which deal with "cut before/after a character" function.
Is there such?
We can use gsub to match zero or more characters that are not a , ([^,]*) from the start (^) of the string followed by a , followed by zero or more spaces (\\s*) or (!) a dot (\\. - it is a metacharacter meaning any character so it is escaped) followed by other characters (.*) until the end of the string ($) and replace it with blank ("")
gsub("^[^,]*,\\s*|\\..*$", "", str1)
#[1] "yyy"
If we don't need regex then strsplit the string by , followed by zero or more spaces or with a . and select the second entry after converting the list output to vector ([[1]])
strsplit(str1, ",\\s*|\\.")[[1]][2]
#[1] "yyy"
data
str1 <- "xxx, yyy. zzz"
It could be that this suffices:
unlist(strsplit("xxx, yyy. zzz","[,.]"))[2] # get yyy with space, or:
gsub(" ","",unlist(strsplit("xxx, yyy. zzz","[,.]")))[2] # remove space

Resources