How to remove start and end of the string in R? - r

I have this string mystring. I want to remove the begining and end of the string in one go and get the result. How do I do this ?
mystring <- c("new_DCLd_2_LTR_assembly.csv", "new_nonLTR_DCLd_2_assembly.csv"
)
result I want:
DCLd_2_LTR_assembly
nonLTR_DCLd_2_assembly

We can use gsub to match zero or more character that are not a _ ([^_]*) followed by a _ from the start (^) of the string or (|) the . followed by csv and replace it with blank ("")
gsub("^[^_]*_|\\.csv", "", mystring)
#[1] "DCLd_2_LTR_assembly" "nonLTR_DCLd_2_assembly"
Or use sub with capture groups
sub("^[^_]*_([^.]*)\\..*", "\\1", mystring)

library(stringr)
str_sub(mystring,5,-5)
[1] "DCLd_2_LTR_assembly" "nonLTR_DCLd_2_assembly"
Or just using (As per akrun )
substr(mystring, 5, nchar(mystring)-4)

Related

Convert sign in column names if not at certain position in R [duplicate]

I have a character string of names which look like
"_6302_I-PAL_SPSY_000237_001"
I need to remove the first occurred underscore, so that it will be as
"6302_I-PAL_SPSY_000237_001"
I aware of gsub but it removes all of underscores. Thank you for any suggestions.
gsub function do the same, to remove starting of the string symbol ^ used
x <- "_6302_I-PAL_SPSY_000237_001"
x <- gsub("^\\_","",x)
[1] "6302_I-PAL_SPSY_000237_001"
We can use sub with pattern as _ and replacement as blanks (""). This will remove the first occurrence of '_'.
sub("_", "", str1)
#[1] "6302_I-PAL_SPSY_000237_001"
NOTE: This will remove the first occurence of _ and it will not limit based on the position i.e. at the start of the string.
For example, suppose we have string
str2 <- "6302_I-PAL_SPSY_000237_001"
sub("_", "", str2)
#[1] "6302I-PAL_SPSY_000237_001"
As the example have _ in the beginning, another option is substring
substring(str1, 2)
#[1] "6302_I-PAL_SPSY_000237_001"
data
str1 <- "_6302_I-PAL_SPSY_000237_001"
This can be done with base R's trimws() too
string1<-"_6302_I-PAL_SPSY_000237_001"
trimws(string1, which='left', whitespace = '_')
[1] "6302_I-PAL_SPSY_000237_001"
In case we have multiple words with leading underscores, we may have to include a word boundary (\\b) in our regex, and use either gsub or stringr::string_remove:
string2<-paste(string1, string1)
string2
[1] "_6302_I-PAL_SPSY_000237_001 _6302_I-PAL_SPSY_000237_001"
library(stringr)
str_remove_all(string2, "\\b_")
> str_remove_all(string2, "\\b_")
[1] "6302_I-PAL_SPSY_000237_001 6302_I-PAL_SPSY_000237_001"

Replace patterns separated by delimiter in R

I need to remove values matching "CBII_*_*_" with "MAP_" in vector tt below.
tt <- c("CBII_27_1018_62770", "CBII_2733_101448_6272", "MAP_1222")
I tried
gsub("CBII_*_*", "MAP_") which won't give the expected result. What would be the solution for this so I get:
"MAP_62770", "MAP_6272", "MAP_1222"
You can use:
gsub("^CBII_.*_.*_", "MAP_",tt)
or
stringr::str_replace(tt, "^CBII_.*_.*_", "MAP_")
Output
[1] "MAP_62770" "MAP_6272" "MAP_1222"
An option with trimws from base R along with paste. We specify the whitespace as characters (.*) till the _. Thus, it removes the substring till the last _ and then with paste concatenate a new string ("MAP_")
paste0("MAP_", trimws(tt, whitespace = ".*_"))
#[1] "MAP_62770" "MAP_6272" "MAP_1222"
sub(".*(?<=_)(\\d+)$", "MAP_\\1", tt, perl = T)
[1] "MAP_62770" "MAP_6272" "MAP_1222"
Here we use positive lookbehind to assert that there is an underscore _ on the left of the capturing group (\\d+) at the very end of the string ($); we recall that capturing group with \\1 in the replacement argument to sub and move MAP_in front of it.

Remove specific sub string in a string with regex expression in R

I'm quite new to the regex world and I'm struggling with this problem. I'd like to remove the specific word in a string. I was able to remove last n characters in this way:
gsub('.{5}$', '', mystring)
like this
mystring = "HOBBIES_1_001_CA_1"
newstring= "HOBBIES_1_001"
Now I wanted to remove the central sub string in this way:
mystring = "HOBBIES_1_001_CA_1"
newstring= "HOBBIES_CA_1"
Any help is appreciate thanks in advance!!
We can use substring as it would be faster
substring(mystring, 1, nchar(mystring)-5)
[#1] "HOBBIES_1_001"
To remove the middle string, match the _ followed by one or more digits (\\d+) followed by the _ and digits and replace with blank ("")
sub("_\\d+_\\d+", "", mystring)
#[1] "HOBBIES_CA_1"
Or another option is to capture the substring and replace with the backreference
sub("^([^_]+)_\\d+_\\d+", "\\1", mystring)
#[1] "HOBBIES_CA_1"
We can extract string in 2 parts using sub. The first part is letters [A-Z] before first underscore and second part is [A-Z] followed by a number at the end of the sentence.
sub('([A-Z])_.*?([A-Z]+_\\d+)$', '\\1_\\2',mystring)
#[1] "HOBBIES_CA_1"

Extract string between the last occurrence of a character and a fixed expression

I have a set of strings such as
mystring
[1] "RData/processed_AutoServico_cat.rds"
[2] "RData/processed_AutoServico_cat_master.rds"
I would like to retrieve the string between the last occurrence of a underscore "_" and ".rds"
I can do it in two steps
str_extract(mystring, '[^_]+$') %>% # get everything after the last '_'
str_extract('.+(?=\\.rds)') # get everything that preceeds '.rds'
[1] "cat" "master"
And there are other ways I can do it.
Is there any single regex expression that would get me all the characters between the last occurrence of a generic character and another fixed expression?
Regex such as
str_extract(mystring, '[^_]+$(?=\\.rds)')
str_extract(mystring, '(?<=[_]).+$(?=\\.rds)')
do not work
The [^_]+$(?=\.rds) pattern matches 1+ chars other than _ up to the end of the string, and then it requires .rds after the end of string, which is impossible, this regex will never match any string. (?<=[_]).+$(?=\.rds) is similar in that regard, it won't match any string, it just starts matching once it finds the first _ and will come to the end of string trying to find .rds after it.
You may use
str_extract(mystring, "[^_]+(?=\\.rds$)")
Or, base R equivalent:
regmatches(s, regexpr("[^_]+(?=\\.rds$)", s, perl=TRUE))
See the regex demo
Pattern details
[^_]+ - 1 or more chars other than _
(?=\.rds$) - a positive lookahead that requires .rds at the end of the string immediately to the right of the current location.
See the Regulex graph:
With base R, we get the basename and use sub to capture the word before the . followed by the characters that are not a . till the end ($) of the string and replace with the backreference (\\1) of the captured group
sub(".*_(\\w+)\\.[^.]+$", "\\1", basename(mystring))
#[1] "cat" "master"
If it is a fixed character
sub(".*_(\\w+)\\.rds", "\\1", basename(mystring))
Or using gsub
gsub(".*_|\\.[^.]+$", "", mystring)
#[1] "cat" "master"

R Returning all characters after the first underscore

Sample DATA
x=c("AG.av08_binloop_v6","TL.av1_binloopv2")
Sample ATTEMPT
y=gsub(".*_","",x)
Sample DESIRED
WANT=c("binloop_v6","binloopv2")
Basically I aim to extract all the characters AFTER the first underscore value.
In the pattern, we can change the zero or more any characters (.* - here . is metacharacter that can match any character) to zero or more characters that is not a _ ([^_]*) from the start (^) of the string.
sub("^[^_]*_", "", x)
#[1] "binloop_v6" "binloopv2"
If we don't specify it as such, the _ will match till the last _ in the string and uptill that substring will be lost returning 'v6' and 'binloopv2'
An easier option would be word from stringr
library(stringr)
word(x, 2, sep = "_")
#[1] "binloop" "binloopv2"
regexpr gives the position of first match (in this case _). Then substring can be used to extract the part of x from relevant position to the end (nchar(x))
substring(x, regexpr("_", x) + 1, nchar(x))
#[1] "binloop_v6" "binloopv2"

Resources