Characters before/after a symbol - r

I have the following string in R: "xxx, yyy. zzz"
I want to get the yyy part only, which are in between "," and "."
I don't want to use regex.
I searched half a day, found many string functions in R but none which deal with "cut before/after a character" function.
Is there such?

We can use gsub to match zero or more characters that are not a , ([^,]*) from the start (^) of the string followed by a , followed by zero or more spaces (\\s*) or (!) a dot (\\. - it is a metacharacter meaning any character so it is escaped) followed by other characters (.*) until the end of the string ($) and replace it with blank ("")
gsub("^[^,]*,\\s*|\\..*$", "", str1)
#[1] "yyy"
If we don't need regex then strsplit the string by , followed by zero or more spaces or with a . and select the second entry after converting the list output to vector ([[1]])
strsplit(str1, ",\\s*|\\.")[[1]][2]
#[1] "yyy"
data
str1 <- "xxx, yyy. zzz"

It could be that this suffices:
unlist(strsplit("xxx, yyy. zzz","[,.]"))[2] # get yyy with space, or:
gsub(" ","",unlist(strsplit("xxx, yyy. zzz","[,.]")))[2] # remove space

Related

Remove specific sub string in a string with regex expression in R

I'm quite new to the regex world and I'm struggling with this problem. I'd like to remove the specific word in a string. I was able to remove last n characters in this way:
gsub('.{5}$', '', mystring)
like this
mystring = "HOBBIES_1_001_CA_1"
newstring= "HOBBIES_1_001"
Now I wanted to remove the central sub string in this way:
mystring = "HOBBIES_1_001_CA_1"
newstring= "HOBBIES_CA_1"
Any help is appreciate thanks in advance!!
We can use substring as it would be faster
substring(mystring, 1, nchar(mystring)-5)
[#1] "HOBBIES_1_001"
To remove the middle string, match the _ followed by one or more digits (\\d+) followed by the _ and digits and replace with blank ("")
sub("_\\d+_\\d+", "", mystring)
#[1] "HOBBIES_CA_1"
Or another option is to capture the substring and replace with the backreference
sub("^([^_]+)_\\d+_\\d+", "\\1", mystring)
#[1] "HOBBIES_CA_1"
We can extract string in 2 parts using sub. The first part is letters [A-Z] before first underscore and second part is [A-Z] followed by a number at the end of the sentence.
sub('([A-Z])_.*?([A-Z]+_\\d+)$', '\\1_\\2',mystring)
#[1] "HOBBIES_CA_1"

sub function in r does not replace the first match

I am trying to manipulate a character vector and want to delete all characters before the first occurrence of a specific string using sub function in r, since the function performs replacement of the first match, but in my code sub replaces the last but not the first match?
Here below is an example
Vec <- c("ID1.P.001", "ID2.P.002") # character vector
# I want to get rid of all characters before the first dot (including the dot)
# So i want to get this vector
c("P.001", "P.002")
#[1] "P.001" "P.002"
# my code
sub('.*\\.', "", Vec )
#[1] "001" "002"
# sub replace the last not the first match !!
How can i use sub to get rid of characters before the first match (including the pattern)?
You can make the * quantifier lazy (opposed to the default greedy matching) by adding a ? after it. I.e.:
sub('.*?\\.', "", Vec)
[1] "P.001" "P.002"
We can specify the start (^) of the string, match the characters that are not a . ([^.]+ - one or more characters that are not a dot) followed by a dot (\\. - metacharacter - so escaping, within the [], it would be evaluated as . though) and in replacement, specify as blank ("")
sub("^[^.]+\\.", "", Vec)
#[1] "P.001" "P.002"

Extract string between the last occurrence of a character and a fixed expression

I have a set of strings such as
mystring
[1] "RData/processed_AutoServico_cat.rds"
[2] "RData/processed_AutoServico_cat_master.rds"
I would like to retrieve the string between the last occurrence of a underscore "_" and ".rds"
I can do it in two steps
str_extract(mystring, '[^_]+$') %>% # get everything after the last '_'
str_extract('.+(?=\\.rds)') # get everything that preceeds '.rds'
[1] "cat" "master"
And there are other ways I can do it.
Is there any single regex expression that would get me all the characters between the last occurrence of a generic character and another fixed expression?
Regex such as
str_extract(mystring, '[^_]+$(?=\\.rds)')
str_extract(mystring, '(?<=[_]).+$(?=\\.rds)')
do not work
The [^_]+$(?=\.rds) pattern matches 1+ chars other than _ up to the end of the string, and then it requires .rds after the end of string, which is impossible, this regex will never match any string. (?<=[_]).+$(?=\.rds) is similar in that regard, it won't match any string, it just starts matching once it finds the first _ and will come to the end of string trying to find .rds after it.
You may use
str_extract(mystring, "[^_]+(?=\\.rds$)")
Or, base R equivalent:
regmatches(s, regexpr("[^_]+(?=\\.rds$)", s, perl=TRUE))
See the regex demo
Pattern details
[^_]+ - 1 or more chars other than _
(?=\.rds$) - a positive lookahead that requires .rds at the end of the string immediately to the right of the current location.
See the Regulex graph:
With base R, we get the basename and use sub to capture the word before the . followed by the characters that are not a . till the end ($) of the string and replace with the backreference (\\1) of the captured group
sub(".*_(\\w+)\\.[^.]+$", "\\1", basename(mystring))
#[1] "cat" "master"
If it is a fixed character
sub(".*_(\\w+)\\.rds", "\\1", basename(mystring))
Or using gsub
gsub(".*_|\\.[^.]+$", "", mystring)
#[1] "cat" "master"

Removing the second "|" on the last position

Here are some examples from my data:
a <-c("sp|Q9Y6W5|","sp|Q9HB90|,sp|Q9NQL2|","orf|NCBIAAYI_c_1_1023|",
"orf|NCBIACEN_c_10_906|,orf|NCBIACEO_c_5_1142|",
"orf|NCBIAAYI_c_258|,orf|aot172_c_6_302|,orf|aot180_c_2_405|")
For a: The individual strings can contain even more entries of "sp|" and "orf"
The results have to be like this:
[1] "sp|Q9Y6W5" "sp|Q9HB90,sp|Q9NQL2" "orf|NCBIAAYI_c_1_1023"
"orf|NCBIACEN_c_10_906,orf|NCBIACEO_c_5_1142"
"orf|NCBIAAYI_c_258,orf|aot172_c_6_302,orf|aot180_c_2_405"
So the aim is to remove the last "|" for each "sp|" and "orf|" entry. It seems that "|" is a special challenge because it is a metacharacter in regular expressions. Furthermore, the length and composition of the "orf|" entries varying a lot. The only things they have in common is "orf|" or "sp|" at the beginning and that "|" is on the last position. I tried different things with gsub() but also with the stringr package or regexpr() or [:punct:], but nothing really worked. Maybe it was just the wrong combination.
We can use gsub to match the | that is followed by a , or is at the end ($) of the string and replace with blank ("")
gsub("[|](?=(,|$))", "", a, perl = TRUE)
#[1] "sp|Q9Y6W5"
#[2] "sp|Q9HB90,sp|Q9NQL2"
#[3] "orf|NCBIAAYI_c_1_1023"
#[4] "orf|NCBIACEN_c_10_906,orf|NCBIACEO_c_5_1142"
#[5] "orf|NCBIAAYI_c_258,orf|aot172_c_6_302,orf|aot180_c_2_405"
Or we split by ,', remove the last character withsubstr, andpastethelist` elements together
sapply(strsplit(a, ","), function(x) paste(substr(x, 1, nchar(x)-1), collapse=","))
An easy alternative that might work. You need to escape the "|" using "\\|".
# Input
a <-c("sp|Q9Y6W5|","sp|Q9HB90|,sp|Q9NQL2|","orf|NCBIAAYI_c_1_1023|",
"orf|NCBIACEN_c_10_906|,orf|NCBIACEO_c_5_1142|",
"orf|NCBIAAYI_c_258|,orf|aot172_c_6_302|,orf|aot180_c_2_405|")
# Expected output
b <- c("sp|Q9Y6W5", "sp|Q9HB90,sp|Q9NQL2", "orf|NCBIAAYI_c_1_1023" ,
"orf|NCBIACEN_c_10_906,orf|NCBIACEO_c_5_1142" ,
"orf|NCBIAAYI_c_258,orf|aot172_c_6_302,orf|aot180_c_2_405")
res <- gsub("\\|,", ",", gsub("\\|$", "", a))
all(res == b)
#[1] TRUE
You could construct a single regex call to gsub, but this is simple and easy to understand. The inner gsub looks for | and the end of the string and removes it. The outer gsub looks for ,| and replaces with ,.
You do not have to use a PCRE regex here as all you need can be done with the default TRE regex (if you specify perl=TRUE, the pattern is compiled with a PCRE regex engine and is sometimes slower than TRE default regex engine).
Here is the single simple gsub call:
gsub("\\|(,|$)", "\\1", a)
See the online R demo. No lookarounds are really necessary, as you see.
Pattern details
\\| - a literal | symbol (because if you do not escape it or put into a bracket expression it will denote an alternation operator, see the line below)
(,|$) - a capturing group (referenced to with \1 from the replacement pattern) matching either of the two alternatives:
, - a comma
| - or (the alternation operator)
$ - end of string anchor.
The \1 in the replacement string tells the regex engine to insert the contents stored in the capturing group #1 back into the resulting string (so, the commas are restored that way where necessary).

In R: grab all alnum characters before the first punctuation

I have a vector s of strings (or NAs), and would like to get a vector of same length of everything before first occurrence of punctionation (.).
s <- c("ABC1.2", "22A.2", NA)
I would like a result like:
[1] "ABC1" "22A" NA
You can remove all symbols (incl. a newline) from the first dot with the following Perl-like regex:
s <- c("ABC1.2", "22A.2", NA)
gsub("[.][\\s\\S]*$", "", s, perl=T)
## => [1] "ABC1" "22A" NA
See IDEONE demo
The regex matches
[.] - a literal dot
[\\s\\S]* - any symbols incl. a newline
$ - end of string.
All matched strings are removed from the input with "". As the regex engine analyzes the string from left to right, the first dot is matched with \\., and the greedy * quantifier with [\\s\\S] will match all up to the end of string.
If there are no newlines, a simpler regex will do: [.].*$:
gsub("[.].*$", "", s)
See another demo

Resources