Remove the last few capital letters on R - r

I was wondering how I could remove the last few capital letters and symbol "/" of each observation string in R? For example, if I have data like
PlayerFirstLastNameABC
PlayerNameAB/CDF
PlayerFirstMN
PlayerLastNameABC/RS
and so on, how do I get it to return to me:
PlayerFirstLastName
PlayerName
PlayerFirst
PlayerLastName
where the last letter of the string is always a lower case letter? i.e. Remove all end of strings until you hit a lower case letter. Thanks!

We can use sub from base R to match one or more (+) upper case letters along with / till the end ($) of the string and replace with blank ("")
sub("[A-Z/]+$", "", v1)
#[1] "PlayerFirstLastName" "PlayerName"
#[3] "PlayerFirst" "PlayerLastName"
Or using trimws
trimws(v1, whitespace = "[A-Z/]+", which = "right")
#[1] "PlayerFirstLastName" "PlayerName"
#[3] "PlayerFirst" "PlayerLastName"
data
v1 <- c("PlayerFirstLastNameABC", "PlayerNameAB/CDF", "PlayerFirstMN",
"PlayerLastNameABC/RS")

You can capture everything until upper case letters and / at the end of the string.
sub('(.*?)[/A-Z]+$', '\\1', x)
#[1] "PlayerFirstLastName" "PlayerName" "PlayerFirst" "PlayerLastName"

Related

Convert sign in column names if not at certain position in R [duplicate]

I have a character string of names which look like
"_6302_I-PAL_SPSY_000237_001"
I need to remove the first occurred underscore, so that it will be as
"6302_I-PAL_SPSY_000237_001"
I aware of gsub but it removes all of underscores. Thank you for any suggestions.
gsub function do the same, to remove starting of the string symbol ^ used
x <- "_6302_I-PAL_SPSY_000237_001"
x <- gsub("^\\_","",x)
[1] "6302_I-PAL_SPSY_000237_001"
We can use sub with pattern as _ and replacement as blanks (""). This will remove the first occurrence of '_'.
sub("_", "", str1)
#[1] "6302_I-PAL_SPSY_000237_001"
NOTE: This will remove the first occurence of _ and it will not limit based on the position i.e. at the start of the string.
For example, suppose we have string
str2 <- "6302_I-PAL_SPSY_000237_001"
sub("_", "", str2)
#[1] "6302I-PAL_SPSY_000237_001"
As the example have _ in the beginning, another option is substring
substring(str1, 2)
#[1] "6302_I-PAL_SPSY_000237_001"
data
str1 <- "_6302_I-PAL_SPSY_000237_001"
This can be done with base R's trimws() too
string1<-"_6302_I-PAL_SPSY_000237_001"
trimws(string1, which='left', whitespace = '_')
[1] "6302_I-PAL_SPSY_000237_001"
In case we have multiple words with leading underscores, we may have to include a word boundary (\\b) in our regex, and use either gsub or stringr::string_remove:
string2<-paste(string1, string1)
string2
[1] "_6302_I-PAL_SPSY_000237_001 _6302_I-PAL_SPSY_000237_001"
library(stringr)
str_remove_all(string2, "\\b_")
> str_remove_all(string2, "\\b_")
[1] "6302_I-PAL_SPSY_000237_001 6302_I-PAL_SPSY_000237_001"

Replace patterns separated by delimiter in R

I need to remove values matching "CBII_*_*_" with "MAP_" in vector tt below.
tt <- c("CBII_27_1018_62770", "CBII_2733_101448_6272", "MAP_1222")
I tried
gsub("CBII_*_*", "MAP_") which won't give the expected result. What would be the solution for this so I get:
"MAP_62770", "MAP_6272", "MAP_1222"
You can use:
gsub("^CBII_.*_.*_", "MAP_",tt)
or
stringr::str_replace(tt, "^CBII_.*_.*_", "MAP_")
Output
[1] "MAP_62770" "MAP_6272" "MAP_1222"
An option with trimws from base R along with paste. We specify the whitespace as characters (.*) till the _. Thus, it removes the substring till the last _ and then with paste concatenate a new string ("MAP_")
paste0("MAP_", trimws(tt, whitespace = ".*_"))
#[1] "MAP_62770" "MAP_6272" "MAP_1222"
sub(".*(?<=_)(\\d+)$", "MAP_\\1", tt, perl = T)
[1] "MAP_62770" "MAP_6272" "MAP_1222"
Here we use positive lookbehind to assert that there is an underscore _ on the left of the capturing group (\\d+) at the very end of the string ($); we recall that capturing group with \\1 in the replacement argument to sub and move MAP_in front of it.

R Returning all characters after the first underscore

Sample DATA
x=c("AG.av08_binloop_v6","TL.av1_binloopv2")
Sample ATTEMPT
y=gsub(".*_","",x)
Sample DESIRED
WANT=c("binloop_v6","binloopv2")
Basically I aim to extract all the characters AFTER the first underscore value.
In the pattern, we can change the zero or more any characters (.* - here . is metacharacter that can match any character) to zero or more characters that is not a _ ([^_]*) from the start (^) of the string.
sub("^[^_]*_", "", x)
#[1] "binloop_v6" "binloopv2"
If we don't specify it as such, the _ will match till the last _ in the string and uptill that substring will be lost returning 'v6' and 'binloopv2'
An easier option would be word from stringr
library(stringr)
word(x, 2, sep = "_")
#[1] "binloop" "binloopv2"
regexpr gives the position of first match (in this case _). Then substring can be used to extract the part of x from relevant position to the end (nchar(x))
substring(x, regexpr("_", x) + 1, nchar(x))
#[1] "binloop_v6" "binloopv2"

Add underscore before every upper case letter followed by lower case

I'm trying to add underscore before every capital letter followed by lower case. Here is the example:
cases <- c("XrefAcctnoAcctID", "NewXref1AcctID", "NewXref2AcctID", "ClientNo")
I have this:
[1] "XrefAcctnoAcctID" "NewXref1AcctID"
[3] "NewXref2AcctID" "ClientNo"
And I want to have this:
"xref_acctno_acct_id"
"new_xref1_acct_id"
"new_xref2_acct_id"
"client_no"
I'm able to go this far:
> tolower(gsub("([a-z])([A-Z])", "\\1_\\2", cases))
[1] "xref_acctno_acct_id" "new_xref1acct_id"
[3] "new_xref2acct_id" "client_no"
But "new_xref1acct_id" "new_xref2acct_id" does not reflect what I want.
We can use regex lookarounds to match the patterns that show a lowercase letter or a number followed by an upper case letter and replace it with _
tolower(gsub("(?<=[a-z0-9])(?=[A-Z])", "_", cases, perl = TRUE))
#[1] "xref_acctno_acct_id" "new_xref1_acct_id" "new_xref2_acct_id"
#[4] "client_no"
Or without lookarounds, we can capture the lower case or numbers as a group followed by upper case letter as another group and replace it with backreference for that group separated by _
tolower(gsub("([a-z1-9])([A-Z])", "\\1_\\2", cases))
#[1] "xref_acctno_acct_id" "new_xref1_acct_id" "new_xref2_acct_id"
#[4] "client_no"

How to substring a char vector using patterns in R?

I have this kind of char vector:
"MODIS.evi.2013116.yL2.BOKU.tif"
The number in the middle of the vector is gonna change. And the evi word will change to ndvi some times.
I want to use substr (or other function, maybe) to sub-string the vector after the second point: ., ie, just take the 2013116.yL2.BOKU.tif, even when the string is MODIS.evi.2013116.yL2.BOKU.tif or MODIS.ndvi.2013116.yL2.BOKU.tif.
We can use sub to match two instance of one or more characters that are not a . followed by a . from the start (^) of the string and replace it with blank ("")
sub("^([^.]+\\.){2}", "", str1)
#[1] "2013116.yL2.BOKU.tif" "2013116.yL2.BOKU.tif"
If the pattern to keep always start with numbers, then the above can be simplified to match only one or more non-numeric characters and replace it with blank from the start (^) of the string
sub("^\\D+", "", str1)
#[1] "2013116.yL2.BOKU.tif" "2013116.yL2.BOKU.tif"
data
str1 <- c("MODIS.evi.2013116.yL2.BOKU.tif", "MODIS.ndvi.2013116.yL2.BOKU.tif")
This deletes all leading non-digit characters in s :
sub("^\\D*", "", s)
If s is as in the Note at the end then the result of running the above is:
[1] "2013116.yL2.BOKU.tif" "2013116.yL2.BOKU.tif"
Note:
s <- c("MODIS.evi.2013116.yL2.BOKU.tif", "MODIS.ndvi.2013116.yL2.BOKU.tif")
l = c("MODIS.evi.2013116.yL2.BOKU.tif","MODIS.ndvi.2013116.yL2.BOKU.tif")
sapply(l, function(x) strsplit(x, "vi.", fixed = T)[[1]][2])

Resources