How to replace an internal capital letter in a string - r

I have a range of strings as follows:
vec<-c("Peronospora boniNhenrici","Cystoseira abiesNmarina","Niplommatina rubra",
"Padina sanctaeNcrucis","Nachygrapsus NaurusNliguricus","Melphidippa borealis")
I would like to replace the internal capital "N" in the second word for each element with "-", so that it would like:
("Peronospora boni-henrici","Cystoseira abies-marina","Niplommatina rubra",
"Padina sanctae-crucis,"Nachygrapsus Naurus-liguricus","Melphidippa borealis")
Any suggestions? I've got the locations using the following:
stri_locate_all(vec,regex = "[N]")
but I'm not sure how to replace the "N" if it's internal. When I try to replace the capital letter "N" using gsub, it replaces all occurrences of N, rather than only the internal "N".

We can look for any N's surrounded by \w, which in regex matches any alphanumeric characters or underscores. If that's too broad you could replace \w with [a-zA-Z] to only match letters:
stringr::str_replace_all(vec, "(\\w)N(\\w)", "\\1-\\2")

We can use look behind to replace "N" in the middle of the word with a "-"
gsub("(?<!^)\\wN", "-", vec, perl = TRUE)
#[1] "Peronospora bon-henrici" "Cystoseira abie-marina" "Niplommatina rubra"
#[4] "Padina sancta-crucis" "Nachygrapsus Nauru-liguricus" "Melphidippa borealis"

We can use gsub with capture groups
gsub("([a-z])N([a-z])", "\\1-\\2", vec)
#[1] "Peronospora boni-henrici" "Cystoseira abies-marina" "Niplommatina rubra"
#[4] "Padina sanctae-crucis"
#[5] "Nachygrapsus Naurus-liguricus" "Melphidippa borealis"

Related

r - replace part of string after its matched

i'm trying to replace a part of a string which is matched like in the following example:
str1 <- "abc sdak+ 123+"
I would like to replace all + that come after 3 numbers, but not in the case when a + is coming after characters. I tried like this, but this replaces the whole matched string, when I only want to replace the + with a -
gsub("[0-9]{3}\\+", "-", str1)
The desired outcome should be:
"abc sdak+ 123-"
We could capture the 3 digits as a group ((...)) and the +, replace with the backreference (\\1) of the captured group and the -. Just to make sure that there is no digits before the 3 digits, use either word boundary (\\b) or a space (\\s)
gsub("\\b(\\d{3})\\+", "\\1-", str1)
-output
[1] "abc sdak+ 123-"
You can also use look-behind ie is the + symbol preceded by 3 numbers? if so, replace it.
str1 <- "abc sdak+ 123+"
gsub("(?<= [0-9]{3})\\+", "-", str1, perl = TRUE)
[1] "abc sdak+ 123-"

Convert sign in column names if not at certain position in R [duplicate]

I have a character string of names which look like
"_6302_I-PAL_SPSY_000237_001"
I need to remove the first occurred underscore, so that it will be as
"6302_I-PAL_SPSY_000237_001"
I aware of gsub but it removes all of underscores. Thank you for any suggestions.
gsub function do the same, to remove starting of the string symbol ^ used
x <- "_6302_I-PAL_SPSY_000237_001"
x <- gsub("^\\_","",x)
[1] "6302_I-PAL_SPSY_000237_001"
We can use sub with pattern as _ and replacement as blanks (""). This will remove the first occurrence of '_'.
sub("_", "", str1)
#[1] "6302_I-PAL_SPSY_000237_001"
NOTE: This will remove the first occurence of _ and it will not limit based on the position i.e. at the start of the string.
For example, suppose we have string
str2 <- "6302_I-PAL_SPSY_000237_001"
sub("_", "", str2)
#[1] "6302I-PAL_SPSY_000237_001"
As the example have _ in the beginning, another option is substring
substring(str1, 2)
#[1] "6302_I-PAL_SPSY_000237_001"
data
str1 <- "_6302_I-PAL_SPSY_000237_001"
This can be done with base R's trimws() too
string1<-"_6302_I-PAL_SPSY_000237_001"
trimws(string1, which='left', whitespace = '_')
[1] "6302_I-PAL_SPSY_000237_001"
In case we have multiple words with leading underscores, we may have to include a word boundary (\\b) in our regex, and use either gsub or stringr::string_remove:
string2<-paste(string1, string1)
string2
[1] "_6302_I-PAL_SPSY_000237_001 _6302_I-PAL_SPSY_000237_001"
library(stringr)
str_remove_all(string2, "\\b_")
> str_remove_all(string2, "\\b_")
[1] "6302_I-PAL_SPSY_000237_001 6302_I-PAL_SPSY_000237_001"

Using gsub replacement with regex

I want to replace a string with "s" with "'s_" but only if it has more than one letter to start with.
e.g
If the input is "john_s_fingerprinting", the output should be "john's_fingerprinting". But if the input is "j_s_fingerprinting" then its should not change.
I have tried regex to match that strictly more than one letter criteria but having issue with replacement regex.
Here is what I have so far
gsub("[a-z]{2,}_s_", "[a-z]{2,}'s_", "john_s_fingerprinting")
The replacement "[a-z]{2,}'s_" is not giving me the correct output
We may need to capture as group and replace with backreference (\\1) of the captured group
gsub("([A-Za-z]{2,})_s", "\\1's", str1)
-output
[1] "john's_fingerprinting" "j_s_fingerprinting"
Or another option is a regex lookaround
gsub("(?<=[A-Za-z]{2})_s", "'s", str1, perl = TRUE)
[1] "john's_fingerprinting" "j_s_fingerprinting"
data
str1 <- c("john_s_fingerprinting", "j_s_fingerprinting")

Remove the last few capital letters on R

I was wondering how I could remove the last few capital letters and symbol "/" of each observation string in R? For example, if I have data like
PlayerFirstLastNameABC
PlayerNameAB/CDF
PlayerFirstMN
PlayerLastNameABC/RS
and so on, how do I get it to return to me:
PlayerFirstLastName
PlayerName
PlayerFirst
PlayerLastName
where the last letter of the string is always a lower case letter? i.e. Remove all end of strings until you hit a lower case letter. Thanks!
We can use sub from base R to match one or more (+) upper case letters along with / till the end ($) of the string and replace with blank ("")
sub("[A-Z/]+$", "", v1)
#[1] "PlayerFirstLastName" "PlayerName"
#[3] "PlayerFirst" "PlayerLastName"
Or using trimws
trimws(v1, whitespace = "[A-Z/]+", which = "right")
#[1] "PlayerFirstLastName" "PlayerName"
#[3] "PlayerFirst" "PlayerLastName"
data
v1 <- c("PlayerFirstLastNameABC", "PlayerNameAB/CDF", "PlayerFirstMN",
"PlayerLastNameABC/RS")
You can capture everything until upper case letters and / at the end of the string.
sub('(.*?)[/A-Z]+$', '\\1', x)
#[1] "PlayerFirstLastName" "PlayerName" "PlayerFirst" "PlayerLastName"

Camel Case format conversion using regular expressions in R

I have two related questions regarding regular expressions in R:
[1]
I would like to convert sub-strings, containing punctuation followed by a letter, to an upper case letter.
Example:
Dr_dre to: DrDre
Captain.Spock to: CaptainSpock
spider-man to: spiderMan
[2]
I would like convert camel case strings to lower case strings with underscore delimiter.
Example:
EndOfFile to: End_of_file
CamelCase to: Camel_Case
ABC to: A_B_C
Thanks much,
Kamashay
We can use sub. We match one or more punctuation characters ([[:punct:]]+) followed by a single character which is captured as a group ((.)). In the replacement, the backreference for the capture group (\\1) is changed to upper case (\\U).
sub("[[:punct:]]+(.)", "\\U\\1", str1, perl = TRUE)
#[1] "DrDre" "CaptainSpock" "spiderMan"
For the second case, we use regex lookarounds i.e. match a letter ((?<=[A-Za-z])) followed by a capital letter and replace with _.
gsub("(?<=[A-Za-z])(?=[A-Z])", "_", str2, perl = TRUE)
#[1] "End_Of_File" "Camel_Case" "A_B_C"
data
str1 <- c("Dr_dre", "Captain.Spock", "spider-man")
str2 <- c("EndOfFile", "CamelCase", "ABC")

Resources