Regex to add comma between any character - r

I'm relatively new to regex, so bear with me if the question is trivial. I'd like to place a comma between every letter of a string using regex, e.g.:
x <- "ABCD"
I want to get
"A,B,C,D"
It would be nice if I could do that using gsub, sub or related on a vector of strings of arbitrary number of characters.
I tried
> sub("(\\w)", "\\1,", x)
[1] "A,BCD"
> gsub("(\\w)", "\\1,", x)
[1] "A,B,C,D,"
> gsub("(\\w)(\\w{1})$", "\\1,\\2", x)
[1] "ABC,D"

Try:
x <- 'ABCD'
gsub('\\B', ',', x, perl = T)
Prints:
[1] "A,B,C,D"
Might have misread the query; OP is looking to add comma's between letters only. Therefor try:
gsub('(\\p{L})(?=\\p{L})', '\\1,', x, perl = T)
(\p{L}) - Match any kind of letter from any language in a 1st group;
(?=\p{L}) - Positive lookahead to match as per above.
We can use the backreference to this capture group in the replacement.

You can use
> gsub("(.)(?=.)", "\\1,", x, perl=TRUE)
[1] "A,B,C,D"
The (.)(?=.) regex matches any char capturing it into Group 1 (with (.)) that must be followed with any single char ((?=.)) is a positive lookahead that requires a char immediately to the right of the current location).
Vriations of the solution:
> gsub("(.)(?!$)", "\\1,", x, perl=TRUE)
## Or with stringr:
## stringr::str_replace_all(x, "(.)(?!$)", "\\1,")
[1] "A,B,C,D"
Here, (?!$) fails the match if there is an end of string position.
See the R demo online:
x <- "ABCD"
gsub("(.)(?=.)", "\\1,", x, perl=TRUE)
# => [1] "A,B,C,D"
gsub("(.)(?!$)", "\\1,", x, perl=TRUE)
# => [1] "A,B,C,D"
stringr::str_replace_all(x, "(.)(?!$)", "\\1,")
# => [1] "A,B,C,D"

A non-regex friendly answer:
paste(strsplit(x, "")[[1]], collapse = ",")
#[1] "A,B,C,D"

Another option is to use positive look behind and look ahead to assert there is a preceding and a following character:
library(stringr)
str_replace_all(x, "(?<=.)(?=.)", ",")
[1] "A,B,C,D"

Related

regex strsplit expression in R so it only applies once to the first occurrence of a specific character in each string?

I have a list filled with strings:
string<- c("SPG_L_subgenual_ACC_R", "SPG_R_MTG_L_pole", "MTG_L_pole_CerebellumGM_L")
I need to split the strings so they appear like:
"SPG_L", "subgenual_ACC_R", "SPG_R", "MTG_L_pole", "MTG_L_pole", "CerebellumGM_L"
I tried using the following regex expression to split the strings:
str_split(string,'(?<=[[RL]|pole])_')
But this leads to:
"SPG_L", "subgenual" "ACC_R", "SPG_R", "MTG_L", "pole", "MTG_L", "pole", "CerebellumGM_L"
How do I edit the regex expression so it splits each string element at the "_" after the first occurrence of "R", "L" unless the first occurrence of "R" or "L" is followed by "pole", then it splits the string element after the first occurrence of "pole" and only splits each string element once?
I suggest a matching approach using
^(.*?[RL](?:_pole)?)_(.*)
See the regex demo
Details
^ - start of string
(.*?[RL](?:_pole)?) - Group 1:
.*? - any zero or more chars other than line break chars as few as possible
[RL](?:_pole)? - R or L optionally followed with _pole
_ - an underscore
(.*) - Group 2: any zero or more chars other than line break chars as many as possible
See the R demo:
library(stringr)
x <- c("SPG_L_subgenual_ACC_R", "SPG_R_MTG_L_pole", "MTG_L_pole_CerebellumGM_L", "SFG_pole_R_IFG_triangularis_L", "SFG_pole_R_IFG_opercularis_L" )
res <- str_match_all(x, "^(.*?[RL](?:_pole)?)_(.*)")
lapply(res, function(x) x[-1])
Output:
[[1]]
[1] "SPG_L" "subgenual_ACC_R"
[[2]]
[1] "SPG_R" "MTG_L_pole"
[[3]]
[1] "MTG_L_pole" "CerebellumGM_L"
[[4]]
[1] "SFG_pole_R" "IFG_triangularis_L"
[[5]]
[1] "SFG_pole_R" "IFG_opercularis_L"
split_again = function(x){
if(length(x) > 1){
return(x)
}
else{
str_split(
string = x,
pattern = '(?<=[R|L])_',
n = 2)
}
}
str_split(
string = string,
pattern = '(?<=pole)_',
n = 2) %>%
lapply(split_again) %>%
unlist()
you could use sub then strsplit as shown:
strsplit(sub("^.*?[LR](?:_pole)?\\K_",":",string,perl=TRUE),":")
[[1]]
[1] "SPG_L" "subgenual_ACC_R"
[[2]]
[1] "SPG_R" "MTG_L_pole"
[[3]]
[1] "MTG_L_pole" "CerebellumGM_L"

Extracting a string from the first square brackets, starting from right to left

I am trying to extract the string only from the first square brackets starting from right to left
I have tried multiple approaches using str_match and regexpr but I couldn't make it.
c<-"Sens [91] [DRCol105]_Issuer[Risk\\Issuer]"
str_match(c,"\\[.*?\\]$")
OR
start.char<-regexpr("\\[*$",c)[1]+2
stop.char<-regexpr("\\]*$",c)[1]-1
substr(c,start.char,stop.char)
I want to extract everything that is inside the last square brackets. In this example, I would like to extract and save in a variable only "Risk\Issuer".
Here is another solution using regex
# s <- "Sens [91] [DRCol105]_Issuer[Risk\\Issuer]"
gsub('.*\\[(.*)\\]', '\\1', s, perl = TRUE)
# [1] "Risk\\Issuer"
The regular expression .*\\[(.*)\\]) extracts the string inside the last square brackets.
Or
# s <- c("Sens [91] [DRCol105]_Issuer[Risk\\Issuer]", "123 [91]#[test] something follows")
gsub('.*\\[(.*)\\][^\\[]*', '\\1', s, perl = TRUE)
# [1] "Risk\\Issuer" "test"
which has the advantage of working if the string does not end with brackets.
Here are few options:
tail(stringr::str_match_all(s, "\\[(.*?)\\]")[[1]][, 2], 1)
#[1] "Risk\\Issuer"
Using the same regex
stringi::stri_extract_last_regex(s, "\\[(.*?)\\]")
#[1] "[Risk\\Issuer]"
Or to remove brackets
gsub("\\[|\\]", "", stringi::stri_extract_last_regex(s, "\\[(.*?)\\]"))
#[1] "Risk\\Issuer"
I have changed the string name to s, since c is a base R function name.
s <- "Sens [91] [DRCol105]_Issuer[Risk\\Issuer]"
sub("^.*(\\[.*?\\]$)", "\\1", s)
#[1] "[Risk\\Issuer]"
Or, if you want to remove the brackets:
sub("^.*\\[(.*?)\\]$", "\\1", s)
#[1] "Risk\\Issuer"
Here is a strsplit approach,
tail(strsplit(x, '[', fixed = TRUE)[[1]], 1)
[1] "Risk\\Issuer]"
#or If you don't want the last bracket,
sub(']', '', tail(strsplit(x, '[', fixed = TRUE)[[1]], 1))
[1] "Risk\\Issuer"

R gsub regex Pascal Case to Camel Case

I want to write a gsub function using R regexes to replace all capital letters in my string with underscore and the lower case variant. In a seperate gsub, I want to replace the first letter with the lowercase variant. The function should do something like this:
pascal_to_camel("PaymentDate") -> "payment_date"
pascal_to_camel("AccountsOnFile") -> "accounts_on_file"
pascal_to_camel("LastDateOfReturn") -> "last_date_of_return"
The problem is, I don't know how to tolower a "\\1" returned by the regex.
I have something like this:
name_format = function(x) gsub("([A-Z])", paste0("_", tolower("\\1")), gsub("^([A-Z])", tolower("\\1"), x))
But it is doing tolower on the string "\\1" instead of on the matched string.
Using two regex ([A-Z]) and (?!^[A-Z])([A-Z]), perl = TRUE, \\L\\1 and _\\L\\1:
name_format <- function(x) gsub("([A-Z])", perl = TRUE, "\\L\\1", gsub("(?!^[A-Z])([A-Z])", perl = TRUE, "_\\L\\1", x))
> name_format("PaymentDate")
[1] "payment_date"
> name_format("AccountsOnFile")
[1] "accounts_on_file"
> name_format("LastDateOfReturn")
[1] "last_date_of_return"
Code demo
You may use the following solution (converted from Python, see the Elegant Python function to convert CamelCase to snake_case? post):
> pascal_to_camel <- function(x) tolower(gsub("([a-z0-9])([A-Z])", "\\1_\\2", gsub("(.)([A-Z][a-z]+)", "\\1_\\2", x)))
> pascal_to_camel("PaymentDate")
[1] "payment_date"
> pascal_to_camel("AccountsOnFile")
[1] "accounts_on_file"
> pascal_to_camel("LastDateOfReturn")
[1] "last_date_of_return"
Explanation
gsub("(.)([A-Z][a-z]+)", "\\1_\\2", x) is executed first to insert a _ between any char followed with an uppercase ASCII letter followed with 1+ ASCII lowercase letters (the output is marked as y in the bullet point below)
gsub("([a-z0-9])([A-Z])", "\\1_\\2", y) - inserts _ between a lowercase ASCII letter or a digit and an uppercase ASCII letter (result is defined as z below)
tolower(z) - turns the whole result to lower case.
The same regex with Unicode support (\p{Lu} matches any uppercase Unicode letter and \p{Ll} matches any Unicode lowercase letter):
pascal_to_camel_uni <- function(x) {
tolower(gsub("([\\p{Ll}0-9])(\\p{Lu})", "\\1_\\2",
gsub("(.)(\\p{Lu}\\p{Ll}+)", "\\1_\\2", x, perl=TRUE), perl=TRUE))
}
pascal_to_camel_uni("ДеньОплаты")
## => [1] "день_оплаты"
See this online R demo.

removes part of string in r

I'm trying to extract ES at the end of a string
> data <- c("phrases", "phases", "princesses","class","pass")
> data1 <- gsub("(\\w+)(s)+?es\\b", "\\1\\2", data, perl=TRUE)
> gsub("(\\w+)s\\b", "\\1", data1, perl=TRUE)
[1] "phra" "pha" "princes" "clas" "pas"
I get this result
[1] "phra" "pha" "princes" "clas" "pas"
but in reality what I need to obtain is:
[1] "phras" "phas" "princess" "clas" "pas"
You can use a word boundary (\\b) if it is guaranteed that each word is followed by a punctuation or is at the end of the string:
data <- c("phrases, phases, princesses, bases")
gsub('es\\b', '', data)
# [1] "phras, phas, princess, bas"
With your method, just wrap everything till the second + with one set of parentheses:
gsub("(\\w+s+)es\\b", "\\1", data)
# [1] "phras, phas, princess, bas"
There is also no need to make + lazy with ?, since you are trying to match as many consecutive s's as possible.
Edit:
OP changed the data and the desired output. Below is a simple solution that removes either es or s at the end of each string:
data <- c("phrases", "phases", "princesses","class","pass")
gsub('(es|s)\\b', '', data)
# [1] "phras" "phas" "princess" "clas" "pas"
maybe you are looking for a lookbehind assertion (which is a 0 length match)
"(?<=s)es\\b"
or because lookbehind can't have a variable length perl \K construct to keep out of match left of \K
"\\ws\\Kes\\b"

r sub negation of [:digit:] in regex

I am trying to use subto remove everything between the end of string s (pattern always includes :, digits and parentheses ) and up till but not including the first digit before starting parenthis (.
s <- "NXF1F-Z10_(1:111)"
>sub("\\(1:[[:digit:]]+)$", "", s) #Almost work!
[1] "NXF1F-Z10_"
To remove all characters not a digit (like _ , anything of any length except a digit ) I tried in vain this to negate digits:
sub("[^[:digit:]]*(1:[[:digit:]]+)$", "", s)
The desired output is :
[1] "NXF1F-Z10"
s <- "NXF1F-Z10_(1:111)"
Try this
sub("_.+", "", s)
# "NXF1F-Z10"
More general
sub("(\\d)[^\\d]*[(].*[)]$", "\\1", s, perl=TRUE)
# "NXF1F-Z10"
sub("(\\d)[^\\d]*[(].*[)]$", "\\1", t, perl=TRUE)
# "NXF1F-Z10"
Or this
sub("[(](\\d+):.+", "\\1", s)
# "NXF1F-Z10_1"
Depending on what you want

Resources