R gsub regex Pascal Case to Camel Case - r

I want to write a gsub function using R regexes to replace all capital letters in my string with underscore and the lower case variant. In a seperate gsub, I want to replace the first letter with the lowercase variant. The function should do something like this:
pascal_to_camel("PaymentDate") -> "payment_date"
pascal_to_camel("AccountsOnFile") -> "accounts_on_file"
pascal_to_camel("LastDateOfReturn") -> "last_date_of_return"
The problem is, I don't know how to tolower a "\\1" returned by the regex.
I have something like this:
name_format = function(x) gsub("([A-Z])", paste0("_", tolower("\\1")), gsub("^([A-Z])", tolower("\\1"), x))
But it is doing tolower on the string "\\1" instead of on the matched string.

Using two regex ([A-Z]) and (?!^[A-Z])([A-Z]), perl = TRUE, \\L\\1 and _\\L\\1:
name_format <- function(x) gsub("([A-Z])", perl = TRUE, "\\L\\1", gsub("(?!^[A-Z])([A-Z])", perl = TRUE, "_\\L\\1", x))
> name_format("PaymentDate")
[1] "payment_date"
> name_format("AccountsOnFile")
[1] "accounts_on_file"
> name_format("LastDateOfReturn")
[1] "last_date_of_return"
Code demo

You may use the following solution (converted from Python, see the Elegant Python function to convert CamelCase to snake_case? post):
> pascal_to_camel <- function(x) tolower(gsub("([a-z0-9])([A-Z])", "\\1_\\2", gsub("(.)([A-Z][a-z]+)", "\\1_\\2", x)))
> pascal_to_camel("PaymentDate")
[1] "payment_date"
> pascal_to_camel("AccountsOnFile")
[1] "accounts_on_file"
> pascal_to_camel("LastDateOfReturn")
[1] "last_date_of_return"
Explanation
gsub("(.)([A-Z][a-z]+)", "\\1_\\2", x) is executed first to insert a _ between any char followed with an uppercase ASCII letter followed with 1+ ASCII lowercase letters (the output is marked as y in the bullet point below)
gsub("([a-z0-9])([A-Z])", "\\1_\\2", y) - inserts _ between a lowercase ASCII letter or a digit and an uppercase ASCII letter (result is defined as z below)
tolower(z) - turns the whole result to lower case.
The same regex with Unicode support (\p{Lu} matches any uppercase Unicode letter and \p{Ll} matches any Unicode lowercase letter):
pascal_to_camel_uni <- function(x) {
tolower(gsub("([\\p{Ll}0-9])(\\p{Lu})", "\\1_\\2",
gsub("(.)(\\p{Lu}\\p{Ll}+)", "\\1_\\2", x, perl=TRUE), perl=TRUE))
}
pascal_to_camel_uni("ДеньОплаты")
## => [1] "день_оплаты"
See this online R demo.

Related

Regex to add comma between any character

I'm relatively new to regex, so bear with me if the question is trivial. I'd like to place a comma between every letter of a string using regex, e.g.:
x <- "ABCD"
I want to get
"A,B,C,D"
It would be nice if I could do that using gsub, sub or related on a vector of strings of arbitrary number of characters.
I tried
> sub("(\\w)", "\\1,", x)
[1] "A,BCD"
> gsub("(\\w)", "\\1,", x)
[1] "A,B,C,D,"
> gsub("(\\w)(\\w{1})$", "\\1,\\2", x)
[1] "ABC,D"
Try:
x <- 'ABCD'
gsub('\\B', ',', x, perl = T)
Prints:
[1] "A,B,C,D"
Might have misread the query; OP is looking to add comma's between letters only. Therefor try:
gsub('(\\p{L})(?=\\p{L})', '\\1,', x, perl = T)
(\p{L}) - Match any kind of letter from any language in a 1st group;
(?=\p{L}) - Positive lookahead to match as per above.
We can use the backreference to this capture group in the replacement.
You can use
> gsub("(.)(?=.)", "\\1,", x, perl=TRUE)
[1] "A,B,C,D"
The (.)(?=.) regex matches any char capturing it into Group 1 (with (.)) that must be followed with any single char ((?=.)) is a positive lookahead that requires a char immediately to the right of the current location).
Vriations of the solution:
> gsub("(.)(?!$)", "\\1,", x, perl=TRUE)
## Or with stringr:
## stringr::str_replace_all(x, "(.)(?!$)", "\\1,")
[1] "A,B,C,D"
Here, (?!$) fails the match if there is an end of string position.
See the R demo online:
x <- "ABCD"
gsub("(.)(?=.)", "\\1,", x, perl=TRUE)
# => [1] "A,B,C,D"
gsub("(.)(?!$)", "\\1,", x, perl=TRUE)
# => [1] "A,B,C,D"
stringr::str_replace_all(x, "(.)(?!$)", "\\1,")
# => [1] "A,B,C,D"
A non-regex friendly answer:
paste(strsplit(x, "")[[1]], collapse = ",")
#[1] "A,B,C,D"
Another option is to use positive look behind and look ahead to assert there is a preceding and a following character:
library(stringr)
str_replace_all(x, "(?<=.)(?=.)", ",")
[1] "A,B,C,D"

Get substring before the second capital letter

Is there an R function to get only the part of a string before the 2nd capital character appears?
For example:
Example <- "MonkeysDogsCats"
Expected output should be:
"Monkeys"
Maybe something like
stringr::str_extract("MonkeysDogsCats", "[A-Z][a-z]*")
#[1] "Monkeys"
Here is an alternative approach:
Here we first put a space before all uppercase and then extract the first word:
library(stringr)
word(gsub("([a-z])([A-Z])","\\1 \\2", Example), 1)
[1] "Monkeys"
A base solution with sub():
x <- "MonkeysDogsCats"
sub("(?<=[a-z])[A-Z].*", "", x, perl = TRUE)
# [1] "Monkeys"
Another way using stringr::word():
stringr::word(x, 1, sep = "(?=[A-Z])\\B")
# [1] "Monkeys"
If the goal is strictly to capture any string before the 2nd capital character, one might want pick a solution it'll also work with all types of strings including numbers and special characters.
strings <- c("MonkeysDogsCats",
"M4DogsCats",
"M?DogsCats")
stringr::str_remove(strings, "(?<=.)[A-Z].*")
Output:
[1] "Monkeys" "M4" "M?"
It depends on what you want to allow to match. You can for example match an uppercase char [A-Z] optionally followed by any character that is not an uppercase character [^A-Z]*
If you don't want to allow whitespace chars, you can exclude them [^A-Z\\s]*
library(stringr)
str_extract("MonkeysDogsCats", "[A-Z][^A-Z]*")
Output
[1] "Monkeys"
R demo
If there should be an uppercase character following, and there are only lowercase characters allowed:
str <- "MonkeysDogsCats"
regmatches(str, regexpr("[A-Z][a-z]*(?=[A-Z])", str, perl = TRUE))
Output
[1] "Monkeys"
R demo

R Sub function: pull everything after second number

Trying to figure out how to pull everything after the second number using the sub function in R. I understand the basics with the lazy and greedy matching, but how do I take it one step further and pull everything after the second number?
str <- 'john02imga-04'
#lazy: pulls everything after first number
sub(".*?[0-9]", "", str)
#output: "2imga-04
#greedy: pulls everything after last number
sub(".*[0-9]", "", str)
#output: ""
#desired output: "imga-04"
You can use
sub("\\D*[0-9]+", "", str)
## Or,
## sub("\\D*\\d+", "", str)
## => [1] "imga-04"
See the regex demo. Also, see the R demo online.
sub will find and replace the first occurrence of
\D* (=[^0-9]) - any zero or more non-digit chars
[0-9]+ (=\d+) - one or more digits.
Alternative ways
Match one or more letters, -, one or more digits at the end of the string:
> regmatches(str, regexpr("[[:alpha:]]+-\\d+$", str))
[1] "imga-04"
> library(stringr)
> str_extract(str, "\\p{L}+-\\d+$")
[1] "imga-04"
You can use a capture group for the second part and use that in the replacement
^\D+\d+(\D+\d+)
^ Start of string
\D+\d+ Match 1+ non digits, then 1+ digits
(\D+\d+) Capture group 1, match 1+ non digits and match 1+ digits
Regex demo | R demo
str <- 'john02imga-04'
sub("^\\D+\\d+(\\D+\\d+)", "\\1", str)
Output
[1] "imga-04"
If you want to remove all after the second number:
^\D+\d+(\D+\d+).*
Regex demo
As an alternative getting a match only using perl=T for using PCRE and \K to clear the match buffer:
str <- 'john02imga-04'
regmatches(str, regexpr("^\\D+\\d+\\K\\D+\\d+", str, perl = T))
Output
[1] "imga-04"
See an R demo

R - gsub paste combination returns gibberish [duplicate]

I have a function:
ncount <- function(num = NULL) {
toRead <- readLines("abc.txt")
n <- as.character(num)
x <- grep("{"n"} number",toRead,value=TRUE)
}
While grep-ing, I want the num passed in the function to dynamically create the pattern to be searched? How can this be done in R? The text file has number and text in every line
You could use paste to concatenate strings:
grep(paste("{", n, "} number", sep = ""),homicides,value=TRUE)
In order to build a regular expression from variables in R, in the current scenarion, you may simply concatenate string literals with your variable using paste0:
grep(paste0('\\{', n, '} number'), homicides, value=TRUE)
Note that { is a special character outside a [...] bracket expression (also called character class), and should be escaped if you need to find a literal { char.
In case you use a list of items as an alternative list, you may use a combination of paste/paste0:
words <- c('bananas', 'mangoes', 'plums')
regex <- paste0('Ben likes (', paste(words, collapse='|'), ')\\.')
The resulting Ben likes (bananas|mangoes|plums)\. regex will match Ben likes bananas., Ben likes mangoes. or Ben likes plums.. See the R demo and the regex demo.
NOTE: PCRE (when you pass perl=TRUE to base R regex functions) or ICU (stringr/stringi regex functions) have proved to better handle these scenarios, it is recommended to use those engines rather than the default TRE regex library used in base R regex functions.
Oftentimes, you will want to build a pattern with a list of words that should be matched exactly, as whole words. Here, a lot will depend on the type of boundaries and whether the words can contain special regex metacharacters or not, whether they can contain whitespace or not.
In the most general case, word boundaries (\b) work well.
regex <- paste0('\\b(', paste(words, collapse='|'), ')\\b')
unlist(regmatches(examples, gregexpr(regex, examples, perl=TRUE)))
## => [1] "bananas" "mangoes" "plums"
The \b(bananas|mangoes|plums)\b pattern will match bananas, but won't match banana (see an R demo).
If your list is like
words <- c('cm+km', 'uname\\vname')
you will have to escape the words first, i.e. append \ before each of the metacharacter:
regex.escape <- function(string) {
gsub("([][{}()+*^$|\\\\?.])", "\\\\\\1", string)
}
examples <- c('Text: cm+km, and some uname\\vname?')
words <- c('cm+km', 'uname\\vname')
regex <- paste0('\\b(', paste(regex.escape(words), collapse='|'), ')\\b')
cat( unlist(regmatches(examples, gregexpr(regex, examples, perl=TRUE))) )
## => cm+km uname\vname
If your words can start or end with a special regex metacharacter, \b word boundaries won't work. Use
Unambiguous word boundaries, (?<!\w) / (?!\w), when the match is expected between non-word chars or start/end of string
Whitespace boundaries, (?<!\S) / (?!\S), when the match is expected to be enclosed with whitespace chars, or start/end of string
Build your own using the lookbehind/lookahead combination and your custom character class / bracket expression, or even more sophisticad patterns.
Example of the first two approaches in R (replacing with the match enclosed with << and >>):
regex.escape <- function(string) {
gsub("([][{}()+*^$|\\\\?.])", "\\\\\\1", string)
}
examples <- 'Text: cm+km, +km and C++,Delphi,C++CLI and C++/CLI.'
words <- c('+km', 'C++')
# Unambiguous word boundaries
regex <- paste0('(?<!\\w)(', paste(regex.escape(words), collapse='|'), ')(?!\\w)')
gsub(regex, "<<\\1>>", examples, perl=TRUE)
# => [1] "Text: cm+km, <<+km>> and <<C++>>,Delphi,C++CLI and <<C++>>/CLI."
# Whitespace boundaries
regex <- paste0('(?<!\\S)(', paste(regex.escape(words), collapse='|'), ')(?!\\S)')
gsub(regex, "<<\\1>>", examples, perl=TRUE)
# => [1] "Text: cm+km, <<+km>> and C++,Delphi,C++CLI and C++/CLI."

Camel Case format conversion using regular expressions in R

I have two related questions regarding regular expressions in R:
[1]
I would like to convert sub-strings, containing punctuation followed by a letter, to an upper case letter.
Example:
Dr_dre to: DrDre
Captain.Spock to: CaptainSpock
spider-man to: spiderMan
[2]
I would like convert camel case strings to lower case strings with underscore delimiter.
Example:
EndOfFile to: End_of_file
CamelCase to: Camel_Case
ABC to: A_B_C
Thanks much,
Kamashay
We can use sub. We match one or more punctuation characters ([[:punct:]]+) followed by a single character which is captured as a group ((.)). In the replacement, the backreference for the capture group (\\1) is changed to upper case (\\U).
sub("[[:punct:]]+(.)", "\\U\\1", str1, perl = TRUE)
#[1] "DrDre" "CaptainSpock" "spiderMan"
For the second case, we use regex lookarounds i.e. match a letter ((?<=[A-Za-z])) followed by a capital letter and replace with _.
gsub("(?<=[A-Za-z])(?=[A-Z])", "_", str2, perl = TRUE)
#[1] "End_Of_File" "Camel_Case" "A_B_C"
data
str1 <- c("Dr_dre", "Captain.Spock", "spider-man")
str2 <- c("EndOfFile", "CamelCase", "ABC")

Resources