r sub negation of [:digit:] in regex - r

I am trying to use subto remove everything between the end of string s (pattern always includes :, digits and parentheses ) and up till but not including the first digit before starting parenthis (.
s <- "NXF1F-Z10_(1:111)"
>sub("\\(1:[[:digit:]]+)$", "", s) #Almost work!
[1] "NXF1F-Z10_"
To remove all characters not a digit (like _ , anything of any length except a digit ) I tried in vain this to negate digits:
sub("[^[:digit:]]*(1:[[:digit:]]+)$", "", s)
The desired output is :
[1] "NXF1F-Z10"

s <- "NXF1F-Z10_(1:111)"
Try this
sub("_.+", "", s)
# "NXF1F-Z10"
More general
sub("(\\d)[^\\d]*[(].*[)]$", "\\1", s, perl=TRUE)
# "NXF1F-Z10"
sub("(\\d)[^\\d]*[(].*[)]$", "\\1", t, perl=TRUE)
# "NXF1F-Z10"
Or this
sub("[(](\\d+):.+", "\\1", s)
# "NXF1F-Z10_1"
Depending on what you want

Related

Regex to add comma between any character

I'm relatively new to regex, so bear with me if the question is trivial. I'd like to place a comma between every letter of a string using regex, e.g.:
x <- "ABCD"
I want to get
"A,B,C,D"
It would be nice if I could do that using gsub, sub or related on a vector of strings of arbitrary number of characters.
I tried
> sub("(\\w)", "\\1,", x)
[1] "A,BCD"
> gsub("(\\w)", "\\1,", x)
[1] "A,B,C,D,"
> gsub("(\\w)(\\w{1})$", "\\1,\\2", x)
[1] "ABC,D"
Try:
x <- 'ABCD'
gsub('\\B', ',', x, perl = T)
Prints:
[1] "A,B,C,D"
Might have misread the query; OP is looking to add comma's between letters only. Therefor try:
gsub('(\\p{L})(?=\\p{L})', '\\1,', x, perl = T)
(\p{L}) - Match any kind of letter from any language in a 1st group;
(?=\p{L}) - Positive lookahead to match as per above.
We can use the backreference to this capture group in the replacement.
You can use
> gsub("(.)(?=.)", "\\1,", x, perl=TRUE)
[1] "A,B,C,D"
The (.)(?=.) regex matches any char capturing it into Group 1 (with (.)) that must be followed with any single char ((?=.)) is a positive lookahead that requires a char immediately to the right of the current location).
Vriations of the solution:
> gsub("(.)(?!$)", "\\1,", x, perl=TRUE)
## Or with stringr:
## stringr::str_replace_all(x, "(.)(?!$)", "\\1,")
[1] "A,B,C,D"
Here, (?!$) fails the match if there is an end of string position.
See the R demo online:
x <- "ABCD"
gsub("(.)(?=.)", "\\1,", x, perl=TRUE)
# => [1] "A,B,C,D"
gsub("(.)(?!$)", "\\1,", x, perl=TRUE)
# => [1] "A,B,C,D"
stringr::str_replace_all(x, "(.)(?!$)", "\\1,")
# => [1] "A,B,C,D"
A non-regex friendly answer:
paste(strsplit(x, "")[[1]], collapse = ",")
#[1] "A,B,C,D"
Another option is to use positive look behind and look ahead to assert there is a preceding and a following character:
library(stringr)
str_replace_all(x, "(?<=.)(?=.)", ",")
[1] "A,B,C,D"

Regex expression to match every nth occurence of a pattern

Consider this string,
str = "abc-de-fghi-j-k-lm-n-o-p-qrst-u-vw-x-yz"
I'd like to separate the string at every nth occurrence of a pattern, here -:
f(str, n = 2)
[1] "abc-de" "fghi-j" "k-lm" "n-o"...
f(str, n = 3)
[1] "abc-de-fghi" "j-k-lm" "n-o-p" "qrst-u-vw"...
I know I could do it like this:
spl <- str_split(str, "-", )[[1]]
unname(sapply(split(spl, ceiling(seq(spl) / 2)), paste, collapse = "-"))
[1] "abc-de" "fghi-j" "k-lm" "n-o" "p-qrst" "u-vw" "x-yz"
But I'm looking for a shorter and cleaner solution
What are the possibilities?
What about the following (where 'n-1' is a placeholder for a number):
(?:[^-]*(?:-[^-]*){n-1})\K-
See an online demo
(?: - Open 1st non-capture group;
[^-]* - Match 0+ characters other hyphen;
(?: - Open a nested 2nd non-capture group;
-[^-]* - Match an hyphen and 0+ characters other than hyphen;
){n} - Close nested non-capture group and match n-times;
) - Close 1st non-capture group;
\K- - Forget what we just matched and match the trailing hyphen.
Note: The use of \K means we must use PCRE (perl=TRUE)
To create the 'n-1' we can use sprintf() functionality to use a variable:
str <- "abc-de-fghi-j-k-lm-n-o-p-qrst-u-vw-x-yz"
for (n in 1:10) {
print(strsplit(str, sprintf("(?:[^-]*(?:-[^-]*){%s})\\K-", n-1), perl=TRUE)[[1]])
}
Prints:
You could use str_extract_all with the pattern \w+(?:-\w+){0,2}, for instance to find terms with 3 words and 2 hyphens:
str <- "abc-de-fghi-j-k-lm-n-o-p-qrst-u-vw-x-yz"
n <- 2
regex <- paste0("\\w+(?:-\\w+){0,", n, "}")
str_extract_all(str, regex)[[1]]
[1] "abc-de-fghi" "j-k-lm" "n-o-p" "qrst-u-vw" "x-yz"
n <- 3
regex <- paste0("\\w+(?:-\\w+){0,", n, "}")
str_extract_all(str, regex)[[1]]
[1] "abc-de-fghi-j" "k-lm-n-o" "p-qrst-u-vw" "x-yz"
1) gsubfn gsubfn in the package of the same name is like gsub except that the replacement can be a function, list or proto object. In the case of a proto object one can supply a fun method which has a built in count variable that can be used to distinguish the occurrences. For each match the match is passed to fun and replaced with the output of fun.
We use the input shown in the Note at the end and also n to specify the number of components to use in each element of the result and sep to specify a character that does not appear in the input.
gsubfn replaces every n-th minus with sep and the strsplit splits on that.
No complex regular expressions are needed.
library(gsubfn)
n <- 3
sep <- " "
p <- proto(fun = function(., x) if (count %% n) "-" else sep)
strsplit(gsubfn("-", p, STR), sep)
## [[1]]
## [1] "abc-de-fghi" "j-k-lm" "n-o-p" "qrst-u-vw" "x-yz"
##
## [[2]]
## [1] "abc-de-fghi" "j-k-lm" "n-o-p" "qrst-u-vw" "x-yz"
2) rollapply Another approach is to split on every - and the paste it together again using rollapply giving the same result as in (1).
library(zoo)
roll <- function(x) rollapply(x, n, by = n, paste, collapse = "-",
partial = TRUE, align = "left")
lapply(strsplit(STR, "-"), roll)
Note
# input
STR = "abc-de-fghi-j-k-lm-n-o-p-qrst-u-vw-x-yz"
STR <- c(STR, STR)
another approach: First split on every split-pattern found, then paste/collapse into groups of n-length, using the split-pattern-variable as collapse character.
str <- "abc-de-fghi-j-k-lm-n-o-p-qrst-u-vw-x-yz"
n <- 3
pattern <- "-"
ans <- unlist(strsplit(str, pattern))
sapply(split(ans,
ceiling(seq_along(ans)/n)),
paste0, collapse = pattern)
# "abc-de-fghi" "j-k-lm" "n-o-p" "qrst-u-vw" "x-yz"

R Sub function: pull everything after second number

Trying to figure out how to pull everything after the second number using the sub function in R. I understand the basics with the lazy and greedy matching, but how do I take it one step further and pull everything after the second number?
str <- 'john02imga-04'
#lazy: pulls everything after first number
sub(".*?[0-9]", "", str)
#output: "2imga-04
#greedy: pulls everything after last number
sub(".*[0-9]", "", str)
#output: ""
#desired output: "imga-04"
You can use
sub("\\D*[0-9]+", "", str)
## Or,
## sub("\\D*\\d+", "", str)
## => [1] "imga-04"
See the regex demo. Also, see the R demo online.
sub will find and replace the first occurrence of
\D* (=[^0-9]) - any zero or more non-digit chars
[0-9]+ (=\d+) - one or more digits.
Alternative ways
Match one or more letters, -, one or more digits at the end of the string:
> regmatches(str, regexpr("[[:alpha:]]+-\\d+$", str))
[1] "imga-04"
> library(stringr)
> str_extract(str, "\\p{L}+-\\d+$")
[1] "imga-04"
You can use a capture group for the second part and use that in the replacement
^\D+\d+(\D+\d+)
^ Start of string
\D+\d+ Match 1+ non digits, then 1+ digits
(\D+\d+) Capture group 1, match 1+ non digits and match 1+ digits
Regex demo | R demo
str <- 'john02imga-04'
sub("^\\D+\\d+(\\D+\\d+)", "\\1", str)
Output
[1] "imga-04"
If you want to remove all after the second number:
^\D+\d+(\D+\d+).*
Regex demo
As an alternative getting a match only using perl=T for using PCRE and \K to clear the match buffer:
str <- 'john02imga-04'
regmatches(str, regexpr("^\\D+\\d+\\K\\D+\\d+", str, perl = T))
Output
[1] "imga-04"
See an R demo

R: Drop all not matching letters of string vector

I have a string vector
d <- c("sladfj0923rn2", ääas230ß0sadfn", 823Höl32basdflk")
I want to remove all characters from this vector that do not
match "a-z", "A-z" and "'"
I tried to use
gsub("![a-zA-z'], "", d)
but that doesn't work.
We could even make your replacement pattern even tighter by doing a case insensitive sub:
d <- c("sladfj0923rn2", "ääas230ß0sadfn", "823Höl32basdflk")
gsub("[^a-z]", "", d, ignore.case=TRUE)
[1] "sladfjrn" "assadfn" "Hlbasdflk"
We can use the ^ inside the square brackets to match all characters except the one specified within the bracket
gsub("[^a-zA-Z]", "", d)
#[1] "sladfjrn" "assadfn" "Hlbasdflk"
data
d <- c("sladfj0923rn2", "ääas230ß0sadfn", "823Höl32basdflk")

R gsub regex Pascal Case to Camel Case

I want to write a gsub function using R regexes to replace all capital letters in my string with underscore and the lower case variant. In a seperate gsub, I want to replace the first letter with the lowercase variant. The function should do something like this:
pascal_to_camel("PaymentDate") -> "payment_date"
pascal_to_camel("AccountsOnFile") -> "accounts_on_file"
pascal_to_camel("LastDateOfReturn") -> "last_date_of_return"
The problem is, I don't know how to tolower a "\\1" returned by the regex.
I have something like this:
name_format = function(x) gsub("([A-Z])", paste0("_", tolower("\\1")), gsub("^([A-Z])", tolower("\\1"), x))
But it is doing tolower on the string "\\1" instead of on the matched string.
Using two regex ([A-Z]) and (?!^[A-Z])([A-Z]), perl = TRUE, \\L\\1 and _\\L\\1:
name_format <- function(x) gsub("([A-Z])", perl = TRUE, "\\L\\1", gsub("(?!^[A-Z])([A-Z])", perl = TRUE, "_\\L\\1", x))
> name_format("PaymentDate")
[1] "payment_date"
> name_format("AccountsOnFile")
[1] "accounts_on_file"
> name_format("LastDateOfReturn")
[1] "last_date_of_return"
Code demo
You may use the following solution (converted from Python, see the Elegant Python function to convert CamelCase to snake_case? post):
> pascal_to_camel <- function(x) tolower(gsub("([a-z0-9])([A-Z])", "\\1_\\2", gsub("(.)([A-Z][a-z]+)", "\\1_\\2", x)))
> pascal_to_camel("PaymentDate")
[1] "payment_date"
> pascal_to_camel("AccountsOnFile")
[1] "accounts_on_file"
> pascal_to_camel("LastDateOfReturn")
[1] "last_date_of_return"
Explanation
gsub("(.)([A-Z][a-z]+)", "\\1_\\2", x) is executed first to insert a _ between any char followed with an uppercase ASCII letter followed with 1+ ASCII lowercase letters (the output is marked as y in the bullet point below)
gsub("([a-z0-9])([A-Z])", "\\1_\\2", y) - inserts _ between a lowercase ASCII letter or a digit and an uppercase ASCII letter (result is defined as z below)
tolower(z) - turns the whole result to lower case.
The same regex with Unicode support (\p{Lu} matches any uppercase Unicode letter and \p{Ll} matches any Unicode lowercase letter):
pascal_to_camel_uni <- function(x) {
tolower(gsub("([\\p{Ll}0-9])(\\p{Lu})", "\\1_\\2",
gsub("(.)(\\p{Lu}\\p{Ll}+)", "\\1_\\2", x, perl=TRUE), perl=TRUE))
}
pascal_to_camel_uni("ДеньОплаты")
## => [1] "день_оплаты"
See this online R demo.

Resources