Regex expression to match every nth occurence of a pattern - r

Consider this string,
str = "abc-de-fghi-j-k-lm-n-o-p-qrst-u-vw-x-yz"
I'd like to separate the string at every nth occurrence of a pattern, here -:
f(str, n = 2)
[1] "abc-de" "fghi-j" "k-lm" "n-o"...
f(str, n = 3)
[1] "abc-de-fghi" "j-k-lm" "n-o-p" "qrst-u-vw"...
I know I could do it like this:
spl <- str_split(str, "-", )[[1]]
unname(sapply(split(spl, ceiling(seq(spl) / 2)), paste, collapse = "-"))
[1] "abc-de" "fghi-j" "k-lm" "n-o" "p-qrst" "u-vw" "x-yz"
But I'm looking for a shorter and cleaner solution
What are the possibilities?

What about the following (where 'n-1' is a placeholder for a number):
(?:[^-]*(?:-[^-]*){n-1})\K-
See an online demo
(?: - Open 1st non-capture group;
[^-]* - Match 0+ characters other hyphen;
(?: - Open a nested 2nd non-capture group;
-[^-]* - Match an hyphen and 0+ characters other than hyphen;
){n} - Close nested non-capture group and match n-times;
) - Close 1st non-capture group;
\K- - Forget what we just matched and match the trailing hyphen.
Note: The use of \K means we must use PCRE (perl=TRUE)
To create the 'n-1' we can use sprintf() functionality to use a variable:
str <- "abc-de-fghi-j-k-lm-n-o-p-qrst-u-vw-x-yz"
for (n in 1:10) {
print(strsplit(str, sprintf("(?:[^-]*(?:-[^-]*){%s})\\K-", n-1), perl=TRUE)[[1]])
}
Prints:

You could use str_extract_all with the pattern \w+(?:-\w+){0,2}, for instance to find terms with 3 words and 2 hyphens:
str <- "abc-de-fghi-j-k-lm-n-o-p-qrst-u-vw-x-yz"
n <- 2
regex <- paste0("\\w+(?:-\\w+){0,", n, "}")
str_extract_all(str, regex)[[1]]
[1] "abc-de-fghi" "j-k-lm" "n-o-p" "qrst-u-vw" "x-yz"
n <- 3
regex <- paste0("\\w+(?:-\\w+){0,", n, "}")
str_extract_all(str, regex)[[1]]
[1] "abc-de-fghi-j" "k-lm-n-o" "p-qrst-u-vw" "x-yz"

1) gsubfn gsubfn in the package of the same name is like gsub except that the replacement can be a function, list or proto object. In the case of a proto object one can supply a fun method which has a built in count variable that can be used to distinguish the occurrences. For each match the match is passed to fun and replaced with the output of fun.
We use the input shown in the Note at the end and also n to specify the number of components to use in each element of the result and sep to specify a character that does not appear in the input.
gsubfn replaces every n-th minus with sep and the strsplit splits on that.
No complex regular expressions are needed.
library(gsubfn)
n <- 3
sep <- " "
p <- proto(fun = function(., x) if (count %% n) "-" else sep)
strsplit(gsubfn("-", p, STR), sep)
## [[1]]
## [1] "abc-de-fghi" "j-k-lm" "n-o-p" "qrst-u-vw" "x-yz"
##
## [[2]]
## [1] "abc-de-fghi" "j-k-lm" "n-o-p" "qrst-u-vw" "x-yz"
2) rollapply Another approach is to split on every - and the paste it together again using rollapply giving the same result as in (1).
library(zoo)
roll <- function(x) rollapply(x, n, by = n, paste, collapse = "-",
partial = TRUE, align = "left")
lapply(strsplit(STR, "-"), roll)
Note
# input
STR = "abc-de-fghi-j-k-lm-n-o-p-qrst-u-vw-x-yz"
STR <- c(STR, STR)

another approach: First split on every split-pattern found, then paste/collapse into groups of n-length, using the split-pattern-variable as collapse character.
str <- "abc-de-fghi-j-k-lm-n-o-p-qrst-u-vw-x-yz"
n <- 3
pattern <- "-"
ans <- unlist(strsplit(str, pattern))
sapply(split(ans,
ceiling(seq_along(ans)/n)),
paste0, collapse = pattern)
# "abc-de-fghi" "j-k-lm" "n-o-p" "qrst-u-vw" "x-yz"

Related

split a string knowing some of the substrings

Say I have the following string and a vector of substrings:
x <- "abc[[+de.f[-[[g"
v <- c("+", "-", "[", "[[")
I would like to split this string by extracting the substrings from my vector and making new substrings from the characters in between, so I would get the following :
res <- c("abc", "[[", "+", "de.f", "[", "-", "[[", "g")
in case of conflicting matches the longer wins (here [[ over [), you can consider there won't be conflicting matches of same length.
Tagging with regex but open to any solution, faster being better.
Please don't make any assumption on the type of character used in any of these strings, apart from the fact they're ASCII. There is no pattern to be inferred if I didn't explicitly mention it.
another example :
x <- "a*bc[[+de.f[-[[g[*+-h-+"
v <- c("+", "-", "[", "[[", "[*", "+-")
res <- c("a*bc", "[[", "+", "de.f", "[", "-", "[[", "g", "[*", "+-", "h", "-", "+")
This almost seems more like a lexing problem than a matching problem. I seem to get decent results with the minilexer package
library(minilexer) #devtools::install_github("coolbutuseless/minilexer")
patterns <- c(
dbracket = "\\[\\[",
bracket = "\\[",
plus = "\\+",
minus = "\\-",
name = "[a-z.]+"
)
x <- "abc[[+de.f[-[[g"
lex(x, patterns)
unname(lex(x, patterns))
# [1] "abc" "[[" "+" "de.f" "[" "-"
# [7] "[[" "g"
Using stringr::str_match_all and Hmisc::escapeRegex :
x <- "abc[[+de.f[-[[g"
v <- c("+", "-", "[", "[[")
tmp <- v[order(-nchar(v))] # sort to have longer first, to match in priority
tmp <- Hmisc::escapeRegex(tmp)
tmp <- paste(tmp,collapse="|") # compile a match string
pattern <- paste0(tmp,"|(.+?)") # add a pattern to match the rest
# extract all matches into a matrix
mat <- stringr::str_match_all(op_chr, pattern)[[1]]
# aggregate where second column is NA
res <- unname(tapply(mat[,1],
cumsum(is.na(mat[,2])) + c(0,cumsum(abs(diff(is.na(mat[,2]))))),
paste, collapse=""))
res
#> [1] "abc" "[[" "+" "de.f" "[" "-" "[[" "g"
A pure regex-based solution will look like
x <- "abc[[+de.f[-[[g"
v <- c("+", "-", "[", "[[")
## Escaping function
regex.escape <- function(string) {
gsub("([][{}()+*^$|\\\\?.])", "\\\\\\1", string)
}
## Sorting by length in the descending order function
sort.by.length.desc <- function (v) v[order( -nchar(v)) ]
pat <- paste(regex.escape(sort.by.length.desc(v)), collapse="|")
pat <- paste0("(?s)", pat, "|(?:(?!", pat, ").)+")
res <- regmatches(x, gregexpr(pat, x, perl=TRUE))
## => [[1]]
## [1] "abc" "[[" "+" "de.f" "[" "-" "[[" "g"
See this R demo online. The PCRE regex here is
(?s)\[\[|\+|-|\[|(?:(?!\[\[|\+|-|\[).)+
See the regex demo and the Regulex graph:
Details
(?s) - a DOTALL modifier that makes . match any char including newlines
\[\[ - [[ substring (escaped with regex.escape)
| - or
\+ - a +
|- - or a - (no need to escape - as it is not inside a character class)
|\[ - or [
| - or
(?:(?!\[\[|\+|-|\[).)+ - a tempered greedy token that matches any char (.), 1 or more repetitions as many as possible (+ at the end), that does not start a a [[, +, - or [ character sequences (learn more about tempered greedy token).
You may also consider a less "regex intensive" solution with a TRE regex:
x <- "abc[[+de.f[-[[g"
v <- c("+", "-", "[", "[[")
## Escaping function
regex.escape <- function(string) {
gsub("([][{}()+*^$|\\\\?.])", "\\\\\\1", string)
}
## Sorting by length in the descending order function
sort.by.length.desc <- function (v) v[order( -nchar(v)) ]
## Interleaving function
riffle3 <- function(a, b) {
mlab <- min(length(a), length(b))
seqmlab <- seq(length=mlab)
c(rbind(a[seqmlab], b[seqmlab]), a[-seqmlab], b[-seqmlab])
}
pat <- paste(regex.escape(sort.by.length.desc(v)), collapse="|")
res <- riffle3(regmatches(x, gregexpr(pat, x), invert=TRUE)[[1]], regmatches(x, gregexpr(pat, x))[[1]])
res <- res[res != ""]
## => [1] "abc" "[[" "+" "de.f" "[" "-" "[[" "g"
See the R demo.
So, the search items are properly escaped to be used in regex, they are sorted by length in descending order, the regex pattern based on alternation is built dynamically, then all matching and non-matching strings are found and then they are joined into a single character vector and empty items are discarded in the end.
One option to get your matches might be to us an alternation:
[a-z.]+|\[+|[+-]
[a-z.]+ Match 1+ times a-z or dot
| Or
\[+ match 1+ times a [
|` or
[+-] Match + or -
Regex demo | R demo
For example, to get the matches:
library(stringr)
x <- "abc[[+de.f[-[[g"
str_extract_all(x, "[a-z.]+|\\[+|[+-]")

Subset string by counting specific characters

I have the following strings:
strings <- c("ABBSDGNHNGA", "AABSDGDRY", "AGNAFG", "GGGDSRTYHG")
I want to cut off the string, as soon as the number of occurances of A, G and N reach a certain value, say 3. In that case, the result should be:
some_function(strings)
c("ABBSDGN", "AABSDG", "AGN", "GGG")
I tried to use the stringi, stringr and regex expressions but I can't figure it out.
You can accomplish your task with a simple call to str_extract from the stringr package:
library(stringr)
strings <- c("ABBSDGNHNGA", "AABSDGDRY", "AGNAFG", "GGGDSRTYHG")
str_extract(strings, '([^AGN]*[AGN]){3}')
# [1] "ABBSDGN" "AABSDG" "AGN" "GGG"
The [^AGN]*[AGN] portion of the regex pattern says to look for zero or more consecutive characters that are not A, G, or N, followed by one instance of A, G, or N. The additional wrapping with parenthesis and braces, like this ([^AGN]*[AGN]){3}, means look for that pattern three times consecutively. You can change the number of occurrences of A, G, N, that you are looking for by changing the integer in the curly braces:
str_extract(strings, '([^AGN]*[AGN]){4}')
# [1] "ABBSDGNHN" NA "AGNA" "GGGDSRTYHG"
There are a couple ways to accomplish your task using base R functions. One is to use regexpr followed by regmatches:
m <- regexpr('([^AGN]*[AGN]){3}', strings)
regmatches(strings, m)
# [1] "ABBSDGN" "AABSDG" "AGN" "GGG"
Alternatively, you can use sub:
sub('(([^AGN]*[AGN]){3}).*', '\\1', strings)
# [1] "ABBSDGN" "AABSDG" "AGN" "GGG"
Here is a base R option using strsplit
sapply(strsplit(strings, ""), function(x)
paste(x[1:which.max(cumsum(x %in% c("A", "G", "N")) == 3)], collapse = ""))
#[1] "ABBSDGN" "AABSDG" "AGN" "GGG"
Or in the tidyverse
library(tidyverse)
map_chr(str_split(strings, ""),
~str_c(.x[1:which.max(cumsum(.x %in% c("A", "G", "N")) == 3)], collapse = ""))
Identify positions of pattern using gregexpr then extract n-th position (3) and substring everything from 1 to this n-th position using subset.
nChars <- 3
pattern <- "A|G|N"
# Using sapply to iterate over strings vector
sapply(strings, function(x) substr(x, 1, gregexpr(pattern, x)[[1]][nChars]))
PS:
If there's a string that doesn't have 3 matches it will generate NA, so you just need to use na.omit on the final result.
This is just a version without strsplit to Maurits Evers neat solution.
sapply(strings,
function(x) {
raw <- rawToChar(charToRaw(x), multiple = TRUE)
idx <- which.max(cumsum(raw %in% c("A", "G", "N")) == 3)
paste(raw[1:idx], collapse = "")
})
## ABBSDGNHNGA AABSDGDRY AGNAFG GGGDSRTYHG
## "ABBSDGN" "AABSDG" "AGN" "GGG"
Or, slightly different, without strsplit and paste:
test <- charToRaw("AGN")
sapply(strings,
function(x) {
raw <- charToRaw(x)
idx <- which.max(cumsum(raw %in% test) == 3)
rawToChar(raw[1:idx])
})
Interesting problem. I created a function (see below) that solves your problem. It's assumed that there are just letters and no special characters in any of your strings.
reduce_strings = function(str, chars, cnt){
# Replacing chars in str with "!"
chars = paste0(chars, collapse = "")
replacement = paste0(rep("!", nchar(chars)), collapse = "")
str_alias = chartr(chars, replacement, str)
# Obtain indices with ! for each string
idx = stringr::str_locate_all(pattern = '!', str_alias)
# Reduce each string in str
reduce = function(i) substr(str[i], start = 1, stop = idx[[i]][cnt, 1])
result = vapply(seq_along(str), reduce, "character")
return(result)
}
# Example call
str = c("ABBSDGNHNGA", "AABSDGDRY", "AGNAFG", "GGGDSRTYHG")
chars = c("A", "G", "N") # Characters that are counted
cnt = 3 # Count of the characters, at which the strings are cut off
reduce_strings(str, chars, cnt) # "ABBSDGN" "AABSDG" "AGN" "GGG"

strsplit and keep part before first underscore

I would like to keep the part after the FIRST undescore. Please see example code.
colnames(df)
"EGAR00001341740_P32_1" "EGAR00001341741_PN32"
My try, but does not give P32_1 but only P32 which is wrong.
sapply(strsplit(colnames(df), split='_', fixed=TRUE), function(x) (x[2]))
desired output: P32_1, PN32
It could be done with a regex by matching zero or more characters that are not an underscore ([^_]*) from the start (^) of the string, followed by an underscore (_) and replace it with blanks ("")
colnames(df) <- sub("^[^_]*_", "", colnames(df))
colnames(df)
#[1] "P32_1" "PN32"
With strsplit, it will split whereever the split character occurs. One option is str_split from stringr where there is an option to specify the 'n' i.e. number of split parts. If we choose n = 2, we get 2 substrings as it will only split at the first _
library(stringr)
sapply(str_split(colnames(df), "_", n = 2), `[`, 2)
#[1] "P32_1" "PN32"
Here are a few ways. The first fixes the code in the question and the remaining ones are alternatives. All use only base except (6). (4) and (7) assume that the first field is fixed length, which is the case in the question.
x <- c("EGAR00001341740_P32_1", "EGAR00001341741_PN32")
# 1 - using strsplit
sapply(strsplit(x, "_"), function(x) paste(x[-1], collapse = "-"))
## [1] "P32_1" "PN32"
# 2 - a bit easier using sub. *? is a non-greedy match
sub(".*?_", "", x)
## [1] "P32_1" "PN32"
# 3 - locate the first underscore and extract all after that
substring(x, regexpr("_", x) + 1)
## [1] "P32_1" "PN32"
# 4 - if the first field is fixed length as in the example
substring(x, 17)
## [1] "P32_1" "PN32"
# 5 - replace first _ with character that does not appear and remove all until it
sub(".*;", "", sub("_", ";", x))
## [1] "P32_1" "PN32"
# 6 - extract everything after first _
library(gsubfn)
strapplyc(x, "_(.*)", simplify = TRUE)
## [1] "P32_1" "PN32"
# 7 - like (4) assumes fixed length first field
read.fwf(textConnection(x), widths = c(16, 99), as.is = TRUE)$V2
## [1] "P32_1" "PN32"

r sub negation of [:digit:] in regex

I am trying to use subto remove everything between the end of string s (pattern always includes :, digits and parentheses ) and up till but not including the first digit before starting parenthis (.
s <- "NXF1F-Z10_(1:111)"
>sub("\\(1:[[:digit:]]+)$", "", s) #Almost work!
[1] "NXF1F-Z10_"
To remove all characters not a digit (like _ , anything of any length except a digit ) I tried in vain this to negate digits:
sub("[^[:digit:]]*(1:[[:digit:]]+)$", "", s)
The desired output is :
[1] "NXF1F-Z10"
s <- "NXF1F-Z10_(1:111)"
Try this
sub("_.+", "", s)
# "NXF1F-Z10"
More general
sub("(\\d)[^\\d]*[(].*[)]$", "\\1", s, perl=TRUE)
# "NXF1F-Z10"
sub("(\\d)[^\\d]*[(].*[)]$", "\\1", t, perl=TRUE)
# "NXF1F-Z10"
Or this
sub("[(](\\d+):.+", "\\1", s)
# "NXF1F-Z10_1"
Depending on what you want

Split a string every 5 characters

Suppose I have a long string:
"XOVEWVJIEWNIGOIWENVOIWEWVWEW"
How do I split this to get every 5 characters followed by a space?
"XOVEW VJIEW NIGOI WENVO IWEWV WEW"
Note that the last one is shorter.
I can do a loop where I constantly count and build a new string character by character but surely there must be something better no?
Using regular expressions:
gsub("(.{5})", "\\1 ", "XOVEWVJIEWNIGOIWENVOIWEWVWEW")
# [1] "XOVEW VJIEW NIGOI WENVO IWEWV WEW"
Using sapply
> string <- "XOVEWVJIEWNIGOIWENVOIWEWVWEW"
> sapply(seq(from=1, to=nchar(string), by=5), function(i) substr(string, i, i+4))
[1] "XOVEW" "VJIEW" "NIGOI" "WENVO" "IWEWV" "WEW"
You can try something like the following:
s <- "XOVEWVJIEWNIGOIWENVOIWEWVWEW" # Original string
l <- seq(from=5, to=nchar(s), by=5) # Calculate the location where to chop
# Add sentinels 0 (beginning of string) and nchar(s) (end of string)
# and take substrings. (Thanks to #flodel for the condense expression)
mapply(substr, list(s), c(0, l) + 1, c(l, nchar(s)))
Output:
[1] "XOVEW" "VJIEW" "NIGOI" "WENVO" "IWEWV" "WEW"
Now you can paste the resulting vector (with collapse=' ') to obtain a single string with spaces.
No *apply stringi solution:
x <- "XOVEWVJIEWNIGOIWENVOIWEWVWEW"
stri_sub(x, seq(1, stri_length(x),by=5), length=5)
[1] "XOVEW" "VJIEW" "NIGOI" "WENVO" "IWEWV" "WEW"
This extracts substrings just like in #Jilber answer, but stri_sub function is vectorized se we don't need to use *apply here.
You can also use a sub-string without a loop. substring is the vectorized substr
x <- "XOVEWVJIEWNIGOIWENVOIWEWVWEW"
n <- seq(1, nc <- nchar(x), by = 5)
paste(substring(x, n, c(n[-1]-1, nc)), collapse = " ")
# [1] "XOVEW VJIEW NIGOI WENVO IWEWV WEW"

Resources