split a string knowing some of the substrings - r

Say I have the following string and a vector of substrings:
x <- "abc[[+de.f[-[[g"
v <- c("+", "-", "[", "[[")
I would like to split this string by extracting the substrings from my vector and making new substrings from the characters in between, so I would get the following :
res <- c("abc", "[[", "+", "de.f", "[", "-", "[[", "g")
in case of conflicting matches the longer wins (here [[ over [), you can consider there won't be conflicting matches of same length.
Tagging with regex but open to any solution, faster being better.
Please don't make any assumption on the type of character used in any of these strings, apart from the fact they're ASCII. There is no pattern to be inferred if I didn't explicitly mention it.
another example :
x <- "a*bc[[+de.f[-[[g[*+-h-+"
v <- c("+", "-", "[", "[[", "[*", "+-")
res <- c("a*bc", "[[", "+", "de.f", "[", "-", "[[", "g", "[*", "+-", "h", "-", "+")

This almost seems more like a lexing problem than a matching problem. I seem to get decent results with the minilexer package
library(minilexer) #devtools::install_github("coolbutuseless/minilexer")
patterns <- c(
dbracket = "\\[\\[",
bracket = "\\[",
plus = "\\+",
minus = "\\-",
name = "[a-z.]+"
)
x <- "abc[[+de.f[-[[g"
lex(x, patterns)
unname(lex(x, patterns))
# [1] "abc" "[[" "+" "de.f" "[" "-"
# [7] "[[" "g"

Using stringr::str_match_all and Hmisc::escapeRegex :
x <- "abc[[+de.f[-[[g"
v <- c("+", "-", "[", "[[")
tmp <- v[order(-nchar(v))] # sort to have longer first, to match in priority
tmp <- Hmisc::escapeRegex(tmp)
tmp <- paste(tmp,collapse="|") # compile a match string
pattern <- paste0(tmp,"|(.+?)") # add a pattern to match the rest
# extract all matches into a matrix
mat <- stringr::str_match_all(op_chr, pattern)[[1]]
# aggregate where second column is NA
res <- unname(tapply(mat[,1],
cumsum(is.na(mat[,2])) + c(0,cumsum(abs(diff(is.na(mat[,2]))))),
paste, collapse=""))
res
#> [1] "abc" "[[" "+" "de.f" "[" "-" "[[" "g"

A pure regex-based solution will look like
x <- "abc[[+de.f[-[[g"
v <- c("+", "-", "[", "[[")
## Escaping function
regex.escape <- function(string) {
gsub("([][{}()+*^$|\\\\?.])", "\\\\\\1", string)
}
## Sorting by length in the descending order function
sort.by.length.desc <- function (v) v[order( -nchar(v)) ]
pat <- paste(regex.escape(sort.by.length.desc(v)), collapse="|")
pat <- paste0("(?s)", pat, "|(?:(?!", pat, ").)+")
res <- regmatches(x, gregexpr(pat, x, perl=TRUE))
## => [[1]]
## [1] "abc" "[[" "+" "de.f" "[" "-" "[[" "g"
See this R demo online. The PCRE regex here is
(?s)\[\[|\+|-|\[|(?:(?!\[\[|\+|-|\[).)+
See the regex demo and the Regulex graph:
Details
(?s) - a DOTALL modifier that makes . match any char including newlines
\[\[ - [[ substring (escaped with regex.escape)
| - or
\+ - a +
|- - or a - (no need to escape - as it is not inside a character class)
|\[ - or [
| - or
(?:(?!\[\[|\+|-|\[).)+ - a tempered greedy token that matches any char (.), 1 or more repetitions as many as possible (+ at the end), that does not start a a [[, +, - or [ character sequences (learn more about tempered greedy token).
You may also consider a less "regex intensive" solution with a TRE regex:
x <- "abc[[+de.f[-[[g"
v <- c("+", "-", "[", "[[")
## Escaping function
regex.escape <- function(string) {
gsub("([][{}()+*^$|\\\\?.])", "\\\\\\1", string)
}
## Sorting by length in the descending order function
sort.by.length.desc <- function (v) v[order( -nchar(v)) ]
## Interleaving function
riffle3 <- function(a, b) {
mlab <- min(length(a), length(b))
seqmlab <- seq(length=mlab)
c(rbind(a[seqmlab], b[seqmlab]), a[-seqmlab], b[-seqmlab])
}
pat <- paste(regex.escape(sort.by.length.desc(v)), collapse="|")
res <- riffle3(regmatches(x, gregexpr(pat, x), invert=TRUE)[[1]], regmatches(x, gregexpr(pat, x))[[1]])
res <- res[res != ""]
## => [1] "abc" "[[" "+" "de.f" "[" "-" "[[" "g"
See the R demo.
So, the search items are properly escaped to be used in regex, they are sorted by length in descending order, the regex pattern based on alternation is built dynamically, then all matching and non-matching strings are found and then they are joined into a single character vector and empty items are discarded in the end.

One option to get your matches might be to us an alternation:
[a-z.]+|\[+|[+-]
[a-z.]+ Match 1+ times a-z or dot
| Or
\[+ match 1+ times a [
|` or
[+-] Match + or -
Regex demo | R demo
For example, to get the matches:
library(stringr)
x <- "abc[[+de.f[-[[g"
str_extract_all(x, "[a-z.]+|\\[+|[+-]")

Related

Regex expression to match every nth occurence of a pattern

Consider this string,
str = "abc-de-fghi-j-k-lm-n-o-p-qrst-u-vw-x-yz"
I'd like to separate the string at every nth occurrence of a pattern, here -:
f(str, n = 2)
[1] "abc-de" "fghi-j" "k-lm" "n-o"...
f(str, n = 3)
[1] "abc-de-fghi" "j-k-lm" "n-o-p" "qrst-u-vw"...
I know I could do it like this:
spl <- str_split(str, "-", )[[1]]
unname(sapply(split(spl, ceiling(seq(spl) / 2)), paste, collapse = "-"))
[1] "abc-de" "fghi-j" "k-lm" "n-o" "p-qrst" "u-vw" "x-yz"
But I'm looking for a shorter and cleaner solution
What are the possibilities?
What about the following (where 'n-1' is a placeholder for a number):
(?:[^-]*(?:-[^-]*){n-1})\K-
See an online demo
(?: - Open 1st non-capture group;
[^-]* - Match 0+ characters other hyphen;
(?: - Open a nested 2nd non-capture group;
-[^-]* - Match an hyphen and 0+ characters other than hyphen;
){n} - Close nested non-capture group and match n-times;
) - Close 1st non-capture group;
\K- - Forget what we just matched and match the trailing hyphen.
Note: The use of \K means we must use PCRE (perl=TRUE)
To create the 'n-1' we can use sprintf() functionality to use a variable:
str <- "abc-de-fghi-j-k-lm-n-o-p-qrst-u-vw-x-yz"
for (n in 1:10) {
print(strsplit(str, sprintf("(?:[^-]*(?:-[^-]*){%s})\\K-", n-1), perl=TRUE)[[1]])
}
Prints:
You could use str_extract_all with the pattern \w+(?:-\w+){0,2}, for instance to find terms with 3 words and 2 hyphens:
str <- "abc-de-fghi-j-k-lm-n-o-p-qrst-u-vw-x-yz"
n <- 2
regex <- paste0("\\w+(?:-\\w+){0,", n, "}")
str_extract_all(str, regex)[[1]]
[1] "abc-de-fghi" "j-k-lm" "n-o-p" "qrst-u-vw" "x-yz"
n <- 3
regex <- paste0("\\w+(?:-\\w+){0,", n, "}")
str_extract_all(str, regex)[[1]]
[1] "abc-de-fghi-j" "k-lm-n-o" "p-qrst-u-vw" "x-yz"
1) gsubfn gsubfn in the package of the same name is like gsub except that the replacement can be a function, list or proto object. In the case of a proto object one can supply a fun method which has a built in count variable that can be used to distinguish the occurrences. For each match the match is passed to fun and replaced with the output of fun.
We use the input shown in the Note at the end and also n to specify the number of components to use in each element of the result and sep to specify a character that does not appear in the input.
gsubfn replaces every n-th minus with sep and the strsplit splits on that.
No complex regular expressions are needed.
library(gsubfn)
n <- 3
sep <- " "
p <- proto(fun = function(., x) if (count %% n) "-" else sep)
strsplit(gsubfn("-", p, STR), sep)
## [[1]]
## [1] "abc-de-fghi" "j-k-lm" "n-o-p" "qrst-u-vw" "x-yz"
##
## [[2]]
## [1] "abc-de-fghi" "j-k-lm" "n-o-p" "qrst-u-vw" "x-yz"
2) rollapply Another approach is to split on every - and the paste it together again using rollapply giving the same result as in (1).
library(zoo)
roll <- function(x) rollapply(x, n, by = n, paste, collapse = "-",
partial = TRUE, align = "left")
lapply(strsplit(STR, "-"), roll)
Note
# input
STR = "abc-de-fghi-j-k-lm-n-o-p-qrst-u-vw-x-yz"
STR <- c(STR, STR)
another approach: First split on every split-pattern found, then paste/collapse into groups of n-length, using the split-pattern-variable as collapse character.
str <- "abc-de-fghi-j-k-lm-n-o-p-qrst-u-vw-x-yz"
n <- 3
pattern <- "-"
ans <- unlist(strsplit(str, pattern))
sapply(split(ans,
ceiling(seq_along(ans)/n)),
paste0, collapse = pattern)
# "abc-de-fghi" "j-k-lm" "n-o-p" "qrst-u-vw" "x-yz"

find alphanumeric elements in vector

I have a vector
myVec <- c('1.2','asd','gkd','232','4343','1.3zyz','fva','3213','1232','dasd')
In this vector, I want to do two things:
Remove any numbers from an element that contains both numbers and letters and then
If a group of letters is followed by another group of letters, merge them into one.
So the above vector will look like this:
'1.2','asdgkd','232','4343','zyzfva','3213','1232','dasd'
I thought I will first find the alphanumeric elements and remove the numbers from them using gsub.
I tried this
gsub('[0-9]+', '', myVec[grepl("[A-Za-z]+$", myVec, perl = T)])
"asd" "gkd" ".zyz" "fva" "dasd"
i.e. it retains the . which I don't want.
This seems to return what you are after
myVec <- c('1.2','asd','gkd','232','4343','1.3zyz','fva','3213','1232','dasd')
clean <- function (x) {
is_char <- grepl("[[:alpha:]]", x)
has_number <- grepl("\\d", x)
mixed <- is_char & has_number
x[mixed] <- gsub("[\\d\\.]+","", x[mixed], perl=T)
grp <- cumsum(!is_char | (is_char & !c(FALSE, head(is_char, -1))))
unname(tapply(x, grp, paste, collapse=""))
}
clean(myVec)
# [1] "1.2" "asdgkd" "232" "4343" "zyzfva" "3213" "1232" "dasd"
Here we look for numbers and letters mixed together and remove the numbers. Then we defined groups for collapsing, looking for characters that come after other characters to put them in the same group. Then we finally collapse all the values in the same group.
Here's my regex-only solution:
myVec <- c('1.2','asd','gkd','232','4343','1.3zyz','fva','3213','1232','dasd')
# find all elemnts containing letters
lettrs = grepl("[A-Za-z]", myVec)
# remove all non-letter characters
myVec[lettrs] = gsub("[^A-Za-z]" ,"", myVec[lettrs])
# paste all elements together, remove delimiter where delimiter is surrounded by letters and split string to new vector
unlist(strsplit(gsub("(?<=[A-Za-z])\\|(?=[A-Za-z])", "", paste(myVec, collapse="|"), perl=TRUE), split="\\|"))

Subset string by counting specific characters

I have the following strings:
strings <- c("ABBSDGNHNGA", "AABSDGDRY", "AGNAFG", "GGGDSRTYHG")
I want to cut off the string, as soon as the number of occurances of A, G and N reach a certain value, say 3. In that case, the result should be:
some_function(strings)
c("ABBSDGN", "AABSDG", "AGN", "GGG")
I tried to use the stringi, stringr and regex expressions but I can't figure it out.
You can accomplish your task with a simple call to str_extract from the stringr package:
library(stringr)
strings <- c("ABBSDGNHNGA", "AABSDGDRY", "AGNAFG", "GGGDSRTYHG")
str_extract(strings, '([^AGN]*[AGN]){3}')
# [1] "ABBSDGN" "AABSDG" "AGN" "GGG"
The [^AGN]*[AGN] portion of the regex pattern says to look for zero or more consecutive characters that are not A, G, or N, followed by one instance of A, G, or N. The additional wrapping with parenthesis and braces, like this ([^AGN]*[AGN]){3}, means look for that pattern three times consecutively. You can change the number of occurrences of A, G, N, that you are looking for by changing the integer in the curly braces:
str_extract(strings, '([^AGN]*[AGN]){4}')
# [1] "ABBSDGNHN" NA "AGNA" "GGGDSRTYHG"
There are a couple ways to accomplish your task using base R functions. One is to use regexpr followed by regmatches:
m <- regexpr('([^AGN]*[AGN]){3}', strings)
regmatches(strings, m)
# [1] "ABBSDGN" "AABSDG" "AGN" "GGG"
Alternatively, you can use sub:
sub('(([^AGN]*[AGN]){3}).*', '\\1', strings)
# [1] "ABBSDGN" "AABSDG" "AGN" "GGG"
Here is a base R option using strsplit
sapply(strsplit(strings, ""), function(x)
paste(x[1:which.max(cumsum(x %in% c("A", "G", "N")) == 3)], collapse = ""))
#[1] "ABBSDGN" "AABSDG" "AGN" "GGG"
Or in the tidyverse
library(tidyverse)
map_chr(str_split(strings, ""),
~str_c(.x[1:which.max(cumsum(.x %in% c("A", "G", "N")) == 3)], collapse = ""))
Identify positions of pattern using gregexpr then extract n-th position (3) and substring everything from 1 to this n-th position using subset.
nChars <- 3
pattern <- "A|G|N"
# Using sapply to iterate over strings vector
sapply(strings, function(x) substr(x, 1, gregexpr(pattern, x)[[1]][nChars]))
PS:
If there's a string that doesn't have 3 matches it will generate NA, so you just need to use na.omit on the final result.
This is just a version without strsplit to Maurits Evers neat solution.
sapply(strings,
function(x) {
raw <- rawToChar(charToRaw(x), multiple = TRUE)
idx <- which.max(cumsum(raw %in% c("A", "G", "N")) == 3)
paste(raw[1:idx], collapse = "")
})
## ABBSDGNHNGA AABSDGDRY AGNAFG GGGDSRTYHG
## "ABBSDGN" "AABSDG" "AGN" "GGG"
Or, slightly different, without strsplit and paste:
test <- charToRaw("AGN")
sapply(strings,
function(x) {
raw <- charToRaw(x)
idx <- which.max(cumsum(raw %in% test) == 3)
rawToChar(raw[1:idx])
})
Interesting problem. I created a function (see below) that solves your problem. It's assumed that there are just letters and no special characters in any of your strings.
reduce_strings = function(str, chars, cnt){
# Replacing chars in str with "!"
chars = paste0(chars, collapse = "")
replacement = paste0(rep("!", nchar(chars)), collapse = "")
str_alias = chartr(chars, replacement, str)
# Obtain indices with ! for each string
idx = stringr::str_locate_all(pattern = '!', str_alias)
# Reduce each string in str
reduce = function(i) substr(str[i], start = 1, stop = idx[[i]][cnt, 1])
result = vapply(seq_along(str), reduce, "character")
return(result)
}
# Example call
str = c("ABBSDGNHNGA", "AABSDGDRY", "AGNAFG", "GGGDSRTYHG")
chars = c("A", "G", "N") # Characters that are counted
cnt = 3 # Count of the characters, at which the strings are cut off
reduce_strings(str, chars, cnt) # "ABBSDGN" "AABSDG" "AGN" "GGG"

How can I split a string and ignore the delimiter if it's "quoted"

Say I have the following string:
params <- "var1 /* first, variable */, var2, var3 /* third, variable */"
I want to split it using , as a separator, then extract the "quoted substrings", so I get 2 vectors as follow :
params_clean <- c("var1","var2","var3")
params_def <- c("first, variable","","third, variable") # note the empty string as a second element.
I use the term "quoted" in a wide sense, with arbitrary strings, here /* and */, which protect substrings from being split.
I found a workaround based on read.table and the fact it allows quoted elements :
library(magrittr)
params %>%
gsub("/\\*","_temp_sep_ '",.) %>%
gsub("\\*/","'",.) %>%
read.table(text=.,strin=F,sep=",") %>%
unlist %>%
unname %>%
strsplit("_temp_sep_") %>%
lapply(trimws) %>%
lapply(`length<-`,2) %>%
do.call(rbind,.) %>%
inset(is.na(.),value="")
But it's quite ugly and hackish, what's a simpler way ? I'm thinking there must be a regex to feed to strsplit for this situation.
related to this question
You may use
library(stringr)
cmnt_rx <- "(\\w+)\\s*(/\\*[^*]*\\*+(?:[^/*][^*]*\\*+)*/)?"
res <- str_match_all(params, cmnt_rx)
params_clean <- res[[1]][,2]
params_clean
## => [1] "var1" "var2" "var3"
params_def <- gsub("^/[*]\\s*|\\s*[*]/$", "", res[[1]][,3])
params_def[is.na(params_def)] <- ""
params_def
## => [1] "first, variable" "" "third, variable"
The main regex details (it is actually (\w+)\s*)(COMMENTS_REGEX)?):
(\w+) - Capturing group 1: one or more word chars
\s* - 0+ whitespace chars
( - Capturing group 2 start
/\* - match the comment start /*
[^*]*\*+ - match 0+ characters other than * followed with 1+ literal *
(?:[^/*][^*]*\*+)* - 0+ sequences of:
[^/*][^*]*\*+ - not a / or * (matched with [^/*]) followed with 0+ non-asterisk characters ([^*]*) followed with 1+ asterisks (\*+)
/ - closing /
)? - Capturing group 2 end, repeat 1 or 0 times (it means it is optional).
See the regex demo.
The "^/[*]\\s*|\\s*[*]/$" pattern in gsub removes /* and */ with adjoining spaces.
params_def[is.na(params_def)] <- "" part replaces NA with empty strings.
Here you are
library(stringr)
params <- "var1 /* first, variable */, var2, var3 /* third, variable */"
# Split by , which are not enclosed in your /*...*/
params_split <- str_split(params, ",(?=[^(/[*])]*(/[*]))")[[1]]
# Extract matches of /*...*/, only taking the contents
params_def <- str_match(params_split, "/[*] *(.*?) *[*]/")[,2]
params_def[is.na(params_def)] <- ""
# Remove traces of /*...*/
params_clean <- trimws(gsub("/[*] *(.*?) *[*]/", "", params_split))
You can wrap it in a function and use the (not well documented) (*SKIP)(*FAIL) mechanism in plain R:
getparams <- function(params) {
tmp <- unlist(strsplit(params, "/\\*.*?\\*/(*SKIP)(*FAIL)|,", perl = TRUE))
params_clean <- vector(length = length(tmp))
params_def <- vector(length = length(tmp))
for (i in seq_along(tmp)) {
# get params_def if available
match <- regmatches(tmp[i], regexec("/\\*(.*?)\\*/", tmp[i]))
params_def[i] <- ifelse(identical(match[[1]], character(0)), "", trimws(match[[1]][2]))
# params_clean
params_clean[i] <- trimws(gsub("/(.*)\\*.*?\\*/", "\\1", tmp[i]))
}
return(list(params_clean = params_clean, params_def = params_def))
}
params <- "var1 /* first, variable */, var2, var3 /* third, variable */"
getparams(params)
This splits the initial string using (*SKIP)(*FAIL) (see a demo on regex101.com) and analyzes the parts afterwards.
This yields a list:
$params_clean
[1] "var1" "var2" "var3"
$params_def
[1] "first, variable" "" "third, variable"
Or, shorter with sapply:
getparams <- function(params) {
tmp <- unlist(strsplit(params, "/\\*.*?\\*/(*SKIP)(*FAIL)|,", perl = TRUE))
(p <- sapply(tmp, function(x) {
match <- regmatches(x, regexec("/\\*(.*?)\\*/", x))
def <- ifelse(identical(match[[1]], character(0)), "", trimws(match[[1]][2]))
clean <- trimws(gsub("/(.*)\\*.*?\\*/", "\\1", x))
c(clean, def)
}, USE.NAMES = F))
}
Which will yield a matrix:
[,1] [,2] [,3]
[1,] "var1" "var2" "var3"
[2,] "first, variable" "" "third, variable"
With the latter, you get the variable names with e.g. result[1,].

R Match And Sub On Space Between Specific Characters

I need a little help with a regular expression using gsub. Take this object:
x <- "4929A 939 8229"
I want to remove the space in between "A" and "9", but I am not sure how to match on only the space between them and not on the second space. I essentially need something like this:
x <- gsub("A 9", "", x)
But I am not sure how to write the regular expression to not match on the "A" and "9" and only the space between them.
Thanks in advance!
You may use the following regex in sub:
> x <- "4929A 939 8229"
> sub("\\s+", "", x)
[1] "4929A939 8229"
The \\s+ will match 1 or more whitespace symbols.
The replacement part is an empty string.
See the online R demo
gsub matches/uses all regex found whereas sub only matches/uses the first one. So
sub(" ", "", "4929A 939 8229") # returns "4929A939 8229"
Will do the job
Removing second/nth occurence
You can do that e.g. by using strsplit as follows:
x <- c("4929A 939 8229", "4929A 9398229")
collapse_nth <- function(x_split, split, nth, replacement){
left <- paste(x_split[seq_len(nth)], collapse = split)
right <- paste(x_split[-seq_len(nth)], collapse = split)
paste(left, right, sep = replacement)
}
remove_nth <- function(x, nth, split, replacement = ""){
x_split <- strsplit(x, split, fixed = TRUE)
x_len <- vapply(x_split, length, integer(1))
out <- x
out[x_len>nth] <- vapply(x_split[x_len>nth], collapse_nth, character(1), split, nth, replacement)
out
}
Which gives you:
# > remove_nth(x, 2, " ")
# [1] "4929A 9398229" "4929A 9398229"
and
# > remove_nth(x, 2, " ", "---")
# [1] "4929A 939---8229" "4929A 9398229"

Resources