Subset string by counting specific characters - r

I have the following strings:
strings <- c("ABBSDGNHNGA", "AABSDGDRY", "AGNAFG", "GGGDSRTYHG")
I want to cut off the string, as soon as the number of occurances of A, G and N reach a certain value, say 3. In that case, the result should be:
some_function(strings)
c("ABBSDGN", "AABSDG", "AGN", "GGG")
I tried to use the stringi, stringr and regex expressions but I can't figure it out.

You can accomplish your task with a simple call to str_extract from the stringr package:
library(stringr)
strings <- c("ABBSDGNHNGA", "AABSDGDRY", "AGNAFG", "GGGDSRTYHG")
str_extract(strings, '([^AGN]*[AGN]){3}')
# [1] "ABBSDGN" "AABSDG" "AGN" "GGG"
The [^AGN]*[AGN] portion of the regex pattern says to look for zero or more consecutive characters that are not A, G, or N, followed by one instance of A, G, or N. The additional wrapping with parenthesis and braces, like this ([^AGN]*[AGN]){3}, means look for that pattern three times consecutively. You can change the number of occurrences of A, G, N, that you are looking for by changing the integer in the curly braces:
str_extract(strings, '([^AGN]*[AGN]){4}')
# [1] "ABBSDGNHN" NA "AGNA" "GGGDSRTYHG"
There are a couple ways to accomplish your task using base R functions. One is to use regexpr followed by regmatches:
m <- regexpr('([^AGN]*[AGN]){3}', strings)
regmatches(strings, m)
# [1] "ABBSDGN" "AABSDG" "AGN" "GGG"
Alternatively, you can use sub:
sub('(([^AGN]*[AGN]){3}).*', '\\1', strings)
# [1] "ABBSDGN" "AABSDG" "AGN" "GGG"

Here is a base R option using strsplit
sapply(strsplit(strings, ""), function(x)
paste(x[1:which.max(cumsum(x %in% c("A", "G", "N")) == 3)], collapse = ""))
#[1] "ABBSDGN" "AABSDG" "AGN" "GGG"
Or in the tidyverse
library(tidyverse)
map_chr(str_split(strings, ""),
~str_c(.x[1:which.max(cumsum(.x %in% c("A", "G", "N")) == 3)], collapse = ""))

Identify positions of pattern using gregexpr then extract n-th position (3) and substring everything from 1 to this n-th position using subset.
nChars <- 3
pattern <- "A|G|N"
# Using sapply to iterate over strings vector
sapply(strings, function(x) substr(x, 1, gregexpr(pattern, x)[[1]][nChars]))
PS:
If there's a string that doesn't have 3 matches it will generate NA, so you just need to use na.omit on the final result.

This is just a version without strsplit to Maurits Evers neat solution.
sapply(strings,
function(x) {
raw <- rawToChar(charToRaw(x), multiple = TRUE)
idx <- which.max(cumsum(raw %in% c("A", "G", "N")) == 3)
paste(raw[1:idx], collapse = "")
})
## ABBSDGNHNGA AABSDGDRY AGNAFG GGGDSRTYHG
## "ABBSDGN" "AABSDG" "AGN" "GGG"
Or, slightly different, without strsplit and paste:
test <- charToRaw("AGN")
sapply(strings,
function(x) {
raw <- charToRaw(x)
idx <- which.max(cumsum(raw %in% test) == 3)
rawToChar(raw[1:idx])
})

Interesting problem. I created a function (see below) that solves your problem. It's assumed that there are just letters and no special characters in any of your strings.
reduce_strings = function(str, chars, cnt){
# Replacing chars in str with "!"
chars = paste0(chars, collapse = "")
replacement = paste0(rep("!", nchar(chars)), collapse = "")
str_alias = chartr(chars, replacement, str)
# Obtain indices with ! for each string
idx = stringr::str_locate_all(pattern = '!', str_alias)
# Reduce each string in str
reduce = function(i) substr(str[i], start = 1, stop = idx[[i]][cnt, 1])
result = vapply(seq_along(str), reduce, "character")
return(result)
}
# Example call
str = c("ABBSDGNHNGA", "AABSDGDRY", "AGNAFG", "GGGDSRTYHG")
chars = c("A", "G", "N") # Characters that are counted
cnt = 3 # Count of the characters, at which the strings are cut off
reduce_strings(str, chars, cnt) # "ABBSDGN" "AABSDG" "AGN" "GGG"

Related

Regex expression to match every nth occurence of a pattern

Consider this string,
str = "abc-de-fghi-j-k-lm-n-o-p-qrst-u-vw-x-yz"
I'd like to separate the string at every nth occurrence of a pattern, here -:
f(str, n = 2)
[1] "abc-de" "fghi-j" "k-lm" "n-o"...
f(str, n = 3)
[1] "abc-de-fghi" "j-k-lm" "n-o-p" "qrst-u-vw"...
I know I could do it like this:
spl <- str_split(str, "-", )[[1]]
unname(sapply(split(spl, ceiling(seq(spl) / 2)), paste, collapse = "-"))
[1] "abc-de" "fghi-j" "k-lm" "n-o" "p-qrst" "u-vw" "x-yz"
But I'm looking for a shorter and cleaner solution
What are the possibilities?
What about the following (where 'n-1' is a placeholder for a number):
(?:[^-]*(?:-[^-]*){n-1})\K-
See an online demo
(?: - Open 1st non-capture group;
[^-]* - Match 0+ characters other hyphen;
(?: - Open a nested 2nd non-capture group;
-[^-]* - Match an hyphen and 0+ characters other than hyphen;
){n} - Close nested non-capture group and match n-times;
) - Close 1st non-capture group;
\K- - Forget what we just matched and match the trailing hyphen.
Note: The use of \K means we must use PCRE (perl=TRUE)
To create the 'n-1' we can use sprintf() functionality to use a variable:
str <- "abc-de-fghi-j-k-lm-n-o-p-qrst-u-vw-x-yz"
for (n in 1:10) {
print(strsplit(str, sprintf("(?:[^-]*(?:-[^-]*){%s})\\K-", n-1), perl=TRUE)[[1]])
}
Prints:
You could use str_extract_all with the pattern \w+(?:-\w+){0,2}, for instance to find terms with 3 words and 2 hyphens:
str <- "abc-de-fghi-j-k-lm-n-o-p-qrst-u-vw-x-yz"
n <- 2
regex <- paste0("\\w+(?:-\\w+){0,", n, "}")
str_extract_all(str, regex)[[1]]
[1] "abc-de-fghi" "j-k-lm" "n-o-p" "qrst-u-vw" "x-yz"
n <- 3
regex <- paste0("\\w+(?:-\\w+){0,", n, "}")
str_extract_all(str, regex)[[1]]
[1] "abc-de-fghi-j" "k-lm-n-o" "p-qrst-u-vw" "x-yz"
1) gsubfn gsubfn in the package of the same name is like gsub except that the replacement can be a function, list or proto object. In the case of a proto object one can supply a fun method which has a built in count variable that can be used to distinguish the occurrences. For each match the match is passed to fun and replaced with the output of fun.
We use the input shown in the Note at the end and also n to specify the number of components to use in each element of the result and sep to specify a character that does not appear in the input.
gsubfn replaces every n-th minus with sep and the strsplit splits on that.
No complex regular expressions are needed.
library(gsubfn)
n <- 3
sep <- " "
p <- proto(fun = function(., x) if (count %% n) "-" else sep)
strsplit(gsubfn("-", p, STR), sep)
## [[1]]
## [1] "abc-de-fghi" "j-k-lm" "n-o-p" "qrst-u-vw" "x-yz"
##
## [[2]]
## [1] "abc-de-fghi" "j-k-lm" "n-o-p" "qrst-u-vw" "x-yz"
2) rollapply Another approach is to split on every - and the paste it together again using rollapply giving the same result as in (1).
library(zoo)
roll <- function(x) rollapply(x, n, by = n, paste, collapse = "-",
partial = TRUE, align = "left")
lapply(strsplit(STR, "-"), roll)
Note
# input
STR = "abc-de-fghi-j-k-lm-n-o-p-qrst-u-vw-x-yz"
STR <- c(STR, STR)
another approach: First split on every split-pattern found, then paste/collapse into groups of n-length, using the split-pattern-variable as collapse character.
str <- "abc-de-fghi-j-k-lm-n-o-p-qrst-u-vw-x-yz"
n <- 3
pattern <- "-"
ans <- unlist(strsplit(str, pattern))
sapply(split(ans,
ceiling(seq_along(ans)/n)),
paste0, collapse = pattern)
# "abc-de-fghi" "j-k-lm" "n-o-p" "qrst-u-vw" "x-yz"

split a string knowing some of the substrings

Say I have the following string and a vector of substrings:
x <- "abc[[+de.f[-[[g"
v <- c("+", "-", "[", "[[")
I would like to split this string by extracting the substrings from my vector and making new substrings from the characters in between, so I would get the following :
res <- c("abc", "[[", "+", "de.f", "[", "-", "[[", "g")
in case of conflicting matches the longer wins (here [[ over [), you can consider there won't be conflicting matches of same length.
Tagging with regex but open to any solution, faster being better.
Please don't make any assumption on the type of character used in any of these strings, apart from the fact they're ASCII. There is no pattern to be inferred if I didn't explicitly mention it.
another example :
x <- "a*bc[[+de.f[-[[g[*+-h-+"
v <- c("+", "-", "[", "[[", "[*", "+-")
res <- c("a*bc", "[[", "+", "de.f", "[", "-", "[[", "g", "[*", "+-", "h", "-", "+")
This almost seems more like a lexing problem than a matching problem. I seem to get decent results with the minilexer package
library(minilexer) #devtools::install_github("coolbutuseless/minilexer")
patterns <- c(
dbracket = "\\[\\[",
bracket = "\\[",
plus = "\\+",
minus = "\\-",
name = "[a-z.]+"
)
x <- "abc[[+de.f[-[[g"
lex(x, patterns)
unname(lex(x, patterns))
# [1] "abc" "[[" "+" "de.f" "[" "-"
# [7] "[[" "g"
Using stringr::str_match_all and Hmisc::escapeRegex :
x <- "abc[[+de.f[-[[g"
v <- c("+", "-", "[", "[[")
tmp <- v[order(-nchar(v))] # sort to have longer first, to match in priority
tmp <- Hmisc::escapeRegex(tmp)
tmp <- paste(tmp,collapse="|") # compile a match string
pattern <- paste0(tmp,"|(.+?)") # add a pattern to match the rest
# extract all matches into a matrix
mat <- stringr::str_match_all(op_chr, pattern)[[1]]
# aggregate where second column is NA
res <- unname(tapply(mat[,1],
cumsum(is.na(mat[,2])) + c(0,cumsum(abs(diff(is.na(mat[,2]))))),
paste, collapse=""))
res
#> [1] "abc" "[[" "+" "de.f" "[" "-" "[[" "g"
A pure regex-based solution will look like
x <- "abc[[+de.f[-[[g"
v <- c("+", "-", "[", "[[")
## Escaping function
regex.escape <- function(string) {
gsub("([][{}()+*^$|\\\\?.])", "\\\\\\1", string)
}
## Sorting by length in the descending order function
sort.by.length.desc <- function (v) v[order( -nchar(v)) ]
pat <- paste(regex.escape(sort.by.length.desc(v)), collapse="|")
pat <- paste0("(?s)", pat, "|(?:(?!", pat, ").)+")
res <- regmatches(x, gregexpr(pat, x, perl=TRUE))
## => [[1]]
## [1] "abc" "[[" "+" "de.f" "[" "-" "[[" "g"
See this R demo online. The PCRE regex here is
(?s)\[\[|\+|-|\[|(?:(?!\[\[|\+|-|\[).)+
See the regex demo and the Regulex graph:
Details
(?s) - a DOTALL modifier that makes . match any char including newlines
\[\[ - [[ substring (escaped with regex.escape)
| - or
\+ - a +
|- - or a - (no need to escape - as it is not inside a character class)
|\[ - or [
| - or
(?:(?!\[\[|\+|-|\[).)+ - a tempered greedy token that matches any char (.), 1 or more repetitions as many as possible (+ at the end), that does not start a a [[, +, - or [ character sequences (learn more about tempered greedy token).
You may also consider a less "regex intensive" solution with a TRE regex:
x <- "abc[[+de.f[-[[g"
v <- c("+", "-", "[", "[[")
## Escaping function
regex.escape <- function(string) {
gsub("([][{}()+*^$|\\\\?.])", "\\\\\\1", string)
}
## Sorting by length in the descending order function
sort.by.length.desc <- function (v) v[order( -nchar(v)) ]
## Interleaving function
riffle3 <- function(a, b) {
mlab <- min(length(a), length(b))
seqmlab <- seq(length=mlab)
c(rbind(a[seqmlab], b[seqmlab]), a[-seqmlab], b[-seqmlab])
}
pat <- paste(regex.escape(sort.by.length.desc(v)), collapse="|")
res <- riffle3(regmatches(x, gregexpr(pat, x), invert=TRUE)[[1]], regmatches(x, gregexpr(pat, x))[[1]])
res <- res[res != ""]
## => [1] "abc" "[[" "+" "de.f" "[" "-" "[[" "g"
See the R demo.
So, the search items are properly escaped to be used in regex, they are sorted by length in descending order, the regex pattern based on alternation is built dynamically, then all matching and non-matching strings are found and then they are joined into a single character vector and empty items are discarded in the end.
One option to get your matches might be to us an alternation:
[a-z.]+|\[+|[+-]
[a-z.]+ Match 1+ times a-z or dot
| Or
\[+ match 1+ times a [
|` or
[+-] Match + or -
Regex demo | R demo
For example, to get the matches:
library(stringr)
x <- "abc[[+de.f[-[[g"
str_extract_all(x, "[a-z.]+|\\[+|[+-]")

substring replace nth positions R

I need to replace the 6,7,8th position to "_". In substring, I mentioned the start and stop position. It didn't work.
> a=c("UHI786KJRH2V", "TYR324FHASJKDG","DHA927NFSYFN34")
> substring(a, 6,8) <- "_"
> a
[1] "UHI78_KJRH2V" "TYR32_FHASJKDG" "DHA92_NFSYFN34"
I need UHI78_RH2V TYR32_ASJKDG DHA92_SYFN34
Using sub, we can match on the pattern (?<=^.{5}).{3}, and then replace it by a single underscore:
a <- c("UHI786KJRH2V", "TYR324FHASJKDG","DHA927NFSYFN34")
out <- sub("(?<=^.{5}).{3}", "_", a, perl=TRUE)
out
[1] "UHI78_RH2V" "TYR32_ASJKDG" "DHA92_SYFN34"
Demo
We could also try doing substring operations here, but we would have to do some splicing:
out <- paste0(substr(a, 1, 5), "_", substr(a, 9, nchar(a)))
1) str_sub<- The str_sub<- replacement function in the stringr package can do that.
library(stringr)
str_sub(a, 6, 8) <- "_"
a
## [1] "UHI78_RH2V" "TYR32_ASJKDG" "DHA92_SYFN34"
2 Base R With only base R you could do this. It replaces the entire string with the match to the first capture group, an underscore and the match to the second capture group.
sub("(.....)...(.*)", "\\1_\\2", a)
## [1] "UHI78_RH2V" "TYR32_ASJKDG" "DHA92_SYFN34"
That regex could also be written as "(.{5}).{3}(.*)" .
3) separate/unite If a is a column in a data frame then we could use dplyr and tidyr to do this:
library(dplyr)
library(tidyr)
DF <- data.frame(a)
DF %>%
separate(a, into = c("pre", "junk", "post"), sep = c(5, 8)) %>%
select(-junk) %>%
unite(a)
giving:
a
1 UHI78_RH2V
2 TYR32_ASJKDG
3 DHA92_SYFN34
From the documentation:
If the portion to be replaced is longer than the replacement string, then only the portion the length of the string is replaced.
So we could do something like this:
substring(a, 6,8) <- "_##"
sub("#+", "", a)
[1] "UHI78_RH2V" "TYR32_ASJKDG" "DHA92_SYFN34"

How to replace only 2nd character from a string in R

s<-"HsdKjnsjsHLKsmH"
how to substitute a character at a particular position in R
replace only second "H" in the string s with "Q"
We can use gregexpr and substr. gregexpr finds all matches and returns the position of the matches. Using the position of the second match, we can then replace the second "H" with "Q" using substr. This guarantees that it's always the second "H" that we're replacing:
s = "HsdKjnsjsHLKsmH"
pos <- gregexpr("H", s)[[1]][2]
substr(s, pos, pos) <- "Q"
# [1] "HsdKjnsjsQLKsmH"
Another method using stringr, and making it a :
library(stringr)
str_pos_replace <- function(string, pattern, replacement, pos=1){
str_locate_all(string, pattern)[[1]][pos,, drop=FALSE] %>%
`str_sub<-`(string, ., value = replacement)
}
str_pos_replace(s, "H", "QQQ", 2)
# [1] "HsdKjnsjsQQQLKsmH"
s = "HsdKjnsjsHLKsmH"
sub("[^H]*H[^H]*\\KH","Q",s,perl=T)
#[1] "HsdKjnsjsQLKsmH"
Try the following.
s <- "HsdKjnsjsHLKsmH"
sub("(H[^H]*)H", "\\1Q", s)
If you want to generalize the code above, here is a function that does so.
replaceSecond <- function(s, old, new){
pattern <- paste0("(", old, "[^", old, "]*)", old)
new <- paste0("\\1", new)
sub(pattern, new, s)
}
replaceSecond(s, "H", "Q")
#[1] "HsdKjnsjsQLKsmH"

Split a string every 5 characters

Suppose I have a long string:
"XOVEWVJIEWNIGOIWENVOIWEWVWEW"
How do I split this to get every 5 characters followed by a space?
"XOVEW VJIEW NIGOI WENVO IWEWV WEW"
Note that the last one is shorter.
I can do a loop where I constantly count and build a new string character by character but surely there must be something better no?
Using regular expressions:
gsub("(.{5})", "\\1 ", "XOVEWVJIEWNIGOIWENVOIWEWVWEW")
# [1] "XOVEW VJIEW NIGOI WENVO IWEWV WEW"
Using sapply
> string <- "XOVEWVJIEWNIGOIWENVOIWEWVWEW"
> sapply(seq(from=1, to=nchar(string), by=5), function(i) substr(string, i, i+4))
[1] "XOVEW" "VJIEW" "NIGOI" "WENVO" "IWEWV" "WEW"
You can try something like the following:
s <- "XOVEWVJIEWNIGOIWENVOIWEWVWEW" # Original string
l <- seq(from=5, to=nchar(s), by=5) # Calculate the location where to chop
# Add sentinels 0 (beginning of string) and nchar(s) (end of string)
# and take substrings. (Thanks to #flodel for the condense expression)
mapply(substr, list(s), c(0, l) + 1, c(l, nchar(s)))
Output:
[1] "XOVEW" "VJIEW" "NIGOI" "WENVO" "IWEWV" "WEW"
Now you can paste the resulting vector (with collapse=' ') to obtain a single string with spaces.
No *apply stringi solution:
x <- "XOVEWVJIEWNIGOIWENVOIWEWVWEW"
stri_sub(x, seq(1, stri_length(x),by=5), length=5)
[1] "XOVEW" "VJIEW" "NIGOI" "WENVO" "IWEWV" "WEW"
This extracts substrings just like in #Jilber answer, but stri_sub function is vectorized se we don't need to use *apply here.
You can also use a sub-string without a loop. substring is the vectorized substr
x <- "XOVEWVJIEWNIGOIWENVOIWEWVWEW"
n <- seq(1, nc <- nchar(x), by = 5)
paste(substring(x, n, c(n[-1]-1, nc)), collapse = " ")
# [1] "XOVEW VJIEW NIGOI WENVO IWEWV WEW"

Resources