Find nested substrings that match a pattern in a string - r

I need to find all the substring within this string 'DGHDAGRTDRPDRMGIEGTRNELPVAYHYNRTLSSNAEPLVESYLTHVLMDDDVLPLISLFWTFGRGDVPRRY*AVR*GQRRDVTTEFIHLLRCLDLSSFACMCAPARH*SRSLLIYSPKRLRNIASHRSYGIVCTSG*CTWINV*QIS*FATH*SKCIAPNLSHADKPRSLVLTPTTLRFSKPAYRRPLIREAMDLWIRASICWGMGLLN*KDWP*ESGYAYYVCELESGLRLMNPDARGFSRV*HVCSSA*LTWPSPFPEQAFLLRFTEPRHKLLYV*D*VNACLVRSSASASIM' that start with the character M and end with the character *.
I tried to use str_extract_all() and stri_extract_all() but I can't get the result I want:
aa <- 'DGHDAGRTDRPDRMGIEGTRNELPVAYHYNRTLSSNAEPLVESYLTHVLMDDDVLPLISLFWTFGRGDVPRRY*AVR*GQRRDVTTEFIHLLRCLDLSSFACMCAPARH*SRSLLIYSPKRLRNIASHRSYGIVCTSG*CTWINV*QIS*FATH*SKCIAPNLSHADKPRSLVLTPTTLRFSKPAYRRPLIREAMDLWIRASICWGMGLLN*KDWP*ESGYAYYVCELESGLRLMNPDARGFSRV*HVCSSA*LTWPSPFPEQAFLLRFTEPRHKLLYV*D*VNACLVRSSASASIM'
str_extract_all(aa, 'M.*\\*')[[1]]
[1] "MGIEGTRNELPVAYHYNRTLSSNAEPLVESYLTHVLMDDDVLPLISLFWTFGRGDVPRRY*AVR*GQRRDVTTEFIHLLRCLDLSSFACMCAPARH*SRSLLIYSPKRLRNIASHRSYGIVCTSG*CTWINV*QIS*FATH*SKCIAPNLSHADKPRSLVLTPTTLRFSKPAYRRPLIREAMDLWIRASICWGMGLLN*KDWP*ESGYAYYVCELESGLRLMNPDARGFSRV*HVCSSA*LTWPSPFPEQAFLLRFTEPRHKLLYV*D*"
stri_extract_all(aa, regex = ('M.*/*'))[[1]]
[1] "MGIEGTRNELPVAYHYNRTLSSNAEPLVESYLTHVLMDDDVLPLISLFWTFGRGDVPRRY*AVR*GQRRDVTTEFIHLLRCLDLSSFACMCAPARH*SRSLLIYSPKRLRNIASHRSYGIVCTSG*CTWINV*QIS*FATH*SKCIAPNLSHADKPRSLVLTPTTLRFSKPAYRRPLIREAMDLWIRASICWGMGLLN*KDWP*ESGYAYYVCELESGLRLMNPDARGFSRV*HVCSSA*LTWPSPFPEQAFLLRFTEPRHKLLYV*D*VNACLVRSSASASIM"
But I get a substring that starts with the first M and ends with either the last *, or with the last character of aa. I would like to get, instead, are all the substrings, even if one is nested within another:
MDDDVLPLISLFWTFGRGDVPRRY*
MCAPARH*
MDLWIRASICWGMGLLN*
MGLLN*
MNPDARGFSRV*
Here are some info on my software versions:
Windows 10
R version 3.5.2
R studio version 1.1.463
stringr version 1.4.0
I'm sorry if I used the wrong lingo, I'm still new to programming.
Thank you for all your help!

The need to find all nested substrings suggests that recursion may be the simplest way:
First remove everything after the final * (since the strings we search must be delimited by a final * according to the question).
x = sub("*[^*]+$", "", aa)
Now let's split this at every *
y = unlist(strsplit(x, '*', fixed = T))
and keep only the strings that contain at least one M
y = grep('M', y, value = T)
Now we use a recursive function to get all the substrings
find.M = function(z){
z = sub('.+?M', 'M', z)
if (length(zz <- grep('.+M', z, value = T))) {
c(z, find.M(sub('.+?M','M',zz)))
}
else z
}
find.M(y)
# [1] "MGIEGTRNELPVAYHYNRTLSSNAEPLVESYLTHVLMDDDVLPLISLFWTFGRGDVPRRY"
# [2] "MCAPARH"
# [3] "MDLWIRASICWGMGLLN"
# [4] "MNPDARGFSRV"
# [5] "MDDDVLPLISLFWTFGRGDVPRRY"
# [6] "MGLLN"

EDIT: This does not exactly result in the desired output but thought I would share it(since I also spent some time on it):
library(stringi)
result<-unlist(strsplit(aa,".(?=M.*)",perl = TRUE))
res<-unlist(stri_split(unlist(result),regex="[A-Z](?<=\\*[A-Z]|(?<=\\M[A-Z]))"))
res1<-res[grep("^M",unlist(res))]
res1[stri_endswith(res1,charclass = "[*|W]")]
#[1] "MDDDVLPLISLFWTFGRGDVPRRY*" "MCAPARH*" "MDLWIRASICW"
#[4] "MGLLN*" "MNPDARGFSRV*"
ORIGINAL:
We can use(This has removed the * at the end):
aa<-'DGHDAGRTDRPDRMGIEGTRNELPVAYHYNRTLSSNAEPLVESYLTHVLMDDDVLPLISLFWTFGRGDVPRRY*AVR*GQRRDVTTEFIHLLRCLDLSSFACMCAPARH*SRSLLIYSPKRLRNIASHRSYGIVCTSG*CTWINV*QIS*FATH*SKCIAPNLSHADKPRSLVLTPTTLRFSKPAYRRPLIREAMDLWIRASICWGMGLLN*KDWP*ESGYAYYVCELESGLRLMNPDARGFSRV*HVCSSA*LTWPSPFPEQAFLLRFTEPRHKLLYV*D*VNACLVRSSASASIM'
aa
res1<-unlist(strsplit(aa,".(?=M)",perl = TRUE))
res2<-unlist(strsplit(res1[grep("\\*{1,}",res1)],"\\*"))
res2[grep("^M",res2)]
Result:
# [1] "MDDDVLPLISLFWTFGRGDVPRRY" "MCAPARH" "MGLLN"
# [4] "MNPDARGFSRV"

You can use [^\\*]* to match anything except the asterix. Noting that you want all matches, including any overlapping patterns, we can add a lookahead. This doesn't seem to be supported with stringr but works with stringi::stri_match_all_regex():
library(stringi)
stri_match_all_regex(aa, '(?=(M[^\\*]*\\*))')[[1]][,2]
# [1] "MGIEGTRNELPVAYHYNRTLSSNAEPLVESYLTHVLMDDDVLPLISLFWTFGRGDVPRRY*"
# [2] "MDDDVLPLISLFWTFGRGDVPRRY*"
# [3] "MCAPARH*"
# [4] "MDLWIRASICWGMGLLN*"
# [5] "MGLLN*"
# [6] "MNPDARGFSRV*"

Related

how to extract specific character using str_extrac() in R

Context
I have a character vector a.
I want to extract the text between the last slash(/) and the .nc using the str_extract()function.
I have tried like this: str_extract(a, "(?=/).*(?=.nc)"), but failed.
Question
How can I get the text between the last lash and .nc in character vector a.
Reproducible code
a = c(
'data/temp/air/pm2.5/pm2.5_year_2014.nc',
'data/temp/air/pm10/pm10_year_2014.nc',
'efcv/asdfe/weewr/rtrkhh/ss_fef_10233_dfdfe.nc'
)
# My solution (failed)
str_extract(a, "(?=/).*(?=.nc)")
# [1] "/temp/air/pm2.5/pm2.5_year_2014"
# [2] "/temp/air/pm10/pm10_year_2014"
# [3] "/asdfe/weewr/rtrkhh/ss_fef_10233_dfdfe"
# The expected output should like this:
# [1] "pm2.5_year_2014"
# [2] "pm10_year_2014"
# [3] "ss_fef_10233_dfdfe"
Here is a regex replacement approach:
a = c(
'data/temp/air/pm2.5/pm2.5_year_2014.nc',
'data/temp/air/pm10/pm10_year_2014.nc',
'efcv/asdfe/weewr/rtrkhh/ss_fef_10233_dfdfe.nc'
)
output <- gsub(".*/|\\.[^.]+$", "", a)
output
[1] "pm2.5_year_2014" "pm10_year_2014" "ss_fef_10233_dfdfe"
Here is the regex logic:
.*/ match everything from the start of the string until the last /
| OR
\.[^.]+$ match everything from final dot until the end of the string
Then we replace these matches by empty string to remove them, leaving behind the filenames.

how to extract part of a string matching pattern with separation in r

I'm trying to extract part of a file name that matches a set of letters with variable length. The file names consist of several parameters separated by "_", but they vary in the number of parts. I'm trying to pull some of the parameters out to use separately.
Example file names:
a = "Vel_Mag_ft_modelExisting_350cfs_blah3.tif"
b = "Depth_modelDesign_11000cfs_blah2.tif"
I'm trying to pull out the parts that start with "model" so I end up with
"modelExisting"
"modelDesign"
The filenames are stored as a variable in a data.frame
I've tried
library(tidyverse)
tibble(files = c(a,b))%>%
mutate(attempt1 = str_extract(files, "model"),
attempt2 = str_match(str_split(files, "_"), "model"))
but just ended up with the "model" in all cases and not the "model...." that I need.
The pieces I need are a consisent number of pieces from the end, but I couldn't figure out how to specify that either. I tried
str_split(files, "_")[-3]
but this threw an error that it must be size 480 or 1 not size 479
We can create a function to capture the word before the _ and one or more digits (\\1), in the replacement, specify the backreference (\\1) of the captured group
f1 <- function(x) sub(".*_([[:alpha:]]+)_\\d+.*", "\\1", x)
-testing
> f1(a)
[1] "modelExisting"
> f1(b)
[1] "modelDesign"
We can use strsplit or regmatches like below
> s <- c("Vel_Mag_ft_modelExisting_350cfs_blah3.tif", "Depth_modelDesign_11000cfs_blah2.tif")
> lapply(strsplit(s, "_"), function(x) x[which(grepl("^\\d+", x)) - 1])
[[1]]
[1] "modelExisting"
[[2]]
[1] "modelDesign"
> regmatches(s, gregexpr("[[:alpha:]]+(?=_\\d+)", s, perl = TRUE))
[[1]]
[1] "modelExisting"
[[2]]
[1] "modelDesign"

How to use str_split with regex in R?

I have this string:
235072,testing,some252f4,14084-things224072,and,other2524,14084-thingies223552,testing,some/2wr24,14084-things
I want to split the string by the 6-digit numbers. I.e. - I want this:
235072,testing,some2wg2f4,wf484-things
224072,and,other25wg4,14-thingies
223552,testing,some/2wr24,14084-things
How do I do this with regex? The following does not work (using stringr package):
> blahblah <- "235072,testing,some252f4,14084-things224072,and,other2524,14084-thingies223552,testing,some/2wr24,14084-things"
> test <- str_split(blahblah, "([0-9]{6}.*)")
> test
[[1]]
[1] "" ""
What am I missing??
Here's an approach with base R using a positive lookahead and lookbehind, and thanks to #thelatemail for the correction:
strsplit(x, "(?<=.)(?=[0-9]{6})", perl = TRUE)[[1]]
# [1] "235072,testing,some252f4,14084-things"
# [2] "224072,and,other2524,14084-thingies"
# [3] "223552,testing,some/2wr24,14084-things"
An alternative approach with str_extract_all. Note I've used .*? to do 'non-greedy' matching, otherwise .* expands to grab everything:
> str_extract_all(blahblah, "[0-9]{6}.*?(?=[0-9]{6}|$)")[[1]]
[1] "235072,testing,some252f4,14084-things" "224072,and,other2524,14084-thingies" "223552,testing,some/2wr24,14084-things"
An easy-to-understand approach is to add a marker and then split on the locations of those markers. This has the advantage of being able to only look for 6-digit sequences and not require any other features in the surrounding text, whose features may change as you add new and unvetted data.
library(stringr)
library(magrittr)
str <- "235072,testing,some252f4,14084-things224072,and,other2524,14084-thingies223552,testing,some/2wr24,14084-things"
out <-
str_replace_all(str, "(\\d{6})", "#SPLIT_HERE#\\1") %>%
str_split("#SPLIT_HERE#") %>%
unlist
[1] "" "235072,testing,some252f4,14084-things"
[3] "224072,and,other2524,14084-thingies" "223552,testing,some/2wr24,14084-things"
If your match occurs at the start or end of a string, str_split() will insert blank character entries in the results vector to indicate that (as it did above). If you don't need that information, you can easily remove it with out[nchar(out) != 0].
[1] "235072,testing,some252f4,14084-things" "224072,and,other2524,14084-thingies"
[3] "223552,testing,some/2wr24,14084-things"
With less complex regex, you can do as following:
s <- "235072,testing,some252f4,14084-things224072,and,other2524,14084-thingies223552,testing,some/2wr24,14084-things"
l <- str_locate_all(string = s, "[0-9]{6}")
str_sub(string = s, start = as.data.frame(l)$start,
end = c(tail(as.data.frame(l)$start, -1) - 1, nchar(s)) )
# [1] "235072,testing,some252f4,14084-things"
# [2] "224072,and,other2524,14084-thingies"
# [3] "223552,testing,some/2wr24,14084-things"

Extracting and matching regular expressions in R

I have a list of strings, an example is shown below (the actual list has a much bigger variety in format)
[1] "AB-123"
[2] "AB-312"
[3] "AB-546"
[4] "ZXC/123456"
Assuming [1] is the correct format, I want to extract the regular expression from [1] and match it against the rest to detect that [4] is inconsistent. Is there a method to do this or is there a better way to achieve the same outcome?
*EDIT - I found something close to what I require, anyone know of any packages that does this?
Given a string, generate a regex that can parse *similar* strings
We may need grep
grepl(sub("-.*", "", v1[1]), v1[-1])
data
v1 <- c( "AB-123" , "AB-312" , "AB-546" , "ZXC/123456")
Here's an attempt at making a function which checks if each value is a Character Digit or Other. It is a bit rough but I'm sure this can be expanded upon to match exactly what you want:
test <- c("AB-123", "AB-312", "AB-546", "ZXC/123456")
compare_1st <- function(x) {
x <- toupper(x)
chars <- list("A",1,"-")
repl <- c("[A-Z]", "[0-9]", "[^0-9A-Z]")
for(i in seq_along(repl)) x <- gsub(repl[i], chars[i], x)
out <- x[1] == x
attr(out, "values") <- chartr("A1-", "CDO", x)
out
}
compare_1st(test)
#[1] TRUE TRUE TRUE FALSE
#attr(,"values")
#[1] "CCODDD" "CCODDD" "CCODDD" "CCCODDDDDD"

Finding number of r's in the vector (Both R and r) before the first u

rquote <- "R's internals are irrefutably intriguing"
chars <- strsplit(rquote, split = "")[[1]]
in the above code we need to find the number of r's(R and r) in rquote
You could use substrings.
## find position of first 'u'
u1 <- regexpr("u", rquote, fixed = TRUE)
## get count of all 'r' or 'R' before 'u1'
lengths(gregexpr("r", substr(rquote, 1, u1), ignore.case = TRUE))
# [1] 5
This follows what you ask for in the title of the post. If you want the count of all the "r", case insensitive, then simplify the above to
lengths(gregexpr("r", rquote, ignore.case = TRUE))
# [1] 6
Then there's always stringi
library(stringi)
## count before first 'u'
stri_count_regex(stri_sub(rquote, 1, stri_locate_first_regex(rquote, "u")[,1]), "r|R")
# [1] 5
## count all R or r
stri_count_regex(rquote, "r|R")
# [1] 6
To get the number of R's before the first u, you need to make an intermediate step. (You probably don't need to. I'm sure akrun knows some incredibly cool regular expression to get the job done, but it won't be as easy to understand as this).
rquote <- "R's internals are irrefutably intriguing"
before_u <- gsub("u[[:print:]]+$", "", rquote)
length(stringr::str_extract_all(before_u, "(R|r)")[[1]])
You may try this,
> length(str_extract_all(rquote, '[Rr]')[[1]])
[1] 6
To get the count of all r's before the first u
> length(str_extract_all(rquote, perl('u.*(*SKIP)(*F)|[Rr]'))[[1]])
[1] 5
EDIT: Just saw before the first u. In that case, we can get the position of the first 'u' from either which or match.
Then use grepl in the 'chars' up to the position (ind) to find the logical index of 'R' with ignore.case=TRUE and use sum using the strsplit output from the OP's code.
ind <- which(chars=='u')[1]
Or
ind <- match('u', chars)
sum(grepl('r', chars[seq(ind)], ignore.case=TRUE))
#[1] 5
Or we can use two gsubs on the original string ('rquote'). First one removes the characters starting with u until the end of the string (u.$) and the second matches all characters except R, r ([^Rr]) and replace it with ''. We can use nchar to get count of the characters remaining.
nchar(gsub('[^Rr]', '', sub('u.*$', '', rquote)))
#[1] 5
Or if we want to count the 'r' in the entire string, gregexpr to get the position of matching characters from the original string ('rquote') and get the length
length(gregexpr('[rR]', rquote)[[1]])
#[1] 6

Resources