How to use str_split with regex in R? - r

I have this string:
235072,testing,some252f4,14084-things224072,and,other2524,14084-thingies223552,testing,some/2wr24,14084-things
I want to split the string by the 6-digit numbers. I.e. - I want this:
235072,testing,some2wg2f4,wf484-things
224072,and,other25wg4,14-thingies
223552,testing,some/2wr24,14084-things
How do I do this with regex? The following does not work (using stringr package):
> blahblah <- "235072,testing,some252f4,14084-things224072,and,other2524,14084-thingies223552,testing,some/2wr24,14084-things"
> test <- str_split(blahblah, "([0-9]{6}.*)")
> test
[[1]]
[1] "" ""
What am I missing??

Here's an approach with base R using a positive lookahead and lookbehind, and thanks to #thelatemail for the correction:
strsplit(x, "(?<=.)(?=[0-9]{6})", perl = TRUE)[[1]]
# [1] "235072,testing,some252f4,14084-things"
# [2] "224072,and,other2524,14084-thingies"
# [3] "223552,testing,some/2wr24,14084-things"

An alternative approach with str_extract_all. Note I've used .*? to do 'non-greedy' matching, otherwise .* expands to grab everything:
> str_extract_all(blahblah, "[0-9]{6}.*?(?=[0-9]{6}|$)")[[1]]
[1] "235072,testing,some252f4,14084-things" "224072,and,other2524,14084-thingies" "223552,testing,some/2wr24,14084-things"

An easy-to-understand approach is to add a marker and then split on the locations of those markers. This has the advantage of being able to only look for 6-digit sequences and not require any other features in the surrounding text, whose features may change as you add new and unvetted data.
library(stringr)
library(magrittr)
str <- "235072,testing,some252f4,14084-things224072,and,other2524,14084-thingies223552,testing,some/2wr24,14084-things"
out <-
str_replace_all(str, "(\\d{6})", "#SPLIT_HERE#\\1") %>%
str_split("#SPLIT_HERE#") %>%
unlist
[1] "" "235072,testing,some252f4,14084-things"
[3] "224072,and,other2524,14084-thingies" "223552,testing,some/2wr24,14084-things"
If your match occurs at the start or end of a string, str_split() will insert blank character entries in the results vector to indicate that (as it did above). If you don't need that information, you can easily remove it with out[nchar(out) != 0].
[1] "235072,testing,some252f4,14084-things" "224072,and,other2524,14084-thingies"
[3] "223552,testing,some/2wr24,14084-things"

With less complex regex, you can do as following:
s <- "235072,testing,some252f4,14084-things224072,and,other2524,14084-thingies223552,testing,some/2wr24,14084-things"
l <- str_locate_all(string = s, "[0-9]{6}")
str_sub(string = s, start = as.data.frame(l)$start,
end = c(tail(as.data.frame(l)$start, -1) - 1, nchar(s)) )
# [1] "235072,testing,some252f4,14084-things"
# [2] "224072,and,other2524,14084-thingies"
# [3] "223552,testing,some/2wr24,14084-things"

Related

regex strsplit expression in R so it only applies once to the first occurrence of a specific character in each string?

I have a list filled with strings:
string<- c("SPG_L_subgenual_ACC_R", "SPG_R_MTG_L_pole", "MTG_L_pole_CerebellumGM_L")
I need to split the strings so they appear like:
"SPG_L", "subgenual_ACC_R", "SPG_R", "MTG_L_pole", "MTG_L_pole", "CerebellumGM_L"
I tried using the following regex expression to split the strings:
str_split(string,'(?<=[[RL]|pole])_')
But this leads to:
"SPG_L", "subgenual" "ACC_R", "SPG_R", "MTG_L", "pole", "MTG_L", "pole", "CerebellumGM_L"
How do I edit the regex expression so it splits each string element at the "_" after the first occurrence of "R", "L" unless the first occurrence of "R" or "L" is followed by "pole", then it splits the string element after the first occurrence of "pole" and only splits each string element once?
I suggest a matching approach using
^(.*?[RL](?:_pole)?)_(.*)
See the regex demo
Details
^ - start of string
(.*?[RL](?:_pole)?) - Group 1:
.*? - any zero or more chars other than line break chars as few as possible
[RL](?:_pole)? - R or L optionally followed with _pole
_ - an underscore
(.*) - Group 2: any zero or more chars other than line break chars as many as possible
See the R demo:
library(stringr)
x <- c("SPG_L_subgenual_ACC_R", "SPG_R_MTG_L_pole", "MTG_L_pole_CerebellumGM_L", "SFG_pole_R_IFG_triangularis_L", "SFG_pole_R_IFG_opercularis_L" )
res <- str_match_all(x, "^(.*?[RL](?:_pole)?)_(.*)")
lapply(res, function(x) x[-1])
Output:
[[1]]
[1] "SPG_L" "subgenual_ACC_R"
[[2]]
[1] "SPG_R" "MTG_L_pole"
[[3]]
[1] "MTG_L_pole" "CerebellumGM_L"
[[4]]
[1] "SFG_pole_R" "IFG_triangularis_L"
[[5]]
[1] "SFG_pole_R" "IFG_opercularis_L"
split_again = function(x){
if(length(x) > 1){
return(x)
}
else{
str_split(
string = x,
pattern = '(?<=[R|L])_',
n = 2)
}
}
str_split(
string = string,
pattern = '(?<=pole)_',
n = 2) %>%
lapply(split_again) %>%
unlist()
you could use sub then strsplit as shown:
strsplit(sub("^.*?[LR](?:_pole)?\\K_",":",string,perl=TRUE),":")
[[1]]
[1] "SPG_L" "subgenual_ACC_R"
[[2]]
[1] "SPG_R" "MTG_L_pole"
[[3]]
[1] "MTG_L_pole" "CerebellumGM_L"

Find nested substrings that match a pattern in a string

I need to find all the substring within this string 'DGHDAGRTDRPDRMGIEGTRNELPVAYHYNRTLSSNAEPLVESYLTHVLMDDDVLPLISLFWTFGRGDVPRRY*AVR*GQRRDVTTEFIHLLRCLDLSSFACMCAPARH*SRSLLIYSPKRLRNIASHRSYGIVCTSG*CTWINV*QIS*FATH*SKCIAPNLSHADKPRSLVLTPTTLRFSKPAYRRPLIREAMDLWIRASICWGMGLLN*KDWP*ESGYAYYVCELESGLRLMNPDARGFSRV*HVCSSA*LTWPSPFPEQAFLLRFTEPRHKLLYV*D*VNACLVRSSASASIM' that start with the character M and end with the character *.
I tried to use str_extract_all() and stri_extract_all() but I can't get the result I want:
aa <- 'DGHDAGRTDRPDRMGIEGTRNELPVAYHYNRTLSSNAEPLVESYLTHVLMDDDVLPLISLFWTFGRGDVPRRY*AVR*GQRRDVTTEFIHLLRCLDLSSFACMCAPARH*SRSLLIYSPKRLRNIASHRSYGIVCTSG*CTWINV*QIS*FATH*SKCIAPNLSHADKPRSLVLTPTTLRFSKPAYRRPLIREAMDLWIRASICWGMGLLN*KDWP*ESGYAYYVCELESGLRLMNPDARGFSRV*HVCSSA*LTWPSPFPEQAFLLRFTEPRHKLLYV*D*VNACLVRSSASASIM'
str_extract_all(aa, 'M.*\\*')[[1]]
[1] "MGIEGTRNELPVAYHYNRTLSSNAEPLVESYLTHVLMDDDVLPLISLFWTFGRGDVPRRY*AVR*GQRRDVTTEFIHLLRCLDLSSFACMCAPARH*SRSLLIYSPKRLRNIASHRSYGIVCTSG*CTWINV*QIS*FATH*SKCIAPNLSHADKPRSLVLTPTTLRFSKPAYRRPLIREAMDLWIRASICWGMGLLN*KDWP*ESGYAYYVCELESGLRLMNPDARGFSRV*HVCSSA*LTWPSPFPEQAFLLRFTEPRHKLLYV*D*"
stri_extract_all(aa, regex = ('M.*/*'))[[1]]
[1] "MGIEGTRNELPVAYHYNRTLSSNAEPLVESYLTHVLMDDDVLPLISLFWTFGRGDVPRRY*AVR*GQRRDVTTEFIHLLRCLDLSSFACMCAPARH*SRSLLIYSPKRLRNIASHRSYGIVCTSG*CTWINV*QIS*FATH*SKCIAPNLSHADKPRSLVLTPTTLRFSKPAYRRPLIREAMDLWIRASICWGMGLLN*KDWP*ESGYAYYVCELESGLRLMNPDARGFSRV*HVCSSA*LTWPSPFPEQAFLLRFTEPRHKLLYV*D*VNACLVRSSASASIM"
But I get a substring that starts with the first M and ends with either the last *, or with the last character of aa. I would like to get, instead, are all the substrings, even if one is nested within another:
MDDDVLPLISLFWTFGRGDVPRRY*
MCAPARH*
MDLWIRASICWGMGLLN*
MGLLN*
MNPDARGFSRV*
Here are some info on my software versions:
Windows 10
R version 3.5.2
R studio version 1.1.463
stringr version 1.4.0
I'm sorry if I used the wrong lingo, I'm still new to programming.
Thank you for all your help!
The need to find all nested substrings suggests that recursion may be the simplest way:
First remove everything after the final * (since the strings we search must be delimited by a final * according to the question).
x = sub("*[^*]+$", "", aa)
Now let's split this at every *
y = unlist(strsplit(x, '*', fixed = T))
and keep only the strings that contain at least one M
y = grep('M', y, value = T)
Now we use a recursive function to get all the substrings
find.M = function(z){
z = sub('.+?M', 'M', z)
if (length(zz <- grep('.+M', z, value = T))) {
c(z, find.M(sub('.+?M','M',zz)))
}
else z
}
find.M(y)
# [1] "MGIEGTRNELPVAYHYNRTLSSNAEPLVESYLTHVLMDDDVLPLISLFWTFGRGDVPRRY"
# [2] "MCAPARH"
# [3] "MDLWIRASICWGMGLLN"
# [4] "MNPDARGFSRV"
# [5] "MDDDVLPLISLFWTFGRGDVPRRY"
# [6] "MGLLN"
EDIT: This does not exactly result in the desired output but thought I would share it(since I also spent some time on it):
library(stringi)
result<-unlist(strsplit(aa,".(?=M.*)",perl = TRUE))
res<-unlist(stri_split(unlist(result),regex="[A-Z](?<=\\*[A-Z]|(?<=\\M[A-Z]))"))
res1<-res[grep("^M",unlist(res))]
res1[stri_endswith(res1,charclass = "[*|W]")]
#[1] "MDDDVLPLISLFWTFGRGDVPRRY*" "MCAPARH*" "MDLWIRASICW"
#[4] "MGLLN*" "MNPDARGFSRV*"
ORIGINAL:
We can use(This has removed the * at the end):
aa<-'DGHDAGRTDRPDRMGIEGTRNELPVAYHYNRTLSSNAEPLVESYLTHVLMDDDVLPLISLFWTFGRGDVPRRY*AVR*GQRRDVTTEFIHLLRCLDLSSFACMCAPARH*SRSLLIYSPKRLRNIASHRSYGIVCTSG*CTWINV*QIS*FATH*SKCIAPNLSHADKPRSLVLTPTTLRFSKPAYRRPLIREAMDLWIRASICWGMGLLN*KDWP*ESGYAYYVCELESGLRLMNPDARGFSRV*HVCSSA*LTWPSPFPEQAFLLRFTEPRHKLLYV*D*VNACLVRSSASASIM'
aa
res1<-unlist(strsplit(aa,".(?=M)",perl = TRUE))
res2<-unlist(strsplit(res1[grep("\\*{1,}",res1)],"\\*"))
res2[grep("^M",res2)]
Result:
# [1] "MDDDVLPLISLFWTFGRGDVPRRY" "MCAPARH" "MGLLN"
# [4] "MNPDARGFSRV"
You can use [^\\*]* to match anything except the asterix. Noting that you want all matches, including any overlapping patterns, we can add a lookahead. This doesn't seem to be supported with stringr but works with stringi::stri_match_all_regex():
library(stringi)
stri_match_all_regex(aa, '(?=(M[^\\*]*\\*))')[[1]][,2]
# [1] "MGIEGTRNELPVAYHYNRTLSSNAEPLVESYLTHVLMDDDVLPLISLFWTFGRGDVPRRY*"
# [2] "MDDDVLPLISLFWTFGRGDVPRRY*"
# [3] "MCAPARH*"
# [4] "MDLWIRASICWGMGLLN*"
# [5] "MGLLN*"
# [6] "MNPDARGFSRV*"

removes part of string in r

I'm trying to extract ES at the end of a string
> data <- c("phrases", "phases", "princesses","class","pass")
> data1 <- gsub("(\\w+)(s)+?es\\b", "\\1\\2", data, perl=TRUE)
> gsub("(\\w+)s\\b", "\\1", data1, perl=TRUE)
[1] "phra" "pha" "princes" "clas" "pas"
I get this result
[1] "phra" "pha" "princes" "clas" "pas"
but in reality what I need to obtain is:
[1] "phras" "phas" "princess" "clas" "pas"
You can use a word boundary (\\b) if it is guaranteed that each word is followed by a punctuation or is at the end of the string:
data <- c("phrases, phases, princesses, bases")
gsub('es\\b', '', data)
# [1] "phras, phas, princess, bas"
With your method, just wrap everything till the second + with one set of parentheses:
gsub("(\\w+s+)es\\b", "\\1", data)
# [1] "phras, phas, princess, bas"
There is also no need to make + lazy with ?, since you are trying to match as many consecutive s's as possible.
Edit:
OP changed the data and the desired output. Below is a simple solution that removes either es or s at the end of each string:
data <- c("phrases", "phases", "princesses","class","pass")
gsub('(es|s)\\b', '', data)
# [1] "phras" "phas" "princess" "clas" "pas"
maybe you are looking for a lookbehind assertion (which is a 0 length match)
"(?<=s)es\\b"
or because lookbehind can't have a variable length perl \K construct to keep out of match left of \K
"\\ws\\Kes\\b"

Regular expression to extract specific part of a URL

I have a vector of URLs and need to extract a certain part of it. I've tried using a regex tester to see if my attempts worked, but they were no good.
The URLs I have are in this format: https://www.baseball-reference.com/teams/MIL/1976.shtml
I ned to extract the three letters after "teams/" (so for the example above, I need "MIL")
Does anyone have any idea how to get the correct regular expression to get this working? Thanks.
1) basename/dirname Try this:
u <- "https://www.baseball-reference.com/teams/MIL/1976.shtml" # input data
basename(dirname(u))
## [1] "MIL"
2) sub or with a regular expression:
sub(".*teams/(.*?)/.*", "\\1", u)
## [1] "MIL"
3) strsplit Split the string on / and take the second last component.
s <- strsplit(u, "/")[[1]]
s[length(s) - 1]
## [1] "MIL"
4) gsub Since the required substring is all upper case and no other characters in the input are this gsub which removes all characters that are not upper case letters would work:
gsub("[^A-Z]", "", u)
## [1] "MIL"
Many different ways to achieve this using regexp's. Here's one:
url <- "https://www.baseball-reference.com/teams/MIL/1976.shtml"
gsub(".+teams/(\\w{3}).+$", "\\1", url);
#[1] "MIL"
Or
x <- c('https://www.baseball-reference.com/teams/MIL/1976.shtml')
pattern <- "/teams/([^/]+)"
m <- regexec(pattern, x)
res = regmatches(x, m)[[1]]
res[2]
which yields
[1] "MIL"
Consider using the stringr package to simplify your code when handling strings.
Use a regular expression with positive lookbehind to catch alphanumeric codes following the string "teams\":
stringr::str_extract(url, "(?<=teams\\/)[A-Z]*")
In your case, if the URLs literally all begin with the same string https://www.baseball-reference.com/teams/ then you can avoid regex entirely and use a simple substring to get the three-letter code which follows:
stringr::str_sub(url, 42, 44)
Here are the results:
> url <- "https://www.baseball-reference.com/teams/MIL/1976.shtml"
>
> stringr::str_extract(url, "(?<=teams\\/)[A-Z]*")
[1] "MIL"
>
> stringr::str_sub(url, 42, 44)
[1] "MIL"

Use substr until condition is met

I have a vector from which I just need the first word. The words have different lengths. Words are separated by a symbol (. and _) How can I use the substr() function to get a new vector with just the first word?
I was thinking of something like this
x <- c("wooombel.ab","mugran.cd","friendly_ef.ab","hungry_kd.xy")
y <- substr(x,0, ???)
I think sub with some regular expressions would be the easiest solution:
sub(pattern = "[._].*", replacement = "", x = x)
# [1] "wooombel" "mugran" "friendly" "hungry"
Try:
sapply(strsplit(x,'[._]'), function(x) x[1])
[1] "wooombel" "mugran" "friendly" "hungry"
You could also use package stringr. It has some really handy functions for string manipulation.
One that comes to mind for this problem is word. It has a sep argument that allows the use of a regular expression.
> x <- c("wooombel.ab","mugran.cd","friendly_ef.ab","hungry_kd.xy")
> library(stringr)
> word(x, sep = "[._]")
# [1] "wooombel" "mugran" "friendly" "hungry"
Another option that allows you to continue to use substr is str_locate. So if we just subtract 1 from its result, we can get the desired first words.
> substr(x, 1, str_locate(x, "[._]")-1)
# [1] "wooombel" "mugran" "friendly" "hungry"
An extraction approach with stringi:
library(stringi)
stri_extract_first_regex(x, "[a-z]+(?=[._])")
## [1] "wooombel" "mugran" "friendly" "hungry"
Though "[^a-z]+(?=[._])" may be more explicit.
Regex explanation:
[^a-z]+ any character except: 'a' to 'z' (1 or
more times)
(?= look ahead to see if there is:
[._] any character of: '.', '_'
) end of look-ahead

Resources