Trying to match and replace "and" or "or" to "&" and "|" when it occurs outside of quotes except when they occur within quotes.
Quotes could be single(') or double(").
The string is as follows:
Industry ='Education' or Industry =\"Energy\" or Industry = 'Financial or Bank' or Industry = 'Hospitality' or Industry = \"Food and Beverage\" and Industry = 'Utilities'
Expected output:
Industry ='Education' | Industry =\"Energy\" | Industry = 'Financial or Bank' | Industry = 'Hospitality' | Industry = \"Food and Beverage\" & Industry = 'Utilities'
I know that we might have to use lookarounds but cant figure out how. I am using stringr package in R for all my regex manipulations.
Let me know if you need more info.
You should consider an approach to match double- and single-quoted substrings to omit them and only process and or or in all other contexts. The easiest way is to use gsubfn where you may pass a PCRE regex that will do exactly that:
> library(gsubfn)
> x <- "Industry ='Education' or Industry =\"Energy\" or Industry = 'Financial or Bank' or Industry = 'Hospitality' or Industry = \"Food and Beverage\" and Industry = 'Utilities'"
> pat = "(?:\"[^\"]*\"|'[^']*')(*SKIP)(*F)|\\b(and|or)\\b"
> gsubfn(pat, ~ ifelse(z=="or","|", "&"), x, backref=0, perl=TRUE)
[1] "Industry ='Education' | Industry =\"Energy\" | Industry = 'Financial or Bank' | Industry = 'Hospitality' | Industry = \"Food and Beverage\" & Industry = 'Utilities'"
The (?:\"[^\"]*\"|'[^']*')(*SKIP)(*F)|\\b(and|or)\\b pattern will match:
(?: - an alternation group:
\"[^\"]*\" - a double quoted substring having no double quotes inside
| - or
'[^']*' - a single quoted substring
) - end of the group
(*SKIP)(*F) - discard the match, proceed looking for the next match
| - or
\\b(and|or)\\b - Group 1: either an and or or as a whole word.
See the regex demo.
Depending on how the literal " and ' are escaped inside "..." and '...' you will need to adjust the (?:\"[^\"]*\"|'[^']*') part of the regex.
The ~ ifelse(z=="or","|", "&") part is a callback function that receives the only argument (named z inside this function) and its contents are the match value you get from the regex (i.e. either or or and). If the match value is equal to or, the match is substituted with |, else, with &.
this is an ugly way to do it, but it's working for your specific case:
For Or :
(?:'|")(?:.*?)(?:'|")(?:.*?)(or)(?:.*?)
For And :
(?:'|")(?:.*?)(?:'|")(?:.*?)(and)(?:.*?)
i recommend using https://regex101.com/ to help build and test your regex
Your question has potential problems, because nested content may not be handled well or at all by a single regex. That being said, if we assume that the or values you want to replace by pipes always occur after a quoted string, then we can try the following:
gsub("([\"'])\\s*or", "\\1 |", input)
[1] "Industry ='Education' | Industry =\"Energy\" | Industry = 'Financial or Bank' |
Industry = 'Hospitality' | Industry = \"Food and Beverage\" and Industry = 'Utilities'"
By inspection, the or values occurring inside quoted strings are surrounded on both sides by unquoted words. Obviously, this may break down upon seeing other data, or more nested content.
Demo
Related
I'm having an issue with the stringr::str_replace_all function. I'm trying to replace all instances of iv with insuredvehicle, but the function only seems to catch the first term.
temp_data <- data.table(text = 'the driver of the 1st vehicle hit the iv iv at a stop')
temp_data[, new_text := stringr::str_replace_all(pattern = ' iv ', replacement = ' insuredvehicle ', string = text)]
The outcome looks like the following, which missed the 2nd iv term:
1: the driver of the 1st vehicle hit the insuredvehicle iv at a stop
I believe the issue is that the 2 instances share a space, which is part of the search pattern. I did that because I want to replace the iv term, and not iv within driver.
I DON'T want to simply consolidate the repeated terms to 1. I'd like the result to look like:
1: the driver of the 1st vehicle hit the insuredvehicle insuredvehicle at a stop
I'd appreciate any help getting this to work!
Maybe if you include a word boundary in your regex, than remove the white spaces from the replacement? It is ideal when you want just a full word matching the pattern, but not parts of words, while staying away from these blank space issues.
\\bseems to do the trick
temp_data[, new_text := stringr::str_replace_all(pattern = '\\biv\\b', replacement = 'insuredvehicle', string = text)]
new_text
1: the driver of the 1st vehicle hit the insuredvehicle insuredvehicle at a stop
You can use lookarounds:
temp_data[, new_text := stringr::str_replace_all(pattern = '(?<= )iv(?= )', replacement = 'insuredvehicle', string = text)]
Output:
"the driver of the 1st vehicle hit the insuredvehicle insuredvehicle at a stop"
Use gsub:
gsub("\\biv\\b", "insuredvehicle", temp_data$text)
[1] "the driver of the 1st vehicle hit the uninsuredvehicle uninsuredvehicle at a stop"
Use space boundaries:
temp_data <- data.table(text = 'the driver of the 1st vehicle hit the iv iv at a stop')
temp_data[, new_text := stringr::str_replace_all(pattern = '(?<!\\S)iv(?!\\S)', replacement = 'insuredvehicle', string = text)]
See regex proof.
EXPLANATION
--------------------------------------------------------------------------------
(?<! look behind to see if there is not:
--------------------------------------------------------------------------------
\S non-whitespace (all but \n, \r, \t, \f,
and " ")
--------------------------------------------------------------------------------
) end of look-behind
--------------------------------------------------------------------------------
iv 'iv'
--------------------------------------------------------------------------------
(?! look ahead to see if there is not:
--------------------------------------------------------------------------------
\S non-whitespace (all but \n, \r, \t, \f,
and " ")
--------------------------------------------------------------------------------
) end of look-ahead
I would like to find the rows in a vector with the word 'RT' in it or 'R' but not if the word 'RT' is preceded by 'no'.
The word RT may be preceded by nothing, a space, a dot, etc.
With the regex, I tried :
grep("(?<=[no] )RT", aaa,ignore.case = FALSE, perl = T)
Which was giving me all the rows with "no RT".
and
grep("(?=[^no].*)RT",aaa , perl = T)
which was giving me all the rows containing 'RT' with and without 'no' at the beginning.
What is my mistake? I thought the ^ was giving everything but the character that follows it.
Example :
aaa = c("RT alone", "no RT", "CT/RT", "adj.RTx", "RT/CT", "lang, RT+","npo RT" )
(?<=[no] )RT matches any RT that is immediately preceded with "n " or "o ".
You should use a negative lookbehind,
"(?<!no )RT"
See the regex demo.
Or, if you need to check for a whole word no,
"(?<!\\bno )RT"
See this regex demo.
Here, (?<!no ) makes sure there is no no immediately to the left of the current location, and only then RT is consumed.
I’m still new to R and regexes, but I’m trying to achieve the following; suppose I have a data table of the following sort:
Title | URL
stackoverflow.com | https://stackoverflow.com
google.com | http://
youtube.com | https://youtube.com
overclock.net | https://
I want to append the cells in column URL with their corresponding value in column Title, in case URL consists only of either http:// or https://, so the desired output would look as follows:
Title | URL
stackoverflow.com | https://stackoverflow.com
google.com | http://google.com
youtube.com | https://youtube.com
overclock.net | https://overclock.net
To do so, I tried using the sub function in conjunction with a lookahead regex as follows:
dt$URL <- sub("(?:^|\\W)https?://(?:$|\\W)", "\\1", dt$Title, perl = TRUE)
or
dt$URL <- sub("\\s(https?://)", "\\1", dt$Title, perl = TRUE)
or
dt$URL <- sub("\\b(https?://\\b)", "\\1", dt$Title, perl = TRUE)
But none of the above produces the desired output. The issue is that it doesn’t append/replace anything at all, possibly because the regex doesn’t match anything, or it also matches if there is more data than just http:// or https:// present, i.e. it will also match on a full domain name (which I do not want). How should I adjust my code so that it produces the desired output, given the example input above?
Thank you!
url.col <- c("https://stackoverflow.com",
"http://",
"https://youtube.com",
"https://")
title.col <- c("stackoverflow.com",
"google.com",
"youtube.com",
"overclock.net")
ifelse(grepl("^(\\w*http(s)?://)$", url.col), # if pattern matches url.col elem:
paste0(url.col, title.col), # join content of cols together and return!
url.col) # but if not return url.col element 'as is'
[1] "https://stackoverflow.com"
[2] "http://google.com"
[3] "https://youtube.com"
[4] "https://overclock.net"
Have the following string that I'd like to parse:
((K00134,K00150) K00927,K11389) (K00234,K00235)
each step is separated by a space and alternation is represented by a comma. I'm stuck in the first part of the string where there is a space inside the brackets. The desired output I'm looking for is:
[[['K00134', 'K00150'], 'K00927'], 'K11389'], ['K00234', 'K00235']
What I've got so far is a basic setup to do recursive parsing, but I'm stumped on how to code in a space separated list into the bracket expression
from pyparsing import Word, Literal, Combine, nums, \
Suppress, delimitedList, Group, Forward, ZeroOrMore
ortholog = Combine(Literal('K') + Word(nums, exact=5))
exp = Forward()
ortholog_group = Suppress('(') + Group(delimitedList(ortholog)) + Suppress(')')
atom = ortholog | ortholog_group | Group(Suppress('(') + exp + Suppress(')'))
exp <<= atom + ZeroOrMore(exp)
You are on the right track, but I think you only need one place where you include grouping with ()'s, not two.
import pyparsing as pp
LPAR,RPAR = map(pp.Suppress, "()")
ortholog = pp.Combine('K' + pp.Word(pp.nums, exact=5))
ortholog_group = pp.Forward()
ortholog_group <<= pp.Group(LPAR + pp.OneOrMore(ortholog_group | pp.delimitedList(ortholog)) + RPAR)
expr = pp.OneOrMore(ortholog_group)
tests = """\
((K00134,K00150) K00927,K11389) (K00234,K00235)
"""
expr.runTests(tests)
gives:
((K00134,K00150) K00927,K11389) (K00234,K00235)
[[['K00134', 'K00150'], 'K00927', 'K11389'], ['K00234', 'K00235']]
[0]:
[['K00134', 'K00150'], 'K00927', 'K11389']
[0]:
['K00134', 'K00150']
[1]:
K00927
[2]:
K11389
[1]:
['K00234', 'K00235']
This is not exactly what you said you were looking for:
you wanted: [[['K00134', 'K00150'], 'K00927'], 'K11389'], ['K00234', 'K00235']
I output : [[['K00134', 'K00150'], 'K00927', 'K11389'], ['K00234', 'K00235']]
I'm not sure why there is grouping in your desired output around the space-separated part (K00134,K00150) K00927. Is this your intention or a typo? If intentional, you'll need to rework the definition of ortholog_group, something that will do a delimited list of space-delimited groups in addition to the grouping at parens. The closest I could get was this:
[[[[['K00134', 'K00150']], 'K00927'], ['K11389']], [['K00234', 'K00235']]]
which required some shenanigans to group on spaces, but not group bare orthologs when grouped with other groups. Here is what it looked like:
ortholog_group <<= pp.Group(LPAR + pp.delimitedList(pp.Group(ortholog_group*(1,) & ortholog*(0,))) + RPAR) | pp.delimitedList(ortholog)
The & operator in combination with the repetition operators gives the space-delimited grouping (*(1,) is equivalent to OneOrMore, *(0,) with ZeroOrMore, but also supports *(10,) for "10 or more", or *(3,5) for "at least 3 and no more than 5"). This too is not quite exactly what you asked for, but may get you closer if indeed you need to group the space-delimited bits.
But I must say that grouping on spaces is ambiguous - or at least confusing. Should "(A,B) C D" be [[A,B],C,D] or [[A,B],C],[D] or [[A,B],[C,D]]? I think, if possible, you should permit comma-delimited lists, and perhaps space-delimited also, but require the ()'s when items should be grouped.
I have a column within a data frame containing long text sequences (often in the thousands of characters) of the format:
abab(VR) | ddee(NR) | def(NR) | fff(VR) | oqq | pqq | ppf(VR)
i.e. a string, a suffix in brackets, then a delimiter
I'm trying to work out the syntax in R to delete the items that end in (VR), including the trailing pipe if present, so that I'm left with:
ddee(NR) | def(NR) | oqq | pqq
I cannot work out the regular expression (or gsub) that will remove these entries and would like to request if anyone could help me please.
If you want to use gsub, you can remove the pattern in two stages:
gsub(" \\| $", "", gsub("\\w+\\(VR\\)( \\| )?", "", s))
# firstly remove all words ending with (VR) and optional | following the pattern and
# then remove the possible | at the end of the string
# [1] "ddee(NR) | def(NR) | oqq | pqq"
regular expression \\w+\\(VR\\) will match words ending with (VR), parentheses are escaped by \\;
( \\| )? matches optional delimiter |, this makes sure it will match the pattern both in the middle and at the end of the string;
possible | left out at the end of the string can be removed by a second gsub;
Here is a method using strsplit and paste with the collapse argument:
paste(sapply(strsplit(temp, split=" +\\| +"),
function(i) { i[setdiff(seq_along(i), grep("\\(VR\\)$", i))] }),
collapse=" | ")
[1] "ddee(NR) | def(NR) | oqq | pqq"
We split on the pipe and spaces, then feed the resulting list to sapply which uses the grep function to drop any elements of the vector that end with "(VR)". Finally, the result is pasted together.
I added a subsetting method with setdiff so that vectors without any "(VR)" will return without any modification.