Appending cells in R based on lookahead regex matching an entire string - r

I’m still new to R and regexes, but I’m trying to achieve the following; suppose I have a data table of the following sort:
Title | URL
stackoverflow.com | https://stackoverflow.com
google.com | http://
youtube.com | https://youtube.com
overclock.net | https://
I want to append the cells in column URL with their corresponding value in column Title, in case URL consists only of either http:// or https://, so the desired output would look as follows:
Title | URL
stackoverflow.com | https://stackoverflow.com
google.com | http://google.com
youtube.com | https://youtube.com
overclock.net | https://overclock.net
To do so, I tried using the sub function in conjunction with a lookahead regex as follows:
dt$URL <- sub("(?:^|\\W)https?://(?:$|\\W)", "\\1", dt$Title, perl = TRUE)
or
dt$URL <- sub("\\s(https?://)", "\\1", dt$Title, perl = TRUE)
or
dt$URL <- sub("\\b(https?://\\b)", "\\1", dt$Title, perl = TRUE)
But none of the above produces the desired output. The issue is that it doesn’t append/replace anything at all, possibly because the regex doesn’t match anything, or it also matches if there is more data than just http:// or https:// present, i.e. it will also match on a full domain name (which I do not want). How should I adjust my code so that it produces the desired output, given the example input above?
Thank you!

url.col <- c("https://stackoverflow.com",
"http://",
"https://youtube.com",
"https://")
title.col <- c("stackoverflow.com",
"google.com",
"youtube.com",
"overclock.net")
ifelse(grepl("^(\\w*http(s)?://)$", url.col), # if pattern matches url.col elem:
paste0(url.col, title.col), # join content of cols together and return!
url.col) # but if not return url.col element 'as is'
[1] "https://stackoverflow.com"
[2] "http://google.com"
[3] "https://youtube.com"
[4] "https://overclock.net"

Related

Regex to replace matches but also ignore when matches within quotes

Trying to match and replace "and" or "or" to "&" and "|" when it occurs outside of quotes except when they occur within quotes.
Quotes could be single(') or double(").
The string is as follows:
Industry ='Education' or Industry =\"Energy\" or Industry = 'Financial or Bank' or Industry = 'Hospitality' or Industry = \"Food and Beverage\" and Industry = 'Utilities'
Expected output:
Industry ='Education' | Industry =\"Energy\" | Industry = 'Financial or Bank' | Industry = 'Hospitality' | Industry = \"Food and Beverage\" & Industry = 'Utilities'
I know that we might have to use lookarounds but cant figure out how. I am using stringr package in R for all my regex manipulations.
Let me know if you need more info.
You should consider an approach to match double- and single-quoted substrings to omit them and only process and or or in all other contexts. The easiest way is to use gsubfn where you may pass a PCRE regex that will do exactly that:
> library(gsubfn)
> x <- "Industry ='Education' or Industry =\"Energy\" or Industry = 'Financial or Bank' or Industry = 'Hospitality' or Industry = \"Food and Beverage\" and Industry = 'Utilities'"
> pat = "(?:\"[^\"]*\"|'[^']*')(*SKIP)(*F)|\\b(and|or)\\b"
> gsubfn(pat, ~ ifelse(z=="or","|", "&"), x, backref=0, perl=TRUE)
[1] "Industry ='Education' | Industry =\"Energy\" | Industry = 'Financial or Bank' | Industry = 'Hospitality' | Industry = \"Food and Beverage\" & Industry = 'Utilities'"
The (?:\"[^\"]*\"|'[^']*')(*SKIP)(*F)|\\b(and|or)\\b pattern will match:
(?: - an alternation group:
\"[^\"]*\" - a double quoted substring having no double quotes inside
| - or
'[^']*' - a single quoted substring
) - end of the group
(*SKIP)(*F) - discard the match, proceed looking for the next match
| - or
\\b(and|or)\\b - Group 1: either an and or or as a whole word.
See the regex demo.
Depending on how the literal " and ' are escaped inside "..." and '...' you will need to adjust the (?:\"[^\"]*\"|'[^']*') part of the regex.
The ~ ifelse(z=="or","|", "&") part is a callback function that receives the only argument (named z inside this function) and its contents are the match value you get from the regex (i.e. either or or and). If the match value is equal to or, the match is substituted with |, else, with &.
this is an ugly way to do it, but it's working for your specific case:
For Or :
(?:'|")(?:.*?)(?:'|")(?:.*?)(or)(?:.*?)
For And :
(?:'|")(?:.*?)(?:'|")(?:.*?)(and)(?:.*?)
i recommend using https://regex101.com/ to help build and test your regex
Your question has potential problems, because nested content may not be handled well or at all by a single regex. That being said, if we assume that the or values you want to replace by pipes always occur after a quoted string, then we can try the following:
gsub("([\"'])\\s*or", "\\1 |", input)
[1] "Industry ='Education' | Industry =\"Energy\" | Industry = 'Financial or Bank' |
Industry = 'Hospitality' | Industry = \"Food and Beverage\" and Industry = 'Utilities'"
By inspection, the or values occurring inside quoted strings are surrounded on both sides by unquoted words. Obviously, this may break down upon seeing other data, or more nested content.
Demo

How to replace text sequences ending in a fixed pattern within a long text string in R?

I have a column within a data frame containing long text sequences (often in the thousands of characters) of the format:
abab(VR) | ddee(NR) | def(NR) | fff(VR) | oqq | pqq | ppf(VR)
i.e. a string, a suffix in brackets, then a delimiter
I'm trying to work out the syntax in R to delete the items that end in (VR), including the trailing pipe if present, so that I'm left with:
ddee(NR) | def(NR) | oqq | pqq
I cannot work out the regular expression (or gsub) that will remove these entries and would like to request if anyone could help me please.
If you want to use gsub, you can remove the pattern in two stages:
gsub(" \\| $", "", gsub("\\w+\\(VR\\)( \\| )?", "", s))
# firstly remove all words ending with (VR) and optional | following the pattern and
# then remove the possible | at the end of the string
# [1] "ddee(NR) | def(NR) | oqq | pqq"
regular expression \\w+\\(VR\\) will match words ending with (VR), parentheses are escaped by \\;
( \\| )? matches optional delimiter |, this makes sure it will match the pattern both in the middle and at the end of the string;
possible | left out at the end of the string can be removed by a second gsub;
Here is a method using strsplit and paste with the collapse argument:
paste(sapply(strsplit(temp, split=" +\\| +"),
function(i) { i[setdiff(seq_along(i), grep("\\(VR\\)$", i))] }),
collapse=" | ")
[1] "ddee(NR) | def(NR) | oqq | pqq"
We split on the pipe and spaces, then feed the resulting list to sapply which uses the grep function to drop any elements of the vector that end with "(VR)". Finally, the result is pasted together.
I added a subsetting method with setdiff so that vectors without any "(VR)" will return without any modification.

Pyparsing - name not starting with a character

I am trying to use Pyparsing to identify a keyword which is not beginning with $ So for the following input:
$abc = 5 # is not a valid one
abc123 = 10 # is valid one
abc$ = 23 # is a valid one
I tried the following
var = Word(printables, excludeChars='$')
var.parseString('$abc')
But this doesn't allow any $ in var. How can I specify all printable characters other than $ in the first character position? Any help will be appreciated.
Thanks
Abhijit
You can use the method I used to define "all characters except X" before I added the excludeChars parameter to the Word class:
NOT_DOLLAR_SIGN = ''.join(c for c in printables if c != '$')
keyword_not_starting_with_dollar = Word(NOT_DOLLAR_SIGN, printables)
This should be a bit more efficient than building up with a Combine and a NotAny. But this will match almost anything, integers, words, valid identifiers, invalid identifiers, so I'm skeptical of the value of this kind of expression in your parser.

grep on two strings

I'm working to grab two different elements in a string.
The string look like this,
str <- c('a_abc', 'b_abc', 'abc', 'z_zxy', 'x_zxy', 'zxy')
I have tried with the different options in ?grep, but I can't get it right, 'm doing something like this,
grep('[_abc]:[_zxy]',str, value = TRUE)
and what I would like is,
[1] "a_abc" "b_abc" "z_zxy" "x_zxy"
any help would be appreciated.
Use normal parentheses (, not the square brackets [
grep('_(abc|zxy)',str, value = TRUE)
[1] "a_abc" "b_abc" "z_zxy" "x_zxy"
To make the grep a bit more flexible, you could do something like:
grep('_.{3}$',str, value = TRUE)
Which will match an underscore _ followed by any character . three times {3} followed immediately by the end of the string $
this should work: grep('_abc|_zxy', str, value=T)
X|Y matches when either X matches or Y matches
In this case just doing:
str[grep("_",str)]
will work... is it more complicated in your specific case?

Replacement and non-matches with 'sub'

Months ago I ended up with a sub statement that originally worked with my input data. It has since stopped working causing me to re-examine my ugly process. I hate to share it but it accomplished several things at once:
active$id[grep("CIR",active$description)] <- sub(".*CIR0*(\\d+).*","\\1",active$description[grep("CIR",active$description)],perl=TRUE)
This statement created a new id column by finding rows that had an id embedded in the description column. The sub statement would find the number following a "CIR0" and populate the id column iff there was an id within a row's description. I recognize it is inefficient with the embedded grep subsetting either side of the assignment.
Is there a way to have a 'sub' replacement be NA or empty if the pattern does not match? I feel like I'm missing something very simple but ask for the community's assistance. Thank you.
Example with the results of creating an id column:
| name | id | description |
|------+-----+-------------------|
| a | 343 | Here is CIR00343 |
| b | | Didn't have it |
| c | 123 | What is CIR0123 |
| d | | CIR lacks a digit |
| e | 452 | CIR452 is next |
I was struggling with the same issue a few weeks ago. I ended up using the str_match function from the stringr package. It returns NA if the target string is not found. Just make sure you subset the result correctly. An example:
library(stringr)
str = "Little_Red_Riding_Hood"
sub(".*(Little).*","\\1",str) # Returns 'Little'
sub(".*(Big).*","\\1",str) # Returns 'Little_Red_Riding_Hood'
str_match(str,".*(Little).*")[1,2] #Returns 'Little'
str_match(str,".*(Big).*")[1,2] # Returns NA
I think in this case you could try using ifelse(), i.e.,
active$id[grep("CIR",active$description)] <- ifelse(match, replacement, "")
where match should evaluate to true if there's a match, and replacement is what that element would be replaced with in that case. Likewise, if match evaluates to false, that element's replaced with an empty string (or NA if you prefer).

Resources