I need help figuring out why my regex does not match with what I am looking for - r

I am working on a R script aiming to check if a data.frame is correctly made and contains the right information at the right place.
I need to make sure a row contains the right information, so I want to use a regular expression to compare with each case of said row.
I thought maybe it did not work because I compared the regex to the value by calling the value directly from the table, but it did not work.
I used regex101.com to make sure my regular expression was correct, and it matched when the test string was put between quotes.
Then I added as.character() to the value, but it came out FALSE.
To sum up, the regex works on regex101.com, but never did on my R script
test = c("b40", "b40")
".[ab][0-8]{2}." == test[1]
FALSE
I expect the output to be TRUE, but it is always FALSE

The == is for fixed full string match and not used for substring match. For that, we can use grep
grepl("^[ab][0-8]{2}", test[1])
#[1] TRUE
Here, we match either 'a' or 'b' at the start (^) of the string followed by two digits ranging from 0 to 8 (if it should be at the end - use $)

Related

Deleting the last TWO characters if they fit certain requirements in R

I want to use a function to go threw strings and only replaces the last two characters if they match certain criteria. How can I do that?
I tried
cleantable3 = gsub('.{2}$', '', cleantable2)
but then it always deletes all the last two. Lets say I only want those replaced that contain " D| E"
Thank you all!
Your regex expression isn't good '.{2}$' means it will match any(.) 2 characters at the end of string($).
You haven't defined the question very well, but in your case I think this is the regex expression you need '(D.$)|(E.$)|(.D$)|(.E$)'. So this is the desired code.
cleantable3 = gsub('(D.$)|(E.$)|(.D$)|(.E$)', '', cleantable2)

How to search for strings with parentheses in R

Using R, I have a long list of keywords that I'm searching for in a dataset. One of the keywords needs to have parentheses around it in order to be included.
I've been attempting to replace the parenthesis in the keywords list with \\ then the parentheses, but have not been successful. If there is a way to modify the grepl() function to recognize them, that would also be helpful. Here is an example of what I'm trying to accomplish:
patterns<-c("dog","cat","(fish)")
data<-c("brown dog","black bear","salmon (fish)","red fish")
patterns2<- paste(patterns,collapse="|")
grepl(patterns2,data)
[1] TRUE FALSE TRUE TRUE
I would like salmon (fish) to give TRUE, and red fish to give FALSE.
Thank you!
As noted by #joran in the comments, the pattern should look like so:
patterns<-c("dog","cat","\\(fish\\)")
The \\s will tell R to read the parentheses literally when searching for the pattern.
Easiest way to achieve this if you don't want to make the change manually:
patterns <- gsub("([()])","\\\\\\1", patterns)
Which will result in:
[1] "dog" "cat" "\\(fish\\)"
If you're not very familiar with regular expressions, what happens here is that it looks for any one character within the the square brackets. The round brackets around that tell it to save whatever it finds that matches the contents. Then, the first four slashes in the second argument tell it to replace what it found with two slashes (each two slashes translate into one slash), and the \\1 tells it to add whatever it saved from the first argument - i.e., either ( or ).
Another option is to forget regex and use grepl with fixed = T
rowSums(sapply(patterns, grepl, data, fixed = T)) > 0
# [1] TRUE FALSE TRUE FALSE

Pattern lookup within a string in R using regular expression matching

I am trying to pick patterns within a specific string and their respective location. I have explained below with an example:
String = "Web_797-Web_797-Web_797-Web_797-PCP_IM_PAR-Pharm_1-Pharm_1-
Web_797-PCP_IM_PAR-Prior_OP-Web_797-Prior_OP-Event_0-"
pattern = "Web_797-*Web_797" (Web_797 followed by Web_797 with anything in between)
I used the following function:
str_locate_all(String,pattern)[[1]]
I am getting the following result:
start end
[1,] 1 15
[2,] 17 31
which is what I need partially. However I the pattern is not able to pick the following combination (highlighted in black).
String = "Web_797-Web_797-Web_797-Web_797-PCP_IM_PAR-Pharm_1-Pharm_1-
Web_797-PCP_IM_PAR-Prior_OP-Web_797-Prior_OP-Event_0-"
I would appreciate if anyone could help with this. I believe there is something wrong with the way I am defining the pattern but not able to fix it.
The problem with your pattern pattern = "Web_797-*Web_797" is the -* part. That means zero or more dashes (-). I believe what you wanted was a dash followed by any characters. So a first (incorrect) attempt would be
pattern = "Web_797-.*Web_797" Where the . means "any character". But that is not quite right. You only want to collect characters until the next time you see Web_797, not all the way until the last time you see Web_797. By default, the matches are "greedy" taking the biggest possible match. If we use
pattern = "Web_797-.*?Web_797" the ? turns off greedy matching so that it only matches to the next Web_797.

ASPX attribute regex parsing in c#

I need to find attribute values in an ASPX file using regular expressions.
That means you don't need to worry about malformed HTML or any HTML related issues.
I need to find the value of a particular attribute (LocText). I want to get what's inside the quotes.
Any ASPX tags such as <%=, <%#, <%$ etc. inside the value don't make sense for this attribute therefore are considered as part of it.
The regex I began with looks like this:
LocText="([^"]+)"
This works great, the first group, which is the result text, gets everything except the double quotes, which are not allowed there (&quot ; must be used instead)
But the ASPX file allows using of single quotes - second regular expression must be applied then.
LocText='([^']+)'
I could use these two regular expressions but I'm looking for a way to connect them.
LocText=("([^"]+)"|'([^']+)')
This also works but doesn't seem very efficient as it's creating unnecessary number of groups. I think this could be somehow done by using backreferences, but I can't get it to work.
LocText=(["']{1})([^\1]+)\1
I thought that by this, I save the single/double quote to the first group and then I tell it to read anything that is NOT the char found in the first group. This is enclosed again by the quote from the first group. Obviously, I'm wrong and it's not working like that.
Is there any way, how to connect the first two expressions together creating just a minimum amount of groups with one group being the value of the attribute I want to get? Is it possible using a backreference for the single/double quote value, or have I completely misunderstood the meaning of them?
I'd say your solution with alternation isn't that bad, but you could use named captures so the result will always be found in the same group's value:
Regex regexObj = new Regex(#"LocText=(?:""(?<attr>[^""]+)""|'(?<attr>[^']+)')");
resultString = regexObj.Match(subjectString).Groups["attr"].Value;
Explanation:
LocText= # Match LocText=
(?: # Either match
"(?<attr>[^"]+)" # "...", capture in named group <attr>
| # or match
'(?<attr>[^']+)' # '...', also capture in named group <attr>
) # End of alternation
Another option would be to use lookahead assertions ([^\1] isn't working because you can't place backreferences inside a character class, but you can use them in lookarounds):
Regex regexObj = new Regex(#"LocText=([""'])((?:(?!\1).)*)\1");
resultString = regexObj.Match(subjectString).Groups[2].Value;
Explanation:
LocText= # Match LocText=
(["']) # Match and capture (group 1) " or '
( # Match and capture (group 2)...
(?: # Try to match...
(?!\1) # (unless it's the quote character we matched before)
. # any character
)* # repeat any number of times
) # End of capturing group 2
\1 # Match the previous quote character

ASP.NET regular expression to restrict consecutive characters

Using ASP.NET syntax for the RegularExpressionValidator control, how do you specify restriction of two consecutive characters, say character 'x'?
You can provide a regex like the following:
(\\w)\\1+
(\\w) will match any word character, and \\1+ will match whatever character was matched with (\\w).
I do not have access to asp.net at the moment, but take this console app as an example:
Console.WriteLine(regex.IsMatch("hello") ? "Not valid" : "Valid"); // Hello contains to consecutive l:s, hence not valid
Console.WriteLine(regex.IsMatch("Bar") ? "Not valid" : "Valid"); // Bar does not contain any consecutive characters, so it's valid
Alexn is right, this is the way you match consecutive characters with a regex, i.e. (a)\1 matches aa.
However, I think this is a case of everything looking like a nail when you're holding a hammer. I would not use regex to validate this input. Rather, I suggest validating this in code (just looping through the string, comparing str[i] and str[i-1], checking for this condition).
This should work:
^((?<char>\w)(?!\k<char>))*$
It matches abc, but not abbc.
The key is to use so called "zero-width negative lookahead assertion" (syntax: (?! subexpression)).
Here we make sure that a group matched with (?<char>\w) is not followed by itself (expressed with (?!\k<char>)).
Note that \w can be replaced with any valid set of characters (\w does not match white-spaces characters).
You can also do it without named group (note that the referenced group has number 2):
^((\w)(?!\2))*$
And its important to start with ^ and end with $ to match the whole text.
If you want to only exclude text with consecutive x characters, you may use this
^((?<char>x)(?!\k<char>)|[^x\W])*$
or without backreferences
^(x(?!x)|[^x\W])*$
All syntax elements for .NET Framework Regular Expressions are explained here.
You can use a regex to validate what's wrong as well as what's right of course. The regex (.)\1 will match any two consecutive characters, so you can just reject any input that gives an IsValid result to that. If this is the only validation you need, I think this way is far easier than trying to come up with a regex to validate correct input instead.

Resources