grepping special characters in R

grepping special characters in R - r

I have a variable named full.path.
And I am checking if the string contained in it is having certain special character or not.
From my code below, I am trying to grep some special character. As the characters are not there, still the output that I get is true.
Could someone explain and help. Thanks in advance.
full.path <- "/home/xyz"
#This returns TRUE :(
grepl("[?.,;:'-_+=()!##$%^&*|~`{}]", full.path)

By plugging this regex into https://regexr.com/ I was able to spot the issue: if you have - in a character class, you will create a range. The range from ' to _ happens to include uppercase letters, so you get spurious matches.
To avoid this behaviour, you can put - first in the character class, which is how you signal you want to actually match - and not a range:
> grepl("[-?.,;:'_+=()!##$%^&*|~`{}]", full.path)
[1] FALSE

Related

Crazy unexpected behavior of grepl

What explains the following very unexpected behavior of grepl?
I am using grepl for basic string matching here, and I think the default behavior as illustrated below is dangerous.
> grepl('a','a')
[1] TRUE
> grepl('a ()','a ()')
[1] TRUE
> grepl('a (b)','a (b)')
[1] FALSE
Adding fixed=TRUE fixes it. The documentation says:
pattern: character string containing a regular expression (or character string for fixed = TRUE) to be matched in the given character vector.
The average user should get from the message above that the default usage of grepl is NOT string matching but regular expression matching, which is not super clear. Someone unaware of regular expressions may not realize the dangers of leaving fixed to its default value. I think a warning should be added about this.
Posting here mainly to alert the community about this behavior. It took me a couple of hours of debugging to narrow down the issue I was experiencing in my Shiny app to this function. I would have never thought that grepl could be dangerous like this.

pattern: a ()
Breakdown: An a followed by a space and then a captured null/empty character ie Nothing.
The a and space matches the first part of the string. Thus the WHOLE pattern can be found in the string. RESULTS in TRUE
second part:
pattern: a (b)
Breakdown. Literally means a b ie a then space then b. But we capture the b hence the parenthesis around b.
String has a (b). Since b does not follow the space, the whole pattern cannot be obtained in the string hence FALSE

I need help figuring out why my regex does not match with what I am looking for

I am working on a R script aiming to check if a data.frame is correctly made and contains the right information at the right place.
I need to make sure a row contains the right information, so I want to use a regular expression to compare with each case of said row.
I thought maybe it did not work because I compared the regex to the value by calling the value directly from the table, but it did not work.
I used regex101.com to make sure my regular expression was correct, and it matched when the test string was put between quotes.
Then I added as.character() to the value, but it came out FALSE.
To sum up, the regex works on regex101.com, but never did on my R script
test = c("b40", "b40")
".[ab][0-8]{2}." == test[1]
FALSE
I expect the output to be TRUE, but it is always FALSE

The == is for fixed full string match and not used for substring match. For that, we can use grep
grepl("^[ab][0-8]{2}", test[1])
#[1] TRUE
Here, we match either 'a' or 'b' at the start (^) of the string followed by two digits ranging from 0 to 8 (if it should be at the end - use $)

Issue with a column containing special characters

I have dataframe in R that contains a column of type character with values as follows
"\"121.29\""
"\"288.1\""
"\"120\""
"\"V132.3\""
"\"800\""
I am trying to get rid of the extra " and \ and retain clean values as below
121.29
288.10
120.00
V132.30
800.00
I tried gsub("([\\])","", x) also str_repalce_all function so far no luck. I would much appreciate it if anybody can help me resolve this issue. Thanks in advance.

Try
gsub('\\"',"",x)
[1] "121.29" "288.1" "120" "V132.3" "800"
Since the fourth entry is not numeric and an atomic vector can only contain entries of the same mode, the entries are all characters in this case (the most flexible mode capable of storing the data). So there still will be quotes around each entry.
Because \ is a special character, it needs to be escaped with a backslash, so the expression \\" is passed as a first parameter to gsub(). Moreover, as suggested by #rawr, one can use single quotes to address the double quote.
An alternative would be to use double quotes and escape them, too:
gsub("\\\"","",x)
which yields the same result.
Hope this helps.

Need help with a regex

Hi I'm trying to right a regular expression that will take a string and ensure it starts with an 'R' and is followed by 4 numeric digits then anything
eg. RXXXX.................
Can anybody help me with this? This is for ASP.NET

You want it to be at the beginning of the line, not anywhere. Also, for efficiency, you dont want the .+ or .* at the end because that will match unnecessary characters. So the following regex is what you really want:
^R\d{4}

This should do it...
^R\d{4}.*$
\d{4} matches 4 digits
.* is simply a way to match any character 0 or more times
the beginning ^ and end $ anchors ensure that nothing precedes or follows
As Vincent suggested, for your specific task it could even be simplified to this...
^R\d{4}
Because as you stated, it doesn't really matter what follows.

/^R\d{4}.*/ and set the case insensitive option unless you only want capital R's

^R\d{4}.*
The caret ^ matches the position before the first character in the string.
\d matches any numeric character (it's the same as [0-9])
{4} indicates that there must be exactly 4 numbers, and
.* matches 0 or more other characters
To use:
string input = "R0012 etc..";
Match match = Regex.Match(input, #"^R\d{4}.*", RexOptions.IgnoreCase);
if (match.Success)
{
// Success!
}
Note the use of RexOptions.IgnoreCase to ignore the case of the letter R (so it'll match strings which start with r. Leave this out if you don't want to undertake a case insensitive match.

ASP.NET regular expression to restrict consecutive characters

Using ASP.NET syntax for the RegularExpressionValidator control, how do you specify restriction of two consecutive characters, say character 'x'?

You can provide a regex like the following:
(\\w)\\1+
(\\w) will match any word character, and \\1+ will match whatever character was matched with (\\w).
I do not have access to asp.net at the moment, but take this console app as an example:
Console.WriteLine(regex.IsMatch("hello") ? "Not valid" : "Valid"); // Hello contains to consecutive l:s, hence not valid
Console.WriteLine(regex.IsMatch("Bar") ? "Not valid" : "Valid"); // Bar does not contain any consecutive characters, so it's valid

Alexn is right, this is the way you match consecutive characters with a regex, i.e. (a)\1 matches aa.
However, I think this is a case of everything looking like a nail when you're holding a hammer. I would not use regex to validate this input. Rather, I suggest validating this in code (just looping through the string, comparing str[i] and str[i-1], checking for this condition).

This should work:
^((?<char>\w)(?!\k<char>))*$
It matches abc, but not abbc.
The key is to use so called "zero-width negative lookahead assertion" (syntax: (?! subexpression)).
Here we make sure that a group matched with (?<char>\w) is not followed by itself (expressed with (?!\k<char>)).
Note that \w can be replaced with any valid set of characters (\w does not match white-spaces characters).
You can also do it without named group (note that the referenced group has number 2):
^((\w)(?!\2))*$
And its important to start with ^ and end with $ to match the whole text.
If you want to only exclude text with consecutive x characters, you may use this
^((?<char>x)(?!\k<char>)|[^x\W])*$
or without backreferences
^(x(?!x)|[^x\W])*$
All syntax elements for .NET Framework Regular Expressions are explained here.

You can use a regex to validate what's wrong as well as what's right of course. The regex (.)\1 will match any two consecutive characters, so you can just reject any input that gives an IsValid result to that. If this is the only validation you need, I think this way is far easier than trying to come up with a regex to validate correct input instead.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

grepping special characters in R - r

Related

Crazy unexpected behavior of grepl

I need help figuring out why my regex does not match with what I am looking for

Issue with a column containing special characters

Need help with a regex

ASP.NET regular expression to restrict consecutive characters

Categories

Resources