is dash a special character in R regex? - r

Despite reading the help page of R regex
Finally, to include a literal -, place it first or last (or, for perl
= TRUE only, precede it by a backslash).
I can't understand the difference between
grepl(pattern=paste("^thing1\\-",sep=""),x="thing1-thing2")
and
grepl(pattern=paste("^thing1-",sep=""),x="thing1-thing2")
Both return TRUE. Should I escape or not here? What is the best practice?

The hyphen is mostly a normal character in regular expressions.
You do not need to escape the hyphen outside of a character class; it has no special meaning.
Within a character class [ ] you can place a hyphen as the first or last character in the range. If you place the hyphen anywhere else you need to escape it in order to add it to your class.
Examples:
grepl('^thing1-', x='thing1-thing2')
[1] TRUE
grepl('[-a-z]+', 'foo-bar')
[1] TRUE
grepl('[a-z-]+', 'foo-bar')
[1] TRUE
grepl('[a-z\\-\\d]+', 'foo-bar')
[1] TRUE
Note: It is more common to find a hyphen placed first or last within a character class.

To see what it means for - to have a special meaning inside of a character class (and how putting it last gives it its literal meaning), try the following:
grepl("[w-y]", "x")
# [1] TRUE
grepl("[w-y]", "-")
# [1] FALSE
grepl("[wy-]", "-")
# [1] TRUE
grepl("[wy-]", "x")
# [1] FALSE

They are both matching the exact same text in these instances. I.e.:
x <- "thing1-thing2"
regmatches(x,regexpr("^thing1\\-",x))
#[1] "thing1-"
regmatches(x,regexpr("^thing1-",x))
#[1] "thing1-"
Using a - is a special character in certain situations though, for specifying ranges of values, such as characters between a and z when specifed inside [], e.g.:
regmatches(x,regexpr("[a-z]+",x))
#[1] "thing"

Related

RegEx R: match strings with same character exact number of times anywhere in string

I guess this is straight forward but I can't manage to get my RegEx right and haven't found an exact example yet...
How can I match only strings with a specific character an exact number of times (not necessarily repeating!)
Let's look at a data set
terms<-c('breeding'
,'foraging'
,'prey'
,'breeding_season'
,'foraging_time'
,'seabird_ecology'
,'annual_reproductive_success'
,'sea_surface_temperature'
,'mean_chick_weight')
I want to select all strings that have exactly two underscores ('_')
I can try:
stringr::str_detect(terms, "_{2}")
no luck
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Or
terms[stringr::str_detect(terms, "._.{2,}")]
gives
[1] "breeding_season" "foraging_time" "seabird_ecology"
[4] "annual_reproductive_success" "sea_surface_temperature" "mean_chick_weight"
but I want only
[1] "annual_reproductive_success" "sea_surface_temperature" "mean_chick_weight"
Thank you RegEx masters
What you're missing is the negated character class. You want to match something that is not an underscore and then an underscore. It's generally like [^X].
/^[^_]*_[^_]*_[^_]*$/
# or
/^(?:[^_]*_){2}[^_]*$/
That is:
beginning of string
anything not an underscore
underscore
anything not an underscore
underscore
anything not an underscore
end of string
This is just one way to do it.
HTH
Here's why your efforts were failing:
"_{2}" #matches two underscores beside each other
"._.{2,}" matches any char, "_", exactly 2 chars, "_", any char
Simplest (but not quite what you asked for) with grepl or str_detect would be:
grepl("_.*_", terms)
[1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
The "*" allows an arbitrary number of characters, so use negated charactyer class fore, aft and in the middle to get exactly 2 Underscores separating non underscores.:
grepl("^[^_]_[^_]*_[^_]$", terms)
[1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
Added the "^" and the "$" to indicate the beginning and end. The "^" operator is different in meaning inside and outside of character-class brackets.
Solution using the stringr package. It is a little more straight forward without having to use a complex regular expression. Here we are using the str_count function to count the number of matches in a string.
terms<-c('breeding'
,'foraging'
,'prey'
,'breeding_season'
,'foraging_time'
,'seabird_ecology'
,'annual_reproductive_success'
,'sea_surface_temperature'
,'mean_chick_weight')
library(stringr)
terms[str_count(terms, "_") == 2]
#> [1] "annual_reproductive_success" "sea_surface_temperature"
#> [3] "mean_chick_weight"
Created on 2021-03-08 by the reprex package (v0.3.0)
Using grep with value = TRUE :
grep('^\\w+_\\w+_\\w+$', terms, value = TRUE)
#[1] "annual_reproductive_success" "sea_surface_temperature"
#[3] "mean_chick_weight"
In stringr you can use str_subset with same regex :
stringr::str_subset(terms, '^\\w+_\\w+_\\w+$')

Match elements from a character range n times

Assume I have a string like this:
id = "ce91ffbe-8218-e211-86da-000c29e211a0"
What regex can I write in R that will verify that this string is 36 characters long and only contains letters, numbers, and dashes?
There is nothing in the documentation on how to use a character range (e.g. [0-9A-z-]) with a quantifier (e.g. {36}). The following code is always returning TRUE regardless of the quantifier. I'm sure I'm missing something simple here...
id <- "ce91ffbe-8218-e211-86da-000c29e211a0"
grepl("[0-9A-z-]{36}", id)
#> [1] TRUE
grepl("[0-9A-z-]{34}", id)
#> [1] TRUE
This behavior only starts when I add the check for the numbers 0-9 in the character range.
Could you please try following:
grepl("^[0-9a-zA-Z-]{36}$",id)
OR
grepl("^[[:alnum:]-]{36}$",id)
After running it we will get following output.
grepl("^[0-9a-zA-Z-]{36}$",id)
[1] TRUE
Explanation: Adding following for only explanation purposes here.
grepl(" ##using grepl to check if regex mentioned in it gives TRUE or FALSE result.
^ ##^ means shows starting of the line.
[[:alnum:]-] ##Mentioning character class [[:alnum:]] with a dash(-) in it means match alphabets with digits and dashes in regex.
{36} ##Look for only 36 occurences of alphabets with dashes.
$", ##$ means check from starting(^) to till end of the variable's value.
id) ##Mentioning id value here.
You want to use:
^[0-9a-z-]{36}$
^ Assert position start of line.
[0-9a-z-] Character set for numbers, letters a to z and dashes -.
{36} Match preceding pattern 36 times.
$ Assert position end of line.
Try it here.
If the string can have other characters before or after the target characters, try
id <- "ce91ffbe-8218-e211-86da-000c29e211a0"
grepl("^[^[:alnum:]-]*[[:alnum:]-]{36}[^[:alnum:]-]*$", id)
#[1] TRUE
grepl("^[^[:alnum:]-]*[[:alnum:]-]{34}[^[:alnum:]-]*$", id)
#[1] FALSE
And this will still work.
id2 <- paste0(":+)!#", id)
grepl("^[^[:alnum:]-]*[[:alnum:]-]{36}[^[:alnum:]-]*$", id2)
#[1] TRUE
grepl("^[^[:alnum:]-]*[[:alnum:]-]{34}[^[:alnum:]-]*$", id2)
#[1] FALSE

Remove fields with special characters

I'm trying to remove all fields that have special characters (#?.* etc) in their text.
I think I should be using
Filter(function(x) {grepl('|[^[:punct:]]).*?', x)} == FALSE, data$V1)
where data$V1 contains my data. However, it seems like
grepl('|[^[:punct:]]).*?', x)
fails with trivial examples like
grepl('|[^[:punct:]]).*?', 'M')
which outputs TRUE even though M has no special characters. How should I be using grepl to remove fields with special characters from a column of data?
To search for "special characters", you can search for the negation of alphanumeric characters as such:
grepl('[^[:alnum:]_]+', c('m','m#','M9*'))
# [1] FALSE TRUE TRUE
or use the symbol \W
grepl('\\W+', c('m','m#','M9*'))
# [1] FALSE TRUE TRUE
\W is explained in the regular expression help:
"The symbol \w matches a ‘word’ character (a synonym for [[:alnum:]_], an extension) and \W is its negation ([^[:alnum:]_]̀)."
Starting a regular expression with a | make it literally useless since it will match anything.
See this JS example:
console.log('With the starting pipe => ' + /|([\W]).*?/.test('M'));
console.log('Without the starting pipe => ' + /([\W]).*?/.test('M'));
Simply put those inside [...] and provide this to the pattern argument to grepl, then negate.
data$V1[!grepl("[#?.*]", data$V1)]
For example,
> x <- c("M", "3#3", "8.*x")
> x[!grepl("[#?.*]", x)]
[1] "M"

How to remove the last character of a string if it is a punctuation?

It suppose to be very simple.
For example:
a = "fafasdf..", b = "sdfs?>", c = "safwe"
The result i want would be
a = "fafasdf", b = "sdfs", c = "safwe"
How do I remove the last few characters if they are punctuation?
I tried sub("[:punct:]\Z", "", mystring), but it does not work...
You're almost there,
sub("[[:punct:]]+$", "", mystring)
You need to put [:punct:] inside a character class and make it to repeat one or more times by adding + next to that. And also replace \Z with $, since sub without perl=TRUE param won't support \Z (which matches the end of the string boundary)
Example:
x <- c("fafasdf..", "sdfs?>", "safwe")
sub("[[:punct:]]+$", "", x)
# [1] "fafasdf" "sdfs" "safwe"
If you really want to use \\Z, then enable perl=TRUE param.
sub("[[:punct:]]+\\Z", "", x, perl=TRUE)
# [1] "fafasdf" "sdfs" "safwe"
POSIX character classes need to be wrapped inside of a bracketed expression, the correct syntax would be [[:punct:]]. And, since you're not utilizing gsub to remove all instances, you need to specify an operator to match more than one occurrence of punctuation.
As commented in the other answer; the perl = TRUE parameter needs to be set to use \Z.
But for future reference — not to dissuade you, this anchor behaves differently depending on the engine being used; being said in R with the parameter set, this anchor will allow a match before a final line break. However, it's alright to use it here, but I would just stick to $ instead.
sub('[[:punct:]]+$', '', c('fafasdf..', 'sdfs?>', 'safwe'))
## [1] "fafasdf" "sdfs" "safwe"
Also take into account the 'locale', it could affect the behavior of the POSIX class. If this becomes an issue, you can read up on this previously answered question.
If you're just wanting to removing non-word characters, you could just use:
sub('\\W+$', '', c('fafasdf..', 'sdfs?>', 'safwe'))
## [1] "fafasdf" "sdfs" "safwe"

grep can not match regular expression with a date whose type is character

pattern<-"[0-9][0-9]\\.[0-9][0-9]\\.[0-9][0-9][0-9][0-9])"
grepl(pattern,"10.06.2011")
[1] FALSE
I am trying to match them and want it to return TRUE, I also tried pattern with dd\., however unsuccessful.
What should I do?
Remove the ) at the end of the pattern and it will work:
pattern<-"[0-9][0-9]\\.[0-9][0-9]\\.[0-9][0-9][0-9][0-9]"
grepl(pattern, "10.06.2011")
# [1] TRUE
By the way, the pattern can be simplified to
"(?:\\d{2}\\.){2}\\d{4}"
You do need to remove the closing parenthesis at the end of your pattern, also you can simplify this.
grepl("\\d{2}\\.\\d{2}\\.\\d{4}", "10.06.2011")

Resources