Inverting a regex in R - r

I have this string:
[1] "19980213" "19980214" "19980215" "19980216" "19980217" "iffi" "geometry"
[8] "date_consid"
and I want to match all the elements that are not dates and not "date_consid". I tried
res = grep("(?!\\d{8})|(?!date_consid)", vec, value=T)
But I just cant make it work...

You can use
vec <- c("19980213", "19980214", "19980215", "19980216","19980217", "iffi","geometry", "date_consid")
grep("^(\\d{8}|date_consid)$", vec, value=TRUE, invert=TRUE)
## => [1] "iffi" "geometry"
See the R demo
The ^(\d{8}|date_consid)$ regex matches a string that only consists of any eight digits or that is equal to date_consid.
The value=TRUE makes grep return values rather than indices and invert=TRUE inverses the regex match result (returns those that do not match).

The pattern that you tried gives all the matches because the lookaheads are unanchored.
Using separate statements with or | will still match all strings.
You can change to logic to asserting from the start of the string, what is directly to the right is not either 8 digits or date_consid in a single check.
Using a positive lookahead, you have to add perl=T and add an anchor ^ to assert the start of the string and add an anchor $ to assert the end of the string after the lookahead.
^(?!\\d{8}$|date_consid$)
^ Start of string
(?! Negative lookahead
\\d{8}$ Match 8 digits until end of string
| Or
date_consid$Match date_consid until end of string
) Close lookahead
For example
vec <- c("19980213", "19980214", "19980215", "19980216","19980217", "iffi","geometry", "date_consid")
grep("^(?!\\d{8}$|date_consid$)", vec, value=T, perl=T)
Output
[1] "iffi" "geometry"

Related

Extract all text after last occurrence of a special character

I have the string in R
BLCU142-09|Apodemia_mejicanus
and I would like to get the result
Apodemia_mejicanus
Using the stringr R package, I have tried
str_replace_all("BLCU142-09|Apodemia_mejicanus", "[[A-Z0-9|-]]", "")
# [1] "podemia_mejicanus"
which is almost what I need, except that the A is missing.
You can use
sub(".*\\|", "", x)
This will remove all text up to and including the last pipe char. See the regex demo. Details:
.* - any zero or more chars as many as possible
\| - a | char (| is a special regex metacharacter that is an alternation operator, so it must be escaped, and since string literals in R can contain string escape sequences, the | is escaped with a double backslash).
See the R demo online:
x <- c("BLCU142-09|Apodemia_mejicanus", "a|b|c|BLCU142-09|Apodemia_mejicanus")
sub(".*\\|", "", x)
## => [1] "Apodemia_mejicanus" "Apodemia_mejicanus"
We can match one or more characters that are not a | ([^|]+) from the start (^) of the string followed by | in str_remove to remove that substring
library(stringr)
str_remove(str1, "^[^|]+\\|")
#[1] "Apodemia_mejicanus"
If we use [A-Z] also to match it will match the upper case letter and replace with blank ("") as in the OP's str_replace_all
data
str1 <- "BLCU142-09|Apodemia_mejicanus"
You can always choose to _extract rather than _remove:
s <- "BLCU142-09|Apodemia_mejicanus"
stringr::str_extract(s,"[[:alpha:]_]+$")
## [1] "Apodemia_mejicanus"
Depending on how permissive you want to be, you could also use [[:alpha:]]+_[[:alpha:]]+ as your target.
I would keep it simple:
substring(my_string, regexpr("|", my_string, fixed = TRUE) + 1L)

sub function in r does not replace the first match

I am trying to manipulate a character vector and want to delete all characters before the first occurrence of a specific string using sub function in r, since the function performs replacement of the first match, but in my code sub replaces the last but not the first match?
Here below is an example
Vec <- c("ID1.P.001", "ID2.P.002") # character vector
# I want to get rid of all characters before the first dot (including the dot)
# So i want to get this vector
c("P.001", "P.002")
#[1] "P.001" "P.002"
# my code
sub('.*\\.', "", Vec )
#[1] "001" "002"
# sub replace the last not the first match !!
How can i use sub to get rid of characters before the first match (including the pattern)?
You can make the * quantifier lazy (opposed to the default greedy matching) by adding a ? after it. I.e.:
sub('.*?\\.', "", Vec)
[1] "P.001" "P.002"
We can specify the start (^) of the string, match the characters that are not a . ([^.]+ - one or more characters that are not a dot) followed by a dot (\\. - metacharacter - so escaping, within the [], it would be evaluated as . though) and in replacement, specify as blank ("")
sub("^[^.]+\\.", "", Vec)
#[1] "P.001" "P.002"

Using gsub to replace last occurence of string in R

I have the following character vector than I need to modify with gsub.
strings <- c("x", "pm2.5.median", "rmin.10000m", "rmin.2500m", "rmax.5000m")
Desired output of filtered strings:
"x", "pm2.5.median", "rmin", "rmin", "rmax"
My current attempt works for everything except the pm2.5.median string which has dots that need to be preserved. I'm really just trying to remove the buffer size that is appended to the end of each variable, e.g. 1000m, 2500m, 5000m, 7500m, and 10000m.
gsub("\\..*m$", "", strings)
"x", "pm2", "rmin", "rmin", "rmax"
Match a dot, any number of digits, m and the end of string and replace that with the empty string. Note that we prefer sub to gsub here because we are only interested in one replacement per string.
sub("\\.\\d+m$", "", strings)
## [1] "x" "pm2.5.median" "rmin" "rmin" "rmax"
The .* pattern matches any 0 or more chars, as many as possible. The \..*m$ pattern matches the first (leftmost) . in the string and then grab all the text after it if it ends with m.
You need
> sub("\\.[^.]*m$", "", strings)
[1] "x" "pm2.5.median" "rmin" "rmin" "rmax"
Here, \.[^.]*m$ matches ., then 0 or more chars other than a dot and then m at the end of the string.
See the regex demo.
Details
\. - a dot (must be escaped since it is a special regex char otherwise)
[^.]* - a negated character class matching any char but . 0 or more times
m - an m char
$ - end of string.

r: regex for containing pattern with negation

Suppose I have the following two strings and want to use grep to see which match:
business_metric_one
business_metric_one_dk
business_metric_one_none
business_metric_two
business_metric_two_dk
business_metric_two_none
And so on for various other metrics. I want to only match the first one of each group (business_metric_one and business_metric_two and so on). They are not in an ordered list so I can't index and have to use grep. At first I thought to do:
.*metric.*[^_dk|^_none]$
But this doesn't seem to work. Any ideas?
You need to use a PCRE pattern to filter the character vector:
x <- c("business_metric_one","business_metric_one_dk","business_metric_one_none","business_metric_two","business_metric_two_dk","business_metric_two_none")
grep("metric(?!.*_(?:dk|none))", x, value=TRUE, perl=TRUE)
## => [1] "business_metric_one" "business_metric_two"
See the R demo
The metric(?!.*(?:_dk|_none)) pattern matches
metric - a metric substring
(?!.*_(?:dk|none)) - that is not followed with any 0+ chars other than line break chars followed with _ and then either dk or none.
See the regex demo.
NOTE: if you need to match only such values that contain metric and do not end with _dk or _none, use a variation, metric.*$(?<!_dk|_none) where the (?<!_dk|_none) negative lookbehind fails the match if the string ends with either _dk or _none.
You can also do something like this:
grep("^([[:alpha:]]+_){2}[[:alpha:]]+$", string, value = TRUE)
# [1] "business_metric_one" "business_metric_two"
or use grepl to match dk and none, then negate the logical when you're indexing the original string:
string[!grepl("(dk|none)", string)]
# [1] "business_metric_one" "business_metric_two"
more concisely:
string[!grepl("business_metric_[[:alpha:]]+_(dk|none)", string)]
# [1] "business_metric_one" "business_metric_two"
Data:
string = c("business_metric_one","business_metric_one_dk","business_metric_one_none","business_metric_two","business_metric_two_dk","business_metric_two_none")

In R: grab all alnum characters before the first punctuation

I have a vector s of strings (or NAs), and would like to get a vector of same length of everything before first occurrence of punctionation (.).
s <- c("ABC1.2", "22A.2", NA)
I would like a result like:
[1] "ABC1" "22A" NA
You can remove all symbols (incl. a newline) from the first dot with the following Perl-like regex:
s <- c("ABC1.2", "22A.2", NA)
gsub("[.][\\s\\S]*$", "", s, perl=T)
## => [1] "ABC1" "22A" NA
See IDEONE demo
The regex matches
[.] - a literal dot
[\\s\\S]* - any symbols incl. a newline
$ - end of string.
All matched strings are removed from the input with "". As the regex engine analyzes the string from left to right, the first dot is matched with \\., and the greedy * quantifier with [\\s\\S] will match all up to the end of string.
If there are no newlines, a simpler regex will do: [.].*$:
gsub("[.].*$", "", s)
See another demo

Resources