Match elements from a character range n times - r

Assume I have a string like this:
id = "ce91ffbe-8218-e211-86da-000c29e211a0"
What regex can I write in R that will verify that this string is 36 characters long and only contains letters, numbers, and dashes?
There is nothing in the documentation on how to use a character range (e.g. [0-9A-z-]) with a quantifier (e.g. {36}). The following code is always returning TRUE regardless of the quantifier. I'm sure I'm missing something simple here...
id <- "ce91ffbe-8218-e211-86da-000c29e211a0"
grepl("[0-9A-z-]{36}", id)
#> [1] TRUE
grepl("[0-9A-z-]{34}", id)
#> [1] TRUE
This behavior only starts when I add the check for the numbers 0-9 in the character range.

Could you please try following:
grepl("^[0-9a-zA-Z-]{36}$",id)
OR
grepl("^[[:alnum:]-]{36}$",id)
After running it we will get following output.
grepl("^[0-9a-zA-Z-]{36}$",id)
[1] TRUE
Explanation: Adding following for only explanation purposes here.
grepl(" ##using grepl to check if regex mentioned in it gives TRUE or FALSE result.
^ ##^ means shows starting of the line.
[[:alnum:]-] ##Mentioning character class [[:alnum:]] with a dash(-) in it means match alphabets with digits and dashes in regex.
{36} ##Look for only 36 occurences of alphabets with dashes.
$", ##$ means check from starting(^) to till end of the variable's value.
id) ##Mentioning id value here.

You want to use:
^[0-9a-z-]{36}$
^ Assert position start of line.
[0-9a-z-] Character set for numbers, letters a to z and dashes -.
{36} Match preceding pattern 36 times.
$ Assert position end of line.
Try it here.

If the string can have other characters before or after the target characters, try
id <- "ce91ffbe-8218-e211-86da-000c29e211a0"
grepl("^[^[:alnum:]-]*[[:alnum:]-]{36}[^[:alnum:]-]*$", id)
#[1] TRUE
grepl("^[^[:alnum:]-]*[[:alnum:]-]{34}[^[:alnum:]-]*$", id)
#[1] FALSE
And this will still work.
id2 <- paste0(":+)!#", id)
grepl("^[^[:alnum:]-]*[[:alnum:]-]{36}[^[:alnum:]-]*$", id2)
#[1] TRUE
grepl("^[^[:alnum:]-]*[[:alnum:]-]{34}[^[:alnum:]-]*$", id2)
#[1] FALSE

Related

Inverting a regex in R

I have this string:
[1] "19980213" "19980214" "19980215" "19980216" "19980217" "iffi" "geometry"
[8] "date_consid"
and I want to match all the elements that are not dates and not "date_consid". I tried
res = grep("(?!\\d{8})|(?!date_consid)", vec, value=T)
But I just cant make it work...
You can use
vec <- c("19980213", "19980214", "19980215", "19980216","19980217", "iffi","geometry", "date_consid")
grep("^(\\d{8}|date_consid)$", vec, value=TRUE, invert=TRUE)
## => [1] "iffi" "geometry"
See the R demo
The ^(\d{8}|date_consid)$ regex matches a string that only consists of any eight digits or that is equal to date_consid.
The value=TRUE makes grep return values rather than indices and invert=TRUE inverses the regex match result (returns those that do not match).
The pattern that you tried gives all the matches because the lookaheads are unanchored.
Using separate statements with or | will still match all strings.
You can change to logic to asserting from the start of the string, what is directly to the right is not either 8 digits or date_consid in a single check.
Using a positive lookahead, you have to add perl=T and add an anchor ^ to assert the start of the string and add an anchor $ to assert the end of the string after the lookahead.
^(?!\\d{8}$|date_consid$)
^ Start of string
(?! Negative lookahead
\\d{8}$ Match 8 digits until end of string
| Or
date_consid$Match date_consid until end of string
) Close lookahead
For example
vec <- c("19980213", "19980214", "19980215", "19980216","19980217", "iffi","geometry", "date_consid")
grep("^(?!\\d{8}$|date_consid$)", vec, value=T, perl=T)
Output
[1] "iffi" "geometry"

Extracting string between punctuation, when present

I'm trying to extract a string after a : or ; and before a ; if the 2nd punctuation is present, then to remove everything after a ; if present. Goal result is a number.
The current code is able to do between : and ; OR after : but cannot handle ; alone or : alone.
Also, gsub(|(OF 100); SEE NOTE) isn't working, and I'm not sure why the initial : isn't being excluded and needs the gsub at all.
test<-c("Score (ABC): 2 (of 100); see note","Amount of ABC; 30%","Presence of ABC: negative","ABC not tested")
#works for :/;
toupper((regmatches(toupper(test), gregexpr(":\\s* \\K.*?(?=;)", toupper(test), perl=TRUE))))
#works for :
test<-toupper((regmatches(toupper(test), gregexpr(":\\s* (.*)", toupper(test), perl=TRUE))))
#removes extra characters:
test<-gsub(": |(OF 100); SEE NOTE|%|; ","",test)
#Negative to numeric:
test[grepl("NEGATIVE|<1",test)]<-0
test
Expected result: 2 30 0
Here are some solutions.
The first two are base. The first only uses very simple regular expressions. The second is shorter and the regular expression is only a bit more complicated. In both cases we return NA if there is no match but you can replace NAs with 0 (using ifelse(is.na(x), 0, x) where x is the answer with NAs) afterwards if that is important to you.
The third is almost the same as the second but uses strapply in gsubfn. It returns 0 instead of NA.
1) read.table Replace all colons with semicolons and read it in as semicolon-separated fields. Pick off the second such field and remove the first non-digit and everything after it. Then convert what is left to numeric.
DF <- read.table(text = gsub(":", ";", test),
as.is = TRUE, fill = TRUE, sep = ";", strip.white = TRUE)
as.numeric(sub("\\D.*", "", DF$V2))
##[1] 2 30 NA
2) strcapture Match from the start characters which are not colon or semicolon and then match a colon or semicolon and then match a space and finally capture digits. Return the captured digits converted to numeric.
strcapture("^[^:;]+[;:] (\\d+)", test, list(num = numeric(0)))$num
##[1] 2 30 NA
3) strapply Using the same pattern as in (2) convert the match to numeric and return 0 if the match is empty.
library(gsubfn)
strapply(test, "^[^:;]+[;:] (\\d+)", as.numeric, simplify = TRUE, empty = 0)
## [1] 2 30 0
Another approach:
out <- gsub('(^.+?[;:][^0-9]+)(\\d+)(.*$)|^.+', '\\2', test)
out[out == ''] <- 0
as.numeric(out)
## [1] 2 30 0
Per the OP's description (italics is mine):
extract a string after a : or ; and before a ; if the 2nd punctuation is present, then to remove everything after a ; if present. Goal result is a number.
I think some of the other suggestions may miss that italicized criteria. So here is the OP's test set with one extra condition at the end to test that:
test<-c( "Score (ABC): 2 (of 100); see note",
"Amount of ABC; 30%",
"Presence of ABC: negative",
"...and before a ; if the second punctuation is present, then remove everything after a ; if present [so 666 should not be returned]")
One-liner to return results as requested:
sub( pattern='.+?[:;]\\D*?[^;](\\d*).*?;*.*',
replacement='\\1',
x=test, perl=TRUE)
Results matching OP's request:
[1] "2" "30" "" ""
If the OP really wants an integer with zeros where no match is found, set the sub() replacement = '0\\1' and wrap with as.integer() as follows:
as.integer( gsub( pattern='.+?[:;]\\D*?[^;](\\d*).*?;*.*',
replacement='0\\1',
x=test, perl=TRUE) )
Result:
[1] 2 30 0 0
Fully working online R (R 3.3.2) example:
https://ideone.com/TTuKzG
Regexp explanation
OP wants to find just one match in a string so the sub() function works just fine.
Technique for using sub() is to make a pattern that matches all strings, but use a capture group in the middle to capture zero or more digits if conditions around it are met.
The pattern .+?[:;]\\D*?[^;](\\d*).*?;*.* is read as follows
.+? Match any character (except for line terminators) + between one and unlimited times ? as few times as possible, expanding as needed (lazy)
[:;] Match a single character in the list between the square brackets, in this case : or ;
\\D Match any character that's NOT a digit (equal to [^0-9])
*? Quantifier * Matches between zero and unlimited times ? as few times as possible, expanding as needed (lazy)
[^;] The ^ hat as first character between square brackets means: Match a single character NOT present in the list between the square brackets, in this case match any character NOT ;
(\d*) Everything between curved brackets is a capturing group - this is the 1st capturing croup: \\d* matches a digit (equal to [0-9]) between zero and unlimited times, as many times as possible(greedy)
;* Match the ; character * between zero and unlimited times [so ; does not have to be present but is matched if it is there: This is the key to excluding anything after the second delimiter as the OP requested]
.* Match any character * between zero and unlimited times, as many times as possible (greedy) [so picks up everything to the end of the line]
The replacement = \\1 refers to the 1st capture group in our pattern. We replace everything that was matched by the pattern with what we found in the capture group. \\d* can match no digits, so will return an empty string if there is no number found where we are expecting it.

r: regex for containing pattern with negation

Suppose I have the following two strings and want to use grep to see which match:
business_metric_one
business_metric_one_dk
business_metric_one_none
business_metric_two
business_metric_two_dk
business_metric_two_none
And so on for various other metrics. I want to only match the first one of each group (business_metric_one and business_metric_two and so on). They are not in an ordered list so I can't index and have to use grep. At first I thought to do:
.*metric.*[^_dk|^_none]$
But this doesn't seem to work. Any ideas?
You need to use a PCRE pattern to filter the character vector:
x <- c("business_metric_one","business_metric_one_dk","business_metric_one_none","business_metric_two","business_metric_two_dk","business_metric_two_none")
grep("metric(?!.*_(?:dk|none))", x, value=TRUE, perl=TRUE)
## => [1] "business_metric_one" "business_metric_two"
See the R demo
The metric(?!.*(?:_dk|_none)) pattern matches
metric - a metric substring
(?!.*_(?:dk|none)) - that is not followed with any 0+ chars other than line break chars followed with _ and then either dk or none.
See the regex demo.
NOTE: if you need to match only such values that contain metric and do not end with _dk or _none, use a variation, metric.*$(?<!_dk|_none) where the (?<!_dk|_none) negative lookbehind fails the match if the string ends with either _dk or _none.
You can also do something like this:
grep("^([[:alpha:]]+_){2}[[:alpha:]]+$", string, value = TRUE)
# [1] "business_metric_one" "business_metric_two"
or use grepl to match dk and none, then negate the logical when you're indexing the original string:
string[!grepl("(dk|none)", string)]
# [1] "business_metric_one" "business_metric_two"
more concisely:
string[!grepl("business_metric_[[:alpha:]]+_(dk|none)", string)]
# [1] "business_metric_one" "business_metric_two"
Data:
string = c("business_metric_one","business_metric_one_dk","business_metric_one_none","business_metric_two","business_metric_two_dk","business_metric_two_none")

Replace all characters between the 3rd occurrence of “-” and the ":" in each element of a vector

Here is what I am trying to do:
Given a string, I want to remove everything after the third occurrence of the '-' and the character — assuming there is a third occurrence, which there may not be.
This is my expected result :
Initial string
yy-aa-bbb-cccc1:HYT => yy-aa-bbb:HYT
yy-aa-vvv-vv:ZTR => yy-aa-vvv:ZTR
yy-aa-ddd:YTLM => yy-aa-ddd:YTLM
Any help?
gsub('(.*-.*-.*)\\-.*(\\:.*)','\\1\\2',string)
#[1] "yy-aa-bbb:HYT" "yy-aa-vvv:ZTR" "yy-aa-ddd:YTLM"
We match two instances of characters that are not a - followed by - ([^-]+-) followed by another set of characters that are not a -, capture it as a group i.e. inside the (), followed by a - and set of characters that are not a : ([^:]+) followed by the second capture group that starts with : ((:.*)) and replace it with the backreference of the capture groups
sub("(([^-]+-){2}[^-]+)-*[^:]+(:.*)", "\\1\\3", str1)
#[1] "yy-aa-bbb:HYT" "yy-aa-vvv:ZTR" "yy-aa-ddd:YTLM"
data
str1 <- c("yy-aa-bbb-cccc1:HYT", "yy-aa-vvv-vv:ZTR", "yy-aa-ddd:YTLM"
Match the the first two fields and everything afterwards to colon and replace that with the first two fields and colon. Note that \w matches any word character and the \ needs to be doubled inside "..." :
sub("(\\w+-\\w+)-.+:", "\\1:", xx)
## [1] "yy-aa-bbb:HYT" "yy-aa-vvv:ZTR" "yy-aa:YTLM"
Note: The input xx in reproducible form is:
xx <- c("yy-aa-bbb-cccc1:HYT", "yy-aa-vvv-vv:ZTR", "yy-aa-ddd:YTLM")
Just throwing a stringi solution in there.
library(stringi)
sub('_.*:' ,':', stri_replace_last_fixed(x, '-', '_'))
#[1] "yy-aa-bbb:HYT" "yy-aa-vvv:ZTR" "yy-aa:YTLM"

is dash a special character in R regex?

Despite reading the help page of R regex
Finally, to include a literal -, place it first or last (or, for perl
= TRUE only, precede it by a backslash).
I can't understand the difference between
grepl(pattern=paste("^thing1\\-",sep=""),x="thing1-thing2")
and
grepl(pattern=paste("^thing1-",sep=""),x="thing1-thing2")
Both return TRUE. Should I escape or not here? What is the best practice?
The hyphen is mostly a normal character in regular expressions.
You do not need to escape the hyphen outside of a character class; it has no special meaning.
Within a character class [ ] you can place a hyphen as the first or last character in the range. If you place the hyphen anywhere else you need to escape it in order to add it to your class.
Examples:
grepl('^thing1-', x='thing1-thing2')
[1] TRUE
grepl('[-a-z]+', 'foo-bar')
[1] TRUE
grepl('[a-z-]+', 'foo-bar')
[1] TRUE
grepl('[a-z\\-\\d]+', 'foo-bar')
[1] TRUE
Note: It is more common to find a hyphen placed first or last within a character class.
To see what it means for - to have a special meaning inside of a character class (and how putting it last gives it its literal meaning), try the following:
grepl("[w-y]", "x")
# [1] TRUE
grepl("[w-y]", "-")
# [1] FALSE
grepl("[wy-]", "-")
# [1] TRUE
grepl("[wy-]", "x")
# [1] FALSE
They are both matching the exact same text in these instances. I.e.:
x <- "thing1-thing2"
regmatches(x,regexpr("^thing1\\-",x))
#[1] "thing1-"
regmatches(x,regexpr("^thing1-",x))
#[1] "thing1-"
Using a - is a special character in certain situations though, for specifying ranges of values, such as characters between a and z when specifed inside [], e.g.:
regmatches(x,regexpr("[a-z]+",x))
#[1] "thing"

Resources