Is %encoded format allowed for IPv4address format of host of uri in RFC 3968? - uri

I am trying to create a URI library from the RFC 3986 definitions. While validating host there are 3 allowed formats:
IP-literal
IPv4address
Reg-name
Here IPv4address's format is little ambiguous (or I am making a mistake in understanding that). The definition says it can be of the format: dec-octet "." dec-octet "." dec-octet "." dec-octet and
dec-octet = DIGIT ; 0-9
/ %x31-39 DIGIT ; 10-99
/ "1" 2DIGIT ; 100-199
/ "2" %x30-34 DIGIT ; 200-249
/ "25" %x30-35 ; 250-255
When the definitions says "2" %x30-34 DIGIT does it mean if any of the three digits (0-255 range number) can be represented in percent-encoded format or only some particular digits can be represented in the percent-encoded format.

Related

Regex, match year listed as range

I have a list of years like this:
2018-
2001–2020
1999-
2005-
I would like to create a regex to match the year with these criteria:
xxxx- matches xxxx
yyyy-nnnn matches nnnn
Can you please help me?
I've tried [[:digit:]]{4}$, or alternatively [[:digit:]]{4}-$, but they only partially work.
To get the last year in the "range," established by - character, the cleanest way
my $year = (split /-/, $range)[-1];
If there isn't anything after the last delimiter then the last returned element by split is what is before it, so the last element in its return list (obtained with index -1) is either the second given year -- as in 2001-2020 -- or the only one, as in other examples. This performs no checking of input.
With a regex, one way is to seek the last number in the string
my ($year) = $range =~ /([0-9]+)[^0-9]*$/;
where if you use [0-9]{4} then there is a small additional measure of checking.
The POSIX character class [[:digit:]] and its negation [[:^digit:]] (or \P{PosixDigit}) can be used instead if desired, but note that these match all manner of Unicode "digit characters," just like \d and \D do (a few hundred), on top of the ascii [0-9] (unless /a modifier is used).
A full test program, for both
use warnings;
use strict;
use feature 'say';
my #ranges = qw(2018- 2001-2020 1999- 2005-);
foreach my $range (#ranges) {
my $year = (split /-/, $range)[-1];
# Or, using regex
# my ($year) = $range =~ /([0-9]+)[^0-9]*$/;
say $year;
}
Prints as desired.
We can capture the 4 digits as group, followed by a - at the end ($) of the string and replace with the backreference (\\1) of the captured group
sub(".*(\\d{4})-?$", "\\1", str1)
#[1] "2018" "2020" "1999" "2005"
data
str1 <- c("2018-", "2001-2020", "1999-", "2005-")
You can split the text on "-" and get the last number.
x <- c("2018-", "2001-2020", "1999-", "2005-")
sapply(strsplit(str1, '-', fixed = TRUE), tail, 1)
#[1] "2018" "2020" "1999" "2005"

R regex match things other than known characters

For a text field, I would like to expose those that contain invalid characters. The list of invalid characters is unknown; I only know the list of accepted ones.
For example for French language, the accepted list is
A-z, 1-9, [punc::], space, àéèçè, hyphen, etc.
The list of invalid charactersis unknown, yet I want anything unusual to resurface, for example, I would want
This is an 2-piece à-la-carte dessert to pass when
'Ã this Øs an apple' pumps up as an anomalie
The 'not contain' notion in R does not behave as I would like, for example
grep("[^(abc)]",c("abcdef", "defabc", "apple") )
(those that does not contain 'abc') match all three while
grep("(abc)",c("abcdef", "defabc", "apple") )
behaves correctly and match only the first two. Am I missing something
How can we do that in R ? Also, how can we put hypen together in the list of accepted characters ?
[a-z1-9[:punct:] àâæçéèêëîïôœùûüÿ-]+
The above regex matches any of the following (one or more times). Note that the parameter ignore.case=T used in the code below allows the following to also match uppercase variants of the letters.
a-z Any lowercase ASCII letter
1-9 Any digit in the range from 1 to 9 (excludes 0)
[:punct:] Any punctuation character
The space character
àâæçéèêëîïôœùûüÿ Any valid French character with a diacritic mark
- The hyphen character
See code in use here
x <- c("This is an 2-piece à-la-carte dessert", "Ã this Øs an apple")
gsub("[a-z1-9[:punct:] àâæçéèêëîïôœùûüÿ-]+", "", x, ignore.case=T)
The code above replaces all valid characters with nothing. The result is all invalid characters that exist in the string. The following is the output:
[1] "" "ÃØ"
If by "expose the invalid characters" you mean delete the "accepted" ones, then a regex character class should be helpful. From the ?regex help page we can see that a hyphen is already part of the punctuation character vector;
[:punct:]
Punctuation characters:
! " # $ % & ' ( ) * + , - . / : ; < = > ? # [ \ ] ^ _ ` { | } ~
So the code could be:
x <- 'Ã this Øs an apple'
gsub("[A-z1-9[:punct:] àéèçè]+", "", x)
#[1] "ÃØ"
Note that regex has a predefined, locale-specific "[:alpha:]" named character class that would probably be both safer and more compact than the expression "[A-zàéèçè]" especially since the post from ctwheels suggests that you missed a few. The ?regex page indicates that "[0-9A-Za-z]" might be both locale- and encoding-specific.
If by "expose" you instead meant "identify the postion within the string" then you could use the negation operator "^" within the character class formalism and apply gregexpr:
gregexpr("[^A-z1-9[:punct:] àéèçè]+", x)
[[1]]
[1] 1 8
attr(,"match.length")
[1] 1 1

Regex: Extracting numbers from parentheses with multiple matches

How do I match the year such that it is general for the following examples.
a <- '"You Are There" (1953) {The Death of Socrates (399 B.C.) (#1.14)}'
b <- 'Þegar það gerist (1998/I) (TV)'
I have tried the following, but did not have the biggest success.
gsub('.+\\(([0-9]+.+\\)).?$', '\\1', a)
What I thought it did was to go until it finds a (, then it would make a group of numbers, then any character until it meets a ). And if there are several matches, I want to extract the first group.
Any suggestions to where I go wrong? I have been doing this in R.
You could use
library(stringr)
strings <- c('"You Are There" (1953) {The Death of Socrates (399 B.C.) (#1.14)}', 'Þegar það gerist (1998/I) (TV)')
years <- str_match(strings, "\\((\\d+(?: B\\.C\\.)?)")[,2]
years
# [1] "1953" "1998"
The expression here is
\( # (
(\d+ # capture 1+ digits
(?: B\.C\.)? # B.C. eventually
)
Note that backslashes need to be escaped in R.
Your pattern contains .+ parts that match 1 or more chars as many as possible, and at best your pattern could grab last 4 digit chunks from the incoming strings.
You may use
^.*?\((\d{4})(?:/[^)]*)?\).*
Replace with \1 to only keep the 4 digit number. See the regex demo.
Details
^ - start of string
.*? - any 0+ chars as few as possible
\( - a (
(\d{4}) - Group 1: four digits
(?: - start of an optional non-capturing group
/ - a /
[^)]* - any 0+ chars other than )
)? - end of the group
\) - a ) (OPTIONAL, MAY BE OMITTED)
.* - the rest of the string.
See the R demo:
a <- c('"You Are There" (1953) {The Death of Socrates (399 B.C.) (#1.14)}', 'Þegar það gerist (1998/I) (TV)', 'Johannes Passion, BWV. 245 (1725 Version) (1996) (V)')
sub("^.*?\\((\\d{4})(?:/[^)]*)?\\).*", "\\1", a)
# => [1] "1953" "1998" "1996"
Another base R solution is to match the 4 digits after (:
regmatches(a, regexpr("\\(\\K\\d{4}(?=(?:/[^)]*)?\\))", a, perl=TRUE))
# => [1] "1953" "1998" "1996"
The \(\K\d{4} pattern matches ( and then drops it due to \K match reset operator and then a (?=(?:/[^)]*)?\\)) lookahead ensures there is an optional / + 0+ chars other than ) and then a ). Note that regexpr extracts the first match only.

Why does is this end of line (\\b) not recognised as word boundary in stringr/ICU and Perl

Using stringr i tried to detect a € sign at the end of a string as follows:
str_detect("my text €", "€\\b") # FALSE
Why is this not working? It is working in the following cases:
str_detect("my text a", "a\\b") # TRUE - letter instead of €
grepl("€\\b", "2009in €") # TRUE - base R solution
But it also fails in perl mode:
grepl("€\\b", "2009in €", perl=TRUE) # FALSE
So what is wrong about the €\\b-regex? The regex €$ is working in all cases...
When you use base R regex functions without perl=TRUE, TRE regex flavor is used.
It appears that TRE word boundary:
When used after a non-word character matches the end of string position, and
When used before a non-word character matches the start of string position.
See the R tests:
> gsub("\\b\\)", "HERE", ") 2009in )")
[1] "HERE 2009in )"
> gsub("\\)\\b", "HERE", ") 2009in )")
[1] ") 2009in HERE"
>
This is not a common behavior of a word boundary in PCRE and ICU regex flavors where a word boundary before a non-word character only matches when the character is preceded with a word char, excluding the start of string position (and when used after a non-word character requires a word character to appear right after the word boundary):
There are three different positions that qualify as word boundaries:
- Before the first character in the string, if the first character is a word character.
- After the last character in the string, if the last character is a word character.
- Between two characters in the string, where one is a word character and the other is not a word character.
\b
is equivalent to
(?:(?<!\w)(?=\w)|(?<=\w)(?!\w))
which is to say it matches
between a word char and a non-word char,
between a word char and the start of the string, and
between a word char and the end of the string.
€ is a symbol, and symbols aren't word characters.
$ uniprops €
U+20AC <€> \N{EURO SIGN}
\pS \p{Sc}
All Any Assigned Common Zyyy Currency_Symbol Sc Currency_Symbols S Gr_Base Grapheme_Base Graph X_POSIX_Graph GrBase Print X_POSIX_Print Symbol Unicode
If your language supports look-behinds and look-aheads, you could use the following to find a boundary between a space and non-space (treating the start and end as a space).
(?:(?<!\S)(?=\S)|(?<=\S)(?!\S))

Regular Expression for IP address separated by comma or *

I want to build a RegEx that will match with a string of IPs separated by comma (,) OR string will have only *. String should not have both IP address & *
Validate IP i.e 1.1.1.1 (Numbers and . dot char). Also, * alone is allowed
* is present, no other IPs should be present.
This is the regex
(((25[0-5]|2[0-4]\d|[01]?\d\d?)\.(25[0-5]|2[0-4]\d|[01]?\d\d?)\.(25[0-5]|2[0-4]\d|[01]?\d\d?)\.(25[0-5]|2[0-4]\d|[01]?\d\d?)(,\n|,?))|(,*))
Testing string:
192.168.1.1,192.56.3.23,189.35.2.2,198.23.45.56,198.168.1.255
How do I check for *?
You may use
^(?:(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\.(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\.(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\.(?:25[0-5]|2[0-4]\d|[01]?\d\d?)|\*)(?:,\s*(?:(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\.(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\.(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\.(?:25[0-5]|2[0-4]\d|[01]?\d\d?)|\*))*$
See the regex demo
Expanded//verbose/free-spacing version:
^ # start of string
(?: # start of a grouping
(?: # start of another grouping
(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\. # First octet and .
(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\. # Second octet and .
(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\. # Third octet and .
(?:25[0-5]|2[0-4]\d|[01]?\d\d?) # Fourth octet
|\* # or just a * char instead of an IP
) # end of another grouping
) # end of grouping
(?:,\s* # a group that will repeat 0+ times, matches , then 0+ whitespaces
(?: # an IP matching grouping
(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\. # First octet and .
(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\. # Second octet and .
(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\. # Third octet and .
(?:25[0-5]|2[0-4]\d|[01]?\d\d?) # Fourth octet
|\*) # Or a *
)* # ... zero or more times
$ # end of string

Resources