Distinguish between email address and IRI - uri

I have a string that can contain either an email address or an IRI (internationalized URI). The strings do not contain additional surrounding whitespace or any HTTP linefolding characters. Moreover they do not contain any elements marked as "obsolete" in their corresponding specifications. I need a simple way to distinguish which of these things the string contains.
I'm looking at what I believe to be the latest respective specifications: RFC 5322 § 3.4.1. Addr-Spec Specification for emails, and RFC 3987 § 2.2. ABNF for IRI References and IRIs for IRIs. I've come up with the following algorithm, with explanations in parentheses:
If the string begins with a quote " character, it is an email address. (Email address local-part may be a quoted string, but an IRI scheme may not.)
Otherwise find the first at # sign or colon : character.
If the character encountered is an at # sign, the string contains an email address.
Otherwise, if it is a colon : character, the string contains an IRI.
Is that approach correct? Is there another simpler approach? Lastly for bonus, how would I expand this algorithm to also distinguish those two things from an IP address (including both IPv4 and IPv6)?

I would think the rules as specified are correct and fast to determine the type (email or IRI). To extend this to IP addresses their corresponding grammar should be added: https://datatracker.ietf.org/doc/html/draft-main-ipaddr-text-rep-00.
So then your rules could be extended to:
Rules: (I assumed well formed input)
First char " => email
First char : => IpV6 (because an IRI the scheme has to contain at least one char)
First of : or #
# => email
: =>
If it does not match the grammar for IpV6 => IRI
Otherwise: ambiguous, also in the grammar, some options
Use as IpV6 => it will be valid, likely to be the thing intended
Use it as IRI => the first part (before the ':') will be a scheme the later part will be one 'segment' in the protocol
So ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff will lead to scheme ffff and 'segment' ffff:ffff:ffff:ffff:ffff:ffff:ffff
I would find this situation very unlikely
Raise an exception, depending on the environment this could be a valid option
Both not in the string => IpV4
ipchar := hex / ':'
hex := [0-9A-Fa-f]

Related

Should the plus in tel URIs be encoded?

In a URI, spaces can be encoded as +. Since this is the case, should the leading plus be encoded when creating tel URIs with international prefix?
Which is better? Do both work in practice?
Call me
Call me
No.
From section 3 of RFC 3966 (The tel URI for Telephone Numbers):
If the reserved characters "+", ";", "=", and "?" are used as delimiters between components of the "tel" URI, they MUST NOT be percent encoded.
You would only percent-encode a + if it’s part of a parameter value:
These characters ["+", ";", "=", and "?"] MUST be percent encoded if they appear in tel URI parameter values.
I’m not sure if the leading +, which indicates that it’s a global number, counts as delimiter, but the definition of a global number says:
Globally unique numbers are identified by the leading "+" character.
So it refers to +, not to something percent-encoded.
And also the examples make clear that it’s not supposed to be percent-encoded, e.g.:
tel:+1-201-555-0123
Note that spaces in tel URIs (e.g., in parameter values) may not be encoded with a +. Using + instead of %20 for a space character is not something that may be done in any URI; it’s only possible in URIs whose URI scheme explicitly defines that.
The tel: URI scheme doesn't have a provision for encoding spaces - see RFC 3966:
5.1.1. Separators in Phone Numbers
...
even though ITU-T E.123 [E.123] recommends the use of space
characters as visual separators in printed telephone numbers, "tel"
URIs MUST NOT use spaces in visual separators to avoid excessive
escaping.
The plus sign encodes a space specifically only in application/x-www-form-urlencoded (default content type for form submission - see W3C info re: forms). There's no valid way to encode a space in tel: URIs. See again RFC 3966 (page 5) for valid visual separators.

Bad production of 2.5.4.5 oid into X509IssuerName, change proposal

I noticed that durnig a xades signature with xades4j the element X509IssuerName presents a bad formatted serialnumber issuer value, it shows a PrintableString Hex encoded, i search into xades4j code and i found that the problem is into the DataGenBaseCertRefs class, if you set
cert.getIssuerX500Principal().getName(X500Principal.RFC1779)
into the generate method you can resolve this problem and procuce an issuer value from this:
2.5.4.5=#130b3037393435323131303036
to this
OID.2.5.4.5=07945211006
I'm not sure that change is correct. XML-DSIG states that RFC 4514 should be used when encoding the distinguished names. Regarding the attribute type, on that RFC one reads:
If the AttributeType is defined to have a short name (...) that short name, a descr, is used. Otherwise the AttributeType is encoded as the dotted-decimal encoding, a numericoid, of its OBJECT IDENTIFIER.
In turn, numericoid is defined on RFC 4512 as follows:
numericoid = number 1*( DOT number )
Regarding the attribute value, one reads:
If the AttributeType is of the dotted-decimal form, the AttributeValue is represented by an number sign ('#' U+0023) character followed by the hexadecimal encoding of each of the octets of the BER e ncoding of the X.500 AttributeValue.
My understanding is that, since a short name was not known, the hex value should be used. What do you think?
This actually makes me realize that xades4j is using RFC 2253, since it is the default on getName().
Are you also including a X509IssuerSerial element on KeyInfo/X509Data? Is that one different from the cert ref?
Can you send me, on another channel, a certificate with those characteristics for tests?

User info in URI without password

I know that URI supports the following syntax:
http://[user]:[password]#[domain.tld]
When there is no password or if the password is empty, is there a colon?
In other words, should I accept this:
http://[user]:#[domain.tld]
Or this:
http://[user]#[domain.tld]
Or are they both valid?
The current URI standard (STD 66) is RFC 3986, and the relevant section is 3.2.1. User Information.
There it’s defined that the userinfo subcomponent (which gets followed by #) can contain any combination of
the character :,
percent-encoded characters, and
characters from the sets unreserved and sub-delims.
So this means that both of your examples are valid.
However, note that the format user:password is deprecated. Anyway, they give recommendations how applications should handle such URIs, i.e., everything after the first : character should not be displayed by applications, unless
the data after the colon is the empty string (indicating no password).
So according to this recommendation, the userinfo subcomponent user: indicates that there is the username "user" and no password.
This is more like convenience and both are valid. I would go with http://[user]#[domain.tld] (and prompt for a password.) because it's simple and not ambiguous. It does not give any chance for user to think if he has to add anything after :

How to determine if a URI is escaped?

I am using apache commons HTTPClient to download web resources. The URI for these resources come from third parties, I do not generate them.
The commons httpclient requires a URI object to be given to the GetMethod object.
The URI constructor takes a string (for the uri) and a boolean specifying if it is escaped or not.
Currently, I am doing the following to determine if the original url I am given is already escaped...
boolean isEscaped = URIUtil.getPathQuery(originalUrl).contains("%");
m.setURI(new URI(originalUrl, isEscaped));
Is this the correct way to determine if a uri is already escaped?
Update...
according to wikipedia ( Well, according to wikipedia ( http://en.wikipedia.org/wiki/Percent-encoding ) it says that percent is a reserved character and should always be encoded... I am quoting verbatim here...
Percent-encoding the percent character[edit] Because the percent ("%")
character serves as the indicator for percent-encoded octets, it must
be percent-encoded as "%25" for that octet to be used as data within a
URI.
Doesnt this mean that you can never have a naked '%' character in a valid uri?
Also, the uri(s) come from various sources so I cannot be sure if they are escaped or unescaped.
This wouldn't work. It's possible the un-encoded string has a % in it already.
ex:
https://www.google.com/#q=like%25&safe=off
is the url for a google search for like%. In unescaped form it would be https://www.google.com/#q=like%&safe=off
Your consumers should let you know if the URI is escaped or not.

ASN.1 Octet Strings

I'm decoding a X.509 Certificate in ASN.1 format. I'm decoding it successfully, traversing the structure, but there is one thing that I don't understand.
There are some scenarios where I get an octet string and this website that I am playing with (http://lapo.it/asn1js/) shows that these octet strings actually contain more of the ASN.1 tree. This website annotates such octet strings with (encapsulates)
My question is this: how do I know during parsing that an octet string actually encapsulates something more? Do I just try to parse it, looking if I get a tag and valid length? If not then it is pure bytes data? And if yes then it is a valid sub-tree?
Or is this meant to be output as bytes and the consumer should then only try to parse it if he knows that it is encoded data from for certain keys?
Take the example that is already loaded on the site and hit "decode". I am referring for example to offset 332 which is an octet string that encapsulates a bit string.
This is what "extensions" looks like in ASN.1 speak (RFC 2459 §B.2 — I know that RFC is "obsolete", but that useful appendix isn't present in the later versions).
Extensions ::= SEQUENCE OF Extension
Extension ::= SEQUENCE {
extnId OBJECT IDENTIFIER,
critical BOOLEAN DEFAULT FALSE,
extnValue OCTET STRING }
Every extension payload is encapsulated within an OCTET STRING. The OID of the extensions tells you what to expect within that octet string.
In the case of keyUsage it's a BIT STRING (§4.2.1.3).
And now I have an answer about my own question on subjectAltName, it's in §4.2.1.7.
One benefit of using OCTET STRING for the content is that, as per spec, unknown (non-critical) extensions can be identified as such and trivially be skipped over (though I think DER makes it trivial too).
And the way to tell ASN.1 tools to deal with that encapsulation is by using the keyword "CONTAINING". For example (this is not the actual/correct certificate spec, but it should give you an idea):
TstCert DEFINITIONS IMPLICIT TAGS ::=
BEGIN
Sun ::= SEQUENCE {
subjAltType OBJECT IDENTIFIER,
name GenNames
}
GenNames ::= SEQUENCE SIZE (1..5) OF GenName
GenName ::= CHOICE {
otherName [0] OtherName,
rfc822Name [1] UTF8String
}
OtherName ::= OCTET STRING (CONTAINING SEQUENCE {
type-id OBJECT IDENTIFIER,
value [0] EXPLICIT UTF8String
} )
END

Resources