ASN.1 Octet Strings - decoding

I'm decoding a X.509 Certificate in ASN.1 format. I'm decoding it successfully, traversing the structure, but there is one thing that I don't understand.
There are some scenarios where I get an octet string and this website that I am playing with (http://lapo.it/asn1js/) shows that these octet strings actually contain more of the ASN.1 tree. This website annotates such octet strings with (encapsulates)
My question is this: how do I know during parsing that an octet string actually encapsulates something more? Do I just try to parse it, looking if I get a tag and valid length? If not then it is pure bytes data? And if yes then it is a valid sub-tree?
Or is this meant to be output as bytes and the consumer should then only try to parse it if he knows that it is encoded data from for certain keys?
Take the example that is already loaded on the site and hit "decode". I am referring for example to offset 332 which is an octet string that encapsulates a bit string.

This is what "extensions" looks like in ASN.1 speak (RFC 2459 §B.2 — I know that RFC is "obsolete", but that useful appendix isn't present in the later versions).
Extensions ::= SEQUENCE OF Extension
Extension ::= SEQUENCE {
extnId OBJECT IDENTIFIER,
critical BOOLEAN DEFAULT FALSE,
extnValue OCTET STRING }
Every extension payload is encapsulated within an OCTET STRING. The OID of the extensions tells you what to expect within that octet string.
In the case of keyUsage it's a BIT STRING (§4.2.1.3).
And now I have an answer about my own question on subjectAltName, it's in §4.2.1.7.
One benefit of using OCTET STRING for the content is that, as per spec, unknown (non-critical) extensions can be identified as such and trivially be skipped over (though I think DER makes it trivial too).

And the way to tell ASN.1 tools to deal with that encapsulation is by using the keyword "CONTAINING". For example (this is not the actual/correct certificate spec, but it should give you an idea):
TstCert DEFINITIONS IMPLICIT TAGS ::=
BEGIN
Sun ::= SEQUENCE {
subjAltType OBJECT IDENTIFIER,
name GenNames
}
GenNames ::= SEQUENCE SIZE (1..5) OF GenName
GenName ::= CHOICE {
otherName [0] OtherName,
rfc822Name [1] UTF8String
}
OtherName ::= OCTET STRING (CONTAINING SEQUENCE {
type-id OBJECT IDENTIFIER,
value [0] EXPLICIT UTF8String
} )
END

Related

Why is extnValue in X.509 Extensions always encapsulated in an OCTET_STRING?

I'm curious, and I was not able to find an explanation so far.
In RFC 5280 Extensions define the following:
Extension ::= SEQUENCE {
extnID OBJECT IDENTIFIER,
critical BOOLEAN DEFAULT FALSE,
extnValue OCTET STRING
-- contains the DER encoding of an ASN.1 value
-- corresponding to the extension type identified
-- by extnID
}
What is the reason for defining the encapsulating OCTET_STRING for extnValue, instead of directly defining extnValue as the "DER encoding of an ASN.1 value corresponding to the extension type identified by extnID".
Thank you.
Not an authoritative answer, but my thoughts are: this is because extension values may have arbitrary enclosing tags and can be defined in external modules:
Most extensions use SEQUENCE, but some are not, like in a given example, Subject Key Identifier is just another OCTET_STRING, Key Usages is a BIT_STRING. And in base type definition you have to use fixed tag to represent variable content (ANY).
In addition, parsers may not know how to parse particular extension, so they read it as octet string without having to dig deeper if extension type is unknown to parser.
update 13.02.2023 (based on comments):
Regarding the type / tag, from my understanding, each different type can be easily identified by the leading tag byte, such as SEQUENCE=0x10, OCTET_STRING=0x04 or BIT_STRING=0x03
you cannot define the field with variable tag, because you introduce type ambiguity. That is, extnValue ANY field definition is not valid, because its type is indeterminate. When you define a type (in this case, it is Extension type), all fields must have deterministic tag.

Distinguish between email address and IRI

I have a string that can contain either an email address or an IRI (internationalized URI). The strings do not contain additional surrounding whitespace or any HTTP linefolding characters. Moreover they do not contain any elements marked as "obsolete" in their corresponding specifications. I need a simple way to distinguish which of these things the string contains.
I'm looking at what I believe to be the latest respective specifications: RFC 5322 § 3.4.1. Addr-Spec Specification for emails, and RFC 3987 § 2.2. ABNF for IRI References and IRIs for IRIs. I've come up with the following algorithm, with explanations in parentheses:
If the string begins with a quote " character, it is an email address. (Email address local-part may be a quoted string, but an IRI scheme may not.)
Otherwise find the first at # sign or colon : character.
If the character encountered is an at # sign, the string contains an email address.
Otherwise, if it is a colon : character, the string contains an IRI.
Is that approach correct? Is there another simpler approach? Lastly for bonus, how would I expand this algorithm to also distinguish those two things from an IP address (including both IPv4 and IPv6)?
I would think the rules as specified are correct and fast to determine the type (email or IRI). To extend this to IP addresses their corresponding grammar should be added: https://datatracker.ietf.org/doc/html/draft-main-ipaddr-text-rep-00.
So then your rules could be extended to:
Rules: (I assumed well formed input)
First char " => email
First char : => IpV6 (because an IRI the scheme has to contain at least one char)
First of : or #
# => email
: =>
If it does not match the grammar for IpV6 => IRI
Otherwise: ambiguous, also in the grammar, some options
Use as IpV6 => it will be valid, likely to be the thing intended
Use it as IRI => the first part (before the ':') will be a scheme the later part will be one 'segment' in the protocol
So ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff will lead to scheme ffff and 'segment' ffff:ffff:ffff:ffff:ffff:ffff:ffff
I would find this situation very unlikely
Raise an exception, depending on the environment this could be a valid option
Both not in the string => IpV4
ipchar := hex / ':'
hex := [0-9A-Fa-f]

how to send binary data (del and null characters) through http url

I want to send some binary characters along with HTTP URL, can some one tell me the best way to do it.
Ex: \x7F/a.html (\x7F represents ASCII DEL in binary form)
Sending it with telnet or curl is sending it as a string. Do you think sending on the sockets directly will work
sock.send('GET /test\x7F/a.html HTTP/1.0\r\nHost: 1.1.1.1\r\n') will work??
According to the HTTP spec, the request-target token can have multiple values "derived" from a URI path. From the URI spec a path can only contain printable 7-bit ASCII alphanumeric characters and a few symbols like '-', '.', '%', '~' and others. It does not allow ASCII control characters.
According to the URI spec, path characters outside the printable 7-bit ASCII range should be percent-encoded, so ASCII DEL should be encoded %7F and ASCII NULL %00.
It's hard to say whether percent-encoding your binary characters “would work” as you do not explain what you expect to get from them. An HTTP request-target is an opaque identifier interpreted by the server, and need not correspond to a file name or actual data. It is perfectly feasible (and common) to refer to binary targets with ASCII alphanumeric request-targets.

Bad production of 2.5.4.5 oid into X509IssuerName, change proposal

I noticed that durnig a xades signature with xades4j the element X509IssuerName presents a bad formatted serialnumber issuer value, it shows a PrintableString Hex encoded, i search into xades4j code and i found that the problem is into the DataGenBaseCertRefs class, if you set
cert.getIssuerX500Principal().getName(X500Principal.RFC1779)
into the generate method you can resolve this problem and procuce an issuer value from this:
2.5.4.5=#130b3037393435323131303036
to this
OID.2.5.4.5=07945211006
I'm not sure that change is correct. XML-DSIG states that RFC 4514 should be used when encoding the distinguished names. Regarding the attribute type, on that RFC one reads:
If the AttributeType is defined to have a short name (...) that short name, a descr, is used. Otherwise the AttributeType is encoded as the dotted-decimal encoding, a numericoid, of its OBJECT IDENTIFIER.
In turn, numericoid is defined on RFC 4512 as follows:
numericoid = number 1*( DOT number )
Regarding the attribute value, one reads:
If the AttributeType is of the dotted-decimal form, the AttributeValue is represented by an number sign ('#' U+0023) character followed by the hexadecimal encoding of each of the octets of the BER e ncoding of the X.500 AttributeValue.
My understanding is that, since a short name was not known, the hex value should be used. What do you think?
This actually makes me realize that xades4j is using RFC 2253, since it is the default on getName().
Are you also including a X509IssuerSerial element on KeyInfo/X509Data? Is that one different from the cert ref?
Can you send me, on another channel, a certificate with those characteristics for tests?

How to determine if a URI is escaped?

I am using apache commons HTTPClient to download web resources. The URI for these resources come from third parties, I do not generate them.
The commons httpclient requires a URI object to be given to the GetMethod object.
The URI constructor takes a string (for the uri) and a boolean specifying if it is escaped or not.
Currently, I am doing the following to determine if the original url I am given is already escaped...
boolean isEscaped = URIUtil.getPathQuery(originalUrl).contains("%");
m.setURI(new URI(originalUrl, isEscaped));
Is this the correct way to determine if a uri is already escaped?
Update...
according to wikipedia ( Well, according to wikipedia ( http://en.wikipedia.org/wiki/Percent-encoding ) it says that percent is a reserved character and should always be encoded... I am quoting verbatim here...
Percent-encoding the percent character[edit] Because the percent ("%")
character serves as the indicator for percent-encoded octets, it must
be percent-encoded as "%25" for that octet to be used as data within a
URI.
Doesnt this mean that you can never have a naked '%' character in a valid uri?
Also, the uri(s) come from various sources so I cannot be sure if they are escaped or unescaped.
This wouldn't work. It's possible the un-encoded string has a % in it already.
ex:
https://www.google.com/#q=like%25&safe=off
is the url for a google search for like%. In unescaped form it would be https://www.google.com/#q=like%&safe=off
Your consumers should let you know if the URI is escaped or not.

Resources