I've been reading about Data URIs which has the following example of a valid data URI:
data:text/html,<script>alert('hi');</script>
However reading through RFC 2397 I have found the following:
dataurl := "data:" [ mediatype ] [ ";base64" ] "," data
mediatype := [ type "/" subtype ] *( ";" parameter )
data := *urlchar
parameter := attribute "=" value
where "urlchar" is imported from RFC2396
From what I understood is that urlchar should be what is in Section 2.4.3 of RFC2396, where it notes the list of US-ASCII characters that have been excluded and specifically says:
The angle-bracket "<" and ">" and double-quote (") characters are
excluded because they are often used as the delimiters around URI in
text documents and protocol fields.
So my question is, are angle brackets allowed in Data URLs? Am I misinterpreting the RFC or is the example at MDN wrong?
The example is indeed wrong (in that the Data URI is invalid, although it might "work").
Related
Reading through the URI syntax description (RFC 3986) and trying to understand what their syntax descriptions mean.
For example, a URI has to have a schema part, which is restricted by the following syntax description:
scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )
But the specification never tells you what * ( and / mean. Anything in quotations seems to mean exactly that character and ALPHA and DIGIT are seemingly the sets of ASCII characters pertaining to the alphanumeric set. I am guessing / is an or, ( may be a group, and * may be 0 or more. But it is not clarified in the specification.
There are other syntax descriptions like:
URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ]
I am also guessing the [ means that part is optional.
Does anybody know if my interpretation is correct? And would you be able to point me to the RFC specification of these characters?
These are all well described in RFC 5234 which is the Augmented BNF format.
/ is for alternatives
* is for variable repitition
It is a grammar Backus-Naur-like grammar.
I have a $text = "Hello πππ π ππ» π¦¦ΓΌΓ€ΓΆ$"
I wanted to remove just emoji's from the text using xquery. How can i do that?
Expected result : "Hello üÀâ$"
i tried to use:
replace($text, '\p{IsEmoticons}+', '')
but didn't work.
it just removed smiley's
Result now: "Hello π ππ» π¦¦ΓΌΓ€ΓΆ$"
Expected result : "Hello üÀâ$"
Thanks in advance :)
I outlined the approach in my answer to the original question, which I updated based on your comment asking about how to strip out π.
Quoting from that expanded answer:
The "Emoticons" block doesn't contain all characters commonly associated with "emoji." For example, π (Purple Heart, U+1F49C), according to a site like https://www.compart.com/en/unicode/U+1F49C that lets you look up Unicode character information, is from:
Miscellaneous Symbols and Pictographs, U+1F300 - U+1F5FF
This block is not available in XPath or XQuery processors, since it is neither listed in the XML Schema 1.0 spec linked above, nor is it in Unicode block names for use in XSD regular expressionsβa list of blocks that XPath and XQuery processors conforming to XML Schema 1.1 are required to support.
For characters from blocks not available in XPath or XQuery, you can manually construct character classes. For example, given the purple heart character above, we can match it as follows:
replace("Purple π heart", "[π-πΏ]", "")
This returns the expected result:
Purple Heart
This approach can be applied to ππ» , π¦¦, or any other character:
Locate the character's unicode block.
Craft your regular expression with the block name (if available in XPath) or character class.
Alternatively, rather than locating the blocks of characters you want to strip out, you could identify the blocks of characters you want to preserve. For example, given the example string in the original post, perhaps the goal is to preserve only those characters in the "Basic Latin" block. To do so, we can match characters NOT in this block via the \P Category Escape:
xquery version "3.1";
let $text := "Hello πππ π ππ» π¦¦ΓΌΓ€ΓΆ$"
return
replace($text, "\P{IsBasicLatin}", "")
This query returns:
Hello $
Notice that this has stripped out the characters with diacritics, which perhaps isn't desired. These characters with diacritics belong to the Latin-1 Supplement block. To preserve characters from both the Latin and Latin-1 Supplement blocks, we'd need to adjust the query as follows:
xquery version "3.1";
let $text := "Hello πππ π ππ» π¦¦ΓΌΓ€ΓΆ$"
return
replace($text, "[^\p{IsBasicLatin}\p{IsLatin-1Supplement}]", "")
... which returns:
Hello üÀâ$
This now preserves the characters with diacritics.
To be precise about the characters you preserve or remove, you need to consult the Unicode blocks and charts.
See: https://www.rfc-editor.org/rfc/rfc3986#section-3
And: https://www.rfc-editor.org/rfc/rfc3986#section-3.3
The origin of "abempty" is mysterious to me, and a quick search didn't turn up any definitions of it.
"abempty", as it states in the comments to the right of its usage in the rfc you reference, means that its value can be either an absolute path or empty so (abempty).
βAbemptyβ, meaning away from empty, describes the pathβs relationship to its preceding authority. Where path-abempty is relevant, the hier-part consists of β//β, authority, and path-abempty. The authority component may be zero length β scheme:/// is a valid URI.
However, when the authority is zero length and the path is empty, there is no way to distinguish the two components, hence a path-abempty path - it "begins with "/" or is empty" (Section 3.3) depending on the circumstances.
Source: http://w3-org.9356.n7.nabble.com/path-abempty-in-URI-td170118.html (See Fielding's response to Petch.)
NB The word βabemptyβ is not a portmanteau of the words absolute and empty.
Please:
URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ]
The hier-part is not optional in the context of a "generic" URI.
hier-part = ("//" authority path-abempty) / path-absolute / path-rootless / path-empty
The double slashes, interestingly enough, are not optional where path-abempty is relevant. And, jumping ahead a little, the authority may be zero-length:
reg-name = *( unreserved / pct-encoded / sub-delims )
Path-abempty is relevant where the hier-part consists of β//β, authority, and path-abempty. Path-abempty is defined as:
path-abempty = *( "/" segment )
The RFC states, βWhen authority is present, the path must either be empty or begin with a slash ("/") character.β If the reg-name is zero length, a casual reading of that statement might suggest that the following URI is valid:
scheme://
Itβs not. The very next sentence states, βWhen authority is not present, the path cannot begin with two slash characters (β//β).β This means that in parsing our URI that begins with βscheme://β we indicate the possibility of a zero-length authority and a zero-length path - otherwise we could stop right there because the URI would be invalid.
In this case, not the common case by any means, the zero-length authority cannot be discerned from the zero-length path. Hence, when the authority is zero-length, WE DO NOT HAVE A CHOICE, the path MUST begin with a forward slash (more precisely, it must match path-abempty) and discern the path from the authority; otherwise, and I will say it again: the URI would be invalid.
The word βabemptyβ doesnβt imply that the path may be absolute or empty. The word means that the path must distinguish itself from the authority, hence it is abempty i.e., away from empty.
Examples:
This URI is ambiguous because even if it has a zero-length authority and a zero-length path, there is no way to discern it from an invalid URI that omits the authority and has a path that starts with two forward slashes.
scheme://
This URI is not ambiguous as it clearly contains a zero-length authority and a path-abempty path.
scheme:///
Given its definition and context in RFC 3986, Section 3.3: I'm confident that abempty is a portmanteau of absolute and empty; as opposed to empty with a Latin ab-prefix.
Possible path patterns are defined as:
path-abempty = *( "/" segment ) ; begins with "/" or is empty
path-absolute = "/" [ segment-nz *( "/" segment ) ] ; begins with "/" but not "//"
path-noscheme = segment-nz-nc *( "/" segment ) ; begins with a non-colon segment
path-rootless = segment-nz *( "/" segment ) ; begins with a segment
path-empty = 0<pchar> ; zero characters
Path-abempty is essentially an extended path-absolute, combined with path-empty.
Path-absolute-or-empty becomes path-abempty.
Disclaimer
My assertion is based solely on inferential conjectures, as I couldn't find the word's etymology, or who coined it. So if anyone has relevant knowledge, to contradict or corroborate: Please, do share!
What characters are allowed in an URL query string?
Do query strings have to follow a particular format?
Per https://www.rfc-editor.org/rfc/rfc3986
In section 2.2 Reserved Characters, the following characters are listed:
reserved = gen-delims / sub-delims
gen-delims = β:β / β/β / β?β / β#β / β[β / β]β / β#β
sub-delims = β!β / β$β / β&β / βββ / β(β / β)β / β*β / β+β / β,β / β;β
/ β=β
The spec then says:
If data for a URI component would conflict with a reserved characterβs
purpose as a delimiter, then the conflicting data must be
percent-encoded before the URI is formed.
Next, in section 2.3 Unreserved Characters, the following are listed:
unreserved = ALPHA / DIGIT / β-β / β.β / β_β / β~β
Wikipedia has your answer: http://en.wikipedia.org/wiki/Query_string
"URL Encoding: Some characters cannot be part of a URL (for example, the space) and some other characters have a special meaning in a URL: for example, the character # can be used to further specify a subsection (or fragment) of a document; the character = is used to separate a name from a value. A query string may need to be converted to satisfy these constraints. This can be done using a schema known as URL encoding.
In particular, encoding the query string uses the following rules:
Letters (A-Z and a-z), numbers (0-9) and the characters '.','-','~' and '_' are left as-is
SPACE is encoded as '+' or %20[citation needed]
All other characters are encoded as %FF hex representation with any non-ASCII characters first encoded as UTF-8 (or other specified encoding)
The octet corresponding to the tilde ("~") character is often encoded as "%7E" by older URI processing implementations; the "%7E" can be replaced by"~" without changing its interpretation.
The encoding of SPACE as '+' and the selection of "as-is" characters distinguishes this encoding from RFC 1738."
Regarding the format, query strings are name value pairs. The ? separates the query string from the URL. Each name value pair is separated by an ampersand (&) while the name (key) and value is separated by an equals sign (=). eg. http://domain.com?key=value&secondkey=secondvalue
Under Structure in the Wikipedia reference I provided:
The question mark is used as a separator and is not part of the query string.
The query string is composed of a series of field-value pairs
Within each pair, the field name and value are separated by an equals sign, '='.
The series of pairs is separated by the ampersand, '&' (or semicolon, ';' for URLs embedded in HTML and not generated by a ...; see below).
W3C recommends that all web servers support semicolon separators in addition to ampersand separators[6] to allow application/x-www-form-urlencoded query strings in URLs within HTML documents without having to entity escape ampersands.
This link has the answer and formatted values you all need.
https://perishablepress.com/url-character-codes/
For your convenience, this is the list:
< %3C
> %3E
# %23
% %25
{ %7B
} %7D
| %7C
\ %5C
^ %5E
~ %7E
[ %5B
] %5D
` %60
; %3B
/ %2F
? %3F
: %3A
# %40
= %3D
& %26
$ %24
+ %2B
" %22
space %20
I need to pass 2 parameters in a query string but would like them to appear as a single parameter to the user. At a low level, how can I concatinate these two values and then later separate them? Both values are Base64 encoded.
?Name=abcyxz
where both abc and xyz are separate Base64 encoded strings.
why don't you just do something like this
temp = base64_encode("var1=abc&var2=yxz")
and then call
?Name=temp
Later you can decode the whole string and split the vars.
(sry for pseudo code :P)
Edit: a small quote from wikipedia
The current version of PEM (specified in RFC 1421) uses a 64-character alphabet consisting of upper- and lower-case Roman alphabet characters (AβZ, aβz), the numerals (0β9), and the "+" and "/" symbols. The "=" symbol is also used as a special suffix code. The original specification, RFC 989, additionally used the "*" symbol to delimit encoded but unencrypted data within the output stream.
You should either use some separator or store the length of the first item.
First of all, I would be curious as to why you can't just pass two parameters. But with that as a given, just choose any character that's a valid character in a URL query string, but won't show up in your base64 encoding, such as ~