List of valid characters for the fragment identifier in an URL? - fragment-identifier

I'm using the fragment identifier to create a permalink for AJAX events in my web app similar to this guy. Something like:
http://www.myapp.com/calendar#filter:year/2010/month/5
I've done quite a bit of searching but can't find a list of valid characters for the fragment idenitifer. The W3C spec doesn't offer anything.
Do I need to encode the characters the same as the URL in has in general?
There doesn't seem to be any good information on this anywhere.

See the RFC 3986.
fragment = *( pchar / "/" / "?" )
pchar = unreserved / pct-encoded / sub-delims / ":" / "#"
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
pct-encoded = "%" HEXDIG HEXDIG
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
So you can use !, $, &, ', (, ), *, +, ,, ;, =, something matching %[0-9a-fA-F]{2}, something matching [a-zA-Z0-9], -, ., _, ~, :, #, /, and ?

https://www.rfc-editor.org/rfc/rfc3986#section-3.5:
fragment = *( pchar / "/" / "?" )
and
pchar = unreserved / pct-encoded / sub-delims / ":" / "#"
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
pct-encoded = "%" HEXDIG HEXDIG
So, combined, the fragment cannot contain #, a raw %, ^, [, ], {, }, \, ", < and > according to the RFC.

One other RFC speak of that: RFC-1738
URL schemeparts for ip based protocols:
HTTP
httpurl = "http://" hostport [ "/" hpath [ "?" search ]]
hpath = hsegment *[ "/" hsegment ]
hsegment = *[ uchar | ";" | ":" | "#" | "&" | "=" ]
search = *[ uchar | ";" | ":" | "#" | "&" | "=" ]

Related

Searching and replacing characters with classes in R

I am trying to replace text in R. I want to find spaces between letters and numbers only and delete them, but when I search using [:alpha:] and [:alnum:] it replaces with that class operator.
> string <- "WORD = 500 * WORD + ((WORD & 400) - (WORD & 300))"
> str_replace_all(string,
+ "[:alpha:] & [:alnum:]",
+ "[:alpha:]&[:alnum:]")
[1] "WORD = 500 * WORD + ((WOR[:alpha:]&[:alnum:]00) - (WOR[:alpha:]&[:alnum:]00))"
How can I use the function so that it returns-
[1] "WORD = 500 * WORD + ((WORD&400) - (WORD&300))"
str_replace_all(string, "([:alpha:]) & ([:alnum:])", "\\1&\\2")
Your requirement is easy enough to handle using sub with lookarounds:
string <- "WORD = 500 * WORD + ((WORD & 400) - (WORD & 300))"
output <- gsub("(?<=\\w) & (?=\\w)", "&", string, perl=TRUE)
output
[1] "WORD = 500 * WORD + ((WORD&400) - (WORD&300))"
Here is a brief explanation of the regex:
(?<=\\w) assert that what precedes is a word character
[ ]&[ ] then match a space, followed by `&`, followed by another space
(?=\\w) assert that what follows is also a word character
Then, we replace with just a single &, with no spaces on either side.
Here is one option where we match regex lookarounds to match one or more spaces (\\s+) either preceding or succeeding a & and replace with blank ("")
gsub("(?<=&)\\s+|\\s+(?=&)", "", string, perl = TRUE)
#[1] "WORD = 500 * WORD + ((WORD&400) - (WORD&300))"

HTTP empty header

Is it acceptable to have empty header in HTTP?
By empty i mean ":" no header name and no header value.
The same question is also relvant to HTTP2 (suppose it is the same answer but to be sure).
Thanks.
HTTP defines a header field as:
header-field = field-name ":" OWS field-value OWS
field-name = token
field-value = *( field-content / obs-fold )
field-content = field-vchar [ 1*( SP / HTAB ) field-vchar ]
field-vchar = VCHAR / obs-text
obs-fold = CRLF 1*( SP / HTAB )
; obsolete line folding
; see Section 3.2.4
The token part is later on defined as:
token = 1*tchar
tchar = "!" / "#" / "$" / "%" / "&" / "'" / "*"
/ "+" / "-" / "." / "^" / "_" / "`" / "|" / "~"
/ DIGIT / ALPHA
; any VCHAR, except delimiters
The implication is that the header name must be at least 1 byte, and the value can be 0 or more characters.
HTTP/2 uses the same underlying data-model.
https://www.rfc-editor.org/rfc/rfc7230#section-3.2.4

vector of punctuation

For digits I can write a vector like this:
digits <- c("0","1","2","3","4","5","6","7","8","9")
How can I get an analogous vector of punctuation marks?
You could convert numbers to punctuation using Unicode code points (thanks Konrad, for point that out).
strsplit(intToUtf8(c(33:47, 58:64, 91:96)), "")[[1]]
# [1] "!" "\"" "#" "$" "%" "&" "'" "(" ")" "*" "+" "," "-" "."
#[15] "/" ":" ";" "<" "=" ">" "?" "#" "[" "\\" "]" "^" "_" "`"
some Ethiopian punctuation (0x1361:0x1367):
strsplit(intToUtf8(0x1361:0x1367), "")[[1]]
[1] "፡" "።" "፣" "፤" "፥" "፦" "፧"
If this is missing punctuation you want to use, you can look up the unicode points associated with the punctuation you want, and use it (e.g. somewhere like http://www.fileformat.info/info/unicode/category/Po/list.htm). You can also get the integers from utf8ToInt. For instance "~" isn't included above:
utf8ToInt("~")
#[1] 126

Valid characters in cookie string?

Is this cookie string valid? Specifically this bit I0=; []scayt_verLang=6; I cant find a simple breakdown on the spec or an online validator.
Cookie JavascriptEnabled=true; Cms_User_Id=removed6CYjfBVknUjmvf9Pp/uSVYoemoQOXCcB0SOg3kZWX9/KZfo9v5C8O7MmLg1Xz0qXf94Wf86p4rLi2lxxminXfnP/16p6pzmwIU5qz7Of4plcQkK6JM6XiU/zbyZb3gksDOz2s8xjhfzWg0ekjgTZUx76/kFuW10/Rf7O8n05aIZzhUX0Gd9UNjk40zLA1DkJ02uNGtMbnil9P9iqVARhE0CNjCZFxc9qoLpyyRXtqG8nv0V/3k175KXzzg6iW6j9jH/DuGH8ko5YZoo6TxiIcW3ViRnFVfoiMK49iatauD2nF6xOtRV6LLH57RV3DhkhTTb/MQurw8bHYbsZWJRIuSnFwKeFUEOoxvRG4friI6d4Qug11F1oM3ECSdbDeKKPXuq5+IUImt8XXZUtBFUeakqWT4oXgnsToeNoI0=; []scayt_verLang=6; ASP.NET_SessionId=removed0l4mhioft0uavblzdeq; last_msg_check=1425606361000
Thanks,
Joe
Cookie and Set-Cookie HTTP headers are defined in RFC 6265 Section 4 with RFC 2616 Section 2.2 providing the basic types.
cookie-header = "Cookie:" OWS cookie-string OWS
cookie-string = cookie-pair *( ";" SP cookie-pair )
cookie-pair = cookie-name "=" cookie-value
cookie-name = token
cookie-value = *cookie-octet / ( DQUOTE *cookie-octet DQUOTE )
cookie-octet = %x21 / %x23-2B / %x2D-3A / %x3C-5B / %x5D-7E
; US-ASCII characters excluding CTLs,
; whitespace DQUOTE, comma, semicolon,
; and backslash
token = <token, defined in [RFC2616], Section 2.2>
Token as defined in RFC 2616...
token = 1*<any CHAR except CTLs or separators>
CHAR = <any US-ASCII character (octets 0 - 127)>
CTL = <any US-ASCII control character
(octets 0 - 31) and DEL (127)>
separators = "(" | ")" | "<" | ">" | "#"
| "," | ";" | ":" | "\" | <">
| "/" | "[" | "]" | "?" | "="
| "{" | "}" | SP | HT
Let's look at your cookie (I've stripped out most of the junk).
JavascriptEnabled=true; Cms_User_Id=removedlotsoftextI0=; []scayt_verLang=6; ASP.NET_SessionId=removed0l4mhioft0uavblzdeq; last_msg_check=1425606361000
You have a bunch of cookie-pairs...
JavascriptEnabled=true
Cms_User_Id=removedlotsoftextI0=
[]scayt_verLang=6
ASP.NET_SessionId=removed0l4mhioft0uavblzdeq
last_msg_check=1425606361000
The cookie-name []scayt_verLang is invalid because it contains separators which are not allowed in a token.
I0= is not its own pair, but the tail end of the very long value of Cms_User_Id. = is allowed in a cookie-value so it's valid.

Non alphanumeric characters in R

For uppercase, lowercase letters and 10-digits I can generate a vector that contains all letters or 10-digit number as follow:
A <- LETTERS[0:26]
B <- letters[0:26]
C <- seq(0,9)
I wonder whether there is a similar function for non-alphanumeric characters.
~!##$%^&*_-+=`|\(){}[]:;"'<>,.?/
I tried
D <- c("~","!","#","#","$","%","^", "&","*","_","-","+","=","`","|","\","(",")","{","}","[","]",":",";",""","'","<",">",",",".","?","/")
Thanks
This is another option. Generate all ascii characters, then filter out the non punctuation with regular expressions.
ascii <- rawToChar(as.raw(0:127), multiple=TRUE)
ascii[grepl('[[:punct:]]', ascii)]
# [1] "!" "\"" "#" "$" "%" "&" "'" "(" ")" "*" "+" "," "-" "." "/" ":" ";" "<" "=" ">" "?" "#"
# [23] "[" "\\" "]" "^" "_" "`" "{" "|" "}" "~"
This might be useful . . The ASCII character set is arranged in ranges of similar types of characters (letters, etc).
http://datadebrief.blogspot.com/2011/03/ascii-code-table-in-r.html
It's a bit drawn out, and there's probably a better website (and a better way to get the same result), but
library(XML); library(RCurl)
doc <- htmlParse(getURL("https://wci.llnl.gov/codes/basis/manual/node161.html"))
xp <- xpathSApply(doc, "//tr/td", xmlValue, trim = TRUE)
xp[nzchar(xp) & nchar(xp) == 1]
# [1] "!" "[" "%" "," "]" "&" "-" "|" "'" "." "=" "~" "("
# [14] "/" ")" "*" "=" "{" "?" "`" "}" "#" ":" ";" "^" " "
Also, using the website from the other answer yields a more complete result
> URL <- "http://datadebrief.blogspot.com/2011/03/ascii-code-table-in-r.html"
> r <- readLines(URL, warn = FALSE)[780:874]
> s <- sapply(strsplit(r, "\\s+"), "[", 1)
> s[!s %in% c(letters, LETTERS, 0:9)]
# [1] "" "!" "\"" "#" "$" "%" "&" "'" "("
# [10] ")" "*" "+" "," "-" "." "/" ":" ";"
# [19] "<" "=" ">" "?" "#" "[" "\\\\" "]" "^"
# [28] "_" "`" "{" "|" "}" "~"
...or yeah, just use rawToChar(as.raw(...)) like MrFlick said :-)
This answer is only for amusement, list the characters you want and use strsplit to generate your vector.
> D <- strsplit('!"#$%&\'()*+,-./\\:;<=>?#[]^_`{|}~', '(?=.)', perl=T)[[1]]
## [1] "!" "\"" "#" "$" "%" "&" "'" "(" ")" "*" "+" "," "-" "." "/"
## [16] "\\" ":" ";" "<" "=" ">" "?" "#" "[" "]" "^" "_" "`" "{" "|"
## [31] "}" "~"
Or filter the characters you want.
> D <- gsub('[^\\pP\\pS]', '', rawToChar(as.raw(1:127), multiple=T), perl=T)
> D[D != ""]
## [1] "!" "\"" "#" "$" "%" "&" "'" "(" ")" "*" "+" "," "-" "." "/"
## [16] ":" ";" "<" "=" ">" "?" "#" "[" "\\" "]" "^" "_" "`" "{" "|"
## [31] "}" "~"

Resources