HTTP empty header - http

Is it acceptable to have empty header in HTTP?
By empty i mean ":" no header name and no header value.
The same question is also relvant to HTTP2 (suppose it is the same answer but to be sure).
Thanks.

HTTP defines a header field as:
header-field = field-name ":" OWS field-value OWS
field-name = token
field-value = *( field-content / obs-fold )
field-content = field-vchar [ 1*( SP / HTAB ) field-vchar ]
field-vchar = VCHAR / obs-text
obs-fold = CRLF 1*( SP / HTAB )
; obsolete line folding
; see Section 3.2.4
The token part is later on defined as:
token = 1*tchar
tchar = "!" / "#" / "$" / "%" / "&" / "'" / "*"
/ "+" / "-" / "." / "^" / "_" / "`" / "|" / "~"
/ DIGIT / ALPHA
; any VCHAR, except delimiters
The implication is that the header name must be at least 1 byte, and the value can be 0 or more characters.
HTTP/2 uses the same underlying data-model.
https://www.rfc-editor.org/rfc/rfc7230#section-3.2.4

Related

Is %encoded format allowed for IPv4address format of host of uri in RFC 3968?

I am trying to create a URI library from the RFC 3986 definitions. While validating host there are 3 allowed formats:
IP-literal
IPv4address
Reg-name
Here IPv4address's format is little ambiguous (or I am making a mistake in understanding that). The definition says it can be of the format: dec-octet "." dec-octet "." dec-octet "." dec-octet and
dec-octet = DIGIT ; 0-9
/ %x31-39 DIGIT ; 10-99
/ "1" 2DIGIT ; 100-199
/ "2" %x30-34 DIGIT ; 200-249
/ "25" %x30-35 ; 250-255
When the definitions says "2" %x30-34 DIGIT does it mean if any of the three digits (0-255 range number) can be represented in percent-encoded format or only some particular digits can be represented in the percent-encoded format.

A regex to split a text string in R

I have a very long string like this sample bellow and I'm struggling to find a regex to split it in parts according to the patern, for example: '1. OAS / AC' and '2. OAS / AD'.
This slice of text has:
1) a varying number in the beginning
2) two capital letters varying from A to Z
I tried this:
x <- stringr::str_split(have, "([1-9])( OAS / )([A-Z]{2})")
but not works
Thanks in advance, for any help!
Example
require(stringr)
have <- "1. OAS / AC 12345/this is a test string to regex, 2. OAS / AD 79856/this is another test string to regex, 3. OAS / AE 87987/this is a new test string to regex. 4. OAS / AZ 78798456/this is one mode test string to regex."
want <- stringr::str_split(have, "([1-9])( OAS / )([A-Z]{2})")
want <- list(
"1. OAS / AC " = "12345/this is a test string to regex,",
"2. OAS / AD " = "79856/this is another test string to regex,",
"3. OAS / AE " = "87987/this is a new test string to regex.",
"4. OAS / AZ " = "78798456/this is one mode test string to regex."
)
We could do this with a positive lookahead, looking for the pattern of a number, followed by a peroid:
str_split(have, "(?=\\d+\\.)")
[1] "" "1. OAS / AC 12345/this is a test string to regex, "
[3] "2. OAS / AD 79856/this is another test string to regex, " "3. OAS / AE 87987/this is a new test string to regex. "
[5] "4. OAS / AZ 78798456/this is one mode test string to regex."
And we can further clean it up:
str_split(have, "(?=\\d{1,2}\\.)") %>% unlist() %>% .[-1]
[1] "1. OAS / AC 12345/this is a test string to regex, " "2. OAS / AD 79856/this is another test string to regex, "
[3] "3. OAS / AE 87987/this is a new test string to regex. " "4. OAS / AZ 78798456/this is one mode test string to regex."
You may use
library(stringr)
have <- "1. OAS / AC 12345/this is a test string to regex, 2. OAS / AD 79856/this is another test string to regex, 3. OAS / AE 87987/this is a new test string to regex. 4. OAS / AZ 78798456/this is one mode test string to regex."
r <- stringr::str_match_all(have, "(\\d+\\. OAS / [A-Z]{2})\\s*(.*?)(?=\\s*\\d+\\. OAS / [A-Z]{2}|\\z)")
res <- r[[1]][,3]
names(res) <- r[[1]][,2]
Result:
dput(res)
# => structure(c("12345/this is a test string to regex,", "79856/this is another test string to regex,",
# "87987/this is a new test string to regex.", "78798456/this is one mode test string to regex."
# ), .Names = c("1. OAS / AC", "2. OAS / AD", "3. OAS / AE", "4. OAS / AZ"
# ))
See the regex demo.
Pattern details
(\d+\. OAS / [A-Z]{2}) - Capturing group 1:
\d+ - 1+ digits
\. - a .
OAS / - a literal OAS / substring
[A-Z]{2} - two uppercase letters
\s* - 0+ whitespaces
(.*?) - Capturing group 2: any 0+ chars other than line break chars, as few as possible
(?=\s*\d+\. OAS / [A-Z]{2}|\z) - a positive lookahead: immediately to the right of the current location, there must
\s*\d+\. OAS / [A-Z]{2} - 0+ whitespaces, 1+ digits, ., space, /, space, two uppercase letters
| - or
\z - end of string.
They way you described the issue is kinda unclear, but if you want to simply extract till "OAS / AC",
library(qdap)
beg2char(have, " ", 4)#looks for the fourth occurrence of \\s and extracts everything before it.
For the above function to work, the sentences should be individual strings in a character vector
If your aim is to actually insert an "=" sign between the two letter sub-string and the number occurring after "OAS",
gsub("([A-Z])\\s*([0-9])","\\1 = \\2",have,perl=T)

Valid characters in cookie string?

Is this cookie string valid? Specifically this bit I0=; []scayt_verLang=6; I cant find a simple breakdown on the spec or an online validator.
Cookie JavascriptEnabled=true; Cms_User_Id=removed6CYjfBVknUjmvf9Pp/uSVYoemoQOXCcB0SOg3kZWX9/KZfo9v5C8O7MmLg1Xz0qXf94Wf86p4rLi2lxxminXfnP/16p6pzmwIU5qz7Of4plcQkK6JM6XiU/zbyZb3gksDOz2s8xjhfzWg0ekjgTZUx76/kFuW10/Rf7O8n05aIZzhUX0Gd9UNjk40zLA1DkJ02uNGtMbnil9P9iqVARhE0CNjCZFxc9qoLpyyRXtqG8nv0V/3k175KXzzg6iW6j9jH/DuGH8ko5YZoo6TxiIcW3ViRnFVfoiMK49iatauD2nF6xOtRV6LLH57RV3DhkhTTb/MQurw8bHYbsZWJRIuSnFwKeFUEOoxvRG4friI6d4Qug11F1oM3ECSdbDeKKPXuq5+IUImt8XXZUtBFUeakqWT4oXgnsToeNoI0=; []scayt_verLang=6; ASP.NET_SessionId=removed0l4mhioft0uavblzdeq; last_msg_check=1425606361000
Thanks,
Joe
Cookie and Set-Cookie HTTP headers are defined in RFC 6265 Section 4 with RFC 2616 Section 2.2 providing the basic types.
cookie-header = "Cookie:" OWS cookie-string OWS
cookie-string = cookie-pair *( ";" SP cookie-pair )
cookie-pair = cookie-name "=" cookie-value
cookie-name = token
cookie-value = *cookie-octet / ( DQUOTE *cookie-octet DQUOTE )
cookie-octet = %x21 / %x23-2B / %x2D-3A / %x3C-5B / %x5D-7E
; US-ASCII characters excluding CTLs,
; whitespace DQUOTE, comma, semicolon,
; and backslash
token = <token, defined in [RFC2616], Section 2.2>
Token as defined in RFC 2616...
token = 1*<any CHAR except CTLs or separators>
CHAR = <any US-ASCII character (octets 0 - 127)>
CTL = <any US-ASCII control character
(octets 0 - 31) and DEL (127)>
separators = "(" | ")" | "<" | ">" | "#"
| "," | ";" | ":" | "\" | <">
| "/" | "[" | "]" | "?" | "="
| "{" | "}" | SP | HT
Let's look at your cookie (I've stripped out most of the junk).
JavascriptEnabled=true; Cms_User_Id=removedlotsoftextI0=; []scayt_verLang=6; ASP.NET_SessionId=removed0l4mhioft0uavblzdeq; last_msg_check=1425606361000
You have a bunch of cookie-pairs...
JavascriptEnabled=true
Cms_User_Id=removedlotsoftextI0=
[]scayt_verLang=6
ASP.NET_SessionId=removed0l4mhioft0uavblzdeq
last_msg_check=1425606361000
The cookie-name []scayt_verLang is invalid because it contains separators which are not allowed in a token.
I0= is not its own pair, but the tail end of the very long value of Cms_User_Id. = is allowed in a cookie-value so it's valid.

Regular expression to match ID's and classes in CSS page

I'm trying to analyze HTML code and extract all CSS classes and ID's from the source. So I need to extract whatever is between two quotation marks, which can be preceded by either class or id:
id="<extract this>"
class="<extract this>"
/(?:id|class)="([^"]*)"/gi
replacement expression: $1
this regex in english: match either "id" or "class" then an equals sign and quote, then capture everything that is not a quote before matching another quote. do this globally and case insensitively.
Since you prefer using regular expression, here is one way I suppose.
\b(?:id|class)\s*=\s*"([^"]*)"
Regular expression:
\b # the boundary between a word char (\w) and not a word char
(?: # group, but do not capture:
id # 'id'
| # OR
class # 'class'
) # end of grouping
\s* # whitespace (\n, \r, \t, \f, and " ") (0 or more times)
= # '='
\s* # whitespace (\n, \r, \t, \f, and " ") (0 or more times)
" # '"'
( # group and capture to \1:
[^"]* # any character except: '"' (0 or more times)
) # end of \1
" # '"'
You may want to try this:
<?php
$css = <<< EOF
id="<extract this>"
class="<extract this>"id="<extract this2>"
class="<extract this3>"id="<extract this4>"
class="<extract this5>"id="<extract this6>"
class="<extract this7>"id="<extract this8>"
class="<extract this9>"
EOF;
preg_match_all('/(?:id|class)="(.*?)"/sim', $css , $classes, PREG_PATTERN_ORDER);
for ($i = 0; $i < count($classes[1]); $i++) {
echo $classes[1][$i]."\n";
}
/*
<extract this>
<extract this>
<extract this2>
<extract this3>
<extract this4>
<extract this5>
<extract this6>
<extract this7>
<extract this8>
<extract this9>
*/
?>
DEMO:
http://ideone.com/Nr9FPt

List of valid characters for the fragment identifier in an URL?

I'm using the fragment identifier to create a permalink for AJAX events in my web app similar to this guy. Something like:
http://www.myapp.com/calendar#filter:year/2010/month/5
I've done quite a bit of searching but can't find a list of valid characters for the fragment idenitifer. The W3C spec doesn't offer anything.
Do I need to encode the characters the same as the URL in has in general?
There doesn't seem to be any good information on this anywhere.
See the RFC 3986.
fragment = *( pchar / "/" / "?" )
pchar = unreserved / pct-encoded / sub-delims / ":" / "#"
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
pct-encoded = "%" HEXDIG HEXDIG
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
So you can use !, $, &, ', (, ), *, +, ,, ;, =, something matching %[0-9a-fA-F]{2}, something matching [a-zA-Z0-9], -, ., _, ~, :, #, /, and ?
https://www.rfc-editor.org/rfc/rfc3986#section-3.5:
fragment = *( pchar / "/" / "?" )
and
pchar = unreserved / pct-encoded / sub-delims / ":" / "#"
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
pct-encoded = "%" HEXDIG HEXDIG
So, combined, the fragment cannot contain #, a raw %, ^, [, ], {, }, \, ", < and > according to the RFC.
One other RFC speak of that: RFC-1738
URL schemeparts for ip based protocols:
HTTP
httpurl = "http://" hostport [ "/" hpath [ "?" search ]]
hpath = hsegment *[ "/" hsegment ]
hsegment = *[ uchar | ";" | ":" | "#" | "&" | "=" ]
search = *[ uchar | ";" | ":" | "#" | "&" | "=" ]

Resources