According to the HTTP specs a header can look like this:
Header-Name=value1,value2,value3,...
I try to parse the header values and store them as an array:
array('value1', 'value2', 'value3')
so far so good. I can just tokenize the string if a comma appears.
BUT how should I handle headers like this one:
Expires=Thu, 01 Dec 1994 16:00:00 GMT
there's a comma but in the one value the header has. Oh that's easy I thought and figuered out the rule: Only separate by commas when there's no space before and after the comma. This way both examples get parsed correct.
BUT then I came across a header like this:
Accept-Encding=gzip, deflate
and now? Is this one value array('gzip, deflate') or two values array('gzip', 'deflate')? For me they are two separate values but then my rule from the above isn't true anymore.
Is there a list which headers are allowed more than once? So I can check against a blacklist to determine if the comma means a value delimiter or not?
Comma concatenation can occur for any header field, even those that aren't designed for it; it's how libraries and intermediaries happen to work.
It is designed to be used for header fields that use list syntax (RFC 7230 has all the details).
Finally, you can't use generic code to tokenize, because the way the comma can occur inside values varies from field to field.
Related
I have a file that has headers but they may not be in the first line (and if they are they may not be the right headers). How can I deal with this?
df<-data.frame(a=seq(0:10))
colname(df)<-("The Wrong Header")
df[4,1]<-"TheRightHeader" #This is one concatenated string because I couldn't work out how to add spaces using this input method
To know whether the first header is correct or not, the two rows after it always contain a 0.
The real nasty data is here (available until 27th Sept):
https://leeds365-my.sharepoint.com/:t:/g/personal/cenmk_leeds_ac_uk/EQFIeY_U1f5MrC5B_YdqChkBrNSxQ_6vvVHZ_NR-6kNTYg?e=Jsb918
A desired output would be:
colnames(df)<-"The Right Header"
Take this CSV file:
ID,NAME,VALUE
1,Blah,100
2,"Has space",200
3,"Ends with quotes"",300
4,""Surrounded with quotes"",300
It loads just fine in most statistical programs (R, SAS, etc.) but in Excel the third row is misinterpreted because it has two quotation marks. Escaping the last quote as \" will also not work in Excel. The only way I have found so far is to replace the one double quote with two double quotes:
ID,NAME,VALUE
1,Blah,100
2,"Has space",200
3,"Ends with quotes""",300
4,"""Surrounded with quotes""",300
But that would render the file completely useless for all other programs (R, SAS, etc.)
Is there a way to format the CSV file where strings can begin or end with the same characters as that used to surround them, such that it would work in Excel as well as commonly used statistical software?
Your second representation is the normal way to generate a CSV file and so should be easy to work with in any software. See the RFC 4180 specifications. https://www.ietf.org/rfc/rfc4180.txt
So your second example represents this data:
Obs id name value
1 1 Blah 100
2 2 Has space 200
3 3 Ends with quotes" 300
4 4 "Surrounded with quotes" 300
If you want to represent it as a delimited file where none of the values are allowed to contain the delimiter (in other words NOT as a standard CSV file) than it would look like:
id,name,value
1,Blah,100
2,Has space,200
3,Ends with quotes",300
4,"Surrounded with quotes",300
But if you want to allow the values to contain the delimiter then you need some way to distinguish embedded delimiters from real delimiters. So the standard forces values that contain the delimiter to be quoted. But once you do that you also need to also add quotes around fields that contain the quote character itself (and double the embedded quotes) to avoid making an ambiguous file. For example the quotes in the 4th observation in your first file look like they are optional quotes around a value instead of part of the value.
Many programs try to handle ambiguous situations. For example SAS does not allow values to contain embedded line breaks so you will always get four observations with your first example file.
But EXCEL allows the embedding of the end of line character(s) inside of quoted values. So in your original file the value of the second field in the third observations looks like what you would start to get if you added quotes around this value:
Ends with quotes",300
4,"Surrounded with quotes",300
So instead of 4 complete observations of three fields values in each there are only three observations and the last observation has only two field values.
This is caused by the fact that escape character for " in Excel is "": Escaping quotes and delimiters in CSV files with Excel
A quick and simple workaround that comes to mind in R is to first read the content of the csv with readLines, then replace the double (escaped) double quotes with just one double quotes, and then read.table:
read.table(
text = gsub(pattern = "\"\"", "\"", readLines("data.csv")),
sep = ",",
header = TRUE
)
I'm almost certain this has been asked before but due to a certain social media app I drowning in unrelated search results.
So the data set that I'm importing contains actual "#", as in Apartment #404, and I'd like to if possible preserve the character but R thinks it's an end of line or something. At first it would bomb out on the first occurrence, then I set fill=TRUE and now it just ignores the rest of the line after that.
How does one instruct R to treat #'s as regular characters?
If you are not using "#" as a comment symbol in your data, you can use
read.table(..., comment.char="")
That should treat "#" like any other character.
Many languages allow one to pass an array of values through the url. I need to , for various reasons, directly construct the url by hand. How is an array of values urlencoded?
It looks like the content in the form of MIME-Type: application/x-www-form-urlencoded.
This is the default content type. Forms submitted with this content type must be encoded as follows:
Control names and values are escaped. Space characters are replaced by +, and then reserved characters are escaped as described in [RFC1738], section 2.2: Non-alphanumeric characters are replaced by %HH, a percent sign and two hexadecimal digits representing the ASCII code of the character. Line breaks are represented as "CR LF" pairs (i.e., %0D%0A).
The control names/values are listed in the order they appear in the document. The name is separated from the value by = and name/value pairs are separated from each other by &.
Which is used for the POST. To do it for the GET, you'll have to append a ? after your URL, and the rest is almost equal. In the comments, mdma states, that the URL may not contain a + for a space character. Instead use %20.
So an array of values:
http://localhost/someapp/?0=zero&1=valueone%20withspace&2=etc&3=etc
Often there is some functionality in libraries that will do the URL encoding for you (point 1). Point two is easily implementable by looping over your array, building the string, appending the index, =, the URL encoded value and when it's not the last entry an &.
I need to pass 2 parameters in a query string but would like them to appear as a single parameter to the user. At a low level, how can I concatinate these two values and then later separate them? Both values are Base64 encoded.
?Name=abcyxz
where both abc and xyz are separate Base64 encoded strings.
why don't you just do something like this
temp = base64_encode("var1=abc&var2=yxz")
and then call
?Name=temp
Later you can decode the whole string and split the vars.
(sry for pseudo code :P)
Edit: a small quote from wikipedia
The current version of PEM (specified in RFC 1421) uses a 64-character alphabet consisting of upper- and lower-case Roman alphabet characters (A–Z, a–z), the numerals (0–9), and the "+" and "/" symbols. The "=" symbol is also used as a special suffix code. The original specification, RFC 989, additionally used the "*" symbol to delimit encoded but unencrypted data within the output stream.
You should either use some separator or store the length of the first item.
First of all, I would be curious as to why you can't just pass two parameters. But with that as a given, just choose any character that's a valid character in a URL query string, but won't show up in your base64 encoding, such as ~