Why do github-webhook payloads contain 'zen'? - github-webhook

I've noticed in the ping section here that they contain Random string of GitHub zen. Whats the point of these strings?

Related

Generating a multipart/byterange response without scanning the parts ahead of sending

I would like to generate a multipart byte range response. Is there a way for me to do it without scanning each segment I am about to send out, since I need to generate multipart boundary strings?
For example, I can have a user request a byterange that would have me fetch and scan 2GB of data, which in my case involves me loading that data into my (slow) VM as strings and so forth. Ideally I would like to simply state in the response that a part has a length of a certain number of bytes, and be done with it. Is there any tooling that could provide me with this option? I see that many developers just grab a UUID as the boundary and are probably willing to risk a tiny probability that it will appear somewhere within the part, but that risk seems to be small enough multiple people are taking it?
To explain in more detail: scanning the parts ahead of time (before generating the response) is not really feasible in my case since I need to fetch them via HTTP from an upstream service. This means that I effectively have to prefetch the entire part first to compute a non-matching multipart boundary, and only then can I splice that part into the response.
Assuming the data can be arbitrary, I don’t see how you could guarantee absence of collisions without scanning the data.
If the format of the data is very limited (like... base 64 encoded?), you may be able to pick a boundary that is known to be an illegal sequence of bytes in that format.
Even if your boundary does collide with the data, it must be followed by headers such as Content-Range, which is even more improbable, so the client is likely to treat it as an error rather than consume the wrong data.
Major Web servers use very simple strategies. Apache grabs 8 random bytes at startup and renders them in hexadecimal. nginx uses a sequential counter left-padded with zeroes.
UUIDs are designed to avoid collisions with other UUIDs, not with arbitrary data. A UUID is no more likely to be a good boundary than a completely random string of the same length. Moreover, some UUID variants include information that you may not want to disclose, such as your machine’s MAC address.
Ideally I would like to simply state in the response that a part has a length of a certain number of bytes, and be done with it. Is there any tooling that could provide me with this option?
Maybe you can avoid supporting multiple ranges and simply tell the clients to request each range separately. In that case, you don’t use the multipart format, so there is no problem.
If you do want to send multiple ranges in one response, then RFC 7233 requires the multipart format, which requires the boundary string.
You can, of course, invent your own mechanism instead of that of RFC 7233. In that case:
You cannot use 206 (Partial Content). You must use 200 (OK) or some other applicable status code.
You cannot use the multipart/byteranges media type. You must come up with your own media type.
You cannot use the Range request header.
Because a 200 (OK) response to a GET request is supposed to carry a (full) representation of the resource, you must do one of the following:
encode the requested ranges in the URL; or
use something like POST instead of GET; or
use a custom, non-standard status code instead of 200 (OK); or
(not sure if this is a correct approach) use media type parameters, send them in Accept, and add Accept to Vary.
The chunked transfer coding may be useful, but you cannot rely on it alone, because it is a property of the connection, not of the payload.

".." (double dots) in otherwise valid IP4 addreses, e.g. 183.60..244.37

My production server recently got a slew of access probes (to try and find a point to break in, to URI's like to /admin.php, /administrator, /wp-login.php, etc.), and I noticed that some of the REMOTE_ADDR's reported by Apache (IP4's) had two dots where there should be one.
What's up with this? Is this some way for servers to hide?
For one, it means that I need to log these to a wider field than expected. Expected would be xxx.xxx.xxx.xxx or 15 characters, but this might make it 16 or even 19.
[Edit: or better yet 50, see this]
The problem is happening in some code somewhere in your application (etc) that is doing formatting.
IP addresses are actually an array of 4 unsigned bytes. They are conventionally represented character-wise (for human consumption) in "ddd.ddd.ddd.ddd" form, but that is not the fundamental representation. The fundamental representation does not have dots in it at all.
It therefore follows that the extra dots you are seeing are some problem with either the way the IP addresses are converted to strings, or the resulting strings are incorporated into messages, or those messages are handled and ultimately displayed. The extra dots do not "mean" anything ... except ... possibly ... to say that some characters have been left out.
Without more information, we can't tell you where those dots come from, or how to stop them.
What's up with this? Is this some way for servers to hide?
Nope.
At the point that your systems first see those IP addresses, they are in 4-byte form, just like other IP addresses. The dots are not a new way to hide. Rather they are just a result of a local problem in the way things are being logged.
UPDATE
Looking at the evidence in your "half answer", one possibility is that you have some progress monitoring or debugging code somewhere that occasionally outputs a "dot" into the output stream. It looks like it would be on a different thread ...
So far my hosting company says only that I can clean up these values.
They are right. But you probably want to find where your application is injecting the garbage and fix that ... rather than massaging the log files.
What are you doing with that variable in your code? I expect it's being translated or parsed in some way that's adding the extra period.
It's extremely unlikely that Apache would report it that way, as that would be invalid as an IPv4 address.
Compare your output with the web server's access logs, which will have recorded the remote IP as Apache saw it.
Half of the answer is that php's $_SERVER['REMOTE_ADDR'] is untrusted because it comes directly from the http request as provided by the server to php it can apparently and from other reports be spoofed.
EDIT2: I have more recently found two more bad variables from $_SERVER with double dots, as follows:
SERVER_ADDR REMOTE_ADDR REQUEST_TIME_FLOAT
184..154.227.128 183.60.244.30 1391788916.198
184.154..227.128 183.60.244.37 1391788913.537
184.154..227.128 183.60.244.37 1391788914.368
184.154..227.128 184.154.227.128 1391086482.1889
184.154.227.128 183..60.244.30 1391788914.1494
184.154.227.128 183..60.244.37 1391788913.0523
184.154.227.128 183.60..244.37 1391788911.5938
184.154.227.128 183.60..244.37 1391788914.3977
184.154.227.128 183.60.244.37 1391788911..9855
So far my hosting company says only that I can clean up these values. That is easy, but cleaning up garbage is still garbage. If dots can and are being added, then the numbers can and possibly are be changed too I think. Humm?
See: this comment from the php manual.
Now that leaves the question where to find a trusted IP from the accessing client? Apache has it I'm guessing from the incoming http packet exchange with the client. (I'll ask this Q: in StackOverflow).

application/x-www-form-urlencoded or multipart/form-data?

In HTTP there are two ways to POST data: application/x-www-form-urlencoded and multipart/form-data. I understand that most browsers are only able to upload files if multipart/form-data is used. Is there any additional guidance when to use one of the encoding types in an API context (no browser involved)? This might e.g. be based on:
data size
existence of non-ASCII characters
existence on (unencoded) binary data
the need to transfer additional data (like filename)
I basically found no formal guidance on the web regarding the use of the different content-types so far.
TL;DR
Summary; if you have binary (non-alphanumeric) data (or a significantly sized payload) to transmit, use multipart/form-data. Otherwise, use application/x-www-form-urlencoded.
The MIME types you mention are the two Content-Type headers for HTTP POST requests that user-agents (browsers) must support. The purpose of both of those types of requests is to send a list of name/value pairs to the server. Depending on the type and amount of data being transmitted, one of the methods will be more efficient than the other. To understand why, you have to look at what each is doing under the covers.
For application/x-www-form-urlencoded, the body of the HTTP message sent to the server is essentially one giant query string -- name/value pairs are separated by the ampersand (&), and names are separated from values by the equals symbol (=). An example of this would be:
MyVariableOne=ValueOne&MyVariableTwo=ValueTwo
According to the specification:
[Reserved and] non-alphanumeric characters are replaced by `%HH', a percent sign and two hexadecimal digits representing the ASCII code of the character
That means that for each non-alphanumeric byte that exists in one of our values, it's going to take three bytes to represent it. For large binary files, tripling the payload is going to be highly inefficient.
That's where multipart/form-data comes in. With this method of transmitting name/value pairs, each pair is represented as a "part" in a MIME message (as described by other answers). Parts are separated by a particular string boundary (chosen specifically so that this boundary string does not occur in any of the "value" payloads). Each part has its own set of MIME headers like Content-Type, and particularly Content-Disposition, which can give each part its "name." The value piece of each name/value pair is the payload of each part of the MIME message. The MIME spec gives us more options when representing the value payload -- we can choose a more efficient encoding of binary data to save bandwidth (e.g. base 64 or even raw binary).
Why not use multipart/form-data all the time? For short alphanumeric values (like most web forms), the overhead of adding all of the MIME headers is going to significantly outweigh any savings from more efficient binary encoding.
READ AT LEAST THE FIRST PARA HERE!
I know this is 3 years too late, but Matt's (accepted) answer is incomplete and will eventually get you into trouble. The key here is that, if you choose to use multipart/form-data, the boundary must not appear in the file data that the server eventually receives.
This is not a problem for application/x-www-form-urlencoded, because there is no boundary. x-www-form-urlencoded can also always handle binary data, by the simple expedient of turning one arbitrary byte into three 7BIT bytes. Inefficient, but it works (and note that the comment about not being able to send filenames as well as binary data is incorrect; you just send it as another key/value pair).
The problem with multipart/form-data is that the boundary separator must not be present in the file data (see RFC 2388; section 5.2 also includes a rather lame excuse for not having a proper aggregate MIME type that avoids this problem).
So, at first sight, multipart/form-data is of no value whatsoever in any file upload, binary or otherwise. If you don't choose your boundary correctly, then you will eventually have a problem, whether you're sending plain text or raw binary - the server will find a boundary in the wrong place, and your file will be truncated, or the POST will fail.
The key is to choose an encoding and a boundary such that your selected boundary characters cannot appear in the encoded output. One simple solution is to use base64 (do not use raw binary). In base64 3 arbitrary bytes are encoded into four 7-bit characters, where the output character set is [A-Za-z0-9+/=] (i.e. alphanumerics, '+', '/' or '='). = is a special case, and may only appear at the end of the encoded output, as a single = or a double ==. Now, choose your boundary as a 7-bit ASCII string which cannot appear in base64 output. Many choices you see on the net fail this test - the MDN forms docs, for example, use "blob" as a boundary when sending binary data - not good. However, something like "!blob!" will never appear in base64 output.
I don't think HTTP is limited to POST in multipart or x-www-form-urlencoded. The Content-Type Header is orthogonal to the HTTP POST method (you can fill MIME type which suits you). This is also the case for typical HTML representation based webapps (e.g. json payload became very popular for transmitting payload for ajax requests).
Regarding Restful API over HTTP the most popular content-types I came in touch with are application/xml and application/json.
application/xml:
data-size: XML very verbose, but usually not an issue when using compression and thinking that the write access case (e.g. through POST or PUT) is much more rare as read-access (in many cases it is <3% of all traffic). Rarely there where cases where I had to optimize the write performance
existence of non-ascii chars: you can use utf-8 as encoding in XML
existence of binary data: would need to use base64 encoding
filename data: you can encapsulate this inside field in XML
application/json
data-size: more compact less that XML, still text, but you can compress
non-ascii chars: json is utf-8
binary data: base64 (also see json-binary-question)
filename data: encapsulate as own field-section inside json
binary data as own resource
I would try to represent binary data as own asset/resource. It adds another call but decouples stuff better. Example images:
POST /images
Content-type: multipart/mixed; boundary="xxxx"
... multipart data
201 Created
Location: http://imageserver.org/../foo.jpg
In later resources you could simply inline the binary resource as link:
<main-resource&gt
...
<link href="http://imageserver.org/../foo.jpg"/>
</main-resource>
I agree with much that Manuel has said. In fact, his comments refer to this url...
http://www.w3.org/TR/html401/interact/forms.html#h-17.13.4
... which states:
The content type
"application/x-www-form-urlencoded" is
inefficient for sending large
quantities of binary data or text
containing non-ASCII characters. The
content type "multipart/form-data"
should be used for submitting forms
that contain files, non-ASCII data,
and binary data.
However, for me it would come down to tool/framework support.
What tools and frameworks do you
expect your API users to be building
their apps with?
Do they have
frameworks or components they can use
that favour one method over the
other?
If you get a clear idea of your users, and how they'll make use of your API, then that will help you decide. If you make the upload of files hard for your API users then they'll move away, of you'll spend a lot of time on supporting them.
Secondary to this would be the tool support YOU have for writing your API and how easy it is for your to accommodate one upload mechanism over the other.
Just a little hint from my side for uploading HTML5 canvas image data:
I am working on a project for a print-shop and had some problems due to uploading images to the server that came from an HTML5 canvas element. I was struggling for at least an hour and I did not get it to save the image correctly on my server.
Once I set the
contentType option of my jQuery ajax call to application/x-www-form-urlencoded everything went the right way and the base64-encoded data was interpreted correctly and successfully saved as an image.
Maybe that helps someone!
If you need to use Content-Type=x-www-urlencoded-form then DO NOT use FormDataCollection as parameter: In asp.net Core 2+ FormDataCollection has no default constructors which is required by Formatters. Use IFormCollection instead:
public IActionResult Search([FromForm]IFormCollection type)
{
return Ok();
}
In my case the issue was that the response contentType was application/x-www-form-urlencoded but actually it contained a JSON as the body of the request. Django when we access request.data in Django it cannot properly converted it so access request.body.
Refer this answer for better understanding:
Exception: You cannot access body after reading from request's data stream

Please help me trace how charsets are handled every step of the way

We all know how easy character sets are on the web, yet every time you think you got it right, a foreign charset bites you in the butt. So I'd like to trace the steps of what happens in a fictional scenario I will describe below. I'm going to try and put down my understanding as well as possible but my question is for you folks to correct any mistakes I make and fill in any BLANKs.
When reading this scenario, imagine that this is being done on a Mac by John, and on Windows by Jane, and add comments if one behaves differently than the other in any particular situation.
Our hero (John/Jane) starts by writing a paragraph in Microsoft Word. Word's charset is BLANK1 (CP1252?).
S/he copies the paragraph, including smart quotes (e.g. “ ”). The act of copying is done by the BLANK2 (Operating system...Windows/Mac?) which BLANK3 (detects what charset the application is using and inherits the charset?). S/he then pastes the paragraph in a text box at StackOverflow.
Let's assume StackOverflow is running on Apache/PHP and that their set up in httpd.conf does not specify AddDefaultCharset utf-8 and their php.ini sets the default_charset to ISO-8859-1.
Yet neither charset above matters, because Stack Overflow's header contains this statement META http-equiv="Content-Type" content="text/html; charset=UTF-8", so even though when you clicked on "Ask Question" you might have seen a *RESPONSE header in firebug of "Content-type text/html;" ... in fact, Firefox/IE/Opera/Other browsers BLANK4 (completely 100% ignore the server header and override it with the Meta Content-type declaration in the header? Although it must read the file before knowing the Content-type, since it doesn't have to do anything with the encoding until it displays the body, this makes no different to the browser?).
Since the Meta Content-type of the page is UTF-8, the input form will convert any characters you type into the box, into UTF-8 characters. BLANK5 (If someone can go into excruciating detail about what the browser does in this step, it would be very helpful...here's my understanding...since the operating system controls the clipboard and display of the character in the form, it inserts the character in whatever charset it was copied from. And displays it in the form as that charset...OVERRIDING the UTF-8 in this example).
Let's assume the form method=GET rather than post so we can play w/ the URL browser input.... Continuing our story, the form is submitted as UTF-8. The smart quotes which represent decimal code 147 & 148, when the browser converts them to UTF-8, it gets transformed into BLANK6 characters.
Let's assume that after submission, Stack Overflow found an error in the form, so rather than displaying the resulting question, it pops back up the input box with your question inside the form. In the php, the form variables are escaped with htmlspecialchars($var) in order for the data to be properly displayed, since this time it's the BLANK7 (browser controlling the display, rather than the operating system...therefore the quotes need to be represented as its UTF-8 equivalent or else you'd get the dreaded funny looking � question mark?)
However, if you take the smart quotes, and insert them directly in the URL bar and hit enter....the htmlspecialchars will do BLANK8, messing up the form display and inserting question marks �� since querying a URL directly will just use the encoding in the url...or even a BLANK9 (mix of encodings?) if you have more than one in there...
When the REQUEST is sent out, the browser lists acceptable charsets to the browser. The list of charsets comes from BLANK10.
Now you might think our story ends there, but it doesn't. Because StackOverflow needs to save this data to a database. Fortunately, the people running this joint are smart. So when their MySQL client connects to the database, it makes sure the client and server are talking to each other UTF-8 by issuing the SET NAMES UTF-8 command as soon as the connection is initiated. Additionally, the default character set for MySQL is set to UTF-8 and each field is set the same way.
Therefore, Stack Overflow has completely secured their website from dB injections, CSRF forgeries and XSS site scripting issues...or at least those borne from charset game playing.
*Note, this is an example, not the actual response by that page.
I don't know if this "answers" your "question", but I can at least help you with what I think may be a critical misunderstanding.
You say, "Since the Meta Content-type of the page is UTF-8, the input form will convert any characters you type into the box, into UTF-8 characters." There is no such thing as a "UTF-8 character", and it isn't true or even meaningful to think of the form "converting" anything into anything when you paste it. Characters are a completely abstract concept, and there's no way of knowing (without reading the source) how a given program, including your web browser, decides to implement them. Since most important applications these days are Unicode-savvy, they probably have some internal abstraction to represent text as Unicode characters--note, that's Unicode and not UTF-8.
A piece of text, in Unicode (or in any other character set), is represented as a series of code points, integers that are uniquely assigned to characters, which are named entities in a large database, each of which has any number of properties (such as whether it's a combining mark, whether it goes right-to-left, etc.). Here's the part where the rubber meets the road: in order to represent text in a real computer, by saving it to a file, or sending it over the wire to some other computer, it has to be encoded as a series of bytes. UTF-8 is an encoding (or a "transformation format" in Unicode-speak), that represents each integer code point as a unique sequence of bytes. There are several interesting and good properties of UTF-8 in particular, but they're not relevant to understanding, in general, what's going on.
In the scenario you describe, the content-type metadata tells the browser how to interpret the bytes being sent as a sequence of characters (which are, remember, completely abstract entities, having no relationship to bytes or anything). It also tells the browser to please encode the textual values entered by the user into a form as UTF-8 on the way back to the server.
All of these remarks apply all the way up and down the chain. When a computer program is processing "text", it is doing operations on a sequence of "characters", which are abstractions representing the smallest components of written language. But when it wants to save text to a file or transmit it somewhere else, it must turn that text into a sequence of bytes.
We use Unicode because its character set is universal, and because the byte sequences it uses in its encodings (UTF-8, the UTF-16s, and UTF-32) are unambiguous.
P.S. When you see �, there are two possible causes.
1) A program was asked to write some characters using some character set (say, ISO-8859-1) that does not contain a particular character that appears in the text. So if text is represented internally as a sequence of Unicode code points, and the text editor is asked to save as ISO-8859-1, and the text contains some Japanese character, it will have to either refuse to do it, or spit out some arbitrary ISO-8859-1 byte sequence to mean "no puedo".
2) A program received a sequence of bytes that perhaps does represent text in some encoding, but it interprets those bytes using a different encoding. Some byte sequences are meaningless in that encoding, so it can either refuse to do it, or just choose some character (such as �) to represent each unintelligible byte sequence.
P.P.S. These encode/decode dances happen between applications and the clipboard in your OS of choice. Imagine the possibilities.
In answer to your comments:
It's not true that "Word uses CP1252 encoding"; it uses Unicode to represent text internally. You can verify this, trivially, by pasting some Katakana character such as サ into Word. Windows-1252 cannot represent such a character.
When you "copy" something, from any application, it's entirely up to the application to decide what to put on the clipboard. For example, when I do a copy operation in Word, I see 17 different pieces of data, each having a different format, placed into the clipboard. One of them has type CF_UNICODETEXT, which happens to be UTF-16.
Now, as for URLs... Details are found here. Before sending an HTTP request, the browser must turn a URL (which can contain any text at all) into an IRI. You convert a URL to an IRI by first encoding it as UTF-8, then representing UTF-8 bytes outside the ASCII printable range by their percent-escaped forms. So, for example, the correct encoding for http://foo.com/dir1/引き割り.html is http://foo.com/dir1/%E5%BC%95%E3%81%8D%E5%89%B2%E3%82%8A.html . (Host names follow different rules, but it's all in the linked-to resource).
Now, in my opinion, the browser ought to show plain old text in the location bar, and do all of the encoding behind the scenes. But some browsers make stupid choices, and they show you the IRI form, or some chimera of a URL and an IRI.

Google Translation API

Has anyone used Google translation API ? What is the max length limit for using it?
The limit was 500... now it is 5000 chars.
source
500 characters
source
At the moment, the throttle limit is 100,000 characters per day. Looks like you can apply to have that limit increased/removed.
I've used it to translate Japanese to English.
I don't believe the 500 char limit is true if you use http://code.google.com/p/jquery-translate/, but one thing that is true is you're restricted as to the number of requests you can make within a certain period of time. They also try to detect whether or not you're sending a lot of requests with a similar period, almost like a mini "denial of service" attack.
So when I did this I wrote a client with a random length sleep between requests. I also ran it on a grid so all the requests didn't come from a single IP address.
I had to translate ~2000 Java messages from a resource bundle from Japanese to English. It worked out pretty nicely, as long as the text was single words. Longer phrases with context came out awkwardly.
Please have look at this link it will give the correct answer at the bottom of the page.
https://developers.google.com/translate/v2/faq
What is the maximum number of characters per request?
The maximum size of each text to be translated is 5000 characters, not including any HTML tags.
You can send source strings of up to 5,000 characters, but there are a
few provisos that are sometimes lost.
You can only send the 5,000 characters via the POST method.
If you use GET method, you are limited to 2,000-character length limit on urls. If a url is longer than that, Google's servers will just reject it.
Note: 2,000-character limit including the path and the rest
of the query string as well + you must count uri encoding (for instance every space becomes a %20, every quotation
mark a %22)
The Cloud Translation API is optimized for translating of smaller requests. The recommended maximum length for each request is 5K characters (code points). However, the more characters that you include, the higher the response latency. For Cloud Translation - Advanced, the maximum number of code points for a single request is 30K. Cloud Translation - Basic has a maximum request size of 100K bytes.
https://cloud.google.com/translate/quotas

Resources