I need to get a CRC32 checksum of a file i'm downloading through an http GET request - without actually opening the response body.
I am building a proxy app - which gets a request from a client, and does the actual GET call. I'd like the response the proxy gets from the server to contain the checksum, without having to read through the actual data in the response body. I connect the response body reader stream, to the writer stream which I return to the client.
I read about the "Want-Digest" header which I can add to the request, and should result in the response containing a "Digest" header, with a checksum - but it did not work.
I also looked into the Content-MD5 header, but when I try to download some photos, I see i'm not getting it in the response (also, I read that it is deprecated).
Thanks in advance!
Any headers, such as 'Want-Digest' or 'Content-MD5', will be up to the server to implement. Most servers will probably ignore those headers, which is why they aren't working for you. If you want to calculate the CRC32 of the body, you'll have to open the body and calculate it yourself.
If you have access to the TCP headers I suppose you could access the TCP checksum, though that is a relatively weak checksum even compared to CRC32, and it is also a checksum of the entire packet, not just the body.
Related
There are similar questions on managing long-running HTTP requests, but the goal is to avoid streaming or chunked encoding.
The reason is to keep a simpler experience for the API user, where they make an API request and receive back an image file in the response -- no need to reassemble chunked/stream data.
Based on posts elsewhere, sending the response headers immediately, before the body is ready, will keep the connection alive.
Chunked encoding lets you do this, but this imposes more work on the user to reassemble the chunked/stream body data.
Is it possible to send one response header property back, then send the remaining headers when the body is ready?
If I make
r = requests.get('http://github.com', stream=True)
and see in tcpdump, the content of page downloaded just after requests.get. After r.content, no tcpdump transfer activity. The same with requests.Session(stream=True).
Don't use GET if you don't want a response body to be sent by the server. Use a HEAD request instead if all you need is the header information.
All stream=True does is not read the response body from the socket. The server can still initiate sending that body, so the socket receive buffer will already have (some) of that body for Python to read.
I'm working with an HTTP request tool (similar to cURL) and having an issue with the server response. Either that or my understanding of the RFC for HTTP 1.1 and chunked data.
What I'm seeing is chunked data should be in this format:
4\r\n
Wiki\r\n
5\r\n
pedia\r\n
e\r\n
in\r\n\r\nchunks.\r\n
0\r\n
\r\n
what I'm actually seeing is the following:
4\r\n
Wiki\r\n
5\r\n
pedia\r\n
e\r\n
in\r\n\r\nchunks.\r\n
0
In other words, the few servers I've tested with send no more data after the 0.. not CRLF, much less CRLFCRLF.
How are we supposed to know it's the end of the chunked data without the proper format of the chunked tags? Timeouts happen looking for the CRLFs after the 0, and that's no sufficient.
Yes, it violates standard. But we want to be compatible with all possible http servers and clients, so we have to understand a way how it can be violated.
Chunked is used often in a way of content streaming over http 1.1 protocol. Standard ask to end content with additional CRLF. So we can see the following pseudo code:
def stream(endpoint)
Socket.open(endpoint) do |socket|
sleep 10
more_data do |data|
print data.length.to_s(16)
print data
print "CRLF"
end
end
print "CRLF"
end
But the right code is the following:
def stream(endpoint)
Socket.open(endpoint) do |socket|
sleep 10
more_data do |data|
print data.length.to_s(16)
print data
print "CRLF"
end
end
ensure
print "CRLF"
end
It means that after input socket interruption of any other exception wrong version of method won't be able to print additional "CRLF" to output socket.
How are we supposed to know it's the end of the chunked data without
the proper format of the chunked tags? Timeouts happen looking for the
CRLFs after the 0, and that's no sufficient.
Many implementations ignores this violation because they don't need to know the size of content. They just tries to receive as much data as possible before socket will be closed.
Use Content-Length, definitely whenever I know it; for file download, checking the filesize is insignificant in terms of resources. For chunked transfer we do not scan the message body for a CRLF pair. It first reads the specified number of bytes, and then reads two more bytes to confirm that they are CR and LF. If they're not, the message body is ill-formed, and either the size was specified improperly or the data was otherwise corrupted.
For more information read RCF, which says
A server using chunked transfer-coding in a response MUST NOT use the
trailer for any header fields unless at least one of the following is
true:
a)the request included a TE header field that indicates "trailers" is
acceptable in the transfer-coding of the response, as described in
section 14.39; or,
b)the server is the origin server for the response, the trailer fields
consist entirely of optional metadata, and the recipient could use the
message (in a manner acceptable to the origin server) without
receiving this metadata. In other words, the origin server is willing
to accept the possibility that the trailer fields might be silently
discarded along the path to the client.
Way to Determine Message Body Length:
If header has Transfer-Encoding and the chunked transfer is final encoding, then message body length is determined by reading and decoding the chunked data until the transfer coding indicates the data is complete.
If header has Transfer-Encoding and the chunked transfer is not final encoding, then message body length is determined by reading the connection until it is closed by the server.
If header has Transfer-Encoding in request and the chunked transfer is not final encoding, then message body length cannot be determined reliably; the server MUST respond with the 400 (Bad Request) status code and then close the connection.
If a message is received with both a Transfer-Encoding and Content-Length header field, the Transfer-Encoding overrides the Content-Length. Such a message might indicate an attempt to perform request response splitting and ought to be handled as an error. A sender MUST remove the received Content-Length field prior to forwarding such a message downstream.
How is a socket used by an HTTP client properly closed after transmitting the request? Or does it have to remain open (bidirectionally) until the complete response has been received? If so, how is the end of the request body determined by the server?
According to http://www.w3.org/Protocols/rfc2616/rfc2616-sec4.html#sec4.4, closing the socket is not an option for a request. That doesn't sound logical to me - why should a half-closed TCP connection be a problem for the server, if the client doesn't try to transmit anything after closing its half of the socket? The client can still receive data after all.
It seems to me that shutting down the write part of a socket would be a very practical way of letting the server know that the request has been finished. http://docs.python.org/howto/sockets.html#disconnecting even specifically mentions that use case.
If that's really the wrong way to do it, what's the alternative? Do I really always have to send a "Content-length" or use chunked transport to enable the server to properly find the end of a request? How does that work for requests with unknown body length?
Transfer-Encoding: chunked is specifically designed to allow sending data with an unknown body length, for both requests and responses. The end of the data is determined by receiving a chunk whose payload size is 0. If you do not send a chunked request, then you must send a Content-Length instead.
Are you talking about this?:
Closing the connection cannot be used to indicate the end of a request body, since that would leave no possibility for the server to send back a response.
I think the text is talking about full close, you can do a half(write)-close. I'm not sure that's a HTTP compilant way of doing it, but I would think most servers will accept it.
Regarding your second question, simply use chunked encoding:
All HTTP/1.1 applications that receive entities MUST accept the "chunked" transfer-coding (section 3.6), thus allowing this mechanism to be used for messages when the message length cannot be determined in advance.
I have an HTTP server that returns large bodies in response to POST requests (it is a SOAP server). These bodies are "streamed" via chunking. If I encounter an error midway through streaming the response how can I report that error to the client and still keep the connection open? The implementation uses a proprietary HTTP/SOAP stack so I am interested in answers at the HTTP protocol level.
Once the server has sent the status line (the very first line of the response) to the client, you can't change the status code of the response anymore. Many servers delay sending the response by buffering it internally until the buffer is full. While the buffer is filling up, you can still change your mind about the response.
If your client has access to the response headers, you could use the fact that chunked encoding allows the server to add a trailer with headers after the chunked-encoded body. So, your server, having encountered the error, could gracefully stop sending the body, and then send a trailer that sets some header to some value. Your client would then interpret the presence of this header as a sign that an error happened.
Also keep in mind that chunked responses can contain "footers" which are just like HTTP headers. After failing, you can send a footer such as:
X-RealStatus: 500 Some bad stuff happened
Or if you succeed:
X-RealStatus: 200 OK
you can change the status code as long as response.iscommitted() returns false.
(fot HttpServletResponse in java, im sure there exists an equivalent in other languages)