How do I get "bytes transferred" (before decompression) from Python requests? - python-requests

Is there a way to get the bytes transferred from Python's request library? len(response.text) is after decompresssing a possibly gzip'd response, and does not include headers.

Related

Reading vs Parsing a HTTP request

I am reading A philosophy of software design book by J. Ousterhout.
In chapter 5 he mentioned the following exercise:
“Implement one or more classes to make it easy for Web servers to receive incoming HTTP requests and send responses.”
He then discuss a common error to solve the exercise:
“Use two different classes for receiving HTTP requests; the first class read the request from the network connection into a string, and the second class parsed the string.”
“Information leakage occurred because a HTTP request can’t be read without parsing much of the message; for example, the Content-Length header specifies the length of the request body, so the headers must be parsed in order to compute the total request length. As a result, both classes needed to understand most of the structure of HTTP requests, and parsing code was duplicated in both classes. ”
I can't understand the example because I don't have an idea about http requests. More precisely, I don't understand the meaning of reading and parsing in the sentence:
"HTTP request can’t be read without parsing much of the message"
Any help?
Reading means taking a bunch of bytes from some external source (like a network socket) and storing them in memory.
Parsing means breaking up that string of bytes into meaningful, domain-specific chunks so you can understand the message.
I haven't read that book, but the author's point is that you can't simply read the bytes first and then parse them, in two separate non-overlapping operations. HTTP requests can be of arbitrary size, so before you know how many bytes to read (that is, how many bytes represent a single HTTP request) you have to figure out how long the request is. You do that by reading the Content-Length header, and that requires parsing and understanding the message.

How to get the crc32 of a resource in the response headers?

I need to get a CRC32 checksum of a file i'm downloading through an http GET request - without actually opening the response body.
I am building a proxy app - which gets a request from a client, and does the actual GET call. I'd like the response the proxy gets from the server to contain the checksum, without having to read through the actual data in the response body. I connect the response body reader stream, to the writer stream which I return to the client.
I read about the "Want-Digest" header which I can add to the request, and should result in the response containing a "Digest" header, with a checksum - but it did not work.
I also looked into the Content-MD5 header, but when I try to download some photos, I see i'm not getting it in the response (also, I read that it is deprecated).
Thanks in advance!
Any headers, such as 'Want-Digest' or 'Content-MD5', will be up to the server to implement. Most servers will probably ignore those headers, which is why they aren't working for you. If you want to calculate the CRC32 of the body, you'll have to open the body and calculate it yourself.
If you have access to the TCP headers I suppose you could access the TCP checksum, though that is a relatively weak checksum even compared to CRC32, and it is also a checksum of the entire packet, not just the body.

Reading Nginx HTTP Response Streams in Reverse Byte Order

Does Nginx or HTTP 1/2 offer ways serve a file in reverse byte order? I'm interested in reading a binary HTTP Response stream in reverse byte order to seek a byte sequence from the end of files (which range from 5-500 MB). My current solution uses iterative Range requests and byte scanning with the Streams API. That suffices but is not optimal.
The implementation goal is to calculate durations of Opus audio files as explained by "How do I get the duration of a .opus file?". A server-side script (Python, Go, PHP) could always work as a fallback, but I'm curious if Nginx or another HTTP server can already do it out-of-the-box. Preserving Opus' low-latency is important. If an existing option does not, a custom Nginx module would be written that responds with HTTP headers containing the duration (and other Opus meta info).

Changing an encoding while change Content-Type

I have a server, which should response for some requests. Requests contain "Content-Type" is equal to "application/x-protobuf", so I need to response a set of bytes (serialised proto object). When I tried to send this bytes with Content-Type "text/plain" - each byte was successfully delivered to users, but when I tried to change this value to "application/x-protobuf" - every byte with value more than 128 was replaced by \uFFF8.
I use cherrypy framework for my server.
Anybody have some ideas, why this happens? And how to know which encoding were used in case "text/plain".
Thank you for your answers.
Google Protocol Buffers code generator for nginx module developers
https://github.com/dbcode/protobuf-nginx

Will HTTP ResponseWriter's write function buffer in Go?

Assume that we have a function handling an HTTP Request, something like:
func handler(w http.ResponseWriter, r *http.Request) {
w.Write([]byte("first piece of data"))
// do something
w.Write([]byte("second piece of data"))
}
I'm wondering that if the first call to w.Write() is flushed to client or not?
If it is flushed, then we actually responses to clients twice, this is strange because how can we determine Content-Length before the second call to write?
If it is not flushed (say the data is buffered locally), then what if we write a huge amount of data at the first call? (will that stack overflow?)
Any explanation will be appreciated! :)
I'm wondering that if the first call to w.Write() is flushed to client or not?
net/http's default ResonseWriter has a (currently 4KB) large output buffer over the net.Conn it writes to. Additionally, the OS normally will buffer writes to a socket. So in most cases some kind of buffering takes place.
If it is flushed, then we actually responses to clients twice, this is strange because how can we determine Content-Length before the second call to write?
Well there's HTTP 1.1 which allows persistent connections. Such responses usually don't include a Content-Length header. Additionally, there's HTTP trailers.
If your client does not support HTTP 1.1 and persistent connections they will have some sort of read timeout, during this time you can write to the connection as many times as you like; it's one response.
This has more to do with the nature of TCP sockets and HTTP implementations than Go.
If it is not flushed (say the data is buffered locally), then what if we write a huge amount of data at the first call? (will that stack overflow?)
No, allocating a buffer on the stack makes no sense – the buffer's body will live on the heap. If you hit your per-process memory limit your application will panic "out of memory".
See also:
How to turn off buffering on write() system call?
*TCPConn.SetNoDelay
Edit to answer your question in the comments:
Chunked Transfer Encoding is part of the HTTP 1.1 specification and not supported in HTTP 1.0.
Edit to clarify:
As long as the total time it takes you to write both parts of your response does not exceed your client's read time out, and you don't specify a Content-Length header you just write your response and then close the connection. That's totally OK and not "hacky".

Resources