Compressing text for HTTP with POST parameters - http

I am writing client software that initiates a HTTP request with a large blob of text (JSON object actually) as POST parameter. I want to compress this text before sending and decompress the text on the server.
Gzip produces binary, which I can't send as a POST parameter, I think.
Which options/algorithms exist to compress text and send it to a web server?
Edit: Would it be an option to GZIP and then BASE64 encode the binary data?

Why don't you just use the standard HTTP gzip compression?
(It just seems a bit mad to needlessly re-invent the wheel.)
Update
Ah yes - my bad. So why not simply gzip the file, upload it to the server as you would a multipart/form-data file upload and then un-gzip it on the server?

The file is a long/unnecessary work around, the original question relates to battle with unbearably large Json blob. From my hacking around I can tell it highly depends on the server, some do support it some don't.
To the original question, you can set the binary data in http post, the real question what is the server going to do with it. It is the same way that C# client does not automatically unzip, you have to write extra code.

Related

Http Header Accept Encoding

I have difficulty in understanding how this header works.
Briefly my question is
If i am requesting a post to certain resource then let's
Say in 1st case response is some json string and in 2nd case response is a .jar file.
1.Should client include accept-header:gzip,deflate in both cases while sending HTTP request,knowing that first one results in json string?
2.What if the response is already zipped,now zipping the response over the already zipped data doesn't create problems?
3.what happens if i include accept-encoding:gzip in first case where json string is received. So i receive a zipped data as my response(i am not even sure if get zipped data or some encoded data as response.I think zipped data means something zipped like .jar/.zip and encoded data means Encoded data of the original data ,which one is happening zipping or encoding)?
4.Lets say the server sends the response with Contentype header as "application/octet-stream". Now is it must to use accept-header:gzip,deflate
A client can use Accept-Encoding HTTP request header to tell the server that it can accept a compressed response.
The server can use the request header to decide if it should send a compressed response or not. It can ignore the header and always send a non-compressed response (possibly less efficient). It can ignore the header and always send a compressed response (risking giving a client a response it can't decode).
Should client include accept-header:gzip,deflate in both cases
I can't think of any reason to not tell the server that a client can handle a compressed response (assuming that fact is true).
What if the response is already zipped,now zipping the response over the already zipped data doesn't create problems
It might be a waste of processor power for little or no saving in bytes.
That's not a reason for the client to say it can't handle a compressed response though. That's a decision to be made on the server.
what happens if i include accept-encoding:gzip in first case where json string is received.
Then the client has told the server that a compressed response is acceptable.
So i receive a zipped data as my response
The server might send a compressed response. It might ignore the header.
i am not even sure if get zipped data or some encoded data as response
There isn't an "or" here.
The data is encoded using a compression algorithm.
Lets say the server sends the response with Contentype header as "application/octet-stream"
That just means the server doesn't know what type of data it is sending. Instead of saying "This is JSON" or "This is a jar file" it is saying "I dunno what this is, it's just a stream of bytes to me".
Now is it must to use accept-header:gzip,deflate
It doesn't make a difference.
The server can compress the data. It can send uncompressed data. It can use the Accept-Encoding request header to decide which of the two.
Yes, why not? If the JSON payload is big, compressing it will make a lot of sense.
It's just overhead.
You might receive gzipped data - not a ZIP file. You may want to read RFCs 7230 and RFC 7231 for details.
The internet media type of the payload is completely independent of the content coding.

Generating PDF on the fly with standard HTTP response fields

I'm developing a web page with a form which returns a PDF document based on the form data. Currently I use the HTTP response fields
Content-Type: application/pdf
Content-Disposition: attachment; filename="foo.pdf"
However, since the field Content-Disposition is non-standard and doesn't work in all browsers I'm looking for a different approach. Do I have to save the PDF document on the server? What is the modus operandi?
Edit: By "doesn't work in all browsers" I mean that with some browsers the filename is not set to foo.pdf. Dillo, for instance, just sets the default filename (in the download dialog) to the basename of the URL path (plus query string).
Do I have to save the PDF document on the server?
No. As far as the HTTP client is concerned it, the inner workings of the server are completely opaque to it. All it sees is a TCP stream of bytes from the server and how exactly that stream is produced doesn't matter as long as it matches the specified Content-Type.
Just send the PDF right after the HTTP headers and you're done with.
Update due to comment
So if you're wondering how to supply a filename without using a header field: Just augment the URL with it. I.e. something like
http://${DOMAIN}/${PDF_GENERATOR}/${DESIRED_FILENAME}
In the HTTP server add a rewrite rule to simply omit the filename part and redirect to just
http://${DOMAIN}/${PDF_GENERATOR}
The HTTP client does not see that, all it see is some URL ending with a "filename", that it can present the user as a default for saving.

nginx resumable upload with upload_module and multipart/form

I currently upload to a webservice on an nginx server using the upload module (http://www.grid.net.ru/nginx/upload.en.html) from a custom desktop application doing a simple multipart-form POST that sends a file in one part and a base64 encoded XML with the file's metadata in another part.
The server receives this POST, passes it to my webservice which reads the metadata, processes the file and all is good.
What I want to do now is use the upload module's upload_resumable directive to do the POST in several chunks to minimize disconnection chances and allow resume. I can currently do this following the protocol described here: http://www.grid.net.ru/nginx/resumable_uploads.en.html
One sends byte ranges of the file along with some headers to identify the chunk and the session in several posts and once all the parts have been uploaded, nginx will compose the final POST containing the file name and path and pass it to your upload_pass location (which in my case CGIs to a django app).
However, I am not clear on how one would send a multipart post with this method since the protocol indicates that the body of the POST must be the bytes indicated in the byte range. I need the final post to also contain the XML I wrote about above.
I can think of sending the XML as the first bytes of the body and a header that indicates how many bytes belong to it but that would mean extra handling of the final file to remove that header and the final files are potentially in the GB size range.
Any other ideas?
Since the protocol supported by nginx specifically states that the post should not be multipart I ended up sending the file in the body, and the rest of the parameters encoded in the URL. Not the prettiest URLs but it works.

Will sending an HTTP Header with Accept: text/html only download text from the page?

I'm writing a simple crawler, and ideally to save bandwidth, I'd only like to download the text and links on the page. Can I do that using HTTP Headers? I'm confused about how they work.
You're on the right track to solving the problem.
I'm not sure how much you already know about HTTP headers, but basically an HTTP header is just a string formatting for a web server - it follows a protocol - and is pretty straightforward in that aspect. You write a request, and receive a response. The requests look like the things you see in the Firefox plugin LiveHTTPHeaders at https://addons.mozilla.org/en-US/firefox/addon/3829/.
I wrote a small post at my site http://blog.gnucom.cc/2010/write-http-request-to-web-server-with-php/ that shows you how you can write a request to a web server and then later read the response. If you only accept text/html you'll only accept a subset of what is available on the web (so yes, it will "optimize" your script to an extent). Note this example is really low level, and if you're going to write a spider you may want to use an existing library like cURL or whatever other tools your implementation language offers.
Yes, with using Accept: text/html you should only get HTML as a valid responses. That’s at least how it ought to be.
But in practice there is a huge difference between the standards and the actual implementations. And proper content negotiation (that’s what Accept is for) is one of the things that are barely supported.
An HTML page contains just the text plus some tag markup.
Images, scripts and stylesheets are (usually) external files that are referenced from the HTML markup. This means that if you request a page, you will already receive just the text (without the images and other stuff).
Since you are writing the crawler, you should make sure it doesn't follow URLs from images, scripts or stylesheets.
I'm not 100% sure, but I believe that GET /foobar.png will return the image even if you send Accept: text/html. For this reason I believe you should just filter what kind of URLs you crawl.
In addition, you may try to read the response headers in the crawler and close the connection before you read the body if the Content-Type is not text/html. It might be worthwhile for undesired larger files.

HTTP Get content type

I have a program that is supposed to interact with a web server and retrieve a file containing structured data using http and cgi. I have a couple questions:
The cgi script on the server needs to specify a body right? What should the content-type be?
Should I be using POST or GET?
Could anyone tell me a good resource for reading about HTTP?
If you just want to retrieve the resource, I’d use GET. And with GET you don’t need a Content-Type since a GET request has no body. And as of HTTP, I’d suggest you to read the HTTP 1.1 specification.
The content-type specified by the server will depend on what type of data you plan to return. As Jim said if it's JSON you can use 'application/json'. The obvious payload for the request would be whatever data you're sending to the client.
From the servers prospective it shouldn't matter that much. In general if you're not expecting a lot of information from the client I'd set up the server to respond to GET requests as opposed to POST requests. An advantage I like is simply being able to specify what I want in the url (this can't be done if it's expecting a POST request).
I would point you to the rfc for HTTP...probably the best source for information..maybe not the most user friendly way to get your answers but it should have all the answers you need. link text
For (1) the Content-Type depends on the structured data. If it's XML you can use application/xml, JSON can be application/json, etc. Content-Type is set by the server. Your client would ask for that type of content using the Accept header. (Try to use existing data format standards and content types if you can.)
For (2) GET is best (you aren't sending up any data to the server).
I found RESTful Web Services by Richardson and Ruby a very interesting introduction to HTTP. It takes a very strict, but very helpful, view of HTTP.

Resources