What tool should I use to fetch HTTP header of a remote web server? - http

I am basically looking for something similar but simpler tool like cURL that fetches http header without the body. Not interested downloading the body. Noticed cURL seems to download the body and consumes unnecessary bandwidth for my need

use the -I flag to curl to make it issue a HEAD request, i.e., just the headers.
(not guaranteed to be exactly the same, but is supposed to be)

If you are using the libcurl library, the curl_easy_setopt() function has a CURLOPT_NOBODY option available, which causes libcurl to send a HEAD request to download just the headers, instead of a GET request to download the entire data.

Related

Curl redirect without sending the first POST

I'm using "curl -L --post302 -request PUT --data-binary #file " to post a file to a redirected address. At the moment the redirection is not optional since it will allow for signed headers and a new destination. The GET version works well. The PUT version under a certain file size threshold works also. I need a way for the PUT to allow itself to be redirected without sending the file on the first request (to the redirectorURL) and then only send the file when the POST is redirected to a new URL. In other words, I don't want to transfer the same file twice. Is this possible? According to the RFC (https://www.rfc-editor.org/rfc/rfc2616#section-8.2) it appears that a server may send a 100 "with an undeclared wait for 100 (Continue) status, applies only to HTTP/1.1 requests without the client asking to send its payload" so what I'm asking for may be thwarted by the server. Is there a way around this with one curl call? If not, two curl calls?
Try curl -L -T file $URL as the more "proper" way to PUT that file. (Often repeated by me: -X and --request should be avoided if possible, they cause misery.)
curl will use "Expect: 100" by itself in this case, but you'll also probably learn that servers widely don't care about supporting that anyway so it'll most likely still end up having to PUT twice...

Sending data with CURL GET

I thought GET is not supposed to have a body. But in the context of elasticsearch, I keep seeing this kind of query (see here for instance):
curl -XGET localhost:9200/test/_msearch --data-binary #requests; echo
How is the binary data sent in this case? Can somebody explains what is going on and how this works?
I first thought it was converted to a POST but, I put a proxy in front of Elasticsearch and saw that curl was really sending a GET. Though I could find the data neither in the header, nor in the parameter, nor in the body. So it seems like my proxy also got confused by this request.
But when I execute the request directly against elasticsearch, it works just fine. What gives?
GETs with bodies are allowed but not considered to be very "meaningful". You can see this question and answer for a full discussion. HTTP GET with request body
and this answer about your proxy: https://stackoverflow.com/a/978173/3516034

Which CDN solutions support caching with content negotiation?

I'm serving a set of resources through content negotiation.
Concretely, any URL can be represented in different formats,
depending on the client's Accept header.
An example of this can be seen at Facebook:
curl -H "Accept: application/json" http://graph.facebook.com/daft-punk
results in JSON
curl -H "Accept: text/turtle" http://graph.facebook.com/daft-punk
results in Turtle
I'm looking for a CDN that caches content based on URL and the client's Accept header.
Example of what goes wrong
CloudFlare doesn't support this: if one client asks for HTML, then all subsequent requests to that URL receive the HTML representation, regardless of their preferences. Others have similar issues.
For example, if I would place CloudFlare over graph.facebook.com(and configure it to cache “extensionless” resources, which it does not by default), then it would behave incorrectly:
I ask for http://graph.facebook.com/daft-punk in JSON through curl;
in response, CloudFlare asks the JSON original from the server, caches it, and serves it.
I ask for http://graph.facebook.com/daft-punk through my browser (thus in HTML);
in response CloudFlare sends the cached JSON (!) representation, even though the original server would have sent the HTML version.
What would be needed instead
The correct behavior would be that CloudFlare asks the server again, since the second client had a different Accept header.
After this, requests with similar Accept headers can be served from cache.
Which CDN solutions support content-negotiation, and also cache negotiated content?
So note that only respecting Accept is not enough; negotiated responses should be cached too.
PS1: It's easy to make your own caching servers support it. For instance, for nginx:
proxy_cache_key "$scheme$host$request_uri$http_accept";
Note how the client's Accept header is part of the key that indexes the cache. I want that on CDN.
PS2: It is not an option to use different URLs for different representations. My application is in the Linked Data domain, where URLs play an important role for identification.
Seems maxcdn still can set up custom nginx rules for content negotiation (despite what their faq says) - http://blog.maxcdn.com/how-to-reduce-image-size-with-webp-automagically/#comment-1048561182
I can't think of any way we would impact this at all at this time. We don't, for example, cache HTML by default. Have you actually seen an issue with this? Have you opened a support ticket?

Is there a standard way in HTTP to specify no content should be returned?

For a PUT or POST (for example), I would like to specify to the server that I don't want any content returned in the response, even if it normally would. Essentially I'm looking for a way to perform blind inserts/updates, and was trying to avoid unnecessary response payloads if I have no intention of using them.
I thought maybe Accept: none as a request header (or something similar) might be an option, but couldn't find anything to support that.
Is there a standard way to specify this in an HTTP request, or do I have to just live with a little extra content in the response?
I think a minimal response is necessary to know if the request was handled correctly by web servers or if there was errors, even if it has no data other than status code and HTTP headers.
That said, you can use HEAD HTTP command to make a GET request having a response without the message body (you get back only headers). But this, AFAIK, doesn't work with POST or PUT requests.
Regards.
You might be interested in the proposal outlined in https://datatracker.ietf.org/doc/html/draft-snell-http-prefer-18.

How to perform an action when a remote (Http) file changed?

I want to create a script that checks an URL and perform an action (download + unzip) when the "Last-Modified" header of the remote file changed. I thought about fetching the header with curl but then I have to store it somewhere for each file and perform a date comparison.
Does anyone have a different idea using (mostly) standard unix tools?
thanks
A possible solution would be periodically running this algorithm on the client box.
Create a HTTP request indicating the If-Modified-Since header equal to the date of your local file. If the file does not exist yet do not include this header;
The server will either send you the file if it was changed since the If-Modified-Since header in the payload or send 304 Not Modified HTTP status.
If you receive a 200 OK HTTP status simply get the payload from the HTTP body and unzip the file.
If in the other hand you received a 304 Not Modified you know that your file is up-to-date.
Use the Last-Modified header to touch your local file. This way you will be in sync with the server datetime.
Another way would be for the server to push notifications (a broadcast package for example) when the file is changed. When the notification is received the client would then execute the above algorithm. This would imply code to live in the HTTP server that listens for file system changes and then broadcast them to interested parties.
Perhaps this info for the curl command is of some importance:
TIME CONDITIONS
HTTP allows a client to specify a time
condition for the document it
requests. It is If-Modified-Since or
If-Unmodified-Since. Curl allow you to
specify them with the -z/--time-cond
flag.
For example, you can easily make a
download that only gets performed if
the remote file is newer than a local
copy. It would be made like:
curl -z local.html
http://remote.server.com/remote.html
Or you can download a file only if the
local file is newer than the remote
one. Do this by prepending the date
string with a '-', as in:
curl -z -local.html
http://remote.server.com/remote.html
You can specify a "free text" date as
condition. Tell curl to only download
the file if it was updated since
yesterday:
curl -z yesterday
http://remote.server.com/remote.html
Curl will then accept a wide range of
date formats. You always make the date
check the other way around by
prepending it with a dash '-'.prepending it with a dash '-'.
To sum up, you will need:
curl command
touch command
some bash scripting
is Java applicable in your case? I did a similar thing in one of my homework using the Apache HTTPcore library, you need to add the header "If-Modified-Since" to your HTTP request before you send it to the server, if the status code of the response that you receive from the server is not 304 then you know that the file has changed since the time value that you're checking against.

Resources