Given a file on a webserver (e.g., http://foo.com/bar.zip -> only accessible through HTTP), is there any way to get the date attributes (e.g., date [created, modified]) without downloading the entire archive in the first place?
Right now, I download the archive and read the attributes programmatically. Trouble is that the archive is dozens of MiB so it seems like a waste of resources to download the entire thing and end up reading off just a couple of bytes of information.
I realize that bandwidth is practically free, but I don't like to be wasteful in any case.
Try to read Last-Modified from header
Be sure to use a HTTP HEAD request instead of a HTTP GET request to read the HTTP headers only. If you do a HTTP GET, you will download the whole file nevertheless, even if you decide just to inspect the HTTP headers.
Just for the sake of simplicity, here's a compilation of the existing (perfect) answers from #ihorko and #JanThomä, that uses curl. Other option are available too, of course, but here's a fully functional answer.
Use curl with the -I option:
-I, --head
(HTTP/FTP/FILE) Fetch the HTTP-header only! HTTP-servers feature the command HEAD which this uses to get nothing but the header of a document. When used on an FTP or FILE file, curl displays the file size and last modification time only.
Also, the -s option is nice here:
-s, --silent
Silent or quiet mode. Don't show progress meter or error messages. Makes Curl mute. It will still output the data you ask for, potentially even to the terminal/stdout unless you redirect it.
Hence, something like this would do the trick:
curl -sI http://foo.com/bar.zip | grep 'Last-Modified' | cut -d' ' -f 2-
Related
There's a simple game that my friends and I play both in person and and online. I developed a CLI that records our in-person games (I just type in each move), but I now want to use it to record our online games. All I need to do is pipe the HTTP response bodies being sent to my browser (Firefox) to my CLI. Unfortunately. I can't figure out how to do this.
Ideally, I'm looking for a Ubuntu package that I can run from the command line that will capture and return all HTTP response bodies from a specific endpoint. I've looked into tcpdump and some simple proxy servers, but I'm not sure they do what I want them to do.
Thanks for your help! Let me know if I need to provide any further information!
I used MITMProxy as ZachChilders recommended in the comments. I found it somewhat difficult to get set up, so I'll include what directions I followed to get it up and running:
1) Install MITMProxy.
2) Configure Firefox.
3) Create Add On to parse body.
4) Stream data via Python to CLI (TODO).
I'm currently using Xquery queries (launch via the API) to extract from Marklogic 8.0.6/
Query in my file extract_data.xqy:
xdmp:save("toto.csv",let $nl := "
"
return
document {
for $data in collection("http://book/polar")
return ($data)
})
API call :
$curl --anyauth --user ${MARKLOGIC_USERNAME}:${MARKLOGIC_PASSWORD} -X POST -i -d #extract_data.xqy \
-H "Content-type: application/x-www-form-urlencoded" \
-H "Accept: multipart/mixed; boundary=BOUNDARY" \
$node:$port/v1/eval?database=$db_name
It works fine but I'd like to schedule this extract directly in marklogic and have it running in background to avoid timeout if the request takes too much time to be executed.
Is-there a feature like that to do that ?
Regards,
Romain.
You can use the task scheduler to setup recurring script execution.
The timeout can be adjusted in the script with xdmp:set-request-time-limit
I would suggest you take a look at MLCP as well.
As suggested by Mads a tool like CORB can help pull csv data out of MarkLogic.
A schedule as suggested by Michael can trigger a periodic export, and save the output to disk, or push it elsewhere via HTTP. I'd look into running incremental exports in that case though, and I'd also suggest batching things up. In a large cluster, I'd even suggest chunking the export into batches per forest or per host on which content forests are attached. Scheduled tasks allow targeting specific hosts on which they should run.
You can also run an adhoc export, particularly if you batch up the work using a tool like taskbot. And if you combine it with its OPTIONS-SYNC-UPDATE mode, you can merge multiple batches back into one result file before emitting it as well, and get better performance out of it, compared to running it single threaded. Merging results doesn't scale endlessly, but if you have a relatively small dataset (only a few million small records maybe), that might be sufficient.
HTH!
We have one .sh file which contains all the configurations.
We have something like this,
export MARK_REMOTE_NODE= (server name)
The requirement is we have to send the same file to two different servers.Is it possible to transfer the same XFB file to different REMOTE_NODE or servers in UNIX??
When i was searching i got to know that BTOPUT transfers are one file at a time to one Partner.So can anyone tell me how to transfer file to 2 different servers?
XFB already has a hard job matching different operating- and filesystems with optional compression and retry mechanism. You want some logic what will happen when 1 transfer fails (only send second when first succeeds, shoot-and-forget, always try to send both and trust your incident management to catch the errors thrown by your monitoring, wait for async transfer for time depending on filesize,..).
I wouldn't trust the XFB options and just make a loop in your script doing exactly what you want. The additional advantage is that a migration to another communication tool will be easier.
while read -r targethost; do
# You need a copy, since xfb will rename and delete the file
cp outputfile outputfile.${targethost}
my_send_xfb ${targethost} outputfile.${targethost}
# optional check result posting the file in the queue
if [ $? -ne 0 ]; then
echo "Xfb not ready or configured for ${targethost}"
# Perhaps break / send alert / ..
fi
done < myhosts
I have a list of approximately 4300 URL's, all very similiar. It is likely that a few of them have been removed and I wish to identify which ones are no longer valid. I'm not interested in the content (at this point in time), only if when used in the real world if they would currently return valid content (http 200) or doesn't exist (http 404). Essentially, I'm looking for a URL ping service. This is a one-off excercise.
If there aren't any existing tools specifcially for this purpose, I'm very comfortable in Java and could code my own solution. However, I don't want to reinvent the wheel and I'm not sure how best to do this without it looking like a denial of service attack. Would it be acceptable to hit each URL in turn, one immediately after the other (so no concurrent requests)? I'm very conscious of not putting undue strain on the target server.
Many thanks for any ideas or suggestions.
wget conveniently returns 0 for 200, and a nonzero return value for 404, thus the following would work:
for i in $(cat listOfUrls.txt); do
wget --quiet $i && echo $i >> goodUrls.txt || echo $i >> badUrls.txt;
done
or some close variant.
Consider:
sleeping for, say, 1s between requests
randomising listOfUrls.txt using, say sort -R, to spread multiple requests to the same server over time
There is no 100% solution for this issue. For example, if the response status is determined on PHP side, it usually will give you contents together with the status whatever request headers you send.
Still you could play with "range" request headers to ask for the first bytes of the content, still this must be supported by the server backend.
When I carried out a fetch command, the following messages were output.
"size of remote file is not known"
Will this be an error? Will it be what or the thing which disappears if I appoint an option? Or will it be all right even if I do not mind it?
This is the Freebsd fetch command, right?
That is not an error, just a warning. I don't think there's an option to suppress that warning, though.
This is because the HTTP server doesn't send the Content-Length header with the response. That way, the client doesn't know in advance how long the file is, and has to assume that it ends when the connection is closed by the server, with the side effect that if the connection drops prematurely, you'll end up with an incomplete download without knowing it.
This doesn't sound very good, but is in fact quite usual practice on the web, especially for dynamic content generated by scripts.
Here is the way in UNIX
//File size of Google Image
echo "Weberdev Logo Size : " . getRemoteFileSize'http://www.weberdev.com/images/BlueLogo150x45.gif');