I wish to get a particular set of files and the only access I have on that box is through the http inteface, which I can get from wget. Now the issue is that I want the latest files, and there are multiple which must be of the same time stamp.
wget http://myserver/abc_20090901.tgz
wget http://myserver/xyz_20090901.tgz
wget http://myserver/pqr_20090901.tgz
The issue being that I do not know if all the above files exist and I wish to get only when all 3 files with the above time stamp exist.
The other issue is that I have another file in a separate folder which is also to be obtained. How do I get these files? Any suggestions?
wget http://myserver/text/myfile_20090901.txt
wget offers some basic resource presence detection with its --spider option. Something like this should do the trick:
wget --spider http://myserver/abc_20090901.tgz &&
wget --spider http://myserver/xyz_20090901.tgz &&
wget --spider http://myserver/pqr_20090901.tgz &&
wget http://myserver/abc_20090901.tgz &&
wget http://myserver/xyz_20090901.tgz &&
wget http://myserver/pqr_20090901.tgz &&
wget http://myserver/text/myfile_20090901.txt
Related
I had been using a proxy for a long time. Now I need to remove it. I have forgotten how I have added the proxy to wget. Can someone please help me get back to the normal wget where it doesn't use any proxy. As of now, I'm using
wget <link> --proxy=none
But I'm facing a problem when I'm installing using a pre-written script. It's painstaking to search all through the scripts and change each command.
Any simpler solution will be very much appreciated.
Thanks
Check your
~/.wgetrc
/etc/wgetrc
and remove proxy settings.
Or use wget --no-proxy command line option to override them.
In case your OS is alpine/busybox then the wget might vary from the one used by #Logu.
There the correct command is
wget --proxy off http://server:port/
Running wget --help outputs:
/ # wget --help
BusyBox v1.31.1 () multi-call binary.
Usage: wget [-c|--continue] [--spider] [-q|--quiet] [-O|--output-document FILE]
[-o|--output-file FILE] [--header 'header: value'] [-Y|--proxy on/off]
[-P DIR] [-S|--server-response] [-U|--user-agent AGENT] [-T SEC] URL...
Retrieve files via HTTP or FTP
--spider Only check URL existence: $? is 0 if exists
-c Continue retrieval of aborted transfer
-q Quiet
-P DIR Save to DIR (default .)
-S Show server response
-T SEC Network read timeout is SEC seconds
-O FILE Save to FILE ('-' for stdout)
-o FILE Log messages to FILE
-U STR Use STR for User-Agent header
-Y on/off Use proxy
I'm using wget to download all jpegs from a website.
I searched a lot and this should be the way:
wget -r -nd -A jpg "http://www.hotelninfea.com"
This should recursively -r download files jpegs -A jpg and store all files in a single directory, without recreating website directory tree -nd
Running this command downloads only the jpegs from the homepage of the website, not the whole jpegs of all the website.
I know that a jpeg file could have different extensions (jpg, jpeg) and so on, but this is not the case, also there aren't any robots.txt restrictions acting.
If I remove the filter from the previous command, it works as expected
wget -r -nd "http://www.hotelninfea.com"
This is happening on Lubuntu 16.04 64bit, wget 1.17.1
Is this a bug or I am misunderstanding something?
I suspect that this is happening because the main page you mention contains links to the other pages in the form http://.../something.php, i.e., there is an explicit extension. Then the option -A jpeg has the "side-effect" of removing those pages from the traversal process.
Perhaps a bit dirty workaround in this particular case would be something like this:
wget -r -nd -A jpg,jpeg,php "http://www.hotelninfea.com" && rm -f *.php
i.e., to download only the necessary extra pages and then delete them if wget successfully terminates.
ewcz anwer pointed me to the right way, the --accept acclist parameter has a dual role, it define the rules of file saving and the rules of following links.
Reading deeply the manual I found this
If ‘--adjust-extension’ was specified, the local filename might have ‘.html’ appended to it. If Wget is invoked with ‘-E -A.php’, a filename such as ‘index.php’ will match be accepted, but upon download will be named ‘index.php.html’, which no longer matches, and so the file will be deleted.
So you can do this
wget -r -nd -E -A jpg,php,asp "http://www.hotelninfea.com"
But of course a webmaster could have been using custom extensions
So I think that the most robust solution would be a bash script, something
like
WEBSITE="http://www.hotelninfea.com"
DEST_DIR="."
image_urls=`wget -nd --spider -r "$WEBSITE" 2>&1 | grep '^--' | awk '{ print $3 }' | grep -i '\.\(jpeg\|jpg\)'`
for image_url in $image_urls; do
DESTFILE="$DEST_DIR/$RANDOM.jpg"
wget "$image_url" -O "$DESTFILE"
done
--spider wget will not download the pages, just check that they are there
$RANDOM asks a random number to the operating system
The following did not work.
wget -r -A .pdf home_page_url
It stop with the following message:
....
Removing site.com/index.html.tmp since it should be rejected.
FINISHED
I don't know why it only stops in the starting url, do not go into the links in it to search for the given file type.
Any other way to recursively download all pdf files in an website. ?
It may be based on a robots.txt. Try adding -e robots=off.
Other possible problems are cookie based authentication or agent rejection for wget.
See these examples.
EDIT: The dot in ".pdf" is wrong according to sunsite.univie.ac.at
the following cmd works for me, it will download pictures of a site
wget -A pdf,jpg,png -m -p -E -k -K -np http://site/path/
This is certainly because of the links in the HTML don't end up with /.
Wget will not follow this has it think it's a file (but doesn't match your filter):
page
But will follow this:
page
You can use the --debug option to see if it's the actual problem.
I don't know any good solution for this. In my opinion this is a bug.
In my version of wget (GNU Wget 1.21.3), the -A/--accept and -r/--recursive flags don't play nicely with each other.
Here's my script for scraping a domain for PDFs (or any other filetype):
wget --no-verbose --mirror --spider https://example.com -o - | while read line
do
[[ $line == *'200 OK' ]] || continue
[[ $line == *'.pdf'* ]] || continue
echo $line | cut -c25- | rev | cut -c7- | rev | xargs wget --no-verbose -P scraped-files
done
Explanation: Recursively crawl https://example.com and pipe log output (containing all scraped URLs) to a while read block. When a line from the log output contains a PDF URL, strip the leading timestamp (25 characters) and tailing request info (7 characters) and use wget to download the PDF.
I have the following script for some number crunching
#!/bin/bash
sudo apt-get update -y
sudo apt-get upgrade -y
sudo apt-get install -y r-base r-base-dev htop s3cmd p7zip-full
wget https://s3.amazonaws.com/#######/###.7z
7z e ###.7z
sudo R CMD BATCH --slave --no-timing --vanilla "--args 0 1 100 200 500 2" SOME-ROUTINE.R
s3cmd put *.results s3://#########/
on EC2. I upload the script as file at the Launch Instance->Instance Details->User Data
The machine fires up, updates and upgrades but then it does not execute wget and does not download the file. When i SSH in the Instance and run the exact same commands the process completes without problems.
Any ideas why wget does not work?
Any other alternatives?
EC
It is always a bit of guessing, but here is how I would debug this:
My first suggestion would be to check for special characters in the S3 URL. This might cause the wget call to fail.
Second, I would give an explicit output path to wget with the -O option. While you are editing the command, you can also add -o to output logging information.
Last step is to check your access rights to the S3 bucket. Perhaps you can try to put the file on another webspace to see if the command executes then.
Greetings to everyone,
I'm on OSX. I use the terminal a lot as a habit from my Linux old days that I never surpassed. I wanted to download the files listed in this http server: http://files.ubuntu-gr.org/ubuntistas/pdfs/
I select them all with the mouse, put them in a txt files and then gave the following command on the terminal:
for i in `cat ../newfile`; do wget http://files.ubuntu-gr.org/ubuntistas/pdfs/$i;done
I guess it's pretty self explanatory.
I was wondering if there's any easier, better, cooler way to download this "linked" pdf files using wget or curl.
Regards
You can do this with one line of wget as follows:
wget -r -nd -A pdf -I /ubuntistas/pdfs/ http://files.ubuntu-gr.org/ubuntistas/pdfs/
Here's what each parameter means:
-r makes wget recursively follow links
-nd avoids creating directories so all files are stored in the current directory
-A restricts the files saved by type
-I restricts by directory (this one is important if you don't want to download the whole internet ;)