How to wget recursively on specific TLDs? - recursion

Is it possible to recursively download files from specific TLDs with wget?
Specifically, I'm trying to download the full text of the Code of Massachusetts Regulations. The actual text of the regulations is stored in multiple files across multiple domains—so I'd like to start the recursive download from the index page, but only follow links to .gov and .us domains.

With help from the wget documentation on spanning hosts, I was able to make this work with the -H and -D flags:
wget -r -l5 -H -D.us,.gov http://www.lawlib.state.ma.us/source/mass/cmr/index.html

Related

wget, recursively download all jpegs works only on website homepage

I'm using wget to download all jpegs from a website.
I searched a lot and this should be the way:
wget -r -nd -A jpg "http://www.hotelninfea.com"
This should recursively -r download files jpegs -A jpg and store all files in a single directory, without recreating website directory tree -nd
Running this command downloads only the jpegs from the homepage of the website, not the whole jpegs of all the website.
I know that a jpeg file could have different extensions (jpg, jpeg) and so on, but this is not the case, also there aren't any robots.txt restrictions acting.
If I remove the filter from the previous command, it works as expected
wget -r -nd "http://www.hotelninfea.com"
This is happening on Lubuntu 16.04 64bit, wget 1.17.1
Is this a bug or I am misunderstanding something?
I suspect that this is happening because the main page you mention contains links to the other pages in the form http://.../something.php, i.e., there is an explicit extension. Then the option -A jpeg has the "side-effect" of removing those pages from the traversal process.
Perhaps a bit dirty workaround in this particular case would be something like this:
wget -r -nd -A jpg,jpeg,php "http://www.hotelninfea.com" && rm -f *.php
i.e., to download only the necessary extra pages and then delete them if wget successfully terminates.
ewcz anwer pointed me to the right way, the --accept acclist parameter has a dual role, it define the rules of file saving and the rules of following links.
Reading deeply the manual I found this
If ‘--adjust-extension’ was specified, the local filename might have ‘.html’ appended to it. If Wget is invoked with ‘-E -A.php’, a filename such as ‘index.php’ will match be accepted, but upon download will be named ‘index.php.html’, which no longer matches, and so the file will be deleted.
So you can do this
wget -r -nd -E -A jpg,php,asp "http://www.hotelninfea.com"
But of course a webmaster could have been using custom extensions
So I think that the most robust solution would be a bash script, something
like
WEBSITE="http://www.hotelninfea.com"
DEST_DIR="."
image_urls=`wget -nd --spider -r "$WEBSITE" 2>&1 | grep '^--' | awk '{ print $3 }' | grep -i '\.\(jpeg\|jpg\)'`
for image_url in $image_urls; do
DESTFILE="$DEST_DIR/$RANDOM.jpg"
wget "$image_url" -O "$DESTFILE"
done
--spider wget will not download the pages, just check that they are there
$RANDOM asks a random number to the operating system

wget download limited by http address

I want to download a website using it's address as a download limiter.
To be precise..
The website address is http://das.sdss.org/spectro/1d_26/
It contains around 2700 subsites with data. I want to limit the recursive download of all files in a way that I can download sites from:
http://das.sdss.org/spectro/1d_26/0182/
to
http://das.sdss.org/spectro/1d_26/0500/
Using this tutorial I have made a wget command:
wget64 -r -nH -np -N http://das.sdss.org/spectro/1d_26/{0182..0500}/
but the last bit of address gives me 404 error.
Is there a mistake in my command or is the tutorial faulty?
P.S. I know it's possible to achieve with -I lists but I want to do it this way if it's possible.

Using wget to download a file from a password protected link

I am trying to use wget to download a file from a http link that is password protected. I am using the following syntax:
wget --http-user=user --http-password=xxxxxx http://......
Am I using the right syntax? Should user and password be surrounded by quotes or double quotes?
I did this a few years ago and luckily found the script in a backup I still have.
I remember it was a two-stage process.
The first step is to get and store the cookie(s):
wget --keep-session-cookies --save-cookies nameofcookiesfile.txt --post-data 'email=my.email#address.com&password=mypassword123' https://web.site.com/redirectLogin -O login.html
The second is to use those cookies to get the file/page you need:
wget --load-cookies nameofcookiesfile.txt -p http://web.site.com/section/ -O savedoutputfile.html -nv
These are the commands exactly as I used them (except I have changed usernames, passwords, filenames and websites). I also came across this link which may be of some assistance particularly the "referer" part:
http://www.linuxscrew.com/2012/03/20/wget-cookies/
Hope this helps or at least gives someone a starting point.

Get list of files via http server using cli (zsh/bash)

Greetings to everyone,
I'm on OSX. I use the terminal a lot as a habit from my Linux old days that I never surpassed. I wanted to download the files listed in this http server: http://files.ubuntu-gr.org/ubuntistas/pdfs/
I select them all with the mouse, put them in a txt files and then gave the following command on the terminal:
for i in `cat ../newfile`; do wget http://files.ubuntu-gr.org/ubuntistas/pdfs/$i;done
I guess it's pretty self explanatory.
I was wondering if there's any easier, better, cooler way to download this "linked" pdf files using wget or curl.
Regards
You can do this with one line of wget as follows:
wget -r -nd -A pdf -I /ubuntistas/pdfs/ http://files.ubuntu-gr.org/ubuntistas/pdfs/
Here's what each parameter means:
-r makes wget recursively follow links
-nd avoids creating directories so all files are stored in the current directory
-A restricts the files saved by type
-I restricts by directory (this one is important if you don't want to download the whole internet ;)

How to change relative URL to absolute URL in wget

I am writing a shell script to download and display the content from a site and I am saving this content to my local file system.
I have used the following command in the script to get the content:
/usr/sfw/bin/wget -q -p -nH -np --referer=$INFO_REF --timeout=300 -P $TMPDIR $INFO_URL
where INFO_REF is the page where I need to display the content from INFO_URL.
The problem is that I am able to get the content (images/css) as an html page but in this html the links on the images and headlines,which are pointing to different site are not working and the path of the URLs (image links) are changing to my local file system path.
I tried adding the -k option in wget and with this option these URLs are pointing to correct location but now the images are not coming as the images path are changing from relative to absolute location. Without -k images are coming properly.
Please tell what option can I use so that images and the links in the page both come properly. Do I need to use two seperate wget commands one for images and another for links in the page?
As per the wget manual:
Actually, to download a single page
and all its requisites (even if they
exist on separate websites), and make
sure the lot displays properly
locally, this author likes to use a
few options in addition to -p:
wget -E -H -k -K -p http://site/document
In order to adjust it to your needs:
/usr/sfw/bin/wget -q -E -H -k -K -p -nH --referer=$INFO_REF --timeout=300 -P $TMPDIR $INFO_URL
I removed the -np because I think it's wrong (maybe a page dependency is in the parent directory).

Resources