Wget redirects even though robots are off

Wget redirects even though robots are off - web-scraping

I'm using wget to download data for a research project on far-right extremism. I have a list of urls, but the urls are not downloaded. (They do work in the browser.)
The urls are all structured like this:
https://www.forum.org/forum/printthread.php?t=1182735&pp=100
But wget redirects to the front page. However, these urls work fine with wget:
https://www.forum.org/forum/printthread.php?t=1182735
The problem seems to be the last bit of the url, &pp=100.
Things I've tried thus far:
Escape the & character (\&) or replace it with % or %20.
Turn off robots.
Here's the wget code I use:
cat urls.txt | parallel -j 4 wget -e robots=off --no-check-certificate --auth-no-challenge --load-cookies cookies.txt --keep-session-cookies --random-wait --max-redirect=0 -P forumfiles -a wget_log_15dec2018
Edit: for what it's worth, the urls download with HTTrack. Which makes me even more curious about this wget issue.
Edit2: changed original urls for anonymity.
Edit3: thanks to the answer below, the following code works:
cat urls.txt | parallel -j 4 wget --no-check-certificate --auth-no-challenge --load-cookies cookies.txt -nc --keep-session-cookies -U "Mozilla/5.0" --random-wait --max-redirect=0 -P forumfiles -a wget_log_17dec2018

Interestingly the website example you've provided returns results based on the user-agent string. With the default user-agent, the server returns a 301 response and asks wget to download only the first page.
You can simply change the user-agent string to make it work. e.g.:
--user-agent=mozilla

Related

How do I turn off wget proxy?

I had been using a proxy for a long time. Now I need to remove it. I have forgotten how I have added the proxy to wget. Can someone please help me get back to the normal wget where it doesn't use any proxy. As of now, I'm using
wget <link> --proxy=none
But I'm facing a problem when I'm installing using a pre-written script. It's painstaking to search all through the scripts and change each command.
Any simpler solution will be very much appreciated.
Thanks

Check your
~/.wgetrc
/etc/wgetrc
and remove proxy settings.
Or use wget --no-proxy command line option to override them.

In case your OS is alpine/busybox then the wget might vary from the one used by #Logu.
There the correct command is
wget --proxy off http://server:port/
Running wget --help outputs:
/ # wget --help
BusyBox v1.31.1 () multi-call binary.
Usage: wget [-c|--continue] [--spider] [-q|--quiet] [-O|--output-document FILE]
[-o|--output-file FILE] [--header 'header: value'] [-Y|--proxy on/off]
[-P DIR] [-S|--server-response] [-U|--user-agent AGENT] [-T SEC] URL...
Retrieve files via HTTP or FTP
--spider Only check URL existence: $? is 0 if exists
-c Continue retrieval of aborted transfer
-q Quiet
-P DIR Save to DIR (default .)
-S Show server response
-T SEC Network read timeout is SEC seconds
-O FILE Save to FILE ('-' for stdout)
-o FILE Log messages to FILE
-U STR Use STR for User-Agent header
-Y on/off Use proxy

wget, recursively download all jpegs works only on website homepage

I'm using wget to download all jpegs from a website.
I searched a lot and this should be the way:
wget -r -nd -A jpg "http://www.hotelninfea.com"
This should recursively -r download files jpegs -A jpg and store all files in a single directory, without recreating website directory tree -nd
Running this command downloads only the jpegs from the homepage of the website, not the whole jpegs of all the website.
I know that a jpeg file could have different extensions (jpg, jpeg) and so on, but this is not the case, also there aren't any robots.txt restrictions acting.
If I remove the filter from the previous command, it works as expected
wget -r -nd "http://www.hotelninfea.com"
This is happening on Lubuntu 16.04 64bit, wget 1.17.1
Is this a bug or I am misunderstanding something?

I suspect that this is happening because the main page you mention contains links to the other pages in the form http://.../something.php, i.e., there is an explicit extension. Then the option -A jpeg has the "side-effect" of removing those pages from the traversal process.
Perhaps a bit dirty workaround in this particular case would be something like this:
wget -r -nd -A jpg,jpeg,php "http://www.hotelninfea.com" && rm -f *.php
i.e., to download only the necessary extra pages and then delete them if wget successfully terminates.

ewcz anwer pointed me to the right way, the --accept acclist parameter has a dual role, it define the rules of file saving and the rules of following links.
Reading deeply the manual I found this
If ‘--adjust-extension’ was specified, the local filename might have ‘.html’ appended to it. If Wget is invoked with ‘-E -A.php’, a filename such as ‘index.php’ will match be accepted, but upon download will be named ‘index.php.html’, which no longer matches, and so the file will be deleted.
So you can do this
wget -r -nd -E -A jpg,php,asp "http://www.hotelninfea.com"
But of course a webmaster could have been using custom extensions
So I think that the most robust solution would be a bash script, something
like
WEBSITE="http://www.hotelninfea.com"
DEST_DIR="."
image_urls=`wget -nd --spider -r "$WEBSITE" 2>&1 | grep '^--' | awk '{ print $3 }' | grep -i '\.\(jpeg\|jpg\)'`
for image_url in $image_urls; do
DESTFILE="$DEST_DIR/$RANDOM.jpg"
wget "$image_url" -O "$DESTFILE"
done
--spider wget will not download the pages, just check that they are there
$RANDOM asks a random number to the operating system

Download all files of a particular type from a website using wget stops in the starting url

The following did not work.
wget -r -A .pdf home_page_url
It stop with the following message:
....
Removing site.com/index.html.tmp since it should be rejected.
FINISHED
I don't know why it only stops in the starting url, do not go into the links in it to search for the given file type.
Any other way to recursively download all pdf files in an website. ?

It may be based on a robots.txt. Try adding -e robots=off.
Other possible problems are cookie based authentication or agent rejection for wget.
See these examples.
EDIT: The dot in ".pdf" is wrong according to sunsite.univie.ac.at

the following cmd works for me, it will download pictures of a site
wget -A pdf,jpg,png -m -p -E -k -K -np http://site/path/

This is certainly because of the links in the HTML don't end up with /.
Wget will not follow this has it think it's a file (but doesn't match your filter):
page
But will follow this:
page
You can use the --debug option to see if it's the actual problem.
I don't know any good solution for this. In my opinion this is a bug.

In my version of wget (GNU Wget 1.21.3), the -A/--accept and -r/--recursive flags don't play nicely with each other.
Here's my script for scraping a domain for PDFs (or any other filetype):
wget --no-verbose --mirror --spider https://example.com -o - | while read line
do
[[ $line == *'200 OK' ]] || continue
[[ $line == *'.pdf'* ]] || continue
echo $line | cut -c25- | rev | cut -c7- | rev | xargs wget --no-verbose -P scraped-files
done
Explanation: Recursively crawl https://example.com and pipe log output (containing all scraped URLs) to a while read block. When a line from the log output contains a PDF URL, strip the leading timestamp (25 characters) and tailing request info (7 characters) and use wget to download the PDF.

wget wont download actual files

I've looked around for quite a while now and haven't figured out how to sort this out.
I'm trying to download files from a website, but only ever get an 'index.html' returned. This is useless to me, as I need the actual files.
I've been using commands like
wget --no-check-certificate -nc -nH -r -k -p -np --cut-dirs=3 \https://websitename/directory/folder_of_interest/
(I have my username and password set up in the .wgetrc file).
The above code will return the recursive directories and in the final one will just be the index.html file.
I could really use a hand here.

In your question you have
wget \https://websitename/directory/folder_of_interest
This originally might have been
wget \
https://websitename/directory/folder_of_interest
which is correct because the backslash is escaping the newline, but with your example is it incorrectly escaping the h. Remove the backslash or move the URL to the next line.

wget or curl from stdin

I'd like to download a web pages while supplying URLs from stdin. Essentially one process continuously produces URLs to stdout/file and I want to pipe them to wget or curl. (Think about it as simple web crawler if you want).
This seems to work fine:
tail 1.log | wget -i - -O - -q
But when I use 'tail -f' and it doesn't work anymore (buffering or wget is waiting for EOF?):
tail -f 1.log | wget -i - -O - -q
Could anybody provide a solution using wget, curl or any other standard Unix tool? Ideally I don't won't want to restart wget in the loop, just keep it running downloading URLs as they come.

What you need to use is xargs. E.g.
tail -f 1.log | xargs -n1 wget -O - -q

Use xargs which converts stdin to argument.
tail 1.log | xargs -L 1 wget

Try piping the tail -f through python -c $'import pycurl;c=pycurl.Curl()\nwhile True: c.setopt(pycurl.URL,raw_input().strip()),c.perform()'
This gets curl (well, you probably meant the command-line curl and I'm calling it as a library from a Python one-liner, but it's still curl) to fetch each URL immediately, while still taking advantage of keeping the socket to the server open if you're requesting multiple URLs from the same server in sequence. It's not completely robust though: if one of your URLs is duff, the whole command will fail (you might want to make it a proper Python script and add try / except to handle this), and there's also the small detail that it will throw EOFError on EOF (but I'm assuming that's not important if you're using tail -f).

The effective way is to avoid using xargs, if downloading files from the same web server:
wget -q -N -i - << EOF
http://sitename/dir1/file1
http://sitename/dir2/file2
http://sitename/dir3/file3
EOF

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Wget redirects even though robots are off - web-scraping

Interestingly the website example you've provided returns results based on the user-agent string. With the default user-agent, the server returns a 301 response and asks wget to download only the first page. You can simply change the user-agent string to make it work. e.g.: --user-agent=mozilla

Related

How do I turn off wget proxy?

wget, recursively download all jpegs works only on website homepage

Download all files of a particular type from a website using wget stops in the starting url

wget wont download actual files

wget or curl from stdin

Categories

Resources