How to resolve "Invalid URL http://: Invalid host name" using wget - web-scraping

I am trying to use wget to download PDFs from a repository. I have a list of URLs saved to a text file that I am feeding to wget.
Example URL in text file:
https://digitalscholarship.unlv.edu/cgi/viewcontent.cgi?article=3849&context=thesesdissertations
Error returned:
Invalid URL http://: Invalid host name
Example command:
wget -i etd_engineering_list.txt
The goal is download all PDFs located at URLs within the etd_engineering_list.txt file.

Here's a simple bash script that should do the job.
#!/bin/bash
input="./etd_engineering_list.txt"
while IFS= read -r line
do
wget "$line"
done < "$input"
based on example here 1

Related

OpenStack Swift cannot use bulk operation to auto extract tar file

I want to upload many files with a single operation in OpenStack Swift. I find the middleware -- Bulk Operations which can auto extract files from tar compressed file. However, I failed to extract the files from the tar.
I PUT the tar file use the bulk operation like this:
curl -X PUT http://127.0.0.1:8080/v1/AUTH_test/ContainerName/$?extract-archive=tar \
-T theTarName.tar \
-H "Content-Type: text/plain" \
-H "X-Auth-Token: token"
I am sure that the storageURL, tar file path, and token is accurate. But, I didn't get any responses(successes or errors). When I show the objects in the container, I find just one object named 0extract-archive=tar was uploaded, but the files in the tar were not extracted.
I want to know how to extract the tar automatically in OpenStack Swift and all of the files in the tar can be displayed in the container.
Thanks in advance.
The issue is the $? part. $? refers to the exit code of the last command in bash (http://tldp.org/LDP/abs/html/exit-status.html), which I suspect you're using.
If you'd like to use $ as the archive prefix, consider escaping it with \:
$ curl -X PUT \
"http://127.0.0.1:8080/v1/AUTH_test/container/\$?extract-archive=tar" \
-T test.tar \
-H "X-Auth-Token: <token>"
You should get the following output:
Number Files Created: 3
Response Body:
Response Status: 201 Created
Errors:

wget, recursively download all jpegs works only on website homepage

I'm using wget to download all jpegs from a website.
I searched a lot and this should be the way:
wget -r -nd -A jpg "http://www.hotelninfea.com"
This should recursively -r download files jpegs -A jpg and store all files in a single directory, without recreating website directory tree -nd
Running this command downloads only the jpegs from the homepage of the website, not the whole jpegs of all the website.
I know that a jpeg file could have different extensions (jpg, jpeg) and so on, but this is not the case, also there aren't any robots.txt restrictions acting.
If I remove the filter from the previous command, it works as expected
wget -r -nd "http://www.hotelninfea.com"
This is happening on Lubuntu 16.04 64bit, wget 1.17.1
Is this a bug or I am misunderstanding something?
I suspect that this is happening because the main page you mention contains links to the other pages in the form http://.../something.php, i.e., there is an explicit extension. Then the option -A jpeg has the "side-effect" of removing those pages from the traversal process.
Perhaps a bit dirty workaround in this particular case would be something like this:
wget -r -nd -A jpg,jpeg,php "http://www.hotelninfea.com" && rm -f *.php
i.e., to download only the necessary extra pages and then delete them if wget successfully terminates.
ewcz anwer pointed me to the right way, the --accept acclist parameter has a dual role, it define the rules of file saving and the rules of following links.
Reading deeply the manual I found this
If ‘--adjust-extension’ was specified, the local filename might have ‘.html’ appended to it. If Wget is invoked with ‘-E -A.php’, a filename such as ‘index.php’ will match be accepted, but upon download will be named ‘index.php.html’, which no longer matches, and so the file will be deleted.
So you can do this
wget -r -nd -E -A jpg,php,asp "http://www.hotelninfea.com"
But of course a webmaster could have been using custom extensions
So I think that the most robust solution would be a bash script, something
like
WEBSITE="http://www.hotelninfea.com"
DEST_DIR="."
image_urls=`wget -nd --spider -r "$WEBSITE" 2>&1 | grep '^--' | awk '{ print $3 }' | grep -i '\.\(jpeg\|jpg\)'`
for image_url in $image_urls; do
DESTFILE="$DEST_DIR/$RANDOM.jpg"
wget "$image_url" -O "$DESTFILE"
done
--spider wget will not download the pages, just check that they are there
$RANDOM asks a random number to the operating system

Download all files of a particular type from a website using wget stops in the starting url

The following did not work.
wget -r -A .pdf home_page_url
It stop with the following message:
....
Removing site.com/index.html.tmp since it should be rejected.
FINISHED
I don't know why it only stops in the starting url, do not go into the links in it to search for the given file type.
Any other way to recursively download all pdf files in an website. ?
It may be based on a robots.txt. Try adding -e robots=off.
Other possible problems are cookie based authentication or agent rejection for wget.
See these examples.
EDIT: The dot in ".pdf" is wrong according to sunsite.univie.ac.at
the following cmd works for me, it will download pictures of a site
wget -A pdf,jpg,png -m -p -E -k -K -np http://site/path/
This is certainly because of the links in the HTML don't end up with /.
Wget will not follow this has it think it's a file (but doesn't match your filter):
page
But will follow this:
page
You can use the --debug option to see if it's the actual problem.
I don't know any good solution for this. In my opinion this is a bug.
In my version of wget (GNU Wget 1.21.3), the -A/--accept and -r/--recursive flags don't play nicely with each other.
Here's my script for scraping a domain for PDFs (or any other filetype):
wget --no-verbose --mirror --spider https://example.com -o - | while read line
do
[[ $line == *'200 OK' ]] || continue
[[ $line == *'.pdf'* ]] || continue
echo $line | cut -c25- | rev | cut -c7- | rev | xargs wget --no-verbose -P scraped-files
done
Explanation: Recursively crawl https://example.com and pipe log output (containing all scraped URLs) to a while read block. When a line from the log output contains a PDF URL, strip the leading timestamp (25 characters) and tailing request info (7 characters) and use wget to download the PDF.

How to ssh to a certain directory and then be able to use the FIND command to find certain files and gunzip them

Hi everyone I'm not to ksh. What i'm trying to do is I'm writing a script to scp a(or many) zip file from a local directory to a remote host. Then get the script to ssh into the remote host to gunzip the files I just scp over. Is there any simple way to do this. I keep trying but once I ssh over to the remote host the rest of my commands no longer run like the cd /file/directory and then gzip -d /files etc.....
NB: don't confuse "zip" and "gzip", two different animals
This should work:
cd <local_directory>
# collect files names as $1 $2 ... $N
set -- *.gz # or use your own filter like "dumps*.gz"
# put source file a tar archive and send it as input to ssh
# then, on the other side, untar the file then decompress
tar cf - $* | ssh <user>#<remote_host> "cd <remote dir> && tar xf - && gunzip $*
Note: using "&&" instead of ";" to prevent "tar" command to be executed if "cd" fails for any reason

Post to Twitter using Terminal with CURL

I got this far:
:~ curl -u username:password -d status="new_status" http://twitter.com/statuses/update.xml
Now, how can I alias this with variables so I can easily twit from Terminal? How can I make the alias working through different sessions (when I close Terminal aliases reset).
Thanks!
Basic Authentication is no longer supported by twitter. Please use OAuth.
You clearly have the alias command: stick it in your ~/.bashrc and it will be set up when your bash shell starts. (.shrc should also work for sh-like shells.)
If you stick it in a script file as the previous answer suggests:
(a) add the line
#!/bin/sh
at the top;
(b) make sure it's on your path or you'll have to type the whole path to the script when you want to run it.
(c) to make it executable,
chmod +x tweet.sh
what about putting it a file and using argument 1 as $1:
# tweet.sh "post my status, moron!":
curl -u username:password -d status="$1" http://twitter.com/statuses/update.xml
will that work?
You need to create a file in your home directory that will get referenced each time a new terminal opens.
Do a bit of research as to what to name the file, according to what type of shell you are using (tcsh looks for a file called .tcshrc while bash looks for .bashrc).
Once you have that file, make it executable by running:
chmod +x name_of_file
Then, in that file, create your alias (again, you'll need to research how to do this depending on what type of shell you are using). For tcsh, my alias looks like this:
alias tw 'curl -u username:password -d status=\!^ http://twitter.com/statuses/update.xml'
Bash aliases use an equals sign. A bash alias would look something more like this:
alias tw='curl -u username:password -d status=\!^ http://twitter.com/statuses/update.xml'
Note the change in the command after "status=". The \!^ tells the line of code to insert the first argument passed after the alias itself.
Save your file.
You could then run an update to twitter by typing the following in a new terminal:
tw 'my first post to twitter via the terminal, using aliases'
Don't forget to escape 'special' characters (like exclamations) with the escape character, \ (i.e. \!)
Since Basic Authentication is no longer supported by twitter, you have to use OAuth to achieve your goal.
But if you just want to post to Twitter using terminal, there are many application can do it.
Take a look at Rainbowstream or t
With rainbowstream, the following lines will let you tweet from console:
$ sudo pip install rainbowstream
$ rainbowstream
[#yourscreenname]t whatever you want

Resources