wget wont download actual files - unix

I've looked around for quite a while now and haven't figured out how to sort this out.
I'm trying to download files from a website, but only ever get an 'index.html' returned. This is useless to me, as I need the actual files.
I've been using commands like
wget --no-check-certificate -nc -nH -r -k -p -np --cut-dirs=3 \https://websitename/directory/folder_of_interest/
(I have my username and password set up in the .wgetrc file).
The above code will return the recursive directories and in the final one will just be the index.html file.
I could really use a hand here.

In your question you have
wget \https://websitename/directory/folder_of_interest
This originally might have been
wget \
https://websitename/directory/folder_of_interest
which is correct because the backslash is escaping the newline, but with your example is it incorrectly escaping the h. Remove the backslash or move the URL to the next line.

Related

SCP issue with multiple files - UNIX

Getting error in copying multiple files. Below command is copying only first file and giving error for rest of the files. Can someone please help me out.
Command:
scp $host:$(ssh -n $host "find /incoming -mmin -120 -name 2018*") /incoming/
Result:
user#host:~/scripts/OTA$ scp $host:$(ssh -n $host "find /incoming -mmin -120 -name 2018*") /incoming/
Password:
Password:
2018084session_event 100% |**********************************************************************************************************| 9765 KB 00:00
cp: cannot access /incoming/2018084session_event_log.195-10.45.40.9
cp: cannot access /incoming/2018084session_event_log.195-10.45.40.9_2_3
Your command uses Command Substitution to generate a list of files. Your assumption is that there is some magic in the "source" notation for scp that would cause multiple members of the list generated by your find command to be assumed to live on $host, when in fact your command might expand into something like:
scp remotehost:/incoming/someoldfile anotheroldfile /incoming
Only the first file is being copied from $host, because none of the rest include $host: at the beginning of the path. They're not found in your local /incoming directory, hence the error.
Oh, and in addition, you haven't escape the asterisk in the find command, so 2018* may expand to multiple files that are in the login directory for the user in question. I can't tell from here, it depends on your OS and shell configuration.
I should point out that you are providing yet another example of the classic Parsing LS problem. Special characters WILL break your command. The "better" solution usually offered for this problem tends to be to use a for loop, but that's not really what you're looking for. Instead, I'd recommend making a tar of the files you're looking for. Something like this might do:
ssh "$host" "find /incoming -mmin -120 -name 2018\* -exec tar -cf - {} \+" |
tar -xvf - -C /incoming
What does this do?
ssh runs a remote find command with your criteria.
find feeds the list of filenames (regardless of special characters) to a tar command as options.
The tar command sends its result to stdout (-f -).
That output is then piped into another tar running on your local machine, which extracts the stream.
If your tar doesn't support -C, you can either remove it and run a cd /incoming before the ssh, or you might be able to replace that pipe segment with a curly-braced command: { cd /incoming && tar -xvf -; }
The curly brace notation assumes a POSIX-like shell (bash, zsh, etc). The rest of this should probably work equally well in csh if that's what you're stuck with.
Limited warranty: Best Effort Only. Untested on animals or computers. Your milage may vary. May contain nuts.
If this doesn't work for you, poke at it until it does.

wget, recursively download all jpegs works only on website homepage

I'm using wget to download all jpegs from a website.
I searched a lot and this should be the way:
wget -r -nd -A jpg "http://www.hotelninfea.com"
This should recursively -r download files jpegs -A jpg and store all files in a single directory, without recreating website directory tree -nd
Running this command downloads only the jpegs from the homepage of the website, not the whole jpegs of all the website.
I know that a jpeg file could have different extensions (jpg, jpeg) and so on, but this is not the case, also there aren't any robots.txt restrictions acting.
If I remove the filter from the previous command, it works as expected
wget -r -nd "http://www.hotelninfea.com"
This is happening on Lubuntu 16.04 64bit, wget 1.17.1
Is this a bug or I am misunderstanding something?
I suspect that this is happening because the main page you mention contains links to the other pages in the form http://.../something.php, i.e., there is an explicit extension. Then the option -A jpeg has the "side-effect" of removing those pages from the traversal process.
Perhaps a bit dirty workaround in this particular case would be something like this:
wget -r -nd -A jpg,jpeg,php "http://www.hotelninfea.com" && rm -f *.php
i.e., to download only the necessary extra pages and then delete them if wget successfully terminates.
ewcz anwer pointed me to the right way, the --accept acclist parameter has a dual role, it define the rules of file saving and the rules of following links.
Reading deeply the manual I found this
If ‘--adjust-extension’ was specified, the local filename might have ‘.html’ appended to it. If Wget is invoked with ‘-E -A.php’, a filename such as ‘index.php’ will match be accepted, but upon download will be named ‘index.php.html’, which no longer matches, and so the file will be deleted.
So you can do this
wget -r -nd -E -A jpg,php,asp "http://www.hotelninfea.com"
But of course a webmaster could have been using custom extensions
So I think that the most robust solution would be a bash script, something
like
WEBSITE="http://www.hotelninfea.com"
DEST_DIR="."
image_urls=`wget -nd --spider -r "$WEBSITE" 2>&1 | grep '^--' | awk '{ print $3 }' | grep -i '\.\(jpeg\|jpg\)'`
for image_url in $image_urls; do
DESTFILE="$DEST_DIR/$RANDOM.jpg"
wget "$image_url" -O "$DESTFILE"
done
--spider wget will not download the pages, just check that they are there
$RANDOM asks a random number to the operating system

Using wget to download a file from a password protected link

I am trying to use wget to download a file from a http link that is password protected. I am using the following syntax:
wget --http-user=user --http-password=xxxxxx http://......
Am I using the right syntax? Should user and password be surrounded by quotes or double quotes?
I did this a few years ago and luckily found the script in a backup I still have.
I remember it was a two-stage process.
The first step is to get and store the cookie(s):
wget --keep-session-cookies --save-cookies nameofcookiesfile.txt --post-data 'email=my.email#address.com&password=mypassword123' https://web.site.com/redirectLogin -O login.html
The second is to use those cookies to get the file/page you need:
wget --load-cookies nameofcookiesfile.txt -p http://web.site.com/section/ -O savedoutputfile.html -nv
These are the commands exactly as I used them (except I have changed usernames, passwords, filenames and websites). I also came across this link which may be of some assistance particularly the "referer" part:
http://www.linuxscrew.com/2012/03/20/wget-cookies/
Hope this helps or at least gives someone a starting point.

Get list of files via http server using cli (zsh/bash)

Greetings to everyone,
I'm on OSX. I use the terminal a lot as a habit from my Linux old days that I never surpassed. I wanted to download the files listed in this http server: http://files.ubuntu-gr.org/ubuntistas/pdfs/
I select them all with the mouse, put them in a txt files and then gave the following command on the terminal:
for i in `cat ../newfile`; do wget http://files.ubuntu-gr.org/ubuntistas/pdfs/$i;done
I guess it's pretty self explanatory.
I was wondering if there's any easier, better, cooler way to download this "linked" pdf files using wget or curl.
Regards
You can do this with one line of wget as follows:
wget -r -nd -A pdf -I /ubuntistas/pdfs/ http://files.ubuntu-gr.org/ubuntistas/pdfs/
Here's what each parameter means:
-r makes wget recursively follow links
-nd avoids creating directories so all files are stored in the current directory
-A restricts the files saved by type
-I restricts by directory (this one is important if you don't want to download the whole internet ;)

Post to Twitter using Terminal with CURL

I got this far:
:~ curl -u username:password -d status="new_status" http://twitter.com/statuses/update.xml
Now, how can I alias this with variables so I can easily twit from Terminal? How can I make the alias working through different sessions (when I close Terminal aliases reset).
Thanks!
Basic Authentication is no longer supported by twitter. Please use OAuth.
You clearly have the alias command: stick it in your ~/.bashrc and it will be set up when your bash shell starts. (.shrc should also work for sh-like shells.)
If you stick it in a script file as the previous answer suggests:
(a) add the line
#!/bin/sh
at the top;
(b) make sure it's on your path or you'll have to type the whole path to the script when you want to run it.
(c) to make it executable,
chmod +x tweet.sh
what about putting it a file and using argument 1 as $1:
# tweet.sh "post my status, moron!":
curl -u username:password -d status="$1" http://twitter.com/statuses/update.xml
will that work?
You need to create a file in your home directory that will get referenced each time a new terminal opens.
Do a bit of research as to what to name the file, according to what type of shell you are using (tcsh looks for a file called .tcshrc while bash looks for .bashrc).
Once you have that file, make it executable by running:
chmod +x name_of_file
Then, in that file, create your alias (again, you'll need to research how to do this depending on what type of shell you are using). For tcsh, my alias looks like this:
alias tw 'curl -u username:password -d status=\!^ http://twitter.com/statuses/update.xml'
Bash aliases use an equals sign. A bash alias would look something more like this:
alias tw='curl -u username:password -d status=\!^ http://twitter.com/statuses/update.xml'
Note the change in the command after "status=". The \!^ tells the line of code to insert the first argument passed after the alias itself.
Save your file.
You could then run an update to twitter by typing the following in a new terminal:
tw 'my first post to twitter via the terminal, using aliases'
Don't forget to escape 'special' characters (like exclamations) with the escape character, \ (i.e. \!)
Since Basic Authentication is no longer supported by twitter, you have to use OAuth to achieve your goal.
But if you just want to post to Twitter using terminal, there are many application can do it.
Take a look at Rainbowstream or t
With rainbowstream, the following lines will let you tweet from console:
$ sudo pip install rainbowstream
$ rainbowstream
[#yourscreenname]t whatever you want

Resources