Wget - how not to download next files if there is no connection - recursion

When using wget with the recursive option (-r), if the page.html contains 100 different download links to 100 different files, if the connection between the PC and the device is interrupted (after the download of 20 file), the program continues trying to download the other 80 files, so wasting a lot of time.
I tried theese options but nothing...
-T 1
-t 1
--dns-timeout=1
--connect-timeout=1
--read-timeout=1
Is there way to stop wget after N minutes of no connection?

Related

File Transfer to Hadoop HDFS from remote linux server

I need to transfer the Files from remote Linux server to directly HDFS.
I have keytab placed on remote server , after kinit command its activated however i cannot browse the HDFS folders. I know from edge nodes i can directly copy files to HDFS however i need to skip the edge node and directly transfer the files to HDFS.
how can we achieve this.
Let's assume a couple of things first. You have one machine on which the external hard drive is mounted (named DISK) and one cluster of machines with an ssh access to the master (we denote by master in the command line the user#hostname part of the master machine). You run the script on the machine with the drive. The data on the drive consists of multiple directories with multiple files in each (like a 100); the numbers don't matter, it's just to justify the loops. The path to the data will be stored in the ${DIR} variable (on Linux, it would be /media/DISK and on Mac OS X /Volumes/DISK). Here is what the script looks like:
DIR=/Volumes/DISK;
for d in $(ls ${DIR}/);
do
for f in $(ls ${DIR}/${d}/);
do
cat ${DIR}/${d}/${f} | ssh master "hadoop fs -put - /path/on/hdfs/${d}/${f}";
done;
done;
Note that we go over each file and we copy it into a specific file because the HDFS API for put requires that "when source is stdin, destination must be a file."
Unfortunately, it takes forever. When I came back the next morning, it only did a fifth of the data (100GB) and was still running... Basically taking 20 minutes per directory! I ended up going forward with the solution of copying the data temporarily on one of the machines and then copying it locally to HDFS. For space reason, I did it one folder at a time and then deleting the temporarily folder immediately after. Here is what the script looks like:
DIR=/Volumes/DISK;
PTH=/path/on/one/machine/of/the/cluster;
for d in $(ls ${DIR}/);
do
scp -r -q ${DIR}/${d} master:${PTH}/
ssh master "hadoop fs -copyFromLocal ${PTH}/${d} /path/on/hdfs/";
ssh master "rm -rf ${PTH}/${d}";
done;
Hope it helps!

How to display Jupyter Notebook connection info while everything else sent to a log file?

I am writing a developer tool, part of which will launch a Jupyter notebook in the background with output sent to a particular file, such as
jupyter notebook --ip 0.0.0.0 --no-browser --allow-root \
>> ${NOTEBOOK_LOGFILE} 2>&1 &
However, I still want the notebook's start-up information to be printed to the console via stdout. Such as
[I 18:25:33.166 NotebookApp] Writing notebook server cookie secret to /root/.local/share/jupyter/runtime/notebook_cookie_secret
[I 18:25:33.189 NotebookApp] Serving notebooks from local directory: /faces
[I 18:25:33.189 NotebookApp] 0 active kernels
[I 18:25:33.189 NotebookApp] The Jupyter Notebook is running at: http://0.0.0.0:8888/?token=b02f25972...
[I 18:25:33.189 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 18:25:33.189 NotebookApp]
Copy/paste this URL into your browser when you connect for the first time,
to login with a token:
http://0.0.0.0:8888/?token=b02f25972...
so that users can still see which URL connection string they need.
I have tried to cat this file after the notebook command but this has some downsides.
The time it takes for the notebook to launch and print the message is variable, and using a combination of sleep along with cat the log file is undesirable, because if there's a rare delay in start-up time, cat of the log file might print nothing because the file is empty.
On the other hand, I don't want to set the sleep time to an overly high number, because then users will have to wait too long at startup.
I have also tried tail -f ${NOTEBOOK_LOGFILE} | grep -n 10 (because the start-up lines will be the first 10). This is promising, but the notebook server does not append a newline to each line until the next line is incoming. This means if you wait for 10 lines, the tail process will hang until some other message is logged to the log file (producing the 10th newline).
How can I ensure that the start-up information is displayed to stdout in a timely fashion from when the notebook outputs this information, while still redirecting notebook output into a log file?
I figured a hack to do it with tail and head, but would be interested in something simpler.
(tail -f -n +1 ${NOTEBOOK_LOGFILE} | head -n 5);
This relies on the fact that the connection URL is also printed among the first 5 lines, and so it won't matter about the lack of newline that keeps head waiting if you try to extract it from lines 9 and 10.

Fastest way to move 90 million files (270GB) between two NFS 1Gb/s folders

I need to move 90 million files from a NFS folder to a second NFS folder, both connections to NFS folder are using same eth0, which is 1Gb/s to the NFS servers, Sync is not needed, only move (overwrite if it exists). I think my main problem is the number of files, not the total size. The best way should be the way with less system calls per file to the NFS folders.
I tried cp, rsync, and finally http://moo.nac.uci.edu/~hjm/parsync/ parsync first took 10 hours in generate 12 GB gzip of the file list, after it took 40 hours and no one file was copied, it was working to 10 threads until I canceled it and started debugging, I found it is doing a call (stat ?) again to each file (from the list) with the -vvv option (it uses rsync):
[sender] make_file(accounts/hostingfacil/snap.2017-01-07.041721/hostingfacil/homedir/public_html/members/vendor/composer/62ebc48e/vendor/whmcs/whmcs-foundation/lib/Domains/DomainLookup/Provider.php,*,0)*
the parsync command is:
time parsync --rsyncopts="-v -v -v" --reusecache --NP=10 --startdir=/nfsbackup/folder1/subfolder2 thefolder /nfsbackup2/folder1/subfolder2
Each rsync has this form:
rsync --bwlimit=1000000 -v -v -v -a --files-from=/root/.parsync/kds-chunk-9 /nfsbackup/folder1/subfolder2 /nfsbackup2/folder1/subfolder2
The NFS folders are mounted:
server:/export/folder/folder /nfsbackup2 nfs auto,noexec,noatime,nolock,bg,intr,tcp,actimeo=1800,nfsvers=3,vers=3 0 0
Any idea how to instruct the rsync to copy the files already in the list from the nfs to the nfs2 folder? Or any way to make this copy efficiently (one system call per file?)
I've had issues doing the same once and I found that it's best to just run a find command and move each file individually.
cd /origin/path
find . | cpio -updm ../destination/
-u command will override the existing files

wget download limited by http address

I want to download a website using it's address as a download limiter.
To be precise..
The website address is http://das.sdss.org/spectro/1d_26/
It contains around 2700 subsites with data. I want to limit the recursive download of all files in a way that I can download sites from:
http://das.sdss.org/spectro/1d_26/0182/
to
http://das.sdss.org/spectro/1d_26/0500/
Using this tutorial I have made a wget command:
wget64 -r -nH -np -N http://das.sdss.org/spectro/1d_26/{0182..0500}/
but the last bit of address gives me 404 error.
Is there a mistake in my command or is the tutorial faulty?
P.S. I know it's possible to achieve with -I lists but I want to do it this way if it's possible.

Why would HTTP transfer via wget be faster than lftp/pget?

I'm building software that needs to do massive amounts of file transfer via both HTTP and FTP. Often times, I get faster HTTP download with a multi-connection download accelerator like axel or lftp with pget. In some cases, I've seen 2x-3x faster file transfer using something like:
axel http://example.com/somefile
or
lftp -e 'pget -n 5 http://example.com/somefile;quit'
vs. just using wget:
wget http://example.com/somefile
But other times, wget is significantly faster than lftp. Strangly, this is even true even when I do lftp with get, like so:
lftp -e 'pget -n 1 http://example.com/somefile;quit'
I understand that downloading a file via multiple connections won't always result in a speedup, depending on how bandwidth is constrained. But: why would it be slower? Especially when calling lftp/pget with -n 1?
Is it possible that the HTTP server is compressing the stream using gzip? I can't remember if wget handles gzip Content Encoding or not. If it does, then this might explain the performance boost. Another possibility is that there is an HTTP cache somewhere in the pipeline. You can try something like
wget --no-cache --header="Accept-Encoding: identity"
and compare this to your FTP-based transfer times.

Resources