File Transfer to Hadoop HDFS from remote linux server

File Transfer to Hadoop HDFS from remote linux server - unix

I need to transfer the Files from remote Linux server to directly HDFS.
I have keytab placed on remote server , after kinit command its activated however i cannot browse the HDFS folders. I know from edge nodes i can directly copy files to HDFS however i need to skip the edge node and directly transfer the files to HDFS.
how can we achieve this.

Let's assume a couple of things first. You have one machine on which the external hard drive is mounted (named DISK) and one cluster of machines with an ssh access to the master (we denote by master in the command line the user#hostname part of the master machine). You run the script on the machine with the drive. The data on the drive consists of multiple directories with multiple files in each (like a 100); the numbers don't matter, it's just to justify the loops. The path to the data will be stored in the ${DIR} variable (on Linux, it would be /media/DISK and on Mac OS X /Volumes/DISK). Here is what the script looks like:
DIR=/Volumes/DISK;
for d in $(ls ${DIR}/);
do
for f in $(ls ${DIR}/${d}/);
do
cat ${DIR}/${d}/${f} | ssh master "hadoop fs -put - /path/on/hdfs/${d}/${f}";
done;
done;
Note that we go over each file and we copy it into a specific file because the HDFS API for put requires that "when source is stdin, destination must be a file."
Unfortunately, it takes forever. When I came back the next morning, it only did a fifth of the data (100GB) and was still running... Basically taking 20 minutes per directory! I ended up going forward with the solution of copying the data temporarily on one of the machines and then copying it locally to HDFS. For space reason, I did it one folder at a time and then deleting the temporarily folder immediately after. Here is what the script looks like:
DIR=/Volumes/DISK;
PTH=/path/on/one/machine/of/the/cluster;
for d in $(ls ${DIR}/);
do
scp -r -q ${DIR}/${d} master:${PTH}/
ssh master "hadoop fs -copyFromLocal ${PTH}/${d} /path/on/hdfs/";
ssh master "rm -rf ${PTH}/${d}";
done;
Hope it helps!

Related

scp remote file into hadoop without copying it to edge node

Want to copy a file from a remote server to hadoop without copying to edge node.
1.Per below article we can do it in 2 step by first doing scp to local edge node and then perform hdfs fs command to move from edge node to hdfs
https://community.cloudera.com/t5/Support-Questions/Import-data-from-remote-server-to-HDFS/td-p/233148
2.Per below article we can do ssh cat, but we have files like .gz which cannot be CAT
putting a remote file into hadoop without copying it to local disk
But I am looking for a 3rd option where we can scp instead of ssh CAT and copy to hadoop without copying to Edge node.

Hadoop doesnt have an SCP upload feature.
If you want to get files in without an edge node or SSH, then that's what WebHDFS or the NFSGateway offer

Transfer using pipe
mkfifo - this creates pipe on local server (this doesn't store any data)
try
mkfifo <pipename - some path on your server where ssh keys are present> | scp : | hdfs dfs -put | rm

Fastest way to move 90 million files (270GB) between two NFS 1Gb/s folders

I need to move 90 million files from a NFS folder to a second NFS folder, both connections to NFS folder are using same eth0, which is 1Gb/s to the NFS servers, Sync is not needed, only move (overwrite if it exists). I think my main problem is the number of files, not the total size. The best way should be the way with less system calls per file to the NFS folders.
I tried cp, rsync, and finally http://moo.nac.uci.edu/~hjm/parsync/ parsync first took 10 hours in generate 12 GB gzip of the file list, after it took 40 hours and no one file was copied, it was working to 10 threads until I canceled it and started debugging, I found it is doing a call (stat ?) again to each file (from the list) with the -vvv option (it uses rsync):
[sender] make_file(accounts/hostingfacil/snap.2017-01-07.041721/hostingfacil/homedir/public_html/members/vendor/composer/62ebc48e/vendor/whmcs/whmcs-foundation/lib/Domains/DomainLookup/Provider.php,*,0)*
the parsync command is:
time parsync --rsyncopts="-v -v -v" --reusecache --NP=10 --startdir=/nfsbackup/folder1/subfolder2 thefolder /nfsbackup2/folder1/subfolder2
Each rsync has this form:
rsync --bwlimit=1000000 -v -v -v -a --files-from=/root/.parsync/kds-chunk-9 /nfsbackup/folder1/subfolder2 /nfsbackup2/folder1/subfolder2
The NFS folders are mounted:
server:/export/folder/folder /nfsbackup2 nfs auto,noexec,noatime,nolock,bg,intr,tcp,actimeo=1800,nfsvers=3,vers=3 0 0
Any idea how to instruct the rsync to copy the files already in the list from the nfs to the nfs2 folder? Or any way to make this copy efficiently (one system call per file?)

I've had issues doing the same once and I found that it's best to just run a find command and move each file individually.
cd /origin/path
find . | cpio -updm ../destination/
-u command will override the existing files

scp or sftp copy multiple files with single command

I'd like to copy files from/to remote server in different directories.
For example, I want to run these 4 commands at once.
scp remote:A/1.txt local:A/1.txt
scp remote:A/2.txt local:A/2.txt
scp remote:B/1.txt local:B/1.txt
scp remote:C/1.txt local:C/1.txt
What is the easiest way to do that?

Copy multiple files from remote to local:
$ scp your_username#remote.edu:/some/remote/directory/\{a,b,c\} ./
Copy multiple files from local to remote:
$ scp foo.txt bar.txt your_username#remotehost.edu:~
$ scp {foo,bar}.txt your_username#remotehost.edu:~
$ scp *.txt your_username#remotehost.edu:~
Copy multiple files from remote to remote:
$ scp your_username#remote1.edu:/some/remote/directory/foobar.txt \
your_username#remote2.edu:/some/remote/directory/
Source: http://www.hypexr.org/linux_scp_help.php

From local to server:
scp file1.txt file2.sh username#ip.of.server.copyto:~/pathtoupload
From server to local (up to OpenSSH v9.0):
scp -T username#ip.of.server.copyfrom:"file1.txt file2.txt" "~/yourpathtocopy"
From server to local (OpenSSH v9.0+):
scp -OT username#ip.of.server.copyfrom:"file1.txt file2.txt" "~/yourpathtocopy"
From man 1 scp:
-O Use the legacy SCP protocol for file transfers instead of the SFTP protocol. Forcing the use of the
SCP protocol may be necessary for servers that do not implement SFTP, for backwards-compatibility for
particular filename wildcard patterns and for expanding paths with a ‘~’ prefix for older SFTP
servers.
HISTORY
Since OpenSSH 9.0, scp has used the SFTP protocol for transfers by default.

You can copy whole directories with using -r switch so if you can isolate your files into own directory, you can copy everything at once.
scp -r ./dir-with-files user#remote-server:upload-path
scp -r user#remote-server:path-to-dir-with-files download-path
so for instance
scp -r root#192.168.1.100:/var/log ~/backup-logs
Or if there is just few of them, you can use:
scp 1.txt 2.txt 3.log user#remote-server:upload-path

As Jiri mentioned, you can use scp -r user#host:/some/remote/path /some/local/path to copy files recursively. This assumes that there's a single directory containing all of the files you want to transfer (and nothing else).
However, SFTP provides an alternative if you want to transfer files from multiple different directories, and the destinations are not identical:
sftp user#host << EOF
get /some/remote/path1/file1 /some/local/path1/file1
get /some/remote/path2/file2 /some/local/path2/file2
get /some/remote/path3/file3 /some/local/path3/file3
EOF
This uses the "here doc" syntax to define a sequence of SFTP input commands. As an alternative, you could put the SFTP commands into a text file and execute sftp user#host -b batchFile.txt

The answers with {file1,file2,file3} works only with bash (on remote or locally)
The real way is :
scp user#remote:'/path1/file1 /path2/file2 /path3/file3' /localPath

After playing with scp for a while I have found the most robust solution:
(Beware of the single and double quotation marks)
Local to remote:
scp -r "FILE1" "FILE2" HOST:'"DIR"'
Remote to local:
scp -r HOST:'"FILE1" "FILE2"' "DIR"
Notice that whatever after "HOST:" will be sent to the remote and parsed there. So we must make sure they are not processed by the local shell. That is why single quotation marks come in. The double quotation marks are used to handle spaces in the file names.
If files are all in the same directory, we can use * to match them all, such as
scp -r "DIR_IN"/*.txt HOST:'"DIR"'
scp -r HOST:'"DIR_IN"/*.txt' "DIR"
Compared to using the "{}" syntax which is supported only by some shells, this one is universal

The simplest way is
local$ scp remote:{A/1,A/2,B/3,C/4}.txt ./
So {.. } list can include directories (A,B and C here are directories; "1.txt" and "2.txt" are file names in those directories).
Although it would copy all these four files into one local directory - not sure if that's what you wanted.
In the above case you will end up remote files A/1.txt, A/2.txt, B/3.txt and C/4.txt copied over to a single local directory, with file names ./1.txt, ./2.txt, ./3.txt and ./4.txt

Problem: Copying multiple directories from remote server to local machine using a single SCP command and retaining each directory as it is in the remote server.
Solution: SCP can do this easily. This solves the annoying problem of entering password multiple times when using SCP with multiple folders. Consequently, this also saves a lot of time!
e.g.
# copies folders t1, t2, t3 from `test` to your local working directory
# note that there shouldn't be any space in between the folder names;
# we also escape the braces.
# please note the dot at the end of the SCP command
~$ cd ~/working/directory
~$ scp -r username#contact.server.de:/work/datasets/images/test/\{t1,t2,t3\} .
PS: Motivated by this great answer: scp or sftp copy multiple files with single command
Based on the comments, this also works fine in Git Bash on Windows

You can do this way:
scp hostname#serverNameOrServerIp:/path/to/files/\\{file1,file2,file3\\}.fileExtension ./
This will download all the listed filenames to whatever local directory you're on.
Make sure not to put spaces between each filename only use a comma ,.

Copy multiple directories:
scp -r dir1 dir2 dir3 admin#127.0.0.1:~/

Is more simple without using scp:
tar cf - file1 ... file_n | ssh user#server 'tar xf -'
This also let you do some things like compress the stream (-C) or (since OpenSSH v7.3) -J to jump any times through one (or more) proxy servers.
Avoid using passwords by coping your public key to ~/.ssh/authorized_keys (on server) with ssh-copy-id (on client).
Posted also here (with more details) and here.

scp remote:"[A-C]/[12].txt" local:

NOTE: I apologize in advance for answering only a portion of the above question. However, I found these commands to be useful for my current unix needs.
Uploading specific files from a local machine to a remote machine:
~/Desktop/dump_files$ scp file1.txt file2.txt lab1.cpp etc.ext your-user-id#remotemachine.edu:Folder1/DestinationFolderForFiles/
Uploading an entire directory from a local machine to a remote machine:
~$ scp -r Desktop/dump_files your-user-id#remotemachine.edu:Folder1/DestinationFolderForFiles/
Downloading an entire directory from a remote machine to a local machine:
~/Desktop$ scp -r your-user-id#remote.host.edu:Public/web/ Desktop/

In my case, I am restricted to only using the sftp command.
So, I had to use a batchfile with sftp. I created a script such as the following. This assumes you are working in the /tmp directory, and you want to put the files in the destdir_on_remote_system on the remote system. This also only works with a noninteractive login. You need to set up public/private keys so you can login without entering a password. Change as needed.
#!/bin/bash
cd /tmp
# start script with list of files to transfer
ls -1 fileset1* > batchfile1
ls -1 fileset2* >> batchfile1
sed -i -e 's/^/put /' batchfile1
echo "cd destdir_on_remote_system" > batchfile
cat batchfile1 >> batchfile
rm batchfile1
sftp -b batchfile user#host

In the specific case where all the files have the same extension but with different suffix (say number of log file) you use the following:
scp user_name#ip.of.remote.machine:/some/log/folder/some_log_file.* ./
This will copy all files named some_log_file from the given folder within the remote, i.e.- some_log_file.1 , some_log_file.2, some_log_file.3 ....

In my case there were too many files with non related names.
I ended up using,
$ for i in $(ssh remote 'ls ~/dir'); do scp remote:~/dir/$i ./$i; done
1.txt 100% 322KB 1.2MB/s 00:00
2.txt 100% 33KB 460.7KB/s 00:00
3.txt 100% 61KB 572.1KB/s 00:00
$

scp uses ssh for data transfer with the same authentication and provides the same security as ssh.
A best practise here is to implement "SSH KEYS AND PUBLIC KEY AUTHENTICATION". With this, you can write your scripts without worring about authentication. Simple as that.
See WHAT IS SSH-KEYGEN

serverHomeDir='/home/somepath/ftp/'
backupDirAbsolutePath=${serverHomeDir}'_sqldump_'
backupDbName1='2021-08-27-03-56-somesite-latin2.sql'
backupDbName2='2021-08-27-03-56-somesite-latin1.sql'
backupDbName3='2021-08-27-03-56-somesite-utf8.sql'
backupDbName4='2021-08-27-03-56-somesite-utf8mb4.sql'
scp -i ~/.ssh/id_rsa.pub user#server.domain.com:${backupDirAbsolutePath}/"{$backupDbName1,$backupDbName2,$backupDbName3,$backupDbName4}" .
. - at the end will download the files to current dir
-i ~/.ssh/id_rsa.pub - assuming that you established ssh to your server with .pub key

scp -r root#ip-address:/root/dir/ C:\Users\your-name\Downloads\
the -r will let you download all the files inside the dir directory of your remote server

putting a remote file into hadoop without copying it to local disk

I am writing a shell script to put data into hadoop as soon as they are generated. I can ssh to my master node, copy the files to a folder over there and then put them into hadoop. I am looking for a shell command to get rid of copying the file to the local disk on master node. to better explain what I need, here below you can find what I have so far:
1) copy the file to the master node's local disk:
scp test.txt username#masternode:/folderName/
I have already setup SSH connection using keys. So no password is needed to do this.
2) I can use ssh to remotely execute the hadoop put command:
ssh username#masternode "hadoop dfs -put /folderName/test.txt hadoopFolderName/"
what I am looking for is how to pipe/combine these two steps into one and skip the local copy of the file on masterNode's local disk.
thanks
In other words, I want to pipe several command in a way that I can

Try this (untested):
cat test.txt | ssh username#masternode "hadoop dfs -put - hadoopFoldername/test.txt"
I've used similar tricks to copy directories around:
tar cf - . | ssh remote "(cd /destination && tar xvf -)"
This sends the output of local-tar into the input of remote-tar.

The node where you have generated the data on, is this able to reach each of your cluster nodes (the name node and all the datanodes).
If you do have data connectivity then you can just execute the hadoop fs -put command from the machine where the data is generated (assuming you have the hadoop binaries installed there too):
#> hadoop fs -fs masternode:8020 -put test.bin hadoopFolderName/

Hadoop provides a couple of REST interfaces. Check Hoop and WebHDFS. You should be able to copy the file without copying the file to the master using them from non-Hadoop environments.

Create pipe and then using pipe do the transfer. In this way file is not stored locally.
mkfifo transfer_pipe
scp remote_file transfer_pipe| hdfs dfs -put transfer_pipe <hdfs_path>

(untested)
Since the node where you create your data has access to internet, then perhaps you could install hadoop client node software, then add it to the cluster - after normal hadoop fs -put, then disconnect and remove your temporary node - the hadoop system should then automatically make replication of your files blocks inside your hadoop cluster

Using local settings through SSH

Is it possible to have an SSH session use all your local configuration files (.bash_profile, .vimrc, etc..) on login? That way you would have the same configuration for, say, editing files in vim in the remote session.

I just came across two alternatives to just doing a git clone of your dotfiles. I take no credit for either of these and can't say I've used either extensively so I don't know if there are pitfalls to either of these.
sshrc
sshrc is a tool (actually just a big bash function) that copies over local rc-files without permanently writing them to the remove user's $HOME - the idea being that might be a shared admin account that other people use. Appears to be customizable for different remote hosts as well.
.ssh/config and LocalCommand
This blog post suggests a way to automatically run a command when you login to a remote host. It tars and pipes a set of files to the remote, then un-tars them on the remote's $HOME:
Your local ~/.ssh/config would look like this:
Host *
PermitLocalCommand yes
LocalCommand tar c -C${HOME} .bashrc .bash_profile .exports .aliases .inputrc .vimrc .screenrc \
| ssh -o PermitLocalCommand=no %n "tar mx -C${HOME}"
You could modify the above to only run the command on certain hosts (instead of the * wildcard) or customize for different hosts as well. There might be a fair amount of duplication per host with this method - although you could package the whole tar c ... | ssh .. "tar mx .." into a script maybe.
Note the above looks like it clobbers the same files on the remote when you connect, so use with caution.

Use a dotfiles.git repo
What I do is keep all my config files in a dotfiles.git on a central server.
You can set it up so that when you ssh into a remote machine, you automatically pull the latest version of the dotfiles. I do something like this:
ssh myhost
cd ~/dotfiles
git pull --rebase
cd ~
ln -sf dotfiles/$username/linux/.* .
Note:
To put that in a shell script, you can automate the process of executing commands on a remote machine by piping to ssh.
The "$username" is there so that you can share your config files with other people you're working with.
The "ln -sf" creates symbolic links to all your dotfiles, overwriting any local ones, such that ~/.emacs is linked to the version controlled file ~/dotfiles/$username/.emacs.
The use of a "linux" subdirectory is just to allow for configuration changes across platforms. I also have a mac directory under dotfiles/$username/mac. Most of the files in the /mac directory are symlinked from the linux directory as it's very similar, but there are some exceptions.
Finally, note that you can make this even more sophisticated with hostnames and the like rather than just a generic 'linux'. With a dotfiles.git, you can also raid dotfiles from your friends, which is awesome -- everyone has their own set of little tricks and hacks.

No, because it's not SSH using your config files, but the remote shell.
I suggest keeping your config files in Subversion or some other VCS. Here's how I do it.

Well, no, because as Andy Lester says, the remote machine is the one doing the work, and it has no access back to your local machine to get .vimrc ...
On the other hand, you could use sshfs to mount the remote file system locally and edit the files locally. This doesn't require you to install anything on the remote machine. Not sure how efficient it is, maybe not great for editing big files over slow links.
Or Komodo IDE has a neat "Open >> Remote File" option which lets you edit files on remote
machines by scping them back and forth automatically.

I do this kind of things every day. I have about 15 bash rc files and .vimrc, a few vim plugin scripts, .screenrc and some other rc files. I have a sync script (written in bash) which uses the cool rsync command to sync all these files to remote servers. Every time I update some files on my main server, I would call the script to sync them to remote servers.
Setting up a svn/git/hg repository on the main server also works for me but my remote servers need to be repeatedly reinstalled for testing. So I find it's more convenient to use rsync.
A few years ago I also used the rdist tool which can also meet the requirement for most of the time. But now I prefer rsync as it supports incremental sync which is very efficient.

ssh can be configured to pass certain environment variables through to the other (remote side). And since most shells will check some environment variables for additional settings to apply, you can hack that into applying some local settings remotely. But its a bit complicated and most administrators turn off the ssh environment variable pass-through in the sshd config anyways.

You could always just copy the files to the machine before connecting with ssh:
#!/bin/bash
scp ~/.bash_profile ~/.vimrc user#host:
ssh user#host
This works best if you are using keys to login and no one else logs in as that user.

Here's a simple bash script I've used for this purpose. It syncs over some folders I like to have copied over using rsync and then adds the ~/bin folder to the remote machines .bashrc if it's not there already. It works best if you have have copied your ssh keys to each server. I use this approach instead of a "dotfiles repo" as lots of the servers I connect to don't have git on them.
So to use it, you'd do something like this:
./bin_sync_to_machine.sh server1
bin_sync_to_machine.sh
function show_help()
{
echo ""
echo "usage: SERVER {SERVER2 SERVER3 etc...}"
echo ""
exit
}
if [ "$1" == "help" ]
then
show_help
fi
if [ -z "$1" ]
then
show_help
fi
# Sync ~/bin and some dot files to remote server using rsync
for SERVER in $*; do
rsync -avrz --progress ~/bin/ -e ssh $SERVER:~/bin
rsync -avrz --progress ~/.vim/ -e ssh $SERVER:~/.vim
rsync -avrz --progress ~/.vimrc -e ssh $SERVER:~/.vimrc
rsync -avrz --progress ~/.aliases $SERVER:~/.aliases
rsync -avrz --progress ~/.aliases $SERVER:~/.bash_aliases
# Ensure remote server has ~/bin in the path
ssh $SERVER '~/bin/path_add_to_path.sh'
done
path_add_to_path.sh
pathadd() {
if [ -d "$1" ] && [[ ":$PATH:" != *":$1:"* ]]; then
PATH="${PATH:+"$PATH:"}$1"
fi
}
# Add to current path if running in a shell
pathadd ~/bin
# Add to ~/.bashrc
if ! grep -q PATH:~/bin ~/.bashrc; then
echo "PATH=\$PATH:~/bin" >> ~/.bashrc
fi
if ! grep -q source ~/.aliases ~/.bashrc; then
echo "source ~/.aliases" >> ~/.bashrc
fi

I wrote an extremely simple tool for this that will allow you to natively transport your .vimrc file whenever you ssh, by using SSHd built-in config options in a non-standard way.
No additional svn,scp,copy/paste, etc required.
It is simple, lightweight, and works by default on all server configurations I have tested so far.
https://github.com/gWOLF3/viSSHous

I think that https://github.com/fsquillace/kyrat does what you need.
I wrote it long time ago before sshrc was born and it has more benefits compared to sshrc:
It does not require dependencies on xxd for both hosts (which can be unavailable on remote host)
Kyrat uses a more efficient encoding algorithm
It is just ~20 lines of code (really easy to understand!)
No need of root access or any installations to the remote host
For instance:
$> echo "alias q=exit" > ~/.config/kyrat/bashrc
$> kyrat myuser#myserver.com
myserver.com $> q
exit

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

File Transfer to Hadoop HDFS from remote linux server - unix

Related

scp remote file into hadoop without copying it to edge node

Fastest way to move 90 million files (270GB) between two NFS 1Gb/s folders

scp or sftp copy multiple files with single command

putting a remote file into hadoop without copying it to local disk

Using local settings through SSH

Categories

Resources