Optimally Copy new files in a directory from local to HDFS - unix

I am trying to write a script to optimally copy new files e.g. files created within the last 7 days. I am using the find command to find matching files and -exec command to execute the hdfs command. However, the -exec command executes a new hdfs command for each file which is suboptimal. I want to use one hdfs command to copy the matched files to HDFS. I am using the command below:
find ./ -type f -mtime -7 -exec echo hdfs dfs -put -f {} hdfs://server1:9000/data/ \;
I have read other answers on SO suggesting putting a + sign at the end of the command as below but that returns find: missing argument to -exec
find ./ -type f -mtime -7 -exec echo hdfs dfs -put -f {} hdfs://server1:9000/data/ +

Related

Count number of lines in each directory

I have a directory structure as below
output/a/1/multipleFiles
output/a/2/multipleFiles
output/a/3/multipleFiles
output/b/1/multipleFiles
output/b/2/multipleFiles
output/b/3/multipleFiles
I want to know number of lines each directory has. So basically, number of lines at each inner most directory level instead of file level. The innermost directories 1, 2, 3 are different kinds of output we generate for our analytics which contains multiple hadoop part-xxxx files.
I moved to output directory and tried the below command.
find . -maxdepth 2 -type d -name '*' | awk -F "/" 'NF==3' | awk '{print $0"/*"}' | xargs wc -l
But I am getting an error as
wc: ./a/1/*: No such file or directory
wc: ./a/2/*: No such file or directory
wc: ./a/3/*: No such file or directory
but if I try
wc -l ./a/1/*
I am getting correct output for that specific folder.
What am I missing here.
EDIT:
I updated my command as below to remove unnecessary awk commands.
find . -mindepth 2 -maxdepth 2 -type d -name '*' | xargs wc -l
This again results in error as
wc: ./a/1: Is a directory
wc: ./a/2: Is a directory
wc: ./a/2: Is a directory
Give a try to execdir, for example:
find . -maxdepth 2 -type f -execdir wc -l {} \;
This will run the command wc -l {} only within the directory that the file has been found, from the man:
-execdir The -execdir primary is identical to the -exec primary with
the exception that utility will be executed from the
directory that holds the current file.

bash find with two commands in an exec ~ How to find a specific Java class within a set of JARs

My use case is I want to search a collection of JARs for a specific class file. More specifically, I want to search recursively within a directory for all *.jar files, then list their contents, looking for a specific class file.
So this is what I have so far:
find . -name *.jar -type f -exec echo {} \; -exec jar tf {} \;
This will list the contents of all JAR files found recursively. I want to put a grep within the seconed exec because I want the second exec to only print the contents of the JAR that grep matches.
If I just put a pipe and pipe it all to grep afterward, like:
find . -name *.jar -type f -exec echo {} \; -exec jar tf {} \; | grep $CLASSNAME
Then I lose the output of the first exec, which tells me where the class file is (the name of JAR file is likely to not match the class file name).
So if there was a way for the exec to run two commands, like:
-exec "jar tf {} | grep $CLASSNAME" \;
Then this would work. Using a grep $(...) in the exec command wouldn't work because I need the {} from the find to take the place of the file that was found.
Is this possible?
(Also I am open to other ways of doing this, but the command line is preferred.)
i find it difficult to execute multiple commands within find-exec, so i usually only grab the results with find and loop around the results.
maybe something like this might help?
find . -type f -name *.jar | while read jarfile; do echo $jarfile; jar tf $jarfile; done
I figured it out - still using "one" command. What I was looking for was actually answered in the question How to use pipe within -exec in find. What I have to do is use a shell command with my exec. This ends up making the command look like:
find . -name *.jar -type f -exec echo {} \; -exec sh -c "jar tf {} | grep --color $CLASSNAME" \;
The --color will help the final result to stick out while the command is recursively listing all JAR files.
A couple points:
This assumes I have a $CLASSNAME set. The class name has to appear as it would in a JAR, not within a Java package. So com.ibm.json.java.JSONObject would become com/ibm/json/java/JSONObject.class.
This requires a JDK - that is where we get the jar command. The JDK must be accessible on the system path. If you have a JDK that is not on the system path, you can set an environment variable, such as JAR to point to the jar executable. I am running this from cygwin, so it turns out my jar installation is within the "Program Files" directory. The presence of a space breaks this, so I have to add these two commands:
export JAR=/cygdrive/c/Program\ Files/Java/jdk1.8.0_65/bin/jar
find . -name *.jar -type f -exec echo {} \; -exec sh -c "\"$JAR\" tf {} | grep --color $CLASSNAME" \;
The $JAR in the shell command must be escaped otherwise the terminal will not know what to do with the space in "Program Files".

Formatting Find output before it's used in next command

I am batch uploading files to an FTP server with find and curl using this command:
find /path/to/target/folder -not -path '*/\.*' -type f -exec curl -u username:password --ftp-create-dirs -T {} ftp://ftp.myserver.com/public/{} \;
The problem is find is outputting paths like
/full/path/from/root/to/file.txt
so on the FTP server I get the file uploaded to
ftp.myserver.com/public/full/path/from/root/to/file.txt
instead of
ftp.myserver.com/public/relative/path/to/file.txt
The goal was to have all files and folders that are in the target folder get uploaded to the public folder, but this problem is destroying the file structure. Is there a way to edit the find output to trim the path before it gets fed into curl?
Not sure exactly what you want to end up with in your path, but this should give you an idea. The trick is to exec sh to allow you to modify the path and run a command.
find . -type f -exec sh -c 'joe=$(basename {}); echo $joe' \;

script for zipping the files with 10 days back files in one folder in unix?

i am trying to zip log files by using "zip " command in UNIX server.
but i want to zip 10 days before files automatically,with out changing the command manually. can any one suggest the script for zipping the files with 10 days back files in one folder.
You can use find command and use zip command to do this like the following
find . -name '*.log' -mtime +10 | zip logfiles.zip -#
Another alternative:
find . -iname '*.log' -mtime +9 -exec zip logfiles.zip {} +
You will probably want to use the -m option of zip which will delete the logs after zipping them.
find . -iname '*.log' -mtime +9 -exec zip -m logfiles.zip {} +
How does it work?
find . will find files starting in the current directory.
-iname '*.log' filters the files ending in .log, .LOG, .Log, etc
-mtime +9 will select the files that were modified at least 9 days before today, i.e: 10 days ago.
-exec zip -m logfiles.zip {} + will run the zip -m logfiles.zip file1 file2 file3 ... and so on.
zip -m file.zip file zips file into file.zip and deletes file.
Some other options:
Probably you want to run this script a couple of times, so, you can use a variable for your logfiles like the date expressed in mm.dd.yyyy format:
find . -iname '*.log' -mtime +9 -exec zip -m logfiles-$(date +%m.%d.%Y).zip {} +
If you just want to use the logs for the current directory (and not it's subdirectory), you can use the -maxdepth option of find:
find . -maxdepth 1 -iname '*.log' -mtime +9 -exec zip -m logfiles-$(date +%m.%d.%Y).zip {} +
Suppose that you have something like:
app1/logs/log1.log
app1/logs/log2.log
app2/logs/log1.log
app2/logs/log2.log
with -execdir (instead of -exec) the zip command will be executed for each subdirectory in which it finds files.
So:
find . -iname '*.log' -mtime +9 -execdir zip -m logfiles.zip {} +
Will leave you with:
app1/logs/logfiles.zip --> (contains log1 and log2 from app1/log)
app2/logs/logfiles.zip --> (contains log1 and log2 from app2/log)

Unix - how to source multiple shell scripts in a directory?

when I want to execute some shell script in Unix (and let's say that I am in the directory where the script is), I just type:
./someShellScript.sh
and when I want to "source" it (e.g. run it in the current shell, NOT in a new shell), I just type the same command just with the "." (or with the "source" command equivalent) before it:
. ./someShellScript.sh
And now the tricky part. When I want to execute "multiple" shell scripts (let's say all the files with .sh suffix) in the current directory, I type:
find . -type f -name *.sh -exec {} \;
but "what command should I use to "SOURCE" multiple shell scripts in a directory"?
I tried this so far but it DIDN'T work:
find . -type f -name *.sh -exec . {} \;
and it only threw this error:
find: `.': Permission denied
Thanks.
for file in *.sh
do . $file
done
Try the following version of Jonathan's code:
export IFSbak = $IFS;export IFS="\0"
for file in `find . -type f -name '*.sh' -print0`
do source "$file"
end
export IFS=$IFSbak
The problem lies in the way shell's work, and that '.' itself is not a command (neither is source in this). When you run find, the shell will fork itself and then turn into the find process, meaning that any environment variables or other environmental stuff goes into the find process (or, more likely, find forks itself for new processes for each exec).
Also, note that your command (and Jonathan's) will not work if there are spaces in the file names.
You can use find and xargs to do this:
find . -type f -name "*.sh" | xargs -I sh {}

Resources