Unix to find pdf files from list in text file - unix

I have a directory (for Endnote) that is filled with PDF files (1000's of them). I have used Unix to print a list of all of the pdf files and saved this list as a text file. Most of these pdf files are located in other directories throughout my computer (duplicates).
Now, I want to use the find command to search for duplicates of these pdf files throughout the rest of my computer and if a duplicate is found, move it to a new directory. If a specific file name is found more than once, I want to give each a unique name (ie basename.pdf.1, basename.pdf.2 etc). At the end, I want a single directory for all duplicates so I can double check them and then delete).
However, I do not want find to search the directory in which my list was made from or my Dropbox, as I do not want to move these pdf files (only move the other pdfs scattered throughout my computer).
I have found (I think) how to do all of the individual steps that I need to complete this task, but I cannot seem to put everything together into a working Unix command.
1) In order to find files while excluding a directory:
find -name "what to search for" -not -path "excluded_directory"
or
find build -not \( -path excluded_directory1 -prune \) -not \( -path excluded_directory2 -prune \) -name \*.what_to_find
or my current favorite
find . -name '*.what_to_find' | grep -v exludeddir1 | grep -v excludeddir2
2) In order to read a text file into find and use the lines as search patterns:
find . type f -print | fgrep -f file_list.txt
3) to find and move files
find / -iname "*.what_to_find" -type f -exec mv {} /new_directory \;
or
find / -iname "*.what_to_find" -type f | xargs -I '{}' /new_directory
or (to rename files so files with same name are not just overwritten by each other). I haven't quite figured everything going on in this command out yet...
find -name '*.what_to_find' -type f -exec bash -c 'mv -v "$0" "./$( mktemp "$( basename "$0" ).XXX" )"' '{}' \;
So, I can execute this commands individually, but have not been able to get them to work together as desired (maybe my order of commands is wrong? other problems?).
find . type f -print | fgrep -f file_list.txt | grep -v excludeddir1 | grep -v excludeddir2 -exec bash -c 'echo mv -v "$0" "./$( mktemp "$( basename "$0" ).XXX" )"' '{}' \;
Any help is much appreciated!
Thanks,
Derrick

Well I wasn't able to complete this task exactly how I wanted to, but I found a work around that got the job done.
I printed a list of all PDFs I have in Endnote, then deleted the path name, leaving just the file names (find and replace function in text wrangler). I then used the find command to search this list against my computer, printing all occurances of each PDF.
Then in text wrangler, I deleted all lines containing the initial path to my endnote PDFs, leaving just the desired duplicates.
Next, I used the find command to search for these exact paths and move them to a new folder.
All In all, I got by with the exact same commands I have in my original post, and a little help from text wrangler. Unfortunately I never figured out how to combine all my desired steps into a single unix command.

Related

Unix Recursively move all files but keeping the structure

I have a folder named "in" that contains several folders "a" "b" "c" and I want to move all files to thhe folder "proc" and compress them. The tricky part is the files in "in/a" have to be moved to "proc/a", "in/b" have to be moved to "proc/b" and so on
I managed to find all files and zip them whit this command
find . -type f ! \( -name "*gz" -o -name "*tmp" -o -name "*xftp" \) -exec gzip -n '{}' \;
But I'm not finding a generic command to move the files that works whiteout me telling the name of the folders. Can anyone give me a hand?
Well I ended up finding out I had a couple more problems for example the target folder not existing so I ended up using this code
find . -type f ! \( -name "*gz" -o -name "*tmp" -o -name "*xftp" \) -exec gzip -n '{}' \;
find . -name "*.gz" | cpio -p -dumv $1
if [ "$?" = "0" ]; then
find . -name "*.gz" -exec rm -rf {} \;
else
echo "cpio Failed!" 1>&2
exit 1
fi
the 1st line finds all files to be processed and zips them.
the second line finds all files and copies to the target dir, in my case it was $1 (argument 1), creating as many folders as necessary to ensure the same structure.
The third line checks the status of the last command if it worked it finds and removes all gz files from the source folder whiteout deleting any folder. If it didn't deletes nothing so I can analyse what happened (maybe run out of space)
I bet there's a faster way of doing this whiteout having to use so much disk space but since that was not a problem for me it looks acceptable.

UNIX get info about file in all directories matching a pattern

I have a bunch of directories that all contain a file /SubDir1/SubDir2/File, and I want to see the memory of each file under directories matching a certain pattern. How do I do this?
So far I have ls -l | grep "pattern* to get a list of the directories, but am stuck at this.
You should use the find command:
find . -name 'pattern*' -printf '%s\t%p\n'
By "memory of each file" I guess you mean file size.
The find command will do a better job:
find . -name "pattern*" -exec du -b {} \;
This will print the file size of every file named File in your arborescence along with the file path.
Bash Pitfall #1: Don't parse ls
You can use find or shell patterns:
for i in pattern*; do
cat "$i"
done
One of your special problems is to get a list of all files under a set of matching directories, and you can do that with a more elaborate pattern:
for i in pattern*/*; do
if [ -f "$i" ]; then
cat "$i"
fi
done
In addition to what SirDarius said, you can also use the -R option to ls to get a recursive listing.
Something like ls -lRh | grep "pattern" should do what you want.

How to limit grep to only search the files that you want

We have a rather large and complex file system and I am trying to generate a list of files containing a particular text string. This should be simple, but I need to exclude the './svn' and './pdv' directories (and probably others) and to only look at files of type *.p, *.w or .i.
I can easily do this with a program, but it is proving very slow to run. I want to speed up the process (so that I'm not searching thousands of files repeatedly) as I need to run such searches against a long list of criteria.
Normally, we search the file system using:
find . -name "*.[!r]*" -exec grep -i -l "search for me" {} \;
This is working, but I'm then having to use a program to exclude the unwanted directories , so it is running very slowly.
After looking at the topics here:
Stack Overflow thread
I've decided to try a few other aproaches:
grep -ilR "search for me" . --exclude ".svn" --excluse "pdv" --exclude "!.{p,w,i*}"
Excludes the './svn', but not the './pdv' directories, Doesn't limit the files looked at.
grep -ilR "search for me" . --exclude ".svn" --excluse "pdv" --include "*.p"
Excludes the './svn', but not the './pdv' directories, Doesn't limit the files looked at.
find . -name "*.[!r]*" -exec grep -i -l ".svn" | grep -i -l "search for me" {} \;
I can't even get this (or variations on it) to run successfully.
find . ! -name "*.svn*" -prune -print -exec grep -i -l "search for me" {} \;
Doesn't return anything. It looks like it stops as soon as it finds the .svn directory.
How about something like:
find . \( \( -name .svn -o -name pdv \) -type d -prune \) -o \( -name '*.[pwi]' -type f -exec grep -i -l "search for me" {} + \)
This will:
- ignore the contents of directories named .svn and pdv
- grep files (and symlinks to files) named *.[pwi]
The + option after exec means gather as many files into a single command as will fit on the command line (roughly 1 million chars in Linux). This can seriously speed up processing if you have to iterate over thousands of files.
Following command finds only *.rb files containing require 'bundler/setup' line and excludes search in .git and .bundle directories. That is the same use case I think.
grep -ril --exclude-dir .git --exclude-dir .bundle \
--include \*.rb "^require 'bundler/setup'$" .
The problem was with swapping of --exclude and --exclude-dir parameters I believe. Refer to the grep(1) manual.
Also note that exclude/include parameters accept GLOB only, not regexps, therefore single character suffix range can be done with one --include parameter, but more complex conditions would require more of the parameters:
--include \*.[pwi] --include \*.multichar_sfx ...
You can try the following:
find path_starting_point -type f | grep regex_to_filter_file_names | xargs grep regex_to_find_inside_matched_files
find . -name "filename_regex"|grep -v '.svn' -v '.pdv'|xargs grep -i 'your search string'

Use grep --exclude/--include syntax to not grep through certain files

I'm looking for the string foo= in text files in a directory tree. It's on a common Linux machine, I have bash shell:
grep -ircl "foo=" *
In the directories are also many binary files which match "foo=". As these results are not relevant and slow down the search, I want grep to skip searching these files (mostly JPEG and PNG images). How would I do that?
I know there are the --exclude=PATTERN and --include=PATTERN options, but what is the pattern format? The man page of grep says:
--include=PATTERN Recurse in directories only searching file matching PATTERN.
--exclude=PATTERN Recurse in directories skip file matching PATTERN.
Searching on grep include, grep include exclude, grep exclude and variants did not find anything relevant
If there's a better way of grepping only in certain files, I'm all for it; moving the offending files is not an option. I can't search only certain directories (the directory structure is a big mess, with everything everywhere). Also, I can't install anything, so I have to do with common tools (like grep or the suggested find).
Use the shell globbing syntax:
grep pattern -r --include=\*.cpp --include=\*.h rootdir
The syntax for --exclude is identical.
Note that the star is escaped with a backslash to prevent it from being expanded by the shell (quoting it, such as --include="*.cpp", would work just as well). Otherwise, if you had any files in the current working directory that matched the pattern, the command line would expand to something like grep pattern -r --include=foo.cpp --include=bar.cpp rootdir, which would only search files named foo.cpp and bar.cpp, which is quite likely not what you wanted.
Update 2021-03-04
I've edited the original answer to remove the use of brace expansion, which is a feature provided by several shells such as Bash and zsh to simplify patterns like this; but note that brace expansion is not POSIX shell-compliant.
The original example was:
grep pattern -r --include=\*.{cpp,h} rootdir
to search through all .cpp and .h files rooted in the directory rootdir.
If you just want to skip binary files, I suggest you look at the -I (upper case i) option. It ignores binary files. I regularly use the following command:
grep -rI --exclude-dir="\.svn" "pattern" *
It searches recursively, ignores binary files, and doesn't look inside Subversion hidden folders, for whatever pattern I want. I have it aliased as "grepsvn" on my box at work.
Please take a look at ack, which is designed for exactly these situations. Your example of
grep -ircl --exclude=*.{png,jpg} "foo=" *
is done with ack as
ack -icl "foo="
because ack never looks in binary files by default, and -r is on by default. And if you want only CPP and H files, then just do
ack -icl --cpp "foo="
grep 2.5.3 introduced the --exclude-dir parameter which will work the way you want.
grep -rI --exclude-dir=\.svn PATTERN .
You can also set an environment variable: GREP_OPTIONS="--exclude-dir=\.svn"
I'll second Andy's vote for ack though, it's the best.
I found this after a long time, you can add multiple includes and excludes like:
grep "z-index" . --include=*.js --exclude=*js/lib/* --exclude=*.min.js
The suggested command:
grep -Ir --exclude="*\.svn*" "pattern" *
is conceptually wrong, because --exclude works on the basename. Put in other words, it will skip only the .svn in the current directory.
In grep 2.5.1 you have to add this line to ~/.bashrc or ~/.bash profile
export GREP_OPTIONS="--exclude=\*.svn\*"
I find grepping grep's output to be very helpful sometimes:
grep -rn "foo=" . | grep -v "Binary file"
Though, that doesn't actually stop it from searching the binary files.
If you are not averse to using find, I like its -prune feature:
find [directory] \
-name "pattern_to_exclude" -prune \
-o -name "another_pattern_to_exclude" -prune \
-o -name "pattern_to_INCLUDE" -print0 \
| xargs -0 -I FILENAME grep -IR "pattern" FILENAME
On the first line, you specify the directory you want to search. . (current directory) is a valid path, for example.
On the 2nd and 3rd lines, use "*.png", "*.gif", "*.jpg", and so forth. Use as many of these -o -name "..." -prune constructs as you have patterns.
On the 4th line, you need another -o (it specifies "or" to find), the patterns you DO want, and you need either a -print or -print0 at the end of it. If you just want "everything else" that remains after pruning the *.gif, *.png, etc. images, then use
-o -print0 and you're done with the 4th line.
Finally, on the 5th line is the pipe to xargs which takes each of those resulting files and stores them in a variable FILENAME. It then passes grep the -IR flags, the "pattern", and then FILENAME is expanded by xargs to become that list of filenames found by find.
For your particular question, the statement may look something like:
find . \
-name "*.png" -prune \
-o -name "*.gif" -prune \
-o -name "*.svn" -prune \
-o -print0 | xargs -0 -I FILES grep -IR "foo=" FILES
On CentOS 6.6/Grep 2.6.3, I have to use it like this:
grep "term" -Hnir --include \*.php --exclude-dir "*excluded_dir*"
Notice the lack of equal signs "=" (otherwise --include, --exclude, include-dir and --exclude-dir are ignored)
git grep
Use git grep which is optimized for performance and aims to search through certain files.
By default it ignores binary files and it is honoring your .gitignore. If you're not working with Git structure, you can still use it by passing --no-index.
Example syntax:
git grep --no-index "some_pattern"
For more examples, see:
How to exclude certain directories/files from git grep search.
Check if all of multiple strings or regexes exist in a file
I'm a dilettante, granted, but here's how my ~/.bash_profile looks:
export GREP_OPTIONS="-orl --exclude-dir=.svn --exclude-dir=.cache --color=auto" GREP_COLOR='1;32'
Note that to exclude two directories, I had to use --exclude-dir twice.
If you search non-recursively you can use glop patterns to match the filenames.
grep "foo" *.{html,txt}
includes html and txt. It searches in the current directory only.
To search in the subdirectories:
grep "foo" */*.{html,txt}
In the subsubdirectories:
grep "foo" */*/*.{html,txt}
In the directories are also many binary files. I can't search only certain directories (the directory structure is a big mess). Is there's a better way of grepping only in certain files?
ripgrep
This is one of the quickest tools designed to recursively search your current directory. It is written in Rust, built on top of Rust's regex engine for maximum efficiency. Check the detailed analysis here.
So you can just run:
rg "some_pattern"
It respect your .gitignore and automatically skip hidden files/directories and binary files.
You can still customize include or exclude files and directories using -g/--glob. Globbing rules match .gitignore globs. Check man rg for help.
For more examples, see: How to exclude some files not matching certain extensions with grep?
On macOS, you can install via brew install ripgrep.
find and xargs are your friends. Use them to filter the file list rather than grep's --exclude
Try something like
find . -not -name '*.png' -o -type f -print | xargs grep -icl "foo="
The advantage of getting used to this, is that it is expandable to other use cases, for example to count the lines in all non-png files:
find . -not -name '*.png' -o -type f -print | xargs wc -l
To remove all non-png files:
find . -not -name '*.png' -o -type f -print | xargs rm
etc.
As pointed out in the comments, if some files may have spaces in their names, use -print0 and xargs -0 instead.
Try this one:
$ find . -name "*.txt" -type f -print | xargs file | grep "foo=" | cut -d: -f1
Founded here: http://www.unix.com/shell-programming-scripting/42573-search-files-excluding-binary-files.html
those scripts don't accomplish all the problem...Try this better:
du -ha | grep -i -o "\./.*" | grep -v "\.svn\|another_file\|another_folder" | xargs grep -i -n "$1"
this script is so better, because it uses "real" regular expressions to avoid directories from search. just separate folder or file names with "\|" on the grep -v
enjoy it!
found on my linux shell! XD
Look # this one.
grep --exclude="*\.svn*" -rn "foo=" * | grep -v Binary | grep -v tags
The --binary-files=without-match option to GNU grep gets it to skip binary files. (Equivalent to the -I switch mentioned elsewhere.)
(This might require a recent version of grep; 2.5.3 has it, at least.)
suitable for tcsh .alias file:
alias gisrc 'grep -I -r -i --exclude="*\.svn*" --include="*\."{mm,m,h,cc,c} \!* *'
Took me a while to figure out that the {mm,m,h,cc,c} portion should NOT be inside quotes.
~Keith
To ignore all binary results from grep
grep -Ri "pattern" * | awk '{if($1 != "Binary") print $0}'
The awk part will filter out all the Binary file foo matches lines
Try this:
Create a folder named "--F" under currdir ..(or link another folder there renamed to "--F" ie double-minus-F.
#> grep -i --exclude-dir="\-\-F" "pattern" *

Unix shell file copy flattening folder structure

On the UNIX bash shell (specifically Mac OS X Leopard) what would be the simplest way to copy every file having a specific extension from a folder hierarchy (including subdirectories) to the same destination folder (without subfolders)?
Obviously there is the problem of having duplicates in the source hierarchy. I wouldn't mind if they are overwritten.
Example: I need to copy every .txt file in the following hierarchy
/foo/a.txt
/foo/x.jpg
/foo/bar/a.txt
/foo/bar/c.jpg
/foo/bar/b.txt
To a folder named 'dest' and get:
/dest/a.txt
/dest/b.txt
In bash:
find /foo -iname '*.txt' -exec cp \{\} /dest/ \;
find will find all the files under the path /foo matching the wildcard *.txt, case insensitively (That's what -iname means). For each file, find will execute cp {} /dest/, with the found file in place of {}.
The only problem with Magnus' solution is that it forks off a new "cp" process for every file, which is not terribly efficient especially if there is a large number of files.
On Linux (or other systems with GNU coreutils) you can do:
find . -name "*.xml" -print0 | xargs -0 echo cp -t a
(The -0 allows it to work when your filenames have weird characters -- like spaces -- in them.)
Unfortunately I think Macs come with BSD-style tools. Anyone know a "standard" equivalent to the "-t" switch?
The answers above don't allow for name collisions as the asker didn't mind files being over-written.
I do mind files being over-written so came up with a different approach. Replacing each / in the path with - keep the hierarchy in the names, and puts all the files in one flat folder.
We use find to get the list of all files, then awk to create a mv command with the original filename and the modified filename then pass those to bash to be executed.
find ./from -type f | awk '{ str=$0; sub(/\.\//, "", str); gsub(/\//, "-", str); print "mv " $0 " ./to/" str }' | bash
where ./from and ./to are directories to mv from and to.
If you really want to run just one command, why not cons one up and run it? Like so:
$ find /foo -name '*.txt' | xargs echo | sed -e 's/^/cp /' -e 's|$| /dest|' | bash -sx
But that won't matter too much performance-wise unless you do this a lot or have a ton of files. Be careful of name collusions, however. I noticed in testing that GNU cp at least warns of collisions:
cp: will not overwrite just-created `/dest/tubguide.tex' with `./texmf/tex/plain/tugboat/tubguide.tex'
I think the cleanest is:
$ find /foo -name '*.txt' | xargs -i cp {} /dest
Less syntax to remember than the -exec option.
As far as the man page for cp on a FreeBSD box goes, there's no need for a -t switch. cp will assume the last argument on the command line to be the target directory if more than two names are passed.

Resources