How to limit grep to only search the files that you want - unix

We have a rather large and complex file system and I am trying to generate a list of files containing a particular text string. This should be simple, but I need to exclude the './svn' and './pdv' directories (and probably others) and to only look at files of type *.p, *.w or .i.
I can easily do this with a program, but it is proving very slow to run. I want to speed up the process (so that I'm not searching thousands of files repeatedly) as I need to run such searches against a long list of criteria.
Normally, we search the file system using:
find . -name "*.[!r]*" -exec grep -i -l "search for me" {} \;
This is working, but I'm then having to use a program to exclude the unwanted directories , so it is running very slowly.
After looking at the topics here:
Stack Overflow thread
I've decided to try a few other aproaches:
grep -ilR "search for me" . --exclude ".svn" --excluse "pdv" --exclude "!.{p,w,i*}"
Excludes the './svn', but not the './pdv' directories, Doesn't limit the files looked at.
grep -ilR "search for me" . --exclude ".svn" --excluse "pdv" --include "*.p"
Excludes the './svn', but not the './pdv' directories, Doesn't limit the files looked at.
find . -name "*.[!r]*" -exec grep -i -l ".svn" | grep -i -l "search for me" {} \;
I can't even get this (or variations on it) to run successfully.
find . ! -name "*.svn*" -prune -print -exec grep -i -l "search for me" {} \;
Doesn't return anything. It looks like it stops as soon as it finds the .svn directory.

How about something like:
find . \( \( -name .svn -o -name pdv \) -type d -prune \) -o \( -name '*.[pwi]' -type f -exec grep -i -l "search for me" {} + \)
This will:
- ignore the contents of directories named .svn and pdv
- grep files (and symlinks to files) named *.[pwi]
The + option after exec means gather as many files into a single command as will fit on the command line (roughly 1 million chars in Linux). This can seriously speed up processing if you have to iterate over thousands of files.

Following command finds only *.rb files containing require 'bundler/setup' line and excludes search in .git and .bundle directories. That is the same use case I think.
grep -ril --exclude-dir .git --exclude-dir .bundle \
--include \*.rb "^require 'bundler/setup'$" .
The problem was with swapping of --exclude and --exclude-dir parameters I believe. Refer to the grep(1) manual.
Also note that exclude/include parameters accept GLOB only, not regexps, therefore single character suffix range can be done with one --include parameter, but more complex conditions would require more of the parameters:
--include \*.[pwi] --include \*.multichar_sfx ...

You can try the following:
find path_starting_point -type f | grep regex_to_filter_file_names | xargs grep regex_to_find_inside_matched_files

find . -name "filename_regex"|grep -v '.svn' -v '.pdv'|xargs grep -i 'your search string'

Related

Find replace text in multiple files in subdirectories. Exclude some subdirectories

I want to find replace a patter1 into pattern2 only in certain files of my subdirectories. But exclude some subdirectories with replacement. What is wrong with this command?
find ./ -type f --exclude-dir='workspace' --exclude-dir='builds' \
-exec sed -i '' 's/foo/bar/g' {} \;
I don't see the option --exclude-dir in man find (I do in man grep, but you can't just borrow other command's options).
Try
find . -type f -not -path './workspace*' ...

bzgrep not printing the file name

find . -name '{fileNamePattern}*.bz2' | xargs -n 1 -P 3 bzgrep -H "{patternToSearch}"
I am using the command above to find out a .bz2 file from set of files that have a pattern that I am looking for. It does go through the files because I can see the pattern that I am trying to find being printed on the console but I don't see the file name.
If you look at the bzgrep script (for example this version for OS X) you will see that it pipes the output from bzip2 through grep. That process loses the original filenames. grep never sees them so it cannot print them out (despite your -H flag).
Something like this should do, not exactly what you want but something similar. (You could get the prefix you were expecting by piping the output from bzgrep into sed/awk but that's a bit less simple of a command to write out.)
find . -name '{fileNamePattern}*.bz2' -printf '### %p\n' -exec bzgrep "{patternToSearch}" {} \;
I printed the file name through echo command and xargs.
find . -name "*bz2" | parallel -j 128 echo -n {}\" \" | xargs bzgrep {pattern}
Etan is very close with his answer: grep indeed does not show the filename when only dealing with one file, so you can make grep believe he's looking into multiple files, just by adding the NULL file, so the command becomes:
find . -name '{fileNamePattern}*.bz2' -printf '### %p\n'
-exec bzgrep "{patternToSearch}" {} /dev/null \;
(It's a dirty trick but it's helping me already for more than 15 years :-) )

Unix Recursively move all files but keeping the structure

I have a folder named "in" that contains several folders "a" "b" "c" and I want to move all files to thhe folder "proc" and compress them. The tricky part is the files in "in/a" have to be moved to "proc/a", "in/b" have to be moved to "proc/b" and so on
I managed to find all files and zip them whit this command
find . -type f ! \( -name "*gz" -o -name "*tmp" -o -name "*xftp" \) -exec gzip -n '{}' \;
But I'm not finding a generic command to move the files that works whiteout me telling the name of the folders. Can anyone give me a hand?
Well I ended up finding out I had a couple more problems for example the target folder not existing so I ended up using this code
find . -type f ! \( -name "*gz" -o -name "*tmp" -o -name "*xftp" \) -exec gzip -n '{}' \;
find . -name "*.gz" | cpio -p -dumv $1
if [ "$?" = "0" ]; then
find . -name "*.gz" -exec rm -rf {} \;
else
echo "cpio Failed!" 1>&2
exit 1
fi
the 1st line finds all files to be processed and zips them.
the second line finds all files and copies to the target dir, in my case it was $1 (argument 1), creating as many folders as necessary to ensure the same structure.
The third line checks the status of the last command if it worked it finds and removes all gz files from the source folder whiteout deleting any folder. If it didn't deletes nothing so I can analyse what happened (maybe run out of space)
I bet there's a faster way of doing this whiteout having to use so much disk space but since that was not a problem for me it looks acceptable.

Unix to find pdf files from list in text file

I have a directory (for Endnote) that is filled with PDF files (1000's of them). I have used Unix to print a list of all of the pdf files and saved this list as a text file. Most of these pdf files are located in other directories throughout my computer (duplicates).
Now, I want to use the find command to search for duplicates of these pdf files throughout the rest of my computer and if a duplicate is found, move it to a new directory. If a specific file name is found more than once, I want to give each a unique name (ie basename.pdf.1, basename.pdf.2 etc). At the end, I want a single directory for all duplicates so I can double check them and then delete).
However, I do not want find to search the directory in which my list was made from or my Dropbox, as I do not want to move these pdf files (only move the other pdfs scattered throughout my computer).
I have found (I think) how to do all of the individual steps that I need to complete this task, but I cannot seem to put everything together into a working Unix command.
1) In order to find files while excluding a directory:
find -name "what to search for" -not -path "excluded_directory"
or
find build -not \( -path excluded_directory1 -prune \) -not \( -path excluded_directory2 -prune \) -name \*.what_to_find
or my current favorite
find . -name '*.what_to_find' | grep -v exludeddir1 | grep -v excludeddir2
2) In order to read a text file into find and use the lines as search patterns:
find . type f -print | fgrep -f file_list.txt
3) to find and move files
find / -iname "*.what_to_find" -type f -exec mv {} /new_directory \;
or
find / -iname "*.what_to_find" -type f | xargs -I '{}' /new_directory
or (to rename files so files with same name are not just overwritten by each other). I haven't quite figured everything going on in this command out yet...
find -name '*.what_to_find' -type f -exec bash -c 'mv -v "$0" "./$( mktemp "$( basename "$0" ).XXX" )"' '{}' \;
So, I can execute this commands individually, but have not been able to get them to work together as desired (maybe my order of commands is wrong? other problems?).
find . type f -print | fgrep -f file_list.txt | grep -v excludeddir1 | grep -v excludeddir2 -exec bash -c 'echo mv -v "$0" "./$( mktemp "$( basename "$0" ).XXX" )"' '{}' \;
Any help is much appreciated!
Thanks,
Derrick
Well I wasn't able to complete this task exactly how I wanted to, but I found a work around that got the job done.
I printed a list of all PDFs I have in Endnote, then deleted the path name, leaving just the file names (find and replace function in text wrangler). I then used the find command to search this list against my computer, printing all occurances of each PDF.
Then in text wrangler, I deleted all lines containing the initial path to my endnote PDFs, leaving just the desired duplicates.
Next, I used the find command to search for these exact paths and move them to a new folder.
All In all, I got by with the exact same commands I have in my original post, and a little help from text wrangler. Unfortunately I never figured out how to combine all my desired steps into a single unix command.

Use grep --exclude/--include syntax to not grep through certain files

I'm looking for the string foo= in text files in a directory tree. It's on a common Linux machine, I have bash shell:
grep -ircl "foo=" *
In the directories are also many binary files which match "foo=". As these results are not relevant and slow down the search, I want grep to skip searching these files (mostly JPEG and PNG images). How would I do that?
I know there are the --exclude=PATTERN and --include=PATTERN options, but what is the pattern format? The man page of grep says:
--include=PATTERN Recurse in directories only searching file matching PATTERN.
--exclude=PATTERN Recurse in directories skip file matching PATTERN.
Searching on grep include, grep include exclude, grep exclude and variants did not find anything relevant
If there's a better way of grepping only in certain files, I'm all for it; moving the offending files is not an option. I can't search only certain directories (the directory structure is a big mess, with everything everywhere). Also, I can't install anything, so I have to do with common tools (like grep or the suggested find).
Use the shell globbing syntax:
grep pattern -r --include=\*.cpp --include=\*.h rootdir
The syntax for --exclude is identical.
Note that the star is escaped with a backslash to prevent it from being expanded by the shell (quoting it, such as --include="*.cpp", would work just as well). Otherwise, if you had any files in the current working directory that matched the pattern, the command line would expand to something like grep pattern -r --include=foo.cpp --include=bar.cpp rootdir, which would only search files named foo.cpp and bar.cpp, which is quite likely not what you wanted.
Update 2021-03-04
I've edited the original answer to remove the use of brace expansion, which is a feature provided by several shells such as Bash and zsh to simplify patterns like this; but note that brace expansion is not POSIX shell-compliant.
The original example was:
grep pattern -r --include=\*.{cpp,h} rootdir
to search through all .cpp and .h files rooted in the directory rootdir.
If you just want to skip binary files, I suggest you look at the -I (upper case i) option. It ignores binary files. I regularly use the following command:
grep -rI --exclude-dir="\.svn" "pattern" *
It searches recursively, ignores binary files, and doesn't look inside Subversion hidden folders, for whatever pattern I want. I have it aliased as "grepsvn" on my box at work.
Please take a look at ack, which is designed for exactly these situations. Your example of
grep -ircl --exclude=*.{png,jpg} "foo=" *
is done with ack as
ack -icl "foo="
because ack never looks in binary files by default, and -r is on by default. And if you want only CPP and H files, then just do
ack -icl --cpp "foo="
grep 2.5.3 introduced the --exclude-dir parameter which will work the way you want.
grep -rI --exclude-dir=\.svn PATTERN .
You can also set an environment variable: GREP_OPTIONS="--exclude-dir=\.svn"
I'll second Andy's vote for ack though, it's the best.
I found this after a long time, you can add multiple includes and excludes like:
grep "z-index" . --include=*.js --exclude=*js/lib/* --exclude=*.min.js
The suggested command:
grep -Ir --exclude="*\.svn*" "pattern" *
is conceptually wrong, because --exclude works on the basename. Put in other words, it will skip only the .svn in the current directory.
In grep 2.5.1 you have to add this line to ~/.bashrc or ~/.bash profile
export GREP_OPTIONS="--exclude=\*.svn\*"
I find grepping grep's output to be very helpful sometimes:
grep -rn "foo=" . | grep -v "Binary file"
Though, that doesn't actually stop it from searching the binary files.
If you are not averse to using find, I like its -prune feature:
find [directory] \
-name "pattern_to_exclude" -prune \
-o -name "another_pattern_to_exclude" -prune \
-o -name "pattern_to_INCLUDE" -print0 \
| xargs -0 -I FILENAME grep -IR "pattern" FILENAME
On the first line, you specify the directory you want to search. . (current directory) is a valid path, for example.
On the 2nd and 3rd lines, use "*.png", "*.gif", "*.jpg", and so forth. Use as many of these -o -name "..." -prune constructs as you have patterns.
On the 4th line, you need another -o (it specifies "or" to find), the patterns you DO want, and you need either a -print or -print0 at the end of it. If you just want "everything else" that remains after pruning the *.gif, *.png, etc. images, then use
-o -print0 and you're done with the 4th line.
Finally, on the 5th line is the pipe to xargs which takes each of those resulting files and stores them in a variable FILENAME. It then passes grep the -IR flags, the "pattern", and then FILENAME is expanded by xargs to become that list of filenames found by find.
For your particular question, the statement may look something like:
find . \
-name "*.png" -prune \
-o -name "*.gif" -prune \
-o -name "*.svn" -prune \
-o -print0 | xargs -0 -I FILES grep -IR "foo=" FILES
On CentOS 6.6/Grep 2.6.3, I have to use it like this:
grep "term" -Hnir --include \*.php --exclude-dir "*excluded_dir*"
Notice the lack of equal signs "=" (otherwise --include, --exclude, include-dir and --exclude-dir are ignored)
git grep
Use git grep which is optimized for performance and aims to search through certain files.
By default it ignores binary files and it is honoring your .gitignore. If you're not working with Git structure, you can still use it by passing --no-index.
Example syntax:
git grep --no-index "some_pattern"
For more examples, see:
How to exclude certain directories/files from git grep search.
Check if all of multiple strings or regexes exist in a file
I'm a dilettante, granted, but here's how my ~/.bash_profile looks:
export GREP_OPTIONS="-orl --exclude-dir=.svn --exclude-dir=.cache --color=auto" GREP_COLOR='1;32'
Note that to exclude two directories, I had to use --exclude-dir twice.
If you search non-recursively you can use glop patterns to match the filenames.
grep "foo" *.{html,txt}
includes html and txt. It searches in the current directory only.
To search in the subdirectories:
grep "foo" */*.{html,txt}
In the subsubdirectories:
grep "foo" */*/*.{html,txt}
In the directories are also many binary files. I can't search only certain directories (the directory structure is a big mess). Is there's a better way of grepping only in certain files?
ripgrep
This is one of the quickest tools designed to recursively search your current directory. It is written in Rust, built on top of Rust's regex engine for maximum efficiency. Check the detailed analysis here.
So you can just run:
rg "some_pattern"
It respect your .gitignore and automatically skip hidden files/directories and binary files.
You can still customize include or exclude files and directories using -g/--glob. Globbing rules match .gitignore globs. Check man rg for help.
For more examples, see: How to exclude some files not matching certain extensions with grep?
On macOS, you can install via brew install ripgrep.
find and xargs are your friends. Use them to filter the file list rather than grep's --exclude
Try something like
find . -not -name '*.png' -o -type f -print | xargs grep -icl "foo="
The advantage of getting used to this, is that it is expandable to other use cases, for example to count the lines in all non-png files:
find . -not -name '*.png' -o -type f -print | xargs wc -l
To remove all non-png files:
find . -not -name '*.png' -o -type f -print | xargs rm
etc.
As pointed out in the comments, if some files may have spaces in their names, use -print0 and xargs -0 instead.
Try this one:
$ find . -name "*.txt" -type f -print | xargs file | grep "foo=" | cut -d: -f1
Founded here: http://www.unix.com/shell-programming-scripting/42573-search-files-excluding-binary-files.html
those scripts don't accomplish all the problem...Try this better:
du -ha | grep -i -o "\./.*" | grep -v "\.svn\|another_file\|another_folder" | xargs grep -i -n "$1"
this script is so better, because it uses "real" regular expressions to avoid directories from search. just separate folder or file names with "\|" on the grep -v
enjoy it!
found on my linux shell! XD
Look # this one.
grep --exclude="*\.svn*" -rn "foo=" * | grep -v Binary | grep -v tags
The --binary-files=without-match option to GNU grep gets it to skip binary files. (Equivalent to the -I switch mentioned elsewhere.)
(This might require a recent version of grep; 2.5.3 has it, at least.)
suitable for tcsh .alias file:
alias gisrc 'grep -I -r -i --exclude="*\.svn*" --include="*\."{mm,m,h,cc,c} \!* *'
Took me a while to figure out that the {mm,m,h,cc,c} portion should NOT be inside quotes.
~Keith
To ignore all binary results from grep
grep -Ri "pattern" * | awk '{if($1 != "Binary") print $0}'
The awk part will filter out all the Binary file foo matches lines
Try this:
Create a folder named "--F" under currdir ..(or link another folder there renamed to "--F" ie double-minus-F.
#> grep -i --exclude-dir="\-\-F" "pattern" *

Resources