unix command line ...how to grep and show only file names that contain a string? - unix

I know I can search for a string with:
grep -n -d recurse 'snoopy' *
and then it shows every file name and instance that contains that string, like:
file/name.txt:23 some snoopy here
file/name2.txt:59 another snoopy there
file/name2.txt:343 some more snoopy
etc...
The problem is that with many occurrences, the list is huge. How do I make it show only the actual file names that contain the string, without duplicates and without the occurrence?
Only like:
file/name1.txt
file/name52.txt
file/name28293.txt
Thanks a lot for any help :)

The -l flag (or, in both BSD and GNU grep, --files-with-matches) does what you want.
From the POSIX spec:
Write only the names of files containing selected lines to standard output. Pathnames shall be written once per file searched. If the standard input is searched, a pathname of "(standard input)" shall be written, in the POSIX locale. In other locales, "standard input" may be replaced by something more appropriate in those locales.
Both BSD and GNU also explicitly guarantee that this will be more efficient. (Older BSD versions say "… grep will only search a file until a match has been found, making searches potentially less expensive", newer BSD and GNU say "The scanning will stop on the first match".) If you don't know which grep you have and which options it has, just type man grep at the shell and you should get the manpage.

Related

Glob path to get newest file only

Let's say I have a directory with product inventories that are saved per day:
$ ls *.csv
2014_01_01.csv
2014_01_02.csv
...
Is there a glob pattern that will only grab the newest file? Or do I need to chain it with other commands? Basically I'm just looking to do what would about to a LIMIT 1 based on the filename sort.
Assuming your shell is bash, ksh93 or zsh, and your files have the same naming convention as the example in your question:
files=( *.csv )
printf "The newest file is %s\n" "${files[-1]}"
Since the date in the filenames is in a format that naturally sorts, storing all of them in an array and taking the last element gives you the newest one (And conversely the first element is the oldest one).

In what order does cat choose files to display?

I have the following line in a bash script:
find . -name "paramsFile.*" | xargs -n131072 cat > parameters.txt
I need to make sure the order the files are concatenated in does not change when I use this command. For example, if I run this command twice on the same set of paramsFile.*, parameters.txt should be the same both times. My question is, is this the case? And if it isn't, how can I make sure it is?
Thanks!
Edit: the same question goes for xargs: would that change how the files are fed to cat?
Edit2: as William Pursell pointed out, this question is actually about find. Does find always return files in the same order?
From description in man cat:
The cat utility reads files sequentially, writing them to the standard
output. The file operands are processed in command-line order.
If file is a single dash (`-') or absent, cat reads from the standard input. If file is a UNIX domain socket, cat connects to it
and
then reads it until EOF. This complements the UNIX domain binding capability available in inetd(8).
So yes as long as you pass the files to cat in the same order every time you'll be ok.

How can I implement the command 'ls' with wildcard, '*'?

EDIT #1 : I'm under the limit that all arguments are enclosed in two quotes, so that shell do not expand any argument with * to the corresponding path.
EDIT #2 : In order to retrieve directories such as */*, ../*, and dirA/*/file.out, How should I use iteration loop or recursive call?
I have just learned about the function fnmatch(). But I don't know start place.
There are many possible cases. I'm confused dealing with these all cases.
For example, Let me assume that executable program is a.out.
$./a.out -l */*
$./a.out -l ../*
$./a.out -l [file_name] [directory_name]
/* Since I also have to implement ls command with no wildcard. */
What should I do? Any advice would be awesome.
Thank you in advance.
Your problem is : shell replaces wildcard caracter * with all of the filenames matching the pattern.
Solution:
If you do not want to use this feature of bash, just put quotation marks around your command line arguments.
Calling your program that way will have the original arguments, containing wildcards.
After this, you can list all the filenames with their paths. For example using some recursive algorithm. Then you can apply some matching to these path string. (when visiting it)
If you want to be a good unix citizen, the rule is Don't do filename globbing unless you are writing a shell.
You want to write an ls-like program? Don't do any wildcard expansion. Don't treat "*" specially. Just treat your argv as a list of filenames. If your program handles these cases:
./a.out file1
./a.out file1 file2 file3
Then it will also handle
./a.out file*
correctly because the shell will do the expansion and your program won't need to know about it. And besides that, it will handle this:
zsh% ./a.out **/file<40-185>~file<90-100>(.mm-30OL[1,2])
which in zsh expanded glob syntax means: expand file40 through file185, except for file90 through file100, include only the ones that have been modified in the last 30 minutes, and use only the largest 2 files in the resulting set.
fnmatch is never going to do anything like that. But these fancy globs can be used with any command that just takes a filename list and doesn't care where it came from.
When you're in a situation where you can't take a list of filenames from the command line, then consider using fnmatch. ls isn't one of those situations.

Compare two folders which have many files inside contents

Have two folders with approx. 150 java property files.
In a shell script, how to compare both folders to see if there is any new property file in either of them and what are the differences between the property files.
The output should be in a report format.
To get summary of new/missing files, and which files differ:
diff -arq folder1 folder2
a treats all files as text, r recursively searched subdirectories, q reports 'briefly', only when files differ
diff -r will do this, telling you both if any files have been added or deleted, and what's changed in the files that have been modified.
I used
diff -rqyl folder1 folder2 --exclude=node_modules
in my nodejs apps.
Could you use dircmp ?
Diff command in Unix is used to find the differences between files(all types). Since directory is also a type of file, the differences between two directories can easily be figure out by using diff commands. For more option use man diff on your unix box.
-b Ignores trailing blanks (spaces and tabs)
and treats other strings of blanks as
equivalent.
-i Ignores the case of letters. For example,
`A' will compare equal to `a'.
-t Expands <TAB> characters in output lines.
Normal or -c output adds character(s) to the
front of each line that may adversely affect
the indentation of the original source lines
and make the output lines difficult to
interpret. This option will preserve the
original source's indentation.
-w Ignores all blanks (<SPACE> and <TAB> char-
acters) and treats all other strings of
blanks as equivalent. For example,
`if ( a == b )' will compare equal to
`if(a==b)'.
and there are many more.

How do I distinguish between 'binary' and 'text' files?

Informally, most of us understand that there are 'binary' files (object files, images, movies, executables, proprietary document formats, etc) and 'text' files (source code, XML files, HTML files, email, etc).
In general, you need to know the contents of a file to be able to do anything useful with it, and form that point of view if the encoding is 'binary' or 'text', it doesn't really matter. And of course files just store bytes of data so they are all 'binary' and 'text' doesn't mean anything without knowing the encoding. And yet, it is still useful to talk about 'binary' and 'text' files, but to avoid offending anyone with this imprecise definition, I will continue to use 'scare' quotes.
However, there are various tools that work on a wide range of files, and in practical terms, you want to do something different based on whether the file is 'text' or 'binary'. An example of this is any tool that outputs data on the console. Plain 'text' will look fine, and is useful. 'binary' data messes up your terminal, and is generally not useful to look at. GNU grep at least uses this distinction when determining if it should output matches to the console.
So, the question is, how do you tell if a file is 'text' or 'binary'? And to restrict is further, how do you tell on a Linux like file-system? I am not aware of any filesystem meta-data that indicates the 'type' of a file, so the question further becomes, by inspecting the content of a file, how do I tell if it is 'text' or 'binary'? And for simplicity, lets restrict 'text' to mean characters which are printable on the user's console. And in particular how would you implement this? (I thought this was implied on this site, but I guess it is helpful, in general, to be pointed at existing code that does this, I should have specified), I'm not really after what existing programs can I use to do this.
You can use the file command. It does a bunch of tests on the file (man file) to decide if it's binary or text. You can look at/borrow its source code if you need to do that from C.
file README
README: ASCII English text, with very long lines
file /bin/bash
/bin/bash: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), for GNU/Linux 2.2.5, dynamically linked (uses shared libs), stripped
The spreadsheet software my company makes reads a number of binary file formats as well as text files.
We first look at the first few bytes for a magic number which we recognize. If we do not recognize the magic number of any of the binary types we read, then we look at up to the first 2K bytes of the file to see whether it appears to be a UTF-8, UTF-16 or a text file encoded in the current code page of the host operating system. If it passes none of these tests, we assume that it is not a file we can deal with and throw an appropriate exception.
You can determine the MIME type of the file with
file --mime FILENAME
The shorthand is file -i on Linux and file -I (capital i) on macOS (see comments).
If it starts with text/, it's text, otherwise binary. The only exception are XML applications. You can match those by looking for +xml at the end of the file type.
To list text file names in current dir/subdirs:
grep -rIl ''
Binaries:
grep -rIL ''
To check for a particular file:
grep -qI '' FILE
then, exit status '0' would mean the file is a text; '1' - binary.
To check:
echo $?
Key option is this:
-I Process a binary file as if it did not contain matching data;
Other options:
-r, --recursive
Read all files under each directory, recursively;
-l, --files-with-matches
Suppress normal output; instead print the name of each input file from which output would normally have been printed.
-L, --files-without-match
Suppress normal output; instead print the name of each input file from which no output would normally have been printed.
-q, --quiet, --silent
Quiet; do not write anything to standard output. Exit immediately with zero status if any match is found, even if an error was detected.
Perl has a decent heuristic. Use the -B operator to test for binary (and its opposite, -T to test for text). Here's shell a one-liner to list text files:
$ find . -type f -print0 | perl -0nE 'say if -f and -s _ and -T _'
(Note that those underscores without a preceding dollar are correct (RTFM).)
Well, if you are just inspecting the entire file, see if every character is printable with isprint(c). It gets a little more complicated for Unicode.
To distinguish a unicode text file, MSDN offers some great advice as to what to do.
The gist of it is to first inspect up to the first four bytes:
EF BB BF UTF-8
FF FE UTF-16, little endian
FE FF UTF-16, big endian
FF FE 00 00 UTF-32, little endian
00 00 FE FF UTF-32, big-endian
That will tell you the encoding. Then, you'd want to use iswprint(c) for the rest of the characters in the text file. For UTF-8 and UTF-16, you need to parse the data manually since a single character can be represented by a variable number of bytes. Also, if you're really anal, you'll want to use the locale variant of iswprint if that's available on your platform.
Its an old topic, but maybe someone will find this useful.
If you have to decide in a script if something is a file then you can simply do like this :
if file -i $1 | grep -q text;
then
.
.
fi
This will get the file type, and with a silent grep you can decide if its a text.
You can use libmagic which is a library version of the Unix file command line (source).
There are wrappers for many languages:
Python
.NET
Nodejs
Ruby
Go
Rust
Most programs that try to tell the difference use a heuristic, such as examining the first n bytes of the file and seeing if those bytes all qualify as 'text' or not (i.e., do they all fall within the range of printable ASCII charcters). For finer distiction there's always the 'file' command on UNIX-like systems.
One simple check is if it has \0 characters. Text files don't have them.
As previously stated *nix operating systems have this ability within the file command. This command uses a configuration file that defines magic numbers contained within many popular file structures.
This file, called magic was historically stored in /etc, although this may be in /usr/share on some distributions. The magic file defines offsets of values known to exist within the file and can then examine these locations to determine the type of the file.
The structure and description of the magic file can be found by consulting the relevant manual page (man magic)
As for an implementation, well that can be found within file.c itself, however the relevant portion of the file command that determines whether it is readable text or not is the following
/* Make sure we are dealing with ascii text before looking for tokens */
for (i = 0; i < nbytes - 1; i++) {
if (!isascii(buf[i]) ||
(iscntrl(buf[i]) && !isspace(buf[i]) &&
buf[i] != '\b' && buf[i] != '\032' && buf[i] != '\033'
)
)
return 0; /* not all ASCII */
}

Resources