Can I create an ack file type based on a filename, not extension? - ack

I would like to include files with a specific name -- not an extension -- in my ack search. Is this possible with ack?
If I use the -G option, this would exclude all other file types. (So I can't put it in my .ackrc file.)
I tried to use --type-set mytype=filename.txt but this only works for extensions, so this would search for files including the pattern .filename.txt, thus not find filename.txt. (That's also what the ack --help types shows: --mytype .filename.txt, not --mytype filename.txt.)
Someone any ideas?

man ack says that the files to be searched in can be given through standard input.
So this should work:
find . -name filename.txt | ack PATTERN -
Unfortunately, it doesn't. It gives ack: Ignoring 1 argument on the command-line while acting as a filter., which apparently is a bug in ack. When this bug will be fixed, we should be able to use
find . -name filename.txt | ack --nofilter PATTERN -

You also do it like if you're using zsh:
ack 'Pattern' **/filename.txt

What you're asking is "can I make a filetype that ack recognizes based on a filename", and the answer is "No, not in ack 1.x, but you can in ack 2.0". ack 2.0 is in alpha release, and we hope to have a beta by Christmas.
As #Christian pointed out above, you can specify the given filename on the command line, but that bypasses filetype checking entirely.

I know this is a late reply, but could you simply specify the filename when you run ack?
ack 'My Text' filename.txt

Related

Ack dooesn't show the line number when i search in the single file

example) When I search the exec in proc.c file,
$ ack allocproc proc.c
allocproc(void)
p = allocproc();
if((np = allocproc()) == 0){
// Return to "caller", actually trapret (see allocproc).
but when I search in the whole directory,
$ ack allocproc
---- blah blah blah ----
proc.c
36:allocproc(void)
84: p = allocproc();
139: if((np = allocproc()) == 0){
357: // Return to "caller", actually trapret (see allocproc).
... I want to show lines when I search a string in the single file...
Maybe add a line in .bashrc alias ackl='ack -H'
and use ackl command as default... will solve this temporarily.
The -H flag in ack will force ack to put a file header and line number on every file. This behavior is copied directly from GNU grep.
You point out the option of creating a shell alias. Another option is to put the -H in an ackrc file. ack supports three different places to find an ackrc. There's a system-wide one in /etc/ackrc, there's one that's personal to you in your ~/.ackrc file, and you can also have a project-specific file, typically in the root of a project.
For more about ackrc files, look at the ack manual (ack --man is one way to see it) and look for the section "THE .ackrc FILE" and "ACKRC LOCATION SEMANTICS".
The one downside to putting -H in your .ackrc is that it will always be in force no matter how you call ack, so if, for example, you're piping output from one process through ack, ack will still show the heading and line numbers.
One other way to deal with this: Just add the -H option when you need it.
I've discovered that, in both grep and ack, if they behave differently with one file than they do with multiple files, you can force the multiple-file behavior by including /dev/null as a second file to search through.
Of course, using the -H switch is much cleaner (and has the advantage of being listed in the documentation, so that curious maintainers can see the exact purpose behind its use), but if you're in a pinch and don't have the documentation available (or if you're using some other program that behaves differently with one vs. with multiple files), then using /dev/null will probably work.
I don't recommend using this /dev/null technique in commands called in scripts -- in such cases, the -H switch is the preferred method, unless for some reason -H is literally unavailable to you.

Ack — Ignoring multiple directories without repeating the flag

Is it possible to ignore multiple directories in Ack, without repeating the flag?
e.g. I know the following works (i.e. setting multiple flags):
ack --ignore-dir=install --ignore-dir=php 'teststring'
I was hoping that I could separate directories with commas, like I can do with the extensions as follows:
ack --ignore-file=ext:css,scss,orig 'teststring'
However, the following comma separated ignore flag doesn't work:
ack --ignore-dir=install,php 'textstring'
Is it possible to use some short-hand equivalent, so I don't have to repeatedly type out the --ignore-dir flag?
It's actually similar to how you would specify the include patterns for grep:
ack <term> --ignore-dir={dir_a,dir_b}
However, This format does not work with a single directory. So
ack <term> --ignore-dir={log}
will not work.
Since you're using ack 2, you can put --ignore-dir=install and --ignore-dir=php in a .ackrc file in the root of your project. Then, every ack invocation in that tree will use those flags.
So to ignore single directory use
ack <term> --ignore-dir=dir_a
and to ignore multiple directories use
ack <term> --ignore-dir={dir_a,dir_b}
One approach could be to select those directories to exclude with a regular expression in -G option complemented with the option --invert-file-match. Based in your question, something like the following:
ack -a -G 'install|php' --invert-file-match 'textstring' .

In what order does cat choose files to display?

I have the following line in a bash script:
find . -name "paramsFile.*" | xargs -n131072 cat > parameters.txt
I need to make sure the order the files are concatenated in does not change when I use this command. For example, if I run this command twice on the same set of paramsFile.*, parameters.txt should be the same both times. My question is, is this the case? And if it isn't, how can I make sure it is?
Thanks!
Edit: the same question goes for xargs: would that change how the files are fed to cat?
Edit2: as William Pursell pointed out, this question is actually about find. Does find always return files in the same order?
From description in man cat:
The cat utility reads files sequentially, writing them to the standard
output. The file operands are processed in command-line order.
If file is a single dash (`-') or absent, cat reads from the standard input. If file is a UNIX domain socket, cat connects to it
and
then reads it until EOF. This complements the UNIX domain binding capability available in inetd(8).
So yes as long as you pass the files to cat in the same order every time you'll be ok.

Ack: Search directory tree for files with a particular extension

I basically just want to do ack foo *.citrus and have ack drill down and find the string 'foo' in all Citrus files in the current directory and below. The trouble is that this won't work if there aren't any Citrus files in the current directory.
I tried messing with -G without success. Do I really need to add a file type in .ackrc just to limit the search to files with a given extension?
As suggested by Andy Lester, you can also create a typeset without taking the trouble to add it in your .ackrc file:
ack --type-set=cit=.citrus --cit "foo"
By default, ack searches only in files with known types ( like *.java, *.cpp etc. ). It doesn't know about files *.citrus, so to search in such files you must use -a cmd line switch:
$ack -a -G '\.citrus$' foo
1.d/1.citrus
1:foo_bar
You don't have to set it in .ackrc if you don't want. You can also set ACK_OPTIONS in your environment, or specify --type-set arguments on the command line. ack doesn't care.

How do I distinguish between 'binary' and 'text' files?

Informally, most of us understand that there are 'binary' files (object files, images, movies, executables, proprietary document formats, etc) and 'text' files (source code, XML files, HTML files, email, etc).
In general, you need to know the contents of a file to be able to do anything useful with it, and form that point of view if the encoding is 'binary' or 'text', it doesn't really matter. And of course files just store bytes of data so they are all 'binary' and 'text' doesn't mean anything without knowing the encoding. And yet, it is still useful to talk about 'binary' and 'text' files, but to avoid offending anyone with this imprecise definition, I will continue to use 'scare' quotes.
However, there are various tools that work on a wide range of files, and in practical terms, you want to do something different based on whether the file is 'text' or 'binary'. An example of this is any tool that outputs data on the console. Plain 'text' will look fine, and is useful. 'binary' data messes up your terminal, and is generally not useful to look at. GNU grep at least uses this distinction when determining if it should output matches to the console.
So, the question is, how do you tell if a file is 'text' or 'binary'? And to restrict is further, how do you tell on a Linux like file-system? I am not aware of any filesystem meta-data that indicates the 'type' of a file, so the question further becomes, by inspecting the content of a file, how do I tell if it is 'text' or 'binary'? And for simplicity, lets restrict 'text' to mean characters which are printable on the user's console. And in particular how would you implement this? (I thought this was implied on this site, but I guess it is helpful, in general, to be pointed at existing code that does this, I should have specified), I'm not really after what existing programs can I use to do this.
You can use the file command. It does a bunch of tests on the file (man file) to decide if it's binary or text. You can look at/borrow its source code if you need to do that from C.
file README
README: ASCII English text, with very long lines
file /bin/bash
/bin/bash: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), for GNU/Linux 2.2.5, dynamically linked (uses shared libs), stripped
The spreadsheet software my company makes reads a number of binary file formats as well as text files.
We first look at the first few bytes for a magic number which we recognize. If we do not recognize the magic number of any of the binary types we read, then we look at up to the first 2K bytes of the file to see whether it appears to be a UTF-8, UTF-16 or a text file encoded in the current code page of the host operating system. If it passes none of these tests, we assume that it is not a file we can deal with and throw an appropriate exception.
You can determine the MIME type of the file with
file --mime FILENAME
The shorthand is file -i on Linux and file -I (capital i) on macOS (see comments).
If it starts with text/, it's text, otherwise binary. The only exception are XML applications. You can match those by looking for +xml at the end of the file type.
To list text file names in current dir/subdirs:
grep -rIl ''
Binaries:
grep -rIL ''
To check for a particular file:
grep -qI '' FILE
then, exit status '0' would mean the file is a text; '1' - binary.
To check:
echo $?
Key option is this:
-I Process a binary file as if it did not contain matching data;
Other options:
-r, --recursive
Read all files under each directory, recursively;
-l, --files-with-matches
Suppress normal output; instead print the name of each input file from which output would normally have been printed.
-L, --files-without-match
Suppress normal output; instead print the name of each input file from which no output would normally have been printed.
-q, --quiet, --silent
Quiet; do not write anything to standard output. Exit immediately with zero status if any match is found, even if an error was detected.
Perl has a decent heuristic. Use the -B operator to test for binary (and its opposite, -T to test for text). Here's shell a one-liner to list text files:
$ find . -type f -print0 | perl -0nE 'say if -f and -s _ and -T _'
(Note that those underscores without a preceding dollar are correct (RTFM).)
Well, if you are just inspecting the entire file, see if every character is printable with isprint(c). It gets a little more complicated for Unicode.
To distinguish a unicode text file, MSDN offers some great advice as to what to do.
The gist of it is to first inspect up to the first four bytes:
EF BB BF UTF-8
FF FE UTF-16, little endian
FE FF UTF-16, big endian
FF FE 00 00 UTF-32, little endian
00 00 FE FF UTF-32, big-endian
That will tell you the encoding. Then, you'd want to use iswprint(c) for the rest of the characters in the text file. For UTF-8 and UTF-16, you need to parse the data manually since a single character can be represented by a variable number of bytes. Also, if you're really anal, you'll want to use the locale variant of iswprint if that's available on your platform.
Its an old topic, but maybe someone will find this useful.
If you have to decide in a script if something is a file then you can simply do like this :
if file -i $1 | grep -q text;
then
.
.
fi
This will get the file type, and with a silent grep you can decide if its a text.
You can use libmagic which is a library version of the Unix file command line (source).
There are wrappers for many languages:
Python
.NET
Nodejs
Ruby
Go
Rust
Most programs that try to tell the difference use a heuristic, such as examining the first n bytes of the file and seeing if those bytes all qualify as 'text' or not (i.e., do they all fall within the range of printable ASCII charcters). For finer distiction there's always the 'file' command on UNIX-like systems.
One simple check is if it has \0 characters. Text files don't have them.
As previously stated *nix operating systems have this ability within the file command. This command uses a configuration file that defines magic numbers contained within many popular file structures.
This file, called magic was historically stored in /etc, although this may be in /usr/share on some distributions. The magic file defines offsets of values known to exist within the file and can then examine these locations to determine the type of the file.
The structure and description of the magic file can be found by consulting the relevant manual page (man magic)
As for an implementation, well that can be found within file.c itself, however the relevant portion of the file command that determines whether it is readable text or not is the following
/* Make sure we are dealing with ascii text before looking for tokens */
for (i = 0; i < nbytes - 1; i++) {
if (!isascii(buf[i]) ||
(iscntrl(buf[i]) && !isspace(buf[i]) &&
buf[i] != '\b' && buf[i] != '\032' && buf[i] != '\033'
)
)
return 0; /* not all ASCII */
}

Resources