How does grep run so fast?

How does grep run so fast? - unix

I am really amazed by the functionality of GREP in shell, earlier I used to use substring method in java but now I use GREP for it and it executes in a matter of seconds, it is blazingly faster than java code that I used to write.(according to my experience I might be wrong though)
That being said I have not been able to figure out how it is happening? there is also not much available on the web.
Can anyone help me with this?

Assuming your question regards GNU grep specifically. Here's a note from the author, Mike Haertel:
GNU grep is fast because it AVOIDS LOOKING AT EVERY INPUT BYTE.
GNU grep is fast because it EXECUTES VERY FEW INSTRUCTIONS FOR EACH
BYTE that it
does look at.
GNU grep uses the well-known Boyer-Moore algorithm, which looks first
for the final letter of the target string, and uses a lookup table to
tell it how far ahead it can skip in the input whenever it finds a
non-matching character.
GNU grep also unrolls the inner loop of Boyer-Moore, and sets up the
Boyer-Moore delta table entries in such a way that it doesn't need to
do the loop exit test at every unrolled step. The result of this is
that, in the limit, GNU grep averages fewer than 3 x86 instructions
executed for each input byte it actually looks at (and it skips many
bytes entirely).
GNU grep uses raw Unix input system calls and avoids copying data
after reading it. Moreover, GNU grep AVOIDS BREAKING THE INPUT INTO
LINES. Looking for newlines would slow grep down by a factor of
several times, because to find the newlines it would have to look at
every byte!
So instead of using line-oriented input, GNU grep reads raw data into
a large buffer, searches the buffer using Boyer-Moore, and only when
it finds a match does it go and look for the bounding newlines
(Certain command line options like
-n disable this optimization.)
This answer is a subset of the information taken from here.

To add to Steve's excellent answer.
It may not be widely known but grep is almost always faster when grepping for a longer pattern-string than a short one, because in a longer pattern, Boyer-Moore can skip forward in longer strides to achieve even better sublinear speeds:
Example:
# after running these twice to ensure apples-to-apples comparison
# (everything is in the buffer cache)
$ time grep -c 'tg=f_c' 20140910.log
28
0.168u 0.068s 0:00.26
$ time grep -c ' /cc/merchant.json tg=f_c' 20140910.log
28
0.100u 0.056s 0:00.17
The longer form is 35% faster!
How come? Boyer-Moore consructs a skip-forward table from the pattern-string, and whenever there's a mismatch, it picks the longest skip possible (from last char to first) before comparing a single char in the input to the char in the skip table.
Here's a video explaining Boyer Moore (Credit to kommradHomer)
Another common misconception (for GNU grep) is that fgrep is faster than grep. f in fgrep doesn't stand for 'fast', it stands for 'fixed' (see the man page), and since both are the same program, and both use Boyer-Moore, there's no difference in speed between them when searching for fixed-strings without regexp special chars. The only reason I use fgrep is when there's a regexp special char (like ., [], or *) I don't want it to be interpreted as such. And even then the more portable/standard form of grep -F is preferred over fgrep.

Related

grep -f alternative for huge files

grep -F -f file1 file2
file1 is 90 Mb (2.5 million lines, one word per line)
file2 is 45 Gb
That command doesn't actually produce anything whatsoever, no matter how long I leave it running. Clearly, this is beyond grep's scope.
It seems grep can't handle that many queries from the -f option. However, the following command does produce the desired result:
head file1 > file3
grep -F -f file3 file2
I have doubts about whether sed or awk would be appropriate alternatives either, given the file sizes.
I am at a loss for alternatives... please help. Is it worth it to learn some sql commands? Is it easy? Can anyone point me in the right direction?

Try using LC_ALL=C . It turns the searching pattern from UTF-8 to ASCII which speeds up by 140 time the original speed. I have a 26G file which would take me around 12 hours to do down to a couple of minutes.
Source: Grepping a huge file (80GB) any way to speed it up?
So what I do is:
LC_ALL=C fgrep "pattern" <input >output

I don't think there is an easy solution.
Imagine you write your own program which does what you want and you will end up with a nested loop, where the outer loop iterates over the lines in file2 and the inner loop iterates over file1 (or vice versa). The number of iterations grows with size(file1) * size(file2). This will be a very large number when both files are large. Making one file smaller using head apparently resolves this issue, at the cost of not giving the correct result anymore.
A possible way out is indexing (or sorting) one of the files. If you iterate over file2 and for each word you can determine whether or not it is in the pattern file without having to fully traverse the pattern file, then you are much better off. This assumes that you do a word-by-word comparison. If the pattern file contains not only full words, but also substrings, then this will not work, because for a given word in file2 you wouldn't know what to look for in file1.
Learning SQL is certainly a good idea, because learning something is always good. It will hovever, not solve your problem, because SQL will suffer from the same quadratic effect described above. It may simplify indexing, should indexing be applicable to your problem.
Your best bet is probably taking a step back and rethinking your problem.

You can try ack. They are saying that it is faster than grep.
You can try parallel :
parallel --progress -a file1 'grep -F {} file2'
Parallel has got many other useful switches to make computations faster.

Grep can't handle that many queries, and at that volume, it won't be helped by fixing the grep -f bug that makes it so unbearably slow.
Are both file1 and file2 composed of one word per line? That means you're looking for exact matches, which we can do really quickly with awk:
awk 'NR == FNR { query[$0] = 1; next } query[$0]' file1 file2
NR (number of records, the line number) is only equal to the FNR (file-specific number of records) for the first file, where we populate the hash and then move onto the next line. The second clause checks the other file(s) for whether the line matches one saved in our hash and then prints the matching lines.
Otherwise, you'll need to iterate:
awk 'NR == FNR { query[$0]=1; next }
{ for (q in query) if (index($0, q)) { print; next } }' file1 file2
Instead of merely checking the hash, we have to loop through each query and see if it matches the current line ($0). This is much slower, but unfortunately necessary (though we're at least matching plain strings without using regexes, so it could be slower). The loop stops when we have a match.
If you actually wanted to evaluate the lines of the query file as regular expressions, you could use $0 ~ q instead of the faster index($0, q). Note that this uses POSIX extended regular expressions, roughly the same as grep -E or egrep but without bounded quantifiers ({1,7}) or the GNU extensions for word boundaries (\b) and shorthand character classes (\s,\w, etc).
These should work as long as the hash doesn't exceed what awk can store. This might be as low as 2.1B entries (a guess based on the highest 32-bit signed int) or as high as your free memory.

How can I implement the command 'ls' with wildcard, '*'?

EDIT #1 : I'm under the limit that all arguments are enclosed in two quotes, so that shell do not expand any argument with * to the corresponding path.
EDIT #2 : In order to retrieve directories such as */*, ../*, and dirA/*/file.out, How should I use iteration loop or recursive call?
I have just learned about the function fnmatch(). But I don't know start place.
There are many possible cases. I'm confused dealing with these all cases.
For example, Let me assume that executable program is a.out.
$./a.out -l */*
$./a.out -l ../*
$./a.out -l [file_name] [directory_name]
/* Since I also have to implement ls command with no wildcard. */
What should I do? Any advice would be awesome.
Thank you in advance.

Your problem is : shell replaces wildcard caracter * with all of the filenames matching the pattern.
Solution:
If you do not want to use this feature of bash, just put quotation marks around your command line arguments.
Calling your program that way will have the original arguments, containing wildcards.
After this, you can list all the filenames with their paths. For example using some recursive algorithm. Then you can apply some matching to these path string. (when visiting it)

If you want to be a good unix citizen, the rule is Don't do filename globbing unless you are writing a shell.
You want to write an ls-like program? Don't do any wildcard expansion. Don't treat "*" specially. Just treat your argv as a list of filenames. If your program handles these cases:
./a.out file1
./a.out file1 file2 file3
Then it will also handle
./a.out file*
correctly because the shell will do the expansion and your program won't need to know about it. And besides that, it will handle this:
zsh% ./a.out **/file<40-185>~file<90-100>(.mm-30OL[1,2])
which in zsh expanded glob syntax means: expand file40 through file185, except for file90 through file100, include only the ones that have been modified in the last 30 minutes, and use only the largest 2 files in the resulting set.
fnmatch is never going to do anything like that. But these fancy globs can be used with any command that just takes a filename list and doesn't care where it came from.
When you're in a situation where you can't take a list of filenames from the command line, then consider using fnmatch. ls isn't one of those situations.

Viewing Unix Log Files

We are having a discussion at work, what is the best UNIX command tool that to view log files. One side says use LESS, the other says use MORE. Is one better than the other?

A common problem is that logs have too many processes writing to them, I prefer to filter my log files and control the output using:
tail -f /var/log/<some logfile> | grep <some identifier> | more
This combination of commands allows you to watch an active log file without getting overwhelmed by the output.

I opt for less. A reason for this is that (with aid of lessopen) it can read gzipped log (as archived by logrotate).
As an example with this single command I can read in time ordered mode dpkg log, without treating differently gzipped ones:
less $(ls -rt /var/log/dpkg.log*) | less

Multitail is the best option, because you can view multiple logs at the same time. It also colors stuff, and you can set up regex to highlight entries you're looking for.

You can use any program: less, nano, vi, tail, cat etc, they differ in functionality.
There are also many log viewers: gnome-system-log, kiwi etc (they can sort log by date / type etc)

Less is more. Although since when I'm looking at my logs I'm typically searching for something specific or just interested in the last few events I find myself using cat, pipes and grep or tail rather than more or less.

less is the best, imo. It is light weight compared to an editor, it allows forward and backward navigation, it has powerful search capabilities, and many more things. Hit 'h' for help. It's well worth the time getting familiar with it.

On my Mac, using the standard terminal windows, there's one difference between less and more, namely, after exiting:
less leaves less mess on my screen
more leaves more useful information on my screen
Consequently, if I think I might want to do something with the material I'm viewing after the viewer finishes (for example, copy'n'paste operations), I use more; if I don't want to use the material after I've finished, then I use less.
The primary advantage of less is the ability to scroll backwards; therefore, I tend to use less rather than more, but both have uses for me. YMMV (YMWV; W = Will in this case!).

As your question was generically about 'Unix systems', keep into account that
in some cases you have no choice, for old systems you have only MORE available,
but not LESS.
LESS is part of the GNU tools, MORE comes from the UCB times.

Turn on grep's line buffering mode.
Using tail (Live monitoring)
tail -f fileName
Using less (Live monitoring)
less +F fileName
Using tail & grep
tail -f fileName | grep --line-buffered my_pattern
Using less & grep
less +F fileName | grep --line-buffered my_pattern
Using watch & tail to highlight new lines
watch -d tail fileName
Note: For linux systems.

Remove lines which are between given patterns from a file (using Unix tools)

I have a text file (more correctly, a “German style“ CSV file, i.e. semicolon-separated, decimal comma) which has a date and the value of a measurement on each line.
There are stretches of faulty values which I want to remove before further work. I'd like to store these cuts in some script so that my corrections are documented and I can replay those corrections if necessary.
The lines look like this:
28.01.2005 14:48:38;5,166
28.01.2005 14:50:38;2,916
28.01.2005 14:52:38;0,000
28.01.2005 14:54:38;0,000
(long stretch of values that should be removed; could also be something else beside 0)
01.02.2005 00:11:43;0,000
01.02.2005 00:13:43;1,333
01.02.2005 00:15:43;3,250
Now I'd like to store a list of begin and end patterns like 28.01.2005 14:52:38 + 01.02.2005 00:11:43, and the script would cut the lines matching these begin/end pairs and everything that's between them.
I'm thinking about hacking an awk script, but perhaps I'm missing an already existing tool.

Have a look at sed:
sed '/start_pat/,/end_pat/d'
will delete lines between start_pat and end_pat (inclusive).
To delete multiple such pairs, you can combine them with multiple -e options:
sed -e '/s1/,/e1/d' -e '/s2/,/e2/d' -e '/s3/,/e3/d' ...

Firstly, why do you need to keep a record of what you have done? Why not keep a backup of the original file, or take a diff between the old & new files, or put it under source control?
For the actual changes I suggest using Vim.
The Vim :global command (abbreviated to :g) can be used to run :ex commands on lines that match a regex. This is in many ways more powerful than awk since the commands can then refer to ranges relative to the matching line, plus you have the full text processing power of Vim at your disposal.
For example, this will do something close to what you want (untested, so caveat emptor):
:g!/^\d\d\.\d\d\.\d\d\d\d/ -1 write tmp.txt >> | delete
This matches lines that do NOT start with a date (the ! negates the match), appends the previous line to the file tmp.txt, then deletes the current line.
You will probably end up with duplicate lines in tmp.txt, but they can be removed by running the file through uniq.

you are also use awk
awk '/start/,/end/' file

I would seriously suggest learning the basics of perl (i.e. not the OO stuff). It will repay you in bucket-loads.
It is fast and simple to write a bit of perl to do this (and many other such tasks) once you have grasped the fundamentals, which if you are used to using awk, sed, grep etc are pretty simple.
You won't have to remember how to use lots of different tools and where you would previously have used multiple tools piped together to solve a problem, you can just use a single perl script (usually much faster to execute).
And, perl is installed on virtually every unix/linux distro now.
(that sed is neat though :-)

use grep -L (print none matching lines)
Sorry - thought you just wanted lines without 0,000 at the end

script,unix,compare

I have two files ...
file1:
002009092312291100098420090922111
010555101070002956200453T+00001190.81+00001295.920010.87P
010555101070002956200449J+00003128.85+00003693.90+00003128
010555101070002956200176H+00000281.14+00000300.32+00000281
file2:
002009092410521000098420090709111
010560458520002547500432M+00001822.88+00001592.96+00001822
010560458520002547500432D+00000106.68+00000114.77+00000106
In both files in every record starting with 01, the string from 3rd char to 25th char, i.e up to alphabet is the key.
Based on this key, I have to compare two files, and if there is any record matching in file 2, then I have to replace that record in file1, or else append it if it won't match.

Well, this is a fairly unspecific (and basic) programming question. We'll be better able to help us if you explain exactly what you did and where you got stuck.
Also, it looks a bit like homework, and people are wary of giving too much help on homework problems, as it might look like cheating.
To get you started:
I'd recommend Perl to solve this, but awk or another scripting language will also do. I'd recommend against sh/bash, as they are weak on text manipulation; also combining grep et al will become rather cumbersome.
First write a Perl program that filters records starting with 01. Then extract the key and put it into a hash (a Perl structure). Then output a new, combined file as required.

Using awk get the fields from 3-25 but doing something like
awk -F "" '/^01/{print $1}' file_name | cut -c 3-25 and match the first two fields with 01 from both files and get all the lines in two different buffers and compare both the buffers using for line in in a shell script.
Whenever the line in second buffer matches the first one grep the line in second buffer in first file and replace the line in first file with the line in second. I think you need to work a bit around the logic.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How does grep run so fast? - unix

Related

grep -f alternative for huge files

How can I implement the command 'ls' with wildcard, '*'?

Viewing Unix Log Files

Remove lines which are between given patterns from a file (using Unix tools)

script,unix,compare

Categories

Resources