Does grep process line by line or entire file? - unix

As I'm learning more about UNIX commands I started working with sed at work. Sed's design reads a file in line by line, and executes commands on each line individually.
How does grep process files? I've tried various ways of googling "does grep process line by line" and nothing really concrete shows up.

From Why GNU grep is fast :
Moreover, GNU grep AVOIDS BREAKING THE INPUT INTO LINES. Looking for newlines would slow grep down by a factor of several times, because to find the newlines it would have to look at every byte!
and then
Don't look for newlines in the input until after you've found a match.

EDIT:
I will correct myself. It is neither line by line nor full file, its in terms of chunks of data which are placed into the buffer.
More details are here http://lists.freebsd.org/pipermail/freebsd-current/2010-August/019310.html

The regular expression you pass to grep doesn't have any way of specifying newlines (although you can specify matches against the start or end of a line).
So it appears to work line by line, even though actually it may not treat line ends differently to other characters.

Related

How can I tail -f but only in whole lines?

I have a constantly updating huge log file (MainLog).
I want to create another file which is only the last n lines of the log file BUT also updating.
If I use:
tail -f MainLog > RecentLog
I get ALMOST what I want except RecentLog is written as MainLog is available and might at any point only have part of the last MainLog line.
How can I specify to tail that I only want it to write when a WHOLE line is available?
By default, tail outputs whole lines unless you use the -c switch to count characters. Something like
tail -n 20 -f MainLog > RecentLog
(substituting the number of lines you want prepended to the second file for "20") should work as you want.
But if if doesn't, it is possible that using grep to line-buffer your output will fix this condition. See this question.
After many attempts, the only solution for multiple files that worked (fantastically well) for me is the fdlinecombine command. It's a small binary that reads multiple file descriptors and prints data to stdout linewise.
My use case is spawning multiple long-running ssh commands in the background and following their output, without having the lines garbled or interrupted in between.

How can I implement the command 'ls' with wildcard, '*'?

EDIT #1 : I'm under the limit that all arguments are enclosed in two quotes, so that shell do not expand any argument with * to the corresponding path.
EDIT #2 : In order to retrieve directories such as */*, ../*, and dirA/*/file.out, How should I use iteration loop or recursive call?
I have just learned about the function fnmatch(). But I don't know start place.
There are many possible cases. I'm confused dealing with these all cases.
For example, Let me assume that executable program is a.out.
$./a.out -l */*
$./a.out -l ../*
$./a.out -l [file_name] [directory_name]
/* Since I also have to implement ls command with no wildcard. */
What should I do? Any advice would be awesome.
Thank you in advance.
Your problem is : shell replaces wildcard caracter * with all of the filenames matching the pattern.
Solution:
If you do not want to use this feature of bash, just put quotation marks around your command line arguments.
Calling your program that way will have the original arguments, containing wildcards.
After this, you can list all the filenames with their paths. For example using some recursive algorithm. Then you can apply some matching to these path string. (when visiting it)
If you want to be a good unix citizen, the rule is Don't do filename globbing unless you are writing a shell.
You want to write an ls-like program? Don't do any wildcard expansion. Don't treat "*" specially. Just treat your argv as a list of filenames. If your program handles these cases:
./a.out file1
./a.out file1 file2 file3
Then it will also handle
./a.out file*
correctly because the shell will do the expansion and your program won't need to know about it. And besides that, it will handle this:
zsh% ./a.out **/file<40-185>~file<90-100>(.mm-30OL[1,2])
which in zsh expanded glob syntax means: expand file40 through file185, except for file90 through file100, include only the ones that have been modified in the last 30 minutes, and use only the largest 2 files in the resulting set.
fnmatch is never going to do anything like that. But these fancy globs can be used with any command that just takes a filename list and doesn't care where it came from.
When you're in a situation where you can't take a list of filenames from the command line, then consider using fnmatch. ls isn't one of those situations.

Search for multiple patterns in a file not necessarily on the same line

I need a unix command that would search multiple patterns (basically an AND), however those patterns need not be on the same line (otherwise I could use grep AND command). For e.g. suppose I have a file like following:
This is first line.
This is second line.
This is last line.
If I search for words 'first' and 'last', above file should be included in the result.
Try this question, seems to be the same as yours with plenty of solutions: How to find patterns across multiple lines using grep?
I think instead of AND you actually mean OR:
grep 'first\|last' file.txt
Results:
This is first line.
This is last line.
If you have a large number of patterns, add them to a file; for example if patterns.txt contains:
first
last
Run:
grep -f patterns.txt file.txt

Remove lines which are between given patterns from a file (using Unix tools)

I have a text file (more correctly, a “German style“ CSV file, i.e. semicolon-separated, decimal comma) which has a date and the value of a measurement on each line.
There are stretches of faulty values which I want to remove before further work. I'd like to store these cuts in some script so that my corrections are documented and I can replay those corrections if necessary.
The lines look like this:
28.01.2005 14:48:38;5,166
28.01.2005 14:50:38;2,916
28.01.2005 14:52:38;0,000
28.01.2005 14:54:38;0,000
(long stretch of values that should be removed; could also be something else beside 0)
01.02.2005 00:11:43;0,000
01.02.2005 00:13:43;1,333
01.02.2005 00:15:43;3,250
Now I'd like to store a list of begin and end patterns like 28.01.2005 14:52:38 + 01.02.2005 00:11:43, and the script would cut the lines matching these begin/end pairs and everything that's between them.
I'm thinking about hacking an awk script, but perhaps I'm missing an already existing tool.
Have a look at sed:
sed '/start_pat/,/end_pat/d'
will delete lines between start_pat and end_pat (inclusive).
To delete multiple such pairs, you can combine them with multiple -e options:
sed -e '/s1/,/e1/d' -e '/s2/,/e2/d' -e '/s3/,/e3/d' ...
Firstly, why do you need to keep a record of what you have done? Why not keep a backup of the original file, or take a diff between the old & new files, or put it under source control?
For the actual changes I suggest using Vim.
The Vim :global command (abbreviated to :g) can be used to run :ex commands on lines that match a regex. This is in many ways more powerful than awk since the commands can then refer to ranges relative to the matching line, plus you have the full text processing power of Vim at your disposal.
For example, this will do something close to what you want (untested, so caveat emptor):
:g!/^\d\d\.\d\d\.\d\d\d\d/ -1 write tmp.txt >> | delete
This matches lines that do NOT start with a date (the ! negates the match), appends the previous line to the file tmp.txt, then deletes the current line.
You will probably end up with duplicate lines in tmp.txt, but they can be removed by running the file through uniq.
you are also use awk
awk '/start/,/end/' file
I would seriously suggest learning the basics of perl (i.e. not the OO stuff). It will repay you in bucket-loads.
It is fast and simple to write a bit of perl to do this (and many other such tasks) once you have grasped the fundamentals, which if you are used to using awk, sed, grep etc are pretty simple.
You won't have to remember how to use lots of different tools and where you would previously have used multiple tools piped together to solve a problem, you can just use a single perl script (usually much faster to execute).
And, perl is installed on virtually every unix/linux distro now.
(that sed is neat though :-)
use grep -L (print none matching lines)
Sorry - thought you just wanted lines without 0,000 at the end

simple shell script in cygwin

#!/bin/bash
echo 'first line' >foo.xml
echo 'second line' >>foo.xml
I am a total newbie to shell scripting.
I am trying to run the above script in cygwin. I want to be able to write one line after the other to a new file.
However, when I execute the above script, I see the follwoing contents in foo.xml:
second line
The second time I run the script, I see in foo.xml:
second line
second line
and so on.
Also, I see the following error displayed at the command prompt after running the script:
: No such file or directory.xml
I will eventually be running this script on a unix box, I am just trying to develop it using cygwin. So I would appreciate it if you could point out if it is a cygwin oddity and if so, should I avoid trying to use cygwin for development of such scripts?
Thanks in advance.
Run dos2unix on your shell script. That will fix the problem.
I had the same kind of problem as the original poster: A very simple script file was not working in Cygwin.
Thanks to Don Branson for the clue.
The fix for me was built into the text editor I'm using. (Most programmer's editors have a feature like this.) For example, in my case I'm using Notepad++, which has a menu item to convert the file line endings to Unix-style. From the menu: [Edit]->[EOL Conversion]->[Unix (LF)]
Then the script behaved as expected.
But there must be something else that is wrong here. When I try it, it works as expected.
> foo.xml puts the line into foo.xml, replacing any previous contents.
>> foo.xml appends to file
It sounds like you may have a typo somewhere. Also keep in mind that while the Windows command prompt can be forgiving about paths with embedded spaces, cygwin's shells will not be, so if you have a filename that contains embedded spaces, you need to either quote the filename or escape the spaces:
echo 'first line' > 'My File.txt'
echo 'first line' > My\ File.txt
The same goes for certain "special" characters including quotes, ampersand (&), semicolons (;) and generally most punctuation other than period/full-stop (.).
So if you are seeing those issues using the exact script that you are running (i.e. you copy and pasted it, there is no possibility of transcription errors) then something truly strange may be happening that I can't explain. Otherwise, there may be a misplaced space or unquoted character somewhere.
I cannot reproduce your results. The script you quote looks correct, and indeed works as expected in my installation of Cygwin here, producing the file foo.xml containing the lines first line and second line; implying that what you are actually running differs from what you quoted in some way that is causing the problem.
The error message implies some sort of problem with the filename in the first echo line. Do you have some nonprintable characters in the script you are running? Have you missed escaping a space in the filename? Are you subsituting shell variables and mistyping the name of the variable or failing to escape the resulting string?
The above should work normally..
However you can always specify a heredoc:
#!/bin/bash
cat <<EOF > foo.xml
first line
second line
EOF

Resources