Grep but very literal - unix

This question might have been asked a million times before, but did’t see my exaxt case.
Suppose a text file contains:
a
ab
bac
Now I want to grep on ‘a’ and have a hit only on the 1st line. After the ‘a’ there’s always a [tab] character.
Anyone any ideas?
Thanks!
Ronald

Try this:
head -1 *.txt | grep -P "a\t"
head will give you specified amount of lines of each file (all txt files in my example) , grep -P use the regular expressions as defined by perl (perl has \t as tab)

Related

Unix Text Processing - how to remove part of a file name from the results?

I'm searching through text files using grep and sed commands and I also want the file names displayed before my results. However, I'm trying to remove part of the file name when it is displayed.
The file names are formatted like this: aja_EPL_1999_03_01.txt
I want to have only the date without the beginning letters and without the .txt extension.
I've been searching for an answer and it seems like it's possible to do that with a sed or a grep command by using something like this to look forward and back and extract between _ and .txt:
(?<=_)\d+(?=\.)
But I must be doing something wrong, because it hasn't worked for me and I possibly have to add something as well, so that it doesn't extract only the first number, but the whole date. Thanks in advance.
Edit: Adding also the working command I've used just in case. I imagine whatever command is needed would have to go at the beginning?
sed '/^$/d' *.txt | grep -P '(^([A-ZÖÄÜÕŠŽ].*)?[Pp][Aa][Ll]{2}.*[^\.]$)' *.txt --colour -A 1
The results look like this:
aja_EPL_1999_03_02.txt:PALLILENNUD : korraga üritavad ümbermaailmalendu kaks meeskonda
A desired output would be this:
1999_03_02:PALLILENNUD : korraga üritavad ümbermaailmalendu kaks meeskonda
First off, you might want to think about your regular expression. While the one you have you say works, I wonder if it could be simplified. You told us:
(^([A-ZÖÄÜÕŠŽ].*)?[Pp][Aa][Ll]{2}.*[^\.]$)
It looks to me as if this is intended to match lines that start with a case insensitive "PALL", possibly preceded by any number of other characters that start with a capital letter, and that lines must not end in a backslash or a dot. So valid lines might be any of:
PALLILENNUD : korraga üritavad etc etc
Õlu on kena. Do I have appalling speling?
Peeter Pall is a limnologist at EMU!
If you'd care to narrow down this description a little and perhaps provide some examples of lines that should be matched or skipped, we may be able to do better. For instance, your outer parentheses are probably unnecessary.
Now, let's clarify what your pipe isn't doing.
sed '/^$/d' *.txt
This reads all your .txt files as an input stream, deletes any empty lines, and prints the output to stdout.
grep -P 'regex' *.txt --otheroptions
This reads all your .txt files, and prints any lines that match regex. It does not read stdin.
So .. in the command line you're using right now, your sed command is utterly ignored, as sed's output is not being read by grep. You COULD instruct grep to read from both files and stdin:
$ echo "hello" > x.txt
$ echo "world" | grep "o" x.txt -
x.txt:hello
(standard input):world
But that's not what you're doing.
By default, when grep reads from multiple files, it will precede each match with the name of the file from whence that match originated. That's also what you're seeing in my example above -- two inputs, one x.txt and the other - a.k.a. stdin, separated by a colon from the match they supplied.
While grep does include the most minuscule capability for filtering (with -o, or GNU grep's \K with optional Perl compatible RE), it does NOT provide you with any options for formatting the filename. Since you can'd do anything with the output of grep, you're limited to either parsing the output you've got, or using some other tool.
Parsing is easy, if your filenames are predictably structured as they seem to be from the two examples you've provided.
For this, we can ignore that these lines contain a file and data. For the purpose of the filter, they are a stream which follows a pattern. It looks like you want to strip off all characters from the beginning of each line up to and not including the first digit. You can do this by piping through sed:
sed 's/^[^0-9]*//'
Or you can achieve the same effect by using grep's minimal filtering to return every match starting from the first digit:
grep -o '[0-9].*'
If this kind of pipe-fitting is not to your liking, you may want to replace your entire grep with something in awk that combines functionality:
$ awk '
/[\.]$/ {next} # skip lines ending in backslash or dot
/^([A-ZÖÄÜÕŠŽ].*)?PALL/ { # lines to match
f=FILENAME
sub(/^[^0-9]*/,"",f) # strip unwanted part of filename, like sed
printf "%s:%s\n", f, $0
getline # simulate the "-A 1" from grep
printf "%s:%s\n", f, $0
}' *.txt
Note that I haven't tested this, because I don't have your data to work with.
Also, awk doesn't include any of the fancy terminal-dependent colourization that GNU grep provides through the --colour option.

grep matches between two files and convert to lower case

I need a fast and efficient approach to the following problem (I am working with many files.) But for example:
I have two files: file2
Hello
Goodbye
Salut
Bonjour
and file1
Hello, is it Me you're looking for?
I would like to find any word in file 2 that exists in file 2, and then convert that word to lower case.
I can grep the words in a file by doing:
grep -f file2.txt file1.txt
and returns
Hello
So now I want to convert to
hello
so that the final output is
hello, is it Me you're looking for?
Where if I match multiple files:
grep -f file2.txt *_infile.txt
The output will be stored in respective separate outfiles.
I know I can convert to lower case using something like tr, but I only know how to do this on every instance of an uppercase letter. I only want to convert words common between two files from uppercase to lowercase.
Thanks.
I would solve the problem a bit differently.
First, I would mark matches in grep. --color=always works well, although it's somewhat cumbersome and potentially unreliable in detection. Then I would change marked matches with sed or perl:
grep --color=always -F -f file2.txt file1.txt | \
perl -p -e 's/\x1b.*?\[K(.*?)\x1b.*?\[K/\L\1/g'
The cryptic RE matches the coloring escape sequence before the match, de-coloring escape sequence right after the match and captures everything in between into group 1. Then it applies lowercase \L conversion to the capture. Likely GNU sed can do the same, but probably perl is more portable.

grep: how to show the next lines after the matched one until a blank line [not possible!]

I have a dictionary (not python dict) consisting of many text files like this:
##Berlin
-capital of Germany
-3.5 million inhabitants
##Earth
-planet
How can I show one entry of the dictionary with the facts?
Thank you!
You can't. grep doesn't have a way of showing a variable amount of context. You can use -A to show a set number of lines after the match, such as -A3 to show three lines after a match, but it can't be a variable number of lines.
You could write a quick Perl program to read from the file in "paragraph mode" and then print blocks that match a regular expression.
as andy lester pointed out, you can't have grep show a variable amount of context in grep, but a short awk statement might do what you're hoping for.
if your example file were named file.dict:
awk -v term="earth" 'BEGIN{IGNORECASE=1}{if($0 ~ "##"term){loop=1} if($0 ~ /^$/){loop=0} if(loop == 1){print $0}}' *.dict
returns:
##Earth
-planet
just change the variable term to the entry you're looking for.
assuming two things:
dictionary files have same extension (.dict for example purposes)
dictionary files are all in same directory (where command is called)
If your grep supports perl regular expressions, you can do it like this:
grep -iPzo '(?s)##Berlin.*?\n(\n|$)'
See this answer for more on this pattern.
You could also do it with GNU sed like this:
query=berlin
sed -n "/$query/I"'{ :a; $p; N; /\n$/!ba; p; }'
That is, when case-insensitive $query is found, print until an empty line is found (/\n$/) or the end of file ($p).
Output in both cases (minor difference in whitespace):
##Berlin
-capital of Germany
-3.5 million inhabitants

Another grep advanced

Q1. I want to grep something like that:
grep -Ir --exclude-dir="some*dirs" "my-text" ~/somewhere
but I don't want to show the whole strings containing "my-text", I want to see only list of files.
Q2. I want to see list of files containing "my-text" but not containing "another-text". How to do that?
Sorry, but I could not find the answer in man grep, neither in google.
Q1. You mustn't have googled very hard on that one.
man grep
-l, --files-with-matches
Suppress normal output; instead print the name of each input
file from which output would normally have been printed. The
scanning will stop on the first match.
Q2. Unless you expect both patterns to be on the same line, you'll need multiple invocations of grep. Something like:
$ grep -l my-text | xargs grep -vl another-text

List only certain files in a directory matching the word BOZO and ending with either '123' or '456'

I'm trying to figure out how to get a list of file names for a file named BOZO but ending with ONLY 123 OR 456.
Files are:
BOZO12389,
BOZOand3
BOZOand456
BOZOand5
BOZOhello123
So the command should only display 'BOZOhello123' and 'BOZOand456'
I can't figure it out. I've tried all forms of LS and GREP that I can think of. The funny thing is, we tried to do it in class for about 10mins and no one could get it (including the instructor).
I did the following and it worked
ls BOZO*456 BOZO*123
Using shell's globs:
ls BOZO*{123,456}
Use regular expressions to help you. The command egrep should help, because it will allow you to use regular expressions.
You're searching for files of the kind BOZO456 and BOZO123
A period . is a wild card, allowing you to substitute for <anything>. The * will let you repeat it 0 or more times. By placing around 123 and 456 round brackets, you will simulate an OR.
Thus, you want any character repeated 0 or more times, followed by 123 or 456.
Example:
egrep "BOZO.*(456|123)" data
Thank you to Nathan Fellman for the help and edits.
You could also use find command :
find . \( -name "BOZO*123" -o -name "BOZO*456" \)
$ ll grep *BOZO* should work too!
This shouldn't be that hard. The most naive way is to ls the directory and then grep for only what you want:
$ ls *BOZO* | grep -e '123$' -e '456$'

Resources