awk find patterns (stored in a file) in a file - unix

I am learning awk and I am having a hard time trying to do this :
I have a file, let's name it pattern_file.txt, which contains multiple patterns, one per line. For example, it looks like this :
pattern_file.txt
PATTERN1
PATTERN2
PATTERN3
PATTERN4
I have a second file, containing some text. Let's name it text_file.txt. It looks like this:
text_file.txt
xxxxxxxxPATTERN1xxxxxxx
yyyyyPATTERN2yyyy
zzzzzzzzzPATTERN3zzzzzz
What I am trying to do is : If one of the patterns in the pattern_file.txt is present in the current line read in text_file.txt, print the line.
I know how to print a line with awk, what gives me a hard time is to use the pattern stored in the pattern_file.txt and verify if one of them is present.

In awk using index
awk 'NR==FNR{a[$1];next}{for(i in a)if(index($0,i)) print}' pattern text
xxxxxxxxPATTERN1xxxxxxx
yyyyyPATTERN2yyyy
zzzzzzzzzPATTERN3zzzzzz
Store "patterns" to a hash and for each record use index to try to find the "patterns" from the record.

A variation of this helpful James Brown's answer using match() which also does regex match (and) returns the starting index of the matching string,
awk 'FNR==NR{a[$0]; next}{for (i in a) if (match($0,i)) print}' pattern_file.txt text_file.txt
which returns me the lines needed.
On printing the return values from the match() function
awk 'FNR==NR{a[$0]; next}{for (i in a) if (match($0,i)) print RSTART}' pattern_file.txt text_file.txt
gives an output as
9 # Meaning 'PATTERN1' match started at index 9 in 'xxxxxxxxPATTERN1xxxxxxx'
6
10

Related

Unix Text Processing - how to remove part of a file name from the results?

I'm searching through text files using grep and sed commands and I also want the file names displayed before my results. However, I'm trying to remove part of the file name when it is displayed.
The file names are formatted like this: aja_EPL_1999_03_01.txt
I want to have only the date without the beginning letters and without the .txt extension.
I've been searching for an answer and it seems like it's possible to do that with a sed or a grep command by using something like this to look forward and back and extract between _ and .txt:
(?<=_)\d+(?=\.)
But I must be doing something wrong, because it hasn't worked for me and I possibly have to add something as well, so that it doesn't extract only the first number, but the whole date. Thanks in advance.
Edit: Adding also the working command I've used just in case. I imagine whatever command is needed would have to go at the beginning?
sed '/^$/d' *.txt | grep -P '(^([A-ZÖÄÜÕŠŽ].*)?[Pp][Aa][Ll]{2}.*[^\.]$)' *.txt --colour -A 1
The results look like this:
aja_EPL_1999_03_02.txt:PALLILENNUD : korraga üritavad ümbermaailmalendu kaks meeskonda
A desired output would be this:
1999_03_02:PALLILENNUD : korraga üritavad ümbermaailmalendu kaks meeskonda
First off, you might want to think about your regular expression. While the one you have you say works, I wonder if it could be simplified. You told us:
(^([A-ZÖÄÜÕŠŽ].*)?[Pp][Aa][Ll]{2}.*[^\.]$)
It looks to me as if this is intended to match lines that start with a case insensitive "PALL", possibly preceded by any number of other characters that start with a capital letter, and that lines must not end in a backslash or a dot. So valid lines might be any of:
PALLILENNUD : korraga üritavad etc etc
Õlu on kena. Do I have appalling speling?
Peeter Pall is a limnologist at EMU!
If you'd care to narrow down this description a little and perhaps provide some examples of lines that should be matched or skipped, we may be able to do better. For instance, your outer parentheses are probably unnecessary.
Now, let's clarify what your pipe isn't doing.
sed '/^$/d' *.txt
This reads all your .txt files as an input stream, deletes any empty lines, and prints the output to stdout.
grep -P 'regex' *.txt --otheroptions
This reads all your .txt files, and prints any lines that match regex. It does not read stdin.
So .. in the command line you're using right now, your sed command is utterly ignored, as sed's output is not being read by grep. You COULD instruct grep to read from both files and stdin:
$ echo "hello" > x.txt
$ echo "world" | grep "o" x.txt -
x.txt:hello
(standard input):world
But that's not what you're doing.
By default, when grep reads from multiple files, it will precede each match with the name of the file from whence that match originated. That's also what you're seeing in my example above -- two inputs, one x.txt and the other - a.k.a. stdin, separated by a colon from the match they supplied.
While grep does include the most minuscule capability for filtering (with -o, or GNU grep's \K with optional Perl compatible RE), it does NOT provide you with any options for formatting the filename. Since you can'd do anything with the output of grep, you're limited to either parsing the output you've got, or using some other tool.
Parsing is easy, if your filenames are predictably structured as they seem to be from the two examples you've provided.
For this, we can ignore that these lines contain a file and data. For the purpose of the filter, they are a stream which follows a pattern. It looks like you want to strip off all characters from the beginning of each line up to and not including the first digit. You can do this by piping through sed:
sed 's/^[^0-9]*//'
Or you can achieve the same effect by using grep's minimal filtering to return every match starting from the first digit:
grep -o '[0-9].*'
If this kind of pipe-fitting is not to your liking, you may want to replace your entire grep with something in awk that combines functionality:
$ awk '
/[\.]$/ {next} # skip lines ending in backslash or dot
/^([A-ZÖÄÜÕŠŽ].*)?PALL/ { # lines to match
f=FILENAME
sub(/^[^0-9]*/,"",f) # strip unwanted part of filename, like sed
printf "%s:%s\n", f, $0
getline # simulate the "-A 1" from grep
printf "%s:%s\n", f, $0
}' *.txt
Note that I haven't tested this, because I don't have your data to work with.
Also, awk doesn't include any of the fancy terminal-dependent colourization that GNU grep provides through the --colour option.

Unix redact data

I want to mask only the 2nd column of the data.
Input:
First_name,second_name,phone_number
ram,prakash,96174535
hari,pallavi,98888234
anurag,aakash,82783784
Output Expected:
First_name,second_name,phone_number
ram,*******,96174535
hari,*******,98888234
anurag,******,82783784
The sed program will do this just fine:
sed '2,$s/,[^,]*,/,*****,/'
The 2,$ only operates on lines 2 through to the end of the file (to leave the header line alone) and the substitute command s/,[^,]*,/,*****,/ will replace anything between the first and second comma with the mask *****.
Note that I've specifically used a fixed number of asterisks in the replacement string. Whether you're hiding passwords or anonymising data (as seems to be the case here), you don't want to leak any information, including the size of the names being replaced.
If you really want to use the same number of characters as in the original data, and you also want to cater for the possibility of replacing multiple fields, you can use something like:
awk -F, 'BEGIN{OFS=FS}NR==1{print;next}{gsub(/./,"*",$2);gsub(/./,"*",$4);print}'
This will also leave the first line untouched but will anonymise columns two and four (albeit with the information leakage previously mentioned):
echo 'First_name,second_name,phone_number,other
ram,prakash,96174535,abc
hari,pallavi,98888234,def
anurag,aakash,82783784,g
bob,santamaria,124,xyzzy' | awk -F, 'BEGIN{OFS=FS}NR==1{print;next}{gsub(/./,"*",$2);gsub(/./,"*",$4);print}'
First_name,second_name,phone_number,other
ram,*******,96174535,***
hari,*******,98888234,***
anurag,******,82783784,*
bob,**********,124,*****
Doing multiple columns with full anonymising would entail using $2="*****" rather than the gsub (for both columns of course).
Another in awk. Using gsub to replace every char in $2 with an *:
$ awk 'BEGIN{FS=OFS=","}NR>1{gsub(/./,"*",$2)}1' file
First_name,second_name,phone_number
ram,*******,96174535
hari,*******,98888234
anurag,******,82783784
try following too once and let me know if this helps you.
awk -F"," 'NR>1{$2="*******"} 1' OFS=, Input_file

awk - Find duplicate entries in 2 columns, keep 1 duplicate and unique entries

I need to find a duplicate entry in 2 different columns and keep only one of the duplicate and all unique entries. For me if A123 is in the first column and it show up later in the second column it's a duplicate. I also know for sure that A123 will always be paired to B123 by either being A123,B123 or B123,A123. I only need to keep one and it doesn't matter which one it is.
Ex: My input file would contain:
A123,B123
A234,B234
C123,D123
B123,A123
B234,A234
I'd like the output to be:
A123,B123
A234,B234
C123,D123
The best I can do is to extract the unique entries with :
awk -F',' 'NR==FNR{x[$1]++;next}; !x[$2]' file1 file1
or get only the duplicates with
awk -F',' 'NR==FNR{x[$1]++;next}; x[$2]' file1 file1
Any help would be greatly appreciated.
This can be shorter!
First print if the element is not yet present in the array. Then add the first field to the array. Only one run over the inputfile is necessary:
awk -F, '!x[$2];{x[$1]++}' file1
This awk one-liner works for your example:
awk -F, '!($2 in a){a[$1]=$0}END{for(x in a)print a[x]}' file
The conventional, idiomatic awk solution:
$ awk -F, '!seen[$1>$2 ? $1 : $2]++' file
A123,B123
A234,B234
C123,D123
By convention we always use seen (rather than x or anything else) as the array name when it represents a set where you want to check if it's index has been seen before, and using a ternary expression to produce the largest of the possible key values as the index ensures the order they appear in the input doesn't matter.
The above doesn't care about your unique situation where every $2 is paired to a specific $1 - it simply prints unique individual occurrences across a pair of fields. If you wanted it to work on the pair of fields combined (and assuming you have more fields so just using $0 as the index wouldn't work) that'd be:
awk -F, '!seen[$1>$2 ? $1 FS $2 : $2 FS $1]++' file

Field spearator to used if they are not escaped using awk

i have once question, suppose i am using "=" as fiels seperator, in this case if my string contain for example
abc=def\=jkl
so if i use = as fields seperator, it will split into 3 as
abc def\ jkl
but as i have escaped 2nd "=" , my output should be as
abc def\=jkl
Can anyone please provide me any suggestion , if i can achieve this.
Thanks in advance
I find it simplest to just convert the offending string to some other string or character that doesn't appear in your input records (I tend to use RS if it's not a regexp* since that cannot appear within a record, or the awk builtin SUBSEP otherwise since if that appears in your input you have other problems) and then process as normal other than converting back within each field when necessary, e.g.:
$ cat file
abc=def\=jkl
$ awk -F= '{
gsub(/\\=/,RS)
for (i=1; i<=NF; i++) {
gsub(RS,"\\=",$i)
print i":"$i
}
}' file
1:abc
2:def\=jkl
* The issue with using RS if it is an RE (i.e. multiple characters) is that the gsub(RS...) within the loop could match a string that didn't get resolved to a record separator initially, e.g.
$ echo "aa" | gawk -v RS='a$' '{gsub(RS,"foo",$1); print "$1=<"$1">"}'
$1=<afoo>
When the RS is a single character, e.g. the default newline, that cannot happen so it's safe to use.
If it is like the example in your question, it could be done.
awk doesn't support look-around regex. So it would be a bit difficult to get what you want by setting FS.
If I were you, I would do some preprocessing, to make the data easier to be handled by awk. Or you could read the line, and using other functions by awk, e.g. gensub() to remove those = s you don't want to have in result, and split... But I guess you want to achieve the goal by playing field separator, so I just don't give those solutions.
However it could be done by FPAT variable.
awk -vFPAT='\\w*(\\\\=)?\\w*' '...' file
this will work for your example. I am not sure if it will work for your real data.
let's make an example, to split this string: "abc=def\=jkl=foo\=bar=baz"
kent$ echo "abc=def\=jkl=foo\=bar=baz"|awk -vFPAT='\\w*(\\\\=)?\\w*' '{for(i=1;i<=NF;i++)print $i}'
abc
def\=jkl
foo\=bar
baz
I think you want that result, don't you?
my awk version:
kent$ awk --version|head -1
GNU Awk 4.0.2

grep: how to show the next lines after the matched one until a blank line [not possible!]

I have a dictionary (not python dict) consisting of many text files like this:
##Berlin
-capital of Germany
-3.5 million inhabitants
##Earth
-planet
How can I show one entry of the dictionary with the facts?
Thank you!
You can't. grep doesn't have a way of showing a variable amount of context. You can use -A to show a set number of lines after the match, such as -A3 to show three lines after a match, but it can't be a variable number of lines.
You could write a quick Perl program to read from the file in "paragraph mode" and then print blocks that match a regular expression.
as andy lester pointed out, you can't have grep show a variable amount of context in grep, but a short awk statement might do what you're hoping for.
if your example file were named file.dict:
awk -v term="earth" 'BEGIN{IGNORECASE=1}{if($0 ~ "##"term){loop=1} if($0 ~ /^$/){loop=0} if(loop == 1){print $0}}' *.dict
returns:
##Earth
-planet
just change the variable term to the entry you're looking for.
assuming two things:
dictionary files have same extension (.dict for example purposes)
dictionary files are all in same directory (where command is called)
If your grep supports perl regular expressions, you can do it like this:
grep -iPzo '(?s)##Berlin.*?\n(\n|$)'
See this answer for more on this pattern.
You could also do it with GNU sed like this:
query=berlin
sed -n "/$query/I"'{ :a; $p; N; /\n$/!ba; p; }'
That is, when case-insensitive $query is found, print until an empty line is found (/\n$/) or the end of file ($p).
Output in both cases (minor difference in whitespace):
##Berlin
-capital of Germany
-3.5 million inhabitants

Resources