Use egrep and sed with pattern list to return first instance of every pattern in a single target file - unix

I have a lengthy pattern list in a text file, one item per line. I'm using an older version of Solaris Unix, so I have to use egrep at the command line as I have very limited scripting experience. The file I am searching through has many instances of each pattern. I want to return only the line from the first instance for each pattern
$ cat patterns.txt
p1
p2
p3
$ cat target.txt
p1
p3
p1
p1
p3
p2
p3
p2
p1
The command to get the whole list of matches is
egrep -f patterns.txt target.txt
I have found many examples of how to return only the first line, or the first and the last line for patterns in the list. What I need is to return the first of each pattern from the patterns.txt in the target.txt
I have tried to adapt examples using awk and sed (below), but I am not very familiar with the commands or their usage, so I'm likely doing it wrong.
awk 'BEGIN { while(getline<"patterns.txt") M[$1]=1 }; { if(M[$1]==1) { print; M[$1]=2 } }' target.txt
egrep -f patterns.txt target.txt | sed -n '1p;$p'
The last one yielded the first pattern matched and the last pattern matched in the target.txt file. I think this is heading in the right direction, but I don't understand sed well enough to get the parameters right.

Based solely on OP's provided data it looks like we can merely match on whole lines.
One awk idea:
awk '
FNR==NR {ptn[$0];next} # 1st file: store line in array ptn[]; skip to next input line
$0 in ptn {print; delete ptn[$0]} # 2nd file: if line is an index for the array then print line and delete array entry (so it will not match next time we see it)
' patterns.txt target.txt
# or as a one-liner sans comments:
awk 'FNR==NR {ptn[$0];next} $0 in ptn {print; delete ptn[$0]}' patterns.txt target.txt
This generates:
p1
p3
p2
Granted, we can't tell solely from this output which line we matched on so for debug purposes we'll add an explicit print to the mix to include the input line number:
$ awk 'FNR==NR {ptn[$0];next} $0 in ptn {print FNR,$0; delete ptn[$0]}' patterns.txt target.txt
1 p1
2 p3
6 p2
NOTE: while this (seems) to answer OP's question for the (limited) provided inputs, I'm guessing OP's real world data may be more involved (eg, the patterns could exist as a subset of a line; we do (not?) need to match on whole words; we do (not?) need to worry about case sensitive matching; etc); if OP's real requirement is more involved I'd suggest trying to modify any answers received here (for this question and data) and if unsuccessful then ask a new question, making sure to provide a more realistic set of sample data

This might work for you (GNU sed):
sed 's#.*#/&/{x;/&/{x;d};s/^/\\n&/;x;b}#' filePatterns | sed -f - fileTarget
Generate a sed script from the patterns file and apply the script to a second invocation of sed using the target file.

Related

Removing comments from a datafile. What are the differences?

Let's say that you would like to remove comments from a datafile using one of two methods:
cat file.dat | sed -e "s/\#.*//"
cat file.dat | grep -v "#"
How do these individual methods work, and what is the difference between them? Would it also be possible for a person to write the clean data to a new file, while avoiding any possible warnings or error messages to end up in that datafile? If so, how would you go about doing this?
How do these individual methods work, and what is the difference
between them?
Yes, they work same though sed and grep are 2 different commands. Your sed command simply substitutes all those lines which having # with NULL. On other hand grep will simply skip or ignore those lines which will skip lines which have # in it.
You could get more information on these by man page as follows:
man grep:
-v, --invert-match
Invert the sense of matching, to select non-matching lines. (-v is specified by POSIX.)
man sed:
s/regexp/replacement/
Attempt to match regexp against the pattern space. If successful, replace that portion matched with replacement. The
replacement may
contain the special character & to refer to that portion of the pattern space which matched, and the special escapes \1
through \9 to
refer to the corresponding matching sub-expressions in the regexp.
Would it also be possible for a person to write the clean data to a
new file, while avoiding any possible warnings or error messages to
end up in that datafile?
yes, we could re-direct the errors by using 2>/dev/null in both the commands.
If so, how would you go about doing this?
You could try like 2>/dev/null 1>output_file
Explanation of sed command: Adding explanation of sed command too now. This is only for understanding purposes and no need to use cat and then use sed you could use sed -e "s/\#.*//" Input_file instead.
sed -e " ##Initiating sed command here with adding the script to the commands to be executed
s/ ##using s for substitution of regexp following it.
\#.* ##telling sed to match a line if it has # till everything here.
//" ##If match found for above regexp then substitute it with NULL.
That grep -v will lose all the lines that have # on them, for example:
$ cat file
first
# second
thi # rd
so
$ grep -v "#" file
first
will drop off all lines with # on it which is not favorable. Rather you should:
$ grep -o "^[^#]*" file
first
thi
like that sed command does but this way you won't get empty lines. man grep:
-o, --only-matching
Print only the matched (non-empty) parts of a matching line,
with each such part on a separate output line.

Sed line match plus line below

I can find my lines with this pattern, but in some case the info is on the line after the match. How can I also get the line following my match line?
sed -n '/SQL3227W Record token/p' /log/PLAN_2015-08-16*.MSG >ERRORS.txt
Firstly, this looks like a job for grep:
grep -A 1 'SQL3227W Record token' /log/PLAN_2015-08-16*.MSG >ERRORS.txt
(-A 1 means to print an additional 1 line After the match).
Secondly, if you're using GNU sed, you can use a second address of +1 thus:
sed -n '/SQL3227W Record token/,+1p' /log/PLAN_2015-08-16*.MSG >ERRORS.txt
Otherwise, (if you really must use non-Gnu sed), then each time you match, append the following line to your pattern space. Delete the first line, before continuing loop (in case the second line is also a match).
Untested code:
#!/bin/sed -nf
/SQL3227W Record token/{
N
P
D
}
sed is for simple substitutions on individual lines, that is all. For anything even slightly more interesting just use awk:
awk '/SQL3227W Record token/{c=2} c&&c--' file
See Printing with sed or awk a line following a matching pattern for other related idioms.

How to find a pattern using sed?

How can I combine multiple filters using sed?
Here's my data set
sex,city,age
male,london,32
male,manchester,32
male,oxford,64
female,oxford,23
female,london,33
male,oxford,45
I want to identify all lines which contain MALE AND OXFORD. Here's my approach:
sed -n '/male/,/oxford/p' file
Thanks
You can associate a block with the first check and put the second in there. For example:
sed -n '/male/ { /oxford/ p; }' file
Or invert the check and action:
sed '/male/!d; /oxford/!d' file
However, since (as #Jotne points out) lines that contain female also contain male and you probably don't want to match them, the patterns should at least be amended to contain word boundaries:
sed -n '/\<male\>/ { /\<oxford\>/ p; }' file
sed '/\<male\>/!d; /\<oxford\>/!d' file
But since that looks like comma-separated data and the check is probably not meant to test whether someone went to male university, it would probably be best to use a stricter check with awk:
awk -F, '$1 == "male" && $2 == "oxford"' file
This checks not only if a line contains male and oxford but also if they are in the appropriate fields. The same can be achieved, somewhat less prettily, with sed by using
sed '/^male,oxford,/!d' file
A single sed command command can be used to solve this. Let's look at two variations of using sed:
$ sed -e 's/^\(male,oxford,.*\)$/\1/;t;d' file
male,oxford,64
male,oxford,45
$ sed -e 's/^male,oxford,\(.*\)$/\1/;t;d' file
64
45
Both have the essentially the same regex:
^male,oxford,.*$
The interesting features are the capture group placement (either the whole line or just the age portion) and the use of ;t;d to discard non matching lines.
By doing it this way, we can avoid the requirement of using awk or grep to solve this problem.
You can use awk
awk -F, '/\<male\>/ && /\<oxford\>/' file
male,oxford,64
male,oxford,45
It uses the word anchor to prevent hit on female.

Remove every x lines from text input

I'm looking to grep some log files with a few surrounding lines, but then discard the junk lines from the matches. To make matters worse, the stupid code outputs the same exception twice so I want to junk every other grep match. I don't know that there's a good way to skip every other grep match when also including surrounding lines, so I'm good to do it all in one.
So let's say we have the following results from grep:
InterestingContext1
lkjsdf
MatchExceptionText1
--
kjslkj
lskjlk
MatchExceptionText2
--
InterestingContext3
lkjsdf
MatchExceptionText3
--
kjslkj
lskjlk
MatchExceptionText4
--
Obviously the grep match is "MatchExceptionText" (simplified, of course). So I'd like to pipe this to something where I can remove lines 2,5,6,7,8 and then repeat that pattern, so the results look like this:
InterestingContext1
MatchExceptionText1
--
InterestingContext3
MatchExceptionText3
--
The repeating is where things get tricky for me. I know sed can just remove certain line numbers but I don't know how to group them into groups of 8 lines and repeat that cut in all the groups.
Any ideas? Thanks for your help.
awk can do modular arithemetic so printing conditional on the number of lines read mod 8 should allow you to repeat the pattern.
awk 'NR%8 ~ /[134]/' file
Sed can do it:
sed -n 'N;s/\n.*//;N;N;p;N;N;N;N' filename
EDIT:
Come to think of it, this is a little better:
sed -n 'p;n;n;N;p;n;n;n;n' filename
With GNU awk you can split the input at appropriate record separators and print the wanted output, eg.:
awk 'NR%2 {print $1, $3}' RS='--\n' ORS='\n--\n' OFS='\n' infile
Output:
InterestingContext1
MatchExceptionText1
--
InterestingContext3
MatchExceptionText3
--
This might work for you (GNU sed):
sed -n 'p;n;n;p;n;p;n;n;n;n' file
sed -n "s/^.*\n//;x;s/^/²/
/^²\{1\}$/ b print
/^²\{3\}$/ b print
/^²\{4\}$/ b print
/^²\{7\}$/ b print
/^²\{8\}$/ b print
b cycle
: print
x;
# your treatment begin
p
# your treatment stop
x
: cycle
/^²\{8\}$/ s/.*//
x
" YourFile
Mainly for reference for kind of "case of" with relative line number, just have to change the number in /^²\{YourLineNumber\}$/ for take the other relative line position.
Don't forget the last line number that reset the cycle
First part take the line and prepare the relative line counter
Second part is the case of
Third part is the treatment (here a print)
Last part is the reset of the cycle counter if needed

How to delete duplicate lines in a file without sorting it in Unix

Is there a way to delete duplicate lines in a file in Unix?
I can do it with sort -u and uniq commands, but I want to use sed or awk.
Is that possible?
awk '!seen[$0]++' file.txt
seen is an associative array that AWK will pass every line of the file to. If a line isn't in the array then seen[$0] will evaluate to false. The ! is the logical NOT operator and will invert the false to true. AWK will print the lines where the expression evaluates to true.
The ++ increments seen so that seen[$0] == 1 after the first time a line is found and then seen[$0] == 2, and so on.
AWK evaluates everything but 0 and "" (empty string) to true. If a duplicate line is placed in seen then !seen[$0] will evaluate to false and the line will not be written to the output.
From http://sed.sourceforge.net/sed1line.txt:
(Please don't ask me how this works ;-) )
# delete duplicate, consecutive lines from a file (emulates "uniq").
# First line in a set of duplicate lines is kept, rest are deleted.
sed '$!N; /^\(.*\)\n\1$/!P; D'
# delete duplicate, nonconsecutive lines from a file. Beware not to
# overflow the buffer size of the hold space, or else use GNU sed.
sed -n 'G; s/\n/&&/; /^\([ -~]*\n\).*\n\1/d; s/\n//; h; P'
Perl one-liner similar to jonas's AWK solution:
perl -ne 'print if ! $x{$_}++' file
This variation removes trailing white space before comparing:
perl -lne 's/\s*$//; print if ! $x{$_}++' file
This variation edits the file in-place:
perl -i -ne 'print if ! $x{$_}++' file
This variation edits the file in-place, and makes a backup file.bak:
perl -i.bak -ne 'print if ! $x{$_}++' file
An alternative way using Vim (Vi compatible):
Delete duplicate, consecutive lines from a file:
vim -esu NONE +'g/\v^(.*)\n\1$/d' +wq
Delete duplicate, nonconsecutive and nonempty lines from a file:
vim -esu NONE +'g/\v^(.+)$\_.{-}^\1$/d' +wq
The one-liner that Andre Miller posted works except for recent versions of sed when the input file ends with a blank line and no characterss. On my Mac my CPU just spins.
This is an infinite loop if the last line is blank and doesn't have any characterss:
sed '$!N; /^\(.*\)\n\1$/!P; D'
It doesn't hang, but you lose the last line:
sed '$d;N; /^\(.*\)\n\1$/!P; D'
The explanation is at the very end of the sed FAQ:
The GNU sed maintainer felt that despite the portability problems
this would cause, changing the N command to print (rather than
delete) the pattern space was more consistent with one's intuitions
about how a command to "append the Next line" ought to behave.
Another fact favoring the change was that "{N;command;}" will
delete the last line if the file has an odd number of lines, but
print the last line if the file has an even number of lines.
To convert scripts which used the former behavior of N (deleting
the pattern space upon reaching the EOF) to scripts compatible with
all versions of sed, change a lone "N;" to "$d;N;".
The first solution is also from http://sed.sourceforge.net/sed1line.txt
$ echo -e '1\n2\n2\n3\n3\n3\n4\n4\n4\n4\n5' |sed -nr '$!N;/^(.*)\n\1$/!P;D'
1
2
3
4
5
The core idea is:
Print only once of each duplicate consecutive lines at its last appearance and use the D command to implement the loop.
Explanation:
$!N;: if the current line is not the last line, use the N command to read the next line into the pattern space.
/^(.*)\n\1$/!P: if the contents of the current pattern space is two duplicate strings separated by \n, which means the next line is the same with current line, we can not print it according to our core idea; otherwise, which means the current line is the last appearance of all of its duplicate consecutive lines. We can now use the P command to print the characters in the current pattern space until \n (\n also printed).
D: we use the D command to delete the characters in the current pattern space until \n (\n also deleted), and then the content of pattern space is the next line.
and the D command will force sed to jump to its first command $!N, but not read the next line from a file or standard input stream.
The second solution is easy to understand (from myself):
$ echo -e '1\n2\n2\n3\n3\n3\n4\n4\n4\n4\n5' |sed -nr 'p;:loop;$!N;s/^(.*)\n\1$/\1/;tloop;D'
1
2
3
4
5
The core idea is:
print only once of each duplicate consecutive lines at its first appearance and use the : command and t command to implement LOOP.
Explanation:
read a new line from the input stream or file and print it once.
use the :loop command to set a label named loop.
use N to read the next line into the pattern space.
use s/^(.*)\n\1$/\1/ to delete the current line if the next line is the same with the current line. We use the s command to do the delete action.
if the s command is executed successfully, then use the tloop command to force sed to jump to the label named loop, which will do the same loop to the next lines until there are no duplicate consecutive lines of the line which is latest printed; otherwise, use the D command to delete the line which is the same with the latest-printed line, and force sed to jump to the first command, which is the p command. The content of the current pattern space is the next new line.
uniq would be fooled by trailing spaces and tabs. In order to emulate how a human makes comparison, I am trimming all trailing spaces and tabs before comparison.
I think that the $!N; needs curly braces or else it continues, and that is the cause of the infinite loop.
I have Bash 5.0 and sed 4.7 in Ubuntu 20.10 (Groovy Gorilla). The second one-liner did not work, at the character set match.
The are three variations. The first is to eliminate adjacent repeat lines, the second to eliminate repeat lines wherever they occur, and the third to eliminate all but the last instance of lines in file.
pastebin
# First line in a set of duplicate lines is kept, rest are deleted.
# Emulate human eyes on trailing spaces and tabs by trimming those.
# Use after norepeat() to dedupe blank lines.
dedupe() {
sed -E '
$!{
N;
s/[ \t]+$//;
/^(.*)\n\1$/!P;
D;
}
';
}
# Delete duplicate, nonconsecutive lines from a file. Ignore blank
# lines. Trailing spaces and tabs are trimmed to humanize comparisons
# squeeze blank lines to one
norepeat() {
sed -n -E '
s/[ \t]+$//;
G;
/^(\n){2,}/d;
/^([^\n]+).*\n\1(\n|$)/d;
h;
P;
';
}
lastrepeat() {
sed -n -E '
s/[ \t]+$//;
/^$/{
H;
d;
};
G;
# delete previous repeated line if found
s/^([^\n]+)(.*)(\n\1(\n.*|$))/\1\2\4/;
# after searching for previous repeat, move tested last line to end
s/^([^\n]+)(\n)(.*)/\3\2\1/;
$!{
h;
d;
};
# squeeze blank lines to one
s/(\n){3,}/\n\n/g;
s/^\n//;
p;
';
}
This can be achieved using AWK.
The below line will display unique values:
awk file_name | uniq
You can output these unique values to a new file:
awk file_name | uniq > uniq_file_name
The new file uniq_file_name will contain only unique values, without any duplicates.
Use:
cat filename | sort | uniq -c | awk -F" " '$1<2 {print $2}'
It deletes the duplicate lines using AWK.

Resources