Match line that contains two strings from another file - unix

I have a file source.txt containing two columns of strings separated by a whitespace.
foo bar
foo baz
goo gaa
Also, there is another file pattern.txt which is a list of strings (1 per line) that should serve as pattern source. This could look like
foo
bar
goo
The goal is to extract only lines, that contain two strings from the pattern file.
Repetitions are fine (e.g. foo foo would be valid).
So the desired output here would be
foo bar
I managed to extract lines that contain at least one term from the pattern file with grep:
grep -wFf pattern.txt source.txt
The command above would return all lines from source.txt since at least one term from pattern.txt is present in each line. My approaches using piped grep commands (which are shown in related questions considering only two search terms) have not worked out.
grep is not mandatory. awk, sed, perl work as well. I have a solution in Python, but it is terribly slow (¬blazinglyfast).
Thank you!
Response to Answers
My Python solution looks like this:
import sys
f_pattern = sys.argv[1]
f_source = sys.argv[2]
with open(f_pattern, "r", encoding="utf-8") as fp:
pattern = set(fp.read().split("\n"))
with open(f_source, "r", encoding="utf-8") as fp:
for line in fp:
w1, w2 = line.strip("\n").split(" ")
if w1 in pattern and w2 in pattern:
print(line, end="") # \n still present in line string
Indeed, it's not that bad (time-wise) compared to some answers.
(My) Python
time python matcher.py pattern.txt source.txt
>> 158,12s user 1,82s system 99% cpu 2:40,08 total
awk by #Avinash Chandravansi
time awk -F' ' 'FNR==NR {arr [$0];next} $2 in arr' pattern.txt source.txt
>> 106,72s user 5,69s system 99% cpu 1:52,88 total
Not quite sure yet, but I think that gives an incorrect result.
awk by #KamilCuk
time awk 'NR==FNR{a[$0];next} {cnt=0; for (k in a) { cnt += $0~k; if (cnt >= 2){ print; break; }}}' pattern.txt source.txt
>> Unclear, more then 20 minutes. Ctrl+C
awk by #Fravadona
time awk 'FNR==NR {patterns[$0]; next}($1 in patterns) && ($2 in patterns)' pattern.txt source.txt
>> 95,45s user 2,46s system 99% cpu 1:38,03 total
^-- This seems to be the accepted answer (for me).

You're using grep -F so I guess that the "patterns" aren't regexps. Now, if you're looking for matching the full strings (and not a substring) then you can do:
awk '
FNR == NR { patterns[$0]; next }
($1 in patterns) && ($2 in patterns)
' pattern.txt source.txt

With awk, store the patterns in array and then check if at least two match.
$ awk 'NR==FNR{a[$0];next} {cnt=0; for (k in a) { cnt += $0~k; if (cnt >= 2){ print; break; }}}' pattern.txt source.txt
foo bar

This might work for you (GNU sed):
sed 'H;1h;$!d;x;y/\n/|/;s#.*#/(&).*(&)/p;d#' patternFile | sed -Ef - file
Create a sed script from the patternFile and apply it to source file.
Using the same alternation regexp twice in the same match print the result, otherwise delete the line.

Related

How can I identify lines from a delimited file, based on a lookup file in unix

Assume that there are two files
File1 - lookup.txt
CAN
USD
INR
EUR
Another file Input.txt
1~Canada~CAN
2~United States of America~USD
3~Brazil~BRL
Both files may be very huge, hypothetically several thousand of records . Now I'm trying to identify the records in Input.txt and identify them based on values in lookup file.
The expected output should be
1~Canada~CAN
2~United States of America~USD
I tried to do something like below
#!/bin/sh
lookupFile=$1 #lookup.txt
inputFile=$2 #input.txt
outputFile=$3 #output.txt
while IFS= read -r line
do
awk -F'~' '{if ($3==$line) print >> $outputFile}' $inputFile
done < "$lookupFile"
But I'm getting error like
awk: cmd. line:1: (FILENAME=input.txt FNR=2) fatal: can't redirect to
How can I fix this issue ? Also if the files really huge, with several thousand of records to search, is this an efficient way ?
With your shown samples please try following awk code. We could do this in single awk we need to take care of setting field separator as ~ before input.txt.
awk 'FNR==NR{arr[$0];next} ($3 in arr)' lookup.txt FS="~" input.txt
Explanation:
awk ' ##starting awk program from here.
FNR==NR{ ##Checking condition which will be TRUE when lookup.txt is being read.
arr[$0] ##Creating array arr with $0 as index.
next ##next to skip all further statements from here.
}
($3 in arr) ##If $3 is present in arr then print that line.
' lookup.txt FS="~" input.txt ##Mentioning Input_files and setting FS to ~ before input.txt
A non-awk solution that you could compare with on the performance point of view:
$ grep -wFf lookup.txt input.txt
1~Canada~CAN
2~United States of America~USD
Warning: this does not match only on the last word. So if some values in lookup.txt can also be found elsewhere in input.txt, prefer another solution. Or, if it contains nothing that could be interpreted as a regular expression operator, preprocess lookup.txt before grep. Example with bash, sed and grep:
$ grep -f <( sed 's/.*/~&$/' lookup.txt ) input.txt
1~Canada~CAN
2~United States of America~USD

Use egrep and sed with pattern list to return first instance of every pattern in a single target file

I have a lengthy pattern list in a text file, one item per line. I'm using an older version of Solaris Unix, so I have to use egrep at the command line as I have very limited scripting experience. The file I am searching through has many instances of each pattern. I want to return only the line from the first instance for each pattern
$ cat patterns.txt
p1
p2
p3
$ cat target.txt
p1
p3
p1
p1
p3
p2
p3
p2
p1
The command to get the whole list of matches is
egrep -f patterns.txt target.txt
I have found many examples of how to return only the first line, or the first and the last line for patterns in the list. What I need is to return the first of each pattern from the patterns.txt in the target.txt
I have tried to adapt examples using awk and sed (below), but I am not very familiar with the commands or their usage, so I'm likely doing it wrong.
awk 'BEGIN { while(getline<"patterns.txt") M[$1]=1 }; { if(M[$1]==1) { print; M[$1]=2 } }' target.txt
egrep -f patterns.txt target.txt | sed -n '1p;$p'
The last one yielded the first pattern matched and the last pattern matched in the target.txt file. I think this is heading in the right direction, but I don't understand sed well enough to get the parameters right.
Based solely on OP's provided data it looks like we can merely match on whole lines.
One awk idea:
awk '
FNR==NR {ptn[$0];next} # 1st file: store line in array ptn[]; skip to next input line
$0 in ptn {print; delete ptn[$0]} # 2nd file: if line is an index for the array then print line and delete array entry (so it will not match next time we see it)
' patterns.txt target.txt
# or as a one-liner sans comments:
awk 'FNR==NR {ptn[$0];next} $0 in ptn {print; delete ptn[$0]}' patterns.txt target.txt
This generates:
p1
p3
p2
Granted, we can't tell solely from this output which line we matched on so for debug purposes we'll add an explicit print to the mix to include the input line number:
$ awk 'FNR==NR {ptn[$0];next} $0 in ptn {print FNR,$0; delete ptn[$0]}' patterns.txt target.txt
1 p1
2 p3
6 p2
NOTE: while this (seems) to answer OP's question for the (limited) provided inputs, I'm guessing OP's real world data may be more involved (eg, the patterns could exist as a subset of a line; we do (not?) need to match on whole words; we do (not?) need to worry about case sensitive matching; etc); if OP's real requirement is more involved I'd suggest trying to modify any answers received here (for this question and data) and if unsuccessful then ask a new question, making sure to provide a more realistic set of sample data
This might work for you (GNU sed):
sed 's#.*#/&/{x;/&/{x;d};s/^/\\n&/;x;b}#' filePatterns | sed -f - fileTarget
Generate a sed script from the patterns file and apply the script to a second invocation of sed using the target file.

Join lines depending on the line beginning

I have a file that, occasionally, has split lines. The split is signaled by the fact that the line starts with a space, empty line or a nonnumeric character. E.g.
40403813|7|Failed|No such file or directory|1
40403816|7|Hi,
The Conversion System could not be reached.|No such file or directory||1
40403818|7|Failed|No such file or directory|1
...
I'd like join the split line back with the previous line (as mentioned below):
40403813|7|Failed|No such file or directory|1
40403816|7|Hi, The Conversion System could not be reached.|No such file or directory||1
40403818|7|Failed|No such file or directory|1
...
using a Unix command like sed/awk. I'm not clear how to join a line with the preceeding one.
Any suggestion?
awk to the rescue!
awk -v ORS='' 'NR>1 && /^[0-9]/{print "\n"} NF' file
only print newline when the current line starts with a digit, otherwise append rows (perhaps you may want to add a space to ORS if the line break didn't preserve the space).
Don't do anything based on the values of the strings in your fields as that could go wrong. You COULD get a wrapping line that starts with a digit, for example. Instead just print after every complete record of 5 fields:
$ awk -F'|' '{rec=rec $0; nf+=NF} nf>=5{print rec; nf=0; rec=""}' file
40403813|7|Failed|No such file or directory|1
40403816|7|Hi, The Conversion System could not be reached.|No such file or directory||1
40403818|7|Failed|No such file or directory|1
Try:
awk 'NF{printf("%s",$0 ~ /^[0-9]/ && NR>1?RS $0:$0)} END{print ""}' Input_file
OR
awk 'NF{printf("%s",/^[0-9]/ && NR>1?RS $0:$0)} END{print ""}' Input_file
It will check if each line starts from a digit or not if yes and greater than line number 1 than it will insert a new line with-it else it will simply print it, also it will print a new line after reading the whole file, if we not mention it, it is not going to insert that at end of the file reading.
If you only ever have the line split into two, you can use this sed command:
sed 'N;s/\n\([^[:digit:]]\)/\1/;P;D' infile
This appends the next line to the pattern space, checks if the linebreak is followed by something other than a digit, and if so, removes the linebreak, prints the pattern space up to the first linebreak, then deletes the printed part.
If a single line can be broken across more than two lines, we have to loop over the substitution:
sed ':a;N;s/\n\([^[:digit:]]\)/\1/;ta;P;D' infile
This branches from ta to :a if a substitution took place.
To use with Mac OS sed, the label and branching command must be separate from the rest of the command:
sed -e ':a' -e 'N;s/\n\([^[:digit:]]\)/\1/;ta' -e 'P;D' infile
If the continuation lines always begin with a single space:
perl -0000 -lape 's/\n / /g' input
If the continuation lines can begin with an arbitrary amount of whitespace:
perl -0000 -lape 's/\n(\s+)/$1/g' input
It is probably more idiomatic to write:
perl -0777 -ape 's/\n / /g' input
You can use sed when you have a file without \r :
tr "\n" "\r" < inputfile | sed 's/\r\([^0-9]\)/\1/g' | tr '\r' '\n'

Getting highest extensions value in unix script

I need to create new files with extensions like: file.1, file.2, file.3 and then check if files with certain numbers exist and create file.(n+1) where n is number of highest, existing file. I was trying to get extensions using basename but it doesn't want to get couple of files
file=`basename $file.*`
ext=${file##*.}
It only works when I input whole file name like $file.3
If the filenames are guaranteed not to have newline characters in them, you can, for example, use standard unix text processing tools:
printf '%s\n' file.* | #full list
sed 's/.*\.//' | #extensions
grep '^[0-9][0-9]*$' | #numerical extensions
awk '{ if($0>m) m=$0} END{ print m }' #get maximum
Here's my take on this.
You can do this entirely in standard awk.
$ awk '{ext=FILENAME;sub(/.*\./,"",ext)} ext>n&&ext~/^[0-9]+$/{n=ext}{nextfile} END {print n}' *.*
Broken out for easier reading:
$ awk '
{
# Capture the extension...
ext=FILENAME
sub(/.*\./,"",ext)
}
# Then, if we have a numeric extension that is bigger than "n"...
ext > n && ext ~ /^[0-9]+$/ {
# let "n" be that extension.
n=ext
}
{
# We aren't actually interested in the contents of this file, so move on.
nextfile
}
# No more files? Print our result.
END {print n}
' *.*
The idea here is that we'll step through the list of filenames and let awk do ALL the processing to capture and "sort" the extensions. (We're not really sorting, we're just recording the highest number as we pass through the files.)
There are a few provisos with this solution:
This only works if all the files have a non-zero length. Technically awk conditions are being compared on "lines of the file", so if there are no lines, awk will pass right by that file.
You don't really need to use the ext variable, you can modify FILENAME directly. I included it for improved readability.
The nextfile command is fairly standard, but not universal. If you have a very old machine, or are running an esoteric variety of unix, nextfile may not be included. (I don't expect this to be a problem.)
Another alternative, which might be easier for you, would be to implement the same logic directly in POSIX shell:
$ n=0; for f in *.*; do ext=${f##*.}; if expr "$ext" : '[0-9][0-9]*$' >/dev/null && [ "$ext" -gt "$n" ]; then n="$ext"; fi; done; echo "$n"
Or, again broken out for easier reading (or scripting):
n=0
for f in *.*; do
ext=${f##*.}
if expr "$ext" : '[0-9][0-9]*$' >/dev/null && [ "$ext" -gt "$n" ]; then
n="$ext"
fi
done
echo "$n"
This steps through all files using a for loop, captures the extension, makes sure it's numeric, determines whether it's greater than "n" and records if it it is, then prints its result.
It requires no pipes and no external tools except expr, which is a POSIX.1 tool available on every system.
One proviso for this solution is that if you have NO files with extensions (i.e. *.* returns no files), this script will erroneously report that the highest numbered extension is 0. You can of course handle that easily enough, but I thought I should mention it.
Thanks for all answers, I've came up with quite similar and a bit simpler idea which I'd like to present it:
for i in file.*; do
#reading the extensions
ext=${i##*.}
if [ "$ext" -gt "$n" ];
then
#increasing n
n=$((n+1))
fi
done
then if we want to get number exceeding n by one
until [[ $a -gt "$n" ]]; do
a=$((a+1))
done
and finally a is one number bigger then number of file extensions. So if there are three files: file.1 file.2 file.3 the returned value will be 4.

removing duplicate lines from file /grep

I want to remove all lines where all the second column 05408736032 are same
0009300|05408736032|89|01|001|0|0|0|1|NNNNNNYNNNNNNNNN|asdf| 0009367|05408736032|89|01|001|0|0|0|1|NNNNNNYNNNNNNNNN|adff|
these lines are not consecutive. Its fine to remove all the lines . I dont have to keep one of them around.
Sorry my unix fu is really weak from non usage :) .
If all your input data is formatted as above - i.e. fixed-size fields - and the order of the lines in the output doesn't matter, sort --key=8,19 --unique should do the trick. If the order does matter, but duplicate lines are always consecutive, uniq -s 8 -w 11 will work. If the fields are not fixed-width but duplicate lines are always consecutive, Pax's awk script will work. In the most general case we're probably looking at something slightly too complicated for a one-liner though.
Assuming that they're consecutive and you want to remove subsequent ones, the following awk script will do it:
awk -F'|' 'NR==1 {print;x=$2} NR>1 {if ($2 != x) {print;x=$2}}'
It works by printing the first line and storing the second column. Then for subsequent lines, it skips ones where the stored value and second column are the same (if different, it prints the line and updates the stored value).
If they're not consecutive, I'd opt for a Perl solution where you maintain an associative array to detect and remove duplicates - I'd code it up but my 3yo daughter has just woken up , it's midnight and she has a cold - see you all tomorrow, if I survive the night :-)
This is the code which is used for removing duplicate words in the line..
awk '{for (i=1; i<=NF; i++) {x=0; for(j=i-1; j>=1; j--) {if ($i == $j){x=1} } if( x != 1){printf ("%s ", $i) }}print ""}' sent
If the columns are not fixed width, you can still use sort:
sort -t '|' --key=10,10 -g FILENAME
The -t flag will set the separator.
The -g is just for natural numeric ordering.
Unix includes python, so the following few-liners may be just what you need:
f=open('input.txt','rt')
d={}
for s in f.readlines():
l=s.split('|')
if l[2] not in d:
print s
d[l[2]]=True
This will work without requiring fixed-length, and even if identical values are not neighbours.
this awk will print only those line where second column is not 05408736032
awk '{if($2!=05408736032}{print}' filename
Takes two passes over the input file: 1) find the duplicate values, 2) remove them
awk -F\| '
{count[$2]++}
END {for (x in count) {if (count[x] > 1) {print x}}}
' input.txt >input.txt.dups
awk -F\| '
NR==FNR {dup[$1]++; next}
!($2 in dup) {print}
' input.txt.dups input.txt
If you use bash, you can omit the temp file: combine into one line using process substitution: (deep breath)
awk -F\| 'NR==FNR {dup[$1]++; next} !($2 in dup) {print}' <(awk -F\| '{count[$2]++} END {for (x in count) {if (count[x] > 1) {print x}}}' input.txt) input.txt
(phew!)
awk -F"|" '!_[$2]++' file
Put the lines in a hash, using line as key and value, then iterate over the hash (this should work in almost any programming language, awk, perl, etc.)

Resources