removing duplicate lines from file /grep - unix

I want to remove all lines where all the second column 05408736032 are same
0009300|05408736032|89|01|001|0|0|0|1|NNNNNNYNNNNNNNNN|asdf| 0009367|05408736032|89|01|001|0|0|0|1|NNNNNNYNNNNNNNNN|adff|
these lines are not consecutive. Its fine to remove all the lines . I dont have to keep one of them around.
Sorry my unix fu is really weak from non usage :) .

If all your input data is formatted as above - i.e. fixed-size fields - and the order of the lines in the output doesn't matter, sort --key=8,19 --unique should do the trick. If the order does matter, but duplicate lines are always consecutive, uniq -s 8 -w 11 will work. If the fields are not fixed-width but duplicate lines are always consecutive, Pax's awk script will work. In the most general case we're probably looking at something slightly too complicated for a one-liner though.

Assuming that they're consecutive and you want to remove subsequent ones, the following awk script will do it:
awk -F'|' 'NR==1 {print;x=$2} NR>1 {if ($2 != x) {print;x=$2}}'
It works by printing the first line and storing the second column. Then for subsequent lines, it skips ones where the stored value and second column are the same (if different, it prints the line and updates the stored value).
If they're not consecutive, I'd opt for a Perl solution where you maintain an associative array to detect and remove duplicates - I'd code it up but my 3yo daughter has just woken up , it's midnight and she has a cold - see you all tomorrow, if I survive the night :-)

This is the code which is used for removing duplicate words in the line..
awk '{for (i=1; i<=NF; i++) {x=0; for(j=i-1; j>=1; j--) {if ($i == $j){x=1} } if( x != 1){printf ("%s ", $i) }}print ""}' sent

If the columns are not fixed width, you can still use sort:
sort -t '|' --key=10,10 -g FILENAME
The -t flag will set the separator.
The -g is just for natural numeric ordering.

Unix includes python, so the following few-liners may be just what you need:
f=open('input.txt','rt')
d={}
for s in f.readlines():
l=s.split('|')
if l[2] not in d:
print s
d[l[2]]=True
This will work without requiring fixed-length, and even if identical values are not neighbours.

this awk will print only those line where second column is not 05408736032
awk '{if($2!=05408736032}{print}' filename

Takes two passes over the input file: 1) find the duplicate values, 2) remove them
awk -F\| '
{count[$2]++}
END {for (x in count) {if (count[x] > 1) {print x}}}
' input.txt >input.txt.dups
awk -F\| '
NR==FNR {dup[$1]++; next}
!($2 in dup) {print}
' input.txt.dups input.txt
If you use bash, you can omit the temp file: combine into one line using process substitution: (deep breath)
awk -F\| 'NR==FNR {dup[$1]++; next} !($2 in dup) {print}' <(awk -F\| '{count[$2]++} END {for (x in count) {if (count[x] > 1) {print x}}}' input.txt) input.txt
(phew!)

awk -F"|" '!_[$2]++' file

Put the lines in a hash, using line as key and value, then iterate over the hash (this should work in almost any programming language, awk, perl, etc.)

Related

Match line that contains two strings from another file

I have a file source.txt containing two columns of strings separated by a whitespace.
foo bar
foo baz
goo gaa
Also, there is another file pattern.txt which is a list of strings (1 per line) that should serve as pattern source. This could look like
foo
bar
goo
The goal is to extract only lines, that contain two strings from the pattern file.
Repetitions are fine (e.g. foo foo would be valid).
So the desired output here would be
foo bar
I managed to extract lines that contain at least one term from the pattern file with grep:
grep -wFf pattern.txt source.txt
The command above would return all lines from source.txt since at least one term from pattern.txt is present in each line. My approaches using piped grep commands (which are shown in related questions considering only two search terms) have not worked out.
grep is not mandatory. awk, sed, perl work as well. I have a solution in Python, but it is terribly slow (¬blazinglyfast).
Thank you!
Response to Answers
My Python solution looks like this:
import sys
f_pattern = sys.argv[1]
f_source = sys.argv[2]
with open(f_pattern, "r", encoding="utf-8") as fp:
pattern = set(fp.read().split("\n"))
with open(f_source, "r", encoding="utf-8") as fp:
for line in fp:
w1, w2 = line.strip("\n").split(" ")
if w1 in pattern and w2 in pattern:
print(line, end="") # \n still present in line string
Indeed, it's not that bad (time-wise) compared to some answers.
(My) Python
time python matcher.py pattern.txt source.txt
>> 158,12s user 1,82s system 99% cpu 2:40,08 total
awk by #Avinash Chandravansi
time awk -F' ' 'FNR==NR {arr [$0];next} $2 in arr' pattern.txt source.txt
>> 106,72s user 5,69s system 99% cpu 1:52,88 total
Not quite sure yet, but I think that gives an incorrect result.
awk by #KamilCuk
time awk 'NR==FNR{a[$0];next} {cnt=0; for (k in a) { cnt += $0~k; if (cnt >= 2){ print; break; }}}' pattern.txt source.txt
>> Unclear, more then 20 minutes. Ctrl+C
awk by #Fravadona
time awk 'FNR==NR {patterns[$0]; next}($1 in patterns) && ($2 in patterns)' pattern.txt source.txt
>> 95,45s user 2,46s system 99% cpu 1:38,03 total
^-- This seems to be the accepted answer (for me).
You're using grep -F so I guess that the "patterns" aren't regexps. Now, if you're looking for matching the full strings (and not a substring) then you can do:
awk '
FNR == NR { patterns[$0]; next }
($1 in patterns) && ($2 in patterns)
' pattern.txt source.txt
With awk, store the patterns in array and then check if at least two match.
$ awk 'NR==FNR{a[$0];next} {cnt=0; for (k in a) { cnt += $0~k; if (cnt >= 2){ print; break; }}}' pattern.txt source.txt
foo bar
This might work for you (GNU sed):
sed 'H;1h;$!d;x;y/\n/|/;s#.*#/(&).*(&)/p;d#' patternFile | sed -Ef - file
Create a sed script from the patternFile and apply it to source file.
Using the same alternation regexp twice in the same match print the result, otherwise delete the line.

How can I identify lines from a delimited file, based on a lookup file in unix

Assume that there are two files
File1 - lookup.txt
CAN
USD
INR
EUR
Another file Input.txt
1~Canada~CAN
2~United States of America~USD
3~Brazil~BRL
Both files may be very huge, hypothetically several thousand of records . Now I'm trying to identify the records in Input.txt and identify them based on values in lookup file.
The expected output should be
1~Canada~CAN
2~United States of America~USD
I tried to do something like below
#!/bin/sh
lookupFile=$1 #lookup.txt
inputFile=$2 #input.txt
outputFile=$3 #output.txt
while IFS= read -r line
do
awk -F'~' '{if ($3==$line) print >> $outputFile}' $inputFile
done < "$lookupFile"
But I'm getting error like
awk: cmd. line:1: (FILENAME=input.txt FNR=2) fatal: can't redirect to
How can I fix this issue ? Also if the files really huge, with several thousand of records to search, is this an efficient way ?
With your shown samples please try following awk code. We could do this in single awk we need to take care of setting field separator as ~ before input.txt.
awk 'FNR==NR{arr[$0];next} ($3 in arr)' lookup.txt FS="~" input.txt
Explanation:
awk ' ##starting awk program from here.
FNR==NR{ ##Checking condition which will be TRUE when lookup.txt is being read.
arr[$0] ##Creating array arr with $0 as index.
next ##next to skip all further statements from here.
}
($3 in arr) ##If $3 is present in arr then print that line.
' lookup.txt FS="~" input.txt ##Mentioning Input_files and setting FS to ~ before input.txt
A non-awk solution that you could compare with on the performance point of view:
$ grep -wFf lookup.txt input.txt
1~Canada~CAN
2~United States of America~USD
Warning: this does not match only on the last word. So if some values in lookup.txt can also be found elsewhere in input.txt, prefer another solution. Or, if it contains nothing that could be interpreted as a regular expression operator, preprocess lookup.txt before grep. Example with bash, sed and grep:
$ grep -f <( sed 's/.*/~&$/' lookup.txt ) input.txt
1~Canada~CAN
2~United States of America~USD

Expressing tail through a variable

So I have a chunk of formatted text, I basically need to use awk to get certain columns out of it. The first thing I did was get rid of the first 10 lines (the header information, irrelevant to the info I need).
Next I got the tail by taking the total lines in the file minus 10.
Here's some code:
import=$HOME/$1
if [ -f "$import" ]
then
#file exists
echo "File Exists."
totalLines=`wc -l < $import`
linesMinus=`expr $totalLines - 10`
tail -n $linesMinus $import
headless=`tail -n $linesMinus $import`
else
#file does not exist
echo "File does not exist."
fi
Now I need to save this tail into a variable (or maybe even separate file) so I can access the columns.
The problem comes here:
headless=`tail -n $linesMinus $import`
When I save the code into this variable and then try to echo it back out, it's all unformatted and I can't distinguish columns to use awk on.
How can I save the tail of this file without compromising the formatting?
Just use Awk. It can do everything you need all at once and all in one program.
E.g. to skip the first 10 lines, then print the second, third, and fourth columns separated by spaces for all remaining lines from INPUT_FILE:
awk 'NR <= 10 {next;}
{print $2 " " $3 " " $4;}' INPUT_FILE
Figured it out, I kind of answered my own question when I asked it. All I did was redirect the tail command to a file in the home directory and I can cat that file. Gotta remember to delete it at the end though!

Combining two awk commands in single command

I want to combine these two command and want to invoke single command
In first command i am storing 4th column of x.csv(Separator ,) file in z.csv file.
awk -F, '{print $4}' x.CSV > z.csv
In second command, i want to find out unique first-column value of z.csv(Separator-space) file.
awk -F\ '{print $1}' z.csv|sort|uniq
I want to combine these two command in single command,How can i do that?
Pipe the output of the first awk to the second awk:
awk -F, '{print $4}' x.CSV | awk -F\ '{print $1}' |sort|uniq
or, as Avinash Raj suggested,
awk -F, '{print $4}' x.CSV | awk -F\ '{print $1}' | sort -u
Assuming that the content of z.csv is actually wanted, rather than just an artefact of the way you're currently implementing your program, then you can use:
awk -F, '{ print $4 > "z.csv"
split($4, f, " ")
f4[f[1]] = 1
}
END { for (i in f4) print i }' x.CSV
The split function breaks field 4 on spaces, and (associative) array f4 records the key value. The loop at the end prints out the distinct values, unsorted. If you need them sorted, you can either use GNU awk's built-in sort functions or (if you don't have an awk with built-in sort functions) write your own in awk, or pipe the output to sort.
With GNU awk, you can replace the END block with:
END { asorti(f4); for (i in f4) print f4[i] }
If you don't want the z.csv file, then (a) you could have used a pipe in the first place, and (b) you can simply remove the print $4 > "z.csv" line.
awk '{split($4,b," "); a[b[1]]=1} END { for( i in a) print i }' FS=, x.CSV
This does not sort the data, but it's not clear if you actually want it sorted or merely needed that to get unique entries. If you do want it sorted, pipe it to sort.

need some help on awk command

need a help with awk. reading a csv file and, doing some substitution on some of the columns. It's like 9th column(string type) should be replaced by value of (9th column itself + value of the 4th column(integer)), then 15th column by $15+$12, column 26th with $26+$23. same has to be done line by line for all the records. Suggestions please
Below is the sample I/O. and the first line which is Description must be left as is.
sample Input
EmpID|Empname|Empadd|roleId|roleDesc|Dept
100|mst|Del|20|SD|DA
101|ms|Del|21|XS|DA
Sample output
EmpID|Empname|Empadd|roleId|roleDesc|Dept
100|mst100|Del|20|SD20|DA
101|ms101|Del|21|XS21|DA
it's like empname has been concatenated with empid & the role desc with roleID.Hope that's helpful :)
This will perform the needed transformation:
$ awk 'NR>1{$2=$2$1;$5=$5$4}1' FS='|' OFS='|' file
EmpID|Empname|Empadd|roleId|roleDesc|Dept
100|mst100|Del|20|SD20|DA
101|ms101|Del|21|XS21|DA
If you have to do this for many columns you can use a for loop like so (provided a arithmetic or geometric stepsize):
$ awk 'NR>1{for(i=2;i<=5;i+=3)$i=$i$(i-1)}1' FS='|' OFS='|' file
EmpID|Empname|Empadd|roleId|roleDesc|Dept
100|mst100|Del|20|SD20|DA
101|ms101|Del|21|XS21|DA
When you say +, I'm assuming you mean string concatentation. IN awk, there is no specific concatenation operator, you just put two strings side-by-side.
awk -F, -v OFS=, '{$9 = $9 $4; $15=$15$12; $26=$26$23; print}' file.csv
Also assuming that by "csv", you actually mean comma-separated.
If you want to edit the file in-place, you need to do this:
awk ... file.csv > newfile && mv file.csv file.csv.bak && mv newfile file.csv
Edit: to leave the first line untouched:
awk -F, -v OFS=, 'NR>1 {$9 = $9 $4; $15=$15$12; $26=$26$23} {print}' file.csv
Now the columns are modified for the 2nd and subsequent lines, but every line is printed.
You'll sometimes see that written this way:
awk -F, -v OFS=, 'NR>1 {$9 = $9 $4; $15=$15$12; $26=$26$23} 1' file.csv

Resources