Please help to identify , count of line items where Third field value is equal to Zero and count of line items where Third field value not equal to Zero , Group by Second field :
Need to populate "Count!=0 & Count=0" as "0" if there is no count.
Input file: f1.txt
ppp1,abc,10,qqq
ppp2,abc,5,qqq
ppp3,abc,0,qqq
ppp4,abc,18,qqq
ppp5,abc,0,qqq
mmm1,xyz,0,rrr
mmm2,xyz,55,rrr
nnn1,ijk,12,sss
nnn2,ijk,89,sss
nnn3,ijk,62,sss
bbb1,lmn,0,ttt
bbb2,lmn,0,ttt
Output.txt
abc,count!=0,3
abc,count=0,2
xyz,count!=0,1
xyz,count=0,1
ijk,count!=0,3
ijk,count=0,0
lmn,count!=0,0
lmn,count=0,2
I break the cmd in lines, to make it easier to read:
awk -F, -v OFS="," '{c[$2];
if($3!=0)a[$2]++;else b[$2]++}
END{for(x in c){
print x,"count!=0",a[x]*1;
print x,"count=0",b[x]*1}}' input
cat f1.txt | awk -F "," '{ if($3==0) {print $2 count=0} if ($3!=0)
{print $2 count!=0}}' | uniq -c | awk {print $2","$3","$1} >> Output.txt
Related
I have one file that looks like below
1234|A|B|C|10|11|12
2345|F|G|H|13|14|15
3456|K|L|M|16|17|18
I want the output as
1234|A
1234|B
1234|C
2345|F
2345|G
2345|H
3456|K
3456|L
3456|M
I have tried with the below script.
awk -F"|" '{print $1","$2","$3","$4"}' file.dat | awk -F"," '{OFS=RS;$1=$1}1'
But the output is generated as below.
1234
A
B
C
2345
F
G
H
3456
K
L
M
Any help is appreciated.
What about a single simple awk process such as this:
$ awk -F\| '{print $1 "|" $2 "\n" $1 "|" $3 "\n" $1 "|" $4}' file.dat
1234|A
1234|B
1234|C
2345|F
2345|G
2345|H
3456|K
3456|L
3456|M
No messing with RS and OFS.
If you want to do this dynamically, then you could pass in the number of fields that you want, and then use a loop starting from the second field.
In the script, you might first check if the number of fields is equal or greater than the number you pass into the script (in this case n=4)
awk -F\| -v n=4 '
NF >= n {
for(i=2; i<=n; i++) print $1 "|" $i
}
' file
Output
1234|A
1234|B
1234|C
2345|F
2345|G
2345|H
3456|K
3456|L
3456|M
# perl -lne'($a,#b)=((split/\|/)[0..3]);foreach (#b){print join"|",$a,$_}' file.dat
1234|A
1234|B
1234|C
2345|F
2345|G
2345|H
3456|K
3456|L
3456|M
I am trying to filter a file using values in the 8 column >= 10. I am using awk but for some reason it doesn't work. Am I doing something wrong, what am I missing?
head df_TPM.csv
LQNS02136402.1_14821_3p,12680.71611,11346.42368,11686.28693,9067.797819,7429.467928,5551.660333,3246.956281
LQNS02000137.1_325_3p,8342.540984,5905.726173,4503.363041,3616.191278,3142.965662,3678.829299,6288.621969
LQNS02278148.1_40791_3p,4921.502758,2461.882836,429.824973,261.273116,132.0239748,68.6191655,70.8815385
LQNS02278089.1_34112_3p,4246.71324,4584.529009,8687.922574,7570.83746,5801.384953,2870.020801,734.3131465
LQNS02278075.1_32377_5p,4143.547577,4093.91803,10804.12323,10062.99269,7925.240969,4712.484455,1080.915573
LQNS02138569.1_14892_3p,2668.27957,2160.173542,837.2584183,233.2310273,84.62362925,64.6037895,23.456714
LQNS02278075.1_32324_5p,2331.608924,491.8868983,1527.312199,881.8683105,747.1474225,347.397634,74.07259175
LQNS02278075.1_32382_3p,2140.686095,2439.122353,10837.38169,12569.95295,9385.530878,6022.323737,1705.900969
LQNS02000138.1_777_5p,1819.275149,1762.009649,8565.396754,33280.90019,32176.07604,15849.37306,11872.99383
LQNS02278186.1_47223_3p,1687.843418,728.4288968,1328.048172,1306.424238,2102.27342,14.78892225,9.92647375
#Extract column 1 and 8 and print if $8>=10
cat df_TPM.csv |awk -F"," '{print $1, $8}' | grep -E "^LQN" | awk -F " " '$2>= 10'
LQNS02276925.1_23356_5p 5.352369
LQNS02277221.1_25158_5p 2.82778125
LQNS02277812.1_29775_3p 11.1090745
LQNS02278074.1_32154_3p 6.124789
LQNS02278139.1_39525_5p 22.6656355
#As you can see lots of numbers shouldn't be there (ex: 2.82778125 < 10)
By seeing OP's comment, in case you don't want to search for LQN text in starting of line and want to check if 8th column is greater than 10 then try following(to check if lines have LQN remove ! from following codes).
awk -F"," '$8+0 >= 10 && !/^LQN/{print $1, $8}' df_TPM.csv
OR To get total lines try: counting those matched lines could be done in a single awk itself.
awk -F"," '$8+0 >= 10 && !/^LQN/{count++} END{print count}' df_TPM.csv
Explanation: Adding detailed explanation for above.
awk -F"," ' ##Starting awk program from here.
$8+0 >= 10 && !/^LQN/{ ##Checking condition if 8th field is greater than 10 and NOT LQN.
count++ ##Increasing count with 1 here.
}
END{ ##Starting END block of this awk program from here.
print count ##Printing count value here.
}
' df_TPM.csv ##Mentioning Input_file name here.
To handle control M characters in awk code itself try: considering that you don't want control M characters in your Input_file.
awk -F"," '{gsub(/\r/,"")} $8 >= 10 && !/^LQN/{count++} END{print count}' df_TPM.csv
You need to tell your awk to coerce $8 into a number by computing $8+0. It is recommended that you ensure you have GNU awk installed to avoid issues. Also, you may probably use dos2unix before working on the files to normalize the line endings.
The whole command can be written as
awk -F"," '/^LQN/ && $8+0 >= 10 {print $1, $8}' df_TPM.csv
See the online awk demo.
NOTE: To only count these lines, use
The whole command can be written as
awk -F, '/^LQN/ && $8+0 >= 10 {cnt++} END{print cnt}' df_TPM.csv
To find the lines that do not start with LQN, just add the negation operator ! before /^LQN/:
awk -F, '!/^LQN/ && $8+0 >= 10 {cnt++} END{print cnt}' df_TPM.csv
Details
-F"," (= -F,) - set the field separator to a comma
/^LQN/ && $8+0 >= 10 - if the current line starts with LQN and the eighth field is equal or larger than 10
!/^LQN/ && $8+0 >= 10 - if the current line does not start with LQN and the eighth field is equal or larger than 10
{print $1, $8} - print Field 1 and 8
{cnt++} - increment the cnt variable
END{print cnt} - print the cnt variable once the awk finishes processing lines.
Would like to sort Input.csv file based on fields $1 and $5 and generate country wise A-Z order.
While doing sort need to consider country name either from $1 or $5 if any of the fields are blank.
Input.csv
Country,Amt,Des,Details,Country,Amt,Des,Network,Details
abc,10,03-Apr-14,Aug,abc,10,DL,ABC~XYZ,Sep
,,,,mno,50,DL,ABC~XYZ,Sep
abc,10,22-Jan-07,Aug,abc,10,DL,ABC~XYZ,Sep
jkl,40,11-Sep-13,Aug,,,,,
,,,,ghi,30,AL,DEF~PQZ,Sep
abc,10,03-Apr-14,Aug,abc,10,MN,ABC~XYZ,Sep
abc,10,19-Feb-14,Aug,abc,10,MN,ABC~XYZ,Sep
def,20,02-Jul-13,Aug,,,,,
def,20,02-Aug-13,Aug,,,,,
Desired Output.csv
Country,Amt,Des,Details,Country,Amt,Des,Network,Details
abc,10,03-Apr-14,Aug,abc,10,DL,ABC~XYZ,Sep
abc,10,22-Jan-07,Aug,abc,10,DL,ABC~XYZ,Sep
abc,10,03-Apr-14,Aug,abc,10,MN,ABC~XYZ,Sep
abc,10,19-Feb-14,Aug,abc,10,MN,ABC~XYZ,Sep
def,20,02-Jul-13,Aug,,,,,
def,20,02-Aug-13,Aug,,,,,
,,,,ghi,30,AL,DEF~PQZ,Sep
jkl,40,11-Sep-13,Aug,,,,,
,,,,mno,50,DL,ABC~XYZ,Sep
I have tried below command but not getting desired output. Please suggest..
head -1 Input.csv > Output.csv; sort -t, -k1,1 -k5,5 <(tail -n +2 Input.csv) >> Output.csv
awk to the rescue!
$ awk -F, '{print ($1==""?$5:$1) "\t" $0}' file | sort | cut -f2-
Country,Amt,Des,Details,Country,Amt,Des,Network,Details
abc,10,03-Apr-14,Aug,abc,10,DL,ABC~XYZ,Sep
abc,10,03-Apr-14,Aug,abc,10,MN,ABC~XYZ,Sep
abc,10,19-Feb-14,Aug,abc,10,MN,ABC~XYZ,Sep
abc,10,22-Jan-07,Aug,abc,10,DL,ABC~XYZ,Sep
def,20,02-Aug-13,Aug,,,,,
def,20,02-Jul-13,Aug,,,,,
,,,,ghi,30,AL,DEF~PQZ,Sep
jkl,40,11-Sep-13,Aug,,,,,
,,,,mno,50,DL,ABC~XYZ,Sep
here the header starting with uppercase and data is lowercase. If this is not a valid assumption special handling of header required as you did above or better with awk
$ awk -F, 'NR==1{print; next} {print ($1==""?$5:$1) "\t" $0 | "sort | cut -f2-"}' file
Is this what you want? (Omitted first line)
cat file_containing_your_lines | awk 'NR != 1' | sed "s/,/\t/g" | sort -k 1 -k 5 | sed "s/\t/,/g"
I have a requirement to find duplicates based on three columns in a .txt file in unix which is delimited by ,.
Input:
a,b,c,d,e,f,gf,h
a,bd,cg,dd,ey,f,g,h
a,b,df,d,e,fd,g,h
a,b,ck,d,eg,f,g,h
Let's take we are finding dupliactes based on 1,2,5 fields.
Expected output:
a,b,c,d,e,f,gf,h
a,b,df,d,e,fd,g,h
Can anyone help to write a script for this or is there a command already available?
I tried like this:
awk -F, '!x[$1,$2,$3]++' file.txt but did not work
One way using awk:
awk -F, 'FNR==NR { x[$1,$2,$5]++; next } x[$1,$2,$5] > 1' a.txt a.txt
This is simple, but reads the file two times. On the first pass (FNR==NR), it maintains counts based on key fields. During the second pass, if prints the line if its key was found more than once.
Another way using awk:
awk -F, '{if (x[$1$2$5]) { y[$1$2$5]++; print $0; if (y[$1$2$5] == 1) { print x[$1$2$5] } } x[$1$2$5] = $0}' a.txt
Explanation:
1 awk -F,
2 '{if (x[$1$2$5])
3 { y[$1$2$5]++; print $0;
4 if (y[$1$2$5] == 1)
5 { print x[$1$2$5] }
6 } x[$1$2$5] = $0
7 }'
Line 2: If x has $1$2$5, this key was seen before, do steps 3-5
Line 3: Increment the count and print the line because it is a dup
Line 4: This means, We are seeing this key for the 2nd time, so we need to print the first line with this key. Last time we saw this key we did not know whether it was a dup or not. So we print the first line in step 5.
Line 6: Store the current line against the key so we can use it in step 2
Another way using sort, uniq and awk
Note: uniq command has an option '-f' to skip the specified number of fields before it starts comparison.
sort -t, -k1,1 -k2,2 -k5,5 a.txt | awk -F, 'BEGIN { OFS = " "} {print $0, $1, $2, $5}' | sed 's/,/ /g' | uniq -f7 -D | sed 's/ /,/g' | cut -d',' -f 1-7
This sorts based on fields 1,2,5. awk prints the original line and appends fields 1,2,5 . sed changes the delimiter because uniq does not have an option to specify delimiter. uniq skips first 7 fields and works on rest of the line and prints duplicate lines.
I had a similar issue
I needed to eliminate duplicate detail records while preserving flat file record formatting and seqence of the records.
The duplication caused by a time expansion of the date field in column 2 of the detail only.
Receiving system was reporting duplication on columns 4 and 5.
I cobbled together this quick hack to resolve it.
First read the file data into an array
Then we can read and manipulate the individual records (crudely with a counter) as demonstrated in this snippet integrating a case statement to logically treat the various record types.
Cheers!
readarray inrecs < [input file name]
filebase=echo "[input file name] | cut -d '.' -f1
i=1
for inrec in "${inrecs[#]}";do
field1=echo ${inrecs[$i-1]} | cut -d',' -f1
field2=echo ${inrecs[$i-1]} | cut -d',' -f2
field3=echo ${inrecs[$i-1]} | cut -d',' -f3
field4=echo ${inrecs[$i-1]} | cut -d',' -f4
field5=echo ${inrecs[$i-1]} | cut -d',' -f5
field6=echo ${inrecs[$i-1]} | cut -d',' -f6
field7=echo ${inrecs[$i-1]} | cut -d',' -f7
field8=echo ${inrecs[$i-1]} | cut -d',' -f8
case $field1 in
'H')
echo "$field1,$field2,$field3">${filebase}.new
;;
'D')
dupecount=0
dupecount=`zegrep -c -e "${field4},${field5}" ${infile}`
if [[ "$dupecount" -gt 1 ]];then
writtencount=0
writtencount=`zegrep -c -e "${field4},${field5}" ${filebase}.new`
if [[ "${writtencount}" -eq 0 ]];then
echo "$field1,$field2,$field3,$field4,$field5,$field6,$field7,$field8,">>${filebase}.new
fi
else
echo "$field1,$field2,$field3,$field4,$field5,$field6,$field7,$field8,">>${filebase}.new
fi
;;
'T')
dcount=`zegrep -c '^D' ${filebase}.new`
echo "$field1,$field2,$dcount,$field4">>${filebase}.new
;;
esac
((i++))
done
I need to first grep the exact line and then to capture the required value.
eg.
Total logical records skipped: 0
Total logical records read: 500
Total logical records rejected: 3
Total logical records discarded: 0
I need to capture the value 500. How can i do that?
One way using awk:
awk '/Total logical records read:/ { print $NF }' file.txt
If by capture, you mean store as a shell variable:
variable=$(awk '/Total logical records read:/ { print $NF }' file.txt)
Use grep + awk:
cat log.txt | grep read | awk '{print $ 5;}'
or just awk:
cat log.txt | awk '/read/ {print $5;}'
Look at these examples: Awk Introduction Tutorial – 7 Awk Print Examples