Would like print to incrementally count and then print the counts of the unique values in column 1 & column 2 & column 3 ...Column NF and Column $0
and if the word is appeared only one time of column 1, would like to print remarks as "No" as duplicated flag
and if the word is appeared more than one time of column 1, would like to print remarks as "Yes" as duplicated flag
Looking something like this
awk -F"," '{OFS=","; if (word == $1) { counter++ } else { counter = 1; word = $1 }; print $0 ",", "Yes/No", counter }'
For example, I am trying to check is there any duplicate information in the field $1 (Fruits Name) .
Under Name field, "Apple" appears three times , "Orange" appears two times,"Mango" appear one time.
So if any word is not repeated more than one time is consieder as "Name_Dup=No" duplicate and count of appears is "Name_Counter=1" (i.e Mango)
where "Apple" appears 3 times , so it is repeated/duplicated -remarks as "Yes" when it appears first time count is "Name_Dup=Yes" and Name_Counter=1" ,
when it appears second time "Name_Dup=Yes" and Name_Counter=2, when it appears 3rd time "Name_Dup=Yes" and Name_Counter=3
Then need to check each column $2, $3 .. till $NF and $0 ..
My actual input file is not sorted on any order. No of fields used to be vary like 10 fields, 12 fields and 15 fields etc
Input.csv
Name,Amount,Dept
Apple,10,eee
Orange,20,csc
Apple,30,mec
Mango,40,sss
Apple,10,eee
Orange,10,csc
Desired Output
Name,Amount,Dept,Name_Dup,Name_Counter,Amount_Dup,Amount_Counter,Dept_Dup,Dept_Counter,EntireLine_Dup,EntireLine_Counter
Apple,10,eee,Yes,1,Yes,1,Yes,1,Yes,1
Orange,20,csc,Yes,1,No,1,Yes,1,No,1
Apple,30,mec,Yes,2,No,1,No,1,No,1
Mango,40,sss,No,1,No,1,No,1,No,1
Apple,10,eee,Yes,3,Yes,2,Yes,2,Yes,2
Orange,10,csc,Yes,2,Yes,3,Yes,2,No,1
For example , Please find below steps for reference.
Step#1 - Field $1 check and Output
Name,Name_Dup,Name_Counter
Apple,Yes,1
Orange,Yes,1
Apple,Yes,2
Mango,No,1
Apple,Yes,3
Orange,Yes,2
Step#2 - Field $2 check and Output
Amount,Amount_Dup,Amount_Counter
10,Yes,1
20,No,1
30,No,1
40,No,1
10,Yes,2
10,Yes,3
Step#3 - Field $3 check and Output
Dept,Dept_Dup,Dept_Counter
eee,Yes,1
csc,Yes,1
mec,No,1
sss,No,1
eee,Yes,2
csc,Yes,2
Step#4-Field $0 check, combination of $1 & $2 & $3 and Output
"Name,Amount,Dept",EntireLine_Dup,EntireLine_Counter
"Apple,10,eee",Yes,1
"Orange,20,csc",No,1
"Apple,30,mec",No,1
"Mango,40,sss",No,1
"Apple,10,eee",Yes,2
"Orange,10,csc",No,1
awk solution:
OP asks for, as I understand it, to show per line, per column, if a column value shows up more than once and give an occurrence count of this particular column so far.
$ cat tst.awk
BEGIN{ FS=OFS="," }
NR==1{
header=$0
n=split("Dup,Counter",h)
for (i=1; i<=NF; i++)
for (j=1; j<=n; j++) header=header OFS $i"_"h[j]
printf("%s,EntireLine_Dup,EntireLine_Counter\n", header)
next
}
{
r[++lines]=$0
for (col=1; col<=NF; col++) v[col][$col]++
v[col][$0]++
}
END {
for (l=1; l<=lines; l++){
n=split(r[l], s)
res=""
for (c=1; c<=n; c++)
res=res OFS output(v,c,s[c])
res=res OFS output(v,c,r[l])
print r[l] res
}
}
function output(arr, col, val){
return sprintf("%s,%s", (arr[col][val] > 1? "Yes" : "No"), ++count[col][val])
}
with input:
$ cat input.txt
Name,Amount,Dept,Nonsense
Apple,10,eee,eee
Orange,20,csc,eee
Apple,30,mec,eee
Mango,40,sss,eee
Apple,10,eee,eee
Orange,10,csc,eee
this gives (I've deleted the header line manually, because I couldn't get it to fit in the code sample):
$ awk -f tst.awk input.txt
# deleted header line
Apple,10,eee,eee,Yes,1,Yes,1,Yes,1,Yes,1,Yes,1
Orange,20,csc,eee,Yes,1,No,1,Yes,1,Yes,2,No,1
Apple,30,mec,eee,Yes,2,No,1,No,1,Yes,3,No,1
Mango,40,sss,eee,No,1,No,1,No,1,Yes,4,No,1
Apple,10,eee,eee,Yes,3,Yes,2,Yes,2,Yes,5,Yes,2
Orange,10,csc,eee,Yes,2,Yes,3,Yes,2,Yes,6,No,1
you are not providing what efforts you placed so far. Here is a hint where I would start. I guess since awk is the tool to use, start with shell command sort Input.csv and pipe it to awk. Populate an array when reading the input as well as an associative array with index the first field.
I n the END section go over the array and see if you find the first field more than once. It takes a bit of time however that sounds like a homework. Not a production problem.
I have a tab-delim text file with only 4 columns as shown below:
GT:CN:CNL:CNP:CNQ:FT .:2:a:b:c:PASS .:2:c:b:a:PASS .:2:d:c:a:FAIL
If the string "FAIL" is found in a specific column starting from column2 to columnN (all the strings are separated by ":") then it would need to replace the second element in that column to "-1". Sample output is shown below:
GT:CN:CNL:CNP:CNQ:FT .:2:a:b:c:PASS .:2:c:b:a:PASS .:-1:d:c:a:FAIL
Any help using awk?
With any awk:
$ awk 'BEGIN{FS=OFS="\t"} {for (i=2;i<=NF;i++) if ($i~/:FAIL$/) sub(/:[^:]+/,":-1",$i)} 1' file
GT:CN:CNL:CNP:CNQ:FT .:2:a:b:c:PASS .:2:c:b:a:PASS .:-1:d:c:a:FAIL
In order to split in awk you can use "split".
An example of it would be the following:
split(1,2,"3");
1 is the string you want to split
2 is the array you want to split it into
and 3 is the character that you want to be split on
e.g
string="hello:world"
result=`echo $string | awk '{ split($1,ARR,":"); printf("%s ",ARR[1]);}'`
In this case the result would be equal to hello, because we split the string to the " : " character and we printed the first half of the ARR, if we would print the second half (so printf("%s ",ARR[2])) of the ARR then it would be returned to result the "world".
With gawk:
awk '{$0=gensub(/[^:]*(:[^:]*:[^:]*:[^:]:FAIL)/,"-1\\1", "g" , $0)};1' File
with sed:
sed 's/[^:]*\(:[^:]*:[^:]*:[^:]:FAIL\)/-1\1/g' File
If you are using GNU awk, you can take advantage of the RT feature1 and split the records at tabs and newlines:
awk '$NF == "FAIL" { $2 = "-1"; } { printf "%s", $0 RT }' RS='[\t\n]' FS=':' infile
Output:
GT:CN:CNL:CNP:CNQ:FT .:2:a:b:c:PASS .:2:c:b:a:PASS .:-1:d:c:a:FAIL
1 The record separator that follows the current record.
Your requirements are somewhat vague, but I'm pretty sure this does what you want with bog standard awk (no gnu-awk extensions):
awk '/FAIL/{$2=-1}1' ORS=\\t RS=\\t FS=: OFS=: input
Can somebody explain the meaning of the below script please?
awk -F "," 's != $1 || NR ==1{s=$1;if(p){print p};p=$0;next}
{sub($1,"",$0);p=p""$0;}
END{print p}' file
The file has the following data:
2,"URL","website","aaa.com"
2,"abc","text","some text"
2,"Password","password","12345678"
3,"URL","website","10.0.10.75"
3,"uname","text","some text"
3,"Password","password","password"
3,"ulang","text","some text"
4,"URL","website","192.168.2.110"
4,"Login","login","admin"
4,"Password","password","blah-blah"
and the output is:
2,"URL","website","aaa.com","abc","text","some text",Password","password","12345678"
3,"URL","website","10.0.10.75","uname","text","some text""Password","password","password","ulang","text","some text"
awk has this structure
pattern {action}
for your script, let's analyze the elements, first pattern
s != $1 || NR == 1 # if the variable s is not equal to first field
# or we're operating on first row
first action
s = $1 # assign variable s to first field
if (p) { # if p is not empty, print
print p
}
p = $0 # assign the current line to p variable
next # move to next, skip the rest
next pattern is missing, so the action will apply to all rows
sub($1, "", $0) # same as $1="", will clear the first field
p = ((p "") $0) # concat new line to p
last pattern is special reserved word END, only applied when all rows are consumed (there is counterpart BEGIN that's applied before the file is opened)
END {
print p # print the value of p
}