I'm trying to display all the files in a directory that have the same contents in a specific way. If the file is unique, it does not need to be displayed. Any file that is identical to others need to be displayed on the same line separated by commas.
For example,
c176ada8afd5e7c6810816e9dd786c36 2group1
c176ada8afd5e7c6810816e9dd786c36 2group2
e5e6648a85171a4af39bbf878926bef3 4group1
e5e6648a85171a4af39bbf878926bef3 4group2
e5e6648a85171a4af39bbf878926bef3 4group3
e5e6648a85171a4af39bbf878926bef3 4group4
2d43383ddb23f30f955083a429a99452 unique
3925e798b16f51a6e37b714af0d09ceb unique2
should be displayed as,
2group1, 2group2
4group1, 4group2, 4group3, 4group4
I know which files are considered unique in a directory from using md5sum, but I do not know how to do the formatting part. I think the solution involves awk or sed, but I am not sure. Any suggestions?
Awk solution (for your current input):
awk '{ a[$1]=a[$1]? a[$1]", "$2:$2 }END{ for(i in a) if(a[i]~/,/) print a[i] }' file
a[$1]=a[$1]? a[$1]", "$2:$2 - accumulating group names (from field $2) for each unique hash presented by the 1st field value $1. The array a is indexed by hashes with concatenated group names as a values (separated by a comma ,).
for(i in a) - iterating through array items
if(a[i]~/,/) print a[i] - means: if the hash associated with more than one group (separated by comma ,) - print the item
The output:
2group1, 2group2
4group1, 4group2, 4group3, 4group4
Given the input you provided, you essentially want to collect all the second columns where the first column is the same. So the first step is use awk to hash the second columns by the first. I leverage the solution posted here: Concatenate lines by first column by awk or sed
awk '{table[$1]=table[$1] $2 ",";} END {for (key in table) print key " => " table[key];}' file
c176ada8afd5e7c6810816e9dd786c36 => 2group1,2group2,
e5e6648a85171a4af39bbf878926bef3 => 4group1,4group2,4group3,4group4,
3925e798b16f51a6e37b714af0d09ceb => unique2,
2d43383ddb23f30f955083a429a99452 => unique,
And if you really want to filter to exclude the unique ones, just make sure you have at least two fields (telling AWK to use ',' as the separator):
awk '{table[$1]=table[$1] $2 ",";} END {for (key in table) print key " => " table[key];}' file | awk -F ',' 'NF > 2'
c176ada8afd5e7c6810816e9dd786c36 => 2group1,2group2,
e5e6648a85171a4af39bbf878926bef3 => 4group1,4group2,4group3,4group4,
perl:
perl -lane '
push #{$groups{$F[0]}}, $F[1]
} END {
for $g (keys %groups) {
print join ", ", #{$groups{$g}} if #{$groups{$g}} > 1
}
' file
The order of the output is indeterminate.
This might work for you (GNU sed):
sed -r 'H;x;s/((\S+)\s+\S+)((\n[^\n]+)*)\n\2\s+(\S+)/\1,\5\3/;x;$!d;x;s/.//;s/^\S+\s*//Mg;s/\n[^,]+$//Mg;s/,/, /g' file
Gather up all the lines of the file and use pattern matching to collapse the lines. At the end of the file, remove the keys and any unique lines and then print the remainder.
Related
I have a file with duplicate values in it. Based on few fields(filed 2, field3) i need to remove the duplicates and change the sequence of a field (ID) which is unique key of the file. how can i achieve this?.
for eg. My file (test.txt) contains
1,Eng,ECE
2,Eng,ECE
3,Eng,CS
4,Eng,CS
I want the output to be below
1,Eng,ECE
2,Eng,CS
I have removed the duplicates using the command
awk -F ',' '!a[$2$3]++' test.txt > test1.txt
How can i change the sequence of ID field now?
You can use
awk -F ',' -v "OFS=," '!a[$2$3]++ { $1=++i; print}'
This will renumber the first field starting with 1.
Another approach:
awk 'BEGIN { FS=OFS="," }
($2,$3) in seen { next }
{ seen[$2,$3] = 1; print ++seqno, $2, $3 }' test.txt
1,Eng,ECE
2,Eng,CS
I am running AIX 6.1
I have a file which contains strings/words starting with some specific characters, say 'xy' or 'Xy' or 'Xy' or 'XY' (case insensitive) and I need to mask the entire word/string with asterisks '*' if the word is greater than say 5 characters.
e.g. I need a sed command which when run against a file containing the below line...
This is a test line xy12345 xy12 Xy123 Xy11111 which I need to replace specific strings
should give below as the output
This is a test line xy12 which I need to replace specific strings
I tried the below commands (did not yet come to the stage where I restrict to word lengths) but it does not work and displays the full line without any substitutions.
I tried using \< and > as well as \b for word identification.
sed 's/\<xy\(.*\)\>/******/g' result2.csv
sed 's/\bxy\(.*\)\b******/g' result2.csv
You can try with awk:
echo 'This is a test line xy12345 xy12 Xy123 Xy11111 which I need to replace specific strings' | awk 'BEGIN{RS=ORS=" "} !(/^[xX][yY]/ && length($0)>=5)'
The awk record separator is set to a space in order to be able to get the length of each word.
This works with GNU awk in --posix and --traditional modes.
With sed for the mental exercice
sed -E '
s/(^|[[:blank:]])([xyXY])([xyXY].{2}[^[:space:]]*)([^[:space:]])/\1#\3#/g
:A
s/(#[^#[:blank:]]*)[^#[:blank:]](#[#]*)/\1#\2/g
tA
s/#/*/g'
This need to not have # in the text.
A simple POSIX awk version :
awk '{for(i=1;i<=NF;++i) if ($i ~ /^[xX][yY]/ && length($i)>=5) gsub(/./,"*",$i)}1'
This, however, does not keep the spacing intact (multiple spaces are converted to a single one), the following does:
awk 'BEGIN{RS=ORS=" "}(/^[xX][yY]/ && length($i)>=5){gsub(/./,"*")}1'
You may use awk:
s='This is a test line xy12345 xy12 Xy123 Xy11111 which I need to replace specific strings xy123 xy1234 xy12345 xy123456 xy1234567'
echo "$s" | awk 'BEGIN {
ORS=RS=" "
}
{
for(i=1;i<=NF;i++) {
if(length($i) >= 5 && $i~/^[Xx][Yy][a-zA-Z0-9]+$/)
gsub(/./,"*", $i);
print $i;
}
}'
A one liner:
awk 'BEGIN {ORS=RS=" "} { for(i=1;i<=NF;i++) {if(length($i) >= 5 && $i~/^[Xx][Yy][a-zA-Z0-9]+$/) gsub(/./,"*", $i); print $i; } }'
# => This is a test line ******* xy12 ***** ******* which I need to replace specific strings ***** ****** ******* ******** *********
See the online demo.
Details
BEGIN {ORS=RS=" "} - start of the awk: set the output record separator equal to the space record separator
{ for(i=1;i<=NF;i++) {if(length($i) >= 5 && $i~/^xy[a-zA-Z0-9]+$/) gsub(/./,"*", $i); print $i; } } - iterate over each field (with for(i=1;i<=NF;i++)) and if the current field ($i) length is equal or more than 5 (length($i) >= 5) and it matches a Xy and (&&) 1 or more alphanumeric chars pattern ($i~/^[Xx][Yy][a-zA-Z0-9]+$/), then replace each char with * (with gsub(/./,"*", $i)) and then print the current field value.
This might work for you (GNU sed):
sed -r ':a;/\bxy\S{5,}\b/I!b;s//\n&\n/;h;s/[^\n]/*/g;H;g;s/\n.*\n(.*)\n.*\n(.*)\n.*/\2\1/;ta' file
If the current line does not contain a string which begins with xy case insensitive and 5 or more following characters, then there is no work to be done.
Otherwise:
Surround the string by newlines
Copy the pattern space (PS) to the hold space (HS)
Replace all characters other than newlines with *'s
Append the PS to the HS
Replace the PS with the HS
Swap the strings between the newlines retaining the remainder of the first line
Repeat
Can awk process this?
Input
Neil,23,01-Jan-1990
25,Reena,19900203
Output
'Neil',23,'01-Jan-1990'
25,'Reena',19900203
awk approach:
awk -F, '{for(i=1;i<=NF;i++) if($i~/[[:alpha:]]/) $i="\047"$i"\047"}1' OFS="," file
The output:
'Neil',23,'01-Jan-1990'
25,'Reena',19900203
if($i~/[[:alpha:]]/) - if field contains alphabetic character
\047 - octal code of single quote ' character
Incorrect was my first attempt
sed -r 's/([^,]*[a-zA-Z]+[^,]*)(,{0,1})/"\1"\2/g' inputfile
#Sundeep gave an excellent comment: I need single quotes and it can be shorter:
I tried to match including the , of end-of-line, causing some complexity for matching. You can just match between the seperators making sure there is an alphabetic character somewhere.
sed 's/[^,]*[a-zA-Z][^,]*/\x27&\x27/g' inputfile
You might use this script:
script.awk
BEGIN { OFS=FS="," }
{ for(i= 1; i<=NF; i++) {
if( !match( $i, /^[0-9]+$/ ) ) $i = "'" $i "'"
}
print
}
and run it like this: awk -f script.awk yourfile .
Explanation
the first line sets up the input and output Fieldseparators to ,.
the loop tests each field, whether it contains only digits (/^[0-9]+$/):
if not the field is put in quotes
I have a Unix file which has data like this.
1379545632,
1051908588,
229102020,
1202084378,
1102083491,
1882950083,
152212030,
1764071734,
1371766009,
(FYI, there is no empty line between two numbers as you see above. Its just because of the editor here. Its just a column with all numbers one below other)
I want to transpose it and print as a single line.
Like this:
1379545632,1051908588,229102020,1202084378,1102083491,1882950083,152212030,1764071734,1371766009
Also remove the last comma.
Can someone help? I need a shell/awk solution.
tr '\n' ' ' < file.txt
To remove the last comma you can try sed 's/,$//'.
With GNU awk for multi-char RS:
$ printf 'x,\ny,\nz,\n' | awk -v RS='^$' '{gsub(/\n|(,\n$)/,"")} 1'
x,y,z
awk 'BEGIN { ORS="" } { print }' file
ORS : Output Record separator.
Each Record will be separated with this delimiter.
my input file is like this.
01,A,34
01,A,35
01,A,36
01,A,37
02,A,40
02,A,41
02,A,42
02,A,45
my output needs to be
01,A,37
01,A,36
01,A,35
02,A,45
02,A,42
02,A,41
i.e select only top three records (top value based on 3rd column) based on key(1st and 2nd column)
Thanks in advance...
You can use a simple bash script to do this provided the data is as shown.
pax$ cat infile
01,A,34
01,A,35
01,A,36
01,A,37
02,A,40
02,A,41
02,A,42
02,A,45
pax$ ./go.sh
01,A,37
01,A,36
01,A,35
02,A,45
02,A,42
02,A,41
pax$ cat go.sh
keys=$(sed 's/,[^,]*$/,/' infile | sort -u)
for key in ${keys} ; do
grep "^${key}" infile | sort -r | head -3
done
The first line gets the full set of keys, constructed from the first two fields by removing the final column with sed then sorting the output and removing duplicates with sort. In this particular case, the keys are 01,A, and 02,A,.
It the extracts the relevant data for each key (the for loop in conjunction with grep), sorting in descending order with sort -r, and getting only the first three (for each key) with head.
Now, if your key is likely to contain characters special to grep such as . or [, you'll need to watch out.
With Perl:
perl -F, -lane'
push #{$_{join ",", #F[0,1]}}, $F[2];
END {
for $k (keys %_) {
print join ",", $k, $_
for (sort { $b <=> $a } #{$_{$k}})[0..2]
}
}' infile