Remove duplicated string stored in variable - unix

I have a variable $var with this content:
var=word1,word2,word3,word1,word3
and I need to delete duplicate words and the results is required stored in the same variable $var.

Try
var="word1,word2,word3,word1,word3"
list=$(echo $var | tr "," "\n")
var=($(printf "%s\n" "${list[#]}" | sort | uniq -c | sort -rnk1 | awk '{ print $2 }'))
echo "${var[#]}"

If open to perl then:
$ var="word1,word2,word3,word1,word3"
$ var=$(perl -F, -lane'{$h{$_}++ or push #a, $_ for #F; print join ",", #a}' <<< "$var")
$ echo "$var"
word1,word2,word3

Related

How to coerce AWK to evaluate string as math expression?

Is there a way to evaluate a string as a math expression in awk?
balter#spectre3:~$ echo "sin(0.3) 0.3" | awk '{print $1,sin($2)}'
sin(0.3) 0.29552
I would like to know a way to also have the first input evaluated to 0.29552.
You can just create your own eval function which calls awk again to execute whatever command you want it to:
$ cat tst.awk
{ print eval($1), sin($2) }
function eval(str, cmd,line,ret) {
cmd = "awk \047BEGIN{print " str "; exit}\047"
if ( (cmd | getline line) > 0 ) {
ret = line
}
close(cmd)
return ret
}
$ echo 'sin(0.3) 0.3' | awk -f tst.awk
0.29552 0.29552
$ echo '4*7 0.3' | awk -f tst.awk
28 0.29552
$ echo 'tolower("FOO") 0.3' | awk -f tst.awk
foo 0.29552
awk lacks an eval(...) function. This means that you cannot do string to code translation based on input after the awk program initializes. Ok, perhaps it could be done, but not without writing your own parsing and evaluation engine in awk.
I would recommend using bc for this effort, like
[edwbuck#phoenix ~]$ echo "s(0.3)" | bc -l
.29552020666133957510
Note that this would require sin to be shortened to s as that's the bc sine operation.
Here's a simple one liner!
math(){ awk "BEGIN{printf $1}"; }
Examples of use:
math 1+1
Yields "2"
math 'sqrt(25)'
Yeilds "5"
x=100; y=5; math "sqrt($x) + $y"
Yeilds "15"
With gawk version 4.1.2 :
echo "sin(0.3) 0.3" | awk '{split($1,a,/[()]/);f=a[1];print #f(a[2]),sin($2)}'
It's ok with tolower(FOO) too.
You can try Perl as it has eval() function.
$ echo "sin(0.3)" | perl -ne ' print eval '
0.29552020666134
$
For the given input,
$ echo "sin(0.3) 0.3" | perl -ne ' /(\S+)\s+(\S+)/ and print eval($1), " ", $2 '
0.29552020666134 0.3
$

awk to sort two fields:

Would like to sort Input.csv file based on fields $1 and $5 and generate country wise A-Z order.
While doing sort need to consider country name either from $1 or $5 if any of the fields are blank.
Input.csv
Country,Amt,Des,Details,Country,Amt,Des,Network,Details
abc,10,03-Apr-14,Aug,abc,10,DL,ABC~XYZ,Sep
,,,,mno,50,DL,ABC~XYZ,Sep
abc,10,22-Jan-07,Aug,abc,10,DL,ABC~XYZ,Sep
jkl,40,11-Sep-13,Aug,,,,,
,,,,ghi,30,AL,DEF~PQZ,Sep
abc,10,03-Apr-14,Aug,abc,10,MN,ABC~XYZ,Sep
abc,10,19-Feb-14,Aug,abc,10,MN,ABC~XYZ,Sep
def,20,02-Jul-13,Aug,,,,,
def,20,02-Aug-13,Aug,,,,,
Desired Output.csv
Country,Amt,Des,Details,Country,Amt,Des,Network,Details
abc,10,03-Apr-14,Aug,abc,10,DL,ABC~XYZ,Sep
abc,10,22-Jan-07,Aug,abc,10,DL,ABC~XYZ,Sep
abc,10,03-Apr-14,Aug,abc,10,MN,ABC~XYZ,Sep
abc,10,19-Feb-14,Aug,abc,10,MN,ABC~XYZ,Sep
def,20,02-Jul-13,Aug,,,,,
def,20,02-Aug-13,Aug,,,,,
,,,,ghi,30,AL,DEF~PQZ,Sep
jkl,40,11-Sep-13,Aug,,,,,
,,,,mno,50,DL,ABC~XYZ,Sep
I have tried below command but not getting desired output. Please suggest..
head -1 Input.csv > Output.csv; sort -t, -k1,1 -k5,5 <(tail -n +2 Input.csv) >> Output.csv
awk to the rescue!
$ awk -F, '{print ($1==""?$5:$1) "\t" $0}' file | sort | cut -f2-
Country,Amt,Des,Details,Country,Amt,Des,Network,Details
abc,10,03-Apr-14,Aug,abc,10,DL,ABC~XYZ,Sep
abc,10,03-Apr-14,Aug,abc,10,MN,ABC~XYZ,Sep
abc,10,19-Feb-14,Aug,abc,10,MN,ABC~XYZ,Sep
abc,10,22-Jan-07,Aug,abc,10,DL,ABC~XYZ,Sep
def,20,02-Aug-13,Aug,,,,,
def,20,02-Jul-13,Aug,,,,,
,,,,ghi,30,AL,DEF~PQZ,Sep
jkl,40,11-Sep-13,Aug,,,,,
,,,,mno,50,DL,ABC~XYZ,Sep
here the header starting with uppercase and data is lowercase. If this is not a valid assumption special handling of header required as you did above or better with awk
$ awk -F, 'NR==1{print; next} {print ($1==""?$5:$1) "\t" $0 | "sort | cut -f2-"}' file
Is this what you want? (Omitted first line)
cat file_containing_your_lines | awk 'NR != 1' | sed "s/,/\t/g" | sort -k 1 -k 5 | sed "s/\t/,/g"

How can I extract all repeated pattern in a line to comma separated format

I am extracting an interested pattern in a file. In each line I have repeated pattern and I want to order all repeated pattern for each line in a comma separated format. For example: In each line I have a string like this:
Line1: InterPro:IPR000504 InterPro:IPR003954 InterPro:IPR012677 Pfam:PF00076 PROSITE:PS50102 SMART:SM00360 SMART:SM00361 EMBL:CP002684 Proteomes:UP000006548 GO:GO:0009507 GO:GO:0003723 GO:GO:0000166 Gene3D:3.30.70.330 SUPFAM:SSF54928 eggNOG:KOG0118 eggNOG:COG0724 InterPro:IPR003954
Line2: InterPro:IPR000306 InterPro:IPR002423 InterPro:IPR002498 Pfam:PF00118 Pfam:PF01363 Pfam:PF01504 PROSITE:PS51455 SMART:SM00064 SMART:SM00330 InterPro:IPR013083 Proteomes:UP000006548 GO:GO:0005739 GO:GO:0005524 EMBL:CP002686 GO:GO:0009555 GO:GO:0046872 GO:GO:0005768 GO:GO:0010008 Gene3D:3.30.40.10 InterPro:IPR017455
I want to extract all InterPro IDs for each line as like as this :
IPR000504,IPR003954,IPR012677,IPR003954
IPR000306,IPR002423,IPR002498,IPR013083,IPR017455
I have used this script:
while read line; do
NUM=$(echo $line | grep -oP 'InterPro:\K[^ ]+' | wc -l)
if [ $NUM -eq 0 ];then
echo "NA" >> InterPro.txt;
fi;
if [ ! $NUM -eq 0 ];then
echo $line | grep -oP 'InterPro:\K[^ ]+' | tr '\n' ',' >> InterPro.txt;
fi;
done <./File.txt
The problem is once I run this script, all the pattern's values in the File.txt print in one line. I want all interested pattern's values of each line print in separated line.
Thank you in advance
With awk:
awk '{for (i=1; i<=NF; ++i) {if ($i~/^InterPro:/) {gsub(/InterPro:/, "", $i); x=x","$i}} gsub (/^,/, "", x); print x; x=""}' file
Output:
IPR000504,IPR003954,IPR012677,IPR003954
IPR000306,IPR002423,IPR002498,IPR013083,IPR017455
With indent and more meaningful variable names:
awk '
{
for (column=1; column<=NF; ++column)
{
if ($column~/^InterPro:/)
{
gsub(/InterPro:/, "", $column)
line=line","$column
}
}
gsub (/^,/, "",line)
print line
line=""
}' file
With bash builtin commands:
while IFS= read -r line; do
for column in $line; do
[[ $column =~ ^InterPro:(.*) ]] && new+=",${BASH_REMATCH[1]}"
done
echo "${new#,*}"
unset new
done < file
Finally, I changed the script and could get the interested results:
while read line; do
NUM=$(echo $line | grep -oP 'InterPro:\K[^ ]+' | wc -l)
if [ $NUM -eq 0 ];then
echo "NA" >> InterPro.txt;
fi;
if [ ! $NUM -eq 0 ];then
echo $line | grep -oP 'InterPro:\K[^ ]+' | sed -n -e 'H;${x;s/\n/,/g;s/^,//;p;}' | sed 's/ /,/g' >> InterPro.txt;
fi;
done <./File.txt

How to find the distinct values in unix

I need distinct values from the below columns:
AA|BB|CC
a#gmail.com,c#yahoo.co.in|a#gmail.com|a#gmail.com
y#gmail.com|x#yahoo.in,z#redhat.com|z#redhat.com
c#gmail.com|b#yahoo.co.in|c#uix.xo.in
Here records are '|' seperated and in the 1st column, we can two email id's which are ',' seperated. so, I want to consider that also. I want distinct email id's in the AA,BB,CC column, whether it is '|' seperated or ',' seperated.
Expected output:
c#yahoo.co.in|a#gmail.com|
y#gmail.com|x#yahoo.in|z#redhat.com
c#gmail.com|b#yahoo.co.in|c#uix.xo.in
is awk unix enough for you?
{
for(i=1; i < NF; i++) {
if ($i ~ /#/) {
mail[$i]++
}
}
}
END {
for (x in mail) {
print mail[x], x
}
}
output:
$ awk -F'[|,]' -f v.awk f1
2 z#redhat.com
3 a#gmail.com
1 x#yahoo.in
1 c#yahoo.co.in
1 c#gmail.com
1 y#gmail.com
1 b#yahoo.co.in
Using awk :
cat file | tr ',' '|' | awk -F '|' '{ line=""; for (i=1; i<=NF; i++) {if ($i != "" && list[NR"#"$i] != 1){line=line $i "|"}; list[NR"#"$i]=1 }; print line}'
Prints :
a#gmail.com|c#yahoo.co.in|
y#gmail.com|x#yahoo.in|z#redhat.com|
c#gmail.com|b#yahoo.co.in|c#uix.xo.in|
Edit :
Now works properly with inputs such as :
a#gmail.com|c#yahoo.co.in|
y#gmail.com|x#yahoo.in|a#gmail.com|
c#gmail.com|c#yahoo.co.in|c#uix.xo.in|
Prints :
a#gmail.com|c#yahoo.co.in|
y#gmail.com|x#yahoo.in|a#gmail.com|
c#gmail.com|c#yahoo.co.in|c#uix.xo.in|
The following python code will solve your problem:
#!/usr/bin/env python
while True:
try:
addrs = raw_input()
except EOFError:
break
print '|'.join(set(addrs.replace(',', '|').split('|')))
In Bash only:
while read s; do
IFS='|,'
for e in $s; do
echo "$e"
done | sort | uniq
unset IFS
done
This seems to work, although I'm not sure what to do if there are more than three unique mails. Run with awk -f filename.awk dataname.dat
BEGIN {IFS=/[,|]/}
NF {
delete uniqmails;
for (i=1; i<=NF; i++)
uniqmails[$i] = 1;
sep="";
n=0;
for (m in uniqmails) {
printf "%s%s", sep, m;
sep="|";
n++;
}
for (;n<3;n++) printf "|";
print ""; // EOL
}
There's also this "one-liner" that doesn't need awk:
while read line; do
echo $line | tr ",|" "\n" | sort -u |\
paste <( seq 3) - | cut -f 2 |\
tr "\n" "|" |\
rev | cut -c 2- | rev;
done
With perl:
perl -lane '$s{$_}++ for split /[|,]/; END { print for keys %s;}' input
I have edited this post, Hope it will work
while read line
do
val1=`echo $line|awk -F"|" '{print $1}'`
val2=`echo $line|awk -F"|" '{print $2}'`
val3=`echo $line|awk -F"|" '{print $3}'`
a=`echo $line|awk -F"|" '{print $2,"|",$3}'|sed 's/'$val1'//g'`
aa=`echo "$val1|$a"`
b=`echo $aa|awk -F"|" '{print $1,"|",$3}'|sed 's/'$val2'//g'`
b1=`echo $b|awk -F"|" '{print $1}'`
b2=`echo $b|awk -F"|" '{print $2}'`
bb=`echo "$b1|$val2|$b2"`
c=`echo $bb|awk -F"|" '{print $1,"|",$2}'|sed 's/'$val3'//g'`
cc=`echo "$c|$val3"|sed 's/,,/,/;s/,|/|/;s/|,/|/;s/^,//;s/ //g'`
echo "$cc">>abcd
done<ab.dat
cat abcd
c#yahoo.co.in||a#gmail.com
y#gmail.com|x#yahoo.in|z#redhat.com
c#gmail.com|b#yahoo.co.in|c#uix.xo.in
You can subtract all "," separated values and parse in the same way...if your all values are having "," separated.

Unix cut command taking an unordered list as arguments

The Unix cut command takes a list of fields, but not the order that I need it in.
$ echo 1,2,3,4,5,6 | cut -d, -f 1,2,3,5
1,2,3,5
$ echo 1,2,3,4,5,6 | cut -d, -f 1,3,2,5
1,2,3,5
However, I would like a Unix shell command that will give me the fields in the order that I specify.
Use:
pax> echo 1,2,3,4,5,6 | awk -F, 'BEGIN {OFS=","}{print $1,$3,$2,$5}'
1,3,2,5
or:
pax> echo 1,2,3,4,5,6 | awk -F, -vOFS=, '{print $1,$3,$2,$5}'
1,3,2,5
Or just use the shell
$ set -f
$ string="1,2,3,4,5"
$ IFS=","
$ set -- $string
$ echo $1 $3 $2 $5
1 3 2 5
Awk based solution is elegant. Here is a perl based solution:
echo 1,2,3,4,5,6 | perl -e '#order=(1,3,2,5);#a=split/,/,<>;for(#order){print $a[$_-1];}'

Resources