Yosys / abc uses many gates instead of better monolithic cell - synthesis

For a simple design and custom cell library, I am getting synthesis results in which Yosys / abc chooses a result that is obviously (for the human reader) worse, and which ignores an obvious alternative implementation. It seems to me that the result Yosys / abc produces is worse both in terms of area and speed, though for the latter I am not sure because my .lib file is missing all delay/drive/load information and I'm not yet sure what kind of default is used. What I would like is to optimize area, not speed.
Design (SevenSegmentDecoder.v):
module SevenSegmentDecoder(
input[3:0] encoded,
output reg decoded,
output[3:0] dummy
);
assign dummy = ~encoded;
always #(*) begin
case (encoded)
4'd0: decoded = 1;
4'd1: decoded = 0;
4'd2: decoded = 1;
4'd3: decoded = 1;
4'd4: decoded = 0;
4'd5: decoded = 1;
4'd6: decoded = 1;
4'd7: decoded = 1;
4'd8: decoded = 1;
4'd9: decoded = 1;
4'd10: decoded = 1;
4'd11: decoded = 0;
4'd12: decoded = 1;
4'd13: decoded = 0;
4'd14: decoded = 1;
4'd15: decoded = 1;
default: decoded = 0;
endcase
end
endmodule
This is the logic for a single segment from a seven-segment decoder. The "dummy" output enforces the presence of inverter cells for the inputs. This is meant to push Yosys / abc towards the intended implementation, but the latter does not work.
Shell script for synthesis (build.sh):
rm show.dot
rm show.svg
rm out.v
rm -rf _tmp_yosys-abc-*
yosys -s build.yosys
Synthesis script (build.yosys):
#
# input
#
read_verilog SevenSegmentDecoder.v
#
# synthesis
#
synth -top SevenSegmentDecoder
#
# tech mapping
#
dfflibmap -liberty own.lib
abc -liberty own.lib -nocleanup
clean
stat -liberty own.lib
#
# output
#
write_verilog out.v
read_liberty -lib own.lib
show -format svg -prefix show
Cell library (own.lib):
/*
delay model : typ
check model : typ
power model : typ
capacitance model : typ
other model : typ
*/
library(my_cells) {
cell(Inverter) {
area: 1;
pin(a) {
direction: input;
}
pin(out) {
direction: output;
function: "(!a)";
}
}
cell(Buffer) {
area: 2450;
pin(a) {
direction: input;
}
pin(out) {
direction: output;
function: "a";
}
}
cell(Nand3) {
area: 2940;
pin(a) {
direction: input;
}
pin(b) {
direction: input;
}
pin(c) {
direction: input;
}
pin(out) {
direction: output;
function: "(!(a b c))";
}
}
cell(Nor) {
area: 2450;
pin(a) {
direction: input;
}
pin(b) {
direction: input;
}
pin(out) {
direction: output;
function: "(!(a+b))";
}
}
cell(OrAndInvert) {
area: 2940;
pin(a) {
direction: input;
}
pin(b) {
direction: input;
}
pin(c) {
direction: input;
}
pin(out) {
direction: output;
function: "(!((a+b) c))";
}
}
cell(seg0) {
area: 1;
pin(a) {
direction: input;
}
pin(b) {
direction: input;
}
pin(c) {
direction: input;
}
pin(d) {
direction: input;
}
pin(na) {
direction: input;
}
pin(out) {
direction: output;
function: "(!((na (!b) (!c) d)+(na b (!c) (!d))+(a b (!c) d)+(a (!b) c d)))";
}
}
}
The last cell, seg0, could easily implement the 'decoded' output when combined with an inverter to produce (na := !a), yet Yosys / abc chooses to use a network of 1xNand3, 3xNor, 3xOrAndInvert. To guide the implementation, presence of the inverter is enforced through the "dummy" output anyway, and the seg0 cell has an area of 1 -- yet it is ignored. Also, the generated network uses multiple gates in series, while seg0 would be a single stage of logic. Absent specific delay/load values, I'm assuming that the seg0 cell would be modeled to be faster this way.
Now try a slightly modified cell library in which the seg0 cell inverts "a" itself:
cell(seg0) {
area: 1000000;
pin(a) {
direction: input;
}
pin(b) {
direction: input;
}
pin(c) {
direction: input;
}
pin(d) {
direction: input;
}
pin(out) {
direction: output;
function: "(!(((!a) (!b) (!c) d)+((!a) b (!c) (!d))+(a b (!c) d)+(a (!b) c d)))";
}
}
Even though this cell has a huge area, it gets picked up by Yosys / abc and all the Nand, Nor and OAI gates disappear. Only the inverters are still present, of course, to generate the "dummy" output.
Why does Yosys / abc not use the seg0 cell when it has to be combined with an inverter to produce "na", even when that inverter is already present?
Yosys Version: Yosys 0.9+1706 (git sha1 58ab9f60, clang 6.0.0-1ubuntu2 -fPIC -Os)
ABC Version (the yosys-abc executable that comes with Yosys): UC Berkeley, ABC 1.01 (compiled Jan 12 2020 20:46:55)
Since abc was mentioned, here are the relevant files generated by Yosys.
abc.script:
echo + read_blif _tmp_yosys-abc-jAcBTF/input.blif;
read_blif _tmp_yosys-abc-jAcBTF/input.blif;
echo + read_lib -w /home/martin/git-repos/chipdraw/resource/7seg/own.lib;
read_lib -w /home/martin/git-repos/chipdraw/resource/7seg/own.lib;
echo + strash;
strash;
echo + ifraig;
ifraig;
echo + scorr;
scorr;
echo + dc2;
dc2;
echo + dretime;
dretime;
echo + strash;
strash;
echo + &get -n;
&get -n;
echo + &dch -f;
&dch -f;
echo + &nf ;
&nf ;
echo + &put;
&put;
echo + write_blif _tmp_yosys-abc-jAcBTF/output.blif;
write_blif _tmp_yosys-abc-jAcBTF/output.blif
input.blif:
.model netlist
.inputs ys__n0 ys__n1 ys__n3 ys__n4
.outputs ys__n34 ys__n35 ys__n36 ys__n37 ys__n38
# ys__n0 \encoded [1]
# ys__n1 \encoded [0]
# ys__n2 $abc$258$new_n10_
# ys__n3 \encoded [3]
# ys__n4 \encoded [2]
# ys__n5 $abc$258$new_n11_
# ys__n6 $abc$258$new_n12_
# ys__n7 $abc$258$new_n13_
# ys__n8 $abc$258$new_n14_
# ys__n9 $abc$258$new_n15_
# ys__n10 $abc$258$new_n16_
# ys__n11 $abc$258$new_n17_
# ys__n12 $abc$258$new_n18_
# ys__n13 $abc$258$new_n19_
# ys__n14 $abc$258$new_n20_
# ys__n15 $abc$258$new_n21_
# ys__n16 $abc$258$new_n22_
# ys__n17 $abc$258$new_n23_
# ys__n18 $abc$258$new_n24_
# ys__n19 $abc$258$new_n25_
# ys__n20 $abc$258$new_n26_
# ys__n21 $abc$258$new_n27_
# ys__n22 $abc$258$new_n28_
# ys__n23 $abc$258$new_n29_
# ys__n24 $abc$258$new_n30_
# ys__n25 $abc$258$new_n31_
# ys__n26 $abc$258$new_n32_
# ys__n27 $abc$258$new_n33_
# ys__n28 $abc$258$new_n34_
# ys__n29 $abc$258$new_n35_
# ys__n30 $abc$258$new_n36_
# ys__n31 $abc$258$new_n37_
# ys__n32 $abc$258$new_n38_
# ys__n33 $abc$258$new_n39_
# ys__n34 \decoded
# ys__n35 \dummy [0]
# ys__n36 \dummy [1]
# ys__n37 \dummy [2]
# ys__n38 \dummy [3]
.names ys__n0 ys__n1 ys__n2
00 1
.names ys__n3 ys__n4 ys__n5
00 1
.names ys__n5 ys__n2 ys__n6
11 1
.names ys__n0 ys__n1 ys__n7
10 1
.names ys__n7 ys__n5 ys__n8
11 1
.names ys__n8 ys__n6 ys__n9
-1 1
1- 1
.names ys__n0 ys__n1 ys__n10
11 1
.names ys__n10 ys__n5 ys__n11
11 1
.names ys__n1 ys__n0 ys__n12
10 1
.names ys__n3 ys__n4 ys__n13
1- 1
-0 1
.names ys__n12 ys__n13 ys__n14
10 1
.names ys__n14 ys__n11 ys__n15
-1 1
1- 1
.names ys__n15 ys__n9 ys__n16
-1 1
1- 1
.names ys__n7 ys__n13 ys__n17
10 1
.names ys__n10 ys__n13 ys__n18
10 1
.names ys__n18 ys__n17 ys__n19
-1 1
1- 1
.names ys__n4 ys__n3 ys__n20
1- 1
-0 1
.names ys__n2 ys__n20 ys__n21
10 1
.names ys__n12 ys__n20 ys__n22
10 1
.names ys__n22 ys__n21 ys__n23
-1 1
1- 1
.names ys__n23 ys__n19 ys__n24
-1 1
1- 1
.names ys__n24 ys__n16 ys__n25
-1 1
1- 1
.names ys__n7 ys__n20 ys__n26
10 1
.names ys__n3 ys__n4 ys__n27
0- 1
-0 1
.names ys__n2 ys__n27 ys__n28
10 1
.names ys__n28 ys__n26 ys__n29
-1 1
1- 1
.names ys__n7 ys__n27 ys__n30
10 1
.names ys__n10 ys__n27 ys__n31
10 1
.names ys__n31 ys__n30 ys__n32
-1 1
1- 1
.names ys__n32 ys__n29 ys__n33
-1 1
1- 1
.names ys__n33 ys__n25 ys__n34
-1 1
1- 1
.names ys__n1 ys__n35
0 1
.names ys__n0 ys__n36
0 1
.names ys__n4 ys__n37
0 1
.names ys__n3 ys__n38
0 1
.end
stdcells.genlib:
GATE ZERO 1 Y=CONST0;
GATE ONE 1 Y=CONST1;
GATE BUF 1 Y=A; PIN * NONINV 1 999 1 0 1 0
GATE NOT 2 Y=!A; PIN * INV 1 999 1 0 1 0
GATE AND 4 Y=A*B; PIN * NONINV 1 999 1 0 1 0
GATE NAND 4 Y=!(A*B); PIN * INV 1 999 1 0 1 0
GATE OR 4 Y=A+B; PIN * NONINV 1 999 1 0 1 0
GATE NOR 4 Y=!(A+B); PIN * INV 1 999 1 0 1 0
GATE XOR 5 Y=(A*!B)+(!A*B); PIN * UNKNOWN 1 999 1 0 1 0
GATE XNOR 5 Y=(A*B)+(!A*!B); PIN * UNKNOWN 1 999 1 0 1 0
GATE ANDNOT 4 Y=A*!B; PIN * UNKNOWN 1 999 1 0 1 0
GATE ORNOT 4 Y=A+!B; PIN * UNKNOWN 1 999 1 0 1 0
GATE MUX 4 Y=(A*B)+(S*B)+(!S*A); PIN * UNKNOWN 1 999 1 0 1 0
output.blif:
# Benchmark "netlist" written by ABC on Mon Jan 13 18:44:03 2020
.model netlist
.inputs ys__n0 ys__n1 ys__n3 ys__n4
.outputs ys__n34 ys__n35 ys__n36 ys__n37 ys__n38
.gate Inverter a=ys__n0 out=ys__n36
.gate Inverter a=ys__n1 out=ys__n35
.gate Inverter a=ys__n3 out=ys__n38
.gate Inverter a=ys__n4 out=ys__n37
.gate Nor a=ys__n3 b=ys__n37 out=new_n14_
.gate Nor a=ys__n38 b=ys__n4 out=new_n15_
.gate Nor a=ys__n0 b=ys__n35 out=new_n16_
.gate OrAndInvert a=new_n14_ b=new_n15_ c=new_n16_ out=new_n17_
.gate OrAndInvert a=ys__n3 b=ys__n37 c=ys__n35 out=new_n18_
.gate OrAndInvert a=ys__n38 b=ys__n4 c=ys__n0 out=new_n19_
.gate Nand3 a=new_n17_ b=new_n18_ c=new_n19_ out=ys__n34
.end

Related

Modify BED with poliregions [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 3 years ago.
Improve this question
I have a somewhat tricky BED file format, which I should convert to a classic BED format so as I can properly use it for further steps:
I have this unconventional BED format:
1 12349 12398 +
1 23523 23578 -
1 23550;23570;23590 23640;23689;23652 +
1 43533 43569 +
1 56021;56078 56099;56155 +
Say that those multiple position rows are representing non-coding fragmented regions.
What I would like to get is a canonical BED file such as:
1 12349 12398 +
1 23523 23578 -
1 23550 23640 +
1 23570 23689 +
1 23590 23652 +
1 43533 43569 +
1 56021 56099 +
1 56078 56155 +
where the poliregions that were mixed in one row, are put in other rows, while maintaining chromosome number and strand.
Could you please try following.
awk '
{
num=split($2,array1,";")
num1=split($3,array2,";")
}
num>1 || num1>1{
for(i=1;i<=num;i++){
print $1,array1[i],array2[i],$NF
}
next
}
1' Input_file | column -t
Output will be as follows.
1 12349 12398 +
1 23523 23578 -
1 23550 23640 +
1 23570 23689 +
1 23590 23652 +
1 43533 43569 +
1 56021 56099 +
1 56078 56155 +
#!/usr/bin/env bash
# ^^^^-- NOT /bin/sh
while read -r a b c d; do
if [[ $b = *';'* ]]; then # if b contains any ';'s
IFS=';' read -r -a ba <<<"$b" # read string b into array ba
IFS=';' read -r -a ca <<<"$c" # read string c into array ca
for idx in "${!ba[#]}"; do # iterate over the indices of array ba
# print a and d with the values for a given index for both ba and ca
printf '%s\t%s\t%s\t%s\n' "$a" "${ba[idx]}" "${ca[idx]}" "$d"
done
else
printf '%s\t%s\t%s\t%s\n' "$a" "$b" "$c" "$d"
fi
done
This combines the answers to existing StackOverflow questions:
bash script loop through two variables in lock step
Reading a delimited string into an array in Bash
...and guidance in the BashFAQ:
How can I read a file (data stream, variable) line-by-line (and/or field-by-field)?
See this running at https://ideone.com/wmrXPE
$ cat tst.awk
BEGIN { FS="[[:space:];]+" }
{
n = (NF - 2) / 2
for (i=1; i<=n; i++) {
print $1, $(i+1), $(i+n), $NF
}
}
$ awk -f tst.awk file
1 12349 12349 +
1 23523 23523 -
1 23550 23590 +
1 23570 23640 +
1 23590 23689 +
1 43533 43533 +
1 56021 56078 +
1 56078 56099 +
Try Perl solution
perl -lane ' if( /;/ and /(\S{2,})\s+(\S{2,})/ ) {
$i=0;#x=split(";",$1);#y=split(";",$2); while($i++<scalar(#x))
{ print join(" ",$F[0],$x[$i-1],$y[$i-1],$F[-1]) }} else { print } ' emilio.txt| column -t
with the given inputs
$ cat emilio.txt
1 12349 12398 +
1 23523 23578 -
1 23550;23570;23590 23640;23689;23652 +
1 43533 43569 +
1 56021;56078 56099;56155 +
$ perl -lane ' if( /;/ and /(\S{2,})\s+(\S{2,})/ ) {
$i=0;#x=split(";",$1);#y=split(";",$2); while($i++<scalar(#x))
{ print join(" ",$F[0],$x[$i-1],$y[$i-1],$F[-1]) }} else { print } ' emilio.txt| column -t
1 12349 12398 +
1 23523 23578 -
1 23550 23640 +
1 23570 23689 +
1 23590 23652 +
1 43533 43569 +
1 56021 56099 +
1 56078 56155 +
$

Aggregate as new column in R

Input:
Time,id1,id2
22:30,1,0
22:32,2,1
22:33,1,0
22:34,2,1
Output Desired
Time,Time2,id1,id2
22:30,22:33,1,0
22:32,22:34,2,1
Output by my code
Time,id1,id2
22:30,22:33,1,0
22:32,22:34,2,1
What change should I make to my code aggregate(Time~,df,FUN=toString)
My id1 and id2 together is the key and times are in and out time for each key. I need to get time in and time out as separate column values. Currently they are in column Time.
I tried it using awk also.
If you do not want to use any packages, this will work:
df <- aggregate(Time~.,df,FUN=toString)
df
#output
id1 id2 Time
1 0 22:30, 22:33
2 1 22:32, 22:34
df$Time2 <- lapply(strsplit(as.character(df$Time), ","),"[", 2)
df$Time <- lapply(strsplit(as.character(df$Time), ","),"[", 1)
df
#output
id1 id2 Time Time2
1 0 22:30 22:33
2 1 22:32 22:34
With awk
$ cat time.awk
BEGIN {
FS = OFS = ","
}
function in_time() {
n++
store[id1, id2] = n
itime[n] = time; iid1[n] = id1; iid2[n] = id2
}
function out_time( i) {
i = store[id1, id2]
otime[i] = time
}
NR > 1 {
time = $1; id1 = $2; id2 = $3
if ((id1, id2) in store) out_time()
else in_time()
}
END {
print "Time,id1,id2"
for (i = 1; i <= n; i++)
print itime[i], otime[i], iid1[i], iid2[i]
}
Usage:
awk -f time.awk file.dat

unix: counting samples with two variants or a homozygous variant per gene

I have a coding problem beyond my limited skills with unix power tools. I'm looking to count the number of sample with either: i) a homozygous variant in a gene (BB below); or ii) two variants in a gene (2x AB). For example, from:
Variant Gene Sample1 Sample2 Sample3
1 TP53 AA BB AB
2 TP53 AB AA AB
3 TP53 AB AA AA
4 KRAS AA AB AA
5 KRAS AB AB BB
I'm looking for:
Gene Two_variants Homozygous Either
TP53 2 1 3
KRAS 1 1 2
Any help would be much appreciated. Thanks.
R_G
In GNU awk:
awk '/\<AB\>.+\<AB\>/ { arr[$2,"AB"] += 1 }
/\<BB\>/ { arr[$2,"BB"] += 1 }
END { for ( elt in arr ) {
split ( elt, index_parts, SUBSEP )
genes[index_parts[1]] = 0
}
printf "%4s%13s%11s%7s\n", "Gene", "Two_variants", "Homozygous", "Either"
for ( gene in genes ) {
printf "%4s%6d%13d%9d\n", gene, arr[gene,"AB"], arr[gene,"BB"], arr[gene,"AB"] + arr[gene,"BB"]
}
}' input.txt
use warnings;
use strict;
my (#header, %data);
open(my $file, "<", "input") or die("$?");
while (<$file>) {
#header = split, next if not #header;
my #v = split;
$data{$v[1]}->{$_}++ for (#v[2..$#v]);
}
close $file;
print "Gene Two_variants Homozygous Either\n";
for my $k (keys %data) {
my ($var2, $homoz) = (int($data{$k}{AB}/2), $data{$k}{BB});
my $sum = $var2 + $homoz;
printf("%4s %8d %9d %8d\n", $k, $var2, $homoz, $sum) if $sum;
}

Read a column value from previous line and next line but insert them as additional fields in the current line using awk

I hope you can help me out with my problem.
I have an input file with 3 columns of data which looks like this:
Apl_No Act_No Sfx_No
100 10 0
100 11 1
100 12 2
100 13 3
101 20 0
101 21 1
I need to create an output file which contains the data as in the input and 3 additional fileds in its output. It should look like this:
Apl_No Act_No Sfx_No Crt_Act_No Prs_Act_No Cd_Act_No
100 10 0 - - -
100 11 1 10 11 12
100 12 2 11 12 13
100 13 3 12 13 10
101 20 0 - - -
101 21 1 20 21 20
Every Apl_No has a set of Act_No that are mapped to it. 3 new fields need to be created: Crt_Act_No Prs_Act_No Cd_Act_No. When the first unique Apl_No is encountered the column values 4, 5 and 6 (Crt_Act_No Prs_Act_No Cd_Act_No) need to be dashed out. For every following occurrence of the same Apl_No the Crt_Act_No is the same as the Act_No on the previous line, the Prs_Act_No is same as the Act_No on the current line and the Cd_Act_No is same as the Act_No on the next line. This continues for all the following rows bearing the same Apl_No except for the last row. In the last row the Crt_Act_No and Prs_Act_No is filled in the same way as the above rows but the Cd_Act_No needs to be pulled from the Act_No from the first row when the first unique Apl_No is encountered.
I wish to achieve this using awk. Can anyone please help me out how to go about this.
One solution:
awk '
## Print header in first line.
FNR == 1 {
printf "%s %s %s %s\n", $0, "Crt_Act_No", "Prs_Act_No", "Cd_Act_No";
next;
}
## If first field not found in the hash means that it is first unique "Apl_No", so
## print line with dashes and save some data for use it later.
## "line" variable has the content of the previous iteration. Print it if it is set.
! apl[ $1 ] {
if ( line ) {
sub( /-/, orig_act, line );
print line;
line = "";
}
printf "%s %s %s %s\n", $0, "-", "-", "-";
orig_act = prev_act = $2;
apl[ $1 ] = 1;
next;
}
## For all non-unique "Apl_No"...
{
## If it is the first one after the line with
## dashes (line not set) save it is content in "line" and the variable
## that I will have to check later ("Act_No"). Note that I leave a dash in last
## field to substitute in the following iteration.
if ( ! line ) {
line = sprintf( "%s %s %s %s", $0, prev_act, $2, "-" );
prev_act = $2;
next;
}
## Now I know the field, so substitute the dash with it, print and repeat
## the process with current line.
sub( /-/, $2, line );
print line;
line = sprintf( "%s %s %s %s", $0, prev_act, $2, "-" );
prev_act = $2;
}
END {
if ( line ) {
sub( /-/, orig_act, line );
print line;
}
}
' infile | column -t
That yields:
Apl_No Act_No Sfx_No Crt_Act_No Prs_Act_No Cd_Act_No
100 10 0 - - -
100 11 1 10 11 12
100 12 2 11 12 13
100 13 3 12 13 10
101 20 0 - - -
101 21 1 20 21 20

Implementing proximity matrix for clustering

Please I am a little new to this field so pardon me if the question sound trivial or basic.
I have a group of dataset(Bag of words to be specific) and I need to generate a proximity matrix by using their edit distance from each other to find and generate the proximity matrix .
I am however quite confused how I will keep track of my data/strings in the matrix. I need the proximity matrix for the purpose of clustering.
Or How generally do you approach this kinds of problem in the field. I am using perl and R to implement this.
Here is a typical code in perl I have written that reads from a text file containing my bag of words
use strict ;
use warnings ;
use Text::Levenshtein qw(distance) ;
main(#ARGV);
sub main
{
my #TokenDistances ;
my $Tokenfile = 'TokenDistinct.txt';
my #Token ;
my $AppendingCount = 0 ;
my #Tokencompare ;
my %Levcount = ();
open (FH ,"< $Tokenfile" ) or die ("Error opening file . $!");
while(<FH>)
{
chomp $_;
$_ =~ s/^(\s+)$//g;
push (#Token , $_ );
}
close(FH);
#Tokencompare = #Token ;
foreach my $tokenWord(#Tokencompare)
{
my $lengthoffile = scalar #Tokencompare;
my $i = 0 ;
chomp $tokenWord ;
##TokenDistances = levDistance($tokenWord , \#Tokencompare );
for($i = 0 ; $i < $lengthoffile ;$i++)
{
if(scalar #TokenDistances == scalar #Tokencompare)
{
print "Yipeeeeeeeeeeeeeeeeeeeee\n";
}
chomp $tokenWord ;
chomp $Tokencompare[$i];
#print $tokenWord. " {$Tokencompare[$i]} " . " $TokenDistances[$i] " . "\n";
#$Levcount{$tokenWord}{$Tokencompare[$i]} = $TokenDistances[$i];
$Levcount{$tokenWord}{$Tokencompare[$i]} = levDistance($tokenWord , $Tokencompare[$i] );
}
StoreSortedValues ( \%Levcount ,\$tokenWord , \$AppendingCount);
$AppendingCount++;
%Levcount = () ;
}
# %Levcount = ();
}
sub levDistance
{
my $string1 = shift ;
#my #StringList = #{(shift)};
my $string2 = shift ;
return distance($string1 , $string2);
}
sub StoreSortedValues {
my $Levcount = shift;
my $tokenWordTopMost = ${(shift)} ;
my $j = ${(shift)};
my #ListToken;
my $Tokenfile = 'LevResult.txt';
if($j == 0 )
{
open (FH ,"> $Tokenfile" ) or die ("Error opening file . $!");
}
else
{
open (FH ,">> $Tokenfile" ) or die ("Error opening file . $!");
}
print $tokenWordTopMost;
my %tokenWordMaster = %{$Levcount->{$tokenWordTopMost}};
#ListToken = sort { $tokenWordMaster{$a} cmp $tokenWordMaster{$b} } keys %tokenWordMaster;
##ListToken = keys %tokenWordMaster;
print FH "-------------------------- " . $tokenWordTopMost . "-------------------------------------\n";
#print FH map {"$_ \t=> $tokenWordMaster{$_} \n "} #ListToken;
foreach my $tokey (#ListToken)
{
print FH "$tokey=>\t" . $tokenWordMaster{$tokey} . "\n"
}
close(FH) or die ("Error Closing File. $!");
}
the problem is how can I represent the proximity matrix from this and still be able to keep track of which comparison represent which in my matrix.
In the RecordLinkage package there is the levenshteinDist function, which is one way of calculating an edit distance between strings.
install.packages("RecordLinkage")
library(RecordLinkage)
Set up some data:
fruit <- c("Apple", "Apricot", "Avocado", "Banana", "Bilberry", "Blackberry",
"Blackcurrant", "Blueberry", "Currant", "Cherry")
Now create a matrix consisting of zeros to reserve memory for the distance table. Then use nested for loops to calculate the individual distances. We end with a matrix with a row and a column for each fruit. Thus we can rename the columns and rows to be identical to the original vector.
fdist <- matrix(rep(0, length(fruit)^2), ncol=length(fruit))
for(i in seq_along(fruit)){
for(j in seq_along(fruit)){
fdist[i, j] <- levenshteinDist(fruit[i], fruit[j])
}
}
rownames(fdist) <- colnames(fdist) <- fruit
The results:
fdist
Apple Apricot Avocado Banana Bilberry Blackberry Blackcurrant
Apple 0 5 6 6 7 9 12
Apricot 5 0 6 7 8 10 10
Avocado 6 6 0 6 8 9 10
Banana 6 7 6 0 7 8 8
Bilberry 7 8 8 7 0 4 9
Blackberry 9 10 9 8 4 0 5
Blackcurrant 12 10 10 8 9 5 0
Blueberry 8 9 9 8 3 3 8
Currant 7 5 6 5 8 10 6
Cherry 6 7 7 6 4 6 10
The proximity or similarity (or dissimilarity) matrix is just a table that stores the similarity score for pairs of objects. So, if you have N objects, then the R code can be simMat <- matrix(nrow = N, ncol = N), and then each entry, (i,j), of simMat indicates the similarity between item i and item j.
In R, you can use several packages, including vwr, to calculate the Levenshtein edit distance.
You may also find this Wikibook to be of interest: http://en.wikibooks.org/wiki/R_Programming/Text_Processing

Resources