Implementing proximity matrix for clustering - r

Please I am a little new to this field so pardon me if the question sound trivial or basic.
I have a group of dataset(Bag of words to be specific) and I need to generate a proximity matrix by using their edit distance from each other to find and generate the proximity matrix .
I am however quite confused how I will keep track of my data/strings in the matrix. I need the proximity matrix for the purpose of clustering.
Or How generally do you approach this kinds of problem in the field. I am using perl and R to implement this.
Here is a typical code in perl I have written that reads from a text file containing my bag of words
use strict ;
use warnings ;
use Text::Levenshtein qw(distance) ;
main(#ARGV);
sub main
{
my #TokenDistances ;
my $Tokenfile = 'TokenDistinct.txt';
my #Token ;
my $AppendingCount = 0 ;
my #Tokencompare ;
my %Levcount = ();
open (FH ,"< $Tokenfile" ) or die ("Error opening file . $!");
while(<FH>)
{
chomp $_;
$_ =~ s/^(\s+)$//g;
push (#Token , $_ );
}
close(FH);
#Tokencompare = #Token ;
foreach my $tokenWord(#Tokencompare)
{
my $lengthoffile = scalar #Tokencompare;
my $i = 0 ;
chomp $tokenWord ;
##TokenDistances = levDistance($tokenWord , \#Tokencompare );
for($i = 0 ; $i < $lengthoffile ;$i++)
{
if(scalar #TokenDistances == scalar #Tokencompare)
{
print "Yipeeeeeeeeeeeeeeeeeeeee\n";
}
chomp $tokenWord ;
chomp $Tokencompare[$i];
#print $tokenWord. " {$Tokencompare[$i]} " . " $TokenDistances[$i] " . "\n";
#$Levcount{$tokenWord}{$Tokencompare[$i]} = $TokenDistances[$i];
$Levcount{$tokenWord}{$Tokencompare[$i]} = levDistance($tokenWord , $Tokencompare[$i] );
}
StoreSortedValues ( \%Levcount ,\$tokenWord , \$AppendingCount);
$AppendingCount++;
%Levcount = () ;
}
# %Levcount = ();
}
sub levDistance
{
my $string1 = shift ;
#my #StringList = #{(shift)};
my $string2 = shift ;
return distance($string1 , $string2);
}
sub StoreSortedValues {
my $Levcount = shift;
my $tokenWordTopMost = ${(shift)} ;
my $j = ${(shift)};
my #ListToken;
my $Tokenfile = 'LevResult.txt';
if($j == 0 )
{
open (FH ,"> $Tokenfile" ) or die ("Error opening file . $!");
}
else
{
open (FH ,">> $Tokenfile" ) or die ("Error opening file . $!");
}
print $tokenWordTopMost;
my %tokenWordMaster = %{$Levcount->{$tokenWordTopMost}};
#ListToken = sort { $tokenWordMaster{$a} cmp $tokenWordMaster{$b} } keys %tokenWordMaster;
##ListToken = keys %tokenWordMaster;
print FH "-------------------------- " . $tokenWordTopMost . "-------------------------------------\n";
#print FH map {"$_ \t=> $tokenWordMaster{$_} \n "} #ListToken;
foreach my $tokey (#ListToken)
{
print FH "$tokey=>\t" . $tokenWordMaster{$tokey} . "\n"
}
close(FH) or die ("Error Closing File. $!");
}
the problem is how can I represent the proximity matrix from this and still be able to keep track of which comparison represent which in my matrix.

In the RecordLinkage package there is the levenshteinDist function, which is one way of calculating an edit distance between strings.
install.packages("RecordLinkage")
library(RecordLinkage)
Set up some data:
fruit <- c("Apple", "Apricot", "Avocado", "Banana", "Bilberry", "Blackberry",
"Blackcurrant", "Blueberry", "Currant", "Cherry")
Now create a matrix consisting of zeros to reserve memory for the distance table. Then use nested for loops to calculate the individual distances. We end with a matrix with a row and a column for each fruit. Thus we can rename the columns and rows to be identical to the original vector.
fdist <- matrix(rep(0, length(fruit)^2), ncol=length(fruit))
for(i in seq_along(fruit)){
for(j in seq_along(fruit)){
fdist[i, j] <- levenshteinDist(fruit[i], fruit[j])
}
}
rownames(fdist) <- colnames(fdist) <- fruit
The results:
fdist
Apple Apricot Avocado Banana Bilberry Blackberry Blackcurrant
Apple 0 5 6 6 7 9 12
Apricot 5 0 6 7 8 10 10
Avocado 6 6 0 6 8 9 10
Banana 6 7 6 0 7 8 8
Bilberry 7 8 8 7 0 4 9
Blackberry 9 10 9 8 4 0 5
Blackcurrant 12 10 10 8 9 5 0
Blueberry 8 9 9 8 3 3 8
Currant 7 5 6 5 8 10 6
Cherry 6 7 7 6 4 6 10

The proximity or similarity (or dissimilarity) matrix is just a table that stores the similarity score for pairs of objects. So, if you have N objects, then the R code can be simMat <- matrix(nrow = N, ncol = N), and then each entry, (i,j), of simMat indicates the similarity between item i and item j.
In R, you can use several packages, including vwr, to calculate the Levenshtein edit distance.
You may also find this Wikibook to be of interest: http://en.wikibooks.org/wiki/R_Programming/Text_Processing

Related

Checking on equal values of 2 different data frame row by row

I have 2 different data frame, one is of 5.5 MB and the other is 25 GB. I want to check if these two data frame have the same value in 2 different columns for each row.
For e.g.
x 0 0 a
x 1 2 b
y 1 2 c
z 3 4 d
and
x 0 0 w
x 1 2 m
y 5 6 p
z 8 9 q
I want to check if the 2° and 3° column are equal for each row, if yes I return the 4° columns for the both data frame.Then I should have:
a w
b m
c m
the 2 data frame are sorted respect the 2° and 3° column value. I try in R but the 2° file (25 GB) is too big. How can I obtain this new file in a "faster" (even some hours) way ???
With GNU awk for arrays of arrays:
$ cat tst.awk
NR==FNR { a[$2,$3][$4]; next }
($2,$3) in a {
for (val in a[$2,$3]) {
print val, $4
}
}
$ awk -f tst.awk small_file large_file
a w
b m
c m
and with any awk (a bit less efficiently):
$ cat tst.awk
NR==FNR { a[$2,$3] = a[$2,$3] FS $4; next }
($2,$3) in a {
split(a[$2,$3],vals)
for (i in vals) {
print vals[i], $4
}
}
$ awk -f tst.awk small_file large_file
a w
b m
c m
The above when reading small_file (NR==FNR is only true for the first file read - look up those variables in the awk man page or google) creates an associative array a[] that maps an index created from the concatenation of the 2nd+3rd fields to the list of value of the 4th field for those 2nd/3rd field combinations. Then when reading large_file it looks up that array for the current 2nd/3rd field combination and loops through all of the values stored for that combination in the previous phase printing that value (the $4 from small_file) plus the current $4.
You said your small file is 5.5 MB and the large file is 25 GB. Since 1 MB is about 1,047,600 characters (see https://www.computerhope.com/issues/chspace.htm) and each of your lines is about 8 characters long that means your small file is about 130 thousand lines long and your large one about 134 million lines long so I expect on an average powered computer the above should take no more than a minute or 2 to run, it certainly won't take anything like an hour!
An alternative to the solution of Ed Morton, but with an identical idea:
$ cat tst.awk
NR==FNR { a[$2,$3] = a[$2,$3] $4 ORS; next }
($2,$3) in a {
s=a[$2,$3]; gsub(ORS,OFS $4 ORS,s)
printf "%s",s;
}
$ awk -f tst.awk small_file large_file
a w
b m
c m

sort two files by header so that they are in matching field order

I have two files that each have 700 fields, where 699/700 of the fields have matching headers. I would like to reorder the fields so that they are in the same order in both files (though which order is irrelevant). For example, given:
File1:
FRUIT MSMC1 MSMC24 MSMC2 MSMC10
Apple 1 2 3 2
Pear 2 1 4 5
File2:
VEG MSMC24 MSMC1 MSMC2 MSMC10
Onion 2 1 3 2
Radish 0 3 9 3
I would like both files to have the first field as the fields that are not common to both files, then the rest of the fields in the same order in both files, for example one possible outcome would be:
File1:
FRUIT MSMC1 MSMC2 MSMC10 MSMC24
Apple 1 3 2 2
Pear 2 4 5 1
File2:
VEG MSMC1 MSMC2 MSMC10 MSMC24
Onion 1 3 2 2
Radish 3 9 3 0
Using data.table, this can help you
First read the files,
library(data.table)
dt1 <- fread("file1.csv")
dt2 <- fread("file2.csv")
then, get the names of the fields, the common ones
ndt1 <- names(dt1)[-1]
ndt2 <- names(dt2)[-1]
common <- intersect(ndt1, ndt2)
and now you can just apply the new order
setorder(dt1, c(ndt1[1], setdiff(ndt1, common), common))
setorder(dt2, c(ndt2[1], setdiff(ndt2, common), common))
A perl solution that leaves the first file as is and writes the second file with the columns arranged in the same order as the first file. It reads the 2 files supplied on the command line (which follow the script name).
Update: Added the map $_ // (), phrase to allow the second file to be a subset of the first file. Answer to his question How could these answers be modified if one file were to be a subset of the other (not all columns from file 1 are in file2)? – theo4786
#!/usr/bin/perl
use strict;
use warnings;
# commandline: perl script_name.pl fruits.csv veg.csv
my (undef, #fruit_hdrs) = split ' ', <> and close ARGV;
my #veg_hdrs;
while (<>) {
my ($name, #cols) = split;
# only executes for the first line (header line) of second file
#veg_hdrs = #cols unless #veg_hdrs;
my %line;
#line{ #veg_hdrs } = #cols;
print join(" ", $name, map $_ // (), #line{ #fruit_hdrs } ), "\n";
}
Output is:
VEG MSMC1 MSMC24 MSMC2 MSMC10
Onion 1 2 3 2
Radish 3 0 9 3
In perl, the tool for this job is a hash slice.
You can access values of a hash as #hash{#keys}.
So something like this:
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
my #headers;
my $type;
my #rows;
#iterate data - would do this with a normal 'open'
while ( <DATA> ) {
#set headers if the leading word is all upper case
if ( m/^[A-Z]+\s/ ) {
#seperate out type (VEG/FRUIT) from the other headings.
chomp ( ( $type, #headers ) = split );
#print for debugging
print Dumper \#headers;
}
else {
#create a hash to store this row.
my %this_row;
#split the row on whitespace, capturing name and ordered fields by header row.
( my $name, #this_row{#headers} ) = split;
#insert name and type into the hash
$this_row{name} = $name;
$this_row{type} = $type;
#print for debugging
print Dumper \%this_row;
#store it in #rows
push ( #rows, \%this_row );
}
}
#print output:
#header line
print join ("\t", "name", "type", #headers ),"\n";
#iterate rows, extract ordered by _last_ set of headers.
foreach my $row ( #rows ) {
print join ( "\t", $row->{name}, $row->{type}, #{$row}{#headers} ),"\n";
}
__DATA__
FRUIT MSMC1 MSMC24 MSMC2 MSMC10
Apple 1 2 3 2
Pear 2 1 4 5
VEG MSMC24 MSMC1 MSMC2 MSMC10
Onion 2 1 3 2
Radish 0 3 9 3
Note - I've used Data::Dumper for diagnostics - those lines can be removed, but I've left them because illustrate what's going on.
Likewise reading from <DATA> - normally you'd open a file handle, or just use while ( <> ) { to read STDIN or files specified on command line.
The ordering of output is based on the last header line 'seen' - you can of course, sort that, or reorder it.
If you need to handle mismatching columns, this will error on the missing one. In this scenario, we can break out map to fill in any blanks, and use a hash for headers to ensure we capture them all.
E.g.;
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
my #headers;
my %headers_combined;
my $type;
my #rows;
#iterate data - would do this with a normal 'open'
while ( <DATA> ) {
#set headers if the leading word is all upper case
if ( m/^[A-Z]+\s/ ) {
#seperate out type (VEG/FRUIT) from the other headings.
chomp ( ( $type, #headers ) = split );
#add to hash of headers, to preserve uniques
$headers_combined{$_}++ for #headers;
#print for debugging
print Dumper \#headers;
}
else {
#create a hash to store this row.
my %this_row;
#split the row on whitespace, capturing name and ordered fields by header row.
( my $name, #this_row{#headers} ) = split;
#insert name and type into the hash
$this_row{name} = $name;
$this_row{type} = $type;
#print for debugging
print Dumper \%this_row;
#store it in #rows
push ( #rows, \%this_row );
}
}
#print output:
#header line
#note - extract keys from hash, not the #headers array.
#sort is needed to order them, because default is unordered.
print join ("\t", "name", "type", sort keys %headers_combined ),"\n";
#iterate rows, extract ordered by _last_ set of headers.
foreach my $row ( #rows ) {
print join ( "\t", $row->{name}, $row->{type}, map { $row->{$_} // '' } sort keys %headers_combined ),"\n";
}
__DATA__
FRUIT MSMC1 MSMC24 MSMC2 MSMC10 OTHER
Apple 1 2 3 2 x
Pear 2 1 4 5 y
VEG MSMC24 MSMC1 MSMC2 MSMC10 NOTHING
Onion 2 1 3 2 p
Radish 0 3 9 3 z
Here, map { $row->{$_} // '' } sort keys %headers_combined takes all the keys of the hash, returns them in order, and then extracts that key from the row - or gives an empty space if it's undefined. (thats what // does)
This will reorder the fields in file2 to match the order in file1:
$ cat tst.awk
FNR==1 {
fileNr++
for (i=2;i<=NF;i++) {
name2nr[fileNr,$i] = i
nr2name[fileNr,i] = $i
}
}
fileNr==2 {
printf "%s", $1
for (i=2;i<=NF;i++) {
printf "%s%s", OFS, $(name2nr[1,nr2name[2,i]])
}
print ""
}
$ awk -f tst.awk file1 file2
VEG MSMC1 MSMC24 MSMC2 MSMC10
Onion 1 2 3 2
Radish 3 0 9 3
With GNU awk you can delete the fileNr++ line and use ARGIND instead of fileNr everywhere else.

Error in lis[[i]] : attempt to select less than one element

This code is meant to compute the total distance of some given coordinates, but I don't know why it's not working.
The error is: Error in lis[[i]] : attempt to select less than one element.
Here is the code:
distant<-function(a,b)
{
return(sqrt((a[1]-b[1])^2+(a[2]-b[2])^2))
}
totdistance<-function(lis)
{
totdis=0
for(i in 1:length(lis)-1)
{
totdis=totdis+distant(lis[[i]],lis[[i+1]])
}
totdis=totdis+distant(lis[[1]],lis[[length(lis)]])
return(totdis)
}
liss1<-list()
liss1[[1]]<-c(12,12)
liss1[[2]]<-c(18,23)
liss1[[4]]<-c(29,25)
liss1[[5]]<-c(31,52)
liss1[[3]]<-c(24,21)
liss1[[6]]<-c(36,43)
liss1[[7]]<-c(37,14)
liss1[[8]]<-c(42,8)
liss1[[9]]<-c(51,47)
liss1[[10]]<-c(62,53)
liss1[[11]]<-c(63,19)
liss1[[12]]<-c(69,39)
liss1[[13]]<-c(81,7)
liss1[[14]]<-c(82,18)
liss1[[15]]<-c(83,40)
liss1[[16]]<-c(88,30)
Output:
> totdistance(liss1)
Error in lis[[i]] : attempt to select less than one element
> distant(liss1[[2]],liss1[[3]])
[1] 6.324555
Let me reproduce your error in a simple way
>list1 = list()
> list1[[0]]=list(a=c("a"))
>Error in list1[[0]] = list(a = c("a")) :
attempt to select less than one element
So, the next question is where are you accessing 0 index list ?
(Indexing of lists starts with 1 in R )
As Molx, indicated in previous posts : "The : operator is evaluated before the subtraction - " . This is causing 0 indexed list access.
For ex:
> 1:10-1
[1] 0 1 2 3 4 5 6 7 8 9
>1:(10-1)
[1] 1 2 3 4 5 6 7 8 9
So replace the following lines of your code
>for(i in 1:(length(lis)-1))
{
totdis=totdis+distant(lis[[i]],lis[[i+1]])
}

Read a column value from previous line and next line but insert them as additional fields in the current line using awk

I hope you can help me out with my problem.
I have an input file with 3 columns of data which looks like this:
Apl_No Act_No Sfx_No
100 10 0
100 11 1
100 12 2
100 13 3
101 20 0
101 21 1
I need to create an output file which contains the data as in the input and 3 additional fileds in its output. It should look like this:
Apl_No Act_No Sfx_No Crt_Act_No Prs_Act_No Cd_Act_No
100 10 0 - - -
100 11 1 10 11 12
100 12 2 11 12 13
100 13 3 12 13 10
101 20 0 - - -
101 21 1 20 21 20
Every Apl_No has a set of Act_No that are mapped to it. 3 new fields need to be created: Crt_Act_No Prs_Act_No Cd_Act_No. When the first unique Apl_No is encountered the column values 4, 5 and 6 (Crt_Act_No Prs_Act_No Cd_Act_No) need to be dashed out. For every following occurrence of the same Apl_No the Crt_Act_No is the same as the Act_No on the previous line, the Prs_Act_No is same as the Act_No on the current line and the Cd_Act_No is same as the Act_No on the next line. This continues for all the following rows bearing the same Apl_No except for the last row. In the last row the Crt_Act_No and Prs_Act_No is filled in the same way as the above rows but the Cd_Act_No needs to be pulled from the Act_No from the first row when the first unique Apl_No is encountered.
I wish to achieve this using awk. Can anyone please help me out how to go about this.
One solution:
awk '
## Print header in first line.
FNR == 1 {
printf "%s %s %s %s\n", $0, "Crt_Act_No", "Prs_Act_No", "Cd_Act_No";
next;
}
## If first field not found in the hash means that it is first unique "Apl_No", so
## print line with dashes and save some data for use it later.
## "line" variable has the content of the previous iteration. Print it if it is set.
! apl[ $1 ] {
if ( line ) {
sub( /-/, orig_act, line );
print line;
line = "";
}
printf "%s %s %s %s\n", $0, "-", "-", "-";
orig_act = prev_act = $2;
apl[ $1 ] = 1;
next;
}
## For all non-unique "Apl_No"...
{
## If it is the first one after the line with
## dashes (line not set) save it is content in "line" and the variable
## that I will have to check later ("Act_No"). Note that I leave a dash in last
## field to substitute in the following iteration.
if ( ! line ) {
line = sprintf( "%s %s %s %s", $0, prev_act, $2, "-" );
prev_act = $2;
next;
}
## Now I know the field, so substitute the dash with it, print and repeat
## the process with current line.
sub( /-/, $2, line );
print line;
line = sprintf( "%s %s %s %s", $0, prev_act, $2, "-" );
prev_act = $2;
}
END {
if ( line ) {
sub( /-/, orig_act, line );
print line;
}
}
' infile | column -t
That yields:
Apl_No Act_No Sfx_No Crt_Act_No Prs_Act_No Cd_Act_No
100 10 0 - - -
100 11 1 10 11 12
100 12 2 11 12 13
100 13 3 12 13 10
101 20 0 - - -
101 21 1 20 21 20

How to find 5 closest number from matrix having attributes?

I have a matrix as follows
`> y
1 2 3
1 0.8802216 1.2277843 0.6875047
2 0.9381081 1.3189847 0.2046542
3 1.3245534 0.8221709 0.4630722
4 1.2006974 0.8890464 0.6710844
5 1.2344071 0.8354292 0.7259998
6 1.1670665 0.9214787 0.6826173
7 0.9670581 1.1070461 0.7742342
8 0.8867365 1.2160533 0.7024281
9 0.8235792 1.4424190 0.2030302
10 0.8821301 1.0541099 1.2279813
11 1.1958634 0.9708839 0.4297043
12 1.3542734 0.7747481 0.5119648
13 0.4385487 0.3588158 4.9167998
14 0.8530141 1.3578511 0.3698620
15 0.9651803 0.8426226 1.6132899
16 0.8854192 1.2272616 0.6715839
17 0.7779642 0.8132233 2.3386331
18 0.9936722 1.1629110 0.5083558
19 1.1235897 1.0018480 0.5764672
20 0.7887222 1.3101684 0.7373181
21 2.2276176 0.0000000 0.0000000`
I found one clue, but it can give position for the whole matrix,`
n<-read.table(file.choose(),header=T)
y<-n[,c("1","2","3")]
my.number=1.12270420185886 .
z<-abs(y-my.number)==min(abs(y-my.number))
which(z)
[1] 19 `
I want to find at least the 5 closest values with letter & column no too, in another way, I want the 5 closest single values from the matrix with their position.
I don't know what language it is; is it R?
In a procedural language, I would save all values to a map (val, (pos)) = (val (row, col); example (0.880..-> (1, 1)), then sort by value.
Then iterate over i<-pos (1 to map.size-5), and get the diff (pos (i), pos (i+5)), search for the minimum (diff), get the values and their position then.
Here is a solution in Scala:
val matrix = """1 0.8802216 1.2277843 0.6875047
2 0.9381081 1.3189847 0.2046542
3 1.3245534 0.8221709 0.4630722
4 1.2006974 0.8890464 0.6710844
5 1.2344071 0.8354292 0.7259998
6 1.1670665 0.9214787 0.6826173
7 0.9670581 1.1070461 0.7742342
8 0.8867365 1.2160533 0.7024281
9 0.8235792 1.4424190 0.2030302
10 0.8821301 1.0541099 1.2279813
11 1.1958634 0.9708839 0.4297043
12 1.3542734 0.7747481 0.5119648
13 0.4385487 0.3588158 4.9167998
14 0.8530141 1.3578511 0.3698620
15 0.9651803 0.8426226 1.6132899
16 0.8854192 1.2272616 0.6715839
17 0.7779642 0.8132233 2.3386331
18 0.9936722 1.1629110 0.5083558
19 1.1235897 1.0018480 0.5764672
20 0.7887222 1.3101684 0.7373181
21 2.2276176 0.0000000 0.0000000"""
// split block of text into lines
val lines=matrix.split ("\n")
// split lines into words
val rows = lines.map (l => l.split (" \\+"))
// remove the index from the beginning (1, 2, ... 21) and
// transform values from Strings to double numbers
// triples is: Array(Array(0.8802216, 1.2277843, 0.6875047), Array(0.9381081, 1.3189847, 0.2046542),
val triples = rows.map (_.tail).map(triple=> triple.map (_.toDouble))
// generate an own index for the rows and columns
// elems is: elems: Array[Array[(Double, (Int, Int))]] = Array(Array((0.8802216,(0,0)), (1.2277843,(0,1)), (0.6875047,(0,2))), Array((0.9381081,(1,0)), ...
val elems = triples.zipWithIndex.map {t=> t._1.zipWithIndex.map (vc=> (vc._1 -> (t._2, vc._2)))}
// sorted = Array((0.0,(20,1)), (0.0,(20,2)), (0.2030302,(8,2)), (0.2046542,(1,2)),
val sorted = elems.sortBy (e => e._1)
// delta5 = List(0.3588158, 0.369862, 0.2266741, 0.2338945, 0.10425639999999997, 0.1384938,
val delta5 = sorted.sliding (5, 1).map (q => q(4)._1-q(0)._1).toList
val minindex = delta5.indexOf (delta5.min) // minindex: Int = 29, delta5.min = 0.008824799999999966
// we found the smallest intervall of 5 values beginning at 29:
(29 to 29 +5).map (sorted (_))
res568: scala.collection.immutable.IndexedSeq[(Double, (Int, Int))] =
Vector( (0.8802216,(0,0)),
(0.8821301,(9,0)),
(0.8854192,(15,0)),
(0.8867365,(7,0)),
(0.8890464,(3,1)),
(0.9214787,(5,1)))
Since Scala counts from 0 to 20 and 0 to 2, where your index runs from 1 to 3 and 1 to 21 respectively, you have to add (1,1) to each of the positions=> (1,1), (10,1), and so on.

Resources