Which similarity measures are immune to word transpositions? - similarity

I am looking for similarity measures to compare phrases with potentially transposed words, for example "extended amounts" and "amounts extended".
The ones I tried already punish those transpositions too much for my purposes. Is there a string similarity function that does not punish or only slightly punishes those transpositions or can I only satisfyingly solve this use case by doing tokenization and calculating individual similarities over the tokens?
import org.apache.lucene.search.spell.*;
public class SimilarityTest
{
public static void main(String[] args)
{
String original = "amounts extended";
String transposed = "extended amounts";
StringDistance[] distances =
{new NGramDistance(),new JaroWinklerDistance(),new LevensteinDistance()};
for(StringDistance dist: distances) System.out.println(String.format("%18s",dist)+" "+dist.getDistance(original, transposed));
}
}
Output
ngram(2) 0.125
jarowinkler(0.7) 0.5416667
levenstein 0.125

I'm not sure if it fits your purpose, since you didn't specify it, and it's not a distance per se, but you should check if you can use a bag of words.
To implement a distance measure, you can either combine it with machine learning techniques, or, if your inputs are small, possibly compute string distances between each pair of words and solve the assignment problem (see Hungarian Algorithm)

Related

R-C50: Definition of minCases in C5.0Control

The parameter minCases of the function C5.0Control in the C50 R package is defined as:
an integer for the smallest number of samples that must be put in at least two of the splits.
How is this implemented? I assume that split in this context refers to the nodes resulting from the split operation. minCases does not seem to represent the smallest number of cases that must be put in at least one node, as I would have expected.
I have tried to find the implementation in the C source code. The variable minCases seems to be defined in extern.h in line 33:
extern CaseCount MINITEMS, LEAFRATIO;
It is used, for instance, in prune.c, lines 249 and 250:
if (BranchCases[v] < MINITEMS) {
ForEach(i, Bp, Ep) { SmallBranches[Class(Case[i])] += Weight(Case[i]); }
What does minCases really do?

How find all strings i can create with 3 char

I have 3 char "abc".
All combinations with these 3 characters are 3^3 = 27:
aaa, aab, aac, aba, ... etc.
I write a pseudo-code to print all of these combinations:
string dictionary[3] = {"a", "b", "c"};
string str[3];
for (i=0;i<3;++i) {
str[0]=dictionary[i];
for (j=0;j<3;++j) {
str[1]=dictionary[j];
for(k=0;k<3;++k) {
str[2]=dictionary[k];
println(str);
}
}
}
Now, I can see that all loops start at 0 and end at 2.
So I thought there was a way to make this function as a recursive function, although in fact no basic step can be distinguished.
So, i asked myself:
Is it really possible to create a recursive function for this type of problem?
If it existed, would it be more or less efficient than the iterative method?
If the number of usable chars would increase, performance would be worse or better?

Sum of ranks in a binary tree - is there a better way

Maybe this question does not belong as this is not a programming question per se, and i do apologize if this is the case.
I just had an exam in abstract data structures, and there was this question:
the rank of a tree node is defined like this: if you are the root of the tree, your rank is 0. Otherwise, your rank is the rank of your parents + 1.
Design an algorithm that calculates the sum of the ranks of all nodes in a binary tree. What is the runtime of your algorithm?
My answer I believe solves this question, my psuedo-code is as such:
int sum_of_tree_ranks(tree node x)
{
if x is a leaf return rank(x)
else, return sum_of_tree_ranks(x->left_child)+sum_of_tree_ranks(x->right_child)+rank(x)
}
where the function rank is
int rank(tree node x)
{
if x->parent=null return 0
else return 1+rank(x->parent)
}
it's very simple, the sum of ranks of a tree is the sum of the left subtree+sum of the right subtree + rank of the root.
The runtime of this algorithm I believe is n^2. i believe this is the case because we were not given the binary tree is balanced. it could be that there are n numbers in the tree but also n different "levels", as in, the tree looks like a linked list rather than a tree. so to calculate the rank of a leaf, potentially we go n steps up. the father of the leaf will be n-1 steps up etc...so thats n+(n-1)+(n-2)+...+1+0=O(n^2)
My question is, is this correct? does my algorithm solve the problem? is my analysis of the runtime correct? and most importantly, is there a better solution to solve this, that does not run in n^2?
Your algorithm works. your analysis is correct. The problem can be solved in O(n) time: (take care of leaves by yourself)
int rank(tree node x, int r)
{
if x is a leaf return r
else
return rank(x->left_child, r + 1)+ ranks(x->right_child, r + 1) + r
}
rank(tree->root, 0)
You're right but there is an O(n) solution providing you can use a more "complex" data structure.
Let each node hold its rank and update the ranks whenever you add/remove, that way you can use the O(1) statement:
return 1 + node->left.rank + node->right.rank;
and do this for each node on the tree to achieve O(n).
A thumb rule for reducing Complexity time is: if you can complex the data structure and add features to adapt it to your problem, you can reduce Complexity time to O(n) most of the times.
It can be solved in O(n) time where n is number of Nodes in Binary tree .
It's nothing but sum of height of all nodes where height of root node is zero .
As
Algorithm:
Input binary tree with left and right child
sum=0;
output sum
PrintSumOfrank(root,sum):
if(root==NULL) return 0;
return PrintSumOfrank(root->lchild,sum+1)+PrintSumOfRank(root->Rchild,sum+1)+sum;
Edit:
This can be also solved using queue or level order of traversal tree.
Algorithm using Queue:
int sum=0;
int currentHeight=0;
Node *T;
Node *t1;
if(T!=NULL)
enque(T);
while(Q is not empty) begin
currentHeight:currentHeight+1 ;
for each nodes in Q do
t1 = deque();
if(t1->lchild!=NULL)begin
enque(t1->lchild);sum = sum+currentHeight;
end if
if(t1->rchild!=NULL)begin
enque(t1->rchild);sum = sum+currentHeight;
end if
end for
end while
print sum ;

Parallel edge detection

I am working on a problem (from Algorithms by Sedgewick, section 4.1, problem 32) to help my understanding, and I have no idea how to proceed.
"Parallel edge detection. Devise a linear-time algorithm to count the parallel edges in a (multi-)graph.
Hint: maintain a boolean array of the neighbors of a vertex, and reuse this array by only reinitializing the entries as needed."
Where two edges are considered to be parallel if they connect the same pair of vertices
Any ideas what to do?
I think we can use BFS for this.
Main idea is to be able to tell if two or more paths exist between two nodes or not, so for this, we can use a set and see if adjacent nodes corresponding to a Node's adjacent list already are in the set.
This uses O(n) extra space but has O(n) time complexity.
boolean bfs(int start){
Queue<Integer> q = new Queue<Integer>(); // get a Queue
boolean[] mark = new boolean[num_of_vertices];
mark[start] = true; // put 1st node into Queue
q.add(start);
while(!q.isEmpty()){
int current = q.remove();
HashSet<Integer> set = new HashSet<Integer>(); /* use a hashset for
storing nodes of current adj. list*/
ArrayList<Integer> adjacentlist= graph.get(current); // get adj. list
for(int x : adjacentlist){
if(set.contains(x){ // if it already had a edge current-->x
return true; // then we have our parallel edge
}
else set.add(x); // if not then we have a new edge
if(!marked[x]){ // normal bfs routine
mark[x]=true;
q.add(x);
}
}
}
}// assumed graph has ArrayList<ArrayList<Integer>> representation
// undirected
Assuming that the vertices in your graph are integers 0 .. |V|.
If your graph is directed, edges in the graph are denoted (i, j).
This allows you to produce a unique mapping of any edge to an integer (a hash function) which can be found in O(1).
h(i, j) = i * |V| + j
You can insert/lookup the tuple (i, j) in a hash table in amortised O(1) time. For |E| edges in the adjacency list, this means the total running time will be O(|E|) or linear in the number of edges in the adjacency list.
A python implementation of this might look something like this:
def identify_parallel_edges(adj_list):
# O(n) list of edges to counts
# The Python implementation of tuple hashing implements a more sophisticated
# version of the approach described above, but is still O(1)
edges = {}
for edge in adj_list:
if edge not in edges:
edges[edge] = 0
edges[edge] += 1
# O(n) filter non-parallel edges
res = []
for edge, count in edges.iteritems():
if count > 1:
res.append(edge)
return res
edges = [(1,0),(2,1),(1,0),(3,4)]
print identify_parallel_edges(edges)

ActionScript/Flex: bitwise AND/OR over 32-bits

Question: Is there an easy way (library function) to perform a bitwise AND or OR on numbers larger than 32-bit in ActionScript?
From the docs:
"Bitwise operators internally manipulate floating-point numbers to change them into 32-bit integers. The exact operation performed depends on the operator, but all bitwise operations evaluate each binary digit (bit) of the 32-bit integer individually to compute a new value."
Bummer...
I can't use the & or | ops - does AS expose a library function to do this for Numbers?
Specifics: I'm porting a bunch of java to flex and the java maintains a bunch of 'long' masks. I know that I can split the Java masks into two ints on the flex side. Since all of my mask manip is localized this won't be too painful. However, I'd like to keep the port as 1-1 as possible.
Any suggestions?
Thanks!
I think your most straightforward option is to break the masks, and if possible the data being masked, into two pieces. You're butting up against a feature gap, so no point in being tricky if you can help it. And if you don't need real BigNum support, best not to even consider it.
If you don't mind porting some Javascript, Leemon Baird has written a public-domain Javascript library for handling big integers here:
http://www.leemon.com/crypto/BigInt.html
You won't be able to explicitly use the & and | operators, but you should be able to augment the existing code with bitwiseAnd and bitwiseOr methods.
`
public class NumberUtils
{
public static const MSB_CONV : Number = Math.pow(2, 32);
public static function bitwiseAND(num1 : Number, num2 : Number) : Number {
var msb1 : int = num1 / MSB_CONV;
var msb2 : int = num2 / MSB_CONV;
return (msb1 & msb2) * MSB_CONV + (num1 & num2);
}
..OR..shiftRight..
}
`
According to http://livedocs.adobe.com/flex/3/html/help.html?content=03_Language_and_Syntax_11.html, there are no 64-bit integers (signed or unsigned)...only 32-bit.
The Number type, as you mentioned above, has a 53-bit mantissa, which is too short for you.
I searched for a BigNum FLEX implementation, but couldn't find one.
I'm guessing that you will have to simulate this with either an array of ints or a class with a high and low int.
Good Luck,
Randy Stegbauer
public function readInt64():Number
{
var highInt:uint = bytes.readUnsignedInt();
var lowerInt:uint = bytes.readUnsignedInt();
return highInt * Math.pow(2,32) + lowerInt;
}
public function writeInt64(value:Number):void
{
this.writeUnsignedInt(int(value / 0xffffffff));
this.writeUnsignedInt(int(value));
}

Resources