Calculating Standard Deviation in Assignment - standards

I have an assignment for my class that reads like this: Write a class called Stats. The constructor will take no input. There will be a method addData(double a) which will be used to add a value from the test program. Methods getCount(), getAverage() and getStandardDeviation() will return the appropriate values as doubles.
Here's what I have so far:
public class Stats
{
public Stats (double a)
{
a=0.0
}
public void addData(double a)
{
while (
sum=sum+a;
sumsq=sumsq+Math.pow(a,2)
count=count+1
}
public double getCount()
{
return count;
}
public double getAverage()
{
average=sum/count
return average;
}
public double getStandardDeviation()
{
private double sum=o;
private double count=0;
private double sumsq=0;
My problem is figuring out how to calculate the standard deviation using the variables I've defined.
Thanks guys!

You can't do this with the variables you defined. You need to keep the original data to be able to compute the formula
sigma = Math.sqrt( sum(Math.pow(x-mean, 2)) / count )
So,
(1) create private array or list into which you'll add your values in addData. That's all you need to do in addData.
(2) getCount = length of the list
(3) getAverage = sum of values in list / getCount()
(4) getStandardDeviation is something like
double avg = getAverage();
double cnt = getCount();
double sumsq = 0;
for (int i = 0; i < values.Count(); i++) {
sumsq += Math.pow(values[i] - avg, 2);
}
stdev = Math.sqrt(sumsq / cnt);

Related

Calculate sound value with distance

I have a more mathematical than programming question, sorry if I'm not in the right section. In my 2D game, we can move the camera on a map where there are objects that can emit sound, and this sound volume (defined by a float from 0 to 1) must increase when the screen center is near this object. For example, when the object is at the screen center, the sound volume is 1, and when we move away, the volume must decrease. Each object has its own scope value. (for example 1000 pixels).
I don't know how to write a method that can calculate it.
Here is some of my code (which is not the right calculation) :
private function setVolumeWithDistance():Void
{
sound.volume = getDistanceFromScreenCenter() / range;
// So the volume is a 0 to 1 float, the range is the scope in pixels and
// and the getDistanceFromScreenCenter() is the distance in pixels
}
I already have the method which calculates the distance of the object from the center screen :
public function getDistanceFromScreenCenter():Float
{
return Math.sqrt(Math.pow((Cameraman.getInstance().getFocusPosition().x - position.x), 2) +
Math.pow((Cameraman.getInstance().getFocusPosition().y - position.y), 2));
Simple acoustics can help.
Here is the formula for sound intensity from a point source. It follows an inverse square of distance rule. Build that into your code.
You need to consider the mapping between global and screen coordinates. You have to map pixel location on the screen to physical coordinates and back.
Your distance code is flawed. No one should use pow() to square numbers. Yours is susceptible to round off errors.
This code combines the distance calculation, done properly, and attempts to solve the inverse square intensity calculation. Note: Inverse square is singular for zero distance.
package physics;
/**
* Simple model for an acoustic point source
* Created by Michael
* Creation date 1/16/2016.
* #link https://stackoverflow.com/questions/34827629/calculate-sound-value-with-distance/34828300?noredirect=1#comment57399595_34828300
*/
public class AcousticPointSource {
// Units matter here....
private static final double DEFAULT_REFERENCE_INTENSITY = 0.01;
private static final double DEFAULT_REFERENCE_DISTANCE = 1.0;
// Units matter here...
private double referenceDistance;
private double referenceIntensity;
public static void main(String[] args) {
int numPoints = 20;
double x = 0.0;
double dx = 0.05;
AcousticPointSource source = new AcousticPointSource();
for (int i = 0; i < numPoints; ++i) {
x += dx;
Point p = new Point(x);
System.out.println(String.format("point %s intensity %-10.6f", p, source.intensity(p)));
}
}
public AcousticPointSource() {
this(DEFAULT_REFERENCE_DISTANCE, DEFAULT_REFERENCE_INTENSITY);
}
public AcousticPointSource(double referenceDistance, double referenceIntensity) {
if (referenceDistance <= 0.0) throw new IllegalArgumentException("distance must be positive");
if (referenceIntensity <= 0.0) throw new IllegalArgumentException("intensity must be positive");
this.referenceDistance = referenceDistance;
this.referenceIntensity = referenceIntensity;
}
public double distance2D(Point p1) {
return distance2D(p1, Point.ZERO);
}
public double distance2D(Point p1, Point p2) {
double distance = 0.0;
if ((p1 != null) && (p2 != null)) {
double dx = Math.abs(p1.x - p2.x);
double dy = Math.abs(p1.y - p2.y);
double ratio;
if (dx > dy) {
ratio = dy/dx;
distance = dx;
} else {
ratio = dx/dy;
distance = dy;
}
distance *= Math.sqrt(1.0 + ratio*ratio);
if (Double.isNaN(distance)) {
distance = 0.0;
}
}
return distance;
}
public double intensity(Point p) {
double intensity = 0.0;
if (p != null) {
double distance = distance2D(p);
if (distance != 0.0) {
double ratio = this.referenceDistance/distance;
intensity = this.referenceIntensity*ratio*ratio;
}
}
return intensity;
}
}
class Point {
public static final Point ZERO = new Point(0.0, 0.0, 0.0);
public final double x;
public final double y;
public final double z;
public Point(double x) {
this(x, 0.0, 0.0);
}
public Point(double x, double y) {
this(x, y, 0.0);
}
public Point(double x, double y, double z) {
this.x = x;
this.y = y;
this.z = z;
}
#Override
public String toString() {
return String.format("(%-10.4f,%-10.4f,%-10.4f)", x, y, z);
}
}

Kinect skeleton Scaling strange behaviour

I am trying to scale a skeleton to match to the sizes of another skeleton.
My algoritm do the following:
Find the distance between two joints of the origin skeleton and the destiny skeleton using phytagorean teorem
divide this two distances to find a multiply factor.
Multiply each joint by this factor.
Here is my actual code:
public static Skeleton ScaleToMatch(this Skeleton skToBeScaled, Skeleton skDestiny)
{
Joint newJoint = new Joint();
double distanciaOrigem = 0;
double distanciaDestino = 0;
double fator = 1;
SkeletonPoint pos = new SkeletonPoint();
foreach (BoneOrientation bo in skToBeScaled.BoneOrientations)
{
distanciaOrigem = FisioKinectCalcs.Distance3DBetweenJoint(skToBeScaled.Joints[bo.StartJoint], skToBeScaled.Joints[bo.EndJoint]);
distanciaDestino = FisioKinectCalcs.Distance3DBetweenJoint(skDestiny.Joints[bo.StartJoint], skDestiny.Joints[bo.EndJoint]);
if (distanciaOrigem > 0 && distanciaDestino > 0)
{
fator = (distanciaDestino / distanciaOrigem);
newJoint = skToBeScaled.Joints[bo.EndJoint]; // escaling only the end joint as the BoneOrientatios starts from HipCenter, i am scaling from center to edges.
// applying the new values to the joint
pos = new SkeletonPoint()
{
X = (float)(newJoint.Position.X * fator),
Y = (float)(newJoint.Position.Y * fator),
Z = (float)(newJoint.Position.Z * fator)
};
newJoint.Position = pos;
skToBeScaled.Joints[bo.EndJoint] = newJoint;
}
}
return skToBeScaled;
}
Every seems to work fine except for the hands and foots
Look at this images
I have my own skeleton over me, and my skeleton scaled to the sizes of another person, but the hands and foots still crazy. (but code looks right)
Any suggestion?
It's hard to say without running the code, but it somewhat "looks good".
What I would validate though, is your
if (distanciaOrigem > 0 && distanciaDestino > 0)
If distanciaOrigem is very close to 0, but even just epsilon away from 0, it won't be picked up by the if, and then
fator = (distanciaDestino / distanciaOrigem);
Will result in a very large number!
I would suggest to smooth the factor so it generally fits the proper scale. Try this code:
private static Dictionary<JointType, double> jointFactors = null;
static CalibrationUtils()
{
InitJointFactors();
}
public static class EnumUtil
{
public static IEnumerable<T> GetValues<T>()
{
return Enum.GetValues(typeof(T)).Cast<T>();
}
}
private static void InitJointFactors()
{
var jointTypes = EnumUtil.GetValues<JointType>();
jointFactors = new Dictionary<JointType, double>();
foreach(JointType type in jointTypes)
{
jointFactors.Add(type, 0);
}
}
private static double SmoothenFactor(JointType jointType, double factor, int weight)
{
double currentValue = jointFactors[jointType];
double newValue = 0;
if(currentValue != 0)
newValue = (weight * currentValue + factor) / (weight + 1);
else
newValue = factor;
jointFactors[jointType] = newValue;
return newValue;
}
When it comes to factor usage just use the SmoothenFactor method first:
public static Skeleton ScaleToMatch(this Skeleton skToBeScaled, Skeleton skDestiny, double additionalFactor = 1)
{
Joint newJoint = new Joint();
double distanceToScale = 0;
double distanceDestiny = 0;
double factor = 1;
int weight = 500;
SkeletonPoint pos = new SkeletonPoint();
Skeleton newSkeleton = null;
KinectHelper.CopySkeleton(skToBeScaled, ref newSkeleton);
SkeletonPoint hipCenterPosition = newSkeleton.Joints[JointType.HipCenter].Position;
foreach(BoneOrientation bo in skToBeScaled.BoneOrientations)
{
distanceToScale = Distance3DBetweenJoints(skToBeScaled.Joints[bo.StartJoint], skToBeScaled.Joints[bo.EndJoint]);
distanceDestiny = Distance3DBetweenJoints(skDestiny.Joints[bo.StartJoint], skDestiny.Joints[bo.EndJoint]);
if(distanceToScale > 0 && distanceDestiny > 0)
{
factor = (distanceDestiny / distanceToScale) * additionalFactor;
newJoint = skToBeScaled.Joints[bo.EndJoint]; // escaling only the end joint as the BoneOrientatios starts from HipCenter, i am scaling from center to edges.
factor = SmoothenFactor(newJoint.JointType, factor, weight);
pos = new SkeletonPoint()
{
X = (float)((newJoint.Position.X - hipCenterPosition.X) * factor + hipCenterPosition.X),
Y = (float)((newJoint.Position.Y - hipCenterPosition.Y) * factor + hipCenterPosition.Y),
Z = (float)((newJoint.Position.Z - hipCenterPosition.Z) * factor + hipCenterPosition.Z)
};
newJoint.Position = pos;
newSkeleton.Joints[bo.EndJoint] = newJoint;
}
}
return newSkeleton;
}
I also modified your ScaleToMatch method as you see. There was a need to move joints in relation to HipCenter position. Also new positions are saved to a new Skeleton instance so they are not used in further vector calculations.
Experiment with the weight but since our bones length is constant you can use big numbers like 100 and more to be sure that wrong Kinect readings do not disturb the correct scale.
Here's an example of how it helped with scaling HandRight joint position:
The weight was set to 500. The resulting factor is supposed to be around 2 (because the base skeleton was purposely downscaled by a factor of 2).
I hope it helps!

Generate Numbers with 1 increasing order with in a range

How to generate number from 1000 to 9999 with 1 increasing order ?
I tried with Random Class, it is working fine, but instead of generating random number, I want to generate number with in a range.
private int RandomNumber(int min, int max)
{
Random random = new Random();
return random.Next(min, max);
}
Then I call this method :
int returnValue = RandomNumber(5, 20);
You can simply do it this way:
public double RandomNumber(int min, int max)
{
return Random.NextInt(min, max) ;
}
Calling Random.Next(int, int) will give you a random number in that range. You should clarify your question.
How to generate number from 1000 to 9999 with 1 increasing order ?
You can use loop to generate number in range.
for(int i=1000; i < 9999; i++)
{
Console.WriteLine(i);
}
You can use List to store the generated numbers.
List<int> lstNumbers = new List<int>();
for(int i=1000; i < 9999; i++)
{
lstNumbers.Add(i)
}
You can use linq IEnumerable.Range() method as well.
IEnumerable<int> squares = Enumerable.Range(1000 , 9999).Select(x => x * x);
foreach (int num in squares)
{
Console.WriteLine(num);
}
the best way to do SQL Sequence and you call it every time, because the random number process will start over if you restarted IIS.

Are there any example of Mutual recursion?

Are there any examples for a recursive function that calls an other function which calls the first one too ?
Example :
function1()
{
//do something
function2();
//do something
}
function2()
{
//do something
function1();
//do something
}
Mutual recursion is common in code that parses mathematical expressions (and other grammars). A recursive descent parser based on the grammar below will naturally contain mutual recursion: expression-terms-term-factor-primary-expression.
expression
+ terms
- terms
terms
terms
term + terms
term - terms
term
factor
factor * term
factor / term
factor
primary
primary ^ factor
primary
( expression )
number
name
name ( expression )
The proper term for this is Mutual Recursion.
http://en.wikipedia.org/wiki/Mutual_recursion
There's an example on that page, I'll reproduce here in Java:
boolean even( int number )
{
if( number == 0 )
return true;
else
return odd(abs(number)-1)
}
boolean odd( int number )
{
if( number == 0 )
return false;
else
return even(abs(number)-1);
}
Where abs( n ) means return the absolute value of a number.
Clearly this is not efficient, just to demonstrate a point.
An example might be the minmax algorithm commonly used in game programs such as chess. Starting at the top of the game tree, the goal is to find the maximum value of all the nodes at the level below, whose values are defined as the minimum of the values of the nodes below that, whose values are defines as the maximum of the values below that, whose values ...
I can think of two common sources of mutual recursion.
Functions dealing with mutually recursive types
Consider an Abstract Syntax Tree (AST) that keeps position information in every node. The type might look like this:
type Expr =
| Int of int
| Var of string
| Add of ExprAux * ExprAux
and ExprAux = Expr of int * Expr
The easiest way to write functions that manipulate values of these types is to write mutually recursive functions. For example, a function to find the set of free variables:
let rec freeVariables = function
| Int n -> Set.empty
| Var x -> Set.singleton x
| Add(f, g) -> Set.union (freeVariablesAux f) (freeVariablesAux g)
and freeVariablesAux (Expr(loc, e)) =
freeVariables e
State machines
Consider a state machine that is either on, off or paused with instructions to start, stop, pause and resume (F# code):
type Instruction = Start | Stop | Pause | Resume
The state machine might be written as mutually recursive functions with one function for each state:
type State = State of (Instruction -> State)
let rec isOff = function
| Start -> State isOn
| _ -> State isOff
and isOn = function
| Stop -> State isOff
| Pause -> State isPaused
| _ -> State isOn
and isPaused = function
| Stop -> State isOff
| Resume -> State isOn
| _ -> State isPaused
It's a bit contrived and not very efficient, but you could do this with a function to calculate Fibbonacci numbers as in:
fib2(n) { return fib(n-2); }
fib1(n) { return fib(n-1); }
fib(n)
{
if (n>1)
return fib1(n) + fib2(n);
else
return 1;
}
In this case its efficiency can be dramatically enhanced if the language supports memoization
In a language with proper tail calls, Mutual Tail Recursion is a very natural way of implementing automata.
Here is my coded solution. For a calculator app that performs *,/,- operations using mutual recursion. It also checks for brackets (()) to decide the order of precedence.
Flow:: expression -> term -> factor -> expression
Calculator.h
#ifndef CALCULATOR_H_
#define CALCULATOR_H_
#include <string>
using namespace std;
/****** A Calculator Class holding expression, term, factor ********/
class Calculator
{
public:
/**Default Constructor*/
Calculator();
/** Parameterized Constructor common for all exception
* #aparam e exception value
* */
Calculator(char e);
/**
* Function to start computation
* #param input - input expression*/
void start(string input);
/**
* Evaluates Term*
* #param input string for term*/
double term(string& input);
/* Evaluates factor*
* #param input string for factor*/
double factor(string& input);
/* Evaluates Expression*
* #param input string for expression*/
double expression(string& input);
/* Evaluates number*
* #param input string for number*/
string number(string n);
/**
* Prints calculates value of the expression
* */
void print();
/**
* Converts char to double
* #param c input char
* */
double charTONum(const char* c);
/**
* Get error
*/
char get_value() const;
/** Reset all values*/
void reset();
private:
int lock;//set lock to check extra parenthesis
double result;// result
char error_msg;// error message
};
/**Error for unexpected string operation*/
class Unexpected_error:public Calculator
{
public:
Unexpected_error(char e):Calculator(e){};
};
/**Error for missing parenthesis*/
class Missing_parenthesis:public Calculator
{
public:
Missing_parenthesis(char e):Calculator(e){};
};
/**Error if divide by zeros*/
class DivideByZero:public Calculator{
public:
DivideByZero():Calculator(){};
};
#endif
===============================================================================
Calculator.cpp
//============================================================================
// Name : Calculator.cpp
// Author : Anurag
// Version :
// Copyright : Your copyright notice
// Description : Calculator using mutual recursion in C++, Ansi-style
//============================================================================
#include "Calculator.h"
#include <iostream>
#include <string>
#include <math.h>
#include <exception>
using namespace std;
Calculator::Calculator():lock(0),result(0),error_msg(' '){
}
Calculator::Calculator(char e):result(0), error_msg(e) {
}
char Calculator::get_value() const {
return this->error_msg;
}
void Calculator::start(string input) {
try{
result = expression(input);
print();
}catch (Unexpected_error e) {
cout<<result<<endl;
cout<<"***** Unexpected "<<e.get_value()<<endl;
}catch (Missing_parenthesis e) {
cout<<"***** Missing "<<e.get_value()<<endl;
}catch (DivideByZero e) {
cout<<"***** Division By Zero" << endl;
}
}
double Calculator::expression(string& input) {
double expression=0;
if(input.size()==0)
return 0;
expression = term(input);
if(input[0] == ' ')
input = input.substr(1);
if(input[0] == '+') {
input = input.substr(1);
expression += term(input);
}
else if(input[0] == '-') {
input = input.substr(1);
expression -= term(input);
}
if(input[0] == '%'){
result = expression;
throw Unexpected_error(input[0]);
}
if(input[0]==')' && lock<=0 )
throw Missing_parenthesis(')');
return expression;
}
double Calculator::term(string& input) {
if(input.size()==0)
return 1;
double term=1;
term = factor(input);
if(input[0] == ' ')
input = input.substr(1);
if(input[0] == '*') {
input = input.substr(1);
term = term * factor(input);
}
else if(input[0] == '/') {
input = input.substr(1);
double den = factor(input);
if(den==0) {
throw DivideByZero();
}
term = term / den;
}
return term;
}
double Calculator::factor(string& input) {
double factor=0;
if(input[0] == ' ') {
input = input.substr(1);
}
if(input[0] == '(') {
lock++;
input = input.substr(1);
factor = expression(input);
if(input[0]==')') {
lock--;
input = input.substr(1);
return factor;
}else{
throw Missing_parenthesis(')');
}
}
else if (input[0]>='0' && input[0]<='9'){
string nums = input.substr(0,1) + number(input.substr(1));
input = input.substr(nums.size());
return stod(nums);
}
else {
result = factor;
throw Unexpected_error(input[0]);
}
return factor;
}
string Calculator::number(string input) {
if(input.substr(0,2)=="E+" || input.substr(0,2)=="E-" || input.substr(0,2)=="e-" || input.substr(0,2)=="e-")
return input.substr(0,2) + number(input.substr(2));
else if((input[0]>='0' && input[0]<='9') || (input[0]=='.'))
return input.substr(0,1) + number(input.substr(1));
else
return "";
}
void Calculator::print() {
cout << result << endl;
}
void Calculator::reset(){
this->lock=0;
this->result=0;
}
int main() {
Calculator* cal = new Calculator;
string input;
cout<<"Expression? ";
getline(cin,input);
while(input!="."){
cal->start(input.substr(0,input.size()-2));
cout<<"Expression? ";
cal->reset();
getline(cin,input);
}
cout << "Done!" << endl;
return 0;
}
==============================================================
Sample input-> Expression? (42+8)*10 =
Output-> 500
Top down merge sort can use a pair of mutually recursive functions to alternate the direction of merge based on level of recursion.
For the example code below, a[] is the array to be sorted, b[] is a temporary working array. For a naive implementation of merge sort, each merge operation copies data from a[] to b[], then merges b[] back to a[], or it merges from a[] to b[], then copies from b[] back to a[]. This requires n · ceiling(log2(n)) copy operations. To eliminate the copy operations used for merging, the direction of merge can be alternated based on level of recursion, merge from a[] to b[], merge from b[] to a[], ..., and switch to in place insertion sort for small runs in a[], as insertion sort on small runs is faster than merge sort.
In this example, MergeSortAtoA() and MergeSortAtoB() are the mutually recursive functions.
Example java code:
static final int ISZ = 64; // use insertion sort if size <= ISZ
static void MergeSort(int a[])
{
int n = a.length;
if(n < 2)
return;
int [] b = new int[n];
MergeSortAtoA(a, b, 0, n);
}
static void MergeSortAtoA(int a[], int b[], int ll, int ee)
{
if ((ee - ll) <= ISZ){ // use insertion sort on small runs
InsertionSort(a, ll, ee);
return;
}
int rr = (ll + ee)>>1; // midpoint, start of right half
MergeSortAtoB(a, b, ll, rr);
MergeSortAtoB(a, b, rr, ee);
Merge(b, a, ll, rr, ee); // merge b to a
}
static void MergeSortAtoB(int a[], int b[], int ll, int ee)
{
int rr = (ll + ee)>>1; // midpoint, start of right half
MergeSortAtoA(a, b, ll, rr);
MergeSortAtoA(a, b, rr, ee);
Merge(a, b, ll, rr, ee); // merge a to b
}
static void Merge(int a[], int b[], int ll, int rr, int ee)
{
int o = ll; // b[] index
int l = ll; // a[] left index
int r = rr; // a[] right index
while(true){ // merge data
if(a[l] <= a[r]){ // if a[l] <= a[r]
b[o++] = a[l++]; // copy a[l]
if(l < rr) // if not end of left run
continue; // continue (back to while)
while(r < ee){ // else copy rest of right run
b[o++] = a[r++];
}
break; // and return
} else { // else a[l] > a[r]
b[o++] = a[r++]; // copy a[r]
if(r < ee) // if not end of right run
continue; // continue (back to while)
while(l < rr){ // else copy rest of left run
b[o++] = a[l++];
}
break; // and return
}
}
}
static void InsertionSort(int a[], int ll, int ee)
{
int i, j;
int t;
for (j = ll + 1; j < ee; j++) {
t = a[j];
i = j-1;
while(i >= ll && a[i] > t){
a[i+1] = a[i];
i--;
}
a[i+1] = t;
}
}

A better similarity ranking algorithm for variable length strings

I'm looking for a string similarity algorithm that yields better results on variable length strings than the ones that are usually suggested (levenshtein distance, soundex, etc).
For example,
Given string A: "Robert",
Then string B: "Amy Robertson"
would be a better match than
String C: "Richard"
Also, preferably, this algorithm should be language agnostic (also works in languages other than English).
Simon White of Catalysoft wrote an article about a very clever algorithm that compares adjacent character pairs that works really well for my purposes:
http://www.catalysoft.com/articles/StrikeAMatch.html
Simon has a Java version of the algorithm and below I wrote a PL/Ruby version of it (taken from the plain ruby version done in the related forum entry comment by Mark Wong-VanHaren) so that I can use it in my PostgreSQL queries:
CREATE FUNCTION string_similarity(str1 varchar, str2 varchar)
RETURNS float8 AS '
str1.downcase!
pairs1 = (0..str1.length-2).collect {|i| str1[i,2]}.reject {
|pair| pair.include? " "}
str2.downcase!
pairs2 = (0..str2.length-2).collect {|i| str2[i,2]}.reject {
|pair| pair.include? " "}
union = pairs1.size + pairs2.size
intersection = 0
pairs1.each do |p1|
0.upto(pairs2.size-1) do |i|
if p1 == pairs2[i]
intersection += 1
pairs2.slice!(i)
break
end
end
end
(2.0 * intersection) / union
' LANGUAGE 'plruby';
Works like a charm!
marzagao's answer is great. I converted it to C# so I thought I'd post it here:
Pastebin Link
/// <summary>
/// This class implements string comparison algorithm
/// based on character pair similarity
/// Source: http://www.catalysoft.com/articles/StrikeAMatch.html
/// </summary>
public class SimilarityTool
{
/// <summary>
/// Compares the two strings based on letter pair matches
/// </summary>
/// <param name="str1"></param>
/// <param name="str2"></param>
/// <returns>The percentage match from 0.0 to 1.0 where 1.0 is 100%</returns>
public double CompareStrings(string str1, string str2)
{
List<string> pairs1 = WordLetterPairs(str1.ToUpper());
List<string> pairs2 = WordLetterPairs(str2.ToUpper());
int intersection = 0;
int union = pairs1.Count + pairs2.Count;
for (int i = 0; i < pairs1.Count; i++)
{
for (int j = 0; j < pairs2.Count; j++)
{
if (pairs1[i] == pairs2[j])
{
intersection++;
pairs2.RemoveAt(j);//Must remove the match to prevent "GGGG" from appearing to match "GG" with 100% success
break;
}
}
}
return (2.0 * intersection) / union;
}
/// <summary>
/// Gets all letter pairs for each
/// individual word in the string
/// </summary>
/// <param name="str"></param>
/// <returns></returns>
private List<string> WordLetterPairs(string str)
{
List<string> AllPairs = new List<string>();
// Tokenize the string and put the tokens/words into an array
string[] Words = Regex.Split(str, #"\s");
// For each word
for (int w = 0; w < Words.Length; w++)
{
if (!string.IsNullOrEmpty(Words[w]))
{
// Find the pairs of characters
String[] PairsInWord = LetterPairs(Words[w]);
for (int p = 0; p < PairsInWord.Length; p++)
{
AllPairs.Add(PairsInWord[p]);
}
}
}
return AllPairs;
}
/// <summary>
/// Generates an array containing every
/// two consecutive letters in the input string
/// </summary>
/// <param name="str"></param>
/// <returns></returns>
private string[] LetterPairs(string str)
{
int numPairs = str.Length - 1;
string[] pairs = new string[numPairs];
for (int i = 0; i < numPairs; i++)
{
pairs[i] = str.Substring(i, 2);
}
return pairs;
}
}
Here is another version of marzagao's answer, this one written in Python:
def get_bigrams(string):
"""
Take a string and return a list of bigrams.
"""
s = string.lower()
return [s[i:i+2] for i in list(range(len(s) - 1))]
def string_similarity(str1, str2):
"""
Perform bigram comparison between two strings
and return a percentage match in decimal form.
"""
pairs1 = get_bigrams(str1)
pairs2 = get_bigrams(str2)
union = len(pairs1) + len(pairs2)
hit_count = 0
for x in pairs1:
for y in pairs2:
if x == y:
hit_count += 1
break
return (2.0 * hit_count) / union
if __name__ == "__main__":
"""
Run a test using the example taken from:
http://www.catalysoft.com/articles/StrikeAMatch.html
"""
w1 = 'Healed'
words = ['Heard', 'Healthy', 'Help', 'Herded', 'Sealed', 'Sold']
for w2 in words:
print('Healed --- ' + w2)
print(string_similarity(w1, w2))
print()
A shorter version of John Rutledge's answer:
def get_bigrams(string):
'''
Takes a string and returns a list of bigrams
'''
s = string.lower()
return {s[i:i+2] for i in xrange(len(s) - 1)}
def string_similarity(str1, str2):
'''
Perform bigram comparison between two strings
and return a percentage match in decimal form
'''
pairs1 = get_bigrams(str1)
pairs2 = get_bigrams(str2)
return (2.0 * len(pairs1 & pairs2)) / (len(pairs1) + len(pairs2))
Here's my PHP implementation of suggested StrikeAMatch algorithm, by Simon White. the advantages (like it says in the link) are:
A true reflection of lexical similarity - strings with small differences should be recognised as being similar. In particular, a significant substring overlap should point to a high level of similarity between the strings.
A robustness to changes of word order - two strings which contain the same words, but in a different order, should be recognised as being similar. On the other hand, if one string is just a random anagram of the characters contained in the other, then it should (usually) be recognised as dissimilar.
Language Independence - the algorithm should work not only in English, but in many different languages.
<?php
/**
* LetterPairSimilarity algorithm implementation in PHP
* #author Igal Alkon
* #link http://www.catalysoft.com/articles/StrikeAMatch.html
*/
class LetterPairSimilarity
{
/**
* #param $str
* #return mixed
*/
private function wordLetterPairs($str)
{
$allPairs = array();
// Tokenize the string and put the tokens/words into an array
$words = explode(' ', $str);
// For each word
for ($w = 0; $w < count($words); $w++)
{
// Find the pairs of characters
$pairsInWord = $this->letterPairs($words[$w]);
for ($p = 0; $p < count($pairsInWord); $p++)
{
$allPairs[] = $pairsInWord[$p];
}
}
return $allPairs;
}
/**
* #param $str
* #return array
*/
private function letterPairs($str)
{
$numPairs = mb_strlen($str)-1;
$pairs = array();
for ($i = 0; $i < $numPairs; $i++)
{
$pairs[$i] = mb_substr($str,$i,2);
}
return $pairs;
}
/**
* #param $str1
* #param $str2
* #return float
*/
public function compareStrings($str1, $str2)
{
$pairs1 = $this->wordLetterPairs(strtoupper($str1));
$pairs2 = $this->wordLetterPairs(strtoupper($str2));
$intersection = 0;
$union = count($pairs1) + count($pairs2);
for ($i=0; $i < count($pairs1); $i++)
{
$pair1 = $pairs1[$i];
$pairs2 = array_values($pairs2);
for($j = 0; $j < count($pairs2); $j++)
{
$pair2 = $pairs2[$j];
if ($pair1 === $pair2)
{
$intersection++;
unset($pairs2[$j]);
break;
}
}
}
return (2.0*$intersection)/$union;
}
}
This discussion has been really helpful, thanks. I converted the algorithm to VBA for use with Excel and wrote a few versions of a worksheet function, one for simple comparison of a pair of strings, the other for comparing one string to a range/array of strings. The strSimLookup version returns either the last best match as a string, array index, or similarity metric.
This implementation produces the same results listed in the Amazon example on Simon White's website with a few minor exceptions on low-scoring matches; not sure where the difference creeps in, could be VBA's Split function, but I haven't investigated as it's working fine for my purposes.
'Implements functions to rate how similar two strings are on
'a scale of 0.0 (completely dissimilar) to 1.0 (exactly similar)
'Source:  http://www.catalysoft.com/articles/StrikeAMatch.html
'Author: Bob Chatham, bob.chatham at gmail.com
'9/12/2010
Option Explicit
Public Function stringSimilarity(str1 As String, str2 As String) As Variant
'Simple version of the algorithm that computes the similiarity metric
'between two strings.
'NOTE: This verision is not efficient to use if you're comparing one string
'with a range of other values as it will needlessly calculate the pairs for the
'first string over an over again; use the array-optimized version for this case.
Dim sPairs1 As Collection
Dim sPairs2 As Collection
Set sPairs1 = New Collection
Set sPairs2 = New Collection
WordLetterPairs str1, sPairs1
WordLetterPairs str2, sPairs2
stringSimilarity = SimilarityMetric(sPairs1, sPairs2)
Set sPairs1 = Nothing
Set sPairs2 = Nothing
End Function
Public Function strSimA(str1 As Variant, rRng As Range) As Variant
'Return an array of string similarity indexes for str1 vs every string in input range rRng
Dim sPairs1 As Collection
Dim sPairs2 As Collection
Dim arrOut As Variant
Dim l As Long, j As Long
Set sPairs1 = New Collection
WordLetterPairs CStr(str1), sPairs1
l = rRng.Count
ReDim arrOut(1 To l)
For j = 1 To l
Set sPairs2 = New Collection
WordLetterPairs CStr(rRng(j)), sPairs2
arrOut(j) = SimilarityMetric(sPairs1, sPairs2)
Set sPairs2 = Nothing
Next j
strSimA = Application.Transpose(arrOut)
End Function
Public Function strSimLookup(str1 As Variant, rRng As Range, Optional returnType) As Variant
'Return either the best match or the index of the best match
'depending on returnTYype parameter) between str1 and strings in rRng)
' returnType = 0 or omitted: returns the best matching string
' returnType = 1 : returns the index of the best matching string
' returnType = 2 : returns the similarity metric
Dim sPairs1 As Collection
Dim sPairs2 As Collection
Dim metric, bestMetric As Double
Dim i, iBest As Long
Const RETURN_STRING As Integer = 0
Const RETURN_INDEX As Integer = 1
Const RETURN_METRIC As Integer = 2
If IsMissing(returnType) Then returnType = RETURN_STRING
Set sPairs1 = New Collection
WordLetterPairs CStr(str1), sPairs1
bestMetric = -1
iBest = -1
For i = 1 To rRng.Count
Set sPairs2 = New Collection
WordLetterPairs CStr(rRng(i)), sPairs2
metric = SimilarityMetric(sPairs1, sPairs2)
If metric > bestMetric Then
bestMetric = metric
iBest = i
End If
Set sPairs2 = Nothing
Next i
If iBest = -1 Then
strSimLookup = CVErr(xlErrValue)
Exit Function
End If
Select Case returnType
Case RETURN_STRING
strSimLookup = CStr(rRng(iBest))
Case RETURN_INDEX
strSimLookup = iBest
Case Else
strSimLookup = bestMetric
End Select
End Function
Public Function strSim(str1 As String, str2 As String) As Variant
Dim ilen, iLen1, ilen2 As Integer
iLen1 = Len(str1)
ilen2 = Len(str2)
If iLen1 >= ilen2 Then ilen = ilen2 Else ilen = iLen1
strSim = stringSimilarity(Left(str1, ilen), Left(str2, ilen))
End Function
Sub WordLetterPairs(str As String, pairColl As Collection)
'Tokenize str into words, then add all letter pairs to pairColl
Dim Words() As String
Dim word, nPairs, pair As Integer
Words = Split(str)
If UBound(Words) < 0 Then
Set pairColl = Nothing
Exit Sub
End If
For word = 0 To UBound(Words)
nPairs = Len(Words(word)) - 1
If nPairs > 0 Then
For pair = 1 To nPairs
pairColl.Add Mid(Words(word), pair, 2)
Next pair
End If
Next word
End Sub
Private Function SimilarityMetric(sPairs1 As Collection, sPairs2 As Collection) As Variant
'Helper function to calculate similarity metric given two collections of letter pairs.
'This function is designed to allow the pair collections to be set up separately as needed.
'NOTE: sPairs2 collection will be altered as pairs are removed; copy the collection
'if this is not the desired behavior.
'Also assumes that collections will be deallocated somewhere else
Dim Intersect As Double
Dim Union As Double
Dim i, j As Long
If sPairs1.Count = 0 Or sPairs2.Count = 0 Then
SimilarityMetric = CVErr(xlErrNA)
Exit Function
End If
Union = sPairs1.Count + sPairs2.Count
Intersect = 0
For i = 1 To sPairs1.Count
For j = 1 To sPairs2.Count
If StrComp(sPairs1(i), sPairs2(j)) = 0 Then
Intersect = Intersect + 1
sPairs2.Remove j
Exit For
End If
Next j
Next i
SimilarityMetric = (2 * Intersect) / Union
End Function
I'm sorry, the answer was not invented by the author. This is a well known algorithm that was first present by Digital Equipment Corporation and is often referred to as shingling.
http://www.hpl.hp.com/techreports/Compaq-DEC/SRC-TN-1997-015.pdf
I translated Simon White's algorithm to PL/pgSQL. This is my contribution.
<!-- language: lang-sql -->
create or replace function spt1.letterpairs(in p_str varchar)
returns varchar as
$$
declare
v_numpairs integer := length(p_str)-1;
v_pairs varchar[];
begin
for i in 1 .. v_numpairs loop
v_pairs[i] := substr(p_str, i, 2);
end loop;
return v_pairs;
end;
$$ language 'plpgsql';
--===================================================================
create or replace function spt1.wordletterpairs(in p_str varchar)
returns varchar as
$$
declare
v_allpairs varchar[];
v_words varchar[];
v_pairsinword varchar[];
begin
v_words := regexp_split_to_array(p_str, '[[:space:]]');
for i in 1 .. array_length(v_words, 1) loop
v_pairsinword := spt1.letterpairs(v_words[i]);
if v_pairsinword is not null then
for j in 1 .. array_length(v_pairsinword, 1) loop
v_allpairs := v_allpairs || v_pairsinword[j];
end loop;
end if;
end loop;
return v_allpairs;
end;
$$ language 'plpgsql';
--===================================================================
create or replace function spt1.arrayintersect(ANYARRAY, ANYARRAY)
returns anyarray as
$$
select array(select unnest($1) intersect select unnest($2))
$$ language 'sql';
--===================================================================
create or replace function spt1.comparestrings(in p_str1 varchar, in p_str2 varchar)
returns float as
$$
declare
v_pairs1 varchar[];
v_pairs2 varchar[];
v_intersection integer;
v_union integer;
begin
v_pairs1 := wordletterpairs(upper(p_str1));
v_pairs2 := wordletterpairs(upper(p_str2));
v_union := array_length(v_pairs1, 1) + array_length(v_pairs2, 1);
v_intersection := array_length(arrayintersect(v_pairs1, v_pairs2), 1);
return (2.0 * v_intersection / v_union);
end;
$$ language 'plpgsql';
A version in beautiful Scala:
def pairDistance(s1: String, s2: String): Double = {
def strToPairs(s: String, acc: List[String]): List[String] = {
if (s.size < 2) acc
else strToPairs(s.drop(1),
if (s.take(2).contains(" ")) acc else acc ::: List(s.take(2)))
}
val lst1 = strToPairs(s1.toUpperCase, List())
val lst2 = strToPairs(s2.toUpperCase, List())
(2.0 * lst2.intersect(lst1).size) / (lst1.size + lst2.size)
}
String Similarity Metrics contains an overview of many different metrics used in string comparison (Wikipedia has an overview as well). Much of these metrics is implemented in a library simmetrics.
Yet another example of metric, not included in the given overview is for example compression distance (attempting to approximate the Kolmogorov's complexity), which can be used for a bit longer texts than the one you presented.
You might also consider looking at a much broader subject of Natural Language Processing. These R packages can get you started quickly (or at least give some ideas).
And one last edit - search the other questions on this subject at SO, there are quite a few related ones.
A faster PHP version of the algorithm:
/**
*
* #param $str
* #return mixed
*/
private static function wordLetterPairs ($str)
{
$allPairs = array();
// Tokenize the string and put the tokens/words into an array
$words = explode(' ', $str);
// For each word
for ($w = 0; $w < count($words); $w ++) {
// Find the pairs of characters
$pairsInWord = self::letterPairs($words[$w]);
for ($p = 0; $p < count($pairsInWord); $p ++) {
$allPairs[$pairsInWord[$p]] = $pairsInWord[$p];
}
}
return array_values($allPairs);
}
/**
*
* #param $str
* #return array
*/
private static function letterPairs ($str)
{
$numPairs = mb_strlen($str) - 1;
$pairs = array();
for ($i = 0; $i < $numPairs; $i ++) {
$pairs[$i] = mb_substr($str, $i, 2);
}
return $pairs;
}
/**
*
* #param $str1
* #param $str2
* #return float
*/
public static function compareStrings ($str1, $str2)
{
$pairs1 = self::wordLetterPairs(mb_strtolower($str1));
$pairs2 = self::wordLetterPairs(mb_strtolower($str2));
$union = count($pairs1) + count($pairs2);
$intersection = count(array_intersect($pairs1, $pairs2));
return (2.0 * $intersection) / $union;
}
For the data I had (approx 2300 comparisons) I had a running time of 0.58sec with Igal Alkon solution versus 0.35sec with mine.
Posting marzagao's answer in C99, inspired by these algorithms
double dice_match(const char *string1, const char *string2) {
//check fast cases
if (((string1 != NULL) && (string1[0] == '\0')) ||
((string2 != NULL) && (string2[0] == '\0'))) {
return 0;
}
if (string1 == string2) {
return 1;
}
size_t strlen1 = strlen(string1);
size_t strlen2 = strlen(string2);
if (strlen1 < 2 || strlen2 < 2) {
return 0;
}
size_t length1 = strlen1 - 1;
size_t length2 = strlen2 - 1;
double matches = 0;
int i = 0, j = 0;
//get bigrams and compare
while (i < length1 && j < length2) {
char a[3] = {string1[i], string1[i + 1], '\0'};
char b[3] = {string2[j], string2[j + 1], '\0'};
int cmp = strcmpi(a, b);
if (cmp == 0) {
matches += 2;
}
i++;
j++;
}
return matches / (length1 + length2);
}
Some tests based on the original article:
#include <stdio.h>
void article_test1() {
char *string1 = "FRANCE";
char *string2 = "FRENCH";
printf("====%s====\n", __func__);
printf("%2.f%% == 40%%\n", dice_match(string1, string2) * 100);
}
void article_test2() {
printf("====%s====\n", __func__);
char *string = "Healed";
char *ss[] = {"Heard", "Healthy", "Help",
"Herded", "Sealed", "Sold"};
int correct[] = {44, 55, 25, 40, 80, 0};
for (int i = 0; i < 6; ++i) {
printf("%2.f%% == %d%%\n", dice_match(string, ss[i]) * 100, correct[i]);
}
}
void multicase_test() {
char *string1 = "FRaNcE";
char *string2 = "fREnCh";
printf("====%s====\n", __func__);
printf("%2.f%% == 40%%\n", dice_match(string1, string2) * 100);
}
void gg_test() {
char *string1 = "GG";
char *string2 = "GGGGG";
printf("====%s====\n", __func__);
printf("%2.f%% != 100%%\n", dice_match(string1, string2) * 100);
}
int main() {
article_test1();
article_test2();
multicase_test();
gg_test();
return 0;
}
Here is the R version:
get_bigrams <- function(str)
{
lstr = tolower(str)
bigramlst = list()
for(i in 1:(nchar(str)-1))
{
bigramlst[[i]] = substr(str, i, i+1)
}
return(bigramlst)
}
str_similarity <- function(str1, str2)
{
pairs1 = get_bigrams(str1)
pairs2 = get_bigrams(str2)
unionlen = length(pairs1) + length(pairs2)
hit_count = 0
for(x in 1:length(pairs1)){
for(y in 1:length(pairs2)){
if (pairs1[[x]] == pairs2[[y]])
hit_count = hit_count + 1
}
}
return ((2.0 * hit_count) / unionlen)
}
Building on Michael La Voie's awesome C# version, as per the request to make it an extension method, here is what I came up with. The primary benefit of doing it this way is that you can sort a Generic List by the percent match. For example, consider you have a string field named "City" in your object. A user searches for "Chester" and you want to return results in descending order of match. For example, you want literal matches of Chester to show up before Rochester. To do this, add two new properties to your object:
public string SearchText { get; set; }
public double PercentMatch
{
get
{
return City.ToUpper().PercentMatchTo(this.SearchText.ToUpper());
}
}
Then on each object, set the SearchText to what the user searched for. Then you can sort it easily with something like:
zipcodes = zipcodes.OrderByDescending(x => x.PercentMatch);
Here's the slight modification to make it an extension method:
/// <summary>
/// This class implements string comparison algorithm
/// based on character pair similarity
/// Source: http://www.catalysoft.com/articles/StrikeAMatch.html
/// </summary>
public static double PercentMatchTo(this string str1, string str2)
{
List<string> pairs1 = WordLetterPairs(str1.ToUpper());
List<string> pairs2 = WordLetterPairs(str2.ToUpper());
int intersection = 0;
int union = pairs1.Count + pairs2.Count;
for (int i = 0; i < pairs1.Count; i++)
{
for (int j = 0; j < pairs2.Count; j++)
{
if (pairs1[i] == pairs2[j])
{
intersection++;
pairs2.RemoveAt(j);//Must remove the match to prevent "GGGG" from appearing to match "GG" with 100% success
break;
}
}
}
return (2.0 * intersection) / union;
}
/// <summary>
/// Gets all letter pairs for each
/// individual word in the string
/// </summary>
/// <param name="str"></param>
/// <returns></returns>
private static List<string> WordLetterPairs(string str)
{
List<string> AllPairs = new List<string>();
// Tokenize the string and put the tokens/words into an array
string[] Words = Regex.Split(str, #"\s");
// For each word
for (int w = 0; w < Words.Length; w++)
{
if (!string.IsNullOrEmpty(Words[w]))
{
// Find the pairs of characters
String[] PairsInWord = LetterPairs(Words[w]);
for (int p = 0; p < PairsInWord.Length; p++)
{
AllPairs.Add(PairsInWord[p]);
}
}
}
return AllPairs;
}
/// <summary>
/// Generates an array containing every
/// two consecutive letters in the input string
/// </summary>
/// <param name="str"></param>
/// <returns></returns>
private static string[] LetterPairs(string str)
{
int numPairs = str.Length - 1;
string[] pairs = new string[numPairs];
for (int i = 0; i < numPairs; i++)
{
pairs[i] = str.Substring(i, 2);
}
return pairs;
}
My JavaScript implementation takes a string or array of strings, and an optional floor (the default floor is 0.5). If you pass it a string, it will return true or false depending on whether or not the string's similarity score is greater than or equal to the floor. If you pass it an array of strings, it will return an array of those strings whose similarity score is greater than or equal to the floor, sorted by score.
Examples:
'Healed'.fuzzy('Sealed'); // returns true
'Healed'.fuzzy('Help'); // returns false
'Healed'.fuzzy('Help', 0.25); // returns true
'Healed'.fuzzy(['Sold', 'Herded', 'Heard', 'Help', 'Sealed', 'Healthy']);
// returns ["Sealed", "Healthy"]
'Healed'.fuzzy(['Sold', 'Herded', 'Heard', 'Help', 'Sealed', 'Healthy'], 0);
// returns ["Sealed", "Healthy", "Heard", "Herded", "Help", "Sold"]
Here it is:
(function(){
var default_floor = 0.5;
function pairs(str){
var pairs = []
, length = str.length - 1
, pair;
str = str.toLowerCase();
for(var i = 0; i < length; i++){
pair = str.substr(i, 2);
if(!/\s/.test(pair)){
pairs.push(pair);
}
}
return pairs;
}
function similarity(pairs1, pairs2){
var union = pairs1.length + pairs2.length
, hits = 0;
for(var i = 0; i < pairs1.length; i++){
for(var j = 0; j < pairs2.length; j++){
if(pairs1[i] == pairs2[j]){
pairs2.splice(j--, 1);
hits++;
break;
}
}
}
return 2*hits/union || 0;
}
String.prototype.fuzzy = function(strings, floor){
var str1 = this
, pairs1 = pairs(this);
floor = typeof floor == 'number' ? floor : default_floor;
if(typeof(strings) == 'string'){
return str1.length > 1 && strings.length > 1 && similarity(pairs1, pairs(strings)) >= floor || str1.toLowerCase() == strings.toLowerCase();
}else if(strings instanceof Array){
var scores = {};
strings.map(function(str2){
scores[str2] = str1.length > 1 ? similarity(pairs1, pairs(str2)) : 1*(str1.toLowerCase() == str2.toLowerCase());
});
return strings.filter(function(str){
return scores[str] >= floor;
}).sort(function(a, b){
return scores[b] - scores[a];
});
}
};
})();
The Dice coefficient algorithm (Simon White / marzagao's answer) is implemented in Ruby in the
pair_distance_similar method in the amatch gem
https://github.com/flori/amatch
This gem also contains implementations of a number of approximate matching and string comparison algorithms: Levenshtein edit distance, Sellers edit distance, the Hamming distance, the longest common subsequence length, the longest common substring length, the pair distance metric, the Jaro-Winkler metric.
A Haskell version—feel free to suggest edits because I haven't done much Haskell.
import Data.Char
import Data.List
-- Convert a string into words, then get the pairs of words from that phrase
wordLetterPairs :: String -> [String]
wordLetterPairs s1 = concat $ map pairs $ words s1
-- Converts a String into a list of letter pairs.
pairs :: String -> [String]
pairs [] = []
pairs (x:[]) = []
pairs (x:ys) = [x, head ys]:(pairs ys)
-- Calculates the match rating for two strings
matchRating :: String -> String -> Double
matchRating s1 s2 = (numberOfMatches * 2) / totalLength
where pairsS1 = wordLetterPairs $ map toLower s1
pairsS2 = wordLetterPairs $ map toLower s2
numberOfMatches = fromIntegral $ length $ pairsS1 `intersect` pairsS2
totalLength = fromIntegral $ length pairsS1 + length pairsS2
Clojure:
(require '[clojure.set :refer [intersection]])
(defn bigrams [s]
(->> (split s #"\s+")
(mapcat #(partition 2 1 %))
(set)))
(defn string-similarity [a b]
(let [a-pairs (bigrams a)
b-pairs (bigrams b)
total-count (+ (count a-pairs) (count b-pairs))
match-count (count (intersection a-pairs b-pairs))
similarity (/ (* 2 match-count) total-count)]
similarity))
Here is another version of Similarity based in Sørensen–Dice index (marzagao's answer), this one written in C++11:
/*
* Similarity based in Sørensen–Dice index.
*
* Returns the Similarity between _str1 and _str2.
*/
double similarity_sorensen_dice(const std::string& _str1, const std::string& _str2) {
// Base case: if some string is empty.
if (_str1.empty() || _str2.empty()) {
return 1.0;
}
auto str1 = upper_string(_str1);
auto str2 = upper_string(_str2);
// Base case: if the strings are equals.
if (str1 == str2) {
return 0.0;
}
// Base case: if some string does not have bigrams.
if (str1.size() < 2 || str2.size() < 2) {
return 1.0;
}
// Extract bigrams from str1
auto num_pairs1 = str1.size() - 1;
std::unordered_set<std::string> str1_bigrams;
str1_bigrams.reserve(num_pairs1);
for (unsigned i = 0; i < num_pairs1; ++i) {
str1_bigrams.insert(str1.substr(i, 2));
}
// Extract bigrams from str2
auto num_pairs2 = str2.size() - 1;
std::unordered_set<std::string> str2_bigrams;
str2_bigrams.reserve(num_pairs2);
for (unsigned int i = 0; i < num_pairs2; ++i) {
str2_bigrams.insert(str2.substr(i, 2));
}
// Find the intersection between the two sets.
int intersection = 0;
if (str1_bigrams.size() < str2_bigrams.size()) {
const auto it_e = str2_bigrams.end();
for (const auto& bigram : str1_bigrams) {
intersection += str2_bigrams.find(bigram) != it_e;
}
} else {
const auto it_e = str1_bigrams.end();
for (const auto& bigram : str2_bigrams) {
intersection += str1_bigrams.find(bigram) != it_e;
}
}
// Returns similarity coefficient.
return (2.0 * intersection) / (num_pairs1 + num_pairs2);
}
Why not for a JavaScript implementation, I also explained the algorithm.
Algorithm
Input : France and French.
Map them both to their upper case characters (making the algorithm insensitive to case differences), then split them up into their character pairs:
FRANCE: {FR, RA, AN, NC, CE}
FRENCH: {FR, RE, EN, NC, CH}
Find there intersection:
Result:
Implementation
function similarity(s1, s2) {
const
set1 = pairs(s1.toUpperCase()), // [ FR, RA, AN, NC, CE ]
set2 = pairs(s2.toUpperCase()), // [ FR, RE, EN, NC, CH ]
intersection = set1.filter(x => set2.includes(x)); // [ FR, NC ]
// Tips: Instead of `2` multiply by `200`, To get percentage.
return (intersection.length * 2) / (set1.length + set2.length);
}
function pairs(input) {
const tokenized = [];
for (let i = 0; i < input.length - 1; i++)
tokenized.push(input.substring(i, 2 + i));
return tokenized;
}
console.log(similarity("FRANCE", "FRENCH"));
Ranking Results By ( Word - Similarity )
Sealed - 80%
Healthy - 55%
Heard - 44%
Herded - 40%
Help - 25%
Sold - 0%
From same original source.
What about Levenshtein distance, divided by the length of the first string (or alternatively divided my min/max/avg length of both strings)? That has worked for me so far.
Hey guys i gave this a try in javascript, but I'm new to it, anyone know faster ways to do it?
function get_bigrams(string) {
// Takes a string and returns a list of bigrams
var s = string.toLowerCase();
var v = new Array(s.length-1);
for (i = 0; i< v.length; i++){
v[i] =s.slice(i,i+2);
}
return v;
}
function string_similarity(str1, str2){
/*
Perform bigram comparison between two strings
and return a percentage match in decimal form
*/
var pairs1 = get_bigrams(str1);
var pairs2 = get_bigrams(str2);
var union = pairs1.length + pairs2.length;
var hit_count = 0;
for (x in pairs1){
for (y in pairs2){
if (pairs1[x] == pairs2[y]){
hit_count++;
}
}
}
return ((2.0 * hit_count) / union);
}
var w1 = 'Healed';
var word =['Heard','Healthy','Help','Herded','Sealed','Sold']
for (w2 in word){
console.log('Healed --- ' + word[w2])
console.log(string_similarity(w1,word[w2]));
}
I was looking for pure ruby implementation of the algorithm indicated by #marzagao's answer. Unfortunately, link indicated by #marzagao is broken. In #s01ipsist answer, he indicated ruby gem amatch where implementation is not in pure ruby. So I searchd a little and found gem fuzzy_match which has pure ruby implementation (though this gem use amatch) at here. I hope this will help someone like me.
**I've converted marzagao's answer to Java.**
import org.apache.commons.lang3.StringUtils; //Add a apache commons jar in pom.xml
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
public class SimilarityComparator {
public static void main(String[] args) {
String str0 = "Nischal";
String str1 = "Nischal";
double v = compareStrings(str0, str1);
System.out.println("Similarity betn " + str0 + " and " + str1 + " = " + v);
}
private static double compareStrings(String str1, String str2) {
List<String> pairs1 = wordLetterPairs(str1.toUpperCase());
List<String> pairs2 = wordLetterPairs(str2.toUpperCase());
int intersection = 0;
int union = pairs1.size() + pairs2.size();
for (String s : pairs1) {
for (int j = 0; j < pairs2.size(); j++) {
if (s.equals(pairs2.get(j))) {
intersection++;
pairs2.remove(j);
break;
}
}
}
return (2.0 * intersection) / union;
}
private static List<String> wordLetterPairs(String str) {
List<String> AllPairs = new ArrayList<>();
String[] Words = str.split("\\s");
for (String word : Words) {
if (StringUtils.isNotBlank(word)) {
String[] PairsInWord = letterPairs(word);
Collections.addAll(AllPairs, PairsInWord);
}
}
return AllPairs;
}
private static String[] letterPairs(String str) {
int numPairs = str.length() - 1;
String[] pairs = new String[numPairs];
for (int i = 0; i < numPairs; i++) {
try {
pairs[i] = str.substring(i, i + 2);
} catch (Exception e) {
pairs[i] = str.substring(i, numPairs);
}
}
return pairs;
}
}
Here's another c++ implementation that follows the original article, that minimizes dynamic memory allocations.
It obtains the same matching values in the examples, but I think it's better to take into account also the single character words.
//---------------------------------------------------------------------------
// Similarity based on Sørensen–Dice index
double calc_similarity( const std::string_view s1, const std::string_view s2 )
{
// Check banal cases
if( s1.empty() || s2.empty() )
{// Empty string is never similar to another
return 0.0;
}
else if( s1==s2 )
{// Perfectly equal
return 1.0;
}
else if( s1.length()==1 || s2.length()==1 )
{// Single (not equal) characters have zero similarity
return 0.0;
}
/////////////////////////////////////////////////////////////////////////
// Represents a pair of adjacent characters
class charpair_t final
{
public:
charpair_t(const char a, const char b) noexcept : c1(a), c2(b) {}
[[nodiscard]] bool operator==(const charpair_t& other) const noexcept { return c1==other.c1 && c2==other.c2; }
private:
char c1, c2;
};
/////////////////////////////////////////////////////////////////////////
// Collects and access a sequence of adjacent characters (skipping spaces)
class charpairs_t final
{
public:
charpairs_t(const std::string_view s)
{
assert( !s.empty() );
const std::size_t i_last = s.size()-1;
std::size_t i = 0;
chpairs.reserve(i_last);
while( i<i_last )
{
// Accepting also single-character words (the second is a space)
//if( !std::isspace(s[i]) ) chpairs.emplace_back( std::tolower(s[i]), std::tolower(s[i+1]) );
// Skipping single-character words (as in the original article)
if( std::isspace(s[i]) ) ; // Skip
else if( std::isspace(s[i+1]) ) ++i; // Skip also next
else chpairs.emplace_back( std::tolower(s[i]), std::tolower(s[i+1]) );
++i;
}
}
[[nodiscard]] auto size() const noexcept { return chpairs.size(); }
[[nodiscard]] auto cbegin() const noexcept { return chpairs.cbegin(); }
[[nodiscard]] auto cend() const noexcept { return chpairs.cend(); }
auto erase(std::vector<charpair_t>::const_iterator i) { return chpairs.erase(i); }
private:
std::vector<charpair_t> chpairs;
};
charpairs_t chpairs1{s1},
chpairs2{s2};
const double orig_avg_bigrams_count = 0.5 * static_cast<double>(chpairs1.size() + chpairs2.size());
std::size_t matching_bigrams_count = 0;
for( auto ib1=chpairs1.cbegin(); ib1!=chpairs1.cend(); ++ib1 )
{
for( auto ib2=chpairs2.cbegin(); ib2!=chpairs2.cend(); )
{
if( *ib1==*ib2 )
{
++matching_bigrams_count;
ib2 = chpairs2.erase(ib2); // Avoid to match the same character pair multiple times
break;
}
else ++ib2;
}
}
return static_cast<double>(matching_bigrams_count) / orig_avg_bigrams_count;
}

Resources