Related
When trying to train a model with a dataset of around 3 million rows and 600 columns using the C5.0 CRAN package I get the following error:
Error in paste(apply(x, 1, paste, collapse = ","), collapse = "\n") : result would exceed 2^31-1 bytes
From what the owner of the repository answered to a similar issue, it is due to an R limitation in the number of bytes in a character string, which is limited to 2^31 - 1.
Long answer ahead:
So, as stated in the question, the error occurs in the last line of the makeDataFile function from the Cubist package, used in C5.0, which concatenates all rows into one string. As this string is needed to pass the data to the C5.0 function in C, but is not needed to make any operations in R, and C has no memory limitation aside from those of the machine itself, the approach I have taken is to create such string in C instead. In order to do this, the R code will pass the information in a character vector containing various strings that don’t surpass the length limit, instead of one, so that once in C these elements can be concatenated.
However, instead of leaving all rows as separate elements in the character vector to be concatenated in C using strcat in a loop, I have found that the strcat function is quite slow, so I have chosen to create another R function (create_max_len_strings) in order to concatenate the rows into the longest (~or close~) strings possible without reaching the memory limit so that strcat only needs to be applied a few times to concatenate these longer strings.
So, the last line of the original makeDataFile() function will be replaced so that each row is left separately as an element of a character vector, only adding a line break at the end of each string row so that when concatenating some of these elements into longer strings, using create_max_len_strings(), they will be differentiated:
makeDataFile.R:
create_max_len_strings <- function(original_vector) {
vector_length = length(original_vector)
nchars = sum(nchar(original_vector, type = "chars"))
## Check if the length of the string would reach 1900000000, which is close to the memory limitation
if(nchars >= 1900000000){
## Calculate how many strings we could create of the maximum length
nchunks = 0
while(nchars > 0){
nchars = nchars - 1900000000
nchunks = nchunks + 1
}
## Get the number of rows that would be contained in each string
chunk_size = vector_length/nchunks
## Get the rounded number of rows in each string
chunk_size = floor(chunk_size)
index = chunk_size
## Create a vector with the indexes of the rows that delimit each string
indexes_vector = c()
indexes_vector = append(indexes_vector, 0)
n = nchunks
while(n > 0){
indexes_vector = append(indexes_vector, index)
index = index + chunk_size
n = n - 1
}
## Get the last few rows if the division had remainder
remainder = vector_length %% nchunks
if (remainder != 0){
indexes_vector = append(indexes_vector, vector_length)
nchunks = nchunks + 1
}
## Create the strings pasting together the rows from the indexes in the indexes vector
strings_vector = c()
i = 2
while (i <= length(indexes_vector)){
## Sum 1 to the index_init so that the next string does not contain the last row of the previous string
index_init = indexes_vector[i-1] + 1
index_end = indexes_vector[i]
## Paste the rows from the vector from index_init to index_end
string <- paste0(original_vector[index_init:index_end], collapse="")
## Create vector containing the strings that were created
strings_vector <- append(strings_vector, string)
i = i + 1
}
}else {
strings_vector = paste0(original_vector, collapse="")
}
strings_vector
}
makeDataFile <- function(x, y, w = NULL) {
## Previous code stays the same
...
x = apply(x, 1, paste, collapse = ",")
x = paste(x, "\n", sep="")
char_vec = create_max_len_strings(x)
}
CALLING C5.0
Now, in order to create the final string to pass to the c50() function in C, an intermediate function is created and called instead. In order to do this, the .C() statement that calls c50() in R is replaced with a .Call() statement calling this function, as .Call() allows for complex objects such as vectors to be passed to C. Also, it allows for the result to be returned in the variable result instead of having to pass back the variables tree, rules and output by reference. The result of calling C5.0 will be received in the character vector result containing the strings corresponding to the tree, rules and output in the first three positions:
C5.0.R:
C5.0.default <- function(x,
y,
trials = 1,
rules = FALSE,
weights = NULL,
control = C5.0Control(),
costs = NULL,
...) {
## Previous code stays the same
...
dataString <- makeDataFile(x, y, weights)
num_chars = sum(nchar(dataString, type = "chars"))
result <- .Call(
"call_C50",
as.character(namesString),
dataString,
as.character(num_chars), ## The length of the resulting string is passed as character because it is too long for an integer
as.character(costString),
as.logical(control$subset),
# -s "use the Subset option" var name: SUBSET
as.logical(rules),
# -r "use the Ruleset option" var name: RULES
## for the bands option, I'm not sure what the default should be.
as.integer(control$bands),
# -u "sort rules by their utility into bands" var name: UTILITY
## The documentation has two options for boosting:
## -b use the Boosting option with 10 trials
## -t trials ditto with specified number of trial
## I think we should use -t
as.integer(trials),
# -t : " ditto with specified number of trial", var name: TRIALS
as.logical(control$winnow),
# -w "winnow attributes before constructing a classifier" var name: WINNOW
as.double(control$sample),
# -S : use a sample of x% for training
# and a disjoint sample for testing var name: SAMPLE
as.integer(control$seed),
# -I : set the sampling seed value
as.integer(control$noGlobalPruning),
# -g: "turn off the global tree pruning stage" var name: GLOBAL
as.double(control$CF),
# -c: "set the Pruning CF value" var name: CF
## Also, for the number of minimum cases, I'm not sure what the
## default should be. The code looks like it dynamically sets the
## value (as opposed to a static, universal integer
as.integer(control$minCases),
# -m : "set the Minimum cases" var name: MINITEMS
as.logical(control$fuzzyThreshold),
# -p "use the Fuzzy thresholds option" var name: PROBTHRESH
as.logical(control$earlyStopping)
)
## Get the first three positions of the character vector that contain the tree, rules and output returned by C5.0 in C
result_tree = result[1]
result_rules = result[2]
result_output = result[3]
modelContent <- strsplit(
if (rules)
result_rules
else
result_tree, "\n"
)[[1]]
entries <- grep("^entries", modelContent, value = TRUE)
if (length(entries) > 0) {
actual <- as.numeric(substring(entries, 10, nchar(entries) - 1))
} else
actual <- trials
if (trials > 1) {
boostResults <- getBoostResults(result_output)
## This next line is here to avoid a false positive warning in R
## CMD check:
## * checking R code for possible problems ... NOTE
## C5.0.default: no visible binding for global variable 'Data'
Data <- NULL
size <-
if (!is.null(boostResults))
subset(boostResults, Data == "Training Set")$Size
else
NA
} else {
boostResults <- NULL
size <- length(grep("[0-9])$", strsplit(result_output, "\n")[[1]]))
}
out <- list(
names = namesString,
cost = costString,
costMatrix = costs,
caseWeights = !is.null(weights),
control = control,
trials = c(Requested = trials, Actual = actual),
rbm = rules,
boostResults = boostResults,
size = size,
dims = dim(x),
call = funcCall,
levels = levels(y),
output = result_output,
tree = result_tree,
predictors = colnames(x),
rules = result_rules
)
class(out) <- "C5.0"
out
}
Now onto the C code, the function call_c50() basically acts as an intermediate between the R code and the C code, concatenating the elements in the dataString array to obtain the string needed by the C function c50(), by accessing each position of the array using CHAR(STRING_ELT(x, i)) and concatenating (strcat) them together. Then the rest of the variables are casted to their respective types and the c50() function in file top.c (where this function should also be placed) is called. The result of calling c50() will be returned to the R routine by creating a character vector and placing the strings corresponding to the tree, rules and output in each position.
Lastly, the c50() function is basically left as is, except for the variables treev, rulesv and outputv, as these are the values that are going to be returned by .Call() instead of being passed by reference, they no longer need to be in the arguments of the function. As they are all strings they can be returned in a single array, by setting each string to a position in the array c50_return.
top.c:
SEXP call_C50(SEXP namesString, SEXP data_vec, SEXP datavec_len, SEXP costString, SEXP subset, SEXP rules, SEXP bands, SEXP trials, SEXP winnow, SEXP sample,
SEXP seed, SEXP noGlobalPruning, SEXP CF, SEXP minCases, SEXP fuzzyThreshold, SEXP earlyStopping){
char* string;
char* concat;
long n = 0;
long size;
int i;
char* eptr;
// Get the length of the data vector
n = length(data_vec);
// Get the string indicating the length of the final string
char* size_str = malloc((strlen(CHAR(STRING_ELT(datavec_len, 0)))+1)*sizeof(char));
strcpy(size_str, CHAR(STRING_ELT(datavec_len, 0)));
// Turn the string to long
size = strtol(size_str, &eptr, 10);
// Allocate memory for the number of characters indicated by datavec_len
string = malloc((size+1)*sizeof(char));
// Copy the first element of data_vec into the string variable
strcpy(string, CHAR(STRING_ELT(data_vec, 0)));
// Loop over the data vector until all elements are concatenated in the string variable
for (i = 1; i < n; i++) {
strcat(string, CHAR(STRING_ELT(data_vec, i)));
}
// Copy the value of namesString into a char*
char* namesv = malloc((strlen(CHAR(STRING_ELT(namesString, 0)))+1)*sizeof(char));
strcpy(namesv, CHAR(STRING_ELT(namesString, 0)));
// Copy the value of costString into a char*
char* costv = malloc((strlen(CHAR(STRING_ELT(costString, 0)))+1)*sizeof(char));
strcpy(costv, CHAR(STRING_ELT(costString, 0)));
// Call c50() function casting the rest of arguments into their respective C types
char** c50_return = c50(namesv, string, costv, asLogical(subset), asLogical(rules), asInteger(bands), asInteger(trials), asLogical(winnow), asReal(sample), asInteger(seed), asInteger(noGlobalPruning), asReal(CF), asInteger(minCases), asLogical(fuzzyThreshold), asLogical(earlyStopping));
free(string);
free(namesv);
free(costv);
// Create a character vector to be returned to the C5.0 R function
SEXP out = PROTECT(allocVector(STRSXP, 3));
SET_STRING_ELT(out, 0, mkChar(c50_return[0]));
SET_STRING_ELT(out, 1, mkChar(c50_return[1]));
SET_STRING_ELT(out, 2, mkChar(c50_return[2]));
UNPROTECT(1);
return out;
}
static char** c50(char *namesv, char *datav, char *costv, int subset,
int rules, int utility, int trials, int winnow,
double sample, int seed, int noGlobalPruning, double CF,
int minCases, int fuzzyThreshold, int earlyStopping) {
int val; /* Used by setjmp/longjmp for implementing rbm_exit */
char ** c50_return = malloc(3 * sizeof(char*));
// Initialize the globals to the values that the c50
// program would have at the start of execution
initglobals();
// Set globals based on the arguments. This is analogous
// to parsing the command line in the c50 program.
setglobals(subset, rules, utility, trials, winnow, sample, seed,
noGlobalPruning, CF, minCases, fuzzyThreshold, earlyStopping,
costv);
// Handles the strbufv data structure
rbm_removeall();
// Deallocates memory allocated by NewCase.
// Not necessary since it's also called at the end of this function,
// but it doesn't hurt, and I'm feeling paranoid.
FreeCases();
// XXX Should this be controlled via an option?
// Rprintf("Calling setOf\n");
setOf();
// Create a strbuf using *namesv as the buffer.
// Note that this is a readonly strbuf since we can't
// extend *namesv.
STRBUF *sb_names = strbuf_create_full(namesv, strlen(namesv))
// Register this strbuf using the name "undefined.names"
if (rbm_register(sb_names, "undefined.names", 0) < 0) {
error("undefined.names already exists");
}
// Create a strbuf using *datav and register it as "undefined.data"
STRBUF *sb_datav = strbuf_create_full(datav, strlen(datav));
// XXX why is sb_datav copied? was that part of my debugging?
// XXX or is this the cause of the leak?
if (rbm_register(strbuf_copy(sb_datav), "undefined.data", 0) < 0) {
error("undefined data already exists");
}
// Create a strbuf using *costv and register it as "undefined.costs"
if (strlen(costv) > 0) {
// Rprintf("registering cost matrix: %s", *costv);
STRBUF *sb_costv = strbuf_create_full(costv, strlen(costv));
// XXX should sb_costv be copied?
if (rbm_register(sb_costv, "undefined.costs", 0) < 0) {
error("undefined.cost already exists");
}
} else {
// Rprintf("no cost matrix to register\n");
}
/*
* We need to initialize rbm_buf before calling any code that
* might call exit/rbm_exit.
*/
if ((val = setjmp(rbm_buf)) == 0) {
// Real work is done here
c50main();
if (rules == 0) {
// Get the contents of the the tree file
STRBUF *treebuf = rbm_lookup("undefined.tree");
if (treebuf != NULL) {
char *treeString = strbuf_getall(treebuf);
c50_return[0] = R_alloc(strlen(treeString) + 1, 1);
strcpy(c50_return[0], treeString);
c50_return[1] = "";
} else {
// XXX Should *treev be assigned something in this case?
// XXX Throw an error?
}
} else {
// Get the contents of the the rules file
STRBUF *rulesbuf = rbm_lookup("undefined.rules");
if (rulesbuf != NULL) {
char *rulesString = strbuf_getall(rulesbuf);
c50_return[1] = R_alloc(strlen(rulesString) + 1, 1);
strcpy(c50_return[1], rulesString);
c50_return[0] = "";
} else {
// XXX Should *rulesv be assigned something in this case?
// XXX Throw an error?
}
}
} else {
Rprintf("c50 code called exit with value %d\n", val - JMP_OFFSET);
}
// Close file object "Of", and return its contents via argument outputv
char *outputString = closeOf();
c50_return[2] = R_alloc(strlen(outputString) + 1, 1);
strcpy(c50_return[2], outputString);
// Deallocates memory allocated by NewCase
FreeCases();
// We reinitialize the globals on exit out of general paranoia
initglobals();
return c50_return;
}
***IMPORTANT: if the string created is longer than 2147483647, you also will need to change the definition of the variables i and j in the function strbuf_gets() in strbuf.c. This function basically iterates through each position of the string, so trying to increase their value above the INT limit to access those positions in the array will cause a segmentation fault. I suggest changing the declaration type to long in order to avoid this issue.
C5.0 PREDICTIONS
However, as the makeDataFile function is not only used to create the model but also to pass the data to the predictions() function, this function will also have to be modified. Just like previously, the .C() statement in predict.C5.0() used to call predictions() will be replaced with a .Call() statement in order to be able to pass the character vector to C, and the result will be returned in the result variable instead of being passed by reference:
predict.C5.0.R:
predict.C5.0 <- function (object,
newdata = NULL,
trials = object$trials["Actual"],
type = "class",
na.action = na.pass,
...) {
## Previous code stays the same
...
caseString <- makeDataFile(x = newdata, y = NULL)
num_chars = sum(nchar(caseString, type = "chars"))
## When passing trials to the C code, convert to
## zero if the original version of trials is used
if (trials <= 0)
stop("'trials should be a positive integer", call. = FALSE)
if (trials == object$trials["Actual"])
trials <- 0
## Add trials (not object$trials) as an argument
results <- .Call(
"call_predictions",
caseString,
as.character(num_chars),
as.character(object$names),
as.character(object$tree),
as.character(object$rules),
as.character(object$cost),
pred = integer(nrow(newdata)),
confidence = double(length(object$levels) * nrow(newdata)),
trials = as.integer(trials)
)
predictions = as.numeric(unlist(results[1]))
confidence = as.numeric(unlist(results[2]))
output = as.character(results[3])
if(any(grepl("Error limit exceeded", output)))
stop(output, call. = FALSE)
if (type == "class") {
out <- factor(object$levels[predictions], levels = object$levels)
} else {
out <-
matrix(confidence,
ncol = length(object$levels),
byrow = TRUE)
if (!is.null(rownames(newdata)))
rownames(out) <- rownames(newdata)
colnames(out) <- object$levels
}
out
}
In the file top.c, the predictions() function will be modified to receive the variables passed by the .Call() statement, so that just like previously, the caseString array will be concatenated into a single string and the rest of the variables casted to their respective types. In this case the variables pred and confidence will be also received as vectors of integer and double types and so they will need to be casted to int* and double*. The rest of the function is left as it was in order to create the predictions and the resulting variables predv, confidencev and output variables will be placed in the first three positions of a vector respectively.
top.c:
SEXP call_predictions(SEXP caseString, SEXP case_len, SEXP names, SEXP tree, SEXP rules, SEXP cost, SEXP pred, SEXP confidence, SEXP trials){
char* casev;
char* outputv = "";
char* eptr;
char* size_str = malloc((strlen(CHAR(STRING_ELT(case_len, 0)))+1)*sizeof(char));
strcpy(size_str, CHAR(STRING_ELT(case_len, 0)));
long size = strtol(size_str, &eptr, 10);
casev = malloc((size+1)*sizeof(char));
strcpy(casev, CHAR(STRING_ELT(caseString, 0)));
int n = length(caseString);
for (int i = 1; i < n; i++) {
strcat(casev, CHAR(STRING_ELT(caseString, i)));
}
char* namesv = malloc((strlen(CHAR(STRING_ELT(names, 0)))+1)*sizeof(char));
strcpy(namesv, CHAR(STRING_ELT(names, 0)));
char* treev = malloc((strlen(CHAR(STRING_ELT(tree, 0)))+1)*sizeof(char));
strcpy(treev, CHAR(STRING_ELT(tree, 0)));
char* rulesv = malloc((strlen(CHAR(STRING_ELT(rules, 0)))+1)*sizeof(char));
strcpy(rulesv, CHAR(STRING_ELT(rules, 0)));
char* costv = malloc((strlen(CHAR(STRING_ELT(cost, 0)))+1)*sizeof(char));
strcpy(costv, CHAR(STRING_ELT(cost, 0)));
int variable;
int* predv = &variable;
int npred = length(pred);
predv = malloc((npred+1)*sizeof(int));
for (int i = 0; i < npred; i++) {
predv[i] = INTEGER(pred)[i];
}
double variable1;
double* confidencev = &variable1;
int nconf = length(confidence);
confidencev = malloc((nconf+1)*sizeof(double));
for (int i = 0; i < nconf; i++) {
confidencev[i] = REAL(confidence)[i];
}
int* trialsv = &variable;
*trialsv = asInteger(trials);
/* Original code for predictions starts */
int val;
// Announce ourselves for testing
// Rprintf("predictions called\n");
// Initialize the globals
initglobals();
// Handles the strbufv data structure
rbm_removeall();
// XXX Should this be controlled via an option?
// Rprintf("Calling setOf\n");
setOf();
STRBUF *sb_cases = strbuf_create_full(casev, strlen(casev));
if (rbm_register(sb_cases, "undefined.cases", 0) < 0) {
error("undefined.cases already exists");
}
STRBUF *sb_names = strbuf_create_full(namesv, strlen(namesv));
if (rbm_register(sb_names, "undefined.names", 0) < 0) {
error("undefined.names already exists");
}
if (strlen(treev)) {
STRBUF *sb_treev = strbuf_create_full(treev, strlen(treev));
if (rbm_register(sb_treev, "undefined.tree", 0) < 0) {
error("undefined.tree already exists");
}
} else if (strlen(rulesv)) {
STRBUF *sb_rulesv = strbuf_create_full(rulesv, strlen(rulesv));
if (rbm_register(sb_rulesv, "undefined.rules", 0) < 0) {
error("undefined.rules already exists");
}
setrules(1);
} else {
error("either a tree or rules must be provided");
}
// Create a strbuf using *costv and register it as "undefined.costs"
if (strlen(costv) > 0) {
// Rprintf("registering cost matrix: %s", *costv);
STRBUF *sb_costv = strbuf_create_full(costv, strlen(costv));
// XXX should sb_costv be copied?
if (rbm_register(sb_costv, "undefined.costs", 0) < 0) {
error("undefined.cost already exists");
}
} else {
// Rprintf("no cost matrix to register\n");
}
if ((val = setjmp(rbm_buf)) == 0) {
// Real work is done here
// Rprintf("\n\nCalling rpredictmain\n");
rpredictmain(trialsv, predv, confidencev);
// Rprintf("predict finished\n\n");
} else {
// Rprintf("predict code called exit with value %d\n\n", val - JMP_OFFSET);
}
// Close file object "Of", and return its contents via argument outputv
char *outputString = closeOf();
char *output = R_alloc(strlen(outputString) + 1, 1);
strcpy(output, outputString);
// We reinitialize the globals on exit out of general paranoia
initglobals();
/* Original code for predictions ends */
free(namesv);
free(treev);
free(rulesv);
free(costv);
SEXP predx = PROTECT(allocVector(INTSXP, npred));
for (int i = 0; i < npred; i++) {
INTEGER(predx)[i] = predv[i];
}
SEXP confidencex = PROTECT(allocVector(REALSXP, nconf));
for (int i = 0; i < npred; i++) {
REAL(confidencex)[i] = confidencev[i];
}
SEXP outputx = PROTECT(allocVector(STRSXP, 1));
SET_STRING_ELT(outputx, 0, mkChar(output));
SEXP vector = PROTECT(allocVector(VECSXP, 3));
SET_VECTOR_ELT(vector, 0, predx);
SET_VECTOR_ELT(vector, 1, confidencex);
SET_VECTOR_ELT(vector, 2, outputx);
UNPROTECT(4);
free(predv);
free(confidencev);
return vector;
}
The code below works as expected, but the map lambda is impure. How could I refactor this to make it pure. (No need to stick to calling map, we could reduce or whatever else here, I just want it to be pure)
val entries = listOf(
Pair(LocalDate.now().minusDays(2), 1),
Pair(LocalDate.now().minusDays(1), 2),
Pair(LocalDate.now().minusDays(0), 3)
)
private fun buildSumSchedule(entries: List<Pair<LocalDate, Double>>): Map<LocalDate, Double> {
var runningSum = 0.0
return entries.sortedBy { it.first }.map {
runningSum += it.second
it.copy(second = runningSum)
}.toMap()
}
val sumSchedule = buildSumSchedule(entries)
what you want here is scanReduce that's how you can use the previous item after sorting
#ExperimentalStdlibApi
private fun buildSumSchedule(entries: List<Pair<LocalDate, Double>>): Map<LocalDate, Double> =
entries.sortedBy { it.first }.scanReduce { pair, acc ->
acc.copy(second = pair.second + acc.second)
}.toMap()
and from kotlin 1.4.0 runningReduce
private fun buildSumSchedule(entries: List<Pair<LocalDate, Double>>): Map<LocalDate, Double> =
entries.sortedBy { it.first }.runningReduce{acc, pair ->
acc.copy(second = pair.second + acc.second)
}.toMap()
I am having problems assigning a pointer to a map. Maybe this is a bug in Go? Or maybe I am just doing something wrong. The code is here on the playground as well, https://play.golang.org/p/p0NosPtkptz
Here is some super simplified code that illustrates the problem. I am creating an object called collections that has two collection objects in it. I am then looping through those collections and assigning them to a map where the key in the map is the collection ID.
package main
import (
"fmt"
)
type collection struct {
ID string
Name string
}
type collections struct {
Collections []collection
}
type cache struct {
Index int
Collections map[string]*collection
}
func main() {
var c cache
c.Collections = make(map[string]*collection)
// Create 2 Collections
var col1, col2 collection
col1.ID = "aa"
col1.Name = "Text A"
col2.ID = "bb"
col2.Name = "Test B"
// Add to Collections
var cols collections
cols.Collections = append(cols.Collections, col1)
cols.Collections = append(cols.Collections, col2)
fmt.Println("DEBUG Collections Type", cols)
i := 0
for k, v := range cols.Collections {
c.Index = i
c.Collections[v.ID] = &v
fmt.Println("DEBUG k", k)
fmt.Println("DEBUG v", v)
i++
}
fmt.Println("Collection 1", c.Collections["aa"].ID)
fmt.Println("Collection 2", c.Collections["bb"].ID)
fmt.Println(c)
}
The output from this playground code looks like:
DEBUG Collections Type {[{aa Text A} {bb Test B}]}
DEBUG k 0
DEBUG v {aa Text A}
DEBUG k 1
DEBUG v {bb Test B}
Collection 1 bb
Collection 2 bb
{1 map[aa:0x1040a0f0 bb:0x1040a0f0]}
So it seems like the map is for some reason getting the same pointer for each entry. All of the "DEBUG" lines print out what I would expect. However, the three print lines at the very end, do not. Collection 1 should be "aa" not "bb".
When you are putting &v into c.Collections[v.ID], you are actually assigning same address of loop variable v.
This address finally holds the last value of your list. That's why you are getting bb Test B for all key.
Print these value and you will see same address.
fmt.Printf("%p\n", c.Collections["aa"])
fmt.Printf("%p\n", c.Collections["bb"])
And by copying it to a new variable in the loop scope, the issue is solved. Each step in the loop you will put a new and unique address into the cache.
for k, v := range cols.Collections {
coll := v
c.Collections[v.ID] = &coll
fmt.Println("DEBUG k", k)
fmt.Println("DEBUG v", v)
i++
}
Unfortunately, the code in the first answer has an error - a pointer to a local variable coll:
for k, v := range cols.Collections {
coll := v
c.Collections[v.ID] = &coll
Just try to change a property value of a collection:
cols.Collections[0].Name = "Text C"
fmt.Println("Collection 1", cols.Collections[0].Name)
fmt.Println("Collection 1", c.Collections["aa"].Name)
// Collection 1 Text C
// Collection 1 Text A
But the another code will print an expected result:
for k, v := range cols.Collections {
c.Index = i
p := &cols.Collections[k]
c.Collections[v.ID] = p
....
cols.Collections[0].Name = "Text C"
fmt.Println("Collection 1", cols.Collections[0].Name)
fmt.Println("Collection 1", c.Collections["aa"].Name)
// Collection 1 Text C
// Collection 1 Text C
The answer about same pointer is a map contains the address of a variable v, but not array items.
for k, v := range cols.Collections {
......
c.Collections[v.ID] = &v
A right solution is storing a pointer to array, but not a value.
type collections struct {
Collections []*collection
}
cols.Collections = append(cols.Collections, &col1)
cols.Collections = append(cols.Collections, &col2)
Or get address of map item (as the answer higher)
for k, v := range cols.Collections {
c.Index = i
p := &cols.Collections[k]
c.Collections[v.ID] = p
Encrypt the message NEED HELP by translating the letters into numbers, applying the
encryption function f (p) = (3p + 7) mod 26, and then translating the numbers back into
letters.
Ans: Encrypted form: UTTQ CTOA.
could someone please explain to me how they got this answer
first you have to assign a number to each letter:
A = 0; B = 1; C = 2 ....
then you apply the function to the numbers you get and convert it back to letters:
N would be 13, so 13 * 3 = 39, + 7 = 46
then mod 26 = 20
converting back, 20 = U
If you do it on all the letters of your sentence you'll have the encrypted form
and here the C# code to do this:
private static string encrypt(string s)
{
char[] tmp = new char[s.Length];
int i = 0;
foreach (char c in s)
{
tmp[i] = (char)((((c - 'A') * 3 + 7) % 26) + 'A');
i++;
}
return new string(tmp);
}
and here your decrypt function (ok this one is messy but works):
private static string decrypt(string s)
{
string res = s;
for (int i = 0; i < 5; i++)
res = encrypt(res);
return res;
}
I'm looking for a string similarity algorithm that yields better results on variable length strings than the ones that are usually suggested (levenshtein distance, soundex, etc).
For example,
Given string A: "Robert",
Then string B: "Amy Robertson"
would be a better match than
String C: "Richard"
Also, preferably, this algorithm should be language agnostic (also works in languages other than English).
Simon White of Catalysoft wrote an article about a very clever algorithm that compares adjacent character pairs that works really well for my purposes:
http://www.catalysoft.com/articles/StrikeAMatch.html
Simon has a Java version of the algorithm and below I wrote a PL/Ruby version of it (taken from the plain ruby version done in the related forum entry comment by Mark Wong-VanHaren) so that I can use it in my PostgreSQL queries:
CREATE FUNCTION string_similarity(str1 varchar, str2 varchar)
RETURNS float8 AS '
str1.downcase!
pairs1 = (0..str1.length-2).collect {|i| str1[i,2]}.reject {
|pair| pair.include? " "}
str2.downcase!
pairs2 = (0..str2.length-2).collect {|i| str2[i,2]}.reject {
|pair| pair.include? " "}
union = pairs1.size + pairs2.size
intersection = 0
pairs1.each do |p1|
0.upto(pairs2.size-1) do |i|
if p1 == pairs2[i]
intersection += 1
pairs2.slice!(i)
break
end
end
end
(2.0 * intersection) / union
' LANGUAGE 'plruby';
Works like a charm!
marzagao's answer is great. I converted it to C# so I thought I'd post it here:
Pastebin Link
/// <summary>
/// This class implements string comparison algorithm
/// based on character pair similarity
/// Source: http://www.catalysoft.com/articles/StrikeAMatch.html
/// </summary>
public class SimilarityTool
{
/// <summary>
/// Compares the two strings based on letter pair matches
/// </summary>
/// <param name="str1"></param>
/// <param name="str2"></param>
/// <returns>The percentage match from 0.0 to 1.0 where 1.0 is 100%</returns>
public double CompareStrings(string str1, string str2)
{
List<string> pairs1 = WordLetterPairs(str1.ToUpper());
List<string> pairs2 = WordLetterPairs(str2.ToUpper());
int intersection = 0;
int union = pairs1.Count + pairs2.Count;
for (int i = 0; i < pairs1.Count; i++)
{
for (int j = 0; j < pairs2.Count; j++)
{
if (pairs1[i] == pairs2[j])
{
intersection++;
pairs2.RemoveAt(j);//Must remove the match to prevent "GGGG" from appearing to match "GG" with 100% success
break;
}
}
}
return (2.0 * intersection) / union;
}
/// <summary>
/// Gets all letter pairs for each
/// individual word in the string
/// </summary>
/// <param name="str"></param>
/// <returns></returns>
private List<string> WordLetterPairs(string str)
{
List<string> AllPairs = new List<string>();
// Tokenize the string and put the tokens/words into an array
string[] Words = Regex.Split(str, #"\s");
// For each word
for (int w = 0; w < Words.Length; w++)
{
if (!string.IsNullOrEmpty(Words[w]))
{
// Find the pairs of characters
String[] PairsInWord = LetterPairs(Words[w]);
for (int p = 0; p < PairsInWord.Length; p++)
{
AllPairs.Add(PairsInWord[p]);
}
}
}
return AllPairs;
}
/// <summary>
/// Generates an array containing every
/// two consecutive letters in the input string
/// </summary>
/// <param name="str"></param>
/// <returns></returns>
private string[] LetterPairs(string str)
{
int numPairs = str.Length - 1;
string[] pairs = new string[numPairs];
for (int i = 0; i < numPairs; i++)
{
pairs[i] = str.Substring(i, 2);
}
return pairs;
}
}
Here is another version of marzagao's answer, this one written in Python:
def get_bigrams(string):
"""
Take a string and return a list of bigrams.
"""
s = string.lower()
return [s[i:i+2] for i in list(range(len(s) - 1))]
def string_similarity(str1, str2):
"""
Perform bigram comparison between two strings
and return a percentage match in decimal form.
"""
pairs1 = get_bigrams(str1)
pairs2 = get_bigrams(str2)
union = len(pairs1) + len(pairs2)
hit_count = 0
for x in pairs1:
for y in pairs2:
if x == y:
hit_count += 1
break
return (2.0 * hit_count) / union
if __name__ == "__main__":
"""
Run a test using the example taken from:
http://www.catalysoft.com/articles/StrikeAMatch.html
"""
w1 = 'Healed'
words = ['Heard', 'Healthy', 'Help', 'Herded', 'Sealed', 'Sold']
for w2 in words:
print('Healed --- ' + w2)
print(string_similarity(w1, w2))
print()
A shorter version of John Rutledge's answer:
def get_bigrams(string):
'''
Takes a string and returns a list of bigrams
'''
s = string.lower()
return {s[i:i+2] for i in xrange(len(s) - 1)}
def string_similarity(str1, str2):
'''
Perform bigram comparison between two strings
and return a percentage match in decimal form
'''
pairs1 = get_bigrams(str1)
pairs2 = get_bigrams(str2)
return (2.0 * len(pairs1 & pairs2)) / (len(pairs1) + len(pairs2))
Here's my PHP implementation of suggested StrikeAMatch algorithm, by Simon White. the advantages (like it says in the link) are:
A true reflection of lexical similarity - strings with small differences should be recognised as being similar. In particular, a significant substring overlap should point to a high level of similarity between the strings.
A robustness to changes of word order - two strings which contain the same words, but in a different order, should be recognised as being similar. On the other hand, if one string is just a random anagram of the characters contained in the other, then it should (usually) be recognised as dissimilar.
Language Independence - the algorithm should work not only in English, but in many different languages.
<?php
/**
* LetterPairSimilarity algorithm implementation in PHP
* #author Igal Alkon
* #link http://www.catalysoft.com/articles/StrikeAMatch.html
*/
class LetterPairSimilarity
{
/**
* #param $str
* #return mixed
*/
private function wordLetterPairs($str)
{
$allPairs = array();
// Tokenize the string and put the tokens/words into an array
$words = explode(' ', $str);
// For each word
for ($w = 0; $w < count($words); $w++)
{
// Find the pairs of characters
$pairsInWord = $this->letterPairs($words[$w]);
for ($p = 0; $p < count($pairsInWord); $p++)
{
$allPairs[] = $pairsInWord[$p];
}
}
return $allPairs;
}
/**
* #param $str
* #return array
*/
private function letterPairs($str)
{
$numPairs = mb_strlen($str)-1;
$pairs = array();
for ($i = 0; $i < $numPairs; $i++)
{
$pairs[$i] = mb_substr($str,$i,2);
}
return $pairs;
}
/**
* #param $str1
* #param $str2
* #return float
*/
public function compareStrings($str1, $str2)
{
$pairs1 = $this->wordLetterPairs(strtoupper($str1));
$pairs2 = $this->wordLetterPairs(strtoupper($str2));
$intersection = 0;
$union = count($pairs1) + count($pairs2);
for ($i=0; $i < count($pairs1); $i++)
{
$pair1 = $pairs1[$i];
$pairs2 = array_values($pairs2);
for($j = 0; $j < count($pairs2); $j++)
{
$pair2 = $pairs2[$j];
if ($pair1 === $pair2)
{
$intersection++;
unset($pairs2[$j]);
break;
}
}
}
return (2.0*$intersection)/$union;
}
}
This discussion has been really helpful, thanks. I converted the algorithm to VBA for use with Excel and wrote a few versions of a worksheet function, one for simple comparison of a pair of strings, the other for comparing one string to a range/array of strings. The strSimLookup version returns either the last best match as a string, array index, or similarity metric.
This implementation produces the same results listed in the Amazon example on Simon White's website with a few minor exceptions on low-scoring matches; not sure where the difference creeps in, could be VBA's Split function, but I haven't investigated as it's working fine for my purposes.
'Implements functions to rate how similar two strings are on
'a scale of 0.0 (completely dissimilar) to 1.0 (exactly similar)
'Source: http://www.catalysoft.com/articles/StrikeAMatch.html
'Author: Bob Chatham, bob.chatham at gmail.com
'9/12/2010
Option Explicit
Public Function stringSimilarity(str1 As String, str2 As String) As Variant
'Simple version of the algorithm that computes the similiarity metric
'between two strings.
'NOTE: This verision is not efficient to use if you're comparing one string
'with a range of other values as it will needlessly calculate the pairs for the
'first string over an over again; use the array-optimized version for this case.
Dim sPairs1 As Collection
Dim sPairs2 As Collection
Set sPairs1 = New Collection
Set sPairs2 = New Collection
WordLetterPairs str1, sPairs1
WordLetterPairs str2, sPairs2
stringSimilarity = SimilarityMetric(sPairs1, sPairs2)
Set sPairs1 = Nothing
Set sPairs2 = Nothing
End Function
Public Function strSimA(str1 As Variant, rRng As Range) As Variant
'Return an array of string similarity indexes for str1 vs every string in input range rRng
Dim sPairs1 As Collection
Dim sPairs2 As Collection
Dim arrOut As Variant
Dim l As Long, j As Long
Set sPairs1 = New Collection
WordLetterPairs CStr(str1), sPairs1
l = rRng.Count
ReDim arrOut(1 To l)
For j = 1 To l
Set sPairs2 = New Collection
WordLetterPairs CStr(rRng(j)), sPairs2
arrOut(j) = SimilarityMetric(sPairs1, sPairs2)
Set sPairs2 = Nothing
Next j
strSimA = Application.Transpose(arrOut)
End Function
Public Function strSimLookup(str1 As Variant, rRng As Range, Optional returnType) As Variant
'Return either the best match or the index of the best match
'depending on returnTYype parameter) between str1 and strings in rRng)
' returnType = 0 or omitted: returns the best matching string
' returnType = 1 : returns the index of the best matching string
' returnType = 2 : returns the similarity metric
Dim sPairs1 As Collection
Dim sPairs2 As Collection
Dim metric, bestMetric As Double
Dim i, iBest As Long
Const RETURN_STRING As Integer = 0
Const RETURN_INDEX As Integer = 1
Const RETURN_METRIC As Integer = 2
If IsMissing(returnType) Then returnType = RETURN_STRING
Set sPairs1 = New Collection
WordLetterPairs CStr(str1), sPairs1
bestMetric = -1
iBest = -1
For i = 1 To rRng.Count
Set sPairs2 = New Collection
WordLetterPairs CStr(rRng(i)), sPairs2
metric = SimilarityMetric(sPairs1, sPairs2)
If metric > bestMetric Then
bestMetric = metric
iBest = i
End If
Set sPairs2 = Nothing
Next i
If iBest = -1 Then
strSimLookup = CVErr(xlErrValue)
Exit Function
End If
Select Case returnType
Case RETURN_STRING
strSimLookup = CStr(rRng(iBest))
Case RETURN_INDEX
strSimLookup = iBest
Case Else
strSimLookup = bestMetric
End Select
End Function
Public Function strSim(str1 As String, str2 As String) As Variant
Dim ilen, iLen1, ilen2 As Integer
iLen1 = Len(str1)
ilen2 = Len(str2)
If iLen1 >= ilen2 Then ilen = ilen2 Else ilen = iLen1
strSim = stringSimilarity(Left(str1, ilen), Left(str2, ilen))
End Function
Sub WordLetterPairs(str As String, pairColl As Collection)
'Tokenize str into words, then add all letter pairs to pairColl
Dim Words() As String
Dim word, nPairs, pair As Integer
Words = Split(str)
If UBound(Words) < 0 Then
Set pairColl = Nothing
Exit Sub
End If
For word = 0 To UBound(Words)
nPairs = Len(Words(word)) - 1
If nPairs > 0 Then
For pair = 1 To nPairs
pairColl.Add Mid(Words(word), pair, 2)
Next pair
End If
Next word
End Sub
Private Function SimilarityMetric(sPairs1 As Collection, sPairs2 As Collection) As Variant
'Helper function to calculate similarity metric given two collections of letter pairs.
'This function is designed to allow the pair collections to be set up separately as needed.
'NOTE: sPairs2 collection will be altered as pairs are removed; copy the collection
'if this is not the desired behavior.
'Also assumes that collections will be deallocated somewhere else
Dim Intersect As Double
Dim Union As Double
Dim i, j As Long
If sPairs1.Count = 0 Or sPairs2.Count = 0 Then
SimilarityMetric = CVErr(xlErrNA)
Exit Function
End If
Union = sPairs1.Count + sPairs2.Count
Intersect = 0
For i = 1 To sPairs1.Count
For j = 1 To sPairs2.Count
If StrComp(sPairs1(i), sPairs2(j)) = 0 Then
Intersect = Intersect + 1
sPairs2.Remove j
Exit For
End If
Next j
Next i
SimilarityMetric = (2 * Intersect) / Union
End Function
I'm sorry, the answer was not invented by the author. This is a well known algorithm that was first present by Digital Equipment Corporation and is often referred to as shingling.
http://www.hpl.hp.com/techreports/Compaq-DEC/SRC-TN-1997-015.pdf
I translated Simon White's algorithm to PL/pgSQL. This is my contribution.
<!-- language: lang-sql -->
create or replace function spt1.letterpairs(in p_str varchar)
returns varchar as
$$
declare
v_numpairs integer := length(p_str)-1;
v_pairs varchar[];
begin
for i in 1 .. v_numpairs loop
v_pairs[i] := substr(p_str, i, 2);
end loop;
return v_pairs;
end;
$$ language 'plpgsql';
--===================================================================
create or replace function spt1.wordletterpairs(in p_str varchar)
returns varchar as
$$
declare
v_allpairs varchar[];
v_words varchar[];
v_pairsinword varchar[];
begin
v_words := regexp_split_to_array(p_str, '[[:space:]]');
for i in 1 .. array_length(v_words, 1) loop
v_pairsinword := spt1.letterpairs(v_words[i]);
if v_pairsinword is not null then
for j in 1 .. array_length(v_pairsinword, 1) loop
v_allpairs := v_allpairs || v_pairsinword[j];
end loop;
end if;
end loop;
return v_allpairs;
end;
$$ language 'plpgsql';
--===================================================================
create or replace function spt1.arrayintersect(ANYARRAY, ANYARRAY)
returns anyarray as
$$
select array(select unnest($1) intersect select unnest($2))
$$ language 'sql';
--===================================================================
create or replace function spt1.comparestrings(in p_str1 varchar, in p_str2 varchar)
returns float as
$$
declare
v_pairs1 varchar[];
v_pairs2 varchar[];
v_intersection integer;
v_union integer;
begin
v_pairs1 := wordletterpairs(upper(p_str1));
v_pairs2 := wordletterpairs(upper(p_str2));
v_union := array_length(v_pairs1, 1) + array_length(v_pairs2, 1);
v_intersection := array_length(arrayintersect(v_pairs1, v_pairs2), 1);
return (2.0 * v_intersection / v_union);
end;
$$ language 'plpgsql';
A version in beautiful Scala:
def pairDistance(s1: String, s2: String): Double = {
def strToPairs(s: String, acc: List[String]): List[String] = {
if (s.size < 2) acc
else strToPairs(s.drop(1),
if (s.take(2).contains(" ")) acc else acc ::: List(s.take(2)))
}
val lst1 = strToPairs(s1.toUpperCase, List())
val lst2 = strToPairs(s2.toUpperCase, List())
(2.0 * lst2.intersect(lst1).size) / (lst1.size + lst2.size)
}
String Similarity Metrics contains an overview of many different metrics used in string comparison (Wikipedia has an overview as well). Much of these metrics is implemented in a library simmetrics.
Yet another example of metric, not included in the given overview is for example compression distance (attempting to approximate the Kolmogorov's complexity), which can be used for a bit longer texts than the one you presented.
You might also consider looking at a much broader subject of Natural Language Processing. These R packages can get you started quickly (or at least give some ideas).
And one last edit - search the other questions on this subject at SO, there are quite a few related ones.
A faster PHP version of the algorithm:
/**
*
* #param $str
* #return mixed
*/
private static function wordLetterPairs ($str)
{
$allPairs = array();
// Tokenize the string and put the tokens/words into an array
$words = explode(' ', $str);
// For each word
for ($w = 0; $w < count($words); $w ++) {
// Find the pairs of characters
$pairsInWord = self::letterPairs($words[$w]);
for ($p = 0; $p < count($pairsInWord); $p ++) {
$allPairs[$pairsInWord[$p]] = $pairsInWord[$p];
}
}
return array_values($allPairs);
}
/**
*
* #param $str
* #return array
*/
private static function letterPairs ($str)
{
$numPairs = mb_strlen($str) - 1;
$pairs = array();
for ($i = 0; $i < $numPairs; $i ++) {
$pairs[$i] = mb_substr($str, $i, 2);
}
return $pairs;
}
/**
*
* #param $str1
* #param $str2
* #return float
*/
public static function compareStrings ($str1, $str2)
{
$pairs1 = self::wordLetterPairs(mb_strtolower($str1));
$pairs2 = self::wordLetterPairs(mb_strtolower($str2));
$union = count($pairs1) + count($pairs2);
$intersection = count(array_intersect($pairs1, $pairs2));
return (2.0 * $intersection) / $union;
}
For the data I had (approx 2300 comparisons) I had a running time of 0.58sec with Igal Alkon solution versus 0.35sec with mine.
Posting marzagao's answer in C99, inspired by these algorithms
double dice_match(const char *string1, const char *string2) {
//check fast cases
if (((string1 != NULL) && (string1[0] == '\0')) ||
((string2 != NULL) && (string2[0] == '\0'))) {
return 0;
}
if (string1 == string2) {
return 1;
}
size_t strlen1 = strlen(string1);
size_t strlen2 = strlen(string2);
if (strlen1 < 2 || strlen2 < 2) {
return 0;
}
size_t length1 = strlen1 - 1;
size_t length2 = strlen2 - 1;
double matches = 0;
int i = 0, j = 0;
//get bigrams and compare
while (i < length1 && j < length2) {
char a[3] = {string1[i], string1[i + 1], '\0'};
char b[3] = {string2[j], string2[j + 1], '\0'};
int cmp = strcmpi(a, b);
if (cmp == 0) {
matches += 2;
}
i++;
j++;
}
return matches / (length1 + length2);
}
Some tests based on the original article:
#include <stdio.h>
void article_test1() {
char *string1 = "FRANCE";
char *string2 = "FRENCH";
printf("====%s====\n", __func__);
printf("%2.f%% == 40%%\n", dice_match(string1, string2) * 100);
}
void article_test2() {
printf("====%s====\n", __func__);
char *string = "Healed";
char *ss[] = {"Heard", "Healthy", "Help",
"Herded", "Sealed", "Sold"};
int correct[] = {44, 55, 25, 40, 80, 0};
for (int i = 0; i < 6; ++i) {
printf("%2.f%% == %d%%\n", dice_match(string, ss[i]) * 100, correct[i]);
}
}
void multicase_test() {
char *string1 = "FRaNcE";
char *string2 = "fREnCh";
printf("====%s====\n", __func__);
printf("%2.f%% == 40%%\n", dice_match(string1, string2) * 100);
}
void gg_test() {
char *string1 = "GG";
char *string2 = "GGGGG";
printf("====%s====\n", __func__);
printf("%2.f%% != 100%%\n", dice_match(string1, string2) * 100);
}
int main() {
article_test1();
article_test2();
multicase_test();
gg_test();
return 0;
}
Here is the R version:
get_bigrams <- function(str)
{
lstr = tolower(str)
bigramlst = list()
for(i in 1:(nchar(str)-1))
{
bigramlst[[i]] = substr(str, i, i+1)
}
return(bigramlst)
}
str_similarity <- function(str1, str2)
{
pairs1 = get_bigrams(str1)
pairs2 = get_bigrams(str2)
unionlen = length(pairs1) + length(pairs2)
hit_count = 0
for(x in 1:length(pairs1)){
for(y in 1:length(pairs2)){
if (pairs1[[x]] == pairs2[[y]])
hit_count = hit_count + 1
}
}
return ((2.0 * hit_count) / unionlen)
}
Building on Michael La Voie's awesome C# version, as per the request to make it an extension method, here is what I came up with. The primary benefit of doing it this way is that you can sort a Generic List by the percent match. For example, consider you have a string field named "City" in your object. A user searches for "Chester" and you want to return results in descending order of match. For example, you want literal matches of Chester to show up before Rochester. To do this, add two new properties to your object:
public string SearchText { get; set; }
public double PercentMatch
{
get
{
return City.ToUpper().PercentMatchTo(this.SearchText.ToUpper());
}
}
Then on each object, set the SearchText to what the user searched for. Then you can sort it easily with something like:
zipcodes = zipcodes.OrderByDescending(x => x.PercentMatch);
Here's the slight modification to make it an extension method:
/// <summary>
/// This class implements string comparison algorithm
/// based on character pair similarity
/// Source: http://www.catalysoft.com/articles/StrikeAMatch.html
/// </summary>
public static double PercentMatchTo(this string str1, string str2)
{
List<string> pairs1 = WordLetterPairs(str1.ToUpper());
List<string> pairs2 = WordLetterPairs(str2.ToUpper());
int intersection = 0;
int union = pairs1.Count + pairs2.Count;
for (int i = 0; i < pairs1.Count; i++)
{
for (int j = 0; j < pairs2.Count; j++)
{
if (pairs1[i] == pairs2[j])
{
intersection++;
pairs2.RemoveAt(j);//Must remove the match to prevent "GGGG" from appearing to match "GG" with 100% success
break;
}
}
}
return (2.0 * intersection) / union;
}
/// <summary>
/// Gets all letter pairs for each
/// individual word in the string
/// </summary>
/// <param name="str"></param>
/// <returns></returns>
private static List<string> WordLetterPairs(string str)
{
List<string> AllPairs = new List<string>();
// Tokenize the string and put the tokens/words into an array
string[] Words = Regex.Split(str, #"\s");
// For each word
for (int w = 0; w < Words.Length; w++)
{
if (!string.IsNullOrEmpty(Words[w]))
{
// Find the pairs of characters
String[] PairsInWord = LetterPairs(Words[w]);
for (int p = 0; p < PairsInWord.Length; p++)
{
AllPairs.Add(PairsInWord[p]);
}
}
}
return AllPairs;
}
/// <summary>
/// Generates an array containing every
/// two consecutive letters in the input string
/// </summary>
/// <param name="str"></param>
/// <returns></returns>
private static string[] LetterPairs(string str)
{
int numPairs = str.Length - 1;
string[] pairs = new string[numPairs];
for (int i = 0; i < numPairs; i++)
{
pairs[i] = str.Substring(i, 2);
}
return pairs;
}
My JavaScript implementation takes a string or array of strings, and an optional floor (the default floor is 0.5). If you pass it a string, it will return true or false depending on whether or not the string's similarity score is greater than or equal to the floor. If you pass it an array of strings, it will return an array of those strings whose similarity score is greater than or equal to the floor, sorted by score.
Examples:
'Healed'.fuzzy('Sealed'); // returns true
'Healed'.fuzzy('Help'); // returns false
'Healed'.fuzzy('Help', 0.25); // returns true
'Healed'.fuzzy(['Sold', 'Herded', 'Heard', 'Help', 'Sealed', 'Healthy']);
// returns ["Sealed", "Healthy"]
'Healed'.fuzzy(['Sold', 'Herded', 'Heard', 'Help', 'Sealed', 'Healthy'], 0);
// returns ["Sealed", "Healthy", "Heard", "Herded", "Help", "Sold"]
Here it is:
(function(){
var default_floor = 0.5;
function pairs(str){
var pairs = []
, length = str.length - 1
, pair;
str = str.toLowerCase();
for(var i = 0; i < length; i++){
pair = str.substr(i, 2);
if(!/\s/.test(pair)){
pairs.push(pair);
}
}
return pairs;
}
function similarity(pairs1, pairs2){
var union = pairs1.length + pairs2.length
, hits = 0;
for(var i = 0; i < pairs1.length; i++){
for(var j = 0; j < pairs2.length; j++){
if(pairs1[i] == pairs2[j]){
pairs2.splice(j--, 1);
hits++;
break;
}
}
}
return 2*hits/union || 0;
}
String.prototype.fuzzy = function(strings, floor){
var str1 = this
, pairs1 = pairs(this);
floor = typeof floor == 'number' ? floor : default_floor;
if(typeof(strings) == 'string'){
return str1.length > 1 && strings.length > 1 && similarity(pairs1, pairs(strings)) >= floor || str1.toLowerCase() == strings.toLowerCase();
}else if(strings instanceof Array){
var scores = {};
strings.map(function(str2){
scores[str2] = str1.length > 1 ? similarity(pairs1, pairs(str2)) : 1*(str1.toLowerCase() == str2.toLowerCase());
});
return strings.filter(function(str){
return scores[str] >= floor;
}).sort(function(a, b){
return scores[b] - scores[a];
});
}
};
})();
The Dice coefficient algorithm (Simon White / marzagao's answer) is implemented in Ruby in the
pair_distance_similar method in the amatch gem
https://github.com/flori/amatch
This gem also contains implementations of a number of approximate matching and string comparison algorithms: Levenshtein edit distance, Sellers edit distance, the Hamming distance, the longest common subsequence length, the longest common substring length, the pair distance metric, the Jaro-Winkler metric.
A Haskell version—feel free to suggest edits because I haven't done much Haskell.
import Data.Char
import Data.List
-- Convert a string into words, then get the pairs of words from that phrase
wordLetterPairs :: String -> [String]
wordLetterPairs s1 = concat $ map pairs $ words s1
-- Converts a String into a list of letter pairs.
pairs :: String -> [String]
pairs [] = []
pairs (x:[]) = []
pairs (x:ys) = [x, head ys]:(pairs ys)
-- Calculates the match rating for two strings
matchRating :: String -> String -> Double
matchRating s1 s2 = (numberOfMatches * 2) / totalLength
where pairsS1 = wordLetterPairs $ map toLower s1
pairsS2 = wordLetterPairs $ map toLower s2
numberOfMatches = fromIntegral $ length $ pairsS1 `intersect` pairsS2
totalLength = fromIntegral $ length pairsS1 + length pairsS2
Clojure:
(require '[clojure.set :refer [intersection]])
(defn bigrams [s]
(->> (split s #"\s+")
(mapcat #(partition 2 1 %))
(set)))
(defn string-similarity [a b]
(let [a-pairs (bigrams a)
b-pairs (bigrams b)
total-count (+ (count a-pairs) (count b-pairs))
match-count (count (intersection a-pairs b-pairs))
similarity (/ (* 2 match-count) total-count)]
similarity))
Here is another version of Similarity based in Sørensen–Dice index (marzagao's answer), this one written in C++11:
/*
* Similarity based in Sørensen–Dice index.
*
* Returns the Similarity between _str1 and _str2.
*/
double similarity_sorensen_dice(const std::string& _str1, const std::string& _str2) {
// Base case: if some string is empty.
if (_str1.empty() || _str2.empty()) {
return 1.0;
}
auto str1 = upper_string(_str1);
auto str2 = upper_string(_str2);
// Base case: if the strings are equals.
if (str1 == str2) {
return 0.0;
}
// Base case: if some string does not have bigrams.
if (str1.size() < 2 || str2.size() < 2) {
return 1.0;
}
// Extract bigrams from str1
auto num_pairs1 = str1.size() - 1;
std::unordered_set<std::string> str1_bigrams;
str1_bigrams.reserve(num_pairs1);
for (unsigned i = 0; i < num_pairs1; ++i) {
str1_bigrams.insert(str1.substr(i, 2));
}
// Extract bigrams from str2
auto num_pairs2 = str2.size() - 1;
std::unordered_set<std::string> str2_bigrams;
str2_bigrams.reserve(num_pairs2);
for (unsigned int i = 0; i < num_pairs2; ++i) {
str2_bigrams.insert(str2.substr(i, 2));
}
// Find the intersection between the two sets.
int intersection = 0;
if (str1_bigrams.size() < str2_bigrams.size()) {
const auto it_e = str2_bigrams.end();
for (const auto& bigram : str1_bigrams) {
intersection += str2_bigrams.find(bigram) != it_e;
}
} else {
const auto it_e = str1_bigrams.end();
for (const auto& bigram : str2_bigrams) {
intersection += str1_bigrams.find(bigram) != it_e;
}
}
// Returns similarity coefficient.
return (2.0 * intersection) / (num_pairs1 + num_pairs2);
}
Why not for a JavaScript implementation, I also explained the algorithm.
Algorithm
Input : France and French.
Map them both to their upper case characters (making the algorithm insensitive to case differences), then split them up into their character pairs:
FRANCE: {FR, RA, AN, NC, CE}
FRENCH: {FR, RE, EN, NC, CH}
Find there intersection:
Result:
Implementation
function similarity(s1, s2) {
const
set1 = pairs(s1.toUpperCase()), // [ FR, RA, AN, NC, CE ]
set2 = pairs(s2.toUpperCase()), // [ FR, RE, EN, NC, CH ]
intersection = set1.filter(x => set2.includes(x)); // [ FR, NC ]
// Tips: Instead of `2` multiply by `200`, To get percentage.
return (intersection.length * 2) / (set1.length + set2.length);
}
function pairs(input) {
const tokenized = [];
for (let i = 0; i < input.length - 1; i++)
tokenized.push(input.substring(i, 2 + i));
return tokenized;
}
console.log(similarity("FRANCE", "FRENCH"));
Ranking Results By ( Word - Similarity )
Sealed - 80%
Healthy - 55%
Heard - 44%
Herded - 40%
Help - 25%
Sold - 0%
From same original source.
What about Levenshtein distance, divided by the length of the first string (or alternatively divided my min/max/avg length of both strings)? That has worked for me so far.
Hey guys i gave this a try in javascript, but I'm new to it, anyone know faster ways to do it?
function get_bigrams(string) {
// Takes a string and returns a list of bigrams
var s = string.toLowerCase();
var v = new Array(s.length-1);
for (i = 0; i< v.length; i++){
v[i] =s.slice(i,i+2);
}
return v;
}
function string_similarity(str1, str2){
/*
Perform bigram comparison between two strings
and return a percentage match in decimal form
*/
var pairs1 = get_bigrams(str1);
var pairs2 = get_bigrams(str2);
var union = pairs1.length + pairs2.length;
var hit_count = 0;
for (x in pairs1){
for (y in pairs2){
if (pairs1[x] == pairs2[y]){
hit_count++;
}
}
}
return ((2.0 * hit_count) / union);
}
var w1 = 'Healed';
var word =['Heard','Healthy','Help','Herded','Sealed','Sold']
for (w2 in word){
console.log('Healed --- ' + word[w2])
console.log(string_similarity(w1,word[w2]));
}
I was looking for pure ruby implementation of the algorithm indicated by #marzagao's answer. Unfortunately, link indicated by #marzagao is broken. In #s01ipsist answer, he indicated ruby gem amatch where implementation is not in pure ruby. So I searchd a little and found gem fuzzy_match which has pure ruby implementation (though this gem use amatch) at here. I hope this will help someone like me.
**I've converted marzagao's answer to Java.**
import org.apache.commons.lang3.StringUtils; //Add a apache commons jar in pom.xml
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
public class SimilarityComparator {
public static void main(String[] args) {
String str0 = "Nischal";
String str1 = "Nischal";
double v = compareStrings(str0, str1);
System.out.println("Similarity betn " + str0 + " and " + str1 + " = " + v);
}
private static double compareStrings(String str1, String str2) {
List<String> pairs1 = wordLetterPairs(str1.toUpperCase());
List<String> pairs2 = wordLetterPairs(str2.toUpperCase());
int intersection = 0;
int union = pairs1.size() + pairs2.size();
for (String s : pairs1) {
for (int j = 0; j < pairs2.size(); j++) {
if (s.equals(pairs2.get(j))) {
intersection++;
pairs2.remove(j);
break;
}
}
}
return (2.0 * intersection) / union;
}
private static List<String> wordLetterPairs(String str) {
List<String> AllPairs = new ArrayList<>();
String[] Words = str.split("\\s");
for (String word : Words) {
if (StringUtils.isNotBlank(word)) {
String[] PairsInWord = letterPairs(word);
Collections.addAll(AllPairs, PairsInWord);
}
}
return AllPairs;
}
private static String[] letterPairs(String str) {
int numPairs = str.length() - 1;
String[] pairs = new String[numPairs];
for (int i = 0; i < numPairs; i++) {
try {
pairs[i] = str.substring(i, i + 2);
} catch (Exception e) {
pairs[i] = str.substring(i, numPairs);
}
}
return pairs;
}
}
Here's another c++ implementation that follows the original article, that minimizes dynamic memory allocations.
It obtains the same matching values in the examples, but I think it's better to take into account also the single character words.
//---------------------------------------------------------------------------
// Similarity based on Sørensen–Dice index
double calc_similarity( const std::string_view s1, const std::string_view s2 )
{
// Check banal cases
if( s1.empty() || s2.empty() )
{// Empty string is never similar to another
return 0.0;
}
else if( s1==s2 )
{// Perfectly equal
return 1.0;
}
else if( s1.length()==1 || s2.length()==1 )
{// Single (not equal) characters have zero similarity
return 0.0;
}
/////////////////////////////////////////////////////////////////////////
// Represents a pair of adjacent characters
class charpair_t final
{
public:
charpair_t(const char a, const char b) noexcept : c1(a), c2(b) {}
[[nodiscard]] bool operator==(const charpair_t& other) const noexcept { return c1==other.c1 && c2==other.c2; }
private:
char c1, c2;
};
/////////////////////////////////////////////////////////////////////////
// Collects and access a sequence of adjacent characters (skipping spaces)
class charpairs_t final
{
public:
charpairs_t(const std::string_view s)
{
assert( !s.empty() );
const std::size_t i_last = s.size()-1;
std::size_t i = 0;
chpairs.reserve(i_last);
while( i<i_last )
{
// Accepting also single-character words (the second is a space)
//if( !std::isspace(s[i]) ) chpairs.emplace_back( std::tolower(s[i]), std::tolower(s[i+1]) );
// Skipping single-character words (as in the original article)
if( std::isspace(s[i]) ) ; // Skip
else if( std::isspace(s[i+1]) ) ++i; // Skip also next
else chpairs.emplace_back( std::tolower(s[i]), std::tolower(s[i+1]) );
++i;
}
}
[[nodiscard]] auto size() const noexcept { return chpairs.size(); }
[[nodiscard]] auto cbegin() const noexcept { return chpairs.cbegin(); }
[[nodiscard]] auto cend() const noexcept { return chpairs.cend(); }
auto erase(std::vector<charpair_t>::const_iterator i) { return chpairs.erase(i); }
private:
std::vector<charpair_t> chpairs;
};
charpairs_t chpairs1{s1},
chpairs2{s2};
const double orig_avg_bigrams_count = 0.5 * static_cast<double>(chpairs1.size() + chpairs2.size());
std::size_t matching_bigrams_count = 0;
for( auto ib1=chpairs1.cbegin(); ib1!=chpairs1.cend(); ++ib1 )
{
for( auto ib2=chpairs2.cbegin(); ib2!=chpairs2.cend(); )
{
if( *ib1==*ib2 )
{
++matching_bigrams_count;
ib2 = chpairs2.erase(ib2); // Avoid to match the same character pair multiple times
break;
}
else ++ib2;
}
}
return static_cast<double>(matching_bigrams_count) / orig_avg_bigrams_count;
}