I have a dataset full of towns and their geographical data and some input data. This input data is, almost always, a town as well. But it being towns scraped off of the internet, they can be a little bit misspelled or spelled differently. e.g. Saint Petersburg <-> St. Petersburg
During my research I came across a couple of algorithms and tried out two. Firstly I tried
Sørensen–Dice coefficient. This gave me some promising results, until I tried to match short strings against longer strings. The algorithm is really good when all strings are roughly the same size, but when they differ a lot in size you get mixed results. e.g. When matching Saint to the set, it will give Sail as best match, while I want Saint-Petersburg.
The second algo I tried is the Levenshtein distance, but for the same reasons it didn't fare well.
I came across some other algorithms such as cosine similarity, longest common subsequence and more, but those are a bit more complicated it seems and I would like to keep the cost of calculation down.
Are there any algorithms that prioritize length of match over the percentage matched?
Anyone have any experience with matching oddly spelled town names? Please let me know!
EDIT:
I thought this SO question was a possible duplicate, but it turns out it describes Sorenson-Dice.
Related
I know that in Word2Vec the length of word vectors could encode properties like term frequency. In that case, we can see two word vectors, say synonyms, with a similar meaning but with a different length given their usage in our corpus.
However, if we normalize the word vectors, we keep their "directions of meaning" and we could clusterize them according that: meaning.
Following that train of thought, the same would be applicable to document vectors in Doc2Vec.
But my question is, is there a reason to NOT normalize document vectors if we want to cluster them? In Word2Vec we can say we want to keep the frequency property of the words, is there a similar thing for documents?
I'm not familiar with any reasoning or research precedent which implies that either unit-normalized or non-normalized document-vectors are better for clustering.
So, I'd try both to see which seems to work better for your purposes.
Other thoughts:
In Word2Vec, my general impression is that larger-magnitude word-vectors are associated with words that, in the training data, have more unambiguous meaning. (That is, they reliably tend to imply the same smaller set of neighboring words.) Meanwhile, words with multiple meanings (polysemy) and usage amongst many other diverse words tend to have lower-magnitude vectors.
Still, the common way of comparing such vectors, cosine-similarity, is oblivious to magnitudes. That's likely because most comparisons just need the best sense of a word, without any more subtle indicator of "unity of meaning".
A similar effect might be present in Doc2Vec vectors: lower-magnitude doc-vectors could be a hint that the document has more broad word-usage/subject-matter, while higher-magnitude doc-vectors suggest more focused documents. (I'd similarly have the hunch that longer documents may tend to have lower-magnitude doc-vectors, because they use a greater diversity of words, whereas small documents with a narrow set of words/topics may have higher-magnitude doc-vectors. But I have not specifically observed/tested this hunch, and any effect here could be heavily influenced by other training choices, like the number of training iterations.)
Thus, it's possible that the non-normalized vectors would be interesting for some clustering goals, like separating focused documents from more general documents. So again, after this longer analysis: I'd suggest trying it both ways to see if one or the other seems to work better for your specific needs.
I am quite new to NLP. My question is can I combine words of same meaning into one using NLP, for example, considering the following rows;
1. It’s too noisy here
2. Come on people whats up with all the chatter
3. Why are people shouting like crazy
4. Shut up people, why are you making so much noise
As one can notice, the common aspect here is that the people are complaining about the noise.
noisy, chatter, shouting, noise -> Noise
Is it possible to group the words using a common entity using NLP. I am using R to come up with a solution to this problem.
I have used a sample twitter data set and my expected output will be a table which contains;
Noise
It’s too noisy here
Come on people whats up with all the chatter
Why are people shouting like crazy
Shut up people, why are you making so much noise
I did search the web for reference before posting here. Any suggestion or valuable inputs will be of much help.
Thanks
The problem you mention is better known as paraphrasing, and it is not completetly solved. Maybe if you want a fast solution, you can start replacing synonyms, wordnet can help with that.
Other idea is calculate sentence similarity (just getting a vector representation of each sentence and use cosine distance to measure similarity to each other)
I think this paper could provide a good introduction for your problem.
As a result of my mathematical research, I have obtained the next figure:
I am trying hard to guess the next value. I know there are multiple extrapolation techniques that can be used here.
However I am primarily concerned on trying to find any kind of logic behind this apparently chaotic chart. For the more curious, the X-axis represents the index of a member of a given population, whereas the Y-axis is just how far that member is from average.
Any algorithm/software to be used to recognise patterns? How would you approach this problem?
The population sizes are sufficiently large for statistics to be useful. Yet, the y values are all over the place. It seems extremely unlikely that an underlying pattern could exist, when population is the X axis. Common sense would suggest that nothing varies so wildly, yet predictably, with change in population.
If this were stock market activity throughout the day, you might have a similar chart, and underlying patterns might relate in some way to times various cities around the globe start their work day, for example. Patterns would be plausible, at least.
There seems to be a similar question here, but I was satisfied with neither the clarity of the answer, nor its practicality. I was asked in a recent interview what data structure I would store my large set of floating point numbers so that I look up a new arrival for itself or its closest neighbor. I said I would use a binary search tree, and try to make it balanced to achieve O(log n).
Then the question was extended to two dimensions: What data structure would I use to store a large set of (x,y) pairs such as geographical coordinates for fast look up? I couldn't think of a satisfactory answer and totally gave up when extended to K-dimensions. Using a k-dimensional tree directly that uses coordinates values to "split" the space doesn't seem to work since two close points near the origin but inside different quadrants may end up on far far away leaves.
After the interview, I remembered Voronoi diagrams for partitioning K-dimensional space nicely. What is the best way of implementing this using what data structure? How are the look ups going to be performed? I feel like this question is so common in computer science that by now it has even a dedicated data structure.
You can use a grid and sort the points into the grid-cells. There is a similar question here:Distance Calculation for massive number of devices/nodes. You can also use a space filling curve (quadkey) or a quadtree,e.g. r-tree when you need additional informations,e.g hierarchy.
Starting off let me clarify that i have seen This Genetic Algorithm Resource question and it does not answer my question.
I am doing a project in Bioinformatics. I have to take data about the NMR spectrum of a cell(E. Coli) and find out what are the different molecules(metabolites) present in the cell.
To do this i am going to be using Genetic Algorithms in R language. I DO NOT have the time to go through huge books on Genetic algorithms. Heck! I dont even have time to go through little books.(That is what the linked question does not answer)
So i need to know of resources which will help me understand quickly what it is Genetic Algorithms do and how they do it. I have read the Wikipedia entry ,this webpage and also a couple of IEEE papers on the subject.
Any working code in R(even in C) or pointers to which R modules(if any) to be used would be helpful.
A brief (and opinionated) introduction to genetic algorithms is at http://www.burns-stat.com/pages/Tutor/genetic.html
A simple GA written in R is available at http://www.burns-stat.com/pages/Freecode/genopt.R The "documentation" is in 'S Poetry' http://www.burns-stat.com/pages/Spoetry/Spoetry.pdf and the code.
I assume from your question you have some function F(metabolites) which yields a spectrum but you do not have the inverse function F'(spectrum) to get back metabolites. The search space of metabolites is large so rather than brute force it you wish to try an approximate method (such as a genetic algorithm) which will make a more efficient random search.
In order to apply any such approximate method you will have to define a score function which compares the similarity between the target spectrum and the trial spectrum. The smoother this function is the better the search will work. If it can only yield true/false it will be a purely random search and you'd be better off with brute force.
Given the F and your score (aka fitness) function all you need to do is construct a population of possible metabolite combinations, run them all through F, score all the resulting spectrums, and then use crossover and mutation to produce a new population that combines the best candidates. Choosing how to do the crossover and mutation is generally domain specific because you can speed the process greatly by avoiding the creation of nonsense genomes. The best mutation rate is going to be very small but will also require tuning for your domain.
Without knowing about your domain I can't say what a single member of your population should look like, but it could simply be a list of metabolites (which allows for ordering and duplicates, if that's interesting) or a string of boolean values over all possible metabolites (which has the advantage of being order invariant and yielding obvious possibilities for crossover and mutation). The string has the disadvantage that it may be more costly to filter out nonsense genes (for example it may not make sense to have only 1 metabolite or over 1000). It's faster to avoid creating nonsense rather than merely assigning it low fitness.
There are other approximate methods if you have F and your scoring function. The simplest is probably Simulated Annealing. Another I haven't tried is the Bees Algorithm, which appears to be multi-start simulated annealing with effort weighted by fitness (sort of a cross between SA and GA).
I've found the article "The science of computing: genetic algorithms", by Peter J. Denning (American Scientist, vol 80, 1, pp 12-14). That article is simple and useful if you want to understand what genetic algorithms do, and is only 3 pages to read!!