How to perform Semantic Similarity in document - information-retrieval

I am doing project In which I need to ranked text document according to search query like search engine but I need to rank documents having semantic similarity of the word or sentence,I am unable to start regarding how to find semantic similarity using java. Is there any link or any paper through which I can start finding semantic similarity of words in documents or any idea.

The standard way to represent documents in term-space is to treat the terms as mutually orthogonal or independent of each other, e.g. the terms "atomic" and "nuclear" although being synonymous and hence interchangeable are treated as distinct, whereas the semantic similarity between this pair of words should be fairly high.
Thus, for implementing a semantic similarity based score, you need to know the relation between a pair of words, for which you can use either of the following.
An external resource such as a Wordnet or a semantic similarity library such as DISCO.
A corpus analysis methodology such as Latent Semantic Analysis (LSA) which reduces the dimensionality of the term space by combining semantically similar terms such as "atomic" and "nuclear".

Have a look at this Demo for semantic similarity
It shows the demo for different algorithms. you can see which one works for you and try to go with it. Also the this "semilar" module can be used with the help of Java I think. You can try using it, I didnt tried it yet but the demo is for the same on that page. Thanks :)

Related

Gremlin: How can I merge groups of vertices when they are similar

My query returns groups of users vertices like this:
[
[Pedro, Sabrina, Macka, Fer]
[Pedro, Sabrina, Macka, Fer, Britney]
[Brintey, Fred, Christina]
]
The first 2 groups are similar, contains mostly the same vertices. I need to merge them.
I need to merge the groups that are like for example 80% similar (80% of the elements are the same).
Is this possible in gremlin? how can I do this?
Edit:
https://gremlify.com/2ykos4047g5
This gremlify project creates a fake output similar to what I have in my query, I need the first 2 lists merged into a sigle one because they contain almost the same vertices and not the third one because it's completely different from the others.
So what I'm asking is how you write a query that compares all lists checking how many vertices are the same in these lists and based on that decide if merging them into a single one or not.
The expected output for the gremlify project is:
[
[
"Pedro",
"Sabrina",
"Macka",
"Fer",
"Britney"
],
[
"Garry",
"Dana",
"Lily"
]
]
Gremlin doesn't have steps that merge lists based on how much they are alike. Gremlin is fairly flexible so I imagine there might be ways to use its steps in creative ways to get what you want, but the added complexity may not be worth it. My personal preference is to use Gremlin to retrieve my data, filter away the whatever is extraneous, and then transform it as close as possible to its final result while maintaining a balance with readability.
Given that thinking, if your result from Gremlin is simply a list of lists of strings and your Gremlin up to that point is well structured and performant, then perhaps Gremlin has gotten you far enough and his job is done. Take that result and post-process it on your application side by writing some code to take you to your final result. With that approach you have your full programming language environment at your disposal with all the libraries available to you to make that final step easier.
I'd further add that your example is a bit contrived and focuses on an arbitrary result which reduces your Gremlin question to collection manipulation question. With graphs and Gremlin I often find that heavy focus on collection manipulation to improve the quality of a result (rather than just format of a result) implies that I should go back to my core of my traversal algorithm rather than try to tack on extra manipulation at the end of the traversal.
For example, if this output you're asking about in this question relates back to your previous questions here and here, then I'd wonder if you shouldn't rethink the rules of your algorithm. Perhaps, you really aren't "detecting triangles and then trying to group them accordingly" as I put it in one of my answers there. Maybe there is a completely different algorithm that will solve your problem which is even more effective and performant.
This blog post, "Reducing Computational Complexity with Correlate Traversals", does an excellent job in explaining this general notion. Though it focuses on centrality algorithms, the general message is quite clear:
All centrality measures share a similar conceptual theme — they all score the vertices in the graph according to how “central” they are relative to all other vertices. It is this unifying concept which can lead different algorithms to yield the same or similar results. Strong, positive correlations can be taken advantage of by the graph system architect, enabling them to choose a computationally less complex metric when possible.
In your case, perhaps you need more flexibility in the rules you stated for your algorithm to thus allow a better (i.e. less rigid) grouping in your results. In any case, it is something to think about and in the worst case you can obviously just take the brute force approach that you describe in your question and get your result.

Generating articles automatically

This question is to learn and understand whether a particular technology exists or not. Following is the scenario.
We are going to provide 200 english words. Software can add additional 40 words, which is 20% of 200. Now, using these, the software should write dialogs, meaningful dialogs with no grammar mistake.
For this, I looked into Spintax and Article Spinning. But you know what they do, taking existing articles and rewrite it. But that is not the best way for this (is it? let me know if it is please). So, is there any technology which is capable of doing this? May be semantic theory that Google uses? Any proved AI method?
Please help.
To begin with, a word of caution: this is quite the forefront of research in natural language generation (NLG), and the state-of-the-art research publications are not nearly good enough to replace human teacher. The problem is especially complicated for students with English as a second language (ESL), because they tend to think in their native tongue before mentally translating the knowledge into English. If we disregard this fearful prelude, the normal way to go about this is as follows:
NLG comprises of three main components:
Content Planning
Sentence Planning
Surface Realization
Content Planning: This stage breaks down the high-level goal of communication into structured atomic goals. These atomic goals are small enough to be reached with a single step of communication (e.g. in a single clause).
Sentence Planning: Here, the actual lexemes (i.e. words or word-parts that bear clear semantics) are chosen to be a part of the atomic communicative goal. The lexemes are connected through predicate-argument structures. The sentence planning stage also decides upon sentence boundaries. (e.g. should the student write "I went there, but she was already gone." or "I went there to see her. She has already left." ... notice the different sentence boundaries and different lexemes, but both answers indicating the same meaning.)
Surface Realization: The semi-formed structure attained in the sentence planning step is morphed into a proper form by incorporating function words (determiners, auxiliaries, etc.) and inflections.
In your particular scenario, most of the words are already provided, so choosing the lexemes is going to be relatively simple. The predicate-argument structures connecting the lexemes needs to be learned by using a suitable probabilistic learning model (e.g. hidden Markov models). The surface realization, which ensures the final correct grammatical structure, should be a combination of grammar rules and statistical language models.
At a high-level, note that content planning is language-agnostic (but it is, quite possibly, culture-dependent), while the last two stages are language-dependent.
As a final note, I would like to add that the choice of the 40 extra words is something I have glossed over, but it is no less important than the other parts of this process. In my opinion, these extra words should be chosen based on their syntagmatic relation to the 200 given words.
For further details, the two following papers provide a good start (complete with process flow architectures, examples, etc.):
Natural Language Generation in Dialog Systems
Stochastic Language Generation for Spoken Dialogue Systems
To better understand the notion of syntagmatic relations, I had found Sahlgren's article on distributional hypothesis extremely helpful. The distributional approach in his work can also be used to learn the predicate-argument structures I mentioned earlier.
Finally, to add a few available tools: take a look at this ACL list of NLG systems. I haven't used any of them, but I've heard good things about SPUD and OpenCCG.

Graph Traversing algorithms in Semantic web

I am asking about Algorithms that would be useful in Querying the Semantic web DB to get all the related RDFs to an original Object.
i.e If the original Object is the movie "inception", I want an algorithm to build queries to get the RDFs of the cast of the movie, the studio, the country ....etc so that I can build a relationship graph.
The most close example is the answer to this question , Especially this class , I wan similar algorithms or maybe titles to search in order to produce such an algorithm, I am thinking maybe some modifications on graph traversing algorithms can work, but I'm not sure.
NOTE: My project is in ASP.NET. So, it would help to use Exisiting .NET libraries.
You should be able to do a simple breadth-first-search to get all the objects that are a certain distance away from a given node.
You'll need to know something about the schema because some neighboring nodes are more meaningful than others. For example, in Freebase, we have intermediate nodes that link a film to an actor and a role. You need to know to go 2-ply deep to get at the actor and the role because just saying that the film is related to the intermediate nodes is not very interesting.
Did you take a look at "property paths"?
Property Paths give a more succinct way to write parts of basic graph
patterns and also extend matching of triple pattern to arbitrary
length paths. Property paths do not invalidate or change any existing
SPARQL query.
Triple stores and SPARQL engines such as OWLIM and AllegroGraph support them.

Measuring distances among classes in RDF/OWL graphs

Maybe someone could give me a hint. Is it possible to measure the distance between 2 concepts/classes that belong to the same ontology?
For example, let's suppose I have an ontology with the
Astronomy class and the Telescope class. There is a link between both, but it is not a direct link. Astronomy has a parent class called Science, and Telescope has a parent class called Optical Instrument which belongs to its parent called Instrumentation, that is related to a class called Empirical Science that finally belongs to a class called Science.
So there is an indirect link between Telescope and Astronomy, and I want to find out the number of steps needed to reach one class starting from the another one.
Is there an easy SPARQL query that resolves that question? Or are there better ways to do that job? Or is not possible to find that out using Semantic Web paradigm?
Any hint will be very appreciated.
SPARQL provides the ability to search for arbitrary length paths in a graph but no mechanism to tell you the length of that path.
So you can do something like:
SELECT * WHERE { ?s ex:property+ ?o }
The syntax is very much like regex so you can do alternatives, restricted cardinalities etc
In my understanding SPARQL doesn't contain any recursive constructions to be able to measure indirect link of arbitrary length. The best you could do is to prepare set of queries distance_1(a, b), distance_2(a, b)... to check for specific distance between two concepts.
Another alternative is to discover this information using non-SPARQL technology, for example writing graph traversing algorithm in Python with RDFlib.
Since you explicitly mentioned that you are talking about classes and they will be in the same ontology, it is safe to assume that they will be always connected (because ultimately both will be a subclass of "Thing", right?). On the other hand, the path I mentioned in the parentheses (Class1 -> ... -> Thing <- ... <- Class2) is a trivial one, so I assume you want to find... all of the existing paths between two classes, in other words, all of the existing paths between two vertices. Is that true? Or are you looking for the shortest path? Your question is not very clear in that aspect, can you clarify it?
As far as I know there is no simple SPARQL construct that will list all the paths between classes or the shortest path. However some semantic web triple stores come with graph traversal algorithms such as breadth-first-search or depth-first-search, please refer to:
http://www.franz.com/agraph/support/documentation/current/lisp-reference.html#sna
You may also find the source code of the following project very useful:
RelFinder, Interactive Relationship Discovery in RDF Data, http://www.visualdataweb.org/relfinder.php

Algorithm to find if one document is included in another, when those two documents are similar

I'm looking for an algorithm that finds whether two text documents are similar, where one document is included in the other document.
I thank you in advance.
You can always use diff with diffstat. The diff documentation isn't precise about the algorithm(s) it uses, but the original authors wrote a paper about it (Google for diff paper), and you can always read the source code.
For more precise answers you will need a more precise question. Are you only interested to know whether one document is a fragment of the other document?
Or are you also interested in knowing whether one can be split up into pieces that each occur in the other document, in the same order? Or are you also interested to know how much material does not occur if you try to match up the material of both documents with a fast algorithm? diff will tell you all those things. Or do you want to know the absolute best matching? diff doesn't always give you that, you'll need something like Levenshtein distance. If one of the documents is much shorter than the other you can use fast string searching algorithms. Etc. Etc.

Resources