Algorithm to find if one document is included in another, when those two documents are similar - similarity

I'm looking for an algorithm that finds whether two text documents are similar, where one document is included in the other document.
I thank you in advance.

You can always use diff with diffstat. The diff documentation isn't precise about the algorithm(s) it uses, but the original authors wrote a paper about it (Google for diff paper), and you can always read the source code.
For more precise answers you will need a more precise question. Are you only interested to know whether one document is a fragment of the other document?
Or are you also interested in knowing whether one can be split up into pieces that each occur in the other document, in the same order? Or are you also interested to know how much material does not occur if you try to match up the material of both documents with a fast algorithm? diff will tell you all those things. Or do you want to know the absolute best matching? diff doesn't always give you that, you'll need something like Levenshtein distance. If one of the documents is much shorter than the other you can use fast string searching algorithms. Etc. Etc.

Related

Gremlin: How can I merge groups of vertices when they are similar

My query returns groups of users vertices like this:
[
[Pedro, Sabrina, Macka, Fer]
[Pedro, Sabrina, Macka, Fer, Britney]
[Brintey, Fred, Christina]
]
The first 2 groups are similar, contains mostly the same vertices. I need to merge them.
I need to merge the groups that are like for example 80% similar (80% of the elements are the same).
Is this possible in gremlin? how can I do this?
Edit:
https://gremlify.com/2ykos4047g5
This gremlify project creates a fake output similar to what I have in my query, I need the first 2 lists merged into a sigle one because they contain almost the same vertices and not the third one because it's completely different from the others.
So what I'm asking is how you write a query that compares all lists checking how many vertices are the same in these lists and based on that decide if merging them into a single one or not.
The expected output for the gremlify project is:
[
[
"Pedro",
"Sabrina",
"Macka",
"Fer",
"Britney"
],
[
"Garry",
"Dana",
"Lily"
]
]
Gremlin doesn't have steps that merge lists based on how much they are alike. Gremlin is fairly flexible so I imagine there might be ways to use its steps in creative ways to get what you want, but the added complexity may not be worth it. My personal preference is to use Gremlin to retrieve my data, filter away the whatever is extraneous, and then transform it as close as possible to its final result while maintaining a balance with readability.
Given that thinking, if your result from Gremlin is simply a list of lists of strings and your Gremlin up to that point is well structured and performant, then perhaps Gremlin has gotten you far enough and his job is done. Take that result and post-process it on your application side by writing some code to take you to your final result. With that approach you have your full programming language environment at your disposal with all the libraries available to you to make that final step easier.
I'd further add that your example is a bit contrived and focuses on an arbitrary result which reduces your Gremlin question to collection manipulation question. With graphs and Gremlin I often find that heavy focus on collection manipulation to improve the quality of a result (rather than just format of a result) implies that I should go back to my core of my traversal algorithm rather than try to tack on extra manipulation at the end of the traversal.
For example, if this output you're asking about in this question relates back to your previous questions here and here, then I'd wonder if you shouldn't rethink the rules of your algorithm. Perhaps, you really aren't "detecting triangles and then trying to group them accordingly" as I put it in one of my answers there. Maybe there is a completely different algorithm that will solve your problem which is even more effective and performant.
This blog post, "Reducing Computational Complexity with Correlate Traversals", does an excellent job in explaining this general notion. Though it focuses on centrality algorithms, the general message is quite clear:
All centrality measures share a similar conceptual theme — they all score the vertices in the graph according to how “central” they are relative to all other vertices. It is this unifying concept which can lead different algorithms to yield the same or similar results. Strong, positive correlations can be taken advantage of by the graph system architect, enabling them to choose a computationally less complex metric when possible.
In your case, perhaps you need more flexibility in the rules you stated for your algorithm to thus allow a better (i.e. less rigid) grouping in your results. In any case, it is something to think about and in the worst case you can obviously just take the brute force approach that you describe in your question and get your result.

How to perform Semantic Similarity in document

I am doing project In which I need to ranked text document according to search query like search engine but I need to rank documents having semantic similarity of the word or sentence,I am unable to start regarding how to find semantic similarity using java. Is there any link or any paper through which I can start finding semantic similarity of words in documents or any idea.
The standard way to represent documents in term-space is to treat the terms as mutually orthogonal or independent of each other, e.g. the terms "atomic" and "nuclear" although being synonymous and hence interchangeable are treated as distinct, whereas the semantic similarity between this pair of words should be fairly high.
Thus, for implementing a semantic similarity based score, you need to know the relation between a pair of words, for which you can use either of the following.
An external resource such as a Wordnet or a semantic similarity library such as DISCO.
A corpus analysis methodology such as Latent Semantic Analysis (LSA) which reduces the dimensionality of the term space by combining semantically similar terms such as "atomic" and "nuclear".
Have a look at this Demo for semantic similarity
It shows the demo for different algorithms. you can see which one works for you and try to go with it. Also the this "semilar" module can be used with the help of Java I think. You can try using it, I didnt tried it yet but the demo is for the same on that page. Thanks :)

Graph data structures in LabVIEW

What's the best way to represent graph data structures in LabVIEW?
I'm doing some basic algorithm review over the holiday, and I'd prefer to not implement all of the storage and traversals myself, if possible.
(I'm aware that there was a thread a few years ago on LAVA, is that my best bet?)
I've never had a need to do this myself, so I never really looked into it, but there are some people who did do some work as far I know.
Brian K. has posted something over here, although it's been a long time since I looked at it:
https://decibel.ni.com/content/docs/DOC-12668
If that doesn't help, I would suggest you read this and then try sending a PM to Daklu there, as he's the most likely candidate to have something.
https://decibel.ni.com/content/thread/8179?tstart=0
If not, I would suggest posting a question on LAVA, as you're more likely to find the relevant people there.
Well you don't have that many options for graphs , from a simple point of view. It really depends on the types of algorithms you are doing, in order to choose the most convenient representation.
Adjacency matrix is simple, but can be slow for some tasks, and can be wasteful if the graph is not dense.
You can keep a couple of lists and hash maps of your edges and vertices. With each edge or vertex created assigned a unique index into the list,it's pretty simple to keep things under control. Each vertex could then be associated with a list of its neighbors. Depending on your needs you could divide that neighbors list into in and out edges. Also depending on your look up needs, you could choose to index edges by their in or out edge or both, or simple by a unique index number.
I had a glance at the LabView quick reference, and while it was not obvious from there how you would do that, as long as they have arrays of some sort, you can implement a graph. I'm sure you'll be fine.

Is there a way to use arbitrary type of value as key in environment or named list in R?

I've been looking for a proper implementation of hash map in R, with functionalities similar to the map type in Python.
After some googling and searching the R documentations, I found that environment and named list are the ONLY options I can use (is that really so?).
But the problem with the two is that they can only take charaters as key for the hashing, not even a number, let alone other type of things.
So is there a way to use arbitrary things as key? or at least more than just characters.
Or is there a better implemtation of hash map that I didn't find with better functionalities ?
Thanks in advance.
Edit:
My current problem: I need a map to store the distance relationship between data points. That is, the key of the map is a tuple (p1, p2) and the value is a number.
The reason I asked a generic question instead of a concrete one is that I'm learning R recently and I want to know how to manipulate some of the most fundamental data structures, not only what my problem refers to. So I may need to use other things as key in the future, and I want to avoid asking similar questions with only minor difference every time I run into them.
Edit 2:
I got a lot of very good advices on this topic. It seems I'm still thinking quite in the Pythonic way, rather than the should-be R way. I should really get more R-ly ! I think my purpose can easily be satisfied by a matrix in R. Thanks All !
The reason people keep asking you for a specific example is that most problems for which hash tables are the appropriate technique in Python have a good solution in R that does not involve hash tables.
That said, there are certainly times when a real hash table is useful in R, and I recommend you check out the hash package for R. It uses environments as its base but lets you do a lot of R-like vector work with them. It's efficient and I've never run into a problem with it.
Just keep in mind that if you're using hash tables a lot while working with R and your code is running slowly or is buggy, you may be able to get some mileage from figuring out a more R-like way of doing it :)

Should I use an expression parser in my Math game?

I'm writing some children's Math Education software for a class.
I'm going to try and present problems to students of varying skill level with randomly generated math problems of different types in fun ways.
One of the frustrations of using computer based math software is its rigidity. If anyone has taken an online Math class, you'll know all about the frustration of taking an online quiz and having your correct answer thrown out because your problem isn't exactly formatted in their form or some weird spacing issue.
So, originally I thought, "I know! I'll use an expression parser on the answer box so I'll be able to evaluate anything they enter and even if it isn't in the same form I'll be able to check if it is the same answer." So I fire up my IDE and start implementing the Shunting Yard Algorithm.
This would solve the problem of it not taking fractions in the smallest form and other issues.
However, It then hit me that a tricky student would simply be able to enter most of the problems into the answer box and my expression parser would dutifully parse and evaluate it to the correct answer!
So, should I not be using an expression parser in this instance? Do I really have to generate a single form of the answer and do a string comparison?
One possible solution is to note how many steps your expression evaluator takes to evaluate the problem's original expression, and to compare this to the optimal answer. If there's too much difference, then the problem hasn't been reduced enough and you can suggest that the student keep going.
Don't be surprised if students come up with better answers than your own definition of "optimal", though! I was a TA/grader for several classes, and the brightest students routinely had answers on their problem sets that were superior to the ones provided by the professor.
For simple problems where you're looking for an exact answer, then removing whitespace and doing a string compare is reasonable.
For more advanced problems, you might do the Shunting Yard Algorithm (or similar) but perhaps parametrize it so you could turn on/off reductions to guard against the tricky student. You'll notice that "simple" answers can still use the parser, but you would disable all reductions.
For example, on a division question, you'd disable the "/" reduction.
This is a great question.
If you are writing an expression system and an evaluation/transformation/equivalence engine (isn't there one available somewhere? I am almost 100% sure that there is an open source one somewhere), then it's more of an education/algebra problem: is the student's answer algebraically closer to the original expression or to the expected expression.
I'm not sure how to answer that, but just an idea (not necessarily practical): perhaps your evaluation engine can count transformation steps to equivalence. If the answer takes less steps to the expected than it did to the original, it might be ok. If it's too close to the original, it's not.
You could use an expression parser, but apply restrictions on the complexity of the expressions permitted in the answer.
For example, if the goal is to reduce (4/5)*(1/2) and you want to allow either (2/5) or (4/10), then you could restrict the set of allowable answers to expressions whose trees take the form (x/y) and which also evaluate to the correct number. Perhaps you would also allow "0.4", i.e. expressions of the form (x) which evaluate to the correct number.
This is exactly what you would (implicitly) be doing if you graded the problem manually -- you would be looking for an answer that is correct but which also falls into an acceptable class.
The usual way of doing this in mathematics assessment software is to allow the question setter to specify expressions/strings that are not allowed in a correct answer.
If you happen to be interested in existing software, there's the open-source Stack http://www.stack.bham.ac.uk/ (or various commercial options such as MapleTA). I suspect most of the problems that you'll come across have also been encountered by Stack so even if you don't want to use it, it might be educational to look at how it approaches things.

Resources