Can inverted index have multiple words in one entry? - information-retrieval

In information retrieval, the inverted index has entries which are the words of corpus, and each word has a posting list which is the list of documents it appears in.
If stemming is applied, index entry would be a stem, so multiple words may finally map to the same entry if they share the same stem. For example:
Without stemming:
(slowing) --> [D1, D5, D9,...]
(slower) --> [D9, D10, D20,...]
(slow) --> [D2,...]
With stemming:
(slow) --> [D1, D2, D5, D9, , D10, D20...]
I want to avoid stemming, and instead would like to make each entry in my inverted index as a bag of words (inflections) such as (slow, slows, slowing, slowed, slower, slowest). For example:
(slow, slows, slowing, slowed, slower, slowest) --> [D1, D2, D5, D9, , D10, D20...]
Would that be possible and feasible or not?

Short Answer:
Simply avoid stemming to suit your need of not considering slow and slows to be a match.
Long Answer:
Question:
I want to avoid stemming, and instead would like to make each entry in my inverted index as a bag of words (inflections) such as (slow, slows, slowing, slowed, slower, slowest).
Let me try to clear some confusion that you have about inverted lists. It is the documents that are stored in the postings for each term (not the terms themselves).
The words are typically stored in a in-memory dictionary (implemented with a hash-table or a trie) containing pointers to the postings (list of documents which contain that particular term) stored and loaded on the fly from secondary storage.
A simple example (without showing document weights):
(information) --> [D1, D5, D9,...]
(informative) --> [D9, D10, D20,...]
(retrieval) --> [D1, D9, D17,...]
..
So, if you don't want to apply stemming, that's fine... In fact, the above example shows an unstemmed index, where the words information and informative appear in their non-conflated forms. In a conflated term index (with a stemmer or a lemmatizer), you would substitute the different forms with an equivalent representation (say inform). In that case, the index will be:
(inform) --> [D1, D5, D9, D10, D20...]. --- union of the different forms
(retrieval) --> [D1, D9, D17,...]
..
So, this conflated representation matches all possible forms of the word information, e.g. informative, informational etc.
Longer Answer
Now let's say you want to achieve the best of both worlds, i.e. a representation which allows this conflation to be done in a user controlled way, e.g. wrapping a word around quotes to denote requiring an exact match ("slow"vs.slowin the query), or some indicator to include synonyms for a query term for semantic search (e.g.syn(slow)` to include synonyms of the word slow).
For this, you need to maintain separate postings for the non-conflated words and maintain additional equivalence indicating pointers between a set of equivalent (stem relation/synonym relation/ semantic relation etc.) terms.
Coming back to our example, you would have something like:
(E1)-->(information) --> [D1, D5, D9,...]
|---->(informative) --> [D9, D10, D20,...]
|---->(data) --> [D20, D23, D25,...]
(E2)-->(retrieval) --> [D1, D9, D17,...]
|---->(search) --> [D20, D30, D31,...]
..
Here, I have shown two examples of equivalence classes (concept representations) of two sets of terms information, data... and retrieval, search....
Depending on the query syntax, it would then be possible at the retrieval time to facilitate exact search or relaxed search (based on inflections/synonyms etc.)

Related

Adding custom properties to RDF statements [duplicate]

I know that I can represent any relation as a RDF triplet as in:
Barack Obama -> president of -> USA
(I am aware that this is not RDF, I am just illustrating)
But how do I add additional information about this relation, like for example the time dimension? I mean he is in his second presidential period and any period last only for a lapse of time. And, how about after and before his presidential periods?
There are several options to do this. I'll illustrate some of the more popular ones.
Named Graphs / Quads
In RDF, named graphs are subsets of an RDF dataset that are assigned a specific identifier (the "graph name"). In most RDF databases, this is implemented by adding a fourth element to the RDF triple, turning it from a triple into a "quad" (sometimes it's also called the 'context' of the triple).
You can use this mechanism to express information about a certain collection of statements. For example (using pseudo N-Quads syntax for RDF):
:i1 a :TimePeriod .
:i1 :begin "2009-01-20T00:00:00Z"^^xsd:dateTime .
:i1 :end "2017-01-20T00:00:00Z"^^xsd:dateTime .
:barackObama :presidentOf :USA :i1 .
Notice the fourth element in the last statement: it links the statement "Barack Obama is president of the USA" to the named graph identified by :i.
The named graphs approach is particularly useful in situations where you have data to express about several statements at once. It is of course also possible to use it for data about individual statements (as the above example illustrates), though it may quickly become cumbersome if used in that fashion (every distinct time period will need its own named graph).
Representing the relation as an object
An alternative approach is to model the relation itself as an object. The relation between "Barack Obama" and "USA" is not just that one is the president of the other, but that one is president of the other between certain dates. To express this in RDF (as Joshua Taylor also illustrated in his comment):
:barackObama :hasRole :president_44 .
:president_44 a :Presidency ;
:of :USA ;
:begin "2009-01-20T00:00:00Z"^^xsd:dateTime ;
:end "2017-01-20T00:00:00Z"^^xsd:dateTime .
The relation itself has now become an object (an instance of the "Presidency" class, with identifier :president_44).
Compared to using named graphs, this approach is much more tailored to asserting data about individual statements. A possible downside is that it becomes a bit more complex to query the relation in SPARQL.
RDF Reification
Not sure this approach actually still counts as "popular", but RDF reification is the historically W3C-sanctioned approach to asserting "statements about statements". In this approach we turn the statement itself into an object:
:obamaPresidency a rdf:Statement ;
rdf:subject :barackObama ;
rdf:predicate :presidentOf ;
rdf:object :USA ;
:trueBetween [
:begin "2009-01-20T00:00:00Z"^^xsd:dateTime ;
:end "2017-01-20T00:00:00Z"^^xsd:dateTime .
] .
There's several good reasons not to use RDF reification in this case, however:
it's conceptually a bit strange. The knowledge that we want to express is about the temporal aspect of the relation, but using RDF reification we are saying something about the statement.
What we have expressed in the above example is: "the statement about Barack Obama being president of the USA is valid between ... and ...". Note that we have not expressed that Barack Obama actually is the president of the USA! You could of course still assert that separately (by just adding the original triple as well as the reified one), but this creates a further duplication/maintenance problem.
It is a pain to use in SPARQL queries.
As Joshua also indicated in his comment, the W3C Note on defining N-ary RDF relations is useful to look at, as it goes into more depth about these (and other) approaches.
RDF*, or RDF-star, allows expressing additional information about an RDF triple, allowing nested structures such as:
<< :BarackObama :presidentOf :USA >> :since :2009
Bit of a contrived example (since the term of presidency could be expressed by simply normalizing data structure), but it’s quite useful for expressing “external” concerns (such as probability or provenance).
See Olaf’s blog post and, for technical details, https://www.w3.org/2021/12/rdf-star.html.
I’m pretty sure Apache Jena supports it already, not quite sure about other products like Neo4j.

Most efficient way to print differences of two arrays?

Recently, a colleague of mine asked me how he could test the equalness of two arrays. He had two sources of Address and wanted to assert that both sources contained exactly the same elements, although order didn't matter.
Both using Array or like List in Java, or IList would be okay, but since there could be two equal Address objects, things like Sets can't be used.
In most programming languages, a List already has an equals method doing the comparison (assuming that the collection was ordered before doing it), but there is no information about the actual differences; only that there are some, or none.
The output should inform about elements that are in one collection but not in the other, and vice-versa.
An obvious approach would be to iterate through one of the collections (if one of them is), and just call contains(element) on the other one, and doing it the the other way around afterwards. Assuming a complexity of O(n) for contains, that would result in O(2n²), if I'm correct.
Is there a more efficient way for getting the information "A1 and A2 isn't in List1, A3 and A4 isn't in List2"? Are there data structures better suited for doing this job than lists? Is it worth it to sort the collections before and using a custom, binary search contains?
The first thing that comes to mind is using set difference
In pseudo-python
addr1 = set(originalAddr1)
addr2 = set(originalAddr2)
in1notin2 = addr1 - addr2
in2notin1 = addr2 - addr1
allDifferences = in1notin2 + in2notin1
From here you can see that set difference is O(len(set)) and union is O(len(set1) + len(set2)) giving you a linear time solution with this python specific set implementation, instead of quadratic as you suggest.
I believe other popular languages tend to implement these type of data structures pretty much the same way, but can't really be sure about this.
Is it worth to sort the collection [...]?
Compare the naive approach O(n²) to sorting two lists in O(n logn) and then comparing them in O(n) - or sorting one list in O(n logn) and iterating over the other in O(n)

Dictionary Training For Different Language

I am working on a messaging system and got the idea of storing the messages in each inbox independently. While working on that idea I asked myself why not compress the messages. So I am looking for a good way to optain dictionaries in different languages.
Since the messages are highly related to everydays talking (social bla bla) I need a good source and way for that.
I need some text for it like a bunch of millions emails, books etc. I would like to create a Huffman tree out of it with the ability to inline and represent each message as a string within this huffman tree. So decoding would be fast enough.
The languages I want to use are over the place. Since Newspapers and alike might not be sufficient I need other ways.
[Update]
As I countinue my research, I noticed that I actually create two dictionaries out of wikipedia. First is a dictionary containing typical characters with a certain propability. Also I noticed that special characters I used for each language seams to have even distribution among latain based languages (well actually latain is just one member of the language family) and even russians tend to have the same distribution beside the quite different alphabet.
Also I noticed that in around 15 to 18% a special character (like ' ', ',', ';') follows another special character. And by than the first 8 most frequent word yield 10% where the next 16 words yield 9% and going on and on and by around 128 (160 words) you reach a yield of 80% of all words. So storing the next 256 and more words, becomes senseless in terms of analysing. This leaves me behind with three dictionaries (characters, words, special characters) per language of around 2 to 5KB (I use a special format to use prefix compression) that save me between 30% to 60% in character reduction and also when you remember that in Java each character stores 16 bits it results in an overall reduction of even more making it a 1:5 to 1:10 by also compressing (huffman tree) the characters having to insert.
Using such a system and compressing numbers as variable length integers, one produces a byte array that can be used for String matching, loads faster and checking for words to contain is faster than doing character by character comparism since one can check for complete words more faster by not needing to tokenize or recognize words in the first place.
It solved the problem of supporting string keys since I just can transform the string in each language and it results in a set of keys I can use for lookups.

R Constained Combination generation

Say I have an set of string:
x=c("a1","b2","c3","d4")
If I have a set of rules that must be met:
if "a1" and "b2" are together in group, then "c3" cannot be in that group.
if "d4" and "a1" are together in a group, then "b2" cannot be in that group.
I was wondering what sort of efficient algorithm are suitable for generating all combinations that meet those rules? What research or papers or anything talk about these type of constrained combination generation problems?
In the above problem, assume its combn(x,3)
I don't know anything about R, so I'll just address the theoretical aspect of this question.
First, the constraints are really boolean predicates of the form "a1 ^ b2 -> ¬c3" and so on. That means that all valid combinations can be represented by one binary decision diagram, which can be created by taking each of the constraints and ANDing them together. In theory you might make an exponentially large BDD that way (that usually doesn't happen, but depends on the structure of the problem), but that would mean that you can't really list all combinations anyway, so it's probably not too bad.
For example the BDD generated for those two constraints would be (I think - not tested - just to give an idea)
But since this is really about a family of sets, a ZDD probably works even better. The difference, roughly, between a BDD and a ZDD is that a BDD compresses nodes that have equal sub-trees (in the total tree of all possibilities), while the ZDD compresses nodes where the solid edge (ie "set this variable to 1") goes to False. Both re-use equal sub-trees and thus form a DAG.
The ZDD of the example would be (again not tested)
I find ZDDs a bit easier to manipulate in code, because any time a variable can be set, it will appear in the ZDD. In contrast, in a BDD, "skipped" nodes have to be detected, including "between the last node and the leaf", so for a BDD you have to keep track of your universe. For a ZDD, most operations are independent of the universe (except complement, which is rarely needed in the family-of-sets scenario). A downside is that you have to be aware of the universe when constructing the constraints, because they have to contain "don't care" paths for all the variables not mentioned in the constraint.
You can find more information about both BDDs and ZDDs in The Art of Computer Programming volume 4A, chapter 7.1.4, there is an old version available for free here.
These methods are in particular nice to represent large numbers of such combinations, and to manipulate them somehow before generating all the possibilities. So this will also work when there are many items and many constraints (such that the final count of combinations is not too large), (usually) without creating intermediate results of exponential size.

pattern matching

Suppose I have a set of tuples like this (each tuple will have 1,2 or 3 items):
Master Set:
{(A) (A,C) (B,C,E)}
and suppose I have another set of tuples like this:
Real Set: {(BOB) (TOM) (ERIC,SALLY,CHARLIE) (TOM,SALLY) (DANNY) (DANNY,TOM) (SALLY) (SALLY,TOM,ERIC) (BOB,SALLY) }
What I want to do is to extract all subsets of Tuples from the Real Set where the tuple members can be substituted to become the same as the Master Set.
In the example above, two sets would be returned:
{(BOB) (BOB,SALLY) (ERIC,SALLY,CHARLIE)}
(let BOB=A,ERIC=B,SALLY=C,CHARLIE=E)
and
{(DANNY) (DANNY,TOM) (SALLY,TOM,ERIC)}
(let DANNY=A,SALLY=B,TOM=C,ERIC=E)
Its sort of pattern matching, sort of combinatorics I guess. I really don't know how to classify this problem and what common plans of attack there are for it. What would the stackoverflow experts suggest?
Seperate your tuples into sets by size. Within each set, create a data structure that allows you to efficiently query for tuples containing a given element. The first part of this structure is your tuples as an array (so that each tuple has a cannonical index). The second set is: Map String (Set Int). This is somewhat space intensive but hopefully not prohibative.
Then, you, essentially, brute force it. For all assignments to the first master set, restrict all assignments to other master sets. For all remaining assignments to the second, restrict all assignments to the third and beyond, etc. The algorithm is basically inductive.
I should add that I don't think the problem is NP-complete so much as just flat worst-case exponential. It's not a decision problem, but an enumeration problem. And it's fairly easy to imagine scenarios of inputs that blow up exponentially.
It will be difficult to do efficiently since your problem is probably NP-complete (it includes subgraph isomorphism as a special case). That assumes the patterns and database both vary in size, though. How much data are you searching? How complicated will your patterns be? I would recommend the brute force solution first, then test if that is too slow and you need something fancier.

Resources