The relevance between Inverted index and Vector Space Model [closed] - information-retrieval

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 years ago.
Improve this question
These day, I study the Information Retrieval(expecially about Text Retrieval).
and I want to make a Search engine. But I confused about the title things that Inverted Index and Vector Space Model(in addition, boolean model etc...for representing document as a vector)
I think Inverted Index is a optional function for Vector Space Model, since this indexing model can help program to get terms(or words) more effectively
.... this is my thinking... is right?
please any comment.

Document-term matrix and inverted index are ways to save documents.
After saving the documents you can use vector space model or language models as retrieval models of a search engine.
Also if you just need a search engine made with some data you have and implementing it from the beginning is not your point, you can use Apache Lucene.

Related

Best practice for functions accepting key value pairs as arguments in R? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
Some languages have an obvious choice when writing a function that accepts key value pairs and determining which data structure ought to be used (e.g. a hash would be an obvious option in ruby, for example).
In R, there are a few ways that come to mind. Examples:
as a named list
as a standard list
as a named vector
as a character vector with some regular expression to extubate the key and value from a single string
Is there any one data structure that is recommended over the others or is considered 'best practice'? Or does it totally depend on the use case? E.g. named anything (vector, list) will create limitations (since names have various constraints). Lists tend to be more tricky to traverse than data.frames. Parsing character vectors could have unintended consequences if the inputs aren't in a very well understood and consistent format. Etc.
Is there a dominant convention that overcomes each of these issues, or is it simply a matter of selecting the best tool for each unique circumstance?

lavinshtein distance with dictionary [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
How to advance edit distance with operation take an anagram of the existing word. every interim step must be a word from a list of words .
The standard technique for anagrams is to store words in canonical sorted order, e.g. "Banana" becomes "aaabnn". Do that for all valid words, then consider Levenshtein distances between those. You will want to map from canonical to a valid set, e.g. valid['dgo'] = {'dog', 'god'}
Take a look at tail /usr/share/dict/words if you need a set of valid words.

How to store data in a tree structure in R? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I am building a decision tree. Now, I want to store splitting condition or threshold value, parent, leaf and other variable in a tree structure, so that I can call that again and get those values in time of prediction? I am not using any random-forest package as I want to get my tree as like I wish.
The list structure is the only way to go. Take a look at how the dendrogram objects are stored.
?as.dendrogram
The other package to review would be igraph.

Whats the best way to shrink a large body of text? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
Suppose im given a large essay. Whats the best method available to shrink it down into a small string of letters which I can decode later? Suppose Im allowed to keep a set of predefined keys if i need to?
Assuming the text is English and you want to minimize the size of the "small string", you will find a number of algorithms here: http://www.maximumcompression.com/data/text.php
For ease-of-implementation, however, you might simply want to use zlib, as it's generally available.
Further, if you want to encrypt the compressed text, you should use AES in CTR mode (and possibly appending an HMAC; ref: http://www.daemonology.net/blog/2009-06-11-cryptographic-right-answers.html).
Finally, assuming that by "asring of letters" you meant "a string of letters", you could base-64 encode the encrypted data, which would give you a string of letters, numbers, and a limited amount of punctuation.

what does the term rep-invariant and rep ok means? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I heard this a lot when talking about software engineering and abstract data types, what does this do? Can anyone give me a concrete example of this concept?
A representation invariant is a condition concerning the state of an object. The condition can always be assumed to be true for a given object, and operations are required not to violate it.
In a Deck class, a representation invariant might be that there are always 52 Cards in the deck. A shuffle() operation is thus guaranteed not to drop any cards on the floor. Which in turn means that someone calling shuffle(), or indeed any other operation, does not need to check the number of cards before and after: they are guaranteed that it will always be 52.

Resources