What are the cases where Inverse Document Frequency is not useful in information retrieval?
You may not want to use IDF if in your system, you do not want to weigh rare terms more heavily than the frequently occurring terms. Moreover, computing idf is a costly operation. This is evident from the fact that in the most commonly used scoring scheme i.e lnc.ltc we do not compute the idf scores for terms occurring in the document.
Moreover, if your search engine only processes one word queries, using idf is useless as if will be the same for each document. Hope it helps
Related
Given a list of queries and given one document, I want to rank the queries based on how relevant they are to the given document.
For each query, I calculated the term frequency of each word in the query.
(term frequency defined as the number of times the word occurs in the document divided by the total number of words in the document)
Now, I summed up the term frequencies for each term in the query.
For example:
search query: "Hello World"
document: "It is a beautiful world"
tf for 'Hello': 0
tf for 'World': 1/5 = 0.2
total tf for query 'Hello World' = 0 + 0.2 = 0.2
My question is, what is the best way to normalize my term frequency for each query? so that a long query doesn't result in a larger relevance score.
And, is there a better way for me to score the query than just using the tf score?
I can't use tf-idf in my scenario because I am ranking them against just one document.
Before answering your question, I want to correct you on your definition of term frequency. The way you defined term frequency is actually called maximum likelihood.
So, I am interpreting your first question as follows.
What is the best way to normalize final score (summation of maximum likelihood) for each query?
One simple approach is to divide the score by query length so that longer query doesn't receive higher score. Advanced techniques are also used in computing relevance score in the context of search engines.
Is there a better way for me to score the query than just using the tf score?
Yes, of course! One of the well-known and widely used ranking method called Okapi BM25 can be used here with little modification. You can think your target task as a ranking problem.
So, given a document, rank a set of queries based on their relevance with the document.
This is a well-known problem in the context of search engine. I encourage you to follow some lectures from any information retrieval class of any university. For example, this lecture slide talks about probabilistic ranking principle which aligns with your need.
Coming to your remark on not being able to use idf, 'I can't use tf-idf in my scenario because I am ranking them against just one document.', here's what you could do:
Keep in mind that your ranking (retrievable) units are queries. Hence, consider that there's a reversal of roles between documents and queries with reference to the standard terminology.
In other words, treat your queries as pseudo-documents and your document as the pseudo-query.
You can then apply a whole range of ranking models that make use of the collection statistics (being computed over the set of queries), e.g. language model, BM25, DFR etc.
I have a data set with a set of users and a history of documents they have read, all the documents have metadata attributes (think topic, country, author) associated with them.
I want to cluster the users based on their reading history per one of the metadata attributes associated with the documents they have clicked on. This attribute has 7 possible categorical values and I want to prove a hypothesis that there is a pattern to the users' reading habits and they can be divided into seven clusters. In other words, that users will often read documents based on one of the 7 possible values in the particular metadata category.
Anyone have any advice on how to do this especially in R, like specific packages? I realize that the standard k-means algorithm won't work well in this case since the data is categorical and not numeric.
Cluster analysis cannot be used to prove anything.
The results are highly sensitive to normalization, feature selection, and choice of distance metric. So no result is trustworthy. Most results you get out are outright useless. So it's as reliable as a proof by example.
They should only be used for explorative analysis, i.e. to find patterns that you then need to study with other methods.
I know how Hyperloglog works but I want to understand in which real-world situations it really applies i.e. makes sense to use Hyperloglog and why? If you've used it in solving any real-world problems, please share. What I am looking for is, given the Hyperloglog's standard error, in which real-world applications is it really used today and why does it work?
("Applications for cardinality estimation", too broad? I would like to add this simply as a comment but it won't fit).
I would suggest you turn to the numerous academic research of the subject; usually academic papers contain some information of "prior research on the subject" as well as "applications for which the subject has been used". You could start with traversing the references of interest as referenced by the following article:
HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm, by P. Flageolet et al.
... This problem has received a great deal of attention over the past
two decades, finding an ever growing number of applications in
networking and traffic monitoring, such as the detection of worm
propagation, of network attacks (e.g., by Denial of Service), and of
link-based spam on the web [3]. For instance, a data stream over a
network consists of a sequence of packets, each packet having a
header, which contains a pair (source–destination) of addresses,
followed by a body of specific data; the number of distinct header
pairs (the cardinality of the multiset) in various time slices is an
important indication for detecting attacks and monitoring traffic, as
it records the number of distinct active flows. Indeed, worms and
viruses typically propagate by opening a large number of different
connections, and though they may well pass unnoticed amongst a huge
traffic, their activity becomes exposed once cardinalities are
measured (see the lucid exposition by Estan and Varghese in [11]).
Other applications of cardinality estimators include data mining of
massive data sets of sorts—natural language texts [4, 5], biological
data [17, 18], very large structured databases, or the internet graph,
where the authors of [22] report computational gains by a factor of
500+ attained by probabilistic cardinality estimators.
At my work, HyperLogLog is used to estimate the number of unique users or unique devices hitting different code paths in online services. For example, how many users are affected by each type of service error? How many users use each feature? There are MANY interesting questions HyperLogLog allows us to answer.
Stackoverflow might use hyperloglog to count the views of each question. Stackoverflow wants to make sure that one user can only contribute one view per item so every view is unique.
It could be implemented with set. every question would have a set that stores the usernames:
question#ID121e={username1,username2...}
For each question creating a set would take up some space and consider how many questions have been asked on this platform. The total amount of space to keep track of every view per user would be huge. But hyperloglog uses about 12 kB of memory per key no matter how many usernames are added, even 10 million views.
Let's assume that we have a moderately growing document corpus i.e. some new documents get added to this document corpus everyday. For these newly added documents, I can infer the topic distributions just by using the inference part of the LDA. I do not have to execute the whole topic estimation + inference process of LDA for all documents again just to get the topic distributions for these new documents. However, over the period of time, I might need to do the whole topic generation process again as the number of documents added newly since the last LDA execution might add totally new words to the document corpus.
Now, the question that I have is - how to determine the good enough interval between two topic generation executions? Are there any general recommendations on how often should we execute the LDA for whole document corpus?
If I keep this interval very short then, I might lose the stable topic distributions and topic distributions will keep changing. If I keep the interval too long then, I might lose the new topics and new topic structures.
I'm just thinking aloud here... One very simple idea is to sample a subset of documents from the bunch of newly added documents (say over a period of one day).
You could possibly extract key words from each of these documents in the sampled set, and execute each as a query to the index built from a version of the collection that existed before adding these new documents.
You could then measure the average cosine similarity of the top K documents retrieved in response to each query (and average them over each query from the sampled set of queries). If this average similarity is less than a pre-defined threshold, it might indicate that the new documents are not that similar to the existing ones. It might thus be a good idea to rerun LDA on the whole collection.
I am working on Marklogic tool
I am having a database of around 27000 documents.
What I want to do is retrieve the keywords which have maximum frequency in the documents given by the result of any search query.
I am currently using xquery functions to count the frequency of each word in the set of all documents retrieved as query result. However, this is quite inefficient.
I was thinking that it would help me if i could get the list of words on which marklogic has performed indexing.
So is there a way to retrieve the list of indexed words from the universal index of marklogic??
Normally you would use something like this in MarkLogic:
(
for $v in cts:element-values(xs:Qname("myelem"))
let $f := cts:frequency($v)
order by $f descending
return $v
)[1 to 10]
This kind of functionality is built-in in the search:search library, which works very conveniently.
But you cannot use that on values from cts:words e.a. unfortunately. There is a little trick that could get you close though. Instead of using cts:frequency, you could use a xdmp:estimate on a cts:search to get a fragment count:
(
for $v in cts:words()
let $f := xdmp:estimate(cts:search(collection(), $v))
order by $f descending
return $v
)[1 to 10]
The performance is less, but still much faster than bluntly running through all documents.
HTH!
What if your search contains multiple terms? How will you calculate the order?
What if some of your terms are very common in your corpus of documents, and others are very rare? Should the count of "the" contribute more to the score than "protease", or should they contribute the same?
If the words occur in the title vs elsewhere in the document, should that matter?
What if one document is relatively short, and another is quite long. How do you account for that?
These are some of the basic questions that come up when trying to determine relevancy. Most search engines use a combination of term frequency (how often do the terms occur in your documents), and document frequency (how many documents contain the terms). They can also use the location of the terms in your documents to determine a score, and they can also account for document length in determining a score.
MarkLogic uses a combination of term frequency and document frequency to determine relevance by default. These factors (and others) are used to determine a relevance score for your search criteria, and this score is the default sorting for results returned by search:search from the search API or the low-level cts:search and its supporting operators.
You can look at the details of the options for cts:search to learn about some of the different scoring options. See 'score-logtfidf' and others here:
http://community.marklogic.com/pubs/5.0/apidocs/SearchBuiltins.html#cts:search
I would also look at the search developers guide:
http://community.marklogic.com/pubs/5.0/books/search-dev-guide.pdf
Many of the concepts are under consideration by the XQuery working group as enhancements for a future version of XQuery. They aren't part of the language today. MarkLogic has been at the forefront of search for a number of years, so you'll find there are many features in the product, and a lot of discussion related to this area in the archives.
"Is there a way to retrieve the list of indexed words from the universal index of marklogic?" No. The universal index is a hash index, so it contains hashes not words.
As noted by others you can create value-based lexicons that can list their contents. Some of these also include frequency information. However, I have another suggestion: cts:distinctive-terms() will identify the most distinctive terms from a sequence of nodes, which could be the current page of search results. You can control whether the output terms are just words, or include more complex terms such as element-word or phrase. See the docs for more details.
http://docs.marklogic.com/5.0doc/docapp.xqy#display.xqy?fname=http://pubs/5.0doc/apidoc/SearchBuiltins.xml&category=SearchBuiltins&function=cts:distinctive-terms
I have used cts:distinctive-terms(). It gives mostly wildcarded terms in my case which are not of much use. Furthur it is suitable for finding distinctive terms in a single document. When I try to run it on many documents it is quite slow.
What I want to implement is a dynamic facet which is populated with the keywords of the documents which come up in the search result. I have implemented it but it is inefficient as it counts the frequency of all the words in the documents. I want it to be a suggestion or recommandation feature like if you have searched for this particular term or phrase then you may be interested in these suggested terms or phrases. So I want an efficient method to find the terms which are common in the result set of documents of a search.
I tried cts:words() as suggested. It gives similar words as the search query word and the number of documents in which it is contained. WHat it does not take into account is the set of search result documents. It just shows the number of documents which contain similar words in the whole database, irrespective of whether these documents are present in the search result or not