relevance models - information-retrieval

The relevance model just estimates the relevance feedback based on feedback documents. In this case, the relevance model would have a higher probability of getting common words as its feedbacks. Thus I assumed the performance of the relevance model won't be so good comparing to the other two models. However, I learned that all those models perform pretty well. What would be the reason for that?

"In contrast, the relevance model just estimates the relevance feedback based on feedback documents. In this case, the relevance model would have a higher probability of getting common words as its feedbacks"
That's a common perception which isn't necessarily true. To be more specific, recall that the estimation equation of relevance model looks like:
P(w|R) = \sum_{D \in Top-K} P(w|D) \prod_{t \in Q} P(q|D)
which in simple English means that --
To compute the weight of a term w in the set of top-K docs - you iterate over each document in top-K and multiply P(w|D) with the similarity score of Q with D (this is the value \prod_{t \in Q} P(q|D)). Now, the idf factor is hidden inside the expression P(w|D).
Following the standard language model paradigm (Jelinek-Mercer or Dirichlet), this isn't just a simple max-likelihood estimate but is rather a collection smoothed version, e.g., for Jelinek-Mercer, this is:
P(w|D) = log(1+ lambda/(1-lambda) * count(w,D)/length(D) * collection_size/cf(t))
which is nothing but a linear combination based generalization of tf*idf - the second component collection_size/cf(t) specifically denoting inverse collection frequency.
So, this expression of P(w|D) ensures that terms with higher idf values tend to get higher weights in the relevance model estimation. In addition to the high idf weights, they should also have a high level of co-occurrence with the query terms due to the product of P(w|D) with P(q|D).

Related

why Strauss-hardcore model could has a gamma bigger than 1?

the spatstat book said clearly that a Strauss model is invalid with a gamma bigger than 1, that is true:
multiple.Strauss<-ppm(P1a4.multiple~1, Strauss(r=51),method='ho')
#Warning message:
#Fitted model is invalid - cannot be simulated
as the L(r) function does has a trough first, I refit the data as a Strauss-hardcore model:
Mo.hybrid<-Hybrid(H=Hardcore(),S=Strauss(51))
multiple.hybrid<-ppm(P1a4.multiple~1,Mo.hybrid,method='ho')
#Hard core distance: 12.65963
#Fitted S interaction parameter gamma: 2.7466492
it interesting to see that the model fitted suceessfully, with a gamma>1 !
I want to know whether the gamma in Strauss-Hardcore model has same meaning with Strauss model, therefore could used as a indicator of aggregation?
Yes, the interpretation is similar and indicates some aggregation behaviour. The model with gamma>1 may be less intuitive to understand: Say the hardcore distance is r=12 and the Strauss interaction distance is R=50. Then you say that pairs of points within distance 12 of each other are heavily penalized (not permitted at all) while pairs of points separated by between 12 and 50 are encouraged (have a higher probability of occurring than at random). Pairs of points separated by more than 50 do not change the baseline probability (complete randomness).
Simulations from the StraussHardcore model often shows strange aggregation behavior, but it may be suitable for your data.

Code syntax in calculating posterior distribution in WinBUGS

Recently I read "The BUGS Book – A Practical Introduction to Bayesian Analysis" to learn WinBUGS. The way WinBUGS describes the derivation of posterior distribution makes me feel confused.
Let's take Example 4.1.1 in this book to illustrae:
Suppose we observe the number of deaths y in a given hospital for a
high-risk operation. Let n denote the total number of such
operations performed and suppose we wish to make inferences regarding
the underlying true mortality rate, $\theta$.
The code of WinBUGS is:
y <- 10 # the number of deaths
n <- 100 # the total number of such operations
#########################
y ~ dbin(theta,n) # likelihood, also a parametric sampling distribution
logit(theta) <- logit.theta # normal prior for the logistic transform of theta
logit.theta ~ dnorm(0,0.368) # precision = 1/2.71
The author said that:
The software knows how to derive the posterior distribution and
subsequently sample from it.
My question is:
Which code reflects the logic structure to tell WinBUGS about "which parameter that I want to calculate its posterior distribution"?
This question seems silly, but if I do not read the background first, I truly cannot find directly in the code above about which parameter is focused on (e.g., theta, or y?).
Below are some of my thoughts (as a beginner of WinBUGS):
I think the following three attributions of the code style in WinBUGS makes me confused:
(1) the code does not follow "a specific sequence". For example, why is logit.theta ~ dnorm(0,0.368) not in front of logit(theta) <- logit.theta?
(2) repeated variable. Foe example, why did the last two lines not be reduced into one line: logit(theta) ~ dnorm(0,0.368)?
(3) variables are defined in more than one place. For example, y is defined two times: y <- 10 and y ~ dbin(theta, n). This one has been explained in Appendix A of the book (i.e., However, a check has been built in so that when finding a logical node that also features as a stochastic node, a stochastic node is created with the calculated values as fixed data), yet I still cannot catch its meaning.
BUGS is a declarative language. For the most part, statements aren't executed in sequence, they define different parts of the model. BUGS works on models that can be represented by directed acyclic graphs, i.e. those where you put a prior on some components, then conditional distributions on other components given the earlier ones.
It's a fairly simple language, so I think logit(theta) ~ dnorm(0, 0.368) is just too complicated for it.
The language lets you define a complicated probability model, and declare observations of certain components in it. Once you declare an observation, the model that BUGS samples from is the the original full model conditioned on that observation. y <- 10 defines observed data. y ~ dbin(theta,n) is part of the model.
The statement n <- 100 could be either: for fixed constants like n, it doesn't really matter which way you think of it. Either the model says that n is always 100, or n has an undeclared prior distribution not depending on any other parameter, and an observed value of 100. These two statements are equivalent.
Finally, your big question: Nothing in the code above says which parameter you want to look at. BUGS will compute the joint posterior distribution of every parameter. n and y will take on their fixed values, theta and logit.theta will both be simulated from the posterior. In another part of your code (or by using the WinBUGS menus) you can decide which of those to look at.

Three-step method LCA in R (poLCA). Posterior probabilities from inclusive LCA?

As recommended by Bray, Lanzaa and Tanb (2015) I’d like to perform three-step method to classify individuals into classes by using posterior probabilities of inclusive LCA (LCA including covariates). However, the inclusive model is very different compare with the non-inclusive model if I include all variables of interest.
Conditional probabilities are completely different, as well as the number of cases per class. Therefore, the interpretation of profiles or patterns changes completely from the non-inclusive model (step-1) when using posterior probabilities of inclusive LCA (in order to assign the cases).
My question is, am I doing something wrong? Is it normal to get these changes? Maybe procedure isn't correct. The model itself loses sense when looking at item conditional probabilities of each class.
These are the steps I took:
To perform LCA to study profiles of sexual risk behaviors (using 6 variables) and analyze association with diferent types of drug use, gender and age (model 4 seemed the best choice).
z <- cbind(sexrisk1, sexrisk2, sexrisk3, sexrisk4, sexrisk5, sexrisk6)
lc4 <- poLCA(z, MyData, nclass = 4,nrep=10)
Include all variables of interest as covariate for “appropriate” posterior analysis (as recommended Bray, Lanzaa and Tanb (2015))
f <- cbind(sexrisk1, sexrisk2, sexrisk3, sexrisk4, sexrisk5, sexrisk6)~ drug1+drug2+drug3+gender+age
lc4.cov <- poLCA(f, MyData, nclass = 4,nrep=10)
Once inclusive model is performed, I used the values of predicted classes and posterior probabilities (which I think poLCA does it via maximum-probability assignment. Not sure of this) to assign cases to membership classes.
table(lc4.cov$predclass)
write.csv(cbind(MyData$code, lc4.cov$posterior), 'new.data.csv')
(NOTE: by incresing the number of nrep of both models (inclusive and non-inclusive) results of posterior probabilities showed less differences).

bnlearn::bn.fit difference and calculation of methods "mle" and "bayes"

I try to understand the differences between the two methods bayes and mle in the bn.fit function of the package bnlearn.
I know about the debate between the frequentist and the bayesian approach on understanding probabilities. On a theoretical level I suppose the maximum likelihood estimate mle is a simple frequentist approach setting the relative frequencies as the probability. But what calculations are done to get the bayes estimate? I already checked out the bnlearn documenation, the description of the bn.fit function and some application examples, but nowhere there's a real description of what's happening.
I also tried to understand the function in R by first checking out bnlearn::bn.fit, leading to bnlearn:::bn.fit.backend, leading to bnlearn:::smartSapply but then I got stuck.
Some help would be really appreciated as I use the package for academic work and therefore I should be able to explain what happens.
Bayesian parameter estimation in bnlearn::bn.fit applies to discrete variables. The key is the optional iss argument: "the imaginary sample size used by the bayes method to estimate the conditional probability tables (CPTs) associated with discrete nodes".
So, for a binary root node X in some network, the bayes option in bnlearn::bn.fit returns (Nx + iss / cptsize) / (N + iss) as the probability of X = x, where N is your number of samples, Nx the number of samples with X = x, and cptsize the size of the CPT of X; in this case cptsize = 2. The relevant code is in the bnlearn:::bn.fit.backend.discrete function, in particular the line: tab = tab + extra.args$iss/prod(dim(tab))
Thus, iss / cptsize is the number of imaginary observations for each entry in a CPT, as opposed to N, the number of 'real' observations. With iss = 0 you would be getting a maximum likelihood estimate, as you would have no prior imaginary observations.
The higher iss with respect to N, the stronger the effect of the prior on your posterior parameter estimates. With a fixed iss and a growing N, the Bayesian estimator and the maximum likelihood estimator converge to the same value.
A common rule of thumb is to use a small non-zero iss so that you avoid zero entries in the CPTs, corresponding to combinations that were not observed in the data. Such zero entries could then result in a network which generalizes poorly, such as some early versions of the Pathfinder system.
For more details on Bayesian parameter estimation you can have a look at the book by Koller and Friedman. I suppose many other Bayesian network books also cover the topic.

R function for Likelihood

I'm trying to analyze repairable systems reliability using growth models.
I have already fitted a Crow-Amsaa model but I wonder if there is any package or any code for fitting a Generalized Renewal Process (Kijima Model I) or type II
in R and find it's parameters Beta, Lambda(or alpha) and q.
(or some other model for the mean cumulative function MCF)
The equation number 15 of this article gives an expression for the
Log-likelihood
I tried to create the function like this:
likelihood.G1=function(theta,x){
# x is a vector with the failure times, theta vector of parameters
a=theta[1] #Alpha
b=theta[2] #Beta
q=theta[3] #q
logl2=log(b/a) # First part of the equation
for (i in 1:length(x)){
logl2=logl2 +(b-1)*log(x[i]/(a*(1+q)^(i-1))) -(x[i]/(a*(1+q)^(i-1)))^b
}
return(-logl2) #Negavite of the log-likelihood
}
And then use some rutine for minimize the -Log(L)
theta=c(0.5,1.2,0.8) #Start parameters (lambda,beta,q)
nlm(likelihood.G1,theta, x=Data)
Or also
optim(theta,likelihood.G1,method="BFGS",x=Data)
However it seems to be some mistake, since the parameters it returns has no sense
Any ideas of what I'm doing wrong?
Thanks
Looking at equation (16) of the paper you reference and comparing it with your code it looks like you are missing one term in the for loop. It seems that each data point contributes to three terms of the log-likelihood but in your code (inside the loop) you only have two terms (not considering the updating term)
Specifically, your code does not include the 4th term in equation (16):
and neither it does the 7th term, and so on. This is at least one error in the code. An extra consideration would be that α and β are constrained to be greater than zero. I am not sure if the solver you are using is considering this constraint.

Resources