R function for Likelihood - r

I'm trying to analyze repairable systems reliability using growth models.
I have already fitted a Crow-Amsaa model but I wonder if there is any package or any code for fitting a Generalized Renewal Process (Kijima Model I) or type II
in R and find it's parameters Beta, Lambda(or alpha) and q.
(or some other model for the mean cumulative function MCF)
The equation number 15 of this article gives an expression for the
Log-likelihood
I tried to create the function like this:
likelihood.G1=function(theta,x){
# x is a vector with the failure times, theta vector of parameters
a=theta[1] #Alpha
b=theta[2] #Beta
q=theta[3] #q
logl2=log(b/a) # First part of the equation
for (i in 1:length(x)){
logl2=logl2 +(b-1)*log(x[i]/(a*(1+q)^(i-1))) -(x[i]/(a*(1+q)^(i-1)))^b
}
return(-logl2) #Negavite of the log-likelihood
}
And then use some rutine for minimize the -Log(L)
theta=c(0.5,1.2,0.8) #Start parameters (lambda,beta,q)
nlm(likelihood.G1,theta, x=Data)
Or also
optim(theta,likelihood.G1,method="BFGS",x=Data)
However it seems to be some mistake, since the parameters it returns has no sense
Any ideas of what I'm doing wrong?
Thanks

Looking at equation (16) of the paper you reference and comparing it with your code it looks like you are missing one term in the for loop. It seems that each data point contributes to three terms of the log-likelihood but in your code (inside the loop) you only have two terms (not considering the updating term)
Specifically, your code does not include the 4th term in equation (16):
and neither it does the 7th term, and so on. This is at least one error in the code. An extra consideration would be that α and β are constrained to be greater than zero. I am not sure if the solver you are using is considering this constraint.

Related

What does the summary function do to the output of regsubsets?

Let me preface this by saying that I do think this question is a coding question, not a statistics question. It would almost surely be closed over at Stats.SE.
The leaps package in R has a useful function for model selection called regsubsets which, for any given size of a model, finds the variables that produce the minimum residual sum of squares. Now I am reading the book Linear Models with R, 2nd Ed., by Julian Faraway. On pages 154-5, he has an example of using the AIC for model selection. The complete code to reproduce the example runs like this:
data(state)
statedata = data.frame(state.x77, row.names=state.abb)
require(leaps)
b = regsubsets(Life.Exp~.,data=statedata)
rs = summary(b)
rs$which
AIC = 50*log(rs$rss/50) + (2:8)*2
plot(AIC ~ I(1:7), ylab="AIC", xlab="Number of Predictors")
The rs$which command produces the output of the regsubsets function and allows you to select the model once you've plotted the AIC and found the number of parameters that minimizes the AIC. But here's the problem: while the typed-up example works fine, I'm having trouble with the wrong number of elements in the array when I try to use this code and adapt it to other data. For example:
require(faraway)
data(odor, package='faraway')
b=regsubsets(odor~temp+gas+pack+
I(temp^2)+I(gas^2)+I(pack^2)+
I(temp*gas)+I(temp*pack)+I(gas*pack),data=odor)
rs=summary(b)
rs$which
AIC=50*log(rs$rss/50) + (2:10)*2
produces a warning message:
Warning message:
In 50 * log(rs$rss/50) + (2:10) * 2 :
longer object length is not a multiple of shorter object length
Sure enough, length(rs$rss)=8, but length(2:10)=9. Now what I need to do is model selection, which means I really ought to have an RSS value for each model size. But if I choose b$rss in the AIC formula, it doesn't work with the original example!
So here's my question: what is summary() doing to the output of the regsubsets() function? The number of RSS values is not only not the same, but the values themselves are not the same.
Ok, so you know the help page for regsubsets says
regsubsets returns an object of class "regsubsets" containing no
user-serviceable parts. It is designed to be processed by
summary.regsubsets.
You're about to find out why.
The code in regsubsets calls Alan Miller's Fortran 77 code for subset selection. That is, I didn't write it and it's in Fortran 77. I do understand the algorithm. In 1996 when I wrote leaps (and again in 2017 when I made a significant modification) I spent enough time reading the code to understand what the variables were doing, but regsubsets mostly followed the structure of the Fortran driver program that came with the code.
The rss field of the regsubsets object has that name because it stores a variable called RSS in the Fortran code. This variable is not the residual sum of squares of the best model. RSS is computed in the setup phase, before any subset selection is done, by the subroute SSLEAPS, which is commented 'Calculates partial residual sums of squares from an orthogonal reduction from AS75.1.' That is, RSS describes the RSS of the models with no selection fitted from left to right in the design matrix: the model with just the leftmost variable, then the leftmost two variables, and so on. There's no reason anyone would need to know this if they're not planning to read the Fortran so it's not documented.
The code in summary.regsubsets extracts the residual sum of squares in the output from the $ress component of the object, which comes from the RESS variable in the Fortran code. This is an array whose [i,j] element is the residual sum of squares of the j-th best model of size i.
All the model criteria are computed from $ress in the same loop of summary.regsubsets, which can be edited down to this:
for (i in ll$first:min(ll$last, ll$nvmax)) {
for (j in 1:nshow) {
vr <- ll$ress[i, j]/ll$nullrss
rssvec <- c(rssvec, ll$ress[i, j])
rsqvec <- c(rsqvec, 1 - vr)
adjr2vec <- c(adjr2vec, 1 - vr * n1/(n1 + ll$intercept -
i))
cpvec <- c(cpvec, ll$ress[i, j]/sigma2 - (n1 + ll$intercept -
2 * i))
bicvec <- c(bicvec, (n1 + ll$intercept) * log(vr) +
i * log(n1 + ll$intercept))
}
}
cpvec gives you the same information as AIC, but if you want AIC it would be straightforward to do the same loop and compute it.
regsubsets has a nvmax parameter to control the "maximum size of subsets to examine". By default this is 8. If you increase it to 9 or higher, your code works.
Please note though, that the 50 in your AIC formula is the sample size (i.e. 50 states in statedata). So for your second example, this should be nrow(odor), so 15.

Code syntax in calculating posterior distribution in WinBUGS

Recently I read "The BUGS Book – A Practical Introduction to Bayesian Analysis" to learn WinBUGS. The way WinBUGS describes the derivation of posterior distribution makes me feel confused.
Let's take Example 4.1.1 in this book to illustrae:
Suppose we observe the number of deaths y in a given hospital for a
high-risk operation. Let n denote the total number of such
operations performed and suppose we wish to make inferences regarding
the underlying true mortality rate, $\theta$.
The code of WinBUGS is:
y <- 10 # the number of deaths
n <- 100 # the total number of such operations
#########################
y ~ dbin(theta,n) # likelihood, also a parametric sampling distribution
logit(theta) <- logit.theta # normal prior for the logistic transform of theta
logit.theta ~ dnorm(0,0.368) # precision = 1/2.71
The author said that:
The software knows how to derive the posterior distribution and
subsequently sample from it.
My question is:
Which code reflects the logic structure to tell WinBUGS about "which parameter that I want to calculate its posterior distribution"?
This question seems silly, but if I do not read the background first, I truly cannot find directly in the code above about which parameter is focused on (e.g., theta, or y?).
Below are some of my thoughts (as a beginner of WinBUGS):
I think the following three attributions of the code style in WinBUGS makes me confused:
(1) the code does not follow "a specific sequence". For example, why is logit.theta ~ dnorm(0,0.368) not in front of logit(theta) <- logit.theta?
(2) repeated variable. Foe example, why did the last two lines not be reduced into one line: logit(theta) ~ dnorm(0,0.368)?
(3) variables are defined in more than one place. For example, y is defined two times: y <- 10 and y ~ dbin(theta, n). This one has been explained in Appendix A of the book (i.e., However, a check has been built in so that when finding a logical node that also features as a stochastic node, a stochastic node is created with the calculated values as fixed data), yet I still cannot catch its meaning.
BUGS is a declarative language. For the most part, statements aren't executed in sequence, they define different parts of the model. BUGS works on models that can be represented by directed acyclic graphs, i.e. those where you put a prior on some components, then conditional distributions on other components given the earlier ones.
It's a fairly simple language, so I think logit(theta) ~ dnorm(0, 0.368) is just too complicated for it.
The language lets you define a complicated probability model, and declare observations of certain components in it. Once you declare an observation, the model that BUGS samples from is the the original full model conditioned on that observation. y <- 10 defines observed data. y ~ dbin(theta,n) is part of the model.
The statement n <- 100 could be either: for fixed constants like n, it doesn't really matter which way you think of it. Either the model says that n is always 100, or n has an undeclared prior distribution not depending on any other parameter, and an observed value of 100. These two statements are equivalent.
Finally, your big question: Nothing in the code above says which parameter you want to look at. BUGS will compute the joint posterior distribution of every parameter. n and y will take on their fixed values, theta and logit.theta will both be simulated from the posterior. In another part of your code (or by using the WinBUGS menus) you can decide which of those to look at.

bnlearn::bn.fit difference and calculation of methods "mle" and "bayes"

I try to understand the differences between the two methods bayes and mle in the bn.fit function of the package bnlearn.
I know about the debate between the frequentist and the bayesian approach on understanding probabilities. On a theoretical level I suppose the maximum likelihood estimate mle is a simple frequentist approach setting the relative frequencies as the probability. But what calculations are done to get the bayes estimate? I already checked out the bnlearn documenation, the description of the bn.fit function and some application examples, but nowhere there's a real description of what's happening.
I also tried to understand the function in R by first checking out bnlearn::bn.fit, leading to bnlearn:::bn.fit.backend, leading to bnlearn:::smartSapply but then I got stuck.
Some help would be really appreciated as I use the package for academic work and therefore I should be able to explain what happens.
Bayesian parameter estimation in bnlearn::bn.fit applies to discrete variables. The key is the optional iss argument: "the imaginary sample size used by the bayes method to estimate the conditional probability tables (CPTs) associated with discrete nodes".
So, for a binary root node X in some network, the bayes option in bnlearn::bn.fit returns (Nx + iss / cptsize) / (N + iss) as the probability of X = x, where N is your number of samples, Nx the number of samples with X = x, and cptsize the size of the CPT of X; in this case cptsize = 2. The relevant code is in the bnlearn:::bn.fit.backend.discrete function, in particular the line: tab = tab + extra.args$iss/prod(dim(tab))
Thus, iss / cptsize is the number of imaginary observations for each entry in a CPT, as opposed to N, the number of 'real' observations. With iss = 0 you would be getting a maximum likelihood estimate, as you would have no prior imaginary observations.
The higher iss with respect to N, the stronger the effect of the prior on your posterior parameter estimates. With a fixed iss and a growing N, the Bayesian estimator and the maximum likelihood estimator converge to the same value.
A common rule of thumb is to use a small non-zero iss so that you avoid zero entries in the CPTs, corresponding to combinations that were not observed in the data. Such zero entries could then result in a network which generalizes poorly, such as some early versions of the Pathfinder system.
For more details on Bayesian parameter estimation you can have a look at the book by Koller and Friedman. I suppose many other Bayesian network books also cover the topic.

Extracting Lagrange Multipliers from SVM output in R

I would like to extract the alpha lagrange multipliers from the SVM function in the e1071 R package, however I am not sure if svm$coef is producing these?
Alphas are defined as in Equation 9.23, p352, An Introduction to Statistical Learning
In the documentation for SVM, it says that
SVM$Coefs = The corresponding coefficients times the training labels
Could someone please explain it?
$coefs produces alpha_i * y_i, but as alpha_i are by definition non-negative, you can simply take absolute value of coefs and it gives you Lagrange multipliers, and extract y_i by taking a sign (as they are only +1 or -1). This is just a simplification, often used in SVM packages, as multipliers are never actually used - only their product with the label, thus they are stored as a single number, for simplicity and efficiency, and in a case of need (like this one) - you can always reconstruct them.

Quadrature to approximate a transformed beta distribution in R

I am using R to run a simulation in which I use a likelihood ratio test to compare two nested item response models. One version of the LRT uses the joint likelihood function L(θ,ρ) and the other uses the marginal likelihood function L(ρ). I want to integrate L(θ,ρ) over f(θ) to obtain the marginal likelihood L(ρ). I have two conditions: in one, f(θ) is standard normal (μ=0,σ=1), and my understanding is that I can just pick a number of abscissa points, say 20 or 30, and use Gauss-Hermite quadrature to approximate this density. But in the other condition, f(θ) is a linearly transformed beta distribution (a=1.25,b=10), where the linear transformation B'=11.14*(B-0.11) is such that B' also has (approximately) μ=0,σ=1.
I am confused enough about how to implement quadrature for a beta distribution but then the linear transformation confuses me even more. My question is threefold: (1) can I use some variation of quadrature to approximate f(θ) when θ is distributed as this linearly transformed beta distribution, (2) how would I implement this in R, and (3) is this a ridiculous waste of time such that there is an obviously much faster and better method to accomplish this task? (I tried writing my own numerical approximation function but found that my implementation of it, being limited to the R language, was just too slow to suffice.)
Thanks!
First, I assume you can express your L(θ,ρ) and f(θ) in terms of actual code; otherwise you're kinda screwed. Given that assumption, you can use integrate to perform the necessary computations. Something like this should get you started; just plug in your expressions for L and f.
marglik <- function(rho) {
integrand <- function(theta, rho) L(theta, rho) * f(theta)
# set your lower/upper integration limits as appropriate
integrate(integrand, lower=-5, upper=5, rho=rho)
}
For this to work, your integrand has to be vectorized; ie, given a vector input for theta, it must return a vector of outputs. If your code doesn't fit the bill, you can use Vectorize on the integrand function before passing it to integrate:
integrand <- Vectorize(integrand, "theta")
Edit: not sure if you're also asking how to define f(θ) for the transformed beta distribution; that seems rather elementary for someone working with joint and marginal likelihoods. But if you are, then the density of B' = a*B + b, given f(B), is
f'(B') = f(B)/a = f((B' - b)/a) / a
So in your case, f(theta) is dbeta(theta/11.14 + 0.11, 1.25, 10) / 11.14

Resources