Smoothing M-splines - r

I'm currently approximating the baseline function required in a model using Msplines of order d=4:
r(t)=γ1 M1(t|4,x)+...γp Mp(t|4,x)
To avoid local fluctuation, I would like to penalized the likelihood by penalizing the curvature of the baseline function.
l(O)-∫r''(u)^2du
My problem is I don't know how to calculate the following part in R: ∫r''(u)^2du
I've found that ∫r''(u)^2du = γT Ωγ
Where Ω=∫M''(u)M''(u)du
Where M'' is the second derivative matrix of M-splines.

Related

Computing ECDF of a data for parameter estimation using weighted nonlinear least square in R

I am writing a code for estimating the parameter of a GPD using weighted nonlinear least square(WNLS) method.
The WNLS method consist of 2 steps
step 1: $(\hat{\xi_1} , \hat{b_1}) = arg\ \min_{(\xi,b)} \sum_{i=1}^{n} [\log(1-F_n(x_i)) - log(1-G_{\xi,b}(x_i))]$,
here $F_n$ is the ECDF and $1-G_{\xi,b}$ is the generalized pareto distribution.
Can anyone let me know how to calculate EDF function $F_n$ for a data "X" in R?
Does ecdf(X)(X) will calculate the ECDF? If so then, what is the need for ecdf(X) other than plotting? Also it would be really helpful if someone share some example code which involves the calculation of ECDF for data.
The ecdf call creates a function. That is, you can apply ecdf(X) to other data, as your ecdf(X)(X) call does. However, you might want to apply ecdf(X) to something other than X itself. If you want to know the empirical quantile to which three numbers a, b, and c_ correspond, an easy way to do that is to call ecdf(X)(c(a, b, c_)).

How to use fitdist when the paramters are already known (Pareto distribution)

I am fitting a Pareto distribution to some data and have already estimated the maximum likelihood estimates for the data. Now I need to create a fitdist (fitdistrplus library) object from it, but I am not sure how to do this. I need a fitdist object because I would like to create qq, density etc. plots with the function such as denscomp. Could someone help?
The reason I calculated the MLEs first is because fitdist does not do this properly - the estimates always blow up to infinity, even if I give the correct MLEs as the starting values (see below). If the earlier option of manually giving fitdist my parameters is not possible, is there an optimization method in fitdist that would allow the pareto parameters to be properly estimated?
I don't have permission to post the original data, but here's a simulation using MLE estimates of a gamma distribution/pareto distribution on the original.
library(fitdistrplus)
library(actuar)
sim <- rgamma(1000, shape = 4.69, rate = 0.482)
fit.pareto <- fit.dist(sim, distr = "pareto", method = "mle",
start = list(scale = 0.862, shape = 0.00665))
#Estimates blow up to infinity
fit.pareto$estimate
If you look at the ?fitdist help topic, it describes what fitdist objects look like: they are lists with lots of components. If you can compute substitutes for all of those components, you should be able to create a fake fitdist object using code like
fake <- structure(list(estimate = ..., method = ..., ...),
class = "fitdist")
For the second part of your question, you'll need to post some code and data for people to improve.
Edited to add:
I added set.seed(123) before your simulation of random data. Then I get the MLE from fitdist to be
scale shape
87220272 9244012
If I plot the log likelihood function near there, I get this:
loglik <- Vectorize(function(shape, scale) sum(dpareto(sim, shape, scale, log = TRUE)))
shape <- seq(1000000, 10000000, len=30)
scale <- seq(10000000, 100000000, len=30)
surface <- outer(shape, scale, loglik)
contour(shape, scale, surface)
points(9244012, 87220272, pch=16)
That looks as though fitdist has made a somewhat reasonable choice, though there may not actually be a finite MLE. How did you find the MLE to be such small values? Are you sure you're using the same parameters as dpareto uses?

Weibull distribution with weighted data

I have some time to event data that I need to generate around 200 shape/scale parameters for subgroups for a simulation model. I have analysed the data, and it best follows a weibull distribution.
Normally, I would use the fitdistrplus package and fitdist(x, "weibull") to do so, however this data has been matched using kernel matching and I have a variable of weighting values called km and so needs to incorporate a weight, which isn't something fitdist can do as far as I can tell.
With my gamma distributed data instead of using fitdist I did the calculation manually using the wtd.mean and wtd.var functions from the hsmisc package, which worked well. However, finding a similar formula for the weibull is eluding me.
I've been testing a few options and comparing them against the fitdist results:
test_data <- rweibull(100, 0.676, 946)
fitweibull <- fitdist(test_data, "weibull", method = "mle", lower = c(0,0))
fitweibull$estimate
shape scale
0.6981165 935.0907482
I first tested this: The Weibull distribution in R (ExtDist)
library(bbmle)
m1 <- mle2(y~dweibull(shape=exp(lshape),scale=exp(lscale)),
data=data.frame(y=test_data),
start=list(lshape=0,lscale=0))
which gave me lshape = -0.3919991 and lscale = 6.852033
The other thing I've tried is eweibull from the EnvStats package.
eweibull <- eweibull(test_data)
eweibull$parameters
shape scale
0.698091 935.239277
However, while these are giving results, I still don't think I can fit my data with the weights into any of these.
Edit: I have also tried the similarly named eWeibull from the ExtDist package (which I'm not 100% sure still works, but does have a weibull function that takes a weight!). I get a lot of error messages about the inputs being non-computable (NA or infinite). If I do it with map, so map(test_data, test_km, eWeibull) I get [[NULL] for all 100 values. If I try it just with test_data, I get a long string of errors associated with optimx.
I have also tried fitDistr from propagate which gives errors that weights should be a specific length. For example, if both are set to be 100, I get an error that weights should be length 94. If I set it to 94, it tells me it has to be length of 132.
I need to be able to pass either a set of pre-weighted mean/var/sd etc data into the calculation, or have a function that can take data and weights and use them both in the calculation.
After much trial and error, I edited the eweibull function from the EnvStats package to instead of using mean(x) and sd(x), to instead use wtd.mean(x,w) and sqrt(wtd.var(x, w)). This now runs and outputs weighted values.

R function for Likelihood

I'm trying to analyze repairable systems reliability using growth models.
I have already fitted a Crow-Amsaa model but I wonder if there is any package or any code for fitting a Generalized Renewal Process (Kijima Model I) or type II
in R and find it's parameters Beta, Lambda(or alpha) and q.
(or some other model for the mean cumulative function MCF)
The equation number 15 of this article gives an expression for the
Log-likelihood
I tried to create the function like this:
likelihood.G1=function(theta,x){
# x is a vector with the failure times, theta vector of parameters
a=theta[1] #Alpha
b=theta[2] #Beta
q=theta[3] #q
logl2=log(b/a) # First part of the equation
for (i in 1:length(x)){
logl2=logl2 +(b-1)*log(x[i]/(a*(1+q)^(i-1))) -(x[i]/(a*(1+q)^(i-1)))^b
}
return(-logl2) #Negavite of the log-likelihood
}
And then use some rutine for minimize the -Log(L)
theta=c(0.5,1.2,0.8) #Start parameters (lambda,beta,q)
nlm(likelihood.G1,theta, x=Data)
Or also
optim(theta,likelihood.G1,method="BFGS",x=Data)
However it seems to be some mistake, since the parameters it returns has no sense
Any ideas of what I'm doing wrong?
Thanks
Looking at equation (16) of the paper you reference and comparing it with your code it looks like you are missing one term in the for loop. It seems that each data point contributes to three terms of the log-likelihood but in your code (inside the loop) you only have two terms (not considering the updating term)
Specifically, your code does not include the 4th term in equation (16):
and neither it does the 7th term, and so on. This is at least one error in the code. An extra consideration would be that α and β are constrained to be greater than zero. I am not sure if the solver you are using is considering this constraint.

loess predict with new x values

I am attempting to understand how the predict.loess function is able to compute new predicted values (y_hat) at points x that do not exist in the original data. For example (this is a simple example and I realize loess is obviously not needed for an example of this sort but it illustrates the point):
x <- 1:10
y <- x^2
mdl <- loess(y ~ x)
predict(mdl, 1.5)
[1] 2.25
loess regression works by using polynomials at each x and thus it creates a predicted y_hat at each y. However, because there are no coefficients being stored, the "model" in this case is simply the details of what was used to predict each y_hat, for example, the span or degree. When I do predict(mdl, 1.5), how is predict able to produce a value at this new x? Is it interpolating between two nearest existing x values and their associated y_hat? If so, what are the details behind how it is doing this?
I have read the cloess documentation online but am unable to find where it discusses this.
However, because there are no coefficients being stored, the "model" in this case is simply the details of what was used to predict each y_hat
Maybe you have used print(mdl) command or simply mdl to see what the model mdl contains, but this is not the case. The model is really complicated and stores a big number of parameters.
To have an idea what's inside, you may use unlist(mdl) and see the big list of parameters in it.
This is a part of the manual of the command describing how it really works:
Fitting is done locally. That is, for the fit at point x, the fit is made using points in a neighbourhood of x, weighted by their distance from x (with differences in ‘parametric’ variables being ignored when computing the distance). The size of the neighbourhood is controlled by α (set by span or enp.target). For α < 1, the neighbourhood includes proportion α of the points, and these have tricubic weighting (proportional to (1 - (dist/maxdist)^3)^3). For α > 1, all points are used, with the ‘maximum distance’ assumed to be α^(1/p) times the actual maximum distance for p explanatory variables.
For the default family, fitting is by (weighted) least squares. For
family="symmetric" a few iterations of an M-estimation procedure with
Tukey's biweight are used. Be aware that as the initial value is the
least-squares fit, this need not be a very resistant fit.
What I believe is that it tries to fit a polynomial model in the neighborhood of every point (not just a single polynomial for the whole set). But the neighborhood does not mean only one point before and one point after, if I was implementing such a function I put a big weight on the nearest points to the point x, and lower weights to distal points, and tried to fit a polynomial that fits the highest total weight.
Then if the given x' for which height should be predicted is closest to point x, I tried to use the polynomial fitted on the neighborhoods of the point x - say P(x) - and applied it over x' - say P(x') - and that would be the prediction.
Let me know if you are looking for anything special.
To better understand what is happening in a loess fit try running the loess.demo function from the TeachingDemos package. This lets you interactively click on the plot (even between points) and it then shows the set of points and their weights used in the prediction and the predicted line/curve for that point.
Note also that the default for loess is to do a second smoothing/interpolating on the loess fit, so what you see in the fitted object is probably not the true loess fitting information, but the secondary smoothing.
Found the answer on page 42 of the manual:
In this algorithm a set of points typically small in number is selected for direct
computation using the loess fitting method and a surface is evaluated using an interpolation
method that is based on blending functions. The space of the factors is divided into
rectangular cells using an algorithm based on k-d trees. The loess fit is evaluated at
the cell vertices and then blending functions do the interpolation. The output data
structure stores the k-d trees and the fits at the vertices. This information
is used by predict() to carry out the interpolation.
I geuss that for predict at x, predict.loess make a regression with some points near x, and calculate the y-value at x.
Visit https://stats.stackexchange.com/questions/223469/how-does-a-loess-model-do-its-prediction

Resources