I'm trying to fit a truncated normal distribution to data using fitdistrplus::fitdistr and specifying upper and lower bounds. However, when comparing the MLE-fitted parameters to those of an MLE-fit without bounds, they seem to be the same.
library(fitdistrplus)
library(MASS)
dt <- rnorm(100, 1, 0.5)
cat("truncated:", fitdistr(dt, "normal", lower = 0, upper = 1.5, method = "mle")$estimate,
"original:", fitdist(dt, "norm", method = "mle")$estimate, sep = "\n")
truncated:
1.034495
0.4112629
original:
1.034495
0.4112629
I'm not a statistics genius, but I'm pretty sure that parameters should be different because truncating the distribution, both mean and sd will change (because the distribution is rescaled). Is this right?
Thanks for your advice
Cheers,
Simon
Related
tldr: I am numerically estimating a PDF from simulated data and I need the density to monotonically decrease outside of the 'main' density region (as x-> infinity). What I have yields a close to zero density, but which does not monotonically decrease.
Detailed Problem
I am estimating a simulated maximum likelihood model, which requires me to numerically evaluate the probability distribution function of some random variable (the probability of which cannot be analytically derived) at some (observed) value x. The goal is to maximize the log-likelihood of these densities, which requires them to not have spurious local maxima.
Since I do not have an analytic likelihood function I numerically simulate the random variable by drawing the random component from some known distribution function, and apply some non-linear transformation to it. I save the results of this simulation in a dataset named simulated_stats.
I then use density() to approximate the PDF and approxfun() to evaluate the PDF at x:
#some example simulation
Simulated_stats_ <- runif(n=500, 10,15)+ rnorm(n=500,mean = 15,sd = 3)
#approximation for x
approxfun(density(simulated_stats))(x)
This works well within the range of simulated simulated_stats, see image:
Example PDF. The problem is I need to be able to evaluate the PDF far from the range of simulated data.
So in the image above, I would need to evaluate the PDF at, say, x=50:
approxfun(density(simulated_stats))(50)
> [1] NA
So instead I use the from and to arguments in the density function, which correctly approximate near 0 tails, such
approxfun(
density(Simulated_stats, from = 0, to = max(Simulated_stats)*10)
)(50)
[1] 1.924343e-18
Which is great, under one condition - I need the density to go to zero the further out from the range x is. That is, if I evaluated at x=51 the result must be strictly smaller. (Otherwise, my estimator may find local maxima far from the 'true' region, since the likelihood function is not monotonic very far from the 'main' density mass, i.e. the extrapolated region).
To test this I evaluated the approximated PDF at fixed intervals, took logs, and plotted. The result is discouraging: far from the main density mass the probability 'jumps' up and down. Always very close to zero, but NOT monotonically decreasing.
a <- sapply(X = seq(from = 0, to = 100, by = 0.5), FUN = function(x){approxfun(
density(Simulated_stats_,from = 0, to = max(Simulated_stats_)*10)
)(x)})
aa <- cbind( seq(from = 0, to = 100, by = 0.5), a)
plot(aa[,1],log(aa[,2]))
Result:
Non-monotonic log density far from density mass
My question
Does this happen because of the kernel estimation in density() or is it inaccuracies in approxfun()? (or something else?)
What alternative methods can I use that will deliver a monotonically declining PDF far from the simulated density mass?
Or - how can I manually change the approximated PDF to monotonically decline the further I am from the density mass? I would happily stick some linear trend that goes to zero...
Thanks!
One possibility is to estimate the CDF using a beta regression model; numerical estimate of the derivative of this model could then be used to estimate the pdf at any point. Here's an example of what I was thinking. I'm not sure if it helps you at all.
Import libraries
library(mgcv)
library(data.table)
library(ggplot2)
Generate your data
set.seed(123)
Simulated_stats_ <- runif(n=5000, 10,15)+ rnorm(n=500,mean = 15,sd = 3)
Function to estimate CDF using gam beta regression model
get_mod <- function(ss,p = seq(0.02, 0.98, 0.02)) {
qp = quantile(ss, probs=p)
betamod = mgcv::gam(p~s(qp, bs="cs"), family=mgcv::betar())
return(betamod)
}
betamod <- get_mod(Simulated_stats_)
Very basic estimate of PDF at val given model that estimates CDF
est_pdf <- function(val, betamod, tol=0.001) {
xvals = c(val,val+tol)
yvals = predict(betamod,newdata=data.frame(qp = xvals), type="response")
as.numeric((yvals[1] - yvals[2])/(xvals[1] - xvals[2]))
}
Lets check if monotonically increasing below min of Simulated_stats
test_x = seq(0,min(Simulated_stats_), length.out=1000)
pdf = sapply(test_x, est_pdf, betamod=betamod)
all(pdf == cummax(pdf))
[1] TRUE
Lets check if monotonically decreasing above max of Simulated_stats
test_x = seq(max(Simulated_stats_), 60, length.out=1000)
pdf = sapply(test_x, est_pdf, betamod=betamod)
all(pdf == cummin(pdf))
[1] TRUE
Additional thoughts 3/5/22
As discussed in comments, using the betamod to predict might slow down the estimator. While this could be resolved to a great extent by writing your own predict function directly, there is another possible shortcut.
Generate estimates from the betamod over the range of X, including the extremes
k <- sapply(seq(0,max(Simulated_stats_)*10, length.out=5000), est_pdf, betamod=betamod)
Use the approach above that you were initially using, i.e. a linear interpolation across the density, but rather than doing this over the density outcome, instead do over k (i.e. over the above estimates from the beta model)
lin_int = approxfun(x=seq(0,max(Simulated_stats_)*10, length.out=5000),y=k)
You can use the lin_int() function for prediction in the estimator, and it will be lighting fast. Note that it produces virtually the same value for a given x
c(est_pdf(38,betamod), lin_int(38))
[1] 0.001245894 0.001245968
and it is very fast
microbenchmark::microbenchmark(
list = alist("betamod" = est_pdf(38, betamod),"lin_int" = lint(38)),times=100
)
Unit: microseconds
expr min lq mean median uq max neval
betamod 1157.0 1170.20 1223.304 1188.25 1211.05 2799.8 100
lin_int 1.7 2.25 3.503 4.35 4.50 10.5 100
Finally, lets check the same plot you did before, but using lin_int() instead of approxfun(density(....))
a <- sapply(X = seq(from = 0, to = 100, by = 0.5), lin_int)
aa <- cbind( seq(from = 0, to = 100, by = 0.5), a)
plot(aa[,1],log(aa[,2]))
I use library(fitdistrplus) package to fit the data as following:
set.seed(100)
x1<-rlnorm(500,1,3)
f.x1<-fitdist(x1,distr = "lnorm",method = "mme")
plot(f.x1)
following is the plot result:
enter image description here
my question is : the x1 data in fact is generated with rlnorm, but after fitting, the pp plot is not so perfect, how to explain this?
thanks.
Guangming
I strongly suspect it is because moment matching isn't a great way of estimating the parameters. I repeated your example using moment matching:
set.seed(100)
x = rlnorm(500, 1, 3)
library(fitdistrplus)
f.x<-fitdist(x,distr = "lnorm",method = "mme")
The parameter estimates were:
> f.x
Fitting of the distribution ' lnorm ' by matching moments
Parameters:
estimate
meanlog 3.012574
sdlog 2.199019
If I do the fitting using maximum likelihood:
ll = function(meanlog, sdlog){
sum(dlnorm(x, meanlog, sdlog, log = TRUE))
}
objFun = function(params){
-ll(params[1], params[2])
}
optim(c(0, 1), objFun)
Then I get parameter estimates of
> optim(c(0, 1), objFun)
$par
[1] 0.8861808 3.0118166
which are much closer to the values you were sampling from I think you'd agree. So the quantiles (and percentage points) are going to be a lot closer to the empirical values.
I have a point pattern with about 84,000 points. Quadrat tests suggested inhomogeneous intensity to I tried different Kernel bandwidths and got very odd behavior in the inhomogeneous implementations of the K-, F- and G-functions. Here is an example of the inhomogeneous F-function plot. Clearly, the estimated F-function does not reach 1 within the distance range while the Poisson process just flatlines. The F-function should also be increasing so the dips are odd. When manually specifying a longer range of r in the Finhom() function, the function still does not evaluate beyond the suggested range of 2000.
Unfortunately, I cannot share my data. However, I managed to reproduce some of the errors with an admittedly very simple example of a point pattern on the unit square:
library(spatstat) # version 1.57-1
# define point pattern
ex <- as.ppp(data.frame(x = c(.9, .25, .29, .7, .72, .8, .72, .85),
y = c(.1, .25, .29, .5, .5, .1, .45, .08)),
W = owin(c(0,1), c(0,1)))
plot(ex)
# testing inhomogeneity
quadrat.test(ex, 3, 3, method = "M", nsim = 500) # p around 0.05
# set bandwidth
diggle <- bw.diggle(ex)
# suggested bandwidth of 0.028
# estimate inhomogeneous F-function
Fi <- Finhom(ex, sigma = diggle)
plot(Fi, main ="Finhom for ex pattern")
The plot is attached here. Similar to my real data, the plot stops evaluating at r = 0.5, flatlines and does not go up all the way to 1.
Interestingly, when supplying the intensity directly via the lambda argument in the Finhom() function, the behavior changes:
lambda_ex <- density(ex, sigma = diggle, at = "points")
Fi_lambda <- Finhom(ex, lambda = lambda_ex)
plot(Fi_lambda, main ="Finhom w/ lambda directly")
Here, the functions behave as expected.
My questions are:
why is there a difference between directly supplied intensity vs. intensity internally estimated in the Finhom() function?
what could be the reason for the odd behavior of the F-function here? A code issue or user error? (Sidenote, the G- and K-functions also return odd behavior, to keep this question short-ish, I've focused on the F-function)
Thank you!
As pointed out by Adrian Baddeley in the other answer this is not a bug in Finhom per se. You would expect that
Fi <- Finhom(ex, sigma = diggle)
should be equivalent to
lambda_ex <- density(ex, sigma = diggle, at = "points")
Fi_lambda <- Finhom(ex, lambda = lambda_ex)
However, different values of the argument lmin are implied by these commands. In the first case lambda is estimated everywhere in the window and the minimum value used. In the second case only the given values of lambda are used to find the minimum. That can of course be quite different. The importance of lmin is illustrated in the code below (note that discrepancy between data and inhomogeneous Poisson is of the same type in all cases).
The other part about the estimate stopping at r=0.5 is not surprising since border correction is used and the window is the unit square. When r=0.5 the entire window is "shaved off", so there is no data left.
library(spatstat)
#> spatstat 1.56-1.031 (nickname: 'Psycho chicken')
X <- swedishpines
lam <- density(X, at = "points", sigma = 10)
lam_min <- min(lam)
plot(Finhom(X, lmin = lam_min), legend = FALSE, col = 1, main = "Finhom for different values of lmin")
s <- 2^(1:3)
for(i in seq_along(s)){
plot(Finhom(X, lmin = lam_min/s[i]), col = i+1, add = TRUE)
}
s <- c(1,s)
legend("topleft", legend = paste0("min(lam)/", s), lty = 1, col = 1:length(s))
Created on 2018-11-24 by the reprex package (v0.2.1)
The "inhomogeneous" functions Kinhom, Ginhom, Finhom involve making adjustments for the spatially varying intensity of the point process. They only work if (a) the intensity has been accurately estimated, and (b) the point process satisfies certain technical assumptions which justify the adjustment calculation (see the references in the help files, or the relevant section of the spatstat book).
The plot of density(ex, sigma=bw.diggle) shows very high peaks and very low troughs in the estimated intensity, suggesting that the data are under-smoothed, so that (a) is not satisfied. The results obtained with bw.scott or bw.CvL are much better behaved. (Remember that bw.diggle is designed for clustered patterns.) For example, I get a reasonably nice plot with
plot(Finhom(ex, sigma=bw.CvL))
Yes, it does seem a bit disconcerting that the results are different when 'lambda' is given as a pixel image and as a numeric vector. This occurs, as Ege explains, because of the different rules for calculating the default value of the important argument lmin. It's not really a bug -- the original authors of the code for Ginhom and Finhom designed it this way; I will consult them for advice about whether we should change it. In the meantime, you can make the two calculations agree if you specify the value of lmin.
Not sure if this numerical methods problem should really be here or in crossvalidated, but since I have a nice reproducible example I though I would start here.
I am going to be estimating and fitting a bunch of distributions both to some large data sets and to data sets generated randomly from similar distributions. As part of this process I will be generating estimates for the conditional mean of various value ranges, including truncated and non-truncated values of the right tail.
The function cr_moment below, given a pdf function for dfun and parameters for that function in params calculates the unconditional mean of that distribution. Given the upper, lower, or both bounds, it calculates the conditional mean for the range specified by those bounds, using the singly- or doubly-truncated distribution for those bounds. The function beneath it, cr_gb2, specializes cr_moment to the generalized beta distribution of the second kind. Finally, the parameter values supplied beneath that approximate the unadjusted current-dollar household income distribution from the US Census/BLS Current Population Survey for the year 2000. McDonald & Ransom 2008. (Also, kudos to Mikko Marttila on this list for help with coding this function).
This function gives me a failure to converge error, copied below, for various lower bounds and an upper bound equal to 4.55e8, or higher, but not at 4.54e8. The kth moment of the GB2 exists for k < shape1 * shape3, here about 2.51. This is a nice smooth unimodal function being integrated over a finite interval, and I don’t know why it is failing to converge and don-t know what to do about it. For other parameter values, but not this one, I have also seen convergence problems at the low end for lower bounds ranging from 6 to a couple of hundred.
Error in integrate(f = prob_interval, lower = lb, upper = ub, subdivisions = 100L):
the integral is probably divergent
455 billion will be above the highest observable income level, by one or two orders of magnitude, but given a wider range of parameter values and using hill-climbing algorithms to fit real and simulated data I think I will hit this wall many times. I know very little about numerical methods in a case like this and don’t really know where to start. Help and suggestions greatly appreciated.
cr_moment <- function(lb = -Inf, ub = Inf, dfun, params, v=1, ...){
x_pdf <- function(X){
X^v * do.call(what=dfun, args=c(list(x=X), params))
}
prob_interval <- function(X){
do.call(what=dfun, args=c(list(x=X), params))
}
integral_val <- integrate(f = x_pdf, lower = lb, upper = ub)
integral_prob <- integrate(f = prob_interval, lower = lb, upper = ub)
crm <- interval_val[[1]] / interval_prob[[1]]
out <- list(value = integral_val[[1]], probability = integral_prob[[1]],
cond_moment = crm)
out
}
library(GB2)
cr_gb2 <- function(lb = -Inf, ub = Inf, v = 1, params){
cr_moment(lb, ub, dfun = dgb2, params = get("params"))
}
GB2_params <- list(shape1 = 2.2474, scale = 58441.5, shape2 = 0.6186, shape3 = 1.118)
cr_gb2(lb=1, ub= 4.55e8, params = GB2_params)
After scrambling through the internet, related questions and trying for days without any success I hope you can potentially help me out with the inclusion of optim() with or for a polr() function in R.
What I am trying to do, is to just set some constraints (lower and/or upper bounds for the coefficients). If you'd have a general example of how this would work, I'd be more than delighted.
Let's consider the following fake and senseless data:
set.seed(3)
my.df <- data.frame(id = 1:1000, y = sample(c(1,2,3), 1000, replace = TRUE),
a = rnorm(5000, 1, 0.1), b = rnorm(100, 1.9, 0.5), c = rnorm (10, 0.8, 1.2))
and a polr function like this:
model <- polr(y ~ a + b + c, method = 'logistic')
I do get a coefficient for c, which is negative (albeit insignificant), but I know it's relation to y being positive. Thus, I want to constrain it's coefficient to being positive and I think I can do this with optim().
I want to include something along the lines:
optim(model, method = "L-BFGS-B", lower = c(a = -1, b = -1, c = 0))
which doesn't work.
I think there might be some option with including the optim() function into the polr function using the (to me) arcane ...but I just cannot figure out how. Any ideas?
Much appreciated, and believe it or not, I really tried hard.