poisson distribution (dpois) integration R - r

I'm trying to integrate the poisson distribution (dpois) in R but I get an incorrect answer (0 with absolute error 0) and 21 warnings. I don't understand how R is digesting my simple meal and why it pukes out 21 warnings.
dpoisd1 <- function(x) {dpois(x, 0.0001)}
dpoisd1(1:20)
integrate(dpoisd1, lower = 1, upper = 20)
it yields 0 with absolute error < 0 and some 21 warnings. I would really appreciate it if someone could show me my mistake(s).

Use warnings to have a look at the warnings:
warnings()
#Warning messages:
#1: In dpois(x, 1e-04) : non-integer x = 10.500000
#<snip>
The first parameter of dpois must be a non-negative integer (see help("dpois")). integrate passes non-integer values to it. In fact, it is not clear, what you want to calculate. You are trying to integrate a discrete density function. Possibly you want ppois, the cumulative distribution function.

Related

Fitting a truncated binomial distribution to data in R

I have discrete count data indicating the number of successes in 10 binomial trials for a pilot sample of 46 cases. (Larger samples will follow once I have the analysis set up.) The zero class (no successes in 10 trials) is missing, i.e. each datum is an integer value between 1 and 10 inclusive. I want to fit a truncated binomial distribution with no zero class, in order to estimate the underlying probability p. I can do this adequately on an Excel spreadsheet using least squares with Solver, but because I want to calculate bootstrap confidence intervals on p, I am trying to implement it in R.
Frankly, I am struggling to understand how to code this. This is what I have so far:
d <- detections.data$x
# load required packages
library(fitdistrplus)
library(truncdist)
library(mc2d)
ptruncated.binom <- function(q, p) {
ptrunc(q, "binom", a = 1, b = Inf, p)
}
dtruncated.binom <- function(x, p) {
dtrunc(x, "binom", a = 1, b = Inf, p)
}
fit.tbin <- fitdist(d, "truncated.binom", method="mle", start=list(p=0.1))
I have had lots of error messages which I have solved by guesswork, but the latest one has me stumped and I suspect I am totally misunderstanding something.
Error in checkparamlist(arg_startfix$start.arg, arg_startfix$fix.arg, :
'start' must specify names which are arguments to 'distr'.<
I think this means I must specify starting values for x in dtrunc and q in ptrunc, but I am really unclear what they should be.
Any help would be very gratefully received.

How to interpret mle() trace = 6 output and why does the mle() procedure stop after 101 iterations in R?

I am trying to estimate seven constrained parameters with the function mle() using the L-BFGS-B method in R. In order to investigate why I get an non-finite finite-difference value [2] error, I include control = list(trace = 6) in the mle() function to hopefully learn more about the origins of the error.
I do not understand the output of the tracing very well unfortunately, making the result surprising to me: the program seems to simply stop after 101 iterations without giving me a proper reason.
Does anyone know why?
I suppose the seven X values reported by trace=6 are the parameter values the mle procedure has converged to up to this iteration. Imputing these values in my loglikelihood function gives me the same value as reported under "final value": -152.449285. When I impute the seven X values from iteration 97 I get the same loglikelihood of -152.449285.
There are two things that seem to stand out. First, the second value of X, 0.999, is exactly the upper limit of the second parameter I estimate. Second, the second value of G seems relatively large at -412.172 compared to the other G values. What exactly does G indicate? The second values of X and G have been like this for many iterations. Does any of this give me a clue how I can potentially make the estimation work? Thanks in advance!
Since my question is about interpretation/intuition of results I refrained from providing a reproducible example. Its lots of code and I do not know how to reproduce this situation with just a tiny bit of code. Please let me know if you need my code.
The final 101th iteration:
---------------- CAUCHY entered-------------------
There are 4 breakpoints
Piece 1 f1, f2 at start point -9.1069e-03 3.1139e+01
Distance to the next break point = 1.9816e+00
Distance to the stationary point = 2.9246e-04
GCP found in this segment
Piece 1 f1, f2 at start point -9.1069e-03 3.1139e+01
Distance to the stationary point = 2.9246e-04
Cauchy X = -0.749937 0.999 0.841376 1.14695 0.134673 0.121755 0.365289
---------------- exit CAUCHY----------------------
0 variables leave; 0 variables enter
6 variables are free at GCP on iteration 101
LINE SEARCH 0 times; norm of step = 0.000232633
X = -0.749896 **0.999** 0.841349 1.14697 0.134672 0.121757 0.36551
G = -0.0154393 **-412.172** -0.0621798 0.0130552 -0.00801055 0.00692317 -0.0134718
final value -152.449285
**stopped after 101 iterations**
Error in optim(start, f, method = method, hessian = TRUE, ...) :
non-finite finite-difference value [2]
---- UPDATE 1 ---
I followed Roland's suggestion but first tried to set maxit at 200: control = list(maxit=200, trace=6).
The procedure now converges at the 106th iteration, yet, I still get the error from before:
iterations 106
function evaluations 127
segments explored during Cauchy searches 110
BFGS updates skipped 2
active bounds at final generalized Cauchy point 1
norm of the final projected gradient 0.0217961
final function value -152.449
X = -0.749748 0.999 0.841415 1.14687 0.134666 0.121766 0.366383
F = -152.449
final value -152.449295
converged
Error in optim(start, f, method = method, hessian = TRUE, ...) :
non-finite finite-difference value [2]
---- UPDATE 2---
I followed the suggestion by Biswajit Banerjee optim in r :non finite finite difference error and set ndeps which ?optim tells me is "A vector of step sizes for the finite-difference approximation to the gradient" to 0.0001 for the second parameter (default 0.001 for all other parameters). Everything works fine now! I wonder whether this is related to the second parameter's value being the upper limit or its G value being relatively large?

How to find the minimum floating-point value accepted by betareg package?

I'm doing a beta regression in R, which requires values between 0 and 1, endpoints excluded, i.e. (0,1) instead of [0,1].
I have some 0 and 1 values in my dataset, so I'd like to convert them to the smallest possible neighbor, such as 0.0000...0001 and 0.9999...9999. I've used .Machine$double.xmin (which gives me 2.225074e-308), but betareg() still gives an error:
invalid dependent variable, all observations must be in (0, 1)
If I use 0.000001 and 0.999999, I got a different set of errors:
1: In betareg.fit(X, Y, Z, weights, offset, link, link.phi, type, control) :
failed to invert the information matrix: iteration stopped prematurely
2: In sqrt(wpp) :
Error in chol.default(K) :
the leading minor of order 4 is not positive definite
Only if I use 0.0001 and 0.9999 I can run without errors. Is there any way I can improve this minimum values with betareg? Or should I just be happy with that?
Try it with eps (displacement from 0 and 1) first equal to 1e-4 (as you have here) and then with 1e-3. If the results of the models don't differ in any way you care about, that's great. If they are, you need to be very careful, because it suggests your answers will be very sensitive to assumptions.
In the example below the dispersion parameter phi changes a lot, but the intercept and slope parameter don't change very much.
If you do find that the parameters change by a worrying amount for your particular data, then you need to think harder about the process by which zeros and ones arise, and model that process appropriately, e.g.
a censored-data model: zero/one arise through a minimum/maximum detection threshold, models the zero/one values as actually being somewhere in the tails or
a hurdle/zero-one inflation model: zeros and ones arise through a separate process from the rest of the data, use a binomial or multinomial model to characterize zero vs. (0,1) vs. one, then use a Beta regression on the (0,1) component)
Questions about these steps are probably more appropriate for CrossValidated than for SO.
sample data
set.seed(101)
library(betareg)
dd <- data.frame(x=rnorm(500))
rbeta2 <- function(n, prob=0.5, d=1) {
rbeta(n, shape1=prob*d, shape2=(1-prob)*d)
}
dd$y <- rbeta2(500,plogis(1+5*dd$x),d=1)
dd$y[dd$y<1e-8] <- 0
trial fitting function
ss <- function(eps) {
dd <- transform(dd,
y=pmin(1-eps,pmax(eps,y)))
m <- try(betareg(y~x,data=dd))
if (inherits(m,"try-error")) return(rep(NA,3))
return(coef(m))
}
ss(0) ## fails
ss(1e-8) ## fails
ss(1e-4)
## (Intercept) x (phi)
## 0.3140810 1.5724049 0.7604656
ss(1e-3) ## also fails
ss(1e-2)
## (Intercept) x (phi)
## 0.2847142 1.4383922 1.3970437
ss(5e-3)
## (Intercept) x (phi)
## 0.2870852 1.4546247 1.2029984
try it for a range of values
evec <- seq(-4,-1,length=51)
res <- t(sapply(evec, function(e) ss(10^e)) )
library(ggplot2)
ggplot(data.frame(e=10^evec,reshape2::melt(res)),
aes(e,value,colour=Var2))+
geom_line()+scale_x_log10()

Parameters estimation of a bivariate mixture normal-lognormal model

I have to create a model which is a mixture of a normal and log-normal distribution. To create it, I need to estimate the 2 covariance matrixes and the mixing parameter (total =7 parameters) by maximizing the log-likelihood function. This maximization has to be performed by the nlm routine.
As I use relative data, the means are known and equal to 1.
I’ve already tried to do it in 1 dimension (with 1 set of relative data) and it works well. However, when I introduce the 2nd set of relative data I get illogical results for the correlation and a lot of warnings messages (at all 25).
To estimate these parameters I defined first the log-likelihood function with the 2 commands dmvnorm and dlnorm.plus. Then I assign starting values of the parameters and finally I use the nlm routine to estimate the parameters (see script below).
`P <- read.ascii.grid("d:/Documents/JOINT_FREQUENCY/grid_E727_P-3000.asc", return.header=
FALSE );
V <- read.ascii.grid("d:/Documents/JOINT_FREQUENCY/grid_E727_V-3000.asc", return.header=
FALSE );
p <- c(P); # tranform matrix into a vector
v <- c(V);
p<- p[!is.na(p)] # removing NA values
v<- v[!is.na(v)]
p_rel <- p/mean(p) #Transforming the data to relative values
v_rel <- v/mean(v)
PV <- cbind(p_rel, v_rel) # create a matrix of vectors
L <- function(par,p_rel,v_rel) {
return (-sum(log( (1- par[7])*dmvnorm(PV, mean=c(1,1), sigma= matrix(c(par[1]^2, par[1]*par[2]
*par[3],par[1]*par[2]*par[3], par[2]^2 ),nrow=2, ncol=2))+
par[7]*dlnorm.rplus(PV, meanlog=c(1,1), varlog= matrix(c(par[4]^2,par[4]*par[5]*par[6],par[4]
*par[5]*par[6],par[5]^2), nrow=2,ncol=2)) )))
}
par.start<- c(0.74, 0.66 ,0.40, 1.4, 1.2, 0.4, 0.5) # log-likelihood estimators
result<-nlm(L,par.start,v_rel=v_rel,p_rel=p_rel, hessian=TRUE, iterlim=200, check.analyticals= TRUE)
Messages d'avis :
1: In log(eigen(sigma, symmetric = TRUE, only.values = TRUE)$values) :
production de NaN
2: In sqrt(2 * pi * det(varlog)) : production de NaN
3: In nlm(L, par.start, p_rel = p_rel, v_rel = v_rel, hessian = TRUE) :
NA/Inf replaced by maximum positive value
4: In log(eigen(sigma, symmetric = TRUE, only.values = TRUE)$values) :
production de NaN
…. Until 25.
par.hat <- result$estimate
cat("sigN_p =", par[1],"\n","sigN_v =", par[2],"\n","rhoN =", par[3],"\n","sigLN_p =", par [4],"\n","sigLN_v =", par[5],"\n","rhoLN =", par[6],"\n","mixing parameter =", par[7],"\n")
sigN_p = 0.5403361
sigN_v = 0.6667375
rhoN = 0.6260181
sigLN_p = 1.705626
sigLN_v = 1.592832
rhoLN = 0.9735974
mixing parameter = 0.8113369`
Does someone know what is wrong in my model or how should I do to find these parameters in 2 dimensions?
Thank you very much for taking time to look at my questions.
Regards,
Gladys Hertzog
When I do these kind of optimization problems, I find that it's important to make sure that all the variables that I'm optimizing over are constrained to plausible values. For example, standard deviation variables have to be positive, and from knowledge of the situation that I'm modelling I'll probably be able to put an upper bound all my standard deviation variables as well. So if s is one of my standard deviation variables, and if m is the maximum value that I want it to take, instead of working with s I'll solve for the variable z which is related to s via
s = m/(1+e-z)
In that formula, z is unconstrained, but s must lie between 0 and m. This is vital because optimization routines where the variables are not constrained to take plausible values will often try completely implausible values while they're trying to bound the solution. Implausible values often cause problems with e.g. precision, that then results in NaN's etc. The general formula that I use for constraining a single variable x to lie between a and b is
x = a + (b - a)/(1+e-z)
However, regarding your particular problem where you're looking for covariance matrices, a more sophisticated approach is necessary than simply bounding all the individual variables. Covariance matrices must be positive semi-definite, so if you're simply optimizing the individual values in the matrix, the optimization will probably fail (producing NaN's) if a matrix which isn't positive definite is fed into the likelihood function. To get round this problem, one approach is to solve for the Cholesky decomposition of the covariance matrix instead of the covariance matrix itself. My guess is that this is probably what's causing your optimization to fail.

Quantile regression and p-values

I am applying guantile regression for my data-set (using R). It is easy to produce the nice scatterplot-image with different quantile regression lines
(taus <- c(0.05,0.25,0.75,0.95)).
Problem occurs when I want to produce p-values (in order to see statistical significance of each regression line) for each one of these quantiles. For median quantile (tau=0.5) this is not problematic, but when it comes to for example tau=0.25, I get following error message:
>QRmodel<-rq(y~x,tau=0.25,model=T)
>summary(QRmodel,se="nid")
Error in summary.rq(QRmodel, se = "nid") : tau - h < 0: error in summary.rq
What could be the reason for this?
Also: Is it recommendable to mention p-values and coefficients regarding the results of quantile regression model or could it be enough to show just the plot-picture and discuss the results based on that picture?
Best regards, frustrated person
A good way to learn what's going on in these sorts of debugging situations is to find the relevant portion of code that is throwing the error. If you type 'summary.rq' at the console, you'll see the code for the function summary.rq. Scanning through it you'll find the section where it calculates se's using the "nid" method, starting with this code:
else if (se == "nid") {
h <- bandwidth.rq(tau, n, hs = hs)
if (tau + h > 1)
stop("tau + h > 1: error in summary.rq")
if (tau - h < 0)
stop("tau - h < 0: error in summary.rq")
bhi <- rq.fit.fnb(x, y, tau = tau + h)$coef
blo <- rq.fit.fnb(x, y, tau = tau - h)$coef
So what's happening here is that in order to calculate the se's, the function first need to calculate a bandwidth, h, and the quantreg model is refit for tau +/- h. For tau's near 0 or 1, there's a possibility that adding or subtracting the bandwidth 'h' will lead to a tau below 0 or greater than 1, which isn't good, so the function stops.
You have a couple of options:
1.) Try a different se method (bootstrapping?)
2.) Modify the summary.rq code yourself to force it to use either max(tau,0) or min(tau,1) in the instances where the bandwidth pushes tau out of bounds. (There could be serious theoretical reasons why this is a bad idea; not advised unless you know what you're doing.)
3.) You could try to read up on the theory behind the calculation of these se's so you'd have a better idea of when they might work well or not. This might shed some light on why you're running into errors with values of tau near 0 or 1.
Try summary(QRmodel,se="boot")
Have a look at the help for summary.rq as well!

Resources