numerical integration of a tricky function - r

The prob package numerically evaluates characteristic functions for base R distributions. For almost all distributions there are existing formulas. For a few cases, though, no closed-form solution is known. Case in point: the Weibull distribution (but see below).
For the Weibull characteristic function I essentially compute two integrals and put them together:
fr <- function(x) cos(t * x) * dweibull(x, shape, scale)
fi <- function(x) sin(t * x) * dweibull(x, shape, scale)
Rp <- integrate(fr, lower = 0, upper = Inf)$value
Ip <- integrate(fi, lower = 0, upper = Inf)$value
Rp + (0+1i) * Ip
Yes, it's clumsy, but it works surprisingly well! ...ahem, most of the time. A user reported recently that the following breaks:
cfweibull(56, shape = 0.5, scale = 1)
Error in integrate(fr, lower = 0, upper = Inf) :
the integral is probably divergent
Now, we know that the integral isn't divergent, so it must be a numerical problem. With some fiddling I could get the following to work:
fr <- function(x) cos(56 * x) * dweibull(x, 0.5, 1)
integrate(fr, lower = 0.00001, upper = Inf, subdivisions=1e7)$value
[1] 0.08024055
That's OK, but it isn't quite right, plus it takes a fair bit of fiddling which doesn't scale well. I've been investigating this for a better solution. I found a recently published "closed-form" for the characteristic function with scale > 1 (see here), but it involves Wright's generalized confluent hypergeometric function which isn't implemented in R (yet). I looked into the archives for integrate alternatives, and there's a ton of stuff out there which doesn't seem very well organized.
As part of that searching it occurred to me to translate the region of integration to a finite interval via the inverse tangent, and voila! Check it out:
cfweibull3 <- function (t, shape, scale = 1){
if (shape <= 0 || scale <= 0)
stop("shape and scale must be positive")
fr <- function(x) cos(t * tan(x)) * dweibull(tan(x), shape, scale)/(cos(x))^2
fi <- function(x) sin(t * tan(x)) * dweibull(tan(x), shape, scale)/(cos(x))^2
Rp <- integrate(fr, lower = 0, upper = pi/2, stop.on.error = FALSE)$value
Ip <- integrate(fi, lower = 0, upper = pi/2, stop.on.error = FALSE)$value
Rp + (0+1i) * Ip
}
> cfweibull3(56, shape=0.5, scale = 1)
[1] 0.08297194+0.07528834i
Questions:
Can you do better than this?
Is there something about numerical integration routines that people who are expert about such things could shed some light on what's happening here? I have a sneaking suspicion that for large t the cosine fluctuates rapidly which causes problems...?
Are there existing R routines/packages which are better suited for this type of problem, and could somebody point me to a well-placed position (on the mountain) to start the climb?
Comments:
Yes, it is bad practice to use t as a function argument.
I calculated the exact answer for shape > 1 using the published result with Maple, and the brute-force-integrate-by-the-definition-with-R kicked Maple's ass. That is, I get the same answer (up to numerical precision) in a small fraction of a second and an even smaller fraction of the price.
Edit:
I was going to write down the exact integrals I'm looking for but it seems this particular site doesn't support MathJAX so I'll give links instead. I'm looking to numerically evaluate the characteristic function of the Weibull distribution for reasonable inputs t (whatever that means). The value is a complex number but we can split it into its real and imaginary parts and that's what I was calling Rp and Ip above.
One final comment: Wikipedia has a formula listed (an infinite series) for the Weibull c.f. and that formula matches the one proved in the paper I referenced above, however, that series has only been proved to hold for shape > 1. The case 0 < shape < 1 is still an open problem; see the paper for details.

You may be interested to look at this paper, which discuss different integration methods for highly oscillating integrals -- that's what you are essentially trying to compute:
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.8.6944
Also, another possible advice, is that instead of infinite limit you may want to specify a smaller one, because if you specify the precision that you want, then based on the cdf of the weibull you can easily estimate how much of the tail you can truncate. And if you have a fixed limit, then you can specify exactly (or almost) the number of subdivisions (e.g. in order to have a few(4-8) points per period).

I had the same problem than Jay - not with the Weibull distribution but with the integrate function. I found my answer to Jay's question 3 in a comment to this question:
Divergent Integral in R is solvable in Wolfram
The R package pracma contains several functions for solving integrals numerically. In the package, one finds some R functions for integrating certain mathematical functions. And there is a more general function integral. That helped in my case. Example code is given below.
To questions 2: The first answer to the linked question (above) states that not the complete error message of the C source file is printed out by R (The function may just converge too slowly). Therefore, I would agree with Jay that the fast fluctuation of the cosine may be a problem. In my case and in the example below it was the problem.
Example Code
# load Practical Numerical Math Functions package
library(pracma)
# define function
testfun <- function(r) cos(r*10^6)*exp(-r)
# Integrate it numerically with the basic 'integrate'.
out1 = integarte(testfun, 0, 100)
# "Error in integrate(testfun, 0, 100) : the integral is probably divergent"
# Integrate it numerically with 'integral' from 'pracma' package
# using 'Gauss-Kronrod' method and 10^-8 as relative tolerance. I
# did not try the other ones.
out2 = integral(testfun, 0, 100, method = 'Kronrod', reltol = 1e-8)
Two remarks
The integral function does not break as the integrate function does but it may take quite a long time to run. I do not know (and I did not try) whether the user can limit the number of iterations (?).
Even if the integral function finalises without errors I am not sure how correct the result is. Numerically integrating a function which is fast fluctuating around zero seems to be quite tricky since one does not know where exactly values on the fluctuating function are calculated (twice as much positive than negative values; positive values close to local maxima and negative values far off). I am not on expert on numeric integration but I just got to know some basic fixed-step integration methods in my numerics lectures. So maybe the adaptive methods used in integral deal with this problem in some way.

I'm attempting to answer questions 1 & 3. That being said I am not contributing any original code. I did a google search and hopefully this is helpful. Good luck!
Source:http://cran.r-project.org/doc/contrib/Ricci-distributions-en.pdf (p.6)
#Script
library(ggplot2)
## sampling from a Weibull distribution with parameters shape=2.1 and scale=1.1
x.wei<-rweibull(n=200,shape=2.1,scale=1.1)
#Weibull population with known paramters shape=2 e scale=1
x.teo<-rweibull(n=200,shape=2, scale=1) ## theorical quantiles from a
#Figure
qqplot(x.teo,x.wei,main="QQ-plot distr. Weibull") ## QQ-plot
abline(0,1) ## a 45-degree reference line is plotted

Is this of any use?
http://www.sciencedirect.com/science/article/pii/S0378383907000452
Muraleedharana et al (2007) Modified Weibull distribution for maximum and significant wave height simulation and prediction, Coastal Engineering, Volume 54, Issue 8, August 2007, Pages 630–638
From the abstract: "The characteristic function of the Weibull distribution is derived."

Related

Would nonidentifiability create an inconsistent response from optim in R?

My objective is to use a kinetic model to describe reaction data. The application is for a fuel and the model is widely accepted as one of the more accurate ones given the setup of my problem. I may have a nonidentifiability issue, but it bothers me that the response from optim is given such an inconsistent response.
Take the two graphs, , I have picked that point based on its low squared error. The second is what optim selected (I don't have enough rep for picture 2, I will try to post comment, hint hint, its not close to lining up). When I ran the numbers that optim gave me it did not match the expected response. I wish I could paste the exact values, but the optimization takes more than two hours each run so I have been tuning it as much as I can with the time I can get. I can say that R is settling on the boundaries. The bounds are set to physical limits at room temperature one can obtain from the pure compound (i.e. the molarity at room temperature). I can be flexible, but not too much as the point of the project was to limit the model parameters to observed physical parameters.
This is all to prep it for an MCMC to add Bayesian elements to this. If my first guess is junk so is the whole project.
To sum it up I would like to know why the errors are inconsistent and if it is coming from nonidentifiability if improving the initial guess can fix that or if I need to remove a variable.
Code for reference.
Objective function
init = function(param){
#Solve for displacement of triglycerides
T.mcmc2 = T.hat.isf
T.mcmc2 = T.mcmc2 - min(T.mcmc2)
A.mcmc2 = T.mcmc2
A.mcmc2[1] <- (6*1.02*.200)/.250
B.mcmc2 = T.mcmc2
primes <- Bprime(x.fine1, param, T.mcmc2, A.mcmc2, B.mcmc2)
B.mcmc2 <- as.numeric(unlist(primes[1]))
A.mcmc2 <- as.numeric(unlist(primes[2]))
res = t(B.obs-B.mcmc2[x.points])%*%(B.obs-B.mcmc2[x.points])
#print(res)
return (res)
}
Optimization with parameters
l = c(1e-8,1e-8, 1e-8, 1e-8)
u = c(2,1.2,24,24)
th0=c(.052, 0.19, .50, 8)
op = optim(th0[1:3], init, method="L-BFGS-B", lower=l, upper = u)
Once run, this message often occurs "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"

optim (method=Brent) and optimize not giving correct minimum for binomial distribution (N > 1000)

I'm using optim() (and optimize()) to try and find the the quantiles of a binomial distribution, however for N ~ 2000 (N = 2135), the functions do not give the correct value.
optim(21, function(x) abs(1 - pbinom(x, 2135, 21/2135) - 0.1),
method = "Brent", lower = 1, upper = 2135)
optimize(function(x) abs(1 - pbinom(x, 2135, 21/2135) - 0.1), c(1,2135))
P.S: I also try to set the min argument equal to the probability, but I still get incorrect answers.
The problem is that optimize() assumes that small changes in the parameter will give reliable information about whether the minimum has been attained (and which direction to go if not). (I initially said that the function needed to be differentiable, which might not be true: see the Wikipedia article on Brent's method.) In other words, most of the easily available optimization algorithms can fail on an objective function that is piecewise constant, as this one is ...
IMO the accepted answer to this nearly identical question is simply wrong. (It states that "the gradient at your starting point is almost 0", whereas in fact it's exactly zero; using optimize() doesn't help, as you found out, and picking a different starting point is more or less a matter of luck ...)
I made up a smaller example to illustrate: find the 0.6 quantile of the binomial distribution with N=10, prob=0.2. R can do this directly, very easily: qbinom(0.6, size=10, prob=0.2) ! But assuming that you want to solve some other problem of a similar form, and this is just an example, or that the constraints are given by a homework problem, or ...
Slightly simplified objective function (using the squared difference rather than the absolute value):
fx <- function(x) (pbinom(x, size=10, prob=0.2) - 0.6)^2
What does this look like?
curve(fx, from = 0, to =10, n=501)
So the correct answer is any value between 2 and 3. In this particular case optimize(fx, interval=c(1,10)) happens to work OK (returns 2.313, you could use floor() to convert it to 2), but it will fail if I use a wider interval (optimize(fx, interval=c(1,100)) returns 99.99996), or if I did a problem with a larger size. Let me try
fx2 <- function(x) pbinom(x, size=1000, prob=0.2) - 0.6
qbinom(0.6, size=1000, prob=0.2) ## answer: 203
optimize(fx2, interval=c(1,1000)) ## 999.9999
The problem is that if the initial step of the optimization method jumps less than one unit, the algorithm will conclude that it has found the minimum.
One potential solution is to look for a root rather than a minimum:
fx3 <- function(x) pbinom(x, size=1000, prob=0.2) - 0.6
uniroot(fx3, interval=c(1,1000)) ## 203
I don't know a good way to solve this as an optimization problem. A stochastic global optimizer would work, but would in general be very inefficient. See here for one particular problem involving nonlinear discrete optimization in R. You can also look at the optimization task view, although I didn't find it useful ...

Why is my approximation too large using Composite Simpson's rule in R (numerical integration)?

I am trying to approximate the following integral, using numerical integration in R:
,
where the function mu is defined by this formula:
To do this, I have implemented the Composite Simpson's rule as a function in R, which takes as parameters a function (integrand), the integration interval ([a,b]) and the number of subintervals desired (n).
I have tested my code on various different mathematical functions, and it seems to be working just fine. However, when I try to approximate the integral shown in the picture, the approximation becomes to large.
My method has been to first define the inner integral in terms of its Composite Simpson approximation as a function of t in R. Then, use the Composite Simpson's rule again, in order to calculate the outer integral by viewing the inner approximation as the function to be integrated.
When doing this, the inner approximation is correct when calculated by itself, as expected, but the approximation of the entire expression becomes too large, and I can't seem to figure out why.
I am comparing the approximations to those given by Maple; the inner expression calculated by itself, using t=20, should give 0.8157191, and the entire expression should be 12.837. R correctly calculates 0.8157191, but gives 32.9285 for the entire expression.
I have tried simplifying using numerous different mathematical functions, and making the functions independent of t in R, but all seems to result in the same error. So, to sum things up, my question is, why is only the outer integral being approximated wrongly?
I would be greatly appreciative of any hints or pointers - I have included my code illustrating the problem here:
compositesimpson <- function(integrand, a, b, n) {
h<- (b-a)/n #THE DEFINITE INTERVAL IS SCALED BY
#THE DESIRED NUMBER OF SUBINTERVALS
xi<- seq.int(a, b, length.out = n+1) #DIVIDES THE DEFINITE INTERVAL INTO THE
xi<- xi[-1] #DESIRED NUMBER OF SUBINTERVALS,
xi<- xi[-length(xi)] #EXCLUDING a AND b
#THE APPROXIMATION ITSELF
approks<- (h/3)*(integrand(a) + 2*sum(integrand(xi[seq.int(2, length(xi), 2)])) +
4*sum(integrand(xi[seq.int(1, length(xi), 2)])) + integrand(b))
return(approks)
}
# SHOULD YIELD -826.5755 BY Maple, SO THE FUNCTION IS WORKING HERE
ftest<- function(x) {
return(exp(2*x)*sin(3*x))
}
compositesimpson(ftest, -4, 4, 100000)
# MU FUNCTION FOR TESTING
mu.01.kvinde<- function(x){ 0.000500 + 10^(5.728 + 0.038*(x+48) -10)}
#INNER INTEGRAL AS A FUNCTION OF ITS APPROXIMATION
indreintegrale.person1<- function(t){
indre<- exp(-compositesimpson(mu.01.kvinde, 0, t, 100000))
return(indre)
}
indreintegrale.person1(20) #YIELDS 0.8157191, WHICH IS CORRECT
compositesimpson(indreintegrale.person1, 20, 72, 100000) #YIELDS 32.9285,
#BUT SHOULD BE 12.837 ACCORDING TO MAPLE
This is something to do with trying to use vectorisation at two levels of recursion and it's not doing what you want it to. E.g. compare
indreintegrale.person1(20)
#> [1] 0.8157191
indreintegrale.person1(c(20, 72))
#> [1] 0.8157191 0.4801160
indreintegrale.person1(72)
#> [1] 2.336346e-10
I think the middle answer is wrong, but the other two are right.
Quickest fix, make this replacement:
indreintegrale.person1 <- function(t){
sapply(t, function(t2) exp(-compositesimpson(mu.01.kvinde, 0, t2, 100000)))
}
and it now gives the answer you expect (but takes a bit longer to calculate!).

Determine mode locations of the kernel density estimate of multimodal univariate data

If I have a density function and I plot it with a particular bandwidth, I visually determine that there are 7 local maximums. I would just like to know how to plot separate distributions of the particular maximums on the same plot.
Also, if is possible to know exactly where the maximums occur by running some code? I can make ball-park estimates using the plot but is there an R function that I can use to get the exact points? I would like to know the mean and variance of the 7 densities that I have identified.
Specifically, I have the following:
plot(density(stamp, bw=0.0013,kernel = "gaussian"))
Determining which modes are real in a kernel density estimate is a matter of which bandwidth you chose to use. This is a complicated thing, and I don't advise choosing but a single bandwidth, as even different optimal rules of thumb can give you different answers. In general, the number of modes of a kde is less than the number of the underlying density in the oversmooothed case and more so in the undersmoothed case. There are many papers that cover this topic and give you many options to help determine the veracity of a mode. e.g., check out Silverman's mode test for Gaussian kernels, Friedman and Fisher's prim algorithm, Marron's siZer, and Minnotte and Scott's mode tree are good places to start.
A naive thing you can do, given a single KDE choice of bandwidth is check the run lengths.
In fact, with the bandwidth you have chosen, I find 9 modes. Just calculate the sign change of the difference in the series, and calculate the cumulative length of the runs in order to find the points. Every other point will be a mode or an antimode, depending on which came first. (You can check the sign to determine this)
library(BSDA)
dstamp <- density(Stamp$thickness, bw=0.0013, kernel = "gaussian")
chng <- cumsum(rle(sign(diff(dstamp$y)))$lengths)
plot(dstamp)
abline(v = dstamp$x[chng[seq(1,length(chng),2)]])
Since I needed something to get only the strongest modes, I created a dead simple algorithm that allows you to increase sensitivity by tweaking the number of density samples (to deacrease local noise) and put a minum density threshold, proportional to the max density (to decrease the global noise).
find_posterior_modes <- function(x, n.samples = 100, filter = .1) {
d <- density(x, n = n.samples)
x <- with(d, sapply(2:(n.samples - 1), function(i) if (y[i] > y[i - 1] & y[i] > y[i + 1] & y[i] > max(y) * filter) x[i]))
unlist(x)
}
I recently released the package ModEstM, it uses the same method as shayaa, with two features to suppress the less significant modes :
it is possible to choose the bandwidth of the density estimation, by choosing the "adjust" parameter of the density function,
the modes are presented in decreasing order of the corresponding density.

Automatically find the scaling factor of the x-axis using LsqFit (or other method)?

I have the following data: a vector B and a vector R. The vector B is the "independent" variable. For this pair, I have two data sets: One is an experimental measurement of Bex, Rex and the other is a simulation produced by me Bsim, Rsim. The simulation does not have any "scale" for the x-axis (the B vector). Therefore when I am trying to fit my curve to the experiment, I have to find out a scaling parameter B0 "by eye", and with this number B0 I multiply the entire Bsim vector and simply plot(Bsim, Rsim, Bex, Rex).
I wanted to use the package LsqFit to make the procedure automatic and more accurate. However I am having trouble in understanding how I could use it to find the scaling on the independent variable.
My first thought was to just "invert" the roles of B and R. However, there are two issues that I think make matters worse: 1) the R curve/data is not monotonous, 2) the experimental data are much more "dense" (they have more data-points: my simulation has 120 points in total, the experiments have some thousands).
Below I give an example if what I am trying to accomplish (of course, the answer need not use LsqFit). I also attach two figures that demonstrate everything very clearly.
#= stuff happened before this point =#
Bsim, Rsim = load(simulation)
Bex, Rex = load(experiment)
#this is what I want to do:
some_model(x, p) = ???
fit = curve_fit(some_model, Bex, Rex, [3.5])
B0 = fit.param[1]
#this is what I currently do by trail and error:
B0 = 3.85 #this is what I currently do by trial and error
plot(B0*Bsim, Rsim, Bex, Rex)
P.S.: The R curves (dependent variables) are both normalized by their maximum value because their scaling is not important.
A simple approach iff you can always expect both your experiment and simulation to feature one high peak, and you're sure that there's only a scaling factor rather than also an offset, is to simply multiply your Bsim vector by mode_rex / mode_rsim (e.g. in your example, mode_rsim = 1, and mode_rex = 4, so multiply Bsim by 4. But I'm sure you've thought of this already.
For a more general approach, one way is as follows:
add and load Interpolations package
Create a grid to interpolate over, e.g. Grid = 0:0.01:Bex[end]
interpolate Rex over that grid, e.g.
RexInterp = interpolate( (Bex,), Rex, Gridded(Linear()));
RexGridVec = RexInterp[Grid];
interpolate Rsim over the same grid, but introduce your multiplier on the Bsim "knots", e.g.
Multiplier = 0.1;
RsimInterp = interpolate( (Multiplier * Bsim,), Rsim, Gridded(Linear()));
RsimGridVec = RsimInterp[Grid]
Now you can calculate a square error value between RsimGridVec and RexGridVec, e.g.
SqErr = sum((RsimGridVec - RexGridVec).^2)
If you follow this technique, then if you create a loop for a multiplier range (say 0:0.01:10), and get the square error associated with each multiplier, you can find out the multiplier for which the square error is the minimum.
In theory if you wanted to find the optimal for a particular offset too, you can make it the outer loop for a range of offsets. Mind you this is a brute force approach, but it be reasonably efficient judging by the vectors in your graph.

Resources