Optimizing for global minimum - r

I am attempting to use optimize() to find the minimum value of n for the following function (Clopper-Pearson lower bound):
f <- function (n, p=0.5)
(1 + (n - p*n + 1) /
(p*n*qf(p= .025, df1= 2*p, df2= 2*(n - p + 1))))^-1
And here is how I attempted to optimize it:
n_clop <- optimize(f.1, c(300,400), maximum = FALSE, p=0.5)
n_clop
I did this over the interval [300,400] because I suspect the value to be between within it but ultimately I would like to do the optimization between 0 and infinity. It seems that this command is producing a local minimum because no matter the interval it produces the lower bound of that interval as the minimum - which is not what I suspect from clopper-pearson. So, my two questions are how to properly find a global minimum in R and how to so over any interval?

I've very briefly looked over the Wikipedia page you linked and don't see any obvious typos in your formula (although I feel like it should be 0.975=1-alpha/2 rather than 0.025=alpha/2?). However, evaluating the function you've coded over a very broad scale suggests that there are no local minima that are messing you up. My strong guess would be that either your logic is wrong (i.e., n->0 is really the right answer) or that you haven't coded what you think you're coding, due to a typo (possibly in the Wikipedia article, although that seems unlikely) or a thinko.
f <- function (n, p=0.5)
(1 + (n - p*n + 1) /
(p*n*qf(p= .025, df1= 2*p, df2= 2*(n - p + 1))))^-1
Confirm that you're getting the right answer for the interval you chose:
curve(f(x),c(300,400))
Evaluating over a broad range (n=0.00001 to 1000000):
curve(f(10^x),c(-5,7))
As #MrFlick suggests, global optimization is hard. You could start with optim(...method="SANN") but the best answer is definitely case-specific.

Related

Optimize within for loop cannot find function

I've got a function, KozakTaper, that returns the diameter of a tree trunk at a given height (DHT). There's no algebraic way to rearrange the original taper equation to return DHT at a given diameter (4 inches, for my purposes)...enter R! (using 3.4.3 on Windows 10)
My approach was to use a for loop to iterate likely values of DHT (25-100% of total tree height, HT), and then use optimize to choose the one that returns a diameter closest to 4". Too bad I get the error message Error in f(arg, ...) : could not find function "f".
Here's a shortened definition of KozakTaper along with my best attempt so far.
KozakTaper=function(Bark,SPP,DHT,DBH,HT,Planted){
if(Bark=='ob' & SPP=='AB'){
a0_tap=1.0693567631
a1_tap=0.9975021951
a2_tap=-0.01282775
b1_tap=0.3921013594
b2_tap=-1.054622304
b3_tap=0.7758393514
b4_tap=4.1034897617
b5_tap=0.1185960455
b6_tap=-1.080697381
b7_tap=0}
else if(Bark=='ob' & SPP=='RS'){
a0_tap=0.8758
a1_tap=0.992
a2_tap=0.0633
b1_tap=0.4128
b2_tap=-0.6877
b3_tap=0.4413
b4_tap=1.1818
b5_tap=0.1131
b6_tap=-0.4356
b7_tap=0.1042}
else{
a0_tap=1.1263776728
a1_tap=0.9485083275
a2_tap=0.0371321602
b1_tap=0.7662525552
b2_tap=-0.028147685
b3_tap=0.2334044323
b4_tap=4.8569609081
b5_tap=0.0753180483
b6_tap=-0.205052535
b7_tap=0}
p = 1.3/HT
z = DHT/HT
Xi = (1 - z^(1/3))/(1 - p^(1/3))
Qi = 1 - z^(1/3)
y = (a0_tap * (DBH^a1_tap) * (HT^a2_tap)) * Xi^(b1_tap * z^4 + b2_tap * (exp(-DBH/HT)) +
b3_tap * Xi^0.1 + b4_tap * (1/DBH) + b5_tap * HT^Qi + b6_tap * Xi + b7_tap*Planted)
return(y=round(y,4))}
HT <- .3048*85 #converting from english to metric (sorry, it's forestry)
for (i in c((HT*.25):(HT+1))) {
d <- KozakTaper(Bark='ob',SPP='RS',DHT=i,DBH=2.54*19,HT=.3048*85,Planted=0)
frame <- na.omit(d)
optimize(f=abs(10.16-d), interval=frame, lower=1, upper=90,
maximum = FALSE,
tol = .Machine$double.eps^0.25)
}
Eventually I would like this code to iterate through a csv and return i for the best d, which will require some rearranging, but I figured I should make it work for one tree first.
When I print d I get multiple values, so it is iterating through i, but it gets held up at the optimize function.
Defining frame was my most recent tactic, because d returns one NaN at the end, but it may not be the best input for interval. I've tried interval=c((HT*.25):(HT+1)), defining KozakTaper within the for loop, and defining f prior to the optimize, but I get the same error. Suggestions for what part I should target (or other approaches) are appreciated!
-KB
Forestry Research Fellow, Appalachian Mountain Club.
MS, University of Maine
**Edit with a follow-up question:
I'm now trying to run this script for each row of a csv, "Input." The row contains the values for KozakTaper, and I've called them with this:
Input=read.csv...
Input$Opt=0
o <- optimize(f = function(x) abs(10.16 - KozakTaper(Bark='ob',
SPP='Input$Species',
DHT=x,
DBH=(2.54*Input$DBH),
HT=(.3048*Input$Ht),
Planted=0)),
lower=Input$Ht*.25, upper=Input$Ht+1,
maximum = FALSE, tol = .Machine$double.eps^0.25)
Input$Opt <- o$minimum
Input$Mht <- Input$Opt/.3048. # converting back to English
Input$Ht and Input$DBH are numeric; Input$Species is factor.
However, I get the error invalid function value in 'optimize'. I get it whether I define "o" or just run optimize. Oddly, when I don't call values from the row but instead use the code from the answer, it tells me object 'HT' not found. I have the awful feeling this is due to some obvious/careless error on my part, but I'm not finding posts about this error with optimize. If you notice what I've done wrong, your explanation will be appreciated!
I'm not an expert on optimize, but I see three issues: 1) your call to KozakTaper does not iterate through the range you specify in the loop. 2) KozakTaper returns a a single number not a vector. 3) You haven't given optimize a function but an expression.
So what is happening is that you are not giving optimize anything to iterate over.
All you should need is this:
optimize(f = function(x) abs(10.16 - KozakTaper(Bark='ob',
SPP='RS',
DHT=x,
DBH=2.54*19,
HT=.3048*85,
Planted=0)),
lower=HT*.25, upper=HT+1,
maximum = FALSE, tol = .Machine$double.eps^0.25)
$minimum
[1] 22.67713 ##Hopefully this is the right answer
$objective
[1] 0
Optimize will now substitute x in from lower to higher, trying to minimize the difference

Least square minimization

I hope this is the right place for such a basic question. I found this and this solutions quite articulated, hence they do not help me to get the fundamentals of the procedure.
Consider a random dataset:
x <- c(1.38, -0.24, 1.72, 2.25)
w <- c(3, 2, 4, 2)
How can I find the value of μ that minimizes the least squares equation :
The package manipulate allows to manually change with bar the model with different values of μ, but I am looking for a more precise procedure than "try manually until you do not find the best fit".
Note: If the question is not correctly posted, I would welcome constructive critics.
You could proceed as follows:
optim(mean(x), function(mu) sum(w * (x - mu)^2), method = "BFGS")$par
# [1] 1.367273
Here mean(x) is an initial guess for mu.
I'm not sure if this is what you want, but here's a little algebra:
We want to find mu to minimise
S = Sum{ i | w[i]*(x[i] - mu)*(x[i] - mu) }
Expand the square, and rearrange into three summations. bringing things that don't depend on i outside the sums:
S = Sum{i|w[i]*x[i]*x[i])-2*mu*Sum{i|w[i]*x[i]}+mu*mu*Sum{i|w[i]}
Define
W = Sum{i|w[i]}
m = Sum{i|w[i]*x[i]} / W
Q = Sum{i|w[i]*x[i]*x[i]}/W
Then
S = W*(Q -2*mu*m + mu*mu)
= W*( (mu-m)*(mu-m) + Q - m*m)
(The second step is 'completing the square', a simple but very useful technique).
In the final equation, since a square is always non-negative, the value of mu to minimise S is m.

Harmonic series sum function in R

I am trying to write a function which takes a positive real number and keeps adding terms of the harmonic series until the total sum exceeds the initial argument.
I need my function to display the total number of terms of the series that were added.
Here's my code so far:
harmonic<-function(n){
x<-c(0,1)
while (length(x) < n) {
position <- length(x)
new <- 1/(x[position] + x[position-1])
x <- c(x,new)
}
return(x)
}
I apologise for the errors in my code, unfortunately I have been working with R only for a month and this is the first time that I am using the while loop and I couldn't find any useful information around.
Thank you, I'd really appreciate your help.
Here's an attempt based on some info from this post at maths.stackexchange: https://math.stackexchange.com/q/496116
I can't speak as to whether it is highly accurate in all circumstances or even the best or an appropriate way to go about this. Caveat emptor.
harmsum.cnt <- function(x,tol=1e-09) {
em.cons <- 0.577215664901533
difffun <- function(x,n) x - (log(n) + em.cons + 1/(2*n) - 1/(12*n^2))
ceiling(uniroot(difffun, c(1, 1e10), tol = tol, x = x)$root)
}
Seems to work alright though:
harmsum.cnt(7)
#[1] 616
harmsum.cnt(15)
#[1] 1835421
Compare:
tail(cumsum(1/1:616),1); tail(cumsum(1/1:615),1)
#7.001274
#6.999651
dput(tail(cumsum(1/1:1835421),1)); dput(tail(cumsum(1/1:1835420),1))
#15.0000003782678
#14.9999998334336
This is a partial answer, which I'll try to fill in later. On the assumption that you want an exact answer, rather than the excellent approximation formula thelatemail found, there are a few tools to consider.
First, use of a hash-table or memoise methods will allow you to save previous calculations, thus saving a lot of time.
Second, since the sum of a (finite) sequence is independent of the grouping, you can calculate, e.g. the first N terms and the second (N+1):2N terms independently. Use parallel package to divide and conquer.
Third, before you get too deep into the morass, check the limits of floating-point accuracy via a call to .Machine$double.eps Once your 1/n term comes close to that, you'll need to switch over to gmp and Rmpfr to get full accuracy in your calculations.
Now, just to clarify what you "should" be doing, a correct loop is
mylimit <- [pick a value]
harmsum<-0
for(k in 1:N){
harmsum <- harmsum + 1/k
if (harmsum >= mylimit) break
}
(or similar setup using while)

integrate a very peaked function

I am using integrate function in R to integrate a very peaked function.
Say that function is a log-normal density:
xs <- seq(0,3,0.00001)
fun <- function(xs) dlnorm(xs, meanlog=-1.057822,sdlog=0.001861871)
plot(xs,fun(xs),type="l")
From the plot, I know that the peak is at around 0.3-0.4.
If I integrate this density function over its support (with increased abs.tol and increased subdivisions) the integrate() gives me zero, which should not be true.
integrate(fun,lower=0,upper=Inf,subdivisions=10000000,abs.tol=1e-100)
0 with absolute error < 0
However, if I restrict the interval to 0.3 - 0.4, it gives me the correct answer.
integrate(fun,lower=0.3,upper=0.4,subdivisions=10000000,abs.tol=1e-100)
1 with absolute error < 1.7e-05
Is there a way to integrate this density without manually choosing the interval?
Not sure whether this is helpful -- might be too specific to dlnorm, but you can partition [0, Inf[, especially if you have a good idea of where the peak will end up:
integrate.dlnorm <- function(mu=0, sd=1, width=2) {
integral.l <- integrate(f=dlnorm, lower=0, upper=exp(mu - width * sd), meanlog=mu, sdlog=sd)$value
integral.m <- integrate(f=dlnorm, lower=exp(mu - width * sd), upper=exp(mu + width * sd), meanlog=mu, sdlog=sd)$value
integral.u <- integrate(f=dlnorm, lower=exp(mu + width * sd), upper=Inf, meanlog=mu, sdlog=sd)$value
return(integral.l + integral.m + integral.u)
}
integrate.dlnorm() # 1
integrate.dlnorm(-1.05, 10^-3) # .97
integrate.dlnorm(-1.05, 10^-3, 3) # .998
integrate:
Like all numerical integration routines, these evaluate the function
on a finite set of points. If the function is approximately constant
(in particular, zero) over nearly all its range it is possible that
the result and error estimate may be seriously wrong.
So, the answer is no.
You really need to know something about the function to compute the integral correctly - for any automated algorithm which detects support there is a function for which it fails.
PS (7 years later). For any deterministic algorithm, and any error, there is a function, such that this algorithm will make this error on it.

R Nonlinear Least Squares (nls) Model Fitting

I'm trying to fit the information from the G function of my data to the following mathematical mode: y = A / ((1 + (B^2)*(x^2))^((C+1)/2)) . The shape of this graph can be seen here:
http://www.wolframalpha.com/input/?i=y+%3D+1%2F+%28%281+%2B+%282%5E2%29*%28x%5E2%29%29%5E%28%282%2B1%29%2F2%29%29
Here's a basic example of what I've been doing:
data(simdat)
library(spatstat)
simdat.Gest <- Gest(simdat) #Gest is a function within spatstat (explained below)
Gvalues <- simdat.Gest$rs
Rvalues <- simdat.Gest$r
GvsR_dataframe <- data.frame(R = Rvalues, G = rev(Gvalues))
themodel <- nls(rev(Gvalues) ~ (1 / (1 + (B^2)*(R^2))^((C+1)/2)), data = GvsR_dataframe, start = list(B=0.1, C=0.1), trace = FALSE)
"Gest" is a function found within the 'spatstat' library. It is the G function, or the nearest-neighbour function, which displays the distance between particles on the independent axis, versus the probability of finding a nearest neighbour particle on the dependent axis. Thus, it begins at y=0 and hits a saturation point at y=1.
If you plot simdat.Gest, you'll notice that the curve is 's' shaped, meaning that it starts at y = 0 and ends up at y = 1. For this reason, I reveresed the vector Gvalues, which are the dependent variables. Thus, the information is in the correct orientation to be fitted the above model.
You may also notice that I've automatically set A = 1. This is because G(r) always saturates at 1, so I didn't bother keeping it in the formula.
My problem is that I keep getting errors. For the above example, I get this error:
Error in nls(rev(Gvalues) ~ (1/(1 + (B^2) * (R^2))^((C + 1)/2)), data = GvsR_dataframe, :
singular gradient
I've also been getting this error:
Error in nls(Gvalues1 ~ (1/(1 + (B^2) * (x^2))^((C + 1)/2)), data = G_r_dataframe, :
step factor 0.000488281 reduced below 'minFactor' of 0.000976562
I haven't a clue as to where the first error is coming from. The second, however, I believe was occurring because I did not pick suitable starting values for B and C.
I was hoping that someone could help me figure out where the first error was coming from. Also, what is the most effective way to pick starting values to avoid the second error?
Thanks!
As noted your problem is most likely the starting values. There are two strategies you could use:
Use brute force to find starting values. See package nls2 for a function to do this.
Try to get a sensible guess for starting values.
Depending on your values it could be possible to linearize the model.
G = (1 / (1 + (B^2)*(R^2))^((C+1)/2))
ln(G)=-(C+1)/2*ln(B^2*R^2+1)
If B^2*R^2 is large, this becomes approx. ln(G) = -(C+1)*(ln(B)+ln(R)), which is linear.
If B^2*R^2 is close to 1, it is approx. ln(G) = -(C+1)/2*ln(2), which is constant.
(Please check for errors, it was late last night due to the soccer game.)
Edit after additional information has been provided:
The data looks like it follows a cumulative distribution function. If it quacks like a duck, it most likely is a duck. And in fact ?Gest states that a CDF is estimated.
library(spatstat)
data(simdat)
simdat.Gest <- Gest(simdat)
Gvalues <- simdat.Gest$rs
Rvalues <- simdat.Gest$r
plot(Gvalues~Rvalues)
#let's try the normal CDF
fit <- nls(Gvalues~pnorm(Rvalues,mean,sd),start=list(mean=0.4,sd=0.2))
summary(fit)
lines(Rvalues,predict(fit))
#Looks not bad. There might be a better model, but not the one provided in the question.

Resources