Numerical precision problems in R? - r

I have a problem with the following function in R:
test <- function(alpha, beta, n){
result <- exp(lgamma(alpha) + lgamma(n + beta) - lgamma(alpha + beta + n) - (lgamma(alpha) + lgamma(beta) - lgamma(alpha + beta)))
return(result)
}
Now if you insert the following values:
betabinom(-0.03292708, -0.3336882, 10)
It should fail and result in a NaN. That is because if we implement the exact function in Excel, we would get a result that is not a number. The implementation in Excel is simple, for J32 is a cell for alpha, K32 beta and L32 for N. The implementation of the resulting cell is given below:
=EXP(GAMMALN(J32)+GAMMALN(L32+K32)-GAMMALN(J32+K32+L32)-(GAMMALN(J32)+GAMMALN(K32)-GAMMALN(J32+K32)))
So this seems to give the correct answer, because the function is only defined for alpha and beta greater than zero and n greater or equal to zero. Therefore I am wondering what is happening here? I have also tried the package Rmpf to increase the numerical accuracy, but that does not seem to do anything.
Thanks

tl;dr log(gamma(x)) is defined more generally than you think, or than Excel thinks. If you want your function not to accept negative values of alpha and beta, or to return NaN, just test manually and return the appropriate values (if (alpha<0 || beta<0) return(NaN)).
It's not a numerical accuracy problem, it's a definition issue. The Gamma function is defined for negative real values: ?lgamma says:
The gamma function is defined by (Abramowitz and Stegun section 6.1.1, page 255)
Gamma(x) = integral_0^Inf t^(x-1) exp(-t) dt
for all real ‘x’ except zero and negative integers (when ‘NaN’ is returned).
Furthermore, referring to lgamma ...
... and the natural logarithm of the absolute value of the gamma function ...
(emphasis in original)
curve(lgamma(x),-1,1)
gamma(-0.1) ## -10.68629
log(gamma(-0.1)+0i) ## 2.368961+3.141593i
log(abs(gamma(-0.1)) ## 2.368961
lgamma(-0.1) ## 2.368961
Wolfram Alpha agrees with second calculation.

Related

Conditioned random generating variables from a distribution function

My question is related to my previous one Generate random variables from a distribution function using inverse sampling
Now I want to generate random variables from a distribution function using inverse sampling but the sampling should be conditioned.
For example, if the inverse of my cdf is :
invcdf <- function(y) a2 * log(a1/y - 1) + a3
I used inverse sampling to generate 10 rv as follows :
invcdf(runif(10))
Now, the problem is that I want the values generated greater or less than a value.
How should I introduce this condition in random generator?
When I use this to have value greater than 500 :
invcdf(runif(10,500,1e6))
I get this error message :
Warning message:
In log((a0/y) - 1) : NaNs produced
I already try to repeat the process until having values satsifying my constraints but it is not efficient!
repeat{
x=invcdf(runif(1))
if(x>100){
break
}
As #spf614 noted, you'd better have checks in your function like
invcdf <- function(y) {
if (a1 > y) {
return( a2 * log(a1/y - 1) + a3 )
}
NaN
}
Then it works for all arguments
Sampling would be
low <- ...
r <- invcdf(runif(low, a1, 1e6))
UPDATE
checking for NaNs in output
nof_nans <- sum(is.nan(r))
if (nof_nans > 0) {
....
The reason that you're getting NaNs is that R is trying to take the logarithm of a negative number. Do you want the log term to be log((a1/y)-1) or log(a1/(y-1))? You currently have the function written the first way, and when you get a very high value for y, the term a1/y approaches zero (the speed with which it approaches zero depends on the value of a1). Thus, subtracting 1 gives you a negative number inside the log function. So if the term is meant to be how you have it written (log(a1/y-1)), you simply won't be able to calculate that above certain values of y.
The simple fix is just
invcdf <- function(y){
a2 * log(a1/(y-1)) + a3
}

Optimizing a simple linear curve (constant and coefficient estimated from a regression)

I am trying to calculate the turning point of a a few functions where I have estimated the coefficient and constant from a regression. I'm using the optimize function for this as my curves are all linear.
My function looks like:
F<- function(x){
beta* x + alpha
}
mind: beta and alpha are both vectors here. When running the optimisation with optimize, I'm getting the following error:
Error in optimize(F, interval = c(10, 20), lower = (10), :
invalid function value in 'optimize'
Is this because optimize is running the optimisation mathematically, so the beta and alphas need to be single parameters? If anyone knows a better way of doing this please do contribute!
Thank you in advance :)
If the functions are linear, then they will be at a minimum at the lower end of the range where beta>=0, and at the upper end of the range if beta<=0 - no need to use optimize().
It's not entirely clear what you're expecting the code to do - if you want it to return an x for each set of parameters, look at optim() instead and have F return the sum, or run optimize on each set of parameters in turn using an apply() function or loop.
One other thing is that your syntax is a bit wonky - I imagine that you mean:
> F<- function(x){
+ beta* x + alpha
+ }
> alpha <- 1
> beta <- 2
> optimize(F,c(10,20))
$minimum
[1] 10.00006
$objective
[1] 21.00011

Function to find Negative binomial distribution in Julia

I am using below code to find Negative binomial distribution in R
dnbinom(n11, size=p[1], prob=p[2]/(p[2]+E))
where dnbinom is the function used for finding Negative binomial distribution
n11 & E are vector of integer.
Now i want to run the same code in Julia, which function should i have to use inplace of dnbinom
The function must have arguments as (x,size,prob)
where x = vector of probabilities.
size = target for number of successful trials, or dispersion parameter (the shape parameter of the gamma mixing distribution). Must be strictly positive, need not be integer.
prob = probability of success in each trial. 0 < prob <= 1.
Below is My full Code(Updated as per answers given, but still not working)
using Distributions
data = query("Select count_a,EXP_COUNT from SM_STAT_ALGO_LOCALTRADE_SOC;")
f([0.2,0.06,1.4,1.8,0.1],data[:,1],data[:,2])
function f(x::Vector,n11,E)
return sum(-log(x[5] * pdf(NegativeBinomial(x[1], x[2]/(x[2]+E), n11)) + (1-x[5]) * pdf(NegativeBinomial(x[3], x[4]/(x[4]+E),n11))))
end
Assuming that you want the probabilities of a vector of outcomes, you can do
using Distributions
function dnbinom(x, size, prob)
dist = NegativeBinomial(size,prob)
map(y->pdf(dist,y), x)
end
#show dnbinom([3,5], 10, 0.1)
To get the equivilaent of dbinom in R
dnbinom(1, 1, 0.5)
# [1] 0.25
you can use
using Distributions
pdf(NegativeBinomial(), 1)
# 0.25000000000000006
in julia.

Parameters estimation of a bivariate mixture normal-lognormal model

I have to create a model which is a mixture of a normal and log-normal distribution. To create it, I need to estimate the 2 covariance matrixes and the mixing parameter (total =7 parameters) by maximizing the log-likelihood function. This maximization has to be performed by the nlm routine.
As I use relative data, the means are known and equal to 1.
I’ve already tried to do it in 1 dimension (with 1 set of relative data) and it works well. However, when I introduce the 2nd set of relative data I get illogical results for the correlation and a lot of warnings messages (at all 25).
To estimate these parameters I defined first the log-likelihood function with the 2 commands dmvnorm and dlnorm.plus. Then I assign starting values of the parameters and finally I use the nlm routine to estimate the parameters (see script below).
`P <- read.ascii.grid("d:/Documents/JOINT_FREQUENCY/grid_E727_P-3000.asc", return.header=
FALSE );
V <- read.ascii.grid("d:/Documents/JOINT_FREQUENCY/grid_E727_V-3000.asc", return.header=
FALSE );
p <- c(P); # tranform matrix into a vector
v <- c(V);
p<- p[!is.na(p)] # removing NA values
v<- v[!is.na(v)]
p_rel <- p/mean(p) #Transforming the data to relative values
v_rel <- v/mean(v)
PV <- cbind(p_rel, v_rel) # create a matrix of vectors
L <- function(par,p_rel,v_rel) {
return (-sum(log( (1- par[7])*dmvnorm(PV, mean=c(1,1), sigma= matrix(c(par[1]^2, par[1]*par[2]
*par[3],par[1]*par[2]*par[3], par[2]^2 ),nrow=2, ncol=2))+
par[7]*dlnorm.rplus(PV, meanlog=c(1,1), varlog= matrix(c(par[4]^2,par[4]*par[5]*par[6],par[4]
*par[5]*par[6],par[5]^2), nrow=2,ncol=2)) )))
}
par.start<- c(0.74, 0.66 ,0.40, 1.4, 1.2, 0.4, 0.5) # log-likelihood estimators
result<-nlm(L,par.start,v_rel=v_rel,p_rel=p_rel, hessian=TRUE, iterlim=200, check.analyticals= TRUE)
Messages d'avis :
1: In log(eigen(sigma, symmetric = TRUE, only.values = TRUE)$values) :
production de NaN
2: In sqrt(2 * pi * det(varlog)) : production de NaN
3: In nlm(L, par.start, p_rel = p_rel, v_rel = v_rel, hessian = TRUE) :
NA/Inf replaced by maximum positive value
4: In log(eigen(sigma, symmetric = TRUE, only.values = TRUE)$values) :
production de NaN
…. Until 25.
par.hat <- result$estimate
cat("sigN_p =", par[1],"\n","sigN_v =", par[2],"\n","rhoN =", par[3],"\n","sigLN_p =", par [4],"\n","sigLN_v =", par[5],"\n","rhoLN =", par[6],"\n","mixing parameter =", par[7],"\n")
sigN_p = 0.5403361
sigN_v = 0.6667375
rhoN = 0.6260181
sigLN_p = 1.705626
sigLN_v = 1.592832
rhoLN = 0.9735974
mixing parameter = 0.8113369`
Does someone know what is wrong in my model or how should I do to find these parameters in 2 dimensions?
Thank you very much for taking time to look at my questions.
Regards,
Gladys Hertzog
When I do these kind of optimization problems, I find that it's important to make sure that all the variables that I'm optimizing over are constrained to plausible values. For example, standard deviation variables have to be positive, and from knowledge of the situation that I'm modelling I'll probably be able to put an upper bound all my standard deviation variables as well. So if s is one of my standard deviation variables, and if m is the maximum value that I want it to take, instead of working with s I'll solve for the variable z which is related to s via
s = m/(1+e-z)
In that formula, z is unconstrained, but s must lie between 0 and m. This is vital because optimization routines where the variables are not constrained to take plausible values will often try completely implausible values while they're trying to bound the solution. Implausible values often cause problems with e.g. precision, that then results in NaN's etc. The general formula that I use for constraining a single variable x to lie between a and b is
x = a + (b - a)/(1+e-z)
However, regarding your particular problem where you're looking for covariance matrices, a more sophisticated approach is necessary than simply bounding all the individual variables. Covariance matrices must be positive semi-definite, so if you're simply optimizing the individual values in the matrix, the optimization will probably fail (producing NaN's) if a matrix which isn't positive definite is fed into the likelihood function. To get round this problem, one approach is to solve for the Cholesky decomposition of the covariance matrix instead of the covariance matrix itself. My guess is that this is probably what's causing your optimization to fail.

cost function in cv.glm of boot library in R

I am trying to use the crossvalidation cv.glm function from the boot library in R to determine the number of misclassifications when a glm logistic regression is applied.
The function has the following signature:
cv.glm(data, glmfit, cost, K)
with the first two denoting the data and model and K specifies the k-fold.
My problem is the cost parameter which is defined as:
cost: A function of two vector arguments specifying the cost function
for the crossvalidation. The first argument to cost should correspond
to the observed responses and the second argument should correspond to
the predicted or fitted responses from the generalized linear model.
cost must return a non-negative scalar value. The default is the
average squared error function.
I guess for classification it would make sense to have a function which returns the rate of misclassification something like:
nrow(subset(data, (predict >= 0.5 & data$response == "no") |
(predict < 0.5 & data$response == "yes")))
which is of course not even syntactically correct.
Unfortunately, my limited R knowledge let me waste hours and I was wondering if someone could point me in the correct direction.
It sounds like you might do well to just use the cost function (i.e. the one named cost) defined further down in the "Examples" section of ?cv.glm. Quoting from that section:
# [...] Since the response is a binary variable an
# appropriate cost function is
cost <- function(r, pi = 0) mean(abs(r-pi) > 0.5)
This does essentially what you were trying to do with your example. Replacing your "no" and "yes" with 0 and 1, lets say you have two vectors, predict and response. Then cost() is nicely designed to take them and return the mean classification rate:
## Simulate some reasonable data
set.seed(1)
predict <- seq(0.1, 0.9, by=0.1)
response <- rbinom(n=length(predict), prob=predict, size=1)
response
# [1] 0 0 0 1 0 0 0 1 1
## Demonstrate the function 'cost()' in action
cost(response, predict)
# [1] 0.3333333 ## Which is right, as 3/9 elements (4, 6, & 7) are misclassified
## (assuming you use 0.5 as the cutoff for your predictions).
I'm guessing the trickiest bit of this will be just getting your mind fully wrapped around the idea of passing a function in as an argument. (At least that was for me, for the longest time, the hardest part of using the boot package, which requires that move in a fair number of places.)
Added on 2016-03-22:
The function cost(), given above is in my opinion unnecessarily obfuscated; the following alternative does exactly the same thing but in a more expressive way:
cost <- function(r, pi = 0) {
mean((pi < 0.5) & r==1 | (pi > 0.5) & r==0)
}
I will try to explain the cost function in simple words. Let's take
cv.glm(data, glmfit, cost, K) arguments step by step:
data
The data consists of many observations. Think of it like series of numbers or even.
glmfit
It is generalized linear model, which runs on the above series. But there is a catch it splits data into several parts equal to K. And runs glmfit on each of them separately (test set), taking the rest of them as training set. The output of glmfit is a series consisting of same number of elements as the split input passed.
cost
Cost Function. It takes two arguments first the split input series(test set), and second the output of glmfit on the test input. The default is mean square error function.
.
It sums the square of difference between observed data point and predicted data point. Inside the function a loop runs over the test set (output and input should have same number of elements) calculates difference, squares it and adds to output variable.
K
The number to which the input should be split. Default gives leave one out cross validation.
Judging from your cost function description. Your input(x) would be a set of numbers between 0 and 1 (0-0.5 = no and 0.5-1 = yes) and output(y) is 'yes' or 'no'. So error(e) between observation(x) and prediction(y) would be :
cost<- function(x, y){
e=0
for (i in 1:length(x)){
if(x[i]>0.5)
{
if( y[i]=='yes') {e=0}
else {e=x[i]-0.5}
}else
{
if( y[i]=='no') {e=0}
else {e=0.5-x[i]}
}
e=e*e #square error
}
e=e/i #mean square error
return (e)
}
Sources : http://www.cs.cmu.edu/~schneide/tut5/node42.html
The cost function can optionally be defined if there is one you prefer over the default average squared error. If you wanted to do so then the you would write a function that returns the cost you want to minimize using two inputs: (1) the vector of known labels that you are predicting, and (2) the vector of predicted probabilities from your model for those corresponding labels. So for the cost function that (I think) you described in your post you are looking for a function that will return the average number of accurate classifications which would look something like this:
cost <- function(labels,pred){
mean(labels==ifelse(pred > 0.5, 1, 0))
}
With that function defined you can then pass it into your glm.cv() call. Although I wouldn't recommend using your own cost function over the default one unless you have reason to. Your example isn't reproducible, so here is another example:
> library(boot)
>
> cost <- function(labels,pred){
+ mean(labels==ifelse(pred > 0.5, 1, 0))
+ }
>
> #make model
> nodal.glm <- glm(r ~ stage+xray+acid, binomial, data = nodal)
> #run cv with your cost function
> (nodal.glm.err <- cv.glm(nodal, nodal.glm, cost, nrow(nodal)))
$call
cv.glm(data = nodal, glmfit = nodal.glm, cost = cost, K = nrow(nodal))
$K
[1] 53
$delta
[1] 0.8113208 0.8113208
$seed
[1] 403 213 -2068233650 1849869992 -1836368725 -1035813431 1075589592 -782251898
...
The cost function defined in the example for cv.glm clearly assumes that the predictions are probabilities, which would require the type="response" argument in the predict function. The documentation from library(boot) should state this explicitly. I would otherwise be forced to assume that the default type="link" is used inside the cv.glm function, in which case the cost function would not work as intended.

Resources