Proportion modeling - Betareg errors - r

I wonder if somebody here can help me.
I am trying to fit a beta GLM with betareg package since my dependent variable is a proportion (relative density of whales in 500m grid size) varying from 0 to 1. I have three covariates:
Depth (measured in meters ranging from 4 to 100m),
Distance to Coast (measured in meters ranging from 0 to 21346m) and
distance to boats (measured in meters ranging from 0 to 20621).
My dependent variable has a lot of 0s and many values that are too close to 0 (as in 7.8e-014). When I try to fit the model the following error shows:
invalid dependent variable, all observations must be in (0, 1).
From what I looked from previous discussions it seems this is caused by my 0s in the dataset (I should not have any 0s or 1s). When I change all my 0 to only positive definite (e.g. 0.0000000000000001) the error message I get is:
Error in chol.default(K) :
the leading minor of order 2 is not positive definite
In addition: Warning messages:
1: In digamma(mu * phi) : NaNs produced
2: In digamma(phi) : NaNs produced
Error in chol.default(K) :
the leading minor of order 2 is not positive definite
In addition: Warning messages:
1: In betareg.fit(X, Y, Z, weights, offset, link, link.phi, type, control) :
failed to invert the information matrix: iteration stopped prematurely
2: In digamma(mu * phi) : NaNs produced
From what I saw at several forums it seems this is because my matrix is not positive definite. It may be either indefinite (i.e. have both positive and negative eigenvalues) or my matrix may be near singular, i.e. it's smallest eigenvalue is very close to 0 (and so computationally it is 0).
My question is: since I only have this dataset, is there any way to solve these problems and run a beta regression? Or, is there any other model that I could use instead of betareg package that it could work?
Here is my code:
betareg(Density~DEPTH+DISTANCE_TO_COAST+DIST_BOAT,data=misti)

When I change all my 0 to only positive definite (e.g. 0.0000000000000001)
Doing this seems like a bad idea, resulting in the error messages you see.
It seems that betareg currently only works strictly for data inside the (0,1) interval, and here's what the package vignette has to say:
The class of beta regression models, as introduced by Ferrari and Cribari-Neto (2004), is useful for modeling continuous variables y that assume values in the open standard unit interval (0, 1). [...] Furthermore, if y also assumes the extremes 0 and 1, a useful transformation in practice is (y · (n − 1) + 0.5)/n where n is the sample size (Smithson and Verkuilen 2006).
So one way to approach this would be:
y.transf.betareg <- function(y){
n.obs <- sum(!is.na(y))
(y * (n.obs - 1) + 0.5) / n.obs
}
betareg( y.transf.betareg(Density) ~ DEPTH+DISTANCE_TO_COAST+DIST_BOAT, data=misti)
For an alternative approach to betareg, using a binomial GLM with a logit link, see this question on Cross Validated and the linked UCLA FAQ:
How to replicate Stata's robust glm for proportion data in R?
Some will suggest using a quasibinomial GLM instead to model proportions/percentages...

Instead of a beta regression, you can just run a linear model using the logistic transformation of your dependent variable. Try the following:
logistic <- function(p) log(p / (1-p) +0.01)
lm(logistic(Density)~DEPTH+DISTANCE_TO_COAST+DIST_BOAT,data=misti)

Related

What is the current convergence criterion of glmnet?

I have attempted to reproduce the results of glmnet with the convergence criterion described in equation 1 and 2 or in the vignette in Appendix 0 on page 34: https://cran.r-project.org/web/packages/glmnet/vignettes/glmnet.pdf
equation1
equation2
Considering that each observation has a weight of 1, this gives me:
delta[i]=crossprod(X[, i], X[, i])* (beta_last[i] - beta_new[i])**2
Then I check if max(delta)>=eps, as described in the vignette
Using this criterion, I do not get the same number of iterations as the glmnet results (often a lag of one or two iterations), leading me to believe that it is out of date. By the way, it seems that the convergence criterion of the glmnet algorithm in the Gaussian case has changed regularly in the last few years.
Do you know what criterion is used to determine the convergence of the algorithm ?
Thanks in advance for your help.
glmnet rescales the weights to sum to 1 before starting the fit, so you're missing a 1/n factor in the definition of delta[i]. But with that fix, this is the criterion used in the current version of glmnet (4.1-3) and also in version 4.1-2. Keep in mind, there are may be other differences like active set/strong set that you may not be using in exactly the same way as glmnet does, which can also affect the number of coordinate descents you realize.

R h2o.deeplearning obtaining probabilities with classification mode

I am using h2o.deeplearning to train a neural network on a classification task.
What I have
Y ~ x1 + x2... where all x variables are continuous and Y is binary.
What I want
To be able to train a deeplearning object to predict the probability of a given row of being true or false. That is, a predicted(Y) restricted to between 0 and 1.
What I've tried
When Y is inputted as a numeric (i.e. 0 or 1), h2o deeplearning automatically treats it as a regression problem. This is fine, except the final layer of the NN is linear, not tanh, and the predicted values can be greater than 1 or less than 0. I've not been able to find a way to get the final layer to be a tanh.
When Y is inputted as categorical (i.e. TRUE or FALSE), h2o deeplearning automatically treats it as a classification problem. Instead of giving me the desired probability of Y being 1 or 0, it gives me its best guess of what Y is.
Is there a way around this? A trick, tweak or an overlooked parameter? I have noticed in the h2o.deeplearning documentation a 'distribution' parameter, but no further information on what that's for. My best guess is that it is some kind of link function in the same vein as GLM, but I'm not sure.
If you treat the problem as a binary classification problem then you not only get the “prediction” of 0 or 1, but also the p0 and p1 probabilities that add up to 1. These are the probabilies that the predicted value is the negative and positive class, respectively.
Then just use p1 directly.

bnlearn::bn.fit difference and calculation of methods "mle" and "bayes"

I try to understand the differences between the two methods bayes and mle in the bn.fit function of the package bnlearn.
I know about the debate between the frequentist and the bayesian approach on understanding probabilities. On a theoretical level I suppose the maximum likelihood estimate mle is a simple frequentist approach setting the relative frequencies as the probability. But what calculations are done to get the bayes estimate? I already checked out the bnlearn documenation, the description of the bn.fit function and some application examples, but nowhere there's a real description of what's happening.
I also tried to understand the function in R by first checking out bnlearn::bn.fit, leading to bnlearn:::bn.fit.backend, leading to bnlearn:::smartSapply but then I got stuck.
Some help would be really appreciated as I use the package for academic work and therefore I should be able to explain what happens.
Bayesian parameter estimation in bnlearn::bn.fit applies to discrete variables. The key is the optional iss argument: "the imaginary sample size used by the bayes method to estimate the conditional probability tables (CPTs) associated with discrete nodes".
So, for a binary root node X in some network, the bayes option in bnlearn::bn.fit returns (Nx + iss / cptsize) / (N + iss) as the probability of X = x, where N is your number of samples, Nx the number of samples with X = x, and cptsize the size of the CPT of X; in this case cptsize = 2. The relevant code is in the bnlearn:::bn.fit.backend.discrete function, in particular the line: tab = tab + extra.args$iss/prod(dim(tab))
Thus, iss / cptsize is the number of imaginary observations for each entry in a CPT, as opposed to N, the number of 'real' observations. With iss = 0 you would be getting a maximum likelihood estimate, as you would have no prior imaginary observations.
The higher iss with respect to N, the stronger the effect of the prior on your posterior parameter estimates. With a fixed iss and a growing N, the Bayesian estimator and the maximum likelihood estimator converge to the same value.
A common rule of thumb is to use a small non-zero iss so that you avoid zero entries in the CPTs, corresponding to combinations that were not observed in the data. Such zero entries could then result in a network which generalizes poorly, such as some early versions of the Pathfinder system.
For more details on Bayesian parameter estimation you can have a look at the book by Koller and Friedman. I suppose many other Bayesian network books also cover the topic.

Gamma GLM: NaN production and divergence errors

Intro
I'm trying to construct a GLM that models the quantity (mass) of eggs the specimens of a fish population lays depending on its size and age.
Thus, the variables are:
eggW: the total mass of layed eggs, a continuous and positive variable ranging between 300 and 30000.
fishW: mass of the fish, continuous and positive, ranging between 3 and 55.
age: either 1 or 2 years.
No 0's, no NA's.
After checking and realising assuming a normal distribution was probably not appropriate, I decided to use a Gamma distribution. I chose Gamma basically because the variable was positive and continuous, with increasing variance with higher values and appeared to be skewed, as you can see in the image below.
Frequency distribution of eggW values:
fishW vs eggW:
The code
myglm <- glm(eggW ~ fishW * age, family=Gamma(link=identity),
start=c(mean(data$eggW),1,1,1),
maxit=100)
I added the maxit factor after seeing it suggested on a post of this page as a solution to glm.fit: algorithm did not converge error, and it worked.
I chose to work with link=identity because of the more obvious and straightforward interpretation of the results in biological terms rather than using an inverse or log link.
So, the code above results in the next message:
Warning messages: 1: In log(ifelse(y == 0, 1, y/mu)) : NaNs
produced 2: step size truncated due to divergence
Importantly, no error warnings are shown if the variable fishW is dropped and only age is kept. No errors are reported if a log link is used.
Questions
If the rationale behind the design of my model is acceptable, I would like to understand why these errors are reported and how to solve or avoid them. In any case, I would appreciate any criticism or suggestions.
You are looking to determine the weight of the eggs based upon age and weight of the fish correct? I think you need to use:
glm(eggW ~ fishW + age, family=Gamma(link=identity)
Instead of
glm(eggW ~ fishW * age, family=Gamma(link=identity)
Does your dataset have missing values?
Are your variables highly correlated?
Turn fishW * age into a seperate column and just pass that to the algo

Why is rmvnorm() function returning "In sqrt(ev$values) : NaNs produced", what is this error and how can it be corrected or avoided?

I am working with financial/economic data in case you are wondering about the large size of some of the coefficients below... My general question has to do with the simulation of parameter coefficients output from a linear random effects model in R. I am attempting to generate a random sample of beta coefficients using the model coefficients and the variance-covariance (VCOV) matrix from the same model in R. My question is: Why am I receiving the error below about the square root of the expected values using the rmvnorm() function from the mvtnorm{} package? How can I deal with this warning/issue?
#Example call: lmer model with random effects by YEAR
#mlm<-lmer(DV~V1+V2+V3+V2*V3+V4+V5+V6+V7+V8+V9+V10+V11+(1|YEAR), data=dat)
#Note: 5 years (5 random effects total)
#LMER call yields the following information:
coef<-as.matrix(c(-28037800,0.8368619,2816347,8681918,-414002.6,371010.7,-26580.84,80.17909,271.417,-239.1172,3.463785,-828326))
sigma<-as.matrix(rbind(c(1834279134971.21,-415.95,-114036304870.57,-162630699769.14,-23984428143.44,-94539802675.96,
-4666823087.67,-93751.98,1735816.34,-1592542.75,3618.67,14526547722.87),
c(-415.95,0.00,41.69,94.17,-8.94,-22.11,-0.55,0.00,0.00,0.00,0.00,-7.97),
c(-114036304870.57,41.69,12186704885.94,12656728536.44,-227877587.40,-2267464778.61,
-4318868.82,8909.65,-355608.46,338303.72,-321.78,-1393244913.64),
c(-162630699769.14,94.17,12656728536.44,33599776473.37,542843422.84,4678344700.91,-27441015.29,
12106.86,-225140.89,246828.39,-593.79,-2445378925.66),
c(-23984428143.44,-8.94,-227877587.40,542843422.84,32114305557.09,-624207176.98,-23072090.09,
2051.16,51800.37,-49815.41,-163.76,2452174.23),
c(-94539802675.96,-22.11,-2267464778.61,4678344700.91,-624207176.98,603769409172.72,90275299.55,
9267.90,208538.76,-209180.69,-304.18,-7519167.05),
c(-4666823087.67,-0.55,-4318868.82,-27441015.29,-23072090.09,90275299.55,82486186.42,-100.73,
15112.56,-15119.40,-1.34,-2476672.62),
c(-93751.98,0.00,8909.65,12106.86,2051.16,9267.90,-100.73,2.54,8.73,-10.15,-0.01,-1507.62),
c(1735816.34,0.00,-355608.46,-225140.89,51800.37,208538.76,15112.56,8.73,527.85,-535.53,-0.01,21968.29),
c(-1592542.75,0.00,338303.72,246828.39,-49815.41,-209180.69,-15119.40,-10.15,-535.53,545.26,0.01,-23262.72),
c(3618.67,0.00,-321.78,-593.79,-163.76,-304.18,-1.34,-0.01,-0.01,0.01,0.01,42.90),
c(14526547722.87,-7.97,-1393244913.64,-2445378925.66,2452174.23,-7519167.05,-2476672.62,-1507.62,21968.29,
-23262.72,42.90,229188496.83)))
#Error begins here:
betas<-rmvnorm(n=1000, mean=coef, sigma=sigma)
#rmvnorm breaks, Error returned:
Warning message: In sqrt(ev$values) : NaNs produced
When I Google the following search string: "rmvnorm, "Warning message: In sqrt(ev$values) : NaNs produced," I saw that:
http://www.nickfieller.staff.shef.ac.uk/sheff-only/mvatasksols6-9.pdf On Page 4 that this error indicates "negative eigen values." Although, I have no idea conceptually or practically what a negative eigen value is or why that they would be produced in this instance.
The second search result: [http://www.r-tutor.com/r-introduction/basic-data-types/complex2 Indicates that this error arises because of an attempt to take the square root of -1, which is "not a complex value" (you cannot take the square root of -1).
The question remains, what is going on here with the random generation of the betas, and how can this be corrected?
sessionInfo() R version 3.0.2 (2013-09-25) Platform:
x86_64-apple-darwin10.8.0 (64-bit)
Using the following packages/versions
mvtnorm_0.9-9994,
lme4_1.1-5,
Rcpp_0.10.3,
Matrix_1.1-2-2,
lattice_0.20-23
You have a huge range of scales in your eigenvalues:
range(eigen(sigma)$values)
## [1] -1.005407e-05 1.863477e+12
I prefer to use mvrnorm from the MASS package, just because it comes installed automatically with R. It also appears to be more robust:
set.seed(1001)
m <- MASS::mvrnorm(n=1000, mu=coef, Sigma=sigma) ## works fine
edit: OP points out that using method="svd" with rmvnorm also works.
If you print the code for MASS::mvrnorm, or debug(MASS:mvrnorm) and step through it, you see that it uses
if (!all(ev >= -tol * abs(ev[1L]))) stop("'Sigma' is not positive definite")
(where ev is the vector of eigenvalues, in decreasing order, so ev[1] is the largest eigenvalue) to decide on the positive definiteness of the variance-covariance matrix. In this case ev[1L] is about 2e12, tol is 1e-6, so this would allow negative eigenvalues up to a magnitude of about 2e6. In this case the minimum eigenvalue is -1e-5, well within tolerance.
Farther down MASS::mvrnorm uses pmax(ev,0) -- that is, if it has decided that the eigenvalues are not below tolerance (i.e. it didn't fail the test above), it just truncates the negative values to zero, which should be fine for practical purposes.
If you insisted on using rmvnorm you could use Matrix::nearPD, which tries to force the matrix to be positive definite -- it returns a list which contains (among other things) the eigenvalues and the "positive-definite-ified" matrix:
m <- Matrix::nearPD(sigma)
range(m$eigenvalues)
## [1] 1.863477e+04 1.863477e+12
The eigenvalues computed from the matrix are not quite identical -- nearPD and eigen use slightly different algorithms -- but they're very close.
range(eigen(m$mat)$values)
## [1] 1.861280e+04 1.863477e+12
More generally,
Part of the reason for the huge range of eigenvalues might be predictor variables that are scaled very differently. It might be a good idea to scale your input data if possible to make the variances more similar to each other (i.e., it will make all of your numerical computations more stable) -- you can always rescale the values once you've generated them
It's also the case that when matrices are very close to singular (i.e. some eigenvalues are very close to zero), small numerical differences can change the sign of the eigenvalues. In particular, if you copy and paste the values, you might lose some precision and cause this problem. Using dput(vcov(fit)) or save(vcov(fit)) to save the variance-covariance matrix at full precision is safer.
if you have no idea what "positive definite" means you might want to read up about it. The Wikipedia articles on covariance matrices and positive definite matrices might be a little too technical for you to start with; this question on StackExchange is closer, but still a little technical. The next entry on my Google journey was this one, which looks about right.

Resources