Understanding errors from ordinal logistic regression - r

I am using MASS::polr to run ordinal logistic regressions, but I am getting a lot of errors that I am hoping people can enlighten me about.
First if I run this the function fails to find starting values:
MASS::polr(as.ordered(cyl)~mpg+gear,mtcars)
So if I specify starting values, I get an error from optim stating 'non-finite value supplied by optim':
MASS::polr(as.ordered(cyl)~mpg+gear,mtcars,start=c(1,1,1,1))
After reading some R-help, and previous stack overflow questions about this, the response is usually that there is something wrong with the data i.e. the response variable has a category with relatively few values, but in this instance I don't see anything wrong with mtcars.
Any guidance on how to diagnose, and deal with issues in data that will impact MASS::polr would be appreciated.
Regards

Going on a scavenger hunt through ?polr, the starting values are to be specified "in the format c(coefficients, zeta)". Looking lower, we see that zeta is "the intercepts for the class boundaries.". In the Details section, we can see that the zeta values must be ordered:
zeta_0 = -Inf < zeta_1 < ... < zeta_K = Inf
([sic], as that presumably should be a < Inf at the end.)
So you need the second zeta value to be greater than the first. This works, for example:
MASS::polr(as.ordered(cyl) ~ mpg + gear, mtcars, start = c(1, 1, 1, 2))

Related

MLE using nlminb in R - understand/debug certain errors

This is my first question here, so I will try to make it as well written as possible. Please be overbearing should I make a silly mistake.
Briefly, I am trying to do a maximum likelihood estimation where I need to estimate 5 parameters. The general form of the problem I want to solve is as follows: A weighted average of three copulas, each with one parameter to be estimated, where the weights are nonnegative and sum to 1 and also need to be estimated.
There are packages in R for doing MLE on single copulas or on a weighted average of copulas with fixed weights. However, to the best of my knowledge, no packages exist to directly solve the problem I outlined above. Therefore I am trying to code the problem myself. There is one particular type of error I am having trouble tracing to its source. Below I have tried to give a minimal reproducible example where only one parameter needs to be estimated.
library(copula)
set.seed(150)
x <- rCopula(100, claytonCopula(250))
# Copula density
clayton_density <- function(x, theta){
dCopula(x, claytonCopula(theta))
}
# Negative log-likelihood function
nll.clayton <- function(theta){
theta_trans <- -1 + exp(theta) # admissible theta values for Clayton copula
nll <- -sum(log(clayton_density(x, theta_trans)))
return(nll)
}
# Initial guess for optimization
guess <- function(x){
init <- rep(NA, 1)
tau.n <- cor(x[,1], x[,2], method = "kendall")
# Guess using method of moments
itau <- iTau(claytonCopula(), tau = tau.n)
# In case itau is negative, we need a conditional statement
# Use log because it is (almost) inverse of theta transformation above
if (itau <= 0) {
init[1] <- log(0.1) # Ensures positive initial guess
}
else {
init[1] <- log(itau)
}
}
estimate <- nlminb(guess(x), nll.clayton)
(parameter <- -1 + exp(estimate$par)) # Retrieve estimated parameter
fitCopula(claytonCopula(), x) # Compare with fitCopula function
This works great when simulating data with small values of the copula parameter, and gives almost exactly the same answer as fitCopula() every time.
For large values of the copula parameter, such as 250, the following error shows up when I run the line with nlminb():"Error in .local(u, copula, log, ...) : parameter is NA
Called from: .local(u, copula, log, ...)
Error during wrapup: unimplemented type (29) in 'eval'"
When I run fitCopula(), the optimization is finished, but this message pops up: "Warning message:
In dlogcdtheta(copula, u) :
dlogcdtheta() returned NaN in column(s) 1 for this explicit copula; falling back to numeric derivative for those columns"
I have been able to find out using debug() that somewhere in the optimization process of nlminb, the parameter of interest is assigned the value NaN, which then yields this error when dCopula() is called. However, I do not know at which iteration it happens, and what nlminb() is doing when it happens. I suspect that perhaps at some iteration, the objective function is evaluated at Inf/-Inf, but I do not know what nlminb() does next. Also, something similar seems to happen with fitCopula(), but the optimization is still carried out to the end, only with the abovementioned warning.
I would really appreciate any help in understanding what is going on, how I might debug it myself and/or how I can deal with the problem. As might be evident from the question, I do not have a strong background in coding. Thank you so much in advance to anyone that takes the time to consider this problem.
Update:
When I run dCopula(x, claytonCopula(-1+exp(guess(x)))) or equivalently clayton_density(x, -1+exp(guess(x))), it becomes apparent that the density evaluates to 0 at several datapoints. Unfortunately, creating pseudobservations by using x <- pobs(x) does not solve the problem, which can be see by repeating dCopula(x, claytonCopula(-1+exp(guess(x)))). The result is that when applying the logarithm function, we get several -Inf evaluations, which of course implies that the whole negative log-likelihood function evaluates to Inf, as can be seen by running nll.clayton(guess(x)). Hence, in addition to the above queries, any tips on handling log(0) when doing MLE numerically is welcome and appreciated.
Second update
Editing the second line in nll.clayton as follows seems to work okay:
nll <- -sum(log(clayton_density(x, theta_trans) + 1e-8))
However, I do not know if this is a "good" way to circumvent the problem, in the sense that it does not introduce potential for large errors (though it would surprise me if it did).

glmmLasso error and warning

I am trying to perform variable selection in a generalized linear mixed model using glmmLasso, but am coming up with an error and a warning, that I can not resolve. The dataset is unbalanced, with some participants (PTNO) having more samples than others; no missing data. My dependent variable is binary, all other variables (beside the ID variable PTNO) are continous.
I suspect something very generic is happening, but obviously fail to see it and have not found any solution in the documentation or on the web.
The code, which is basically just adapted from the glmmLasso soccer example is:
glm8 <- glmmLasso(Group~NDUFV2_dCTABL+GPER1_dCTABL+ ESR1_dCTABL+ESR2_dCTABL+KLF12_dCTABL+SP4_dCTABL+SP1_dCTABL+ PGAM1_dCTABL+ANK3_dCTABL+RASGRP1_dCTABL+AKT1_dCTABL+NUDT1_dCTABL+ POLG_dCTABL+ ADARB1_dCTABL+OGG_dCTABL+ PDE4B_dCTABL+ GSK3B_dCTABL+ APOE_dCTABL+ MAPK6_dCTABL, rnd = list(PTNO=~1),
family = poisson(link = log), data = stackdata, lambda=100,
control = list(print.iter=TRUE,start=c(1,rep(0,29)),q.start=0.7))
The error message is displayed below. Specficially, I do not believe there are any NAs in the dataset and I am unsure about the meaning of the warning regarding the factor variable.
Iteration 1
Error in grad.lasso[b.is.0] <- score.beta[b.is.0] - lambda.b * sign(score.beta[b.is.0]) :
NAs are not allowed in subscripted assignments
In addition: Warning message:
In Ops.factor(y, Mu) : ‘-’ not meaningful for factors
An abbreviated dataset containing the necessary variables is available in R format and can be downladed here.
I hope I can be guided a bit as to how to go on with the analysis. Please let me know if there is anything wrong with the dataset or you cannot download it. ANY help is much appreciated.
Just to follow up on #Kristofersen comment above. It is indeed the start vector that messes your analysis up.
If I run
glm8 <- glmmLasso(Group~NDUFV2_dCTABL+GPER1_dCTABL+ ESR1_dCTABL+ESR2_dCTABL+KLF12_dCTABL+SP4_dCTABL+SP1_dCTABL+ PGAM1_dCTABL+ANK3_dCTABL+RASGRP1_dCTABL+AKT1_dCTABL+NUDT1_dCTABL+ POLG_dCTABL+ ADARB1_dCTABL+OGG_dCTABL+ PDE4B_dCTABL+ GSK3B_dCTABL+ APOE_dCTABL+ MAPK6_dCTABL,
rnd = list(PTNO=~1),
family = binomial(),
data = stackdata,
lambda=100,
control = list(print.iter=TRUE))
then everything is fine and dandy (i.e., it converges and produces a solution). You have copied the example with poisson regression and you need to tweak the code to your situation. I have no idea about whether the output makes sense.
Quick note: I ran with the binomial distribution in the code above since your outcome is binary. If it makes sense to estimate relative risks then poisson may be reasonable (and it also converges), but you need to recode your outcome as the two groups are defined as 1 and 2 and that will certainly mess up the poisson regression.
In other words do a
stackdata$Group <- stackdata$Group-1
before you run the analysis.

lme4::lmer reports "fixed-effect model matrix is rank deficient", do I need a fix and how to?

I am trying to run a mixed-effects model that predicts F2_difference with the rest of the columns as predictors, but I get an error message that says
fixed-effect model matrix is rank deficient so dropping 7 columns / coefficients.
From this link, Fixed-effects model is rank deficient, I think I should use findLinearCombos in the R package caret. However, when I try findLinearCombos(data.df), it gives me the error message
Error in qr.default(object) : NA/NaN/Inf in foreign function call (arg 1)
In addition: Warning message:
In qr.default(object) : NAs introduced by coercion
My data does not have any NAs - What could be causing this? (Sorry if the answer is various obvious - I am new to R).
All of my data are factors except the numerical value that I am trying to predict. Here is a small sample of my data.
sex <- c("f", "m", "f", "m")
nasal <- c("TRUE", "TRUE", "FALSE", "FALSE")
vowelLabel <- c("a", "e", "i", "o")
speaker <- c("Jim", "John", "Ben", "Sally")
word_1 <- c("going", "back", "bag", "back")
type <- c("coronal", "coronal", "labial", "velar")
F2_difference <- c(345.6, -765.8, 800, 900.5)
data.df <- data.frame(sex, nasal, vowelLabel, speaker,
word_1, type, F2_difference
stringsAsFactors = TRUE)
Edit:
Here is some more code, if it helps.
formula <- F2_difference ~ sex + nasal + type + vowelLabel +
type * vowelLabel + nasal * type +
(1|speaker) + (1|word_1)
lmer(formula, REML = FALSE, data = data.df)
Editor edit:
The OP did not provide sufficient number of test data to allow an actual run of the model in lmer for the reader. But this is not too big a issue. This is still a very good post!
You are slightly over-concerned with the warning message:
fixed-effect model matrix is rank deficient so dropping 7 columns / coefficients.
It is a warning not an error. There is neither misuse of lmer nor ill-specification of model formula, thus you will obtain an estimated model. But to answer your question, I shall strive to explain it.
During execution of lmer, your model formula is broken into a fixed effect formula and a random effect formula, and for each a model matrix is constructed. Construction for the fixed one is via the standard model matrix constructor model.matrix; construction for the random one is complicated but not related to your question, so I just skip it.
For your model, you can check what the fixed effect model matrix looks like by:
fix.formula <- F2_difference ~ sex + nasal + type + vowelLabel +
type * vowelLabel + nasal * type
X <- model.matrix (fix.formula, data.df)
All your variables are factor so X will be binary. Though model.matrix applies contrasts for each factor and their interaction, it is still possible that X does not end up with full column rank, as a column may be a linear combination of some others (which can either be precise or numerically close). In your case, some levels of one factor may be nested in some levels of another.
Rank deficiency can arise in many different ways. The other answer shares a CrossValidated answer offering substantial discussions, on which I will make some comments.
For case 1, people can actually do a feature selection model via say, LASSO.
Cases 2 and 3 are related to the data collection process. A good design of experiment is the best way to prevent rank-deficiency, but for many people who build models, the data are already there and no improvement (like getting more data) is possible. However, I would like to stress that even for a dataset without rank-deficiency, we can still get this problem if we don't use it carefully. For example, cross-validation is a good method for model comparison. To do this we need to split the complete dataset into a training one and a test one, but without care we may get a rank-deficient model from the training dataset.
Case 4 is a big problem that could be completely out of our control. Perhaps a natural choice is to reduce model complexity, but an alternative is to try penalized regression.
Case 5 is a numerical concern leading to numerical rank-deficiency and this is a good example.
Cases 6 and 7 tell the fact that numerical computations are performed in finite precision. Usually these won't be an issue if case 5 is dealt with properly.
So, sometimes we can workaround the deficiency but it is not always possible to achieve this. Thus, any well-written model fitting routine, like lm, glm, mgcv::gam, will apply QR decomposition for X to only use its full-rank subspace, i.e., a maximum subset of X's columns that gives a full-rank space, for estimation, fixing coefficients associated with the rest of the columns at 0 or NA. The warning you got just implies this. There are originally ncol(X) coefficients to estimate, but due to deficiency, only ncol(X) - 7 will be estimated, with the rest being 0 or NA. Such numerical workaround ensures that a least squares solution can be obtained in the most stable manner.
To better digest this issue, you can use lm to fit a linear model with fix.formula.
fix.fit <- lm(fix.formula, data.df, method = "qr", singular.ok = TRUE)
method = "qr" and singular.ok = TRUE are default, so actually we don't need to set it. But if we specify singular.ok = FALSE, lm will stop and complain about rank-deficiency.
lm(fix.formula, data.df, method = "qr", singular.ok = FALSE)
#Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
# singular fit encountered
You can then check the returned values in fix.fit.
p <- length(coef)
coef <- fix.fit$coef
no.NA <- sum(is.na(coef))
rank <- fix.fit$rank
It is guaranteed that p = ncol(X), but you should see no.NA = 7 and rank + no.NA = p.
Exactly the same thing happens inside lmer. lm will not report deficiency while lmer does. This is in fact informative, as too often, I see people asking why lm returns NA for some coefficients.
Update 1 (2016-05-07):
Let me see if I have this right: The short version is that one of my predictor variables is correlated with another, but I shouldn't worry about it. It is appropriate to use factors, correct? And I can still compare models with anova or by looking at the BIC?
Don't worry about the use of summary or anova. Methods are written so that the correct number of parameters (degree of freedom) will be used to produce valid summary statistics.
Update 2 (2016-11-06):
Let's also hear what package author of lme4 would say: rank deficiency warning mixed model lmer. Ben Bolker has mentioned caret::findLinearCombos, too, particularly because the OP there want to address deficiency issue himself.
Update 3 (2018-07-27):
Rank-deficiency is not a problem for valid model estimation and comparison, but could be a hazard in prediction. I recently composed a detailed answer with simulated examples on CrossValidated: R lm, Could anyone give me an example of the misleading case on “prediction from a rank-deficient”? So, yes, in theory we should avoid rank-deficient estimation. But in reality, there is no so-called "true model": we try to learn it from data. We can never compare an estimated model to "truth"; the best bet is to choose the best one from a number of models we've built. So if the "best" model ends up rank-deficient, we can be skeptical about it but probably there is nothing we can do immediately.
This response does an excellent job of explaining what rank deficiency is, and what the possible causes may be.
Viz:
Too little data: You cannot uniquely estimate n parameters with less than n data points
Too many points are replicates.
Information in the wrong places.
Complicated model (too many variables)
Units and scaling
Variation in numbers: 12.001 vs. 12.005 & 44566 vs 44555
Data precision: Even Double-precision variables have limits

Odd behavior with step()

step() and stepAIC() produce a "remove missing values error" when running the code on data with missing values.
Error in step(mod1, direction = "backward") :
number of rows in use has changed: remove missing values?
According to ?step:
The model fitting must apply the models to the same dataset. This may be
a problem if there are missing values and R's default of na.action = na.omit
is used. We suggest you remove the missing values first.
I have a data frame with one variable which has four na values. However, when I run step on the lm object, I don't get the "missing values" error even though it has missing values. Can anyone tell me what could be going on?
> d1$Impressions
[1] NA NA NA 6924180 9313226 27888455
18213812 54557205 13495553
...
This does not produce an error message:
mod1 = lm(Leads ~ G + Con + GOO + DAY + Res + SD + ED +
ME + Impressions + Inc + Sea, data=d1)
step(mod1, direction="backward")
stepAIC(mod1)
Even with a variable which has missing values, it's not generating an error message. Any ideas on what's going on?
One reason for the stated behaviour is this. step() fits the full model and hence drops 3 (as stated) observations due to presence of NAs. As long as the variables for which there are NAs remain in the model, the lm() function will remove those observations at each step. If stepping stops before it removes a variable that would result in one of the previously removed observations remaining in the model, then no error will be raised, because the numbers of rows in the model matrix will not have changed.
As an aside, stepwise selection like this is considered to be of somewhat dubious validity. Not least, in using it you a making a fairly bold statement that the effects of the eliminated variables are exactly equal to zero. This also has the effect of biasing the effect (estimated coefficients) of the variables retained in the model to have larger (absolute) value.
Alternatives to this stepwise selection include shrinkage methods such as the Lasso and the Elastic Net.

Forward procedure with BIC

I'm trying to select variables for a linear model with forward stepwise algorithm and BIC criterion. As the help file indicates and as I always did, I wrote the following:
model.forward<-lm(y~1,data=donnees)
model.forward.BIC<-step(model.forward,direction="forward", k=log(n), scope=list(lower = ~1, upper = ~x1+x2+x3), data=donnees)
with k=log(n) indicating I'm using BIC. But R returns:
Error in extractAIC.lm(fit, scale, k = k, ...) : object 'n' not found
I never really asked myself the question before but I think that n is supposed to be defined in function step(it s the number of variables in the model at each iteration).... Anyway, the issue never happened to me before! Restarting R doesn't change anything and I admit I have no idea of what can cause this error.
Here is some code to test:
y<-runif(20,0,10)
x1<-runif(20,0,1)
x2<-y+runif(20,0,5)
x3<-runif(20,0,1)-runif(20,0,1)*y
donnees<-data.frame(x1,x2,x3,y)
Any ideas?
step(model.forward,direction="forward",
k=log(nrow(donnees)), scope=list(lower = ~1, upper = ~x1+x2+x3),
data=donnees)
or more generally ...
... k=log(nobs(model.forward)) ...
(for example, if there are NA values in your data, then nobs(model.forward) will be different from nrow(donnees). On the other hand, if you have NA values in your predictors, you're going to run into trouble when doing model selection anyway.)

Resources