Error in fitting a model with gee(): NA/NaN/Inf in foreign function call (arg 3) - r

I'm fitting a gee model on a dataset including 13,500 observations (here students). Students are grouped into 52 different schools. I know that there is evidence that students are nested within schools (low ICC) and therefore I should adjust this nesting effect in the variance covariance matrix. What I'm planning to do is to first fit a gee model with exchangeable var-cov structure. Then, on top of that, I'll run Huber-White Sandwich estimator also known as robust variance estimator. I wrote my own code for robust variance estimator and it works perfectly. My gee statement doesn't work and give the error below:
NA/NaN/Inf in foreign function call (arg 3)
Here is my code:
STMath.OneYr.C1 = gee(postCSTMath1Yr ~ TRT1Yr + preCSTMath + preCSTENG +
post1YrGradeRef + ELLBaseLine + GENDER + ECODIS + ETHNICITY.F +
as.factor(FailedInd1Yr), data = UCI.clone[UCI.clone$COHORT0809 == "C1",],
id = post1YrSchIID, corstr = "exchangeable")
Unfortunately, the code above is not reproducible for you guys and perhaps difficult to figure out what the issue is.
I appreciate if you could help me figure out to solve the issue.

OK, this question is quite old but I ended up here, so this might help someone eventually.
Basically, this error was caused because unlike in other libraries, the id parameter is treated as a numeric vector.
Indeed, the gee function is casting id as a double, which I don't really understand. Here are the implicated lines (l. 119-120 of the function):
if (!(is.double(id)))
id <- as.double(id)
If your id column is a character, just cast it to a factor, or use some function (like dplyr::min_rank) to turn it to a numeric variable.
This should do the trick.

Related

lmer failing with na.pass

When I run a lmer model with lme4 using na.pass as the na.action, I get the following error:
R: NA/NaN/Inf in foreign function call (arg 1)
I run the model like this:
model1 <- lme4::lmer(agg_dv_singing ~ GMS.Musical.Training +
JAJ.ability + MDT.ability + MPT.ability + PDCT.ability +
PIAT.ability + agg_dv_long_note + demographics.age +
aggiv_entropy + aggiv_interval_complexity +
aggiv_rhythmic_complexity + aggiv_tonal_complexity +
log.freq + length + (1|p_id),
data = dat, na.action = na.pass)
summary(dat) indicates that there are no Inf or NaN values, although yes, there are many NA values.
Running na.pass outside of lmer on the same data set does not give an error:
na.pass(dat)
So what could be going wrong within lmer?
Comments to a previous question of yours attempted to explain that, in general, mixed model machinery cannot handle estimation from cases when there are missing values in the predictors; it just doesn't work that way. If you want to fit mixed models with missing data you need to do some form of imputation, i.e. filling in values for missing predictors (e.g. see the mice package, which is more or less the state of the art at least as far as the R ecosystem is concerned). Here is what the four different standard na.* actions do in the context of mixed models:
na.fail(): fail immediately if there are missing values in the data (predictors or response). This is frustrating, but alerts you immediately to the fact that you have missing data, and lets you decide what to do about it.
na.omit(): drop non-complete cases from the data before fitting.
na.exclude(): like na.omit(), but keep track of the locations of the excluded cases. When using predict() or residuals() (or any function that produces results per observation), reconstitute a complete data set with NA values for the non-complete cases in the original data set. (I usually find this setting to be the most useful default.)
na.pass: do not remove NA values, but attempt to continue with the fitting procedure. As you found out, this usually doesn't work at all! It will just pass the NA values down through the code until something goes wrong. Typically one of two things happens at this point:
if the entire estimation procedure is written using R functions that can handle and propagate missing values, then you'll usually get a fitted model object with NA/NaN for all coefficients, likelihoods, etc. etc. (because the missing values contaminate the entire fitting procedure);
if some step of the estimation procedure can't handle NA/NaN values (as in this case), you get an inscrutable error from the first point in the procedure that fails.
If you look at the source code of na.pass() (by typing na.pass at the R prompt), you'll see that in fact all it does is return the same object, unchanged. To be honest, I'm not really sure why na.pass even exists, except for completeness ... (or compatibility with S)
Your NA value was not in a parameter that is used in a random-effects term: if it had, you would have gotten a more interpretable error message:
library(lme4)
ss <- sleepstudy
ss[1,"Days"] <- NA
lmer(Reaction ~ Days + (Days|Subject), ss, na.action=na.pass)
Error in lme4::lFormula(formula = Reaction ~ Days + (Days | Subject), :
NA in Z (random-effects model matrix): please use "na.action='na.omit'" or "na.action='na.exclude'"
If I fit a model with (1|Subject), so that the NA value only affects the fixed effects
lmer(Reaction ~ Days + (1|Subject), ss, na.action=na.pass)
then we get your error message.
Error in qr.default(X, tol = tol, LAPACK = FALSE) :
NA/NaN/Inf in foreign function call (arg 1)
traceback() tells me that this happens in the internal chkRank.drop.cols() function, where R is trying to figure out if any of your fixed-effect columns are collinear. There should probably be a check for missing values there ...

TukeyHSD or glht in R, ANCOVA

I'm wondering if i can use the function "TukeyHSD" to perform the all pairwise comparisons of a "aov()" model with one factor (e.g., GROUP) and one continuous covariate (e.g., AGE). I did for example:
library(multcomp)
data('litter', package = 'multcomp')
litter.aov <- aov(weight ~ gesttime + dose, data = litter)
TukeyHSD(litter.aov, which = 'dose')
and i get a warning message like this:
Warning message:
In replications(paste("~", xx), data = mf): non-factor ignored: gesttime
Is this process above correct? What's the meaning of the warning message? And does "TukeyHSD" apply to badly unbalanced designs?
In addition, is there any difference between the processes above and below?
litter.mc <- glht(litter.aov, linfct = mcp(dose = 'Tukey'))
summary(litter.mc)
Best, Sue
There's no difference. TukeyHSD() is just a bit more eager to tell you about potential problems. Notice that it's a warning message, not an error, meaning that the results might not be what you expect, but they'll still be returned so you can judge for yourself.
As for what it means, it means what it says: non-factor variables are ignored. Remember that you are comparing the differences between groups, and grouping is done using factors, so factors are all TukeyHSD() care about. In your case you explicitly tell the function to only care about dose, which is factor, so the warning might be seen as overly cautious.
One way of avoiding the warning would be to convert gesttime into a factor, and as it consist of only four levels it makes some sense to do so.
data('litter', package = 'multcomp')
litter$gesttime <- as.factor(litter$gesttime)
litter.aov <- aov(weight ~ gesttime + dose, data = litter)
TukeyHSD(litter.aov, which = 'dose')
I know this is an old thread but I'm not sure the existing answers are quite right...
I've been trying both functions with my own data and have a similar situation to Sue, where TukeyHSD gives a warning message about ignoring non-factor covariates, while glht() does not.
It does not appear that they are doing the same thing contrary to the other answer. The results are different and it appears that TukeyHSD is not marginalizing over the non-factor covariates (as the warning states). It appears that glht() correctly uses the mean value of non-factor covariates to compute the marginal mean of the groups of interest since the point estimates are the same as those obtained from lsmeans().
So it does not seem that TukeyHSD is overly cautious, it just seems that it can't handle non-factor covariates while glht is able to. So glht seems to be the correct function to use in this case, to me.

lme4::lmer reports "fixed-effect model matrix is rank deficient", do I need a fix and how to?

I am trying to run a mixed-effects model that predicts F2_difference with the rest of the columns as predictors, but I get an error message that says
fixed-effect model matrix is rank deficient so dropping 7 columns / coefficients.
From this link, Fixed-effects model is rank deficient, I think I should use findLinearCombos in the R package caret. However, when I try findLinearCombos(data.df), it gives me the error message
Error in qr.default(object) : NA/NaN/Inf in foreign function call (arg 1)
In addition: Warning message:
In qr.default(object) : NAs introduced by coercion
My data does not have any NAs - What could be causing this? (Sorry if the answer is various obvious - I am new to R).
All of my data are factors except the numerical value that I am trying to predict. Here is a small sample of my data.
sex <- c("f", "m", "f", "m")
nasal <- c("TRUE", "TRUE", "FALSE", "FALSE")
vowelLabel <- c("a", "e", "i", "o")
speaker <- c("Jim", "John", "Ben", "Sally")
word_1 <- c("going", "back", "bag", "back")
type <- c("coronal", "coronal", "labial", "velar")
F2_difference <- c(345.6, -765.8, 800, 900.5)
data.df <- data.frame(sex, nasal, vowelLabel, speaker,
word_1, type, F2_difference
stringsAsFactors = TRUE)
Edit:
Here is some more code, if it helps.
formula <- F2_difference ~ sex + nasal + type + vowelLabel +
type * vowelLabel + nasal * type +
(1|speaker) + (1|word_1)
lmer(formula, REML = FALSE, data = data.df)
Editor edit:
The OP did not provide sufficient number of test data to allow an actual run of the model in lmer for the reader. But this is not too big a issue. This is still a very good post!
You are slightly over-concerned with the warning message:
fixed-effect model matrix is rank deficient so dropping 7 columns / coefficients.
It is a warning not an error. There is neither misuse of lmer nor ill-specification of model formula, thus you will obtain an estimated model. But to answer your question, I shall strive to explain it.
During execution of lmer, your model formula is broken into a fixed effect formula and a random effect formula, and for each a model matrix is constructed. Construction for the fixed one is via the standard model matrix constructor model.matrix; construction for the random one is complicated but not related to your question, so I just skip it.
For your model, you can check what the fixed effect model matrix looks like by:
fix.formula <- F2_difference ~ sex + nasal + type + vowelLabel +
type * vowelLabel + nasal * type
X <- model.matrix (fix.formula, data.df)
All your variables are factor so X will be binary. Though model.matrix applies contrasts for each factor and their interaction, it is still possible that X does not end up with full column rank, as a column may be a linear combination of some others (which can either be precise or numerically close). In your case, some levels of one factor may be nested in some levels of another.
Rank deficiency can arise in many different ways. The other answer shares a CrossValidated answer offering substantial discussions, on which I will make some comments.
For case 1, people can actually do a feature selection model via say, LASSO.
Cases 2 and 3 are related to the data collection process. A good design of experiment is the best way to prevent rank-deficiency, but for many people who build models, the data are already there and no improvement (like getting more data) is possible. However, I would like to stress that even for a dataset without rank-deficiency, we can still get this problem if we don't use it carefully. For example, cross-validation is a good method for model comparison. To do this we need to split the complete dataset into a training one and a test one, but without care we may get a rank-deficient model from the training dataset.
Case 4 is a big problem that could be completely out of our control. Perhaps a natural choice is to reduce model complexity, but an alternative is to try penalized regression.
Case 5 is a numerical concern leading to numerical rank-deficiency and this is a good example.
Cases 6 and 7 tell the fact that numerical computations are performed in finite precision. Usually these won't be an issue if case 5 is dealt with properly.
So, sometimes we can workaround the deficiency but it is not always possible to achieve this. Thus, any well-written model fitting routine, like lm, glm, mgcv::gam, will apply QR decomposition for X to only use its full-rank subspace, i.e., a maximum subset of X's columns that gives a full-rank space, for estimation, fixing coefficients associated with the rest of the columns at 0 or NA. The warning you got just implies this. There are originally ncol(X) coefficients to estimate, but due to deficiency, only ncol(X) - 7 will be estimated, with the rest being 0 or NA. Such numerical workaround ensures that a least squares solution can be obtained in the most stable manner.
To better digest this issue, you can use lm to fit a linear model with fix.formula.
fix.fit <- lm(fix.formula, data.df, method = "qr", singular.ok = TRUE)
method = "qr" and singular.ok = TRUE are default, so actually we don't need to set it. But if we specify singular.ok = FALSE, lm will stop and complain about rank-deficiency.
lm(fix.formula, data.df, method = "qr", singular.ok = FALSE)
#Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
# singular fit encountered
You can then check the returned values in fix.fit.
p <- length(coef)
coef <- fix.fit$coef
no.NA <- sum(is.na(coef))
rank <- fix.fit$rank
It is guaranteed that p = ncol(X), but you should see no.NA = 7 and rank + no.NA = p.
Exactly the same thing happens inside lmer. lm will not report deficiency while lmer does. This is in fact informative, as too often, I see people asking why lm returns NA for some coefficients.
Update 1 (2016-05-07):
Let me see if I have this right: The short version is that one of my predictor variables is correlated with another, but I shouldn't worry about it. It is appropriate to use factors, correct? And I can still compare models with anova or by looking at the BIC?
Don't worry about the use of summary or anova. Methods are written so that the correct number of parameters (degree of freedom) will be used to produce valid summary statistics.
Update 2 (2016-11-06):
Let's also hear what package author of lme4 would say: rank deficiency warning mixed model lmer. Ben Bolker has mentioned caret::findLinearCombos, too, particularly because the OP there want to address deficiency issue himself.
Update 3 (2018-07-27):
Rank-deficiency is not a problem for valid model estimation and comparison, but could be a hazard in prediction. I recently composed a detailed answer with simulated examples on CrossValidated: R lm, Could anyone give me an example of the misleading case on “prediction from a rank-deficient”? So, yes, in theory we should avoid rank-deficient estimation. But in reality, there is no so-called "true model": we try to learn it from data. We can never compare an estimated model to "truth"; the best bet is to choose the best one from a number of models we've built. So if the "best" model ends up rank-deficient, we can be skeptical about it but probably there is nothing we can do immediately.
This response does an excellent job of explaining what rank deficiency is, and what the possible causes may be.
Viz:
Too little data: You cannot uniquely estimate n parameters with less than n data points
Too many points are replicates.
Information in the wrong places.
Complicated model (too many variables)
Units and scaling
Variation in numbers: 12.001 vs. 12.005 & 44566 vs 44555
Data precision: Even Double-precision variables have limits

adding interaction to GEE - model matrix is rank deficient

I am trying to run a GEE model, using geepack. I have done this successfully, using the below call.
Call:
geeglm(formula = pdc1 ~ country + post + time_post +
TIME + age + sex + country * time_post + country * post, family = gaussian("identity"), data = lipid_data,
id = id, waves = ID, corstr = "ar1", std.err = "san.se").
where:
pdc1=numeric
country=factor
post=factor
time_post=numeric
TIME=numeric
I'm trying to run the exact same model on different data, which are in the exact same format as above. I can run the model with main effects, but not with the interactions. this is the error I get:
Error in geeglm(pdc1 ~ STATE + post + time_post + TIME + STATE * post, :
Model matrix is rank deficient; geeglm can not proceed
I have tried recoding STATE as a numeric variable (and post) but this does not prove fruitful. I don't understand whats going on, the variables hold the exact same data as the first model, and are coded the way. Does anyone know what could be going on here?
I have recently solved a similar problem - it related to having one of the covariates constant across the dataset. For example, STATE might be the same state for each observation which may be causing the error "rank deficient". This might also explain why this error is specific to your new dataset.
I am not sure if you encountered the same problem as mine. I used to fit a geeglm model in R and got the same problem.
My model is something like:
geeglm(outCC$counts~outCC$partage:as.factor(outCC$contage)-1,
family=poisson(link="log"),id=outCC$partid,data=outCC,subset=sel,weights=outCC$partwe)
This model runs without any problem.
However, when I tried to put another binary covariates setting (Urban/Rural) into the model:
geeglm(outCC$counts~outCC$partage:as.factor(outCC$contage)+as.factor(setting)-1,
family=poisson(link="log"),id=outCC$partid,data=outCC,subset=sel,weights=outCC$partwe)
R gave me the error:
Model matrix is rank deficient; geeglm can not proceed
Then I decided to create a new binary variable: setting_new (0/1) and put this new variable in the mode:
geeglm(outCC$counts~outCC$partage:as.factor(outCC$contage)+setting_new-1,
family=poisson(link="log"),id=outCC$partid,data=outCC,subset=sel,weights=outCC$partwe)
And now the problem is solved.
I got the same problem when trying to put another categorical variables into the above mentioned model, with the syntax:
as.factor(Q12_travel)
And the model could not run until I tried to create new dummy variables (0/1) and put it back to the model.
Maybe you have another problem, but I suggest you can try this approach to see if the problem is solved.

Regression coefficients by group in dataframe R

I have data of various companies' financial information organized by company ticker. I'd like to regress one of the columns' values against the others while keeping the company constant. Is there an easy way to write this out in lm() notation?
I've tried using:
reg <- lmList(lead2.dDA ~ paudit1 + abs.d.GINDEX + logcapx + logmkvalt +
logmkvalt2|pp, data=reg.df)
where pp is a vector of company names, but this returns coefficients as though I regressed all the data at once (and did not separate by company name).
A convenient and apparently little-known syntax for estimating separate regression coefficients by group in lm() involves using the nesting operator, /. In this case it would look like:
reg <- lm(lead2.dDA ~ 0 + pp/(paudit1 + abs.d.GINDEX + logcapx +
logmkvalt + logmkvalt2), data=reg.df)
Make sure that pp is a factor and not a numeric. Also notice that the overall intercept must be suppressed for this to work; in the new formulation, we have a different "intercept" for each group.
A couple comments:
Although the regression coefficients obtained this way will match those given by lmList(), it should be noted that with lm() we estimate only a single residual variance across all the groups, whereas lmList() would estimate separate residual variances for each group.
Like I mentioned in my earlier comment, the lmList() syntax that you gave looks like it should have worked. Since you say it didn't, this leads me to expect that really the problem is something else (although it's hard to tell what without a reproducible example), and so it seems likely that the solution I posted will fail for you as well, for the same unknown reasons. If you want more detailed guidance, please provide more information; help us help you.

Resources