lme4::lmer reports "fixed-effect model matrix is rank deficient", do I need a fix and how to? - r

I am trying to run a mixed-effects model that predicts F2_difference with the rest of the columns as predictors, but I get an error message that says
fixed-effect model matrix is rank deficient so dropping 7 columns / coefficients.
From this link, Fixed-effects model is rank deficient, I think I should use findLinearCombos in the R package caret. However, when I try findLinearCombos(data.df), it gives me the error message
Error in qr.default(object) : NA/NaN/Inf in foreign function call (arg 1)
In addition: Warning message:
In qr.default(object) : NAs introduced by coercion
My data does not have any NAs - What could be causing this? (Sorry if the answer is various obvious - I am new to R).
All of my data are factors except the numerical value that I am trying to predict. Here is a small sample of my data.
sex <- c("f", "m", "f", "m")
nasal <- c("TRUE", "TRUE", "FALSE", "FALSE")
vowelLabel <- c("a", "e", "i", "o")
speaker <- c("Jim", "John", "Ben", "Sally")
word_1 <- c("going", "back", "bag", "back")
type <- c("coronal", "coronal", "labial", "velar")
F2_difference <- c(345.6, -765.8, 800, 900.5)
data.df <- data.frame(sex, nasal, vowelLabel, speaker,
word_1, type, F2_difference
stringsAsFactors = TRUE)
Edit:
Here is some more code, if it helps.
formula <- F2_difference ~ sex + nasal + type + vowelLabel +
type * vowelLabel + nasal * type +
(1|speaker) + (1|word_1)
lmer(formula, REML = FALSE, data = data.df)
Editor edit:
The OP did not provide sufficient number of test data to allow an actual run of the model in lmer for the reader. But this is not too big a issue. This is still a very good post!

You are slightly over-concerned with the warning message:
fixed-effect model matrix is rank deficient so dropping 7 columns / coefficients.
It is a warning not an error. There is neither misuse of lmer nor ill-specification of model formula, thus you will obtain an estimated model. But to answer your question, I shall strive to explain it.
During execution of lmer, your model formula is broken into a fixed effect formula and a random effect formula, and for each a model matrix is constructed. Construction for the fixed one is via the standard model matrix constructor model.matrix; construction for the random one is complicated but not related to your question, so I just skip it.
For your model, you can check what the fixed effect model matrix looks like by:
fix.formula <- F2_difference ~ sex + nasal + type + vowelLabel +
type * vowelLabel + nasal * type
X <- model.matrix (fix.formula, data.df)
All your variables are factor so X will be binary. Though model.matrix applies contrasts for each factor and their interaction, it is still possible that X does not end up with full column rank, as a column may be a linear combination of some others (which can either be precise or numerically close). In your case, some levels of one factor may be nested in some levels of another.
Rank deficiency can arise in many different ways. The other answer shares a CrossValidated answer offering substantial discussions, on which I will make some comments.
For case 1, people can actually do a feature selection model via say, LASSO.
Cases 2 and 3 are related to the data collection process. A good design of experiment is the best way to prevent rank-deficiency, but for many people who build models, the data are already there and no improvement (like getting more data) is possible. However, I would like to stress that even for a dataset without rank-deficiency, we can still get this problem if we don't use it carefully. For example, cross-validation is a good method for model comparison. To do this we need to split the complete dataset into a training one and a test one, but without care we may get a rank-deficient model from the training dataset.
Case 4 is a big problem that could be completely out of our control. Perhaps a natural choice is to reduce model complexity, but an alternative is to try penalized regression.
Case 5 is a numerical concern leading to numerical rank-deficiency and this is a good example.
Cases 6 and 7 tell the fact that numerical computations are performed in finite precision. Usually these won't be an issue if case 5 is dealt with properly.
So, sometimes we can workaround the deficiency but it is not always possible to achieve this. Thus, any well-written model fitting routine, like lm, glm, mgcv::gam, will apply QR decomposition for X to only use its full-rank subspace, i.e., a maximum subset of X's columns that gives a full-rank space, for estimation, fixing coefficients associated with the rest of the columns at 0 or NA. The warning you got just implies this. There are originally ncol(X) coefficients to estimate, but due to deficiency, only ncol(X) - 7 will be estimated, with the rest being 0 or NA. Such numerical workaround ensures that a least squares solution can be obtained in the most stable manner.
To better digest this issue, you can use lm to fit a linear model with fix.formula.
fix.fit <- lm(fix.formula, data.df, method = "qr", singular.ok = TRUE)
method = "qr" and singular.ok = TRUE are default, so actually we don't need to set it. But if we specify singular.ok = FALSE, lm will stop and complain about rank-deficiency.
lm(fix.formula, data.df, method = "qr", singular.ok = FALSE)
#Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
# singular fit encountered
You can then check the returned values in fix.fit.
p <- length(coef)
coef <- fix.fit$coef
no.NA <- sum(is.na(coef))
rank <- fix.fit$rank
It is guaranteed that p = ncol(X), but you should see no.NA = 7 and rank + no.NA = p.
Exactly the same thing happens inside lmer. lm will not report deficiency while lmer does. This is in fact informative, as too often, I see people asking why lm returns NA for some coefficients.
Update 1 (2016-05-07):
Let me see if I have this right: The short version is that one of my predictor variables is correlated with another, but I shouldn't worry about it. It is appropriate to use factors, correct? And I can still compare models with anova or by looking at the BIC?
Don't worry about the use of summary or anova. Methods are written so that the correct number of parameters (degree of freedom) will be used to produce valid summary statistics.
Update 2 (2016-11-06):
Let's also hear what package author of lme4 would say: rank deficiency warning mixed model lmer. Ben Bolker has mentioned caret::findLinearCombos, too, particularly because the OP there want to address deficiency issue himself.
Update 3 (2018-07-27):
Rank-deficiency is not a problem for valid model estimation and comparison, but could be a hazard in prediction. I recently composed a detailed answer with simulated examples on CrossValidated: R lm, Could anyone give me an example of the misleading case on “prediction from a rank-deficient”? So, yes, in theory we should avoid rank-deficient estimation. But in reality, there is no so-called "true model": we try to learn it from data. We can never compare an estimated model to "truth"; the best bet is to choose the best one from a number of models we've built. So if the "best" model ends up rank-deficient, we can be skeptical about it but probably there is nothing we can do immediately.

This response does an excellent job of explaining what rank deficiency is, and what the possible causes may be.
Viz:
Too little data: You cannot uniquely estimate n parameters with less than n data points
Too many points are replicates.
Information in the wrong places.
Complicated model (too many variables)
Units and scaling
Variation in numbers: 12.001 vs. 12.005 & 44566 vs 44555
Data precision: Even Double-precision variables have limits

Related

How to obtain Brier Score in Random Forest in R?

I am having trouble getting the Brier Score for my Machine Learning Predictive models. The outcome "y" was categorical (1 or 0). Predictors are a mix of continuous and categorical variables.
I have created four models with different predictors, I will call them "model_1"-"model_4" here (except predictors, other parameters are the same). Example code of my model is:
Model_1=rfsrc(y~ ., data=TrainTest, ntree=1000,
mtry=30, nodesize=1, nsplit=1,
na.action="na.impute", nimpute=3,seed=10,
importance=T)
When I run the "Model_1" function in R, I got the results:
My question was how can I get the predicted possibility for those 412 people? And how to find the observed probability for each person? Do I need to calculate by hand? I found the function BrierScore() in "DescTools" package.
But I tried "BrierScore(Model_1)", it gives me no results.
codes I added:
library(scoring)
library(DescTools)
BrierScore(Raw_SB)
class(TrainTest$VL_supress03)
TrainTest$VL_supress03_nu<-as.numeric(as.character(TrainTest$VL_supress03))
class(TrainTest$VL_supress03_nu)
prediction_Raw_SB = predict(Raw_SB, TrainTest)
BrierScore(prediction_Raw_SB, as.numeric(TrainTest$VL_supress03) - 1)
BrierScore(prediction_Raw_SB, as.numeric(as.character(TrainTest$VL_supress03)) - 1)
BrierScore(prediction_Raw_SB, TrainTest$VL_supress03_nu - 1)
I tried some codes: have so many error messages:
One assumption I am making about your approach is that you want to compute the BrierScore on the data you train your model on (which is usually not the correct approach, google train-test split if you need more info there).
In general, therefore you should reflect on whether your approach is correct there.
The BrierScore method in DescTools only has a defined method for glm models, otherwise, it expects as input a vector of predicted probabilities and a vector of true values (see ?BrierScore).
What you would need to do though is to predict on your data using:
prediction = predict(model_1, TrainTest, na.action="na.impute")
and then compute the brier score using
BrierScore(as.numeric(TrainTest$y) - 1, prediction$predicted[, 1L])
(Note, that we transform TrainTest$y into a numeric vector of 0's and 1's in order to compute the brier score.)
Note: The randomForestSRC package also prints a normalized brier score when you call print(prediction).
In general, using one of the available workbenches for machine learning in R (mlr3, tidymodels, caret) might simplify this approach for you and prevent a lot of errors in this direction. This is a really good practice, especially if you are less experienced in ML as it can prevent many errors.
See e.g. this chapter in the mlr3 book for more information.
For reference, here is some very similar code using the mlr3 package, automatically also taking care of train-test splits.
data(breast, package = "randomForestSRC") # with target variable "status"
library(mlr3)
library(mlr3extralearners)
task = TaskClassif$new(id = "breast", backend = breast, target = "status")
algo = lrn("classif.rfsrc", na.action = "na.impute", predict_type = "prob")
resample(task, algo, rsmp("holdout", ratio = 0.8))$score(msr("classif.bbrier"))

How to run a multinomial logit regression with both individual and time fixed effects in R

Long story short:
I need to run a multinomial logit regression with both individual and time fixed effects in R.
I thought I could use the packages mlogit and survival to this purpose, but I am cannot find a way to include fixed effects.
Now the long story:
I have found many questions on this topic on various stack-related websites, none of them were able to provide an answer. Also, I have noticed a lot of confusion regarding what a multinomial logit regression with fixed effects is (people use different names) and about the R packages implementing this function.
So I think it would be beneficial to provide some background before getting to the point.
Consider the following.
In a multiple choice question, each respondent take one choice.
Respondents are asked the same question every year. There is no apriori on the extent to which choice at time t is affected by the choice at t-1.
Now imagine to have a panel data recording these choices. The data, would look like this:
set.seed(123)
# number of observations
n <- 100
# number of possible choice
possible_choice <- letters[1:4]
# number of years
years <- 3
# individual characteristics
x1 <- runif(n * 3, 5.0, 70.5)
x2 <- sample(1:n^2, n * 3, replace = F)
# actual choice at time 1
actual_choice_year_1 <- possible_choice[sample(1:4, n, replace = T, prob = rep(1/4, 4))]
actual_choice_year_2 <- possible_choice[sample(1:4, n, replace = T, prob = c(0.4, 0.3, 0.2, 0.1))]
actual_choice_year_3 <- possible_choice[sample(1:4, n, replace = T, prob = c(0.2, 0.5, 0.2, 0.1))]
# create long dataset
df <- data.frame(choice = c(actual_choice_year_1, actual_choice_year_2, actual_choice_year_3),
x1 = x1, x2 = x2,
individual_fixed_effect = as.character(rep(1:n, years)),
time_fixed_effect = as.character(rep(1:years, each = n)),
stringsAsFactors = F)
I am new to this kind of analysis. But if I understand correctly, if I want to estimate the effects of respondents' characteristics on their choice, I may use a multinomial logit regression.
In order to take advantage of the longitudinal structure of the data, I want to include in my specification individual and time fixed effects.
To the best of my knowledge, the multinomial logit regression with fixed effects was first proposed by Chamberlain (1980, Review of Economic Studies 47: 225–238). Recently, Stata users have been provided with the routines to implement this model (femlogit).
In the vignette of the femlogit package, the author refers to the R function clogit, in the survival package.
According to the help page, clogit requires data to be rearranged in a different format:
library(mlogit)
# create wide dataset
data_mlogit <- mlogit.data(df, id.var = "individual_fixed_effect",
group.var = "time_fixed_effect",
choice = "choice",
shape = "wide")
Now, if I understand correctly how clogit works, fixed effects can be passed through the function strata (see for additional details this tutorial). However, I am afraid that it is not clear to me how to use this function, as no coefficient values are returned for the individual characteristic variables (i.e. I get only NAs).
library(survival)
fit <- clogit(formula("choice ~ alt + x1 + x2 + strata(individual_fixed_effect, time_fixed_effect)"), as.data.frame(data_mlogit))
summary(fit)
Since I was not able to find a reason for this (there must be something that I am missing on the way these functions are estimated), I have looked for a solution using other packages in R: e.g., glmnet, VGAM, nnet, globaltest, and mlogit.
Only the latter seems to be able to explicitly deal with panel structures using appropriate estimation strategy. For this reason, I have decided to give it a try. However, I was only able to run a multinomial logit regression without fixed effects.
# state formula
formula_mlogit <- formula("choice ~ 1| x1 + x2")
# run multinomial regression
fit <- mlogit(formula_mlogit, data_mlogit)
summary(fit)
If I understand correctly how mlogit works, here's what I have done.
By using the function mlogit.data, I have created a dataset compatible with the function mlogit. Here, I have also specified the id of each individual (id.var = individual_fixed_effect) and the group to which individuals belongs to (group.var = "time_fixed_effect"). In my case, the group represents the observations registered in the same year.
My formula specifies that there are no variables correlated with a specific choice, and which are randomly distributed among individuals (i.e., the variables before the |). By contrast, choices are only motivated by individual characteristics (i.e., x1 and x2).
In the help of the function mlogit, it is specified that one can use the argument panel to use panel techniques. To set panel = TRUE is what I am after here.
The problem is that panel can be set to TRUE only if another argument of mlogit, i.e. rpar, is not NULL.
The argument rpar is used to specify the distribution of the random variables: i.e. the variables before the |.
The problem is that, since these variables does not exist in my case, I can't use the argument rpar and then set panel = TRUE.
An interesting question related to this is here. A few suggestions were given, and one seems to go in my direction. Unfortunately, no examples that I can replicate are provided, and I do not understand how to follow this strategy to solve my problem.
Moreover, I am not particularly interested in using mlogit, any efficient way to perform this task would be fine for me (e.g., I am ok with survival or other packages).
Do you know any solution to this problem?
Two caveats for those interested in answering:
I am interested in fixed effects, not in random effects. However, if you believe there is no other way to take advantage of the longitudinal structure of my data in R (there is indeed in Stata but I don't want to use it), please feel free to share your code.
I am not interested in going Bayesian. So if possible, please do not suggest this approach.

Heteroskedasticity random effects regression with the plm package (correction & how to report)

I have already checked a couple of topics and also found some help regarding heteroskedasticity in panel regressions. But unfortunately, some questions have remained unsolved.
Following example (some repeated measures, data already in long format):
Panelregr <- plm(V1~ V2 + V3 + V4, data = XY, model ="random")
Then I checked for heteroskedasticity:
B.P.Test <- bptest(V ~ V2 + V3 + V4, data=XY, studentize = F)
The test was highly significant --> heteroskedasticity
Then I read (Link: https://www.princeton.edu/~otorres/Panel101R.pdf) about using robust covariance matrix to account for hetereoskedasticity. For the example above I used the code
coeftest(Panelregr, vcovHC)
summary(Panelregr, vcov = vcovHC)
and got the results. But I could also use
coeftest(Panelregr, vcovHC(Panelregr, type = "HC3"))
or the other types HC0 - HC4
Now some questions came up:
Which estimator of these five types do I receive when I use coeftest(Panelregr, vcovHC) instead of defining one particular HC..? Is it HC0?
How do I know which HC... fits to my data? (I read some information, for example: https://cran.r-project.org/web/packages/sandwich/vignettes/sandwich.pdf , page 4, but I´m still not sure how to decide).
How do I describe the results in case of the use of one of these correct estimators? Example: "In order to account for heteroskedasticity, a robust covariance metrix was used. In detail, we used the HC... estimator as ... In the following table, the results of the HC... estimator are shown."
When I correct for hetereosk. , the results don´t include values like R-squared. Is it correct to report the corrected values (e.g. coeftest(Panelregr, vcovHC) and to report values like R-squared from the "originial" Panel regression (Panelregr <- plm(V1~ V2 + V3 + V4, data = XY, model ="random"))?
1) The default one (see ?vcovHC) and for plm::vcovHC that is HC0 as it is the first value mentioned for argument type.
3) HC0, HC1, ... are scaling factors for the variance-covariance matrix. Good to mentioned that. You also want to mention the estimator, i.e. what is given by the method argument. A typical choice is the estimator by Arellano (1987) and it is the default for plm::vcovHC.
4) The R^2 is not impacted by using a het.-consistent variance-covariance matrix. However, the F-statistic is. summary(Panelregr, vcov = vcovHC) gives you what you need.

glm summary not giving coefficients values

I'm trying to apply glm on a given dataset,but the summary(model1) is not giving me the correct output , it's not giving coefficient values for Estimate Std. Error z value Pr(>|z|) etc, it's just giving me NA as an output for individual attribute element.
TEXT <- c('Learned a new concept today : metamorphic testing. t.co/0is1IUs3aW','BMC Bioinformatics BioMed Central: Detecting novel ncRNAs by experimental #RNomics is not an easy task... http:/t.co/ui3Unxpx #bing #MyEN','BMC Bioinformatics BioMed Central: small #RNA with a regulatory function as a scientific ... Detecting novel… http:/t.co/wWHOEkR0vc #bing','True or false? link(#Addition, #Classification) http:/t.co/zMJuTFt8iq #Oxytocin','Biologists do have a sense of humor, especially computational bio people http:/t.co/wFZqaaFy')
NAME <- c('QSoft Consulting','Fabrice Leclerc','Sungsam Gong','Frederic','Zach Stednick')
SCREEN_NAME <-c ('QSoftConsulting','rnomics','sunggong','rnomics','jdwasmuth')
FOLLOWERS_COUNT <- c(734,1900,234,266,788)
RETWEET <- c(1,3,5,0,2)
FRIENDS_COUNT <-c(34,532,77,213,422)
STATUSES_COUNT <- c(234,643,899,222,226)
FAVOURITES_COUNT <- c(144,2677,445,930,254)
df <- data.frame(TEXT,NAME,SCREEN_NAME,RETWEET,FRIENDS_COUNT,STATUSES_COUNT,FAVOURITES_COUNT)
mydata<-df
mydata$FAVOURITES_COUNT <- ifelse( mydata$FAVOURITES_COUNT >= 445, 1, 0) #converting fav_count to binary values
Splitting data
library(caret)
split=0.60
trainIndex <- createDataPartition(mydata$FAVOURITES_COUNT, p=split, list=FALSE)
data_train <- mydata[ trainIndex,]
data_test <- mydata[-trainIndex,]
glm model
library(e1071)
model1 <- glm(FAVOURITES_COUNT~.,family = binomial, data = data_train)
summary(model1)
I want to get the p value for further analysis so far i think my code is right, how can i get the correct output?
A binomial distribution will only work if the dependent variable has two outcomes. You should consider a Poisson distribution when the dependent variable is a count. See here for more details: http://www.statmethods.net/advstats/glm.html
Your code for fitting the GLM is programmatically correct. However, there are a few issues:
As mentioned in the comments, for every variable that is categorical, you should use as.factor() to make it into a factor. GLM doesn't know what a "string" variable is.
As MorganBall indicated, if your data truly is count data, you may consider fitting it using a Poisson GLM, instead of converting to binary and using Logistic regression.
You indicate that you have 13 parameters and 1000 observations. While this may seem like enough data, note that some of these parameters may have very few (close to 0?) observations in them. This is a problem.
In addition, did you make sure that your data does not perfectly separate the response? Because if there are some combinations of parameters that do separate the response perfectly, the maximum likelihood estimate won't converge and theoretically goes to infinity. Practically speaking, you'll get very large standard errors for your estimates.

Error in fitting a model with gee(): NA/NaN/Inf in foreign function call (arg 3)

I'm fitting a gee model on a dataset including 13,500 observations (here students). Students are grouped into 52 different schools. I know that there is evidence that students are nested within schools (low ICC) and therefore I should adjust this nesting effect in the variance covariance matrix. What I'm planning to do is to first fit a gee model with exchangeable var-cov structure. Then, on top of that, I'll run Huber-White Sandwich estimator also known as robust variance estimator. I wrote my own code for robust variance estimator and it works perfectly. My gee statement doesn't work and give the error below:
NA/NaN/Inf in foreign function call (arg 3)
Here is my code:
STMath.OneYr.C1 = gee(postCSTMath1Yr ~ TRT1Yr + preCSTMath + preCSTENG +
post1YrGradeRef + ELLBaseLine + GENDER + ECODIS + ETHNICITY.F +
as.factor(FailedInd1Yr), data = UCI.clone[UCI.clone$COHORT0809 == "C1",],
id = post1YrSchIID, corstr = "exchangeable")
Unfortunately, the code above is not reproducible for you guys and perhaps difficult to figure out what the issue is.
I appreciate if you could help me figure out to solve the issue.
OK, this question is quite old but I ended up here, so this might help someone eventually.
Basically, this error was caused because unlike in other libraries, the id parameter is treated as a numeric vector.
Indeed, the gee function is casting id as a double, which I don't really understand. Here are the implicated lines (l. 119-120 of the function):
if (!(is.double(id)))
id <- as.double(id)
If your id column is a character, just cast it to a factor, or use some function (like dplyr::min_rank) to turn it to a numeric variable.
This should do the trick.

Resources