This question already has answers here:
"Factor has new levels" error for variable I'm not using
(3 answers)
Closed 1 year ago.
I am trying to make and test a linear model as follows:
lm_model <- lm(Purchase ~., data = train)
lm_prediction <- predict(lm_model, test)
This results in the following error, stating that the Product_Category_1 column has values that exist in the test data frame but not the train data frame):
factor Product_Category_1 has new levels 7, 9, 14, 16, 17, 18
However, if I check these they definitely look to appear in both data frames:
> nrow(subset(train, Product_Category_1 == "7"))
[1] 2923
> nrow(subset(test, Product_Category_1 == "7"))
[1] 745
> nrow(subset(train, Product_Category_1 == "9"))
[1] 312
> nrow(subset(test, Product_Category_1 == "9"))
[1] 92
Also showing the table for train and test show they have the same factors:
> table(train$Product_Category_1)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
110820 18818 15820 9265 118955 16159 2923 89511 312 4030 19113 3108 4407 1201 4991 7730 467 2430
> table(test$Product_Category_1)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
27533 4681 4029 2301 29637 4005 745 22621 92 1002 4847 767 1033 299 1212 1967 100 645
>
Table of Contents:
A simple example for walkthrough
Suggestion for users
Helpful information that we can get from the fitted model object
OK, I see what the problem is now, but how to make predict work?
Is there a better way to avoid such problem at all?
A simple example for walkthrough
Here is simple enough reproducible example to hint you what has happened.
train <- data.frame(y = runif(4), x = c(runif(3), NA), f = factor(letters[1:4]))
test <- data.frame(y = runif(4), x = runif(4), f = factor(letters[1:4]))
fit <- lm(y ~ x + f, data = train)
predict(fit, newdata = test)
#Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
# factor f has new levels d
I am fitting a model with more parameters than data so the model is rank-deficient (to be explained in the end). However, this does not affect how lm and predict work.
If you just check table(train$f) and table(test$f) it is not useful as the problem is not caused by variable f but by NA in x. lm and glm drop incomplete cases, i.e., rows with at least one NA (see ?complete.cases) for model fitting. They have to do so as otherwise the underlying FORTRAN routine for QR factorization would fail because it can not handle NA. If you check the documentation at ?lm you will see this function has an argument na.action which defaults to na.omit. You can also set it to na.exclude but na.pass which retains NA will cause FORTRAN error:
fit <- lm(y ~ x + f, data = train, na.action = na.pass)
#Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
# NA/NaN/Inf in 'x'
Let's remove NA from the training dataset.
train <- na.omit(train)
train$f
#[1] a b c
#Levels: a b c d
f now has an unused level "d". lm and glm will drop unused levels when building the model frame (and later the model matrix):
## source code of lm; don't run
mf$drop.unused.levels <- TRUE
mf[[1L]] <- quote(stats::model.frame)
mf <- eval(mf, parent.frame())
This is not user controllable. The reason is that if an unused level is included, it will generate a column of zeros in the model matrix.
mf <- model.frame(y ~ x + f, data = train, drop.unused.levels = FALSE)
model.matrix(y ~ x + f, data = mf)
# (Intercept) x fb fc fd
#1 1 0.90021178 0 0 0
#2 1 0.10188534 1 0 0
#3 1 0.05881954 0 1 0
#attr(,"assign")
#[1] 0 1 2 2 2
#attr(,"contrasts")
#attr(,"contrasts")$f
#[1] "contr.treatment"
This is undesired as it produces NA coefficient for dummy variable fd. By drop.unused.levels = TRUE as forced by lm and glm:
mf <- model.frame(y ~ x + f, data = train, drop.unused.levels = TRUE)
model.matrix(y ~ x + f, data = mf)
# (Intercept) x fb fc
#1 1 0.90021178 0 0
#2 1 0.10188534 1 0
#3 1 0.05881954 0 1
#attr(,"assign")
#[1] 0 1 2 2
#attr(,"contrasts")
#attr(,"contrasts")$f
#[1] "contr.treatment"
The fd is gone, and
mf$f
#[1] a b c
#Levels: a b c
The now non-existing "d" level will cause the "new factor level" error in predict.
Suggestion for users
It is highly recommended that all users do the following manually when fitting models:
[No. 1] remove incomplete cases;
[No. 2] drop unused factor levels.
This is exactly the procedure as recommended here: How to debug "contrasts can be applied only to factors with 2 or more levels" error? This gets users aware of what lm and glm do under the hood, and makes their debugging life much easier.
Note, there should be another recommendation in the list:
[No. 0] do subsetting yourself
Users may occasionally use subset argument. But there is a potential pitfall: not all factor levels might appear in the subsetted dataset, thus you may get "new factor levels" when using predict later.
The above advice is particularly important when you write functions wrapping lm or glm. You want your functions to be robust. Ask your function to return an informative error rather than waiting for lm and glm to complain.
Helpful information that we can get from the fitted model object
lm and glm return an xlevels value in the fitted object. It contains the factor levels actually used for model fitting.
fit$xlevels
#$f
#[1] "a" "b" "c"
So in case you have not followed the recommendations listed above and have got into trouble with factor levels, this xlevels should be the first thing to inspect.
If you want to use something like table to count how many cases there are for each factor levels, here is a way: Get number of data in each factor level (as well as interaction) from a fitted lm or glm [R], although making a model matrix can use much RAM.
OK, I see what the problem is now, but how to make predict work?
If you can not choose to work with a different set of train and test dataset (see the next section), you need to set those factor levels in the test but not in xlevels to NA. Then predict will just predict NA for such incomplete cases.
Is there a better way to avoid such problem at all?
People split data into train and test as they want to do cross-validation. The first step is to apply na.omit on your full dataset to get rid of NA noise. Then we could do a random partitioning on what is left, but this this naive way may end up with
some factor levels in test but not in train (oops, we get "new factor level" error when using predict);
some factor variables in train only have 1 level after unused levels removed (oops, we get "contrasts" error when using lm and glm);
So, it is highly recommended that you do some more sophisticated partitioning like stratified sampling.
There is in fact another hazard, but not causing programming errors:
the model matrix for train is rank-deficient (oops, we get a "prediction for rank-deficient model may be misleading" warning when using predict).
Regarding the rank-deficiency in model fitting, see lme4::lmer reports "fixed-effect model matrix is rank deficient", do I need a fix and how to? Rank-deficiency does not cause problem for model estimation and checking, but can be a hazard for prediction: R lm, Could anyone give me an example of the misleading case on “prediction from a rank-deficient”? However, such issue is more difficult to avoid, particularly if you have many factors and possibly with interaction.
Examples of poor binning
It's a little unclear what your data looks like, you should use plots of your predictors to get a better idea of what you are dealing with. Here is an example of how deficiency can be an issues in general.
When you cut count data into factors, you need to ensure that you don't have degenerate classes. I.e. there are not zero, or near zero-presentations of that class. Use a bar plot on your class levels. You will note in the image that several classes are problematic in how this data set is splitting into dummy classes. If this was how the data was collected, then you are stuck with missing data, you can try using K-nearest Neighbors imputations, but if too much data is missing then you likely have to recollect the data if it research data (redo the experiment, re-observe the process, etc). If the data is not reproducible, you will need to remove that predictor and annotate your finding to inform your audience.
See https://www.r-bloggers.com/2016/08/data-splitting/
The function createDataPartition of the caret package can be used to create balanced splits of the data or random stratified split
Related
I am adjusting a fixed effects model considering some covariates. Regarding the specification of the model, two of these covariates are nested and have fixed effects. See that the following error below is happening.
library(nlme)
library(lme4)
dados$VarCat=as.factor(dados$VarCat)
dados$VarX5=as.factor(dados$VarX5)
dados$VarX6=as.factor(dados$VarX6)
modelANew <- lme(log(Resp)~log(VarX1)+log(VarX2)+(VarX3)+(VarX4)+VarX5/VarX6 ,random = ~1|VarCat,
dados, method="REML")
Error in MEEM(object, conLin, control$niterEM) :
Singularity in backsolve at level 0, block 1
Variable X6 is a dichotomous variable. This seems to me to be interfering with the convergence or estimation of the model. How can I solve?
Your data are unbalanced in a way that makes the fixed-effect model rank-deficient (or multicollinear, if you prefer). When you include X5/X6 you are stating that you want to estimate effects for all combinations of X5 and X6. However:
with(dd, table(VarX6,VarX5))
VarX5
VarX6 A B H IND Q S T
0 2 9 94 155 0 1 15
1 0 0 0 0 8 0 0
Only VarX5=Q is ever measured at the VarX6=1 level, and it's never measured at the VarX6=0 level. This means the VarX6 variable, and its interaction with VarX5, are redundant information.
As pointed out in the comments, if you run this in lme4::lmer() it will automatically drop the redundant columns for you, with a message:
library(lme4)
m2 <- lmer(log(Resp)~log(VarX1)+log(VarX2)+(VarX3)+(VarX4)+
VarX5/VarX6 + (1|VarCat),
dd, REML=TRUE)
fixed-effect model matrix is rank deficient so dropping 7 columns / coefficients
You can find out which columns it dropped via attr(getME(m2,"X"), "col.dropped").
Alternatively, if you fit it in lm() (I know you want to fit a mixed model, but this is a good diagnostic) you'll see that it doesn't complain, but it automatically sets all of the redundant coefficients to NA:
m3 <- lm(log(Resp)~log(VarX1)+log(VarX2)+(VarX3)+(VarX4)+
VarX5/VarX6, data=dd)
coef(m3)
(Intercept) log(VarX1) log(VarX2) VarX3 VarX4
0.46921538 0.79476848 -0.45769296 1.85386835 -2.78321092
VarX5B VarX5H VarX5IND VarX5Q VarX5S
-0.04677216 0.21896140 0.24584351 -2.00226719 0.32677006
VarX5T VarX5A:VarX61 VarX5B:VarX61 VarX5H:VarX61 VarX5IND:VarX61
0.17474369 NA NA NA NA
VarX5Q:VarX61 VarX5S:VarX61 VarX5T:VarX61
NA NA NA
This question is very similar to Singularity in backsolve at level 0, block 1 in LME model . When you have unbalanced designs like this, "what to do about it" is not a question with a single simple answer.
you could remove terms from the model yourself (e.g. in this case you can't really estimate anything about VarX6, since it is completely redundant with VarX5, so replace VarX5/VarX6 in your model with VarX5.
you could use a function such as lmer that can automatically remove terms for you
What you can't do is actually estimate VarX5/VarX6 - your design just doesn't include that information. It's a little like saying "I want to estimate the effect of car colour on speed, but I only measured red cars".
This question already has answers here:
"Factor has new levels" error for variable I'm not using
(3 answers)
Closed 1 year ago.
I am trying to make and test a linear model as follows:
lm_model <- lm(Purchase ~., data = train)
lm_prediction <- predict(lm_model, test)
This results in the following error, stating that the Product_Category_1 column has values that exist in the test data frame but not the train data frame):
factor Product_Category_1 has new levels 7, 9, 14, 16, 17, 18
However, if I check these they definitely look to appear in both data frames:
> nrow(subset(train, Product_Category_1 == "7"))
[1] 2923
> nrow(subset(test, Product_Category_1 == "7"))
[1] 745
> nrow(subset(train, Product_Category_1 == "9"))
[1] 312
> nrow(subset(test, Product_Category_1 == "9"))
[1] 92
Also showing the table for train and test show they have the same factors:
> table(train$Product_Category_1)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
110820 18818 15820 9265 118955 16159 2923 89511 312 4030 19113 3108 4407 1201 4991 7730 467 2430
> table(test$Product_Category_1)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
27533 4681 4029 2301 29637 4005 745 22621 92 1002 4847 767 1033 299 1212 1967 100 645
>
Table of Contents:
A simple example for walkthrough
Suggestion for users
Helpful information that we can get from the fitted model object
OK, I see what the problem is now, but how to make predict work?
Is there a better way to avoid such problem at all?
A simple example for walkthrough
Here is simple enough reproducible example to hint you what has happened.
train <- data.frame(y = runif(4), x = c(runif(3), NA), f = factor(letters[1:4]))
test <- data.frame(y = runif(4), x = runif(4), f = factor(letters[1:4]))
fit <- lm(y ~ x + f, data = train)
predict(fit, newdata = test)
#Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
# factor f has new levels d
I am fitting a model with more parameters than data so the model is rank-deficient (to be explained in the end). However, this does not affect how lm and predict work.
If you just check table(train$f) and table(test$f) it is not useful as the problem is not caused by variable f but by NA in x. lm and glm drop incomplete cases, i.e., rows with at least one NA (see ?complete.cases) for model fitting. They have to do so as otherwise the underlying FORTRAN routine for QR factorization would fail because it can not handle NA. If you check the documentation at ?lm you will see this function has an argument na.action which defaults to na.omit. You can also set it to na.exclude but na.pass which retains NA will cause FORTRAN error:
fit <- lm(y ~ x + f, data = train, na.action = na.pass)
#Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
# NA/NaN/Inf in 'x'
Let's remove NA from the training dataset.
train <- na.omit(train)
train$f
#[1] a b c
#Levels: a b c d
f now has an unused level "d". lm and glm will drop unused levels when building the model frame (and later the model matrix):
## source code of lm; don't run
mf$drop.unused.levels <- TRUE
mf[[1L]] <- quote(stats::model.frame)
mf <- eval(mf, parent.frame())
This is not user controllable. The reason is that if an unused level is included, it will generate a column of zeros in the model matrix.
mf <- model.frame(y ~ x + f, data = train, drop.unused.levels = FALSE)
model.matrix(y ~ x + f, data = mf)
# (Intercept) x fb fc fd
#1 1 0.90021178 0 0 0
#2 1 0.10188534 1 0 0
#3 1 0.05881954 0 1 0
#attr(,"assign")
#[1] 0 1 2 2 2
#attr(,"contrasts")
#attr(,"contrasts")$f
#[1] "contr.treatment"
This is undesired as it produces NA coefficient for dummy variable fd. By drop.unused.levels = TRUE as forced by lm and glm:
mf <- model.frame(y ~ x + f, data = train, drop.unused.levels = TRUE)
model.matrix(y ~ x + f, data = mf)
# (Intercept) x fb fc
#1 1 0.90021178 0 0
#2 1 0.10188534 1 0
#3 1 0.05881954 0 1
#attr(,"assign")
#[1] 0 1 2 2
#attr(,"contrasts")
#attr(,"contrasts")$f
#[1] "contr.treatment"
The fd is gone, and
mf$f
#[1] a b c
#Levels: a b c
The now non-existing "d" level will cause the "new factor level" error in predict.
Suggestion for users
It is highly recommended that all users do the following manually when fitting models:
[No. 1] remove incomplete cases;
[No. 2] drop unused factor levels.
This is exactly the procedure as recommended here: How to debug "contrasts can be applied only to factors with 2 or more levels" error? This gets users aware of what lm and glm do under the hood, and makes their debugging life much easier.
Note, there should be another recommendation in the list:
[No. 0] do subsetting yourself
Users may occasionally use subset argument. But there is a potential pitfall: not all factor levels might appear in the subsetted dataset, thus you may get "new factor levels" when using predict later.
The above advice is particularly important when you write functions wrapping lm or glm. You want your functions to be robust. Ask your function to return an informative error rather than waiting for lm and glm to complain.
Helpful information that we can get from the fitted model object
lm and glm return an xlevels value in the fitted object. It contains the factor levels actually used for model fitting.
fit$xlevels
#$f
#[1] "a" "b" "c"
So in case you have not followed the recommendations listed above and have got into trouble with factor levels, this xlevels should be the first thing to inspect.
If you want to use something like table to count how many cases there are for each factor levels, here is a way: Get number of data in each factor level (as well as interaction) from a fitted lm or glm [R], although making a model matrix can use much RAM.
OK, I see what the problem is now, but how to make predict work?
If you can not choose to work with a different set of train and test dataset (see the next section), you need to set those factor levels in the test but not in xlevels to NA. Then predict will just predict NA for such incomplete cases.
Is there a better way to avoid such problem at all?
People split data into train and test as they want to do cross-validation. The first step is to apply na.omit on your full dataset to get rid of NA noise. Then we could do a random partitioning on what is left, but this this naive way may end up with
some factor levels in test but not in train (oops, we get "new factor level" error when using predict);
some factor variables in train only have 1 level after unused levels removed (oops, we get "contrasts" error when using lm and glm);
So, it is highly recommended that you do some more sophisticated partitioning like stratified sampling.
There is in fact another hazard, but not causing programming errors:
the model matrix for train is rank-deficient (oops, we get a "prediction for rank-deficient model may be misleading" warning when using predict).
Regarding the rank-deficiency in model fitting, see lme4::lmer reports "fixed-effect model matrix is rank deficient", do I need a fix and how to? Rank-deficiency does not cause problem for model estimation and checking, but can be a hazard for prediction: R lm, Could anyone give me an example of the misleading case on “prediction from a rank-deficient”? However, such issue is more difficult to avoid, particularly if you have many factors and possibly with interaction.
Examples of poor binning
It's a little unclear what your data looks like, you should use plots of your predictors to get a better idea of what you are dealing with. Here is an example of how deficiency can be an issues in general.
When you cut count data into factors, you need to ensure that you don't have degenerate classes. I.e. there are not zero, or near zero-presentations of that class. Use a bar plot on your class levels. You will note in the image that several classes are problematic in how this data set is splitting into dummy classes. If this was how the data was collected, then you are stuck with missing data, you can try using K-nearest Neighbors imputations, but if too much data is missing then you likely have to recollect the data if it research data (redo the experiment, re-observe the process, etc). If the data is not reproducible, you will need to remove that predictor and annotate your finding to inform your audience.
See https://www.r-bloggers.com/2016/08/data-splitting/
The function createDataPartition of the caret package can be used to create balanced splits of the data or random stratified split
This question already has answers here:
"Factor has new levels" error for variable I'm not using
(3 answers)
Closed 1 year ago.
I am trying to make and test a linear model as follows:
lm_model <- lm(Purchase ~., data = train)
lm_prediction <- predict(lm_model, test)
This results in the following error, stating that the Product_Category_1 column has values that exist in the test data frame but not the train data frame):
factor Product_Category_1 has new levels 7, 9, 14, 16, 17, 18
However, if I check these they definitely look to appear in both data frames:
> nrow(subset(train, Product_Category_1 == "7"))
[1] 2923
> nrow(subset(test, Product_Category_1 == "7"))
[1] 745
> nrow(subset(train, Product_Category_1 == "9"))
[1] 312
> nrow(subset(test, Product_Category_1 == "9"))
[1] 92
Also showing the table for train and test show they have the same factors:
> table(train$Product_Category_1)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
110820 18818 15820 9265 118955 16159 2923 89511 312 4030 19113 3108 4407 1201 4991 7730 467 2430
> table(test$Product_Category_1)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
27533 4681 4029 2301 29637 4005 745 22621 92 1002 4847 767 1033 299 1212 1967 100 645
>
Table of Contents:
A simple example for walkthrough
Suggestion for users
Helpful information that we can get from the fitted model object
OK, I see what the problem is now, but how to make predict work?
Is there a better way to avoid such problem at all?
A simple example for walkthrough
Here is simple enough reproducible example to hint you what has happened.
train <- data.frame(y = runif(4), x = c(runif(3), NA), f = factor(letters[1:4]))
test <- data.frame(y = runif(4), x = runif(4), f = factor(letters[1:4]))
fit <- lm(y ~ x + f, data = train)
predict(fit, newdata = test)
#Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
# factor f has new levels d
I am fitting a model with more parameters than data so the model is rank-deficient (to be explained in the end). However, this does not affect how lm and predict work.
If you just check table(train$f) and table(test$f) it is not useful as the problem is not caused by variable f but by NA in x. lm and glm drop incomplete cases, i.e., rows with at least one NA (see ?complete.cases) for model fitting. They have to do so as otherwise the underlying FORTRAN routine for QR factorization would fail because it can not handle NA. If you check the documentation at ?lm you will see this function has an argument na.action which defaults to na.omit. You can also set it to na.exclude but na.pass which retains NA will cause FORTRAN error:
fit <- lm(y ~ x + f, data = train, na.action = na.pass)
#Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
# NA/NaN/Inf in 'x'
Let's remove NA from the training dataset.
train <- na.omit(train)
train$f
#[1] a b c
#Levels: a b c d
f now has an unused level "d". lm and glm will drop unused levels when building the model frame (and later the model matrix):
## source code of lm; don't run
mf$drop.unused.levels <- TRUE
mf[[1L]] <- quote(stats::model.frame)
mf <- eval(mf, parent.frame())
This is not user controllable. The reason is that if an unused level is included, it will generate a column of zeros in the model matrix.
mf <- model.frame(y ~ x + f, data = train, drop.unused.levels = FALSE)
model.matrix(y ~ x + f, data = mf)
# (Intercept) x fb fc fd
#1 1 0.90021178 0 0 0
#2 1 0.10188534 1 0 0
#3 1 0.05881954 0 1 0
#attr(,"assign")
#[1] 0 1 2 2 2
#attr(,"contrasts")
#attr(,"contrasts")$f
#[1] "contr.treatment"
This is undesired as it produces NA coefficient for dummy variable fd. By drop.unused.levels = TRUE as forced by lm and glm:
mf <- model.frame(y ~ x + f, data = train, drop.unused.levels = TRUE)
model.matrix(y ~ x + f, data = mf)
# (Intercept) x fb fc
#1 1 0.90021178 0 0
#2 1 0.10188534 1 0
#3 1 0.05881954 0 1
#attr(,"assign")
#[1] 0 1 2 2
#attr(,"contrasts")
#attr(,"contrasts")$f
#[1] "contr.treatment"
The fd is gone, and
mf$f
#[1] a b c
#Levels: a b c
The now non-existing "d" level will cause the "new factor level" error in predict.
Suggestion for users
It is highly recommended that all users do the following manually when fitting models:
[No. 1] remove incomplete cases;
[No. 2] drop unused factor levels.
This is exactly the procedure as recommended here: How to debug "contrasts can be applied only to factors with 2 or more levels" error? This gets users aware of what lm and glm do under the hood, and makes their debugging life much easier.
Note, there should be another recommendation in the list:
[No. 0] do subsetting yourself
Users may occasionally use subset argument. But there is a potential pitfall: not all factor levels might appear in the subsetted dataset, thus you may get "new factor levels" when using predict later.
The above advice is particularly important when you write functions wrapping lm or glm. You want your functions to be robust. Ask your function to return an informative error rather than waiting for lm and glm to complain.
Helpful information that we can get from the fitted model object
lm and glm return an xlevels value in the fitted object. It contains the factor levels actually used for model fitting.
fit$xlevels
#$f
#[1] "a" "b" "c"
So in case you have not followed the recommendations listed above and have got into trouble with factor levels, this xlevels should be the first thing to inspect.
If you want to use something like table to count how many cases there are for each factor levels, here is a way: Get number of data in each factor level (as well as interaction) from a fitted lm or glm [R], although making a model matrix can use much RAM.
OK, I see what the problem is now, but how to make predict work?
If you can not choose to work with a different set of train and test dataset (see the next section), you need to set those factor levels in the test but not in xlevels to NA. Then predict will just predict NA for such incomplete cases.
Is there a better way to avoid such problem at all?
People split data into train and test as they want to do cross-validation. The first step is to apply na.omit on your full dataset to get rid of NA noise. Then we could do a random partitioning on what is left, but this this naive way may end up with
some factor levels in test but not in train (oops, we get "new factor level" error when using predict);
some factor variables in train only have 1 level after unused levels removed (oops, we get "contrasts" error when using lm and glm);
So, it is highly recommended that you do some more sophisticated partitioning like stratified sampling.
There is in fact another hazard, but not causing programming errors:
the model matrix for train is rank-deficient (oops, we get a "prediction for rank-deficient model may be misleading" warning when using predict).
Regarding the rank-deficiency in model fitting, see lme4::lmer reports "fixed-effect model matrix is rank deficient", do I need a fix and how to? Rank-deficiency does not cause problem for model estimation and checking, but can be a hazard for prediction: R lm, Could anyone give me an example of the misleading case on “prediction from a rank-deficient”? However, such issue is more difficult to avoid, particularly if you have many factors and possibly with interaction.
Examples of poor binning
It's a little unclear what your data looks like, you should use plots of your predictors to get a better idea of what you are dealing with. Here is an example of how deficiency can be an issues in general.
When you cut count data into factors, you need to ensure that you don't have degenerate classes. I.e. there are not zero, or near zero-presentations of that class. Use a bar plot on your class levels. You will note in the image that several classes are problematic in how this data set is splitting into dummy classes. If this was how the data was collected, then you are stuck with missing data, you can try using K-nearest Neighbors imputations, but if too much data is missing then you likely have to recollect the data if it research data (redo the experiment, re-observe the process, etc). If the data is not reproducible, you will need to remove that predictor and annotate your finding to inform your audience.
See https://www.r-bloggers.com/2016/08/data-splitting/
The function createDataPartition of the caret package can be used to create balanced splits of the data or random stratified split
I have been comparing Poisson, negative binomial (NB), and zero-inflated Poisson and NB models in R. My dependent variable is a symptom count for generalized anxiety disorder (GAD), and my predictors are two personality traits (disinhibition [ZDis_winz] and meanness [ZMean_winz]), their interaction, and covariates of age and assessment site (dummy-coded; there are 8 sites so I have 7 of these dummy variables). I have a sample of 1206 with full data (and these are the only individuals included in the data frame).
I am using NB models for this disorder because the variance (~40) far exceeds the mean (~4). I wanted to consider the possibility of a ZINB model as well, given that ~30% of the sample has 0 symptoms.
For other symptom counts (e.g., conduct disorder), I have run ZINB models perfectly fine in R, but I am getting an error when I do the exact same thing with the GAD model. The standard NB model works fine for GAD; it is only the GAD ZINB model that's erroring out.
Here is the error I'm receiving:
Error in solve.default(as.matrix(fit$hessian)) :
system is computationally singular: reciprocal condition number = 4.80021e-36
Here is the code I'm using for the (working) NB model:
summary(
NB_GAD_uw_int <- glm.nb(
dawbac_bl_GAD_sxs_uw ~ ZMean_winz + ZDis_winz + ZMean_winz*ZDis_winz + age_years + Nottingham_dummy + Dublin_dummy + Berlin_dummy + Hamburg_dummy + Mannheim_dummy + Paris_dummy + Dresden_dummy,
data=eurodata))
Here is the code I'm using for the (not working) ZINB model (which is identical to other ZINB models I've run for other disorders):
summary(
ZINB_GAD_uw_int <- zeroinfl(
dawbac_bl_GAD_sxs_uw ~ ZMean_winz + ZDis_winz + ZMean_winz*ZDis_winz + age_years + Nottingham_dummy + Dublin_dummy + Berlin_dummy + Hamburg_dummy + Mannheim_dummy + Paris_dummy + Dresden_dummy,
data = eurodata,
dist = "negbin",
model = TRUE,
y = TRUE, x = TRUE))
I have seen a few other posts on StackOverflow and other forums about this type of issue. As far as I can tell, people generally say that this is an issue of either 1) collinear predictors or 2) too complex a model for too little data. (Please let me know if I am misinterpreting this! I'm fairly new to Poisson-based models.) However, I am still confused about these answers because: 1) In this case, none of my predictors are correlated more highly than .15, except for the main predictors of interest (ZMean_winz and ZDis_winz), which are correlated about .45. The same predictors are used in other ZINB models that have worked. 2) With 1206 participants, and having run the same ZINB model with similarly distributed count data for other disorders, I'm a little confused how this could be too complex a model for my data.
If anyone has any explanation for why this version of my model will not run and/or any suggestions for troubleshooting, I would really appreciate it! I am also happy to provide more info if needed.
Thank you so much!
The problem may be that zeroinfl is not converting categorical variables into dummy variables.
You can dummify your variables using model.matrix, which is what glm, glm.nb, etc. call internally to dummify categorical variables. This is usually preferred over manually dummifying categorical variables, and should be done to avoid mistakes and to ensure full rank of your model matrix (a full rank matrix is non-singular).
You can of course dummify categorical variables yourself; in that case I would use model.matrix to transform your input data involving categorical variables (and potentially interactions between categorical variables and other variables) into the correct model matrix.
Here is an example:
set.seed(2017)
df <- data.frame(
DV = rnorm(100),
IV1_num = rnorm(100),
IV2_cat = sample(c("catA", "catB", "catC"), 100, replace = T))
head(df)
# DV IV1_num IV2_cat
#1 1.43420148 0.01745491 catC
#2 -0.07729196 1.37688667 catC
#3 0.73913723 -0.06869535 catC
#4 -1.75860473 0.84190898 catC
#5 -0.06982523 -0.96624056 catB
#6 0.45190553 -1.96971566 catC
mat <- model.matrix(DV ~ IV1_num + IV2_cat, data = df)
head(mat)
# (Intercept) IV1_num IV2_catcatB IV2_catcatC
#1 1 0.01745491 0 1
#2 1 1.37688667 0 1
#3 1 -0.06869535 0 1
#4 1 0.84190898 0 1
#5 1 -0.96624056 1 0
#6 1 -1.96971566 0 1
The manually dummified input data would then be
df.dummified = cbind.data.frame(DV = df$DV, mat[, -1])
# DV IV1_num IV2_catB IV2_catC
#1 1.43420148 0.01745491 0 1
#2 -0.07729196 1.37688667 0 1
#3 0.73913723 -0.06869535 0 1
#4 -1.75860473 0.84190898 0 1
#5 -0.06982523 -0.96624056 1 0
#6 0.45190553 -1.96971566 0 1
which you'd use in e.g.
glm.nb(DV ~ ., data = df.dummified)
I have a logistic regression model in R, where all of the predictor variables are categorical rather than continuous (in addition to the response variable, which is also obviously categorical/binary).
When calling summary(model_name), is there a way to include a column representing the number of observations within each factor level?
I have a logistic regression model in R, where all of the predictor variables are categorical rather than continuous.
If all your covariates are factors (not including the intercept), this is fairly easy as the model matrix only contains 0 and 1 and the number of 1 indicates the occurrence of that factor level (or interaction level) in your data. So just do colSums(model.matrix(your_glm_model_object)).
Since a model matrix has column names, colSums will give you a vector with "names" attribute, that is consistent with the "names" field of coef(your_glm_model_object).
The same solution applies to a linear model (by lm) and a generalized linear model (by glm) for any distribution family.
Here is a quick example:
set.seed(0)
f1 <- sample(gl(2, 50)) ## a factor with 2 levels, each with 50 observations
f2 <- sample(gl(4, 25)) ## a factor with 4 levels, each with 25 observations
y <- rnorm(100)
fit <- glm(y ~ f1 * f2) ## or use `lm` as we use `guassian()` family object here
colSums(model.matrix(fit))
#(Intercept) f12 f22 f23 f24 f12:f22
# 100 50 25 25 25 12
# f12:f23 f12:f24
# 12 14
Here, we have 100 observations / complete-cases (indicated under (Intercept)).
Is there a way to display the count for the baseline level of each factor?
Baseline levels are contrasted, so they don't appear in the the model matrix used for fitting. However, we can generate the full model matrix (without contrasts) from your formula not your fitted model (this also offers you a way to drop numeric variables if you have them in your model):
SET_CONTRAST <- list(f1 = contr.treatment(nlevels(f1), contrast = FALSE),
f2 = contr.treatment(nlevels(f2), contrast = FALSE))
X <- model.matrix(~ f1 * f2, contrasts.arg = SET_CONTRAST)
colSums(X)
#(Intercept) f11 f12 f21 f22 f23
# 100 50 50 25 25 25
# f24 f11:f21 f12:f21 f11:f22 f12:f22 f11:f23
# 25 13 12 13 12 13
# f12:f23 f11:f24 f12:f24
# 12 11 14
Note that it can quickly become tedious in setting contrasts when you have many factor variables.
model.matrix is definitely not the only approach for this. The conventional way may be
table(f1)
table(f2)
table(f1, f2)
but could get tedious too when your model become complicated.