I used a generalized linear model with multiple dependent variables in R

I used a generalized linear model with multiple dependent variables in R - poisson

I used a generalized linear model with multiple variables in R .my data (young) looks like below and I have 5 DVS(dv1,dv2,dv3,dv4,dv5) and three IVS(IV1,IV2,IV3) as a data frame. I keep getting the error below , can some one answer please , as to I am doing wrong.
> head(young)
IV1 IV2 IV3 DVS
1 18 1 1 dv1
2 20 1 1 dv1
3 21 2 1 dv1
4 21 1 2 dv1
5 22 1 1 dv1
6 22 1 1 dv1
> models <- list()
> dvnames <- paste("DVS", 1:5, sep='')
> ivnames <- paste("IV", 1:3, sep='') ## for some value of 3
> for (y in dvnames){
+ form <- formula(paste(y, "~", ivnames))
+ models[[y]] <- glm(form, data=young, family='poisson') }
**Error in eval(expr, envir, enclos) : object 'DVS1' not found**

It is easy to see why you get the error. The first time the loop iterates, y takes the value DVS1. In a model formula, R will look for a variable in young with the name DVS1. As you have shown us, there is no variable (column) with that name (nor presumably an object with that within the scope of glm()) and hence
**Error in eval(expr, envir, enclos) : object 'DVS1' not found**
Which is quite correct.
Now the more important question becomes, what are you trying to do? You seem to be fitting a Poisson model but you claim the response variables are in a single column, DVS, which R will be treating as a factor variable. Where are the count data that you wish to model as some function of IV1, IV2, and IV3?
R expects to be supplied something it can interpret as a numeric count for the left-hand-side of the formula (where you are putting y).

Related

why i'm getting this error "factor has new levels" error while running prediction code for linear regression? [duplicate]

This question already has answers here:
"Factor has new levels" error for variable I'm not using
(3 answers)
Closed 1 year ago.
I am trying to make and test a linear model as follows:
lm_model <- lm(Purchase ~., data = train)
lm_prediction <- predict(lm_model, test)
This results in the following error, stating that the Product_Category_1 column has values that exist in the test data frame but not the train data frame):
factor Product_Category_1 has new levels 7, 9, 14, 16, 17, 18
However, if I check these they definitely look to appear in both data frames:
> nrow(subset(train, Product_Category_1 == "7"))
[1] 2923
> nrow(subset(test, Product_Category_1 == "7"))
[1] 745
> nrow(subset(train, Product_Category_1 == "9"))
[1] 312
> nrow(subset(test, Product_Category_1 == "9"))
[1] 92
Also showing the table for train and test show they have the same factors:
> table(train$Product_Category_1)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
110820 18818 15820 9265 118955 16159 2923 89511 312 4030 19113 3108 4407 1201 4991 7730 467 2430
> table(test$Product_Category_1)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
27533 4681 4029 2301 29637 4005 745 22621 92 1002 4847 767 1033 299 1212 1967 100 645
>

Table of Contents:
A simple example for walkthrough
Suggestion for users
Helpful information that we can get from the fitted model object
OK, I see what the problem is now, but how to make predict work?
Is there a better way to avoid such problem at all?
A simple example for walkthrough
Here is simple enough reproducible example to hint you what has happened.
train <- data.frame(y = runif(4), x = c(runif(3), NA), f = factor(letters[1:4]))
test <- data.frame(y = runif(4), x = runif(4), f = factor(letters[1:4]))
fit <- lm(y ~ x + f, data = train)
predict(fit, newdata = test)
#Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
# factor f has new levels d
I am fitting a model with more parameters than data so the model is rank-deficient (to be explained in the end). However, this does not affect how lm and predict work.
If you just check table(train$f) and table(test$f) it is not useful as the problem is not caused by variable f but by NA in x. lm and glm drop incomplete cases, i.e., rows with at least one NA (see ?complete.cases) for model fitting. They have to do so as otherwise the underlying FORTRAN routine for QR factorization would fail because it can not handle NA. If you check the documentation at ?lm you will see this function has an argument na.action which defaults to na.omit. You can also set it to na.exclude but na.pass which retains NA will cause FORTRAN error:
fit <- lm(y ~ x + f, data = train, na.action = na.pass)
#Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
# NA/NaN/Inf in 'x'
Let's remove NA from the training dataset.
train <- na.omit(train)
train$f
#[1] a b c
#Levels: a b c d
f now has an unused level "d". lm and glm will drop unused levels when building the model frame (and later the model matrix):
## source code of lm; don't run
mf$drop.unused.levels <- TRUE
mf[[1L]] <- quote(stats::model.frame)
mf <- eval(mf, parent.frame())
This is not user controllable. The reason is that if an unused level is included, it will generate a column of zeros in the model matrix.
mf <- model.frame(y ~ x + f, data = train, drop.unused.levels = FALSE)
model.matrix(y ~ x + f, data = mf)
# (Intercept) x fb fc fd
#1 1 0.90021178 0 0 0
#2 1 0.10188534 1 0 0
#3 1 0.05881954 0 1 0
#attr(,"assign")
#[1] 0 1 2 2 2
#attr(,"contrasts")
#attr(,"contrasts")$f
#[1] "contr.treatment"
This is undesired as it produces NA coefficient for dummy variable fd. By drop.unused.levels = TRUE as forced by lm and glm:
mf <- model.frame(y ~ x + f, data = train, drop.unused.levels = TRUE)
model.matrix(y ~ x + f, data = mf)
# (Intercept) x fb fc
#1 1 0.90021178 0 0
#2 1 0.10188534 1 0
#3 1 0.05881954 0 1
#attr(,"assign")
#[1] 0 1 2 2
#attr(,"contrasts")
#attr(,"contrasts")$f
#[1] "contr.treatment"
The fd is gone, and
mf$f
#[1] a b c
#Levels: a b c
The now non-existing "d" level will cause the "new factor level" error in predict.
Suggestion for users
It is highly recommended that all users do the following manually when fitting models:
[No. 1] remove incomplete cases;
[No. 2] drop unused factor levels.
This is exactly the procedure as recommended here: How to debug "contrasts can be applied only to factors with 2 or more levels" error? This gets users aware of what lm and glm do under the hood, and makes their debugging life much easier.
Note, there should be another recommendation in the list:
[No. 0] do subsetting yourself
Users may occasionally use subset argument. But there is a potential pitfall: not all factor levels might appear in the subsetted dataset, thus you may get "new factor levels" when using predict later.
The above advice is particularly important when you write functions wrapping lm or glm. You want your functions to be robust. Ask your function to return an informative error rather than waiting for lm and glm to complain.
Helpful information that we can get from the fitted model object
lm and glm return an xlevels value in the fitted object. It contains the factor levels actually used for model fitting.
fit$xlevels
#$f
#[1] "a" "b" "c"
So in case you have not followed the recommendations listed above and have got into trouble with factor levels, this xlevels should be the first thing to inspect.
If you want to use something like table to count how many cases there are for each factor levels, here is a way: Get number of data in each factor level (as well as interaction) from a fitted lm or glm [R], although making a model matrix can use much RAM.
OK, I see what the problem is now, but how to make predict work?
If you can not choose to work with a different set of train and test dataset (see the next section), you need to set those factor levels in the test but not in xlevels to NA. Then predict will just predict NA for such incomplete cases.
Is there a better way to avoid such problem at all?
People split data into train and test as they want to do cross-validation. The first step is to apply na.omit on your full dataset to get rid of NA noise. Then we could do a random partitioning on what is left, but this this naive way may end up with
some factor levels in test but not in train (oops, we get "new factor level" error when using predict);
some factor variables in train only have 1 level after unused levels removed (oops, we get "contrasts" error when using lm and glm);
So, it is highly recommended that you do some more sophisticated partitioning like stratified sampling.
There is in fact another hazard, but not causing programming errors:
the model matrix for train is rank-deficient (oops, we get a "prediction for rank-deficient model may be misleading" warning when using predict).
Regarding the rank-deficiency in model fitting, see lme4::lmer reports "fixed-effect model matrix is rank deficient", do I need a fix and how to? Rank-deficiency does not cause problem for model estimation and checking, but can be a hazard for prediction: R lm, Could anyone give me an example of the misleading case on “prediction from a rank-deficient”? However, such issue is more difficult to avoid, particularly if you have many factors and possibly with interaction.

Examples of poor binning
It's a little unclear what your data looks like, you should use plots of your predictors to get a better idea of what you are dealing with. Here is an example of how deficiency can be an issues in general.
When you cut count data into factors, you need to ensure that you don't have degenerate classes. I.e. there are not zero, or near zero-presentations of that class. Use a bar plot on your class levels. You will note in the image that several classes are problematic in how this data set is splitting into dummy classes. If this was how the data was collected, then you are stuck with missing data, you can try using K-nearest Neighbors imputations, but if too much data is missing then you likely have to recollect the data if it research data (redo the experiment, re-observe the process, etc). If the data is not reproducible, you will need to remove that predictor and annotate your finding to inform your audience.

See https://www.r-bloggers.com/2016/08/data-splitting/
The function createDataPartition of the caret package can be used to create balanced splits of the data or random stratified split

Logistic Regression: while prediction it says new level in the data [duplicate]

This question already has answers here:
"Factor has new levels" error for variable I'm not using
(3 answers)
Closed 1 year ago.
I am trying to make and test a linear model as follows:
lm_model <- lm(Purchase ~., data = train)
lm_prediction <- predict(lm_model, test)
This results in the following error, stating that the Product_Category_1 column has values that exist in the test data frame but not the train data frame):
factor Product_Category_1 has new levels 7, 9, 14, 16, 17, 18
However, if I check these they definitely look to appear in both data frames:
> nrow(subset(train, Product_Category_1 == "7"))
[1] 2923
> nrow(subset(test, Product_Category_1 == "7"))
[1] 745
> nrow(subset(train, Product_Category_1 == "9"))
[1] 312
> nrow(subset(test, Product_Category_1 == "9"))
[1] 92
Also showing the table for train and test show they have the same factors:
> table(train$Product_Category_1)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
110820 18818 15820 9265 118955 16159 2923 89511 312 4030 19113 3108 4407 1201 4991 7730 467 2430
> table(test$Product_Category_1)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
27533 4681 4029 2301 29637 4005 745 22621 92 1002 4847 767 1033 299 1212 1967 100 645
>

Table of Contents:
A simple example for walkthrough
Suggestion for users
Helpful information that we can get from the fitted model object
OK, I see what the problem is now, but how to make predict work?
Is there a better way to avoid such problem at all?
A simple example for walkthrough
Here is simple enough reproducible example to hint you what has happened.
train <- data.frame(y = runif(4), x = c(runif(3), NA), f = factor(letters[1:4]))
test <- data.frame(y = runif(4), x = runif(4), f = factor(letters[1:4]))
fit <- lm(y ~ x + f, data = train)
predict(fit, newdata = test)
#Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
# factor f has new levels d
I am fitting a model with more parameters than data so the model is rank-deficient (to be explained in the end). However, this does not affect how lm and predict work.
If you just check table(train$f) and table(test$f) it is not useful as the problem is not caused by variable f but by NA in x. lm and glm drop incomplete cases, i.e., rows with at least one NA (see ?complete.cases) for model fitting. They have to do so as otherwise the underlying FORTRAN routine for QR factorization would fail because it can not handle NA. If you check the documentation at ?lm you will see this function has an argument na.action which defaults to na.omit. You can also set it to na.exclude but na.pass which retains NA will cause FORTRAN error:
fit <- lm(y ~ x + f, data = train, na.action = na.pass)
#Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
# NA/NaN/Inf in 'x'
Let's remove NA from the training dataset.
train <- na.omit(train)
train$f
#[1] a b c
#Levels: a b c d
f now has an unused level "d". lm and glm will drop unused levels when building the model frame (and later the model matrix):
## source code of lm; don't run
mf$drop.unused.levels <- TRUE
mf[[1L]] <- quote(stats::model.frame)
mf <- eval(mf, parent.frame())
This is not user controllable. The reason is that if an unused level is included, it will generate a column of zeros in the model matrix.
mf <- model.frame(y ~ x + f, data = train, drop.unused.levels = FALSE)
model.matrix(y ~ x + f, data = mf)
# (Intercept) x fb fc fd
#1 1 0.90021178 0 0 0
#2 1 0.10188534 1 0 0
#3 1 0.05881954 0 1 0
#attr(,"assign")
#[1] 0 1 2 2 2
#attr(,"contrasts")
#attr(,"contrasts")$f
#[1] "contr.treatment"
This is undesired as it produces NA coefficient for dummy variable fd. By drop.unused.levels = TRUE as forced by lm and glm:
mf <- model.frame(y ~ x + f, data = train, drop.unused.levels = TRUE)
model.matrix(y ~ x + f, data = mf)
# (Intercept) x fb fc
#1 1 0.90021178 0 0
#2 1 0.10188534 1 0
#3 1 0.05881954 0 1
#attr(,"assign")
#[1] 0 1 2 2
#attr(,"contrasts")
#attr(,"contrasts")$f
#[1] "contr.treatment"
The fd is gone, and
mf$f
#[1] a b c
#Levels: a b c
The now non-existing "d" level will cause the "new factor level" error in predict.
Suggestion for users
It is highly recommended that all users do the following manually when fitting models:
[No. 1] remove incomplete cases;
[No. 2] drop unused factor levels.
This is exactly the procedure as recommended here: How to debug "contrasts can be applied only to factors with 2 or more levels" error? This gets users aware of what lm and glm do under the hood, and makes their debugging life much easier.
Note, there should be another recommendation in the list:
[No. 0] do subsetting yourself
Users may occasionally use subset argument. But there is a potential pitfall: not all factor levels might appear in the subsetted dataset, thus you may get "new factor levels" when using predict later.
The above advice is particularly important when you write functions wrapping lm or glm. You want your functions to be robust. Ask your function to return an informative error rather than waiting for lm and glm to complain.
Helpful information that we can get from the fitted model object
lm and glm return an xlevels value in the fitted object. It contains the factor levels actually used for model fitting.
fit$xlevels
#$f
#[1] "a" "b" "c"
So in case you have not followed the recommendations listed above and have got into trouble with factor levels, this xlevels should be the first thing to inspect.
If you want to use something like table to count how many cases there are for each factor levels, here is a way: Get number of data in each factor level (as well as interaction) from a fitted lm or glm [R], although making a model matrix can use much RAM.
OK, I see what the problem is now, but how to make predict work?
If you can not choose to work with a different set of train and test dataset (see the next section), you need to set those factor levels in the test but not in xlevels to NA. Then predict will just predict NA for such incomplete cases.
Is there a better way to avoid such problem at all?
People split data into train and test as they want to do cross-validation. The first step is to apply na.omit on your full dataset to get rid of NA noise. Then we could do a random partitioning on what is left, but this this naive way may end up with
some factor levels in test but not in train (oops, we get "new factor level" error when using predict);
some factor variables in train only have 1 level after unused levels removed (oops, we get "contrasts" error when using lm and glm);
So, it is highly recommended that you do some more sophisticated partitioning like stratified sampling.
There is in fact another hazard, but not causing programming errors:
the model matrix for train is rank-deficient (oops, we get a "prediction for rank-deficient model may be misleading" warning when using predict).
Regarding the rank-deficiency in model fitting, see lme4::lmer reports "fixed-effect model matrix is rank deficient", do I need a fix and how to? Rank-deficiency does not cause problem for model estimation and checking, but can be a hazard for prediction: R lm, Could anyone give me an example of the misleading case on “prediction from a rank-deficient”? However, such issue is more difficult to avoid, particularly if you have many factors and possibly with interaction.

Examples of poor binning
It's a little unclear what your data looks like, you should use plots of your predictors to get a better idea of what you are dealing with. Here is an example of how deficiency can be an issues in general.
When you cut count data into factors, you need to ensure that you don't have degenerate classes. I.e. there are not zero, or near zero-presentations of that class. Use a bar plot on your class levels. You will note in the image that several classes are problematic in how this data set is splitting into dummy classes. If this was how the data was collected, then you are stuck with missing data, you can try using K-nearest Neighbors imputations, but if too much data is missing then you likely have to recollect the data if it research data (redo the experiment, re-observe the process, etc). If the data is not reproducible, you will need to remove that predictor and annotate your finding to inform your audience.

See https://www.r-bloggers.com/2016/08/data-splitting/
The function createDataPartition of the caret package can be used to create balanced splits of the data or random stratified split

how to debug "factor has new levels" error for linear model and prediction [duplicate]

This question already has answers here:
"Factor has new levels" error for variable I'm not using
(3 answers)
Closed 1 year ago.
I am trying to make and test a linear model as follows:
lm_model <- lm(Purchase ~., data = train)
lm_prediction <- predict(lm_model, test)
This results in the following error, stating that the Product_Category_1 column has values that exist in the test data frame but not the train data frame):
factor Product_Category_1 has new levels 7, 9, 14, 16, 17, 18
However, if I check these they definitely look to appear in both data frames:
> nrow(subset(train, Product_Category_1 == "7"))
[1] 2923
> nrow(subset(test, Product_Category_1 == "7"))
[1] 745
> nrow(subset(train, Product_Category_1 == "9"))
[1] 312
> nrow(subset(test, Product_Category_1 == "9"))
[1] 92
Also showing the table for train and test show they have the same factors:
> table(train$Product_Category_1)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
110820 18818 15820 9265 118955 16159 2923 89511 312 4030 19113 3108 4407 1201 4991 7730 467 2430
> table(test$Product_Category_1)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
27533 4681 4029 2301 29637 4005 745 22621 92 1002 4847 767 1033 299 1212 1967 100 645
>

Table of Contents:
A simple example for walkthrough
Suggestion for users
Helpful information that we can get from the fitted model object
OK, I see what the problem is now, but how to make predict work?
Is there a better way to avoid such problem at all?
A simple example for walkthrough
Here is simple enough reproducible example to hint you what has happened.
train <- data.frame(y = runif(4), x = c(runif(3), NA), f = factor(letters[1:4]))
test <- data.frame(y = runif(4), x = runif(4), f = factor(letters[1:4]))
fit <- lm(y ~ x + f, data = train)
predict(fit, newdata = test)
#Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
# factor f has new levels d
I am fitting a model with more parameters than data so the model is rank-deficient (to be explained in the end). However, this does not affect how lm and predict work.
If you just check table(train$f) and table(test$f) it is not useful as the problem is not caused by variable f but by NA in x. lm and glm drop incomplete cases, i.e., rows with at least one NA (see ?complete.cases) for model fitting. They have to do so as otherwise the underlying FORTRAN routine for QR factorization would fail because it can not handle NA. If you check the documentation at ?lm you will see this function has an argument na.action which defaults to na.omit. You can also set it to na.exclude but na.pass which retains NA will cause FORTRAN error:
fit <- lm(y ~ x + f, data = train, na.action = na.pass)
#Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
# NA/NaN/Inf in 'x'
Let's remove NA from the training dataset.
train <- na.omit(train)
train$f
#[1] a b c
#Levels: a b c d
f now has an unused level "d". lm and glm will drop unused levels when building the model frame (and later the model matrix):
## source code of lm; don't run
mf$drop.unused.levels <- TRUE
mf[[1L]] <- quote(stats::model.frame)
mf <- eval(mf, parent.frame())
This is not user controllable. The reason is that if an unused level is included, it will generate a column of zeros in the model matrix.
mf <- model.frame(y ~ x + f, data = train, drop.unused.levels = FALSE)
model.matrix(y ~ x + f, data = mf)
# (Intercept) x fb fc fd
#1 1 0.90021178 0 0 0
#2 1 0.10188534 1 0 0
#3 1 0.05881954 0 1 0
#attr(,"assign")
#[1] 0 1 2 2 2
#attr(,"contrasts")
#attr(,"contrasts")$f
#[1] "contr.treatment"
This is undesired as it produces NA coefficient for dummy variable fd. By drop.unused.levels = TRUE as forced by lm and glm:
mf <- model.frame(y ~ x + f, data = train, drop.unused.levels = TRUE)
model.matrix(y ~ x + f, data = mf)
# (Intercept) x fb fc
#1 1 0.90021178 0 0
#2 1 0.10188534 1 0
#3 1 0.05881954 0 1
#attr(,"assign")
#[1] 0 1 2 2
#attr(,"contrasts")
#attr(,"contrasts")$f
#[1] "contr.treatment"
The fd is gone, and
mf$f
#[1] a b c
#Levels: a b c
The now non-existing "d" level will cause the "new factor level" error in predict.
Suggestion for users
It is highly recommended that all users do the following manually when fitting models:
[No. 1] remove incomplete cases;
[No. 2] drop unused factor levels.
This is exactly the procedure as recommended here: How to debug "contrasts can be applied only to factors with 2 or more levels" error? This gets users aware of what lm and glm do under the hood, and makes their debugging life much easier.
Note, there should be another recommendation in the list:
[No. 0] do subsetting yourself
Users may occasionally use subset argument. But there is a potential pitfall: not all factor levels might appear in the subsetted dataset, thus you may get "new factor levels" when using predict later.
The above advice is particularly important when you write functions wrapping lm or glm. You want your functions to be robust. Ask your function to return an informative error rather than waiting for lm and glm to complain.
Helpful information that we can get from the fitted model object
lm and glm return an xlevels value in the fitted object. It contains the factor levels actually used for model fitting.
fit$xlevels
#$f
#[1] "a" "b" "c"
So in case you have not followed the recommendations listed above and have got into trouble with factor levels, this xlevels should be the first thing to inspect.
If you want to use something like table to count how many cases there are for each factor levels, here is a way: Get number of data in each factor level (as well as interaction) from a fitted lm or glm [R], although making a model matrix can use much RAM.
OK, I see what the problem is now, but how to make predict work?
If you can not choose to work with a different set of train and test dataset (see the next section), you need to set those factor levels in the test but not in xlevels to NA. Then predict will just predict NA for such incomplete cases.
Is there a better way to avoid such problem at all?
People split data into train and test as they want to do cross-validation. The first step is to apply na.omit on your full dataset to get rid of NA noise. Then we could do a random partitioning on what is left, but this this naive way may end up with
some factor levels in test but not in train (oops, we get "new factor level" error when using predict);
some factor variables in train only have 1 level after unused levels removed (oops, we get "contrasts" error when using lm and glm);
So, it is highly recommended that you do some more sophisticated partitioning like stratified sampling.
There is in fact another hazard, but not causing programming errors:
the model matrix for train is rank-deficient (oops, we get a "prediction for rank-deficient model may be misleading" warning when using predict).
Regarding the rank-deficiency in model fitting, see lme4::lmer reports "fixed-effect model matrix is rank deficient", do I need a fix and how to? Rank-deficiency does not cause problem for model estimation and checking, but can be a hazard for prediction: R lm, Could anyone give me an example of the misleading case on “prediction from a rank-deficient”? However, such issue is more difficult to avoid, particularly if you have many factors and possibly with interaction.

Examples of poor binning
It's a little unclear what your data looks like, you should use plots of your predictors to get a better idea of what you are dealing with. Here is an example of how deficiency can be an issues in general.
When you cut count data into factors, you need to ensure that you don't have degenerate classes. I.e. there are not zero, or near zero-presentations of that class. Use a bar plot on your class levels. You will note in the image that several classes are problematic in how this data set is splitting into dummy classes. If this was how the data was collected, then you are stuck with missing data, you can try using K-nearest Neighbors imputations, but if too much data is missing then you likely have to recollect the data if it research data (redo the experiment, re-observe the process, etc). If the data is not reproducible, you will need to remove that predictor and annotate your finding to inform your audience.

See https://www.r-bloggers.com/2016/08/data-splitting/
The function createDataPartition of the caret package can be used to create balanced splits of the data or random stratified split

contrasts can be applied only to factors with 2 or more levels in spite of having two levels [duplicate]

This question already has answers here:
How to debug "contrasts can be applied only to factors with 2 or more levels" error?
(3 answers)
Closed 5 years ago.
I was trying out linear regression and observe that I get this error in spite of all my factor columns having at least two levels.
I tracked down to the column which is giving me this error and this is the summary of that column
> summary(df[,30])
0 1 <NA>
31543 14 0
> unique(df[,30])
[1] 0 1
Levels: 0 1 <NA>
I have also eliminated all rows which have an NA value by doing the following
df = na.omit(df)
Please note that the NA above is an additional factor level I have added using the addNA function.
How do I resolve this?
EDIT :
I have placed a reproducible example at my public share on http://aftabubuntu.cloudapp.net/ . Please download the reproduce.RDS file from here.
This is the code I'm using
df = readRDS('reproduce.RDS')
model = lm(formula = COL_101~.,data=traindf)
predict.lm(model, df[1:5,])
This is my output
> model = lm(formula = COL_101~.,data=df)
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels

This isn't quite an answer, though possibly could be, if it turns out to demonstrate the issue. I can recreate data that looks like yours, but that works, as follows.
set.seed(5)
df <- data.frame(y=rnorm(100), x=addNA(rep(c(0,1), c(80,20))))
table(df$x)
## 0 1 <NA>
## 80 20 0
lm(y~x, data=df)
## Call:
## lm(formula = y ~ x, data = df)
##
## Coefficients:
## (Intercept) x1
## 0.007601 0.120172

Ancova using variable selected from loop (R)

I have 1,000 dependent variables (y) to use in Ancova. Independent variables (x) is the same in all models.
Yvar1, Yvar2, …, Yvar1000, x1, x2
1 2 5 11 16
2 3 6 18 23
I need an R code that replace example$Y.to.use values for each Yvar.
I tried this:
for (i in 1:1000){
example$Y.to.use<- example$paste("Yvar",i, sep="")
# Error: attempt to apply non-function
paste("fitted.model",i)<-lme(log(Y.to.use+1) ~ x1*x2, data=example, random=~1| Exp/Person)
}
# Error in paste("fitted.model", i, sep = "") <- lme(log(Y.to.use+1) ~ x1*x2 : target of assignment expands to non-language object
I will create a table with coefficients from each fitted model of each Yvar.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

I used a generalized linear model with multiple dependent variables in R - poisson

Related

why i'm getting this error "factor has new levels" error while running prediction code for linear regression? [duplicate]

Logistic Regression: while prediction it says new level in the data [duplicate]

how to debug "factor has new levels" error for linear model and prediction [duplicate]

contrasts can be applied only to factors with 2 or more levels in spite of having two levels [duplicate]

Ancova using variable selected from loop (R)

Categories

Resources