R: factor as new level when I predict with test data - r

I am getting an error from my datasets similar logic with the code I posted in below. I have tried increased the number of training data but didn't solve. I have already excluded all NA values.
Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
factor y has new levels L, X
set.seed(234)
d <- data.frame(w=abs(rnorm(50)*1000),
x=rnorm(50),
y=sample(LETTERS[1:26], 50, replace=TRUE))
train_idx <- sample(1:nrow(d), floor(0.8*nrow(d)))
train <- d[train_idx,]
test <- d[-train_idx,]
fit <- lm(w ~x + y, data=train)
predict(fit, test)

As #jdobres has already explained the reason of why this error popped up I'll straightforwardly jump to the solution approach:
Let's try below line of code just before predict statement
#add all levels of 'y' in 'test' dataset to fit$xlevels[["y"]] in the fit object
fit$xlevels[["y"]] <- union(fit$xlevels[["y"]], levels(test[["y"]]))
Hope this would resolve your problem!

Factor and character data are treated as categorical variables. As such, models cannot form predictions for category labels they've never seen before. If you built a model to predict things about "poodle" and "pit bull", the model would fail if you gave it "golden retriever".
More specific to your example, the error is telling you that labels "L" and "X", which are in your test set, do not appear in your training set. Since they weren't in the training set, the model doesn't know what to do when it encounters these in the test.

Thanks Prem, and if you have many variables you can loop the line of code like this:
for(k in vars){
if(is.factor(shop_data[,k])){
ols_fit$xlevels[[k]] <- union(ols_fit$xlevels[[k]],levels(shop_data[[k]]))
}
}
vars are the variables used in the model, shop_data is the main dataset which is split into train and test

Related

How can I include both my categorical and numeric predictors in my elastic net model? r

As a note beforehand, I think I should mention that I am working with highly sensitive medical data that is protected by HIPAA. I cannot share real data with dput- it would be illegal to do so. That is why I made a fake dataset and explained my processes to help reproduce the error.
I have been trying to estimate an elastic net model in r using glmnet. However, I keep getting an error. I am not sure what is causing it. The error happens when I go to train the data. It sounds like it has something to do with the data type and matrix.
I have provided a sample dataset. Then I set the outcomes and certain predictors to be factors. After setting certain variables to be factors, I label them. Next, I create an object with the column names of the predictors I want to use. That object is pred.names.min. Then I partition the data into the training and test data frames. 65% in the training, 35% in the test. With the train control function, I specify a few things I want to have happen with the model- random paraments for lambda and alpha, as well as the leave one out method. I also specify that it is a classification model (categorical outcome). In the last step, I specify the training model. I write my code to tell it to use all of the predictor variables in the pred.names.min object for the trainingset data frame.
library(dplyr)
library(tidyverse)
library(glmnet),0,1,0
library(caret)
#creating sample dataset
df<-data.frame("BMIfactor"=c(1,2,3,2,3,1,2,1,3,2,1,3,1,1,3,2,3,2,1,2,1,3),
"age"=c(0,4,8,1,2,7,4,9,9,2,2,1,8,6,1,2,9,2,2,9,2,1),
"L_TartaricacidArea"=c(0,1,1,0,1,1,1,0,0,1,0,1,1,0,1,0,0,1,1,0,1,1),
"Hydroxymethyl_5_furancarboxylicacidArea_2"=
c(1,1,0,1,0,0,1,0,1,1,0,1,1,0,1,1,0,1,0,1,0,1),
"Anhydro_1.5_D_glucitolArea"=
c(8,5,8,6,2,9,2,8,9,4,2,0,4,8,1,2,7,4,9,9,2,2),
"LevoglucosanArea"=
c(6,2,9,2,8,6,1,8,2,1,2,8,5,8,6,2,9,2,8,9,4,2),
"HexadecanolArea_1"=
c(4,9,2,1,2,9,2,1,6,1,2,6,2,9,2,8,6,1,8,2,1,2),
"EthanolamineArea"=
c(6,4,9,2,1,2,4,6,1,8,2,4,9,2,1,2,9,2,1,6,1,2),
"OxoglutaricacidArea_2"=
c(4,7,8,2,5,2,7,6,9,2,4,6,4,9,2,1,2,4,6,1,8,2),
"AminopentanedioicacidArea_3"=
c(2,5,5,5,2,9,7,5,9,4,4,4,7,8,2,5,2,7,6,9,2,4),
"XylitolArea"=
c(6,8,3,5,1,9,9,6,6,3,7,2,5,5,5,2,9,7,5,9,4,4),
"DL_XyloseArea"=
c(6,9,5,7,2,7,0,1,6,6,3,6,8,3,5,1,9,9,6,6,3,7),
"ErythritolArea"=
c(6,7,4,7,9,2,5,5,8,9,1,6,9,5,7,2,7,0,1,6,6,3),
"hpresponse1"=
c(1,0,1,1,0,1,1,0,0,1,0,0,1,0,1,1,1,0,1,0,0,1),
"hpresponse2"=
c(1,0,1,0,0,1,1,1,0,1,0,1,0,1,1,0,1,0,1,0,0,1))
#setting variables as factors
df$hpresponse1<-as.factor(df$hpresponse1)
df$hpresponse2<-as.factor(df$hpresponse2)
df$BMIfactor<-as.factor(df$BMIfactor)
df$L_TartaricacidArea<- as.factor(df$L_TartaricacidArea)
df$Hydroxymethyl_5_furancarboxylicacidArea_2<-
as.factor(df$Hydroxymethyl_5_furancarboxylicacidArea_2)
#labeling factor levels
df$hpresponse1 <- factor(df$hpresponse1, labels = c("group1.2", "group3.4"))
df$hpresponse2 <- factor(df$hpresponse2, labels = c("group1.2.3", "group4"))
df$L_TartaricacidArea <- factor(df$L_TartaricacidArea, labels =c ("No",
"Yes"))
df$Hydroxymethyl_5_furancarboxylicacidArea_2 <-
factor(df$Hydroxymethyl_5_furancarboxylicacidArea_2, labels =c ("No",
"Yes"))
df$BMIfactor <- factor(df$BMIfactor, labels = c("<40", ">=40and<50",
">=50"))
#creating list of predictor names
pred.start.min <- which(colnames(df) == "BMIfactor"); pred.start.min
pred.stop.min <- which(colnames(df) == "ErythritolArea"); pred.stop.min
pred.names.min <- colnames(df)[pred.start.min:pred.stop.min]
#partition data into training and test (65%/35%)
set.seed(2)
n=floor(nrow(df)*0.65)
train_ind=sample(seq_len(nrow(df)), size = n)
trainingset=df[train_ind,]
testingset=df[-train_ind,]
#specifying that I want to use the leave one out cross-
#validation method and
use "random" as search for elasticnet
tcontrol <- trainControl(method = "LOOCV",
search="random",
classProbs = TRUE)
#training model
elastic_model1 <- train(as.matrix(trainingset[,
pred.names.min]),
trainingset$hpresponse1,
data = trainingset,
method = "glmnet",
trControl = tcontrol)
After I run the last chunk of code, I end up with this error:
Error in { :
task 1 failed - "error in evaluating the argument 'x' in selecting a
method for function 'as.matrix': object of invalid type "character" in
'matrix_as_dense()'"
In addition: There were 50 or more warnings (use warnings() to see the first
50)
I tried removing the "as.matrix" arguemtent:
elastic_model1 <- train((trainingset[, pred.names.min]),
trainingset$hpresponse1,
data = trainingset,
method = "glmnet",
trControl = tcontrol)
It still produces a similar error.
Error in { :
task 1 failed - "error in evaluating the argument 'x' in selecting a method
for function 'as.matrix': object of invalid type "character" in
'matrix_as_dense()'"
In addition: There were 50 or more warnings (use warnings() to see the first
50)
When I tried to make none of the predictors factors (but keep outcome as factor), this is the error I get:
Error: At least one of the class levels is not a valid R variable name; This
will cause errors when class probabilities are generated because the
variables names will be converted to X0, X1 . Please use factor levels that
can be used as valid R variable names (see ?make.names for help).
How can I fix this? How can I use my predictors (both the numeric and categorical ones) without producing an error?
glmnet does not handle factors well. The recommendation currently is to dummy code and re-code to numeric where possible:
Using LASSO in R with categorical variables

Subscript out of bound error in predict function of randomforest

I am using random forest for prediction and in the predict(fit, test_feature) line, I get the following error. Can someone help me to overcome this. I did the same steps with another dataset and had no error. but I get error here.
Error: Error in x[, vname, drop = FALSE] : subscript out of bounds
training_index <- createDataPartition(shufflled[,487], p = 0.8, times = 1)
training_index <- unlist(training_index)
train_set <- shufflled[training_index,]
test_set <- shufflled[-training_index,]
accuracies<- c()
k=10
n= floor(nrow(train_set)/k)
for(i in 1:k){
sub1<- ((i-1)*n+1)
sub2<- (i*n)
subset<- sub1:sub2
train<- train_set[-subset, ]
test<- train_set[subset, ]
test_feature<- test[ ,-487]
True_Label<- as.factor(test[ ,487])
fit<- randomForest(x= train[ ,-487], y= as.factor(train[ ,487]))
prediction<- predict(fit, test_feature) #The error line
correctlabel<- prediction == True_Label
t<- table(prediction, True_Label)
}
I had similar problem few weeks ago.
To go around the problem, you can do this:
df$label <- factor(df$label)
Instead of as.factor try just factor generic function. Also, try first naming your label variable.
Are there identical column names in your training and validation x?
I had the same error message and solved it by renaming my column names because my data was a matrix and their colnames were all empty, i.e. "".
Your question is not very clear, anyway I try to help you.
First of all check your data to see the distribution in levels of your various predictors and outcomes.
You may find that some of your predictor levels or outcome levels are very highly skewed, or some outcomes or predictor levels are very rare. I got that error when I was trying to predict a very rare outcome with a heavily tuned random forest, and so some of the predictor levels were not actually in the training data. Thus a factor level appears in the test data that the training data thinks is out of bounds.
Alternatively, check the names of your variables.
Before calling predict() to make sure that the variable names match.
Without your data files, it's hard to tell why your first example worked.
For example You can try:
names(test) <- names(train)
Add the expression
dimnames(test_feature) <- NULL
before
prediction <- predict(fit, test_feature)

Setting contrasts for part of the factors in a model.matrix

I have an experimental design to which I'd like to fit a linear regression model.
Here's the design data.frame:
design.df <- data.frame(batch=rep(c(1:3,1:3),4),
species=rep(c(rep("mouse",3),rep("rat",3)),4),
sex=rep(c(rep("M",12),rep("F",12))),
stringsAsFactors = F)
design.df$species and design.df$sex are both factors:
design.df$species <- factor(design.df$species,levels=c("mouse","rat"))
design.df$sex <- factor(design.df$sex,levels=c("F","M"))
The contrast encoding of design.df$species should be contr.treatment whereas that of design.df$sex should be contr.sum.
To set it it as a model.matrix I thought perhaps this could work:
contrasts.list <- list(batch=NA,species="contr.treatment",sex="contr.sum")
design.mat <- model.matrix(as.formula(paste0("~",paste(model.factors,collapse="+"))),contrasts=contrasts.list,data=design.df)
Obviously it doesn't work according to the error I get:
Error in `contrasts<-`(`*tmp*`, value = contrasts.arg[[nn]]) :
contrasts apply only to factors
So my question is how do I get the model.matrix from design.df according to the contrasts.list I specify?
You are using a variable model.factors that's not defined anywhere. Not sure what the goal is. If you just wanted all these values as covariates, you can do
contrasts.list <- list(species="contr.treatment", sex="contr.sum")
design.mat <- model.matrix(~., contrasts=contrasts.list, data=design.df)
Note that your contrasts.list should only have values for the factor variables. Do not include batch.

R, Caret, train(), predict(), GBM, Error: Error in model.frame.default(..): Factor has new levels

So I have a pretty good idea of what is happening but I'm wondering how to handle the error, I've seen other posts similar to this but they were not specific to Gradient Boosting Machine models. They all seem to be related to GLMs and the error isn't being caused by the same thing I don't think.
Here's my code:
myTuneGrid <- expand.grid(n.trees=c(100,200), interaction.depth=c(9,10,11,12), shrinkage=0.1, n.minobsinnode=10)
fitControl <- trainControl(method = "cv", number =5,verboseIter = FALSE,returnResamp = "all")
myModel <- train(as.factor(target) ~ .,data = trainingDataC.GB, method = "gbm",trControl = fitControl,tuneGrid = myTuneGrid)
myPrediction <- predict(myModel,newdata=testDataC)
Here's my error:
Error in model.frame.default(Terms, newdata, na.action = na.action,
xlev = object$xlevels) : factor 47V has new levels E, H, J
So my factor variable has a bunch of levels in my training set, but from the error I'm guessing not all levels are represented in my training set. When I go to my test set there are new levels that were not in my training set so I'm getting this error?
This is a supervised learning problem, I can't change the test set and move data to the training set. So it's not a sampling problem.
Anyway, does anyone know any settings or quick fixes so that this doesn't cause my program to crash?
This happens quite a bit on kaggle competitions. You can combine the variables to create a levels argument to make sure the factor contains all the levels in both train and test. You see this all quite a lot in the kaggle scripts.
See this very simple example based on mtcars. You just need to fill in the variable name in quotes (e.g. "cyl") and the variable will be set to a factor in both the train and test set with all the levels available in both sets. This will just prevent your model from giving an error. This does not mean that it will learn anything from the factor levels not available in the training set.
train <- subset(mtcars, cyl < 8)
test <- subset(mtcars, cyl >= 8)
fact_train_test <- function(x) {
levels <- unique(c(train[[x]], test[[x]]))
train[[x]] <<- factor(train[[x]], levels=levels)
test[[x]] <<- factor(test[[x]], levels=levels)
}
fact_train_test("cyl")
There are probably other methods of doing this, but it works.

How to subset a dataset such that the test set contains

I built a linear regression model (lm.full) and I'm trying to test the model on a test data set. I'm running into an issue due to a feature / predictor with many unique values when I try to predict based on the test data. The troublesome feature is cbsa (Core Based Statistical Area).
The train and the test have the same unique values. I'm not sure what the issue is, because if each of the levels of the factor variable is fit in the training model, then I think I should be able to predict the value test.
I divided the data here for the test and training sets:
sample.size<-floor(0.95*nrow(tvwm))
# Make sure that seeds different
set.seed(15)
tvwm_train_ind <- sample(seq_len(nrow(tvwm)), size = sample.size)
tvwm_train <- tvwm[tvwm_train_ind,]
tvwm_test <- tvwm[-tvwm_train_ind,]
And here is the prediction:
> predict(object=lm.full, newdata=tvwm_test, type = "response")
Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
factor factor(cbsa_name) has new levels Boston-Cambridge-Newton, MA-NH, Detroit-Warren-Livonia, MI, Virginia Beach-Norfolk-Newport News, VA-NC
Try
all(levels(tvwm_test$cbsa_name) %in% levels(tvwm_train$cbsa_name))
all(levels(tvwm_train$cbsa_name) %in% levels(tvwm_test$cbsa_name))
and make sure they are both TRUE. Or, as Gregor suggested below in his comment, you can do it in one statement:
identical(levels(tvwm_test$cbsa_name), levels(tvwm_train$cbsa_name))
If they are not both TRUE, and you are certain that both the training set and the test set have the same factor levels in the data, then run the following to reset the levels:
tvwm_train$cbsa_name <- factor(tvwm_train$cbsa_name)
tvwm_test$cbsa_name <- factor(tvwm_test$cbsa_name)

Resources