How to subset a dataset such that the test set contains - r

I built a linear regression model (lm.full) and I'm trying to test the model on a test data set. I'm running into an issue due to a feature / predictor with many unique values when I try to predict based on the test data. The troublesome feature is cbsa (Core Based Statistical Area).
The train and the test have the same unique values. I'm not sure what the issue is, because if each of the levels of the factor variable is fit in the training model, then I think I should be able to predict the value test.
I divided the data here for the test and training sets:
sample.size<-floor(0.95*nrow(tvwm))
# Make sure that seeds different
set.seed(15)
tvwm_train_ind <- sample(seq_len(nrow(tvwm)), size = sample.size)
tvwm_train <- tvwm[tvwm_train_ind,]
tvwm_test <- tvwm[-tvwm_train_ind,]
And here is the prediction:
> predict(object=lm.full, newdata=tvwm_test, type = "response")
Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
factor factor(cbsa_name) has new levels Boston-Cambridge-Newton, MA-NH, Detroit-Warren-Livonia, MI, Virginia Beach-Norfolk-Newport News, VA-NC

Try
all(levels(tvwm_test$cbsa_name) %in% levels(tvwm_train$cbsa_name))
all(levels(tvwm_train$cbsa_name) %in% levels(tvwm_test$cbsa_name))
and make sure they are both TRUE. Or, as Gregor suggested below in his comment, you can do it in one statement:
identical(levels(tvwm_test$cbsa_name), levels(tvwm_train$cbsa_name))
If they are not both TRUE, and you are certain that both the training set and the test set have the same factor levels in the data, then run the following to reset the levels:
tvwm_train$cbsa_name <- factor(tvwm_train$cbsa_name)
tvwm_test$cbsa_name <- factor(tvwm_test$cbsa_name)

Related

How can I include both my categorical and numeric predictors in my elastic net model? r

As a note beforehand, I think I should mention that I am working with highly sensitive medical data that is protected by HIPAA. I cannot share real data with dput- it would be illegal to do so. That is why I made a fake dataset and explained my processes to help reproduce the error.
I have been trying to estimate an elastic net model in r using glmnet. However, I keep getting an error. I am not sure what is causing it. The error happens when I go to train the data. It sounds like it has something to do with the data type and matrix.
I have provided a sample dataset. Then I set the outcomes and certain predictors to be factors. After setting certain variables to be factors, I label them. Next, I create an object with the column names of the predictors I want to use. That object is pred.names.min. Then I partition the data into the training and test data frames. 65% in the training, 35% in the test. With the train control function, I specify a few things I want to have happen with the model- random paraments for lambda and alpha, as well as the leave one out method. I also specify that it is a classification model (categorical outcome). In the last step, I specify the training model. I write my code to tell it to use all of the predictor variables in the pred.names.min object for the trainingset data frame.
library(dplyr)
library(tidyverse)
library(glmnet),0,1,0
library(caret)
#creating sample dataset
df<-data.frame("BMIfactor"=c(1,2,3,2,3,1,2,1,3,2,1,3,1,1,3,2,3,2,1,2,1,3),
"age"=c(0,4,8,1,2,7,4,9,9,2,2,1,8,6,1,2,9,2,2,9,2,1),
"L_TartaricacidArea"=c(0,1,1,0,1,1,1,0,0,1,0,1,1,0,1,0,0,1,1,0,1,1),
"Hydroxymethyl_5_furancarboxylicacidArea_2"=
c(1,1,0,1,0,0,1,0,1,1,0,1,1,0,1,1,0,1,0,1,0,1),
"Anhydro_1.5_D_glucitolArea"=
c(8,5,8,6,2,9,2,8,9,4,2,0,4,8,1,2,7,4,9,9,2,2),
"LevoglucosanArea"=
c(6,2,9,2,8,6,1,8,2,1,2,8,5,8,6,2,9,2,8,9,4,2),
"HexadecanolArea_1"=
c(4,9,2,1,2,9,2,1,6,1,2,6,2,9,2,8,6,1,8,2,1,2),
"EthanolamineArea"=
c(6,4,9,2,1,2,4,6,1,8,2,4,9,2,1,2,9,2,1,6,1,2),
"OxoglutaricacidArea_2"=
c(4,7,8,2,5,2,7,6,9,2,4,6,4,9,2,1,2,4,6,1,8,2),
"AminopentanedioicacidArea_3"=
c(2,5,5,5,2,9,7,5,9,4,4,4,7,8,2,5,2,7,6,9,2,4),
"XylitolArea"=
c(6,8,3,5,1,9,9,6,6,3,7,2,5,5,5,2,9,7,5,9,4,4),
"DL_XyloseArea"=
c(6,9,5,7,2,7,0,1,6,6,3,6,8,3,5,1,9,9,6,6,3,7),
"ErythritolArea"=
c(6,7,4,7,9,2,5,5,8,9,1,6,9,5,7,2,7,0,1,6,6,3),
"hpresponse1"=
c(1,0,1,1,0,1,1,0,0,1,0,0,1,0,1,1,1,0,1,0,0,1),
"hpresponse2"=
c(1,0,1,0,0,1,1,1,0,1,0,1,0,1,1,0,1,0,1,0,0,1))
#setting variables as factors
df$hpresponse1<-as.factor(df$hpresponse1)
df$hpresponse2<-as.factor(df$hpresponse2)
df$BMIfactor<-as.factor(df$BMIfactor)
df$L_TartaricacidArea<- as.factor(df$L_TartaricacidArea)
df$Hydroxymethyl_5_furancarboxylicacidArea_2<-
as.factor(df$Hydroxymethyl_5_furancarboxylicacidArea_2)
#labeling factor levels
df$hpresponse1 <- factor(df$hpresponse1, labels = c("group1.2", "group3.4"))
df$hpresponse2 <- factor(df$hpresponse2, labels = c("group1.2.3", "group4"))
df$L_TartaricacidArea <- factor(df$L_TartaricacidArea, labels =c ("No",
"Yes"))
df$Hydroxymethyl_5_furancarboxylicacidArea_2 <-
factor(df$Hydroxymethyl_5_furancarboxylicacidArea_2, labels =c ("No",
"Yes"))
df$BMIfactor <- factor(df$BMIfactor, labels = c("<40", ">=40and<50",
">=50"))
#creating list of predictor names
pred.start.min <- which(colnames(df) == "BMIfactor"); pred.start.min
pred.stop.min <- which(colnames(df) == "ErythritolArea"); pred.stop.min
pred.names.min <- colnames(df)[pred.start.min:pred.stop.min]
#partition data into training and test (65%/35%)
set.seed(2)
n=floor(nrow(df)*0.65)
train_ind=sample(seq_len(nrow(df)), size = n)
trainingset=df[train_ind,]
testingset=df[-train_ind,]
#specifying that I want to use the leave one out cross-
#validation method and
use "random" as search for elasticnet
tcontrol <- trainControl(method = "LOOCV",
search="random",
classProbs = TRUE)
#training model
elastic_model1 <- train(as.matrix(trainingset[,
pred.names.min]),
trainingset$hpresponse1,
data = trainingset,
method = "glmnet",
trControl = tcontrol)
After I run the last chunk of code, I end up with this error:
Error in { :
task 1 failed - "error in evaluating the argument 'x' in selecting a
method for function 'as.matrix': object of invalid type "character" in
'matrix_as_dense()'"
In addition: There were 50 or more warnings (use warnings() to see the first
50)
I tried removing the "as.matrix" arguemtent:
elastic_model1 <- train((trainingset[, pred.names.min]),
trainingset$hpresponse1,
data = trainingset,
method = "glmnet",
trControl = tcontrol)
It still produces a similar error.
Error in { :
task 1 failed - "error in evaluating the argument 'x' in selecting a method
for function 'as.matrix': object of invalid type "character" in
'matrix_as_dense()'"
In addition: There were 50 or more warnings (use warnings() to see the first
50)
When I tried to make none of the predictors factors (but keep outcome as factor), this is the error I get:
Error: At least one of the class levels is not a valid R variable name; This
will cause errors when class probabilities are generated because the
variables names will be converted to X0, X1 . Please use factor levels that
can be used as valid R variable names (see ?make.names for help).
How can I fix this? How can I use my predictors (both the numeric and categorical ones) without producing an error?
glmnet does not handle factors well. The recommendation currently is to dummy code and re-code to numeric where possible:
Using LASSO in R with categorical variables

Error when calculating variable importance with categorical variables using the caret package (varImp)

I've been trying to compute the variable importance for a model with mixed scale features using the varImp function in the caret package. I've tried a number of approaches, including renaming and coding my levels numerically. In each case, I am getting the following error:
Error in auc3_(actual, predicted, ranks) :
Not compatible with requested type: [type=character; target=double].
The following dummy example should illustrate my point (edited to reflect #StupidWolf's correction):
library(caret)
#create small dummy dataset
set.seed(124)
dummy_data = data.frame(Label = factor(sample(c("a","b"),40, replace = TRUE)))
dummy_data$pred1 = ifelse(dummy_data$Label=="a",rnorm(40,-.5,2),rnorm(40,.5,2))
dummy_data$pred2 = factor(ifelse(dummy_data$Label=="a",rbinom(40,1,0.3),rbinom(40,1,0.7)))
# check varImp
control.lvq <- caret::trainControl(method="repeatedcv", number=10, repeats=3)
model.lvq <- caret::train(Label~., data=dummy_data,
method="lvq", preProcess="scale", trControl=control.lvq)
varImp.lvq <- caret::varImp(model.lvq, scale=FALSE)
The issue persists when using different models (like randomForest and SVM).
If anyone knows a solution or can tell me what is going wrong, I would highly appreciate that.
Thanks!
When you call varImp on lvq , it defaults to filterVarImp() because there is no specific variable importance for this model. Now if you check the help page:
For two class problems, a series of cutoffs is applied to the
predictor data to predict the class. The sensitivity and specificity
are computed for each cutoff and the ROC curve is computed.
Now if you read the source code of varImp.train() that feeds the data into filterVarImp(), it is the original dataframe and not whatever comes out of the preprocess.
This means in the original data, if you have a variable that is a factor, it cannot cut the variable, it will throw and error like this:
filterVarImp(data.frame(dummy_data$pred2),dummy_data$Label)
Error in auc3_(actual, predicted, ranks) :
Not compatible with requested type: [type=character; target=double].
So using my example and like you have pointed out, you need to onehot encode it:
set.seed(111)
dummy_data = data.frame(Label = rep(c("a","b"),each=20))
dummy_data$pred1 = rnorm(40,rep(c(-0.5,0.5),each=20),2)
dummy_data$pred2 = rbinom(40,1,rep(c(0.3,0.7),each=20))
dummy_data$pred2 = factor(dummy_data$pred2)
control.lvq <- caret::trainControl(method="repeatedcv", number=10, repeats=3)
ohe_data = data.frame(
Label = dummy_data$Label,
model.matrix(Label ~ 0+.,data=dummy_data))
model.lvq <- caret::train(Label~., data=ohe_data,
method="lvq", preProcess="scale",
trControl=control.lvq)
caret::varImp(model.lvq, scale=FALSE)
ROC curve variable importance
Importance
pred1 0.6575
pred20 0.6000
pred21 0.6000
If you use a model that doesn't have a specific variable importance method, then one option is that you can already calculate the variable importance first, and run the model after that.
Note that this problem can be circumvented by replacing ordinal features (with d levels) by its (d-1)-dimensional indicator encoding:
model.matrix(~dummy_data$pred2-1)[,1:(length(levels(dummy_data$pred2)-1)]
However, why does varImp not handle this automatically? Further, this has the drawback that it yields an importance score for each of the d-1 indicators, not one unified importance score for the original feature.

R: factor as new level when I predict with test data

I am getting an error from my datasets similar logic with the code I posted in below. I have tried increased the number of training data but didn't solve. I have already excluded all NA values.
Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
factor y has new levels L, X
set.seed(234)
d <- data.frame(w=abs(rnorm(50)*1000),
x=rnorm(50),
y=sample(LETTERS[1:26], 50, replace=TRUE))
train_idx <- sample(1:nrow(d), floor(0.8*nrow(d)))
train <- d[train_idx,]
test <- d[-train_idx,]
fit <- lm(w ~x + y, data=train)
predict(fit, test)
As #jdobres has already explained the reason of why this error popped up I'll straightforwardly jump to the solution approach:
Let's try below line of code just before predict statement
#add all levels of 'y' in 'test' dataset to fit$xlevels[["y"]] in the fit object
fit$xlevels[["y"]] <- union(fit$xlevels[["y"]], levels(test[["y"]]))
Hope this would resolve your problem!
Factor and character data are treated as categorical variables. As such, models cannot form predictions for category labels they've never seen before. If you built a model to predict things about "poodle" and "pit bull", the model would fail if you gave it "golden retriever".
More specific to your example, the error is telling you that labels "L" and "X", which are in your test set, do not appear in your training set. Since they weren't in the training set, the model doesn't know what to do when it encounters these in the test.
Thanks Prem, and if you have many variables you can loop the line of code like this:
for(k in vars){
if(is.factor(shop_data[,k])){
ols_fit$xlevels[[k]] <- union(ols_fit$xlevels[[k]],levels(shop_data[[k]]))
}
}
vars are the variables used in the model, shop_data is the main dataset which is split into train and test

Should I drop unused levels when I split a data frame into training and testing set in R?

I am building a decision tree classification model. All of my feature variables and label variable are in factor type. When I split my data set into training and testing sets, the two subsets will contain unused levels. If I dropped levels on the two subsets, the predictive results will be very different and the accuracy will be mush lower.
I am wondering what is the proper way to deal this level issue, in the circumstance of predictive modeling, as well as other situations. Any suggestion?
Here is an reproducible example using sample data solder in rpart package. I choose Solder as my label variable. It is an balanced data set.
solder_data<-solder
##split training data set and test data set
set.seed(11)
g <- runif(nrow(solder_data))#set random order of data set
solder_data<- solder_data[order(g),]
ss <- sample(1:nrow(solder_data),size = 0.7*nrow(solder_data))
solder.Train <- solder_data[ss,]
solder.Test <- subset(solder_data[-ss,],Opening=='S')
dl_solder.Test <-droplevels(solder.Test) # drop unused levels in testing set
str(solder.Test) #opening has 3 levels
str(droplevels(solder.Test)) # opening has 1 level
#build model
library(RevoScaleR)
rxfit <- rxDTree(Solder ~ Opening + skips + Mask + PadType + Panel,
data = solder.Train)
#test model on test set before dropping levels
rxpred <- rxPredict(rxfit,data = solder.Test,extraVarsToWrite = "Solder")
rxpred$Predicted <- ifelse(rxpred$Thick_prob<=rxpred$Thin_prob,
"Thin","Thick")
mean(rxpred$Predicted!=rxpred$Solder) # misclassification rate is 0.1428571
#test model on test set after dropping levels
rxpred_dl <- rxPredict(rxfit,data = dl_solder.Test,
extraVarsToWrite = "Solder")
rxpred_dl$Predicted <- ifelse(rxpred_dl$Thick_prob<=rxpred_dl$Thin_prob,
"Thin","Thick")
mean(rxpred_dl$Predicted!=rxpred_dl$Solder)
# misclassification rate is 0.3714286
Why it leads to different predicted results after dropping unused levels in test data set? Which one is the right way to do prediction?

R, Caret, train(), predict(), GBM, Error: Error in model.frame.default(..): Factor has new levels

So I have a pretty good idea of what is happening but I'm wondering how to handle the error, I've seen other posts similar to this but they were not specific to Gradient Boosting Machine models. They all seem to be related to GLMs and the error isn't being caused by the same thing I don't think.
Here's my code:
myTuneGrid <- expand.grid(n.trees=c(100,200), interaction.depth=c(9,10,11,12), shrinkage=0.1, n.minobsinnode=10)
fitControl <- trainControl(method = "cv", number =5,verboseIter = FALSE,returnResamp = "all")
myModel <- train(as.factor(target) ~ .,data = trainingDataC.GB, method = "gbm",trControl = fitControl,tuneGrid = myTuneGrid)
myPrediction <- predict(myModel,newdata=testDataC)
Here's my error:
Error in model.frame.default(Terms, newdata, na.action = na.action,
xlev = object$xlevels) : factor 47V has new levels E, H, J
So my factor variable has a bunch of levels in my training set, but from the error I'm guessing not all levels are represented in my training set. When I go to my test set there are new levels that were not in my training set so I'm getting this error?
This is a supervised learning problem, I can't change the test set and move data to the training set. So it's not a sampling problem.
Anyway, does anyone know any settings or quick fixes so that this doesn't cause my program to crash?
This happens quite a bit on kaggle competitions. You can combine the variables to create a levels argument to make sure the factor contains all the levels in both train and test. You see this all quite a lot in the kaggle scripts.
See this very simple example based on mtcars. You just need to fill in the variable name in quotes (e.g. "cyl") and the variable will be set to a factor in both the train and test set with all the levels available in both sets. This will just prevent your model from giving an error. This does not mean that it will learn anything from the factor levels not available in the training set.
train <- subset(mtcars, cyl < 8)
test <- subset(mtcars, cyl >= 8)
fact_train_test <- function(x) {
levels <- unique(c(train[[x]], test[[x]]))
train[[x]] <<- factor(train[[x]], levels=levels)
test[[x]] <<- factor(test[[x]], levels=levels)
}
fact_train_test("cyl")
There are probably other methods of doing this, but it works.

Resources