Sampling with many categories - r

I'm using a linear regression to work with a dataset with many categorical variables that each contain several categories, up to 45 categories in one of them.
I'm sampling the data this way:
## 70% of the sample size
smp_size <- floor(0.7 * nrow(plot_data))
## set the seed to make your partition reproductible
set.seed(888)
train_ind <- sample(seq_len(nrow(plot_data)), size = smp_size)
train <- plot_data[train_ind, ]
test <- plot_data[-train_ind, ]
Then I make the model like this:
linear_model = lm(train$dependent_variable~., data = train)
The problem is that whenever I try to predict and work with the testing set, the training set contains some categories that the training set does not.
pred_data = predict(linear_model, newdata = test)
This gives me the following error:
Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
factor origin has new levels someCategory1, SomeCategory2
Is there a way to ensure that all the categories are in both the train and testing sets or is there a workaround on this?

I ended up removing the observations with new levels on the test set. I know it has it's limitations and that the OSR2 loses reliability, but it got the job done:
test = na.omit(remove_missing_levels (fit=linear_model, test_data=test));
I found the remove_missing_levels function here.
It requires this library:
install.packages("magrittr");
library(magrittr);

Related

Random Forest prediction error in R "No forest component in the object"

I am attempting to use a random forest regressor to classify a raster stack, but an error does not allow a prediction of "area_pct", have I not trained the model properly?
d100 is my dataset with predictor variables d100[,4:ncol(d100)] and prediction variable d100["area_pct"].
#change na values to zero
d100[is.na(d100)] <- 0
set.seed(100)
#split dataset into training (70%) and testing (30%)
id<- sample(2,nrow(d100), replace = TRUE, prob = c(0.7,0.3))
train_100<- d100[id==1,]
test_100 <- d100[id==2,]
train random forest model with randomForest package, this appears to work fine
final_CC_rf_20 = randomForest(x=train[,4:ncol(train)], y= train$area_pct,
xtest=test[,4:ncol(test)], ytest=test$area_pct, mtry=14, importance=TRUE, ntree = 600)
Then I try to predict a raster.
New raster stack with predictor variables
sentinel_2_20 <- stack( paste(getwd(), "Sentinel_SR_clip_20.tif", sep="/") )
area_classified_20_2018 <- predict(object = final_CC_rf_20 , newdata = sentinel_2_20,type = 'response', progress = 'window')
but error pops up:
#Error in predict.randomForest(object = final_CC_rf_20, newdata = sentinel_2_20, :
# No forest component in the object
any help would be extremely useful
The arguments you are using for predict (with raster data) are not correct. The first argument, object, should be the raster data, the second argument, model, should be the fitted model. There is no argument newdata.
Another problem is that you use keep.forest=FALSE which is the default when xtest is not NULL. You could set keep.forest=TRUE but that is not a good approach, generally, as you should fit your model with all data before you make a prediction (you are no longer evaluating your model). Thus, I would suggest fitting your model without xtest, like this
rfmod <- randomForest(x=d100[,4:ncol(train)], y=d100$area_pct,
mtry=14, importance=TRUE, ntree = 600)
And then do
p <- predict(sentinel_2_20, rfmod, type='response')
See ?raster::predict or ?terra::predict for working examples

Error connected to factor var when using logistic model to predict

Forgive me if my title is unclear, but I couldn't think a very clear way to summarize what I'm after.
I'm working with the Titanic dataset to learn logistic regression. The idea is to develop a model to predict survival. The data includes passenger Age. Using that attribute, I factored Age like
Age_labels <- c('0-10', '11-17', '18-29', '30-39', '40-49', '50-59', '60-69', '70-79')
train_data$AgeGroup <- cut(train_data$Age, c(0, 11, 18, 30, 40, 50, 60, 70, 80), include.highest=TRUE, labels= Age_labels)
Model completed, I'm ready to use it to predict survival on the test data--but I get an error when I try
test_data_predictions <- predict(my_model, newdata = test_data, type = "response")
Here's the error:
Error in model.frame.default(Terms, newdata, na.action = na.action,
xlev = object$xlevels) : factor AgeGroup has new levels 60-69,
70-79
Why? It seems to mean the problem is because the test data includes passengers in the 60-69 and 70-79 AgeGroup (whereas train data did not include passengers in those age ranges). Or does the error actually mean something else?
Obviously I want to use this model to predict the survival of any passenger, regardless of age.
Here is a potential clue: str() tells me that AgeGroup in my test_data is a factor w /8 levels, whereas in train_data it's a factor w /6 levels. Also, there are no NAs in either train_data or test_data.
How do I correct the error so I can move on to actual predictions? Thanks
Note: haven't included data as this question does not seem to require reproducibility to answer
UPDATE
Per suggestion by #sjp, I went back and treated AgeGroup as continuous variable (as numeric). Doing so has adverse effects: AIC goes up, binnedplot of residuals now looks rather poor (too many outside of bin), and Hosmer-Lemeshow now says "Summary: model does not fit well". So, passing AgeGroup as numeric does make it possible for me to use model to make predictions on test data, but I worry the price is too high.
tl;dr Because there are age groups present in the test set that are in the training set (due to random sampling of small categories), R can't make predictions for those test cases. You can use caret::createDataPartition(age_group) to create a train/test split that is balanced on the age-group variable (and hence is not missing any categories). The help page (?createDataPartition) warns you that "for ... very small class sizes (<= 3) the classes may not show up in both the training and test data", but it seems to work OK here (the smallest group has n=6).
replicate the problem
tt <- transform(carData::TitanicSurvival,
age_group = cut(age,
breaks = c(0, 11, 18, 30, 40, 50, 60, 70, 80))
)
set.seed(101)
## allocate a small fraction (10%) to the training set to make
## it easier to get missing classes in the training set
split1 <- sample(c("train", "test"),
size = nrow(tt),
replace = TRUE,
prob = c(0.1, 0.9))
m1 <- glm(survived ~ age_group, family = binomial,
data = tt[split1 == "train", ])
try(predict(m1, newdata = tt[split1 == "test",]))
this gives
Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
factor age_group has new levels (70,80]
as in the original example.
balanced sample
library(caret)
set.seed(101)
w <- createDataPartition(tt$age_group, p = 0.1)$Resample1
table(tt$age_group[w])
table(tt$age_group[-w])
m2 <- glm(survived ~ age_group, family = binomial, data = tt[w,])
predict(m2, newdata = tt[-w,])
This works OK. Using table(tt$age_group[w]) and table(tt$age_group[-w]) confirms that every age class is present in both the training and the test set, although it doesn't cause any problems if classes are missing from the test set only ...

Random Forest model yields incorrect predictions despite having accuracy of over 99 percent

For a ML course, I am supposed to build a model based on the training set to predict the variable "classe" on a validation set. I removed all unnecessary variables in the training set, used cross validation to prevent over-fitting, and made sure the validation set matched the training set in terms of which columns are removed. When I predict classe in the validation set, it yields all classe A, and I know this is incorrect.
I included the entire script below.
Where did I go wrong?
library(caret)
download.file("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv", "train.csv")
download.file("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv", "test.csv")
train <- read.csv("./train.csv")
val <- read.csv("./test.csv")
#getting rid of columns with NAs
nas <- sapply(train, function(x) sum(is.na(x)))
train <- train[, nas<1900]
#removing near zero variance columns
remove <- nearZeroVar(train)
train <- train[, -remove]
#create partition in our training set
set.seed(8675309)
inTrain <- createDataPartition(train$classe, p = .7, list = FALSE)
training <- train[inTrain,]
testing <- train[-inTrain,]
model <- train(classe ~ ., method = "rf", data = training)
confusionMatrix(predict(model, testing), testing$classe)
#make sure validation set has same features as training set
trainforvalid <- subset(training, select = -classe)
val <- val[, colnames(trainforvalid)]
predict(model, val)
#the above step yields all predictions as classe A
This might be happening because the data is unbalanced. If the data have a lot more data points for Class A then Class B, the model will simply learn to predict always Class A.
Try to use a better metric in this case like F1 score.
I also recommend using techniques like oversampling or undersampling to avoid the unbalanced data issue.

R - factor examcard has new levels

I built a classification model in R using C5.0 given below:
library(C50)
library(caret)
a = read.csv("All_SRN.csv")
set.seed(123)
inTrain <- createDataPartition(a$anatomy, p = .70, list = FALSE)
training <- a[ inTrain,]
test <- a[-inTrain,]
Tree <- C5.0(anatomy ~ ., data = training,
trControl = trainControl(method = "repeatedcv", repeats = 10,
classProb = TRUE))
TreePred <- predict(Tree, test)
The training set has features like - examcard, coil_used, anatomy_region, bodypart_anatomy and anatomy(target class). All the features are categorical variables. There are a total of 10k odd values, I divided the data into training and test data. The learner worked great with this training and test set partioned in 70:30 ratio, but the problem comes when I provide the test set with new values given below:
TreePred <- predict(Tree, test_add)
Here, test_add contains the already present test set and a set of new values and on executing the learner fails to classify the new values and throws the following error:
Error in model.frame.default(object$Terms, newdata, na.action = na.action, : factor examcard has new levels
I tried to merge the new factor levels with the existing one using:
Tree$xlevels[["examcard"]] <- union(Tree$xlevels[["examcard"]], levels(test_add$examcard))
But, this wasn't of much help since the code executed with the following message and didn't yield any fruitful result:
predict code called exit with value 1
The feaure examcard holds a good deal of primacy in the classification hence can't be ignored. How can these set of values be classified?
You cannot create a prediction for factor levels in your test set that are absent in your training set. Your model will not have coefficients for these new factor levels.
If you are doing a 70/30 split, you need to repartition your data using caret::CreateDataPartition...
... or your own stratified sample function to ensure that all levels are represented in the training set: use the "split-apply-combine" approach: split the data set by examcard, and for each subset, apply the split, then combine the training subsets and the testing subsets.
See this question for more details.

How to keep all levels of categorical variables when splitting data frame in test and train set in R

Sometimes when splitting a data frame with categorical columns into a test and train set, the train set will not contain all levels of the categorical variable. When you then train the model, and try to predict the test set, the prediction will fail with:
For example:
x <- data.frame(...) # data frame with columns with very dispersed categorical variables
set.seed(123)
smp_size <- floor(0.75 * nrow(x))
train_idx <- sample(seq_len(nrow(x)), size = smp_size)
train_set <- x[train_idx, ]
test_set <- x[-train_idx, ]
m <- lm(some_formula, data=train_set)
predict(m, newdata=test_set)
Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
factor xxxx has new levels yyy ...
Does anyone know a handy way to set the levels of all categorical variables in both train and test set to the levels in the original data set ?
Thank you.
The caret function createDataPartition() attempts to deal with the issue you describe.
Given your example above, you should be able to use it this way:
train_idx <- createDataPartition(y, times = 1, p = 0.75, list=F)
Here is a part of the R documentation on the function createDataPartition:
"the random sampling is done within the levels of y when y is a factor in an attempt to balance the class distributions within the splits."

Resources