column names - xgboost predict on new data - r

I have never productionised an xgboost model and am concerned re how to handle fresh data predictions within an xgboost model. Specifically when column names do not match the trained models sparse matrix column names - either due to new columns being added or certain columns being removed when fresh data is converted to a sparse matrix.
What if I attempt to predict an xgboost model on new data with extra or some missing column names? I see this definitely occurring and would like to create code to account for it so that predictions are correct. I would prefer to avoid hacking together a solution if more elegant ones already exist.
So specifically if the new datas sparse matrix has different column names then what?
My best guess is to factorise (levels based on trained data levels) > create sparse matrix > then remove non-matching columns (between trained dataset and new data).
I have created dummy data (in below code) as an example of prediction errors given different column names.
1st step = build model (just for illustrative purposes I know it's a bad build)
2nd step = resample entire dataset then predict (= no problems. Predictions match)
3rd step = only select from 10% of data then predict - this gets prediction errors due to different column names.
Here's the code:
Step 1 create dummy data and create a lazy xgboost model just for illustrative purposes.
library(xgboost) # for xgboost algo
library(Matrix) # for sparse matrix
### Create dummy data
num_rows <- 100
set.seed(1234)
target <- runif(num_rows)
dummy_data <- data.frame(
LETTER_SINGLE=sample(LETTERS,num_rows,replace=TRUE),
DOUBLE_LETTER=paste(sample(LETTERS,num_rows,replace=TRUE),sample(LETTERS,num_rows,replace=TRUE),sep=""),
TRIPLE_LETTER=paste(sample(LETTERS,num_rows,replace=TRUE),sample(LETTERS,num_rows,replace=TRUE),sample(LETTERS,num_rows,replace=TRUE),sep=""),
stringsAsFactors=FALSE
)
## STEP 1 CREATE XGBOOST MODEL AND GET PREDICTED VALUES TO COMPARE WITH FUTURE DATA CUTS.
model_data_01 <- dummy_data
target_01 <- target
# create matrix
model_01_sparse <- sparse.model.matrix(~ .-1, data = model_data_01)
# colnames model 1
colnames_trained_model <- colnames(model_01_sparse)
# train a model
xgb_fit_01 <-
xgboost(data = model_01_sparse,
label = target_01,
#param = best_param,
nrounds=100,
verbose = T
)
pred_01 <- predict(xgb_fit_01,newdata=model_01_sparse)
Step 2. Test to see if order of observations cause differences in predictions. Spoiler - no prediction errors occur.
## STEP 2 CREATE SHUFFLED DATA (SAME DATA SAMPLES BUT SHUFFLED) THEN PREDICT AND COMPARE.
sample_order <- sample(1:num_rows)
model_data_shuffled <- dummy_data[sample_order,]
target_shuffled <- target[sample_order]
# They are different
head(model_data_01)
head(model_data_shuffled)
# create matrix
model_shuffled_sparse <- sparse.model.matrix(~ .-1, data = model_data_shuffled)
# colnames model 1
colnames_shuffled <- colnames(model_shuffled_sparse)
pred_shuffled <- predict(xgb_fit_01,newdata=model_shuffled_sparse)
# check if predictions differ
pred_01[sample_order] - pred_shuffled
## This matched. Yay. sparse.model.matrix function must first sort alphabetically then create column names.
# due to same column names
mean(colnames_trained_model == colnames_shuffled)
Step 3. Only sample a select few rows and predict to see whether missing columns - in sparse matrix - cause prediction errors.
## STEP 2 WORKED FINE SO ONTO...
## STEP 3 RANDOMLY SAMPLE ONLY A HANDFUL OF ROWS PREDICT AND COMPARE.
sample_order_02 <- sample(1:(num_rows*0.1))
model_data_shuffled_02 <- dummy_data[sample_order_02,]
target_shuffled_02 <- target[sample_order_02]
# create matrix
model_shuffled_sparse_02 <- sparse.model.matrix(~ .-1, data = model_data_shuffled_02)
# colnames model 1
colnames_shuffled_02 <- colnames(model_shuffled_sparse_02)
pred_shuffled_02 <- predict(xgb_fit_01,newdata=model_shuffled_sparse_02)
# check if predictions differ
pred_01[sample_order_02] - pred_shuffled_02
## This did not matched. Damn.
# Due to different column names
colnames_trained_model
colnames_shuffled_02
mean(colnames_trained_model == colnames_shuffled_02)
As you can see this last attempt gets variance in the predicted values due solely to missing column names in the spare matrix.
I don't want to hack an ugly solution together if an elegant one exists for me to learn from.
So my question is... Is there an elegant way to force sparse model matrix column names to match that of the built model (the one used for predictions on new data)?
I have searched the web and no luck thus far finding any best practices solution.
If anybody could help by answering the Question or pointing me in the right direction that would be much appreciated.

What is your production environment? R, Python, Java or something else?
The idea is to use XGBoost functionality (both training and prediction) via production environment-specific wrapper library, not directly. For example, in Python, you could use Scikit-Learn wrappers, which encapsulate feature engineering and -selection tasks into a reusable sklearn.pipeline.Pipeline object. You would 1) fit the pipeline object (where the XGBoost estimator is the final task) in development environment and serialize it to a pickle file, 2) move the pickle file from development to production environment, and 3) de-serialize it from the pickle file and use for transforming new data in production environment. This is a high-level API, which completely abstracts away low-level details such as the layout of XGBoost "internal" data matrices.
For a platform-independent solution, you could export XGBoost models (and associated data pre-processing logic) in the standardized PMML representation.

Related

Using Amelia and decision trees

I have a panel dataset (countries and years) with a lot of missing data so I've decided to use multiple imputation. The goal is to see the relationship between the proportion of women in management (managerial_value) and total fatal workplace injuries (total_fatal)
From what I've read online, Amelia is the best option for panel data so I used that like so:
amelia_data <- amelia(spdata, ts = "year", cs = "country", polytime = 1,
intercs = FALSE)
where spdata is my original dataset.
This imputation process worked, but I'm unsure of how to proceed with forming decision trees using the imputed data (an object of class 'amelia').
I originally tried creating a function (amelia2df) to turn each of the 5 imputed datasets into a data frame:
amelia2df <- function(amelia_data, which_imp = 1) {
stopifnot(inherits(amelia_data, "amelia"), is.numeric(which_imp))
imps <- amelia_data$imputations[[which_imp]]
as.data.frame(imps)
}
one_amelia <- amelia2df(amelia_data, which_imp = 1)
two_amelia <- amelia2df(amelia_data, which_imp = 2)
three_amelia <- amelia2df(amelia_data, which_imp = 3)
four_amelia <- amelia2df(amelia_data, which_imp = 4)
five_amelia <- amelia2df(amelia_data, which_imp = 5)
where one_amelia is the data frame for the first imputed dataset, two_amelia is the second, and so on.
I then combined them using rbind():
total_amelia <- rbind(one_amelia, two_amelia, three_amelia, four_amelia, five_amelia)
And used the new combined dataset total_amelia to construct a decision tree:
set.seed(300)
tree_data <- total_amelia
I_index <- sample(1:nrow(tree_data), size = 0.75*nrow(tree_data), replace=FALSE)
I_train <- tree_data[I_index,]
I_test <- tree_data[-I_index,]
fatal_tree <- rpart(total_fatal ~ managerial_value, I_train)
rpart.plot(fatal_tree)
fatal_tree
This "works" as in it doesn't produce an error, but I'm not sure that it is appropriately using the imputed data.
I found a couple resources explaining how to apply least squares, logit, etc., but nothing about decision trees. I'm under the impression I'd need the 5 imputed datasets to be combined into one data frame, but I have not been able to find a way to do that.
I've also looked into Zelig and bind_rows but haven't found anything that returns one data frame that I can then use to form a decision tree.
Any help would be appreciated!
As already indicated by #Noah, you would set up the multiple imputation workflow different than you currently do.
Multiple imputation is not really a tool to improve your results or to make them more correct.
It is a method to enable you to quantify the uncertainty caused by the missing data, that comes along with your analysis.
All the different datasets created by multiple imputation are plausible imputations, because of the uncertainty, you don't know, which one is correct.
You would therefore use multiple imputation the following way:
Create your m imputed datasets
Build your trees on each imputed dataset separately
Do you analysis on each tree separately
In your final paper, you can now state how much uncertainty is caused trough the missing values/imputation
This means you get e.g. 5 different analysis results for m = 5 imputed datasets. First this looks confusing, but this enables you to give bounds, between the correct result probably lies. Or if you get completely different results for each imputed dataset, you know, there is too much uncertainty caused by the missing values to give reliable results.

How can I use a machine learning model to predict on data whose features differ slightly?

I have a randomForest model trained on a bunch of NLP data (tf-idf values for each word). I want to use it to predict on a new dataset. The features in the model overlap with but don't quite match the features in the new data, such that when I predict on the new data I get:
Error in predict.randomForest(object = model, newdata = new_data) :
variables in the training data missing in newdata
I thought to get around this error by excluding all the features from the model which do not appear in the new data, and all the features in the new data which do not appear in the model. Putting aside for the moment the impact on model accuracy (this would significantly pare down the number of features, but there would still be plenty to predict with), I did something like this:
model$forest$xlevels <- model$forest$xlevels[colnames(new_data)]
# and vice versa
new_data <- new_data[names(model$forest$xlevels)]
This worked, insofar as names(model$forest$xlevels) == colnames(new_data) returned TRUE for each feature name.
However, when I try to predict on the resulting new_data I still get the variables in the training data missing in newdata error. I am fairly certain that I'm amending the correct part of the model (model$forest$xlevels), so why isn't it working?
i think you should go the other way around. That is add the missing columns to the newdata.
When you are working with bags of words, it is common to have words that are not present in some batch of new data. These missing words should just be encoded as a columns of zeros.
# do something like this (also exclude the target variable, obviously)
names_missing <- names(traindata)[!names(traindata) %in% names(new_data)]
new_data[,names_missing] <- 0L
and then you should be able to predict

KNN using R - in Production

I have some dummy data that consists of 99 rows of data, one column is
free text data and one column is the cateogry. It has been categorised into either Customer Service or Not Customer Service related.
I passed the 99 rows of data into my R script, created a Corpus, cleaned and parsed my data and converted it to a DocumentTermMatrix. I then converted my DTM to a dataframe to make it easier to view. I bound the category to my new dataframe. I then split it 50/50 so 50 rows into my training set, 49 into my testing set. I also pulled out the category.
train <- sample(nrow(mat.df), ceiling(nrow(mat.df) * .5))
test <- (1:nrow(mat.df))[- train]
cl <- mat.df[, "category"]
I then created a model with the stripped out category column and passed this new model to my KNN
knn.pred <- knn(modeldata[train, ], modeldata[test, ], cl[train])
conf.mat <- table("Predictions" = knn.pred, Actual = cl[test])
conf.mat
I can then work out the accuracy, generate a cross table or export the predictions to test the accuracy of the model.
The bit i am struggling to get my head around at the moment, is how do i use the model going forward for new data.
So if i then have 10 new rows of free text data that havent been manually classified, How do i then run my knn model i have just created to classify this additional data?
Maybe i am just misunderstanding the next process.
Thanks,
The same way you just found the hold-out test performance:
knn.pred.newdata <- knn(modeldata[train, ], completely_new_data, cl[train])
In a KNN model, your training data is intrinsically part of your model. Since it's just finding the nearest training points, how do you know which those are if you don't have their coordinates?
That said, why do you want to use a KNN model instead of something more modern (SVM, Random forest, Boosted trees, neural networks)? KNN models scale extremely poorly with the number of data points.

How do I convert an "RWeka" decision tree into a "party" tree in R?

I am using the RWeka package in R to fit M5' trees to a dataset using "M5P". I then want to convert the tree generated into a "party" tree so that I can access variable importances. The issue I am having is that I can't seem to get the function as.party to work without getting the following error:
"Error: all(sapply(split, head, 1) %in% c("<=", ">")) is not TRUE"
This error only arises when I apply the function within a for loop, but the for loop is necessary as I am running 5-fold cross validation.
Below is the code I have been running:
n <- nrow(data)
k <- 5
indCV <- sample( rep(1:k,each=ceiling(n/k)), n)
for(i in 1:k){
#Training data is for all the observations where indCV is not equal to i
training_data <- data.frame(x[-which(indCV==i),])
training_response <- y[-which(indCV==i)]
#Test the data on the fifth of the data where the observation indices are equal to i
test_data <- x[which(indCV==i),]
test_response <- y[which(indCV==i)]
#Fit a pruned model to the training data
fit <- M5P(training_response~., data=training_data, control=Weka_control(N=TRUE))
#Convert to party
p <- as.party(fit)
}
The RWeka package has an example for converting M5P trees into party objects. If you run example("M5P", package = "RWeka") then the tree visualizations are actually drawn by partykit. After running the examples, see plot(m3) and as.party(m3).
However, while for J48 you can get a fully fledged constparty object, the same is not true for M5P. In the latter case, the tree structure itself can be converted to party but the linear models within the nodes are not completely straightforward to convert into lm objects. Thus, if you want to use the party representation to compute measures that only depend on the tree structure (e.g., variables used for splitting, number of splits, splitpoints, etc.) then you can do so. But if you want to compute measures that depend on the models or the predictions (e.g., mean square errors etc.) then the party class won't be of much help.

Rpart Variables were specificed with different types from the fit?

I make a classification tree using rpart. The data has 10 columns, all properly labeled. Five of these columns contain information such as the day of the week in the form of "Wed" and the other five contain numeric values.
I can successfully make a tree using Rpart, but when I try to run a test set of the data, or even the training set that made the tree, I get a bunch of warnings saying that the variables containing characters were changed to a factor, and then an error that says those same variables were specified with a different type from the fit.
Anyone know how to fix this?
My relavent code should be
library(rpart)
#read data into info
info <- data.frame(info)
set.seed(30198)
train_ind <- sample(1:2000, 1500)
training_data_info <- info[train_ind, ]
test_data_info <- info[-train_ind, ]
training_data_info <- data.frame(training_data_info)
test_data_info <- data.frame(test_data_info)
tree <- rpart(info ~ ., data = training_data_info, method = "class")
info.test.fit <- predict(tree, newdata=test_data_info) #this is where it goes wrong
You can't use character vectors in an rpart fit. You have to code them as factors. The code does this for you, but then you hit the problem that it is entirely possible for the test data to have a different set of levels from the training data used to fit the tree.
The error arises from the use of these two lines:
training_data_info <- data.frame(training_data_info)
test_data_info <- data.frame(test_data_info)
These are redundant, the objects are already data frames. All this achieves is to drop those levels from the whole dataset that are missing in either the training or test datasets. And that is where the error comes from. Try without those two lines and you should be good to go.

Resources