KNN using R - in Production - r

I have some dummy data that consists of 99 rows of data, one column is
free text data and one column is the cateogry. It has been categorised into either Customer Service or Not Customer Service related.
I passed the 99 rows of data into my R script, created a Corpus, cleaned and parsed my data and converted it to a DocumentTermMatrix. I then converted my DTM to a dataframe to make it easier to view. I bound the category to my new dataframe. I then split it 50/50 so 50 rows into my training set, 49 into my testing set. I also pulled out the category.
train <- sample(nrow(mat.df), ceiling(nrow(mat.df) * .5))
test <- (1:nrow(mat.df))[- train]
cl <- mat.df[, "category"]
I then created a model with the stripped out category column and passed this new model to my KNN
knn.pred <- knn(modeldata[train, ], modeldata[test, ], cl[train])
conf.mat <- table("Predictions" = knn.pred, Actual = cl[test])
conf.mat
I can then work out the accuracy, generate a cross table or export the predictions to test the accuracy of the model.
The bit i am struggling to get my head around at the moment, is how do i use the model going forward for new data.
So if i then have 10 new rows of free text data that havent been manually classified, How do i then run my knn model i have just created to classify this additional data?
Maybe i am just misunderstanding the next process.
Thanks,

The same way you just found the hold-out test performance:
knn.pred.newdata <- knn(modeldata[train, ], completely_new_data, cl[train])
In a KNN model, your training data is intrinsically part of your model. Since it's just finding the nearest training points, how do you know which those are if you don't have their coordinates?
That said, why do you want to use a KNN model instead of something more modern (SVM, Random forest, Boosted trees, neural networks)? KNN models scale extremely poorly with the number of data points.

Related

Using Amelia and decision trees

I have a panel dataset (countries and years) with a lot of missing data so I've decided to use multiple imputation. The goal is to see the relationship between the proportion of women in management (managerial_value) and total fatal workplace injuries (total_fatal)
From what I've read online, Amelia is the best option for panel data so I used that like so:
amelia_data <- amelia(spdata, ts = "year", cs = "country", polytime = 1,
intercs = FALSE)
where spdata is my original dataset.
This imputation process worked, but I'm unsure of how to proceed with forming decision trees using the imputed data (an object of class 'amelia').
I originally tried creating a function (amelia2df) to turn each of the 5 imputed datasets into a data frame:
amelia2df <- function(amelia_data, which_imp = 1) {
stopifnot(inherits(amelia_data, "amelia"), is.numeric(which_imp))
imps <- amelia_data$imputations[[which_imp]]
as.data.frame(imps)
}
one_amelia <- amelia2df(amelia_data, which_imp = 1)
two_amelia <- amelia2df(amelia_data, which_imp = 2)
three_amelia <- amelia2df(amelia_data, which_imp = 3)
four_amelia <- amelia2df(amelia_data, which_imp = 4)
five_amelia <- amelia2df(amelia_data, which_imp = 5)
where one_amelia is the data frame for the first imputed dataset, two_amelia is the second, and so on.
I then combined them using rbind():
total_amelia <- rbind(one_amelia, two_amelia, three_amelia, four_amelia, five_amelia)
And used the new combined dataset total_amelia to construct a decision tree:
set.seed(300)
tree_data <- total_amelia
I_index <- sample(1:nrow(tree_data), size = 0.75*nrow(tree_data), replace=FALSE)
I_train <- tree_data[I_index,]
I_test <- tree_data[-I_index,]
fatal_tree <- rpart(total_fatal ~ managerial_value, I_train)
rpart.plot(fatal_tree)
fatal_tree
This "works" as in it doesn't produce an error, but I'm not sure that it is appropriately using the imputed data.
I found a couple resources explaining how to apply least squares, logit, etc., but nothing about decision trees. I'm under the impression I'd need the 5 imputed datasets to be combined into one data frame, but I have not been able to find a way to do that.
I've also looked into Zelig and bind_rows but haven't found anything that returns one data frame that I can then use to form a decision tree.
Any help would be appreciated!
As already indicated by #Noah, you would set up the multiple imputation workflow different than you currently do.
Multiple imputation is not really a tool to improve your results or to make them more correct.
It is a method to enable you to quantify the uncertainty caused by the missing data, that comes along with your analysis.
All the different datasets created by multiple imputation are plausible imputations, because of the uncertainty, you don't know, which one is correct.
You would therefore use multiple imputation the following way:
Create your m imputed datasets
Build your trees on each imputed dataset separately
Do you analysis on each tree separately
In your final paper, you can now state how much uncertainty is caused trough the missing values/imputation
This means you get e.g. 5 different analysis results for m = 5 imputed datasets. First this looks confusing, but this enables you to give bounds, between the correct result probably lies. Or if you get completely different results for each imputed dataset, you know, there is too much uncertainty caused by the missing values to give reliable results.

How can I use a machine learning model to predict on data whose features differ slightly?

I have a randomForest model trained on a bunch of NLP data (tf-idf values for each word). I want to use it to predict on a new dataset. The features in the model overlap with but don't quite match the features in the new data, such that when I predict on the new data I get:
Error in predict.randomForest(object = model, newdata = new_data) :
variables in the training data missing in newdata
I thought to get around this error by excluding all the features from the model which do not appear in the new data, and all the features in the new data which do not appear in the model. Putting aside for the moment the impact on model accuracy (this would significantly pare down the number of features, but there would still be plenty to predict with), I did something like this:
model$forest$xlevels <- model$forest$xlevels[colnames(new_data)]
# and vice versa
new_data <- new_data[names(model$forest$xlevels)]
This worked, insofar as names(model$forest$xlevels) == colnames(new_data) returned TRUE for each feature name.
However, when I try to predict on the resulting new_data I still get the variables in the training data missing in newdata error. I am fairly certain that I'm amending the correct part of the model (model$forest$xlevels), so why isn't it working?
i think you should go the other way around. That is add the missing columns to the newdata.
When you are working with bags of words, it is common to have words that are not present in some batch of new data. These missing words should just be encoded as a columns of zeros.
# do something like this (also exclude the target variable, obviously)
names_missing <- names(traindata)[!names(traindata) %in% names(new_data)]
new_data[,names_missing] <- 0L
and then you should be able to predict

Does k cross validation work well with Random Forest Model in R?

I want to build Random forest model in R and I want first to divide my dataset to Training and Testing.
all the tutorials so far use regular sampling.
For example: training<-data[1:150,] and
testing<- data[151:700,]
I tried to use 10 cross validation with my random forest model and I want to ensure I am doing the right thing
here's my code in R
#the head of the dataset after deleting ID attribute
head(wpdc)
#k fold cross validation
RF_folds<-createFolds(wpdc$outcome, k=10) #create folds
RF_fun <- lapply (RF_folds, function(x){
RF_traing_folds=wpdc[-x,]
RF_test_folds=wpdc[x,]
RF_test_folds_class<-RF_test_folds[,1]
#build the model
RF_model<-randomForest(outcome ~ ., data = RF_traing_folds)
#test the model
RF_predict<-predict(RF_model, RF_test_folds[-1])
#accuracy
RF_table<-table(RF_test_folds_class,RF_predict)
RF_confusionMatrix<-confusionMatrix(RF_table, positive ="R") #to see the matrex of echo floods
return(RF_confusionMatrix$table)
})
RF_sum_matrices <-Reduce('+', RF_fun)/10 #sum 10 matrices
RF_final_confusionMatrix<-confusionMatrix(RF_sum_matrices, positive ="R")
RF_final_confusionMatrix
I read an article that says out of bag in RFs are similar to k cross validation and there's no need to use k cross validation. So my question is, is my code correct ? and is k cross validation a good choice to partition data before building a model? if so, what is the correct way to do so?

column names - xgboost predict on new data

I have never productionised an xgboost model and am concerned re how to handle fresh data predictions within an xgboost model. Specifically when column names do not match the trained models sparse matrix column names - either due to new columns being added or certain columns being removed when fresh data is converted to a sparse matrix.
What if I attempt to predict an xgboost model on new data with extra or some missing column names? I see this definitely occurring and would like to create code to account for it so that predictions are correct. I would prefer to avoid hacking together a solution if more elegant ones already exist.
So specifically if the new datas sparse matrix has different column names then what?
My best guess is to factorise (levels based on trained data levels) > create sparse matrix > then remove non-matching columns (between trained dataset and new data).
I have created dummy data (in below code) as an example of prediction errors given different column names.
1st step = build model (just for illustrative purposes I know it's a bad build)
2nd step = resample entire dataset then predict (= no problems. Predictions match)
3rd step = only select from 10% of data then predict - this gets prediction errors due to different column names.
Here's the code:
Step 1 create dummy data and create a lazy xgboost model just for illustrative purposes.
library(xgboost) # for xgboost algo
library(Matrix) # for sparse matrix
### Create dummy data
num_rows <- 100
set.seed(1234)
target <- runif(num_rows)
dummy_data <- data.frame(
LETTER_SINGLE=sample(LETTERS,num_rows,replace=TRUE),
DOUBLE_LETTER=paste(sample(LETTERS,num_rows,replace=TRUE),sample(LETTERS,num_rows,replace=TRUE),sep=""),
TRIPLE_LETTER=paste(sample(LETTERS,num_rows,replace=TRUE),sample(LETTERS,num_rows,replace=TRUE),sample(LETTERS,num_rows,replace=TRUE),sep=""),
stringsAsFactors=FALSE
)
## STEP 1 CREATE XGBOOST MODEL AND GET PREDICTED VALUES TO COMPARE WITH FUTURE DATA CUTS.
model_data_01 <- dummy_data
target_01 <- target
# create matrix
model_01_sparse <- sparse.model.matrix(~ .-1, data = model_data_01)
# colnames model 1
colnames_trained_model <- colnames(model_01_sparse)
# train a model
xgb_fit_01 <-
xgboost(data = model_01_sparse,
label = target_01,
#param = best_param,
nrounds=100,
verbose = T
)
pred_01 <- predict(xgb_fit_01,newdata=model_01_sparse)
Step 2. Test to see if order of observations cause differences in predictions. Spoiler - no prediction errors occur.
## STEP 2 CREATE SHUFFLED DATA (SAME DATA SAMPLES BUT SHUFFLED) THEN PREDICT AND COMPARE.
sample_order <- sample(1:num_rows)
model_data_shuffled <- dummy_data[sample_order,]
target_shuffled <- target[sample_order]
# They are different
head(model_data_01)
head(model_data_shuffled)
# create matrix
model_shuffled_sparse <- sparse.model.matrix(~ .-1, data = model_data_shuffled)
# colnames model 1
colnames_shuffled <- colnames(model_shuffled_sparse)
pred_shuffled <- predict(xgb_fit_01,newdata=model_shuffled_sparse)
# check if predictions differ
pred_01[sample_order] - pred_shuffled
## This matched. Yay. sparse.model.matrix function must first sort alphabetically then create column names.
# due to same column names
mean(colnames_trained_model == colnames_shuffled)
Step 3. Only sample a select few rows and predict to see whether missing columns - in sparse matrix - cause prediction errors.
## STEP 2 WORKED FINE SO ONTO...
## STEP 3 RANDOMLY SAMPLE ONLY A HANDFUL OF ROWS PREDICT AND COMPARE.
sample_order_02 <- sample(1:(num_rows*0.1))
model_data_shuffled_02 <- dummy_data[sample_order_02,]
target_shuffled_02 <- target[sample_order_02]
# create matrix
model_shuffled_sparse_02 <- sparse.model.matrix(~ .-1, data = model_data_shuffled_02)
# colnames model 1
colnames_shuffled_02 <- colnames(model_shuffled_sparse_02)
pred_shuffled_02 <- predict(xgb_fit_01,newdata=model_shuffled_sparse_02)
# check if predictions differ
pred_01[sample_order_02] - pred_shuffled_02
## This did not matched. Damn.
# Due to different column names
colnames_trained_model
colnames_shuffled_02
mean(colnames_trained_model == colnames_shuffled_02)
As you can see this last attempt gets variance in the predicted values due solely to missing column names in the spare matrix.
I don't want to hack an ugly solution together if an elegant one exists for me to learn from.
So my question is... Is there an elegant way to force sparse model matrix column names to match that of the built model (the one used for predictions on new data)?
I have searched the web and no luck thus far finding any best practices solution.
If anybody could help by answering the Question or pointing me in the right direction that would be much appreciated.
What is your production environment? R, Python, Java or something else?
The idea is to use XGBoost functionality (both training and prediction) via production environment-specific wrapper library, not directly. For example, in Python, you could use Scikit-Learn wrappers, which encapsulate feature engineering and -selection tasks into a reusable sklearn.pipeline.Pipeline object. You would 1) fit the pipeline object (where the XGBoost estimator is the final task) in development environment and serialize it to a pickle file, 2) move the pickle file from development to production environment, and 3) de-serialize it from the pickle file and use for transforming new data in production environment. This is a high-level API, which completely abstracts away low-level details such as the layout of XGBoost "internal" data matrices.
For a platform-independent solution, you could export XGBoost models (and associated data pre-processing logic) in the standardized PMML representation.

Preprocess data in R

Im using R to create logistic regression classifier model.
Here is the code sample:
library(ROCR)
DATA_SET <- read.csv('E:/1.csv')
classOneCount= 4000
classZeroCount = 4000
sample.churn <- sample(which(DATA_SET$Class==1),classOneCount)
sample.nochurn <- sample(which(DATA_SET$Class==0),classZeroCount )
train.set <- DATA_SET[c(sample.churn,sample.nochurn),]
test.set <- DATA_SET[c(-sample.churn,-sample.nochurn),]
full.logit <- glm(Class~., data = train.set, family = binomial)
And it works fine, but I would like to preprocess the data to see if it improves classification model.
What I would like to do would be to divide input vector variables which are continuoes into intervals. Lets say that one variable is height in centimeters in float.
Sample values of height:
183.23
173.43
163.53
153.63
193.27
and so on, and I would like to split it into lets say 3 different intervals: small, medium, large.
And do it with all variables from my set - there are 32 variables.
What's more I would like to see at the end correlation between value of the variables (this intervals) and classification result class.
Is this clear?
Thank you very much in advance
The classification model creates some decision boundary and existing algorithms are rather good at estimating it. Let's assume that you have one variable - height - and linear decision boundary. Your algorithm can then decide between what values put decision boundary by estimating error on training set. If you perform quantization and create few intervals your algorithm have fewer places to put boundary(data loss). It will likely perform worse on such cropped dataset than on original one. It could help if your learning algorithm is suffering from high variance (is overfitting data) but then you could also try getting more training examples, use smaller set (subset) of features or use algorithm with regularization and increase regularization parameter
There are also many questions about how to choose number of intervals and how to divide data into them like: should all intervals be equally frequent or of equal width or most similar to each other inside each interval?
If you want just to experiment use some software like f.e. free version of RapidMiner Studio (it can read CSV and Excel files and have some quick quantization options) to convert your data

Resources