Missing Categories in Validation Data - r

I built a classification model in R based on training dataset with 12 categorical predictors, each variable holds tens to hundreds of categories.
The problem is that in the dataset I use for validation, some of the variables has less categories than in the training data.
For example, if I have in the training data variable v1 with 3 categories - 'a','b','c', in the validation dataset v1 has only 2 categories - 'a','b'.
In tree based methods like decision tree or random forest it makes no problem, but in logistic regression methods (I use LASSO) that require a preparation of a dummy variables matrix, the number of columns in the training data matrix and validation data matrix doesn't match. If we go back to the example of variable v1, in the training data I get three dummy variables for v1, and in the validation data I get only 2.
Any idea how to solve this?

You can try to avoid this problem by setting the levels correctly. Look at following very stupid example:
set.seed(106)
thedata <- data.frame(
y = rnorm(100),
x = factor(sample(letters[1:3],100,TRUE))
)
head(model.matrix(y~x, data = thedata))
thetrain <- thedata[1:7,]
length(unique(thetrain$x))
head(model.matrix(y~x, data = thetrain))
I make a dataset with a x and a y variable, and x is a factor with 3 levels. The training dataset only has 2 levels of x, but the model matrix is still constructed correctly. That is because R kept the level data of the original dataset:
> levels(thetrain$x)
[1] "a" "b" "c"
The problem arises when your training set is somehow constructed using eg the function data.frame() or any other method that drops the levels information of the factor.
Try the following:
thetrain$x <- factor(thetrain$x) # erases the levels
levels(thetrain$x)
head(model.matrix(y~x, data = thetrain))
You see in the second line that the level "b" has been dropped, and consequently the model matrix isn't what you want any longer. So make sure that all factors in your training dataset actually have all levels, eg:
thetrain$x <- factor(thetrain$x, levels = c("a","b","c"))
On a sidenote: if you build your model matrices yourself using either model.frame() or model.matrix(), the argument xlev might be of help:
thetrain$x <- factor(thetrain$x) # erases the levels
levels(thetrain$x)
head(model.matrix(y~x, data = thetrain,
xlev = list(x = c('a','b','c'))))
Note that this xlev argument is actually from model.frame, and model.matrix doesn't call model.frame in every case. So that solution is not guaranteed to always work, but it should for data frames.

Related

what does "~0" mean in the R model matrix

I am trying to understand model matrix in R (model.matrix), to convert categorical variables to dummy variables and came across the following code
# Option 2: use model.matrix() to convert all categorical variables in the data frame into a set of dummy variables. We must then turn the resulting data matrix back into
# a data frame for further work.
xtotal <- model.matrix(~ 0 + REMODEL, data = df)
xtotal <- as.data.frame(xtotal)
Can someone please help me understand what "~0" here mean ? And what the code is trying to do ?
+ 0 means that the model will not have an intercept, i.e., a column of 1s. In the presence of a factor variable, when an intercept is present, one of the levels will be removed in order to ensure that the model matrix is full rank, which is required in ordinary least squares regression. When the intercept is absent, all levels of the factor can remain.
So, this code is a way to convert a factor to a matrix where dummy variables for all the levels are present. Omitting + 0 would replace one of the dummies with a column of 1s, which may not be useful for your purpose.

Factor scores from factor analysis on ordinal categorical data in R

I'm having trouble computing factor scores from an exploratory factor analysis on ordered categorical data. I've managed to assess how many factors to draw, and to run the factor analysis using the psych package, but can't figure out how to get factor scores for individual participants, and haven't found much help online. Here is where I'm stuck:
library(polycor)
library(nFactors)
library(psych)
# load data
dat <- read.csv("https://raw.githubusercontent.com/paulrconnor/datasets/master/data.csv")
# convert to ordered factors
for(i in 1:length(dat)){
dat[,i] <- as.factor(dat[,i])
}
# compute polychoric correlations
pc <- hetcor(dat,ML=T)
# 2. choose number of factors
ev <- eigen(pc)
ap <- parallel(subject = nrow(dat),
var=ncol(dat),rep=100,cent=.05)
nS <- nScree(x = ev$values, aparallel = ap$eigen$qevpea)
dev.new(height=4,width=6,noRStudioGD = T)
plotnScree(nS) # 2 factors, maybe 1
# run FA
faPC <- fa(r=pc$correlations, nfactors = 2, rotate="varimax",fm="ml")
faPC$loadings
Edit: I've found a way to get scores using irt.fa() and scoreIrt(), but it involved converting my ordered categories to numeric so I'm not sure it's valid. Any advice would be much appreciated!
x = as.matrix(dat)
fairt <- irt.fa(x = x,nfactors=2,correct=TRUE,plot=TRUE,n.obs=NULL,rotate="varimax",fm="ml",sort=FALSE)
for(i in 1:length(dat)){dat[,i] <- as.numeric(dat[,i])}
scoreIrt(stats = fairt, items = dat, cut = 0.2, mod="logistic")
That's an interesting problem. Regular factor analysis assumes your input measures are ratio or interval scaled. In the case of ordinal variables, you have a few options. You could either use an IRT based approach (in which case you'd be using something like the Graded Response Model), or to do as you do in your example and use the polychoric correlation matrix as the input to factor analysis. You can see more discussion of this issue here
Most factor analysis packages have a method for getting factor scores, but will give you different output depending on what you choose to use as input. For example, normally you can just use factor.scores() to get your expected factor scores, but only if you input your original raw score data. The problem here is the requirement to use the polychoric matrix as input
I'm not 100% sure (and someone please correct me if I'm wrong), but I think the following should be OK in your situation:
dat <- read.csv("https://raw.githubusercontent.com/paulrconnor/datasets/master/data.csv")
dat_orig <- dat
#convert to ordered factors
for(i in 1:length(dat)){
dat[,i] <- as.factor(dat[,i])
}
# compute polychoric correlations
pc <- hetcor(dat,ML=T)
# run FA
faPC <- fa(r=pc$correlations, nfactors = 2, rotate="varimax",fm="ml")
factor.scores(dat_orig, faPC)
In essence what you're doing is:
Calculate the polychoric correlation matrix
Use that matrix to conduct the factor analysis and extract 2 factors and associated loadings
Use the loadings from the FA and the raw (numeric) data to get your factor scores
Both this method, and the method you use in your edit, treat the original data as numeric rather than factors. I think this should be OK because you're just taking your raw data and projecting it down on the factors identified by the FA, and the loadings there are already taking into account the ordinal nature of your variables (as you used the polychoric matrix as input into FA). The post linked above cautions against this approach, however, and suggests some alternatives, but this is not a straightforward problem to solve

column names - xgboost predict on new data

I have never productionised an xgboost model and am concerned re how to handle fresh data predictions within an xgboost model. Specifically when column names do not match the trained models sparse matrix column names - either due to new columns being added or certain columns being removed when fresh data is converted to a sparse matrix.
What if I attempt to predict an xgboost model on new data with extra or some missing column names? I see this definitely occurring and would like to create code to account for it so that predictions are correct. I would prefer to avoid hacking together a solution if more elegant ones already exist.
So specifically if the new datas sparse matrix has different column names then what?
My best guess is to factorise (levels based on trained data levels) > create sparse matrix > then remove non-matching columns (between trained dataset and new data).
I have created dummy data (in below code) as an example of prediction errors given different column names.
1st step = build model (just for illustrative purposes I know it's a bad build)
2nd step = resample entire dataset then predict (= no problems. Predictions match)
3rd step = only select from 10% of data then predict - this gets prediction errors due to different column names.
Here's the code:
Step 1 create dummy data and create a lazy xgboost model just for illustrative purposes.
library(xgboost) # for xgboost algo
library(Matrix) # for sparse matrix
### Create dummy data
num_rows <- 100
set.seed(1234)
target <- runif(num_rows)
dummy_data <- data.frame(
LETTER_SINGLE=sample(LETTERS,num_rows,replace=TRUE),
DOUBLE_LETTER=paste(sample(LETTERS,num_rows,replace=TRUE),sample(LETTERS,num_rows,replace=TRUE),sep=""),
TRIPLE_LETTER=paste(sample(LETTERS,num_rows,replace=TRUE),sample(LETTERS,num_rows,replace=TRUE),sample(LETTERS,num_rows,replace=TRUE),sep=""),
stringsAsFactors=FALSE
)
## STEP 1 CREATE XGBOOST MODEL AND GET PREDICTED VALUES TO COMPARE WITH FUTURE DATA CUTS.
model_data_01 <- dummy_data
target_01 <- target
# create matrix
model_01_sparse <- sparse.model.matrix(~ .-1, data = model_data_01)
# colnames model 1
colnames_trained_model <- colnames(model_01_sparse)
# train a model
xgb_fit_01 <-
xgboost(data = model_01_sparse,
label = target_01,
#param = best_param,
nrounds=100,
verbose = T
)
pred_01 <- predict(xgb_fit_01,newdata=model_01_sparse)
Step 2. Test to see if order of observations cause differences in predictions. Spoiler - no prediction errors occur.
## STEP 2 CREATE SHUFFLED DATA (SAME DATA SAMPLES BUT SHUFFLED) THEN PREDICT AND COMPARE.
sample_order <- sample(1:num_rows)
model_data_shuffled <- dummy_data[sample_order,]
target_shuffled <- target[sample_order]
# They are different
head(model_data_01)
head(model_data_shuffled)
# create matrix
model_shuffled_sparse <- sparse.model.matrix(~ .-1, data = model_data_shuffled)
# colnames model 1
colnames_shuffled <- colnames(model_shuffled_sparse)
pred_shuffled <- predict(xgb_fit_01,newdata=model_shuffled_sparse)
# check if predictions differ
pred_01[sample_order] - pred_shuffled
## This matched. Yay. sparse.model.matrix function must first sort alphabetically then create column names.
# due to same column names
mean(colnames_trained_model == colnames_shuffled)
Step 3. Only sample a select few rows and predict to see whether missing columns - in sparse matrix - cause prediction errors.
## STEP 2 WORKED FINE SO ONTO...
## STEP 3 RANDOMLY SAMPLE ONLY A HANDFUL OF ROWS PREDICT AND COMPARE.
sample_order_02 <- sample(1:(num_rows*0.1))
model_data_shuffled_02 <- dummy_data[sample_order_02,]
target_shuffled_02 <- target[sample_order_02]
# create matrix
model_shuffled_sparse_02 <- sparse.model.matrix(~ .-1, data = model_data_shuffled_02)
# colnames model 1
colnames_shuffled_02 <- colnames(model_shuffled_sparse_02)
pred_shuffled_02 <- predict(xgb_fit_01,newdata=model_shuffled_sparse_02)
# check if predictions differ
pred_01[sample_order_02] - pred_shuffled_02
## This did not matched. Damn.
# Due to different column names
colnames_trained_model
colnames_shuffled_02
mean(colnames_trained_model == colnames_shuffled_02)
As you can see this last attempt gets variance in the predicted values due solely to missing column names in the spare matrix.
I don't want to hack an ugly solution together if an elegant one exists for me to learn from.
So my question is... Is there an elegant way to force sparse model matrix column names to match that of the built model (the one used for predictions on new data)?
I have searched the web and no luck thus far finding any best practices solution.
If anybody could help by answering the Question or pointing me in the right direction that would be much appreciated.
What is your production environment? R, Python, Java or something else?
The idea is to use XGBoost functionality (both training and prediction) via production environment-specific wrapper library, not directly. For example, in Python, you could use Scikit-Learn wrappers, which encapsulate feature engineering and -selection tasks into a reusable sklearn.pipeline.Pipeline object. You would 1) fit the pipeline object (where the XGBoost estimator is the final task) in development environment and serialize it to a pickle file, 2) move the pickle file from development to production environment, and 3) de-serialize it from the pickle file and use for transforming new data in production environment. This is a high-level API, which completely abstracts away low-level details such as the layout of XGBoost "internal" data matrices.
For a platform-independent solution, you could export XGBoost models (and associated data pre-processing logic) in the standardized PMML representation.

How does randomForest() predict for new factor levels not in training data?

When I create training set and test set by splitting a single data frame and build a random forest using randomForest package, for some factor levels which are not present in the training data, the predict() function still throws an output. While this gives no error(which was what I was looking for in the related question), my question is on what basis does the randomForest() model predict the value, as it ideally should have thrown the following error...
Error in predict.randomForest() :
New factor levels not present in the training data
Want to know just out of curiosity if randomForest() method makes some inherent assumption for new factor levels in test data.
Here's a reproducible example:
seq1 <- c(5,3,1,3,1,"unwanted_char",4,2,2,3,0,4,1,1,0,1,0,1)
df1 <- matrix(seq1,6)
df1 <- as.data.frame(df1)
colnames(df1) <- c("a","b","c")
train <- df1[1:4,]
test <- df1[5:6,]
Now when we create a forest using train and run the predict() on test as follow...
forest1 <- randomForest(c~a+b,data=train,ntree=500)
test$prediction <- predict(forest1,test,type='response')
The test matrix contain's prediction of '1' for the last observation which has a = 'unwanted_char' and b = '4'.
Please note: When you create test and train data separately the predict function throws the above mentioned error instead of predicting.
My opinion is that this is a very bad example; but, here's the answer:
Your created df1 only has factor variables and 4 observations. Here, mtry will equal 1, meaning that roughly 1/2 your trees will be based on b alone and 1/2 on a alone. When b == "4" the classification is always 1. IE- b == 4 perfectly predicts c. Similarly a == 1 perfectly predicts c == 0.
The reason that this works when you create the data in a single dataset is that the variables are factor variables, where the possible levels exist in both train and test, although the observed quantities for some levels == 0 in train. Since "unwanted_char" is a possible level in train$a (although unobserved) it's not problematic for your prediction. If you create these as separate datasets, the factor variables are created distinctly and test has new levels.
That is to say that, essentially, your problem works because you do not understand how factors work in R.
I concur with Alex that this is not a good example.
Here is the answer to your question:
str(train)
If you check the structure of your train data, you will see that variable 'a' has all 4 levels, because the levels were assigned when you created the dataframe df1.

Classification column is removed after using dummyVars in caret package - R

I am playing around with the caret package and came upon this question.
I am using dummyVars to split my categorical columns into separate dummy variables. It seems that dummyVars code removes the classification column in the input data set. For example:
library(earth)
data(etitanic)
dummies <- dummyVars(survived ~ ., data = etitanic, levelsOnly = FALSE)
et<-as.data.frame(predict(dummies, newdata = etitanic))
names(et)
[1] "pclass.1st" "pclass.2nd" "pclass.3rd" "sex.female" "sex.male" "age"
[7] "sibsp" "parch"
So when I try to split the data, I get an error.
train = createDataPartition(et$survived, p=.75, list=FALSE)
Error in createDataPartition(et$survived, p = 0.75, list = FALSE) :
y must have at least 2 data points
Could anyone let me know if this is the expected behavior of caret's dummyVars. I can easily add in the survived column into the data set using say,
et$survived<-etitanic$survived
and then train a model. But I presume that there must be a better way or else the caret package would not remove the classification column. Am I missing something here? Could someone throw more light on this please?
Thanks
As far as I know there is no way to keep the classification column in (or at least not as a factor; and that is because the output is a matrix and therefore it is always numeric). This is because the reason of the dummyVars function is to create dummy variables for the factor predictor variables. It is also designed to provide an alternative to the base R function model.matrix which offers more choices (model.matrix also does not keep the classification column).
Also, and maybe more importantly, functions that require the classification column to be of factor class and only of factor class offer either a way to provide the factor as a separate argument (like the svm function from the e1071 package) or specifically require it as a separate argument (like the knn function from the FNN package). In both cases you do not need to have the factor in your data.frame. You just need to provide it as a separate vector in the function you want to use.
However, there is an alternative for the cases where you do not need the classification column to be of factor type in which case you can simply do:
library(earth)
data(etitanic)
etitanic2 <- etitanic
#convert the classification colunn to numeric
etitanic2$survived <- as.numeric(etitanic2$survived)
#use formula without specifying the response variable
dummies <- dummyVars( ~ ., data = etitanic, levelsOnly = FALSE)
et<-as.data.frame(predict(dummies, newdata = etitanic))
names(et)
> names(et)
[1] "pclass.1st" "pclass.2nd" "pclass.3rd" "survived" "sex.female" "sex.male" "age"
[8] "sibsp" "parch"
By converting the classification column into numeric and by not specifying a response variable in the formula, the survived column is kept in the output data.frame but as of numeric class.

Resources