Classification column is removed after using dummyVars in caret package - R - r

I am playing around with the caret package and came upon this question.
I am using dummyVars to split my categorical columns into separate dummy variables. It seems that dummyVars code removes the classification column in the input data set. For example:
library(earth)
data(etitanic)
dummies <- dummyVars(survived ~ ., data = etitanic, levelsOnly = FALSE)
et<-as.data.frame(predict(dummies, newdata = etitanic))
names(et)
[1] "pclass.1st" "pclass.2nd" "pclass.3rd" "sex.female" "sex.male" "age"
[7] "sibsp" "parch"
So when I try to split the data, I get an error.
train = createDataPartition(et$survived, p=.75, list=FALSE)
Error in createDataPartition(et$survived, p = 0.75, list = FALSE) :
y must have at least 2 data points
Could anyone let me know if this is the expected behavior of caret's dummyVars. I can easily add in the survived column into the data set using say,
et$survived<-etitanic$survived
and then train a model. But I presume that there must be a better way or else the caret package would not remove the classification column. Am I missing something here? Could someone throw more light on this please?
Thanks

As far as I know there is no way to keep the classification column in (or at least not as a factor; and that is because the output is a matrix and therefore it is always numeric). This is because the reason of the dummyVars function is to create dummy variables for the factor predictor variables. It is also designed to provide an alternative to the base R function model.matrix which offers more choices (model.matrix also does not keep the classification column).
Also, and maybe more importantly, functions that require the classification column to be of factor class and only of factor class offer either a way to provide the factor as a separate argument (like the svm function from the e1071 package) or specifically require it as a separate argument (like the knn function from the FNN package). In both cases you do not need to have the factor in your data.frame. You just need to provide it as a separate vector in the function you want to use.
However, there is an alternative for the cases where you do not need the classification column to be of factor type in which case you can simply do:
library(earth)
data(etitanic)
etitanic2 <- etitanic
#convert the classification colunn to numeric
etitanic2$survived <- as.numeric(etitanic2$survived)
#use formula without specifying the response variable
dummies <- dummyVars( ~ ., data = etitanic, levelsOnly = FALSE)
et<-as.data.frame(predict(dummies, newdata = etitanic))
names(et)
> names(et)
[1] "pclass.1st" "pclass.2nd" "pclass.3rd" "survived" "sex.female" "sex.male" "age"
[8] "sibsp" "parch"
By converting the classification column into numeric and by not specifying a response variable in the formula, the survived column is kept in the output data.frame but as of numeric class.

Related

XGBoost Error in R Studio ("'data' has class 'character' and length...")

I am having difficulties fitting my data to an xgboost classifier model. When I run this:
classifier = xgboost(data = as.matrix(training_set[c(4:15, 17:18,20:28)]),
label = training_set$posted_ind, nrounds = 10)
R Studio tells me:
Error in xgb.DMatrix(data, label = label, missing = missing) :
'data' has class 'character' and length 1472000.
'data' accepts either a numeric matrix or a single filename.
The training set data has both continuous and categorical data, but all categorical data has been encoded as such (and the same data fit to random forest and naive bayes models). Is there some additional step I need to complete so that I can use these data in an xgboost model?
Make sure that your "training_set" does not have any columns that are factors. If you encoded your categorical variables as numeric but casted them as factors, you will get this error.
I came across the same problem and found a complete solution. You have to use:
sparse_matrix <- sparse.model.matrix(label_y ~ ., data = df)[,-1]
X_train_dmat = xgb.DMatrix(sparse_matrix, label = df$label)
This transforms the categorical data to dummy variables. Several encoding methods exist, e.g., one-hot encoding is a common approach. The above is dummy contrast coding which is popular because it produces “full rank” encoding (also see this blog post by Max Kuhn).
The purpose is to transform each value of each categorical feature into a binary feature {0, 1}.
For example, a column Treatment will be replaced by two columns, TreatmentPlacebo, and TreatmentTreated. Each of them will be binary. Therefore, an observation which has the value Placebo in column Treatment before the transformation will have after the transformation the value 1 in the new column TreatmentPlacebo and the value 0 in the new column TreatmentTreated. The column TreatmentPlacebo will disappear during the contrast encoding, as it would be absorbed into a common constant intercept column.
Source: https://cran.r-project.org/web/packages/xgboost/vignettes/discoverYourData.html#conversion-from-categorical-to-numeric-variables
What is working for me while using tidymodels is adding a recipe step for dummy encoding:
step_dummy(all_nominal_predictors(), one_hot = TRUE)

Missing Categories in Validation Data

I built a classification model in R based on training dataset with 12 categorical predictors, each variable holds tens to hundreds of categories.
The problem is that in the dataset I use for validation, some of the variables has less categories than in the training data.
For example, if I have in the training data variable v1 with 3 categories - 'a','b','c', in the validation dataset v1 has only 2 categories - 'a','b'.
In tree based methods like decision tree or random forest it makes no problem, but in logistic regression methods (I use LASSO) that require a preparation of a dummy variables matrix, the number of columns in the training data matrix and validation data matrix doesn't match. If we go back to the example of variable v1, in the training data I get three dummy variables for v1, and in the validation data I get only 2.
Any idea how to solve this?
You can try to avoid this problem by setting the levels correctly. Look at following very stupid example:
set.seed(106)
thedata <- data.frame(
y = rnorm(100),
x = factor(sample(letters[1:3],100,TRUE))
)
head(model.matrix(y~x, data = thedata))
thetrain <- thedata[1:7,]
length(unique(thetrain$x))
head(model.matrix(y~x, data = thetrain))
I make a dataset with a x and a y variable, and x is a factor with 3 levels. The training dataset only has 2 levels of x, but the model matrix is still constructed correctly. That is because R kept the level data of the original dataset:
> levels(thetrain$x)
[1] "a" "b" "c"
The problem arises when your training set is somehow constructed using eg the function data.frame() or any other method that drops the levels information of the factor.
Try the following:
thetrain$x <- factor(thetrain$x) # erases the levels
levels(thetrain$x)
head(model.matrix(y~x, data = thetrain))
You see in the second line that the level "b" has been dropped, and consequently the model matrix isn't what you want any longer. So make sure that all factors in your training dataset actually have all levels, eg:
thetrain$x <- factor(thetrain$x, levels = c("a","b","c"))
On a sidenote: if you build your model matrices yourself using either model.frame() or model.matrix(), the argument xlev might be of help:
thetrain$x <- factor(thetrain$x) # erases the levels
levels(thetrain$x)
head(model.matrix(y~x, data = thetrain,
xlev = list(x = c('a','b','c'))))
Note that this xlev argument is actually from model.frame, and model.matrix doesn't call model.frame in every case. So that solution is not guaranteed to always work, but it should for data frames.

Error in bn.fit predict function in bnlear R

I have learned and fitted Bayesian Network in bnlearn R package and I wish to predict it's "event" node value.
fl="data/discrete_kdd_10.txt"
h=TRUE
dtbl1 = read.csv(file=fl, head=h, sep=",")
net=hc(dtbl1)
fitted=bn.fit(net,dtbl1)
I want to predict the value of "event" node based on the evidence stored in another file with the same structure as the file used for learning.
fileName="data/dcmp.txt"
dtbl2 = read.csv(file=fileName, head=h, sep=",")
predict(fitted,"event",dtbl2)
However, predict fails with
Error in check.data(data) : variable duration must have at least two levels.
I don't understand why there should be any restriction on number of levels of variables in the evidence data.frame.
The dtbl2 data.frame contains only few rows, one for each scenario in which I want to predict the "event" value.
I know I can use cpquery, but I wish to use the predict function also for networks with mixed variables (both discrete and continuous). I haven't found out how to make use of evidence of continuous variable in cpqery.
Can someone please explain what I'm doing wrong with the predict function and how should I do it right?
Thanks in advance!
The problem was that reading the evidence data.frame in
fileName="data/dcmp.txt"
dtbl2 = read.csv(file=fileName, head=h, sep=",")
predict(fitted,"event",dtbl2)
caused categoric variables to be factors with different number of levels (subset of levels of the original training set).
I used following code to solve this issue.
for(i in 1:dim(dtbl2)[2]){
dtbl2[[i]] = factor(dtbl2[[i]],levels = levels(dtbl1[[i]]))
}
By the way bnlearn package does fit models with mixed variables and also provides functions for predictions in them.

Lasso, glmnet, preprocessing of the data

Im trying to use the glmnet package to fit a lasso (L1 penalty) on a model with a binary outcome (a logit). My predictors are all binary (they're 1/0 not ordered, ~4000) except for one continuous variable.
I need to convert the predictors into a sparse matrix, since it takes forever and a day otherwise.
My question is: it seems that people are using sparse.model.matrix rather than just converting their matrix into a sparse matrix. Why is that? and do I need to do this here? Outcome is a little different for both methods.
Also, do my factors need to be coded as factors (when it comes to both the outcome and the predictors) or is it sufficient to use the sparse matrix and specify in the glmnet model that the outcome is binomial?
Here's what im doing so far
#Create a random dataset, y is outcome, x_d is all the dummies (10 here for simplicity) and x_c is the cont variable
y<- sample(c(1:0), 200, replace = TRUE)
x_d<- matrix(data= sample(c(1:0), 2000, replace = TRUE), nrow=200, ncol=10)
x_c<- sample(60:90, 200, replace = TRUE)
#FIRST: scale that one cont variable.
scaled<-scale(x_c,center=TRUE, scale=TRUE)
#then predictors together
x<- cbind(x_d, scaled)
#HERE'S MY MAIN QUESTION: What i currently do is:
xt<-Matrix(x , sparse = TRUE)
#then run the cross validation...
cv_lasso_1<-cv.glmnet(xt, y, family="binomial", standardize=FALSE)
#which gives slightly different results from (here the outcome variable is in the x matrix too)
xt<-sparse.model.matrix(data=x, y~.)
#then run CV.
So to sum up my 2 questions are:
1-Do i need to use sparse.model.matrix even if my factors are just binary and not ordered? [and if yes what does it actually do differently from just converting the matrix to a sparse matrix]
2- Do i need to code the binary variables as factors?
the reason i ask that is my dataset is huge. it saves a lot of time to just do it without coding as factors.
I don't think you need a sparse.model.matrix, as all that it really gives you above a regular matrix is expansion of factor terms, and if you're binary already that won't give you anything. You certainly don't need to code as factors, I frequently use glmnet on a regular (non-model) sparse matrix with only 1's. At the end of the day glmnet is a numerical method, so a factor will get converted to a number in the end regardless.

lm and predict - agreement of data.frame names

Working in R to develop regression models, I have something akin to this:
c_lm = lm(trainingset$dependent ~ trainingset$independent)
c_pred = predict(c_lm,testset$independent))
and every single time, I get a mysterious error from R:
Warning message:
'newdata' had 34 rows but variables found have 142 rows
which essentially translates into R not being able to find the independent column of the testset data.frame. This is simply because the exact name from the right-hand side of the formula in lm must be there in predict. To fix it, I can do this:
tempset = trainingset
c_lm = lm(trainingset$dependent ~ tempset$independent)
tempset = testset
c_pred = predict(c_lm,tempset$independent))
or some similar variation, but this is really sloppy, in my opinion.
Is there another way to clean up the translation between the two so that the independent variables' data frame does not have to have the exact same name in predict as it does in lm?
No, No, No, No, No, No! Do not use the formula interface in the way you are doing if you want all the other sugar that comes with model formulas. You wrote:
c_lm = lm(trainingset$dependent ~ trainingset$independent)
You repeat trainingset twice, which is a waste of fingers/time, redundant, and not least causing you the problem that you are hitting. When you now call predict, it will be looking for a variable in testset that has the name trainingset$independent, which of course doesn't exist. Instead, use the data argument in your call to lm(). For example, this fits the same model as your formula but is efficient and also works properly with predict()
c_lm = lm(dependent ~ independent, data = trainingset)
Now when you call predict(c_lm, newdata = testset), you only need to have a data frame with a variable whose name is independent (or whatever you have in the model formula).
An additional reason to write formulas as I show them, is legibility. Getting the object name out of the formula allows you to more easily see what the model is.

Resources