The df is splitted in the train and test dataframes. the train dataframe is splitted in training and testing dataframes. The dependent variable Y is binary (factor) with values 0 and 1. I'm trying to predict the probability with this code (neural networks, caret package):
library(caret)
model_nn <- train(
Y ~ ., training,
method = "nnet",
metric="ROC",
trControl = trainControl(
method = "cv", number = 10,
verboseIter = TRUE,
classProbs=TRUE
)
)
model_nn_v2 <- model_nn
nnprediction <- predict(model_nn, testing, type="prob")
cmnn <-confusionMatrix(nnprediction,testing$Y)
print(cmnn) # The confusion matrix is to assess/compare the model
However, it gives me this error:
Error: At least one of the class levels is not a valid R variable name;
This will cause errors when class probabilities are generated because the
variables names will be converted to X0, X1 . Please use factor levels
that can be used as valid R variable names (see ?make.names for help).
I don't understand what means "use factor levels that can be used as valid R variable names". The dependent variable Y is already a factor, but is not a valid R variable name?.
PS: The code works perfectly if you erase classProbs=TRUE in trainControl() and metric="ROC" in train(). However, the "ROC" metric is my metric of comparison for the best model in my case, so I'm trying to make a model with "ROC" metric.
EDIT: Code example:
# You have to run all of this BEFORE running the model
classes <- c("a","b","b","c","c")
floats <- c(1.5,2.3,6.4,2.3,12)
dummy <- c(1,0,1,1,0)
chr <- c("1","2","2,","3","4")
Y <- c("1","0","1","1","0")
df <- cbind(classes, floats, dummy, chr, Y)
df <- as.data.frame(df)
df$floats <- as.numeric(df$floats)
df$dummy <- as.numeric(df$dummy)
classes <- c("a","a","a","b","c")
floats <- c(5.5,2.6,7.3,54,2.1)
dummy <- c(0,0,0,1,1)
chr <- c("3","3","3,","2","1")
Y <- c("1","1","1","0","0")
df <- cbind(classes, floats, dummy, chr, Y)
df <- as.data.frame(df)
df$floats <- as.numeric(df$floats)
df$dummy <- as.numeric(df$dummy)
There are two separate issues here.
The first is the error message, which says it all: you have to use something else than "0", "1" as values for your dependent factor variable Y.
You can do this by at least two ways, after you have built your dataframe df; the first one is hinted at the error message, i.e. use make.names:
df$Y <- make.names(df$Y)
df$Y
# "X1" "X1" "X1" "X0" "X0"
The second way is to use the levels function, by which you will have explicit control over the names themselves; showing it here again with names X0 and X1
levels(df$Y) <- c("X0", "X1")
df$Y
# [1] X1 X1 X1 X0 X0
# Levels: X0 X1
After adding either one of the above lines, the shown train() code will run smoothly (replacing training with df), but it will still not produce any ROC values, giving instead the warning:
Warning messages:
1: In train.default(x, y, weights = w, ...) :
The metric "ROC" was not in the result set. Accuracy will be used instead.
which bring us to the second issue here: in order to use the ROC metric, you have to add summaryFunction = twoClassSummary in the trControlargument of train():
model_nn <- train(
Y ~ ., df,
method = "nnet",
metric="ROC",
trControl = trainControl(
method = "cv", number = 10,
verboseIter = TRUE,
classProbs=TRUE,
summaryFunction = twoClassSummary # ADDED
)
)
Running the above snippet with the toy data you have provided still gives an error (missing ROC values), but probably this is due to the very small dataset used here combined with the large number of CV folds, and it will not happen with your own, full dataset (it works OK if I reduce the CV folds to number=3)...
Related
I am fitting a model using a random site-level effect using a generalized additive model, implemented in the mgcv package for R. I had been doing this using the function gam() however, to speed things up I need to shift to the bam() framework, which is basically the same as gam(), but faster. I further sped up fitting by passing the options bam(nthreads = N, discrete=T), where nthreads is the number of cores on my machine. However, when I use the discretization option, and then try to make predictions with my model on new data, while ignoring the random effect, I consistent get an error.
Here is code to generate example data and reproduce the error.
library(mgcv)
#generate data.
N <- 10000
x <- runif(N,0,1)
y <- (0.5*x / (x + 0.2)) + rnorm(N)*0.1 #non-linear relationship between x and y.
#uninformative random effect.
random.x <- as.factor(do.call(paste0, replicate(2, sample(LETTERS, N, TRUE), FALSE)))
#fit models.
fit1 <- gam(y ~ s(x) + s(random.x, bs = 're')) #this one takes ~1 minute to fit, rest faster.
fit2 <- bam(y ~ s(x) + s(random.x, bs = 're'))
fit3 <- bam(y ~ s(x) + s(random.x, bs = 're'), discrete = T, nthreads = 2)
#make predictions on new data.
newdat <- data.frame(runif(200, 0, 1))
colnames(newdat) <- 'x'
test1 <- predict(fit1, newdata=newdat, exclude = c("s(random.x)"), newdata.guaranteed = T)
test2 <- predict(fit2, newdata=newdat, exclude = c("s(random.x)"), newdata.guaranteed = T)
test3 <- predict(fit3, newdata=newdat, exclude = c("s(random.x)"), newdata.guaranteed = T)
Making predictions with the third model which uses discretization throws this error (which the other two do not):
Error in model.frame.default(object$dinfo$gp$fake.formula[-2], newdata) :
variable lengths differ (found for 'random.x')
In addition: Warning message:
'newdata' had 200 rows but variables found have 10000 rows
How can I go about making predictions for a new dataset using the model fit with discretization?
newdata.gauranteed doesn't seem to be working for bam() models with discrete = TRUE. You could email the author and maintainer of mgcv and send him the reproducible example so he can take a look. See ?bug.reports.mgcv.
You probably want
names(newdat) <- "x"
as data frames have names.
But the workaround is just to pass in something for random.x
newdat <- data.frame(x = runif(200, 0, 1), random.x = random.x[[1]])
and then do your call to generate test3 and it will work.
The warning message and error are the result of you not specifying random.x in the newdata and then mgcv looking for random.x and finding it in the global environment. You should really gather that variables into a data frame and use the data argument when you are fitting your models, and try not to leave similarly named objects lying around in your global environment.
I have a question regarding the feature importance function in the Caret package.
I have a dataset which has more numeric and factor features.
I used the command below to get the feature importance of the model. It gives me the importance of each (sub_feature) for the factor variables. However, I just want the importance of the feature itself without go in detail for each factor of the feature.
gbmImp <- caret::varImp(xgb1, scale = TRUE)
I will create some example data as we don't have any from your question:
library(caret)
# example data
df <- data.frame("x" = rnorm(100),
"fac" = as.factor(sample(c(rep("A", 30), rep("B", 35), rep("C", 35)))),
"y" = as.numeric((rpois(100, 4))))
# model
model <- train(y ~ ., method = "glm", data = df)
# feature importance
varImp(model, scale = TRUE)
This returns the feature importance that you do not want in your question:
# glm variable importance
#
# Overall
# facB 100.00
# facC 13.08
# x 0.00
You can convert the factor variables to numeric and do the same thing:
# make the factor variable numeric
trans_df <- transform(df, fac = as.numeric(fac))
# model
trans_model <- train(y ~ ., method = "glm", data = trans_df)
# feature importance
varImp(trans_model, scale = TRUE)
This returns the importance for the 'overall' feature:
# glm variable importance
#
# Overall
# x 100
# fac 0
However, I do not know whether the as.numeric() operation on the factor variable doesn't result in a different feature importance when we run varImp(trans_model, scale = TRUE).
Also, check out this SO thread if you find that your specific factor/character variables are problematic when converting to numeric.
In fact, there is a similar question and answer, but it does not work me. see below. The trick lies in rewrite fit of lmFunc.
"Error in { : task 1 failed - "Results do not have equal lengths", many warning:glm.fit: fitted probabilities numerically 0 or 1 occurred"
where is the fault?
lmFuncs$fit=function (x, y, first, last, ...)
{
tmp <- as.data.frame(x)
tmp$y <- y
glm(y ~ ., data = tmp, family=binomial(link='logit'))
}
ctrl <- rfeControl(functions = lmFuncs,method = 'cv',number=10)
fit.rfe=rfe(df.preds,df.depend, rfeControl=ctrl)
And in the rfeControl help, it is said the parameter 'functions' that can be used with caret’s train function (caretFuncs). What does it really mean?
Any details and example? Thanks
I was having a similar issue with customising lmFunc.
For logistic regression make sure you use lrFuncs and set size equal to the number of predictor variables. This leads to no issues.
Example (for functionality purposes only)
library(caret)
#Reproducible data
set.seed(1)
x <- data.frame(runif(10),runif(10),runif(10),runif(10),runif(10),runif(10),runif(10),runif(10),runif(10),runif(10),runif(10),runif(10),runif(10),runif(10),runif(10),runif(10),runif(10),runif(10),runif(10),runif(10),runif(10),runif(10),runif(10),runif(10),runif(10),runif(10),runif(10),runif(10))
x$dpen <- sample(c(0,1), replace=TRUE, size=10)
x$dpen <- factor(x$dpen)
#Spliting training set into two parts based on outcome: 80% and 20%
index <- createDataPartition(x$dpen, p=0.8, list=FALSE)
trainSet <- x[ index,]
testSet <- x[-index,]
control <- rfeControl(functions = lrFuncs,
method = "cv", #cross validation
verbose = FALSE, #prevents copious amounts of output from being produced.
)
##RFE
rfe(trainSet[,1:28] #predictor varia,
trainSet[,9],
sizes = c(1:28) #size of predictor variables,
rfeControl = control)
How can I use dummy vars in caret without destroying my target variable?
set.seed(5)
data <- ISLR::OJ
data<-na.omit(data)
dummies <- dummyVars( Purchase ~ ., data = data)
data2 <- predict(dummies, newdata = data)
split_factor = 0.5
n_samples = nrow(data2)
train_idx <- sample(seq_len(n_samples), size = floor(split_factor * n_samples))
train <- data2[train_idx, ]
test <- data2[-train_idx, ]
modelFit<- train(Purchase~ ., method='lda',preProcess=c('scale', 'center'), data=train)
will fail, as the Purchase variable is missing. In case I replace it with data$Purchase <- ifelse(data$Purchase == "CH",1,0) beforehand caret complains that this no longer is a classification but a regression problem
At least the example code seems to have a few issues indicated in the comments below. To answer your questions:
The result of ifelse is an integer vector, not a factor, so the train function defaults to regression
Passing the dummyVars directly to the function is done by using the train(x = , y =, ...) instead of a formula
To avoid these problems, check the class of your objects carefully.
Be aware that option preProcess in train() will apply the preprocessing to all numeric variables, including the dummies. Option 2 below avoid this, be standardizing the data before calling train().
set.seed(5)
data <- ISLR::OJ
data<-na.omit(data)
# Make sure that all variables that should be a factor are defined as such
newFactorIndex <- c("StoreID","SpecialCH","SpecialMM","STORE")
data[, newFactorIndex] <- lapply(data[,newFactorIndex], factor)
library(caret)
# See help for dummyVars. The function does not take a dependent variable and predict will give an error
# I don't include the target variable here, so predicting dummies on new data will drop unknown columns
# including the target variable
dummies <- dummyVars(~., data = data[,-1])
# I don't change the data yet to apply standardization to the numeric variables,
# before turning the categorical variables into dummies
split_factor = 0.5
n_samples = nrow(data)
train_idx <- sample(seq_len(n_samples), size = floor(split_factor * n_samples))
# Option 1 (as asked): Specify independent and dependent variables separately
# Note that dummy variables will be standardized by preProcess as per the original code
# Turn the categorical variabels to (unstandardized) dummies
# The output of predict is a matrix, change it to data frame
data2 <- data.frame(predict(dummies, newdata = data))
modelFit<- train(y = data[train_idx, "Purchase"], x = data2[train_idx,], method='lda',preProcess=c('scale', 'center'))
# Option 2: Append dependent variable to the independent variables (needs to be a data frame to allow factor and numeric)
# Note that I also shift the proprocessing away from train() to
# avoid standardizing the dummy variables
train <- data[train_idx, ]
test <- data[-train_idx, ]
preprocessor <- preProcess(train[!sapply(train, is.factor)], method = c('center',"scale"))
train <- predict(preprocessor, train)
test <- predict(preprocessor, test)
# Turn the categorical variabels to (unstandardized) dummies
# The output of predict is a matrix, change it to data frame
train <- data.frame(predict(dummies, newdata = train))
test <- data.frame(predict(dummies, newdata = test))
# Reattach the target variable to the training data that has been
# dropped by predict(dummies,...)
train$Purchase <- data$Purchase[train_idx]
modelFit<- train(Purchase ~., data = train, method='lda')
I have trained a random forest using caret package for predicting a binary classification task.
library(caret)
set.seed(78)
inTrain <- createDataPartition(disambdata$Response, p=3/4, list = FALSE)
trainSet <- disambdata[inTrain,]
testSet <- disambdata[-inTrain,]
ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 10)
grid_rf <- expand.grid(.mtry = c(3,5,7,9))
set.seed(78)
m_rf <- train(Response ~ ., data=trainSet,
method= "rf", metric = "Kappa", trcontrol=ctrl, tuneGrid = grid_rf)
The Response variable contains values {Valid, Invalid}.
Using the following I get the class probabilities for the testing data:
pred <- predict.train(m_rf, newdata = testSet,
type="prob", models=m_rf$finalModel)
However I am interested in obtaining the predicted class i.e. Valid or Invalid instead of class probabilities to generate a confusion matrix.
I have already tried the argument type="raw" in the predict.train function but it returns a list of NAs.
By assigning type = "prob" in predict() function, you are specifically asking for probabilities. just remove it & it will provide labels
pred <- predict.train(m_rf, newdata = testSet,models=m_rf$finalModel)
It seems that the caret package (caret_6.0-70) still has issue with the formula interface. Expanding the formula from Response ~ . to the one that explicitly mentions all predictors like this Response ~ MaxLikelihood + n1 + n2 + count resolves the problem and predict.train(m_rf, newdata=testSet) returns the predicted class.