Caret nnet: logloss not working for twoClassSummary - r

I have a training dataset
Out Revolver Ratio Num ...
0 1 0.766127 0.802982 0 ...
1 0 0.957151 0.121876 1
2 0 0.658180 0.085113 0
3 0 0.233810 0.036050 3
4 1 0.907239 0.024926 5
The outcome variable Out is binary and only takes on the values 0 or 1. Num is not a factor
I then attempted to run nnet using caret. I want to eventually try nnGrid but I just want to make sure this works first:
nnTrControl=trainControl(method = "cv", classProbs = TRUE, summaryFunction = twoClassSummary,
number = 2,verboseIter = TRUE, returnData = FALSE, returnResamp = "all")
#nnGrid = expand.grid(.size=c(1,4,7),.decay=c(0,0.001,0.1))
Outf<-factor(training$Out)
model <- train(Outf~ Revolver+Ratio+Num, data=training, method='nnet',
trControl = nnTrControl, metric="logLoss")#, tuneGrid=nnGrid)
I get the error
Error in train.default(x, y, weights = w, ...) :
At least one of the class levels is not a valid R variable name; This will cause errors when class probabilities are generated because the variables names will be converted to X0, X1 . Please use factor levels that can be used as valid R variable names (see ?make.names for help).
However, I've used caret and gotten this error before, which I resolved by using make.names. So when I try the below instead:
yCat<-make.names(training$Out, unique=FALSE, allow_=TRUE)
mnn <- model.matrix( ~Revolver + Ratio + Num, data = training)
model <- train(y=yCat, x=mnn, method='nnet',
trControl = nnTrControl, metric="logLoss")#, tuneGrid=nnGrid)
I then get the message
The metric "logLoss" was not in the result set. ROC will be used instead.
But I don't understand why its not evaluating according to logLoss?
If I then use this to predict on a test set
probs<-predict(model, newdata=testSet, type="prob")
I get
Error in eval(expr, envir, enclos) : object '(Intercept)' not found
How do I fix this?

Related

Error in glmnet if I specify a variable to be a factor

I have a database in R where I would like to perform a glmnet task. The y variable consists on an originally numeric variable which however takes on only 0 and 1 values. If I specify the latter to be a factor variable as follows
df_ML_1976[,names] <- lapply(df_ML_1976[,names] , factor)
and then apply glmnet after dividing into training and test set:
library("dplyr")
df_ML_1976 %>%
select(where(~ any(. != 0)))
#df_ML_1976 <- subset(df_ML_1976, select = -c(X))
library("caret")
default_idx = createDataPartition(df_ML_1976$y_tr4, p = 0.75, list = FALSE)
default_trn = df_ML_1976[default_idx, ]
default_tst = df_ML_1976[-default_idx, ]
## Fitting elasticnet:
cv_5 = trainControl(method = "cv", number = 5)
def_elnet = train(
y_tr4 ~ ., data = default_trn,
method = "glmnet",
trControl = cv_5
)
def_elnet
an error occurs:
Error in h(simpleError(msg, call)) :
error in evaluating the argument 'x' in selecting a method for function 'drop': non-conformable arguments
which does not appear if I do not specify
df_ML_1976[,names] <- lapply(df_ML_1976[,names] , factor)
why is it like so?
Thank you

if (any(const_vars)) missing value where TRUE/FALSE needed error while running Lasso in R

When I tried to run a Lasso Regression on my 51th response variable with the other 50 variables, I got the following error message:
lasso_now=cv.glmnet(x=as.matrix(scaledData[,-51]),y=as.matrix(scaledData[,51]),alpha=1,nfolds = 5,type.measure="mse",family = binomial(link = "logit"))
Error in if (any(const_vars)) { : missing value where TRUE/FALSE needed
My response variable is either 0 or 1 so I used logistic regression. My x has either categorical or numerical variables.
Does anyone why it happened or is there any way to validate the data for the issue? Thanks in advance!
Check if you have NA values, you get the error because glmnet checks where any of your columns have standard deviation of zero. For example, we set one 1st entry of fourth column to be NA in the following dataset:
library(glmnet)
scaledData = data.frame(v1 = rnorm(100),v2=rnorm(100),
v3 = rbinom(100,1,0.5),v4 = rbinom(100,1,0.7))
scaledData[1,4] = NA
You can check:
glmnet:::weighted_mean_sd(as.matrix(scaledData[,-3]))
$mean
v1 v2 v4
0.03979154 0.14547529 NA
$sd
v1 v2 v4
0.8544635 1.0815797 NA
Runs with the same error:
lasso_now=cv.glmnet(x=as.matrix(scaledData[,-3]),
y=as.matrix(scaledData[,3]),
alpha=1,nfolds = 5,type.measure="mse",
family = binomial(link = "logit"))
Error in if (any(const_vars)) { : missing value where TRUE/FALSE needed
One way you can remove is like this:
scaledData = scaledData[complete.cases(scaledData),]
And run it, note that for binomial you should not use "mse", you can use "deviance", "class" or "auc".
lasso_now=cv.glmnet(x=as.matrix(scaledData[,-3]),
y=as.matrix(scaledData[,3]),
alpha=1,nfolds = 5,type.measure="deviance",
family = binomial(link = "logit"))
lasso_now
Call: cv.glmnet(x = as.matrix(scaledData[, -3]),
y = as.matrix(scaledData[,3]),
type.measure = "deviance", nfolds = 5, alpha = 1,
family = binomial(link = "logit"))
Measure: GLM Deviance
Lambda Index Measure SE Nonzero
min 0.07643 1 1.427 0.01681 0
1se 0.07643 1 1.427 0.01681 0

The length of trainPred is not correct in prediction function with R

This is my code for part of Naive Bayes a
trainPred<- predict(NBclassfier, newdata = train, type = "raw")
but I am getting wrong number for length of trainPred which two times biger than the actual size of the trainPre.
Even when I am using
trainPred<- predict(NBclassfier, newdata = train, type = "class")
I only get 0 for length of trainPred
so when I am running bellow code I get an error
trainTable <- table(train$prog, trainPred)
The Code for NBclassifer is NBclassfier = naiveBayes(prog~., data= train)
the whole code an an error
library(caret)
library(e1071)
set.seed(25)
trainIndex=createDataPartition(NaiveData$prog, p=0.8)$Resample1
train=NaiveData[trainIndex, ]
test=NaiveData[-trainIndex, ]
check the balance
print(table(NaiveData$prog))
0 1
496 261
Check the train table
print(table(train$prog))
0 1
388 218
NBclassfier = naiveBayes(prog~., data= train)
trainPred <- predict(NBclassfier, newdata = train, type = "raw")
trainPred<- trainPred
trainTable <- table(train$prog, trainPred)
Error in table(train$prog, trainPred) : all arguments must have the same length
I just solved the problem and wanted to share the answer as well,
NBclassfier = naiveBayes(as.factor(prog)~., data= train)
confusionMatrix(as.factor(trainPred), as.factor(train$prog), mode = "prec_recall")
Justmake them Factor.

I want to use AUPRC as the performance measure, in a GBM run using caret package. How can I use a customized metric such as auprc?

I am trying to use AUPRC as my custom metric for a gbm model fit because I have imbalanced classifier. However, when i try to incorporate the custom metric I am getting the following error mentioned in the code. Not sure what I am doing wrong.
Also the auprcSummary() works on its own when i run it inline. It is giving me an error when i try to incorporate it in train().
library(dplyr) # for data manipulation
library(caret) # for model-building
library(pROC) # for AUC calculations
library(PRROC) # for Precision-Recall curve calculations
auprcSummary <- function(data, lev = NULL, model = NULL){
index_class2 <- data$Class == "Class2"
index_class1 <- data$Class == "Class1"
the_curve <- pr.curve(data$Class[index_class2],
data$Class[index_class1],
curve = FALSE)
out <- the_curve$auc.integral
names(out) <- "AUPRC"
out
}
ctrl <- trainControl(method = "repeatedcv",
number = 10,
repeats = 5,
summaryFunction = auprcSummary,
classProbs = TRUE)
set.seed(5627)
orig_fit <- train(Class ~ .,
data = toanalyze.train,
method = "gbm",
verbose = FALSE,
metric = "AUPRC",
trControl = ctrl)
This is the error I am getting:
Error in order(scores.class0) : argument 1 is not a vector
Is it because pr.curve() takes only numeric vectors as inputs (scores/probabilities?)
caret has a built-in function called prSummary that computes that for you. You don't have to write your own.
I think this approach yields an appropriate custom summary function:
library(caret)
library(pROC)
library(PRROC)
library(mlbench) #for the data set
data(Ionosphere)
in pr.curve function the classification scores may be either provided separately for the data points of each of the classes, i.e., as scores.class0 for the data points from the positive/foreground class and as scores.class1 for the data points of the negative/background class; or the classification scores for all data points are provided as scores.class0 and the labels are provided as numerical values (1 for the positive class, 0 for the negative class) as weights.class0 (I copied this from the help of the function I apologize if it is unclear).
I opted to provide the later - probability for all in scores.class0 and class assignment in weights.class0.
caret states that if the classProbs argument of the trainControl object is set to TRUE, additional columns in data will be present that contains the class probabilities. So for the Ionosphere data columns good and bad should be present:
levels(Ionosphere$Class)
#output
[1] "bad" "good"
to convert to 0/1 labeling one can just do:
as.numeric(Ionosphere$Class) - 1
good will become 1
bad will become 0
now we have all the data for the custom function
auprcSummary <- function(data, lev = NULL, model = NULL){
prob_good <- data$good #take the probability of good class
the_curve <- pr.curve(scores.class0 = prob_good,
weights.class0 = as.numeric(data$obs)-1, #provide the class labels as 0/1
curve = FALSE)
out <- the_curve$auc.integral
names(out) <- "AUPRC"
out
}
Instead of using data$good which will work on this data set alone one can extract the class names and use that to get the desired column:
lvls <- levels(data$obs)
prob_good <- data[,lvls[2]]
It is important to note each time you update the summaryFunction you need to update the trainControl object.
ctrl <- trainControl(method = "repeatedcv",
number = 10,
repeats = 5,
summaryFunction = auprcSummary,
classProbs = TRUE)
orig_fit <- train(y = Ionosphere$Class, x = Ionosphere[,c(1,3:34)], #omit column 2 to avoid a bunch of warnings related to the data set
method = "gbm",
verbose = FALSE,
metric = "AUPRC",
trControl = ctrl)
orig_fit$results
#output
shrinkage interaction.depth n.minobsinnode n.trees AUPRC AUPRCSD
1 0.1 1 10 50 0.9722775 0.03524882
4 0.1 2 10 50 0.9758017 0.03143379
7 0.1 3 10 50 0.9739880 0.03316923
2 0.1 1 10 100 0.9786706 0.02502183
5 0.1 2 10 100 0.9817447 0.02276883
8 0.1 3 10 100 0.9772322 0.03301064
3 0.1 1 10 150 0.9809693 0.02078601
6 0.1 2 10 150 0.9824430 0.02284361
9 0.1 3 10 150 0.9818318 0.02287886
Seems reasonable

Perform cross-validation on randomForest with R

I am using the randomForest package for R to train a model for classification.
To compare it to other classifiers, I need a way to display all the information given by the rather verbose cross-validation method in Weka. Therefore, the R script should output somesthing like [a] from Weka.
Is there a way to validate an R model via RWeka to produce those measures?
If not, how is a cross-validation on a random forest done purely in R?
Is it possble to use rfcv from the randomForest package here? I could not get it to work.
I do know that the out-of-bag error (OOB) used in randomForest is some kind of a cross-validation. But I need the full information for a suited comparison.
What I tried so far using R is [b]. However, the code also produces an error on my setup [c] due to missing values.
So, can you help me with the cross-validation?
Appendix
[a] Weka:
=== Stratified cross-validation ===
=== Summary ===
Correctly Classified Instances 3059 96.712 %
Incorrectly Classified Instances 104 3.288 %
Kappa statistic 0.8199
Mean absolute error 0.1017
Root mean squared error 0.1771
Relative absolute error 60.4205 %
Root relative squared error 61.103 %
Coverage of cases (0.95 level) 99.6206 %
Mean rel. region size (0.95 level) 78.043 %
Total Number of Instances 3163
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0,918 0,028 0,771 0,918 0,838 0,824 0,985 0,901 sick-euthyroid
0,972 0,082 0,991 0,972 0,982 0,824 0,985 0,998 negative
Weighted Avg. 0,967 0,077 0,971 0,967 0,968 0,824 0,985 0,989
=== Confusion Matrix ===
a b <-- classified as
269 24 | a = sick-euthyroid
80 2790 | b = negative
[b] Code so far:
library(randomForest) #randomForest() and rfImpute()
library(foreign) # read.arff()
library(caret) # train() and trainControl()
nTrees <- 2 # 200
myDataset <- 'D:\\your\\directory\\SE.arff' # http://hakank.org/weka/SE.arff
mydb = read.arff(myDataset)
mydb.imputed <- rfImpute(class ~ ., data=mydb, ntree = nTrees, importance = TRUE)
myres.rf <- randomForest(class ~ ., data=mydb.imputed, ntree = nTrees, importance = TRUE)
summary(myres.rf)
# specify type of resampling to 10-fold CV
fitControl <- trainControl(method = "rf",number = 10,repeats = 10)
set.seed(825)
# deal with NA | NULL values in categorical variables
#mydb.imputed[is.na(mydb.imputed)] <- 1
#mydb.imputed[is.null(mydb.imputed)] <- 1
rfFit <- train(class~ ., data=mydb.imputed,
method = "rf",
trControl = fitControl,
## This last option is actually one
## for rf() that passes through
ntree = nTrees, importance = TRUE, na.action = na.omit)
rfFit
The error is:
Error in names(resamples) <- gsub("^\\.", "", names(resamples)) :
attempt to set an attribute on NULL
Using traceback()
5: nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
method = models, ppOpts = preProcess, ctrl = trControl, lev = classLevels,
...)
4: train.default(x, y, weights = w, ...)
3: train(x, y, weights = w, ...)
2: train.formula(class~ ., data = mydb.imputed, method = "rf",
trControl = fitControl, ntree = nTrees, importance = TRUE,
sampsize = rep(minorityClassNum, 2), na.action = na.omit)
1: train(class~ ., data = mydb.imputed, method = "rf", trControl = fitControl,
ntree = nTrees, importance = TRUE, sampsize = rep(minorityClassNum,
2), na.action = na.omit) at #39
[c] R version information via sessionInfo()
R version 3.1.0 (2014-04-10)
Platform: i386-w64-mingw32/i386 (32-bit)
[...]
other attached packages:
[1] e1071_1.6-3 caret_6.0-30 ggplot2_1.0.0 foreign_0.8-61 randomForest_4.6-7 DMwR_0.4.1
[7] lattice_0.20-29 JGR_1.7-16 iplots_1.1-7 JavaGD_0.6-1 rJava_0.9-6
I dont know about weka, but i have done randomForest modelling in R and I have always used predict function in R to do this.
Try using this function
predict(Model,data)
Bind the output with original values and use table command to get the confusion matrix.

Resources