I am running caret's neural network on the Bike Sharing dataset and I get the following error message:
In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
: There were missing values in resampled performance measures.
I am not sure what the problem is. Can anyone help please?
The dataset is from:
https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset
Here is the coding:
library(caret)
library(bestNormalize)
data_hour = read.csv("hour.csv")
# Split dataset
set.seed(3)
split = createDataPartition(data_hour$casual, p=0.80, list=FALSE)
validation = data_hour[-split,]
dataset = data_hour[split,]
dataset = dataset[,c(-1,-2,-4)]
# View strucutre of data
str(dataset)
# 'data.frame': 13905 obs. of 14 variables:
# $ season : int 1 1 1 1 1 1 1 1 1 1 ...
# $ mnth : int 1 1 1 1 1 1 1 1 1 1 ...
# $ hr : int 1 2 3 5 8 10 11 12 14 15 ...
# $ holiday : int 0 0 0 0 0 0 0 0 0 0 ...
# $ weekday : int 6 6 6 6 6 6 6 6 6 6 ...
# $ workingday: int 0 0 0 0 0 0 0 0 0 0 ...
# $ weathersit: int 1 1 1 2 1 1 1 1 2 2 ...
# $ temp : num 0.22 0.22 0.24 0.24 0.24 0.38 0.36 0.42 0.46 0.44 ...
# $ atemp : num 0.273 0.273 0.288 0.258 0.288 ...
# $ hum : num 0.8 0.8 0.75 0.75 0.75 0.76 0.81 0.77 0.72 0.77 ...
# $ windspeed : num 0 0 0 0.0896 0 ...
# $ casual : int 8 5 3 0 1 12 26 29 35 40 ...
# $ registered: int 32 27 10 1 7 24 30 55 71 70 ...
# $ cnt : int 40 32 13 1 8 36 56 84 106 110 ...
## transform numeric data to Guassian
dataset_selected = dataset[,c(-13,-14)]
for (i in 8:12) { dataset_selected[,i] = predict(boxcox(dataset_selected[,i] +0.1))}
# View transformed dataset
str(dataset_selected)
#'data.frame': 13905 obs. of 12 variables:
#' $ season : int 1 1 1 1 1 1 1 1 1 1 ...
#' $ mnth : int 1 1 1 1 1 1 1 1 1 1 ...
#' $ hr : int 1 2 3 5 8 10 11 12 14 15 ...
#' $ holiday : int 0 0 0 0 0 0 0 0 0 0 ...
#' $ weekday : int 6 6 6 6 6 6 6 6 6 6 ...
#' $ workingday: int 0 0 0 0 0 0 0 0 0 0 ...
#' $ weathersit: int 1 1 1 2 1 1 1 1 2 2 ...
#' $ temp : num -1.47 -1.47 -1.35 -1.35 -1.35 ...
#' $ atemp : num -1.18 -1.18 -1.09 -1.27 -1.09 ...
#' $ hum : num 0.899 0.899 0.637 0.637 0.637 ...
#' $ windspeed : num -1.8 -1.8 -1.8 -0.787 -1.8 ...
#' $ casual : num -0.361 -0.588 -0.81 -1.867 -1.208 ...
# Train data with Neural Network model from caret
control = trainControl(method = 'repeatedcv', number = 10, repeats =3)
metric = 'RMSE'
set.seed(3)
fit = train(casual ~., data = dataset_selected, method = 'nnet', metric = metric, trControl = control, trace = FALSE)
Thanks for your help!
phivers comment is spot on, however I would still like to provide a more verbose answer on this concrete example.
In order to investigate what is going on in more detail one should add the argument savePredictions = "all" to trainControl:
control = trainControl(method = 'repeatedcv',
number = 10,
repeats = 3,
returnResamp = "all",
savePredictions = "all")
metric = 'RMSE'
set.seed(3)
fit = train(casual ~.,
data = dataset_selected,
method = 'nnet',
metric = metric,
trControl = control,
trace = FALSE,
form = "traditional")
now when running:
fit$results
#output
size decay RMSE Rsquared MAE RMSESD RsquaredSD MAESD
1 1 0e+00 0.9999205 NaN 0.8213177 0.009655872 NA 0.007919575
2 1 1e-04 0.9479487 0.1850270 0.7657225 0.074211541 0.20380571 0.079640883
3 1 1e-01 0.8801701 0.3516646 0.6937938 0.074484860 0.20787440 0.077960642
4 3 0e+00 0.9999205 NaN 0.8213177 0.009655872 NA 0.007919575
5 3 1e-04 0.9272942 0.2482794 0.7434689 0.091409600 0.24363651 0.098854133
6 3 1e-01 0.7943899 0.6193242 0.5944279 0.011560524 0.03299137 0.013002708
7 5 0e+00 0.9999205 NaN 0.8213177 0.009655872 NA 0.007919575
8 5 1e-04 0.8811411 0.3621494 0.6941335 0.092169810 0.22980560 0.098987058
9 5 1e-01 0.7896507 0.6431808 0.5870894 0.009947324 0.01063359 0.009121535
we notice the problem occurs when decay = 0.
lets filter the observations and predictions for decay = 0
library(tidyverse)
fit$pred %>%
filter(decay == 0) -> for_r2
var(for_r2$pred)
#output
0
we can observe that all of the predictions when decay == 0 are the same (have zero variance). The model exclusively predicts 0:
unique(for_r2$pred)
#output
0
So when the summary function tries to predict R squared:
caret::R2(for_r2$obs, for_r2$pred)
#output
[1] NA
Warning message:
In cor(obs, pred, use = ifelse(na.rm, "complete.obs", "everything")) :
the standard deviation is zero
Answer by #topepo (Caret package main developer). See detailed Github thread here.
It looks like it happens when you have one hidden unit and almost no
regularization. What is happening is that the model is predicting a
value very close to a constant (so that the RMSE is a little worse
than the basic st deviation of the outcome):
> ANN_cooling_fit$resample %>% dplyr::filter(is.na(Rsquared))
RMSE Rsquared MAE size decay Resample
1 8.414010 NA 6.704311 1 0e+00 Fold04.Rep01
2 8.421244 NA 6.844363 1 0e+00 Fold01.Rep03
3 7.855925 NA 6.372947 1 1e-04 Fold10.Rep07
4 7.963816 NA 6.428947 1 0e+00 Fold07.Rep09
5 8.492898 NA 6.901842 1 0e+00 Fold09.Rep09
6 7.892527 NA 6.479474 1 0e+00 Fold10.Rep10
> sd(mydata$V7)
[1] 7.962888
So it's nothing to really worry about; just some parameters that do very poorly.
The answer by #missuse is already very insightful to understand why this error happens.
So I just want to add some straightforward ways how to get rid of this error.
If in some cross-validation folds the predictions get zero variance, the model didn't converge. In such cases, you can try the neuralnet package which offers two parameters you can tune:
threshold : default value = 0.01. Set it to 0.3 and then try lower values 0.2, 0.1, 0.05.
stepmax : default value = 1e+05. Set it to 1e+08 and then try lower values 1e+07, 1e+06.
In most cases, it is sufficient to change the threshold parameter like this:
model.nn <- caret::train(formula1,
method = "neuralnet",
data = training.set[,],
# apply preProcess within cross-validation folds
preProcess = c("center", "scale"),
trControl = trainControl(method = "repeatedcv",
number = 10,
repeats = 3),
threshold = 0.3
)
Related
i'm trying to train a model with the caret Package (Random Forest), after running the "train" code, i get: Error: Can't find column RH_train in .data. Then, i tried converting the dependent variable (Rendimiento) to a factor but i get: Error: At least one of the class levels is not a valid R variable name; This will cause errors when class probabilities are generated because the variables names will be converted to X0, X0.5, X0.6, X0.65, X0.7, X0.75, X0.79, X0.8, X0.81, X0.82, X0.83, X0.85, X0.86, X0.87, X0.88, X0.9, X1 . Please use factor levels that can be used as valid R variable names (see ?make.names for help).
library(rpart)
library(rpart.plot)
library(RWekajars)
library(randomForest)
library(party)
library(caret)
library(e1071)
library(dplyr)
####Cargar base de datos####
setwd("C:/Users/Frankenstein/Downloads")
RH <- read_excel("RH.xlsx")
RH$`Año Ingreso`=NULL
RH$`Mes ingreso`=NULL
RH$`Status empleado para Gestión t`=NULL
RH$`Horario trabajo`=NULL
RH$Nacional=NULL
RH$Jefe=NULL
RH$`N Personal`=NULL
colnames(RH)
names(RH)[names(RH) == "Grado de distancia"] <- "Distancia"
names(RH)[names(RH) == "Clave para el estado civil"] <- "EstadoCivil"
names(RH)[names(RH) == "Clave de sexo"] <- "Sexo"
####Analizar la estructura del los datos#
str(RH)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 325 obs. of 10 variables:
$ Rendimiento: num 0.6 0.8 0.85 0.86 0.85 0.8 1 0.86 1 0.9 ...
$ Edad : num 36 37 21 26 25 28 32 32 29 36 ...
$ Posición : chr "ANA" "ANA" "AUX" "AUX" ...
$ Sexo : num 1 1 0 1 1 1 1 1 1 1 ...
$ Distancia : num 5 3 1 1 3 2 2 4 5 4 ...
$ Estrato : num 2 5 3 3 3 3 5 2 5 6 ...
$ EstadoCivil: num 1 2 1 1 1 1 2 1 1 2 ...
$ Hijos : num 1 1 0 0 0 0 0 1 0 0 ...
$ Formación : chr "PREGRADO" "PREGRADO" "PREGRADO" "PREGRADO" ...
$ Educación : num 3 3 3 3 3 3 4 3 3 3 ...
hist(RH$Rendimiento)
summary(RH)
####Dividir datos en entrenamiento y testeo####
glimpse(RH_train)
glimpse(RH_test)
RH_train <- RH[1:243, ]
RH_test <- RH[244:325, ]
# Define the control
trControl <- trainControl(method = "cv",
number = 10,
search = "grid",
classProb=TRUE)
set.seed(1234)
RH_train$Rendimiento=factor(RH_train$Rendimiento)
RendimientoFactor=factor(RH_train$Rendimiento)
# Run the model
rf_default <- train(RH_train$Rendimiento ~ RH_train$Edad + RH_train$Sexo + RH_train$Distancia + RH_train$Estrato + RH_train$EstadoCivil + RH_train$Hijos + RH_train$Educación,
data=RH_train,
method = "rf",
metric = "Accuracy",
trControl = trControl)
This question has been asked before, however it was not answered in a way that solved my problem. The question was also slightly different.
I am trying to build a decision tree model using the c5 package. I am trying to predict if MMA fighters have championship potential (this is a logical factor with 2 levels yes/no).
Originally this column was a boolean but i converted it to a factor using
fighters_clean$championship_potential <- as.factor(fighters_clean$championship_potential)
table(fighters_clean$championship_potential)
#Rename binary outcome
fighters_clean$championship_potential <- factor(fighters_clean$championship_potential,
levels = c("TRUE", "FALSE"), labels = c("YES", "NO"))
on my data frame it says "Factor with 2 levels" which should work as the classifier for a c5 decision tree, however I keep getting this error message.
Error in UseMethod("QuinlanAttributes") :
no applicable method for 'QuinlanAttributes' applied to an object of class "logical"
The code for my model is below.
#Lets use a decision tree to see what fighters have that championship potential
table(fighters_clean$championship_potential)
#FALSE TRUE
#2578 602
#create test and training data
#set seed alters the random number generator so that it is random but repeatable, the number is arbitrary.
set.seed(123)
Tree_training <- sample(3187, 2868)
str(Tree_training)
#So what this does is it creates a vector of 2868 random integers.
#We use this vector to split our data into training and test data
#it should be a representative 90/10 split.
Tree_Train <- fighters_clean[Tree_training, ]
Tree_Test <- fighters_clean[-Tree_training, ]
#That worked, sweet.
#Now lets see if they are representative.
#Should be even number of champ potential in both data sets,
prop.table(table(Tree_Train$championship_potential))
prop.table(table(Tree_Test$championship_potential))
#awesome so thats a perfect split, with each data set having 18% champions.
#C5 is a commercial software for decision tree models that is built into R
#We will use this to build a decision tree.
str(Tree_Train)
'data.frame': 2868 obs. of 12 variables:
$ name : chr "Jesse Juarez" "Milton Vieira" "Joey Gomez" "Gilbert Smith" ...
$ SLpM : num 1.71 1.13 2.93 1.09 5.92 0 0 1.2 0 2.11 ...
$ Str_Acc : num 48 35 35 41 51 0 0 33 0 50 ...
$ SApM : num 2.87 2.36 4.03 2.73 3.6 0 0 1.73 0 1.89 ...
$ Str_Def : num 52 48 53 35 55 0 0 73 0 63 ...
$ TD_Avg : num 2.69 2.67 1.15 3.51 0.44 0 0 0 0 0.19 ...
$ TD_Acc : num 33 53 37 60 33 0 0 0 0 40 ...
$ TD_Def : num 50 12 50 0 70 0 0 50 0 78 ...
$ Sub_Avg : num 0 0.7 0 1.2 0.4 0 0 0 0 0.3 ...
$ Win_percentage : num 0.667 0.565 0.875 0.714 0.8 ...
$ championship_potential: Factor w/ 2 levels "YES","NO": 2 2 1 2 2 2 1 2 2 2 ...
$ contender : logi FALSE FALSE TRUE TRUE TRUE TRUE ...
library(C50)
DTModel <- C5.0(Tree_Train [-11], Tree_Train$championship_potential, trials = 1, costs = NULL)
Related question - 1
I have a dataset like so:
> head(training_data)
year month channelGrouping visitStartTime visitNumber timeSinceLastVisit browser
1 2016 October Social 1477775021 1 0 Chrome
2 2016 September Social 1473037945 1 0 Safari
3 2017 July Organic Search 1500305542 1 0 Chrome
4 2017 July Organic Search 1500322111 2 16569 Chrome
5 2016 August Social 1471890172 1 0 Safari
6 2017 May Direct 1495146428 1 0 Chrome
operatingSystem isMobile continent subContinent country source medium
1 Windows 0 Americas South America Brazil youtube.com referral
2 Macintosh 0 Americas Northern America United States youtube.com referral
3 Windows 0 Americas Northern America Canada google organic
4 Windows 0 Americas Northern America Canada google organic
5 Macintosh 0 Africa Eastern Africa Zambia youtube.com referral
6 Android 1 Americas Northern America United States (direct)
isTrueDirect hits pageviews positiveTransaction
1 0 1 1 No
2 0 1 1 No
3 0 5 5 No
4 1 3 3 No
5 0 1 1 No
6 1 6 6 No
> str(training_data)
'data.frame': 1000 obs. of 18 variables:
$ year : int 2016 2016 2017 2017 2016 2017 2016 2017 2017 2016 ...
$ month : Factor w/ 12 levels "January","February",..: 10 9 7 7 8 5 10 3 3 12 ...
$ channelGrouping : chr "Social" "Social" "Organic Search" "Organic Search" ...
$ visitStartTime : int 1477775021 1473037945 1500305542 1500322111 1471890172 1495146428 1476003570 1488556031 1490323225 1480696262 ...
$ visitNumber : int 1 1 1 2 1 1 1 1 1 1 ...
$ timeSinceLastVisit : int 0 0 0 16569 0 0 0 0 0 0 ...
$ browser : chr "Chrome" "Safari" "Chrome" "Chrome" ...
$ operatingSystem : chr "Windows" "Macintosh" "Windows" "Windows" ...
$ isMobile : int 0 0 0 0 0 1 0 1 0 0 ...
$ continent : Factor w/ 5 levels "Africa","Americas",..: 2 2 2 2 1 2 3 3 2 4 ...
$ subContinent : chr "South America" "Northern America" "Northern America" "Northern America" ...
$ country : chr "Brazil" "United States" "Canada" "Canada" ...
$ source : chr "youtube.com" "youtube.com" "google" "google" ...
$ medium : chr "referral" "referral" "organic" "organic" ...
$ isTrueDirect : int 0 0 0 1 0 1 0 0 0 0 ...
$ hits : int 1 1 5 3 1 6 1 1 2 1 ...
$ pageviews : int 1 1 5 3 1 6 1 1 2 1 ...
$ positiveTransaction: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 …
I then define my custom RMSLE function using Metrics package:
rmsleMetric <- function(data, lev = NULL, model = NULL){
out <- Metrics::rmsle(data$obs, data$pred)
names(out) <- c("rmsle")
return (out)
}
Then, I define the trainControl:
tc <- trainControl(method = "repeatedcv",
number = 5,
repeats = 5,
summaryFunction = rmsleMetric,
classProbs = TRUE)
My grid search:
tg <- expand.grid(alpha = 0, lambda = seq(0, 1, by = 0.1))
Finally, my model:
penalizedLogit_ridge <- train(positiveTransaction ~ .,
data = training_data,
metric="rmsle",
method = "glmnet",
family = "binomial",
trControl = tc,
tuneGrid = tg
)
When I try to run the command above, I get an error:
Something is wrong; all the rmsle metric values are missing:
rmsle
Min. : NA
1st Qu.: NA
Median : NA
Mean :NaN
3rd Qu.: NA
Max. : NA
NA's :11
Error: Stopping
In addition: There were 50 or more warnings (use warnings() to see the first 50)
Looking at warnings, I find:
1: In Ops.factor(1, actual) : ‘+’ not meaningful for factors
2: In Ops.factor(1, predicted) : ‘+’ not meaningful for factors
repeated 25 times
Since the same thing works fine if I change the metric to AUC using prSummary as my summary function, I don't believe that there are any issues with my data.
So, I believe that my function is wrong but I don't know how to figure out why it is wrong.
Any help is highly appreciated.
Your custom metric is not defined properly. If you use classProbs = TRUE and savePredictions = "final" with trainControl you will realize that there are two columns named according to your target classes which hold the predicted probabilities while the data$pred column holds the predicted class which can not be used to calculate the desired metric.
A proper way to define the function would be to get the possible levels and use them to extract the probabilities for one of the classes:
rmsleMetric <- function(data, lev = NULL, model = NULL){
lvls <- levels(data$obs)
out <- Metrics::rmsle(ifelse(data$obs == lev[2], 0, 1),
data[, lvls[1]])
names(out) <- c("rmsle")
return (out)
}
does it work:
library(caret)
library(mlbench)
data(Sonar)
tc <- trainControl(method = "repeatedcv",
number = 2,
repeats = 2,
summaryFunction = rmsleMetric,
classProbs = TRUE,
savePredictions = "final")
tg <- expand.grid(alpha = 0, lambda = seq(0, 1, by = 0.1))
penalizedLogit_ridge <- train(Class ~ .,
data = Sonar,
metric="rmsle",
method = "glmnet",
family = "binomial",
trControl = tc,
tuneGrid = tg)
#output
glmnet
208 samples
60 predictor
2 classes: 'M', 'R'
No pre-processing
Resampling: Cross-Validated (2 fold, repeated 2 times)
Summary of sample sizes: 105, 103, 104, 104
Resampling results across tuning parameters:
lambda rmsle
0.0 0.2835407
0.1 0.2753197
0.2 0.2768288
0.3 0.2797847
0.4 0.2827953
0.5 0.2856088
0.6 0.2881894
0.7 0.2905501
0.8 0.2927171
0.9 0.2947169
1.0 0.2965505
Tuning parameter 'alpha' was held constant at a value of 0
rmsle was used to select the optimal model using the largest value.
The final values used for the model were alpha = 0 and lambda = 1.
You can inspect caret::twoClassSummary - it is defined quite similarly.
I am trying to fit Logistic Ridge Regression and developed the model as follows; I need help with the coding for testing it for accuracy and ROC/AUC curve with threshold value.
My coding is as follows:
Fitting the model
library(glmnet)
library(caret)
data1<-read.csv("D:\\Research\\Final2.csv",header=T,sep=",")
str(data1)
'data.frame': 154 obs. of 12 variables:
$ Earningspershare : num 12 2.69 8.18 -0.91 3.04 ...
$ NetAssetsPerShare: num 167.1 17.2 41.1 14.2 33 ...
$ Dividendpershare : num 3 1.5 1.5 0 1.25 0 0 0 0 0.5 ...
$ PE : num 7.35 8.85 6.66 -5.27 18.49 ...
$ PB : num 0.53 1.38 1.33 0.34 1.7 0.23 0.5 3.1 0.5 0.3 ...
$ ROE : num 0.08 0.16 0.27 -0.06 0.09 -0.06 -0.06 0.15 0.09 0.
$ ROA : num 0.02 0.09 0.14 -0.03 0.05 -0.04 -0.05 0.09 0.03 0
$ Log_MV : num 8.65 10.38 9.81 8.3 10.36 ..
$ Return_yearly : int 0 1 0 0 0 0 0 0 0 0 ...
$ L3 : int 0 0 0 0 0 0 0 0 0 0 ...
$ L6 : int 0 0 0 0 0 0 0 0 0 0 ...
$ Sector : int 2 2 2 2 2 1 2 2 4 1 ...
smp_size <- floor(0.8 * nrow(data1))
set.seed(123)
train_ind <- sample(seq_len(nrow(data1)), size = smp_size)
train <- data1[train_ind, ]
test <- data1[-train_ind, ]
train$Return_yearly <-as.factor(train$Return_yearly)
train$L3 <-as.factor(train$L3)
train$L6 <-as.factor(train$L6)
train$Sector <-as.factor(train$Sector)
train$L3 <-model.matrix( ~ L3 - 1, data=train)
train$L6 <-model.matrix( ~ L6 - 1, data=train)
train$Sector<-model.matrix( ~ Sector - 1, data=train)
x <- model.matrix(Return_yearly ~., train)
y <- train$Return_yearly
ridge.mod <- glmnet(x, y=as.factor(train$Return_yearly), family='binomial', alpha=0, nlambda=100, lambda.min.ratio=0.0001)
set.seed(1)
cv.out <- cv.glmnet(x, y=as.factor(train$Return_yearly), family='binomial', alpha=0, nfolds = 5, type.measure = "auc", nlambda=100, lambda.min.ratio=0.0001)
plot(cv.out)
best.lambda <- cv.out$lambda.min
best.lambda
[1] 5.109392
Testing the model
test$L3 <-as.factor(test$L3)
test$L6 <-as.factor(test$L6)
test$Sector <-as.factor(test$Sector)
test$Return_yearly <-as.factor(test$Return_yearly)
test$L3 <-model.matrix( ~ L3 - 1, data=test)
test$L6 <-model.matrix( ~ L6 - 1, data=test)
test$Sector<-model.matrix( ~ Sector - 1, data=test)
newx <- model.matrix(Return_yearly ~., test)
y.pred <- as.matrix(ridge.mod,newx=newx, type="class",data=test)
comparing for accuracy testing; error pops up, unable to continue
compare <- cbind (actual=test$Return_yearly, y.pred)
Warning message:
In cbind(actual = test$Return_yearly, y.pred) :
number of rows of result is not a multiple of vector length (arg 1)
Without a reproducible dataset here's a guess:
The train and test matrices have different columns as the result of converting L3 and L6 to factors. By default, as.factor() creates as many levels in a factor as there are unique values, so if by chance the train/test split has different unique values of L3 or L6, the number of dummy variables created by model.matrix() will be different as well.
Possible solution: do as.factor() before train/test split, or supply as.factor with the complete levels, like
train$L3 <- as.factor(train$L3, levels = unique(data1$L3))
Use the following code to plot the accuracy and sensitivity.
ROC_Pre <- prediction(ROC_Pre, data$LSD)
ROC <- performance(ROC_Pre, "tpr", "fpr")
plot(ROC)
When trying to fit models to predict the outcome "death" I am having a 100% accuracy, this is obviously wrong. Could someone tell me what am I missing?
library(caret)
set.seed(100)
intrain <- createDataPartition(riskFinal$death,p=0.6, list=FALSE)
training_Score <- riskFinal[intrain,]
testing_Score <- riskFinal[-intrain,]
control <- trainControl(method="repeatedcv", repeats=3, number=5)
#C5.0 decision tree
set.seed(100)
modelC50 <- train(death~., data=training_Score, method="C5.0",trControl=control)
summary(modelC50)
#Call:
#C5.0.default(x = structure(c(3, 4, 2, 30, 4, 12, 156, 0.0328767150640488, 36, 0.164383560419083, 22,
# 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0,
# 0, 0, 0, 0,
#C5.0 [Release 2.07 GPL Edition] Tue Aug 4 10:23:10 2015
#-------------------------------
#Class specified by attribute `outcome'
#Read 27875 cases (23 attributes) from undefined.data
#21 attributes winnowed
#Estimated importance of remaining attributes:
#-2147483648% no.subjective.fevernofever
#Rules:
#Rule 1: (26982, lift 1.0)
# no.subjective.fevernofever <= 0
# -> class no [1.000]
#Rule 2: (893, lift 31.2)
# no.subjective.fevernofever > 0
# -> class yes [0.999]
#Default class: no
#Evaluation on training data (27875 cases):
# Rules
# ----------------
# No Errors
# 2 0( 0.0%) <<
# (a) (b) <-classified as
# ---- ----
# 26982 (a): class no
# 893 (b): class yes
# Attribute usage:
# 100.00% no.subjective.fevernofever
#Time: 0.1 secs
confusionMatrix(predictC50, testing_Score$death)
#Confusion Matrix and Statistics
# Reference
#Prediction no yes
# no 17988 0
# yes 0 595
# Accuracy : 1
# 95% CI : (0.9998, 1)
# No Information Rate : 0.968
# P-Value [Acc > NIR] : < 2.2e-16
# Kappa : 1
# Mcnemar's Test P-Value : NA
# Sensitivity : 1.000
# Specificity : 1.000
# Pos Pred Value : 1.000
# Neg Pred Value : 1.000
# Prevalence : 0.968
# Detection Rate : 0.968
# Detection Prevalence : 0.968
# Balanced Accuracy : 1.000
# 'Positive' Class : no
For the Random Forest model
set.seed(100)
modelRF <- train(death~., data=training_Score, method="rf", trControl=control)
predictRF <- predict(modelRF,testing_Score)
confusionMatrix(predictRF, testing_Score$death)
#Confusion Matrix and Statistics
#
# Reference
#Prediction no yes
# no 17988 0
# yes 0 595
# Accuracy : 1
# 95% CI : (0.9998, 1)
# No Information Rate : 0.968
# P-Value [Acc > NIR] : < 2.2e-16
# Kappa : 1
# Mcnemar's Test P-Value : NA
# Sensitivity : 1.000
# Specificity : 1.000
# Pos Pred Value : 1.000
# Neg Pred Value : 1.000
# Prevalence : 0.968
# Detection Rate : 0.968
# Detection Prevalence : 0.968
# Balanced Accuracy : 1.000
# 'Positive' Class : no
predictRFprobs <- predict(modelRF, testing_Score, type = "prob")
For the Logit model
set.seed(100)
modelLOGIT <- train(death~., data=training_Score,method="glm",family="binomial", trControl=control)
summary(modelLOGIT)
#Call:
#NULL
#Deviance Residuals:
# Min 1Q Median 3Q Max
#-2.409e-06 -2.409e-06 -2.409e-06 -2.409e-06 2.409e-06
#Coefficients:
# Estimate Std. Error z value Pr(>|z|)
#(Intercept) -2.657e+01 7.144e+04 0.000 1.000
#age.in.months 3.554e-15 7.681e+01 0.000 1.000
#temp -1.916e-13 1.885e+03 0.000 1.000
#genderfemale 3.644e-14 4.290e+03 0.000 1.000
#no.subjective.fevernofever 5.313e+01 1.237e+04 0.004 0.997
#palloryes -1.156e-13 4.747e+03 0.000 1.000
#jaundiceyes -2.330e-12 1.142e+04 0.000 1.000
#vomitingyes 1.197e-13 4.791e+03 0.000 1.000
#diarrheayes -3.043e-13 4.841e+03 0.000 1.000
#dark.urineyes -6.958e-13 1.037e+04 0.000 1.000
#intercostal.retractionyes 2.851e-13 1.003e+04 0.000 1.000
#subcostal.retractionyes 7.414e-13 1.012e+04 0.000 1.000
#wheezingyes -1.756e-12 1.091e+04 0.000 1.000
#rhonchiyes -1.659e-12 1.074e+04 0.000 1.000
#difficulty.breathingyes 4.496e-13 6.504e+03 0.000 1.000
#deep.breathingyes 1.086e-12 7.075e+03 0.000 1.000
#convulsionsyes -1.294e-12 6.424e+03 0.000 1.000
#lethargyyes -4.338e-13 6.188e+03 0.000 1.000
#unable.to.sityes -4.284e-13 8.118e+03 0.000 1.000
#unable.to.drinkyes 7.297e-13 6.507e+03 0.000 1.000
#altered.consciousnessyes 2.907e-12 1.071e+04 0.000 1.000
#unconsciousnessyes 2.868e-11 1.505e+04 0.000 1.000
#meningeal.signsyes -1.177e-11 1.570e+04 0.000 1.000
#(Dispersion parameter for binomial family taken to be 1)
# Null deviance: 7.9025e+03 on 27874 degrees of freedom
#Residual deviance: 1.6172e-07 on 27852 degrees of freedom
#AIC: 46
#Number of Fisher Scoring iterations: 25
#predictLOGIT<-predict(modelLOGIT,testing_Score)
confusionMatrix(predictLOGIT, testing_Score$death)
#Confusion Matrix and Statistics
# Reference
#Prediction no yes
# no 17988 0
# yes 0 595
# Accuracy : 1
# 95% CI : (0.9998, 1)
# No Information Rate : 0.968
# P-Value [Acc > NIR] : < 2.2e-16
# Kappa : 1
# Mcnemar's Test P-Value : NA
# Sensitivity : 1.000
# Specificity : 1.000
# Pos Pred Value : 1.000
# Neg Pred Value : 1.000
# Prevalence : 0.968
# Detection Rate : 0.968
# Detection Prevalence : 0.968
# Balanced Accuracy : 1.000
# 'Positive' Class : no
The data before slicing was:
str(riskFinal)
#'data.frame': 46458 obs. of 23 variables:
# $ age.in.months : num 3 3 4 2 1.16 ...
# $ temp : num 35.5 39.4 36.8 35.2 35 34.3 37.2 35.2 34.6 35.3 ...
# $ gender : Factor w/ 2 levels "male","female": 1 2 2 2 1 1 1 2 1 1 ...
# $ no.subjective.fever : Factor w/ 2 levels "fever","nofever": 1 1 2 2 1 1 2 2 2 1 ...
# $ pallor : Factor w/ 2 levels "no","yes": 2 2 1 1 2 2 2 1 2 2 ...
# $ jaundice : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 2 ...
# $ vomiting : Factor w/ 2 levels "no","yes": 1 2 1 1 1 1 1 2 1 1 ...
# $ diarrhea : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
# $ dark.urine : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 2 ...
# $ intercostal.retraction: Factor w/ 2 levels "no","yes": 2 2 2 1 2 2 2 2 1 2 ...
# $ subcostal.retraction : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 1 1 ...
# $ wheezing : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
# $ rhonchi : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 2 1 1 1 ...
# $ difficulty.breathing : Factor w/ 2 levels "no","yes": 2 2 1 2 2 2 1 1 1 2 ...
# $ deep.breathing : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 1 2 ...
# $ convulsions : Factor w/ 2 levels "no","yes": 1 2 1 1 2 2 2 1 2 2 ...
# $ lethargy : Factor w/ 2 levels "no","yes": 2 2 2 1 2 2 2 2 2 2 ...
# $ unable.to.sit : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
# $ unable.to.drink : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
# $ altered.consciousness : Factor w/ 2 levels "no","yes": 2 2 2 1 2 2 2 2 2 2 ...
# $ unconsciousness : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
# $ meningeal.signs : Factor w/ 2 levels "no","yes": 1 2 2 1 1 2 1 2 2 1 ...
# $ death : Factor w/ 2 levels "no","yes": 1 1 2 2 1 1 2 2 2 1 ...
EDIT: based on the comments, I realized that the no.subjective.fever variable had the exactly same values as the target variable death, so I excluded it from the model. Then I got even stranger results:
RANDOM FOREST
set.seed(100)
nmodelRF<- train(death~.-no.subjective.fever, data=training_Score, method="rf", trControl=control)
summary(nmodelRF)
npredictRF<-predict(nmodelRF,testing_Score)
> confusionMatrix(npredictRF, testing_Score$death)
# Confusion Matrix and Statistics
#
# Reference
# Prediction no yes
# no 17988 595
# yes 0 0
#
# Accuracy : 0.968
# 95% CI : (0.9653, 0.9705)
# No Information Rate : 0.968
# P-Value [Acc > NIR] : 0.5109
#
# Kappa : 0
# Mcnemar's Test P-Value : <2e-16
#
# Sensitivity : 1.000
# Specificity : 0.000
# Pos Pred Value : 0.968
# Neg Pred Value : NaN
# Prevalence : 0.968
# Detection Rate : 0.968
# Detection Prevalence : 1.000
# Balanced Accuracy : 0.500
#
# 'Positive' Class : no
Logit
set.seed(100)
nmodelLOGIT<- train(death~.-no.subjective.fever, data=training_Score,method="glm",family="binomial", trControl=control)
>summary(nmodelLOGIT)
# Call:
# NULL
#
# Deviance Residuals:
# Min 1Q Median 3Q Max
# -1.5113 -0.2525 -0.2041 -0.1676 3.1698
#
# Coefficients:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) 2.432065 1.084942 2.242 0.024984 *
#age.in.months -0.001047 0.001293 -0.810 0.417874
#temp -0.168704 0.028815 -5.855 4.78e-09 ***
#genderfemale -0.053306 0.070468 -0.756 0.449375
#palloryes 0.282123 0.076518 3.687 0.000227 ***
#jaundiceyes 0.323755 0.144607 2.239 0.025165 *
#vomitingyes -0.533661 0.082948 -6.434 1.25e-10 ***
#diarrheayes -0.040272 0.080417 -0.501 0.616520
#dark.urineyes -0.583666 0.168787 -3.458 0.000544 ***
#intercostal.retractionyes -0.021717 0.129607 -0.168 0.866926
#subcostal.retractionyes 0.269588 0.128772 2.094 0.036301 *
#wheezingyes -0.587940 0.150475 -3.907 9.34e-05 ***
#rhonchiyes -0.008565 0.140095 -0.061 0.951249
#difficulty.breathingyes 0.397394 0.087789 4.527 5.99e-06 ***
#deep.breathingyes 0.399302 0.098761 4.043 5.28e-05 ***
#convulsionsyes 0.132609 0.094038 1.410 0.158491
#lethargyyes 0.338599 0.089934 3.765 0.000167 ***
#unable.to.sityes 0.452111 0.104556 4.324 1.53e-05 ***
#unable.to.drinkyes 0.516878 0.089685 5.763 8.25e-09 ***
#altered.consciousnessyes 0.433672 0.123288 3.518 0.000436 ***
#unconsciousnessyes 0.754012 0.136105 5.540 3.03e-08 ***
#meningeal.signsyes 0.188823 0.161088 1.172 0.241130
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# (Dispersion parameter for binomial family taken to be 1)
#
# Null deviance: 7902.5 on 27874 degrees of freedom
# Residual deviance: 7148.5 on 27853 degrees of freedom
# AIC: 7192.5
#
# Number of Fisher Scoring iterations: 6
npredictLOGIT<-predict(nmodelLOGIT,testing_Score)
>confusionMatrix(npredictLOGIT, testing_Score$death)
# Confusion Matrix and Statistics
#
# Reference
# Prediction no yes
# no 17982 592
# yes 6 3
#
# Accuracy : 0.9678
# 95% CI : (0.9652, 0.9703)
# No Information Rate : 0.968
# P-Value [Acc > NIR] : 0.5605
#
# Kappa : 0.009
# Mcnemar's Test P-Value : <2e-16
#
# Sensitivity : 0.999666
# Specificity : 0.005042
# Pos Pred Value : 0.968127
# Neg Pred Value : 0.333333
# Prevalence : 0.967981
# Detection Rate : 0.967659
# Detection Prevalence : 0.999516
# Balanced Accuracy : 0.502354
#
# 'Positive' Class : no
The 100% accuracy results are probably not correct. I assume that they are due to the fact that the target variable (or another variable with essentially the same entries as the target variable, as pointed out in a comment by #ulfelder) is included in the training set and in the test set. Usually these columns need to be removed during the model building and testing process, since they represent the target that describes the classification, whereas the train/test data should only contain information that (hopefully) leads to a correct classification according to the target variable.
You could try the following:
target <- riskFinal$death
set.seed(100)
intrain <- createDataPartition(riskFinal$death,p=0.6, list=FALSE)
training_Score <- riskFinal[intrain,]
testing_Score <- riskFinal[-intrain,]
train_target <- training_Score$death
test_target <- test_Score$death
training_Score <- training_Score[,-which(colnames(training_Score)=="death")]
test_Score <- test_Score[,-which(colnames(test_Score)=="death")]
modelRF <- train(training_Score, train_target, method="rf", trControl=control)
Then you could proceed like you did before, noting that the target "death" is stored in the variables train_target and test_target.
Hope this helps.