I am trying to fit Logistic Ridge Regression and developed the model as follows; I need help with the coding for testing it for accuracy and ROC/AUC curve with threshold value.
My coding is as follows:
Fitting the model
library(glmnet)
library(caret)
data1<-read.csv("D:\\Research\\Final2.csv",header=T,sep=",")
str(data1)
'data.frame': 154 obs. of 12 variables:
$ Earningspershare : num 12 2.69 8.18 -0.91 3.04 ...
$ NetAssetsPerShare: num 167.1 17.2 41.1 14.2 33 ...
$ Dividendpershare : num 3 1.5 1.5 0 1.25 0 0 0 0 0.5 ...
$ PE : num 7.35 8.85 6.66 -5.27 18.49 ...
$ PB : num 0.53 1.38 1.33 0.34 1.7 0.23 0.5 3.1 0.5 0.3 ...
$ ROE : num 0.08 0.16 0.27 -0.06 0.09 -0.06 -0.06 0.15 0.09 0.
$ ROA : num 0.02 0.09 0.14 -0.03 0.05 -0.04 -0.05 0.09 0.03 0
$ Log_MV : num 8.65 10.38 9.81 8.3 10.36 ..
$ Return_yearly : int 0 1 0 0 0 0 0 0 0 0 ...
$ L3 : int 0 0 0 0 0 0 0 0 0 0 ...
$ L6 : int 0 0 0 0 0 0 0 0 0 0 ...
$ Sector : int 2 2 2 2 2 1 2 2 4 1 ...
smp_size <- floor(0.8 * nrow(data1))
set.seed(123)
train_ind <- sample(seq_len(nrow(data1)), size = smp_size)
train <- data1[train_ind, ]
test <- data1[-train_ind, ]
train$Return_yearly <-as.factor(train$Return_yearly)
train$L3 <-as.factor(train$L3)
train$L6 <-as.factor(train$L6)
train$Sector <-as.factor(train$Sector)
train$L3 <-model.matrix( ~ L3 - 1, data=train)
train$L6 <-model.matrix( ~ L6 - 1, data=train)
train$Sector<-model.matrix( ~ Sector - 1, data=train)
x <- model.matrix(Return_yearly ~., train)
y <- train$Return_yearly
ridge.mod <- glmnet(x, y=as.factor(train$Return_yearly), family='binomial', alpha=0, nlambda=100, lambda.min.ratio=0.0001)
set.seed(1)
cv.out <- cv.glmnet(x, y=as.factor(train$Return_yearly), family='binomial', alpha=0, nfolds = 5, type.measure = "auc", nlambda=100, lambda.min.ratio=0.0001)
plot(cv.out)
best.lambda <- cv.out$lambda.min
best.lambda
[1] 5.109392
Testing the model
test$L3 <-as.factor(test$L3)
test$L6 <-as.factor(test$L6)
test$Sector <-as.factor(test$Sector)
test$Return_yearly <-as.factor(test$Return_yearly)
test$L3 <-model.matrix( ~ L3 - 1, data=test)
test$L6 <-model.matrix( ~ L6 - 1, data=test)
test$Sector<-model.matrix( ~ Sector - 1, data=test)
newx <- model.matrix(Return_yearly ~., test)
y.pred <- as.matrix(ridge.mod,newx=newx, type="class",data=test)
comparing for accuracy testing; error pops up, unable to continue
compare <- cbind (actual=test$Return_yearly, y.pred)
Warning message:
In cbind(actual = test$Return_yearly, y.pred) :
number of rows of result is not a multiple of vector length (arg 1)
Without a reproducible dataset here's a guess:
The train and test matrices have different columns as the result of converting L3 and L6 to factors. By default, as.factor() creates as many levels in a factor as there are unique values, so if by chance the train/test split has different unique values of L3 or L6, the number of dummy variables created by model.matrix() will be different as well.
Possible solution: do as.factor() before train/test split, or supply as.factor with the complete levels, like
train$L3 <- as.factor(train$L3, levels = unique(data1$L3))
Use the following code to plot the accuracy and sensitivity.
ROC_Pre <- prediction(ROC_Pre, data$LSD)
ROC <- performance(ROC_Pre, "tpr", "fpr")
plot(ROC)
Related
I have difficulty calculating the C-index (UnoC with survAUC R package) for each treatment arm to assess the variable-treatment interaction.
I have a database with 4 explanatory variables X1, X2, X3, X4, as follows:
> str(data)
'data.frame': 1000 obs. of 7 variables:
$ X1 : num -0.578 0.351 0.759 -0.858 -1.022 ...
$ X2 : num -0.7897 0.0339 -1.608 -1.1642 -0.0787 ...
$ X3 : num -0.1561 -0.7147 -0.8229 -0.1519 -0.0318 ...
$ X4 : num 1.4161 -0.0688 -0.155 -0.1571 -0.649 ...
$ TRT : num 0 0 0 0 0 0 0 1 0 1 ...
$ time: num 6.52 2.15 3 1.31 1.56 ...
$ stat: num 1 1 1 1 1 1 1 1 1 1 ...
The variable X4 interacts with the treatment variable and I don't have censored data.
I would like to calculate the C-index (UnoC) for each treatment arm. I expect the C-index to be equal to 0.5 in the control arm and much higher in the experimental arm.
But, I get almost the same value for both arms!
Can anyone confirm that: if I have a strong interaction between a variable and the treatment, the C-index in the experimental arm is high and in the control arm = 0.5?
Here is my attempt:
TR <- data[1:500,]
TE <- data[501:1000,]
s <- Surv(TR$time, TR$stat)
sNew <- Surv(TE$time, TE$stat)
train.fit <- coxph(Surv(time, stat) ~ X4, data=TR)
lpnew <- predict(train.fit, newdata=TE)
# The C-index for each treatment arm
UnoC(Surv.rsp = s[TR$TRT == 1], Surv.rsp.new = sNew[TE$TRT == 1], lpnew = lpnew[TE$TRT == 1])
[1] 0.7577109
UnoC(Surv.rsp = s[TR$TRT == 0], Surv.rsp.new = sNew[TE$TRT == 0], lpnew = -lpnew[TE$TRT == 0])
[1] 0.7295202
Thank you for your Help
I am trying to do best subset selection on the wine dataset, and then I want to get the test error rate using 10 fold CV. The code I used is -
cost1 <- function(good, pi=0) mean(abs(good-pi) > 0.5)
res.best.logistic <-
bestglm(Xy = winedata,
family = binomial, # binomial family for logistic
IC = "AIC", # Information criteria
method = "exhaustive")
res.best.logistic$BestModels
best.cv.err<- cv.glm(winedata,res.best.logistic$BestModel,cost1, K=10)
However, this gives the error -
Error in UseMethod("family") : no applicable method for 'family' applied to an object of class "NULL"
I thought that $BestModel is the lm-object that represents the best fit, and that's what manual also says. If that's the case, then why cant I find the test error on it using 10 fold CV, with the help of cv.glm?
The dataset used is the white wine dataset from https://archive.ics.uci.edu/ml/datasets/Wine+Quality and the package used is the boot package for cv.glm, and the bestglm package.
The data was processed as -
winedata <- read.delim("winequality-white.csv", sep = ';')
winedata$quality[winedata$quality< 7] <- "0" #recode
winedata$quality[winedata$quality>=7] <- "1" #recode
winedata$quality <- factor(winedata$quality)# Convert the column to a factor
names(winedata)[names(winedata) == "quality"] <- "good" #rename 'quality' to 'good'
bestglm fit rearranges your data and name your response variable as y, hence if you pass it back into cv.glm, winedata does not have a column y and everything crashes after that
It's always good to check what is the class:
class(res.best.logistic$BestModel)
[1] "glm" "lm"
But if you look at the call of res.best.logistic$BestModel:
res.best.logistic$BestModel$call
glm(formula = y ~ ., family = family, data = Xi, weights = weights)
head(res.best.logistic$BestModel$model)
y fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
1 0 7.0 0.27 0.36 20.7 0.045
2 0 6.3 0.30 0.34 1.6 0.049
3 0 8.1 0.28 0.40 6.9 0.050
4 0 7.2 0.23 0.32 8.5 0.058
5 0 7.2 0.23 0.32 8.5 0.058
6 0 8.1 0.28 0.40 6.9 0.050
free.sulfur.dioxide density pH sulphates
1 45 1.0010 3.00 0.45
2 14 0.9940 3.30 0.49
3 30 0.9951 3.26 0.44
4 47 0.9956 3.19 0.40
5 47 0.9956 3.19 0.40
6 30 0.9951 3.26 0.44
You can substitute things in the call etc, but it's too much of a mess. Fitting is not costly, so make a fit on winedata and pass it to cv.glm:
best_var = apply(res.best.logistic$BestModels[,-ncol(winedata)],1,which)
# take the variable names for best model
best_var = names(best_var[[1]])
new_form = as.formula(paste("good ~", paste(best_var,collapse="+")))
fit = glm(new_form,winedata,family="binomial")
best.cv.err<- cv.glm(winedata,fit,cost1, K=10)
Working with R, I performed a KNN Algorithm knn <- train(x = x_train, y = y_train, method = "knn") with this dataframe:
1 0.35955056 0.62068966 0.34177215 0.27 0.7260274 0.22 MIT
2 0.59550562 0.56321839 0.35443038 0.15 0.7260274 0.22 MIT
3 0.52808989 0.35632184 0.45569620 0.13 0.7397260 0.22 NUC
4 0.34831461 0.35632184 0.34177215 0.54 0.6575342 0.22 MIT
5 0.44943820 0.31034483 0.44303797 0.17 0.6712329 0.22 CYT
6 0.43820225 0.47126437 0.34177215 0.65 0.7260274 0.22 MIT
7 0.41573034 0.36781609 0.48101266 0.20 0.7945205 0.34 NUC
8 0.49438202 0.42528736 0.56962025 0.36 0.6712329 0.22 MIT
9 0.32584270 0.29885057 0.49367089 0.15 0.7945205 0.30 CYT
10 0.35955056 0.29885057 0.41772152 0.21 0.7260274 0.27 NU
...
Obtaining this result:
k-Nearest Neighbors
945 samples
6 predictor
8 classes: 'CYT', 'ERL', 'EXC', 'ME', 'MIT', 'NUC', 'POX', 'VAC'
No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 945, 945, 945, 945, 945, 945, ...
Resampling results across tuning parameters:
k Accuracy Kappa
5 0.5273630 0.3760233
7 0.5480598 0.4004283
9 0.5667651 0.4242597
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 9.
After that, I wanted to do a confusion matrix with this code:
predictions <- predict(knn, x_test)
results <- data.frame(real = y_test, predicted = predictions)
attach(results)
confusionMatrix(real, predicted)
And I got this results:
Confusion Matrix and Statistics
Reference
Prediction CYT ERL EXC ME MIT NUC POX VAC
CYT 73 0 0 3 7 44 0 0
ERL 0 0 0 1 0 0 0 0
EXC 2 0 6 3 1 0 0 0
ME 5 0 1 68 2 11 0 0
MIT 19 0 0 8 44 14 0 0
NUC 57 0 0 6 8 74 0 0
POX 3 0 0 0 1 2 0 0
VAC 3 0 2 2 1 1 0 0
Overall Statistics
Accuracy : 0.5614
95% CI : (0.5153, 0.6068)
No Information Rate : 0.3432
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.417
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: CYT Class: ERL Class: EXC Class: ME Class: MIT Class: NUC Class: POX Class: VAC
Sensitivity 0.4506 NA 0.66667 0.7473 0.68750 0.5068 NA NA
Specificity 0.8258 0.997881 0.98704 0.9501 0.89951 0.7822 0.98729 0.98093
Pos Pred Value 0.5748 NA 0.50000 0.7816 0.51765 0.5103 NA NA
Neg Pred Value 0.7420 NA 0.99348 0.9403 0.94832 0.7798 NA NA
Prevalence 0.3432 0.000000 0.01907 0.1928 0.13559 0.3093 0.00000 0.00000
Detection Rate 0.1547 0.000000 0.01271 0.1441 0.09322 0.1568 0.00000 0.00000
Detection Prevalence 0.2691 0.002119 0.02542 0.1843 0.18008 0.3072 0.01271 0.01907
Balanced Accuracy 0.6382 NA 0.82685 0.8487 0.79350 0.6445 NA NA
I would like to know why I have got this NAs in my sensibility in the class ERL for example.
Did I do something wrong ?
What is the reason of these NAs. I can provided the completed dataframe if necessary.
Based on the confusion matrix, your prediction set is lacking data with the classification of ERL, POX, and VOC which is leading to the NAs in the Statistics by Class.
Take a look at the Sensitivity of Class ERL for example. Sensitivity, also called the True Positive Rate, is calculated as the number of correct positive predictions divided by the total number of positives.
Positive ERL Predictions = 0
Actual ERL Classifications = 0
Sensitivity ERL = 0/0 which leads to the NA.
I am running caret's neural network on the Bike Sharing dataset and I get the following error message:
In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
: There were missing values in resampled performance measures.
I am not sure what the problem is. Can anyone help please?
The dataset is from:
https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset
Here is the coding:
library(caret)
library(bestNormalize)
data_hour = read.csv("hour.csv")
# Split dataset
set.seed(3)
split = createDataPartition(data_hour$casual, p=0.80, list=FALSE)
validation = data_hour[-split,]
dataset = data_hour[split,]
dataset = dataset[,c(-1,-2,-4)]
# View strucutre of data
str(dataset)
# 'data.frame': 13905 obs. of 14 variables:
# $ season : int 1 1 1 1 1 1 1 1 1 1 ...
# $ mnth : int 1 1 1 1 1 1 1 1 1 1 ...
# $ hr : int 1 2 3 5 8 10 11 12 14 15 ...
# $ holiday : int 0 0 0 0 0 0 0 0 0 0 ...
# $ weekday : int 6 6 6 6 6 6 6 6 6 6 ...
# $ workingday: int 0 0 0 0 0 0 0 0 0 0 ...
# $ weathersit: int 1 1 1 2 1 1 1 1 2 2 ...
# $ temp : num 0.22 0.22 0.24 0.24 0.24 0.38 0.36 0.42 0.46 0.44 ...
# $ atemp : num 0.273 0.273 0.288 0.258 0.288 ...
# $ hum : num 0.8 0.8 0.75 0.75 0.75 0.76 0.81 0.77 0.72 0.77 ...
# $ windspeed : num 0 0 0 0.0896 0 ...
# $ casual : int 8 5 3 0 1 12 26 29 35 40 ...
# $ registered: int 32 27 10 1 7 24 30 55 71 70 ...
# $ cnt : int 40 32 13 1 8 36 56 84 106 110 ...
## transform numeric data to Guassian
dataset_selected = dataset[,c(-13,-14)]
for (i in 8:12) { dataset_selected[,i] = predict(boxcox(dataset_selected[,i] +0.1))}
# View transformed dataset
str(dataset_selected)
#'data.frame': 13905 obs. of 12 variables:
#' $ season : int 1 1 1 1 1 1 1 1 1 1 ...
#' $ mnth : int 1 1 1 1 1 1 1 1 1 1 ...
#' $ hr : int 1 2 3 5 8 10 11 12 14 15 ...
#' $ holiday : int 0 0 0 0 0 0 0 0 0 0 ...
#' $ weekday : int 6 6 6 6 6 6 6 6 6 6 ...
#' $ workingday: int 0 0 0 0 0 0 0 0 0 0 ...
#' $ weathersit: int 1 1 1 2 1 1 1 1 2 2 ...
#' $ temp : num -1.47 -1.47 -1.35 -1.35 -1.35 ...
#' $ atemp : num -1.18 -1.18 -1.09 -1.27 -1.09 ...
#' $ hum : num 0.899 0.899 0.637 0.637 0.637 ...
#' $ windspeed : num -1.8 -1.8 -1.8 -0.787 -1.8 ...
#' $ casual : num -0.361 -0.588 -0.81 -1.867 -1.208 ...
# Train data with Neural Network model from caret
control = trainControl(method = 'repeatedcv', number = 10, repeats =3)
metric = 'RMSE'
set.seed(3)
fit = train(casual ~., data = dataset_selected, method = 'nnet', metric = metric, trControl = control, trace = FALSE)
Thanks for your help!
phivers comment is spot on, however I would still like to provide a more verbose answer on this concrete example.
In order to investigate what is going on in more detail one should add the argument savePredictions = "all" to trainControl:
control = trainControl(method = 'repeatedcv',
number = 10,
repeats = 3,
returnResamp = "all",
savePredictions = "all")
metric = 'RMSE'
set.seed(3)
fit = train(casual ~.,
data = dataset_selected,
method = 'nnet',
metric = metric,
trControl = control,
trace = FALSE,
form = "traditional")
now when running:
fit$results
#output
size decay RMSE Rsquared MAE RMSESD RsquaredSD MAESD
1 1 0e+00 0.9999205 NaN 0.8213177 0.009655872 NA 0.007919575
2 1 1e-04 0.9479487 0.1850270 0.7657225 0.074211541 0.20380571 0.079640883
3 1 1e-01 0.8801701 0.3516646 0.6937938 0.074484860 0.20787440 0.077960642
4 3 0e+00 0.9999205 NaN 0.8213177 0.009655872 NA 0.007919575
5 3 1e-04 0.9272942 0.2482794 0.7434689 0.091409600 0.24363651 0.098854133
6 3 1e-01 0.7943899 0.6193242 0.5944279 0.011560524 0.03299137 0.013002708
7 5 0e+00 0.9999205 NaN 0.8213177 0.009655872 NA 0.007919575
8 5 1e-04 0.8811411 0.3621494 0.6941335 0.092169810 0.22980560 0.098987058
9 5 1e-01 0.7896507 0.6431808 0.5870894 0.009947324 0.01063359 0.009121535
we notice the problem occurs when decay = 0.
lets filter the observations and predictions for decay = 0
library(tidyverse)
fit$pred %>%
filter(decay == 0) -> for_r2
var(for_r2$pred)
#output
0
we can observe that all of the predictions when decay == 0 are the same (have zero variance). The model exclusively predicts 0:
unique(for_r2$pred)
#output
0
So when the summary function tries to predict R squared:
caret::R2(for_r2$obs, for_r2$pred)
#output
[1] NA
Warning message:
In cor(obs, pred, use = ifelse(na.rm, "complete.obs", "everything")) :
the standard deviation is zero
Answer by #topepo (Caret package main developer). See detailed Github thread here.
It looks like it happens when you have one hidden unit and almost no
regularization. What is happening is that the model is predicting a
value very close to a constant (so that the RMSE is a little worse
than the basic st deviation of the outcome):
> ANN_cooling_fit$resample %>% dplyr::filter(is.na(Rsquared))
RMSE Rsquared MAE size decay Resample
1 8.414010 NA 6.704311 1 0e+00 Fold04.Rep01
2 8.421244 NA 6.844363 1 0e+00 Fold01.Rep03
3 7.855925 NA 6.372947 1 1e-04 Fold10.Rep07
4 7.963816 NA 6.428947 1 0e+00 Fold07.Rep09
5 8.492898 NA 6.901842 1 0e+00 Fold09.Rep09
6 7.892527 NA 6.479474 1 0e+00 Fold10.Rep10
> sd(mydata$V7)
[1] 7.962888
So it's nothing to really worry about; just some parameters that do very poorly.
The answer by #missuse is already very insightful to understand why this error happens.
So I just want to add some straightforward ways how to get rid of this error.
If in some cross-validation folds the predictions get zero variance, the model didn't converge. In such cases, you can try the neuralnet package which offers two parameters you can tune:
threshold : default value = 0.01. Set it to 0.3 and then try lower values 0.2, 0.1, 0.05.
stepmax : default value = 1e+05. Set it to 1e+08 and then try lower values 1e+07, 1e+06.
In most cases, it is sufficient to change the threshold parameter like this:
model.nn <- caret::train(formula1,
method = "neuralnet",
data = training.set[,],
# apply preProcess within cross-validation folds
preProcess = c("center", "scale"),
trControl = trainControl(method = "repeatedcv",
number = 10,
repeats = 3),
threshold = 0.3
)
I am trying to role Xg boost model on single test data point.
a <- data.frame(satisfaction_level=0.14,
last_evaluation=0.92,
number_project=2,
average_montly_hours=350,
time_spend_company=5,
Work_accident=0,
promotion_last_5years=1,
sales=factor("sales",levels=levels(Bdata$sales)),
salary=factor("medium",levels=levels(Bdata$salary)))
#Converting it into matrix format
str(a)
a <- as.data.frame.model.matrix(a)
I get below error when I predict using the model
xgb.preds = predict(xgb.model, a)
Error in xgb.DMatrix(newdata, missing = missing) :
xgb.DMatrix: does not support to construct from list
Created the model using:
xgb.model <- xgboost(param =param, data = xgb.train.data,nrounds = 1500 ,eta = 0.05,subsample = 1 )
and Bdata contains:
head(Bdata)
satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident left promotion_last_5years sales salary
1 0.38 0.53 2 157 3 0 1 0 sales low
2 0.80 0.86 5 262 6 0 1 0 sales medium
3 0.11 0.88 7 272 4 0 1 0 sales medium
4 0.72 0.87 5 223 5 0 1 0 sales low
5 0.37 0.52 2 159 3 0 1 0 sales low
6 0.41 0.50 2 153 3 0 1 0 sales low
>
You should not use as.data.frame.model.matrix. Your a object is still a data.frame. You need to use a <- as.matrix(a).
See below for a workable example using the iris dataset.
library(xgboost)
x = as.matrix(iris[, 1:4])
y = as.numeric(factor(iris[, 5]))-1
model <- xgboost(data = x, label = y, nrounds = 10)
new <- data.frame(Sepal.Length = 5.1,
Sepal.Width = 3.5,
Petal.Length = 1.4,
Petal.Width = 0.2)
#error because it is a data.frame
preds <- predict(model, newdata = new)
# Error in xgb.DMatrix(newdata, missing = missing) :
# xgb.DMatrix: does not support to construct from list
# This works because data.frame is turned into a matrix
preds <- predict(model, newdata = as.matrix(new))