I want to run models on a loop via and then store the performance metrics into a table. I do not want to use the confusionMatrix function in caret, but I want to compute the precision, recall and f1 and then store those in a table. Please assist, edits to the code are welcome.
My attempt is below.
library(MASS) #will load our biopsy data
library(caret)
data("biopsy")
biopsy$ID<-NULL
names(biopsy)<-c('clump thickness','uniformity cell size','uniformity cell shape',
'marginal adhesion','single epithelial cell size','bare nuclei',
'bland chromatin','normal nuclei','mitosis','class')
sum(is.na(biopsy))
biopsy<-na.omit(biopsy)
sum(is.na(biopsy))
head(biopsy,5)
set.seed(123)
inTraining <- createDataPartition(biopsy$class, p = .75, list = FALSE)
training <- biopsy[ inTraining,]
testing <- biopsy[-inTraining,]
# Run algorithms using 10-fold cross validation
control <- trainControl(method="repeatedcv", number=10,repeats = 5, verboseIter = F, classProbs = T)
#CHANGING THE CHARACTERS INTO FACTORS VARAIBLES
training<- as.data.frame(unclass(training),
stringsAsFactors = TRUE)
#CHANGING THE CHARACTERS INTO FACTORS VARAIBLES
testing <- as.data.frame(unclass(testing),
stringsAsFactors = TRUE)
models<-c("svmRadial","rf")
results_table <- data.frame(models = models, stringsAsFactors = F)
for (i in models){
model_train<-train(class~., data=training, method=i,
trControl=control,metric="Accuracy")
predictions<-predict(model_train, newdata=testing)
precision_<-posPredValue(predictions,testing)
recall_<-sensitivity(predictions,testing)
f1<-(2*precision_*recall_)/(precision_+recall_)
# put that in the results table
results_table[i, "Precision"] <- precision_
results_table[i, "Recall"] <- recall_
results_table[i, "F1score"] <- f1
}
However I get an error which says Error in posPredValue.default(predictions, testing) : inputs must be factors. i do not know where I went wrong and any edits to my code are welcome.
I know that I could get precision,recall, f1 by just using the code below (B), however this is a tutorial question where I am required not to use the code example below (B):
(B)
for (i in models){
model_train<-train(class~., data=training, method=i,
trControl=control,metric="Accuracy")
predictions<-predict(model_train, newdata=testing)
print(confusionMatrix(predictions, testing$class,mode="prec_recall"))
}
A few things need to happen.
You have to change the function calls for posPredValue and sensitivity. For both, change testing to testing$class.
for the results_table, i is a word, not a value, so you're assigning results_table["rf", "Precision"] <- precision_ (This makes a new row, where the row name is "rf".)
Here is your for statement, with changes to those functions mentioned in 1) and a modification to address the issue in 2).
for (i in models){
model_train <- train(class~., data = training, method = i,
trControl= control, metric = "Accuracy")
assign("fit", model_train)
predictions <- predict(model_train, newdata = testing)
precision_ <-posPredValue(predictions, testing$class)
recall_ <- sensitivity(predictions, testing$class)
f1 <- (2*precision_ * recall_) / (precision_ + recall_)
# put that in the results table
results_table[results_table$models %in% i, "Precision"] <- precision_
results_table[results_table$models %in% i, "Recall"] <- recall_
results_table[results_table$models %in% i, "F1score"] <- f1
}
This is what it looks like for me.
results_table
# models Precision Recall F1score
# 1 svmRadial 0.9722222 0.9459459 0.9589041
# 2 rf 0.9732143 0.9819820 0.9775785
R Optim stops iterating earlier than I want. I use method="L-BFGS-B" (as I need different bounds for different parameters). I know I can set the maximum of iterations via 'control'>'maxit', but optim does not reach the max. I guess 'control'>'pgtol' and/or 'factr' should help, but apparently they do not.
I do the same optimisation with Excel solver Add-In and therefore I know that R stops iterating too early.
Here is my sample data and code:
dsg <- as.data.frame(cbind(c(0:47)
,c(0.402469136,0.368944099,0.375477721,0.391121435,0.36741817,0.366685299,0.373907486,0.409429755,0.383399692,0.412436098,0.389864409,0.411901702,0.379012346,0.383269431,0.372778178,0.397308798,0.407005188,0.396770412,0.378525076,0.38084766,0.378051956,0.376836815,0.351144888,0.387655975,0.415815896,0.39851447,0.384345349,0.40061633,0.370402697,0.373590499,0.379474943,0.378865913,0.382395269,0.365808609,0.383106843,0.35946353,0.361037542,0.36077482,0.384418935,0.362583824,0.385405581,0.348344335,0.358934922,0.379079876,0.391434446,0.354347971,0.361197833,0.372232682)
,c(0.114814815,0.118012422,0.132153971,0.137563457,0.113412879,0.113819587,0.117105297,0.117003116,0.132768529,0.114580427,0.120072809,0.116621127,0.124691358,0.118103399,0.130523309,0.13783449,0.114587233,0.10441059,0.113704754,0.109561299,0.108298377,0.118025013,0.125106438,0.106440408,0.107985517,0.127293523,0.130639958,0.113993233,0.111258799,0.113139383,0.114220436,0.094720217,0.094661712,0.119814534,0.100816305,0.10081601,0.092889949,0.100408522,0.090772039,0.090377762,0.084900005,0.092355162,0.112520582,0.097859676,0.087209055,0.1041137,0.112856553,0.090746204)
,c(9.18031601,11.09227687,9.83844379,9.64580639,10.22514748,10.23337748,10.40043161,11.42924699,13.81486345,14.13952435,13.61129849,10.83903702,6.88640782,9.04216056,12.02954886,10.72787232,9.4425759,11.13168511,10.81846319,7.78656007,9.72518025,13.7847261,12.33280119,9.26193982,9.44348187,9.84196161,11.74926408,12.84258627,11.7028168,10.15912189,9.40823422,10.91680175,13.23648902,16.4693486,14.21047788,9.13496124,7.57774394,8.51722165,11.76416064,10.1919151,11.73247567,9.81560667,8.74626473,8.28651636,12.22919798,14.78829048,12.31028928,7.84778185)
,c(32.81570128,31.82592469,38.98876493,36.76658375,38.44461603,25.63108488,24.05370986,29.96483401,35.41164119,38.10191701,40.08138389,40.88474396,30.11146104,28.32714529,38.10802983,33.06030547,30.26582152,30.81661426,19.32980669,22.1124164,39.01648731,36.54290113,42.37598936,37.80545142,35.41146597,38.03598825,44.00978984,39.49187432,42.19555313,46.46831371,28.62873468,29.05176428,53.9939235,54.82043874,46.26856583,46.39431442,39.83112353,40.50502621,39.48027012,37.93228955,42.59635965,35.06031045,30.37208461,28.13106896,38.42397418,38.90616994,42.98276083,39.79207105)
,c(3470.0,3927.0,4996.8,3148.7,3882.4,4579.9,4191.0,4328.4,4059.6,3667.6,4074.3,4220.0,3698.0,3660.2,4717.6,3687.2,4259.2,4098.1,4297.1,4321.0,3761.5,3586.7,3952.5,3630.0,3803.1,3863.9,4457.1,4280.6,4083.1,4443.2,4534.8,4510.7,3839.3,4408.0,4404.3,3633.3,4153.2,4129.3,4648.0,4947.3,5136.5,4491.6,5758.0,5423.6,5177.8,5312.8,5241.7,4551.9)
))
vs <- names(dsg)[1:5]
cr <- names(dsg)[6]
attach(dsg)
#a linear regression
minL.RSS <- function(par) {
Zws <- par[1]
for(u in 1:length(vs)) {
Zws <- Zws + par[u+1] * (get(vs[u]) ^ 1)
}
Zws <- (Zws - get(cr))^2
sum(Zws)
}
#same linear regression adding an exponent
minE.RSS <- function(par) {
Zws <- par[1]
for(u in 1:length(vs)) {
Zws <- Zws + par[u+1] * (get(vs[u]) ^ par[u+1+length(vs)])
}
Zws <- (Zws - get(cr))^2
sum(Zws)
}
#running optim for the simple regression
resultL <- optim(par = c(0,rep(0,length(vs))), fn = minL.RSS,
method="L-BFGS-B"
, lower = c(0,-Inf,0,-Inf,-Inf,-Inf)
, upper = c(Inf,rep(c(Inf),length(vs)))
, control = list(maxit = 4000)
)
resultL
#running optim for the regression with exponent, using the parameter start values found with the model before - but they dont change (but should)
resultE <- optim(par = c(resultL$par[1],resultL$par[2:(length(vs)+1)],rep(1,length(vs))), fn = minE.RSS,
method="L-BFGS-B"
, lower = c(0,-Inf,0,-Inf,-Inf,-Inf,rep(c(0.1),length(vs)))
, upper = c(Inf,rep(c(Inf),length(vs)),rep(c(1),length(vs)))
, control = list(maxit = 4000, pgtol = 1e-100)
)
resultE
#using initial parameter values I received from same formula with Excel solver Add-In - the result is getting better=smaller
resultX <- optim(par = c(0,31,0,3500,2860,-31,1,1,1,0.17,1), fn = minE.RSS,
method="L-BFGS-B"
, lower = c(0,-Inf,0,-Inf,-Inf,-Inf,rep(c(0.1),length(vs)))
, upper = c(Inf,rep(c(Inf),length(vs)),rep(c(1),length(vs)))
, control = list(maxit = 4000, pgtol = 1e-100)
)
resultX
detach(dsg)
resultX$value
[1] 8109259
resultL$value
[1] 8175660
resultE$value
[1] 8175660
I tried pgtol and factr with very small and very big values (1e100 / 1e-100), but resultE does not get better than resultL. And I know from Excel solver Add-In that there is a better solution (resultX).
How can I force optim to run more iterations and/or find a solution as good as Excel solver Add-In does?
It seems like factr, ndeps and maxit have been limiting in your case. You can get pretty close to resultX$value when you do:
resultE2 <- optim(par = c(resultL$par[1],resultL$par[2:(length(vs)+1)],rep(1,length(vs))), fn = minE.RSS,
method="L-BFGS-B"
, lower = c(0,-Inf,0,-Inf,-Inf,-Inf,rep(c(0.1),length(vs)))
, upper = c(Inf,rep(c(Inf),length(vs)),rep(c(1),length(vs)))
, control = list(maxit = 1e4, pgtol = 0, ndeps = rep(1e-6, 11), factr=0))
resultE2$value
[1] 8109250
I'm attempting to create a genetic algorithm (not picky about library, ga and genalg produce same errors) to identify potential columns for use in a linear regression model, by minimizing -adj. r^2. Using mtcars as a play-set, trying to regress on mpg.
I have the following fitness function:
mtcarsnompg <- mtcars[,2:ncol(mtcars)]
evalFunc <- function(string) {
costfunc <- summary(lm(mtcars$mpg ~ ., data = mtcarsnompg[, which(string == 1)]))$adj.r.squared
return(-costfunc)
}
ga("binary",fitness = evalFunc, nBits = ncol(mtcarsnompg), popSize = 100, maxiter = 100, seed = 1, monitor = FALSE)
this causes:
Error in terms.formula(formula, data = data) :
'.' in formula and no 'data' argument
Researching this error, I decided I could work around it this way:
evalFunc = function(string) {
child <- mtcarsnompg[, which(string == 1)]
costfunc <- summary(lm(as.formula(paste("mtcars$mpg ~", paste(child, collapse = "+"))), data = mtcars))$adj.r.squared
return(-costfunc)
}
ga("binary",fitness = evalFunc, nBits = ncol(mtcarsnompg), popSize = 100, maxiter = 100, seed = 1, monitor = FALSE)
but this results in:
Error in terms.formula(formula, data = data) :
invalid model formula in ExtractVars
I know it should work, because I can evaluate the function by hand written either way, while not using ga:
solution <- c("1","1","1","0","1","0","1","1","1","0")
evalFunc(solution)
[1] -0.8172511
I also found in "A quick tour of GA" (https://cran.r-project.org/web/packages/GA/vignettes/GA.html) that using "string" in which(string == 1) is something the GA ought to be able to handle, so I have no idea what GA's issue with my function is.
Any thoughts on a way to write this to get ga or genalg to accept the function?
Turns out I didn't consider that a solution string of 0s (or indeed, a string of 0s with one 1) would cause the internal paste to read "mpg ~ " which is not a possible linear regression.