I want to run models on a loop via and then store the performance metrics into a table. I do not want to use the confusionMatrix function in caret, but I want to compute the precision, recall and f1 and then store those in a table. Please assist, edits to the code are welcome.
My attempt is below.
library(MASS) #will load our biopsy data
names(biopsy)<-c('clump thickness','uniformity cell size','uniformity cell shape',
'marginal adhesion','single epithelial cell size','bare nuclei',
'bland chromatin','normal nuclei','mitosis','class')
inTraining <- createDataPartition(biopsy$class, p = .75, list = FALSE)
training <- biopsy[ inTraining,]
testing <- biopsy[-inTraining,]
# Run algorithms using 10-fold cross validation
control <- trainControl(method="repeatedcv", number=10,repeats = 5, verboseIter = F, classProbs = T)
training<- as.data.frame(unclass(training),
stringsAsFactors = TRUE)
testing <- as.data.frame(unclass(testing),
stringsAsFactors = TRUE)
results_table <- data.frame(models = models, stringsAsFactors = F)
for (i in models){
model_train<-train(class~., data=training, method=i,
predictions<-predict(model_train, newdata=testing)
# put that in the results table
results_table[i, "Precision"] <- precision_
results_table[i, "Recall"] <- recall_
results_table[i, "F1score"] <- f1
However I get an error which says Error in posPredValue.default(predictions, testing) : inputs must be factors. i do not know where I went wrong and any edits to my code are welcome.
I know that I could get precision,recall, f1 by just using the code below (B), however this is a tutorial question where I am required not to use the code example below (B):
for (i in models){
model_train<-train(class~., data=training, method=i,
predictions<-predict(model_train, newdata=testing)
print(confusionMatrix(predictions, testing$class,mode="prec_recall"))
A few things need to happen.
You have to change the function calls for posPredValue and sensitivity. For both, change testing to testing$class.
for the results_table, i is a word, not a value, so you're assigning results_table["rf", "Precision"] <- precision_ (This makes a new row, where the row name is "rf".)
Here is your for statement, with changes to those functions mentioned in 1) and a modification to address the issue in 2).
for (i in models){
model_train <- train(class~., data = training, method = i,
trControl= control, metric = "Accuracy")
assign("fit", model_train)
predictions <- predict(model_train, newdata = testing)
precision_ <-posPredValue(predictions, testing$class)
recall_ <- sensitivity(predictions, testing$class)
f1 <- (2*precision_ * recall_) / (precision_ + recall_)
# put that in the results table
results_table[results_table$models %in% i, "Precision"] <- precision_
results_table[results_table$models %in% i, "Recall"] <- recall_
results_table[results_table$models %in% i, "F1score"] <- f1
This is what it looks like for me.
# models Precision Recall F1score
# 1 svmRadial 0.9722222 0.9459459 0.9589041
# 2 rf 0.9732143 0.9819820 0.9775785
R Optim stops iterating earlier than I want. I use method="L-BFGS-B" (as I need different bounds for different parameters). I know I can set the maximum of iterations via 'control'>'maxit', but optim does not reach the max. I guess 'control'>'pgtol' and/or 'factr' should help, but apparently they do not.
I do the same optimisation with Excel solver Add-In and therefore I know that R stops iterating too early.
Here is my sample data and code:
dsg <- as.data.frame(cbind(c(0:47)
vs <- names(dsg)[1:5]
cr <- names(dsg)[6]
#a linear regression
minL.RSS <- function(par) {
Zws <- par[1]
for(u in 1:length(vs)) {
Zws <- Zws + par[u+1] * (get(vs[u]) ^ 1)
Zws <- (Zws - get(cr))^2
#same linear regression adding an exponent
minE.RSS <- function(par) {
Zws <- par[1]
for(u in 1:length(vs)) {
Zws <- Zws + par[u+1] * (get(vs[u]) ^ par[u+1+length(vs)])
Zws <- (Zws - get(cr))^2
#running optim for the simple regression
resultL <- optim(par = c(0,rep(0,length(vs))), fn = minL.RSS,
, lower = c(0,-Inf,0,-Inf,-Inf,-Inf)
, upper = c(Inf,rep(c(Inf),length(vs)))
, control = list(maxit = 4000)
#running optim for the regression with exponent, using the parameter start values found with the model before - but they dont change (but should)
resultE <- optim(par = c(resultL$par[1],resultL$par[2:(length(vs)+1)],rep(1,length(vs))), fn = minE.RSS,
, lower = c(0,-Inf,0,-Inf,-Inf,-Inf,rep(c(0.1),length(vs)))
, upper = c(Inf,rep(c(Inf),length(vs)),rep(c(1),length(vs)))
, control = list(maxit = 4000, pgtol = 1e-100)
#using initial parameter values I received from same formula with Excel solver Add-In - the result is getting better=smaller
resultX <- optim(par = c(0,31,0,3500,2860,-31,1,1,1,0.17,1), fn = minE.RSS,
, lower = c(0,-Inf,0,-Inf,-Inf,-Inf,rep(c(0.1),length(vs)))
, upper = c(Inf,rep(c(Inf),length(vs)),rep(c(1),length(vs)))
, control = list(maxit = 4000, pgtol = 1e-100)
[1] 8109259
[1] 8175660
[1] 8175660
I tried pgtol and factr with very small and very big values (1e100 / 1e-100), but resultE does not get better than resultL. And I know from Excel solver Add-In that there is a better solution (resultX).
How can I force optim to run more iterations and/or find a solution as good as Excel solver Add-In does?
It seems like factr, ndeps and maxit have been limiting in your case. You can get pretty close to resultX$value when you do:
resultE2 <- optim(par = c(resultL$par[1],resultL$par[2:(length(vs)+1)],rep(1,length(vs))), fn = minE.RSS,
, lower = c(0,-Inf,0,-Inf,-Inf,-Inf,rep(c(0.1),length(vs)))
, upper = c(Inf,rep(c(Inf),length(vs)),rep(c(1),length(vs)))
, control = list(maxit = 1e4, pgtol = 0, ndeps = rep(1e-6, 11), factr=0))
[1] 8109250
I'm attempting to create a genetic algorithm (not picky about library, ga and genalg produce same errors) to identify potential columns for use in a linear regression model, by minimizing -adj. r^2. Using mtcars as a play-set, trying to regress on mpg.
I have the following fitness function:
mtcarsnompg <- mtcars[,2:ncol(mtcars)]
evalFunc <- function(string) {
costfunc <- summary(lm(mtcars$mpg ~ ., data = mtcarsnompg[, which(string == 1)]))$adj.r.squared
ga("binary",fitness = evalFunc, nBits = ncol(mtcarsnompg), popSize = 100, maxiter = 100, seed = 1, monitor = FALSE)
this causes:
Error in terms.formula(formula, data = data) :
'.' in formula and no 'data' argument
Researching this error, I decided I could work around it this way:
evalFunc = function(string) {
child <- mtcarsnompg[, which(string == 1)]
costfunc <- summary(lm(as.formula(paste("mtcars$mpg ~", paste(child, collapse = "+"))), data = mtcars))$adj.r.squared
ga("binary",fitness = evalFunc, nBits = ncol(mtcarsnompg), popSize = 100, maxiter = 100, seed = 1, monitor = FALSE)
but this results in:
Error in terms.formula(formula, data = data) :
invalid model formula in ExtractVars
I know it should work, because I can evaluate the function by hand written either way, while not using ga:
solution <- c("1","1","1","0","1","0","1","1","1","0")
[1] -0.8172511
I also found in "A quick tour of GA" (https://cran.r-project.org/web/packages/GA/vignettes/GA.html) that using "string" in which(string == 1) is something the GA ought to be able to handle, so I have no idea what GA's issue with my function is.
Any thoughts on a way to write this to get ga or genalg to accept the function?
Turns out I didn't consider that a solution string of 0s (or indeed, a string of 0s with one 1) would cause the internal paste to read "mpg ~ " which is not a possible linear regression.