Genetic Algorithm Optimization - r

I asked a question a few weeks back regarding how one would do optimization in R(Optimizing for Vector Using Optimize R). Now that I have a proper grip with basic optimization in R, I would like to start employing GA's to solve for solutions.
Given the objective function:
div.ratio <- function(weight, vol, cov.mat){
weight <- weight / sum(weight)
dr <- (t(weight) %*% vol) / (sqrt(t(weight) %*% cov.mat %*% (weight)))
return(-dr)
}
I am using genalg package for optimizing, specifically the "rbga.bin" function. But the thing is one cannot seem to pass in more than one parameter, ie can't pass in vol and cov.mat. Am I missing something or understanding this incorrectly.
Edit:
In the genalg package, there is a function called rbga.bin which is the one I am using.
Here is the simple code from previous question that can get you started:
rm(list=ls())
require(RCurl)
sit = getURLContent('https://github.com/systematicinvestor/SIT/raw/master/sit.gz', binary=TRUE, followlocation = TRUE, ssl.verifypeer = FALSE)
con = gzcon(rawConnection(sit, 'rb'))
source(con)
close(con)
load.packages('quantmod')
data <- new.env()
tickers<-spl("VTI,VGK,VWO,GLD,VNQ,TIP,TLT,AGG,LQD")
getSymbols(tickers, src = 'yahoo', from = '1980-01-01', env = data, auto.assign = T)
for(i in ls(data)) data[[i]] = adjustOHLC(data[[i]], use.Adjusted=T)
bt.prep(data, align='remove.na', dates='1990::2013')
prices<-data$prices[,-10]
ret<-na.omit(prices/mlag(prices) - 1)
vol<-apply(ret,2,sd)
cov.mat<-cov(ret)
out <- optim(par = rep(1 / length(vol), length(vol)), # initial guess
fn = div.ratio,
vol = vol,
cov.mat = cov.mat,
method = "L-BFGS-B",
lower = 0,
upper = 1)
opt.weights <- out$par / sum(out$par) #optimal weights
While the above optim function works just fine, I was thinking if it is possible to reproduce this using a GA algorithm. So in the future if I am searching for multiple objectives I will be able to do this faster compared to GA. (I am not sure if it is faster, but this is the step to take to find out)
GAmodel <- rbga.bin(size = 7, #genes
popSize = 200, #initial number of chromosomes
iters = 100, #number of iterations
mutationChance = 0.01, #chance of mutation
evalFunc = div.ratio) #objective function
Doing the above seems to produce an error as div.ratio needs extra paramters, so I am looking for some help in structuring my problem so that it will be able to produce the optimal answer. I hope the above edit clarifies things.
Thanks

This is what you need:
GAmodel <- rbga(stringMin=rep(0, length(vol)), stringMax=rep(1, length(vol)),
popSize = 200,
iters = 100,
mutationChance = 0.01,
evalFunc = function(weight) div.ratio(weight, vol=vol, cov.mat=cov.mat))
(see first and last lines above).
The problems were:
vectors weight and vol must match lengths.
function evalFunc is called with a single parameter, causing the others to be missing. As I understand, you want to optimize in the weight vector only, keeping vol and cov.mat fixed.
If you want weight to be treated as a continuous variable, then use rbga instead.

Related

Finding the precision, recall and the f1 in R

I want to run models on a loop via and then store the performance metrics into a table. I do not want to use the confusionMatrix function in caret, but I want to compute the precision, recall and f1 and then store those in a table. Please assist, edits to the code are welcome.
My attempt is below.
library(MASS) #will load our biopsy data
library(caret)
data("biopsy")
biopsy$ID<-NULL
names(biopsy)<-c('clump thickness','uniformity cell size','uniformity cell shape',
'marginal adhesion','single epithelial cell size','bare nuclei',
'bland chromatin','normal nuclei','mitosis','class')
sum(is.na(biopsy))
biopsy<-na.omit(biopsy)
sum(is.na(biopsy))
head(biopsy,5)
set.seed(123)
inTraining <- createDataPartition(biopsy$class, p = .75, list = FALSE)
training <- biopsy[ inTraining,]
testing <- biopsy[-inTraining,]
# Run algorithms using 10-fold cross validation
control <- trainControl(method="repeatedcv", number=10,repeats = 5, verboseIter = F, classProbs = T)
#CHANGING THE CHARACTERS INTO FACTORS VARAIBLES
training<- as.data.frame(unclass(training),
stringsAsFactors = TRUE)
#CHANGING THE CHARACTERS INTO FACTORS VARAIBLES
testing <- as.data.frame(unclass(testing),
stringsAsFactors = TRUE)
models<-c("svmRadial","rf")
results_table <- data.frame(models = models, stringsAsFactors = F)
for (i in models){
model_train<-train(class~., data=training, method=i,
trControl=control,metric="Accuracy")
predictions<-predict(model_train, newdata=testing)
precision_<-posPredValue(predictions,testing)
recall_<-sensitivity(predictions,testing)
f1<-(2*precision_*recall_)/(precision_+recall_)
# put that in the results table
results_table[i, "Precision"] <- precision_
results_table[i, "Recall"] <- recall_
results_table[i, "F1score"] <- f1
}
However I get an error which says Error in posPredValue.default(predictions, testing) : inputs must be factors. i do not know where I went wrong and any edits to my code are welcome.
I know that I could get precision,recall, f1 by just using the code below (B), however this is a tutorial question where I am required not to use the code example below (B):
(B)
for (i in models){
model_train<-train(class~., data=training, method=i,
trControl=control,metric="Accuracy")
predictions<-predict(model_train, newdata=testing)
print(confusionMatrix(predictions, testing$class,mode="prec_recall"))
}
A few things need to happen.
You have to change the function calls for posPredValue and sensitivity. For both, change testing to testing$class.
for the results_table, i is a word, not a value, so you're assigning results_table["rf", "Precision"] <- precision_ (This makes a new row, where the row name is "rf".)
Here is your for statement, with changes to those functions mentioned in 1) and a modification to address the issue in 2).
for (i in models){
model_train <- train(class~., data = training, method = i,
trControl= control, metric = "Accuracy")
assign("fit", model_train)
predictions <- predict(model_train, newdata = testing)
precision_ <-posPredValue(predictions, testing$class)
recall_ <- sensitivity(predictions, testing$class)
f1 <- (2*precision_ * recall_) / (precision_ + recall_)
# put that in the results table
results_table[results_table$models %in% i, "Precision"] <- precision_
results_table[results_table$models %in% i, "Recall"] <- recall_
results_table[results_table$models %in% i, "F1score"] <- f1
}
This is what it looks like for me.
results_table
# models Precision Recall F1score
# 1 svmRadial 0.9722222 0.9459459 0.9589041
# 2 rf 0.9732143 0.9819820 0.9775785

k-fold cross validation in quanteda

I've been using the quanteda SML workflow as described in the quanteda tutorial (https://tutorials.quanteda.io/machine-learning/nb/) and found it extremely helpful to set up my own classification task. However, instead of the fixed held-out train/test sampling I would like to use a k-fold cross-validation. Could you point me towards the best way to implement it into the workflow? Is there an easy way to apply it in quanteda?
Many thanks
I tried to add a cross validation based on this example:
https://rdrr.io/github/quanteda/quanteda.classifiers/man/crossval.html
require(quanteda)
require(quanteda.textmodels)
require(caret)
corp_movies <- data_corpus_moviereviews
summary(corp_movies, 5)
# generate 1500 numbers without replacement
set.seed(300)
id_train <- sample(1:2000, 1500, replace = FALSE)
head(id_train, 10)
# create docvar with ID
corp_movies$id_numeric <- 1:ndoc(corp_movies)
# tokenize texts
toks_movies <- tokens(corp_movies, remove_punct = TRUE, remove_number = TRUE) %>%
tokens_remove(pattern = stopwords("en")) %>%
tokens_wordstem()
dfmt_movie <- dfm(toks_movies)
# get training set
dfmat_training <- dfm_subset(dfmt_movie, id_numeric %in% id_train)
# get test set (documents not in id_train)
dfmat_test <- dfm_subset(dfmt_movie, !id_numeric %in% id_train)
tmod_nb <- textmodel_nb(dfmat_training, dfmat_training$sentiment)
summary(tmod_nb)
dfmat_matched <- dfm_match(dfmat_test, features = featnames(dfmat_training))
actual_class <- dfmat_matched$sentiment
predicted_class <- predict(tmod_nb, newdata = dfmat_matched)
tab_class <- table(actual_class, predicted_class)
tab_class
require(confusionMatrix)
confusionMatrix(tab_class, mode = "everything", positive = "pos")
#n-fold cross validation
require(crossval)
dfmat <- dfm(toks_movies)
tmod <- textmodel_nb(dfmat, y = data_corpus_moviereviews$sentiment)
crossval(tmod, k = 5, by_class = TRUE)
crossval(tmod, k = 5, by_class = FALSE)
crossval(tmod, k = 5, by_class = FALSE, verbose = TRUE)
but it returns "Error in group.samples(Y) : argument "Y" is missing, with no default"
It should probably be a comment, but I cannot post them yet. I think your problem is caused by the usage of the crossval() function from the improper package. The link you shared suggests that you want to use it from the remote quanteda/quanteda.classifiers package, instead of crossval. The one you used presumably requires a different pipeline cause its definition is different. The used function requires additional X and Y arguments. Their lack is a reason for your error.

How can I force r optim to run more iterations?

R Optim stops iterating earlier than I want. I use method="L-BFGS-B" (as I need different bounds for different parameters). I know I can set the maximum of iterations via 'control'>'maxit', but optim does not reach the max. I guess 'control'>'pgtol' and/or 'factr' should help, but apparently they do not.
I do the same optimisation with Excel solver Add-In and therefore I know that R stops iterating too early.
Here is my sample data and code:
dsg <- as.data.frame(cbind(c(0:47)
,c(0.402469136,0.368944099,0.375477721,0.391121435,0.36741817,0.366685299,0.373907486,0.409429755,0.383399692,0.412436098,0.389864409,0.411901702,0.379012346,0.383269431,0.372778178,0.397308798,0.407005188,0.396770412,0.378525076,0.38084766,0.378051956,0.376836815,0.351144888,0.387655975,0.415815896,0.39851447,0.384345349,0.40061633,0.370402697,0.373590499,0.379474943,0.378865913,0.382395269,0.365808609,0.383106843,0.35946353,0.361037542,0.36077482,0.384418935,0.362583824,0.385405581,0.348344335,0.358934922,0.379079876,0.391434446,0.354347971,0.361197833,0.372232682)
,c(0.114814815,0.118012422,0.132153971,0.137563457,0.113412879,0.113819587,0.117105297,0.117003116,0.132768529,0.114580427,0.120072809,0.116621127,0.124691358,0.118103399,0.130523309,0.13783449,0.114587233,0.10441059,0.113704754,0.109561299,0.108298377,0.118025013,0.125106438,0.106440408,0.107985517,0.127293523,0.130639958,0.113993233,0.111258799,0.113139383,0.114220436,0.094720217,0.094661712,0.119814534,0.100816305,0.10081601,0.092889949,0.100408522,0.090772039,0.090377762,0.084900005,0.092355162,0.112520582,0.097859676,0.087209055,0.1041137,0.112856553,0.090746204)
,c(9.18031601,11.09227687,9.83844379,9.64580639,10.22514748,10.23337748,10.40043161,11.42924699,13.81486345,14.13952435,13.61129849,10.83903702,6.88640782,9.04216056,12.02954886,10.72787232,9.4425759,11.13168511,10.81846319,7.78656007,9.72518025,13.7847261,12.33280119,9.26193982,9.44348187,9.84196161,11.74926408,12.84258627,11.7028168,10.15912189,9.40823422,10.91680175,13.23648902,16.4693486,14.21047788,9.13496124,7.57774394,8.51722165,11.76416064,10.1919151,11.73247567,9.81560667,8.74626473,8.28651636,12.22919798,14.78829048,12.31028928,7.84778185)
,c(32.81570128,31.82592469,38.98876493,36.76658375,38.44461603,25.63108488,24.05370986,29.96483401,35.41164119,38.10191701,40.08138389,40.88474396,30.11146104,28.32714529,38.10802983,33.06030547,30.26582152,30.81661426,19.32980669,22.1124164,39.01648731,36.54290113,42.37598936,37.80545142,35.41146597,38.03598825,44.00978984,39.49187432,42.19555313,46.46831371,28.62873468,29.05176428,53.9939235,54.82043874,46.26856583,46.39431442,39.83112353,40.50502621,39.48027012,37.93228955,42.59635965,35.06031045,30.37208461,28.13106896,38.42397418,38.90616994,42.98276083,39.79207105)
,c(3470.0,3927.0,4996.8,3148.7,3882.4,4579.9,4191.0,4328.4,4059.6,3667.6,4074.3,4220.0,3698.0,3660.2,4717.6,3687.2,4259.2,4098.1,4297.1,4321.0,3761.5,3586.7,3952.5,3630.0,3803.1,3863.9,4457.1,4280.6,4083.1,4443.2,4534.8,4510.7,3839.3,4408.0,4404.3,3633.3,4153.2,4129.3,4648.0,4947.3,5136.5,4491.6,5758.0,5423.6,5177.8,5312.8,5241.7,4551.9)
))
vs <- names(dsg)[1:5]
cr <- names(dsg)[6]
attach(dsg)
#a linear regression
minL.RSS <- function(par) {
Zws <- par[1]
for(u in 1:length(vs)) {
Zws <- Zws + par[u+1] * (get(vs[u]) ^ 1)
}
Zws <- (Zws - get(cr))^2
sum(Zws)
}
#same linear regression adding an exponent
minE.RSS <- function(par) {
Zws <- par[1]
for(u in 1:length(vs)) {
Zws <- Zws + par[u+1] * (get(vs[u]) ^ par[u+1+length(vs)])
}
Zws <- (Zws - get(cr))^2
sum(Zws)
}
#running optim for the simple regression
resultL <- optim(par = c(0,rep(0,length(vs))), fn = minL.RSS,
method="L-BFGS-B"
, lower = c(0,-Inf,0,-Inf,-Inf,-Inf)
, upper = c(Inf,rep(c(Inf),length(vs)))
, control = list(maxit = 4000)
)
resultL
#running optim for the regression with exponent, using the parameter start values found with the model before - but they dont change (but should)
resultE <- optim(par = c(resultL$par[1],resultL$par[2:(length(vs)+1)],rep(1,length(vs))), fn = minE.RSS,
method="L-BFGS-B"
, lower = c(0,-Inf,0,-Inf,-Inf,-Inf,rep(c(0.1),length(vs)))
, upper = c(Inf,rep(c(Inf),length(vs)),rep(c(1),length(vs)))
, control = list(maxit = 4000, pgtol = 1e-100)
)
resultE
#using initial parameter values I received from same formula with Excel solver Add-In - the result is getting better=smaller
resultX <- optim(par = c(0,31,0,3500,2860,-31,1,1,1,0.17,1), fn = minE.RSS,
method="L-BFGS-B"
, lower = c(0,-Inf,0,-Inf,-Inf,-Inf,rep(c(0.1),length(vs)))
, upper = c(Inf,rep(c(Inf),length(vs)),rep(c(1),length(vs)))
, control = list(maxit = 4000, pgtol = 1e-100)
)
resultX
detach(dsg)
resultX$value
[1] 8109259
resultL$value
[1] 8175660
resultE$value
[1] 8175660
I tried pgtol and factr with very small and very big values (1e100 / 1e-100), but resultE does not get better than resultL. And I know from Excel solver Add-In that there is a better solution (resultX).
How can I force optim to run more iterations and/or find a solution as good as Excel solver Add-In does?
It seems like factr, ndeps and maxit have been limiting in your case. You can get pretty close to resultX$value when you do:
resultE2 <- optim(par = c(resultL$par[1],resultL$par[2:(length(vs)+1)],rep(1,length(vs))), fn = minE.RSS,
method="L-BFGS-B"
, lower = c(0,-Inf,0,-Inf,-Inf,-Inf,rep(c(0.1),length(vs)))
, upper = c(Inf,rep(c(Inf),length(vs)),rep(c(1),length(vs)))
, control = list(maxit = 1e4, pgtol = 0, ndeps = rep(1e-6, 11), factr=0))
resultE2$value
[1] 8109250

R "Error in terms.formula" using GA/genalg library

I'm attempting to create a genetic algorithm (not picky about library, ga and genalg produce same errors) to identify potential columns for use in a linear regression model, by minimizing -adj. r^2. Using mtcars as a play-set, trying to regress on mpg.
I have the following fitness function:
mtcarsnompg <- mtcars[,2:ncol(mtcars)]
evalFunc <- function(string) {
costfunc <- summary(lm(mtcars$mpg ~ ., data = mtcarsnompg[, which(string == 1)]))$adj.r.squared
return(-costfunc)
}
ga("binary",fitness = evalFunc, nBits = ncol(mtcarsnompg), popSize = 100, maxiter = 100, seed = 1, monitor = FALSE)
this causes:
Error in terms.formula(formula, data = data) :
'.' in formula and no 'data' argument
Researching this error, I decided I could work around it this way:
evalFunc = function(string) {
child <- mtcarsnompg[, which(string == 1)]
costfunc <- summary(lm(as.formula(paste("mtcars$mpg ~", paste(child, collapse = "+"))), data = mtcars))$adj.r.squared
return(-costfunc)
}
ga("binary",fitness = evalFunc, nBits = ncol(mtcarsnompg), popSize = 100, maxiter = 100, seed = 1, monitor = FALSE)
but this results in:
Error in terms.formula(formula, data = data) :
invalid model formula in ExtractVars
I know it should work, because I can evaluate the function by hand written either way, while not using ga:
solution <- c("1","1","1","0","1","0","1","1","1","0")
evalFunc(solution)
[1] -0.8172511
I also found in "A quick tour of GA" (https://cran.r-project.org/web/packages/GA/vignettes/GA.html) that using "string" in which(string == 1) is something the GA ought to be able to handle, so I have no idea what GA's issue with my function is.
Any thoughts on a way to write this to get ga or genalg to accept the function?
Turns out I didn't consider that a solution string of 0s (or indeed, a string of 0s with one 1) would cause the internal paste to read "mpg ~ " which is not a possible linear regression.

theta.sparse error with lorDIF

I was wondering whether anyone can help me out.
I am trying to run a dif analysis on my data but keep getting a theta.sparse error, which I am unsure of how to fix. I would really appreciate any that you can give me.
library(lordif)
dat<- read.csv2("OPSO.csv",header=TRUE)
datgender <- read.csv2("DATA.csv",header=TRUE)
group<-datgender$Gender
sink("outputDIFopso.txt")
gender.difopso <- lordif(dat, group, selection = NULL,
criterion = c("Chisqr", "R2", "Beta"),
pseudo.R2 = c("McFadden", "Nagelkerke", "CoxSnell"), alpha = 0.01,
beta.change = 0.1, R2.change = 0.02, maxIter = 10, minCell = 5,
minTheta = -4, maxTheta = 4, inc = 0.1, control = list(), model = "GRM",
anchor = NULL, MonteCarlo = FALSE, nr = 100)
print(gender.difopso)
summary(gender.difopso)
sink()
pdf("graphtestop.pdf")
plot(gender.difopso)
dev.off()
dev.off()
Error in lordif(dat, group, selection = NULL, criterion = c("Chisqr", :
object 'theta.sparse' not found
Thank you :)
You should check the error line before then. The output will probably say you have no items flagged for DIF. When that's the case you should just run the mirt function and extract theta and ipar objects as necessary.
The author could add some case handling for when compare(flags, flags.matrix) is true. At the very least, it seems a warning is omitted when there are no items with DIF the same way it says
if (ndif == ni) {
warning("all items got flagged for DIF - stopping\n")
}
and there is no case handling when (ndif == 0) although compare(flags, flag.matrix) evaluates to TRUE.
The implications when all or none of the items have DIF is that you would get the same results (generating the same ICC plots, same inference etc) by fitting mirt in the combined sample (no DIF) or two or more mirt models for each group (all DIF). So it's a correct time saving procedure to just bypass when all that breaks down.

Resources