Efficient nonlinear discrete multivariable optimization in R - r

I have a function which is extremely computationally heavy (takes about 10 minutes to run once), and this function has 4 parameters.
I define the function as follows
function <- function(parameters){...calculate value...
return(value)
}
It is nonlinear for sure, and these 4 parameters I am defining must be integer values. I want to run it on the interval from c(3, 100, 20000, 80000) to c(10, 150, 80000, 120000)
I realize that for a function which is this computationally heavy and the range is so large, this will take a long time for analysis.
I am looking for an efficient way of doing this. rgenoud seems to work for integer values, but after running it, I have no way to determine if it is actually running. It stops on the line Minimization problem.
Are there other such optimization algorithms (either minimization or maximization) which are extremely efficient and will give me some feedback whether it is running or not?
I am sorry I could not provide a reproducible example, as I cannot provide the data set (100 gb) and this is more of a theoretical question.
EDIT:
If this helps, some algorithms I have been looking at include genetic algorithms, swarm particle optimization, simulated annealing, Markov chain Monte Carlo. I am open to other algorithms, as I am just looking for most efficient way to optimize these discrete variables.
EDIT 2:
Simplified version of function:
control1 <- sample(1:75, 3947398, replace=TRUE)
control2 <- sample(1:75, 28793, replace=TRUE)
control3 <- sample(1:100, 392733, replace=TRUE)
control4 <- sample(1:75, 858383, replace=TRUE)
patient1 <- sample(1:100, 28048, replace=TRUE)
patient2 <- sample(1:50, 80400, replace=TRUE)
patient3 <- sample(1:100, 48239, replace=TRUE)
control <- list(control1, control2, control3, control4)
patient <- list(patient1, patient2, patient3)
function <- function(parameter){
s <- parameter[1]
control_s <- list()
patient_s <- list()
for (i in 1:length(control))
control_s[[i]] <- sample(control[[i]], s)
for (i in 1:length(patient))
patient_s[[i]] <- sample(patient[[i]], s)
controlfreq <- list()
for (i in 1:length(control_s)){
controlfreq[[i]] <-
as.data.frame(prop.table(table(factor(
control_s[[i]], levels = 1:100
))))[,2]}
patientfreq <- list()
for (i in 1:length(patient_s)){
patientfreq[[i]] <-
as.data.frame(prop.table(table(factor(
patient_s[[i]], levels = 1:100
))))[,2]}
controlfreq <- t(as.data.frame(controlfreq))
controltrainingset <- transform(controlfreq, status = "control")
patientfreq <- t(as.data.frame(patientfreq))
patienttrainingset <- transform(patientfreq, status = "patient")
dataset <- rbind(controltrainingset, patienttrainingset)
library(caret)
fitControl <-trainControl(method = "LOOCV", classProbs = T, savePredictions = T)
model <- train(status ~ ., data = dataset, method = "rf", trControl = fitControl)
selectedIndices <- model$pred$mtry == 2
confusionmatrix <- table(model$pred$obs[selectedIndices], model$pred$pred[selectedIndices])
metric= ((confusionmatrix[1,1]/length(control))+(confusionmatrix[2,2]/length(patient)))/2
return(-metric)}
# -metric if minimizing, metric if maximizing

Related

R: Multiclass Matrices

I am working with the R programming language. I am trying to learn how to make a "confusion matrix" for multiclass variables (e.g. How to construct the confusion matrix for a multi class variable).
Suppose I generate some data and fit a decision tree model :
#load libraries
library(rpart)
library(caret)
#generate data
a <- rnorm(1000, 10, 10)
b <- rnorm(1000, 10, 5)
d <- rnorm(1000, 5, 10)
group_1 <- sample( LETTERS[1:3], 1000, replace=TRUE, prob=c(0.33,0.33,0.34) )
e = data.frame(a,b,d, group_1)
e$group_1 = as.factor(d$group_1)
#split data into train and test set
trainIndex <- createDataPartition(e$group_1, p = .8,
list = FALSE,
times = 1)
training <- e[trainIndex,]
test <- e[-trainIndex,]
fitControl <- trainControl(## 10-fold CV
method = "repeatedcv",
number = 5,
## repeated ten times
repeats = 1)
#fit decision tree model
TreeFit <- train(group_1 ~ ., data = training,
method = "rpart2",
trControl = fitControl)
From here, I am able to store the results into a "confusion matrix":
pred <- predict(TreeFit,test)
table_example <- table(pred,test$group_1)
This satisfies my requirements - but this "table" requires me to manually calculate the different accuracy metrics of "A", "B" and "C" (as well as the total accuracy).
My question: Is it possible to use the caret::confusionMatrix() command for this problem?
e.g.
pred <- predict(TreeFit, test, type = "prob")
labels_example <- as.factor(ifelse(pred[,2]>0.5, "1", "0"))
con <- confusionMatrix(labels_example, test$group_1)
This way, I would be able to directly access the accuracy measurements from the confusion matrix. E.g. metric = con$overall[1]
Thanks
Is this what you're looking for?
pred <- predict(
TreeFit,
test)
con <- confusionMatrix(
test$group_1,
pred)
con
con$overall[1]
Same output as in:
table(test$group_1, pred)
Plus accuracy metrics.

Why is the error rate from bagging trees much higher than that from a single tree?

I cross-post this question here, but it seems to me that I'm unlikely to receive any answer. So I post it here.
I'm running the classification method Bagging Tree (Bootstrap Aggregation) and compare the misclassification error rate with one from one single tree. We expect that the result from bagging tree is better then that from one single tree, i.e. error rate from bagging is lower than that of single tree.
I repeat the whole procedure M = 100 times (each time splitting randomly the original data set into a training set and a test set) to obtain 100 test errors and bagging test errors (use a for loop). Then I use boxplots to compare the distributions of these two types of errors.
# Loading package and data
library(rpart)
library(boot)
library(mlbench)
data(PimaIndiansDiabetes)
# Initialization
n <- 768
ntrain <- 468
ntest <- 300
B <- 100
M <- 100
single.tree.error <- vector(length = M)
bagging.error <- vector(length = M)
# Define statistic
estim.pred <- function(a.sample, vector.of.indices)
{
current.train <- a.sample[vector.of.indices, ]
current.fitted.model <- rpart(diabetes ~ ., data = current.train, method = "class")
predict(current.fitted.model, test.set, type = "class")
}
for (j in 1:M)
{
# Split the data into test/train sets
train.idx <- sample(1:n, ntrain, replace = FALSE)
train.set <- PimaIndiansDiabetes[train.idx, ]
test.set <- PimaIndiansDiabetes[-train.idx, ]
# Train a direct tree model
fitted.tree <- rpart(diabetes ~ ., data = train.set, method = "class")
pred.test <- predict(fitted.tree, test.set, type = "class")
single.tree.error[j] <- mean(pred.test != test.set$diabetes)
# Bootstrap estimates
res.boot = boot(train.set, estim.pred, B)
pred.boot <- vector(length = ntest)
for (i in 1:ntest)
{
pred.boot[i] <- ifelse (mean(res.boot$t[, i] == "pos") >= 0.5, "pos", "neg")
}
bagging.error[j] <- mean(pred.boot != test.set$diabetes)
}
boxplot(single.tree.error, bagging.error, ylab = "Misclassification errors", names = c("single.tree", "bagging"))
The result is
Could you please explain why the error rate for bagging trees is much higher than that of a single tree? I feel that this does not make sense. I've checked my code but could not found anything unusual.
I've received an answer from https://stats.stackexchange.com/questions/452882/why-is-the-error-rate-from-bagging-trees-much-higher-than-that-from-a-single-tre. I posted it here to close this question and for future visitors.

Reinforcement learning - Find optimal number of values to randomly sample to optimize random forest classification

I have data set up like the following:
control1 <- sample(1:75, 3947398, replace=TRUE)
control2 <- sample(1:75, 28793, replace=TRUE)
control3 <- sample(1:100, 392733, replace=TRUE)
control4 <- sample(1:75, 858383, replace=TRUE)
patient1 <- sample(1:100, 28048, replace=TRUE)
patient2 <- sample(1:50, 80400, replace=TRUE)
patient3 <- sample(1:100, 48239, replace=TRUE)
control <- list(control1, control2, control3, control4)
patient <- list(patient1, patient2, patient3)
To classify these samples as either control or patient, I want make frequency distributions of presence of each of the 100 variables being considered. To do this, I randomly sample "s" values from each sample and generate a frequency vector of length 100. This is how I would do it:
control_s <- list()
patient_s <- list()
for (i in 1:length(control))
control_s[[i]] <- sample(control[[i]], s)
for (i in 1:length(patient))
patient_s[[i]] <- sample(patient[[i]], s)
Once I do this, I generate the frequency vector of length 100 as follows:
controlfreq <- list()
for (i in 1:length(control_s)){
controlfreq[[i]] <-
as.data.frame(prop.table(table(factor(
control_s[[i]], levels = 1:100
))))[,2]}
patientfreq <- list()
for (i in 1:length(patient_s)){
patientfreq[[i]] <-
as.data.frame(prop.table(table(factor(
patient_s[[i]], levels = 1:100
))))[,2]}
controlfreq <- t(as.data.frame(controlfreq))
controltrainingset <- transform(controlfreq, status = "control")
patientfreq <- t(as.data.frame(patientfreq))
patienttrainingset <- transform(patientfreq, status = "patient")
dataset <- rbind(controltrainingset, patienttrainingset)
This is the final data frame being used in the classification algorithm. My goal of this post is to figure out how to identify the optimal "s" value so that the highest balanced accuracy is achieved. I am using "rf" from the caret package to do classification.
library(caret)
fitControl <-trainControl(method = "LOOCV", classProbs = T, savePredictions = T)
model <- train(status ~ ., data = dataset, method = "rf", trControl = fitControl)
How can I automate it to start "s" at 5000, change it to another value, and based on the change in accuracy, keep changing "s" to work towards the best possible "s" value?
Thanks!
EDIT: I am defining accuracy as the follows:
selectedIndices <- model$pred$mtry == 2
confusionmatrix <- table(model$pred$obs[selectedIndices], model$pred$pred[selectedIndices])
BalancedACC = ((confusionmatrix[1,1]/length(control))+(confusionmatrix[2,2]/length(patient)))/2
This is the value I want to maximize by changing "s".

simulation study using AIC

I have to code a simulation study in R. So, I have X1,...,X15~N(0,1) explanatory variables and Y~N(2+2*X1+0.8*X2­1.2*X15, 1) and I need to simulate n=100 values and repeat that for iter=100 times. Then, for each linear model created I have to calculate the AIC­values and, finally, find the best model. The problem is that I can't figure out how to do that for item=100 times. I wrote the code for 1 simulation, which is the following:
set.seed(123)
n<‐100
p<‐15
iter<‐100 X<‐matrix(rep(NA,n*p),ncol=p) for (j in 1:p) {
X[,j]<‐rnorm(n = 100, mean = 0, sd = 1) }
mu<‐(2+2*X[,1])+(0.8*X[,2])‐(1.2*X[,15]) Y<‐rnorm(n = 100, mean = mu , sd = 1)
sim<‐data.frame(Y,X)
d<‐lm(Y~X, data = sim)
But how I do the rest I have to do, i.e.the 100 simulations and the calculations of AIC? I'm really new to R, so I am quite confused.
How about this
nsim <- 100
nobs <- 100
nvar <- 15
results <- lapply(1:nsim, function(i) {
X <- matrix(rnorm(nobs*nvar),nrow=nobs)
y <- rnorm(nobs, mean=2 + X[,c(1,2,15)]%*% c(2, .8,-1.2))
DF <- data.frame(y, X)
lm(y ~ X, data=DF)})
That should give you your simulations. Now find the "best"
findbest <- which.min(sapply(results, function(i) { AIC(i) }))
results[[findbest]]
Since all data are simulated using the same underlying data-generating process any variation in AIC is essentially random variation.

R caret: leave subject out cross validation with data subset for training?

I want to perform leave subject out cross validation with R caret (cf. this example) but only use a subset of the data in training for creating CV models. Still, the left out CV partition should be used as a whole, as I need to test on all data of a left out subject (no matter if it's millions of samples that cannot be used in training due to computational restrictions).
I've created a minimal 2 class classification example using the subset and index parameters of caret::train and caret::trainControl to achieve this. From my observation this should solve the problem, but I have a hard time actually ensuring that the evaluation is still done in a leave-subject-out way. Maybe someone with experience in this task could shed some light on this:
library(plyr)
library(caret)
library(pROC)
library(ggplot2)
# with diamonds we want to predict cut and look at results for different colors = subjects
d <- diamonds
d <- d[d$cut %in% c('Premium', 'Ideal'),] # make a 2 class problem
d$cut <- factor(d$cut)
indexes_data <- c(1,5,6,8:10)
indexes_labels <- 2
# population independent CV indexes for trainControl
index <- llply(unique(d[,3]), function(cls) c(which(d[,3]!=cls)))
names(index) <- paste0('sub_', unique(d[,3]))
str(index) # indexes used for training models with CV = OK
m3 <- train(x = d[,indexes_data],
y = d[,indexes_labels],
method = 'glm',
metric = 'ROC',
subset = sample(nrow(d), 5000), # does this subset the data used for training and obtaining models, but not the left out partition used for estimating CV performance?
trControl = trainControl(returnResamp = 'final',
savePredictions = T,
classProbs = T,
summaryFunction = twoClassSummary,
index = index))
str(m3$resample) # all samples used once = OK
# performance over all subjects
myRoc <- roc(predictor = m3$pred[,3], response = m3$pred$obs)
plot(myRoc, main = 'all')
performance for individual subjects
l_ply(unique(m3$pred$Resample), .fun = function(cls) {
pred_sub <- m3$pred[m3$pred$Resample==cls,]
myRoc <- roc(predictor = pred_sub[,3], response = pred_sub$obs)
plot(myRoc, main = cls)
} )
Thanks for your time!
Using both the index and indexOut parameter in caret::trainControl at the same time seems to do the trick (thanks to Max for the hint in this question). Here is the updated code:
library(plyr)
library(caret)
library(pROC)
library(ggplot2)
str(diamonds)
# with diamonds we want to predict cut and look at results for different colors = subjects
d <- diamonds
d <- d[d$cut %in% c('Premium', 'Ideal'),] # make a 2 class problem
d$cut <- factor(d$cut)
indexes_data <- c(1,5,6,8:10)
indexes_labels <- 2
# population independent CV partitions for training and left out partitions for evaluation
indexes_populationIndependence_subjects <- 3
index <- llply(unique(d[,indexes_populationIndependence_subjects]), function(cls) c(which(d[,indexes_populationIndependence_subjects]!=cls)))
names(index) <- paste0('sub_', unique(d[,indexes_populationIndependence_subjects]))
indexOut <- llply(index, function(part) (1:nrow(d))[-part])
names(indexOut) <- paste0('sub_', unique(d[,indexes_populationIndependence_subjects]))
# subsample partitions for training
index <- llply(index, function(i) sample(i, 1000))
m3 <- train(x = d[,indexes_data],
y = d[,indexes_labels],
method = 'glm',
metric = 'ROC',
trControl = trainControl(returnResamp = 'final',
savePredictions = T,
classProbs = T,
summaryFunction = twoClassSummary,
index = index,
indexOut = indexOut))
m3$resample # seems OK
str(m3$pred) # seems OK
myRoc <- roc(predictor = m3$pred[,3], response = m3$pred$obs)
plot(myRoc, main = 'all')
# analyze results per subject
l_ply(unique(m3$pred$Resample), .fun = function(cls) {
pred_sub <- m3$pred[m3$pred$Resample==cls,]
myRoc <- roc(predictor = pred_sub[,3], response = pred_sub$obs)
plot(myRoc, main = cls)
} )
Still, I'm not absolutely sure if this is actually does the estimation in a population independent way, so if anybody has knowledge about the details please share your thoughts!

Resources