I found R code for hyperparameter tuning using GA. The following is the code, but it does not show the expected results, which will be the prediction accuracy? I have mentioned the output it produces at the end of the question but I expected an output RMSE values like, 0.44, 0.23, 0.1 etc
The code is as follows:
d=readARFF("soft.arff")
index <- createDataPartition(d$Effort, p = .70,list = FALSE)
tr <- d[index, ]
ts <- d[-index, ]
svm_fit <- function(x) {
mod <- train(Rank ~ ., data = tr,
method = "svmRadial",
preProc = c("center", "scale"),
trControl = trainControl(method = "cv"),
tuneGrid = data.frame(C = 2^x[1], sigma = exp(x[2])))
-getTrainPerf(mod)[, "TrainRMSE"]
}
svm_ga_obj <- GA::ga(type = "real-valued",
fitness = svm_fit,
min = c(-5, -5),
max = c(5, 0),
popSize = 50,
maxiter = 2,
seed = 16478,
keepBest = TRUE,
monitor = NULL,
elitism = 2)
summary(svm_ga_obj)
The code does not give error and executed successfully but instead of RMSE value, it shows the following output when I execute a summary(svm_ga_obj).
GA settings:
Type = real-valued
Population size = 50
Number of generations = 2
Elitism = 2
Crossover probability = 0.8
Mutation probability = 0.1
Search domain =
x1 x2
lower -5 -5
upper 5 0
GA results:
Iterations = 2
Fitness function value = -6309.072
Solution =
x1 x2
[1,] 4.80478 -4.202595
What is the problem and how I can get the value of RMSE?
Related
I am implementing a Support Vector Machine with Radial Basis Function Kernel ('svmRadial') with caret. As far as I understand the documentation and the source code, caret uses an analytical formula to get reasonable estimates of sigma and fix it to that value (According to the output: Tuning parameter 'sigma' was held constant at a value of 0.1028894). In addition, caret cross-validates over a set of cost parameters C (default = 3).
However, if I now want to set my own grid of cost parameters (tuneGrid), I have to additionally specify a value of sigma. Otherwise the following error appears:
Error: The tuning parameter grid should have columns sigma, C
How can I fix Sigma based on the analytical formula and still implement my own grid of cost parameters C?
Here is a MWE:
library(caret)
library(mlbench)
data(BostonHousing)
set.seed(1)
index <- sample(nrow(BostonHousing),nrow(BostonHousing)*0.75)
Boston.train <- BostonHousing[index,]
Boston.test <- BostonHousing[-index,]
# without tuneGrid
set.seed(1)
svmR <- train(medv ~ .,
data = Boston.train,
method = "svmRadial",
preProcess = c("center", "scale"),
trControl = trainControl(method = "cv", number = 5))
# with tuneGrid (gives the error message)
set.seed(1)
svmR <- train(medv ~ .,
data = Boston.train,
method = "svmRadial",
preProcess = c("center", "scale"),
tuneGrid = expand.grid(C = c(0.01, 0.1)),
trControl = trainControl(method = "cv", number = 5))
If you look under the information for the model, it shows how the grid is generated if you don't provide:
getModelInfo("svmRadial")$svmRadial$grid
function(x, y, len = NULL, search = "grid") {
sigmas <- kernlab::sigest(as.matrix(x), na.action = na.omit, scaled = TRUE)
if(search == "grid") {
out <- expand.grid(sigma = mean(as.vector(sigmas[-2])),
C = 2 ^((1:len) - 3))
} else {
rng <- extendrange(log(sigmas), f = .75)
out <- data.frame(sigma = exp(runif(len, min = rng[1], max = rng[2])),
C = 2^runif(len, min = -5, max = 10))
}
out
}
So the method to get it is to estimate the sigma using kernlab::sigest, First we pull out the grid method for svmRadial:
models <- getModelInfo("svmRadial", regex = FALSE)[[1]]
Set up the input x and y since you are providing a formula:
preProcValues = preProcess(Boston.train, method = c("center", "scale"))
processData = predict(preProcValues,Boston.train)
x = model.matrix(medv ~ .,data=processData)[,-1]
y = processData$medv
And we use the grid function for this model, which you can see is the same as your output:
set.seed(1)
models$grid(x,y,3)
sigma C
1 0.1028894 0.25
2 0.1028894 0.50
3 0.1028894 1.00
svmR$results
sigma C RMSE Rsquared MAE RMSESD RsquaredSD MAESD
1 0.1028894 0.25 5.112750 0.7591398 2.982241 0.8569208 0.05387213 0.4032354
2 0.1028894 0.50 4.498887 0.8046234 2.594059 0.7823051 0.05357678 0.3644430
3 0.1028894 1.00 4.055564 0.8349416 2.402248 0.8403222 0.06825159 0.3732571
And this is what happens underneath:
set.seed(1)
sigmas = kernlab::sigest(as.matrix(x), na.action = na.omit, scaled = TRUE)
# from the code, you can see it takes the mean of the two extreme quantiles
mean(sigmas[-2])
[1] 0.1028894
I would like to ask for help please. I use this code to run the XGboost model in the Caret package. However, I want to use the validation split based on time. I want 60% training, 20% validation ,20% testing. I already split the data, but I do know how to deal with the validation data if it is not cross-validation.
Thank you,
xgb_trainControl = trainControl(
method = "cv",
number = 5,
returnData = FALSE
)
xgb_grid <- expand.grid(nrounds = 1000,
eta = 0.01,
max_depth = 8,
gamma = 1,
colsample_bytree = 1,
min_child_weight = 1,
subsample = 1
)
set.seed(123)
xgb1 = train(sale~., data = trans_train,
trControl = xgb_trainControl,
tuneGrid = xgb_grid,
method = "xgbTree",
)
xgb1
pred = predict(lm1, trans_test)
The validation partition should not be used when you are creating the model - it should be 'set aside' until the model is trained and tuned using the 'training' and 'tuning' partitions, then you can apply the model to predict the outcome of the validation dataset and summarise how accurate the predictions were.
For example, in my own work I create three partitions: training (75%), tuning (10%) and testing/validation (15%) using
# Define the partition (e.g. 75% of the data for training)
trainIndex <- createDataPartition(data$response, p = .75,
list = FALSE,
times = 1)
# Split the dataset using the defined partition
train_data <- data[trainIndex, ,drop=FALSE]
tune_plus_val_data <- data[-trainIndex, ,drop=FALSE]
# Define a new partition to split the remaining 25%
tune_plus_val_index <- createDataPartition(tune_plus_val_data$response,
p = .6,
list = FALSE,
times = 1)
# Split the remaining ~25% of the data: 40% (tune) and 60% (val)
tune_data <- tune_plus_val_data[-tune_plus_val_index, ,drop=FALSE]
val_data <- tune_plus_val_data[tune_plus_val_index, ,drop=FALSE]
# Outcome of this section is that the data (100%) is split into:
# training (~75%)
# tuning (~10%)
# validation (~15%)
These data partitions are converted to xgb.DMatrix matrices ("dtrain", "dtune", "dval"). I then use the 'training' partition to train models and the 'tuning' partition to tune hyperparameters (e.g. random grid search) and evaluate model training (e.g. cross validation). This is ~equivalent to the code in your question.
lrn_tune <- setHyperPars(lrn, par.vals = mytune$x)
params2 <- list(booster = "gbtree",
objective = lrn_tune$par.vals$objective,
eta=lrn_tune$par.vals$eta, gamma=0,
max_depth=lrn_tune$par.vals$max_depth,
min_child_weight=lrn_tune$par.vals$min_child_weight,
subsample = 0.8,
colsample_bytree=lrn_tune$par.vals$colsample_bytree)
xgb2 <- xgb.train(params = params2,
data = dtrain, nrounds = 50,
watchlist = list(val=dtune, train=dtrain),
print_every_n = 10, early_stopping_rounds = 50,
maximize = FALSE, eval_metric = "error")
Once the model is trained I apply the model to the validation data with predict():
xgbpred2_keep <- predict(xgb2, dval)
xg2_val <- data.frame("Prediction" = xgbpred2_keep,
"Patient" = rownames(val),
"Response" = val_data$response)
# Reorder Patients according to Response
xg2_val$Patient <- factor(xg2_val$Patient,
levels = xg2_val$Patient[order(xg2_val$Response)])
ggplot(xg2_val, aes(x = Patient, y = Prediction,
fill = Response)) +
geom_bar(stat = "identity") +
theme_bw(base_size = 16) +
labs(title=paste("Patient predictions (xgb2) for the validation dataset (n = ",
length(rownames(val)), ")", sep = ""),
subtitle="Above 0.5 = Non-Responder, Below 0.5 = Responder",
caption=paste("JM", Sys.Date(), sep = " "),
x = "") +
theme(axis.text.x = element_text(angle=90, vjust=0.5,
hjust = 1, size = 8)) +
# Distance from red line = confidence of prediction
geom_hline(yintercept = 0.5, colour = "red")
# Convert predictions to binary outcome (responder / non-responder)
xgbpred2_binary <- ifelse(predict(xgb2, dval) > 0.5,1,0)
# Results matrix (i.e. true positives/negatives & false positives/negatives)
confusionMatrix(as.factor(xgbpred2_binary), as.factor(labels_tv))
# Summary of results
Summary_of_results <- data.frame(Patient_ID = rownames(val),
label = labels_tv,
pred = xgbpred2_binary)
Summary_of_results$eval <- ifelse(
Summary_of_results$label != Summary_of_results$pred,
"wrong",
"correct")
Summary_of_results$conf <- round(predict(xgb2, dval), 2)
Summary_of_results$CDS <- val_data$`variants`
Summary_of_results
This provides you with a summary of how well the model 'works' on your validation data.
I want to construct the histogram and CI for RMSE using XGBoost regression H2O. The codes given below are (reproducible) and work fine on a small iteration to 100. However, I want to iterate at least 1000 and the dataset is large. When a large number of iteration such as even 500 I get the following error message. Can we apply the lapply function to make it fast? or Is there any other way to solve this problem?
Error in .h2o.doSafeREST(h2oRestApiVersion = h2oRestApiVersion, urlSuffix = page, : Unexpected CURL error: Empty reply from server
boston <- h2o.importFile("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/BostonHousing.csv")
# set the predictor names and the response column name
predictors <- colnames(boston)[1:13]
# set the response column to "medv"
response <- "medv"
# convert the chas column to a factor bounds river; 0 otherwise))
boston["chas"] <- as.factor(boston["chas"])
rmse=c()
for(i in 1:50){
boston.splits <- h2o.splitFrame(data = boston, ratios = c(0.6, 0.2))
train<- boston.splits[[1]]
valid<- boston.splits[[2]]
test<- boston.splits[[3]]
my_xgb1 <- h2o.xgboost(x = predictors,
y = response,
nfolds = 0,
training_frame = train,
validation_frame = valid,
stopping_tolerance = 0.005,
max_runtime_secs = 3600,
seed = 1601,
distribution = "AUTO",
tweedie_power = 1.5,
categorical_encoding = "AUTO",
quiet_mode = TRUE,
ntrees = 100,
max_depth = 6,
min_rows = 0.01,
min_child_weight = 1,
learn_rate =0.05,
grow_policy = "depthwise" ,
booster = "gbtree",
reg_lambda = 0.1,
reg_alpha = 1)
pred <- h2o.predict(my_xgb1 , test)
perf <- h2o.performance(my_xgb1, newdata = test)
rmse[i] = h2o.rmse(perf)
print(rmse)
}
mean(rmse)
ci <- quantile(rmse, c(.05, .95))
windows()
# Plot histogram of scores.
hist(rmse, density=35, main = "Test RMSE over 50 Samples", xlab = "Value of Obtained RMSE", col="blue", border="black")
abline(v=mean(rmse), lwd=3, col="red")
I am trying to train a neural network using train function and neuralnet as my method paramater to predict times table.
I am scaling my training data set as well.
Even though I've tried different learningrates, stepmaxes, and thresholds for my neuralnet, each time I tried to train the network using train function one of the k-folds happened to fail every time saying
1: Algorithm did not converge in 1 of 1 repetition(s) within the stepmax.
2: predictions failed for Fold05.Rep1: layer1=8, layer2=0, layer3=0 Error in cbind(1, pred) %*% weights[[num_hidden_layers + 1]] :
requires numeric/complex matrix/vector arguments
I am guessing that this is because of weights being random so somehow each time I happen to get some weights that are not going to converge.
Is there anyway of preventing this? Maybe trying to re-train the particular fold which has failed using different weights?
Here is my code:
library(caret)
library(neuralnet)
# Create the dataset
tt = data.frame(multiplier = rep(1:10, times = 10), multiplicand = rep(1:10, each = 10))
tt = cbind(tt, data.frame(product = tt$multiplier * tt$multiplicand))
# Splitting
indexes = createDataPartition(tt$product,
times = 1,
p = 0.7,
list = FALSE)
tt.train = tt[indexes,]
tt.test = tt[-indexes,]
# Pre-process
preProc <- preProcess(tt, method = c('center', 'scale'))
tt.preProcessed <- predict(preProc, tt)
tt.preProcessed.train <- tt.preProcessed[indexes,]
tt.preProcessed.test <- tt.preProcessed[-indexes,]
# Train
train.control <- trainControl(method = "repeatedcv",
number = 10,
repeats = 3)
tune.grid <- expand.grid(layer1 = 8,
layer2 = 0,
layer3 = 0)
tt.cv <- train(product ~ .,
data = tt.preProcessed.train,
method = 'neuralnet',
tuneGrid = tune.grid,
trControl = train.control,
linear.output = TRUE,
algorithm = 'backprop',
learningrate = 0.01,
stepmax = 500000,
lifesign = 'minimal',
threshold = 0.01)
I am using Bayesian optimization to tune the parameters of SVM for regression problem. In the following code, what should be the value of init_grid_dt = initial_grid ? I got the upper and lower bounds of the sigma and C parameters of SVM, but dont know what should be the initial-grid?
In one of the example on the web, they took a random search results as input to the initial grid. The code is as follow:
ctrl <- trainControl(method = "repeatedcv", repeats = 5)
svm_fit_bayes <- function(logC, logSigma) {
## Use the same model code but for a single (C, sigma) pair.
txt <- capture.output(
mod <- train(y ~ ., data = train_dat,
method = "svmRadial",
preProc = c("center", "scale"),
metric = "RMSE",
trControl = ctrl,
tuneGrid = data.frame(C = exp(logC), sigma = exp(logSigma)))
)
list(Score = -getTrainPerf(mod)[, "TrainRMSE"], Pred = 0)
}
lower_bounds <- c(logC = -5, logSigma = -9)
upper_bounds <- c(logC = 20, logSigma = -0.75)
bounds <- list(logC = c(lower_bounds[1], upper_bounds[1]),
logSigma = c(lower_bounds[2], upper_bounds[2]))
## Create a grid of values as the input into the BO code
initial_grid <- rand_search$results[, c("C", "sigma", "RMSE")]
initial_grid$C <- log(initial_grid$C)
initial_grid$sigma <- log(initial_grid$sigma)
initial_grid$RMSE <- -initial_grid$RMSE
names(initial_grid) <- c("logC", "logSigma", "Value")
library(rBayesianOptimization)
ba_search <- BayesianOptimization(svm_fit_bayes,
bounds = bounds,
init_grid_dt = initial_grid,
init_points = 0,
n_iter = 30,
acq = "ucb",
kappa = 1,
eps = 0.0,
verbose = TRUE)