liquidSVM parameter selection clarification in R - r

This is my first question and i'm only a basic "programmer" so i'm sorry if i do not make myself clear enough.
I'm currently using liquidSVM 1.2.1 on R 3.5.0 and, despite its great potential i do not understand some technicisms, as the help is not explanatory enough for me and i cannot find anything on the internet.
More specifically I'd like to understand further how the parameter selection works.
The final liquidSVM model contains in fact info on gammas and lambdas but i cannot understand if these parameters are all being used in different cells or if just a final couple has been chosen for the final model.
These leads to 2 sub-questions:
If using all the values, how can i disable grid_choice and select only a value for each parameter?
If the algorithm selects a final couple of values, how can i understand which one it is?
This is the setting i've been using so far:
model = liquidSVM:: svm(formula, TRAIN, threads = 3, predict.prob = T, random_seed = 123, folds = 5, scale = F, d = 1, partition_choice = 5, grid_choice = -1)
I tried different things, for example:
setting gamma = 0.01 and lambda = 0.1;
setting max_gamma = 0.01 and min_gamma = 0.01
setting grid_choice = NULL or grid_choice = list(gamma = 0.01, lambda = 0.01)
but it still does a grid selection on its own.
If only i could understand how to disable this grid search and provide my chosen parameters, i'd code a grid search by myself (thus knowing what the code is doing).
Thank you in advance.

The question is somewhat older now. However, if someone is still looking for a corresponding solution.
You can define your grid in which to be searched for the best matching values by using the arguments called gammas and lambdas. In this case you set them to one value.
For example:
model <- svm(x1~., train, display=1,folds=5, mc_type="OvA_ls",
gammas = 0.01,
lambdas = 0.1
)
would set the gamma to only 0.01 and the lambda to 0.1.
However this is not a grid search anymore and you should expect to get two hands full of warning messages. If you provide a vector of gammas and a vector of lambdas it will search that set grid rather than the default. Hence the arguments can be handy if you want to compare liquidSVM with other packages for example.
Best luck

Related

Error in optim: L-BFGS-B needs finite value of fn

I am trying to run impute_errors() function of the imputeTestBench package for a series of values. I am using six user defined methods for selection of best imputation method. Below is my code:
correctedSalesHistoryMatrix[, 1:2],
matrix(unlist(apply(X = as.matrix(correctedSalesHistoryMatrix[, -c(1, 2)]),
MARGIN = 1,
FUN = impute_errors,
smps = "mcar",
methods = c(
"imputationMethod1"
, "imputationMethod2"
, "imputationMethod3"
, "imputationMethod4"
, "imputationMethod5"
, "imputationMethod6"
),
methodPath = "C:\\Documents\\Imputations.R",
errorParameter = "mape",
missPercentFrom = 10,
missPercentTo = 10
)
), nrow = nrow(correctedSalesHistoryMatrix), byrow = T
)
)
When I am using a small dataset, the function executes successfully. When I am using a large dataset I am using the following error:
Error in optim(init[mask], getLike, method = "L-BFGS-B", lower = rep(0, :
L-BFGS-B needs finite values of 'fn'
Called from: optim(init[mask], getLike, method = "L-BFGS-B", lower = rep(0,
np + 1L), upper = rep(Inf, np + 1L), control = optim.control)
I don't think this is an easy fix.
Error is probably not caused by imputeTestBench itself, but rather by one of your user defined imputation methods.
Run impute_errors like before and only add na_mean as method instead of your user defined methods (impute_errors(..., methods = 'na_mean') ) to see if this suggestion is true.
The error itself occurs quite often and has to do with stats::optim receiving inputs it can't deal with. Quite likely you are not using stats::optim in your user defined imputation methods (so you can't easily fix the input). More likely is that a package your are using is doing some calculations and then using stats::optim. Or even worse a package you are using is using another package, that is using stats::optim.
In the answers to this question you can see an explanation underlying problem. Overall seems to occur especially for large datasets, when the fn input parameter to stats::optim becomes Inf.
Here a some examples of the problem also occurring for different R packages and functions (which all use stats::optim somewhere internally): 1, 2, 3
Not too much you can do overall, if you don't want to go extremely deep into the underlying packages.
If you are using the imputeTS package for one of your user supplied imputation methods, in this Github Issue a workaround is proposed, which might help, if the error occurs within the na_kalman or na_seadec method.

Controlling Step Size in 'optim' R function

I am using R to optimize a function using the 'optim' function. However, the true values of the variables I am optimizing over are spaced apart at least 10^-5 or so. But, as I understand it, the default step size (ie how much optim adds to each control variable to see how that changes the objective function) is of the order of 10^-8.
Is there any easy way to tell the 'optim' function to increase the step size to 10^5 or perhaps higher?
For reference my code is here:
Optimal <- optim(par = starting, fn =expectedSeats,
propensities = propsShocked, n = NumberofDistricts,
shockType = "normal", shockSD = 0.1,
method = "L-BFGS-B",
lower = rep(0,NumberofDistricts), upper = rep(1,NumberofDistricts),
control=list(factr = 1e12)
)
I have looked around and can't seem to figure this out. Thanks!
As I understand your question, I believe you can specify the step value within the ndeps argument of the control= option. According to the documentation, the default is 1e-3 (not 1e-8).

How to choose the nrounds using `catboost`?

If I understand correctly catboost, we need to tune the nrounds just like in xgboost, using CV. I see the following code in the official tutorial In [8]
params_with_od <- list(iterations = 500,
loss_function = 'Logloss',
train_dir = 'train_dir',
od_type = 'Iter',
od_wait = 30)
model_with_od <- catboost.train(train_pool, test_pool, params_with_od)
Which result in the best iterations = 211.
My question are:
Is it correct that: this command use the test_pool to choose the best iterations instead of using cross-validation?
If yes, does catboost provide a command to choose the best iterations from CV, or I need to do it manually?
Catboost is doing cross validation to determine the optimum number of iterations. Both train_pool and test_pool are datasets that include the target variable. Earlier in the tutorial they write
train_path = '../R-package/inst/extdata/adult_train.1000'
test_path = '../R-package/inst/extdata/adult_test.1000'
column_description_vector = rep('numeric', 15)
cat_features <- c(3, 5, 7, 8, 9, 10, 11, 15)
for (i in cat_features)
column_description_vector[i] <- 'factor'
train <- read.table(train_path, head=F, sep="\t", colClasses=column_description_vector)
test <- read.table(test_path, head=F, sep="\t", colClasses=column_description_vector)
target <- c(1)
train_pool <- catboost.from_data_frame(data=train[,-target], target=train[,target])
test_pool <- catboost.from_data_frame(data=test[,-target], target=test[,target])
When you execute catboost.train(train_pool, test_pool, params_with_od) train_pool is used for training and test_pool is used to determine the optimum number of iterations via cross validation.
Now you are right to be confused, since later on in the tutorial they again use test_pool and the fitted model to make a prediction (model_best is similar to model_with_od, but uses a different overfitting detector IncToDec):
prediction_best <- catboost.predict(model_best, test_pool, type = 'Probability')
This might be bad practice. Now they might get away with it with their IncToDec overfitting detector - I am not familiar with the mathematics behind it - but for the Iter type overfitting detector you would need to have separate train,validation and test data sets (and if you want to be on the save side, do the same for the IncToDec overfitting detector). However it is only a tutorial showing the functionality so I wouldn't be too pedantic about what data they have already used how.
Here a link to a little more detail on the overfitting detectors:
https://tech.yandex.com/catboost/doc/dg/concepts/overfitting-detector-docpage/
It is a very poor decision to base your number of iterations on one test_pool and from the best iterations of catboost.train(). In doing so, you are tuning your parameters to one specific test set and your model will not work well with new data. You are therefore correct in presuming that like XGBoost, you need to apply CV to find the optimal number of iterations.
There is indeed a CV function in catboost. What you should do is specify a large number of iterations and stop the training after a certain number of rounds without improvement by using parameters early_stopping_rounds. Unlike LightGBM unfortunately, catboost doesn't seem to have the option of automatically giving the optimal number of boosting rounds after CV to apply in catboost.train(). Therefore, it requires a bit of a workaround. Here is an example which should work:
library(catboost)
library(data.table)
parameter = list(
thread_count = n_cores,
loss_function = "RMSE",
eval_metric = c("RMSE","MAE","R2"),
iterations = 10^5, # Train up to 10^5 rounds
early_stopping_rounds = 100, # Stop after 100 rounds of no improvement
)
# Apply 6-fold CV
model = catboost.cv(
pool = train_pool,
fold_count = 6,
params = parameter
)
# Transform output to DT
setDT(cbt_occupancy)
model[, iterations := .I]
# Order from lowest to highgest RMSE
setorder(model, test.RMSE.mean)
# Select iterations with lowest RMSE
parameter$iterations = model[1, iterations]
# Train model with optimal iterations
model = catboost.train(
learn_pool = train_pool,
test_pool = test_pool,
params = parameter
)
I think this is a general question for xgboost and catboost.
The choice of nround gets along with the choice with learning rate.
Thus, I recommend the higher round (1000+) and low learning rate.
After you find the best hype-params and retry a lower learning rate to check the hype-params you choose are stable.
And I find #nikitxskv 's answer is misleading.
In the R tutorial, In [12] just chooses learning_rate = 0.1 without mutiple choices. Thus, there is no hint for nround tuning.
Actually, In [12] just uses function expand.grid to find the best hype-params. It functions on the selections of depth, gamma and so on.
And in practice, we don't use this way to find a proper nround (too long).
And now for the two questions.
Is it correct that: this command use the test_pool to choose the best iterations instead of using cross-validation?
Yes, but you can use CV.
If yes, does catboost provide a command to choose the best iterations from CV, or I need to do it manually?
It depends on yourself. If you have a great aversion on boosting overfitting, I recommend you try it. There are a lot of packages to solve this problem. I recommend tidymodel packages.

r Nomad categorical optimisation (snomadr)

I am trying to use the Nomad technique for blackbox optimisation from the crs package (C implementation), which is called via the snomadr function. The method works when trying straight numerical optimisation, but errors when categorical features are included. However the help for categorical optimisation is not very well documented, so I am struggling to see where I am going wrong. Reproducible code below:
library(crs)
library(randomForest)
Illustrating this on randomForest & the iris dataset.
Creating the randomForest model (leaving the last row out as starting points for the optimizer)
rfIris <- randomForest(x=iris[-150,-c(1)], y=unlist(iris[-150,1]))
The objective function (functions we want to optimize)
objFn <- function(x0,model){
preds <- predict(object = model, newdata = x0)
as.numeric(preds)
}
Test to see if the objective function works (should return ~6.37)
objOut <- objFn(x0=unlist(iris[150,-c(1)]),model = rfIris)
Creating initial conditions, options list, and upper/lower bounds for Nomad
x0 <- iris[150,-c(1)]
x0 <- unlist(x0)
options <- list("MAX_BB_EVAL"=10000,
"MIN_MESH_SIZE"=0.001,
"INITIAL_MESH_SIZE"=1,
"MIN_POLL_SIZE"=0.001,
"NEIGHBORS_EXE" = c(1,2,3),
"EXTENDED_POLL_ENABLED" = 'yes',
"EXTENDED_POLL_TRIGGER" = 'r0.01',
"VNS_SEARCH" = '1')
up <- c(10,10,10,10)
low <- c(0,0,0,0)
Calling the optimizer
opt <- snomadr(eval.f = objFn, n = 4, bbin = c(0,0,0,2), bbout = 0, x0= x0 ,model = rfIris, opts=options,
ub = up, lb = low)
and I get an error about the NEIGHBORS_EXE parameter in the options list. It seems as if I need to supply NEIGHBORS_EXE a file corresponding to a set of 'extended poll' coordinates, however is it not clear what these exactly are.
The method works by setting "EXTENDED_POLL_ENABLED" = 'no' in the options list, as it then ignores the categorical variables and defaults to numerical optimisation, but this is not what I want.
I also managed to pull up some additional information for NEIGHBORS_EXE using
snomadr(information=list("help"="-h NEIGHBORS_EXE"))
and again, do not understand what the 'neighbours.exe' is meant to be.
Any help would be much appreciated!
This is the response from Zhenghua who coded the R interface:
The issue is that he did not configure the parameter “NEIGHBORS_EXE” properly. He need to prepare an Executable file for defining the neighbors, put the executable file in the folder where R is called, and then set the parameter “NEIGHBORS_EXE” to the executable file name.
You can contact us at nomad#gerad.ca if you wish to continue the discussion.
About the neighbours_exe parameter you can refer to the section 7.1 of user guide of Nomad
https://www.gerad.ca/nomad/Downloads/user_guide.pdf

Using mRMRe in R

I am currently working on a project where I have to do some feature selection for building a predictive model. I was lead to a package in R called mRMRe. I am just trying to work the example but cannot get it working. The example can be found here - http://www.inside-r.org/packages/cran/mRMRe/docs/mRMR.ensemble.
Here is my code -
data(cgps)
data <- data.frame(target=cgps.ic50, cgps.ge)
mRMR.ensemble(data, 1, rep.int(1, 30))
When I run this code I get the error -
Error in .local(.Object, ...) : data must be of type mRMRe.Data.
I dug a litter further and found that you actually have to convert the data to mRMR.Data type. So I did this update -
# Update
data <- mRMR.data(data = data.frame(target=cgps.ic50, cgps.ge))
mRMR.ensemble(data, 1, rep.int(1, 30))
but I still get the same error. When I look at the class I have -
> class(data)
[1] "mRMRe.Data"
attr(,"package")
[1] "mRMRe"
So the data is the requested type but the code is still not functional.
My question is if anyone has experience using this package or any help or comments would be appreciated!
Also want to note that in the example from the link - when I load the data
cgps_ic50 -> cgps.ic50
cgps_ge -> cgps.ge
so the names of the data aren't the same as the same in the example.
With the code you wrote:
data(cgps)
data <- mRMR.data(data = data.frame(target=cgps.ic50, cgps.ge))
mRMR.ensemble(data, 1, rep.int(1, 30))
The function mRMR.ensemble is getting the data as the first parameter, but the default first parameter in this function is solution_count.
I understand that your intentions executing that example are finding 30 relevant and non-redundant features using the classic mRMR feature selection algorithm so try this:
data(cgps)
data <- mRMR.data(data = data.frame(target=cgps.ic50, cgps.ge))
mRMR.ensemble(data = data, target_indices = 1,
feature_count = 30, solution_count = 1)
The target_indices are the positions in the original data.frame of the features used to maximize the relevance (correlation or other quality measure for this issue), so features selected in the end will be good for explaining the features indicated in the target_indices.
For example, in a classification problem, we would choose the position of the class variable as the value for the target_indices parameter.
The feature_count parameter indicates the number of variables to be chosen.
The solution_count is not a parameter of the classic mRMR. It indicates the number of mRMR algorithms to be ensembled to get a final feature selection, so if set to 1 it performs only one classic mRMR.

Resources