Parameters for hyperparameter grid search functions in tidymodels tuning - r

I’m using {workflowsets} from {tidymodels} for the first time, and I’m following along chapter in Tidy Modeling with R.
In the book, the authors use a fixed, regular grid hyperparameter search:
grid_results <-
all_workflows %>%
workflow_map(
fn = "tune_grid",
seed = 1503,
resamples = concrete_folds,
grid = 25, # See note below
control = grid_ctrl
)
Help for the tune::tune_grid() function specifies that “[a]n integer denotes the number of candidate parameter sets to be created automatically.”
While this has worked for me, I wanted to try a space-filling design, such as those created with dials::grid_max_entropy. However, this function (much like all grid-generating functions in {tune}) requires a param or parameter set object.
How can I extract parameters en bloc from a workflow set, seeing as how there is no parameters method for workflowsets (yet)?
Thanks!
Edit: added fn argument to workflow_map for clarity, even though it’s not present in the book (tune_grid is the default value for the arg).

I'm quoting the answer of Max Kuhn in this GitHub issue
The best approach is to use option_add(). There is an example in the upcoming blog post that does this (but to define a custom parameter range for the grid). You could do:
wflow_set <-
wflow_set %>%
option_add(grid = some_grid, id = "some wflow_id")
Also, I found this presentation by Max Kuhn very useful. Namely slide 24 illustrates "Passing individual options to models". This is the YouTube link of the presentation (https://www.youtube.com/watch?v=2OfTEakSFXQ)

Related

To find valid argument for a function in R's help document (meaning of ...)

This question may seem basic but this has bothered me quite a while. The help document for many functions has ... as one of its argument, but somehow I can never get my head around this ... thing.
For example, suppose I have created a model say model_xgboost and want to make a prediction based on a dataset say data_tbl using the predict() function, and I want to know the syntax. So I look at its help document which says:
?predict
**Usage**
predict (object, ...)
**Arguments**
object a model object for which prediction is desired.
... additional arguments affecting the predictions produced.
To me the syntax and its examples didn't really enlighten me as I still have no idea what the valid syntax/arguments are for the function. In an online course it uses something like below, which works:
data_tbl %>%
predict(model_xgboost, new_data = .)
However, looking across the help doc I cannot find the new_data argument. Instead it mentioned newdata argument in its Details section, which actually didn't work if I displace the new_data = . with newdata = .:
Error in `check_pred_type_dots()`:
! Did you mean to use `new_data` instead of `newdata`?
My questions are:
How do I know exactly what argument(s) / syntax can be used for a function like this?
Why new_data but not newdata in this example?
I might be missing something here, but is there any reference/resource about how to use/interpret a help document, in plain English? (a lot of document, including R help file seem just give a brief sentence like "additional arguments affecting the predictions produced" etc)
#CarlWitthoft's answer is good, I want to add a little bit of nuance about this particular function. The reason the help page for ?predict is so vague is an unfortunate consequence of the fact that predict() is a generic method in R: that is, it's a function that can be applied to a variety of different object types, using slightly different (but appropriate) methods in each case. As such, the ?predict help page only lists object (which is required as the first argument in all methods) and ..., because different predict methods could take very different arguments/options.
If you call methods("predict") in a clean R session (before loading any additional packages) you'll see a list of 16 methods that base R knows about. After loading library("tidymodels"), the list expands to 69 methods. I don't know what class your object is (class("model_xgboost")), but assuming that it's of class model_fit, we look at ?predict.model_fit to see
predict(object, new_data, type = NULL, opts = list(), ...)
This tells us that we need to call the new data new_data (and, reading a bit farther down, that it needs to be "A rectangular data object, such as a data frame")
The help page for predict says
Most prediction methods which are similar to those for linear
models have an argument ‘newdata’ specifying the first place to
look for explanatory variables to be used for prediction
(emphasis added). I don't know why the parsnip authors (the predict.model_fit method comes from the parsnip package) decided to use new_data rather than newdata, presumably in line with the tidyverse style guide, which says
Use underscores (_) (so called snake case) to separate words within a name.
In my opinion this might have been a mistake, but you can see that the parsnip/tidymodels authors have realized that people are likely to make this mistake and added an informative warning, as shown in your example and noted e.g. here
Among other things, the existence of ... in a function definition means you can enter any arguments (values, functions, etc) you want to. There are some cases where the main function does not even use the ... but passes them to functions called inside the main function. Simple example:
foo <- function(x,...){
y <- x^2
plot(x,y,...)
}
I know of functions which accept a function as an input argument, at which point the items to include via ... are specific to the selected input function name.

How to include all elements of a vector in an exams2moodle or exams2pdf output?

I am working on a simple code to find the square root of the following elements:
dat <- c(4,9,16,25,36,49,64,81,100,121,144,169,196,225)
num<-sample(dat ,1,replace=F)
Parts of the seed file are configured like this:
examen01<-c("SinRad.Rmd")
semilla<-sample(100:1000, 1)
set.seed(semilla)
exams2moodle(examen01,n=14,svg=TRUE,name="SinRadConRad",
encoding="UTF-8",dir="salida",edir="ejercicios",
mchoice = list(shuffle = TRUE,answernumbering = "ABCD",
solution = FALSE,
eval = list(partial = TRUE,rule = "none")))
I intend that, with n=14, all the responses to the options found in the "dat" vector are included, but I see that there are responses that are repeated.
How to achieve 14 answers for the 14 possibilities, without repeating or missing any?
Thank you very much
The exams2xyz() functions have been written to draw large numbers of random variations from sets of exercises. There is no dedicated functionality that draws a small number of deterministic variations. So in your case I would just draw, say, a hundred variations from the exercise template even if it can only yield 14 distinct versions. Sure, this wastes a bit of memory but not so much that I would worry about this.
Having said that, it is possible to set up a temporary file with a specific version of an exercise by using the expar() function. For example, expar("SinRad.Rmd", num = 4) would yield an exercise where the num parameter has been fixed to 4. Then in the same way you can cycle through the other 13 numbers you want. In the following post we also provide an expargrid() function that does this for all possible combinationso of parameters: Making deterministic versions of a parametrized question
Then you can run exams2moodle() on the resulting 14 deterministic exercise files.

How to use RWeka classifiers function attribute "options"?

In RWeka classifiers, there is an attribute "options" in the classifier's function call, e.g. Bagging(formula, data, subset, na.action, control = Weka_control(), options = NULL). Could some one please give an example (a sample R code) on how to define these options?
I would be interested in passing on some options (such as the number of iterations and size of each bag) to Bagging meta learner of RWeka. Thanks in advance!
You can get at the features that you mentioned, but not through options.
First, what does options do? According to the help page ?Bagging
Argument options allows further customization. Currently, options model and instances (or partial matches for these) are used: if set to TRUE, the model frame or the corresponding Weka instances, respectively, are included in the fitted model object, possibly speeding up subsequent computations on the object. By default, neither is included.
So options simply stores more information in the returned result. To get at the features that you want, you need to use control. You will need to construct the value for control using the function Weka_control. Without some help, it is hard to know how to use that, but luckily, help is available through WOW the Weka Option Wizard. Because there are many options, the output is long. I am going to truncate it to just the part about the features that you mentioned - the number of iterations and size of each bag. But do look at what else is available.
WOW(Bagging)
-P Size of each bag, as a percentage of the training set size. (default 100)
-I <num>
Number of iterations. (current value 10)
Number of arguments: 1.
Repeating: I have truncated the output to show just these two options.
Example: Iris data
Suppose that I wanted to use bagging with the iris data with the bag size being 90% of the data (instead of the default 100%) and with 20 iterations (instead of the default 10). First, I would build the Weka_control, then include that in my call to Bagging.
WC = Weka_control(P=90, I=20)
BagOfIrises = Bagging(Species ~ ., data=iris, control=WC)
I hope that this helps.

Exclude specific tensors being updated by optimizer in TensorFlow

I have two graphs, which I suppose to train them independently, which means I have two different optimizers, but at the same time one of them is using the tensor values of the other graph. As a result, I need to be able to stop specific tensors being updated while training one of the graphs. I have assigned two different namescopes two my tensors and using this code to control updates over tensors for different optimizers:
mentor_training_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, "mentor")
train_op_mentor = mnist.training(loss_mentor, FLAGS.learning_rate, mentor_training_vars)
mentee_training_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, "mentee")
train_op_mentee = mnist.training(loss_mentee, FLAGS.learning_rate, mentee_training_vars)
the vars variable is being used like below, in the training method of mnist object:
def training(loss, learning_rate, var_list):
# Add a scalar summary for the snapshot loss.
tf.summary.scalar('loss', loss)
# Create the gradient descent optimizer with the given learning rate.
optimizer = tf.train.GradientDescentOptimizer(learning_rate)
# Create a variable to track the global step.
global_step = tf.Variable(0, name='global_step', trainable=False)
# Use the optimizer to apply the gradients that minimize the loss
# (and also increment the global step counter) as a single training step.
train_op = optimizer.minimize(loss, global_step=global_step, var_list=var_list)
return train_op
I'm using the var_list attribute of the optimizer class in order to control vars being updated by the optimizer.
Right now I'm confused whether I have done what I supposed to do appropriately, and even if there is anyway to check if any optimizer would only update partial of a graph?
I would appreciate if anyone can help me with this issue.
Thanks!
I have had a similar problem and used the same approach as you, i.e. via the var_list argument of the optimizer. I then checked whether the variables not intended for training stayed the same using:
the_var_np = sess.run(tf.get_default_graph().get_tensor_by_name('the_var:0'))
assert np.equal(the_var_np, pretrained_weights['the_var']).all()
pretrained_weights is a dictionary returned by np.load('some_file.npz') which I used to store the pre-trained weights to disk.
Just in case you need that as well, here is how you can override a tensor with a given value:
value = pretrained_weights['the_var']
variable = tf.get_default_graph().get_tensor_by_name('the_var:0')
sess.run(tf.assign(variable, value))

Avoiding using global objects when building an R package with multiple separate functions

I have built an R package that runs a complex Bayesian model (Dirichlet Process Mixture model on spatial data) including an MCMC, thinning and validation and interface with Googlemaps. I'm very happy with performance and it runs without problems. The only issue is I would like to get it up on CRAN and it will be rejected because I extensively use global variables.
The package is built around the use of 8 core functions (which the user interacts with):
1) LoadData: Loads in data, extracts key information and sets up a series of global matrices as well as other small list objects.
2) ModelParameters: Sets model parameters, option to plot prior on parameter sigma on Googlemap. Calculates a hyper-prior at this point and saves a large matrix to the global environment
3) GraphicParameters: Sets graphic parameters of maps and plots (see code below)
4) CreateMaps: Creates the prior surface on source location tau and plots the data on a Google map. Keeps a number of global objects saved for repeated plotting of this map.
5) RunMCMC: Runs the bulk of the analysis using MCMC (a time intensive step), creates many global objects.
6) ThinandAnalsye: Thins the posterior samples and constructs the geoprofile (a time intensive step)
7) PlotGP: Plots the data and overlays the geoprofile onto a Google map
8) reporthitscores: OPTIONAL if source data is imported, calculates the hit scores of potential sources
Each one is run in turn before the next, and I pass global variables out which are used by one or more of the other functions.
I built it this way for a reason, as the user must stop and evaluate the results of these functions before rushing ahead to the future ones.
Each of these functions passes not just fixed parameters, but also large map objects, lists and matrices as global objects. I thought it was a nice simple solution with a smooth workflow (you can check the results in your main working environment before moving on, possibly applying transformations etc) and I have given all the objects unique and informative names.
How do I get around this, and pass the checks of CRAN whilst keeping my user friendly workflow of a series of interacting functions?
I dont want to post up a lot of code (as just the MCMC part is several hundred lines long)
But I will include one of the simple examples. GraphicParameters is one of my simple parameter setting functions, that comes with the default values set. This is a simple example, there are much more complex ones in the package. There is a model parameters function that pulls many of the variables from an existing data loading function for example.
GraphicParameters <-
function(Guardrail=0.05, nring=20,transp=0.4,gridsize=640,gridsize2=300,MapType= "roadmap",Location=getwd(),pointcol="black") {
Guardrail<<-Guardrail
nring<<-nring
transp<<-transp
gridsize<<-gridsize
gridsize2<<-gridsize2
MapType<<-MapType
Location<<-Location
pointcol<<-pointcol
}
Most of the material I have seen concerning avoiding global objects resolves around a single function that will do all the work. I want to keep my step by step multi-function approach, but loose the global objects.
Any help would be greatly appreciated.
I understand this may be a major reworking of the code (which is several 1000 lines currently), so I would also love solutions that minimally affect the overall structure of the package.
P.S. I wish I had known about CRANs displeasure with global objects before I started!!!
Your problem is very amenable to OOP-style design. You can use reference classes or S4 to export a single global, e.g., a MapAnalysis class generator. The idea is then that someone creates this using
ma <- new('MapAnalysis', option1 = ..., option2 = ..., ...) # S4
# or
ma <- MapAnalysis$new(option1 = ..., ...) # refClass
and can then call your methods with
ma$loadData(...)
ma$setParameters(...)
with the object doing any bookkeeping of options and auxiliary objects internally. It should not be that much work to refactor. If you read the page I linked to at the top of this post, you should see it's probably possible to just wrap all your functions with a refClass('MapAnalysis', fields = (...), methods = (...)) with few further modifications. (Although it would do you a lot of good down the road to re-think the architecture in OOP terms.)

Resources