How to use RWeka classifiers function attribute "options"? - r

In RWeka classifiers, there is an attribute "options" in the classifier's function call, e.g. Bagging(formula, data, subset, na.action, control = Weka_control(), options = NULL). Could some one please give an example (a sample R code) on how to define these options?
I would be interested in passing on some options (such as the number of iterations and size of each bag) to Bagging meta learner of RWeka. Thanks in advance!

You can get at the features that you mentioned, but not through options.
First, what does options do? According to the help page ?Bagging
Argument options allows further customization. Currently, options model and instances (or partial matches for these) are used: if set to TRUE, the model frame or the corresponding Weka instances, respectively, are included in the fitted model object, possibly speeding up subsequent computations on the object. By default, neither is included.
So options simply stores more information in the returned result. To get at the features that you want, you need to use control. You will need to construct the value for control using the function Weka_control. Without some help, it is hard to know how to use that, but luckily, help is available through WOW the Weka Option Wizard. Because there are many options, the output is long. I am going to truncate it to just the part about the features that you mentioned - the number of iterations and size of each bag. But do look at what else is available.
WOW(Bagging)
-P Size of each bag, as a percentage of the training set size. (default 100)
-I <num>
Number of iterations. (current value 10)
Number of arguments: 1.
Repeating: I have truncated the output to show just these two options.
Example: Iris data
Suppose that I wanted to use bagging with the iris data with the bag size being 90% of the data (instead of the default 100%) and with 20 iterations (instead of the default 10). First, I would build the Weka_control, then include that in my call to Bagging.
WC = Weka_control(P=90, I=20)
BagOfIrises = Bagging(Species ~ ., data=iris, control=WC)
I hope that this helps.

Related

Parameters for hyperparameter grid search functions in tidymodels tuning

I’m using {workflowsets} from {tidymodels} for the first time, and I’m following along chapter in Tidy Modeling with R.
In the book, the authors use a fixed, regular grid hyperparameter search:
grid_results <-
all_workflows %>%
workflow_map(
fn = "tune_grid",
seed = 1503,
resamples = concrete_folds,
grid = 25, # See note below
control = grid_ctrl
)
Help for the tune::tune_grid() function specifies that “[a]n integer denotes the number of candidate parameter sets to be created automatically.”
While this has worked for me, I wanted to try a space-filling design, such as those created with dials::grid_max_entropy. However, this function (much like all grid-generating functions in {tune}) requires a param or parameter set object.
How can I extract parameters en bloc from a workflow set, seeing as how there is no parameters method for workflowsets (yet)?
Thanks!
Edit: added fn argument to workflow_map for clarity, even though it’s not present in the book (tune_grid is the default value for the arg).
I'm quoting the answer of Max Kuhn in this GitHub issue
The best approach is to use option_add(). There is an example in the upcoming blog post that does this (but to define a custom parameter range for the grid). You could do:
wflow_set <-
wflow_set %>%
option_add(grid = some_grid, id = "some wflow_id")
Also, I found this presentation by Max Kuhn very useful. Namely slide 24 illustrates "Passing individual options to models". This is the YouTube link of the presentation (https://www.youtube.com/watch?v=2OfTEakSFXQ)

Estimation to plot person-item map not feasible because items "have no 0-responses" in data matrix

I am trying to create a person item map that organizes the questions from a dataset in order of difficulty. I am using the eRm package and the output should looks like follows:
[person-item map] (https://hansjoerg.me/post/2018-04-23-rasch-in-r-tutorial_files/figure-html/unnamed-chunk-3-1.png)
So one of the previous steps, before running the function that outputs the map, I have to fit the data set to have a matrix which is the object that the plotting functions uses to create the actual map, but I am having an error when creating that matrix
I have already tried to follow and review some documentation that might be useful if you want to have some extra-information:
[Tutorial] https://hansjoerg.me/2018/04/23/rasch-in-r-tutorial/#plots
[Ploting function] https://rdrr.io/rforge/eRm/man/plotPImap.html
[Documentation] https://eeecon.uibk.ac.at/psychoco/2010/slides/Hatzinger.pdf
Now, this is the code that I am using. First, I install and load the respective libraries and the data:
> library(eRm)
> library(ltm)
Loading required package: MASS
Loading required package: msm
Loading required package: polycor
> library(difR)
Then I fit the PCM and generate the object of class Rm and here is the error:
*the PCM function here is specific for polytomous data, if I use a different one the output says that I am not using a dichotomous dataset
> res <- PCM(my.data)
>Warning:
The following items have no 0-responses:
AUT_10_04 AUN_07_01 AUN_07_02 AUN_09_01 AUN_10_01 AUT_11_01 AUT_17_01
AUT_20_03 CRE_05_02 CRE_07_04 CRE_10_01 CRE_16_02 EFEC_03_07 EFEC_05
EFEC_09_02 EFEC_16_03 EVA_02_01 EVA_07_01 EVA_12_02 EVA_15_06 FLX_04_01
... [rest of items]
>Responses are shifted such that lowest
category is 0.
Warning:
The following items do not have responses on
each category:
EFEC_03_07 LC_07_03 LC_11_05
Estimation may not be feasible. Please check
data matrix
I must clarify that all the dataset has a range from 1 to 5. Is a Likert polytomous dataset
Finally, I try to use the plot function and it does not have any output, the system just keep loading ad-infinitum with no answer
>plotPImap(res, sorted=TRUE)
I would like to add the description of that particular function and the arguments:
>PCM(X, W, se = TRUE, sum0 = TRUE, etaStart)
#X
Input data matrix or data frame with item responses (starting from 0);
rows represent individuals, columns represent items. Missing values are
inserted as NA.
#W
Design matrix for the PCM. If omitted, the function will compute W
automatically.
#se
If TRUE, the standard errors are computed.
#sum0
If TRUE, the parameters are normed to sum-0 by specifying an appropriate
W.
If FALSE, the first parameter is restricted to 0.
#etaStart
A vector of starting values for the eta parameters can be specified. If
missing, the 0-vector is used.
I do not understand why is necessary to have a score beginning from 0, I think that that what the error is trying to say but I don't understand quite well that output.
I highly appreciate any hint that you can provide me
Feel free to ask for any information that could be useful to reach the solution to this issue
The problem is not caused by the fact that there are no items with 0-responses. The model automatically corrects this by centering the response scale categories on zero. (You'll notice that the PI-map that you linked to is centered on zero. Also, I believe the map you linked to is of dichotomous data. Polytomous data should include the scale categories on the PI-map, I believe.)
Without being able to see your data, it is impossible to know the exact cause though.
It may be that the model is not converging. That may be what this error was alluding to: Estimation may not be feasible. Please check data matrix. You could check by entering > res at the prompt. If the model was able to converge you should see something like:
Conditional log-likelihood: -2.23709
Number of iterations: 27
Number of parameters: 8
...
Does your data contain answers with decimal numbers? I found the same error, I solved it by using dplyr::dense_rank() function:
df_ranked <- sapply(df_decimal_data, dense_rank)
Worked.

Exclude specific tensors being updated by optimizer in TensorFlow

I have two graphs, which I suppose to train them independently, which means I have two different optimizers, but at the same time one of them is using the tensor values of the other graph. As a result, I need to be able to stop specific tensors being updated while training one of the graphs. I have assigned two different namescopes two my tensors and using this code to control updates over tensors for different optimizers:
mentor_training_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, "mentor")
train_op_mentor = mnist.training(loss_mentor, FLAGS.learning_rate, mentor_training_vars)
mentee_training_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, "mentee")
train_op_mentee = mnist.training(loss_mentee, FLAGS.learning_rate, mentee_training_vars)
the vars variable is being used like below, in the training method of mnist object:
def training(loss, learning_rate, var_list):
# Add a scalar summary for the snapshot loss.
tf.summary.scalar('loss', loss)
# Create the gradient descent optimizer with the given learning rate.
optimizer = tf.train.GradientDescentOptimizer(learning_rate)
# Create a variable to track the global step.
global_step = tf.Variable(0, name='global_step', trainable=False)
# Use the optimizer to apply the gradients that minimize the loss
# (and also increment the global step counter) as a single training step.
train_op = optimizer.minimize(loss, global_step=global_step, var_list=var_list)
return train_op
I'm using the var_list attribute of the optimizer class in order to control vars being updated by the optimizer.
Right now I'm confused whether I have done what I supposed to do appropriately, and even if there is anyway to check if any optimizer would only update partial of a graph?
I would appreciate if anyone can help me with this issue.
Thanks!
I have had a similar problem and used the same approach as you, i.e. via the var_list argument of the optimizer. I then checked whether the variables not intended for training stayed the same using:
the_var_np = sess.run(tf.get_default_graph().get_tensor_by_name('the_var:0'))
assert np.equal(the_var_np, pretrained_weights['the_var']).all()
pretrained_weights is a dictionary returned by np.load('some_file.npz') which I used to store the pre-trained weights to disk.
Just in case you need that as well, here is how you can override a tensor with a given value:
value = pretrained_weights['the_var']
variable = tf.get_default_graph().get_tensor_by_name('the_var:0')
sess.run(tf.assign(variable, value))

How to find out which seed the MICE R-package chose for multiple imputation when using seed=NA?

I´m doing a multiple imputation for a dataframe named "mydata" with this code:
library(mice)
imp<-mice(mydata,pred=pred,method="pmm", m=10)
Because the default argument for this is function is "seed=NA", the seed-number is chosen randomly. I would like to keep it like this, because i don´t know which number i should choose as a seed. But for replication i would like to know which seed this function chose for me. Is there a possibility to inspect the mids-object "imp" for the seed-value? Or should i just use a random number generator and set the seed to a generated value?
If you look at the documentation, there is no such thing as a set.seed argument for mice function. There is, however a seed argument which takes an integer. If left alone, the integer is generated randomly.
An integer that is used as argument by the `set.seed()` for offsetting the random
number generator. Default is to leave the random number generator alone
You can choose your own integer. If you're stuck at what to choose, try your lucky number, or some random integer, with sky or architecture of your system being the limit.
The function sets seed in the following manner, which translates to "set seed only if specified, otherwise leave alone" as mentioned in the documentation.
if (!is.na(seed))
set.seed(seed) ## FEH 1apr02

Accessing class values in R's poLCA

I am trying my hand at learning Latent Component Analysis, while also learning R. I'm using the poLCA package, and am having a bit of trouble accessing the attributes. I can run the sample code just fine:
ds = read.csv("http://www.math.smith.edu/r/data/help.csv")
ds = within(ds, (cesdcut = ifelse(cesd>20, 1, 0)))
library(poLCA)
res2 = poLCA(cbind(homeless=homeless+1,
cesdcut=cesdcut+1, satreat=satreat+1,
linkstatus=linkstatus+1) ~ 1,
maxiter=50000, nclass=3,
nrep=10, data=ds)
but in order to make this more useful, I'd like to access the attributes within the objects created by the poLCA class as such:
attr(res2, 'Nobs')
attr(res2, 'maxiter')
but they both come up as 'Null'. I expect Nobs to be 453 (determined by the function) and maxiter to be 50000 (dictated by my input value).
I'm sure I'm just being naive, but I could use any help available. Thanks a lot!
Welcome to R. You've got the model-fitting syntax right, in that you can get a model out (don't know how latent component analysis works, so can't speak to the statistical validity of your result). However, you've mixed up the different ways in which R can store information pertaining to a model.
poLCA returns an object of class poLCA, which is
a list containing the following elements:
(. . .)
Nobs number of fully observed cases (less than or equal to N).
maxiter maximum number of iterations through which the estimation algorithm was set
to run.
Since it's a list, you can extract individual elements from your model object using the $ operator:
res2$Nobs # number of observations
res2$maxiter # maximum iterations
In some cases, there might be extractor functions to get this information without having to do low-level indexing. For example, many model-fitting functions will have a fitted method, which pulls out the vector of fitted values on the training data; and similarly residuals pulls out the vector of residuals. You should check whether there are such extractor functions provided by the poLCA package and use them if possible; that way, you're not making assumptions about the structure of the model object that might be broken in the future.
This is distinct to getting the attributes of an object, which is what you use attr for. Attributes in R are what you might call metadata: they contain R-specific information about an object itself, rather than information about whatever it is the object relates to. Examples of common attributes include class (the class of an object), dim (the dimensions of an array or matrix), names (names of individual elements of a vector/list/array) and so on.

Resources