DRAGNN local/global training - syntaxnet

This is a parameter which present the training ways of syntaxnet models is local or global, however, I can't find the related parameter in the DRAGNN models. Can someone help me?

Related

Should I tune a learner with the whole dataset?

Recently I was learning about using mlr3 package, and I got some problems:
(1) I read The "Cross-Validation - Train/Predict" misunderstanding. This sounds to me that: I should focus on the average unbiased estimate performance of the model as the result of nested sampling. To me, tuning is adding CV in the training set (i.e., find the best hyperparameters and the feature importance rank), just like the figure below.
However, many examples in the mlr3gallery (e.g., tuning a SVM; tuning a graph) seem to tune their learners with the whole dateset, which confused me. Is it appropriate to use all the data for tuning learner?
(2) I wonder if there is any way to combine Feature Selection and Model Tuning together. I find the function in mlr feature-filtering-with-tuning, but not in mlr3. I guess a pipeline graph might be help, but I could not find a tutorial.
Thanks a lot! Any helps would be grateful!

When to use Train Validation Test sets

I know this question is quite common but I have looked at all the questions that have been asked before and I still can't understand why we also need a validation set.
I know sometimes people only use a train set and a test set, so why do we also need a validation set?
And how do we use it?
For example, in order to impute missing data, I impute these 3 different sets separately or not?
Thank you!
I will try to answer with an example.
If I'm training a neural network or doing linear regression, and I'm using only train and test data I can check my test data loss for each iteration and stop when my test data loss begins to grow or get a snapshot of the model with the lowest test loss.
Is some sense this is "overfiting" to my test data since i decide when to stop based on that.
If I was using test, train and validation data I can do the same process as above with the validation instead of the test data, and then after i decide when my model is done training, I can test it on the never before seen test data to give me a more unbiased score of my models predictions.
For the second part of the question, I would suggest to treat at least the test data as independent and impute the missing data differently, but it depends on the situation and data.

R XGBoost - xgb.save or xgb.load loss of data

After training an XGBoost model in R, I am presented with a model object called xgb which is a list of 7.
When I save the model using xgb.save and then reload using xgb.load, I am presented with what seems to be a 'smaller' model object which is a list of 2.
Obviously I can't share the code as you would need the training data which is massive, so all I can really show is a picture of the variable editor.
Below is model object xgb which is the original model after training, vs. the model object test1 which is the same model but saved and reloaded:
Why does this happen and am I losing valuable information upon saving/loading my models?
Any help is appreciated.
Maybe late, but I was having the same problem and found a solution.
Saving the xgb-model as "rds" doesn't loose any information and the reloaded model xgb_ does generate the same forecast values as the original xgb when I tested it. Hope that helps!
saveRDS(xgb, "model.rds")
xgb_ <- readRDS("model.rds")
all.equal(xgb, xgb_)
You are loosing information due to rounding errors after save/load. See this issue. I believe it is currently a bug.
As to why the loaded model is a smaller list, see here. So again, you are loosing information such as the callbacks and parameters. But these are not essential for prediction and not portable to e.g. python.

rpart: Is training data required

I have a problem to understand some basics, so I'm stuck with a regression tree.
I use a classification tree by rpart to check the influence of environmental parameters on a tree growth factor I measured.
Long story short:
What is the purpose of splitting data into training and test data and (when) do I need it? My searches showed examples in which they either don't do it or do it, but I can't find the backstory. Is it just to verify the pruning?
Thank you ahead!
You need to split into training and test data before training the model. The training data helps the model learn, while the test data helps validate the model.
The split is done before running the model, and the model must be retrained when there is some fine tuning or change.
As you might know, the general process for postpruning is the following:
1) Split data into training & test (validation) sets
2) Build decision tree from training set
3) For every non-leaf node N, prune the subtree rooted by N and
replace with the majority class. Then test accuracy with a
validation set. This validation set could be the one defined before
or not.
This all means that you are probably on the right track and that yes, the whole dataset has probably been used to test the accuracy of the pruning.

set.seed publishing model in R

Do you include set.seed() as part of the final model when you are ready to distribute/expose your model to the real word? Do you only use set.seed() for validation during training and validation?
Once you decide on your method, parameters, etc., do you run the model without setting the seed and that is the model you launch? Or do you pick the seed that performed the best during validation?
Thanks!
I doubt different seeds give that variant results unless you have parameters that involve random sampling of features or observations. Still i believe you answered your own question. To reproduce the Exact result you would need to include the seed value as well with your final model. It is not of really great significance however, so you dont need to really worry about it too much.

Resources