I notice that the H2O packages mentions that it:
preprocesses the data to be standardized for compatibility with the
activation functions (recall Table 1’s summary of each activation
function’s target space). Since the activation function does not
generally map into the full spectrum of real numbers, R, we first
standardize our data to be drawn from N (0, 1). Standardizing again
after network propagation allows us to compute more precise errors in
this standardized space, rather than in the raw feature space. For
autoencoding, the data is normalized (instead of standardized) to the
compact interval of mathcalU(−0.5, 0.5), to allow bounded activation
functions like Tanh to better reconstruct the data.
However, I don't fully understand. My impression was (here, and here) that the the categorical variables should be broken into 1-of-C dummies and the continuous data normalised. Then, everything should be standardised to [-1,1].
I also don't see a way of specifying the neurons for the read-out layer. I thought that if we have a categorical output variable then we want to use softmax activation function (and encode as 1-of-C) / if we have a continuous output (e.g. price) then we scale that to [-1,1] and use 'tanh' / if we have a single binary output then we can use logistic and code it as [0,1]
For classification and regression (i.e., supervised mode), H2O Deep Learning does the following:
The input into the first neural network layer is indeed 1-of-C dummies (either 0 or 1) for categorical features. Continuous features are standardized (not normalized): de-meaned and scaled by 1/variance.
For regression, the response variable is also standardized internally, to allow the (single) output neuron's activation value to be compared against it. However, for presentation to the user during scoring, the predictions are de-standardized into the original space.
For classification, we use Softmax to get probabilities for the C classes, even for binary classification.
The documentation you cited also refers to unsupervised autoencoding (by enabling the autoencoder flag). In that case, the input is normalized (i.e., scaled by 1/(max-min)) instead of being standardized. That is needed to allow the auto-encoder to have fully overlapping input and output spaces.
H2O achieves the effect of 1-of-C dummy encoding, without the cost. The exact details vary by algorithm, but there's always an obvious algorithmic optimization that gives the predictive strength of a dummy encoding, without the memory or speed costs.
Cliff
Related
I'd like to use check_model() from {performance} but I'm working with a few millions datapoints, which make plotting too costly. Is it possible to take a sample from a lm() model without affecting everything else (eg., it's coefficients).
# defining a model
model = lm(mpg ~ wt + am + gear + vs * cyl, data = mtcars)
# checking model assumptions
performance::check_model(model)
Created on 2022-08-23 by the reprex package (v2.0.1)
Alternative: Is downsizing, ok? In a ML workflow I'd donwsample for tunning, feature selection and feature engineering, for example. But I don't know if that's usual in classic linear regression modelling (is OK to test for heteroskedasticity in a downsized sample and then estimate the coefficients with full sample?)
Speeding up check_model
The documentation (?check_model) explains a few things you can do to speed up the function/plotting without subsampling:
For models with many observations, or for more complex models in
general, generating the plot might become very slow. One reason might
be that the underlying graphic engine becomes slow for plotting many
data points. In such cases, setting the argument show_dots = FALSE
might help. Furthermore, look at the check argument and see if some of
the model checks could be skipped, which also increases performance.
Accordingly, you can turn off the dots-per-point default with check_model(model, show_dots = FALSE). You can also choose the specific checks you get (reducing computation time) if you are not interested in them. For example, you could get only samples from the posterior predictive distribution with check_model(model, check = "pp_check").
Implications of Downsampling
Choosing a subset of observations (and/or draws from the posterior if you're using a Bayesian model) will always change the results of anything that conditions on the data. Both your model parameters and post-estimation summaries conditioning on the data will change. Just how much it will change depends on variability of your observations and sample size. With millions of observations, it's probably unlikely to change much -- but maybe some rare data patterns can heavily influence your results during (post)-estimation.
Plotting for heteroskedasticity based on a different model than the one you estimated makes little sense, but your mileage may vary because the models may differ little. You're looking to evaluate how well your model approximates the Gauss-Markov variance assumptions, not how well another model does. From a computational perspective, it would also be puzzling to do so: the residuals are part of estimation -- if you can fit the model, you can presumably also show the residuals in various ways.
That being said, these plots are also approximations to the actual distribution of interest anyway (i.e. you're implicitly estimating test statistics with some of these plots) and since the central limit theorem applies, things would look the same roughly if you cut out some observations given your data are sufficiently large.
I am building in Python a credit scorecard using this public dataset: https://www.kaggle.com/sivakrishna3311/delinquency-telecom-dataset
It's a binary classification problem:
Target = 1 -> Good applicant
Target = 0 -> Bad applicant
I only have numeric continuous predictive characteristics.
In the credit industry it is a legal requirement to explain why an applicant got rejected (or didn't even get the maximum score): to meet that requirement, Adverse Codes are produced.
In a classic logistic regression approach, one would do this:
calculate the Weight-of-Evidence (WoE) for each predictive
characteristic (forcing a monotonic relationship between the feature
values and the WoE or log(odds)). In the following example, the
higher the network Age the higher the Weight-of-Evidence (WoE):
replace the data values with the correspondent Weight-of-Evidence.
For example, a value of 250 for Network Age would be replaced by
0.04 (which is the correspondent WoE).
Train a logistic regression
After some linear transformations you'd get something like this:
And therefore it'd be straightforward to assign the Adverse Codes, so that the bin with the maximum score doesn't return an Adverse Code. For example:
Now, I want to train an XGBoost (which typically outperforms a logistic regression on a imbalanced, low noise data). XGBoost are very predictive but need to be explained (typically via SHAP).
What I have read is that in order to make the model decision explainable you must ensure that the monotonic constraints are applied.
Question 1. Does it mean that I need to train the XGBoost on the Weight-of-Evidence transformed data like it's done with the Logistic Regression (see point 2 above)?
Question 2. In Python, the XGBoost package offers the option to set monotonic constraints (via the monotone_constraints option). If I don't transform the data by replacing the Weight-of-Evidence (therefore removing all monotonic constraints) does it still make sense to use "monotone_constraints" in XGboost for a binary problem? I mean, does it make sense to use monotone_constraints with a XGBClassifier at all?
Thanks.
I am working on a problem where i want to see if a measure (test) is a good predictor of the outcome variable (performance). Performance is a bounded variable between 0-100. I am only thinking of the methodology for now and not working with the data yet.
I am aware that there are different models and methods that deal with bounded dependent variables, but from my understanding these are useful if one is interested in predictions?
I am interested in how much variance of the dependent variable (performance)is explained by my measure (test). I am not interested in predicting specific outcomes.
Is it OK to just use normal regression?
Do i need to account for the bounded dependent variable somehow?
You can scale your dependent variable in the [0, 1] interval and run a logistic regression, that shrinks every input value into that range.
If you can, you can use fractional logit models, typically used to predict continuous outputs in the [0, 1] interval.
Alternatively, if you are into Machine Learning, you can implement a Neural Network regressor with one output note with a sigmoid activation function.
Let Y be a binary variable.
If we use logistic regression for modeling, then we can use cv.glm for cross validation and there we can specify the cost function in the cost argument. By specifying the cost function, we can assign different unit costs to different types of errors:predicted Yes|reference is No or predicted No|reference is Yes.
I am wondering if I could achieve the same in SVM. In other words, is there a way for me to specify a cost(loss) function instead of using built-in loss function?
Besides the Answer by Yueguoguo, there is also three more solutions, the standard Wrapper approach, hyperplane tuning and the one in e1017.
The Wrapper approach (available out of the box for example in weka) is applicable to almost all classifiers. The idea is to over- or undersample the data in accordance with the misclassification costs. The learned model if trained to optimise accuracy is optimal under the costs.
The second idea is frequently used in textminining. The classification is svm's are derived from distance to the hyperplane. For linear separable problems this distance is {1,-1} for the support vectors. The classification of a new example is then basically, whether the distance is positive or negative. However, one can also shift this distance and not make the decision and 0 but move it for example towards 0.8. That way the classifications are shifted in one or the other direction, while the general shape of the data is not altered.
Finally, some machine learning toolkits have a build in parameter for class specific costs like class.weights in the e1017 implementation. the name is due to the fact that the term cost is pre-occupied.
The loss function for SVM hyperplane parameters is automatically tuned thanks to the beautiful theoretical foundation of the algorithm. SVM applies cross-validation for tuning hyperparameters. Say, an RBF kernel is used, cross validation is to select the optimal combination of C (cost) and gamma (kernel parameter) for the best performance, measured by certain metrics (e.g., mean squared error). In e1071, the performance can be obtained by using tune method, where the range of hyperparameters as well as attribute of cross-validation (i.e., 5-, 10- or more fold cross validation) can be specified.
To obtain comparative cross-validation results by using Area-Under-Curve type of error measurement, one can train different models with different hyperparameter configurations and then validate the model against sets of pre-labelled data.
Hope the answer helps.
In the last few months I've worked on a number of projects where I've used the glmnet package to fit elastic net models. It's great, but the interface is rather bare-bones compared to most R modelling functions. In particular, rather than specifying a formula and data frame, you have to give a response vector and predictor matrix. You also lose out on many quality-of-life things that the regular interface provides, eg sensible (?) treatment of factors, missing values, putting variables into the correct order, etc.
So I've generally ended up writing my own code to recreate the formula/data frame interface. Due to client confidentiality issues, I've also ended up leaving this code behind and having to write it again for the next project. I figured I might as well bite the bullet and create an actual package to do this. However, a couple of questions before I do so:
Are there any issues that complicate using the formula/data frame interface with elastic net models? (I'm aware of standardisation and dummy variables, and wide datasets maybe requiring sparse model matrices.)
Is there any existing package that does this?
Well, it looks like there's no pre-built formula interface, so I went ahead and made my own. You can download it from Github: https://github.com/Hong-Revo/glmnetUtils
Or in R, using devtools::install_github:
install.packages("devtools")
library(devtools)
install_github("hong-revo/glmnetUtils")
library(glmnetUtils)
From the readme:
Some quality-of-life functions to streamline the process of fitting
elastic net models with glmnet, specifically:
glmnet.formula provides a formula/data frame interface to glmnet.
cv.glmnet.formula does a similar thing for cv.glmnet.
Methods for predict and coef for both the above.
A function cvAlpha.glmnet to choose both the alpha and lambda parameters via cross-validation, following the approach described in
the help page for cv.glmnet. Optionally does the cross-validation in
parallel.
Methods for plot, predict and coef for the above.
Incidentally, while writing the above, I think I realised why nobody has done this before. Central to R's handling of model frames and model matrices is a terms object, which includes a matrix with one row per variable and one column per main effect and interaction. In effect, that's (at minimum) roughly a p x p matrix, where p is the number of variables in the model. When p is 16000, which is common these days with wide data, the resulting matrix is about a gigabyte in size.
Still, I haven't had any problems (yet) working with these objects. If it becomes a major issue, I'll see if I can find a workaround.
Update Oct-2016
I've pushed an update to the repo, to address the above issue as well as one related to factors. From the documentation:
There are two ways in which glmnetUtils can generate a model matrix out of a formula and data frame. The first is to use the standard R machinery comprising model.frame and model.matrix; and the second is to build the matrix one variable at a time. These options are discussed and contrasted below.
Using model.frame
This is the simpler option, and the one that is most compatible with other R modelling functions. The model.frame function takes a formula and data frame and returns a model frame: a data frame with special information attached that lets R make sense of the terms in the formula. For example, if a formula includes an interaction term, the model frame will specify which columns in the data relate to the interaction, and how they should be treated. Similarly, if the formula includes expressions like exp(x) or I(x^2) on the RHS, model.frame will evaluate these expressions and include them in the output.
The major disadvantage of using model.frame is that it generates a terms object, which encodes how variables and interactions are organised. One of the attributes of this object is a matrix with one row per variable, and one column per main effect and interaction. At minimum, this is (approximately) a p x p square matrix where p is the number of main effects in the model. For wide datasets with p > 10000, this matrix can approach or exceed a gigabyte in size. Even if there is enough memory to store such an object, generating the model matrix can take a significant amount of time.
Another issue with the standard R approach is the treatment of factors. Normally, model.matrix will turn an N-level factor into an indicator matrix with N-1 columns, with one column being dropped. This is necessary for unregularised models as fit with lm and glm, since the full set of N columns is linearly dependent. With the usual treatment contrasts, the interpretation is that the dropped column represents a baseline level, while the coefficients for the other columns represent the difference in the response relative to the baseline.
This may not be appropriate for a regularised model as fit with glmnet. The regularisation procedure shrinks the coefficients towards zero, which forces the estimated differences from the baseline to be smaller. But this only makes sense if the baseline level was chosen beforehand, or is otherwise meaningful as a default; otherwise it is effectively making the levels more similar to an arbitrarily chosen level.
Manually building the model matrix
To deal with the problems above, glmnetUtils by default will avoid using model.frame, instead building up the model matrix term-by-term. This avoids the memory cost of creating a terms object, and can be noticeably faster than the standard approach. It will also include one column in the model matrix for all levels in a factor; that is, no baseline level is assumed. In this situation, the coefficients represent differences from the overall mean response, and shrinking them to zero is meaningful (usually).
The main downside of not using model.frame is that the formula can only be relatively simple. At the moment, only straightforward formulas like y ~ x1 + x2 + ... + x_p are handled by the code, where the x's are columns already present in the data. Interaction terms and computed expressions are not supported. Where possible, you should compute such expressions beforehand.
Update Apr-2017
After a few hiccups, this is finally on CRAN.