Minimal differences between R and PMML xgboost probabilities output - r

I have built an xgboost model in R and exported it in PMML (with r2pmml).
I have tested the same dataset with R and PMML (with Java), the probabilities output are very close but they all have a small difference between 1e-8 and 1e-10.
These differences are too small to be caused by a issue with the input data.
Is it a classic behaviour of rounding between different language/software or I did a mistake somewhere.

the probabilities output are very close but they all have a small difference between 1e-8 and 1e-10.
The XGBoost library uses float32 data type (single-precision floating-point), which has a "natural precision" of around 1e-7 .. 1e-8 in this range (probability values, between 0 and 1).
So, your observed difference is less than the "natural precision", and should not be a cause for further concern.
The (J)PMML representation is carrying out exactly the same computations (summation of booster float values, applying a normalization function to it) as the native XGBoost representation.

Related

Which loss function is used in R package gbm for multinomial distribution?

I am using the R package gbm to fit probabilistic classifiers in a dataset with > 2 classes. I am using distribution = "multinomial" as argument, however, I have some difficulties to find out implementation details of what is actually implemented by that choice.
The help function for gbm states that
Currently available options are "gaussian" (squared error), "laplace" (absolute loss), "tdist" (t-distribution loss), "bernoulli" (logistic regression for 0-1 outcomes), "huberized" (huberized hinge loss for 0-1 outcomes), classes), "adaboost" (the AdaBoost exponential loss for 0-1 outcomes), "poisson" (count outcomes), "coxph" (right censored observations), "quantile", or "pairwise" (ranking measure using the LambdaMart algorithm).
and does not list multinomial, whereas the paragraph preceding the one I copied states that
... If not specified, gbm will try to guess: ... if the response is a factor,multinomial is assumed; ...
I would like to know which loss function is implemented if I specify distribution = "multinomial". The documentation in the vignette which can be accessed via
utils::browseVignettes("gbm")
does not contain the word "multinomial" or any descriptions of what that argument implies.
I have tried to look at the package source code, but can't find the information there as well. It seems that the relevant things happen in the C++ functions in the file /src/multinomial.cpp, however, my knowledge of C++ is too limited to understand what is going on there.

Meaning of confidence factor in J48

I try to use J48 classifier from RWeka library in R (C4.5 algorithm). I can parametrize this classifier with C parameter which means 'confidence factor'. What does this value exactly mean? I know that bigger value means that I believe more my learning set is a good representation of the whole population and the algorithm will be less likely to prune. But what it exactly means? Is there any formula how to interpret this value?

H2O in R - Automatic Data Processing

I notice that the H2O packages mentions that it:
preprocesses the data to be standardized for compatibility with the
activation functions (recall Table 1’s summary of each activation
function’s target space). Since the activation function does not
generally map into the full spectrum of real numbers, R, we first
standardize our data to be drawn from N (0, 1). Standardizing again
after network propagation allows us to compute more precise errors in
this standardized space, rather than in the raw feature space. For
autoencoding, the data is normalized (instead of standardized) to the
compact interval of mathcalU(−0.5, 0.5), to allow bounded activation
functions like Tanh to better reconstruct the data.
However, I don't fully understand. My impression was (here, and here) that the the categorical variables should be broken into 1-of-C dummies and the continuous data normalised. Then, everything should be standardised to [-1,1].
I also don't see a way of specifying the neurons for the read-out layer. I thought that if we have a categorical output variable then we want to use softmax activation function (and encode as 1-of-C) / if we have a continuous output (e.g. price) then we scale that to [-1,1] and use 'tanh' / if we have a single binary output then we can use logistic and code it as [0,1]
For classification and regression (i.e., supervised mode), H2O Deep Learning does the following:
The input into the first neural network layer is indeed 1-of-C dummies (either 0 or 1) for categorical features. Continuous features are standardized (not normalized): de-meaned and scaled by 1/variance.
For regression, the response variable is also standardized internally, to allow the (single) output neuron's activation value to be compared against it. However, for presentation to the user during scoring, the predictions are de-standardized into the original space.
For classification, we use Softmax to get probabilities for the C classes, even for binary classification.
The documentation you cited also refers to unsupervised autoencoding (by enabling the autoencoder flag). In that case, the input is normalized (i.e., scaled by 1/(max-min)) instead of being standardized. That is needed to allow the auto-encoder to have fully overlapping input and output spaces.
H2O achieves the effect of 1-of-C dummy encoding, without the cost. The exact details vary by algorithm, but there's always an obvious algorithmic optimization that gives the predictive strength of a dummy encoding, without the memory or speed costs.
Cliff

calibration and liftchart with caret R package

I am comparing various predictive models on a binary classification task using the caret R package with respect to their predictive performance (liftChart) and prediction accuracy (calibration plot). I found the following issues:
1. Sometimes the lift function is very very slow when the number of observation is quite big or there are various competing classifiers. In addition I wonder whether it is possible to manually define the cuts of the calibration plot. I have a severe imbalanced model (average probability is 5%) and the calibration plot function assumes evenly spaced cuts.
The lift plot does the calculation for every unique probability value (much like an ROC curve), which is why it is slow.
Neither of those options are available right now. You can add two issues to the github page. I'm fairly swamped right now but those shouldn't be a big deal to change (you could always contribute solutions too).
Max

Why use softmax as opposed to standard normalization?

In the output layer of a neural network, it is typical to use the softmax function to approximate a probability distribution:
This is expensive to compute because of the exponents. Why not simply perform a Z transform so that all outputs are positive, and then normalise just by dividing all outputs by the sum of all outputs?
There is one nice attribute of Softmax as compared with standard normalisation.
It react to low stimulation (think blurry image) of your neural net with rather uniform distribution and to high stimulation (ie. large numbers, think crisp image) with probabilities close to 0 and 1.
While standard normalisation does not care as long as the proportion are the same.
Have a look what happens when soft max has 10 times larger input, ie your neural net got a crisp image and a lot of neurones got activated
>>> softmax([1,2]) # blurry image of a ferret
[0.26894142, 0.73105858]) # it is a cat perhaps !?
>>> softmax([10,20]) # crisp image of a cat
[0.0000453978687, 0.999954602]) # it is definitely a CAT !
And then compare it with standard normalisation
>>> std_norm([1,2]) # blurry image of a ferret
[0.3333333333333333, 0.6666666666666666] # it is a cat perhaps !?
>>> std_norm([10,20]) # crisp image of a cat
[0.3333333333333333, 0.6666666666666666] # it is a cat perhaps !?
I've had this question for months. It seems like we just cleverly guessed the softmax as an output function and then interpret the input to the softmax as log-probabilities. As you said, why not simply normalize all outputs by dividing by their sum? I found the answer in the Deep Learning book by Goodfellow, Bengio and Courville (2016) in section 6.2.2.
Let's say our last hidden layer gives us z as an activation. Then the softmax is defined as
Very Short Explanation
The exp in the softmax function roughly cancels out the log in the cross-entropy loss causing the loss to be roughly linear in z_i. This leads to a roughly constant gradient, when the model is wrong, allowing it to correct itself quickly. Thus, a wrong saturated softmax does not cause a vanishing gradient.
Short Explanation
The most popular method to train a neural network is Maximum Likelihood Estimation. We estimate the parameters theta in a way that maximizes the likelihood of the training data (of size m). Because the likelihood of the whole training dataset is a product of the likelihoods of each sample, it is easier to maximize the log-likelihood of the dataset and thus the sum of the log-likelihood of each sample indexed by k:
Now, we only focus on the softmax here with z already given, so we can replace
with i being the correct class of the kth sample. Now, we see that when we take the logarithm of the softmax, to calculate the sample's log-likelihood, we get:
, which for large differences in z roughly approximates to
First, we see the linear component z_i here. Secondly, we can examine the behavior of max(z) for two cases:
If the model is correct, then max(z) will be z_i. Thus, the log-likelihood asymptotes zero (i.e. a likelihood of 1) with a growing difference between z_i and the other entries in z.
If the model is incorrect, then max(z) will be some other z_j > z_i. So, the addition of z_i does not fully cancel out -z_j and the log-likelihood is roughly (z_i - z_j). This clearly tells the model what to do to increase the log-likelihood: increase z_i and decrease z_j.
We see that the overall log-likelihood will be dominated by samples, where the model is incorrect. Also, even if the model is really incorrect, which leads to a saturated softmax, the loss function does not saturate. It is approximately linear in z_j, meaning that we have a roughly constant gradient. This allows the model to correct itself quickly. Note that this is not the case for the Mean Squared Error for example.
Long Explanation
If the softmax still seems like an arbitrary choice to you, you can take a look at the justification for using the sigmoid in logistic regression:
Why sigmoid function instead of anything else?
The softmax is the generalization of the sigmoid for multi-class problems justified analogously.
I have found the explanation here to be very good: CS231n: Convolutional Neural Networks for Visual Recognition.
On the surface the softmax algorithm seems to be a simple non-linear (we are spreading the data with exponential) normalization. However, there is more than that.
Specifically there are a couple different views (same link as above):
Information Theory - from the perspective of information theory the softmax function can be seen as trying to minimize the cross-entropy between the predictions and the truth.
Probabilistic View - from this perspective we are in fact looking at the log-probabilities, thus when we perform exponentiation we end up with the raw probabilities. In this case the softmax equation find the MLE (Maximum Likelihood Estimate)
In summary, even though the softmax equation seems like it could be arbitrary it is NOT. It is actually a rather principled way of normalizing the classifications to minimize cross-entropy/negative likelihood between predictions and the truth.
The values of q_i are unbounded scores, sometimes interpreted as log-likelihoods. Under this interpretation, in order to recover the raw probability values, you must exponentiate them.
One reason that statistical algorithms often use log-likelihood loss functions is that they are more numerically stable: a product of probabilities may be represented be a very small floating point number. Using a log-likelihood loss function, a product of probabilities becomes a sum.
Another reason is that log-likelihoods occur naturally when deriving estimators for random variables that are assumed to be drawn from multivariate Gaussian distributions. See for example the Maximum Likelihood (ML) estimator and the way it is connected to least squares.
We are looking at a multiclass classification problem. That is, the predicted variable y can take one of k categories, where k > 2. In probability theory, this is usually modelled by a multinomial distribution. Multinomial distribution is a member of exponential family distributions. We can reconstruct the probability P(k=?|x) using properties of exponential family distributions, it coincides with the softmax formula.
If you believe the problem can be modelled by another distribution, other than multinomial, then you could reach a conclusion that is different from softmax.
For further information and a formal derivation please refer to CS229 lecture notes (9.3 Softmax Regression).
Additionally, a useful trick usually performs to softmax is: softmax(x) = softmax(x+c), softmax is invariant to constant offsets in the input.
The choice of the softmax function seems somehow arbitrary as there are many other possible normalizing functions. It is thus unclear why the log-softmax loss would perform better than other loss alternatives.
From "An Exploration of Softmax Alternatives Belonging to the Spherical Loss Family" https://arxiv.org/abs/1511.05042
The authors explored some other functions among which are Taylor expansion of exp and so called spherical softmax and found out that sometimes they might perform better than usual softmax.
I think one of the reasons can be to deal with the negative numbers and division by zero, since exp(x) will always be positive and greater than zero.
For example for a = [-2, -1, 1, 2] the sum will be 0, we can use softmax to avoid division by zero.
Adding to Piotr Czapla answer, the greater the input values, the greater the probability for the maximum input, for same proportion and compared to the other inputs:
Suppose we change the softmax function so the output activations are given by
where c is a positive constant. Note that c=1 corresponds to the standard softmax function. But if we use a different value of c we get a different function, which is nonetheless qualitatively rather similar to the softmax. In particular, show that the output activations form a probability distribution, just as for the usual softmax. Suppose we allow c to become large, i.e., c→∞. What is the limiting value for the output activations a^L_j? After solving this problem it should be clear to you why we think of the c=1 function as a "softened" version of the maximum function. This is the origin of the term "softmax". You can follow the details from this source (equation 83).
While it indeed somewhat arbitrary, the softmax has desirable properties such as:
being easily diferentiable (df/dx = f*(1-f))
when used as the output layer for a classification task, the in-fed scores are interpretable as log-odds

Resources