How is xgboost quality calculated? - r

Could someone explain how the Quality column in the xgboost R package is calculated in the xgb.model.dt.tree function?
In the documentation it says that Quality "is the gain related to the split in this specific node".
When you run the following code, given in the xgboost documentation for this function, Quality for node 0 of tree 0 is 4000.53, yet I calculate the Gain as 2002.848
data(agaricus.train, package='xgboost')
train <- agarics.train
X = train$data
y = train$label
bst <- xgboost(data = train$data, label = train$label, max.depth = 2,
eta = 1, nthread = 2, nround = 2,objective = "binary:logistic")
xgb.model.dt.tree(agaricus.train$data#Dimnames[[2]], model = bst)
p = rep(0.5,nrow(X))
L = which(X[,'odor=none']==0)
R = which(X[,'odor=none']==1)
pL = p[L]
pR = p[R]
yL = y[L]
yR = y[R]
GL = sum(pL-yL)
GR = sum(pR-yR)
G = sum(p-y)
HL = sum(pL*(1-pL))
HR = sum(pR*(1-pR))
H = sum(p*(1-p))
gain = 0.5 * (GL^2/HL+GR^2/HR-G^2/H)
gain
I understand that Gain is given by the following formula:
Since we are using log loss, G is the sum of p-y and H is the sum of p(1-p) - gamma and lambda in this instance are both zero.
Can anyone identify where I am going wrong?

OK, I think I've worked it out. The value for reg_lambda is not 0 by default as given in the documentation, but is actually 1 (from param.h)
Also, it appears that the factor of a half is not applied when calculating the gain, so the Quality column is double what you would expect. Lastly, I also don't think gamma (also called min_split_loss) is applied to this calculation either (from update_hitmaker-inl.hpp)
Instead, gamma is used to determine whether to invoke pruning, but is not reflected in the gain calculation itself, as the documentation suggests.
If you apply these changes, you do indeed get 4000.53 as the Quality for node 0 of tree 0, as in the original question. I'll raise this as an issue to the xgboost guys, so the documentation can be changed accordingly.

Related

Calculating MDE for a difference-in-difference clustered randomized trial (in R)

I'm looking to calculate the Minimum Detectable Effect for a potential Difference-in-Differences design where treatment and control are assigned at the cluster level and my outcome at the individual level is dichotomous. To do this I'm working in R and using the clusterPower package, specifically I'm using the cpa.did.binary function. In the help file for this function it notes that d is "The expected absolute difference." I'm interpreting this as being a Minimum Detectable Effect, is that correct? If this is the MDE, is this output the expected difference in logits?
Thanks to anyone that can help. Alternatively, if you have a better package or way of calculating MDE that is also welcome.
# Input
cpa.did.binary(power = .8,
nclusters = 10,
nsubjects = 100,
p = .5,
d = NA,
ICC = .04,
rho_c = 0,
rho_s = 0)
# Output
> d
>0.2086531

stop xgboost based on eval_metric

I am trying to run xgboost for a problem with very noisy features and interested in stopping the number of rounds based on a custom eval_metric that I have defined.
Based on domain knowledge I know that when the eval_metric (evaluated on the training data) goes above a certain value xgboost is overfitting. And I would like to just take the fitted model at that specific number of rounds and not proceed further.
What would be the best way to achieve this ?
It would be somewhat in line with the early stopping criteria but not exactly.
Alternately, if there is a possibility to get the model from an intermediate round ?
Here is an example to better explain by question. (Using the toy example that comes with xgboost help docs and using the default eval_metric)
library(xgboost)
data(agaricus.train, package='xgboost')
train <- agaricus.train
bstSparse <- xgboost(data = train$data, label = train$label, max.depth = 2, eta = 1, nthread = 2, nround = 5, objective = "binary:logistic")
Here is the output
[0] train-error:0.046522
[1] train-error:0.022263
[2] train-error:0.007063
[3] train-error:0.015200
[4] train-error:0.007063
Now lets say from domain knowledge I know that once the train error goes below 0.015 (third round in this case), any further rounds only lead to over fitting. How would I stop the training process after the third round and get hold of the trained model to use it for prediction over a different dataset ?
I need to run the training process over many different datasets and I have no sense of how many rounds it might take to train to get the error below a fixed number, hence I can't set the nrounds argument to a predetermined value. Only intuition I have is that once the training error goes below a number I need to stop further training rounds.
In the absence of any code you have tried or any data you are using then try something like this:
require(xgboost)
library(Metrics) # for rmse to calculate errors
# Assume you have a training set db.train and have some
# feature indices of interest and a test set db.test
predz <- c(2, 4, 6, 8, 10, 12)
predictors <- names(db.train[, predz])
# you have some response you are interested in
outcomeName <- "myLabel"
# you may like to include for testing some other parameters like:
# eta, gamma, colsample_bytree, min_child_weight
# here we look at depths from 1 to 4 and rounds 1 to 100 but set your own values
smallestError <- 100 # set to some sensible value depending on your eval metric
for (depth in seq(1, 4, 1)) {
for (rounds in seq(1, 100, 1)) {
# train
bst <- xgboost(data = as.matrix(db.train[,predictors]),
label = db.train[,outcomeName],
max.depth = depth,
nround = rounds,
eval_metric = "logloss",
objective = "binary:logistic",
verbose=TRUE)
gc()
# predict
predictions <- as.numeric(predict(bst, as.matrix(db.test[, predictors]),
outputmargin = TRUE))
err <- rmse(as.numeric(db.test[, outcomeName]), as.numeric(predictions))
if (err < smallestError) {
smallestError = err
print(paste(depth,rounds,err))
}
}
}
You could adapt this code for your particular evaluation metric and print this out to suit your situation. Similarly you could introduce a break in the code when some specified number of rounds is reached that satisfies some condition you seek to achieve.

Use nls with fixed parameters?

I've been trying to use the nls function to fit experimental data to a model that I have, expressed by a function of 3 parameters, let's say a, b and c. However, I would like to keep b and c fixed, since I know their true value, and fit only the parameter a:
nls(formula=pattern~myfunction(a, b, c), start=list(a=estimate_a), control=list(maxiter=50, tol=5e-8, warnOnly=T), algorithm="port", weights=sqrt(pattern), na.action=na.exclude, lower=0, upper=1)
But apparently this does not work... How can I tell R that b and c are fixed?
To fix a parameter (1) set it before running nls and (2) do not include it in start. Here is a self contained example showing the fixing of a to 0 :
a <- 0
nls(demand ~ a + b * Time, BOD, start = list(b = 1))
A quick solution:
my_new_function <- function(a) myfunction(a, b = b_true, c = c_true)
nls(formula = pattern ~ my_new_function(a), start = list(a = estimate_a),
control = list(maxiter = 50, tol = 5e-8, warnOnly = TRUE), algorithm = "port",
weights = sqrt(pattern), na.action = na.exclude, lower = 0, upper = 1)
The issue of fixed (or MASKED) parameters has been around a long time. Ron Duggleby of U. of Queensland introduced me to the term "masked" when I was on sabbatical there in 1987, and I have had masks in my own software for nonlinear optimization and nonlinear least squares since. In particular, the CRAN package "nlsr" or the developmental "nlsr2" (https://gitlab.com/nashjc/improvenls/-/tree/master/nlsr-rox) handle fixed parameters reliably.
Another approach is to use "nls()" with the "port" algorithm and set upper and lower bounds equal for the fixed parameters. I'm not sure if this is pushing the envelope, and have only tried a couple of examples. For those examples, "minpack.lm::nlsLM()" using the same equal bounds approach seems to give incorrect results sometimes.
John Nash

Separating circles using kernel PCA

I am trying to reproduce a simple example of using kernel PCA. The objective is to separate out the points from two concentric circles.
Creating the data:
circle <- data.frame(radius = rep(c(0, 1), 500) + rnorm(1000, sd = 0.05),
phi = runif(1000, 0, 2 * pi),
group = rep(c("A", "B"), 500))
#
circle <- transform(circle,
x = radius * cos(phi),
y = radius * sin(phi),
z = rnorm(length(radius))) %>% select(group, x, y, z)
TFRAC = 0.75
#
train <- sample(1:1000, TFRAC * 1000)
circle.train <- circle[train,]
circle.test <- circle[-train,]
> head(circle.train)
group x y z
491 A -0.034216 -0.0312062 0.70780
389 A 0.052616 0.0059919 1.05942
178 B -0.987276 -0.3322542 0.75297
472 B -0.808646 0.3962935 -0.17829
473 A -0.032227 0.0027470 0.66955
346 B 0.894957 0.3381633 1.29191
I have split the data up into training and testing sets because I have the intention (once I get this working!) of testing the resulting model.
In principal kernel PCA should allow me to separate out the two classes. Other discussions of this example have used the Radial Basis Function (RBF) kernel, so I adopted this too. In R kernel PCA is implemented in the kernlab package.
library(kernlab)
circle.kpca <- kpca(~ ., data = circle.train[, -1], kernel = "rbfdot", kpar = list(sigma = 10), features = 1)
I requested only the first component and specified the RBF kernel. This is the result:
There has definitely been a major transformation of the data, but the transformed data is not what I was expecting (which would be a nice, clean separation of the two classes). I have tried fiddling with the value of the parameter sigma and, although the results do vary dramatically, I still didn't get what I was expecting. I assume that sigma is related to the parameter gamma mentioned here, possibly via the relationship given here (without the negative sign?).
I'm pretty sure that I am making a naive rookie error here and I would really appreciate any pointers which would get me onto the right track.
Thanks,
Andrew.
Try sigma = 20. I think you will get the answer you are looking for. The sigma in kernlab is actually what is usually referred to as gamma for rbf kernel so they are inversely related.

how to optimize parameters of of the support vector machine by using the genetic algorithm with R

In order to learn the support vector machine, we must determine various parameters.
For example, there are parameters such as cost and gamma.
I am trying to determine sigma and gamma parameters of SVM Using "GA" package and "kernlab" package of R.
I use accuracy as the evaluation function of the genetic algorithm.
I have created the following code, and I ran it.
library(GA)
library(kernlab)
data(spam)
index <- sample(1:dim(spam)[1])
spamtrain <- spam[index[1:floor(dim(spam)[1]/2)], ]
spamtest <- spam[index[((ceiling(dim(spam)[1]/2)) + 1):dim(spam)[1]], ]
f <- function(x)
{
x1 <- x[1]
x2 <- x[2]
filter <- ksvm(type~.,data=spamtrain,kernel="rbfdot",kpar=list(sigma=x1),C=x2,cross=3)
mailtype <- predict(filter,spamtest[,-58])
t <- table(mailtype,spamtest[,58])
return(t[1,1]+t[2,2])/(t[1,1]+t[1,2]+t[2,1]+t[2,2])
}
GA <- ga(type = "real-valued", fitness = f, min = c(-5.12, -5.12), max = c(5.12, 5.12), popSize = 50, maxiter = 2)
summary(GA)
plot(GA)
However, When I call the GA function,the following error is returned.
"No Support Vectors found. You may want to change your parameters"
I can not understand why the code is bad.
Using GA for SVM parameters is not a good idea - it should be sufficient to just do a regular grid search ( two for loops, one for C and one for gamma values).
In Rs library e1071 (which also provides SVMs) there is a methodtune.svm` which looks for best parameters using a grid search.
Example
data(iris)
obj <- tune.svm(Species~., data = iris, sampling = "fix",
gamma = 2^c(-8,-4,0,4), cost = 2^c(-8,-4,-2,0))
plot(obj, transform.x = log2, transform.y = log2)
plot(obj, type = "perspective", theta = 120, phi = 45)
Which also shows one important thing - you should look for a good C and gamma values in a geometric manner, so eg. 2^x for x in {-10,-8,-6,-6,-4,-2,0,2,4}.
GA is an algorithm for meta optimisation, where the parameters space is huge, and there is no easy correlation between parameters and the optimising function. It requires tuning of much more parameters then SVM (number of generations, size of the population, mutation probability, crossing probability, mutation operator, crossing operator ...) so it completely useless approach here.
And of course - as it was earlier stated in comments - C and Gamma have to be strictly positive.
For more details about using e1071 take a look at the CRAN document: http://cran.r-project.org/web/packages/e1071/e1071.pdf

Resources