I have not got a clear idea about how labels for the softmax classifier should be shaped.
What I could understand from my experiments is that a scalar laber indicating the index of class probability output is one option, while another is a 2D label where the rows are class probabilities, or one-hot encoded variable, like c(1, 0, 0).
What puzzles me though is that:
I can use sclalar label values that go beyong indexing, like 4 in my
example below -- without warning or error. Why is that?
When my label is a negative scalar or an array with a negative value,
the model converges to uniform probablity distribution over classes.
For example, is this expected that actor_train.y = matrix(c(0, -1,v0), ncol = 1) results in equal probabilities in the softmax output?
I try to use softmax MXNET classifier to produce the policy gradient
reifnrocement learning, and my negative rewards lead to the issue
above: uniform probability. Is that expected?
require(mxnet)
actor_initializer <- mx.init.Xavier(rnd_type = "gaussian",
factor_type = "avg",
magnitude = 0.0001)
actor_nn_data <- mx.symbol.Variable('data') actor_nn_label <- mx.symbol.Variable('label')
device.cpu <- mx.cpu()
NN architecture
actor_fc3 <- mx.symbol.FullyConnected(
data = actor_nn_data
, num_hidden = 3 )
actor_output <- mx.symbol.SoftmaxOutput(
data = actor_fc3
, label = actor_nn_label
, name = 'actor' )
crossentfunc <- function(label, pred)
{
- sum(label * log(pred)) }
actor_loss <- mx.metric.custom(
feval = crossentfunc
, name = "log-loss"
)
initialize NN
actor_train.x <- matrix(rnorm(11), nrow = 1)
actor_train.y = 0 #1 #2 #3 #-3 # matrix(c(0, 0, -1), ncol = 1)
rm(actor_model)
actor_model <- mx.model.FeedForward.create(
symbol = actor_output,
X = actor_train.x,
y = actor_train.y,
ctx = device.cpu,
num.round = 100,
array.batch.size = 1,
optimizer = 'adam',
eval.metric = actor_loss,
clip_gradient = 1,
wd = 0.01,
initializer = actor_initializer,
array.layout = "rowmajor" )
predict(actor_model, actor_train.x, array.layout = "rowmajor")
It is quite strange to me, but I found a solution.
I changed optimizer from optimizer = 'adam' to optimizer = 'rmsprop', and the NN started to converge as expected in case of negative targets. I made simulations in R using a simple NN and optim function to get the same result.
Looks like adam or SGD may be buggy or whatever in case of multinomial classification... I also used to get stuck at the fact those optimizers did not converge to a perfect solution on just 1 example, while rmsprop does! Be aware!
Related
General description of my problem
I am performing a Poisson regression using LightGBM in R.
I am using an "offset" for the training, similar to using log(time) in a GLM as the offset when modelling insurance claims because we want to ensure that expected value of the response is proportional to time. I do this using the init_score parameter within lab.train().
I am using the "continue training" option in lgb.train (where you specify a value for init_model). This is because I want to build a "stumps" model first, and then continue training with a more complex model. This is to help me identify potential interaction terms in the data. This is just for background why I am doing this - not relevant to the specific issue described below.
However, when I continue training, the offset originally specified in the first model I build is no longer used by the fitting process. I think init_model overrides any value of init_score, but init_model does NOT itself contain or allow for init_score. So, as far as I can see, the init_score is totally lost from the fitting process once you continue training using init_model.
This means that the "starting point" when continuing to train a model is not the "finishing point" from the original model build. e.g. in my example below, I want the poisson log-likelihood error metric for models 2 and 3 to "start" from where model 1 finished. This isn't the case - but surely that is what "continue training" should deliver?
I have entered comments into the code below to explain the issue more clearly.
Reproducible example
library(lightgbm)
library(data.table)
# simulate some data
# z follows a Poisson distribution
# the mean of z is given by t * exp(x+y), where t is the "time exposed to risk"
# t is uniform(0,10)
# x and y are uniform(0,1)
# I want to specify log(t) using init_score in the lightGBM
# i.e. just like Poisson regression in insurance where log(t) is the offset in a GLM or GBM
n <- 10000 # number of rows
set.seed(42)
d <- data.table(t = runif(n,0,10), x = runif(n,0,1), y = runif(n,0,1))
d[, z := rpois(n, t * exp(x+y))]
# check weighted mean looks about right
# should get actual = 2.957188 and
# underlying = 2.939975
d[, list(actual = sum(z)/sum(t),
underlying = sum(t * exp(x+y))/sum(t)),]
# build a lightGBM using 100 rounds and specify log(t) as init_score
feature_cols <- c('x','y')
dm <- as.matrix(d[, ..feature_cols])
l_train <- lgb.Dataset(dm, label=d[,z], free_raw_data = FALSE)
setinfo(l_train, "init_score", log(d$t))
params <- list(objective='poisson', metric = 'poisson')
lgbm_1 <- lgb.train(params = params,
valids = list(train = l_train),
data = l_train,
nrounds = 100,
num_leaves = 2,
bagging_fraction = 1,
bagging_freq = 1,
feature_fraction = 1,
learning_rate=0.2)
train_log_1 <- lgb.get.eval.result(lgbm_1, "train", 'poisson')
# get the model predictions and check that they are close to expected
# remember that we need to manually apply the init_score to get the prediction
# i.e. we need to add log(t) onto the raw score, or multiply the scaled prediction by t
# the predictions are all very close
d[, lgbm_predicted_1 := t*predict(lgbm_1, dm, raw_score = FALSE)]
d[, list(actual = sum(z)/sum(t),
predicted_1 = sum(lgbm_predicted_1)/sum(t),
underlying = sum(t * exp(x+y))/sum(t)),]
# save the model
lgb.save(lgbm_1, 'lgbm_1.txt')
# ATTEMPT A - CONTINUE TRAINING FROM MODEL 1
# don't change the init_score
# note iterations in console start at 101 because we are continuing training
# however, the error metric (poisson log likelihood)
# start from a totally different value to where the first model ended
lgbm_2 <- lgb.train(params = params,
init_model = 'lgbm_1.txt',
valids = list(train = l_train),
data = l_train,
nrounds = 100,
num_leaves = 2,
bagging_fraction = 1,
bagging_freq = 1,
feature_fraction = 1,
learning_rate=0.2)
train_log_2 <- lgb.get.eval.result(lgbm_2, "train", 'poisson')
# check predictions - predicted_2 are WAY TOO HIGH now!
# I think this is because lightGBM uses the predictions from the first model
# as the starting point for training
# but the predictions from model 1 DO NOT ALLOW FOR THE log(t) being the offset to the original model!
d[, lgbm_predicted_2 := t*predict(lgbm_2, dm, raw_score = FALSE)]
d[, list(actual = sum(z)/sum(t),
predicted_1 = sum(lgbm_predicted_1)/sum(t),
predicted_2 = sum(lgbm_predicted_2)/sum(t),
underlying = sum(t * exp(x+y))/sum(t)),]
# ATTEMPT B - try init_score = 0?
# doesn't seem to make any difference
# so my hypothesis is that init_score is being ignored
# and over-written by the init_model
# but... how does the original init_score ever get back into the fitting process?
# init_score + init_model is a good stating point
# init_model on it's own is not
setinfo(l_train, "init_score", rep(0, nrow(d)))
lgbm_3 <- lgb.train(params = params,
valids = list(train = l_train),
init_model = 'lgbm_1.txt',
data = l_train,
nrounds = 100,
num_leaves = 2,
bagging_fraction = 1,
bagging_freq = 1,
feature_fraction = 1,
learning_rate=0.2)
train_log_3 <- lgb.get.eval.result(lgbm_3, "train", 'poisson')
# check predictions - models 2 and 3 are identical, the init_score made no difference
d[, lgbm_predicted_3 := t*predict(lgbm_3, dm, raw_score = FALSE)]
d[, list(actual = sum(z)/sum(t),
predicted_1 = sum(lgbm_predicted_1)/sum(t),
predicted_2 = sum(lgbm_predicted_2)/sum(t),
predicted_3 = sum(lgbm_predicted_3)/sum(t),
underlying = sum(t * exp(x+y))/sum(t)),]
# compare training logs
# question - why do V2 and V3 not start from the "finishing" point of V1?
# it's because the init_model is wrong, because it doesn't allow for the init_score
logs <- data.table(v1 = train_log_1, v2 = train_log_2, v3 = train_log_3)
I'm confused with the eigenstrat function. Do I use the output top k eigenvectors in succeeding regressions for ancestry adjustment? (since it is n x k) But the papers say the adjustment using 4 principal components (solving for this will make the matrices non-conformable - assuming I followed the input genofile where the rows are the p SNPs and the n subjects are the columns). Basically, I'm confused on the usage of eigenvector & principal components in papers and documentation. I think people use it interchangeably.
I tried following the documentation so I used the output eigenvectors in the succeeding regressions.
(basically what's on the documentation)
write.table(t(genotype), file = "riskalleles.txt", quote = FALSE, sep = "", row.names = FALSE, col.names = FALSE)
pca.eg <- eigenstrat(genoFile = "riskalleles.txt", outFile.Robj = "eigenstrat.result.list" outFile.txt = "riskalleles.result.txt", rm.marker.index = NULL,rm.subject.index = NULL, miss.val = 9, num.splits = 10,topK = NULL, signt.eigen.level = 0.01, signal.outlier = FALSE, iter.outlier = 5, sigma.thresh = 6)
ev <- pca.eg$topK.eigenvectors
gamma <- t(ev)%*%genotype
adjusted <- t(genotype - ev%*%gamma)
(solving for PCs using formula from my multivariate class)
pc <- t(genotype)%*%ev
I´m working on a STM Model (topicmodelling) and i´d like to evaluate and verify the model, but i´m not sure how to do it. My code is:
Corpus.STM <- readCorpus(dtm, type = "slam")
Model choice:
BestM1. <- searchK(Corpus.STM$documents, Corpus.STM$vocab, K=c(10,20, 30, 40, 50, 60), proportion = .4, heldout.seed = 1, prevalence=~ cvJahr+ cvDienstgrad+ cvLand, data=Jahr.Land )
BestM2. <- searchK(Corpus.STM$documents, Corpus.STM$vocab, K=c(85,110), proportion = .4, heldout.seed = 1, prevalence=~ cvJahr+ cvDienstgrad+ cvLand, data=Jahr.Land )
BestM3. <- searchK(Corpus.STM$documents, Corpus.STM$vocab, K=c(20,21,22,23,24,25,26,27,28,29,30), proportion = .4, heldout.seed = 1, prevalence=~ cvJahr+ cvDienstgrad+ cvLand, data=Jahr.Land )
str(BestM1.)
plot.searchK(BestM1.)
plot.STM(BestM2)
plot.searchK(BestM3.)
#27 seems to be a good choice
#Heldout
set.seed(1)
heldout<- make.heldout(Corpus.STM$documents, Corpus.STM$vocab, proportion = .5,seed = 1)
stm.mod1 <- stm(heldout$documents, heldout$vocab, K =27, seed = 1, init.type = "Spectral", max.em.its = 100 )
heldout.evaluation <- eval.heldout(stm.mod1, heldout$missing)
heldout.evaluation
#evaluation heldout
labelTopics(stm.mod1)
plot.STM(stm.mod1, type="labels", n=5, frexweight = 0.25)
cloud(stm.mod1, topic=5)
plot.STM(stm.mod1, type="summary", labeltype="frex", topics=c(1:5), n=8)
I´m not sure how to interpret the output of "eval.heldout". Additional I want to make sure that the model doesn´t overfit, but i´m not sure how it could work.
eval.heldout() calculates the held-out log-likelihood using document completion. The number you want is the heldout.evaluation$expected.heldout which is the average of the held-out log-likelihood values for each document. Unfortunately there is no unambiguous measure of whether or not the model is "overfit." The plot.searchK() call you have will give you a plot of the held-out log-likelihood over different values of K and certainly if that number is decreasing as K goes up one explanation is overfitting.
Sorry to not have a clearer answer but unfortunately there are no hard and fast rules here.
I'm having a lot of trouble figuring out how to correctly set the num_classes for xgboost.
I've got an example using the Iris data
df <- iris
y <- df$Species
num.class = length(levels(y))
levels(y) = 1:num.class
head(y)
df <- df[,1:4]
y <- as.matrix(y)
df <- as.matrix(df)
param <- list("objective" = "multi:softprob",
"num_class" = 3,
"eval_metric" = "mlogloss",
"nthread" = 8,
"max_depth" = 16,
"eta" = 0.3,
"gamma" = 0,
"subsample" = 1,
"colsample_bytree" = 1,
"min_child_weight" = 12)
model <- xgboost(param=param, data=df, label=y, nrounds=20)
This returns an error
Error in xgb.iter.update(bst$handle, dtrain, i - 1, obj) :
SoftmaxMultiClassObj: label must be in [0, num_class), num_class=3 but found 3 in label
If I change the num_class to 2 I get the same error. If I increase the num_class to 4 then the model runs, but I get 600 predicted probabilities back, which makes sense for 4 classes.
I'm not sure if I'm making an error or whether I'm failing to understand how xgboost works. Any help would be appreciated.
label must be in [0, num_class)
in your script add y<-y-1 before model <-...
I ran into this rather weird problem as well. It seemed in my class to be a result of not properly encoding the labels.
First, using a string vector with N classes as the labels, I could only get the algorithm to run by setting num_class = N + 1. However, this result was useless, because I only had N actual classes and N+1 buckets of predicted probabilities.
I re-encoded the labels as integers and then num_class worked fine when set to N.
# Convert classes to integers for xgboost
class <- data.table(interest_level=c("low", "medium", "high"), class=c(0,1,2))
t1 <- merge(t1, class, by="interest_level", all.x=TRUE, sort=F)
and
param <- list(booster="gbtree",
objective="multi:softprob",
eval_metric="mlogloss",
#nthread=13,
num_class=3,
eta_decay = .99,
eta = .005,
gamma = 1,
max_depth = 4,
min_child_weight = .9,#1,
subsample = .7,
colsample_bytree = .5
)
For example.
I was seeing the same error, my issue was that I was using an eval_metric that was only meant to be used for multiclass labels when my data had binary labels. See eval_metric in the Learning Class Parameters section of the XGBoost docs for a list of all of the options.
I had this problem and it turned out that I was trying to subtract 1 from my predictor which was already in the units of 0 and 1. Probably a novice mistake, but in case anyone else is running into this with a binary response variable that is already 0 and 1 it is something to make note of.
Tutorial said:
label = as.integer(iris$Species)-1
What worked for me (response is high_end):
label = as.integer(high_end)
I know that the smoothing parameter(lambda) is quite important for fitting a smoothing spline, but I did not see any post here regarding how to select a reasonable lambda (spar=?), I was told that spar normally ranges from 0 to 1. Could anyone share your experience when use smooth.spline()? Thanks.
smooth.spline(x, y = NULL, w = NULL, df, spar = NULL,
cv = FALSE, all.knots = FALSE, nknots = NULL,
keep.data = TRUE, df.offset = 0, penalty = 1,
control.spar = list(), tol = 1e-6 * IQR(x))
agstudy provides a visual way to choose spar. I remember what I learned from linear model class (but not exact) is to use cross validation to pick "best" spar. Here's a toy example borrowed from agstudy:
x = seq(1:18)
y = c(1:3,5,4,7:3,2*(2:5),rep(10,4))
splineres <- function(spar){
res <- rep(0, length(x))
for (i in 1:length(x)){
mod <- smooth.spline(x[-i], y[-i], spar = spar)
res[i] <- predict(mod, x[i])$y - y[i]
}
return(sum(res^2))
}
spars <- seq(0, 1.5, by = 0.001)
ss <- rep(0, length(spars))
for (i in 1:length(spars)){
ss[i] <- splineres(spars[i])
}
plot(spars, ss, 'l', xlab = 'spar', ylab = 'Cross Validation Residual Sum of Squares' , main = 'CV RSS vs Spar')
spars[which.min(ss)]
R > spars[which.min(ss)]
[1] 0.381
Code is not neatest, but easy for you to understand. Also, if you specify cv=T in smooth.spline:
R > xyspline <- smooth.spline(x, y, cv=T)
R > xyspline$spar
[1] 0.3881
From the help of smooth.spline you have the following:
The computational λ used (as a function of \code{spar}) is λ = r *
256^(3*spar - 1)
spar can be greater than 1 (but I guess no too much). I think you can vary this parameters and choose it graphically by plotting the fitted values for different spars. For example:
spars <- seq(0.2,2,length.out=10) ## I will choose between 10 values
dat <- data.frame(
spar= as.factor(rep(spars,each=18)), ## spar to group data(to get different colors)
x = seq(1:18), ## recycling here to repeat x and y
y = c(1:3,5,4,7:3,2*(2:5),rep(10,4)))
xyplot(y~x|spar,data =dat, type=c('p'), pch=19,groups=spar,
panel =function(x,y,groups,...)
{
s2 <- smooth.spline(y,spar=spars[panel.number()])
panel.lines(s2)
panel.xyplot(x,y,groups,...)
})
Here for example , I get best results for spars = 0.4
If you don't have duplicated points at the same x value, then try setting GCV=TRUE - the Generalized Cross Validation (GCV) procedure is a clever way of selecting a pretty good stab at picking a good value for lambda (span). One neat detail about the GCV is that it doesn't actually have to go to the trouble of doing the calculations for every single set of one-left-out points - as highlighted in Simon Wood's book. For lots of detail on this have a look at the notes on Simon Wood's web page on MGCV.
Adrian Bowman's (sm) r-package has a function h.select() which is intended specifically for going the grunt work for choosing a value of lambda (though I'm not 100% sure that it is compatible with the smooth.spline() function in the base package.