What does 'seed' do in in 'ldatuning' to determine LDA topic frequency (in R)? - r

I have been trying out different ways of determining topic frequency in LDA (in R) and have stumbled across the very useful-looking package ldatuning but cannot really figure out the control parameter and particularly the example value for seed.
Here is the example code from the website:
library("topicmodels")
data("AssociatedPress", package="topicmodels")
dtm <- AssociatedPress[1:10, ]
result <- FindTopicsNumber(
dtm,
topics = seq(from = 2, to = 15, by = 1),
metrics = c("Griffiths2004", "CaoJuan2009", "Arun2010", "Deveaud2014"),
method = "Gibbs",
control = list(seed = 77),
mc.cores = 2L,
verbose = TRUE
)
I played around with the parameters a little bit and noticed that changes in the value for seed change the output graphs quite significantly. Can someone please explain what the 77 in this case stands for and how the value for seed should be selected?
Also, I couldn't find any other options for what to enter for control and what effect that has on the result. If anyone can provide some guidance here that would be great.

seed:
Object of class "integer"; used to set the seed in the external code for VEM estimation and to call set.seed for Gibbs sampling. For Gibbs sampling it can also be set to NA (default) to avoid changing the seed of the random number generator in the model fitting call.

Related

What's the difference between lgb.train() and lightgbm() in r?

I'm trying to build a regression model with R using lightGBM,
and i'm getting a bit confused with some functions and when/how to use them.
First one is what i've written in the title, what's the difference between lgb.train() and lightgbm()?
The description in the documentation(https://cran.r-project.org/web/packages/lightgbm/lightgbm.pdf) says that lgb.train is 'Logic to train with LightGBM' and lightgbm is 'Simple interface for training a LightGBM model', while both their outcome value is lgb.Booster, a trained model.
One difference I've found is that lgb.train() does not work with valids = , while lightgbm() does.
Second one is about a function lgb.cv(), regarding a cross validation in lightGBM. How do you apply the output of lgb.cv() to a model?
As I understood from the documentation i've linked above, it seems like the output of both lgb.cv and lgb.train is a model.
Is it correct to use it like the example below?
lgbcv <- lgb.cv(params,
lgbtrain,
nrounds = 1000,
nfold = 5,
early_stopping_rounds = 100,
learning_rate = 1.0)
lgbcv <- lightgbm(params,
lgbtrain,
nrounds = 1000,
early_stopping_rounds = 100,
learning_rate = 1.0)
Thank you in advance!
what's the difference between lgb.train() and lightgbm()?
These functions both train a LightGBM model, they're just slightly different interfaces. The biggest difference is in how training data are prepared. LightGBM training requires a special LightGBM-specific representation of the training data, called a Dataset. To use lgb.train(), you have to construct one of these beforehand with lgb.Dataset(). lightgbm(), on the other hand, can accept a data frame, data.table, or matrix and will create the Dataset object for you.
Choose whichever method you feel has a more friendly interface...both will produce a single trained LightGBM model (class "lgb.Booster").
that lgb.train() does not work with valids = , while lightgbm() does.
This is not correct. Both functions accept the keyword argument valids. Run ?lgb.train and ?lightgbm for documentation on those methods.
How do you apply the output of lgb.cv() to a model?
I'm not sure what you mean, but you can find an example of how to use lgb.cv() in the docs that show up when you run ?lgb.cv.
library(lightgbm)
data(agaricus.train, package = "lightgbm")
train <- agaricus.train
dtrain <- lgb.Dataset(train$data, label = train$label)
params <- list(objective = "regression", metric = "l2")
model <- lgb.cv(
params = params
, data = dtrain
, nrounds = 5L
, nfold = 3L
, min_data = 1L
, learning_rate = 1.0
)
This returns an object of class "lgb.CVBooster". That object has multiple "lgb.Booster" objects in it (the trained models that lightgbm() or lgb.train() produce).
You can extract any one of these from model$boosters. However, in practice I don't recommend using the models from lgb.cv() directly. The goal of cross-validation is to get an estimate of the generalization error for a model. So you can use lgb.cv() to figure out the expected error for a given dataset + set of parameters (by looking at model$record_evals and model$best_score).

How to handle a skewed response in H2O algorithms

In my problem dataset response variable is extremely skewed to the left. I have tried to fit the model with h2o.randomForest() and h2o.gbm() as below. I can give tune min_split_improvement and min_rows to avoid overfitting in these two cases. But with these models, I see very high errors on the tail observations. I have tried using weights_column to oversample the tail observations and undersample other observations, but it does not help.
h2o.model <- h2o.gbm(x = predictors, y = response, training_frame = train,valid = valid, seed = 1,
ntrees =150, max_depth = 10, min_rows = 2, model_id = "GBM_DD", balance_classes = T, nbins = 20, stopping_metric = "MSE",
stopping_rounds = 10, min_split_improvement = 0.0005)
h2o.model <- h2o.randomForest(x = predictors, y = response, training_frame = train,valid = valid, seed = 1,ntrees =150, max_depth = 10, min_rows = 2, model_id = "DRF_DD", balance_classes = T, nbins = 20, stopping_metric = "MSE",
stopping_rounds = 10, min_split_improvement = 0.0005)
I have tried the h2o.automl() function of h2o package for the problem for better performance. However, I see significant overfitting. I don't know of any parameters in h2o.automl() to control overfitting.
Does anyone know of a way to avoid overfitting with h2o.automl()?
EDIT
The distribution of the log transformed response is given below. After the suggestion from Erin
EDIT2:
Distribution of original response.
H2O AutoML uses H2O algos (e.g. RF, GBM) underneath, so if you're not able to get good models there, you will suffer from the same issues using AutoML. I am not sure that I would call this overfitting -- it's more that your models are not doing well at predicting outliers.
My recommendation is to log your response variable -- that's a useful thing to do when you have a skewed response. In the future, H2O AutoML will try to detect a skewed response automatically and take the log, but that's not a feature of the the current version (H2O 3.16.*).
Here's a bit more detail if you are not familiar with this process. First, create a new column, e.g. log_response, as follows and use that as the response when training (in RF, GBM or AutoML):
train[,"log_response"] <- h2o.log(train[,response])
Caveats: If you have zeros in your response, you should use h2o.log1p() instead. Make sure not to include the original response in your predictors. In your case, you don't need to change anything because you are already explicitly specifying the predictors using a predictors vector.
Keep in mind that when you log the response that your predictions and model metrics will be on the log scale. So if you need to convert your predictions back to the normal scale, like this:
model <- h2o.randomForest(x = predictors, y = "log_response",
training_frame = train, valid = valid)
log_pred <- h2o.predict(model, test)
pred <- h2o.exp(log_pred)
This gives you the predictions, but if you also want to see the metrics, you will have to compute those using the h2o.make_metrics() function using the new preds rather than extracting the metrics from the model.
perf <- h2o.make_metrics(predicted = pred, actual = test[,response])
h2o.mse(perf)
You can try this using RF like I showed above, or a GBM, or with AutoML (which should give better performance than a single RF or GBM).
Hopefully that helps improve the performance of your models!
When your target variable is skewed, mse is not a good metric to use. I would try changing the loss function because gbm tries to fit the model to the gradient of the loss function and you want to make sure that you are using the correct distribution. if you have a spike on zero and right skewed positive target, probably Tweedie would be a better option.

How to choose the nrounds using `catboost`?

If I understand correctly catboost, we need to tune the nrounds just like in xgboost, using CV. I see the following code in the official tutorial In [8]
params_with_od <- list(iterations = 500,
loss_function = 'Logloss',
train_dir = 'train_dir',
od_type = 'Iter',
od_wait = 30)
model_with_od <- catboost.train(train_pool, test_pool, params_with_od)
Which result in the best iterations = 211.
My question are:
Is it correct that: this command use the test_pool to choose the best iterations instead of using cross-validation?
If yes, does catboost provide a command to choose the best iterations from CV, or I need to do it manually?
Catboost is doing cross validation to determine the optimum number of iterations. Both train_pool and test_pool are datasets that include the target variable. Earlier in the tutorial they write
train_path = '../R-package/inst/extdata/adult_train.1000'
test_path = '../R-package/inst/extdata/adult_test.1000'
column_description_vector = rep('numeric', 15)
cat_features <- c(3, 5, 7, 8, 9, 10, 11, 15)
for (i in cat_features)
column_description_vector[i] <- 'factor'
train <- read.table(train_path, head=F, sep="\t", colClasses=column_description_vector)
test <- read.table(test_path, head=F, sep="\t", colClasses=column_description_vector)
target <- c(1)
train_pool <- catboost.from_data_frame(data=train[,-target], target=train[,target])
test_pool <- catboost.from_data_frame(data=test[,-target], target=test[,target])
When you execute catboost.train(train_pool, test_pool, params_with_od) train_pool is used for training and test_pool is used to determine the optimum number of iterations via cross validation.
Now you are right to be confused, since later on in the tutorial they again use test_pool and the fitted model to make a prediction (model_best is similar to model_with_od, but uses a different overfitting detector IncToDec):
prediction_best <- catboost.predict(model_best, test_pool, type = 'Probability')
This might be bad practice. Now they might get away with it with their IncToDec overfitting detector - I am not familiar with the mathematics behind it - but for the Iter type overfitting detector you would need to have separate train,validation and test data sets (and if you want to be on the save side, do the same for the IncToDec overfitting detector). However it is only a tutorial showing the functionality so I wouldn't be too pedantic about what data they have already used how.
Here a link to a little more detail on the overfitting detectors:
https://tech.yandex.com/catboost/doc/dg/concepts/overfitting-detector-docpage/
It is a very poor decision to base your number of iterations on one test_pool and from the best iterations of catboost.train(). In doing so, you are tuning your parameters to one specific test set and your model will not work well with new data. You are therefore correct in presuming that like XGBoost, you need to apply CV to find the optimal number of iterations.
There is indeed a CV function in catboost. What you should do is specify a large number of iterations and stop the training after a certain number of rounds without improvement by using parameters early_stopping_rounds. Unlike LightGBM unfortunately, catboost doesn't seem to have the option of automatically giving the optimal number of boosting rounds after CV to apply in catboost.train(). Therefore, it requires a bit of a workaround. Here is an example which should work:
library(catboost)
library(data.table)
parameter = list(
thread_count = n_cores,
loss_function = "RMSE",
eval_metric = c("RMSE","MAE","R2"),
iterations = 10^5, # Train up to 10^5 rounds
early_stopping_rounds = 100, # Stop after 100 rounds of no improvement
)
# Apply 6-fold CV
model = catboost.cv(
pool = train_pool,
fold_count = 6,
params = parameter
)
# Transform output to DT
setDT(cbt_occupancy)
model[, iterations := .I]
# Order from lowest to highgest RMSE
setorder(model, test.RMSE.mean)
# Select iterations with lowest RMSE
parameter$iterations = model[1, iterations]
# Train model with optimal iterations
model = catboost.train(
learn_pool = train_pool,
test_pool = test_pool,
params = parameter
)
I think this is a general question for xgboost and catboost.
The choice of nround gets along with the choice with learning rate.
Thus, I recommend the higher round (1000+) and low learning rate.
After you find the best hype-params and retry a lower learning rate to check the hype-params you choose are stable.
And I find #nikitxskv 's answer is misleading.
In the R tutorial, In [12] just chooses learning_rate = 0.1 without mutiple choices. Thus, there is no hint for nround tuning.
Actually, In [12] just uses function expand.grid to find the best hype-params. It functions on the selections of depth, gamma and so on.
And in practice, we don't use this way to find a proper nround (too long).
And now for the two questions.
Is it correct that: this command use the test_pool to choose the best iterations instead of using cross-validation?
Yes, but you can use CV.
If yes, does catboost provide a command to choose the best iterations from CV, or I need to do it manually?
It depends on yourself. If you have a great aversion on boosting overfitting, I recommend you try it. There are a lot of packages to solve this problem. I recommend tidymodel packages.

The xgboost package and the random forests regression

The xgboost package allows to build a random forest (in fact, it chooses a random subset of columns to choose a variable for a split for the whole tree, not for a nod, as it is in a classical version of the algorithm, but it can be tolerated). But it seems that for regression only one tree from the forest (maybe, the last one built) is used.
To ensure that, consider just a standard toy example.
library(xgboost)
library(randomForest)
data(agaricus.train, package = 'xgboost')
dtrain = xgb.DMatrix(agaricus.train$data,
label = agaricus.train$label)
bst = xgb.train(data = dtrain,
nround = 1,
subsample = 0.8,
colsample_bytree = 0.5,
num_parallel_tree = 100,
verbose = 2,
max_depth = 12)
answer1 = predict(bst, dtrain);
(answer1 - agaricus.train$label) %*% (answer1 - agaricus.train$label)
forest = randomForest(x = as.matrix(agaricus.train$data), y = agaricus.train$label, ntree = 50)
answer2 = predict(forest, as.matrix(agaricus.train$data))
(answer2 - agaricus.train$label) %*% (answer2 - agaricus.train$label)
Yes, of course, the default version of the xgboost random forest uses not a Gini score function but just the MSE; it can be changed easily. Also it is not correct to do such a validation and so on, so on. It does not affect a main problem. Regardless of which sets of parameters are being tried results are suprisingly bad compared with the randomForest implementation. This holds for another data sets as well.
Could anybody provide a hint on such strange behaviour? When it comes to the classification task the algorithm does work as expected.
#
Well, all trees are grown and all are used to make a prediction. You may check that using the parameter 'ntreelimit' for the 'predict' function.
The main problem remains: is the specific form of the Random Forest algorithm that is produced by the xgbbost package valid?
Cross-validation, parameter tunning and other crap have nothing to do with that -- every one may add necessary corrections to the code and see what happens.
You may specify the 'objective' option like this:
mse = function(predict, dtrain)
{
real = getinfo(dtrain, 'label')
return(list(grad = 2 * (predict - real),
hess = rep(2, length(real))))
}
This provides that you use the MSE when choosing a variable for the split. Even after that, results are suprisingly bad compared to those of randomForest.
Maybe, the problem is of academical nature and concerns the way how a random subset of features to make a split is chosen. The classical implementation chooses a subset of features (the size is specified with 'mtry' for the randomForest package) for EVERY split separately and the xgboost implementation chooses one subset for a tree (specified with 'colsample_bytree').
So this fine difference appears to be of great importance, at least for some types of datasets. It is interesting, indeed.
xgboost(random forest style) does use more than one tree to predict. But there are many other differences to explore.
I myself am new to xgboost, but curious. So I wrote the code below to visualize the trees. You can run the code yourself to verify or explore other differences.
Your data set of choice is a classification problem as labels are either 0 or 1. I like to switch to a simple regression problem to visualize what xgboost does.
true model: $y = x_1 * x_2$ + noise
If you train a single tree or multiple tree, with the code examples below you observe that the learned model structure does contain more trees. You cannot argue alone from the prediction accuracy how many trees are trained.
Maybe the predictions are different because the implementations are different. None of the ~5 RF implementations I know of are exactly alike, and this xgboost(rf style) is as closest a distant "cousin".
I observe the colsample_bytree is not equal to mtry, as the former uses the same subset of variable/columns for the entire tree. My regression problem is one big interaction only, which cannot be learned if trees only uses either x1 or x2. Thus in this case colsample_bytree must be set to 1 to use both variables in all trees. Regular RF could model this problem with mtry=1, as each node would use either X1 or X2
I see your randomForest predictions are not out-of-bag cross-validated. If drawing any conclusions on predictions you must cross-validate, especially for fully grown trees.
NB You need to fix the function vec.plot as does not support xgboost out of the box, because xgboost out of some other box do not take data.frame as an valid input. The instruction in the code should be clear
library(xgboost)
library(rgl)
library(forestFloor)
Data = data.frame(replicate(2,rnorm(5000)))
Data$y = Data$X1*Data$X2 + rnorm(5000)*.5
gradientByTarget =fcol(Data,3)
plot3d(Data,col=gradientByTarget) #true data structure
fix(vec.plot) #change these two line in the function, as xgboost do not support data.frame
#16# yhat.vec = predict(model, as.matrix(Xtest.vec))
#21# yhat.obs = predict(model, as.matrix(Xtest.obs))
#1 single deep tree
xgb.model = xgboost(data = as.matrix(Data[,1:2]),label=Data$y,
nrounds=1,params = list(max.depth=250))
vec.plot(xgb.model,as.matrix(Data[,1:2]),1:2,col=gradientByTarget,grid=200)
plot(Data$y,predict(xgb.model,as.matrix(Data[,1:2])),col=gradientByTarget)
#clearly just one tree
#100 trees (gbm boosting)
xgb.model = xgboost(data = as.matrix(Data[,1:2]),label=Data$y,
nrounds=100,params = list(max.depth=16,eta=.5,subsample=.6))
vec.plot(xgb.model,as.matrix(Data[,1:2]),1:2,col=gradientByTarget)
plot(Data$y,predict(xgb.model,as.matrix(Data[,1:2])),col=gradientByTarget) ##predictions are not OOB cross-validated!
#20 shallow trees (bagging)
xgb.model = xgboost(data = as.matrix(Data[,1:2]),label=Data$y,
nrounds=1,params = list(max.depth=250,
num_parallel_tree=20,colsample_bytree = .5, subsample = .5))
vec.plot(xgb.model,as.matrix(Data[,1:2]),1:2,col=gradientByTarget) #bagged mix of trees
plot(Data$y,predict(xgb.model,as.matrix(Data[,1:2]))) #terrible fit!!
#problem, colsample_bytree is NOT mtry as columns are only sampled once
# (this could be raised as an issue on their github page, that this does not mimic RF)
#20 deep tree (bagging), no column limitation
xgb.model = xgboost(data = as.matrix(Data[,1:2]),label=Data$y,
nrounds=1,params = list(max.depth=500,
num_parallel_tree=200,colsample_bytree = 1, subsample = .5))
vec.plot(xgb.model,as.matrix(Data[,1:2]),1:2,col=gradientByTarget) #boosted mix of trees
plot(Data$y,predict(xgb.model,as.matrix(Data[,1:2])))
#voila model can fit data

Getting term weights out of an LDA model in R

I was wondering if anyone knows of a way to extract term weights / probabilities out of a topic model constructed in R, using the topicmodels package.
Following the example in the following link I created a topic model like so:
Gibbs = LDA(JSS_dtm, k = 4,
method = "Gibbs",
control = list(seed = 1, burnin = 1000, thin = 100, iter = 1000))
we can then get the topics using topics(Gibbs,1), terms using terms(Gibbs,10) and even the topic probabilities using Gibbs#gamma, but after looking at str(Gibbs) it appears that there is no way to get term probabilities within each topic. This would be useful because topic 1 could be 50% term A and 50% term B, while topic 2 can be 90% Term C and 10% term D. I'm aware that tools like MALLET and Python's NLTK module offer this capability, but I was also hoping that a similar solution may exist in R.
If anyone know how this can be achieved, please let us know.
Many thanks!
EDIT:
For the benefit of the others, I thought I'd share my current workaround. If I knew term probabilities, I'd be able to visualise them and give the viewer a better understanding of what each topic means, but without the probabilities, I'm simply breaking down my data by each topic and creating a word cloud for each topic using binary weights. While these values are not probabilities, they give an indication of what each topic focuses on.
See the below code:
JSS_text <- sapply(1:length(JSS_papers[,"description"]), function(x) unlist(JSS_papers[x,"description"]))
jss_df <- data.frame(text=JSS_text,topic=topics(Gibbs, 1))
jss_dec_df <- data.frame()
for(i in unique(topics(Gibbs, 1))){
jss_dec_df <- rbind(jss_dec_df,data.frame(topic = i,
text = paste(jss_df[jss_df$topic==i,"text"],collapse=" ")))
}
corpus <- Corpus(VectorSource(jss_dec_df$text))
JSS_dtm <- TermDocumentMatrix(corpus,control = list(stemming = TRUE,
stopwords = TRUE,
minWordLength = 3,
removeNumbers = TRUE,
removePunctuation = TRUE,
function(x)weightSMART(x,spec="bnc")))
(JSS_dtm = removeSparseTerms(JSS_dtm,0.1)) # not the sparsity parameter
library(wordcloud)
comparison.cloud(as.matrix(JSS_dtm),random.order=F,max.words=100,
scale=c(6,0.6),colours=4,title.size=2)
Figured it out -- to get the term weights, use posterior(lda_object)$terms. Turned out to be much easier than I thought!

Resources