Getting term weights out of an LDA model in R - r

I was wondering if anyone knows of a way to extract term weights / probabilities out of a topic model constructed in R, using the topicmodels package.
Following the example in the following link I created a topic model like so:
Gibbs = LDA(JSS_dtm, k = 4,
method = "Gibbs",
control = list(seed = 1, burnin = 1000, thin = 100, iter = 1000))
we can then get the topics using topics(Gibbs,1), terms using terms(Gibbs,10) and even the topic probabilities using Gibbs#gamma, but after looking at str(Gibbs) it appears that there is no way to get term probabilities within each topic. This would be useful because topic 1 could be 50% term A and 50% term B, while topic 2 can be 90% Term C and 10% term D. I'm aware that tools like MALLET and Python's NLTK module offer this capability, but I was also hoping that a similar solution may exist in R.
If anyone know how this can be achieved, please let us know.
Many thanks!
EDIT:
For the benefit of the others, I thought I'd share my current workaround. If I knew term probabilities, I'd be able to visualise them and give the viewer a better understanding of what each topic means, but without the probabilities, I'm simply breaking down my data by each topic and creating a word cloud for each topic using binary weights. While these values are not probabilities, they give an indication of what each topic focuses on.
See the below code:
JSS_text <- sapply(1:length(JSS_papers[,"description"]), function(x) unlist(JSS_papers[x,"description"]))
jss_df <- data.frame(text=JSS_text,topic=topics(Gibbs, 1))
jss_dec_df <- data.frame()
for(i in unique(topics(Gibbs, 1))){
jss_dec_df <- rbind(jss_dec_df,data.frame(topic = i,
text = paste(jss_df[jss_df$topic==i,"text"],collapse=" ")))
}
corpus <- Corpus(VectorSource(jss_dec_df$text))
JSS_dtm <- TermDocumentMatrix(corpus,control = list(stemming = TRUE,
stopwords = TRUE,
minWordLength = 3,
removeNumbers = TRUE,
removePunctuation = TRUE,
function(x)weightSMART(x,spec="bnc")))
(JSS_dtm = removeSparseTerms(JSS_dtm,0.1)) # not the sparsity parameter
library(wordcloud)
comparison.cloud(as.matrix(JSS_dtm),random.order=F,max.words=100,
scale=c(6,0.6),colours=4,title.size=2)

Figured it out -- to get the term weights, use posterior(lda_object)$terms. Turned out to be much easier than I thought!

Related

Normalized topic document probabilities text2vec R

I am trying to find out the topic document probabilities after running the lda model using text2vec package in R.
Following commands generate the model:
lda_model <- LDA$new(n_topics = n_topics, doc_topic_prior = 0.1, topic_word_prior = 0.01)
doc_topic_distr <- lda_model$fit_transform(x = quantdfm, n_iter = 2000, convergence_tol = 0.00001, n_check_convergence = 10, progressbar = FALSE)
quantdfm is the dtm using quanteda package, which I am plugging it in the $fit_transform method.
I noticed that the doc_topic_distr contains the topic document probabilities (without even asking for normalization). Is this correct? Because on a previous post: How to get topic probability table from text2vec LDA, Dmitriy Selivanov has asked to derive such probabilities using:
doc_topic_prob = normalize(doc_topic_distr, norm = "l1")
whereas when I use the same command as above, doc_topic_distr and doc_topic_prob have the same values (I thought the former contains integers as opposed to fractions in the latter).
Please suggest if this is the expected behavior of the code, or I have missed something here.
Thanks.
According to the up to date documentation LDA fit_transform returns topic probabilities.

How to choose the nrounds using `catboost`?

If I understand correctly catboost, we need to tune the nrounds just like in xgboost, using CV. I see the following code in the official tutorial In [8]
params_with_od <- list(iterations = 500,
loss_function = 'Logloss',
train_dir = 'train_dir',
od_type = 'Iter',
od_wait = 30)
model_with_od <- catboost.train(train_pool, test_pool, params_with_od)
Which result in the best iterations = 211.
My question are:
Is it correct that: this command use the test_pool to choose the best iterations instead of using cross-validation?
If yes, does catboost provide a command to choose the best iterations from CV, or I need to do it manually?
Catboost is doing cross validation to determine the optimum number of iterations. Both train_pool and test_pool are datasets that include the target variable. Earlier in the tutorial they write
train_path = '../R-package/inst/extdata/adult_train.1000'
test_path = '../R-package/inst/extdata/adult_test.1000'
column_description_vector = rep('numeric', 15)
cat_features <- c(3, 5, 7, 8, 9, 10, 11, 15)
for (i in cat_features)
column_description_vector[i] <- 'factor'
train <- read.table(train_path, head=F, sep="\t", colClasses=column_description_vector)
test <- read.table(test_path, head=F, sep="\t", colClasses=column_description_vector)
target <- c(1)
train_pool <- catboost.from_data_frame(data=train[,-target], target=train[,target])
test_pool <- catboost.from_data_frame(data=test[,-target], target=test[,target])
When you execute catboost.train(train_pool, test_pool, params_with_od) train_pool is used for training and test_pool is used to determine the optimum number of iterations via cross validation.
Now you are right to be confused, since later on in the tutorial they again use test_pool and the fitted model to make a prediction (model_best is similar to model_with_od, but uses a different overfitting detector IncToDec):
prediction_best <- catboost.predict(model_best, test_pool, type = 'Probability')
This might be bad practice. Now they might get away with it with their IncToDec overfitting detector - I am not familiar with the mathematics behind it - but for the Iter type overfitting detector you would need to have separate train,validation and test data sets (and if you want to be on the save side, do the same for the IncToDec overfitting detector). However it is only a tutorial showing the functionality so I wouldn't be too pedantic about what data they have already used how.
Here a link to a little more detail on the overfitting detectors:
https://tech.yandex.com/catboost/doc/dg/concepts/overfitting-detector-docpage/
It is a very poor decision to base your number of iterations on one test_pool and from the best iterations of catboost.train(). In doing so, you are tuning your parameters to one specific test set and your model will not work well with new data. You are therefore correct in presuming that like XGBoost, you need to apply CV to find the optimal number of iterations.
There is indeed a CV function in catboost. What you should do is specify a large number of iterations and stop the training after a certain number of rounds without improvement by using parameters early_stopping_rounds. Unlike LightGBM unfortunately, catboost doesn't seem to have the option of automatically giving the optimal number of boosting rounds after CV to apply in catboost.train(). Therefore, it requires a bit of a workaround. Here is an example which should work:
library(catboost)
library(data.table)
parameter = list(
thread_count = n_cores,
loss_function = "RMSE",
eval_metric = c("RMSE","MAE","R2"),
iterations = 10^5, # Train up to 10^5 rounds
early_stopping_rounds = 100, # Stop after 100 rounds of no improvement
)
# Apply 6-fold CV
model = catboost.cv(
pool = train_pool,
fold_count = 6,
params = parameter
)
# Transform output to DT
setDT(cbt_occupancy)
model[, iterations := .I]
# Order from lowest to highgest RMSE
setorder(model, test.RMSE.mean)
# Select iterations with lowest RMSE
parameter$iterations = model[1, iterations]
# Train model with optimal iterations
model = catboost.train(
learn_pool = train_pool,
test_pool = test_pool,
params = parameter
)
I think this is a general question for xgboost and catboost.
The choice of nround gets along with the choice with learning rate.
Thus, I recommend the higher round (1000+) and low learning rate.
After you find the best hype-params and retry a lower learning rate to check the hype-params you choose are stable.
And I find #nikitxskv 's answer is misleading.
In the R tutorial, In [12] just chooses learning_rate = 0.1 without mutiple choices. Thus, there is no hint for nround tuning.
Actually, In [12] just uses function expand.grid to find the best hype-params. It functions on the selections of depth, gamma and so on.
And in practice, we don't use this way to find a proper nround (too long).
And now for the two questions.
Is it correct that: this command use the test_pool to choose the best iterations instead of using cross-validation?
Yes, but you can use CV.
If yes, does catboost provide a command to choose the best iterations from CV, or I need to do it manually?
It depends on yourself. If you have a great aversion on boosting overfitting, I recommend you try it. There are a lot of packages to solve this problem. I recommend tidymodel packages.

What does 'seed' do in in 'ldatuning' to determine LDA topic frequency (in R)?

I have been trying out different ways of determining topic frequency in LDA (in R) and have stumbled across the very useful-looking package ldatuning but cannot really figure out the control parameter and particularly the example value for seed.
Here is the example code from the website:
library("topicmodels")
data("AssociatedPress", package="topicmodels")
dtm <- AssociatedPress[1:10, ]
result <- FindTopicsNumber(
dtm,
topics = seq(from = 2, to = 15, by = 1),
metrics = c("Griffiths2004", "CaoJuan2009", "Arun2010", "Deveaud2014"),
method = "Gibbs",
control = list(seed = 77),
mc.cores = 2L,
verbose = TRUE
)
I played around with the parameters a little bit and noticed that changes in the value for seed change the output graphs quite significantly. Can someone please explain what the 77 in this case stands for and how the value for seed should be selected?
Also, I couldn't find any other options for what to enter for control and what effect that has on the result. If anyone can provide some guidance here that would be great.
seed:
Object of class "integer"; used to set the seed in the external code for VEM estimation and to call set.seed for Gibbs sampling. For Gibbs sampling it can also be set to NA (default) to avoid changing the seed of the random number generator in the model fitting call.

Seeding words into an LDA topic model in R

I have a dataset of news articles that have been collected based on the criteria that they use the term "euroscepticism" or "eurosceptic". I have been running topic models using the lda package (with dfm matrices built in quanteda) in order to identify the main topics of these articles; however, the words I am interested in do not appear in any of the topics. I want to therefore seed these words into the model, and I am not sure exactly how to do that.
I see that the package topicmodels allows for an argument called seedwords, which "can be specified as a matrix or an object class of simple_triplet_matrix", but there are no other instructions. It seems that a simple_triplet_matrix only takes integers, and not strings - does anyone know I would then seed the words 'euroscepticism' and 'eurosceptic' into the model?
Here is a shortened version of the code:
library("quanteda")
library("lda")
##Load UK texts/create corpus
UKcorp <- corpus(textfile(file="~Michael/DM6/*"))
##Create document feature matrix
UKdfm2 <- dfm(UKcorp, ngrams =1, verbose = TRUE, toLower = TRUE,
removeNumbers = TRUE, removePunct = TRUE, removeSeparators = TRUE,
removeTwitter = FALSE, stem = TRUE, ignoredFeatures =
stopwords(kind="english"), keptFeatures = NULL, language = "english",
thesaurus = NULL, dictionary = NULL, valuetype = "fixed"))
##Convert to lda model
UKlda2 <- convert(UKdfm2, to = "lda")
##run model
UKmod2 <- lda.collapsed.gibbs.sampler(UKlda2$documents, K = 15, UKlda2$vocab,
num.iterations = 1500, alpha = .1,eta = .01, initial = NULL, burnin
= NULL, compute.log.likelihood = TRUE, trace = 0L, freeze.topics = FALSE)
"Seeding" words in the topicmodels package is a different procedure, as it allows you when estimating through the collapsed Gibbs sampler to attach prior weights for words. (See for instance Jagarlamudi, J., Daumé, H., III, & Udupa, R. (2012). Incorporating lexical priors into topic models (pp. 204–213). Association for Computational Linguistics.) But this is part of an estimation strategy for topics, not a way of ensuring that key words of interest remain in your fitted topics. Unless you have set a threshold for removing them based on sparsity, before calling lad::lda.collapsed.gibbs.sampler(), then *every* term in yourUKlda2$vocab` vector will be assigned probabilities across topics.
Probably what is happening here is that your words are either of such low frequency that they are hard to locate near the top of any of your topics. It's also possible that stemming has changed them, e.g.:
quanteda::char_wordstem("euroscepticism")
## [1] "eurosceptic"
I suggest you make sure that your words exist in dfm first, through:
colSums(UKdfm2)["eurosceptic"]
And then you can look at the fitted distribution of topic proportions for this word and others in the fitted topic model object.

Predicting topics with LDA

I am trying to extract topic assignments from a fit I build with R's 'lda' package. I created a fit:
fit <- lda.collapsed.gibbs.sampler(documents = documents, K = K, vocab = vocab,
num.iterations = G, alpha = alpha, eta = eta, initial = NULL,
burnin = 0, compute.log.likelihood = TRUE)
...and would like to extract a probability for each topic-document assignment or simply the most likely topic for each document. With the 'topicmodel' package I can just call
topics(fit)
to get that (as in LDA with topicmodels, how can I see which topics different documents belong to?)
How can I get the same with 'lda'?
I haven't used the 'lda' package of R but I use the 'topicmodels' package in R
I an create the lda fit for lets say 5 topics, using
topic.fit <- LDA(document-term matrix, 5)
now if you want to extract the probability of each topic-document assignment, use
topic.fit#gamma[1:5, ] , gamma contains the document-topic matrix
and to get the most likely topic you can use
most.likely.topic <- topics(topic.fit, 1)
hope this answers your question.

Resources