Normalized topic document probabilities text2vec R - r

I am trying to find out the topic document probabilities after running the lda model using text2vec package in R.
Following commands generate the model:
lda_model <- LDA$new(n_topics = n_topics, doc_topic_prior = 0.1, topic_word_prior = 0.01)
doc_topic_distr <- lda_model$fit_transform(x = quantdfm, n_iter = 2000, convergence_tol = 0.00001, n_check_convergence = 10, progressbar = FALSE)
quantdfm is the dtm using quanteda package, which I am plugging it in the $fit_transform method.
I noticed that the doc_topic_distr contains the topic document probabilities (without even asking for normalization). Is this correct? Because on a previous post: How to get topic probability table from text2vec LDA, Dmitriy Selivanov has asked to derive such probabilities using:
doc_topic_prob = normalize(doc_topic_distr, norm = "l1")
whereas when I use the same command as above, doc_topic_distr and doc_topic_prob have the same values (I thought the former contains integers as opposed to fractions in the latter).
Please suggest if this is the expected behavior of the code, or I have missed something here.
Thanks.

According to the up to date documentation LDA fit_transform returns topic probabilities.

Related

What's the difference between lgb.train() and lightgbm() in r?

I'm trying to build a regression model with R using lightGBM,
and i'm getting a bit confused with some functions and when/how to use them.
First one is what i've written in the title, what's the difference between lgb.train() and lightgbm()?
The description in the documentation(https://cran.r-project.org/web/packages/lightgbm/lightgbm.pdf) says that lgb.train is 'Logic to train with LightGBM' and lightgbm is 'Simple interface for training a LightGBM model', while both their outcome value is lgb.Booster, a trained model.
One difference I've found is that lgb.train() does not work with valids = , while lightgbm() does.
Second one is about a function lgb.cv(), regarding a cross validation in lightGBM. How do you apply the output of lgb.cv() to a model?
As I understood from the documentation i've linked above, it seems like the output of both lgb.cv and lgb.train is a model.
Is it correct to use it like the example below?
lgbcv <- lgb.cv(params,
lgbtrain,
nrounds = 1000,
nfold = 5,
early_stopping_rounds = 100,
learning_rate = 1.0)
lgbcv <- lightgbm(params,
lgbtrain,
nrounds = 1000,
early_stopping_rounds = 100,
learning_rate = 1.0)
Thank you in advance!
what's the difference between lgb.train() and lightgbm()?
These functions both train a LightGBM model, they're just slightly different interfaces. The biggest difference is in how training data are prepared. LightGBM training requires a special LightGBM-specific representation of the training data, called a Dataset. To use lgb.train(), you have to construct one of these beforehand with lgb.Dataset(). lightgbm(), on the other hand, can accept a data frame, data.table, or matrix and will create the Dataset object for you.
Choose whichever method you feel has a more friendly interface...both will produce a single trained LightGBM model (class "lgb.Booster").
that lgb.train() does not work with valids = , while lightgbm() does.
This is not correct. Both functions accept the keyword argument valids. Run ?lgb.train and ?lightgbm for documentation on those methods.
How do you apply the output of lgb.cv() to a model?
I'm not sure what you mean, but you can find an example of how to use lgb.cv() in the docs that show up when you run ?lgb.cv.
library(lightgbm)
data(agaricus.train, package = "lightgbm")
train <- agaricus.train
dtrain <- lgb.Dataset(train$data, label = train$label)
params <- list(objective = "regression", metric = "l2")
model <- lgb.cv(
params = params
, data = dtrain
, nrounds = 5L
, nfold = 3L
, min_data = 1L
, learning_rate = 1.0
)
This returns an object of class "lgb.CVBooster". That object has multiple "lgb.Booster" objects in it (the trained models that lightgbm() or lgb.train() produce).
You can extract any one of these from model$boosters. However, in practice I don't recommend using the models from lgb.cv() directly. The goal of cross-validation is to get an estimate of the generalization error for a model. So you can use lgb.cv() to figure out the expected error for a given dataset + set of parameters (by looking at model$record_evals and model$best_score).

What does 'seed' do in in 'ldatuning' to determine LDA topic frequency (in R)?

I have been trying out different ways of determining topic frequency in LDA (in R) and have stumbled across the very useful-looking package ldatuning but cannot really figure out the control parameter and particularly the example value for seed.
Here is the example code from the website:
library("topicmodels")
data("AssociatedPress", package="topicmodels")
dtm <- AssociatedPress[1:10, ]
result <- FindTopicsNumber(
dtm,
topics = seq(from = 2, to = 15, by = 1),
metrics = c("Griffiths2004", "CaoJuan2009", "Arun2010", "Deveaud2014"),
method = "Gibbs",
control = list(seed = 77),
mc.cores = 2L,
verbose = TRUE
)
I played around with the parameters a little bit and noticed that changes in the value for seed change the output graphs quite significantly. Can someone please explain what the 77 in this case stands for and how the value for seed should be selected?
Also, I couldn't find any other options for what to enter for control and what effect that has on the result. If anyone can provide some guidance here that would be great.
seed:
Object of class "integer"; used to set the seed in the external code for VEM estimation and to call set.seed for Gibbs sampling. For Gibbs sampling it can also be set to NA (default) to avoid changing the seed of the random number generator in the model fitting call.

How to reproduce topic modelling results with LDA package in R

I am using the lda package in R to perform Latent Dirichlet Allocation modelling. However, each time I run the program I get a different output.
Using set.seed() doesn't seem to help like with the topicmodels package.
Assuming an identical input, is there a way to ensure that identical topics are found on subsequent executions of the code?
I execute the function as follows:
set.seed(11)
fit1 <- lda.collapsed.gibbs.sampler(documents = documents, K = topics, vocab = vocab,
num.iterations = iterations, alpha = alpha,
eta = eta, initial = NULL, burnin = 500,
compute.log.likelihood = TRUE)

Getting term weights out of an LDA model in R

I was wondering if anyone knows of a way to extract term weights / probabilities out of a topic model constructed in R, using the topicmodels package.
Following the example in the following link I created a topic model like so:
Gibbs = LDA(JSS_dtm, k = 4,
method = "Gibbs",
control = list(seed = 1, burnin = 1000, thin = 100, iter = 1000))
we can then get the topics using topics(Gibbs,1), terms using terms(Gibbs,10) and even the topic probabilities using Gibbs#gamma, but after looking at str(Gibbs) it appears that there is no way to get term probabilities within each topic. This would be useful because topic 1 could be 50% term A and 50% term B, while topic 2 can be 90% Term C and 10% term D. I'm aware that tools like MALLET and Python's NLTK module offer this capability, but I was also hoping that a similar solution may exist in R.
If anyone know how this can be achieved, please let us know.
Many thanks!
EDIT:
For the benefit of the others, I thought I'd share my current workaround. If I knew term probabilities, I'd be able to visualise them and give the viewer a better understanding of what each topic means, but without the probabilities, I'm simply breaking down my data by each topic and creating a word cloud for each topic using binary weights. While these values are not probabilities, they give an indication of what each topic focuses on.
See the below code:
JSS_text <- sapply(1:length(JSS_papers[,"description"]), function(x) unlist(JSS_papers[x,"description"]))
jss_df <- data.frame(text=JSS_text,topic=topics(Gibbs, 1))
jss_dec_df <- data.frame()
for(i in unique(topics(Gibbs, 1))){
jss_dec_df <- rbind(jss_dec_df,data.frame(topic = i,
text = paste(jss_df[jss_df$topic==i,"text"],collapse=" ")))
}
corpus <- Corpus(VectorSource(jss_dec_df$text))
JSS_dtm <- TermDocumentMatrix(corpus,control = list(stemming = TRUE,
stopwords = TRUE,
minWordLength = 3,
removeNumbers = TRUE,
removePunctuation = TRUE,
function(x)weightSMART(x,spec="bnc")))
(JSS_dtm = removeSparseTerms(JSS_dtm,0.1)) # not the sparsity parameter
library(wordcloud)
comparison.cloud(as.matrix(JSS_dtm),random.order=F,max.words=100,
scale=c(6,0.6),colours=4,title.size=2)
Figured it out -- to get the term weights, use posterior(lda_object)$terms. Turned out to be much easier than I thought!

Predicting topics with LDA

I am trying to extract topic assignments from a fit I build with R's 'lda' package. I created a fit:
fit <- lda.collapsed.gibbs.sampler(documents = documents, K = K, vocab = vocab,
num.iterations = G, alpha = alpha, eta = eta, initial = NULL,
burnin = 0, compute.log.likelihood = TRUE)
...and would like to extract a probability for each topic-document assignment or simply the most likely topic for each document. With the 'topicmodel' package I can just call
topics(fit)
to get that (as in LDA with topicmodels, how can I see which topics different documents belong to?)
How can I get the same with 'lda'?
I haven't used the 'lda' package of R but I use the 'topicmodels' package in R
I an create the lda fit for lets say 5 topics, using
topic.fit <- LDA(document-term matrix, 5)
now if you want to extract the probability of each topic-document assignment, use
topic.fit#gamma[1:5, ] , gamma contains the document-topic matrix
and to get the most likely topic you can use
most.likely.topic <- topics(topic.fit, 1)
hope this answers your question.

Resources