R LDA Topic Model How to get posterior for delta - r

I ran LDA using the R package topicmodels and I have been trying to get the value for delta which is, in my understanding, the parameter of the dirichlet for words over topics. However, I was not able to access the value.
I only managed to get the initial value using
LDA#control#delta
or
slot(LDA#control,"delta")
I know how to get alpha (parameter of Dir for topics over documents) for the posterior distribution, which is simply slot(LDA,"alpha") but how do get the delta?
Thanks a lot!

topicmodels uses a list of control parameters for the sampling method, here Gibbs sampling. By default values of alpha = 50/kand delta = 0.1 are assumed in control_LDA_Gibbs - you may, of course, specify other values. Maybe you have not specified your controls correctly. In any case, here a short example of code that should information on the deltaprior in the output. I hope that helps and solves your issue.
library(text2vec)
library(topicmodels)
library(slam) #to convert dtm to simple triplet matrix for topicmodels
ntopics <- 10
alphaprior <- 0.1
deltaprior <- 0.001
niter <- 1000
seedpar <- 0
docssubset <- 1:500
docs <- movie_review$review[docssubset]
#Generate document term matrix with text2vec
tokens = docs %>%
tolower %>%
word_tokenizer
it = itoken(tokens, ids = movie_review$id[docssubset], progressbar = FALSE)
vocab = create_vocabulary(it) %>%
prune_vocabulary(term_count_min = 10, doc_proportion_max = 0.2)
vectorizer = vocab_vectorizer(vocab)
dtm = create_dtm(it, vectorizer, type = "dgTMatrix")
control_Gibbs_topicmodels <- list(
alpha = alphaprior
,delta = deltaprior
,iter = niter
,burnin = 100
,keep = 50
,nstart = 1
,best = TRUE
,seed = seedpar
)
ldatopicmodels <- LDA(as.simple_triplet_matrix(dtm)
,k = ntopics
,method = "Gibbs"
,control = control_Gibbs_topicmodels
)
str(ldatopicmodels)
ldatopicmodels#control#delta

Related

Do you need to tokenize your text to visualize data from a LDA topic model?

I'm currently using the textmineR package to run LDA topicmodels on news articles from 2016-2019.
However, I am quite new to R and i don't know how to display my results from the model.
I want to show the prevalence of the 8 topics my model finds, over the time period I have collected data. The data is structured in a dataframe. My data is defined at the day to day level as %y-%m-%d
My LDA model is made like this:
## get textmineR dtm
dtm <- CreateDtm(doc_vec = dat$fulltext, # character vector of documents
ngram_window = c(1, 2),
doc_names = dat$names,
stopword_vec = c(stopwords::stopwords("da"), custom_stopwords),
lower = T, # lowercase - this is the default value
remove_punctuation = T, # punctuation - this is the default
remove_numbers = T, # numbers - this is the default
verbose = T,
cpus = 4)
dtm <- dtm[, colSums(dtm) > 3]
dtm <- dtm[, str_length(colnames(dtm)) > 3]
############################################################
## RUN & EXAMINE TOPIC MODEL
############################################################
# Draw quasi-random sample from the pc
set.seed(34838)
model <- FitLdaModel(dtm = dtm,
k = 8,
iterations = 500,
burnin = 200,
alpha = 0.1,
beta = 0.05,
optimize_alpha = TRUE,
calc_likelihood = TRUE,
calc_coherence = TRUE,
calc_r2 = TRUE,
cpus = 4)
# model log-likelihood
plot(model$log_likelihood, type = "l")
# topic coherence
summary(model$coherence)
hist(model$coherence,
col= "blue",
main = "Histogram of probabilistic coherence")
# top terms by topic
model$top_terms1 <- GetTopTerms(phi = model$phi, M = 10)
t(model$top_terms1)
# topic prevalence
model$prevalence <- colSums(model$theta) / sum(model$theta) * 100
# prevalence should be proportional to alpha
plot(model$prevalence, model$alpha, xlab = "prevalence", ylab = "alpha")
Can anyone tell me how to plot the most prevalent topics the model finds over time?
Do I need to tokenize the text or something like that?
I hope this makes sense.
Best,
Tokenization happens in the CreateDtm function. So, it doesn't sound like that's your issue.
You can get the prevalence of topics over a set of documents by taking a mean over the columns of theta, a matrix that's part of the resulting model.
I can't give you an exact answer with your data, but I can show you a similar example with the nih_sample data that ships with textmineR
# load the NIH sample data
data(nih_sample)
# create a dtm and topic model
dtm <- CreateDtm(doc_vec = nih_sample$ABSTRACT_TEXT,
doc_names = nih_sample$APPLICATION_ID)
m <- FitLdaModel(dtm = dtm, k = 20, iterations = 100, burnin = 75)
# aggregate theta by the year of the PROJECT_END variable
end_year <- stringr::str_split(string = nih_sample$PROJECT_END, pattern = "/")
end_year <- sapply(end_year, function(x) x[length(x)])
end_year <- as.numeric(end_year)
topic_by_year <- by(data = m$theta, INDICES = end_year, FUN = function(x){
if (is.null(nrow(x))) {
# if only one row, gets converted to a vector
# just return that vector
return(x)
} else { # if multiple rows, then aggregate
return(colMeans(x))
}
})
topic_by_year <- as.data.frame(do.call(rbind, topic_by_year))
topic_by_year <- as.data.frame(do.call(rbind, topic_by_year))
# plot topic 10's prevalence by year
plot(topic_by_year$year, topic_by_year$t_10, type = "l")

MXNET softmax output: label shape confusion

I have not got a clear idea about how labels for the softmax classifier should be shaped.
What I could understand from my experiments is that a scalar laber indicating the index of class probability output is one option, while another is a 2D label where the rows are class probabilities, or one-hot encoded variable, like c(1, 0, 0).
What puzzles me though is that:
I can use sclalar label values that go beyong indexing, like 4 in my
example below -- without warning or error. Why is that?
When my label is a negative scalar or an array with a negative value,
the model converges to uniform probablity distribution over classes.
For example, is this expected that actor_train.y = matrix(c(0, -1,v0), ncol = 1) results in equal probabilities in the softmax output?
I try to use softmax MXNET classifier to produce the policy gradient
reifnrocement learning, and my negative rewards lead to the issue
above: uniform probability. Is that expected?
require(mxnet)
actor_initializer <- mx.init.Xavier(rnd_type = "gaussian",
factor_type = "avg",
magnitude = 0.0001)
actor_nn_data <- mx.symbol.Variable('data') actor_nn_label <- mx.symbol.Variable('label')
device.cpu <- mx.cpu()
NN architecture
actor_fc3 <- mx.symbol.FullyConnected(
data = actor_nn_data
, num_hidden = 3 )
actor_output <- mx.symbol.SoftmaxOutput(
data = actor_fc3
, label = actor_nn_label
, name = 'actor' )
crossentfunc <- function(label, pred)
{
- sum(label * log(pred)) }
actor_loss <- mx.metric.custom(
feval = crossentfunc
, name = "log-loss"
)
initialize NN
actor_train.x <- matrix(rnorm(11), nrow = 1)
actor_train.y = 0 #1 #2 #3 #-3 # matrix(c(0, 0, -1), ncol = 1)
rm(actor_model)
actor_model <- mx.model.FeedForward.create(
symbol = actor_output,
X = actor_train.x,
y = actor_train.y,
ctx = device.cpu,
num.round = 100,
array.batch.size = 1,
optimizer = 'adam',
eval.metric = actor_loss,
clip_gradient = 1,
wd = 0.01,
initializer = actor_initializer,
array.layout = "rowmajor" )
predict(actor_model, actor_train.x, array.layout = "rowmajor")
It is quite strange to me, but I found a solution.
I changed optimizer from optimizer = 'adam' to optimizer = 'rmsprop', and the NN started to converge as expected in case of negative targets. I made simulations in R using a simple NN and optim function to get the same result.
Looks like adam or SGD may be buggy or whatever in case of multinomial classification... I also used to get stuck at the fact those optimizers did not converge to a perfect solution on just 1 example, while rmsprop does! Be aware!

Glove word embedding model parameters using tex2vec in R, and display training output (epochs) after every n iterations

I am using text2vec package in R for training word embedding (Glove Model) as:
library(text2vec)
library(tm)
prep_fun = tolower
tok_fun = word_tokenizer
tokens = docs %>% # docs: a collection of text documents
prep_fun %>%
tok_fun
it = itoken(tokens, progressbar = FALSE)
stopword <- tm::stopwords("SMART")
vocab = create_vocabulary(it,stopwords=stopword)
vectorizer <- vocab_vectorizer(vocab)
tcm <- create_tcm(it, vectorizer, skip_grams_window = 6)
x_max <- min(50,max(10,ceiling(length(vocab$doc_count)/100)))
glove_model <- GlobalVectors$new(word_vectors_size = 200, vocabulary = vocab, x_max = x_max,learning_rate = 0.1)
word_vectors <- glove_model$fit_transform(tcm, n_iter = 1000, convergence_tol = 0.001)
When I run this code I get the following output:
My questions are:
Is it possible to have output after every n iterations, i.e. output for epoch 50, 100, 150 and so on.
Any suggestion for optimal values for word_vectors_size, x_max and learning_rate? for example for 10,000 documents, what is the best value for those parameters?
I appreciate your response.
Many thanks,
Sam
There is a member of the GlobalVectors class called n_dump_every. You can set it to some number and the history of word embeddings will be saved. Then it can be retrieved with get_history() function
glove_model <- GlobalVectors$new(word_vectors_size = 200, vocabulary = vocab, x_max = 100,learning_rate = 0.1)
glove_model$n_dump_every = 10
word_vectors <- glove_model$fit_transform(tcm, n_iter = 1000, convergence_tol = 0.001)
trace = glove_model$get_history()
Regarding second question -
you may try to vary learning rate a bit (usually decrease), but default one should be ok (keep track of the value of cost function).
the more data you have the larger value you can provide for word_vectors_size. For wikipedia size 300 is usually enough. For smaller datasets you may start with 20-50. You really need to experiment with this.

Get phi, theta, doc.length, vocab, term.frequency from mallet LDA object?

I am trying to use a mallet topic model with the LDAvis package. To do so you must extract a number of parameters from the topic.model object: phi, theta, vocab, doc.length, and term.frequency.
The mallet documentation makes no mention of these parameters. How can I extract them from a topic.model object generated from data using mallet.import() and MalletLDA()?
So far, I've used mallet to fit the topic model:
id_numbers <- as.integer(c(1, 2, 3))
comments <- c("words to be used for text mining", "that may or may not be interesting", "but could serve as a good example")
df <- data.frame(id_numbers, comments, stringsAsFactors = F)
# Set up topic model
library(mallet)
stoplist <- c("to", "be", "or")
write.csv(stoplist, file = "example_stoplist.csv")
mallet.instances <- mallet.import(
as.character(df$id_numbers),
as.character(df$comments),
"example_stoplist.csv",
FALSE,
token.regexp="[\\p{L}']+")
topic.model <- MalletLDA(num.topics=10)
topic.model$loadDocuments(mallet.instances)
vocabulary <- topic.model$getVocabulary()
word.freqs <- mallet.word.freqs(topic.model)
topic.model$setAlphaOptimization(40, 80) # tweaking optimization interval and burn-in iterations)
topic.model$train(400)
topic.words.m <- mallet.topic.words(topic.model, smoothed=TRUE,
normalized=TRUE)
dim(topic.words.m)
vocabulary <- topic.model$getVocabulary()
colnames(topic.words.m) <- vocabulary
doc.topics.m <- mallet.doc.topics(topic.model, smoothed=T,
normalized=T)
doc.topics.df <- as.data.frame(doc.topics.m)
doc.topics.df <- cbind(id_numbers, doc.topics.df)
doc.topic.means.df <- aggregate(doc.topics.df[, 2:ncol(doc.topics.df)],
list(doc.topics.df[,1]),
mean)
Out of this I now need to generate the JSON for LDAvis. I tried the following:
# LDAvis
library(LDAvis)
phi <- t(mallet.topic.words(topic.model, smoothed = TRUE, normalized = TRUE))
phi.count <- mallet.topic.words(topic.model, smoothed = TRUE, normalized = FALSE)
topic.words <- mallet.topic.words(topic.model, smoothed=TRUE, normalized=TRUE)
topic.counts <- rowSums(topic.words)
topic.proportions <- topic.counts/sum(topic.counts)
vocab <- topic.model$getVocabulary()
doc.tokens <- data.frame(id=c(1:nrow(doc.topics.m)), tokens=0)
for(i in vocab){
# Find word if word in text
matched <- grepl(i, df$comments)
doc.tokens[matched,2] =doc.tokens[matched,2] + 1
}
createJSON(phi = phi,
theta = doc.topics.m,
doc.length = doc.tokens,
vocab = vocab,
term.frequency = apply(phi.count, 1, sum))
However, this gives me the following error message:
Error in createJSON(phi = phi, theta = doc.topics.m, doc.length = doc.tokens, :
Number of rows of phi does not match
number of columns of theta; both should be equal to the number of topics
in the model.
So I seem to be generating the phi and theta matrices in the wrong way.
Try removing the matrix transpose function t() from the line where you create phi.
RMallet is returning these matrices in the format expected by LDAvis: topics are columns for document topics (theta) and rows for topic words (phi). Sometimes it makes sense to flip one of them so that either rows or columns always means topics, but not here.

R caret: leave subject out cross validation with data subset for training?

I want to perform leave subject out cross validation with R caret (cf. this example) but only use a subset of the data in training for creating CV models. Still, the left out CV partition should be used as a whole, as I need to test on all data of a left out subject (no matter if it's millions of samples that cannot be used in training due to computational restrictions).
I've created a minimal 2 class classification example using the subset and index parameters of caret::train and caret::trainControl to achieve this. From my observation this should solve the problem, but I have a hard time actually ensuring that the evaluation is still done in a leave-subject-out way. Maybe someone with experience in this task could shed some light on this:
library(plyr)
library(caret)
library(pROC)
library(ggplot2)
# with diamonds we want to predict cut and look at results for different colors = subjects
d <- diamonds
d <- d[d$cut %in% c('Premium', 'Ideal'),] # make a 2 class problem
d$cut <- factor(d$cut)
indexes_data <- c(1,5,6,8:10)
indexes_labels <- 2
# population independent CV indexes for trainControl
index <- llply(unique(d[,3]), function(cls) c(which(d[,3]!=cls)))
names(index) <- paste0('sub_', unique(d[,3]))
str(index) # indexes used for training models with CV = OK
m3 <- train(x = d[,indexes_data],
y = d[,indexes_labels],
method = 'glm',
metric = 'ROC',
subset = sample(nrow(d), 5000), # does this subset the data used for training and obtaining models, but not the left out partition used for estimating CV performance?
trControl = trainControl(returnResamp = 'final',
savePredictions = T,
classProbs = T,
summaryFunction = twoClassSummary,
index = index))
str(m3$resample) # all samples used once = OK
# performance over all subjects
myRoc <- roc(predictor = m3$pred[,3], response = m3$pred$obs)
plot(myRoc, main = 'all')
performance for individual subjects
l_ply(unique(m3$pred$Resample), .fun = function(cls) {
pred_sub <- m3$pred[m3$pred$Resample==cls,]
myRoc <- roc(predictor = pred_sub[,3], response = pred_sub$obs)
plot(myRoc, main = cls)
} )
Thanks for your time!
Using both the index and indexOut parameter in caret::trainControl at the same time seems to do the trick (thanks to Max for the hint in this question). Here is the updated code:
library(plyr)
library(caret)
library(pROC)
library(ggplot2)
str(diamonds)
# with diamonds we want to predict cut and look at results for different colors = subjects
d <- diamonds
d <- d[d$cut %in% c('Premium', 'Ideal'),] # make a 2 class problem
d$cut <- factor(d$cut)
indexes_data <- c(1,5,6,8:10)
indexes_labels <- 2
# population independent CV partitions for training and left out partitions for evaluation
indexes_populationIndependence_subjects <- 3
index <- llply(unique(d[,indexes_populationIndependence_subjects]), function(cls) c(which(d[,indexes_populationIndependence_subjects]!=cls)))
names(index) <- paste0('sub_', unique(d[,indexes_populationIndependence_subjects]))
indexOut <- llply(index, function(part) (1:nrow(d))[-part])
names(indexOut) <- paste0('sub_', unique(d[,indexes_populationIndependence_subjects]))
# subsample partitions for training
index <- llply(index, function(i) sample(i, 1000))
m3 <- train(x = d[,indexes_data],
y = d[,indexes_labels],
method = 'glm',
metric = 'ROC',
trControl = trainControl(returnResamp = 'final',
savePredictions = T,
classProbs = T,
summaryFunction = twoClassSummary,
index = index,
indexOut = indexOut))
m3$resample # seems OK
str(m3$pred) # seems OK
myRoc <- roc(predictor = m3$pred[,3], response = m3$pred$obs)
plot(myRoc, main = 'all')
# analyze results per subject
l_ply(unique(m3$pred$Resample), .fun = function(cls) {
pred_sub <- m3$pred[m3$pred$Resample==cls,]
myRoc <- roc(predictor = pred_sub[,3], response = pred_sub$obs)
plot(myRoc, main = cls)
} )
Still, I'm not absolutely sure if this is actually does the estimation in a population independent way, so if anybody has knowledge about the details please share your thoughts!

Resources