Seeding words into an LDA topic model in R - r

I have a dataset of news articles that have been collected based on the criteria that they use the term "euroscepticism" or "eurosceptic". I have been running topic models using the lda package (with dfm matrices built in quanteda) in order to identify the main topics of these articles; however, the words I am interested in do not appear in any of the topics. I want to therefore seed these words into the model, and I am not sure exactly how to do that.
I see that the package topicmodels allows for an argument called seedwords, which "can be specified as a matrix or an object class of simple_triplet_matrix", but there are no other instructions. It seems that a simple_triplet_matrix only takes integers, and not strings - does anyone know I would then seed the words 'euroscepticism' and 'eurosceptic' into the model?
Here is a shortened version of the code:
library("quanteda")
library("lda")
##Load UK texts/create corpus
UKcorp <- corpus(textfile(file="~Michael/DM6/*"))
##Create document feature matrix
UKdfm2 <- dfm(UKcorp, ngrams =1, verbose = TRUE, toLower = TRUE,
removeNumbers = TRUE, removePunct = TRUE, removeSeparators = TRUE,
removeTwitter = FALSE, stem = TRUE, ignoredFeatures =
stopwords(kind="english"), keptFeatures = NULL, language = "english",
thesaurus = NULL, dictionary = NULL, valuetype = "fixed"))
##Convert to lda model
UKlda2 <- convert(UKdfm2, to = "lda")
##run model
UKmod2 <- lda.collapsed.gibbs.sampler(UKlda2$documents, K = 15, UKlda2$vocab,
num.iterations = 1500, alpha = .1,eta = .01, initial = NULL, burnin
= NULL, compute.log.likelihood = TRUE, trace = 0L, freeze.topics = FALSE)

"Seeding" words in the topicmodels package is a different procedure, as it allows you when estimating through the collapsed Gibbs sampler to attach prior weights for words. (See for instance Jagarlamudi, J., Daumé, H., III, & Udupa, R. (2012). Incorporating lexical priors into topic models (pp. 204–213). Association for Computational Linguistics.) But this is part of an estimation strategy for topics, not a way of ensuring that key words of interest remain in your fitted topics. Unless you have set a threshold for removing them based on sparsity, before calling lad::lda.collapsed.gibbs.sampler(), then *every* term in yourUKlda2$vocab` vector will be assigned probabilities across topics.
Probably what is happening here is that your words are either of such low frequency that they are hard to locate near the top of any of your topics. It's also possible that stemming has changed them, e.g.:
quanteda::char_wordstem("euroscepticism")
## [1] "eurosceptic"
I suggest you make sure that your words exist in dfm first, through:
colSums(UKdfm2)["eurosceptic"]
And then you can look at the fitted distribution of topic proportions for this word and others in the fitted topic model object.

Related

Question about pre-processing Russian text for stm in R

I am trying to run a structural topic model in R using the stm package. The corpus is a collection of Russian-language speeches. The problem I am having is that the Russian words are not being pre-processed correctly. Here is the code I have written thus far:
library(stm) # Package for sturctural topic modeling
library(igraph) # Package for network analysis and visualisation
library(stmCorrViz) # Package for hierarchical correlation view of STMs
data <- read.csv("convocation4.csv") # Load data
stopwordsRU <- readLines("stop_words_russian.txt", encoding = "UTF-8") # Custom stopwords
processed <- textProcessor(
data$text,
metadata = data,
lowercase = TRUE,
removestopwords = TRUE,
removenumbers = TRUE,
removepunctuation = TRUE,
stem = TRUE,
language = "ru",
customstopwords = stopwordsRU
)
docs <- out$documents
vocab <- out$vocab
meta <- out$meta
fit <- stm(out$documents, out$vocab, K=20, prevalence=~party_id,
max.em.its=75, data=out$meta, init.type="Spectral",
seed=8458159)
Here is an example of the problem. I have included one Russian stopword, уважаемый, in my list of custom stopwords. However, the word уважаемые, the plural form, comes out as one of the top words after running the model. Why is this happening? How might I solve it?

Need help diagnosing cause of "Covariate matrix is singular" when estimating effect in structural topic model (stm)

First things first. I've saved my workspace and you can load it with the following command:
load(url("https://dl.dropboxusercontent.com/s/06oz5j41nif7la5/example.RData?dl=0"))
I have a number of abstract texts and I'm attempting to estimate a structural topic model to measure topic prevalence over time. The data contains a document id, abstract text, and year of publication.
I want to generate trends in expected topic proportion over time like the authors of the STM Vignette do here:
I'm able to create my topic model without issue, but when I attempt to run the estimateEffect() function from the stm package in R, I always get the following warning:
And my trends look like this:
In the documentation, the authors note that
The function will automatically check whether the covariate matrix is singular which generally results from linearly dependent columns. Some common causes include a factor variable with an unobserved level, a spline with degrees of freedom that are too high, or a spline with a continuous variable where a gap in the support of the variable results in several empty basis functions.
I've tried a variety of different models, using a 2-topic solution all the way up to 52-topic solution, always with the same result. If I remove the spline function from the "year" variable in my model and assume a linear fit, then estimateEffect() works just fine. So it must be an issue with the splined data. I just don't know what exactly.
Again, here's a link to my workspace:
load(url("https://dl.dropboxusercontent.com/s/06oz5j41nif7la5/example.RData?dl=0"))
And here is the code I'm using to get there:
library(udpipe)
library(dplyr) # data wrangling
library(readr) # import data
library(ggplot2) # viz
library(stm) # STM
library(tidytext) # Tf-idf
library(tm) # DTM stuff
library(quanteda) # For using ngrams in STM
rm(list = ls())
abstracts <- read_delim("Data/5528_demand_ta.txt",
delim = "\t", escape_double = FALSE,
col_names = TRUE, trim_ws = TRUE)
abstracts <- rename(abstracts, doc_id = cpid)
abstracts$doc_id <- as.character(abstracts$doc_id)
# Download english dictionary
ud_model <- udpipe_download_model(language = "english")
ud_model <- udpipe_load_model(ud_model$file_model)
# Interpret abstracts assuming English
x <- udpipe_annotate(ud_model, x = abstracts$abstract, doc_id = abstracts$doc_id)
x <- as.data.frame(x)
# Regroup terms
data <- paste.data.frame(x, term = "lemma", group = c("doc_id"))
data <- left_join(data, abstracts) %>%
rename(term = lemma) %>%
select(doc_id, term , year)
# Prepare text
processed <- textProcessor(documents = data$term,
metadata = data,
lowercase = TRUE,
removestopwords = TRUE,
removenumbers = TRUE,
removepunctuation = TRUE,
stem = FALSE)
out <- prepDocuments(processed$documents,
processed$vocab,
processed$meta,
lower.thresh = 20, # term must appear in at least n docs to matter
upper.thres = 1000) # I've been using about 1/3 of documents as an upper thresh
# Build model allowing tSNE to pick k (should result in 52 topics)
stm_mod <- stm(documents = out$documents,
vocab = out$vocab,
K = 0,
init.type = "Spectral",
prevalence = ~ s(year),
data = out$meta,
max.em.its = 500, # Max number of runs to attempt
seed = 831)
###################################################################################
########### If you loaded the workspace from my link, then you are here ###########
###################################################################################
# Estimate effect of year
prep <- estimateEffect(formula = 1:52 ~ s(year),
stmobj = stm_mod,
metadata = out$meta)
# Plot expected topic proportion
summary(prep, topics=1)
plot.estimateEffect(prep,
"year",
method = "continuous",
model = stm_mod,
topics = 5,
printlegend = TRUE,
xaxt = "n",
xlab = "Years")
A singular matrix simply means that you have linearly dependent rows or columns. First thing you could do is check the determinant of the matrix - a singular matrix implies a zero determinant - which means the matrix can't be inverted.
Next thing would be to identify the literally dependent rows (columns), you can do so using smisc::findDepMat(X, rows = TRUE, tol = 1e-10) for rows, and smisc::findDepMat(X, rows = FALSE, tol = 1e-10) for columns. You MAY be able to alter the levels of tol in findDepMat() and etol in stm() to arrive at a solution, probably an unstable solution, but a solution.

Normalized topic document probabilities text2vec R

I am trying to find out the topic document probabilities after running the lda model using text2vec package in R.
Following commands generate the model:
lda_model <- LDA$new(n_topics = n_topics, doc_topic_prior = 0.1, topic_word_prior = 0.01)
doc_topic_distr <- lda_model$fit_transform(x = quantdfm, n_iter = 2000, convergence_tol = 0.00001, n_check_convergence = 10, progressbar = FALSE)
quantdfm is the dtm using quanteda package, which I am plugging it in the $fit_transform method.
I noticed that the doc_topic_distr contains the topic document probabilities (without even asking for normalization). Is this correct? Because on a previous post: How to get topic probability table from text2vec LDA, Dmitriy Selivanov has asked to derive such probabilities using:
doc_topic_prob = normalize(doc_topic_distr, norm = "l1")
whereas when I use the same command as above, doc_topic_distr and doc_topic_prob have the same values (I thought the former contains integers as opposed to fractions in the latter).
Please suggest if this is the expected behavior of the code, or I have missed something here.
Thanks.
According to the up to date documentation LDA fit_transform returns topic probabilities.

Getting term weights out of an LDA model in R

I was wondering if anyone knows of a way to extract term weights / probabilities out of a topic model constructed in R, using the topicmodels package.
Following the example in the following link I created a topic model like so:
Gibbs = LDA(JSS_dtm, k = 4,
method = "Gibbs",
control = list(seed = 1, burnin = 1000, thin = 100, iter = 1000))
we can then get the topics using topics(Gibbs,1), terms using terms(Gibbs,10) and even the topic probabilities using Gibbs#gamma, but after looking at str(Gibbs) it appears that there is no way to get term probabilities within each topic. This would be useful because topic 1 could be 50% term A and 50% term B, while topic 2 can be 90% Term C and 10% term D. I'm aware that tools like MALLET and Python's NLTK module offer this capability, but I was also hoping that a similar solution may exist in R.
If anyone know how this can be achieved, please let us know.
Many thanks!
EDIT:
For the benefit of the others, I thought I'd share my current workaround. If I knew term probabilities, I'd be able to visualise them and give the viewer a better understanding of what each topic means, but without the probabilities, I'm simply breaking down my data by each topic and creating a word cloud for each topic using binary weights. While these values are not probabilities, they give an indication of what each topic focuses on.
See the below code:
JSS_text <- sapply(1:length(JSS_papers[,"description"]), function(x) unlist(JSS_papers[x,"description"]))
jss_df <- data.frame(text=JSS_text,topic=topics(Gibbs, 1))
jss_dec_df <- data.frame()
for(i in unique(topics(Gibbs, 1))){
jss_dec_df <- rbind(jss_dec_df,data.frame(topic = i,
text = paste(jss_df[jss_df$topic==i,"text"],collapse=" ")))
}
corpus <- Corpus(VectorSource(jss_dec_df$text))
JSS_dtm <- TermDocumentMatrix(corpus,control = list(stemming = TRUE,
stopwords = TRUE,
minWordLength = 3,
removeNumbers = TRUE,
removePunctuation = TRUE,
function(x)weightSMART(x,spec="bnc")))
(JSS_dtm = removeSparseTerms(JSS_dtm,0.1)) # not the sparsity parameter
library(wordcloud)
comparison.cloud(as.matrix(JSS_dtm),random.order=F,max.words=100,
scale=c(6,0.6),colours=4,title.size=2)
Figured it out -- to get the term weights, use posterior(lda_object)$terms. Turned out to be much easier than I thought!

Predicting topics with LDA

I am trying to extract topic assignments from a fit I build with R's 'lda' package. I created a fit:
fit <- lda.collapsed.gibbs.sampler(documents = documents, K = K, vocab = vocab,
num.iterations = G, alpha = alpha, eta = eta, initial = NULL,
burnin = 0, compute.log.likelihood = TRUE)
...and would like to extract a probability for each topic-document assignment or simply the most likely topic for each document. With the 'topicmodel' package I can just call
topics(fit)
to get that (as in LDA with topicmodels, how can I see which topics different documents belong to?)
How can I get the same with 'lda'?
I haven't used the 'lda' package of R but I use the 'topicmodels' package in R
I an create the lda fit for lets say 5 topics, using
topic.fit <- LDA(document-term matrix, 5)
now if you want to extract the probability of each topic-document assignment, use
topic.fit#gamma[1:5, ] , gamma contains the document-topic matrix
and to get the most likely topic you can use
most.likely.topic <- topics(topic.fit, 1)
hope this answers your question.

Resources