Question about pre-processing Russian text for stm in R - r

I am trying to run a structural topic model in R using the stm package. The corpus is a collection of Russian-language speeches. The problem I am having is that the Russian words are not being pre-processed correctly. Here is the code I have written thus far:
library(stm) # Package for sturctural topic modeling
library(igraph) # Package for network analysis and visualisation
library(stmCorrViz) # Package for hierarchical correlation view of STMs
data <- read.csv("convocation4.csv") # Load data
stopwordsRU <- readLines("stop_words_russian.txt", encoding = "UTF-8") # Custom stopwords
processed <- textProcessor(
data$text,
metadata = data,
lowercase = TRUE,
removestopwords = TRUE,
removenumbers = TRUE,
removepunctuation = TRUE,
stem = TRUE,
language = "ru",
customstopwords = stopwordsRU
)
docs <- out$documents
vocab <- out$vocab
meta <- out$meta
fit <- stm(out$documents, out$vocab, K=20, prevalence=~party_id,
max.em.its=75, data=out$meta, init.type="Spectral",
seed=8458159)
Here is an example of the problem. I have included one Russian stopword, уважаемый, in my list of custom stopwords. However, the word уважаемые, the plural form, comes out as one of the top words after running the model. Why is this happening? How might I solve it?

Related

Need help diagnosing cause of "Covariate matrix is singular" when estimating effect in structural topic model (stm)

First things first. I've saved my workspace and you can load it with the following command:
load(url("https://dl.dropboxusercontent.com/s/06oz5j41nif7la5/example.RData?dl=0"))
I have a number of abstract texts and I'm attempting to estimate a structural topic model to measure topic prevalence over time. The data contains a document id, abstract text, and year of publication.
I want to generate trends in expected topic proportion over time like the authors of the STM Vignette do here:
I'm able to create my topic model without issue, but when I attempt to run the estimateEffect() function from the stm package in R, I always get the following warning:
And my trends look like this:
In the documentation, the authors note that
The function will automatically check whether the covariate matrix is singular which generally results from linearly dependent columns. Some common causes include a factor variable with an unobserved level, a spline with degrees of freedom that are too high, or a spline with a continuous variable where a gap in the support of the variable results in several empty basis functions.
I've tried a variety of different models, using a 2-topic solution all the way up to 52-topic solution, always with the same result. If I remove the spline function from the "year" variable in my model and assume a linear fit, then estimateEffect() works just fine. So it must be an issue with the splined data. I just don't know what exactly.
Again, here's a link to my workspace:
load(url("https://dl.dropboxusercontent.com/s/06oz5j41nif7la5/example.RData?dl=0"))
And here is the code I'm using to get there:
library(udpipe)
library(dplyr) # data wrangling
library(readr) # import data
library(ggplot2) # viz
library(stm) # STM
library(tidytext) # Tf-idf
library(tm) # DTM stuff
library(quanteda) # For using ngrams in STM
rm(list = ls())
abstracts <- read_delim("Data/5528_demand_ta.txt",
delim = "\t", escape_double = FALSE,
col_names = TRUE, trim_ws = TRUE)
abstracts <- rename(abstracts, doc_id = cpid)
abstracts$doc_id <- as.character(abstracts$doc_id)
# Download english dictionary
ud_model <- udpipe_download_model(language = "english")
ud_model <- udpipe_load_model(ud_model$file_model)
# Interpret abstracts assuming English
x <- udpipe_annotate(ud_model, x = abstracts$abstract, doc_id = abstracts$doc_id)
x <- as.data.frame(x)
# Regroup terms
data <- paste.data.frame(x, term = "lemma", group = c("doc_id"))
data <- left_join(data, abstracts) %>%
rename(term = lemma) %>%
select(doc_id, term , year)
# Prepare text
processed <- textProcessor(documents = data$term,
metadata = data,
lowercase = TRUE,
removestopwords = TRUE,
removenumbers = TRUE,
removepunctuation = TRUE,
stem = FALSE)
out <- prepDocuments(processed$documents,
processed$vocab,
processed$meta,
lower.thresh = 20, # term must appear in at least n docs to matter
upper.thres = 1000) # I've been using about 1/3 of documents as an upper thresh
# Build model allowing tSNE to pick k (should result in 52 topics)
stm_mod <- stm(documents = out$documents,
vocab = out$vocab,
K = 0,
init.type = "Spectral",
prevalence = ~ s(year),
data = out$meta,
max.em.its = 500, # Max number of runs to attempt
seed = 831)
###################################################################################
########### If you loaded the workspace from my link, then you are here ###########
###################################################################################
# Estimate effect of year
prep <- estimateEffect(formula = 1:52 ~ s(year),
stmobj = stm_mod,
metadata = out$meta)
# Plot expected topic proportion
summary(prep, topics=1)
plot.estimateEffect(prep,
"year",
method = "continuous",
model = stm_mod,
topics = 5,
printlegend = TRUE,
xaxt = "n",
xlab = "Years")
A singular matrix simply means that you have linearly dependent rows or columns. First thing you could do is check the determinant of the matrix - a singular matrix implies a zero determinant - which means the matrix can't be inverted.
Next thing would be to identify the literally dependent rows (columns), you can do so using smisc::findDepMat(X, rows = TRUE, tol = 1e-10) for rows, and smisc::findDepMat(X, rows = FALSE, tol = 1e-10) for columns. You MAY be able to alter the levels of tol in findDepMat() and etol in stm() to arrive at a solution, probably an unstable solution, but a solution.

Use a pre trained model with text2vec?

I would like to use a pre trained model with text2vec. My understanding was that the benefit here is that these models have been trained on a huge volume of data already, e.g. Google News Model.
Reading the text2vec documentation it looks like the getting started code reads in text data then trains a model with it:
library(text2vec)
text8_file = "~/text8"
if (!file.exists(text8_file)) {
download.file("http://mattmahoney.net/dc/text8.zip", "~/text8.zip")
unzip ("~/text8.zip", files = "text8", exdir = "~/")
}
wiki = readLines(text8_file, n = 1, warn = FALSE)
The documentation then proceeds to show one how to create tokens and a vocab:
# Create iterator over tokens
tokens <- space_tokenizer(wiki)
# Create vocabulary. Terms will be unigrams (simple words).
it = itoken(tokens, progressbar = FALSE)
vocab <- create_vocabulary(it)
vocab <- prune_vocabulary(vocab, term_count_min = 5L)
# Use our filtered vocabulary
vectorizer <- vocab_vectorizer(vocab)
# use window of 5 for context words
tcm <- create_tcm(it, vectorizer, skip_grams_window = 5L)
Then, this looks like the step to fit the model:
glove = GlobalVectors$new(word_vectors_size = 50, vocabulary = vocab, x_max = 10)
glove$fit(tcm, n_iter = 20)
My question is, is the well know Google pre trained word2vec model usable here without the need to rely on my own vocab or my own local device to train the model? If yes, how could I read it in and use it in r?
I think I'm misunderstanding or missing something here? Can I use text2vec for this task?
At the moment text2vec doesn't provide any functionality for downloading/manipulating pre-trained word embeddings.
I have a drafts to add such utilities to the next release.
But on other side you can easily do it manually with just standard R tools. For example here is how to read fasttext vectors:
con = url("https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.af.300.vec.gz", "r")
con = gzcon(con)
wv = readLines(con, n = 10)
Then you need just to parse it - strsplit and rbind are your friends.
This comes a bit late, but might be of interest for other users. Taylor Van Anne provides a little tutorial how to use pretrained GloVe vector models with text2vec here:
https://gist.github.com/tjvananne/8b0e7df7dcad414e8e6d5bf3947439a9

STM: estimating metadata/topic relationships when starting from dfm

After running an STM model based on a Quanteda dfm, I want to estimate my covariates' effects on certain topics.
Running the STM model went fine, producing the topics as expected, but when using estimateEffect (in the final step in the script below) the R session is aborted, notifying there is a 'fatal error'.
How can I estimate my covariates' effects, when starting from a dfm? The STM manual advices on running an STM model from a dfm, but I couldn't find how to work with the covariates after this stage.
Here's the code:
# Read texts with Quanteda
texts <- (readtext("C:/Users/renswilderom/Documents/Stuff Im working on at the moment/Newspaper articles DJ/test data/*.txt",
docvarsfrom = "filenames", dvsep = "_",
docvarnames = c("Date of Publication", "Length LexisNexis", "source"),
encoding = "UTF-8-BOM"))
mycorpus <- corpus(texts)
tokens <- tokens(mycorpus, remove_punct = TRUE, remove_numbers = TRUE, ngrams = 1)
mydfm <- dfm(tokens, remove = stopwords("english"), stem = TRUE)
# Run the STM model - Metadata is called with 'data = docvars(mycorpus)'
stm_from_dfm <- stm(mydfm, K = 10, prevalence =~ Date.of.Publication + source, gamma.prior='L1', data = docvars(mycorpus))
# Estimate effects
prep <- estimateEffect(1:10 ~ Date.of.Publication + source, stm_from_dfm,
meta = docvars(mycorpus), uncertainty = "Global")
Alternatively, I made an STM corpus from my dfm corpus, using STMcorpus <- asSTMCorpus(mydfm). But then I couldn't run the STM model as it didn't recognized my meta data. Would it be better to follow this alternative strategy? (so I need to associate the meta data with the STMcorpus in some way after running STMcorpus <- asSTMCorpus(mydfm)).
We worked through this by email- but I'll add the answer here for others who might encounter some form of the problem.
There is a bug in the matrixStats package which causes R to crash with large matrices on Windows only. The bug and solution are detailed here: https://github.com/HenrikBengtsson/matrixStats/issues/104. This issue contains both a simple test of the problem and instructions for how to install the development version of matrixStats which fixes it. This is an issue in version matrixStats 0.52.2 and will presumably be resolved by the next CRAN release.

Plotting the effect of document pruning on text corpus in R text2vec

Is it possible to check how many documents remain in the corpus after applying prune_vocabulary in the text2vec package?
Here is an example for getting a dataset in and pruning vocabulary
library(text2vec)
library(data.table)
library(tm)
#Load movie review dataset
data("movie_review")
setDT(movie_review)
setkey(movie_review, id)
set.seed(2016L)
#Tokenize
prep_fun = tolower
tok_fun = word_tokenizer
it_train = itoken(movie_review$review,
preprocessor = prep_fun,
tokenizer = tok_fun,
ids = movie_review$id,
progressbar = FALSE)
#Generate vocabulary
vocab = create_vocabulary(it_train
, stopwords = tm::stopwords())
#Prune vocabulary
#How do I ascertain how many documents got kicked out of my training set because of the pruning criteria?
pruned_vocab = prune_vocabulary(vocab,
term_count_min = 10,
doc_proportion_max = 0.5,
doc_proportion_min = 0.001)
# create document term matrix with new pruned vocabulary vectorizer
vectorizer = vocab_vectorizer(pruned_vocab)
dtm_train = create_dtm(it_train, vectorizer)
Is there an easy way to understand how aggressive the term_count_min and doc_proportion_min parameters are being on my text corpus. I am trying to do something similar to how stm package lets us handle this using a plotRemoved function which produces a plot like this:
vocab $vocab is a data.table which contains a lot of statistics about your corpus. prune_vocabulary with term_count_min, doc_proportion_min parameters just filters this data.table. For example here is how you can calculate number of removed tokens:
total_tokens = sum(v$vocab$terms_counts)
total_tokens
# 1230342
# now lets prune
v2 = prune_vocabulary(v, term_count_min = 10)
total_tokens - sum(v2$vocab$terms_counts)
# 78037
# effectively this will remove 78037 tokens
On other side you can create document-term matrices with different vocabularies and check different statistics with functions from Matrix package: colMeans(), colSums(), rowMeans(), rowSums(), etc. I'm sure you can obtain any of the metrics above.
For example here is how to find empty documents:
doc_word_count = Matrix::rowSums(dtm)
indices_empty_docs = which(doc_word_count == 0)

Seeding words into an LDA topic model in R

I have a dataset of news articles that have been collected based on the criteria that they use the term "euroscepticism" or "eurosceptic". I have been running topic models using the lda package (with dfm matrices built in quanteda) in order to identify the main topics of these articles; however, the words I am interested in do not appear in any of the topics. I want to therefore seed these words into the model, and I am not sure exactly how to do that.
I see that the package topicmodels allows for an argument called seedwords, which "can be specified as a matrix or an object class of simple_triplet_matrix", but there are no other instructions. It seems that a simple_triplet_matrix only takes integers, and not strings - does anyone know I would then seed the words 'euroscepticism' and 'eurosceptic' into the model?
Here is a shortened version of the code:
library("quanteda")
library("lda")
##Load UK texts/create corpus
UKcorp <- corpus(textfile(file="~Michael/DM6/*"))
##Create document feature matrix
UKdfm2 <- dfm(UKcorp, ngrams =1, verbose = TRUE, toLower = TRUE,
removeNumbers = TRUE, removePunct = TRUE, removeSeparators = TRUE,
removeTwitter = FALSE, stem = TRUE, ignoredFeatures =
stopwords(kind="english"), keptFeatures = NULL, language = "english",
thesaurus = NULL, dictionary = NULL, valuetype = "fixed"))
##Convert to lda model
UKlda2 <- convert(UKdfm2, to = "lda")
##run model
UKmod2 <- lda.collapsed.gibbs.sampler(UKlda2$documents, K = 15, UKlda2$vocab,
num.iterations = 1500, alpha = .1,eta = .01, initial = NULL, burnin
= NULL, compute.log.likelihood = TRUE, trace = 0L, freeze.topics = FALSE)
"Seeding" words in the topicmodels package is a different procedure, as it allows you when estimating through the collapsed Gibbs sampler to attach prior weights for words. (See for instance Jagarlamudi, J., Daumé, H., III, & Udupa, R. (2012). Incorporating lexical priors into topic models (pp. 204–213). Association for Computational Linguistics.) But this is part of an estimation strategy for topics, not a way of ensuring that key words of interest remain in your fitted topics. Unless you have set a threshold for removing them based on sparsity, before calling lad::lda.collapsed.gibbs.sampler(), then *every* term in yourUKlda2$vocab` vector will be assigned probabilities across topics.
Probably what is happening here is that your words are either of such low frequency that they are hard to locate near the top of any of your topics. It's also possible that stemming has changed them, e.g.:
quanteda::char_wordstem("euroscepticism")
## [1] "eurosceptic"
I suggest you make sure that your words exist in dfm first, through:
colSums(UKdfm2)["eurosceptic"]
And then you can look at the fitted distribution of topic proportions for this word and others in the fitted topic model object.

Resources