quanteda convert to topicmodels retaining docvars - r

I'm using the awesome quanteda package to convert my dfm to a topicmodels format. However, in the process I'm losing my docvars which I need for identifying which topics are most likely prevalent in my documents. This is especially a problem given that topicmodels package (as does STM) only selects non-zero counts. The number of documents in the original dfm and the model output hence differ. Is there any way for me to correctly identify the documents in casu?

I checked your outcome. Because of your select statement you have no features left in dfm_speeches. Convert that to the "dtm" format as used by the topicmodels and you indeed get a document term matrix that has no documents and no terms.
But if your selection with dfm_select results in a dfm with features and you then convert it into a dtm format you will see docvars appearing.
dfm_speeches <- dfm(data_corpus_irishbudget2010,
remove_punct = TRUE, remove_numbers = TRUE, remove = stopwords("english")) %>%
dfm_trim(min_termfreq = 4, max_docfreq = 10)
dfm_speeches <- dfm_select(dfm_speeches, c("Bruton", "Cowen"))
docvars(dfm_speeches)
dfmlda <- convert(dfm_speeches, to = "topicmodels")
This will then work further with topicmodels. I will admit that if you convert to a dtm for tm and you have no features you will see the documents appearing in the dtm. I'm not sure if there is a unintended side effect with the conversion to topicmodels if there are no features.

I don't think the problem is described clearly, but I believe I understand what it is.
Topic models' document feature matrix cannot contain empty documents, so they return named vector of topics without these. But you can still live with it if you match them to the document names:
# mx is a quanteda's dfm
# topic is a named vector for topics from LDA
docvars(mx, "topic") <- topic[match(docnames(mx), names(topic))]

Sorry, here's an example.
dfm_speeches <- dfm(data_corpus_irishbudget2010,
remove_punct = TRUE, remove_numbers = TRUE, remove = stopwords("english")) %>%
dfm_trim(min_termfreq = 4, max_docfreq = 10)
dfm_speeches <- dfm_select(dfm_speeches, c("corbyn", "hillary"))
library(topicmodels)
dfmlda <- convert(dfm_speeches, to = "topicmodels") %>%
dfmlda
As you can see, the dfmlda object is empty because the fact that I modified my dfm by removing specific words.

Related

kwic() function returns less rows than it should

I'm currently trying to perform a sentiment analysis on a kwic object, but I'm afraid that the kwic() function does not return all rows it should return. I'm not quite sure what exactly the issue is which makes it hard to post a reproducible example, so I hope that a detailed explanation of what I'm trying to do will suffice.
I subsetted the original dataset containing speeches I want to analyze to a new data frame that only includes speeches mentioning certain keywords. I used the following code to create this subset:
ostalgie_cluster <- full_data %>%
filter(grepl('Schwester Agnes|Intershop|Interflug|Trabant|Trabi|Ostalgie',
speechContent,
ignore.case = TRUE))
The resulting data frame consists of 201 observations. When I perform kwic() on the same initial dataset using the following code, however, it returns a data frame with only 82 observations. Does anyone know what might cause this? Again, I'm sorry I can't provide a reproducible example, but when I try to create a reprex from scratch it just.. works...
#create quanteda corpus object
qtd_speeches_corp <- corpus(full_data,
docid_field = "id",
text_field = "speechContent")
#tokenize speeches
qtd_tokens <- tokens(qtd_speeches_corp,
remove_punct = TRUE,
remove_numbers = TRUE,
remove_symbols = TRUE,
padding = FALSE) %>%
tokens_remove(stopwords("de"), padding = FALSE) %>%
tokens_compound(pattern = phrase(c("Schwester Agnes")), concatenator = " ")
ostalgie_words <- c("Schwester Agnes", "Intershop", "Interflug", "Trabant", "Trabi", "Ostalgie")
test_kwic <- kwic(qtd_tokens,
pattern = ostalgie_words,
window = 5)
It's something of a guess without having a reproducible example (your input full_data, namely) but here's my best guess. Your kwic() call is using the default "glob" pattern matching, and what you want is a regular expression match instead.
Fix it this way:
kwic(qtd_tokens, pattern = ostalgie_words, valuetype = "regex",
window = 5

Why does quanteda drop some documents when converting to topicmodels format?

I'm working with quanteda here, and finding that when I convert from a document-feature matrix to topic models I lose some documents. Does anyone know why this is or how I can prevent it? It is causing me some problems in a later section of the analysis. This code starts from the construction of the dfm to the conversion. when I run nrow(dfm_counts2), I get 199,560 rows. But after converting to dtm_lda, there are only 198,435?
dfm_counts <- corpus_raw %>%
dfm(tolower = TRUE, remove_punct = TRUE, remove_numbers=TRUE,
remove = stopwords_and_single,stem = FALSE,
remove_separators=TRUE,remove_url =TRUE, remove_symbols = TRUE)
docnames(dfm_counts)<-dfm_counts#docvars$index
## trimming tokens too common or too rare to improve efficiency of modeling
dfm_counts2<-dfm_trim(dfm_counts, max_docfreq = 0.95, min_docfreq=0.005,docfreq_type="prop")
dtm_lda <- convert(dfm_counts2, to = "topicmodels")
That's because after your trimming, some documents now consist of zero features. convert(x, to = "topicmodels") removes empty documents, since you cannot fit them in a topic model, and topicmodels::LDA()` produces an error if you try.
In the dfm_trim() call, 199560 - 198435 = 1125 documents must have consisted of features that fall outside your docfreq range.
I suspect that this will be true:
sum(ntoken(dfm_counts2) == 0) == 1125
By the way you can rename the document names by:
docnames(dfm_counts) <- dfm_counts$index
Better to use this operator than access object internals.

How do I subset my SOTU dfm to Presidents Wilson and later in quanteda?

I'm working with quanteda.corpora's SOTU corpus and need to subset it to look at roughly the last century of SOTU speeches. I'm coming from tm though, so I'm not super familiar with managing dfm objects.
I've learned how to preprocess the corpus when in dfm format, but I'm not certain what to do next. This is what I have right now. To my understanding, this code ought to subset my corpus to include only documents that were delivered after 1913.
library(quanteda)
library(quanteda.corpora)
dfmat_sotu <- dfm(data_corpus_sotu, tolower = TRUE, remove = stopwords("english"), remove_numbers = TRUE, remove_punct = TRUE)
dfmat_sotu <- dfm_wordstem(dfmat_sotu, language = quanteda_options("language_stemmer"))
dfmat_sotu <- dfm_subset(dfmat_sotu, Date > 1913-12-02)
wf_sotu <- textmodel_wordfish(dfmat_sotu)
textplot_scale1d(wf_sotu)
The issue is that when I run this code as well as wordfish, it becomes clear that I haven't subset the corpus as intended -- it seems to only include speeches from 1978 and later. What do I need to do differently?

hash vectorizer in R text2vec package with stopwords removal option

I am using R text2vec package for creating document-term-matrix. Here is my code:
library(lime)
library(text2vec)
# load data
data(train_sentences, package = "lime")
#
tokens <- train_sentences$text %>%
word_tokenizer
it <- itoken(tokens, progressbar = FALSE)
stop_words <- c("in","the","a","at","for","is","am") # stopwords
vocab <- create_vocabulary(it, c(1L, 2L), stopwords = stop_words) %>%
prune_vocabulary(term_count_min = 10, doc_proportion_max = 0.2)
vectorizer <- vocab_vectorizer(vocab )
dtm <- create_dtm(it , vectorizer, type = "dgTMatrix")
Another method is hash_vectorizer() instead of vocab_vectorizer() as:
h_vectorizer <- hash_vectorizer(hash_size = 2 ^ 10, ngram = c(1L, 2L))
dtm <- create_dtm(it,h_vectorizer)
But when I am using hash_vectorizer, there is no option for stopwords removal and pruning vocabulary. In a study case, hash_vectorizer works better than vocab_vectorizer for me. I know one can remove stopwords after creating dtm or even when creating tokens. Is there any other options, similar to the vocab_vectorizer and how it is created. Particularly I am interested in a method that also supports pruning vocabulary similar to prune_vocabulary().
I appreciate your responses.
Thanks, Sam
This is not possible. The whole point of using hash_vectorizer and feature hashing is to avoid hashmap lookups (getting index of a given word). Removing stop-words is essentially the thing - check whether word is in the set of stop-words.
Usually it is recommended to use hash_vectorizer only if you dataset is very big and if it takes a lot of time/memory to build vocabulary. Otherwise according to my experience vocab_vectorizer with prune_vocabulary will perform at least not worse.
Also if you use hash_vectorized with small hash_size it acts as a dimensionality reduction step and hence can reduce variance for your dataset. So if your dataset is not very big I suggest to use vocab_vectorizer and play with prune_vocabulary parameters to reduce vocabulary and document-term-matrix size.

R: tm package, aggregate / join docs

I could not find any previous questions posted on this, so perhaps you can help.
What is a good way to aggregate data in a tm corpus based on metadata (e.g. aggregate texts of different writers)?
There are at least two obvious ways it could be done:
A built-in function in tm, that would allow a DocumentTermMatrix to be built on a metadata feature. Unfortunately I haven't been able to uncover this.
A way to join documents within a corpus based on some external metadata in a table. It would just use metadata to replace document-ids.
So you would have a table that contains: DocumentId, AuthorName
And a tm-built corpus that contains an amount of documents. I understand it is not difficult to introduce the table as metadata for the corpus object.
A matrix can be built with a following function.
library(tm) # version 0.6, you seem to be using an older version
corpus <-Corpus(DirSource("/directory-with-texts"),
readerControl = list(language="lat"))
metadata <- data.frame(DocID, Author)
#A very crude way to enter metadata into the corpus (assumes the same sequence):
for (i in 1:length(corpus)) {
attr(corpus[[i]], "Author") <- metadata$Author[i]
}
a_documenttermmatrix_by_DocId <-DocumentTermMatrix(corpus)
How would you build a matrix that shows frequencies for each author possibly aggregating multiple documents instead of documents? It would be useful to do this just at this stage and not in post-processing with only a few terms.
a_documenttermmatrix_by_Author <- ?
Many thanks!
A DocumentTermMatrix is really just a matrix with fancy dressing (a Simple Triplet Matrix from the slam library) that contains term frequencies for each term and document. Aggregating data from multiple documents by author is really just adding up the columns for the author. Consider formatting the matrix as a standard R matrix and use standard subsetting / aggregating methods:
# Format the document term matrix as a standard matrix.
# The rownames of m become the document Id's
# The colnames of m become the individual terms
m <- as.matrix(dtm)
# Transpose matrix to use the "by" operator.
# Rows become individual terms
# Columns become document ids
# Group columns by Author
# Aggregate column sums (word frequencies) for each author, resulting in a list.
author.list <- by(t(m), metadata$Author, colSums)
# Format the list as a matrix and do stuff with it
author.dtm <- matrix(unlist(author.list), nrow = length(author.list), byrow = T)
# Add column names (term) and row names (author)
colnames(author.dtm) <- rownames(m)
rownames(author.dtm) <- names(author.list)
# View the resulting matrix
View(author.dtm[1:10, 1:10])
The resulting matrix will be a standard matrix where the rows are the Authors and the columns are the individual terms. You should be able to do whatever analysis you want at that point.
I have a very crude workaround for this if the corpus text can be found in a table. However this does not help a lot with a large corpus in a 'tm' format, however it may be handy in other cases. Feel free to improve it, as it is very crude!
custom_term_matrix <- function(author_vector, text_vector)
{
author_vector <- factor(author_vector)
temp <- data.frame(Author = levels(author_vector))
for (i in 1:length(temp$Author)){
temp$Content[i] <- paste(c(as.character(text_vector[author_vector ==
levels(author_vector)[i]])), sep=" ", collapse="")
}
m <- list(id = "Author", content = "Content")
myReader <- readTabular(mapping = m)
mycorpus <- Corpus(DataframeSource(data1), readerControl = list(reader = myReader))
custom_matrix <<- DocumentTermMatrix(mycorpus, control =
list(removePunctuation = TRUE))
}
There probably is a function internal to tm, that I haven't been able to find, so I will be grateful for any help!

Resources