Why does quanteda drop some documents when converting to topicmodels format? - r

I'm working with quanteda here, and finding that when I convert from a document-feature matrix to topic models I lose some documents. Does anyone know why this is or how I can prevent it? It is causing me some problems in a later section of the analysis. This code starts from the construction of the dfm to the conversion. when I run nrow(dfm_counts2), I get 199,560 rows. But after converting to dtm_lda, there are only 198,435?
dfm_counts <- corpus_raw %>%
dfm(tolower = TRUE, remove_punct = TRUE, remove_numbers=TRUE,
remove = stopwords_and_single,stem = FALSE,
remove_separators=TRUE,remove_url =TRUE, remove_symbols = TRUE)
docnames(dfm_counts)<-dfm_counts#docvars$index
## trimming tokens too common or too rare to improve efficiency of modeling
dfm_counts2<-dfm_trim(dfm_counts, max_docfreq = 0.95, min_docfreq=0.005,docfreq_type="prop")
dtm_lda <- convert(dfm_counts2, to = "topicmodels")

That's because after your trimming, some documents now consist of zero features. convert(x, to = "topicmodels") removes empty documents, since you cannot fit them in a topic model, and topicmodels::LDA()` produces an error if you try.
In the dfm_trim() call, 199560 - 198435 = 1125 documents must have consisted of features that fall outside your docfreq range.
I suspect that this will be true:
sum(ntoken(dfm_counts2) == 0) == 1125
By the way you can rename the document names by:
docnames(dfm_counts) <- dfm_counts$index
Better to use this operator than access object internals.

Related

How to extract entities names with SpacyR with personalized data?

Good afternoon,
I am trying to sort a large corpus of normative texts of different lengths, and to tag the parts of speech (POS). For that purpose, I was using the tm and udpipe libraries, and given the length of the database.
The other task I need to perform is to identify the entities. I tried the SpacyR library, but it does not correctly identify the name of the organizations, so I want to train a custom NER model based on a few documents from the corpus, which I have personally validated.
How could I "spacy_extract_entity()" with custom data? Or maybe with quanteda and spacyr?
Thanks in advance.
I have done the POS task in this way. I generated a couple of functions.
suppressMessages(suppressWarnings(library(pdftools)))
suppressMessages(suppressWarnings(library(tidyverse)))
suppressMessages(suppressWarnings(library(tm)))
# load the corpus
tm_corpus <- VCorpus(DirSource(
"working_path,
pattern = ".pdf"),readerControl = list(reader = readPDF, language = 'es-419'))
# load udpipe
library(udpipe)
dl <- udpipe_download_model(language = "spanish", overwrite = FALSE)
str(dl)
udmodel_spanish <- udpipe_load_model(file = dl$file_model)
# functions to annotate the corpus
f_udpipe_anot <- function(n){
txt <- as.character(tm_corpus[[n]]) %>% #lista simia
unlist()
y <- udpipe_annotate(udmodel_spanish, x = txt, trace = TRUE)
y <- as.data.frame(y)
}
pinkillazo <- function(desde, hasta){
resultado <- data.frame()
for (item in desde:hasta){
print(item)
resultado <- rbind(resultado, f_udpipe_anot(item))
}
return(resultado)
}
leyes_udpipe_POS <- pinkillazo(1,13) # here I got the annotated corpus as a dataframe
To identify the named entities, I have tried this:
spacyr::spacy_initialize(model = "es_core_news_sm")
quan_corpus <- corpus(tm_corpus)
POS_df_spacyr <- spacy_parse(quan_corpus, lemma = FALSE, entity = TRUE, tag = FALSE, pos = TRUE)
organiz <- spacy_extract_entity(
quan_corpus,
output = c("data.frame", "list"),
type = c("all", "named", "extended"),
multithread = TRUE,
)
I am getting the wrong organizations' names as well as other misspecifications. With multithread, I tought that this task could easen, but it's not the case.
If you want to train your own named entity recognition model in R, you could use R packages crfsuite and R package nametagger which are respectively Conditional Random Fields and Maximum Entropy Models which can be used alongside the udpipe annotation.
If you want deep learning models, you might have to look into torch alongside tokenisers like sentencepiece and embedding techniques like word2vec to implement your own modelling flow (e.g. BiLSTM).

kwic() function returns less rows than it should

I'm currently trying to perform a sentiment analysis on a kwic object, but I'm afraid that the kwic() function does not return all rows it should return. I'm not quite sure what exactly the issue is which makes it hard to post a reproducible example, so I hope that a detailed explanation of what I'm trying to do will suffice.
I subsetted the original dataset containing speeches I want to analyze to a new data frame that only includes speeches mentioning certain keywords. I used the following code to create this subset:
ostalgie_cluster <- full_data %>%
filter(grepl('Schwester Agnes|Intershop|Interflug|Trabant|Trabi|Ostalgie',
speechContent,
ignore.case = TRUE))
The resulting data frame consists of 201 observations. When I perform kwic() on the same initial dataset using the following code, however, it returns a data frame with only 82 observations. Does anyone know what might cause this? Again, I'm sorry I can't provide a reproducible example, but when I try to create a reprex from scratch it just.. works...
#create quanteda corpus object
qtd_speeches_corp <- corpus(full_data,
docid_field = "id",
text_field = "speechContent")
#tokenize speeches
qtd_tokens <- tokens(qtd_speeches_corp,
remove_punct = TRUE,
remove_numbers = TRUE,
remove_symbols = TRUE,
padding = FALSE) %>%
tokens_remove(stopwords("de"), padding = FALSE) %>%
tokens_compound(pattern = phrase(c("Schwester Agnes")), concatenator = " ")
ostalgie_words <- c("Schwester Agnes", "Intershop", "Interflug", "Trabant", "Trabi", "Ostalgie")
test_kwic <- kwic(qtd_tokens,
pattern = ostalgie_words,
window = 5)
It's something of a guess without having a reproducible example (your input full_data, namely) but here's my best guess. Your kwic() call is using the default "glob" pattern matching, and what you want is a regular expression match instead.
Fix it this way:
kwic(qtd_tokens, pattern = ostalgie_words, valuetype = "regex",
window = 5

hash vectorizer in R text2vec package with stopwords removal option

I am using R text2vec package for creating document-term-matrix. Here is my code:
library(lime)
library(text2vec)
# load data
data(train_sentences, package = "lime")
#
tokens <- train_sentences$text %>%
word_tokenizer
it <- itoken(tokens, progressbar = FALSE)
stop_words <- c("in","the","a","at","for","is","am") # stopwords
vocab <- create_vocabulary(it, c(1L, 2L), stopwords = stop_words) %>%
prune_vocabulary(term_count_min = 10, doc_proportion_max = 0.2)
vectorizer <- vocab_vectorizer(vocab )
dtm <- create_dtm(it , vectorizer, type = "dgTMatrix")
Another method is hash_vectorizer() instead of vocab_vectorizer() as:
h_vectorizer <- hash_vectorizer(hash_size = 2 ^ 10, ngram = c(1L, 2L))
dtm <- create_dtm(it,h_vectorizer)
But when I am using hash_vectorizer, there is no option for stopwords removal and pruning vocabulary. In a study case, hash_vectorizer works better than vocab_vectorizer for me. I know one can remove stopwords after creating dtm or even when creating tokens. Is there any other options, similar to the vocab_vectorizer and how it is created. Particularly I am interested in a method that also supports pruning vocabulary similar to prune_vocabulary().
I appreciate your responses.
Thanks, Sam
This is not possible. The whole point of using hash_vectorizer and feature hashing is to avoid hashmap lookups (getting index of a given word). Removing stop-words is essentially the thing - check whether word is in the set of stop-words.
Usually it is recommended to use hash_vectorizer only if you dataset is very big and if it takes a lot of time/memory to build vocabulary. Otherwise according to my experience vocab_vectorizer with prune_vocabulary will perform at least not worse.
Also if you use hash_vectorized with small hash_size it acts as a dimensionality reduction step and hence can reduce variance for your dataset. So if your dataset is not very big I suggest to use vocab_vectorizer and play with prune_vocabulary parameters to reduce vocabulary and document-term-matrix size.

quanteda convert to topicmodels retaining docvars

I'm using the awesome quanteda package to convert my dfm to a topicmodels format. However, in the process I'm losing my docvars which I need for identifying which topics are most likely prevalent in my documents. This is especially a problem given that topicmodels package (as does STM) only selects non-zero counts. The number of documents in the original dfm and the model output hence differ. Is there any way for me to correctly identify the documents in casu?
I checked your outcome. Because of your select statement you have no features left in dfm_speeches. Convert that to the "dtm" format as used by the topicmodels and you indeed get a document term matrix that has no documents and no terms.
But if your selection with dfm_select results in a dfm with features and you then convert it into a dtm format you will see docvars appearing.
dfm_speeches <- dfm(data_corpus_irishbudget2010,
remove_punct = TRUE, remove_numbers = TRUE, remove = stopwords("english")) %>%
dfm_trim(min_termfreq = 4, max_docfreq = 10)
dfm_speeches <- dfm_select(dfm_speeches, c("Bruton", "Cowen"))
docvars(dfm_speeches)
dfmlda <- convert(dfm_speeches, to = "topicmodels")
This will then work further with topicmodels. I will admit that if you convert to a dtm for tm and you have no features you will see the documents appearing in the dtm. I'm not sure if there is a unintended side effect with the conversion to topicmodels if there are no features.
I don't think the problem is described clearly, but I believe I understand what it is.
Topic models' document feature matrix cannot contain empty documents, so they return named vector of topics without these. But you can still live with it if you match them to the document names:
# mx is a quanteda's dfm
# topic is a named vector for topics from LDA
docvars(mx, "topic") <- topic[match(docnames(mx), names(topic))]
Sorry, here's an example.
dfm_speeches <- dfm(data_corpus_irishbudget2010,
remove_punct = TRUE, remove_numbers = TRUE, remove = stopwords("english")) %>%
dfm_trim(min_termfreq = 4, max_docfreq = 10)
dfm_speeches <- dfm_select(dfm_speeches, c("corbyn", "hillary"))
library(topicmodels)
dfmlda <- convert(dfm_speeches, to = "topicmodels") %>%
dfmlda
As you can see, the dfmlda object is empty because the fact that I modified my dfm by removing specific words.

Removing words from a DocumentTermMatrix

My friend and I are working on transforming some tweets we collected into a dtm in order to be able to run a sentiment analysis using machine learning in R. The task must be performed in R, because it is for an exam at our university, where R is required to be used as a tool.
Initially we have collected a smaller sample, in order to test if our code was working, before we would start coding a larger dataset. Our problem is that we can't seem to figure out how to remove custom words from the dtm. Our code so far looks something like this (we are primarily using the tm package):
file <- read.csv("Tmix.csv",
row.names = NULL, sep=";", header=TRUE) #just for loading the dataset
tweetsCorpus <- Corpus(VectorSource(file[,1]))
tweetsDTM <- DocumentTermMatrix(tweetsCorpus,
control = list(verbose = TRUE,
asPlain = TRUE,
stopwords = TRUE,
tolower = TRUE,
removeNumbers = TRUE,
stemWords = FALSE,
removePunctuation = TRUE,
removeSeparators = TRUE,
removeTwitter = TRUE,
stem = TRUE,
stripWhitespace = TRUE,
removeWords = c("customword1", "customword2", "customword3")))
We've also tried removing the words before converting into a dtm, using the removeWords command, together with all of the "removeXXX" commands in the tm package, and then converting it to a dtm, but it doesn't seem to work.
It is important that we don't simply remove all words with i.e. 5 or less observations. We need all observations, except the ones we want to remove like for instance https-adresses and stuff like that.
Does anyone know how we do this?
And a second question: Is there any easier way to remove all words that start with https instead of having to write all of the adresses individually into the code. Right now for instance we are writing "httpstcokozcejeg", "httpstcolskjnyjyn", "httpstcolwwsxuem" as single custom words to remove from the data.
NOTE: We know that RemoveWords is a terrible solution to our problem, but we can't figure out how else to do it.
You can use regular expressions, for example:
gsub("http[a-z]*","","httpstcolwwsxuem here")
[1] " here"
Assuming that you removed punctuation/digits in tweetsCorpus, you can use the following:
1- Direct gsub
tweetsCorpus <- gsub("http[a-z]*","",tweetsCorpus[[1]][[1]])
OR
2- tm::tm_map, content_transformer
library(tm)
RemoveURL <- function(x){
gsub("http[a-z]*","",x)
}
tweetsCorpus <- tm_map(tweetsCorpus, content_transformer(RemoveURL))

Resources