hash vectorizer in R text2vec package with stopwords removal option - r

I am using R text2vec package for creating document-term-matrix. Here is my code:
library(lime)
library(text2vec)
# load data
data(train_sentences, package = "lime")
#
tokens <- train_sentences$text %>%
word_tokenizer
it <- itoken(tokens, progressbar = FALSE)
stop_words <- c("in","the","a","at","for","is","am") # stopwords
vocab <- create_vocabulary(it, c(1L, 2L), stopwords = stop_words) %>%
prune_vocabulary(term_count_min = 10, doc_proportion_max = 0.2)
vectorizer <- vocab_vectorizer(vocab )
dtm <- create_dtm(it , vectorizer, type = "dgTMatrix")
Another method is hash_vectorizer() instead of vocab_vectorizer() as:
h_vectorizer <- hash_vectorizer(hash_size = 2 ^ 10, ngram = c(1L, 2L))
dtm <- create_dtm(it,h_vectorizer)
But when I am using hash_vectorizer, there is no option for stopwords removal and pruning vocabulary. In a study case, hash_vectorizer works better than vocab_vectorizer for me. I know one can remove stopwords after creating dtm or even when creating tokens. Is there any other options, similar to the vocab_vectorizer and how it is created. Particularly I am interested in a method that also supports pruning vocabulary similar to prune_vocabulary().
I appreciate your responses.
Thanks, Sam

This is not possible. The whole point of using hash_vectorizer and feature hashing is to avoid hashmap lookups (getting index of a given word). Removing stop-words is essentially the thing - check whether word is in the set of stop-words.
Usually it is recommended to use hash_vectorizer only if you dataset is very big and if it takes a lot of time/memory to build vocabulary. Otherwise according to my experience vocab_vectorizer with prune_vocabulary will perform at least not worse.
Also if you use hash_vectorized with small hash_size it acts as a dimensionality reduction step and hence can reduce variance for your dataset. So if your dataset is not very big I suggest to use vocab_vectorizer and play with prune_vocabulary parameters to reduce vocabulary and document-term-matrix size.

Related

How to extract entities names with SpacyR with personalized data?

Good afternoon,
I am trying to sort a large corpus of normative texts of different lengths, and to tag the parts of speech (POS). For that purpose, I was using the tm and udpipe libraries, and given the length of the database.
The other task I need to perform is to identify the entities. I tried the SpacyR library, but it does not correctly identify the name of the organizations, so I want to train a custom NER model based on a few documents from the corpus, which I have personally validated.
How could I "spacy_extract_entity()" with custom data? Or maybe with quanteda and spacyr?
Thanks in advance.
I have done the POS task in this way. I generated a couple of functions.
suppressMessages(suppressWarnings(library(pdftools)))
suppressMessages(suppressWarnings(library(tidyverse)))
suppressMessages(suppressWarnings(library(tm)))
# load the corpus
tm_corpus <- VCorpus(DirSource(
"working_path,
pattern = ".pdf"),readerControl = list(reader = readPDF, language = 'es-419'))
# load udpipe
library(udpipe)
dl <- udpipe_download_model(language = "spanish", overwrite = FALSE)
str(dl)
udmodel_spanish <- udpipe_load_model(file = dl$file_model)
# functions to annotate the corpus
f_udpipe_anot <- function(n){
txt <- as.character(tm_corpus[[n]]) %>% #lista simia
unlist()
y <- udpipe_annotate(udmodel_spanish, x = txt, trace = TRUE)
y <- as.data.frame(y)
}
pinkillazo <- function(desde, hasta){
resultado <- data.frame()
for (item in desde:hasta){
print(item)
resultado <- rbind(resultado, f_udpipe_anot(item))
}
return(resultado)
}
leyes_udpipe_POS <- pinkillazo(1,13) # here I got the annotated corpus as a dataframe
To identify the named entities, I have tried this:
spacyr::spacy_initialize(model = "es_core_news_sm")
quan_corpus <- corpus(tm_corpus)
POS_df_spacyr <- spacy_parse(quan_corpus, lemma = FALSE, entity = TRUE, tag = FALSE, pos = TRUE)
organiz <- spacy_extract_entity(
quan_corpus,
output = c("data.frame", "list"),
type = c("all", "named", "extended"),
multithread = TRUE,
)
I am getting the wrong organizations' names as well as other misspecifications. With multithread, I tought that this task could easen, but it's not the case.
If you want to train your own named entity recognition model in R, you could use R packages crfsuite and R package nametagger which are respectively Conditional Random Fields and Maximum Entropy Models which can be used alongside the udpipe annotation.
If you want deep learning models, you might have to look into torch alongside tokenisers like sentencepiece and embedding techniques like word2vec to implement your own modelling flow (e.g. BiLSTM).

quanteda convert to topicmodels retaining docvars

I'm using the awesome quanteda package to convert my dfm to a topicmodels format. However, in the process I'm losing my docvars which I need for identifying which topics are most likely prevalent in my documents. This is especially a problem given that topicmodels package (as does STM) only selects non-zero counts. The number of documents in the original dfm and the model output hence differ. Is there any way for me to correctly identify the documents in casu?
I checked your outcome. Because of your select statement you have no features left in dfm_speeches. Convert that to the "dtm" format as used by the topicmodels and you indeed get a document term matrix that has no documents and no terms.
But if your selection with dfm_select results in a dfm with features and you then convert it into a dtm format you will see docvars appearing.
dfm_speeches <- dfm(data_corpus_irishbudget2010,
remove_punct = TRUE, remove_numbers = TRUE, remove = stopwords("english")) %>%
dfm_trim(min_termfreq = 4, max_docfreq = 10)
dfm_speeches <- dfm_select(dfm_speeches, c("Bruton", "Cowen"))
docvars(dfm_speeches)
dfmlda <- convert(dfm_speeches, to = "topicmodels")
This will then work further with topicmodels. I will admit that if you convert to a dtm for tm and you have no features you will see the documents appearing in the dtm. I'm not sure if there is a unintended side effect with the conversion to topicmodels if there are no features.
I don't think the problem is described clearly, but I believe I understand what it is.
Topic models' document feature matrix cannot contain empty documents, so they return named vector of topics without these. But you can still live with it if you match them to the document names:
# mx is a quanteda's dfm
# topic is a named vector for topics from LDA
docvars(mx, "topic") <- topic[match(docnames(mx), names(topic))]
Sorry, here's an example.
dfm_speeches <- dfm(data_corpus_irishbudget2010,
remove_punct = TRUE, remove_numbers = TRUE, remove = stopwords("english")) %>%
dfm_trim(min_termfreq = 4, max_docfreq = 10)
dfm_speeches <- dfm_select(dfm_speeches, c("corbyn", "hillary"))
library(topicmodels)
dfmlda <- convert(dfm_speeches, to = "topicmodels") %>%
dfmlda
As you can see, the dfmlda object is empty because the fact that I modified my dfm by removing specific words.

stri_replace_all_fixed slow on big data set - is there an alternative?

I'm trying to stem ~4000 documents in R, by using the stri_replace_all_fixed function. However, it is VERY slow, since my dictionary of stemmed words consists of approx. 300k words. I am doing this because the documents are in danish and therefore the Porter Stemmer Algortihm is not useful (it is too aggressive).
I have posted the code below. Does anyone know an alternative for doing this?
Logic: Look at each word in each document -> If word = word from voc-table, then replace with tran-word.
##Read in the dictionary
voc <- read.table("danish.csv", header = TRUE, sep=";")
#Using the library 'stringi' to make the stemming
library(stringi)
#Split the voc corpus and put the word and stem column into different corpus
word <- Corpus(VectorSource(voc))[1]
tran <- Corpus(VectorSource(voc))[2]
#Using stri_replace_all_fixed to stem words
## !! NOTE THAT THE FOLLOWING STEP MIGHT TAKE A FEW MINUTES DEPENDING ON THE SIZE !! ##
docs <- tm_map(docs, function(x) stri_replace_all_fixed(x, word, tran, vectorize_all = FALSE))
Structure of "voc" data frame:
Word Stem
1 abandonnere abandonner
2 abandonnerede abandonner
3 abandonnerende abandonner
...
313273 åsyns åsyn
To make a dictionary marching fast, you need to implement some clever data structures such as a prefix tree. 300000x search and replace just does not scale.
I don't think this will be efficient in R, but you will need to write a C or C++ extension. You have many tiny operations there, the overhead of the R interpreter will kill you when trying to do this in pure R.

R: tm package, aggregate / join docs

I could not find any previous questions posted on this, so perhaps you can help.
What is a good way to aggregate data in a tm corpus based on metadata (e.g. aggregate texts of different writers)?
There are at least two obvious ways it could be done:
A built-in function in tm, that would allow a DocumentTermMatrix to be built on a metadata feature. Unfortunately I haven't been able to uncover this.
A way to join documents within a corpus based on some external metadata in a table. It would just use metadata to replace document-ids.
So you would have a table that contains: DocumentId, AuthorName
And a tm-built corpus that contains an amount of documents. I understand it is not difficult to introduce the table as metadata for the corpus object.
A matrix can be built with a following function.
library(tm) # version 0.6, you seem to be using an older version
corpus <-Corpus(DirSource("/directory-with-texts"),
readerControl = list(language="lat"))
metadata <- data.frame(DocID, Author)
#A very crude way to enter metadata into the corpus (assumes the same sequence):
for (i in 1:length(corpus)) {
attr(corpus[[i]], "Author") <- metadata$Author[i]
}
a_documenttermmatrix_by_DocId <-DocumentTermMatrix(corpus)
How would you build a matrix that shows frequencies for each author possibly aggregating multiple documents instead of documents? It would be useful to do this just at this stage and not in post-processing with only a few terms.
a_documenttermmatrix_by_Author <- ?
Many thanks!
A DocumentTermMatrix is really just a matrix with fancy dressing (a Simple Triplet Matrix from the slam library) that contains term frequencies for each term and document. Aggregating data from multiple documents by author is really just adding up the columns for the author. Consider formatting the matrix as a standard R matrix and use standard subsetting / aggregating methods:
# Format the document term matrix as a standard matrix.
# The rownames of m become the document Id's
# The colnames of m become the individual terms
m <- as.matrix(dtm)
# Transpose matrix to use the "by" operator.
# Rows become individual terms
# Columns become document ids
# Group columns by Author
# Aggregate column sums (word frequencies) for each author, resulting in a list.
author.list <- by(t(m), metadata$Author, colSums)
# Format the list as a matrix and do stuff with it
author.dtm <- matrix(unlist(author.list), nrow = length(author.list), byrow = T)
# Add column names (term) and row names (author)
colnames(author.dtm) <- rownames(m)
rownames(author.dtm) <- names(author.list)
# View the resulting matrix
View(author.dtm[1:10, 1:10])
The resulting matrix will be a standard matrix where the rows are the Authors and the columns are the individual terms. You should be able to do whatever analysis you want at that point.
I have a very crude workaround for this if the corpus text can be found in a table. However this does not help a lot with a large corpus in a 'tm' format, however it may be handy in other cases. Feel free to improve it, as it is very crude!
custom_term_matrix <- function(author_vector, text_vector)
{
author_vector <- factor(author_vector)
temp <- data.frame(Author = levels(author_vector))
for (i in 1:length(temp$Author)){
temp$Content[i] <- paste(c(as.character(text_vector[author_vector ==
levels(author_vector)[i]])), sep=" ", collapse="")
}
m <- list(id = "Author", content = "Content")
myReader <- readTabular(mapping = m)
mycorpus <- Corpus(DataframeSource(data1), readerControl = list(reader = myReader))
custom_matrix <<- DocumentTermMatrix(mycorpus, control =
list(removePunctuation = TRUE))
}
There probably is a function internal to tm, that I haven't been able to find, so I will be grateful for any help!

Text Retrieval using R

I have been using R's text mining package and its really a great tool. I have not found retrieval support or maybe there are functionalities I am missing.
How can a simple VSM model be implemented using the R's text mining package?
# Sample R commands in support of my previous answer
require(fortunes)
require(tm)
sentences <- NULL
for (i in 1:10) sentences <- c(sentences,fortune(i)$quote)
d <- data.frame(textCol =sentences )
ds <- DataframeSource(d)
dsc<-Corpus(ds)
dtm<- DocumentTermMatrix(dsc, control = list(weighting = weightTf, stopwords = TRUE))
dictC <- Dictionary(dtm)
# The query below is created from words in fortune(1) and fortune(2)
newQry <- data.frame(textCol = "lets stand up and be counted seems to work undocumented")
newQryC <- Corpus(DataframeSource(newQry))
dtmNewQry <- DocumentTermMatrix(newQryC, control = list(weighting=weightTf,stopwords=TRUE,dictionary=dict1))
dictQry <- Dictionary(dtmNewQry)
# Below does a naive similarity (number of features in common)
apply(dtm,1,function(x,y=dictQry){length(intersect(names(x)[x!= 0],y))})
Assuming VSM = Vector Space Model, you can go about a simple retrieval system in the following manner:
Create a Document Term Matrix of your collection/corpus
Create a function for your similarity measure (Jaccard, Euclidean, etc.). There are packages available with these functions. RSiteSearch should help in finding them.
Convert your query to a Document Term Matrix (which will have 1 row and is mapped using the same dictionary as used for the first step)
Compute similarity with the query and the matrix from the first step.
Rank the results and choose the top n.
A non-R method is to use the GINI index on a text column (rows are documents) of a table in PostgreSQL. Using the ts_vector querying methods, you can have a very fast retrieval system.

Resources