Use a pre trained model with text2vec? - r

I would like to use a pre trained model with text2vec. My understanding was that the benefit here is that these models have been trained on a huge volume of data already, e.g. Google News Model.
Reading the text2vec documentation it looks like the getting started code reads in text data then trains a model with it:
library(text2vec)
text8_file = "~/text8"
if (!file.exists(text8_file)) {
download.file("http://mattmahoney.net/dc/text8.zip", "~/text8.zip")
unzip ("~/text8.zip", files = "text8", exdir = "~/")
}
wiki = readLines(text8_file, n = 1, warn = FALSE)
The documentation then proceeds to show one how to create tokens and a vocab:
# Create iterator over tokens
tokens <- space_tokenizer(wiki)
# Create vocabulary. Terms will be unigrams (simple words).
it = itoken(tokens, progressbar = FALSE)
vocab <- create_vocabulary(it)
vocab <- prune_vocabulary(vocab, term_count_min = 5L)
# Use our filtered vocabulary
vectorizer <- vocab_vectorizer(vocab)
# use window of 5 for context words
tcm <- create_tcm(it, vectorizer, skip_grams_window = 5L)
Then, this looks like the step to fit the model:
glove = GlobalVectors$new(word_vectors_size = 50, vocabulary = vocab, x_max = 10)
glove$fit(tcm, n_iter = 20)
My question is, is the well know Google pre trained word2vec model usable here without the need to rely on my own vocab or my own local device to train the model? If yes, how could I read it in and use it in r?
I think I'm misunderstanding or missing something here? Can I use text2vec for this task?

At the moment text2vec doesn't provide any functionality for downloading/manipulating pre-trained word embeddings.
I have a drafts to add such utilities to the next release.
But on other side you can easily do it manually with just standard R tools. For example here is how to read fasttext vectors:
con = url("https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.af.300.vec.gz", "r")
con = gzcon(con)
wv = readLines(con, n = 10)
Then you need just to parse it - strsplit and rbind are your friends.

This comes a bit late, but might be of interest for other users. Taylor Van Anne provides a little tutorial how to use pretrained GloVe vector models with text2vec here:
https://gist.github.com/tjvananne/8b0e7df7dcad414e8e6d5bf3947439a9

Related

How do i build a model using Glove word embeddings and predict on Test data using text2vec in R

I am building a classification model on text data into two categories(i.e. classifying each comment into 2 categories) using GloVe word embeddings. I have two columns, one with textual data(comments) and the other one is a binary Target variable(whether a comment is actionable or not). I was able to generate Glove word embeddings for textual data using the following code from text2vec documentation.
glove_model <- GlobalVectors$new(word_vectors_size = 50,vocabulary =
glove_pruned_vocab,x_max = 20L)
#fit model and get word vectors
word_vectors_main <- glove_model$fit_transform(glove_tcm,n_iter = 20,convergence_tol=-1)
word_vectors_context <- glove_model$components
word_vectors <- word_vectors_main+t(word_vectors_context)
How do i build a model and generate predictions on test data?
text2vec has a standard predict method (like most of the R libraries anyway) that you can use in a straightforward fashion: have a look at the documentation.
To make a long story short, just use
predictions <- predict(fitted_model, data)
Got it.
glove_model <- GlobalVectors$new(word_vectors_size = 50,vocabulary =
glove_pruned_vocab,x_max = 20L)
#fit model and get word vectors
word_vectors_main <- glove_model$fit_transform(glove_tcm,n_iter =20,convergence_tol=-1)
word_vectors_context <- glove_model$components
word_vectors <- word_vectors_main+t(word_vectors_context)
After creating word embeddings, build an index that maps words(strings) to their vector representations(numbers)
embeddings_index <- new.env(parent = emptyenv())
for (line in lines) {
values <- strsplit(line, ' ', fixed = TRUE)[[1]]
word <- values[[1]]
coefs <- as.numeric(values[-1])
embeddings_index[[word]] <- coefs
}
Next, build an embedding matrix of shape (max_words,embedding_dim) which can be loaded into an embedding layer.
embedding_dim <- 50 (number of dimensions you wish to represent each word).
embedding_matrix <- array(0,c(max_words,embedding_dim))
for(word in names(word_index)){
index <- word_index[[word]]
if(index < max_words){
embedding_vector <- embeddings_index[[word]]
if(!is.null(embedding_vector)){
embedding_matrix[index+1,] <- embedding_vector #words not found in
the embedding index will all be zeros
}
}
}
We can then load this embedding matrix into the embedding layer, build a
model and then generate predictions.
model_pretrained <- keras_model_sequential() %>% layer_embedding(input_dim = max_words,output_dim = embedding_dim) %>%
layer_flatten()%>%layer_dense(units=32,activation = "relu")%>%layer_dense(units = 1,activation = "sigmoid")
summary(model_pretrained)
#Loading the glove embeddings in the model
get_layer(model_pretrained,index = 1) %>%
set_weights(list(embedding_matrix)) %>% freeze_weights()
model_pretrained %>% compile(optimizer = "rmsprop",loss="binary_crossentropy",metrics=c("accuracy"))
history <-model_pretrained%>%fit(x_train,y_train,validation_data = list(x_val,y_val),
epochs = num_epochs,batch_size = 32)
Then use standard predict function to generate predictions.
Check the following links.
Use word embeddings to build a model in Keras
Pre-trained word embeddings

Normalized topic document probabilities text2vec R

I am trying to find out the topic document probabilities after running the lda model using text2vec package in R.
Following commands generate the model:
lda_model <- LDA$new(n_topics = n_topics, doc_topic_prior = 0.1, topic_word_prior = 0.01)
doc_topic_distr <- lda_model$fit_transform(x = quantdfm, n_iter = 2000, convergence_tol = 0.00001, n_check_convergence = 10, progressbar = FALSE)
quantdfm is the dtm using quanteda package, which I am plugging it in the $fit_transform method.
I noticed that the doc_topic_distr contains the topic document probabilities (without even asking for normalization). Is this correct? Because on a previous post: How to get topic probability table from text2vec LDA, Dmitriy Selivanov has asked to derive such probabilities using:
doc_topic_prob = normalize(doc_topic_distr, norm = "l1")
whereas when I use the same command as above, doc_topic_distr and doc_topic_prob have the same values (I thought the former contains integers as opposed to fractions in the latter).
Please suggest if this is the expected behavior of the code, or I have missed something here.
Thanks.
According to the up to date documentation LDA fit_transform returns topic probabilities.

STM: estimating metadata/topic relationships when starting from dfm

After running an STM model based on a Quanteda dfm, I want to estimate my covariates' effects on certain topics.
Running the STM model went fine, producing the topics as expected, but when using estimateEffect (in the final step in the script below) the R session is aborted, notifying there is a 'fatal error'.
How can I estimate my covariates' effects, when starting from a dfm? The STM manual advices on running an STM model from a dfm, but I couldn't find how to work with the covariates after this stage.
Here's the code:
# Read texts with Quanteda
texts <- (readtext("C:/Users/renswilderom/Documents/Stuff Im working on at the moment/Newspaper articles DJ/test data/*.txt",
docvarsfrom = "filenames", dvsep = "_",
docvarnames = c("Date of Publication", "Length LexisNexis", "source"),
encoding = "UTF-8-BOM"))
mycorpus <- corpus(texts)
tokens <- tokens(mycorpus, remove_punct = TRUE, remove_numbers = TRUE, ngrams = 1)
mydfm <- dfm(tokens, remove = stopwords("english"), stem = TRUE)
# Run the STM model - Metadata is called with 'data = docvars(mycorpus)'
stm_from_dfm <- stm(mydfm, K = 10, prevalence =~ Date.of.Publication + source, gamma.prior='L1', data = docvars(mycorpus))
# Estimate effects
prep <- estimateEffect(1:10 ~ Date.of.Publication + source, stm_from_dfm,
meta = docvars(mycorpus), uncertainty = "Global")
Alternatively, I made an STM corpus from my dfm corpus, using STMcorpus <- asSTMCorpus(mydfm). But then I couldn't run the STM model as it didn't recognized my meta data. Would it be better to follow this alternative strategy? (so I need to associate the meta data with the STMcorpus in some way after running STMcorpus <- asSTMCorpus(mydfm)).
We worked through this by email- but I'll add the answer here for others who might encounter some form of the problem.
There is a bug in the matrixStats package which causes R to crash with large matrices on Windows only. The bug and solution are detailed here: https://github.com/HenrikBengtsson/matrixStats/issues/104. This issue contains both a simple test of the problem and instructions for how to install the development version of matrixStats which fixes it. This is an issue in version matrixStats 0.52.2 and will presumably be resolved by the next CRAN release.

Plotting the effect of document pruning on text corpus in R text2vec

Is it possible to check how many documents remain in the corpus after applying prune_vocabulary in the text2vec package?
Here is an example for getting a dataset in and pruning vocabulary
library(text2vec)
library(data.table)
library(tm)
#Load movie review dataset
data("movie_review")
setDT(movie_review)
setkey(movie_review, id)
set.seed(2016L)
#Tokenize
prep_fun = tolower
tok_fun = word_tokenizer
it_train = itoken(movie_review$review,
preprocessor = prep_fun,
tokenizer = tok_fun,
ids = movie_review$id,
progressbar = FALSE)
#Generate vocabulary
vocab = create_vocabulary(it_train
, stopwords = tm::stopwords())
#Prune vocabulary
#How do I ascertain how many documents got kicked out of my training set because of the pruning criteria?
pruned_vocab = prune_vocabulary(vocab,
term_count_min = 10,
doc_proportion_max = 0.5,
doc_proportion_min = 0.001)
# create document term matrix with new pruned vocabulary vectorizer
vectorizer = vocab_vectorizer(pruned_vocab)
dtm_train = create_dtm(it_train, vectorizer)
Is there an easy way to understand how aggressive the term_count_min and doc_proportion_min parameters are being on my text corpus. I am trying to do something similar to how stm package lets us handle this using a plotRemoved function which produces a plot like this:
vocab $vocab is a data.table which contains a lot of statistics about your corpus. prune_vocabulary with term_count_min, doc_proportion_min parameters just filters this data.table. For example here is how you can calculate number of removed tokens:
total_tokens = sum(v$vocab$terms_counts)
total_tokens
# 1230342
# now lets prune
v2 = prune_vocabulary(v, term_count_min = 10)
total_tokens - sum(v2$vocab$terms_counts)
# 78037
# effectively this will remove 78037 tokens
On other side you can create document-term matrices with different vocabularies and check different statistics with functions from Matrix package: colMeans(), colSums(), rowMeans(), rowSums(), etc. I'm sure you can obtain any of the metrics above.
For example here is how to find empty documents:
doc_word_count = Matrix::rowSums(dtm)
indices_empty_docs = which(doc_word_count == 0)

Getting term weights out of an LDA model in R

I was wondering if anyone knows of a way to extract term weights / probabilities out of a topic model constructed in R, using the topicmodels package.
Following the example in the following link I created a topic model like so:
Gibbs = LDA(JSS_dtm, k = 4,
method = "Gibbs",
control = list(seed = 1, burnin = 1000, thin = 100, iter = 1000))
we can then get the topics using topics(Gibbs,1), terms using terms(Gibbs,10) and even the topic probabilities using Gibbs#gamma, but after looking at str(Gibbs) it appears that there is no way to get term probabilities within each topic. This would be useful because topic 1 could be 50% term A and 50% term B, while topic 2 can be 90% Term C and 10% term D. I'm aware that tools like MALLET and Python's NLTK module offer this capability, but I was also hoping that a similar solution may exist in R.
If anyone know how this can be achieved, please let us know.
Many thanks!
EDIT:
For the benefit of the others, I thought I'd share my current workaround. If I knew term probabilities, I'd be able to visualise them and give the viewer a better understanding of what each topic means, but without the probabilities, I'm simply breaking down my data by each topic and creating a word cloud for each topic using binary weights. While these values are not probabilities, they give an indication of what each topic focuses on.
See the below code:
JSS_text <- sapply(1:length(JSS_papers[,"description"]), function(x) unlist(JSS_papers[x,"description"]))
jss_df <- data.frame(text=JSS_text,topic=topics(Gibbs, 1))
jss_dec_df <- data.frame()
for(i in unique(topics(Gibbs, 1))){
jss_dec_df <- rbind(jss_dec_df,data.frame(topic = i,
text = paste(jss_df[jss_df$topic==i,"text"],collapse=" ")))
}
corpus <- Corpus(VectorSource(jss_dec_df$text))
JSS_dtm <- TermDocumentMatrix(corpus,control = list(stemming = TRUE,
stopwords = TRUE,
minWordLength = 3,
removeNumbers = TRUE,
removePunctuation = TRUE,
function(x)weightSMART(x,spec="bnc")))
(JSS_dtm = removeSparseTerms(JSS_dtm,0.1)) # not the sparsity parameter
library(wordcloud)
comparison.cloud(as.matrix(JSS_dtm),random.order=F,max.words=100,
scale=c(6,0.6),colours=4,title.size=2)
Figured it out -- to get the term weights, use posterior(lda_object)$terms. Turned out to be much easier than I thought!

Resources