Classic king - man + woman = queen example with pretrained word-embedding and word2vec package in R - r

I am really desperate, I just cannot reproduce the allegedly classic example of king - man + woman = queen with the word2vec package in R and any (!) pre-trained embedding model (as a bin file).
I would be very grateful if anybody could provide working code to reproduce this example... including a link to the necessary pre-trained model which is also downloadable (many are not!).
Thank you very much!

An overview of using word2vec with R is available at https://www.bnosac.be/index.php/blog/100-word2vec-in-r which even shows an example of king - man + woman = queen.
Just following the instructions there and downloading the first English 300-dim embedding word2vec model from http://vectors.nlpl.eu/repository ran on the British National Corpus which I encountered, downloaded and unzipped the model.bin on my drive and next inspecting the terms in the model (words are there apparently appended with pos tags), getting the word vectors, displaying the vectors, getting the king - man + woman and finding the closest vector to that vector gives ... queen.
> library(word2vec)
> model <- read.word2vec("C:/Users/jwijf/OneDrive/Bureaublad/model.bin", normalize = TRUE)
> head(summary(model, type = "vocabulary"), n = 10)
[1] "vintage-style_ADJ" "Sinopoli_PROPN" "Yarrell_PROPN" "en-1_NUM" "74°–78°F_X"
[6] "bursa_NOUN" "uni-male_ADJ" "37541_NUM" "Menuetto_PROPN" "Saxena_PROPN"
> wv <- predict(model, newdata = c("king_NOUN", "man_NOUN", "woman_NOUN"), type = "embedding")
> head(t(wv), n = 10)
king_NOUN man_NOUN woman_NOUN
[1,] -0.4536242 -0.47802860 -1.03320265
[2,] 0.7096733 1.40374041 -0.91597748
[3,] 1.1509652 2.35536361 1.57869458
[4,] -0.2882653 -0.59587735 -0.59021348
[5,] -0.2110678 -1.05059254 -0.64248675
[6,] 0.1846713 -0.05871651 -1.01818573
[7,] 0.5493720 0.13456300 0.38765019
[8,] -0.9401053 0.56237948 0.02383301
[9,] 0.1140556 -0.38569298 -0.43408644
[10,] 0.3657919 0.92853492 -2.56553030
> wv <- wv["king_NOUN", ] - wv["man_NOUN", ] + wv["woman_NOUN", ]
> predict(model, newdata = wv, type = "nearest", top_n = 4)
term similarity rank
1 king_NOUN 0.9332663 1
2 queen_NOUN 0.7813236 2
3 coronation_NOUN 0.7663506 3
4 kingship_NOUN 0.7626975 4
Do you prefer to build your own model based on your own text or a more larger corpus e.g. the text8 file. Follow the instructions shown at https://www.bnosac.be/index.php/blog/100-word2vec-in-r.
Get a text file and use R package word2vec to build the model, wait untill the model finished training and next interact with it.
download.file("http://mattmahoney.net/dc/text8.zip", "text8.zip")
unzip("text8.zip", files = "text8")
> library(word2vec)
> set.seed(123456789)
> model <- word2vec(x = "text8", type = "cbow", dim = 100, window = 10, lr = 0.05, iter = 5, hs = FALSE, threads = 2)
> wv <- predict(model, newdata = c("king", "man", "woman"), type = "embedding")
> wv <- wv["king", ] - wv["man", ] + wv["woman", ]
> predict(model, newdata = wv, type = "nearest", top_n = 4)
term similarity rank
1 king 0.9743692 1
2 queen 0.8295941 2

You haven't shown what pretrained models you've tried, nor what data you've used in your attempts, nor what training-then-probing code that you used and failed, nor how your attempt failed. So it's hard to help without writing you a whole tutorial... and there are already plenty of word2vec tutorials online.
But note:
word2vec is a data-hungry algorithm, and its useful qualities (including analogy-solving capabilities) really only become reliably demoable when using adequate large training sets
that said, most pretrained models from competent teams should easily show the classic man : king :: woman : queen analogy-solution, when using the same kinds of vector-arithmetic & candidate-answer ranking (eliminating all words in the question) as the original work
if I recall correctly, the merely 100MB of uncompressed-text text8 dataset from http://mattmahoney.net/dc/textdata) will often succeed or come close to succeeding on man : king :: woman : queen, though the related text9 that's 1GB of data tends to do much better. Both, though are a bit small for making strong general word-vectors. For contrast, the GoogleNews vectors Google released circa 2013 at the same time as the original word2vec papers were said to be trained on something like 100GB of news articles.
beware, though: the text8 & text9 datasets, by stripping all punctuation/linebreaks, may need to be chunked to pass to some word2vec implementations that rquire training-texts to fit within certain limits. For example, Python's Gensim expects training texts to be no longer than 10000 tokens each. text8 is 17 million words on one line. If you pass that one line of 17 million tokens to Gensim as one training text, 99.94% of them will be ignored as beyond the 10000-token limit. Your R implementation may have a similar, or even tighter, implementation limit.

Related

topicmodels has inverted functions $topics and $terms. Is it reliable?

I have a vector of strings (which represent preprocessed documents) on which I want to estimate an LDA model through R. I use functions in the topicmodels library.
For the purpose of making reproduction of the problem easy, I create a vector with three documents, and impose 5 topics in the LDA model. The full code is as follows:
#install.packages("tm")
library("tm")
#install.packages("topicmodels")
library("topicmodels")
vector_of_speeches<- c("feder reserv commit use full rang tool support us economi challeng time therebi promot maxemploy pricest goal", "progress strong polici support indic economicact employ continu strengthen sector advers affect pandem improv recent month continu affect covid job gain solid recent month unemploymentr declin substanti suppli demand imbal relat pandem economi continu contribut elev level inflat overal financialcondit remain accommod part reflect polici measur support economi flow credit us household busi","path economi continu depend cours viru progress eas suppli expect support continu gain economicact employ reduct inflat risk economicoutlook remain includ new viru")
df <- as.data.frame(vector_of_speeches)
myCorpus <- Corpus(VectorSource(df$vector_of_speeches))
dtm <- TermDocumentMatrix(myCorpus)
inspect(dtm) # 3 documents and 68 different words
#LDA prep
burnin <- 4000
iter <- 4000
keep <- 50
k<-5
delta_gibbs <- 0.025
alpha_gibbs <- 50/k
seed=0
fomc_LDA <- LDA(dtm, k=k, method = "Gibbs", control = list(seed=seed, burnin = burnin, iter = iter, keep = keep))
str(as.matrix(posterior(fomc_LDA)$terms)) #dimension is 5 x 3, so the number of topics is being related with the number of documents
str(as.matrix(posterior(fomc_LDA)$topics)) #dimension is 68 x 5, so the number of unique words is being related with the number of documents
The functions that extracts the topic distribution per document is #topics, and the one which extracts vocabulary distribution per topics is $ terms. However, clearly they are inverted in the above code (the topic distribution is actually extracted from the $terms function). Why is this ocurring, and is it safe to use the topic distributions per document that are being returned by the $terms function?
When I use the full vector of documents (almost 2000), I tried to transpose the document term document, writing dtm <- t(dtm), but then, running the LDA model yields the following error:
Error in LDA(dtm, k = k, method = "Gibbs", control = list(seed = seed, :
Each row of the input matrix needs to contain at least one non-zero entry
Why does this occur? Weird that the $topics and $terms functions seem inverted when it comes to the output they deliver, and I am not sure if I can thus rely on the $terms function to obtain the correct topic distributions per document(which is what I need).

how to assign the topics retried via LDA in R using "textmineR" package to the specific documents

I have got 787 documents (speech - text file). Using "textmineR" package i got the topics for the same. I have got 3 topics as below:
topic label coherence prevalence top_terms
t_1 policy 0.092 37.374 policy, inflation, monetary, rate, federal, economic
t_2 financial 0.030 37.677 financial, banks, risk, capital, market, not
t_3 community 0.004 24.949 community, federal, reserve, more, return, mortgage
Can someone please suggest how do i assign each topic to the relevant document? and create a datable for the same:
Document Number Topic
1 t_1
and so on.
Glad you found the solution yourself and sorry I didn't see it sooner.
If you need to assign topics to new documents you can also use predict.
Here's a reproducible example using your solution and predict.
library(textmineR)
# 'mycorpus' and `newcorpus` are disjoint character vectors of documents
mycorpus <- nih_sample$ABSTRACT_TEXT
newcorpus <- nih_sample$PROJECT_TITLE
# create a document term matrix for training
dtm <- CreateDtm(mycorpus)
# train an LDA topic model
lda <- FitLdaModel(dtm, k = 10, iterations = 200, burnin = 150)
# get the topic document assignments for your training data
lda$theta
# create a new document term matrix for new documents
new_dtm <- CreateDtm(newcorpus)
# predict handles vocabulary (mis)alignment for you
new_theta <- predict(lda, new_dtm, iterations = 200, burnin = 150)
found it, one can use the theta matrix generated as a result of fitLDAmodel. that is the significance of each topic in each speech(document).

How to train model with large of categorical features :: RStudio Crashes

I have a dataset with over 800K rows and 66 columns/features. I am training xgboost model with carte with 5k-Fold Cross-Validation. However, due to the following two columns my R session always crashes; even though I used amazon instance with following specs. I am using
Amazon EC2 Instance Types
m5.4xlarge 16 64 EBS-Only Up to 10 3,500
# A tibble: 815,885 x 66
first_tile last_tile
<fct> <fct>
1 Filly Brown Body of Evidence
2 The Dish The Hunger Games
3 Waiting for Guffman Hell's Kitchen N.Y.C.
4 The Age of Innocence The Lake House
5 Malevolence In the Name of the Father
6 Old Partner Desperate Measures
7 Lady Jane The Invasion
8 Mad Dog Time Eye of the Needle
9 Beauty Is Embarrassing Funny Lady
10 The Snowtown Murders Alvin and the Chipmunks
11 Superman II Pina
12 Leap of Faith Capote
13 The Royal Tenenbaums Dead Men Don't Wear Plaid
14 School for Scoundrels Tarzan
15 Rhinestone Cocoon: The Return
16 Burn After Reading Death Defying Acts
17 The Doors Half Baked
18 The Wood Dance of the Dead
19 Jason X Around the World in 80 Days
20 Dragon Wars LOL
## Model Training
libray(caret)
set.seed(42)
split <- 0.8
train_index <- createDataPartition(data_tbl$paid, p = split, list = FALSE)
data_train <- data_tbl[train_index, ]
data_test <- data_tbl[-train_index, ]
## Summarise The Target Variable
table(dat_train$paid) / nrow(data_train)
## Create Train/Test Indexes
## Create train/test indexes
## preserve class indices
set.seed(42)
my_folds <- createFolds(data_train$paid, k = 5)
# Compare class distribution
i <- my_folds$Fold1
table(data_train$paid[i]) / length(i)
## Reusing trainControl
my_control <- trainControl(
summaryFunction = twoClassSummary,
classProbs = TRUE,
verboseIter = TRUE,
savePredictions = TRUE,
index = my_folds
)
model_xgb <- train(
paid ~. ,
data = data_train,
metric = "ROC",
method = "xgbTree",
trControl = myControl)
Can you suggest me someway I can get around with this memory problem every time?
Is there a way I can do some sort of one hot coding for these features?
I would appreciate any suggestion or help?
Is there a way I should know how big machine I need?
Thanks in advance
There are different ways to tackle such issues in the world of ML.
Do you really need all the 66 features? Have you performed feature selection techniques? Have you tried getting rid of features which do not contribute to your prediction in any way? Check out some feature selection mechanisms for R here:
https://dataaspirant.com/2018/01/15/feature-selection-techniques-r/
Assuming you need most or all of your features, and now you want to encode these categorical variables, one hot seems a popular choice but there are other encoding techniques out there too. One of my choices would be binary encoding. However, there are other encoding techniques you can explore too: https://towardsdatascience.com/smarter-ways-to-encode-categorical-data-for-machine-learning-part-1-of-3-6dca2f71b159
xgboost also has subsampling mechanism. Did you try training with a sample of your data? Check out the subsampling feature of xgboost here: https://xgboost.readthedocs.io/en/latest/parameter.html

Why is LSA in text2vec producing different results every time?

I was using latent semantic analysis in the text2vec package to generate word vectors and using transform to fit new data when I noticed something odd, the spaces not being lined up when trained on the same data.
There appears to be some inconsistency (or randomness?) in the method. Namely, even when re-running an LSA model on the exact same data, the resulting word vectors are wildly different, despite indentical input. When looking around I only found these old closed github issues link link and a mention in the changelog about LSA being cleaned up. I reproduced the behaviour using the movie_review dataset and (slightly modified) code from the documentation:
library(text2vec)
packageVersion("text2vec") # ‘0.5.1’
data("movie_review")
N = 1000
tokens = word_tokenizer(tolower(movie_review$review[1:N]))
it=itoken(tokens)
voc = create_vocabulary(it) %>% prune_vocabulary(term_count_min = 5, doc_proportion_max =0.9)
vectorizer = vocab_vectorizer(voc)
tcm = create_tcm(it, vectorizer)
# edit: make tcm symmetric:
tcm = tcm + Matrix::t(Matrix::triu(tcm))
n_topics = 10
lsa_1 = LatentSemanticAnalysis$new(n_topics)
d1 = lsa_1$fit_transform(tcm)
lsa_2 = LatentSemanticAnalysis$new(n_topics)
d2 = lsa_2$fit_transform(tcm)
# despite being trained on the same data, words have completely different vectors:
sim2(d1["film",,drop=F], d2["film",,drop=F])
# yields values like -0.993363 but sometimes 0.9888435 (should be 1)
mean(diag(sim2(d1, d2)))
# e.g. -0.2316826
hist(diag(sim2(d1, d2)), main="self-similarity between models")
# note: these numbers are different every time!
# But: within each model, results seem consistent and reasonable:
# top similar words for "film":
head(sort(sim2(d1, d1["film",,drop=F])[,1],decreasing = T))
# film movie show piece territory bay
# 1.0000000 0.9873934 0.9803280 0.9732380 0.9680488 0.9668800
# same in the second model:
head(sort(sim2(d2, d2["film",,drop=F])[,1],decreasing = T))
# film movie show piece territory bay
# 1.0000000 0.9873935 0.9803279 0.9732364 0.9680495 0.9668819
# transform works:
sim2(d2["film",,drop=F], transform(tcm["film",,drop=F], lsa_2 )) # yields 1
# LSA in quanteda doesn't have this problem, same data => same vectors
library(quanteda)
d1q = textmodel_lsa(as.dfm(tcm), 10)
d2q = textmodel_lsa(as.dfm(tcm), 10)
mean(diag(sim2(d1q$docs, d2q$docs))) # yields 1
# the top synonyms for "film" are also a bit different with quanteda's LSA
# film movie hunk show territory bay
# 1.0000000 0.9770574 0.9675766 0.9642915 0.9577723 0.9573138
What's the deal, is it a bug, is this intended behaviour for some reason, or am I having a massive misunderstanding? (I'm kind of hoping for the latter...). If it's intended, why would quanteda behave differently?
The issue is that your matrix seems ill-conditioned and hence you have numerical stability issues.
library(text2vec)
library(magrittr)
data("movie_review")
N = 1000
tokens = word_tokenizer(tolower(movie_review$review[1:N]))
it=itoken(tokens)
voc = create_vocabulary(it) %>% prune_vocabulary(term_count_min = 5, doc_proportion_max =0.9)
vectorizer = vocab_vectorizer(voc)
tcm = create_tcm(it, vectorizer)
# condition number
kappa(tcm)
# Inf
Now if you will do truncated SVD (algorithm behind LSA) you will notice that singular vectors are very close to zero:
library(irlba)
truncated_svd = irlba(tcm, 10)
str(truncated_svd)
# $ d : num [1:10] 2139 1444 660 559 425 ...
# $ u : num [1:4387, 1:10] -1.44e-04 -1.62e-04 -7.77e-05 -8.44e-04 -8.99e-04 ...
# $ v : num [1:4387, 1:10] 6.98e-20 2.37e-20 4.09e-20 -4.73e-20 6.62e-20 ...
# $ iter : num 3
# $ mprod: num 50
Hence the sign of the embeddings is not stable and cosine angle between them is not stable as well.
Similar to how it works in sklearn in Python, using a truncated SVD function in R has a random number function built in. It is both what makes it so powerful for large model building but somewhat difficult for smaller uses. If you set your values to a seed set.seed() before the SVD matrix is created you shouldn't have an issue. This used to terrify me when doing LSA.
Let me know if that helps!

R - LDA Topic Model Output Data

I'm working on building some topic models in R using the 'topicmodels' package. After pre-processing and creating a document term matrix, I am applying the following LDA Gibbs model. This may be a simple answer but I'm a newbie to R so here it goes. Is there a way that I can export the topics and term lists along with their probabilities to a text file or excel file? I can print them in R (as below), but don't know how to export :(
This is mainly so I can do some visualisation, which I'm sure can be done in Excel, but like I mentioned I'm a newbie and don't have too much available to learn visualisation techniques in R. Hope this makes sense
k = 33
burnin = 1000
iter = 1000
keep = 50
seed = 2003
model_lda <- LDA(myDtm, k = k, method = "Gibbs",control = list(seed = seed, burnin = burnin, iter = iter, keep = keep))
print(model_lda)
save(model_lda, file = "LDA_Output.RData")
topics(model_lda, 5)
terms(model_lda, 15)
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7
[1,] "seat" "dialogu" "websit" "census" "northern" "growth" "hse"
[2,] "resum" "church" "partnership" "disabl" "univers" "adjust" "legisl"
[3,] "suspend" "congreg" "nesc" "cso" "peac" "forecast" "die"
[4,] "adjourn" "school" "site" "statist" "unemploy" "bernard" "legal"
[5,] "fisheri" "survivor" "nesf" "survey" "polic" "burton" "child"
First, you can read in data with readr and then you could use the tidytext R package. For example:
readr::write_csv(tidy(model_lda, "beta"), "beta.csv")
readr::write_csv(tidy(model_lda, "gamma"), "gamma.csv")
The above code should save your beta matrix and gamma matrix in beta.csv and gamma.csv, respectively.
You can also find a chapter that was helpful for me here: http://tidytextmining.com/topicmodeling.html

Resources