Calculate Cosine Similarity for a word2vec model in R - r

I´m working with the package "word2vec" model in R and got a huge problem. I wanna figure out which words are the closest synonyms to "uncertainty" and "economy" like the paper of Azqueta-Gavaldon (2020): "Economic policy uncertainty in the euro area: An unsupervised machine learning approach".So I did the word2vec function of the word2vec package to create my own word2vec model. With the function predict (object, ...) I can create a table which shows me the words which are closest to my considered words.The problem is that the similarity of this function is defined as the (sqrt(sum(x . y) / ncol(x))) which is not the cosine similarity.
I know that I can use the function cosine(x,y). This function but just works to calculate the cosine similarity between two vectors and can´t do the output like the predict function which I described above.
Does anyone know how to determine the cosine similarity for each word in my Word2Vec model to the other and give me an output of the most similar words to a given word based on these values?
This would really help me a lot and I am already grateful for your answers.
Kind regards,
Tom

following github-code explains how you can use the cosine similarity in Word2Vec Models in R:
https://gist.github.com/adamlauretig/d15381b562881563e97e1e922ee37920
You can use this function at every matrix in R and therefore for every Word2Vec Model built in R.
Kind Regards,
Tom

Related

Is there a function to calculate the scatter matrix in R language?

Recently I have been trying to use an optimizer to make feature selections for clustering. I need a fitness function to tell the optimizer which feature set is better. So I refer to the criteria mentioned in the book "Introduction to Statistical Pattern Recognition 2nd Ed chapter 10,10.2- Keinosuke Fukunaga". The content is shown below.
I have found a function(ScatterMatrices()) in Matlab to calculate the value J. As shown below.
However, I didn't find any function similar to ScatterMatrices() in Matlab. I would appreciate it if you could help me🙏.
withinSS: Within-class Sum of Squares Matrix
"Calculates within-class sum of squares and cross product matrix (a.k.a. within-class scatter matrix)"
Which is available in the archive Index of /src/contrib/Archive/DiscriMiner
How do I install a package that has been archived from CRAN

Preventing underforecasting of support vector regression in R

I'm currently using the e1071 package in R to forecast product demand using support vector regression via the svm function in the package. While support vector regression yields much higher forecast accuracy for my data compared to other methods (e.g. ARIMA, simple exponential smoothing), my results show that the svm function tends to underforecast. In my particular case, underforecasting is worse and much more expensive than overforecasting. Therefore, I want to implement something in R to tells support vector regression to penalize underforecasting much more than overforecasting.
Unfortunately, I can't really find any possibility to do this. There seems to be nothing on this in the e1071 package. The kernlab package has a support vector function (ksvm) that implements an 'eps-bsvr bound-constraint svm regression' but I can't find any information what is meant by bound-constraint or how to define that bound.
Has anyone seen any examples how to do this in R? I'm only finding very mathematical papers on asymmetric loss functions for support vector regression, and I don't have the skills to translate this into R code, so i'm looking for an already existing solution in R.

Area under ROC in R

Is there a way of calculating or estimating the area under the curve as an external metric, using base R, from confusion matrices alone?
If not, how would I do it, given the clustering object?
e.g. we can start from
cutree(hclust(dist(iris[,1:4])),method="average"),3))
or, from a diagonal-maximized version of
table(iris$Species, cutree(hclust(dist(iris[,1:4])),method="average"),3))
the latter being the confusion matrix. I would much, much prefer a solution that goes from the confusion matrix but if it's impossible we can use the clustering object itself.
I read the comments here: Calculate AUC in R? -- the top solution looks good, but it's unclear to me how to generalise it for multi-class data like iris.
(No packages, obviously, I want to find out how to do it by hand in base R)

Which methods can I use to calculate correlation among words in quanteda?

My question is a continuation of this.
After cleaning my text data and visualizing it using a wordcloud, I want to see which words are correlated to each other. Here comes the problem:
quantedahas the function textstat_simil, but it says
similarity. So, are "similarity" and "correlation" in this case the same thing? (Is distance also related?).
Moreover, my dfm looks like a binary matrix. Is in this case phi
correlation (from chi'squared statistics) more indicated? Can I
calculate this via quanteda?
Do you guys have any other content rather than the source code of
github that explain in more detail the methods to calculate
similarity or distance measures? (I couldn't understand from
this
code, sorry).
Thanks for you patient!
To compute Pearson’s product-moment correlations among features, you would use:
textstat_simil(x, method = “correlation”, margin = “features”)
The documentation makes this pretty clear, and the correlation method is the default.
Pearson’s correlation would not be the most appropriate for binary data, and we currently do not implement Spearman’s or other correlation methods more appropriate for categorical or ordinal data. However you can always coerce the dfm to an ordinary matrix (use as.matrix()) and then use the stats::cor() methods, which include Spearman’s.
As for the last question, we use the standard implementation of these measures. If you want more clarity on what they mean, I suggest asking on Cross-Validated.

Smarter than an Eighth grader? Kaggle AI Challenge. R

I am working on the Allen AI Science Challenge currently up on Kaggle.
The idea behind the challenge is to train to a model using the training data provided (a set of Eighth grade level science questions along with four answer options, one of which is the correct answer and the correct answer) along with any additional knowledge sources (Wikipedia, Science textbooks, etc) so that it can answer science questions as well as an (average?) Eighth grader can.
I'm thinking of taking the first crack at the problem in R (proficient only in R and C++; I don't think C++ will be a very useful language to solve this problem in). After exploring the Kaggle forums, I decided to use the TopicModels (tm), RWeka and Latent Dirichlet Algorithm (LDA) packages.
My current approach is to build a text predictor of some sort which on reading the question posed to it outputs a string of text and compute the cosine similarity between this output text and the four options given in the test set and predict the correct one to be with the highest cosine similarity.
I will train the model using the training data, a Wikipedia corpus along with a few Science textbooks so that the model does not overfit.
I have two questions here:
Does the overall approach make sense?
What would be a good starting point to build this text predictor? Will converting the corpus(training data, Wikipedia and Textbooks) to a Term Document/Document Term matrix help? I think forming n-grams for all the sources would help but I don't know what the next step would be, i.e. how exactly will the model predict and belt out a string of text(of say, size n) on reading a question.
I have tried implementing a part of the approach; finding out optimum number of topics and performing LDA over the training set; here's the code:
library(topicmodels)
library(RTextTools)
data<-read.delim("cleanset.txt", header = TRUE)
data$question<-as.character(data$question)
data$answerA<-as.character(data$answerA)
data$answerB<-as.character(data$answerB)
data$answerC<-as.character(data$answerC)
data$answerD<-as.character(data$answerD)
matrix <- create_matrix(cbind(as.vector(data$question),as.vector(data$answerA),as.vector(data$answerB),as.vector(data$answerC),as.vector(data$answerD)), language="english", removeNumbers=FALSE, stemWords=TRUE, weighting = tm::weightTf)
best.model<-lapply(seq(2,25,by=1),function(k){LDA(matrix,k)})
best.model.logLik <- as.data.frame(as.matrix(lapply(best.model, logLik)))
best.model.logLik.df <- data.frame(topics=c(2:25), LL=as.numeric(as.matrix(best.model.logLik)))
best.model.logLik.df[which.max(best.model.logLik.df$LL),]
best.model.lda<-LDA(matrix,25)
Any help will be appreciated!

Resources