What are the pre-processing requirements on cosine similarity? - similarity

The input on cosine similarity is two vectors representing two different data i want to compare. Is there a requirement for the semantic of the vector? Can it simply be the byte representation of each file. And then compute the frequency of each byte? Does this make sense? Or there should be a vectorization of the file where each dimension is not a raw-piece of data from the file but some metadata as the frequency of each term if we speak for text files or the tf-idf encoding model? To put it in another shape: Does cosine similarity in order to be "correct" asks a complex pre-processing step of data or i can give it as input integer values that represents each byte of my data without text in mind or just a frequency term of each byte?

The "semantics" of the data is critical. For example, say you are comparing English text documents. For large documents, the frequency of occurence of the various letters will be roughly the same, so if the elements of your vector represent the counts of letters, you will have trouble distinguishing documents. If the elements of your vector represent the counts of words, you will get better results. If the elements of your vector represent the counts of "stemmed" words, even better. Etc.
Cosine similarity is a "dumb" statistical measure - it is up to you to give it something meaningful to compare.

Related

How to define the difference between the similarity (e.g.: measured with cosine) of entities from two given sets?

I have two groups of entities, a "valid" group which is expected to contain similar entities, and a "random" group which is expected to contain less similar entities when compared to the valid group. I have a long list of similarity/distance measurements, like cosine, canberra, manhattan, etc., that are all computed within the groups to produce an average similarity scores for the entities in them. My question is, how should the difference between the scores of the two group be defined? I know comparing different metric results is not advised, but this problem mainly stems from the fact, that the entities have two different descriptions (a vector-based and a knowledge-graph-based) that require different similarity metrics, and my aim is to compare the two types of description through the difference of average similarities between valid and random groups of entities.
My first intuition was to simply take
similarity_gain = (valid_score - random_score) / np.abs(random_score)
However, this largely ignored the scale of the metrics, e.g.: dot measurement resulted in huge gains due to the large difference in the vectors, meanwhile other measurements with logarithmic-like scales showed little difference. Would a simply normalization to [0, 1] be a proper solution? Or is there another way to represent the gain through the different metrics for proper comparison?

Scoring texts based on dictionaries other than positive / negative in Quanteda?

I would like to be able to give the texts in a very large corpus a score based on the number of words they contain from either 'Dictionary 1' (positive score), or 'Dictionary 2' (negative score). Can anybody help me to figure out how to do this?
I have read up on the use of Quanteda for sentiment analysis https://tutorials.quanteda.io/advanced-operations/targeted-dictionary-analysis/ and it seems like that is essentially just scoring texts based on whether they contain words from lists of 'positive' and 'negative' words.
I would like to use Quanteda for this, as I am familiar with the package.
How can I make modified dictionary which doesn't contain two lists of positive and negative words, but two list of other kinds of words e.g. 'right wing' and 'left wing' for example? I already know the words I wish to include in each list.

How can a sentence or a document be converted to a vector?

We have models for converting words to vectors (for example the word2vec model). Do similar models exist which convert sentences/documents into vectors, using perhaps the vectors learnt for the individual words?
1) Skip gram method: paper here and the tool that uses it, google word2vec
2) Using LSTM-RNN to form semantic representations of sentences.
3) Representations of sentences and documents. The Paragraph vector is introduced in this paper. It is basically an unsupervised algorithm that learns fixed-length feature representations from variable-length pieces of texts, such as sentences, paragraphs, and documents.
4) Though this paper does not form sentence/paragraph vectors, it is simple enough to do that. One can just plug in the individual word vectors(Glove word vectors are found to give the best performance) and then can form a vector representation of the whole sentence/paragraph.
5) Using a CNN to summarize documents.
It all depends on:
which vector model you're using
what is the purpose of the model
your creativity in combining word vectors into a document vector
If you've generated the model using Word2Vec, you can either try:
Doc2Vec: https://radimrehurek.com/gensim/models/doc2vec.html
Wiki2Vec: https://github.com/idio/wiki2vec
Or you can do what some people do, i.e. sum all content words in the documents and divide by the content words, e.g. https://github.com/alvations/oque/blob/master/o.py#L13 (note: line 17-18 is a hack to reduce noise):
def sent_vectorizer(sent, model):
sent_vec = np.zeros(400)
numw = 0
for w in sent:
try:
sent_vec = np.add(sent_vec, model[w])
numw+=1
except:
pass
return sent_vec / np.sqrt(sent_vec.dot(sent_vec))
A solution that is slightly less off the shelf, but probably hard to beat in terms of accuracy if you have a specific thing you're trying to do:
Build an RNN (with LSTM or GRU memory cells, comparison here) and optimize the error function of the actual task you're trying to accomplish. You feed it your sentence, and train it to produce the output you want. The activations of the network after being fed your sentence is a representation of the sentence (although you might only care about the networks output).
You can represent the sentence as a sequence of one-hot encoded characters, as a sequence of one-hot encoded words, or as a sequence of word vectors (e.g. GloVe or word2vec). If you use word vectors, you can keep backpropagating into the word vectors, updating their weights, so you also get custom word vectors tweaked specifically for the task you're doing.
There are a lot of ways to answer this question. The answer depends on your interpretation of phrases and sentences.
These distributional models such as word2vec which provide vector representation for each word can only show how a word usually is used in a window-base context in relation with other words. Based on this interpretation of context-word relations, you can take average vector of all words in a sentence as vector representation of the sentence. For example, in this sentence:
vegetarians eat vegetables .
We can take the normalised vector as vector representation:
The problem is in compositional nature of sentences. If you take the average word vectors as above, these two sentences have the same vector representation:
vegetables eat vegetarians .
There are a lot of researches in distributional fashion to learn tree structures through corpus processing. For example: Parsing With Compositional Vector Grammars. This video also explain this method.
Again I want to emphasise on interpretation. These sentence vectors probably have their own meanings in your application. For instance, in sentiment analysis in this project in Stanford, the meaning that they are seeking is the positive/negative sentiment of a sentence. Even if you find a perfect vector representation for a sentence, there are philosophical debates that these are not actual meanings of sentences if you cannot judge the truth condition (David Lewis "General Semantics" 1970). That's why there are lines of works focusing on computer vision (this paper or this paper). My point is that it can completely depend on your application and interpretation of vectors.
Hope you welcome an implementation. I faced the similar problem in converting the movie plots for analysis, after trying many other solutions I sticked to an implementation that made my job easier. The code snippet is attached below.
Install 'spaCy' from the following link.
import spacy
nlp = spacy.load('en')
doc = nlp(YOUR_DOC_HERE)
vec = doc.vector
Hope this helps.

Markov Algorithm for Random Writing

I got a litte problem understanding conceptually the structure of a random writing program (that takes input in form of a text file) and uses the Markov algorithm to create a somewhat sensible output.
So the data structure i am using is to use cases ranging from 0-10. Where at case 0: I count the number a letter/symbol or digit appears and base my new text on this to simulate the input. I have already implemented this by using an Map type that holds each unique letter in the input text and a array of how many there are in the text. So I can simply ask for the size of the array for the specific letter and create output text easy like this.
But now I Need to create case1/2/3 and so on... case 1 also holds what letter is most likely to appear after any letter aswell. Do i need to create 10 seperate arrays for these cases, or are there an easier way?
There are a lot of ways to model this. One approach is as you describe, with an multi-dimensional array where each index is the following character in the chain and the final result is the count.
# Two character sample:
int counts[][] = new int[26][26]
# ... initialize all entries to zero
# 'a' => 0, 'b' => 1, ... 'z' => 25
# For example for the string 'apple'
# Note: I'm only writing this like this to show what the result is, it should be in a
# loop or function ...
counts['a'-'a']['p'-'a']++
counts['p'-'a']['p'-'a']++
counts['p'-'a']['l'-'a']++
counts['l'-'a']['l'-'e']++
Then to randomly generate names you would count the number of total outcomes for a given character (ex: 2 outcomes for 'p' in the previous example) and pick a weighted random number for one of the possible outcomes.
For smaller sizes (say up to 4 characters) that should work fine. For anything larger you may start to run into memory issues since (assuming you're using A-Z) 26^N entries for an N-length chain.
I wrote something like a couple of years ago. I think I used random pages from Wikipedia to for seed data to generate the weights.

n-gram sentence similarity with cosine similarity measurement

I have been working on a project about sentence similarity. I know it has been asked many times in SO, but I just want to know if my problem can be accomplished by the method I use by the way that I am doing it, or I should change my approach to the problem. Roughly speaking, the system is supposed to split all sentences of an article and find similar sentences among other articles that are fed to the system.
I am using cosine similarity with tf-idf weights and that is how I did it.
1- First, I split all the articles into sentences, then I generate trigrams for each sentence and sort them(should I?).
2- I compute the tf-idf weights of trigrams and create vectors for all sentences.
3- I calculate the dot product and magnitude of original sentence and of the sentence to be compared. Then calculate the cosine similarity.
However, the system does not work as I expected. Here, I have some questions in my mind.
As far as I have read about tf-idf weights, I guess they are more useful for finding similar "documents". Since I am working on sentences, I modified the algorithm a little by changing some variables of the formula of tf and idf definitions(instead of document I tried to come up with sentence based definition).
tf = number of occurrences of trigram in sentence / number of all trigrams in sentence
idf = number of all sentences in all articles / number of sentences where trigram appears
Do you think it is ok to use such a definition for this problem?
Another one is that I saw the normalization is mentioned many times when calculating the cosine similarity. I am guessing that this is important because the trigrams vectors might not be the same size(which they rarely are in my case). If a trigram vector is size of x and the other is x+1, then I treat the first vector as it was the size of x+1 with the last value is being 0. Is this what it is meant by normalization? If not, how do I do the normalization?
Besides these, if I have chosen the wrong algorithm what else can be used for such problem(preferably with n-gram approach)?
Thank you in advance.
I am not sure why you are sorting the trigrams for every sentence. All you need to care about when computing cosine similarity is that whether the same trigram occurred in the two sentences or not and with what frequencies. Conceptually speaking you define a fixed and common order among all possible trigrams. Remember the order has to be the same for all sentences. If the number of possible trigrams is N, then for each sentence you obtain a vector of dimensionality N. If a certain trigram does not occur, you set the corresponding value in the vector to zero. You dont really need to store the zeros, but have to take care of them when you define the dot product.
Having said that, trigrams are not a good choice as chances of a match are a lot sparser. For high k you will have better results from bags of k consecutive words, rather than k-grams. Note that the ordering does not matter inside a bag, its a set. You are using k=3 k-grams, but that seems to be on the high side, especially for sentences. Either drop down to bi-grams or use bags of different lengths, starting from 1. Preferably use both.
I am sure you have noticed that sentences that do not use the exact trigram has 0 similarity in your method. K-bag of words
will alleviate the situation somewhat but not solve it completely. Because now you need sentences to share actual words. Two sentences may be similar without using the same words. There are a couple of ways to fix this. Either use LSI(latent Semantic Indexing) or clustering of the words and use the cluster labels to define your cosine similarity.
In order to compute the cosine similarity between vectors x and y you compute the dot product and divide by the norms of x and y.
The 2-norm of the vector x can be computed as square root of the sum of the components squared. However you should also try your algorithm out without any normalization to compare. Usually it works fine, because you are already taking care of the relative sizes of the sentences when you compute the term frequencies (tf).
Hope this helps.

Resources