Solving Huffman Code Tree - math

I was expecting a little help in my discrete maths problem. Is there any way to shorten up the binary tree or I have to construct for the whole data below.
Construct a Huffman code for the letters of the English alphabet
where the frequencies of letters in typical English
text are as shown in this table.

For a true Huffman code and to satisfy the requirements of the exercise you do need a binary tree. There are alternative ways of doing it some discussed in https://en.wikipedia.org/wiki/Huffman_coding.
You could put all the low frequency letters in the same bin and just use a simple scheme for the 8 lowest frequency letters. So B=000, J=001, K=010, P=011, Q=100, V=101, X=110, Z=111. Just append these codes to the Huffman coding with BJKPQVXZ treated as a single character.

Related

Is there any tool to determine a pattern from several numbers and characters sequences ie ti find out what way they were calculated?

I have several alphanumeric sequences generated by some system and want to find the pattern so I could add new ones. They consist of 11 numbers and letters. Similarly I'd like to find out how to deconstruct only numerical data like that. Is there any tool, formula or algorithm for that online?

Checking for similarity of text in two text strings

I have two strings of text (typically two paragraphs). I am looking to check for the "similarity" between them, e.g. check if one paragraph is a plagiarised version of the other. Ideally I need a similarity score, as well as an indication of where the similarities are. I prefer to do this fully in R. Any suggestions please?
The difference of stings can be measured with the levenshtein distance (or concepts that build on top of that). The main idea is to quantify the "editiing distance" of strings: how many letters need to be included/excluded/changed, etc (depending on the algorithm more or less types of editing are allowed). A package in R for this task would be fuzzyjoin.
To look up the similarities you could cut both texts (original and suposed plagiate) in sentences and build the fuzzy joins on this - Then you can filter for best matches. The topic is a bit tricky so I recomend trying out different algorithms (jaccard distance, damerau levenshtein, etc). A start into the topic can be found here: https://cran.r-project.org/web/packages/fuzzyjoin/readme/README.html

Google Ngram Viewer - English One Million

I'm training a language model in PyTorch and I'd need the most common one million words in English to serve as dictionary.
From what I've understood, the Google Ngram English One Million (1-grams) might suit to this task, but after downloading every part (0-9) of this dataset and using tail on them to check if they were what I supposed, I found out that no part of this dataset contains words beyond the F letter.
As far as I understood, any Version 1 file has its ngrams alphabetically and cronologically sorted and I'm concerned if it might be possible that the most common one million words do not go beyond the F?
Or am I missing the point of this dataset and it isn't the most commond one million words?
Try shuf <file> to get a random sorting and you will see the data covers all letters. What you see at the end of the files is not an f but the ligature fl.

How to replace english abbreviated form to their dictionary form

I'm working on a system to analyze texts in english: I use stanford-core nlp to make sentences from whole documents and to make tokens from sentences. I also use the maxent tagger to get tokens pos tags.
Now, considering that I use this corpus to build a supervised classifier, it would be good if I could replace any word like 're, 's, havin, sayin', etc. to its standard form(are, is, having, saying). I've been searching for some english dictionary file, but I don't know how to use it. There are so many distinct cases to consider that I don't think it's an easy task to realize: is there some similar work or whole project that I could use?
Ideas:
I) use string edit distance on a subset of your text and try to match words that do not exist in the dictionary using edit distance against existing words in the dictionary.
II) The key feature of lots of those examples you have is that they are only 1 character different from the correct spelling. So, I suggest for those words that you fail to match with a dictionary entry, try and add all english characters to the front or back and lookup the resulting word in a dictionary. This is very expensive in the beginning but if you keep track of those misspellings in a lookup table (re -> are) at some point you will have 99.99% of the common misspellings (or whatever you call them) in your lookup table with their actual correct spelling.
III) Train a word-level 2-gram or 3-gram language model on proper and clean english text (i.e. newspaper articles), then run it over the entire corpus that you have and see for those words that your language model considers as unknown words (which means it hasn't seen them in training phase), what is the highest probable word according to the language model. Most probably the language model top-10 prediction will be the correct spelled word.

How can a sentence or a document be converted to a vector?

We have models for converting words to vectors (for example the word2vec model). Do similar models exist which convert sentences/documents into vectors, using perhaps the vectors learnt for the individual words?
1) Skip gram method: paper here and the tool that uses it, google word2vec
2) Using LSTM-RNN to form semantic representations of sentences.
3) Representations of sentences and documents. The Paragraph vector is introduced in this paper. It is basically an unsupervised algorithm that learns fixed-length feature representations from variable-length pieces of texts, such as sentences, paragraphs, and documents.
4) Though this paper does not form sentence/paragraph vectors, it is simple enough to do that. One can just plug in the individual word vectors(Glove word vectors are found to give the best performance) and then can form a vector representation of the whole sentence/paragraph.
5) Using a CNN to summarize documents.
It all depends on:
which vector model you're using
what is the purpose of the model
your creativity in combining word vectors into a document vector
If you've generated the model using Word2Vec, you can either try:
Doc2Vec: https://radimrehurek.com/gensim/models/doc2vec.html
Wiki2Vec: https://github.com/idio/wiki2vec
Or you can do what some people do, i.e. sum all content words in the documents and divide by the content words, e.g. https://github.com/alvations/oque/blob/master/o.py#L13 (note: line 17-18 is a hack to reduce noise):
def sent_vectorizer(sent, model):
sent_vec = np.zeros(400)
numw = 0
for w in sent:
try:
sent_vec = np.add(sent_vec, model[w])
numw+=1
except:
pass
return sent_vec / np.sqrt(sent_vec.dot(sent_vec))
A solution that is slightly less off the shelf, but probably hard to beat in terms of accuracy if you have a specific thing you're trying to do:
Build an RNN (with LSTM or GRU memory cells, comparison here) and optimize the error function of the actual task you're trying to accomplish. You feed it your sentence, and train it to produce the output you want. The activations of the network after being fed your sentence is a representation of the sentence (although you might only care about the networks output).
You can represent the sentence as a sequence of one-hot encoded characters, as a sequence of one-hot encoded words, or as a sequence of word vectors (e.g. GloVe or word2vec). If you use word vectors, you can keep backpropagating into the word vectors, updating their weights, so you also get custom word vectors tweaked specifically for the task you're doing.
There are a lot of ways to answer this question. The answer depends on your interpretation of phrases and sentences.
These distributional models such as word2vec which provide vector representation for each word can only show how a word usually is used in a window-base context in relation with other words. Based on this interpretation of context-word relations, you can take average vector of all words in a sentence as vector representation of the sentence. For example, in this sentence:
vegetarians eat vegetables .
We can take the normalised vector as vector representation:
The problem is in compositional nature of sentences. If you take the average word vectors as above, these two sentences have the same vector representation:
vegetables eat vegetarians .
There are a lot of researches in distributional fashion to learn tree structures through corpus processing. For example: Parsing With Compositional Vector Grammars. This video also explain this method.
Again I want to emphasise on interpretation. These sentence vectors probably have their own meanings in your application. For instance, in sentiment analysis in this project in Stanford, the meaning that they are seeking is the positive/negative sentiment of a sentence. Even if you find a perfect vector representation for a sentence, there are philosophical debates that these are not actual meanings of sentences if you cannot judge the truth condition (David Lewis "General Semantics" 1970). That's why there are lines of works focusing on computer vision (this paper or this paper). My point is that it can completely depend on your application and interpretation of vectors.
Hope you welcome an implementation. I faced the similar problem in converting the movie plots for analysis, after trying many other solutions I sticked to an implementation that made my job easier. The code snippet is attached below.
Install 'spaCy' from the following link.
import spacy
nlp = spacy.load('en')
doc = nlp(YOUR_DOC_HERE)
vec = doc.vector
Hope this helps.

Resources