How to concatenate the word embedding for special tokens and words in HuggingFace? - bert-language-model

I tried to add an extra dimension to the Huggingface pre-trained BERT tokenizer. The extra column represents the extra label. For example, if the original embedding of the word “dog” was [1,1,1,1,1,1,1], then I might add a special column with index 2 to represent ‘noun’. Thus, the new embedding becomes [1,1,1,1,1,1,1,2]. Then, I will feed the new input [1,1,1,1,1,1,1,2] into the Bert model. How can I do this in Huggingface?
There is something called tokenizer.add_special_tokens which extends the original vocabulary with new tokens. However, I want to concatenate the embedding of the original vocabulary with the embedding of the tokenizer. For example, I want the Bert model to understand that Dog is a noun by connecting the embedding of dog to the embedding of noun. Should I even change the input word embedding of a pre-trained model? Or should I somehow enhance the attention on “dog” and “noun” in the middle layer?
Here is the example of using tokenizer.add_special_tokens
tokenizer = GPT2Tokenizer.from_pretrained(‘gpt2’)
model = GPT2Model.from_pretrained(‘gpt2’)
special_tokens_dict = {‘cls_token’: ‘’}
num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)
print(‘We have added’, num_added_toks, ‘tokens’)
model.resize_token_embeddings(len(tokenizer))
assert tokenizer.cls_token == ‘’

I found solution here: https://discuss.huggingface.co/t/how-to-use-additional-input-features-for-ner/4364
Basically you need to modified the BertEmbeddings file in such a way that you add the embedding of POS TAG and words.

Related

TM, Quanteda, text2vec. Get strings on the left of term in wordlist according to regex pattern

I would like to analyse a big folder of texts for the presence of names, addressess and telephone numbers in several languages.
These will usually be preceded with a word "Address", "telephone number", "name", "company", "hospital", "deliverer". I will have a dictionary of these words.
I am wondering if text mining tools would be perfect for the job.
I would like to create a Corpus for all these documents and then find texts that meet specific (i am thinking about regex criteria) on the right or down of the given dictionary entry.
Is there such a syntax in data mining packages in R, ie. to get the strings on the right or down of the wordlist entry, the strings that meet a specific pattern?
If not, would be more suitable tool in R to do the job?
Two options with quanteda come to mind:
Use kwic with your list of target patterns, with a window big enough to capture the size after the term that you want. This will return a data.frame that you can use the keyword and post columns from for your analysis. You can also construct a corpus directly from this object (corpus(mykwic)) and then focus on the new post docvar which will contain the text you want.
Use corpus_segment where you use the target word list to create a "tag" type, and anything following this tag, until the next tag, will be reshaped into a new document. This works well but is a bit trickier to configure, since you will need to get the regex right for the tag.

keras embedding vector back to one-hot

I am using keras in NLP problem. There comes a question about word embedding when I try to predict next word according to previous words. I have already turn the one-hot word to word vector via keras Embedding layer like this:
word_vector = Embedding(input_dim=2000,output_dim=100)(word_one_hot)
And use this word_vector to do something and the model gives another word_vector at last. But I have to see what the prediction word really is. How I can turn the word_vector back to word_one_hot?
This question is old but seems to be linked to a common point of confusion about what embeddings are and what purpose they serve.
First off, you should never convert to one-hot if you're going to embed afterwards. This is just a wasted step.
Starting with your raw data, you need to tokenize it. This is simply the process of assigning a unique integer to each element in your vocabulary (the set of all possible words/characters [your choice] in your data). Keras has convenience functions for this:
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer
max_words = 100 # just a random example,
# it is the number of most frequently occurring words in your data set that you want to use in your model.
tokenizer = Tokenizer(num_words=max_words)
# This builds the word index
tokenizer.fit_on_texts(df['column'])
# This turns strings into lists of integer indices.
train_sequences = tokenizer.texts_to_sequences(df['column'])
# This is how you can recover the word index that was computed
print(tokenizer.word_index)
Embeddings generate a representation. Later layers in your model use earlier representations to generate more abstract representations. The final representation is used to generate a probability distribution over the number of possible classes (assuming classification).
When your model makes a prediction, it provides a probability estimate for each of the integers in the word_index. So, 'cat' as the most likely next word, and your word_index had something like {cat:666}, ideally the model would have provided a high likelihood for 666 (not 'cat'). Does this make sense? The model doesn't predict an embedding vector ever, the embedding vectors are intermediary representations of the input data that are (hopefully) useful for predicting an integer associated with a word/character/class.

Stopwords eliminating and vector making

In text2vec, the only function I could find about Stopwords is “create_vocabulary”. But in text mining mission, we usually need to eliminate the stopwords in the resource document and then build corpus or other further processes. How do we use “stopword” to tackle the documents in building corpus, dtm and tcm using text2vec?
I’ve used tm for text mining before. It has a function for analyzing PDF document, but it reads one paper as several vectors(one line, one vector), not read each of the document as a vector as my expect. Furthermore, the format exchange function in tm have mess code problem in Chinese. If use text2vec to read documents, could it read one paper into a vector?(aka. Is the volume of vector large enough for one paper published on journals?) Otherwise, the corpus and vector built in text2vec compatible with which built in tm?
There are two ways to create document-term matrix:
Using feature hashing
Using vocabulary
See text-vectorization vignette for details.
You are interesting in 2 choice. This mean you should build vocabulary - set of words/ngrams which will be used in all downstream tasks. create_vocabulary creates vocabulary object and only terms from this object will be used in further steps. So if you will provide stopwords to create_vocabulary, it will remove them from the set of all observed words in the corpus.As you can see you should provide stopwords only once. All the dowstream tasks will work with vocabulary.
Answer on second question.
text2vec doesn't provide high-level functions for reading PDF documents. However it allows user to provide custom reader function. All you need is to read full articles with some function and reshape them to character vector where each element corresponds to desired unit of information (full article, paragraph, etc). For example you can easily combine lines into single element with paste() function. For example:
article = c("sentence 1.", "sentence 2")
full_article = paste(article, collapse = ' ')
# "sentence 1. sentence 2"
Hope this helps.

How to replace english abbreviated form to their dictionary form

I'm working on a system to analyze texts in english: I use stanford-core nlp to make sentences from whole documents and to make tokens from sentences. I also use the maxent tagger to get tokens pos tags.
Now, considering that I use this corpus to build a supervised classifier, it would be good if I could replace any word like 're, 's, havin, sayin', etc. to its standard form(are, is, having, saying). I've been searching for some english dictionary file, but I don't know how to use it. There are so many distinct cases to consider that I don't think it's an easy task to realize: is there some similar work or whole project that I could use?
Ideas:
I) use string edit distance on a subset of your text and try to match words that do not exist in the dictionary using edit distance against existing words in the dictionary.
II) The key feature of lots of those examples you have is that they are only 1 character different from the correct spelling. So, I suggest for those words that you fail to match with a dictionary entry, try and add all english characters to the front or back and lookup the resulting word in a dictionary. This is very expensive in the beginning but if you keep track of those misspellings in a lookup table (re -> are) at some point you will have 99.99% of the common misspellings (or whatever you call them) in your lookup table with their actual correct spelling.
III) Train a word-level 2-gram or 3-gram language model on proper and clean english text (i.e. newspaper articles), then run it over the entire corpus that you have and see for those words that your language model considers as unknown words (which means it hasn't seen them in training phase), what is the highest probable word according to the language model. Most probably the language model top-10 prediction will be the correct spelled word.

Distinguish word count by document number in mapper - Hadoop?

I'm writing a mapper function on R (using Rhipe for map-reduce). The mapper function is supposed to read the text file and create Corpus. Now, R already has a package called tm which does the Text Mining and create DocumentMatrix. If you want to know more about `tm', have a look here.
But the problem with using this package in map-reduce is that the matrix is converted to list, and is difficult to create a matrix in Reduce from this jumbled up "list". I found an algorithm for creating corpus using map-reduce in this website , but I'm slightly confused as to how I could find the name or some unique identification of the mapper document.
For the document that I have which is 196MB text file, hadoop spawned 4 mappers (blocksize=64MB). How can I classify the key value pair such that the mapper sends the pair as ((words#document),1). The article explains it beautifully. However, I'm having a little trouble understanding how mapper can distinguish the document number it's reading between multiple mappers. As far as I understand, the mapper counter is specific only for the corresponding mapper. Anyone care to elaborate, or provide some suggestions as to what I should do?
I think I came up with my own solution. What I did is instead of looking for mapper counts and what not, I added a text at the end of each line followed by number as in "This is a text, n:1". I used gsub to create increment. In the mapper, while I read the line, I also read the value n:1. Since the n increases for each line, no matter which mapper is reading which line, it gets the correct value of n. I'm then using the value of n to create a new key for each line (document) as in ((word#doc=n),1) where n is the value of each line number.

Resources