Unable to find the word that I added to the Huggingface Bert tokenizer vocabulary - bert-language-model

I tried to add new words to the Bert tokenizer vocab. I see that the length of the vocab is increasing, however I can't find the newly added word in the vocab.
tokenizer.add_tokens(['covid', 'wuhan'])
v = tokenizer.get_vocab()
print(len(v))
'covid' in tokenizer.vocab
Output:
30524
False

You are calling two different things with tokenizer.vocab and tokenizer.get_vocab(). The first one contains the base vocabulary without the added tokens, while the other one contains the base vocabulary with the added tokens.
from transformers import BertTokenizer
t = BertTokenizer.from_pretrained('bert-base-uncased')
print(len(t.vocab))
print(len(t.get_vocab()))
print(t.get_added_vocab())
t.add_tokens(['covid'])
print(len(t.vocab))
print(len(t.get_vocab()))
print(t.get_added_vocab())
Output:
30522
30522
{}
30522
30523
{'covid': 30522}

Related

Use pre-trained model vocabulary in an appropriate way with allennlp

When using a huggingface pre-traind model,i passed a tokennizer and indexer for my textfied in Datasetreader, also i want use the same tokennizer and indexer in my model. Which way is an appropriate way in allennlp ? (using config file ?)
Here is my code, i think this is a bad sloution. Give me some suggestions please.
`In my Dataset Reader::
self._tokenizer = PretrainedTransformerTokenizer("microsoft/DialoGPT-small",tokenizer_kwargs={'cls_token': '[CLS]',
'sep_token': '[SEP]',
'bos_token':'[BOS]'})
self._tokenindexer = {"tokens": PretrainedTransformerIndexer("microsoft/DialoGPT-small",
tokenizer_kwargs={'cls_token': '[CLS]',
'sep_token': '[SEP]',
'bos_token':'[BOS]'})}
In my Model:
self.tokenizer = GPT2Tokenizer.from_pretrained("microsoft/DialoGPT-small")
num_added_tokens = self.tokenizer.add_special_tokens({'bos_token':'[BOS]','sep_token': '[SEP]','cls_token':'[CLS]'})
self.emb_dim = len(self.tokenizer)
self.embeded_layer = self.encoder.resize_token_embeddings(self.emb_dim)
I have create two tokenizers for datasetreader and model, and both the tokenizers have the common vocabulary and special tokens. but when i add the three special token in the same order, the special token will have a different index. so i switched the order in Model`s codes to achieve the same indexs.(stupid but effective)
Is there exists a way to pass the tokennizer or vocab from DatasetReader to Model?
Which way is an appropriate way in allennlp to slove this problem ?

How to find all the related keywords for a root word?

I am trying to figure out a way to find all the keywords that come from the same root word (in some sense the opposite action of stemming). Currently, I am using R for coding, but I am open to switching to a different language if it helps.
For instance, I have the root word "rent" and I would like to be able to find "renting", "renter", "rental", "rents" and so on.
Try this code in python:
from pattern.en import lexeme
print(lexeme("rent")
the output generated is:
Installation:
pip install pattern
pip install nltk
Now, open a terminal, type python and run the below code.
import nltk
nltk.download(["wordnet","wordnet_ic","sentiwordnet"])
After the installation is done, run the pattern code again.
You want to find the opposite of Stemming, but stemming can be your way in.
Look at this example in Python:
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
words = ["renting", "renter", "rental", "rents", "apple"]
all_rents = {}
for word in words:
stem = stemmer.stem(word)
if stem not in all_rents:
all_rents[stem] = []
all_rents[stem].append(word)
else:
all_rents[stem].append(word)
print(all_rents)
Result:
{'rent': ['renting', 'rents'], 'renter': ['renter'], 'rental': ['rental'], 'appl': ['apple']}
There are several other algorithm to use. However, keep in mind that stemmers are rule-based and are not "smart" to the point where they will select all related words (as seen above). You can even implement your own rules (extend the Stem API from NLTK).
Read more about all available stemmers in NLTK (the module that was used in the above example) here: https://www.nltk.org/api/nltk.stem.html
You can implement your own algorithm as well. For example, you can implement Levenshtein Distance (as proposed in #noski comment) to compute the smaller common prefix. However, you have to do your own research on this one, since it is a complex process.
For an R answer, you can try these functions as a starting point. d.b gives grepl as an example, here are a few more:
words = c("renting", "renter", "rental", "rents", "apple", "brent")
grepl("rent", words) # TRUE TRUE TRUE TRUE FALSE TRUE
startsWith(words, "rent") # TRUE TRUE TRUE TRUE FALSE FALSE
endsWith(words, "rent") # FALSE FALSE FALSE FALSE FALSE TRUE

How can I create a large file with random but sensible English words?

I want to test my wordcount software based on MapReduce framework with a very large file (over 1GB) but I don't know how can I generate it.
Are there any tools to create a large file with random but sensible english sentences?
Thanks
A simple python script can create a Pseudo-random document of words. I have the one I wrote up for just a task a year ago:
import random
file1 = open("test.txt","a")
PsudoRandomWords = ["Apple ", "Banana ", "Tree ", "Pickle ", "Toothpick ", "Coffee ", "Done "]
index = 0
#Increase the range to make a bigger file
for x in range(150000000):
#Change end range of the randint function below if you add more words
index = random.randint(0,6)
file1.write(PsudoRandomWords[index])
if x % 20 == 0:
file1.write('\n')`
Just add more words to the list to make it more random and increase the index of the random function. I just tested it and it should create a document named test.txt at exactly one gigabyte. This will contain words from the list in a random order separated by a new line every 20 words.
I wrote this simple Python script that scrape on Project Gutenberg site and write the text (encoding: us-ascii, if you want to use others see http://www.gutenberg.org/files/) in a local file text. This script can be used in combination with https://github.com/c-w/gutenberg to do more accurate filtering (by language, by author etc.)
from __future__ import print_function
import requests
import sys
if (len(sys.argv)!=2):
print("[---------- ERROR ----------] Usage: scraper <number_of_files>", file=sys.stderr)
sys.exit(1)
number_of_files=int(sys.argv[1])
text_file=open("big_file.txt",'w+')
for i in range(number_of_files):
url='http://www.gutenberg.org/files/'+str(i)+'/'+str(i)+'.txt'
resp=requests.get(url)
if resp.status_code!=200:
print("[X] resp.status_code =",resp.status_code,"for",url)
continue
print("[V] resp.status_code = 200 for",url)
try:
content=resp.text
#dummy cleaning of the text
splitted_content=content.split("*** START OF THIS PROJECT GUTENBERG EBOOK")
splitted_content=splitted_content[1].split("*** END OF THIS PROJECT GUTENBERG EBOOK")
print(splitted_content[0], file = text_file)
except:
continue
text_file.close()

Documenter.jl: #ref a specific method of a function

Let's say I have two methods
"""
f(x::Integer)
Integer version of `f`.
"""
f(x::Integer) = println("I'm an integer!")
"""
f(x::Float64)
Float64 version of `f`.
"""
f(x::Float64) = println("I'm floating!")
and produce doc entries for those methods in my documentation using Documenter.jl's #autodocs or #docs.
How can I refer (with #ref) to one of the methods?
I'm searching for something like [integer version](#ref f(::Integer)) (which unfortunately does not work) rather than just [function f](#ref f) or [f](#ref).
Note that for generating the doc entries #docs has a similar feature. From the guide page:
[...] include the type in the signature with
```#docs
length(::T)
```
Thanks in advance.
x-ref: https://discourse.julialang.org/t/documenter-jl-ref-a-specific-method-of-a-function/8792
x-ref: https://github.com/JuliaDocs/Documenter.jl/issues/569#issuecomment-362760811
As pointed out by #mortenpi on Discourse and github:
You would normally refer to a function with [`f`](#ref), with the name
of the function being referred to between backticks in text part of
the link. You can then also refer to specific signatures, e.g. with
[`f(::Integer)`](#ref).
The #ref section in the docs should be updated to mention this
possibility.

looking for example for QCompleter with segmented completion / tree models

The PySide docs include this section on QCompleter with tree models:
PySide.QtGui.QCompleter can look for completions in tree models, assuming that any item (or sub-item or sub-sub-item) can be unambiguously represented as a string by specifying the path to the item. The completion is then performed one level at a time.
Let’s take the example of a user typing in a file system path. The model is a (hierarchical) PySide.QtGui.QFileSystemModel . The completion occurs for every element in the path. For example, if the current text is C:\Wind , PySide.QtGui.QCompleter might suggest Windows to complete the current path element. Similarly, if the current text is C:\Windows\Sy , PySide.QtGui.QCompleter might suggest System .
For this kind of completion to work, PySide.QtGui.QCompleter needs to be able to split the path into a list of strings that are matched at each level. For C:\Windows\Sy , it needs to be split as “C:”, “Windows” and “Sy”. The default implementation of PySide.QtGui.QCompleter.splitPath() , splits the PySide.QtGui.QCompleter.completionPrefix() using QDir.separator() if the model is a PySide.QtGui.QFileSystemModel .
To provide completions, PySide.QtGui.QCompleter needs to know the path from an index. This is provided by PySide.QtGui.QCompleter.pathFromIndex() . The default implementation of PySide.QtGui.QCompleter.pathFromIndex() , returns the data for the edit role for list models and the absolute file path if the mode is a PySide.QtGui.QFileSystemModel.
But I can't seem to find an example showing how to do this. Can anyone point me at an example I can use as a starting point? (In my investigation it looks like maybe the hard part is the tree model rather than the QCompleter)
It looks like you would need to provide these functions:
ability to split a string into segments (for the example given, C:\Windows\Sy to ['C:','Windows','Sy']
the ability to specify the list of items that include the last segment (e.g. all the items included in ['C:','Windows']
I found an example for the basic functionality of QCompleter and have been able to tweak the basics fine (see below), I just don't know how to go about implementing a tree model type application.
'''based on
http://codeprogress.com/python/libraries/pyqt/showPyQTExample.php?index=403&key=QCompleterQLineEdit'''
from PySide.QtGui import *
from PySide.QtCore import *
import sys
def main():
app = QApplication(sys.argv)
edit = QLineEdit()
strList = '''
Germany;Russia;France;
french fries;frizzy hair;fennel;fuzzball
frayed;fickle;Frobozz;fear;framing;frames
Franco-American;Frames;fancy;fire;frozen yogurt
football;fnord;foul;fowl;foo;bar;baz;quux
family;Fozzie Bear;flinch;fizzy;famous;fellow
friend;fog;foil;far;flower;flour;Florida
'''.replace('\n',';').split(";")
strList.sort(key=lambda s: s.lower())
completer = QCompleter(strList,edit)
completer.setCaseSensitivity(Qt.CaseInsensitive)
edit.setWindowTitle("PySide QLineEdit Auto Complete")
edit.setCompleter(completer)
edit.show()
sys.exit(app.exec_())
if __name__ == '__main__':
main()
I couldn't find a good example for what I wanted, but I figured out how to adapt the Qt TreeModel example to using a QCompleter:
https://gist.github.com/jason-s/9dcef741288b6509d362
The QCompleter is the easy part, you just have to tell it how to split a path into segments, and then how to get from a particular entry in the model back to a path:
class MyCompleter(QtGui.QCompleter):
def splitPath(self, path):
return path.split('/')
def pathFromIndex(self, index):
result = []
while index.isValid():
result = [self.model().data(index, QtCore.Qt.DisplayRole)] + result
index = index.parent()
r = '/'.join(result)
return r
Aside from that, you have to configure the QCompleter properly, telling it how to get from a model item to a text string. Here I set it up to use the DisplayRole and to use column 0.
edit = QtGui.QLineEdit()
completer = MyCompleter(edit)
completer.setModel(model)
completer.setCompletionColumn(0)
completer.setCompletionRole(QtCore.Qt.DisplayRole)
completer.setCaseSensitivity(QtCore.Qt.CaseInsensitive)
As the documentation for QCompleter says you can provide two models: a list model or a tree model.
Example for list model, after your example:
from PySide import QtGui
app = QtGui.QApplication([])
edit = QtGui.QLineEdit()
strList = "Germany;Russia;France;Norway".split(";")
completer = QtGui.QCompleter(strList)
edit.setCompleter(completer)
edit.show()
app.exec_()
works:
And as a tree model:
from PySide import QtGui, QtCore
app = QtGui.QApplication([])
edit = QtGui.QLineEdit()
model = QtGui.QFileSystemModel()
model.setFilter(QtCore.QDir.AllDirs | QtCore.QDir.Drives)
model.setRootPath('')
completer = QtGui.QCompleter(model, edit)
edit.setCompleter(completer)
edit.show()
app.exec_()
for some strange reason nothing is displayed here. Will investigate later.

Resources