Are special tokens [CLS] [SEP] absolutely necessary while fine tuning BERT? - bert-language-model

I am following the tutorial https://www.depends-on-the-definition.com/named-entity-recognition-with-bert/ to do Named Entity Recognition with BERT.
While fine-tuning, before feeding the tokens to the model, the author does:
input_ids = pad_sequences([tokenizer.convert_tokens_to_ids(txt) for txt in tokenized_texts],
maxlen=MAX_LEN, dtype="long", value=0.0,
truncating="post", padding="post")
According to my tests, this doesn't add special tokens to the ids. So am I missing something or i it not always necessary to include [CLS] (101) [SEP] (102)?

I'm also following this tutorial. It worked for me without adding these tokens, however, I found in another tutorial (https://vamvas.ch/bert-for-ner) that it is better to add them, because the model was trained in this format.
[Update]
Actually just checked it, it turned out that the accuracy improved by 20% after adding the tokens. But note that I am using it on a different dataset

Related

Adding more custom entities into pretrained custom NER Spacy3

I've a huge amount of textual data and wanted to add around 50 different entities. Initially when I started working with it, I was getting memory error. As we know spacy can handle 1,00,000 tokens per GB and maximum up to 10,00,000. So I chunked my dataset into 5 sets and using annotator created multiple JSON file for the same. Now I started with one JSON and successfully completed creating the model and now I want to add more data into it so that I don't miss out any tags and there's a good variety of data is used while training in the model. Please guide me how to proceed next.
I mentioned some points of confusion in a comment, but assuming that your issue is how to load a large training set into spaCy, the solution is pretty simple.
First, save your training data as multiple .spacy files in one directory. You do not have to make JSON files, that was standard in v2. For details on training data see the training data section of the docs. In your config you can specify this directory as the training data source and spaCy will use all the files there.
Next, to avoid keeping all the training data in memory, you can specify max_epochs = -1 (see the docs on streaming corpora). Using this feature means you will have to specify your labels ahead of time as covered in the docs there. You will probably also want to shuffle your training data manually.
That's all you need to train with a lot of data.
The title of your question mentions adding entities to the pretrained model. It's usually better to train from scratch instead to avoid catastrophic forgetting, but you can see a guide to doing it here.

How do I access the list of generated cuts after presolving the model?

I would like to access the model information after presolving. Specifically, I would like to see the list of cuts that were added to the model during the presolving phase. Is there a way to access this list of cuts?
If cuts get added to the model during presolving, then they would be added as constraints (since the LP has not been created at that stage). Note that this does not happen very often in presolving.
I would suggest simply looking at the transformed problem and to check which constraints have been added.

Using response data from one scenario to another

With Karate, I'm looking to simulate an end-to-end test structure where I do the following:
Make a GET request to specific data
Store a value as a def variable
Use that information for a separate scenario
This is what I have so far:
Scenario: Search for asset
Given url "https://foo.bar.buzz"
When method get
Then status 200
* def responseItem = $.items[0].id // variable initialized from the response
Scenario: Modify asset found
Given url "https://foo.bar.buzz/" + responseItem
// making request payload
When method put.....
I tried reading the documentation for reusing information, but that seemed to be for more in-depth testing.
Thoughts?
It is highly recommended to model flows like this as one scenario. Please refer to the documentation: https://github.com/intuit/karate#script-structure
Variables set using def in the Background will be re-set before every
Scenario. If you are looking for a way to do something only once per
Feature, take a look at callonce. On the other hand, if you are
expecting a variable in the Background to be modified by one Scenario
so that later ones can see the updated value - that is not how you
should think of them, and you should combine your 'flow' into one
scenario. Keep in mind that you should be able to comment-out a
Scenario or skip some via tags without impacting any others. Note that
the parallel runner will run Scenario-s in parallel, which means they
can run in any order.
That said, maybe the Background or hooks is what you are looking for: https://github.com/intuit/karate#hooks

How to add customized tokens into solr to change the indexing token behaviour

It's a Drupal site with solr for search. Mainly I am not satisfied with current search result on Chinese. The tokenizer has broken the words into supposed small pieces. Most of them are reasonable. But still, it made mistakes by not treating something as a valid token either breaking it to pieces or not breaking it.
Assuming I am writing Chinese now: big data analysis is one word which shouldn't be broken. So my search on it should find it. Also I want people to find AI and big data analysis training as the first hit when they search the exact phrase AI and big data analysis training.
So I want a way to intervene or compensate the current tokens to make the search smarter.
Maybe there is a file in solr allow me to manually write these tokens down to relate them certain phrases? So every time when indexing, solr can use it as a reference.
You different steps to achieve what you want :
1) I don't see an extremely big problem with your " over tokenization" :
big data analysis is one word which shouldn't be broken. So my search on it should find it. -> your search will find it even if tokenized, I understand this was an example and the actual words are chinese, but I suspect a different issue there
2) You can use the edismax[1] query parser with phrase boost at various level to boost subsequent tokens or phrases ( pf,pf2,pf3...ps,ps2,ps3...)
[1] https://lucene.apache.org/solr/guide/6_6/the-extended-dismax-query-parser.html , https://lucene.apache.org/solr/guide/6_6/the-extended-dismax-query-parser.html#TheExtendedDisMaxQueryParser-ThepsParameter

How do you track usage of QRcodes?

I don't know if this even is possible, but here goes, how do you track usage of QRcodes? (track scans may be more accurate)
I'm not primarily looking for code snippets (they are of course welcome), but a method for doing this.
Thanks for all your answers :)
Lars
Assuming that the QR codes in question are URLs, put an extra query parameter at the end of the URL containing some means of identifying the QR code. Then, have the target page check for the presence of this query parameter. If the value in the parameter is in your database, increment its counter. You'll obviously have to generate and distribute multiple QR codes with differing values for your query parameter.
From my experience the easiest way to track usage of your QR codes is to encode all URLs you are planning to put into QR codes via URL shortener like bit.ly.
By doing so you will achive two results:
"free" clicks/scans analytics from bit.ly
you have really short URLs which allow you to use smaller QR codes (easier to scan with low-res camera-phones) or better error correction

Resources