How to add new special token to the tokenizer? - bert-language-model

I want to build a multi-class classification model for which I have conversational data as input for the BERT model (using bert-base-uncased).
QUERY: I want to ask a question.
ANSWER: Sure, ask away.
QUERY: How is the weather today?
ANSWER: It is nice and sunny.
QUERY: Okay, nice to know.
ANSWER: Would you like to know anything else?
Apart from this I have two more inputs.
I was wondering if I should put special token in the conversation to make it more meaning to the BERT model, like:
[CLS]QUERY: I want to ask a question. [EOT]
ANSWER: Sure, ask away. [EOT]
QUERY: How is the weather today? [EOT]
ANSWER: It is nice and sunny. [EOT]
QUERY: Okay, nice to know. [EOT]
ANSWER: Would you like to know anything else? [SEP]
But I am not able to add a new [EOT] special token.
Or should I use [SEP] token for this?
EDIT: steps to reproduce
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")
print(tokenizer.all_special_tokens) # --> ['[UNK]', '[SEP]', '[PAD]', '[CLS]', '[MASK]']
print(tokenizer.all_special_ids) # --> [100, 102, 0, 101, 103]
num_added_toks = tokenizer.add_tokens(['[EOT]'])
model.resize_token_embeddings(len(tokenizer)) # --> Embedding(30523, 768)
tokenizer.convert_tokens_to_ids('[EOT]') # --> 30522
text_to_encode = '''QUERY: I want to ask a question. [EOT]
ANSWER: Sure, ask away. [EOT]
QUERY: How is the weather today? [EOT]
ANSWER: It is nice and sunny. [EOT]
QUERY: Okay, nice to know. [EOT]
ANSWER: Would you like to know anything else?'''
enc = tokenizer.encode_plus(
text_to_encode,
max_length=128,
add_special_tokens=True,
return_token_type_ids=False,
return_attention_mask=False,
)['input_ids']
print(tokenizer.convert_ids_to_tokens(enc))
Result:
['[CLS]', 'query', ':', 'i', 'want', 'to', 'ask', 'a', 'question',
'.', '[', 'e', '##ot', ']', 'answer', ':', 'sure', ',', 'ask', 'away',
'.', '[', 'e', '##ot', ']', 'query', ':', 'how', 'is', 'the',
'weather', 'today', '?', '[', 'e', '##ot', ']', 'answer', ':', 'it',
'is', 'nice', 'and', 'sunny', '.', '[', 'e', '##ot', ']', 'query',
':', 'okay', ',', 'nice', 'to', 'know', '.', '[', 'e', '##ot', ']',
'answer', ':', 'would', 'you', 'like', 'to', 'know', 'anything',
'else', '?', '[SEP]']

As the intention of the [SEP] token was to act as a separator between two sentence, it fits your objective of using [SEP] token to separate sequences of QUERY and ANSWER.
You also try to add different tokens to mark the beginning and end of QUERY or ANSWER as <BOQ> and <EOQ> to mark the beginning and end of QUERY. Likewise, <BOA> and <EOA> to mark the beginning and end of ANSWER.
Sometimes, using the existing token works much better than adding new tokens to the vocabulary, as it requires huge number of training iterations as well as the data to learn the new token embedding.
However, if you want to add a new token if your application demands so, then it can be added as follows:
num_added_toks = tokenizer.add_tokens(['[EOT]'], special_tokens=True) ##This line is updated
model.resize_token_embeddings(len(tokenizer))
###The tokenizer has to be saved if it has to be reused
tokenizer.save_pretrained(<output_dir>)

Related

Returning multiple values from one step using 'select', 'by' in Gremline

My graph schema looks like this:
(Location)<-[:INVENTOR_LOCATED_IN]-(Inventor)-[:INVENTOR_OF]->(Patent)
I'm trying to return multiple values from each step in the query path. Here's the query I have so far that runs correctly:
g.V().and(has('Location', 'city', textContains('Bloomington')), has('Location','state',textContains('IN'))).as('a').
bothE().bothV().hasLabel('Inventor').as('b').
bothE().bothV().has('Patent', 'title', textContains('Lid')).as('c').
select('a,', 'b', 'c').
by('state').by('name_first').by('title').
fold();
What I would like to do is for each step return two node properties. I tried the following but it returns an error:
g.V().and(has('Location', 'city', textContains('Bloomington')), has('Location', 'state',textContains('IN'))).as('a').
bothE().bothV().hasLabel('Inventor').as('b').
bothE().bothV().has('Patent', 'title', textContains('Lid')).as('c').
select('a,', 'b', 'c').
by('city', 'state').by('name_first', 'name_last').by('title', 'abstract').
fold();
Can anyone suggest syntax that will allow me to return multiple properties from each node in the path?
The by(key) is meant to be a sort of shorthand for values(key) which means if you have more than one value you could do:
g.V().and(has('Location', 'city', textContains('Bloomington')), has('Location', 'state',textContains('IN'))).as('a').
bothE().bothV().hasLabel('Inventor').as('b').
bothE().bothV().has('Patent', 'title', textContains('Lid')).as('c').
select('a,', 'b', 'c').
by(values('city', 'state').fold()).
by(values('name_first', 'name_last').fold()).
by(values('title', 'abstract').fold()).
fold()
You might also consider forms of elementMap(), valueMap(), or project() as alternatives. Since by() takes a Traversal you have a lot of flexibility.

Obtain lists of plot marker and line styles in Octave

Is there a way to programmatically obtain the list of marker and line styles available for plotting in Octave?
Ideally, I would do something like
mslist = whatever_function_for_marker_styles;
lslist = whatever_function_for_line_styles;
for i = 1:np
plot(x, y(i,:), 'marker', mslist(i), 'linestyle', lslist(i))
endfor
Notes:
I would add some mod functions to cycle across the lists.
I know the size of both lists may not be the same, so they may shift from one another upon cycling.
The easiest way would be to get the symbols from the manual and put them in a cell array:
mslist = {'+', 'o', '*', '.', 'x', 's', 'd', '^', 'v', '>', '<', 'p', 'h'};
lslist = {'-', '--', ':', '-.'};
You can loop over them with a standard for-loop and access them by index using curly brackets, e.g. lslist{i}. The symbols are in Section 15.2.1 of the manual (https://octave.org/doc/v6.1.0/Two_002dDimensional-Plots.html#Two_002dDimensional-Plots). An ordinary vector would work for mslist instead of a cell array as all the symbols are single characters, but not for lslist where some of them are two characters long.
I agree with Howard that doing it 'entirely' programmatically is probably overkill.
However, if you do want to do that, my bet would be to parse the 'help' output for the 'plot' command, which is guaranteed to mention these points, and has a reasonable guarantee that it will remain in the same format even if more markers are added in the future etc.
I won't parse the whole thing, but if you were to do this, you'd probably start like this:
plotdoc = help('plot');
[plotdoc_head , plotdoc_rest] = deal( strsplit( plotdoc , ' linestyle' ){:} );
[plotdoc_lines , plotdoc_rest] = deal( strsplit( plotdoc_rest, ' marker' ){:} );
[plotdoc_markers, plotdoc_rest] = deal( strsplit( plotdoc_rest, ' color' ){:} );
[plotdoc_colors , plotdoc_rest] = deal( strsplit( plotdoc_rest, '";displayname;"' ){:} );
or something along those lines, and then use regexp or strfind / strtoken / strplit creatively to obtain the necessary tokens in each category.

Returning text between a starting and ending regular expression

I am working on a regular expression to extract some text from files downloaded from a newspaper database. The files are mostly well formatted. However, the full text of each article starts with a well-defined phrase ^Full text:. However, the ending of the full-text is not demarcated. The best that I can figure is that the full text ends with a variety of metadata tags that look like: Subject: , CREDIT:, Credit.
So, I can certainly get the start of the article. But, I am having a great deal of difficulty finding a way to select the text between the start and the end of the full text.
This is complicated by two factors. First, obviously the ending string varies, although I feel like I could settle on something like: `^[:alnum:]{5,}: ' and that would capture the ending. But the other complicating factor is that there are similar tags that appear prior to the start of the full text. How do I get R to only return the text between the Full text regex and the ending regex?
test<-c('Document 1', 'Article title', 'Author: Author Name', 'https://a/url', 'Abstract: none', 'Full text: some article text that I need to capture','the second line of the article that I need to capture', 'Subject: A subject', 'Publication: Publication', 'Location: A country')
test2<-c('Document 2', 'Article title', 'Author: Author Name', 'https://a/url', 'Abstract: none', 'Full text: some article text that I need to capture','the second line of the article that I need to capture', 'Credit: A subject', 'Publication: Publication', 'Location: A country')
My current attempt is here:
test[(grep('Full text:', test)+1):grep('^[:alnum:]{5,}: ', test)]
Thank you.
This just searches for the element matching 'Full text:', then the next element after that matching ':'
get_text <- function(x){
start <- grep('Full text:', x)
end <- grep(':', x)
end <- end[which(end > start)[1]] - 1
x[start:end]
}
get_text(test)
# [1] "Full text: some article text that I need to capture"
# [2] "the second line of the article that I need to capture"
get_text(test2)
# [1] "Full text: some article text that I need to capture"
# [2] "the second line of the article that I need to capture"

creating multi-dimensional array from textfile in perl [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions concerning problems with code you've written must describe the specific problem — and include valid code to reproduce it — in the question itself. See SSCCE.org for guidance.
Closed 9 years ago.
Improve this question
The file temp.txt has contents like this:
ABC 1234 56 PQR
XYZ 8672 12 RQP
How to store the temp.txt file into a two dimensional array, so that I can access them through the array index?
use File::Slurp;
use Data::Dumper;
my #arr = map [split], read_file("temp.txt");
print Dumper \#arr;
output
$VAR1 = [
[
'ABC',
'1234',
'56',
'PQR'
],
[
'XYZ',
'8672',
'12',
'RQP'
]
];
At a minimum, you could do this
my #file = load_file($filename);
sub load_file {
my $filename = shift;
open my $fh, "<", $filename or die "load_file cannot open $filename: $!";
my #file = map [ split ], <$fh>;
return #file;
}
This will read an argument file, split the content on whitespace and put it inside an array ref (one per line), then return the array with array refs. On exiting the subroutine, the file handle will be closed.
This is a somewhat clunky solution, in some ways. It loads the entire file into memory, it does not have a particularly fast lookup when you are looking for a specific value, etc. If you have a unique key in each row, you can use a hash instead of an array, to make lookup faster:
my %file = map { my ($key, #vals) = split; $key => \#vals; } <$fh>;
Note that the keys must be unique, or they will overwrite each other.
Or you can use Tie::File to only look up the values you want:
use Tie::File;
tie my #file, 'Tie::File', $filename or die "Cannot tie file: $!";
my $line = [ split ' ', $file[0] ];
Or if you have a specific delimiter on the lines of your file, and a format that complies with the CSV format, you can use Tie::File::CSV
use Tie::File::CSV;
tie my #file, 'Tie::Array::CSV', $filename, sep_char => ' '
or die "Cannot tie file: $!";
my $line = $file[0];
Note that using this module might be overkill, and might cause problems if you do not have a strict csv format. Also, Tie::File has a reputation of decreasing performance. Which solution is best depends largely on your needs and preferences.

How to sanitize the post_name value before inserting in WordPress?

How to sanitize the post_name value before inserting in WordPress?
Simple:
$post_title = sanitize_title_with_dashes($post_title);
But WordPress does this for you already. I assume you need it for something different?
I'm guessing you're sanitizing by direct SQL insertion. Instead, consider using wp_post_insert() in your insertion script.
$new_post_id = wp_insert_post(array(
'post_title' => "This <open_tag insane title thing<b>LOL!;drop table `bobby`;"
));
At this point, you just worry about your title - and not the slug, post name, etc. WP will take care of the rest and (at least security) sanitization. The slug, as demonstrated in the screenshot, becomes fairly usable.
This function can be used by simply doing include( "wp-config.php" ); and going about your business without any other PHP overhead.
If you are dealing with some funky titles to begin with, a simple strip_tags(trim()) might do the trick. Otherwise, you've got other problems to deal with ;-)
Some solution might be found at http://postedpost.com/2008/06/23/ultimate-wordpress-post-name-url-sanitize-solution/
Also, you might want to do it as follows:
$special_chars = array("?", "[", "]", "/", "\\", "=", "<", ">", ":", ";", ",", "'", "\"", "&", "$", "#", "*", "(", ")", "|", "~", "`", "!", "{", "}");
$post_name = str_replace(' ', '-', str_replace($special_chars, '', strtolower($post_name)));

Resources