How to map token indices from the SQuAD data to tokens from BERT tokenizer? - bert-language-model

I am using the SQuaD dataset for answer span selection. After using the BertTokenizer to tokenize the passages, for some samples, the start and end indices of the answer don't match the real answer span position in the passage tokens anymore. How to solve this problem? One way is to modify the answer indices (also the training targets) accordingly? But how to do it?

The tokenization in the original dataset is different from how BERT tokenizes the input. In BERT, less frequent words get split into subword units. You can easily find out the character offsets of the tokens in the original dataset.
In the newer versions of Transformers, the tokenizers have the option of return_offsets_mapping. If this is set to True, it returns the character offset (a tuple (char_start, char_end)). If you have the character offsets in the original text, you can map them with the output of the tokenizer.
from transformers import BertTokenizerFast
tok = BertTokenizerFast.from_pretrained("bert-base-cased")
tok("I am a tokenizer.", return_offsets_mapping=True)
The output:
{'input_ids': [101, 146, 1821, 170, 22559, 17260, 119, 102],
'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0],
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1],
'offset_mapping': [(0, 0), (0, 1), (2, 4), (5, 6), (7, 12), (12, 16), (16, 17), (0, 0)]}
The (0, 0) spans correspond to technical tokens, in the case of BERT [CLS] and [SEP].
When you have the offsets using both the original tokenization and BERT tokenization, you can find out what are the indices in the re-tokenized string.

Related

Rebalancing table with average value

Looking for algorithm which will resolve my problem, i.e.
Having a table with values: [6, 2, 3, 1]
Get an average value per slice: 12/4 = 3
Re-balance values in slice, i.e. I should get formulas something like: 1) 0, 2) 6-1, 3) 0, 4) 5-2. I.e. from highest value deduct amount need's for average. As result I will need to get slice [3, 3, 3, 3]
Thanks

How to avois this error:only 0's may be mixed with negative subscripts?

I have this example:
x=c(NA, 2, -3, -4, -5, -6, -7, -8, -9, -10, -11, -2, 2, -14, -15, -16, -17, 2, -19, -20)
g= head(x[!is.na(x)], 13)
I want to exclude values that were already used for g.
y=x[-(head(x[!is.na(x)], 13))]
Is there a better way to do this?
I got this error:
Error in x[-(head(x[!is.na(x)], 13))] :
only 0's may be mixed with negative subscripts
any idea why?
You can use %in% to check which values are contained in g and negate every index which contains a value that is contained in your definition of g:
x[!(x %in% (head(x[!is.na(x)], 13))) | (1:length(x)) > which(cumsum(!is.na(x)) == 13)]
Your error occurs because you are mixing positive and negative indexes in your subsetting of x, which is not necessary because you do not have to work with the indexes rather than create a logical vector that gives you the place for every value not contained in g.
EDIT: I added a second logical vector which makes sure that values after the index of the 13th non-NA value cannot be removed, since they can never be contained in g (cause g is a subset of the first 13 non-NA-values of x). There may be an easier solution but this should do it..
It actually depends on the problem you are solving. If you want to delete the elements at positions 1-10 except NA, use:
pos = c(is.na(x[1:10]), 11:length(x))
y = x[-pos]
If you want to delete every element which takes the same value as the elements 2:10, then use:
setdiff(x,g)
As per the error is concerned, I guess in your case the elements in x are a mix of positive and negative values. For example,
letters[c(-1, -3, 5, 6, 7)]
Error: only 0's may mix with negative subscripts
A brilliant description is given here: https://www.stat.berkeley.edu/~nolan/stat133/Fall05/notes/RSubset.html

how do you count the number of results

Write a program that reads a series of numbers, ending with 0, and then tells you how
many numbers you have keyed in (other than the last 0). For example, if you keyed in
the numbers 5, -10, 50, 22, -945, 12, 0 it would output ‘You have entered 6 numbers.’.
doing my homework and can get this one to work
what stumps me is i understand adding the numbers to get the sum total but what do i call the number of numbers ...
thanks
Python has a very simple function that could be used here, string.count(). Given each number is separated by a comma, you can count the amount of commas to get the amount of numbers (not including the 0, which doesn't have a comma after it). An example of this in use would be
input = 5, -10, 50, 22, -945, 12, 0
Number_of_Numbers = input.count(',')

Increment number stored as array of digit-counters

I'm trying to store a counter that can become very large (well over 32 and probably 64-bit limits), but rather than use a single integer, I'd like to store it as an array of counters for each digit. This should be pretty language-agnostic.
In this form, 0 would be [1, 0, 0, 0, 0, 0, 0, 0, 0, 0] (one zero, none of the other digits up to 9). 1 would be [0, 1, 0, ...] and so on. 10 would therefore be [1, 1, 0, ...].
I can't come with a way to keep track of which digits should be decremented (moving from 29 to 30, for example) and how those should be moved. I suspect that it can't be done without another counter, either a single value representing the last cell touched, or an array of 10 more counters to flag when each digit should be touched.
Is it possible to represent a number in this fashion and count up without using a simple integer value?
No, this representation by itself would be useless because it fails to encode digit position, leading to many numbers having the same representation (e.g. 121 and 211).
Either use a bignum library, or 80-bits worth of raw binary (that being sufficient to store your declared range of 10e23)

Getting Unique Number Combinations

Is it possible without using exponentiation to have a set of numbers that when added together, always give unique sum?
I know it can be done with exponentiation (see first answer): The right way to manage user privileges (user hierarchy)
But I'm wondering if it's possible without exponentiation.
No, you can only use exponentiation, because the sum of lower values have to be less than the new number to be unique: 1+2=3 < 4, 1+2+4=7 < 8.
[EDIT:]
This is a laymen's explanation, of course there are other possibilities, but none as efficient as using exponentials of 2.
There's a chance it can be done without exponentation (I'm no math expert), but not in any way that's more efficient than exponentation. This is because it only takes one bit of storage space per possible value, and as an added plus you can use boolean operators to do useful stuff with the values.
If you restrict yourself to integers, the numbers have to grow at least as fast as an exponential function. If you find a function that grows faster (like, oh, maybe the Ackermann function) then the numbers produced by that will probably work too.
With floating-point numbers, you can keep adding unique irreducible roots of primes (sqrt(2), sqrt(3), sqrt(5), ...) and you will always get something unique, up until you hit the limits of floating-point precision. Not sure how many unique numbers you could squeeze out of it - maybe you should try it.
No. To see this directly, think about building up the set of basis values by considering at each step the smallest possible positive integer that could be included as the next value. The next number to add must be different from all possible sums of the numbers already in the set (including the empty sum, which is 0), and can't combine with any combination of numbers already present to produce a duplicate. So...
{} : all possible sums = {0}, smallest possible next = 1
{1} : all possible sums = {0, 1}, smallest possible next = 2
{1, 2} : all possible sums = {0, 1, 2, 3}, smallest possible next = 4
{1, 2, 4} : a.p.s. = {0, 1, 2, 3, 4, 5, 6, 7}, s.p.n. = 8
{1, 2, 4, 8} ...
And, of course, we're building up the binary powers. You could start with something other than {1, 2}, but look what happens, using the "smallest possible next" rule:
{1, 3} : a.p.s. = {0, 1, 3, 4}, s.p.n. = 6 (because 2 could be added to 1 giving 3, which is already there)
{1, 3, 6} : a.p.s. = {0, 1, 3, 4, 6, 7, 9, 10}, s.p.n = 11
{1, 3, 6, 11} ...
This sequence is growing faster than the binary powers, term by term.
If you want a nice Project-Euler-style programming challenge, you could write a routine that takes a set of positive integers and determines the "smallest possible next" positive integer, under the "sums must be unique" constraint.

Resources