BERT Heads Count - bert-language-model

From the literature I read,
Bert Base has 12 encoder layers and 12 attention heads. Bert Large has 24 encoder layers and 16 attention heads.
Why is Bert large having 16 attentions heads ?

The number of attention heads is irrespective of the number of (encoder) layers.
However, there is an inherent tie between the hidden size of each model (768 for bert-base, and 1024 for bert-large), which is explained in the original Transformers paper.
Essentially, the choice made by the authors is that the self-attention block size (d_k) equals the hidden dimension (d_hidden), divided by the number of heads (h), or formally
d_k = d_hidden / h
Since the standard choice seems to be d_k = 64, we can infer the final size from our parameters:
h = d_hidden / d_k = 1024 / 64 = 16
which is exactly the value you are looking at in bert-large.

Related

Competing risk survival random forest with large data

I have a data set with 500,000 observations with events and a competing risk as well as a time-to-event variable (survival analysis).
I want to run a survival random forest.
The R-package randomForestSRC is great for it, however, it is impossible to use more than 100,000 rows due to memory limitation (100'000 uses 40GB of RAM) even though I limit my number of predictors to 15 to 20.
I have a hard time finding a solution. Does anyone have a recommendation?
I looked at h2o and spark mllib, both of which do not support survival random forests.
Ideally I am looking for a somewhat R-based solution but I am happy to explore anything else if anyone knows a way to use large data + competing risk random forest.
In general, the memory profile for an RF-SRC data set is n x p x 8 on a 64-bit machine. With n=500,000 and p=20, RAM usage is approximately 80MB. This is not large.
You also need to consider the size of the forest, $nativeArray. With the default nodesize = 3, you will have n / 3 = 166,667 terminal nodes. Assuming symmetric trees for convenience, the total number of interanal/external nodes will approximately be 2 * n / 3 = 333,333. With the default ntree = 1000, and assuming no factors, $nativeArray will be of dimensions [2 * n / 3 * ntree] x [5]. A simple example will show you why we have [5] columns in the $nativeArray to tag the split parameter, and split value. Memory usage for the forest will be thus be 2 * n / 3 * ntree * 5 * 8 = 1.67GB.
So now we are getting into some serious memory usage.
Next consider the ensembles. You haven't said how many events you have in your competing risk data set, but let's assume there are two for simplicity.
The big arrays here are the cause-specific hazard function (CSH) and the cause-specific cumulative incidence function (CIF). These are both of dimension [n] x [time.interest] x [2]. In a worst case scenario, if all your times are distinct, and there are no censored events, time.interest = n. So each of these outputs is n * n * 2 * 8 bytes. This will blow up most machines. It's time.interest that is your enemy. In big-n situations, you need to constrain the time.interest vector to a subset of the actual event times. This can be controlled with the parameter ntime.
From the documentation:
ntime: Integer value used for survival families to constrain ensemble calculations to a grid of time values of no more than ntime time points. Alternatively if a vector of values of length greater than one is supplied, it is assumed these are the time points to be used to constrain the calculations (note that the constrained time points used will be the observed event times closest to the user supplied time points). If no value is specified, the default action is to use all observed event times.
My suggestion would be to start with a very small value of ntime, just to test whether the data set can be analyzed in its entirety without issue. Then increase it gradually and observe your RAM usage. Note that if you have missing data, then RAM usage will be much larger. Also note that I did not mention other arrays such as the terminal node statistics that also lead to heavy RAM usage. I only considered the ensembles, but the reality is that each terminal node will contain arrays of dimension [time.interest] x 2 for the node specific estimator of the CSH and CIF that is used in creating the forest ensemble.
In the future, we will be implementing a Big Data option that will suppress ensembles and optimize the memory profile of the package to better accommodate big-n scenarios. In the meantime, you will have to be diligent in using the existing options like ntree, nodesize, and ntime to reduce your RAM usage.

Seeding square roots on FPGA in VHDL for Fixed Point

I'm attempting to create a fixed-point square root function for a Xilinx FPGA (hence real types are out, and David Bishops ieee_proposed library is also unsupported for XST synthesis).
I've settled on a Newton-Raphson method to calculate the reciprocal square root (as it involves fewer divisions).
One of the remaining dilemmas I have is how to generate the initial seed. I looked at the Fast Inverse Square Root, but it only appears to work for floating point arithmetic.
My best thoughts at the moment are, to take the length of the input value (ie. find the index of the most significant, non-zero bit), halve it crudely and use that as the power for a seed.
I wrote a short test script to quickly check the accuracy (its in Matlab but that's just so I could plot a graph...)
x = 1:2^24;
gen_result = zeros(1,length(x));
seed_vals = zeros(1,length(x));
for i = 1:length(x)
result = 2^-ceil(log2(x(i))/2); %effectively creates seed value from top bit index
seed_vals(i) = 1/result; %Store seed value
for j = 1:6
result = result*(1.5-0.5*x(i)*result^2); %reciprocal root
end
gen_result(i) = 1/result; %single division at the end
end
And unsurprisingly, the seed becomes wildly inaccurate each time a number increases in size, and this increases as the magnitude of the input increases. As a graph this can be seen as:
The red line is the value of the seed, and as can be seen, is increasing increasing in powers of 2.
My question very simple: Are there any other simple methods I could use to generate a seed value for fixed point square root values in VHDL, ideally which don't cause ever increasing amounts of inaccuracy (and hence require more iterations each time the input increases in size).
Any other incidental advise on how to approach finding fixed points square roots in VHDL would be gratefully received!
I realize this is an old question but I did end up here and this was kind of useful so I want to add my bit.
Assuming your Xilinx chip has an embedded multiplier, you could consider this approach to help get a better starting seed. The basic premise is to convert the input integer to fixed point with all fraction bits, and then use the embedded multiplier to scale half of your initial seed value by 0.X (which in hindsight is probably what people mean when they say "normalize to the region [0.5..1)", now that I think about it). It's basically piecewise linear interpolation of your existing seed method. The steps below should translate relatively easily to RTL, as they're just bit-shifts, adds, and one unsigned multiply.
1) Begin with your existing seed value (e.g. for x=9e6, you would generate s=4096 as the seed for your first guess with your "crude halving" method)
2) Right-shift the existing seed value by 1 to get the previous seed value (s_half = s >> 1 = 2048)
3) Left-shift the input until the most significant bit is a 1. In the event you are sqrting 32-bit ints, x_scale would then be 2304000000 = 0x89544000
4) Slice the upper e.g. 18 bits off of x_scale and multiply by an 18-bit version of s_half (I suggest 18 because I happen to know some Xilinx chips have embedded 18x18 multipliers). For this case, the result, x_scale(31 downto 14) = 140625 = 0x22551.
At least, that's what the multiplier thinks - we're going to use fixed point so that it's actually 0b0.100010010101010001 = 0.53644 instead of 140625.
The result of this multiplication will be s_scale = s_half * x_scale(31 downto 14) = 2048 * 140625 = 288000000, but this output is in 18.18 format (18 integer bits, 18 fraction bits). Take the upper 18 bits, and you get s_scale(35 downto 18) = 1098
5) Add the upper 18 bits of s_scale to s_half to get your improved seed, in this case s_improved = 1098+2048 = 3146
Now you can do a few iterations of Newton-Raphson with this seed. For x=9e6, your crude halving approach would give an initial seed of 4096, the fixed-point scale outlined above gives you 3146, and the actual sqrt(9e6) is 3000. This value is half-way between your seed steps, and my napkin math suggests it saved about 3 iterations of Newton-Raphson

Calculate original set size after hash collisions have occurred

You have an empty ice cube tray which has n little ice cube buckets, forming a natural hash space that's easy to visualize.
Your friend has k pennies which he likes to put in ice cube trays. He uses a random number generator repeatedly to choose which bucket to put each penny. If the bucket determined by the random number is already occupied by a penny, he throws the penny away and it is never seen again.
Say your ice cube tray has 100 buckets (i.e, would make 100 ice cubes). If you notice that your tray has c=80 pennies, what is the most likely number of pennies (k) that your friend had to start out with?
If c is low, the odds of collisions are low enough that the most likely number of k == c. E.g. if c = 3, then it's most like that k was 3. However, the odds of a collision are increasingly likely, after say k=14 then odds are there should be 1 collision, so maybe it's maximally likely that k = 15 if c = 14.
Of course if n == c then there would be no way of knowing, so let's set that aside and assume c < n.
What's the general formula for estimating k given n and c (given c < n)?
The problem as it stands is ill-posed.
Let n be the number of trays.
Let X be the random variable for the number of pennies your friend started with.
Let Y be the random variable for the number of filled trays.
What you are asking for is the mode of the distribution P(X|Y=c).
(Or maybe the expectation E[X|Y=c] depending on how you interpret your question.)
Let's take a really simple case: the distribution P(X|Y=1). Then
P(X=k|Y=1) = (P(Y=1|X=k) * P(X=k)) / P(Y=1)
= (1/nk-1 * P(X=k)) / P(Y=1)
Since P(Y=1) is normalizing constant, we can say P(X=k|Y=1) is proportional to 1/nk-1 * P(X=k).
But P(X=k) is a prior probability distribution. You have to assume some probability distribution on the number of coins your friend has to start with.
For example, here are two priors I could choose:
My prior belief is that P(X=k) = 1/2k for k > 0.
My prior belief is that P(X=k) = 1/2k - 100 for k > 100.
Both would be valid priors; the second assumes that X > 100. Both would give wildly different estimates for X: prior 1 would estimate X to be around 1 or 2; prior 2 would estimate X to be 100.
I would suggest if you continue to pursue this question you just go ahead and pick a prior. Something like this would work nicely: WolframAlpha. That's a geometric distribution with support k > 0 and mean 10^4.

Random number generation without using bit operations

I'm writing a vertex shader at the moment, and I need some random numbers. Vertex shader hardware doesn't have logical/bit operations, so I cannot implement any of the standard random number generators.
Is it possible to make a random number generator using only standard arithmetic? the randomness doesn't have to be particularly good!
If you don't mind crappy randomness, a classic method is
x[n+1] = (x[n] * x[n] + C) mod N
where C and N are constants, C != 0 and C != -2, and N is prime. This is a typical pseudorandom generator for Pollard Rho factoring. Try C = 1 and N = 8051, those work ok.
Vertex shaders sometimes have built-in noise generators for you to use, such as cg's noise() function.
Use a linear congruential generator:
X_(n+1) = (a * X_n + c) mod m
Those aren't that strong, but at least they are well known and can have long periods. The Wikipedia page also has good recommendations:
The period of a general LCG is at most
m, and for some choices of a much less
than that. The LCG will have a full
period if and only if:
1. c and m are relatively prime,
2. a - 1 is divisible by all prime factors of m,
3. a - 1 is a multiple of 4 if m is a multiple of 4
Believe it or not, I used newx = oldx * 5 + 1 (or a slight variation of it) in several videogames. The randomness is horrible--it's more of a scrambled sequence than a random generator. But sometimes that's all you need. If I recall correctly, it goes through all numbers before it repeats.
It has some terrible characteristics. It doesn't ever give you the same number twice in a row. A few of us did a bunch of tests on variations of it and we used some variations in other games.
We used it when there was no good modulo available to us. It's just a shift by two and two adds (or a multiply by 5 and one add). I would never use it nowadays for random numbers--I'd use an LCG--but maybe it would work OK for a shader where speed is crucial and your instruction set may be limited.

Generating sorted random ints without the sort? O(n)

Just been looking at a code golf question about generating a sorted list of 100 random integers. What popped into my head, however, was the idea that you could generate instead a list of positive deltas, and just keep adding them to a running total, thus:
deltas: 1 3 2 7 2
ints: 1 4 6 13 15
In fact, you would use floats, then normalise to fit some upper limit, and round, but the effect is the same.
Although it wouldn't make for shorter code, it would certainly be faster without the sort step. But the thing I have no real handle on is this: Would the resulting distribution of integers be the same as generating 100 random integers from a uniformly distributed probability density function?
Edit: A sample script:
import random,sys
running = 0
max = 1000
deltas = [random.random() for i in range(0,11)]
floats = []
for d in deltas:
running += d
floats.append(running)
upper = floats.pop()
ints = [int(round(f/upper*max)) for f in floats]
print(ints)
Whose output (fair dice roll) was:
[24, 71, 133, 261, 308, 347, 499, 543, 722, 852]
UPDATE: Alok's answer and Dan Dyer's comment point out that using an exponential distribution for the deltas would give a uniform distribution of integers.
So you are asking if the numbers generated in this way are going to be uniformly distributed.
You are generating a series:
yj = ∑i=0j ( xi / A )
where A is the sum of all xi. xi is the list of (positive) deltas.
This can be done iff xi are exponentially distributed (with any fixed mean). So, if xi are uniformly distributed, the resulting yj will not be uniformly distributed.
Having said that, it's fairly easy to generate exponential xi values.
One example would be:
sum := 0
for I = 1 to N do:
X[I] = sum = sum - ln(RAND)
sum = sum - ln(RAND)
for I = 1 to N do:
X[I] = X[I]/sum
and you will have your random numbers sorted in the range [0, 1).
Reference: Generating Sorted Lists of Random Numbers. The paper has other (faster) algorithms as well.
Of course, this generates floating-point numbers. For uniform distribution of integers, you can replace sum above by sum/RANGE in the last step (i.e., the R.H.S becomes X[I]*RANGE/sum, and then round the numbers to the nearest integer).
A uniform distribution has an upper and a lower bound. If you use your proposed method, and your deltas happen to be chosen large enough that you run into the upper bound before you have generated all your numbers, what would your algorithm do next?
Having said that, you may want to investigate the Poisson distribution, which is the distribution of interval times between random events occurring with a given average frequency.
If you take the number range of being 1 to 1000, and you have to use 100 of these numbers, the delta will have to be as a minimum 10, otherwise you can not reach the 1000 mark. How about some working to demonstrate it in action...
The chance of any given number in an evenly distributed random selection is 100/1000 e.g. 1/10 - no shock there, take that as the basis.
Assuming you start using a delta and that delta is just 10.
The odds of getting the number 1 is 1/10 - seems fine.
The odds of getting the number 2 is 1/10 + (1/10 * 1/10) (because you could hit 2 deltas of 1 in a row, or just hit a 2 as the first delta.)
The odds of getting the number 3 is 1/10 + (1/10 * 1/10 * 1/10) + (1/10 * 1/10) + (1/10 * 1/10)
The first case was a delta of 3, the second was hitting 3 deltas of 1 in a row, the third case would be a delta of 1 followed by a 2, and the fourth case was a delta of 2 followed by a 1.
For the sake of my fingers typing, we won't generate the combinations that hit 5.
Immediately the first few numbers have a greater percentage chance than the straight random.
This could be altered by changing the delta value so the fractions are all different, but I do not believe you could find a delta that produced identical odds.
To give an analogy that might just sink it, if you consider your delta as just 6 and you run that twice it is the equivalent of throwing 2 dice - each of the deltas is independant, but you know that 7 has a higher chance of being selected than 2.
I think it will be extremely similar but the extremes will be different because of the normalization. For example, 100 numbers chosen at random between 1 and 100 could all be 1. However, 100 numbers created using your system could all have deltas of 0.01 but when you normalize them you'll scale them up to be in the range 1 -> 100 which will mean you'll never get that strange possibility of a set of very low numbers.
Alok's answer and Dan Dyer's comment point out that using an exponential distribution for the deltas would give a uniform distribution of integers.
So the new version of the code sample in the question would be:
import random,sys
running = 0
max = 1000
deltas = [random.expovariate(1.0) for i in range(0,11)]
floats = []
for d in deltas:
running += d
floats.append(running)
upper = floats.pop()
ints = [int(round(f/upper*max)) for f in floats]
print(ints)
Note the use of random.expovariate(1.0), a Python exponential distribution random number generator (very useful!). Here it's called with a mean of 1.0, but since the script normalises against the last number in the sequence, the mean itself doesn't matter.
Output (fair dice roll):
[11, 43, 148, 212, 249, 458, 539, 725, 779, 871]
Q: Would the resulting distribution of integers be the same as generating 100 random integers from a uniformly distributed probability density function?
A: Each delta will be uniformly distributed. The central limit theorem tells us that the distribution of a sum of a large number of such deviates (since they have a finite mean and variance) will tend to the normal distribution. Hence the later deviates in your sequence will not be uniformly distributed.
So the short answer is "no". Afraid I cannot give a simple solution without doing algebra I don't have time to do today!
The reference (1979) in Alok's answer is interesting. It gives an algorithm for generating the uniform order statistics not by addition but by successive multiplication:
max = 1.
for i = N downto 1 do
out[i] = max = max * RAND^(1/i)
where RAND is uniform on [0,1). This way you don't have to normalize at the end, and in fact don't even have to store the numbers in an array; you could use this as an iterator.
The Exponential distribution: theory, methods and applications
By N. Balakrishnan, Asit P. Basu gives another derivation of this algorithm on page 22 and credits Malmquist (1950).
You can do it in two passes;
in the first pass, generate deltas between 0 and (MAX_RAND/n)
in the second pass, normalise the random numbers to be within bounds
Still O(n), with good locality of reference.

Resources