Weights & Biases - How can I interpret the graphs when training BERT - bert-language-model

Can someone help me to understand the amazing graphs generated by Weights & Biases tools when you are training a BERT model?
How can I interpret the above image? I don't know what the dispersion grey means, nor if the concentration in the blue region is good or bad.
Thanks in advance.

So those charts show the histograms of the gradients, per time step.
Take the leftmost chart, layer.10 weights. In very first slice at Step 0, the grey shading tells you that the gradients for that layer had values between ~ -40 and +40. The blue parts however tell you that most of those gradients were between -2 and +2 (roughly).
So the shading represents the count of gradients in that particular histogram bin, for that particular time step.
Now interpreting gradients can be tricky sometimes, but generally I find these plots useful to check that your gradients haven't exploded (big values on the y-axis) or collapsed (concentrated blue around 0 with little to no deviation). For example if you try train with a very high learning rate you should see the values on the y-axis go into the 100s or 1000s, indicating that your gradients are huge.
One final tip would be to focus more on the gradients from the weights as opposed to the biases as this can be more informative about what your model is doing.

Related

pointwise envelopes not including Theoretical line Foxall J

I am computing pointwise envelopes for the Foxall's J function to investigate the whether some point patterns of interest are clustered, avoid or are independent from other point patterns or polygons.
I am doing this using spatstat with a syntax similar to this:
evelope(my_pattern_of_interest, fun=Jfox, funargs=list(Y=my_other_patten),...)
I am calculating the envelopes for several replicated patterns (i.e., different transects) and then plotting the pooled (i.e., pool) the envelopes.
So far, whenever the Jfox is calculated between 2 point patterns, the shaded area representing the simulation envelope includes the theoretical line (i.e., represented by red dashes), as in this example:
Figure 1
Instead, when Jfox is calculated between a point pattern and polygons, it is frequent that the envelope area does not include the Theroretical line (at least in some regions). Like in this example:
Figure 2
What does this means?
From my understanding so far, if - or rather "where" since these are pointwise envelopes - the observed line (solid black) is within the shaded area, it means that the observed pattern (say clustering as in Figure 2) is not significantly different from what could be observed at random. Alternatively, if/where the observed line is out of the shaded area, then I have a significant difference.
What does the fact that the theoretical line lays outside the shaded area mean? Is it giving me another piece of information? How should I interpret this?
Thank you.

fitting a L shape (corner) to points to remove outliers

I am trying to extract the length and width from a set lidar sensor points (pink) as shown in the image below. The points circled in blue and white are actually noise which I wish to eliminated. [The orange box is the length and width I currently have calculated from the points. As seen, the calculated width is much 1/3 wider than it is supposed to be, due to the noisy points i blue and white]
I've read some approaches to do corner/rectangle fitting, then discarding x% of the poorest fitting points. This would help me to get rid of the circled points. But so far, after many searches I still cannot find any concrete implementations on how to do this fitting.
Can any provide any suggestions how I can go about doing this?

How to smooth hand drawn lines?

So I am using Kinect with Unity.
With the Kinect, we detect a hand gesture and when it is active we draw a line on the screen that follows where ever the hand is going. Every update the location is stored as the newest (and last) point in a line. However the lines can often look very choppy.
Here is a general picture that shows what I want to achieve:
With the red being the original line, and the purple being the new smoothed line. If the user suddenly stops and turns direction, we think we want it to not exactly do that but instead have a rapid turn or a loop.
My current solution is using Cubic Bezier, and only using points that are X distance away from each other (with Y points being placed between the two points using Cubic Bezier). However there are two problems with this, amongst others:
1) It often doesn't preserve the curves to the distance outwards the user drew them, for example if the user suddenly stop a line and reverse the direction there is a pretty good chance the line won't extend to point where the user reversed the direction.
2) There is also a chance that the selected "good" point is actually a "bad" random jump point.
So I've thought about other solutions. One including limiting the max angle between points (with 0 degrees being a straight line). However if the point has an angle beyond the limit the math behind lowering the angle while still following the drawn line as best possible seems complicated. But maybe it's not. Either way I'm not sure what to do and looking for help.
Keep in mind this needs to be done in real time as the user is drawing the line.
You can try the Ramer-Douglas-Peucker algorithm to simplify your curve:
https://en.wikipedia.org/wiki/Ramer%E2%80%93Douglas%E2%80%93Peucker_algorithm
It's a simple algorithm, and parameterization is reasonably intuitive. You may use it as a preprocessing step or maybe after one or more other algorithms. In any case it's a good algorithm to have in your toolbox.
Using angles to reject "jump" points may be tricky, as you've seen. One option is to compare the total length of N line segments to the straight-line distance between the extreme end points of that chain of N line segments. You can threshold the ratio of (totalLength/straightLineLength) to identify line segments to be rejected. This would be a quick calculation, and it's easy to understand.
If you want to take line segment lengths and segment-to-segment angles into consideration, you could treat the line segments as vectors and compute the cross product. If you imagine the two vectors as defining a parallelogram, and if knowing the area of the parallegram would be a method to accept/reject a point, then the cross product is another simple and quick calculation.
https://www.math.ucdavis.edu/~daddel/linear_algebra_appl/Applications/Determinant/Determinant/node4.html
If you only have a few dozen points, you could randomly eliminate one point at a time, generate your spline fits, and then calculate the point-to-spline distances for all the original points. Given all those point-to-spline distances you can generate a metric (e.g. mean distance) that you'd like to minimize: the best fit would result from eliminating points (Pn, Pn+k, ...) resulting in a spline fit quality S. This technique wouldn't scale well with more points, but it might be worth a try if you break each chain of line segments into groups of maybe half a dozen segments each.
Although it's overkill for this problem, I'll mention that Euler curves can be good fits to "natural" curves. What's nice about Euler curves is that you can generate an Euler curve fit by two points in space and the tangents at those two points in space. The code gets hairy, but Euler curves (a.k.a. aesthetic curves, if I remember correctly) can generate better and/or more useful fits to natural curves than Bezier nth degree splines.
https://en.wikipedia.org/wiki/Euler_spiral

computer vision: segmentation setup. Graph cut potentials

I have been trying to teach myself some simple computer vision algorithms and am trying to solve a problem where I have some noise corrupted image and all I am trying to do is separate the black background from the foreground which has some signal. Now, the background RGB channels are not all completely zero as they can have some noise. However, the human eye can easily discern the foreground from the background.
So, what I did was use the SLIC algorithm to break the image down into super pixels. The idea being that since the image is noise corrupted, doing statistics on the patches might result in better classification of background and foreground because of higher SNR.
After this, I get around 100 patches which should have similar profile and the result of SLIC seems reasonable. I have been reading about graph cuts (the Kolmogorov paper) and it seemed like something nice to try for the binary problem I have. So, I constructed a graph which is a first order MRF and I have edges between the immediate neighbours (4-connected graph).
Now, I was wondering what possible unary and binary terms I can use here to do my segmentation. So, I was thinking for the unary term, I can model it as a simple Gaussian where the background should have a zero mean intensity and the foreground should have some non-zero mean. Although, I am struggling to figure out how to encode this. Should I just assume some noise variance and compute probabilities directly using patch statistics?
Similarly, for neighbouring patches I do want to encourage them to take similar label but I am not sure what binary term I can design that reflects that. Seems just the difference between the label (1 or 0) seems weird...
Sorry for the long-winded question. Hoping someone can give some helpful hint on how to start.
You could build your CRF model over superpixels, such that a superpixel has a connection to another superpixel if it is a neighbour of it.
For your statistical model Pixel Wise Posteriors are simple and cheap to compute.
So, I suggest the following for the unary terms of the CRF:
Build foreground and background histograms over texture per pixel(assuming you have a mask, or reasonable amount of marked foreground pixels(note, not superpixels)).
For each superpixel, make an independence assumption over pixels within it, such that a superpixels likelihood of being either foreground or background is the product over each observation in the superpixel(in practice, we sum logs). The individual likelihood terms come from the histograms that you generated.
Compute the posterior for foreground as the cumulative likelihood described above for foreground divided by the sum of the cumulative likelihoods of both. Similar for background.
The pairwise terms between superpixels can be as simple as the difference between the mean observed textures(pixelwise) for each passed through a kernel, such as the Radial Basis Function.
Alternatively, you could compute histograms over each superpixels observed texture(again, pixel wise) and compute the Bhattacharyya Distance between each neighbouring pair of superpixels.

Minimising interpolation error between two data sets

In the top of the diagrams below we can see some value (y-axis) changing over time (x-axis).
As this happens we are sampling the value at different and unpredictable times, also we are alternating the sampling between two data sets, indicated by red and blue.
When computing the value at any time, we expect that both red and blue data sets will return similar values. However as shown in the three smaller boxes this is not the case. Viewed over time the values from each data set (red and blue) will appear to diverge and then converge about the original value.
Initially I used linear interpolation to obtain a value, next I tried using Catmull-Rom interpolation. The former results in a values come close together and then drift apart between each data point; the latter results in values which remain closer, but where the average error is greater.
Can anyone suggest another strategy or interpolation method which will provide greater smoothing (perhaps by using a greater number of sample points from each data set)?
I believe what you ask is a question that does not have a straight answer without further knowledge on the underlying sampled process. By its nature, the value of the function between samples can be merely anything, so I think there is no way to assure the convergence of the interpolations of two sample arrays.
That said, if you have a prior knowledge of the underlying process, then you can choose among several interpolation methods to minimize the errors. For example, if you measure the drag force as a function of the wing velocity, you know the relation is square (a*V^2). Then you can choose polynomial fitting of the 2nd order and have pretty good match between the interpolations of the two serieses.
Try B-splines: Catmull-Rom interpolates (goes through the data points), B-spline does smoothing.
For example, for uniformly-spaced data (not your case)
Bspline(t) = (data(t-1) + 4*data(t) + data(t+1)) / 6
Of course the interpolated red / blue curves depend on the spacing of the red / blue data points,
so cannot match perfectly.
I'd like to quote Introduction to Catmull-Rom Splines to suggest not using Catmull-Rom for this interpolation task.
One of the features of the Catmull-Rom
spline is that the specified curve
will pass through all of the control
points - this is not true of all types
of splines.
By definition your red interpolated curve will pass through all red data points and your blue interpolated curve will pass through all blue points. Therefore you won't get a best fit for both data sets.
You might change your boundary conditions and use data points from both data sets for a piecewise approximation as shown in these slides.
I agree with ysap that this question cannot be answered as you may be expecting. There may be better interpolation methods, depending on your model dynamics - as with ysap, I recommend methods that utilize the underlying dynamics, if known.
Regarding the red/blue samples, I think you have made a good observation about sampled and interpolated data sets and I would challenge your original expectation that:
When computing the value at any time, we expect that both red and blue data sets will return similar values.
I do not expect this. If you assume that you cannot perfectly interpolate - and particularly if the interpolation error is large compared to the errors in samples - then you are certain to have a continuous error function that exhibits largest errors longest (time) from your sample points. Therefore two data sets that have differing sample points should exhibit the behaviour you see because points that are far (in time) from red sample points may be near (in time) to blue sample points and vice versa - if staggered as your points are, this is sure to be true. Thus I would expect what you show, that:
Viewed over time the values from each data set (red and blue) will appear to diverge and then converge about the original value.
(If you do not have information about underlying dynamics (except frequency content), then Giacomo's points on sampling are key - however, you need not interpolate if looking at info below Nyquist.)
When sampling the original continuous function, the sampling frequency should comply to the Nyquist-Shannon sampling theorem, otherwise the sampling process introduces an error (also known as aliasing). The error, being different in the two datasets, results in a different value when you interpolate.
Therefore, you need to know the highest frequency B of the original function and then collect samples with a frequency at least 2B. If your function has very high frequencies and you cannot sample that fast, you should at least try to filter them away before sampling.

Resources