Github page
Looking generate_anchor_base method, which is Faster R-CNN util method in ChainerCV.
What is the base_size = 16? I saw in the Documentation that it is
The width and the height of the reference window.
But what does "reference window" mean?
Also it says that anchor_scales=[8, 16, 32] are the areas of the anchors but I thought that that the areas are (128, 256, 512)
Another question:
If the base size is 16 and h = 128 and w=128, Does that mean anchor_base[index, 0] = py - h / 2 is a negative value?
since py = 8 and and h/2 = 128/2
The method is a util function of Faster R-CNN, so I assume you understood what is the "anchor" proposed in Faster R-CNN.
"Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks" https://arxiv.org/abs/1506.01497
base_size and anchor_scales determines the size of the anchor.
For example, when base_size=16 and anchor_scales=[8, 16, 32] (and ratio=1.0), height and width of the anchor will be 16 * [8, 16, 32] = (128, 256, 512), as you expected.
ratio determines the height and width aspect ratio.
(I might be wrong in below paragraph, please correct if I'm wrong.)
I think base_size need to be set as the size of the current hidden layer's scale. In the chainercv Faster R-CNN implementation, extractor's feature is fed into rpn (region proposal network) and generate_anchor_base is used in rpn. So you need to take care what is the feature of extractor's output. chainercv uses VGG16 as the feature extractor, and conv5_3 layer is used as extracted feature (see here), this layer is a place where max_pooling_2d is applied 4 times, which results 2^4=16 times smallen feature.
For the another question, I think your understanding is correct, py - h / 2 will be negative value. But this anchor_base value is just a relative value. Once anchor_base is prepared at the initialization of model (here), actual (absolute value) anchor is created in each forward call (here) in _enumerate_shifted_anchor method.
Related
I am monitoring an audio source and visualizing the power of each channel. I get a number out of the api (averagePowerForChannel, but the language/platform shouldn't be important for this problem).
When I add both numbers together I have a scale from -240...0. This makes sense as this is the decibel range.
I transform this scale to a linear representation of the same numbers from 0...1. (I understand that decibels are logarithmic, I leave that alone and just map the scale linearly)
I then give the 0...1 value to an alpha channel that nicely represents the audio being played.
The problem is it's not showing enough change aesthetically. The value shifts slightly and usually hovers around 80:
alpha: 0.820713937282562
alpha: 0.816978693008423
alpha: 0.830410122871399
...
As you might imagine, this just creates a mild flicker.
Instead I'd like to accentuate the peaks of the audio. I have thrown some different methods at it:
// var alpha = 1 / (1 + exp(1-linear)) // never gets fully bright, sits at about .45
// var alpha = 1 - exp2(-linear) // stays around .45
// var alpha = linear / linear + 1
These do not get me a good result, but then again I don't have any idea what I was trying to do.
Goal:
low values on the range get pushed to zero or near zero (could even shift the range down 0.2 after the curve is calculated)
Mid values are pushed lower
High values have their differences accentuated (eg: .83 is shifted very close to 1, but .81 is shifted to .5)
I think I might want an exponential curve? I'm not sure. This is a very specific problem with known inputs so a magic number solution is acceptable.
I get a satisfactory visual by shifting the range to an interesting area, then using an exponential curve to emphasize changes from there on:
var alpha = volume / maxVolume
alpha = alpha - 0.5 // Shift the range over to the area with interesting differences in our source tracks
alpha = pow(alpha, 3) // Emphases the changes in this range
alpha *= 10 // Fix the decimal place
Will accept a better/more pure answer--for this I just wiggled numbers until they got me a good visual result. I'm sorry for grossing out the CS folks here :)
The best answer may be frequency isolation, but there is enough interesting difference to make a good visual without it.
Not sure why you don't invert the logarthmic scale:
decibels = 10 * log10( value );
the inverse is just algebra:
value = pow(10.0, decibels/10);
Observe that since your decibels are between -240 and 0, the value is between 0 (exclusive) and 1 (inclusive). That should ensure your values are more sparsely distributed. However if it still isn't, then your audio configuration might not be detecting significant changes in average amplitude - a not-so-unlikely possibility. In that case you might have to look at decomposing the audio into particular frequencies and looking at the amplitude of each frequency.
I'm trying to teach myself some machine learning, and have been using the MNIST database (http://yann.lecun.com/exdb/mnist/) do so. The author of that site wrote a paper in '98 on all different kinds of handwriting recognition techniques, available at http://yann.lecun.com/exdb/publis/pdf/lecun-98.pdf.
The 10th method mentioned is a "Tangent Distance Classifier". The idea being that if you place each image in a (NxM)-dimensional vector space, you can compute the distance between two images as the distance between the hyperplanes formed by each where the hyperplane is given by taking the point, and rotating the image, rescaling the image, translating the image, etc.
I can't figure out enough to fill in the missing details. I understand that most of these are indeed linear operators, so how does one use that fact to then create the hyperplane? And once we have a hyperplane, how do we take its distance with other hyperplanes?
I will give you some hints. You need some background knowledge in image processing. Please refer to 2,3 for details.
2 is a c implementation of tangent distance
3 is a paper that describes tangent distance in more details
Image Convolution
According to 3, the first step you need to do is to smooth the picture. Below we show the result of 3 different smooth operations (check section 4 of 3) (The left column shows the result images, the right column shows the original images and the convolution operators). This step is to map the discrete vector to continuous one so that it is differentiable. The author suggests to use a Gaussian function. If you need more background about image convolution, here is an example.
After this step is done, you have calculated the horizontal and vertical shift:
Calculating Scaling Tangent
Here I show you one of the tangent calculations implemented in 2 - the scaling tangent. From 3, we know the transformation is as below:
/* scaling */
for(k=0;k<height;k++)
for(j=0;j<width;j++) {
currentTangent[ind] = ((j+offsetW)*x1[ind] + (k+offsetH)*x2[ind])*factor;
ind++;
}
In the beginning of td.c in 2's implementation, we know the below definition:
factorW=((double)width*0.5);
offsetW=0.5-factorW;
factorW=1.0/factorW;
factorH=((double)height*0.5);
offsetH=0.5-factorH;
factorH=1.0/factorH;
factor=(factorH<factorW)?factorH:factorW; //min
The author is using images with size 16x16. So we know
factor=factorW=factorH=1/8,
and
offsetH=offsetW = 0.5-8 = -7.5
Also note we already computed
x1[ind] = ,
x2[ind] =
So that, we plug in those constants:
currentTangent[ind] = ((j-7.5)*x1[ind] + (k-7.5)*x2[ind])/8
= x1 * (j-7.5)/8 + x2 * (k-7.5)/8.
Since j(also k) is an integer between 0 and 15 inclusive (the width and the height of the image are 16 pixels), (j-7.5)/8 is just a fraction number between -0.9375 to 0.9375.
So I guess (j+offsetW)*factor is the displacement for each pixel, which is proportional to the horizontal distance from the pixel to the center of the image. Similarly you know the vertical displacement (k+offsetH)*factor.
Calculating Rotation Tangent
Rotation tangent is defined as below in 3:
/* rotation */
for(k=0;k<height;k++)
for(j=0;j<width;j++) {
currentTangent[ind] = ((k+offsetH)*x1[ind] - (j+offsetW)*x2[ind])*factor;
ind++;
}
Using the conclusion from previous, we know (k+offsetH)*factor corresponds to y. Similarly - (j+offsetW)*factor corresponds to -x. So you know that is exactly the formula used in 3.
You can find all other tangents described in 3 implemented at 2. I like the below image from 3, which clearly shows the displacements effect of different transformation tangents.
Calculating the tangent distance between images
Just follow the implementation in tangentDistance function:
// determine the tangents of the first image
calculateTangents(imageOne, tangents, numTangents, height, width, choice, background);
// find the orthonormal tangent subspace
numTangentsRemaining = normalizeTangents(tangents, numTangents, height, width);
// determine the distance to the closest point in the subspace
dist=calculateDistance(imageOne, imageTwo, (const double **) tangents, numTangentsRemaining, height, width);
I think the above should be enough to get you started and if anything is missing, please read 3 carefully and see corresponding implementations in 2. Good luck!
"Simple" question that I can't find the answer to -- What does -webkit-perspective actually do mathematically? (I know the effect it has, it basically acts like a focal-length control) e.g. what does -webkit-perspective: 500 mean?!?
I need to find the on-screen location of something that's been moved using, among other things, -webkit-perspective
The CSS 3D Transforms Module working draft gives the following explanation:
perspective(<number>)
specifies a perspective projection matrix. This matrix maps a viewing cube onto a pyramid whose base is infinitely far away from the
viewer and whose peak represents the viewer's position. The viewable
area is the region bounded by the four edges of the viewport (the
portion of the browser window used for rendering the webpage between
the viewer's position and a point at a distance of infinity from the
viewer). The depth, given as the parameter to the function, represents
the distance of the z=0 plane from the viewer. Lower values give a
more flattened pyramid and therefore a more pronounced perspective
effect. The value is given in pixels, so a value of 1000 gives a
moderate amount of foreshortening and a value of 200 gives an extreme
amount. The matrix is computed by starting with an identity matrix and
replacing the value at row 3, column 4 with the value -1/depth. The
value for depth must be greater than zero, otherwise the function is
invalid.
This is something of a start, if not entirely clear. The first sentence leads me to believe the perspective projection matrix article on Wikipedia might be of some help, although in the comments on this post it is revealed there might be some slight differences between the CSS Working Group's conventions and those found in Wikipedia, so please check those out to save yourself a headache.
Check out http://en.wikipedia.org/wiki/Perspective_projection#Diagram
After reading the previous comments and doing some research and testing I'm pretty sure this is correct.
Notice that this is same for the Y coord too.
Transformed X = Original X * ( Perspective / ( Perspective - Z translation ) )
eg.
Div is 500px wide
Perspective is 10000px
Transform is -5000px in Z direction
Transformed Width = 500 * ( 10000 / ( 10000 - ( -5000 ) )
Transformed Width = 500 * ( 10000 / 15000) = 500 * (2/3) = 333px
#Domenic Oddly enough, the description "The matrix is computed by starting with an identity matrix and replacing the value at row 3, column 4 with the value -1/depth." has already been removed from the The CSS 3D Transforms Module working draft now. Perhaps there might have been some inaccuracies with this description.
Well, as to the question what does the number in perspective(<number>) means, I think it could be seen as the distance between the position of the imagined camera and your computer screen.
This return below is defined as a gaussian falloff. I am not seeing e or powers of 2, so I am not sure how this is related to the Gaussian falloff, or if it is the wrong kind of fallout for me to use to get a nice smooth deformation on my mesh:
Mathf.Clamp01 (Mathf.Pow (360.0, -Mathf.Pow (distance / inRadius, 2.5) - 0.01))
where Mathf.Clamp01 returns a value between 0 and 1.
inRadius is the size of the distortion and distance is determined by:
sqrMagnitude = (vertices[i] - position).sqrMagnitude;
// Early out if too far away
if (sqrMagnitude > sqrRadius)
continue;
distance = Mathf.Sqrt(sqrMagnitude);
vertices is a list of mesh vertices, and position is the point of mesh manipulation/deformation.
My question is two parts:
1) Is the above actually a Gaussian falloff? It is expontential, but there does not seem to be the crucial e or power of 2... (Updated - I see how the graph seems to decrease smoothly in a Gaussian-like way. Perhaps this function is not the cause for problem 2 below)
2) My mesh is not deforming smoothly enough - given the above parameters, would you recommend a different Gaussian falloff?
Don't know about meshes etc. but lets see that math:
f=360^(-0.1- ((d/r)^2.5) ) looks similar enough to gausian function to make a "fall off".
i'll take the exponent apart to show a point:
f= 360^( -(d/r)^2.5)*360^(-0.1)=(0.5551)*360^( -(d/r)^2.5)
if d-->+inf then f-->0
if d-->+0 then f-->(0.5551)
the exponent of 360 is always negative (assuming 'distance' and 'inRadius' are always positive) and getting bigger (more negative) almost cubicly ( power of 2.5) with distance thus the function is "falling off" and doing it pretty fast.
Conclusion: the function is not Gausian because it behaves badly for negative input and probably for other reasons. It does exibits the "fall off" behavior you are looking for.
Changing r will change the speed of the fall-off. When d==r the f=(1/360)*0.5551.
The function will never go over 0.5551 and below zero so the "clipping" in the code is meaningless.
I don't see any see any specific reason for the constant 360 - changing it changes the slope a bit.
cheers!
I need to place 1 to 100 nodes (actually 25px dots) on a html5 canvas. I need to make them look randomly distributed so using some kind of grid is out. I also need to ensure these dots are not touching or overlapping. I would also like to not have big blank areas. Can someone tell me what this kind of algorithm is called? A reference to an open source project that does this would also be appreciated.
Thanks all
Guido
What you are looking for is called a Poisson-disc distribution. It occurs in nature in the distribution of photoreceptor cells on your retina. There is a great article about this by Mike Bostock (StackOverflow profile) called Visualizing Algorithms. It has JavaScript demos and a lot of code to look at.
In the interest of doing more then dropping a link into the answer, I will try to give a brief summary of the article:
Mitchell's best-candidate algorithm
A simple approximation known as Mitchell’s best-candidate algorithm. It is easy to implement both crowds some spaces and leaves gaps in other. The algorithm adds new points one at a time. For each new sample, the best-candidate algorithm generates a fixed number of candidates, say 10. The point furthest from any other point is added to the set and the process is repeated until the desired density is achieved.
Bridson's Algorithm
Bridson’s algorithm for Poisson-disc sampling (original paper pdf) scales linearly and is easy to implement as well. This algorithm grows from an initial point and (IMHO) is quite fun to watch (again see Mike Bostock's article). All points in the set are either active or inactive. all points are added as active. One point is chosen from the active set and some number of candidate points are generated in the annulus (a.k.a ring) that extends from the sample with the inner circle having a radius r and the outer circle having a radius 2r. Candidate sample less then r distance away from any point in the FinalSet are rejected. Once a sample is found that is not rejected it is added the the FinalSet. If all the candidate sample are rejected the original point is marked as inactive on the assumption that is has so many neighboring points that no more can be added around it. When all samples are inactive the algorithm terminates.
A grid of size r/√2 can be used to greatly increase the speed of checking candidate points. Only one point may ever be in a grid square and only a limited number of adjacent squares need to be checked.
The easiest way would be to just generate random (x, y) coordinates for each one, repeating if they are touching or overlapping.
Pseudocode:
do N times
{
start:
x = rand(0, width)
y = rand(0, height)
for each other point, p
if distance(p.x, p.y, x, y) < radius * 2
goto start
add_point(x, y);
}
This is O(n^2), but if n is only going to be 100 then that's fine.
I don't know if this is a named algorithm, but it sounds like you could assign each node a position on a “grid”, then pick a random offset. That would give the appearance of some chaos while still guaranteeing that there are no big empty spaces.
For example:
node.x = node.number / width + (Math.random() - 0.5) * SOME_SCALE;
node.y = node.number % height + (Math.random() - 0.5) * SOME_SCALE;
Maybe you could use a grid of circles and place one 25px-dot in every circle? Wouldn't really be random, but look good.
Or you could place dots randomly and then make empty areas attract dots and give dots a limited-range-repulsion, but that is maybe too complicated and takes too much CPU time for this simple task.