What is the difference between mAP from Mask R-CNN utils.py and mAP from (MS)COCO? - mask

I trained my Mask R-CNN Network with my own data, which i transformed into COCO Style for my Thesis and now i want to evaluate my results. I found two methods to do that. One method is the evaluation from COCO itself. Mask R-CNN itself shows how to evaluate with COCO Metric in their coco.py file:
https://github.com/matterport/Mask_RCNN/blob/master/samples/coco/coco.py
I basically use the evaluate_coco(...) function in line 342. As a result i get the result from COCO metric with Average Precisions and Average Recall for different metrics, see the images below. For the parameter eval_type i use eval_type="segm".
For me the mAP is interesting. I know that the mAP50 uses 0.5 for the IuO (Intersection over Union) and their standard mAP is from IuO = 0.5 until 0.95 in 0.05 steps.
The second method is from Mask R-CNN itself in their utils.py:
https://github.com/matterport/Mask_RCNN/blob/master/mrcnn/utils.py
The name of the function is compute_ap(...) in line 715 and there is written that the IoU is 0.5. The function returns a mAP value.
This raises the question of what type compute_ap() evaluates for. With COCO you can choose between "bbox" and "segm".
Also i want to know the difference between the mAP value from the compute_ap(...) function and the mAP50 from COCO, because with the same data i get different results.
Unfortunately i can't upload a better image now, because i can only go on Monday-Friday to my university and friday i was in a hurry and took pictures without checking them, but the mean Value over all AP from the compute_ap() is 0.91 and i am pretty sure that the AP50 from COCO is 0.81.
Do someone know the difference or is there no differnce?
Is it because of the maxDets=100 Parameter?:
Understanding COCO evaluation "maximum detections"
My pictures have only 4 categories and a maximum of 4 instances per picture.
Or is the way how the mAP is calculated different?
EDIT: Now i have a better image with the whole COCO Metric for Type "Segm" and Type "BBox" and compere it with the result from compute_ap(), which is "mAP: 0,41416..." By the way, if there is an "-1" in the COCO Result, does this mean there were no from this typ?

Related

Setting the "tpow" and "expcost" arguments in TraMineR::seqdist

I'm actually working on the pathways of inpatients during their hospital stay. These pathways are represented as states sequences (the current medical unit at each time unit) and I'm trying to find typical pathways through clustering algorithms.
I create the distance matrix by using the seqdist function from the R package TraMineR, with the method "OMspell". I've already read the R documentation and the related articles, but I can't find how to set the arguments tpow and expcost.
As the time unit is an hour, I don't want any little difference of duration to have a big impact on the clustering result (contrary to a medical unit transfer for example). But I don't want the duration not to have any impact either...
Also, is there a proper way to choose their value ? Or do I just continue to grope around for a good configuration ? (I'm using Dunn, Davies-Bouldin and Silhouette criteria to compare the results of hierarchical clustering, besides the medical opinion on the resulting clusters)
The parameter tpow is an exponential coefficient applied to transform the actual spell lengths (durations). The default value is 1 for which the spell lengths are taken as are. With tpow=0, you would just ignore spell durations, and with tpow=0.5 you would consider the square root of the spell lengths.
The expcost parameter is the expansion cost, i.e. the cost for expanding a (transformed) spell length by one unit. In other words, when in the editing of one sequence into the other a spell of length t1 has to be expanded to length t2, it would cost expcost * |t2^tpow - t1^tpow|. With expcost=0 spells in a same state (e.g. AA and AAAAA) would be equivalent whatever their lengths.
With tpow=.5, for example, increasing the spell length from 1 to 2 costs more than increasing a spell length form 3 to 4. If you do not want to give to much importance to small differences in spell lengths use a low expcost. However, note that the expcost applies to the transformed spell lengths and you may want to adjust it when you change the tpow value.

what if the FD steps varied w.r.t output/input

I am using the finite difference scheme to find gradients.
Lets say i have 2 outputs (y1,y2) and 1 input (x) in a single component. And in advance I know that the sensitivity of y1 with respect to x is not same as the sensitivity of y2 to x. And thus i could potentially have two different steps for those as in ;
self.declare_partials(of=y1, wrt=x, method='fd',step=0.01, form='central')
self.declare_partials(of=y2, wrt=x, method='fd',step=0.05, form='central')
There is nothing that stops me (algorithmically) but it is not clear what would openmdao gradient calculation exactly do in this case?
does it exchange information from the case where the steps are different by looking at the steps ratios or simply treating them independently and therefore doubling computational time ?
I just tested this, and it does the finite difference twice with the two different step sizes, and only saves the requested outputs for each step. I don't think we could do anything with the ratios as you suggested, as the reason for using different stepsizes to resolve individual outputs is because you don't trust the accuracy of the outputs at the smaller (or large) stepsize.
This is a fair question about the effect of the API. In typical FD applications you would get only 1 function call per design variable for forward and backward difference and 2 function calls for central difference.
However in this case, you have asked for two different step sizes for two different outputs, both with central difference. So here, you'll end up with 4 function calls to compute all the derivatives. dy1_dx will be computed using the step size of .01 and dy2_dx will be computed with a step size of .05.
There is no crosstalk between the two different FD calls, and you do end up with more function calls than you would have if you just specified a single step size via:
self.declare_partials(of='*', wrt=x, method='fd',step=0.05, form='central')
If the cost is something you can bear, and you get improved accuracy, then you could use this method to get different step sizes for different outputs.

Cost function of Convolutional Neural Networ not serving intended purpose

So I have built a CNN and now I am trying to get the training of my network to work effectively despite my lack of a formal education on the topic. I have decided to use stochastic gradient descent with a standard mean squared error cost function. As stated in the title, the problem seems to lie within the cost function.
When I use a couple of training examples, I calculate the mean squared error for each, and get the mean, and use that as the full error. There are two output neurons, one for face, and one for not a face; which ever is higher is the class that is yielded. Essentially, if a training example yields the wrong classification, I calculate the error (with the desired value being the value of the class that was yielded).
Example:
Input an image of a face--->>>
Face: 500
Not face: 1000
So in this case, the network says that the image isn't a face, when in fact it is. The error comes out to:
500 - 1000 = -500
-500^2 = 250000 <<--error
(correct me if i'm doing anything wrong)
As you can see the desired value is set to the value of the incorrect class that was selected.
Now this is all good (from what I can tell), but here is my issue:
As I perform b-prop on the network multiple times, the mean cost of the entire training set falls to 0, but this is only because all of the weights in the network are becoming 0, so all classes always become 0.
After training:
Input not face->
Face: 0
Not face: 0
--note that if the classes are the same, the first one is selected
(0-0)^0 = 0 <<--error
So the error is being minimized to 0 (which is good I guess), but obviously not the way we want.
So my question is this:
How do I minimize the space between the classes when the class is wrong, but also get it to overshoot the incorrect class so that the correct class is yielded.
//example
Had this: (for input of face)
Face: 100
Not face: 200
Got this:
Face: 0
Notface: 0
Want this: (or something similar)
Face: 300
Not face: 100
I hope this question wasn't too vague...
But any help would be much appreciated!!!
The way you're computing the error doesn't correspond to the standard 'mean squared error'. But, even if you were to fix it, it makes more sense to use a different type of outputs and error that are specifically designed for classification problems.
One option is to use a single output unit with a sigmoid activation function. This neuron would output the probability that the input image is a face. The probability that it's not a face is given by 1 minus this value. This approach will work for binary classification problems. Softmax outputs are another option. You'd have two output units: the first outputs the probability that the input image is a face, and the second outputs the probability that it's not a face. This approach will also work for multi-class problems, with one output unit for each class.
In either case, use the cross entropy loss (also called log loss). Here, you have a target value (face or no face), which is the true class of the input image. The error is the negative log probability that the network assigns to the target.
Most neural nets that perform classification work this way. You can find many good tutorials here, and read this online book.

How to calculate the likelihood for an element in a route that traverses a probability graph is correct?

I have an asymmetric directed graph with a set of probabilities (so the likelihood that a person will move from point A to B, or point A to C, etc). Given a route through all the points, I would like to calculate the likelihood that each choice made in the route is a good choice.
As an example, suppose a graph of just 2 points.
//In a matrix, the probabilities might look like
//A B
[ 0 0.9 //A
0.1 0 ] //B
So the probability of moving from A to B is 0.9 and from B to A is 0.1. Given the route A->B, how correct is the first point (A), and how correct is the second point (B).
Suppose I have a bigger matrix with a route that goes A->B->C->D. So, some examples of what I would like to know:
How likely is it that A comes before B,C, & D
How likely is it that B comes after A
How likely is it that C & D come after B
Basically, at each point, I want to know the likelihood that the previous points come before the current and also the likelihood that the following points come after. I don't need something that is statistically sound. Just an indicator that I can use for relative comparisons. Any ideas?
update: I see that this question is not useful to everyone but the answer is really useful to me so I've tried to make the description of the problem more clear and will include my answer shortly in case it helps someone.
I don't think that's possible efficiently. If there was an algorithm to calculate the probability that a point was in the wrong position, you could simply work out which position was least wrong for each point, and thus calculate the correct order. The problem is essentially the same as finding the optimal route.
The subsidiary question is what the probability is 'of', here. Can the probability be 100%? How would you know?
Part of the reason the travelling salesman problem is hard is that there is no way to know that you have the optimal solution except looking at all the solutions and finding that it is the shortest.
Replace probability matrix (p) with -log(p) and finding shortest path in that matrix would solve your problem.
After much thought, I came up with something that suits my needs. It still has the the same problem where to get an accurate answer would require checking every possible route. However, in my case, only checking direct and the first indirect routes are enough to give an idea of how "correct" my answer is.
First I need the confidence for each probability. This is a separate calculation and is contained in a separate matrix (that maps 1 to 1 to the probability matrix). I just take the 1.0-confidenceInterval for each probability.
If I have a route A->B->C->D, I calculate a "correctness indicator" for a point. It looks like I am getting some sort of average of a direct route and the first level of indirect routes.
Some examples:
Denote P(A,B) as probability that A comes before B
Denote C(A,B) as confidence in the probability that A comes before B
Denote P`(A,C) as confidence that A comes before C based on the indirect route A->B->C
At point B, likelihood that A comes before it:
indicator = P(A,B)*C(A,B)/C(A,B)
At point C, likelihood that A & B come before:
P(A,C) = P(A,B)*P(B,C)
C(A,C) = C(A,B)*C(B,C)
indicator = [P(A,C)*C(A,C) + P(B,C)*C(B,C) + P'(A,C)*C'(A,C)]/[C(A,C)+C(B,C)+C'(A,C)]
So this gives me some sort of indicator that is always between 0 and 1, and takes the first level indirect route into account (from->indirectPoint->to). It seems to provide the rough estimation I was looking for. It is not a great answer, but it does provide some estimate and since nothing else provides anything better, it is suitable

Oh no Another BigO one

I've been doing BigO recently, and I get the formula ok, but I've written a piece of code that takes and input and returns a time taken to complete a sort. So I have the input and time, how do I use this to classify what sort of BigO it is? I've made graphs and can see which sort they are but I can't do it using the formula? I'm not strong on maths which I think is my problem here!
For instance I get:
Size Time Operations
200 2 163648
400 1 162240
800 15 2489456
1600 6 10247376
3200 19 40858160
6400 79 165383984
12800 318 656588080
25600 1274 2624318128
51200 5059 10476803408
102400 20333 41969291968
I know that this is O(n^2) by looking at the graph and comparing, but how do I prove it?
Yes, you can sample a thousand different input sizes, and then try to derive a Big-O value from that, but you shouldn't - not only because it doesn't actually prove anything, but because that isn't the point.
The way to prove O(n^2) is to prove it on the code itself, not through experiments. The actual running time isn't important, because Big-O notation doesn't say anything about that - in simple terms, it only specifies the dominant term of whatever formula you would use to calculate the exact running time, in the sense of the number of operations executed for that function. Constants are thrown away, and so are smaller terms - the actual running time of a function might be 1000n^2+1000000n, but that's still O(n^2).
You can't mathematically prove anything from this table; the complexity might be O(1) if Time remains at 20333 for all larger values.
The best you can do is try fitting several curves to this table and selecting the best fit according to Occam's razor.
You can't prove it by looking at the timings, you can only prove it by analysing the code to see how many steps are performed. The reason for this is that the time taken is a function not only of your program but many other things outside of your control as well.
For example, who can say whether your machine didn't spend an inordinate amount of time in other processes during one particular test run of your program? This sort of thing can be minimsed to a point using statistical methods but the proof requires solid data.
What you can do is to look at some of your data points to get support for the contention that it's O(n2). Have a look at the last four entries:
Input Time
128 318
256 1274 1274 / 318 = 4.006
512 5059 5059 / 1274 = 3.971
1024 20333 20333 / 5059 = 4.019
You can see that each doubling of the input size has a multiplier effect of the time of about 4 which would tend to indicate an O(n2) property.
But this is support only. It applies only to that particular range of input values and, as stated, is subject to factors outside your control. Note also that the support would be harder to see if the time taken was not a simple one. For example, if the time function was t = n2/10 + 123n + 123456789, it would be a little harder to figure out.
Just by making a comparison between the values may not make any sense.However,if you plot a graph using this values( x-axis : input , y-axis:time),you will get a curve or a linear shape or whatever.Using this information,you can predict the BigO value of that function.Of course there may be(not always) some interrupts that affects the running of that process,but that does not last during the whole period.It is slight overhead that cannot affect the result.
In order to predict the BigO value , you will need some Calculus knowledge in order to make the analogy between the shape and BigO result.
For example,let's say that you got a linear shape and you know that it means O(n).In that point,you reached that result because you know the shape of a linear function graph and your graph looks like it.In order to reach the true proof , you have to draw both your functions curve and the graph of the mathematical function that has the closest shape to your graph.
There are some other functions like Big-Theta , Small-Omega that binds your function from upper or from lower.The mathematical function could be both of them,but as a result,your Big-O function is the closest one to that shape.

Resources