Precision-recall plot - information-retrieval

Precision-recall plot - information-retrieval

I'd like to make sure that I plotted precision-recall curve. I have following data:
recall = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
precision = [1, 1, 0.8, 0.7, 0.80, 0.65, 0.60, 0.72, 0.60, 0.73, 0.75]
interpolated_precision = [1, 1, 0.80, 0.80, 0.80, 0.75, 0.75, 0.75, 0.75, 0.75, 0.75]
and prepared graph as shown below
precision-recall curve
I'm not sure it is correct since I have seen figures with jiggles. An example is here:
enter image description here
I would be glad if anyone can confirm weather it is wrong or not.

The jagged lines / sawtooth pattern you usually see is more common with more data points (note at least 20 or so in the example figure, vs. exactly 10 for yours), that are coming from actual search results. You said nothing about where your data points are coming from.
The reason the P-R figure often looks jagged is that every increase in recall is usually accompanied by a reduction in precision, at least temporarily, due to the likely addition of false positives. This is also the case in your figure, however, your "dips" seem smaller and your precision remains high throughout.
However, there are two clear errors in your figure in the downward shifts for both precision and interpolated precision, since you are graphing the downward shifts as diagonal lines.
For precision, any downward shift should always be a vertical line. You will not get this from a simple x-y plot of the points you described, e.g. in excel. These vertical lines contribute to the "jagged" look.
For interpolated precision, the graph will always contain perpendicular straight lines, either horizontally or vertically. The definition of interpolated precision essentially requires that (see e.g. https://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-ranked-retrieval-results-1.html for the correct definition of interpolated precision at any point of recall).
The key here is to realize that the data you are describing should not be graphed as independent observations, but rather as defining the P-R values for the rest of the graph in a particular way.

Related

How do I make an xspline curve symmetric in R?

I'm trying to draw a shape using the xspline function in R.
Using a set of control points, I can get the shape but it is asymmetric even though the points and shape values are all symmetric.
How do I draw this shape symmetrically?
This draws the approximate shape but the lines show how it is asymmetric.
curve <- data.frame(x=c(-0.1,-0.1,-0.1,-0.1,0.1,0.1,0.1,0.1,0.1,0.1,-0.1,-0.1,-0.1),y=c(0.1,0.1,0.1,0.3,0.3,0.1,0.1,-0.1,-0.1,-0.3,-0.3,-0.1,-0.1))
plot(curve)
xspline(curve,shape=1,open=F)
lines(x=c(-0.15,0.15),y=c(0.15,0.15),col="red")
lines(x=c(-0.15,0.15),y=c(-0.15,-0.15),col="red")
I have tried changing the shape values for each node but with no success.

Your question are actually two questions in one:
Is the curve (as a mathematical object) symmetric with respect to the x-axis?
Does it seem so in the picture?
Answer 2
Even if Answer 1 were "Yes" (which I doubt, see below), I think the answer is "No." Judging from the documentation, what xspline does is that it evaluates the curve at many points and then plots a polyline connecting these. You can persuade yourself: setting draw to F, the following should give you two arrays, one of x- and one of y-values.
curve <- data.frame(x=c(-0.1,-0.1,-0.1,-0.1,0.1,0.1,0.1,0.1,0.1,0.1,-0.1,-0.1,-0.1),y=c(0.1,0.1,0.1,0.3,0.3,0.1,0.1,-0.1,-0.1,-0.3,-0.3,-0.1,-0.1))
plot(curve)
pts=xspline(curve,shape=1,open=F,draw=F)
pts
I don't think there is any way of controlling the number or density of the evaluation points. So even if your curve (as a mathematical object, blue) is symmetric, its polyline rendering (black) is not necessarily:
This alone might explain the small differences from #Mike's comment.
Answer 1
We don't know exactly how R enforces the curve being closed. Based on the documentation,
For open X-splines, the start and end control points must have a shape of 0 (and non-zero values are silently converted to zero).
I suppose that it adds another control point at the very end, makes it equal to the first one and sets the shape of both of them equal to zero. But this is different from what your control points on the right hand-side of the picture look like! Your control point (0.1, 0.1) is repeated twice (not three times as (-0.1, 0.1) is) and its shape is 1, not 0 (caveat: the control point being repeated three times, maybe this does not have any influence; we would have to check the paper linked from the documentation).
I have adapted this and plotted the curve and its mirrored version so that we see the difference.
curve <- data.frame(
x=c(-0.1,-0.1,-0.1,-0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1,-0.1,-0.1,-0.1),
y=c( 0.1, 0.1, 0.1, 0.3, 0.3, 0.1, 0.1, 0.1,-0.1,-0.1,-0.3,-0.3,-0.1,-0.1))
xswap <- data.frame(
x=c( 0.1, 0.1, 0.1, 0.1,-0.1,-0.1,-0.1,-0.1,-0.1,-0.1,-0.1, 0.1, 0.1, 0.1),
y=c( 0.1, 0.1, 0.1, 0.3, 0.3, 0.1, 0.1, 0.1,-0.1,-0.1,-0.3,-0.3,-0.1,-0.1))
plot(curve)
xspline(curve,shape=c(0,1,1,1,1,1,1,0,1,1,1,1,1,1),open=F)
xspline(xswap,shape=c(0,1,1,1,1,1,1,0,1,1,1,1,1,1),open=F)
lines(x=c(-0.15,0.15),y=c(0.15,0.15),col="red")
lines(x=c(-0.15,0.15),y=c(-0.15,-0.15),col="red")
To me they seem pretty much overlapping, especially taking the effects from Answer 2 into account.
General recommendations
Unless absolutely necessary, do not repeat control points. This tends to have funny effects on the underlying parametrization. More often than not (in my opinion), repeated control points come from people confusing control points with knots.
When you want symmetry with respect to the x-axis, put the first (and last) control point on it. Then you don't have to worry about finding the corresponding control point whose shape you have to set to 0.
For example,
curve <- data.frame(
x=c(0, 1, 1, 0, -1, -1),
y=c(1, 1, -1, -1, -1, 1))
curve_x_swapped <- data.frame(
x=c(0, -1, -1, 0, 1, 1),
y=c(1, 1, -1, -1, -1, 1))
plot(curve)
xspline(curve, shape=1,open=F)
xspline(curve_x_swapped,shape=1,open=F,border="red")

r rgl quads3d with only 3 unique vertices does not show color

R 3.5.1
rgl 0.99.16
Windows 10 version 1809 build 17763.195
Under some circumstances (I think if it's the two internal points), if two of the four points provided to a quads3d() are the same, then the resulting shape does not display its assigned color, instead it's black. In the following example, note that the second and third points are the same:
q1 <- matrix(c(-0.35, 0, -0.5,
0.35, -0.5, 0,
0.35, -0.5, 0,
-0.35, 0, 0.5),
byrow=TRUE,
ncol=3,
dimnames=list(c("C0", "Cl", "Dl", "D0"), c("x", "y", "z")))
quads3d(x=q1[,"x"], y=q1[,"y"], z=q1[,"z"], color="blue", alpha=1)
This code produces a triangle (as it should, see screen shot), but it's always black.Object should be blue. Changing just the coordinates produces a blue shape.
I can work around this, but I would call it a bug. I think quads3d should work properly with only three of the four points passed to it being unique - that does not violate the documentation (its all in one plane and convex). I'm reporting it here in case anyone has helpful information about it, and for future searchers.
Thanks.

How to "jitter" a vector of numbers?

The concept of jittering in graphical plotting is intended to make sure points do not overlap. I want to do something similar in a vector
Imagine I have a vector like this:
v <- c(0.5, 0.5, 0.55, 0.60, 0.71, 0.71, 0.8)
As you can see, it is a vector that is ordered by increasing numbers, with the caveat that some of the numbers are exactly the same. How can I "jitter" them through adding a very small value, so that they can be ordered strictly in increasing order? I would like to achieve something like this:
0.5, 0.50001, 0.55, 0.60, 0.71, 0.71001, 0.8
How can I achieve this in R?
If the solution allows me to adjust the size of the "added value" it's a bonus!

Jitter and then sort:
sort(jitter(z))

The function rle gets you the run length of repeated elements in a vector. Using this information, you can then create a sequence of the repeats, multiply this by your verySmallNumber and add it to v.
# New vector to illustrate a triplet
v <- c(0.5, 0.5, 0.55, 0.60, 0.71, 0.71, 0.71, 0.8)
# Define the amount you wish to add
verySmallNumber <- 0.00001
# Get the rle
rv <- rle(v)
# Create the sequence, multiply and subtract the verySmallNumber, then add
sequence(rv$lengths) * verySmallNumber - verySmallNumber + v
# [1] 0.50000 0.50001 0.55000 0.60000 0.71000 0.71001 0.71002 0.80000
Of course, eventually, a very long sequence of repeats might lead to a value equal to the next real value. Adding a check to see what the longest repeated value is would possibly solve that.

perspective*lookAt transformation in qt openGL behaving unexpectedly not even keeping w==1

I am attempting in an openGL window to identify whether or not the
mouse pointer is situated within a given 3D rectangle visible within the rendered scene.
The story so far
The rectangle's coordinates I already have as world coordinates.
I also have the matrix
QMatrix4x4 camera = perspective(..) * lookAt(..)
And it seems fine within my vertex shader. After all, using
gl_Position = camera * v_vertices;
has the rectangle displayed on the screen just like I want it.
The problem
What I truly want is the screen coordinates of the rectangle's corners
(xj,yj) in [-1,1]^2 on the CPU.
Reassured by my experiences with the vertex shader I first grab
QVector4D w = (world coordinates of such a visible vertex) = (xw,yw,zw,1)
the values within w looking good in the debugger, gdb. Next I try to directly obtain the screen coordinates using
QVector4D s = camera * w
since the rectangle actually is rendered on the screen with this very same transformation and since I am deeply believing that all visible openGL points live in [-1,1]^3 I really would expext an
s in ([-1,1],[-1,1],[-1,1], 1)
however, I get stuff like
w == {xp = 0.5, yp = 1.5, zp = 2, wp = 1}
s == {xp = 1.53, yp = -6.43, zp = 1.81, wp = 2.60}
where not even the s.wp value stayed at 1.
I guess the question boils down to: How can it be that a visible
vertex on the screen leads to s NOT in ([-1,1],[-1,1],[-1,1], 1)
with my CPU sided reconstruction workflow?
The particulars
The sheer product camera*w is correct. Given that camera is row-major
I can reproduce in octave:
camera = [1.54, -0.31, -0.29, -0.28;
0.51, 0.95, 0.87, 0.87;
0.00, 2.20, -0.41, -0.41;
0.00, -12.09, 1.48, 2.27]';
w=[.5,1.5,2,1]';
(camera*w)' == 1.5350 -6.4200 1.8200 2.6150
Viewer data:
site = [0, 0, 5.5]
direction_of_view = [-0.28, 0.87, -0.41]
dir_up = [0,0,1]
near = 0.40
far = 200
fov_v = 45 degrees
fov_h = approx. 65 degrees

You need to divide each vector by its w component.
This is done automatically during the resterization step in openGL but if you calculate locations manually you need to do it yourself.

Compare two user defined curves and score their similarity

I have a set of 2 curves (each with a few hundreds to a couple thousands datapoints) that I want to compare and get some similarity "score". Actually, I have >100 of those sets to compare... I am familiar with R (or at least bioconductor) and would like to use it.
I tried the ccf() function but I'm not too happy about it.
For example, if I compare c1 to the following curves:
c1 <- c(0, 0.8, 0.9, 0.9, 0.5, 0.1, 0.5)
c1b <- c(0, 0.8, 0.9, 0.9, 0.5, 0.1, 0.5) # perfect match! ideally score of 1
c1c <- c(1, 0.2, 0.1, 0.1, 0.5, 0.9, 0.5) # total opposite, ideally score of -1? (what would 0 be though?)
c2 <- c(0, 0.9, 0.9, 0.9, 0, 0.3, 0.3, 0.9) #pretty good, score of ???
Note that the vectors don't have the same size and it needs to be normalized, somehow... Any idea?
If you look at those 2 lines, they are fairly similar and I think that in a first step, measuring the area under the 2 curves and subtracting would do. I look at the post "Shaded area under 2 curves in R" but that is not quite what I need.
A second issue (optional) is that for lines that have the same profile but different amplitude, I would like to score those as very similar even though the area under them would be big:
c1 <- c(0, 0.8, 0.9, 0.9, 0.5, 0.1, 0.5)
c4 <- c(0, 0.6, 0.7, 0.7, 0.3, 0.1, 0.3) # very good, score of ??
I hope that a biologist pretending to formulate problem to programmer is OK...
I'd be happy to provide some real life examples if needed.
Thanks in advance!

They don't form curves in the usual meaning of paired x.y values unless they are of equal length. The first three are of equal length and after packaging in a matrix the rcorr function in HMisc package returns:
> rcorr(as.matrix(dfrm))[[1]]
c1 c1b c1c
c1 1 1 -1
c1b 1 1 -1
c1c -1 -1 1 # as desired if you scaled them to 0-1
The correlation of the c1 and c4 vectors:
> cor( c(0, 0.8, 0.9, 0.9, 0.5, 0.1, 0.5),
c(0, 0.6, 0.7, 0.7, 0.3, 0.1, 0.3) )
[1] 0.9874975

I do not have a very good answer, but I did face similar question in the past, probably on more than 1 occasion. My approach is to answer to myself what makes my curves similar when I subjectively evaluate them (the scientific term here is "eye-balling" :). Is it the area under the curve? Do I count linear translation, rotation, or scaling (zoom) of my curves as contributing to dissimilarity? If not, I take out all the factors that I do not care about by selected normalization (e.g. scale the curves to cover the same ranges in x and y).
I am confident that there is a rigorous mathematical theory for this topic, I would search for the words "affinity" "affine". That said, my primitive/naive methods usually sufficed for the work I was doing.
You may want to ask this question on some math forum.

If the proteins you compare are reasonably close orthologs, you should be able to obtain alignments for either each pair you want to score the similarity of, or a multiple alignment for the entire bunch. Depending on the application, I think the latter will be more rigorous. I would then extract the folding score of only those amino acids that are aligned so that all profiles have the same length, and calculate correlation measures or squared normalized dot-products of the profiles as a similarity measure. The squared normalized dot product or the spearman rank correlation will be less sensitive to amplitude differences, which you seem to want. That will make sure you are comparing elements which are reasonable paired (to the extent the alignment is reasonable), and will let you answer questions like: "Are corresponding residues in the compared proteins generally folded to a similar extent?".

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex