I'm trying to calculate f(0.5) and f(2) for this set of data:
Precision: 0.6
Recall: 0.45
My results:
f(1) : 0.51
f(0.5): 0.56 (wrong)
f(2) : 0.47 (wrong)
I calculated f(1) Measure using this formula
(2x(PxR))/(P+R)
But when I try to calculate f(2) or f(0.5) my results are slightly off
f(0.5) should be 0.54
f(2) should be 0.49
I used the following formula:
(b^2 + 1) x ((P x R)/(b^2)+R)
b = the f measure I'm using, either 0.5 or 2
What am I doing wrong?
And if possible, could someone calculate the f(0.5) and f(2) measure for me and confirm that I am wrong?
Any help is appreciated, will do my best to make this question as clear as possible. Please leave a comment if it's not clear enough and I will try to add to it
Thanks
Fortunately, Wikipedia is searchable
The correct equation (on the Wikipedia page it has real math formatting, which is easier to read) is:
F(β)=(1+β2)⋅(PR/(β2P+R))
Or in Python:
>>> def F(beta, precision, recall):
... return (beta*beta + 1)*precision*recall / (beta*beta*precision + recall)
...
>>> F(1, .6, .45)
0.5142857142857143
>>> F(2, .6, .45)
0.4736842105263158
>>> F(0.5, .6, .45)
0.5625000000000001
That looks pretty close to the values you are getting, and not very similar to the ones you say are "correct". So it seems worth asking "Where do the supposedly correct values come from?"
Related
I am working on a function in C programming and would like to create a function based off of the given data points but I cannot seem to get something that fits this curve. See graph here:
My program will primarily use this function in the 0-500°F range so it is important that this range is accurate.
Using this graph I have determined the data points to be approximately:
Temp(F), Factor
(-300, 1.57)
(-200, 1.33)
(-100, 1.16)
(0, 1.05)
(100, 0.98)
(200, 0.94)
(300, 0.915)
(400, 0.865)
I have found that y = 0.00000244x^2 -0.001x + 1.05 is a close fit for the -300-100°F range but gets very bad for x>100°F values.
y = 1.6904761904745*10^-6 x^2 - 0.00109048x + 1.0628 seems to be closer.
I figure that I need a cubic function to model this well, but I can't figure out what it would be. Any recommendations? I was also thinking that I could model T<200°F & T>200°F as separate functions.
EDIT1
I have found a set of linear piecewise functions that fit the dataset, still probably inaccurate for the 400-500°F range.
1.) y=-0.0024x + 0.85 {-300< x <-200}
2.) y=-0.0017x + 0.99 {-200< x <-100}
3.) y=-0.0011x + 1.05 {-100< x < 0}
4.) y=-0.0007x + 1.05 {0 < x <100}
5.) y=-0.0004x + 1.02 {100 < x < 200}
6.) y=-0.00025x + 0.99 {200 < x < 300}
7.) y=-0.0005x + 1.065 {300< x < 400}
8.) y=-0.00065x + 1.125 {400< x < 500} *estimated factor to be 0.8 # 500°F*
EDIT2
I was able to model this pretty well with the help of JJacquelin's answer below. I have settled on using a piecewise set of two functions:
1.) y = 0.83583 + 0.218653e^{-0.00404x} {-400 < x < 185}
2.) y = -0.0000014x^2 + 0.000465x + 0.9027 {185 < x < 500}
Interactive graph here
EDIT 3:
JAlex Has a good point about using Cubic Spline interpolation. This is the method I ended up using. I found an Arduino library and adapted it to my project. sakov/csa-c (Cubic Spline Approximation
You can see the the cubic spline approximation (CSA) fits the original dataset quite well. Keep in mind the point for T=500°F & T=600°F were estimated using equation 2 from EDIT2.
Another convenient model (exponential function) :
Method of fitting from https://fr.scribd.com/doc/14674814/Regressions-et-equations-integrales
I am currently using Measurements.jl for error propagation and LsqFit.jl for fitting functions to data. Is there a simple way to fit a function to data with errors? It would be no problem to use an other package if that makes things easier.
Thanks in advance for your help.
While in principle it should be possible to make these packages work together, the implementation of LsqFit.jl does not seem to play nicely with the Measurement type. However, if one writes a simple least-squares linear regression directly
# Generate test data, with noise
x = 1:10
y = 2x .+ 3
using Measurements
x_observed = (x .+ randn.()) .± 1
y_observed = (y .+ randn.()) .± 1
# Simple least-squares linear regression
# for an equation of the form y = a + bx
# using `\` for matrix division
linreg(x, y) = hcat(fill!(similar(x), 1), x) \ y
(a, b) = linreg(x_observed, y_observed)
then
julia> (a, b) = linreg(x_observed, y_observed)
2-element Vector{Measurement{Float64}}:
3.9 ± 1.4
1.84 ± 0.23
This ought to be able to work with either x uncertainties, y uncertainties, or both.
If you need a nonlinear least-squares fit, it should also be possible to extend the above approach to nonlinear least squares -- though for the latter it may be easier to just find where the incompatibility is in LsqFit.jl and make a PR.
The following equation f(t)= b*pow(1-exp(-k*t), b-1) – (b-1)*pow(1-exp(-k*t), b) is used to describe a curve.
When f(t)=0.5 with known b and k, how to calculate it in R?
b*pow(1-exp(-k*t),b-1) –(b-1)*pow(1-exp(-k*t),b) = 0.5
e.g. when b=5, k=0.5, t=?
You could use uniroot, but first plot the function to check for roots, if any. I extract 0.5 from the function, since that is what you want to solve for. Plotting shows that there are two roots, so you have to play with th interval in the uniroot function. I'll leave that to you, let me know if you struggle with it.
f <- function(x)
{
b=5
k=0.5
return( b* (1- exp(-k*x))^(b-1) - (b-1) * (1-exp(-k*x))^b -0.5 )
}
uniroot(f, interval = c(0, 1e+08))
I'm having a hard time building an efficient procedure that adds and multiplies probability density functions to predict the distribution of time that it will take to complete two process steps.
Let "a" represent the probability distribution function of how long it takes to complete process "A". Zero days = 10%, one day = 40%, two days = 50%. Let "b" represent the probability distribution function of how long it takes to complete process "B". Zero days = 10%, one day = 20%, etc.
Process "B" can't be started until process "A" is complete, so "B" is dependent upon "A".
a <- c(.1, .4, .5)
b <- c(.1,.2,.3,.3,.1)
How can I calculate the probability density function of the time to complete "A" and "B"?
This is what I'd expect as the output for or the following example:
totallength <- 0 # initialize
totallength[1:(length(a) + length(b))] <- 0 # initialize
totallength[1] <- a[1]*b[1]
totallength[2] <- a[1]*b[2] + a[2]*b[1]
totallength[3] <- a[1]*b[3] + a[2]*b[2] + a[3]*b[1]
totallength[4] <- a[1]*b[4] + a[2]*b[3] + a[3]*b[2]
totallength[5] <- a[1]*b[5] + a[2]*b[4] + a[3]*b[3]
totallength[6] <- a[2]*b[5] + a[3]*b[4]
totallength[7] <- a[3]*b[5]
print(totallength)
[1] [1] 0.01 0.06 0.16 0.25 0.28 0.19 0.05
sum(totallength)
[1] 1
I have an approach in visual basic that used three for loops (one for each of the steps, and one for the output) but I hope I don't have to loop in R.
Since this seems to be a pretty standard process flow question, part two of my question is whether any libraries exist to model operations flow so I'm not creating this from scratch.
The efficient way to do this sort of operation is to use a convolution:
convolve(a, rev(b), type="open")
# [1] 0.01 0.06 0.16 0.25 0.28 0.19 0.05
This is efficient both because it's less typing than computing each value individually and also because it's implemented in an efficient way (using the Fast Fourier Transform, or FFT).
You can confirm that each of these values is correct using the formulas you posted:
(expected <- c(a[1]*b[1], a[1]*b[2] + a[2]*b[1], a[1]*b[3] + a[2]*b[2] + a[3]*b[1], a[1]*b[4] + a[2]*b[3] + a[3]*b[2], a[1]*b[5] + a[2]*b[4] + a[3]*b[3], a[2]*b[5] + a[3]*b[4], a[3]*b[5]))
# [1] 0.01 0.06 0.16 0.25 0.28 0.19 0.05
See the package:distr. Choosing the term "multiply" is unfortunate, since the situation described is not one where the contributions to probabilities is independent (where multiplication of probabilities would be the natural term to use). It's rather some sort of sequential addition, and that is exactly what the distr package provides as its interpretation of what "+" should mean when used as a symbolic manipulation of two discrete distributions.
A <- DiscreteDistribution ( setNames(0:2, c('Zero', 'one', 'two') ), a)
B <- DiscreteDistribution(setNames(0:2, c( "Zero2" ,"one2", "two2",
"three2", "four2") ), b )
?'operators-methods' # where operations on 2 DiscreteDistribution are convolution
plot(A+B)
After a bit of nosing around I see that the actual numeric values can be found here:
A.then.B <- A + B
> environment(A.the.nB#d)$dx
[1] 0.01 0.06 0.16 0.25 0.28 0.19 0.05
Seems like there should have been a method for display of the probabilities, and I'm not a regular user of this fascinating package so there well may be one. Do read the vignette and the code-demos ... which I have not yet done. Further noodling around convinces me that the right place to look is in the companion package: distrDoc where the vignette is 100+ pages long. And it shouldn't have required any effort to find it, either, since that advice is in the messages that print when the package is loaded ... except in my defense there were a couple of pages of messages, so it was more tempting to jump into coding and using the help pages.
I'm not familiar with a dedicated package that does exactly what your example describes. but let me sujust a more robust solution for this problem.
You are looking for a method to estimate the distribution of a process that might be combined by an n steps process, in your case 2 that might not be as easy to compute as your example.
The approach Iwould use is a simulation, of 10k observations drown from the underlying distributions, and then calculating the density function of the simulated results.
using your example we can do the following:
x <- runif(10000)
y <- runif(10000)
library(data.table)
z <- as.data.table(cbind(x,y))
z[x>=0 & x<0.1, a_days:=0]
z[x>=0.1 & x<0.5, a_days:=1]
z[x>=0.5 & x<=1, a_days:=2]
z[y>=0 & y <0.1, b_days:=0]
z[x>=0.1 & x<0.3, b_days:=1]
z[x>=0.3 & x<0.5, b_days:=2]
z[x>=0.5 & x<0.8, b_days:=3]
z[x>=0.8 & x<=1, b_days:=4]
z[,total_days:=a_days+b_days]
hist(z[,total_days])
this will result in a very good proxy if the density and the aproach would also work if your second process was drown from an exponential distribution. in which case you'd use rexp function to calculate b_days directly.
file.data has the following values to fit with Weibull distribution,
x y
2.53 0.00
0.70 0.99
0.60 2.45
0.49 5.36
0.40 9.31
0.31 18.53
0.22 30.24
0.11 42.23
Following the Weibull distribution function f(x)=1.0-exp(-lambda*x**n), it is giving error:
fit f(x) 'data.dat' via lambda, n
and finally plotting f(x) and xy graph have large discrepancy.
Any feedback would be highly appreciated. Thanks!
Several things:
You must skip the first line (if it really is x y).
You must use the correct function (the pdf and not the CDF, see http://en.wikipedia.org/wiki/Weibull_distribution, like you did in https://stackoverflow.com/q/20336051/2604213)
You must use an additional scaling parameter, because your data are not normalized
You must select adequate initial values for the fitting.
The following works fine:
f(x) = (x < 0 ? 0 : a*(x/lambda)**(n-1)*exp(-(x/lambda)**n))
n = 0.5
a = 100
lambda = 0.15
fit f(x) 'data.dat' every ::1 via lambda, n, a
set encoding utf8
plot f(x) title sprintf('λ = %.2f, n = %.2f', lambda, n), 'data.dat' every ::1
That gives (with 4.6.4):
If that's the actual command you provided to gnuplot, it won't work because you haven't yet defined f(x).