file.data has the following values to fit with Weibull distribution,
x y
2.53 0.00
0.70 0.99
0.60 2.45
0.49 5.36
0.40 9.31
0.31 18.53
0.22 30.24
0.11 42.23
Following the Weibull distribution function f(x)=1.0-exp(-lambda*x**n), it is giving error:
fit f(x) 'data.dat' via lambda, n
and finally plotting f(x) and xy graph have large discrepancy.
Any feedback would be highly appreciated. Thanks!
Several things:
You must skip the first line (if it really is x y).
You must use the correct function (the pdf and not the CDF, see http://en.wikipedia.org/wiki/Weibull_distribution, like you did in https://stackoverflow.com/q/20336051/2604213)
You must use an additional scaling parameter, because your data are not normalized
You must select adequate initial values for the fitting.
The following works fine:
f(x) = (x < 0 ? 0 : a*(x/lambda)**(n-1)*exp(-(x/lambda)**n))
n = 0.5
a = 100
lambda = 0.15
fit f(x) 'data.dat' every ::1 via lambda, n, a
set encoding utf8
plot f(x) title sprintf('λ = %.2f, n = %.2f', lambda, n), 'data.dat' every ::1
That gives (with 4.6.4):
If that's the actual command you provided to gnuplot, it won't work because you haven't yet defined f(x).
Related
I was trying to fit this dataset:
#Mydataset damped sine wave data
#X ---- Y
45.80 320.0
91.60 -254.0
137.4 198.0
183.2 -156.0
229.0 126.0
274.8 -100.0
320.6 80.0
366.4 -64.0
412.2 52.0
458.0 -40.0
503.8 34.0
549.6 -26.0
595.4 22.0
641.2 -18.0
which, as you can see by the plot below, has the classical trend of a damped sine wave:
So i first set the macro for the fit
f(x) = exp(-a*x)*sin(b*x)
then i made the proper fit
fit f(x) 'data.txt' via a,b
iter chisq delta/lim lambda a b
0 2.7377200000e+05 0.00e+00 1.10e-19 1.000000e+00 1.000000e+00
Current data point
=========================
# = 1 out of 14
x = -5.12818e+20
z = 320
Current set of parameters
=========================
a = -5.12818e+20
b = -1.44204e+20
Function evaluation yields NaN ("not a number")
getting a NaN as result. So I looked around on STackOverflow and I remembered I've already have had in the past problems by fitting exponential due to their fast growth/decay which requires you to set initial parameters in order not to get this error (as I've asked here). So I tried by setting as starting parameters a and b as the ones expected, a = 9000, b=146000, but the result was more frustrating than the one before:
fit f(x) 'data.txt' via a,b
iter chisq delta/lim lambda a b
0 2.7377200000e+05 0.00e+00 0.00e+00 9.000000e+03 1.460000e+05
Singular matrix in Givens()
I've thought: "these are too much large numbers, let's try with smaller ones".
So i entered the values for a and b and started fitting again
a = 0.01
b = 2
fit f(x) 'data.txt' via a,b
iter chisq delta/lim lambda a b
0 2.7429059500e+05 0.00e+00 1.71e+01 1.000000e-02 2.000000e+00
1 2.7346318324e+05 -3.03e+02 1.71e+00 1.813940e-02 -9.254913e-02
* 1.0680927157e+137 1.00e+05 1.71e+01 -2.493611e-01 5.321099e+00
2 2.7344431789e+05 -6.90e+00 1.71e+00 1.542835e-02 4.310193e+00
* 6.1148639318e+81 1.00e+05 1.71e+01 -1.481123e-01 -1.024914e+01
3 2.7337226343e+05 -2.64e+01 1.71e+00 1.349852e-02 -9.008087e+00
* 6.4751980241e+136 1.00e+05 1.71e+01 -2.458835e-01 -4.089511e+00
4 2.7334273482e+05 -1.08e+01 1.71e+00 1.075319e-02 -4.346296e+00
* 1.8228530731e+121 1.00e+05 1.71e+01 -2.180542e-01 -1.407646e+00
* 2.7379223634e+05 1.64e+02 1.71e+02 8.277720e-03 -1.440256e+00
* 2.7379193486e+05 1.64e+02 1.71e+03 1.072342e-02 -3.706519e+00
5 2.7326800742e+05 -2.73e+01 1.71e+02 1.075288e-02 -4.338196e+00
* 2.7344116255e+05 6.33e+01 1.71e+03 1.069793e-02 -3.915375e+00
* 2.7327905718e+05 4.04e+00 1.71e+04 1.075232e-02 -4.332930e+00
6 2.7326776014e+05 -9.05e-02 1.71e+03 1.075288e-02 -4.338144e+00
iter chisq delta/lim lambda a b
After 6 iterations the fit converged.
final sum of squares of residuals : 273268
rel. change during last iteration : -9.0493e-07
degrees of freedom (FIT_NDF) : 12
rms of residuals (FIT_STDFIT) = sqrt(WSSR/ndf) : 150.905
variance of residuals (reduced chisquare) = WSSR/ndf : 22772.3
Final set of parameters Asymptotic Standard Error
======================= ==========================
a = 0.0107529 +/- 3.114 (2.896e+04%)
b = -4.33814 +/- 3.678 (84.78%)
correlation matrix of the fit parameters:
a b
a 1.000
b 0.274 1.000
I saw it produced some result, so I thought it was all ok, but my happiness lasted seconds, just until I plotted the output:
Wow. A really good one.
And I'm still here wondering what's wrong and how to get a proper fit of a damped sine wave dataset with gnuplot.
Hope someone knows the answer :)
The function you are fitting the data to is not a good match for the data. The envelope of the data is a decaying function, so you want a positive damping parameter a. But then your fitting function cannot be bigger than 1 for positive x, unlike your data. Also, by using a sine function in your fit you assume something about the phase behavior -- the fitted function will always be zero at x=0. However, your data looks like it should have a large, negative amplitude.
So let's choose a better fitting function, and give gnuplot a hand by choosing some reasonable initial guesses for the parameters:
f(x)=c*exp(-a*x)*cos(b*x)
a=1./500
b=2*pi/100.
c=-400.
fit f(x) 'data.txt' via a,b,c
plot f(x), "data.txt" w p
gives
I'm trying to calculate f(0.5) and f(2) for this set of data:
Precision: 0.6
Recall: 0.45
My results:
f(1) : 0.51
f(0.5): 0.56 (wrong)
f(2) : 0.47 (wrong)
I calculated f(1) Measure using this formula
(2x(PxR))/(P+R)
But when I try to calculate f(2) or f(0.5) my results are slightly off
f(0.5) should be 0.54
f(2) should be 0.49
I used the following formula:
(b^2 + 1) x ((P x R)/(b^2)+R)
b = the f measure I'm using, either 0.5 or 2
What am I doing wrong?
And if possible, could someone calculate the f(0.5) and f(2) measure for me and confirm that I am wrong?
Any help is appreciated, will do my best to make this question as clear as possible. Please leave a comment if it's not clear enough and I will try to add to it
Thanks
Fortunately, Wikipedia is searchable
The correct equation (on the Wikipedia page it has real math formatting, which is easier to read) is:
F(β)=(1+β2)⋅(PR/(β2P+R))
Or in Python:
>>> def F(beta, precision, recall):
... return (beta*beta + 1)*precision*recall / (beta*beta*precision + recall)
...
>>> F(1, .6, .45)
0.5142857142857143
>>> F(2, .6, .45)
0.4736842105263158
>>> F(0.5, .6, .45)
0.5625000000000001
That looks pretty close to the values you are getting, and not very similar to the ones you say are "correct". So it seems worth asking "Where do the supposedly correct values come from?"
I'm working with the Rssa package to decompose time series, witch works fine except that I can't get the percentage of explained variance from each eigenvector (if these are the right words to explain this). However, these percentages are noted on top on one of the graphs I can plot with this package.
Let me give an example:
d=rnorm(200,10,3)
plot(d,type="l")
ssa=ssa(d, L = 100,digits=0)
plot(ssa,type="vector") #the percentage I want is in the title of each individual graph
# to reconstruct the trend and the residuals
res <- reconstruct(ssa, groups = list(1))
trend <- res$F1
How do I get these percentages in a vector? Especially since I want to loop over multiple series.
Thank you!
Seems that the code for weighted norm of the series by component is hidden in the package.
I extract the code from Rssa:::.plot.ssa.vectors.1d.ssa and wrapped it a small function:
component_wnorm <-
function(x) {
idx <- seq_len(min(nsigma(x), 10))
x <- ssa
total <- wnorm(x)^2
round(100*x$sigma[idx]^2 / total, digits = 2)
}
component_wnorm(ssa)
[1] 92.02 0.35 0.34 0.27 0.27 0.25 0.22 0.20 0.20 0.18
The recent version of Rssa has the function contributions.
Therefore, you can use
> s <- ssa(d, L=100)
> c <- contributions(s)*100
> print(c[1:10], digits = 2)
[1] 92.41 0.28 0.26 0.26 0.26 0.23 0.23 0.21 0.20 0.20
I use R to calculate the ecdf of some data. I want to use the results in another software. I use R just to do the 'work' but not to produce the final diagram for my thesis.
Example Code
# Plotting the a built in sampla data
plot(cars$speed)
# Assingning the data to a new variable name
myData = cars$speed
# Calculating the edcf
myResult = ecdf(myData)
myResult
# Plotting the ecdf
plot(myResult)
Output
> # Plotting the a built in sampla data
> plot(cars$speed)
> # Assingning the data to a new variable name
> myData = cars$speed
> # Calculating the edcf
> myResult = ecdf(myData)
> myResult
Empirical CDF
Call: ecdf(myData)
x[1:19] = 4, 7, 8, ..., 24, 25
> # Plotting the ecdf
> plot(myResult)
> plot(cars$speed)
Questions
Question 1
How do I get the raw information in order to plot the ecdf diagram in another software (e. g. Excel, Matlab, LaTeX)? For the histogram function I can just write
res = hist(...)
and I find all the information like
res$breaks
res$counts
res$density
res$mids
res$xname
Question 2
How do I calculate the inverse ecdf? Say I want to know how many cars have a speed below 10 mph (the example data is car speed).
Update
Thanks to the answer of user777 I have more information now. If I use
> myResult(0:25)
[1] 0.00 0.00 0.00 0.00 0.04 0.04 0.04 0.08 0.10 0.12 0.18 0.22 0.30 0.38
[15] 0.46 0.52 0.56 0.62 0.70 0.76 0.86 0.86 0.88 0.90 0.98 1.00
I get the data for 0 to 25 mph. But I do not know where to draw a data point. I do want to reproduce the R plot exactly.
Here I have a data point every 1 mph.
Here I do not have a data pint every 1 mph. I only have a data point if there is data available.
Solution
# Plotting the a built in sample data
plot(cars$speed)
# Assingning the data to a new variable name
myData = cars$speed
# Calculating the edcf
myResult = ecdf(myData)
myResult
# Plotting the ecdf
plot(myResult)
# Have a look on the probability for 0 to 25 mph
myResult(0:25)
# Have a look on the probability but just where there ara data points
myResult(unique(myData))
# Saving teh stuff to a directory
write.csv(cbind(unique(myData), myResult(unique(myData))), file="D:/myResult.txt")
The file myResult.txt looks like
"","V1","V2"
"1",4,0.04
"2",7,0.08
"3",8,0.1
"4",9,0.12
"5",10,0.18
"6",11,0.22
"7",12,0.3
"8",13,0.38
"9",14,0.46
"10",15,0.52
"11",16,0.56
"12",17,0.62
"13",18,0.7
"14",19,0.76
"15",20,0.86
"16",22,0.88
"17",23,0.9
"18",24,0.98
"19",25,1
Meaning
Attention: I have a German Excel so the decimal symbol is comma instead of the dot.
The output of ecdf is a function, among other object types. You can verify this with class(myResult), which displayes the S4 classes of the object myResult.
If you enter myResult(unique(myData)), R evaluates the ecdf object myResult at all distinct values appearing in myData, and prints it to the console. To save the output you can enter write.csv(cbind(unique(myData), myResult(unique(myData))), file="C:/Documents/My ecdf.csv") to save it to that filepath.
The ecdf doesn't tell you how many cars are above/below a specific threshold; rather, it states the probability that a randomly selected car from your data set is above/below the threshold. If you're interested in the number of cars satisfying some criteria, just count them. myData[myData<=10] returns the data elements, and length(myData[myData<=10]) tells you how many of them there are.
Assuming you mean that you want to know the sample probabilities that a randomly-selected car from your data is below 10 mph, that's the value given by myResult(10).
As I see it, your main requirement is to reproduce the jumps at each x value. Try this:
> x <- c(cars$speed, cars$speed, 1, 28)
> y <- c((0:49)/50, (1:50)/50, 0, 1)
> ord <- order(x)
> plot(y[ord] ~ x[ord], type="l")
The first 50 (x,y) pairs are tyhe beginnings of the jumps, the next 50 are the ends, and the last two give you starting and ending values at $(x_1-3,0)$ and $(x_{50}+3,1)$. Then you need to sort the values in increasing order in $x$.
I have data of this form:
x y
1 0.19
2 0.26
3 0.40
4 0.58
5 0.59
6 1.24
7 0.68
8 0.60
9 1.12
10 0.80
11 1.20
12 1.17
13 0.39
I'm currently plotting a kernel-smoothed density estimate of the x versus y using this code:
smoothed = ksmooth( d$resi, d$score, bandwidth = 6 )
plot( smoothed )
I simply want a plot of the x versus smoothed(y) values, which is ## Heading ##
However, the documentation for ksmooth suggests that this isn't the best kernel-smoothing package available:
This function is implemented purely
for compatibility with S, although it
is nowhere near as slow as the S
function. Better kernel smoothers are
available in other packages.
What other kernel smoothers are better and where can these smoothers be found?
If you "simply want a plot of the x versus smoothed(y)", then I recommend considering loess in package stats - it's simple, fast and effective. If instead you really want a regression based on kernel smoothing, then you could try locpoly in package KernSmooth or npreg in package np.