How to properly fit a non-linear equation to a set of datapoints? - math

I have a curve of net longwave radiation (QL) data, which are calculated as follows:
QL = a*Ta^4 - b*Ts^4
where a and b are constants, Ta is the air temperature and Ts is the surface temperature
If I plot a curve of QL versus Ta-Ts, what type of equation should I use to fit the data as follows y = f(x) where x = (Ta-Ts)?
Thanks
-20.5 -176.683672
-19.5 -171.0655836
-18.5 -165.8706233
-17.5 -158.9990897
-16.5 -154.2715535
-15.5 -147.5376901
-14.5 -141.2410818
-13.5 -135.3387669
-12.5 -129.3971791
-11.5 -122.0777208
-10.5 -117.475907
-9.5 -111.107148
-8.5 -104.5999237
-7.5 -99.82769298
-6.5 -93.43215832
-5.5 -87.6278432
-4.5 -81.85415752
-3.5 -76.5997892
-2.5 -70.26308516
-1.5 -65.49437303
-0.5 -60.78052134
0.5 -56.32077454
1.5 -51.74037492
2.5 -47.30542394
3.5 -42.92298839
4.5 -38.13260904
5.5 -34.22676827
6.5 -30.49502686
7.5 -26.89383663
8.5 -22.259631
The complete data https://docs.google.com/spreadsheets/d/1e3gNCKQesrGe9ESrEIUcQw3umERzNRt0/edit?usp=sharing&ouid=115727378140347660572&rtpof=true&sd=true:
TS = surface temperature (degrees Celsius);
TA = air temperature (degrees Celsius);
Lin = longwave in (0.8 * 5.67E-8 * (TA+273.15)^4) (W m-2);
Lout = longwave out (0.97 * 5.67E-8 * (TS+273.15)^4) (W m-2);
QL = Lin - Lout (W m-2);

The notation QL=y=f(x) is falacious because QL doesn't depends on one variable only, but depends two independant variables Ta and Ts.
So, one have to write : y=F(Ta,Ts) or equivalently y=g(x,Ta) or equivalently y=h(x,Ts) with x=Ta-Ts and the functions F or g or h.
Any one of those functions can be determined thanks to nonlinear regression if we have data on the form of a table of three columns (not only two columns) for example :
(Ta,Ts,y) to find the function F(Ta,Ts)
or (x,Ta,y) to find the function g(x,Ta)
or (x,Ts,y) to find the function h(x,Ts)
In fact one cannot definitively answer to your question in which something is missing : Either measurements of another parameter or another relationship between the parameters in addition to the relationship x=Ta-Ts.
Of course one can compute (for example) the coefficients A,B,C,... for a polynomial regression of the kind f(x)=A+Bx+Cx^2+... and get a very good fitting :
The coefficients A,B,C are purely mathematical without physical signifiance. The coefficients a and b in f(x)=aTa^4+bTs^4 cannot be derived from the coefficients A,B,C without more physical information as already pointed out.

I took your data and did a 4th order polynomial fit. Here's the result:
QL = -58.607 + x*(4.8336 + x*(-0.0772 + x*(-2e-5 + x*8e-5)))
R^ = 0.9999
x = (Ta - Ts)
If you want the equation to be in terms of Ta and Ts instead of the difference you should substitute and do the algebra.

Related

Noisy data presented in tabular form are given. Fit and build a model curve

Noisy data presented in tabular form are given. Fit and build a model curve. Choose a functional dependence of the form:
Here is the Scilab code for your first example, you will be able to adapt it to other cases. To evaluate the obtained polynomial F use Scilab function horner (see the associated help page by typing help horner).
gd = [1.2 1.4 1.6 1.8 2.1 2.2 2.4 2.6 2.8 3.1]';
Fd = [1.55 2.73 3.91 5.51 7.11 9.12 11.12 12.91 15.45 17.91]';
c = [ones(gd), gd, gd.^2, gd.^3]\Fd;
F = poly(c,"g","coeff");
disp(F)
the above script displays
7.6508968 -14.91761g +9.8440484g² -1.2735795g³
You can plot the graph of the polynomial and the original noisy data like this:
g = linspace(1.2,3.1,100);
plot(g, horner(F,g),gd,Fd,'o')

Mixed integer programming R: Least absolute deviation with cost associated with each regressor

I have been presented with a problem, regarding the minimization of the absolute error, the problem know as LAD(Least absolute deviation) but, being each regressor the result of expensive test with an associated cost, one should refrain from using regressors that don't explain variance to a high degree. It takes the following equations:
Where N is the total number of observations, E the deviation associated with observation i, S the number of independant variables, lambda a penalty coefficient for the cost, and C the cost associated with performing the test.
So far, I have oriented as usual. To make it lineal, I transformed the absolute value in two errors, e^+ and e^-, where e= y_i-(B_0+sum(B_j*X_ij) and the following constraints:
z_j ={0,1}, binary value about whether the regressor enters my model.
B_i<=M_zj; B_i>=-M_zj
E^+, E^- >=0
A toy subset of data I'm working has the following structure:
For y
quality
1 5
2 5
3 5
4 6
5 7
6 5
For the regressors
fixed.acidity volatile.acidity citric.acid
1 7.5 0.610 0.26
2 5.6 0.540 0.04
3 7.4 0.965 0.00
4 6.7 0.460 0.24
5 6.1 0.400 0.16
6 9.7 0.690 0.32
And for the cost
fixed.acidity volatile.acidity citric.acid
1 0.26 0.6 0.52
So far, my code looks like this:
# loading the matrixes
y <- read.csv(file="PATH\\y.csv", header = TRUE, sep = ",") #dim=100*11
regresores <- read.csv(file="PATH\\regressors.csv", header = TRUE, sep = ",")#dim=100*1
cost <- read.csv(file="PATH\\cost.csv", header = TRUE, sep = ",")#dim=1*11
for (i in seq(0, 1, by = 0.1)){#so as to have a collection of models with different penalties
obj.fun <- c(1,1,i*coste)
constr <- matrix(
c(y,regresores,-regresores),
c(-y,-regresores,regresores),
sum(regresores),ncol = ,byrow = TRUE)
constr.dir <- c("<=",">=","<=","==")
rhs<-c(regresores,-regresores,1,binary)
sol<- lp("min", obj.fun, constr, constr.tr, rhs)
sol$objval
sol$solution}
I know theres is a LAD function in R, but for consistence sake with my colleagues, as well as a pretty annoying phD tutor, I have to perform this using lpSolve in R. I have just started with R for the project and I don't know exactly why this won't run. Is there something wrong with the syntax or my formulation of the model. Right know, the main problem I have is:
"Error in matrix(c(y, regressors, -regressors), c(-y, -regressors, regressors), : non-numeric matrix extent".
Mainly, I intended for it to create said weighted LAD model and have it return the different values of lambda, from 0 to 1 in a 0.1 step.
Thanks in advance and sorry for any inconvenience, neither English nor R are my native languages.

Fitting damped sine wave dataset with gnuplot, getting lot of errors

I was trying to fit this dataset:
#Mydataset damped sine wave data
#X ---- Y
45.80 320.0
91.60 -254.0
137.4 198.0
183.2 -156.0
229.0 126.0
274.8 -100.0
320.6 80.0
366.4 -64.0
412.2 52.0
458.0 -40.0
503.8 34.0
549.6 -26.0
595.4 22.0
641.2 -18.0
which, as you can see by the plot below, has the classical trend of a damped sine wave:
So i first set the macro for the fit
f(x) = exp(-a*x)*sin(b*x)
then i made the proper fit
fit f(x) 'data.txt' via a,b
iter chisq delta/lim lambda a b
0 2.7377200000e+05 0.00e+00 1.10e-19 1.000000e+00 1.000000e+00
Current data point
=========================
# = 1 out of 14
x = -5.12818e+20
z = 320
Current set of parameters
=========================
a = -5.12818e+20
b = -1.44204e+20
Function evaluation yields NaN ("not a number")
getting a NaN as result. So I looked around on STackOverflow and I remembered I've already have had in the past problems by fitting exponential due to their fast growth/decay which requires you to set initial parameters in order not to get this error (as I've asked here). So I tried by setting as starting parameters a and b as the ones expected, a = 9000, b=146000, but the result was more frustrating than the one before:
fit f(x) 'data.txt' via a,b
iter chisq delta/lim lambda a b
0 2.7377200000e+05 0.00e+00 0.00e+00 9.000000e+03 1.460000e+05
Singular matrix in Givens()
I've thought: "these are too much large numbers, let's try with smaller ones".
So i entered the values for a and b and started fitting again
a = 0.01
b = 2
fit f(x) 'data.txt' via a,b
iter chisq delta/lim lambda a b
0 2.7429059500e+05 0.00e+00 1.71e+01 1.000000e-02 2.000000e+00
1 2.7346318324e+05 -3.03e+02 1.71e+00 1.813940e-02 -9.254913e-02
* 1.0680927157e+137 1.00e+05 1.71e+01 -2.493611e-01 5.321099e+00
2 2.7344431789e+05 -6.90e+00 1.71e+00 1.542835e-02 4.310193e+00
* 6.1148639318e+81 1.00e+05 1.71e+01 -1.481123e-01 -1.024914e+01
3 2.7337226343e+05 -2.64e+01 1.71e+00 1.349852e-02 -9.008087e+00
* 6.4751980241e+136 1.00e+05 1.71e+01 -2.458835e-01 -4.089511e+00
4 2.7334273482e+05 -1.08e+01 1.71e+00 1.075319e-02 -4.346296e+00
* 1.8228530731e+121 1.00e+05 1.71e+01 -2.180542e-01 -1.407646e+00
* 2.7379223634e+05 1.64e+02 1.71e+02 8.277720e-03 -1.440256e+00
* 2.7379193486e+05 1.64e+02 1.71e+03 1.072342e-02 -3.706519e+00
5 2.7326800742e+05 -2.73e+01 1.71e+02 1.075288e-02 -4.338196e+00
* 2.7344116255e+05 6.33e+01 1.71e+03 1.069793e-02 -3.915375e+00
* 2.7327905718e+05 4.04e+00 1.71e+04 1.075232e-02 -4.332930e+00
6 2.7326776014e+05 -9.05e-02 1.71e+03 1.075288e-02 -4.338144e+00
iter chisq delta/lim lambda a b
After 6 iterations the fit converged.
final sum of squares of residuals : 273268
rel. change during last iteration : -9.0493e-07
degrees of freedom (FIT_NDF) : 12
rms of residuals (FIT_STDFIT) = sqrt(WSSR/ndf) : 150.905
variance of residuals (reduced chisquare) = WSSR/ndf : 22772.3
Final set of parameters Asymptotic Standard Error
======================= ==========================
a = 0.0107529 +/- 3.114 (2.896e+04%)
b = -4.33814 +/- 3.678 (84.78%)
correlation matrix of the fit parameters:
a b
a 1.000
b 0.274 1.000
I saw it produced some result, so I thought it was all ok, but my happiness lasted seconds, just until I plotted the output:
Wow. A really good one.
And I'm still here wondering what's wrong and how to get a proper fit of a damped sine wave dataset with gnuplot.
Hope someone knows the answer :)
The function you are fitting the data to is not a good match for the data. The envelope of the data is a decaying function, so you want a positive damping parameter a. But then your fitting function cannot be bigger than 1 for positive x, unlike your data. Also, by using a sine function in your fit you assume something about the phase behavior -- the fitted function will always be zero at x=0. However, your data looks like it should have a large, negative amplitude.
So let's choose a better fitting function, and give gnuplot a hand by choosing some reasonable initial guesses for the parameters:
f(x)=c*exp(-a*x)*cos(b*x)
a=1./500
b=2*pi/100.
c=-400.
fit f(x) 'data.txt' via a,b,c
plot f(x), "data.txt" w p
gives

Distance origin to point in the space

I want to equalize the distance from the origin to all points, where points are given by a data frame with two coordinates.
I have all the points as:
x y
1 0.0 0.0
2 -4.0 -2.8
3 -7.0 -6.5
4 -9.0 -11.1
5 -7.7 -16.9
6 -4.2 -22.4
7 -0.6 -27.7
8 3.0 -32.5
9 5.6 -36.7
10 8.4 -40.8
To get the distance I apply the Euclidean distance for a vector. I have tried this:
distance <- function(trip) {
distance = lapply(trip, function (x) sqrt( (trip[x,]-trip[1,] )^2+ trip[,x]-trip[,1] )^2))
return(distance)
}
and this as well:
distance = apply(trip,1, function (x) sqrt( (trip[x,]-trip[1,] )^2+ (trip[,x]-trip[,1] )^2))
return(distance)
There's no need to loop through the individual rows of your data with the apply function. You can compute all the distances in one shot with vectorized arithmetic in R:
(distance <- sqrt((trip$x - trip$x[1])^2 + (trip$y - trip$y[1])^2))
# [1] 0.000000 4.882622 9.552487 14.290206 18.571484 22.790349 27.706497 32.638168 37.124790 41.655732
Computing all the distances at once with vectorized operations will be much quicker in cases where you have many points.
There is a function for matrix distance computation :
dist(trip, method = "euclidean")
If you don't expect a distance matrix but only the distance from each point to the origin, you can subset the 1st column as.matrix(dist(mat, method = "euclidean"))[1,]

KS test for power law

Im attempting fitting a powerlaw distribution to a data set, using the method outlined by Aaron Clauset, Cosma Rohilla Shalizi and M.E.J. Newman in their paper "Power-Law Distributions in Empirical Data".
I've found code to compare to my own, but im a bit mystified where some of it comes from, the story thus far is,
to identify a suitable xmin for the powerlaw fit, we take each possible xmin fit a powerlaw to that data and then compute the corresponding exponet (a) then the KS statistic (D) for the fit and the observed data, then find the xmin that corresponds to the minimum of D. The KS statistic if computed as follows,
cx <- c(0:(n-1))/n # n is the sample size for the data >= xmin
cf <- 1-(xmin/z)^a # the cdf for a powerlaw z = x[x>=xmin]
D <- max(abs(cf-cx))
what i dont get is where cx comes for, surely we should be comparing the distance between the empirical distributions and the calculated distribution. something along the lines of:
cx = ecdf(sort(z))
cf <- 1-(xmin/z)^a
D <- max(abs(cf-cx(z)))
I think im just missing something very basic but please do correct me!
The answer is that they are (almost) the same. The easiest way to see this is to generate some data:
z = sort(runif(5,xmin, 10*xmin))
n = length(x)
Then examine the values of the two CDFs
R> (cx1 = c(0:(n-1))/n)
[1] 0.0 0.2 0.4 0.6 0.8
R> (cx2 = ecdf(sort(z)))
[1] 0.2 0.4 0.6 0.8 1.0
Notice that they are almost the same - essentially the cx1 gives the CDF for greater than or equal to whilst cx2 is greater than.
The advantage of the top approach is that it is very efficient and quick to calculate. The disadvantage is that if your data isn't truly continuous, i.e. z=c(1,1,2), cx1 is wrong. But then you shouldn't be fitting your data to a CTN distribution if this were the case.

Resources