Noisy data presented in tabular form are given. Fit and build a model curve - scilab

Noisy data presented in tabular form are given. Fit and build a model curve. Choose a functional dependence of the form:

Here is the Scilab code for your first example, you will be able to adapt it to other cases. To evaluate the obtained polynomial F use Scilab function horner (see the associated help page by typing help horner).
gd = [1.2 1.4 1.6 1.8 2.1 2.2 2.4 2.6 2.8 3.1]';
Fd = [1.55 2.73 3.91 5.51 7.11 9.12 11.12 12.91 15.45 17.91]';
c = [ones(gd), gd, gd.^2, gd.^3]\Fd;
F = poly(c,"g","coeff");
disp(F)
the above script displays
7.6508968 -14.91761g +9.8440484g² -1.2735795g³
You can plot the graph of the polynomial and the original noisy data like this:
g = linspace(1.2,3.1,100);
plot(g, horner(F,g),gd,Fd,'o')

Related

How to properly fit a non-linear equation to a set of datapoints?

I have a curve of net longwave radiation (QL) data, which are calculated as follows:
QL = a*Ta^4 - b*Ts^4
where a and b are constants, Ta is the air temperature and Ts is the surface temperature
If I plot a curve of QL versus Ta-Ts, what type of equation should I use to fit the data as follows y = f(x) where x = (Ta-Ts)?
Thanks
-20.5 -176.683672
-19.5 -171.0655836
-18.5 -165.8706233
-17.5 -158.9990897
-16.5 -154.2715535
-15.5 -147.5376901
-14.5 -141.2410818
-13.5 -135.3387669
-12.5 -129.3971791
-11.5 -122.0777208
-10.5 -117.475907
-9.5 -111.107148
-8.5 -104.5999237
-7.5 -99.82769298
-6.5 -93.43215832
-5.5 -87.6278432
-4.5 -81.85415752
-3.5 -76.5997892
-2.5 -70.26308516
-1.5 -65.49437303
-0.5 -60.78052134
0.5 -56.32077454
1.5 -51.74037492
2.5 -47.30542394
3.5 -42.92298839
4.5 -38.13260904
5.5 -34.22676827
6.5 -30.49502686
7.5 -26.89383663
8.5 -22.259631
The complete data https://docs.google.com/spreadsheets/d/1e3gNCKQesrGe9ESrEIUcQw3umERzNRt0/edit?usp=sharing&ouid=115727378140347660572&rtpof=true&sd=true:
TS = surface temperature (degrees Celsius);
TA = air temperature (degrees Celsius);
Lin = longwave in (0.8 * 5.67E-8 * (TA+273.15)^4) (W m-2);
Lout = longwave out (0.97 * 5.67E-8 * (TS+273.15)^4) (W m-2);
QL = Lin - Lout (W m-2);
The notation QL=y=f(x) is falacious because QL doesn't depends on one variable only, but depends two independant variables Ta and Ts.
So, one have to write : y=F(Ta,Ts) or equivalently y=g(x,Ta) or equivalently y=h(x,Ts) with x=Ta-Ts and the functions F or g or h.
Any one of those functions can be determined thanks to nonlinear regression if we have data on the form of a table of three columns (not only two columns) for example :
(Ta,Ts,y) to find the function F(Ta,Ts)
or (x,Ta,y) to find the function g(x,Ta)
or (x,Ts,y) to find the function h(x,Ts)
In fact one cannot definitively answer to your question in which something is missing : Either measurements of another parameter or another relationship between the parameters in addition to the relationship x=Ta-Ts.
Of course one can compute (for example) the coefficients A,B,C,... for a polynomial regression of the kind f(x)=A+Bx+Cx^2+... and get a very good fitting :
The coefficients A,B,C are purely mathematical without physical signifiance. The coefficients a and b in f(x)=aTa^4+bTs^4 cannot be derived from the coefficients A,B,C without more physical information as already pointed out.
I took your data and did a 4th order polynomial fit. Here's the result:
QL = -58.607 + x*(4.8336 + x*(-0.0772 + x*(-2e-5 + x*8e-5)))
R^ = 0.9999
x = (Ta - Ts)
If you want the equation to be in terms of Ta and Ts instead of the difference you should substitute and do the algebra.

Mixed integer programming R: Least absolute deviation with cost associated with each regressor

I have been presented with a problem, regarding the minimization of the absolute error, the problem know as LAD(Least absolute deviation) but, being each regressor the result of expensive test with an associated cost, one should refrain from using regressors that don't explain variance to a high degree. It takes the following equations:
Where N is the total number of observations, E the deviation associated with observation i, S the number of independant variables, lambda a penalty coefficient for the cost, and C the cost associated with performing the test.
So far, I have oriented as usual. To make it lineal, I transformed the absolute value in two errors, e^+ and e^-, where e= y_i-(B_0+sum(B_j*X_ij) and the following constraints:
z_j ={0,1}, binary value about whether the regressor enters my model.
B_i<=M_zj; B_i>=-M_zj
E^+, E^- >=0
A toy subset of data I'm working has the following structure:
For y
quality
1 5
2 5
3 5
4 6
5 7
6 5
For the regressors
fixed.acidity volatile.acidity citric.acid
1 7.5 0.610 0.26
2 5.6 0.540 0.04
3 7.4 0.965 0.00
4 6.7 0.460 0.24
5 6.1 0.400 0.16
6 9.7 0.690 0.32
And for the cost
fixed.acidity volatile.acidity citric.acid
1 0.26 0.6 0.52
So far, my code looks like this:
# loading the matrixes
y <- read.csv(file="PATH\\y.csv", header = TRUE, sep = ",") #dim=100*11
regresores <- read.csv(file="PATH\\regressors.csv", header = TRUE, sep = ",")#dim=100*1
cost <- read.csv(file="PATH\\cost.csv", header = TRUE, sep = ",")#dim=1*11
for (i in seq(0, 1, by = 0.1)){#so as to have a collection of models with different penalties
obj.fun <- c(1,1,i*coste)
constr <- matrix(
c(y,regresores,-regresores),
c(-y,-regresores,regresores),
sum(regresores),ncol = ,byrow = TRUE)
constr.dir <- c("<=",">=","<=","==")
rhs<-c(regresores,-regresores,1,binary)
sol<- lp("min", obj.fun, constr, constr.tr, rhs)
sol$objval
sol$solution}
I know theres is a LAD function in R, but for consistence sake with my colleagues, as well as a pretty annoying phD tutor, I have to perform this using lpSolve in R. I have just started with R for the project and I don't know exactly why this won't run. Is there something wrong with the syntax or my formulation of the model. Right know, the main problem I have is:
"Error in matrix(c(y, regressors, -regressors), c(-y, -regressors, regressors), : non-numeric matrix extent".
Mainly, I intended for it to create said weighted LAD model and have it return the different values of lambda, from 0 to 1 in a 0.1 step.
Thanks in advance and sorry for any inconvenience, neither English nor R are my native languages.

Support Vector Machine on R and WEKA

My data generated strange results with svm on R from the e1071 package, so I tried to check if the R svm can generate same result as WEKA (or python), since I've been using WEKA in the past.
I googled the question and found one that has the exact same confusion with me but without an answer. This is the question.
So I hope that I could get an answer here.
To make things easier, I'm also using the iris data set, and train a model (SMO in WEKA, and svm from R package e1071) using the whole iris data, and test on itself.
WEKA parameters:
weka.classifiers.functions.SMO -C 1.0 -L 0.001 -P 1.0E-12 -N 0 -V 10 -W 1 -K "weka.classifiers.functions.supportVector.RBFKernel -G 0.01 -C 250007"
Other than default, I changed kernel into RBFKernel to make it consistant with the R fucntion.
The result is:
a b c <-- classified as
50 0 0 | a = Iris-setosa
0 46 4 | b = Iris-versicolor
0 7 43 | c = Iris-virginica
R script:
library(e1071)
model <- svm(iris[,-5], iris[,5], kernel="radial", epsilon=1.0E-12)
res <- predict(model, iris[,-5])
table(pred = res, true = iris[,ncol(iris)])
The result is:
true
pred setosa versicolor virginica
setosa 50 0 0
versicolor 0 48 2
virginica 0 2 48
I'm not a machine learning person, so I'm guessing the default parameters are very different for these two methods. For example, e1071 has 0.01 as default epsilon and WEKA has 1.0E-12. I tried to read through the manuals and wanted to make all parameters identical, but a lot of parameters do not seem comparable to me.
Thanks.
Refer to http://weka.sourceforge.net/doc.dev/weka/classifiers/functions/SMO.html for the RWeka parameters for SMO and use ?svm to find the corresponding parameters for e1071 svm implementation.
As per ?svm, R e1071 svm is an interface to libsvm and seems to use standard QP solvers.
For multiclass-classification with k levels, k>2, libsvm uses the
‘one-against-one’-approach, in which k(k-1)/2 binary classifiers are
trained; the appropriate class is found by a voting scheme.
libsvm internally uses a sparse data representation, which is also high-level supported by the package SparseM.
To the contrary ?SMO in RWeka
implements John C. Platt's sequential minimal optimization algorithm
for training a support vector classifier using polynomial or RBF
kernels. Multi-class problems are solved using pairwise
classification.
So, these two implementations are different in general (so the results may be a little different). Still if we choose the corresponding hyper-parameters same, the confusion matrix is almost the same:
library(RWeka)
model.smo <- SMO(Species ~ ., data = iris,
control = Weka_control(K = list("RBFKernel", G=2), C=1.0, L=0.001, P=1.0E-12, N=0, V=10, W=1234))
res.smo <- predict(model.smo, iris[,-5])
table(pred = res.smo, true = iris[,ncol(iris)])
true
pred setosa versicolor virginica
setosa 50 0 0
versicolor 0 47 1
virginica 0 3 49
library(e1071)
set.seed(1234)
model.svm <- svm(iris[,-5], iris[,5], kernel="radial", cost=1.0, tolerance=0.001, epsilon=1.0E-12, scale=TRUE, cross=10)
res.svm <- predict(model.svm, iris[,-5])
table(pred = res.svm, true = iris[,ncol(iris)])
true
pred setosa versicolor virginica
setosa 50 0 0
versicolor 0 49 1
virginica 0 1 49
Also refer to this: [https://stats.stackexchange.com/questions/130293/svm-and-smo-main-differences][1] and this [https://www.quora.com/Whats-the-difference-between-LibSVM-and-LibLinear][1]

how to fit exponential curve such as a*b^t in r?

input <- "
t y
1 5.3
2 7.2
3 9.6
4 12.9
5 17.1
6 23.2"
dat<-read.table(textConnection(input),header=TRUE,sep="")
t<-dat[,1]
y<-dat[,2]
y=3.975*(1.341^t) is the resule of fit,how can i use nls function to get it?
maybe the problem is how to express the formula?
nls(y~(a*b^t))
Error in getInitial.default(func, data, mCall = as.list(match.call(func, :
no 'getInitial' method found for "function" objects
Try
nls(y~(a*b^t),start=c(a=4,b=1))
nls needs a good starting point for each parameter

Statistics Question: Kernel Smoothing in R

I have data of this form:
x y
1 0.19
2 0.26
3 0.40
4 0.58
5 0.59
6 1.24
7 0.68
8 0.60
9 1.12
10 0.80
11 1.20
12 1.17
13 0.39
I'm currently plotting a kernel-smoothed density estimate of the x versus y using this code:
smoothed = ksmooth( d$resi, d$score, bandwidth = 6 )
plot( smoothed )
I simply want a plot of the x versus smoothed(y) values, which is ## Heading ##
However, the documentation for ksmooth suggests that this isn't the best kernel-smoothing package available:
This function is implemented purely
for compatibility with S, although it
is nowhere near as slow as the S
function. Better kernel smoothers are
available in other packages.
What other kernel smoothers are better and where can these smoothers be found?
If you "simply want a plot of the x versus smoothed(y)", then I recommend considering loess in package stats - it's simple, fast and effective. If instead you really want a regression based on kernel smoothing, then you could try locpoly in package KernSmooth or npreg in package np.

Resources