NIST non-linear fitting benchmark Misra1a - numerical

I have been toying with some codes to solve this using a Levenberg-Marquardt solver
but I am puzzled by what is given here
http://www.itl.nist.gov/div898/strd/nls/data/LINKS/DATA/Misra1a.dat
When I plug in the certified values and look at the predicted values, they look quite different from the actual Y values... surely I am doing something wrong here - what ?
Y= b1*(1-exp[-b2*X])
b1=238.94212918
b2=0.00055015643181
X Y Y-estimate
10.07 77.6 1.32E+00
14.73 114.9 1.93E+00
17.94 141.1 2.35E+00
23.93 190.8 3.13E+00
29.61 239.9 3.86E+00
35.18 289 4.58E+00
40.02 332.8 5.20E+00
44.82 378.4 5.82E+00
50.76 434.8 6.58E+00
55.05 477.3 7.13E+00
61.01 536.8 7.89E+00
66.4 593.1 8.57E+00
75.47 689.1 9.72E+00
81.78 760 1.05E+01
I thought maybe the base was 10 and tried power(10,...) instead of exp but that does not seem to be the problem.

Your X and Y columns are swapped (really, go look closely at their table again):
238.94212918*(1-e(-0.00055015643181*77.6))
9.98626636447322174420
238.94212918*(1-e(-0.00055015643181*10.07))
1.32009728485679509663
Incidentally, their model also includes a final + e term.

Related

How to properly fit a non-linear equation to a set of datapoints?

I have a curve of net longwave radiation (QL) data, which are calculated as follows:
QL = a*Ta^4 - b*Ts^4
where a and b are constants, Ta is the air temperature and Ts is the surface temperature
If I plot a curve of QL versus Ta-Ts, what type of equation should I use to fit the data as follows y = f(x) where x = (Ta-Ts)?
Thanks
-20.5 -176.683672
-19.5 -171.0655836
-18.5 -165.8706233
-17.5 -158.9990897
-16.5 -154.2715535
-15.5 -147.5376901
-14.5 -141.2410818
-13.5 -135.3387669
-12.5 -129.3971791
-11.5 -122.0777208
-10.5 -117.475907
-9.5 -111.107148
-8.5 -104.5999237
-7.5 -99.82769298
-6.5 -93.43215832
-5.5 -87.6278432
-4.5 -81.85415752
-3.5 -76.5997892
-2.5 -70.26308516
-1.5 -65.49437303
-0.5 -60.78052134
0.5 -56.32077454
1.5 -51.74037492
2.5 -47.30542394
3.5 -42.92298839
4.5 -38.13260904
5.5 -34.22676827
6.5 -30.49502686
7.5 -26.89383663
8.5 -22.259631
The complete data https://docs.google.com/spreadsheets/d/1e3gNCKQesrGe9ESrEIUcQw3umERzNRt0/edit?usp=sharing&ouid=115727378140347660572&rtpof=true&sd=true:
TS = surface temperature (degrees Celsius);
TA = air temperature (degrees Celsius);
Lin = longwave in (0.8 * 5.67E-8 * (TA+273.15)^4) (W m-2);
Lout = longwave out (0.97 * 5.67E-8 * (TS+273.15)^4) (W m-2);
QL = Lin - Lout (W m-2);
The notation QL=y=f(x) is falacious because QL doesn't depends on one variable only, but depends two independant variables Ta and Ts.
So, one have to write : y=F(Ta,Ts) or equivalently y=g(x,Ta) or equivalently y=h(x,Ts) with x=Ta-Ts and the functions F or g or h.
Any one of those functions can be determined thanks to nonlinear regression if we have data on the form of a table of three columns (not only two columns) for example :
(Ta,Ts,y) to find the function F(Ta,Ts)
or (x,Ta,y) to find the function g(x,Ta)
or (x,Ts,y) to find the function h(x,Ts)
In fact one cannot definitively answer to your question in which something is missing : Either measurements of another parameter or another relationship between the parameters in addition to the relationship x=Ta-Ts.
Of course one can compute (for example) the coefficients A,B,C,... for a polynomial regression of the kind f(x)=A+Bx+Cx^2+... and get a very good fitting :
The coefficients A,B,C are purely mathematical without physical signifiance. The coefficients a and b in f(x)=aTa^4+bTs^4 cannot be derived from the coefficients A,B,C without more physical information as already pointed out.
I took your data and did a 4th order polynomial fit. Here's the result:
QL = -58.607 + x*(4.8336 + x*(-0.0772 + x*(-2e-5 + x*8e-5)))
R^ = 0.9999
x = (Ta - Ts)
If you want the equation to be in terms of Ta and Ts instead of the difference you should substitute and do the algebra.

Mixed integer programming R: Least absolute deviation with cost associated with each regressor

I have been presented with a problem, regarding the minimization of the absolute error, the problem know as LAD(Least absolute deviation) but, being each regressor the result of expensive test with an associated cost, one should refrain from using regressors that don't explain variance to a high degree. It takes the following equations:
Where N is the total number of observations, E the deviation associated with observation i, S the number of independant variables, lambda a penalty coefficient for the cost, and C the cost associated with performing the test.
So far, I have oriented as usual. To make it lineal, I transformed the absolute value in two errors, e^+ and e^-, where e= y_i-(B_0+sum(B_j*X_ij) and the following constraints:
z_j ={0,1}, binary value about whether the regressor enters my model.
B_i<=M_zj; B_i>=-M_zj
E^+, E^- >=0
A toy subset of data I'm working has the following structure:
For y
quality
1 5
2 5
3 5
4 6
5 7
6 5
For the regressors
fixed.acidity volatile.acidity citric.acid
1 7.5 0.610 0.26
2 5.6 0.540 0.04
3 7.4 0.965 0.00
4 6.7 0.460 0.24
5 6.1 0.400 0.16
6 9.7 0.690 0.32
And for the cost
fixed.acidity volatile.acidity citric.acid
1 0.26 0.6 0.52
So far, my code looks like this:
# loading the matrixes
y <- read.csv(file="PATH\\y.csv", header = TRUE, sep = ",") #dim=100*11
regresores <- read.csv(file="PATH\\regressors.csv", header = TRUE, sep = ",")#dim=100*1
cost <- read.csv(file="PATH\\cost.csv", header = TRUE, sep = ",")#dim=1*11
for (i in seq(0, 1, by = 0.1)){#so as to have a collection of models with different penalties
obj.fun <- c(1,1,i*coste)
constr <- matrix(
c(y,regresores,-regresores),
c(-y,-regresores,regresores),
sum(regresores),ncol = ,byrow = TRUE)
constr.dir <- c("<=",">=","<=","==")
rhs<-c(regresores,-regresores,1,binary)
sol<- lp("min", obj.fun, constr, constr.tr, rhs)
sol$objval
sol$solution}
I know theres is a LAD function in R, but for consistence sake with my colleagues, as well as a pretty annoying phD tutor, I have to perform this using lpSolve in R. I have just started with R for the project and I don't know exactly why this won't run. Is there something wrong with the syntax or my formulation of the model. Right know, the main problem I have is:
"Error in matrix(c(y, regressors, -regressors), c(-y, -regressors, regressors), : non-numeric matrix extent".
Mainly, I intended for it to create said weighted LAD model and have it return the different values of lambda, from 0 to 1 in a 0.1 step.
Thanks in advance and sorry for any inconvenience, neither English nor R are my native languages.

Poisson GLM with categorical data

I'm trying to fit a Poisson generalized mixed model using counts of categorical data labeled as s and v. Since the data was collected within sessions that have a different duration (see session_dur_s), I want to include this information as a predictor by putting offset in the glm model.
Here is my table:
label session counts session_dur_s
s 1 587 6843
s 2 203 2095
s 3 187 1834
s 4 122 1340
s 5 40 1108
s 6 64 476
s 7 60 593
v 1 147 6721
v 2 57 2095
v 3 58 1834
v 4 22 986
v 5 8 1108
v 6 12 476
v 7 11 593
My data:
label <- c("s","s","s","s","s","s","s","v","v","v","v","v","v","v")
session <- c(1,2,3,4,5,6,7,1,2,3,4,5,6,7)
counts <- c(587,203,187,122,40,64,60,147,54,58,22,8,12,11)
session_dur_s <-c(6843,2095,1834,1340,1108,476,593,6721,2095,1834,986,1108,476,593)
sv_dur <- data.frame(label,session,counts,session_dur_s)
That's my code:
sv_dur_mod <- glm(counts ~ label * session, data=sv_dur, family = "poisson",offset =session_dur_s)
summary(sv_dur_mod)
plot(allEffects(sv_dur_mod),type="response")
I can't execute the glm function because I receive the beautiful error:
Error: no valid set of coefficients has been found: please supply starting values
I'm not sure how to go about it. I would be really happy if a smart head could point me what can I do in order to work it out.
If there is a better model that I can use to predict the counts over time for the both s and v labels, I'm more than open to go for it.
Many thanks for comments and suggestions!
P.S. I'm running it in the R markdown script using following packages tidyverse, effects and dplyr
A Poisson GLM uses a log link as default. That is, it can executed as:
sv_dur_mod <- glm(counts ~ label * session,
data = sv_dur,
family = poisson("log"))
Accordingly, a log offset is generally appropriate:
sv_dur_mod <- glm(counts ~ label * session,
data = sv_dur,
offset = log(session_dur_s),
family = poisson("log"))
Which executes as expected. See the answer here for more information on using a log offset: https://stats.stackexchange.com/a/237980/70372

Predict Future values by using OPERA package in R

I have been trying to understand Opera “Online Prediction by Expert Aggregation” by Pierre Gaillard and Yannig Goude. I read two posts by Pierre Gaillard (http://pierre.gaillard.me/opera.html) and Rob Hyndman (https://robjhyndman.com/hyndsight/forecast-combinations/). However, I do not understand how to predict future values. In Pierre's example, newY = Y represents the test data set (Y <- data_test$Load) which is weekly observations of the French electric load. As you shown below, the data ends at December 2009. Now, how can I forecast let's say 2010 values? What will be the newY here?
> tail(electric_load,5)
Time Day Month Year NumWeek Load Load1 Temp Temp1 IPI
727 727 30 11 2009 0.9056604 63568.79 58254.42 7.220536 10.163839 91.3 88.4
728 728 7 12 2009 0.9245283 63977.13 63568.79 6.808929 7.220536 90.1 87.7
729 729 14 12 2009 0.9433962 78046.85 63977.13 -1.671280 6.808929 90.1 87.7
730 730 21 12 2009 0.9622642 66654.69 78046.85 4.034524 -1.671280 90.1 87.7
731 731 28 12 2009 0.9811321 60839.71 66654.69 7.434115 4.034524 90.1 87.7
I noticed that by multiplying the weights of MLpol0 by X, we get similar outputs as online predictions values.
> weights <- predict(MLpol0, X, Y, type='weights')
> w<-weights[,1]*X[,1]+weights[,2]*X[,2]+weights[,3]*X[,3]
> predValues <- predict(MLpol0, newexpert = X, newY = Y, type='response')
Test_Data predValues w
620 65564.29 65017.11 65017.11
621 62936.07 62096.12 62096.12
622 64953.83 64542.44 64542.44
623 61580.44 60447.63 60447.63
624 71075.52 67622.97 67622.97
625 75399.88 72388.64 72388.64
626 65410.13 67445.63 67445.63
627 65815.15 62623.64 62623.64
628 65251.90 64271.97 64271.97
629 63966.91 61803.77 61803.77
630 64893.42 65793.14 65793.14
631 69226.32 67153.80 67153.80
But still I am not sure how to generate weights w/out newY. Maybe we can use final coefficients that are the output of MLpol to predict future values?
(c<-summary(MLpol <- mixture(Y = Y, experts = X, model = "MLpol", loss.type = "square"))$coefficients)
[1] 0.585902 0.414098 0.000000
I am sorry I may be way off on this and my question may not make sense at all, but I really appreciate any help/insight.
The idea of the opera package is a bit different from classical batch machine learning methods with a training set and a testing set. The goal is to make sequential predictions:
At each round t=1,...,n,
1) the algorithm receives predictions of the expert for round n+1,
2) it makes a prediction for this time step by combining the expert
3) it updates the weights used for the combination by using the new output
If you have out-of-sample forecasts (i.e., forecasts of experts for future values without the outputs), the best you can do is to use the last coefficients and use them to make a prediction by using:
newexperts %*% model$coefficients
In practice, you may also want to use the averaged coefficients. You can also obtained the same by using
predict (object, # for exemple, mixture(model='FS', loss.type="square")
newexperts = # matrix of out-of-sample experts predictions
online = FALSE,
type = 'response')
By using the parameter online = FALSE the model does not need any newY. It will not update the model. When you provide newY, the algorithm does not cheat. It does not use the value at rount t to make the prediction at round t. The values of newY are only used to update the coefficients step by step to do as if the prediction were made sequentially.
I hope this helped.

Understanding loess errors in R

I'm trying to fit a model using loess, and I'm getting errors such as "pseudoinverse used at 3", "neighborhood radius 1", and "reciprocal condition number 0". Here's a MWE:
x = 1:19
y = c(NA,71.5,53.1,53.9,55.9,54.9,60.5,NA,NA,NA
,NA,NA,178.0,180.9,180.9,NA,NA,192.5,194.7)
fit = loess(formula = y ~ x,
control = loess.control(surface = "direct"),
span = 0.3, degree = 1)
x2 = seq(0,20,.1)
library(ggplot2)
qplot(x=x2
,y=predict(fit, newdata=data.frame(x=x2))
,geom="line")
I realize I can fix these errors by choosing a larger span value. However, I'm trying to automate this fit, as I have about 100,000 time series (each of length about 20) similar to this. Is there a way that I can automatically choose a span value that will prevent these errors while still providing a fairly flexible fit to the data? Or, can anyone explain what these errors mean? I did a bit of poking around in the loess() and simpleLoess() functions, but I gave up at the point when C code was called.
Compare fit$fitted to y. You'll notice that something is wrong with your regression. Choose adequate bandwidth, otherwise it'll just interpolate the data. With too few data points, linear function behaves like constant on small bandwidth and triggers collinearity. Thus, you see the errors warning pseudoinverses, singularities. You wont see such errors if you use degree=0 or ksmooth. One intelligible, data-driven choice of span is to use to cross-validation, about which you can ask at Cross Validated.
> fit$fitted
[1] 71.5 53.1 53.9 55.9 54.9 60.5 178.0 180.9 180.9 192.5 194.7
> y
[1] NA 71.5 53.1 53.9 55.9 54.9 60.5 NA NA NA NA NA 178.0
[14] 180.9 180.9 NA NA 192.5 194.7
You see over-fit( perfect-fit) because in your model number of parameters are as many as effective sample size.
fit
#Call:
#loess(formula = y ~ x, span = 0.3, degree = 1, control = loess.control(surface = "direct"))
#Number of Observations: 11
#Equivalent Number of Parameters: 11
#Residual Standard Error: Inf
Or, you might as well just use automated geom_smooth. (again setting geom_smooth(span=0.3) throws warnings)
ggplot(data=data.frame(x, y), aes(x, y)) +
geom_point() + geom_smooth()

Resources