Inferring/expressing the polynomial equation of a fitted smoothing spline? - r

If I smooth a data vector with a smoothing cubic spline my understanding is that each ‘segment’ between knots should be representable as a cubic polynomial.
Is it possible to infer the equation of each segment from the spline coefficients after e.g fitting by the smooth.spline function in R?
This is straightforward for an interpolating spine as the array of polynomial coefficients is generated explicitly. However, I’ve not been able to find an answer as to whether this is possible with a smoothing spline or regression spline?
The reason for wanting this is to in turn obtain an analytical expression for the derivative of a segment of a spline.

Related

R) How can I do a multiple non-linear regression in R?

I used Curve Expert to get a non-linear equation for one y.
Now I have a two X, but don know how I have to merge this in one equation.
In other word, I want to know how I can do a multiple NON-linear regression with two non-linear equation in R.
Is there any packages for do this math or can I use nls function for this?

confidence interval of estimates in a fitted hybrid model by spatstat

hybrid Gibbs models are flexible for fitting spatial pattern data, however, I am confused on how to get the confidence interval for the fitted model's estimate. for instance, I fitted a hybrid geyer model including a hardcore and a geyer saturation components, got the estimates:
Mo.hybrid<-Hybrid(H=Hardcore(), G=Geyer(81,1))
my.hybrid<-ppm(my.X~1,Mo.hybrid, correction="bord")
#beta = 1.629279e-06
#Hard core distance: 31.85573
#Fitted G interaction parameter gamma: 10.241487
what I interested is the gamma, which present the aggregation of points. obviously, the data X is a sample, i.e., of cells in a anatomical image. in order to report statistical result, a confidence interval for gamma is needed. however, i do not have replicates for the image data.
can i simlate 10 time of the fitted hybrid model, then refitted them to get confidence interval of the estimate? something like:
mo.Y<-rmhmodel(cif=c("hardcore","geyer"),
par=list(list(beta=1.629279e-06,hc=31.85573),
list(beta=1, gamma=10.241487,r=81,sat=1)), w=my.X)
Y1<-rmh(model=mo.Y, control = list(nrep=1e6,p=1, fixall=TRUE),
start=list(n.start=c(npoint(my.X))))
Y1.fit<-ppm(Y1~1, Mo.hybrid,rbord=0.1)
# simulate and fit Y2,Y3,...Y10 in same way
or:
Y10<-simulate(my.hybrid,nsim=10)
Y1.fit<-ppm(Y10[1]~1, Mo.hybrid,rbord=0.1)
# fit Y2,Y3,...Y10 in same way
certainly, the algorithms is different, the rmh() can control simulated intensity while the simulate() does not.
now the questions are:
is it right to use simualtion to get confidence interval of estimate?
or the fitted model can provide estimate interval that could be extracted?
if simulation is ok, which algorithm is better in my case?
The function confint calculates confidence intervals for the canonical parameters of a statistical model. It is defined in the standard stats package. You can apply it to fitted point process models in spatstat: in your example just type confint(my.hybrid).
You wanted a confidence interval for the non-canonical parameter gamma. The canonical parameter is theta = log(gamma) so if you do exp(confint(my.hybrid) you can read off the confidence interval for gamma.
Confidence intervals and other forms of inference for fitted point process models are discussed in detail in the spatstat book chapters 9, 10 and 13.
The confidence intervals described above are the asymptotic ones (based on the asymptotic variance matrix using the central limit theorem).
If you really wanted to estimate the variance-covariance matrix by simulation, it would be safer and easier to fit the model using method='ho' (which performs the simulation) and then apply confint as before (which would then use the variance of the simulations rather than the asymptotic variance).
rmh.ppm and simulate.ppm are essentially the same algorithm, apart from some book-keeping. The differences observed in your example occur because you passed different arguments. You could have passed the same arguments to either of these functions.

Degrees of freedom in the smoothing spline fit in a GAM formula

If the smoothing spline is simply a natural cubic spline with knots at every unique value of x_i. Then why does gam::s() in R need a degree of freedom?
Because it is smoothing not interpolation. The degree of freedom tells how much complexity you want in the fitted spline.

High (or very high) order polynomial regression in R (or alternatives?)

I would like to fit a (very) high order regression to a set of data in R, however the poly() function has a limit of order 25.
For this application I need an order on the range of 100 to 120.
model <- lm(noisy.y ~ poly(q,50))
# Error in poly(q, 50) : 'degree' must be less than number of unique points
model <- lm(noisy.y ~ poly(q,30))
# Error in poly(q, 30) : 'degree' must be less than number of unique points
model <- lm(noisy.y ~ poly(q,25))
# OK
Polynomials and orthogonal polynomials
poly(x) has no hard-coded limit for degree. However, there are two numerical constraints in practice.
Basis functions are constructed on unique location of x values. A polynomial of degree k has k + 1 basis and coefficients. poly generates basis without the intercept term, so degree = k implies k basis and k coefficients. If there are n unique x values, it must be satisfied that k <= n, otherwise there is simply insufficient information to construct a polynomial. Inside poly(), the following line checks this condition:
if (degree >= length(unique(x)))
stop("'degree' must be less than number of unique points")
Correlation between x ^ k and x ^ (k+1) is getting closer and closer to 1 as k increases. Such approaching speed is of course dependent on x values. poly first generates ordinary polynomial basis, then performs QR factorization to find orthogonal span. If numerical rank-deficiency occurs between x ^ k and x ^ (k+1), poly will also stop and complain:
if (QR$rank < degree)
stop("'degree' must be less than number of unique points")
But the error message is not informative in this case. Furthermore, this does not have to be an error; it can be a warning then poly can reset degree to rank to proceed. Maybe R core can improve on this bit??
Your trial-and-error shows that you can't construct a polynomial of more than 25 degree. You can first check length(unique(q)). If you have a degree smaller than this but still triggering error, you know for sure it is due to numerical rank-deficiency.
But what I want to say is that a polynomial of more than 3-5 degree is never useful! The critical reason is the Runge's phenomenon. In terms of statistical terminology: a high-order polynomial always badly overfits data!. Don't naively think that because orthogonal polynomials are numerically more stable than raw polynomials, Runge's effect can be eliminated. No, a polynomial of degree k forms a vector space, so whatever basis you use for representation, they have the same span!
Splines: piecewise cubic polynomials and its use in regression
Polynomial regression is indeed helpful, but we often want piecewise polynomials. The most popular choice is cubic spline. Like that there are different representation for polynomials, there are plenty of representation for splines:
truncated power basis
natural cubic spline basis
B-spline basis
B-spline basis is the most numerically stable, as it has compact support. As a result, the covariance matrix X'X is banded, thus solving normal equations (X'X) b = (X'y) are very stable.
In R, we can use bs function from splines package (one of R base packages) to get B-spline basis. For bs(x), The only numerical constraint on degree of freedom df is that we can't have more basis than length(unique(x)).
I am not sure of what your data look like, but perhaps you can try
library(splines)
model <- lm(noisy.y ~ bs(q, df = 10))
Penalized smoothing / regression splines
Regression spline is still likely to overfit your data, if you keep increasing the degree of freedom. The best model seems to be about choosing the best degree of freedom.
A great approach is using penalized smoothing spline or penalized regression spline, so that model estimation and selection of degree of freedom (i.e., "smoothness") are integrated.
The smooth.spline function in stats package can do both. Unlike what its name seems to suggest, for most of time it is just fitting a penalized regression spline rather than smoothing spline. Read ?smooth.spline for more. For your data, you may try
fit <- smooth.spline(q, noisy.y)
(Note, smooth.spline has no formula interface.)
Additive penalized splines and Generalized Additive Models (GAM)
Once we have more than one covariates, we need additive models to overcome the "curse of dimensionality" while being sensible. Depending on representation of smooth functions, GAM can come in various forms. The most popular, in my opinion, is the mgcv package, recommended by R.
You can still fit a univariate penalized regression spline with mgcv:
library(mgcv)
fit <- gam(noisy.y ~ s(q, bs = "cr", k = 10))

ggplot2 geom_smooth() in R... loess, gam, splines, etc

Hi I'm looking for some clarification here.
Context: I want to draw a line in a scatterplot that doesn't appear parametric, therefore I am using geom_smooth() in a ggplot. It automatically returns geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method. I gather gam stands for generalized additive models and it has a cubic spline used.
Are the following perceptions correct?
-Loess estimates the response at specific values.
-Splines are approximations that connect different piecewise functions that fit the data (which make up the generalized additive model), and cubic splines are the specific type of spline used here.
Lastly, When should splines be used, when should loess be used?

Resources