how to fit the best hybrid Gibbs model by spatstat? - r

I want fit a best hybrid model to a ppp with more than 100 points and a rectangle window, so do trial and error. it's natural to find the significant r for geyer model by looking into the centered L(r) plot where there are several peak and troughs on it. so i draw the centered L(r) function first:
L.inhom<-Linhom(my.X)
plot(L.inhom, iso-r~r,legend=FALSE)
then decide whether to include a hardcore (the first trough suggests a hardcore), and a geyer component(following peaks or troughs).
according to the spatstat book, one can alway use profilepl() to find the best (r,sat) for a Geyer saturation model
df<-expand.grid(r=seq(45,200, by=1),sat=c(1,2,3,4,5))
p.Geyer<-profilepl(df,Geyer, my.X~1, correction="translate", aic=TRUE)
as.ppm(p.Geyer) #(r=,sat=)
I feel that the (r,sat) found by profilepl() is good enough for the global pattern of my ppp (is that right?). Indeed, a hybrid model include a hardcore and a geyer component with this (r,sat) parameters owns a much lower AIC than corresponding hardcore model and geyer saturation model.
after this, I fit the second hybrid model with hardcore and two geyer components, which is even better according to AICs. and I think the second hybrid model is also valid or good for my data, as the L(r) plot does present more than one peak and one trough.
I am thinking to use envelope of fitted model to judge its validity: if the fitted L(r) was bounded in the simulated envelope, then the model is good enough for the data.
env.hybrid.final<-envelope(hybrid.1, Linhom,
simulate=expression(rpoispp(den.X)),
use.thoery=TRUE,nsim=19, global=TRUE)
plot(env.hybrid.final,.-r~r, shade=NULL, legend=FALSE,
xlab="r",ylab="L(r)-r", main = "")
here come the questions:
q1: is it right using L(r) envelop plot to judge the validity of fitted model, as i did? if not, how to judge the best fitted model or even a "good" model?
q2: when you can fit several models to a dataset, how to decide when a model is good enough? especially when there no clear mechanism konwn. AIC?

Related

Regression: Generalized Additive Model

I have started to work with GAM in R and I’ve acquired Simon Wood’s excellent book on the topic. Based on one of his examples, I am looking at the following:
library(mgcv)
data(trees)
ct1<-gam(log(Volume) ~ Height + s(Girth), data=trees)
I have two general questions to this example:
How does one decide when a variable in the model estimation should be parametric (such as Height) or when it should be smooth (such as Girth)? Does one hold an advantage over the other and is there a way to determine what is the optimal type for a variable is? If anybody has any literature about this topic, I’d be happy to know of it.
Say I want to look closer at the weights of ct1: ct1$coefficients. Can I use them as the gam-procedure outputs them, or do I have to transform them before analyzing them given that I am fitting to log(Volume)? In the case of the latter, I guess I would have to use exp (ct1$coefficients)

How to consider different costs for different types of errors in SVM using R

Let Y be a binary variable.
If we use logistic regression for modeling, then we can use cv.glm for cross validation and there we can specify the cost function in the cost argument. By specifying the cost function, we can assign different unit costs to different types of errors:predicted Yes|reference is No or predicted No|reference is Yes.
I am wondering if I could achieve the same in SVM. In other words, is there a way for me to specify a cost(loss) function instead of using built-in loss function?
Besides the Answer by Yueguoguo, there is also three more solutions, the standard Wrapper approach, hyperplane tuning and the one in e1017.
The Wrapper approach (available out of the box for example in weka) is applicable to almost all classifiers. The idea is to over- or undersample the data in accordance with the misclassification costs. The learned model if trained to optimise accuracy is optimal under the costs.
The second idea is frequently used in textminining. The classification is svm's are derived from distance to the hyperplane. For linear separable problems this distance is {1,-1} for the support vectors. The classification of a new example is then basically, whether the distance is positive or negative. However, one can also shift this distance and not make the decision and 0 but move it for example towards 0.8. That way the classifications are shifted in one or the other direction, while the general shape of the data is not altered.
Finally, some machine learning toolkits have a build in parameter for class specific costs like class.weights in the e1017 implementation. the name is due to the fact that the term cost is pre-occupied.
The loss function for SVM hyperplane parameters is automatically tuned thanks to the beautiful theoretical foundation of the algorithm. SVM applies cross-validation for tuning hyperparameters. Say, an RBF kernel is used, cross validation is to select the optimal combination of C (cost) and gamma (kernel parameter) for the best performance, measured by certain metrics (e.g., mean squared error). In e1071, the performance can be obtained by using tune method, where the range of hyperparameters as well as attribute of cross-validation (i.e., 5-, 10- or more fold cross validation) can be specified.
To obtain comparative cross-validation results by using Area-Under-Curve type of error measurement, one can train different models with different hyperparameter configurations and then validate the model against sets of pre-labelled data.
Hope the answer helps.

variable selection in package earth R for MARS

I am working on a MARS model using earth package in R. I want to force all the variables in the model. I have 14 predictors but in the result, I am getting only 13 predictors. Below is my code.
mdl <- earth(x,y,nprune=10000, nk=1000, degree=1, thresh=1e-7,
linpreds=c(1:14), penalty = -1, ponly=T, trace = 0)
Here are my questions
Is it possible for me to force variables instead of some variable selection. If yes, how?\
Once I start exploring hinges in the data, is it possible for me to manually fix the knots and then ge tthe estimates on the basis of them.
You cannot force predictors into the earth/MARS model. Fundamental to the MARS algorithm is automatic selection of predictors.
But it is possible to increase the likelihood of all predictors going into the model by subverting the normal algorithm by setting thresh=0 and penalty=-1. See the earth help page, and search for "thresh" and "penalty" in the earth vignette for details and examples.
However, to quote that vignette: "It’s usually best not to subvert the standard MARS algorithm by toying with tuning
parameters such as thresh, penalty, and endspan. Remember that we aren’t seeking
a model that best fits the training data, but rather a model that best fits the underlying
distribution."
You can't manually fix the knots. Once again, automatic selection of knots if inherent in the MARS algorithm. However, the minspan and endspan arguments give you some flexibility in the automatic placement of knots, for example minspan=-3 will allow up to 3 equally spaced knots per predictor.

Comparing nonlinear regression models

I want to compare the curve fits of three models by r-squared values. I ran models using the nls and drc packages. It appears, though, that neither of those packages calculate r-squared values; they give "residual std error" and "residual sum of squares" though.
Can these two be used to compare model fits?
This is really a statistics question, rather than a coding question: consider posting on stats.stackexchange.com; you're likely to get a better answer.
RSQ is not really meaningful for non-linear regression. This is why summary.nls(...) does not provide it. See this post for an explanation.
There is a common, and understandable, tendency to hope for a single statistic that allows one to assess which of a set of models better fits a dataset. Unfortunately, it doesn't work that way. Here are some things to consider.
Generally, the best model is the one that has a mechanistic underpinning. Do your models reflect some physical process, or are you just trying a bunch of mathematical equations and hoping for the best? The former approach almost always leads to better models.
You should consider how the models will be used. Will you be interpolating (e.g. estimating y|x within the range of your dataset), or will you be extrapolating (estimating y|x outside the range of your data)? Some models yield a fit that provides relatively accurate estimates slightly outside the dataset range, and others completely fall apart.
Sometimes the appropriate modeling technique is suggested by the type of data you have. For example, if you have data that counts something, then y is likely to be poisson distributed and a generalized linear model (glm) in the poisson family is indicated. If your data is binary (e.g. only two possible outcomes, success or failure), then a binomial glm is indicated (so-called logistic regression).
The key underlying assumption of least squares techniques is that the error in y is normally distributed with mean 0 and constant variance. We can test this after doing the fit by looking at a plot of standardized residuals vs. y, and by looking at a Normal Q-Q plot of the residuals. If the residuals plot shows scatter increasing or decreasing with y then the model in not a good one. If the Normal Q-Q plot is not close to a straight line, then the residuals are not normally distributed and probably a different model is indicated.
Sometimes certain data points have high leverage with a given model, meaning that the fit is unduly influenced by those points. If this is a problem you will see it in a leverage plot. This indicates a weak model.
For a given model, it may be the case that not all of the parameters are significantly different from 0 (e.g., p-value of the coefficient > 0.05). If this is the case, you need to explore the model without those parameters. With nls, this often implies a completely different model.
Assuming that your model passes the tests above, it is reasonable to look at the F-statistic for the fit. This is essentially the ratio of SSR/SSE corrected for the dof in the regression (R) and the residuals (E). A model with more parameters will generally have smaller residual SS, but that does not make it a better model. The F-statistic accounts for this in that models with more parameters will have larger regression dof and smaller residual dof, making the F-statistic smaller.
Finally, having considered the items above, you can consider the residual standard error. Generally, all other things being equal, smaller residual standard error is better. Trouble is, all other things are never equal. This is why I would recommend looking at RSE last.

When to choose nls() over loess()?

If I have some (x,y) data, I can easily draw straight-line through it, e.g.
f=glm(y~x)
plot(x,y)
lines(x,f$fitted.values)
But for curvy data I want a curvy line. It seems loess() can be used:
f=loess(y~x)
plot(x,y)
lines(x,f$fitted)
This question has evolved as I've typed and researched it. I started off with wanting to a simple function to fit curvy data (where I know nothing about the data), and wanting to understand how to use nls() or optim() to do that. That was what everyone seemed to be suggesting in similar questions I found. But now I stumbled upon loess() I'm happy. So, now my question is why would someone choose to use nls or optim instead of loess (or smooth.spline)? Using the toolbox analogy, is nls a screwdriver and loess is a power-screwdriver (meaning I'd almost always choose the latter as it does the same thing but with less of my effort)? Or is nls a flat-head screwdriver and loess a cross-head screwdriver (meaning loess is a better fit for some problems, but for others it simply won't do the job)?
For reference, here is the play data I was using that loess gives satisfactory results for:
x=1:40
y=(sin(x/5)*3)+runif(x)
And:
x=1:40
y=exp(jitter(x,factor=30)^0.5)
Sadly, it does less well on this:
x=1:400
y=(sin(x/20)*3)+runif(x)
Can nls(), or any other function or library, cope with both this and the previous exp example, without being given a hint (i.e. without being told it is a sine wave)?
UPDATE: Some useful pages on the same theme on stackoverflow:
Goodness of fit functions in R
How to fit a smooth curve to my data in R?
smooth.spline "out of the box" gives good results on my 1st and 3rd examples, but terrible (it just joins the dots) on the 2nd example. However f=smooth.spline(x,y,spar=0.5) is good on all three.
UPDATE #2: gam() (from mgcv package) is great so far: it gives a similar result to loess() when that was better, and a similar result to smooth.spline() when that was better. And all without hints or extra parameters. The docs were so far over my head I felt like I was squinting at a plane flying overhead; but a bit of trial and error found:
#f=gam(y~x) #Works just like glm(). I.e. pointless
f=gam(y~s(x)) #This is what you want
plot(x,y)
lines(x,f$fitted)
Nonlinear-least squares is a means of fitting a model that is non-linear in the parameters. By fitting a model, I mean there is some a priori specified form for the relationship between the response and the covariates, with some unknown parameters that are to be estimated. As the model is non-linear in these parameters NLS is a means to estimate values for those coefficients by minimising a least-squares criterion in an iterative fashion.
LOESS was developed as a means of smoothing scatterplots. It has a very less well defined concept of a "model" that is fitted (IIRC there is no "model"). LOESS works by trying to identify pattern in the relationship between response and covariates without the user having to specify what form that relationship is. LOESS works out the relationship from the data themselves.
These are two fundamentally different ideas. If you know the data should follow a particular model then you should fit that model using NLS. You could always compare the two fits (NLS vs LOESS) to see if there is systematic variation from the presumed model etc - but that would show up in the NLS residuals.
Instead of LOESS, you might consider Generalized Additive Models (GAMs) fitted via gam() in recommended package mgcv. These models can be viewed as a penalised regression problem but allow for the fitted smooth functions to be estimated from the data like they are in LOESS. GAM extends GLM to allow smooth, arbitrary functions of covariates.
loess() is non-parametric, meaning you don't get a set of coefficients you can use later - it's not a model, just a fit line. nls() will give you coefficients you could use to build an equation and predict values with a different but similar data set - you can create a model with nls().

Resources