gam() in R: Is it a spline model with automated knots selection? - r

I run an analysis where I need to plot a nonlinear relation between two variables. I read about spline regression where one challenge is to find the number and the position of the knots. So I was happy to read in this book that generalized additive models (GAM) fit "spline models with automated selection of knots". Thus, I started to read how to do GAM analysis in R and I was surprised to see that the gam() function has a knots argument.
Now I am confused. Does the gam() function in R run a GAM which atomatically finds the best knots? If so, why should we provide the knots argument? Also, the documentation says "If they are not supplied then the knots of the spline are placed evenly throughout the covariate values to which the term refers. For example, if fitting 101 data with an 11 knot spline of x then there would be a knot at every 10th (ordered) x value". This does not sound like a very elaborated algorithm for knots selection.
I could not find another source validating the statement that GAM is a spline model with automated knots selection. So is the gam() function the same as pspline() where degree is 3 (cubic) with the difference that gam() sets some default for the df argument?

The term GAM covers a broad church of models and approaches to solve the smoothness selection problem.
mgcv uses penalized regression spline bases, with a wiggliness penalty to choose the complexity of the fitted smooth(s). As such, it doesn't choose the number of knots as part of the smoothness selection.
Basically, you as the user choose how large a basis to use for each smooth function (by setting argument k in the s(), te(), etc functions used in the model formula). The value(s) for k set the upper limit on the wiggliness of the smooth function(s). The penalty measures the wiggliness of the function (it is typically the squared second derivative of the smooth summed over the range of the covariate). The model then estimates values for the coefficients for the basis functions representing each smooth and chooses smoothness parameter(s) by maximizing the penalized log likelihood criterion. The penalized log likelihood is the log likelihood plus some amount of penalty for wiggliness for each smooth.
Basically, you set the upper limit of expected complexity (wiggliness) for each smooth and when the model is fitted, the penalty(ies) shrink the coefficients behind each smooth so that excess wiggliness is removed from the fit.
In this way, the smoothness parameters control how much shrinkage happens and hence how complex (wiggly) each fitted smooth is.
This approach avoids the problems of choosing where to put the knots.
This doesn't mean the bases used to represent the smooths don't have knots. In the cubic regression spline basis you mention, the value you give to k sets the dimensionality of the basis, which implies a certain number of knots. These knots are placed at the boundaries of the covariate involved in the smooth and then evenly over the range of the covariate, unless the user supplies a different set of knot locations. However, once the number of knots and their locations are set, thus forming the basis, they are fixed, with the wiggliness of the smooth being controlled by the wiggliness penalty, not by varying the number of knots.
You have to be very careful also with R as there are two packages providing a gam() function. The original gam package provides an R version of the software and approach described in the original GAM book by Hastie and Tibshirani. This package doesn't fit GAMs using penalized regression splines as I describe above.
R ships with the mgcv package, which fits GAMs using penalized regression splines as I outline above. You control the size (dimensionality) of the basis for each smooth using the argument k. There is no argument df.
Like I said, GAMs are a broad church and there are many ways to fit them. It is important to know what software you are using and what methods that software is employing to estimate the GAM. Once you have that info in hand, you can home in on specific material for that particular approach to estimating GAMs. In this case, you should look at Simon Wood's book GAMs: an introduction with R as this describes the mgcv package and is written by the author of the mgcv package.

Related

R quantile regression monotone cubic spline

I am running a quantile regression model on some data with one natural cubic spline, which needs to be monotonically decreasing (because it cannot physically increase at any point). To start with I used the ns() function from the splines package to achieve this but quickly found that it won't do (unsurprisingly). So I found the function mSpline from the package splines2 which is supposed to fit monotonic splines, but it also does not work. Below is an example of the two functions and how they fail on mtcars.
How can I achieve my goal of obtaining monotonically decreasing splines either with my approach or some other approach?
Bonus points if additional variables can be added to the model, which are not splined.
library(quantreg)
mod=rq(mpg~ns(hp,df=3),data=mtcars,tau=0.99)
mod=rq(mpg~mSpline(hp,df=3),data=mtcars,tau=0.99) #monotone
preds=predict(mod)
plot(mtcars$mpg~mtcars$hp)
points(preds~mtcars$hp,col=2,cex=1,pch=16)
First part of your question: Quantile Regression with smoothing splines and monotonicity restrictions can be implemented using splineDesign from the Splines package together with quantreg (option method="fnc" for the rq-function).
R Code is available from the supplement to https://doi.org/10.1002/jae.2347 in the open archive http://qed.econ.queensu.ca/jae/datasets/haupt002/
Second part of your question: the implemented model is additive and allows to add further splines, parametric, or semiparametric components.

How to consider different costs for different types of errors in SVM using R

Let Y be a binary variable.
If we use logistic regression for modeling, then we can use cv.glm for cross validation and there we can specify the cost function in the cost argument. By specifying the cost function, we can assign different unit costs to different types of errors:predicted Yes|reference is No or predicted No|reference is Yes.
I am wondering if I could achieve the same in SVM. In other words, is there a way for me to specify a cost(loss) function instead of using built-in loss function?
Besides the Answer by Yueguoguo, there is also three more solutions, the standard Wrapper approach, hyperplane tuning and the one in e1017.
The Wrapper approach (available out of the box for example in weka) is applicable to almost all classifiers. The idea is to over- or undersample the data in accordance with the misclassification costs. The learned model if trained to optimise accuracy is optimal under the costs.
The second idea is frequently used in textminining. The classification is svm's are derived from distance to the hyperplane. For linear separable problems this distance is {1,-1} for the support vectors. The classification of a new example is then basically, whether the distance is positive or negative. However, one can also shift this distance and not make the decision and 0 but move it for example towards 0.8. That way the classifications are shifted in one or the other direction, while the general shape of the data is not altered.
Finally, some machine learning toolkits have a build in parameter for class specific costs like class.weights in the e1017 implementation. the name is due to the fact that the term cost is pre-occupied.
The loss function for SVM hyperplane parameters is automatically tuned thanks to the beautiful theoretical foundation of the algorithm. SVM applies cross-validation for tuning hyperparameters. Say, an RBF kernel is used, cross validation is to select the optimal combination of C (cost) and gamma (kernel parameter) for the best performance, measured by certain metrics (e.g., mean squared error). In e1071, the performance can be obtained by using tune method, where the range of hyperparameters as well as attribute of cross-validation (i.e., 5-, 10- or more fold cross validation) can be specified.
To obtain comparative cross-validation results by using Area-Under-Curve type of error measurement, one can train different models with different hyperparameter configurations and then validate the model against sets of pre-labelled data.
Hope the answer helps.

gbm::interact.gbm vs. dismo::gbm.interactions

Background
The reference manual for the gbm package states the interact.gbm function computes Friedman's H-statistic to assess the strength of variable interactions. the H-statistic is on the scale of [0-1].
The reference manual for the dismo package does not reference any literature for how the gbm.interactions function detects and models interactions. Instead it gives a list of general procedures used to detect and model interactions. The dismo vignette "Boosted Regression Trees for ecological modeling" states that the dismo package extends functions in the gbm package.
Question
How does dismo::gbm.interactions actually detect and model interactions?
Why
I am asking this question because gbm.interactions in the dismo package yields results >1, which the gbm package reference manual says is not possible.
I checked the tar.gz for each of the packages to see if the source code was similar. It is different enough that I cannot determine if these two packages are using the same method to detect and model interactions.
To summarize, the difference between the two approaches boils down to how the "partial dependence function" of the two predictors is estimated.
The dismo package is based on code originally given in Elith et al., 2008 and you can find the original source in the supplementary material. The paper very briefly describes the procedure. Basically the model predictions are obtained over a grid of two predictors, setting all other predictors at their means. The model predictions are then regressed onto the grid. The mean squared errors of this model are then multiplied by 1000. This statistic indicates departures of the model predictions from a linear combination of the predictors, indicating a possible interaction.
From the dismo package, we can also obtain the relevant source code for gbm.interactions. The interaction test boils down to the following commands (copied directly from source):
interaction.test.model <- lm(prediction ~ as.factor(pred.frame[,1]) + as.factor(pred.frame[,2]))
interaction.flag <- round(mean(resid(interaction.test.model)^2) * 1000,2)
pred.frame contains a grid of the two predictors in question, and prediction is the prediction from the original gbm fitted model where all but two predictors under consideration are set at their means.
This is different than Friedman's H statistic (Friedman & Popescue, 2005), which is estimated via formula (44) for any pair of predictors. This is essentially the departure from additivity for any two predictors averaging over the values of the other variables, NOT setting the other variables at their means. It is expressed as a percent of the total variance of the partial dependence function of the two variables (or model implied predictions) so will always be between 0-1.

Comparing nonlinear regression models

I want to compare the curve fits of three models by r-squared values. I ran models using the nls and drc packages. It appears, though, that neither of those packages calculate r-squared values; they give "residual std error" and "residual sum of squares" though.
Can these two be used to compare model fits?
This is really a statistics question, rather than a coding question: consider posting on stats.stackexchange.com; you're likely to get a better answer.
RSQ is not really meaningful for non-linear regression. This is why summary.nls(...) does not provide it. See this post for an explanation.
There is a common, and understandable, tendency to hope for a single statistic that allows one to assess which of a set of models better fits a dataset. Unfortunately, it doesn't work that way. Here are some things to consider.
Generally, the best model is the one that has a mechanistic underpinning. Do your models reflect some physical process, or are you just trying a bunch of mathematical equations and hoping for the best? The former approach almost always leads to better models.
You should consider how the models will be used. Will you be interpolating (e.g. estimating y|x within the range of your dataset), or will you be extrapolating (estimating y|x outside the range of your data)? Some models yield a fit that provides relatively accurate estimates slightly outside the dataset range, and others completely fall apart.
Sometimes the appropriate modeling technique is suggested by the type of data you have. For example, if you have data that counts something, then y is likely to be poisson distributed and a generalized linear model (glm) in the poisson family is indicated. If your data is binary (e.g. only two possible outcomes, success or failure), then a binomial glm is indicated (so-called logistic regression).
The key underlying assumption of least squares techniques is that the error in y is normally distributed with mean 0 and constant variance. We can test this after doing the fit by looking at a plot of standardized residuals vs. y, and by looking at a Normal Q-Q plot of the residuals. If the residuals plot shows scatter increasing or decreasing with y then the model in not a good one. If the Normal Q-Q plot is not close to a straight line, then the residuals are not normally distributed and probably a different model is indicated.
Sometimes certain data points have high leverage with a given model, meaning that the fit is unduly influenced by those points. If this is a problem you will see it in a leverage plot. This indicates a weak model.
For a given model, it may be the case that not all of the parameters are significantly different from 0 (e.g., p-value of the coefficient > 0.05). If this is the case, you need to explore the model without those parameters. With nls, this often implies a completely different model.
Assuming that your model passes the tests above, it is reasonable to look at the F-statistic for the fit. This is essentially the ratio of SSR/SSE corrected for the dof in the regression (R) and the residuals (E). A model with more parameters will generally have smaller residual SS, but that does not make it a better model. The F-statistic accounts for this in that models with more parameters will have larger regression dof and smaller residual dof, making the F-statistic smaller.
Finally, having considered the items above, you can consider the residual standard error. Generally, all other things being equal, smaller residual standard error is better. Trouble is, all other things are never equal. This is why I would recommend looking at RSE last.

Goodness of fit functions in R

What functions do you use in R to fit a curve to your data and test how well that curve fits? What results are considered good?
Just the first part of that question can fill entire books. Just some quick choices:
lm() for standard linear models
glm() for generalised linear models (eg for logistic regression)
rlm() from package MASS for robust linear models
lmrob() from package robustbase for robust linear models
loess() for non-linear / non-parametric models
Then there are domain-specific models as e.g. time series, micro-econometrics, mixed-effects and much more. Several of the Task Views as e.g. Econometrics discuss this in more detail. As for goodness of fit, that is also something one can spend easily an entire book discussing.
The workhorses of canonical curve fitting in R are lm(), glm() and nls(). To me, goodness-of-fit is a subproblem in the larger problem of model selection. Infact, using goodness-of-fit incorrectly (e.g., via stepwise regression) can give rise to seriously misspecified model (see Harrell's book on "Regression Modeling Strategies"). Rather than discussing the issue from scratch, I recommend Harrell's book for lm and glm. Venables and Ripley's bible is terse, but still worth a reading. "Extending the Linear Model with R" by Faraway is comprehensive and readable. nls is not covered in these sources, but "Nonlinear Regression with R" by Ritz & Streibig fills the gap and is very hands-on.
The nls() function (http://sekhon.berkeley.edu/stats/html/nls.html) is pretty standard for nonlinear least-squares curve fitting. Chi squared (the sum of the squared residuals) is the metric that is optimized in that case, but it is not normalized so you can't readily use it to determine how good the fit is. The main thing you should ensure is that your residuals are normally distributed. Unfortunately I'm not sure of an automated way to do that.
The Quick R site has a reasonable good summary of basic functions used for fitting models and testing the fits, along with sample R code:
http://www.statmethods.net/stats/regression.html
The main thing you should ensure is
that your residuals are normally
distributed. Unfortunately I'm not
sure of an automated way to do that.
qqnorm() could probably be modified to find the correlation between the sample quantiles and the theoretical quantiles. Essentially, this would just be a numerical interpretation of the normal quantile plot. Perhaps providing several values of the correlation coefficient for different ranges of quantiles could be useful. For example, if the correlation coefficient is close to 1 for the middle 97% of the data and much lower at the tails, this tells us the distribution of residuals is approximately normal, with some funniness going on in the tails.
Best to keep simple, and see if linear methods work "well enuff". You can judge your goodness of fit GENERALLY by looking at the R squared AND F statistic, together, never separate. Adding variables to your model that have no bearing on your dependant variable can increase R2, so you must also consider F statistic.
You should also compare your model to other nested, or more simpler, models. Do this using log liklihood ratio test, so long as dependant variables are the same.
Jarque–Bera test is good for testing the normality of the residual distribution.

Resources