R quantile regression monotone cubic spline - r

I am running a quantile regression model on some data with one natural cubic spline, which needs to be monotonically decreasing (because it cannot physically increase at any point). To start with I used the ns() function from the splines package to achieve this but quickly found that it won't do (unsurprisingly). So I found the function mSpline from the package splines2 which is supposed to fit monotonic splines, but it also does not work. Below is an example of the two functions and how they fail on mtcars.
How can I achieve my goal of obtaining monotonically decreasing splines either with my approach or some other approach?
Bonus points if additional variables can be added to the model, which are not splined.
library(quantreg)
mod=rq(mpg~ns(hp,df=3),data=mtcars,tau=0.99)
mod=rq(mpg~mSpline(hp,df=3),data=mtcars,tau=0.99) #monotone
preds=predict(mod)
plot(mtcars$mpg~mtcars$hp)
points(preds~mtcars$hp,col=2,cex=1,pch=16)

First part of your question: Quantile Regression with smoothing splines and monotonicity restrictions can be implemented using splineDesign from the Splines package together with quantreg (option method="fnc" for the rq-function).
R Code is available from the supplement to https://doi.org/10.1002/jae.2347 in the open archive http://qed.econ.queensu.ca/jae/datasets/haupt002/
Second part of your question: the implemented model is additive and allows to add further splines, parametric, or semiparametric components.

Related

how check overfitting on point pattern on a linear network using spatstat

I have been using lppm (point pattern on a linear network) on spatstat with bunch of covariates and fitting a log-linear model but I couldn't see how to check over-fitting. Is there a quick way to do it?
It depends on what you want.
What tool would you use to check overfitting in (say) a linear model?
To identify whether individual observations may have been over-fitted, you could use influence.lppm (from the spatstat.linnet package).
To identify collinearity in the covariates, currently we do not provide a dedicated function in spatstat, but you could use the following trick. If fit is your fitted model of class lppm, first extract the corresponding GLM using
g <- getglmfit(as.ppm(fit))
Next install the package faraway and use the vif function to calculate the variance inflation factors
library(faraway)
vif(g)

ugarchroll() with EVT-COPULA-GARCH Model

How can I fit an EVT-Copula-GARCH model to the ugarchroll() function to obtain Value-at-Risk forecasts? Thanks.
I already obtained standardized residuals from my fitted GARCH, next I would need to apply the Kernel and fit a Copula, I know how I would do it in individual steps from a technical perspective but I cannot wrap my head around it to apply it straight forward with the "rugarch" or "rmgarch" package.

gam() in R: Is it a spline model with automated knots selection?

I run an analysis where I need to plot a nonlinear relation between two variables. I read about spline regression where one challenge is to find the number and the position of the knots. So I was happy to read in this book that generalized additive models (GAM) fit "spline models with automated selection of knots". Thus, I started to read how to do GAM analysis in R and I was surprised to see that the gam() function has a knots argument.
Now I am confused. Does the gam() function in R run a GAM which atomatically finds the best knots? If so, why should we provide the knots argument? Also, the documentation says "If they are not supplied then the knots of the spline are placed evenly throughout the covariate values to which the term refers. For example, if fitting 101 data with an 11 knot spline of x then there would be a knot at every 10th (ordered) x value". This does not sound like a very elaborated algorithm for knots selection.
I could not find another source validating the statement that GAM is a spline model with automated knots selection. So is the gam() function the same as pspline() where degree is 3 (cubic) with the difference that gam() sets some default for the df argument?
The term GAM covers a broad church of models and approaches to solve the smoothness selection problem.
mgcv uses penalized regression spline bases, with a wiggliness penalty to choose the complexity of the fitted smooth(s). As such, it doesn't choose the number of knots as part of the smoothness selection.
Basically, you as the user choose how large a basis to use for each smooth function (by setting argument k in the s(), te(), etc functions used in the model formula). The value(s) for k set the upper limit on the wiggliness of the smooth function(s). The penalty measures the wiggliness of the function (it is typically the squared second derivative of the smooth summed over the range of the covariate). The model then estimates values for the coefficients for the basis functions representing each smooth and chooses smoothness parameter(s) by maximizing the penalized log likelihood criterion. The penalized log likelihood is the log likelihood plus some amount of penalty for wiggliness for each smooth.
Basically, you set the upper limit of expected complexity (wiggliness) for each smooth and when the model is fitted, the penalty(ies) shrink the coefficients behind each smooth so that excess wiggliness is removed from the fit.
In this way, the smoothness parameters control how much shrinkage happens and hence how complex (wiggly) each fitted smooth is.
This approach avoids the problems of choosing where to put the knots.
This doesn't mean the bases used to represent the smooths don't have knots. In the cubic regression spline basis you mention, the value you give to k sets the dimensionality of the basis, which implies a certain number of knots. These knots are placed at the boundaries of the covariate involved in the smooth and then evenly over the range of the covariate, unless the user supplies a different set of knot locations. However, once the number of knots and their locations are set, thus forming the basis, they are fixed, with the wiggliness of the smooth being controlled by the wiggliness penalty, not by varying the number of knots.
You have to be very careful also with R as there are two packages providing a gam() function. The original gam package provides an R version of the software and approach described in the original GAM book by Hastie and Tibshirani. This package doesn't fit GAMs using penalized regression splines as I describe above.
R ships with the mgcv package, which fits GAMs using penalized regression splines as I outline above. You control the size (dimensionality) of the basis for each smooth using the argument k. There is no argument df.
Like I said, GAMs are a broad church and there are many ways to fit them. It is important to know what software you are using and what methods that software is employing to estimate the GAM. Once you have that info in hand, you can home in on specific material for that particular approach to estimating GAMs. In this case, you should look at Simon Wood's book GAMs: an introduction with R as this describes the mgcv package and is written by the author of the mgcv package.

gbm::interact.gbm vs. dismo::gbm.interactions

Background
The reference manual for the gbm package states the interact.gbm function computes Friedman's H-statistic to assess the strength of variable interactions. the H-statistic is on the scale of [0-1].
The reference manual for the dismo package does not reference any literature for how the gbm.interactions function detects and models interactions. Instead it gives a list of general procedures used to detect and model interactions. The dismo vignette "Boosted Regression Trees for ecological modeling" states that the dismo package extends functions in the gbm package.
Question
How does dismo::gbm.interactions actually detect and model interactions?
Why
I am asking this question because gbm.interactions in the dismo package yields results >1, which the gbm package reference manual says is not possible.
I checked the tar.gz for each of the packages to see if the source code was similar. It is different enough that I cannot determine if these two packages are using the same method to detect and model interactions.
To summarize, the difference between the two approaches boils down to how the "partial dependence function" of the two predictors is estimated.
The dismo package is based on code originally given in Elith et al., 2008 and you can find the original source in the supplementary material. The paper very briefly describes the procedure. Basically the model predictions are obtained over a grid of two predictors, setting all other predictors at their means. The model predictions are then regressed onto the grid. The mean squared errors of this model are then multiplied by 1000. This statistic indicates departures of the model predictions from a linear combination of the predictors, indicating a possible interaction.
From the dismo package, we can also obtain the relevant source code for gbm.interactions. The interaction test boils down to the following commands (copied directly from source):
interaction.test.model <- lm(prediction ~ as.factor(pred.frame[,1]) + as.factor(pred.frame[,2]))
interaction.flag <- round(mean(resid(interaction.test.model)^2) * 1000,2)
pred.frame contains a grid of the two predictors in question, and prediction is the prediction from the original gbm fitted model where all but two predictors under consideration are set at their means.
This is different than Friedman's H statistic (Friedman & Popescue, 2005), which is estimated via formula (44) for any pair of predictors. This is essentially the departure from additivity for any two predictors averaging over the values of the other variables, NOT setting the other variables at their means. It is expressed as a percent of the total variance of the partial dependence function of the two variables (or model implied predictions) so will always be between 0-1.

When to choose nls() over loess()?

If I have some (x,y) data, I can easily draw straight-line through it, e.g.
f=glm(y~x)
plot(x,y)
lines(x,f$fitted.values)
But for curvy data I want a curvy line. It seems loess() can be used:
f=loess(y~x)
plot(x,y)
lines(x,f$fitted)
This question has evolved as I've typed and researched it. I started off with wanting to a simple function to fit curvy data (where I know nothing about the data), and wanting to understand how to use nls() or optim() to do that. That was what everyone seemed to be suggesting in similar questions I found. But now I stumbled upon loess() I'm happy. So, now my question is why would someone choose to use nls or optim instead of loess (or smooth.spline)? Using the toolbox analogy, is nls a screwdriver and loess is a power-screwdriver (meaning I'd almost always choose the latter as it does the same thing but with less of my effort)? Or is nls a flat-head screwdriver and loess a cross-head screwdriver (meaning loess is a better fit for some problems, but for others it simply won't do the job)?
For reference, here is the play data I was using that loess gives satisfactory results for:
x=1:40
y=(sin(x/5)*3)+runif(x)
And:
x=1:40
y=exp(jitter(x,factor=30)^0.5)
Sadly, it does less well on this:
x=1:400
y=(sin(x/20)*3)+runif(x)
Can nls(), or any other function or library, cope with both this and the previous exp example, without being given a hint (i.e. without being told it is a sine wave)?
UPDATE: Some useful pages on the same theme on stackoverflow:
Goodness of fit functions in R
How to fit a smooth curve to my data in R?
smooth.spline "out of the box" gives good results on my 1st and 3rd examples, but terrible (it just joins the dots) on the 2nd example. However f=smooth.spline(x,y,spar=0.5) is good on all three.
UPDATE #2: gam() (from mgcv package) is great so far: it gives a similar result to loess() when that was better, and a similar result to smooth.spline() when that was better. And all without hints or extra parameters. The docs were so far over my head I felt like I was squinting at a plane flying overhead; but a bit of trial and error found:
#f=gam(y~x) #Works just like glm(). I.e. pointless
f=gam(y~s(x)) #This is what you want
plot(x,y)
lines(x,f$fitted)
Nonlinear-least squares is a means of fitting a model that is non-linear in the parameters. By fitting a model, I mean there is some a priori specified form for the relationship between the response and the covariates, with some unknown parameters that are to be estimated. As the model is non-linear in these parameters NLS is a means to estimate values for those coefficients by minimising a least-squares criterion in an iterative fashion.
LOESS was developed as a means of smoothing scatterplots. It has a very less well defined concept of a "model" that is fitted (IIRC there is no "model"). LOESS works by trying to identify pattern in the relationship between response and covariates without the user having to specify what form that relationship is. LOESS works out the relationship from the data themselves.
These are two fundamentally different ideas. If you know the data should follow a particular model then you should fit that model using NLS. You could always compare the two fits (NLS vs LOESS) to see if there is systematic variation from the presumed model etc - but that would show up in the NLS residuals.
Instead of LOESS, you might consider Generalized Additive Models (GAMs) fitted via gam() in recommended package mgcv. These models can be viewed as a penalised regression problem but allow for the fitted smooth functions to be estimated from the data like they are in LOESS. GAM extends GLM to allow smooth, arbitrary functions of covariates.
loess() is non-parametric, meaning you don't get a set of coefficients you can use later - it's not a model, just a fit line. nls() will give you coefficients you could use to build an equation and predict values with a different but similar data set - you can create a model with nls().

Resources