mgcv: How to get spline equations - r

Probably I'm not the only one to ask the following question concerning estimated equations in mgcv::gam.
Well, thing is I've surfed the internet in vain to get an explicit answer to how I should make the following output into a full equation that I can subsequently take to any other analytical software, especially GIS software, in order to map/project the equation onto a certain geographic space using predictors X1 & X2:
family = gaussian(link = "identity")
smooth class = p-spline
The following is a transposed output of spline function coefficients:
**Intercept** 2.121
**s(X1).1** -1.23E-07
**s(X1).2** 1.86E-07
**s(X1).3** -7.33E-08
**s(X2).1** -2.51E-08
**s(X2).2** 3.08E-07
**s(X2).3** -3.00E-08
It's clear that the output means:
Y = 2.121 + (-1.231e-07 * s(X1).1) + (1.856e-07 * s(X1).2) + (-7.331e-08 * s(X1).3)…..
How can I mathematically interpret s(Xi).j? In other words could you please advise how to extract the exact full equations for p-spline from mgcv::gam?

This gives some necessary background of P-splines in mgcv: mgcv: how to extract knots, basis, coefficients and predictions for P-splines in adaptive smooth? but your question is not a duplicate. Very likely you have seen this thread.
Exact math formula is ugly, because B-spline construction is recursive.
Another thing is that mgcv imposes numerical centering to smooth functions. This is a reparametrization at run-time. There will be no beautiful formula for transformed basis, even if the original basis has one.
Well, mgcv is written for R, so model estimation and prediction are expected to be handled in R. There are handy generic functions to do so in my linked answer. They are not exportable to your intended software.
A possible remedy I could think of, is that you approximate those basis with linear interpolation, and export the interpolation function.

Related

binomial()$linkinv(fixef()) and binomial_pred_ci() functions: what exactly does these function are for when applied to mixed generalized analysis?

I was workinf on a dataset, trying re-perfomrming an already run statysical analysis and I met the following function:
binomial()$linkinv(fixef(m))
after running the following model
summary((m = glmer(T1.ACC ~ COND + (COND | ID), d9only, family = binomial)))
My first question is what exactly does this functions is made for? Beacuse throgh other command lines the reciprocal code as well as a slightly modified code based always on it are also reported:
1) 1- binomial()$linkinv(fixef())
2) d9only$fit = binomial()$linkinv(model.matrix(m) %*% fixef(m)) #also the sense of the operator %*% is quite misterious too.
Moreover, another function present is the following one:
binomial_pred_ci()
To be honest, I've to search through the overall script and no customized function there was or the package where that has been called from either? Anyone knows where does it may come from? Maybe the package 'runjags'? Just in case, any on how to download it?
Thanks for your answers
I agree with most of #Oliver's answer. I will add a few comments (since I had an answer partly composed already).
I would be very wary of the script you are following: some parts look wrong (I could obviously be mistaken since these bits are taken completely out of context ...)
binomial()$linkinv refers to the inverse link function for the model used. By default (which applies in this case since no optional link= argument has been specified), this is the inverse-logit or logistic function A nearly equivalent function is available via plogis(), but using $linkinv could be better in some cases since it would generalize to binomial analyses done with other link functions [e.g. probit or cloglog].
as #Oliver mentions, applying the inverse link function to the coefficients is at least weird, I would even say wrong. Researchers often exponentiate coefficients estimated on the logit/log-odds scale to obtain odds ratios, but applying the inverse link (usually logistic function) is rarely correct.
binomial()$linkinv(model.matrix(m) %*% fixef(m)) is indeed computing the predicted estimates on the link scale and converting them back to the data (= probability) scale. You can get the same results more reliably (handling missing values, etc.) by using predict(m, type = "response", re.form = ~0) (this extends #Oliver's answer to a case that also applies the inverse-link function for you).
I don't know what binomial_pred_ci is either, but I would suggest you look at predictInterval() from the merTools package ...
PS these answers all have not much to do with runjags, which uses an entirely different model structure. Presumably glmer models are being fitted for comparison ...
help(binomial) describes the link function and inverse link function and their uses. binomial()$linkinv is the binomial inverse-link function (sigmoid function) prob(y|eta) = 1 / (1 + exp(-eta)) where eta is the linear predictor. Using this with the coefficients (or fixed effects) is a bit odd, but is not unusual to get an idea of how large the effect of each coefficient is. I would not encourage it however.
%*% is the matrix multiplier, while model.matrix(m) (for lme4) extracts the fixed effect model matrix. So model.matrix(m) %*% fixef(m) is the linear predictor using only fixed effects. It would be the same as predict(m, re.form = ~ 0). This is often used in case you want to use the fixed effect model either because you want to correct for between-group-variation or because you are predicting new data.
binomial_pred_ci no idea. Guessing it's a function for predicting confidence levels.

Get the AIC or BIC citerium from a gamm, gam, and lme models: How in mgcv? And how can I trust the result? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 3 years ago.
Improve this question
I am new to Gamms and gams, so this question may be a bit basic, I'd appreciate your help on this very much:
I am using the following code:
M.gamm <- gamm (bsigsi ~ s(summodpa, sed,k= 1, fx= TRUE, bs="tp") + s(sumlightpa, sed, k=1, fx= TRUE, bs="tp") , random = list(school=~ 1) , method= "ML", na.action= na.omit, data= Pilot_fitbit2)
The code runs, but gives me this feedback:
Warning messages: 1: In smooth.construct.tp.smooth.spec(object,
dk$data, dk$knots) : basis dimension, k, increased to minimum
possible
2: In smooth.construct.tp.smooth.spec(object, dk$data, dk$knots) :
basis dimension, k, increased to minimum possible
Questions:
My major question is however how I can get an AIC or BIC from this?
I've tried BIC(M.gamm$gam) and BIC(M.gamm$lme), since gamm exists of both parts (lme and gam), and for the latter one (with lme) I do get a value, bot for the first one, I don'get a value. Does anyone know why and how I can get one?
The issue is that I would like to compare this value to the BIC value of a gam model, and I am not sure which one (BIC(M.gamm$lme) or BIC(M.gam$gam)) would be the correct one. It is possible for me to derive a BIC and AIC for the gam and lme model.
If I'd be able to get the AIC or BIC for the gamm model - how can I know I can trust the results? What do I need to be careful with so I interpret the result correctly? Currently, I am using ML in all models and also use the same package (mgcv) to estimate lme, gam, and gamm to estabilish comparability.
Any help/ advice or ideas on this would be greatly appreciated!!
Best wishes,
Noemi
Thank you very much for this!
This warnings come as a result of requesting a smoother basis of a single function for each of your two smooths; this doesn't make any sense as both such bases would only contain equivalent of constant functions, both of which are are unidentifiable given you have another constant term (the intercept) in your model. Once mgcv applies identifiable to constraints the two smooths would get dropped entirely from the model.
Hence the warnings; mgcv didn't do what you wanted. Instead it set k to be the smallest values possible. Set k to something larger; you might as well leave it at the default and not specify it in the s() if you want a low rank smooth. Also, unless you really want an unpenalized spline fit, don't use fix = TRUE.
I'm not really familiar with any theory for BIC applied to GAM(M)s that corrects for smoothness selection. The AIC method for gam() models estimated using REML smoothness selection does have some theory beyond it, including a recent paper by Simon Wood and colleagues.
The mgcv FAQ has the following two things to say
How can I compare gamm models? In the identity link normal errors case, then AIC and hypotheis testing based methods are fine. Otherwise it is best to work out a strategy based on the summary.gam Alternatively, simple random effects can be fitted with gam, which makes comparison straightforward. Package gamm4 is an alternative, which allows AIC type model selection for generalized models.
When using gamm or gamm4, the reported AIC is different for the gam object and the lme or lmer object. Why is this? There are several reasons for this. The most important is that the models being used are actually different in the two representations. When treating the GAM as a mixed model, you are implicitly assuming that if you gathered a replicate dataset, the smooths in your model would look completely different to the smooths from the original model, except for having the same degree of smoothness. Technically you would expect the smooths to be drawn afresh from their distribution under the random effects model. When viewing the gam from the usual penalized regression perspective, you would expect smooths to look broadly similar under replication of the data. i.e. you are really using Bayesian model for the smooths, rather than a random effects model (it's just that the frequentist random effects and Bayesian computations happen to coincide for computing the estimates). As a result of the different assumptions about the data generating process, AIC model comparisons can give rather different answers depending on the model adopted. Which you use should depend on which model you really think is appropriate. In addition the computations of the AICs are different. The mixed model AIC uses the marginal likelihood and the corresponding number of model parameters. The gam model uses the penalized likelihood and the effective degrees of freedom.
So, I'd probably stick to AIC, not use BIC. I'd be thinking about which interpretation of the GAM(M) I was interested most in. I'd also likely fit the random effects you have here using gam() if they are this simple. An equivalent model would include + s(school, bs = 're') in the main formula and exclude the random bit whilst using gam()
gam(bsigsi ~ s(summodpa, sed) + s(sumlightpa, sed) +
s(school, bs = 're'), data = Pilot_fitbit2,
method = 'REML')
Do be careful with 2D isotopic smooths; both sed and summodpa and sumlightpa need to be in the same units have the same degrees of wiggliness in each smooth. If these aren't in the same units or have different wigglinesses, use te() instead of s() for the 2D terms.
Also be careful with variables that appear in two or more smooths like this; mgcv will do it's best to make the models identifiable, but you can easily get into computational problems even so. A better modelling approach would to be estimate the marginal effects of sed and the other terms plus their 2nd order interactions by decomposing the effects in the two 2d smooths as follows:
gam(bsigsi ~ s(sed) + s(summodpa) + s(sumlightpa) +
ti(summodpa, sed) + ti(sumlightpa, sed) +
s(school, bs = 're'), data = Pilot_fitbit2,
method = 'REML')
where the ti() smooths are tensor product interaction bases, when're the main effects of the two marginal variables have been removed from the basis. Hence you can treat them as a pure smooth interaction term. In this way, the main effect of sed is contained in a single smooth term.

Regression: Generalized Additive Model

I have started to work with GAM in R and I’ve acquired Simon Wood’s excellent book on the topic. Based on one of his examples, I am looking at the following:
library(mgcv)
data(trees)
ct1<-gam(log(Volume) ~ Height + s(Girth), data=trees)
I have two general questions to this example:
How does one decide when a variable in the model estimation should be parametric (such as Height) or when it should be smooth (such as Girth)? Does one hold an advantage over the other and is there a way to determine what is the optimal type for a variable is? If anybody has any literature about this topic, I’d be happy to know of it.
Say I want to look closer at the weights of ct1: ct1$coefficients. Can I use them as the gam-procedure outputs them, or do I have to transform them before analyzing them given that I am fitting to log(Volume)? In the case of the latter, I guess I would have to use exp (ct1$coefficients)

Poisson regression on gravity model

For a university project, I am trying to fit a regression model for demand to a number of independent variables. I tried to include a small example, but it didn't work as a figure (as I am new to this). Instead, see the following link to view a sample of the dataset that I am using:
In this table, the first column indicates country pairs, columns 2 to 6 are independent variables, and the final column is the depedent variable. What I would like to do, is to perform regression analysis, assuming that this data can be described by a gravity equation.
I know that people frequently use log-linearisation to solve this. However, as I am dealing with zeros in my data (and don't to distort the data by adding small constants), and as I assume that heteroskedasticity is in the data, I would like to use a different method. Based on what Santos 2006 described (in his paper "The log of gravity"), I would like to use Poisson Pseudo Maximum Likelihood Estimation.
However, I am fairly new to using R (or any statistical software), and I cant figure out how to do this in R. Can anybody help me with this? The only thing I've found so far is that the glm commands poisson and quasipoisson could be used (https://stat.ethz.ch/pipermail/r-help/2010-September/252476.html).
I've searched for help in documents on glm, but I don't understand how to use the glm function to solve this gravity model? How do I specify that I want the model in the form:
DEM = RP^alpha1 * GDPC_O^alpha2 * GDPC_D^alpha3 * POP_O^alpha4.... and then use the regression to solve for the different alphas?
Hard to say precisely without more detail, but
glm(DEM ~ log(RP) + log(GDPC_O) + log(GDPC_D) + log(POP_O),
data=your_data,
family=quasipoisson(link="log"))
should work reasonably well. The intercept will be the log of the proportionality constant; the other coefficients will be the exponents
of the respective terms (this works because the log link says that log(DEM) = beta_0 + beta_1*log(RP) + ...; if you exponentiate both sides you get DEM = exp(beta_0) * exp(log(RP)*beta_1) * ... or DEM = C*RP^beta_1*...
PS it is not necessary, but it may be helpful for interpretation to scale and center your predictor variables.

When to choose nls() over loess()?

If I have some (x,y) data, I can easily draw straight-line through it, e.g.
f=glm(y~x)
plot(x,y)
lines(x,f$fitted.values)
But for curvy data I want a curvy line. It seems loess() can be used:
f=loess(y~x)
plot(x,y)
lines(x,f$fitted)
This question has evolved as I've typed and researched it. I started off with wanting to a simple function to fit curvy data (where I know nothing about the data), and wanting to understand how to use nls() or optim() to do that. That was what everyone seemed to be suggesting in similar questions I found. But now I stumbled upon loess() I'm happy. So, now my question is why would someone choose to use nls or optim instead of loess (or smooth.spline)? Using the toolbox analogy, is nls a screwdriver and loess is a power-screwdriver (meaning I'd almost always choose the latter as it does the same thing but with less of my effort)? Or is nls a flat-head screwdriver and loess a cross-head screwdriver (meaning loess is a better fit for some problems, but for others it simply won't do the job)?
For reference, here is the play data I was using that loess gives satisfactory results for:
x=1:40
y=(sin(x/5)*3)+runif(x)
And:
x=1:40
y=exp(jitter(x,factor=30)^0.5)
Sadly, it does less well on this:
x=1:400
y=(sin(x/20)*3)+runif(x)
Can nls(), or any other function or library, cope with both this and the previous exp example, without being given a hint (i.e. without being told it is a sine wave)?
UPDATE: Some useful pages on the same theme on stackoverflow:
Goodness of fit functions in R
How to fit a smooth curve to my data in R?
smooth.spline "out of the box" gives good results on my 1st and 3rd examples, but terrible (it just joins the dots) on the 2nd example. However f=smooth.spline(x,y,spar=0.5) is good on all three.
UPDATE #2: gam() (from mgcv package) is great so far: it gives a similar result to loess() when that was better, and a similar result to smooth.spline() when that was better. And all without hints or extra parameters. The docs were so far over my head I felt like I was squinting at a plane flying overhead; but a bit of trial and error found:
#f=gam(y~x) #Works just like glm(). I.e. pointless
f=gam(y~s(x)) #This is what you want
plot(x,y)
lines(x,f$fitted)
Nonlinear-least squares is a means of fitting a model that is non-linear in the parameters. By fitting a model, I mean there is some a priori specified form for the relationship between the response and the covariates, with some unknown parameters that are to be estimated. As the model is non-linear in these parameters NLS is a means to estimate values for those coefficients by minimising a least-squares criterion in an iterative fashion.
LOESS was developed as a means of smoothing scatterplots. It has a very less well defined concept of a "model" that is fitted (IIRC there is no "model"). LOESS works by trying to identify pattern in the relationship between response and covariates without the user having to specify what form that relationship is. LOESS works out the relationship from the data themselves.
These are two fundamentally different ideas. If you know the data should follow a particular model then you should fit that model using NLS. You could always compare the two fits (NLS vs LOESS) to see if there is systematic variation from the presumed model etc - but that would show up in the NLS residuals.
Instead of LOESS, you might consider Generalized Additive Models (GAMs) fitted via gam() in recommended package mgcv. These models can be viewed as a penalised regression problem but allow for the fitted smooth functions to be estimated from the data like they are in LOESS. GAM extends GLM to allow smooth, arbitrary functions of covariates.
loess() is non-parametric, meaning you don't get a set of coefficients you can use later - it's not a model, just a fit line. nls() will give you coefficients you could use to build an equation and predict values with a different but similar data set - you can create a model with nls().

Resources