Is it possible to let the precision/variance parameter in a beta regression via GAM vary with the predictor as well? - r

I want to fit a spatiotemporal model where my dependent variable is in the range [>0,<1].
A beta regression seems suitable for this case.
I tried the betareg package, that works like a charm, but to my knowledge I cannot include complex interaction terms that occur e.g. in spatiotemporal datasets to account for autocorrelation.
I know that GAMs e.g. package mgcv support beta regression via the betar() family. To my knowledge the precision/variance parameter is held constant though and only the mean (mu) changes as a function of the predictors.
my model looks like this (it is conceptual so no example data needed):
mgcv::gam(Y~ te(latitude,longitude,day)+s(X1)+s(X2)+s(X3),family=betar())
The problem is that only mu is modelled but not phi / precision
In the betareg I can let vary phi with my predictors:
betareg::betareg(Y ~ X1+X2+X3+latitude+longitude | X1+X2+X3+latitude+longitude)
but this doesn´t let me model the spatiotemporal term as needed, because simple additive effects are not suitable for that and I need something like what is supported with the te() functionality from mgcv or any other kind of interaction term.
Is there any work around or a way to model phi but account for my spatiotemporal term either via mgcv or betareg or any other R package?
Thanks a lot!

Related

how check overfitting on point pattern on a linear network using spatstat

I have been using lppm (point pattern on a linear network) on spatstat with bunch of covariates and fitting a log-linear model but I couldn't see how to check over-fitting. Is there a quick way to do it?
It depends on what you want.
What tool would you use to check overfitting in (say) a linear model?
To identify whether individual observations may have been over-fitted, you could use influence.lppm (from the spatstat.linnet package).
To identify collinearity in the covariates, currently we do not provide a dedicated function in spatstat, but you could use the following trick. If fit is your fitted model of class lppm, first extract the corresponding GLM using
g <- getglmfit(as.ppm(fit))
Next install the package faraway and use the vif function to calculate the variance inflation factors
library(faraway)
vif(g)

R: Using relative importance (relaimpo package) to build a linear model for prediction?

I have a huge dataset and I'm trying to build a good predictive linear model using the relaimpo package.
Using the calc.relimp function with type="lmg, i get an output of variables which are of relative importance. Although the proportion of variance explained by the model is only at 52%, I want to go and build a linear model using these variables.
Is there a way to build a lm model using these variables and somehow take into account the relative importance values into the model?
I'm not too familiar with this and was thinking maybe something along the lines of weighting each variable based on its relative importance value...?
I'm not a statistician, so I won't give you any Greek symbols, but I think you are confusing a few things.
As you correctly say, the relative importances based on the LMG method are more or less some sort of variance decomposition in case of correlated predictor variables, i.e. it tells you how much of your variance in the model is explained by which predictor.
However, this doesn't have to do anything with the lm function and its estimation itself. In fact, the R² of your lm model is exactly the same as you'll get by summing up the relative importances from calc.relimp.
There is no way to tell the lm function to pay more attention to a certain predictor during prediction/estimation.
What you probably want to do is an elastic net (which is a combination of LASSO and RIDGE regression), which basically does what you want, i.e. it shrinks the impact of "unimportant"/small predictors and emphasizes the impact of important/large predictors: https://en.wikipedia.org/wiki/Elastic_net_regularization (Lasso and Ridge regression are linked in the Wikipedia article).
I think this one here is the original package from Jerome Friedman, Trevor Hastie, Rob Tibshirani, et al.: https://cran.r-project.org/web/packages/glmnet/index.html

Get the AIC or BIC citerium from a gamm, gam, and lme models: How in mgcv? And how can I trust the result? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 3 years ago.
Improve this question
I am new to Gamms and gams, so this question may be a bit basic, I'd appreciate your help on this very much:
I am using the following code:
M.gamm <- gamm (bsigsi ~ s(summodpa, sed,k= 1, fx= TRUE, bs="tp") + s(sumlightpa, sed, k=1, fx= TRUE, bs="tp") , random = list(school=~ 1) , method= "ML", na.action= na.omit, data= Pilot_fitbit2)
The code runs, but gives me this feedback:
Warning messages: 1: In smooth.construct.tp.smooth.spec(object,
dk$data, dk$knots) : basis dimension, k, increased to minimum
possible
2: In smooth.construct.tp.smooth.spec(object, dk$data, dk$knots) :
basis dimension, k, increased to minimum possible
Questions:
My major question is however how I can get an AIC or BIC from this?
I've tried BIC(M.gamm$gam) and BIC(M.gamm$lme), since gamm exists of both parts (lme and gam), and for the latter one (with lme) I do get a value, bot for the first one, I don'get a value. Does anyone know why and how I can get one?
The issue is that I would like to compare this value to the BIC value of a gam model, and I am not sure which one (BIC(M.gamm$lme) or BIC(M.gam$gam)) would be the correct one. It is possible for me to derive a BIC and AIC for the gam and lme model.
If I'd be able to get the AIC or BIC for the gamm model - how can I know I can trust the results? What do I need to be careful with so I interpret the result correctly? Currently, I am using ML in all models and also use the same package (mgcv) to estimate lme, gam, and gamm to estabilish comparability.
Any help/ advice or ideas on this would be greatly appreciated!!
Best wishes,
Noemi
Thank you very much for this!
This warnings come as a result of requesting a smoother basis of a single function for each of your two smooths; this doesn't make any sense as both such bases would only contain equivalent of constant functions, both of which are are unidentifiable given you have another constant term (the intercept) in your model. Once mgcv applies identifiable to constraints the two smooths would get dropped entirely from the model.
Hence the warnings; mgcv didn't do what you wanted. Instead it set k to be the smallest values possible. Set k to something larger; you might as well leave it at the default and not specify it in the s() if you want a low rank smooth. Also, unless you really want an unpenalized spline fit, don't use fix = TRUE.
I'm not really familiar with any theory for BIC applied to GAM(M)s that corrects for smoothness selection. The AIC method for gam() models estimated using REML smoothness selection does have some theory beyond it, including a recent paper by Simon Wood and colleagues.
The mgcv FAQ has the following two things to say
How can I compare gamm models? In the identity link normal errors case, then AIC and hypotheis testing based methods are fine. Otherwise it is best to work out a strategy based on the summary.gam Alternatively, simple random effects can be fitted with gam, which makes comparison straightforward. Package gamm4 is an alternative, which allows AIC type model selection for generalized models.
When using gamm or gamm4, the reported AIC is different for the gam object and the lme or lmer object. Why is this? There are several reasons for this. The most important is that the models being used are actually different in the two representations. When treating the GAM as a mixed model, you are implicitly assuming that if you gathered a replicate dataset, the smooths in your model would look completely different to the smooths from the original model, except for having the same degree of smoothness. Technically you would expect the smooths to be drawn afresh from their distribution under the random effects model. When viewing the gam from the usual penalized regression perspective, you would expect smooths to look broadly similar under replication of the data. i.e. you are really using Bayesian model for the smooths, rather than a random effects model (it's just that the frequentist random effects and Bayesian computations happen to coincide for computing the estimates). As a result of the different assumptions about the data generating process, AIC model comparisons can give rather different answers depending on the model adopted. Which you use should depend on which model you really think is appropriate. In addition the computations of the AICs are different. The mixed model AIC uses the marginal likelihood and the corresponding number of model parameters. The gam model uses the penalized likelihood and the effective degrees of freedom.
So, I'd probably stick to AIC, not use BIC. I'd be thinking about which interpretation of the GAM(M) I was interested most in. I'd also likely fit the random effects you have here using gam() if they are this simple. An equivalent model would include + s(school, bs = 're') in the main formula and exclude the random bit whilst using gam()
gam(bsigsi ~ s(summodpa, sed) + s(sumlightpa, sed) +
s(school, bs = 're'), data = Pilot_fitbit2,
method = 'REML')
Do be careful with 2D isotopic smooths; both sed and summodpa and sumlightpa need to be in the same units have the same degrees of wiggliness in each smooth. If these aren't in the same units or have different wigglinesses, use te() instead of s() for the 2D terms.
Also be careful with variables that appear in two or more smooths like this; mgcv will do it's best to make the models identifiable, but you can easily get into computational problems even so. A better modelling approach would to be estimate the marginal effects of sed and the other terms plus their 2nd order interactions by decomposing the effects in the two 2d smooths as follows:
gam(bsigsi ~ s(sed) + s(summodpa) + s(sumlightpa) +
ti(summodpa, sed) + ti(sumlightpa, sed) +
s(school, bs = 're'), data = Pilot_fitbit2,
method = 'REML')
where the ti() smooths are tensor product interaction bases, when're the main effects of the two marginal variables have been removed from the basis. Hence you can treat them as a pure smooth interaction term. In this way, the main effect of sed is contained in a single smooth term.

Negative binomial in GEE

For R packages implementing GEE such as gee, geepack, it seems that the negative binomial family is not included. I have two questions:
Are there any other R packages for GEE that I am not aware of?
If not, is there a simple step to allow the creation of a family, i.e providing the link function (log mu) and the variance function (mu + mu^2/theta), assuming theta is specified (otherwise the NB is not a GLM) and then to let the gee or geepack codes do the business in a similar fashion to glm?
You should be able to use the negative.binomial family defined in the MASS package to do this (set up a NB family with a specified theta value). It looks like geepack::geese (at least) will accept family specifications in this form. To estimate theta you might try embedding the GEE fit with a fixed theta into a loop, or make a geefit_NB(theta) function and optimize over theta.
If negative.binomial did not already exist in MASS, you could define your own family (this is admittedly a bit advanced -- I would start by downloading the source code of the MASS package and looking at the file R/neg.bin.R).

Goodness of fit functions in R

What functions do you use in R to fit a curve to your data and test how well that curve fits? What results are considered good?
Just the first part of that question can fill entire books. Just some quick choices:
lm() for standard linear models
glm() for generalised linear models (eg for logistic regression)
rlm() from package MASS for robust linear models
lmrob() from package robustbase for robust linear models
loess() for non-linear / non-parametric models
Then there are domain-specific models as e.g. time series, micro-econometrics, mixed-effects and much more. Several of the Task Views as e.g. Econometrics discuss this in more detail. As for goodness of fit, that is also something one can spend easily an entire book discussing.
The workhorses of canonical curve fitting in R are lm(), glm() and nls(). To me, goodness-of-fit is a subproblem in the larger problem of model selection. Infact, using goodness-of-fit incorrectly (e.g., via stepwise regression) can give rise to seriously misspecified model (see Harrell's book on "Regression Modeling Strategies"). Rather than discussing the issue from scratch, I recommend Harrell's book for lm and glm. Venables and Ripley's bible is terse, but still worth a reading. "Extending the Linear Model with R" by Faraway is comprehensive and readable. nls is not covered in these sources, but "Nonlinear Regression with R" by Ritz & Streibig fills the gap and is very hands-on.
The nls() function (http://sekhon.berkeley.edu/stats/html/nls.html) is pretty standard for nonlinear least-squares curve fitting. Chi squared (the sum of the squared residuals) is the metric that is optimized in that case, but it is not normalized so you can't readily use it to determine how good the fit is. The main thing you should ensure is that your residuals are normally distributed. Unfortunately I'm not sure of an automated way to do that.
The Quick R site has a reasonable good summary of basic functions used for fitting models and testing the fits, along with sample R code:
http://www.statmethods.net/stats/regression.html
The main thing you should ensure is
that your residuals are normally
distributed. Unfortunately I'm not
sure of an automated way to do that.
qqnorm() could probably be modified to find the correlation between the sample quantiles and the theoretical quantiles. Essentially, this would just be a numerical interpretation of the normal quantile plot. Perhaps providing several values of the correlation coefficient for different ranges of quantiles could be useful. For example, if the correlation coefficient is close to 1 for the middle 97% of the data and much lower at the tails, this tells us the distribution of residuals is approximately normal, with some funniness going on in the tails.
Best to keep simple, and see if linear methods work "well enuff". You can judge your goodness of fit GENERALLY by looking at the R squared AND F statistic, together, never separate. Adding variables to your model that have no bearing on your dependant variable can increase R2, so you must also consider F statistic.
You should also compare your model to other nested, or more simpler, models. Do this using log liklihood ratio test, so long as dependant variables are the same.
Jarque–Bera test is good for testing the normality of the residual distribution.

Resources