Specify varying-coefficient and factor-level replicate in mgcv - r

I am using the mgcv package in R to fit smooths. I am interested in fitting a varying-coefficient model, where the varying-coefficient smooth also varies based on a factor variable.
According to the mgcv documentation, when specifying a smooth with s(), the by argument can take either a numeric variable, in which case a varying-coefficient model is fit, OR a factor variable, in which case a replicate of the smooth is produced for each factor level. However, the documentation does not say how to specify a model with a varying-coefficient effect AND have that effect differ across multiple factor levels. I dont see any reason why this wouldnt be possible, so it is somewhat odd that these two different effects are specified by the same argument.

I reached out to the creator of the mgcv package with this question, and here is the response I got:
This should be possible, but there may be an identifiability problem.
Suppose f is a factor with levels "a", "b", "c", x is your smoothing
variable and z your covariate of interest, you want a model something like
y = z s_f(x) + noise
Then
da <- as.numeric(f="a")*z
db <- as.numeric(f="b")*z
dc <- as.numeric(f="c")*z
gam(y ~ s(x,by=da) + s(x,by=db) + s(x,by=dc) - 1)
would fit the model. The only problem with this is the -1. We need it
because otherwise the intercepts of the smooths are confounded with the
overall intercept. -1 deals with the identifiability in this simple
case, but won't in some other situations - for example if you include a
factor variable parametrically in the model, or have a second set of
smooths conditional on factor levels in this way.
A possibility in these cases would be to use the select=TRUE argument
to gam, which will formally remove the identifiability problem.

Related

Mixed effect model or multiple regressions comparison in nested setup

I have a response Y that is a percentage ranging between 0-1. My data is nested by taxonomy or evolutionary relationship say phylum/genus/family/species and I have one continuous covariate temp and one categorial covariate fac with levels fac1 & fac2.
I am interested in estimating:
is there a difference in Y between fac1 and fac2 (intercept) and how much variance is explained by that
does each level of fac responds differently in regard to temp (linearly so slope)
is there a difference in Y for each level of my taxonomy and how much variance is explained by those (see varcomp)
does each level of my taxonomy responds differently in regard to temp (linearly so slope)
A brute force idea would be to split my data into the lowest taxonomy here species, do a linear beta regression for each species i as betareg(Y(i)~temp) . Then extract slope and intercepts for each speies and group them to a higher taxonomic level per fac and compare the distribution of slopes (intercepts) say, via Kullback-Leibler divergence to a distribution that I get when bootstrapping my Y values. Or compare the distribution of slopes (or interepts) just between taxonomic levels or my factor fac respectively.Or just compare mean slopes and intercepts between taxonomy levels or my factor levels.
Not sure is this is a good idea. And also not sure of how to answer the question of how many variance is explained by my taxonomy level, like in nested random mixed effect models.
Another option may be just those mixed models, but how can I include all the aspects I want to test in one model
say I could use the "gamlss" package to do:
library(gamlss)
model<-gamlss(Y~temp*fac+re(random=~1|phylum/genus/family/species),family=BE)
But here I see no way to incorporate a random slope or can I do:
model<-gamlss(Y~re(random=~temp*fac|phylum/genus/family/species),family=BE)
but the internal call to lme has some trouble with that and guess this is not the right notation anyways.
Is there any way to achive what I want to test, not necessarily with gamlss but any other package that inlcuded nested structures and beta regressions?
Thanks!
In glmmTMB, if you have no exact 0 or 1 values in your response, something like this should work:
library(glmmTMB)
glmmTMB(Y ~ temp*fac + (1 + temp | phylum/genus/family/species),
data = ...,
family = beta_family)
if you have zero values, you will need to do something . For example, you can add a zero-inflation term in glmmTMB; brms can handle zero-one-inflated Beta responses; you can "squeeze" the 0/1 values in a little bit (see the appendix of Smithson and Verkuilen's paper on Beta regression). If you have only a few 0/1 values it won't matter very much what you do. If you have a lot, you'll need to spend some serious time thinking about what they mean, which will influence how you handle them. Do they represent censoring (i.e. values that aren't exactly 0/1 but are too close to the borders to measure the difference)? Are they a qualitatively different response? etc. ...)
As I said in my comment, computing variance components for GLMMs is pretty tricky - there's not necessarily an easy decomposition, e.g. see here. However, you can compute the variances of intercept and slope at each taxonomic level and compare them (and you can use the standard deviations to compare with the magnitudes of the fixed effects ...)
The model given here might be pretty demanding, depending on the size of your phylogeny - for example, you might not have enough replication at the phylum level (in which case you could fit the model ~ temp*(fac + phylum) + (1 + temp | phylum:(genus/family/species)), i.e. pull out the phylum effects as fixed effects).
This is assuming that you're willing to assume that the effects of fac, and its interaction with temp, do not vary across the phylogeny ...

GAM with only Categorical/Logical

I'm currently trying to use a GAM to calculate a rough estimation of expected goals model based purely on the commentary data from ESPN. However, all the data is either a categorical variable or a logical vector, so I'm not sure if there's a way to smooth, or if I should just use the factor names.
Here are my variables:
shot_where (factor): shot location (e.g. right side of the box)
assist_class (factor): type of assist (cross, through ball, pass)
follow_corner (logical): whether the shot follows a corner
shot_with (factor): right foot, left food, header
follow_set_piece (logical): whether the shot follows a set piece
I think I should just use the formula as just the variable names.
model <- bam(is_goal ~ shot_where + assist_class + follow_set_piece + shot_where + follow_corner + shot_where:shot_with, family = "binomial", method = "REML")
The shot_where and shot_with would incorporate any interactions between these two varaibles.
However, I was told I could smooth factor variables as well using the below structure.
model <- bam(is_goal ~ s(shot_where, bs = 'fs') + s(assist_class, bs = 'fs') + as.logical(follow_set_piece) +
as.logical(follow_corner) + s(shot_with, bs = 'fs'), data = model_data, family = "binomial", method = "REML")
This worked for creating a model, but I want to make sure this is a correct method of building the model. I've yet to see any information on using only factor/logical variables in a GAM model, so I thought it was worth asking.
If you only have categorical covariates then you aren't fitting a GAM, whether you fit the model with gam(), bam(), or something else.
What you are doing when you pass factor variables to s() using the fs basis like this
s(f, bs = 'fs')`
is creating a random intercept for each level of the factor f.
There's no smoothing going on here at all; the model is simply exploiting the equivalence of the Bayesian view of smoothing with random effects.
Given that none of your covariates could reasonably be considered random in the sense of a mixed effects model then the only justification for doing what you're doing might be as a computational trick.
Your first model is just a simple GLM (note the typo in the formula as shot_where is repeated twice in the formula.)
It's not clear to me why you are using bam() to fit this model; you're loosing computational efficiency that bam() provides by using method = 'REML'; it should be 'fREML' for bam() models. But as there is no smoothness selection going on in the first model you'd likely be better off using glm() to fit that model. If the issue is large sample sizes, there are several packages that can fit GLMs to large data, for example biglm and it's bigglm() function.
In the second model there is no smoothing going on but there is penalisation which is shrinking the estimates for the random intercepts toward zero. You're likely to get better performance on big data using the lme4 package or TMB and the glmmTMB package to fit what is a GLMM.
This is more of a theoretical question than about R, but let me provide a brief answer. Essentially, the most flexible model you could estimate would be one where you used the variables as factors. It also produces a model that is reasonably easily interpreted - where each coefficient gives you the expected difference in y between the reference level and the level represented by the dummy regressor.
Smoothing splines try to strike the appropriate bias-variance tradeoff. If you've got lots of data and relatively few categories in the categorical variables, there will be no real loss in efficiency for including all of the dummy regressors representing the categories and the bias will also be as small as possible. To the extent that the smoothing spline model is different from the one treating everything as factors, it is likely inducing bias without a corresponding increase in efficiency. If it were me, I would stick with a model that treats all of the categorical variables as factors.

Is it necessary to include a factor (fitted as a smooth-factor interaction) as a parametric term within a gam?

I am interested in investigating a non-linear temporal trend in a data set and so I would like to use the R package mgcv to fit the following GAM:
model1 <- gam(Variable ~ s(Date, by = Site.Factor), data = data)
where Variable is the continuous variable of interest, Site.Factor is a factor with two levels and Date is a continuous variable.
I have read that know that because of the inclusion of the by factor within the smoothing function, differences in the means of the two factor levels are not accounted for. I should therefore include Site.Factor as a parametric term like so:
model2 <- gam(Variable ~ Site.Factor + s(Date, by = Site.Factor), data = data)
However, whilst I might expect the influence of Site.Factor on the smooth to be significant, I do not expect the means of each level of the factor to be significant. Do I still need to include the factor separately within the model as in model1, or would model2 be okay?
Unless you know that the populations from which your data are drawn have exactly the same mean then yes, you should include the term Site.Factor as a fixed effect term, whether that difference in sample is significant or not.

interaction contrast with glmer

I am running a model with a similar structure:
model <- glmer(protest ~ factora*factorb*numeric+factora+factorb+numeric+1 + (1 + factor1|level1) + (1|level2), data=data, family=binomial(link='logit'))
where factora and factorb are factor variables, numeric is a numerical variable.
I am curious of the statistical significance of the contrast in the interaction while holding factora constant at 3, between two values of factorb (1-5) across the range of the numerical value.
I have tried the following options with no luck:
library(psycho)
get_contrasts(model, formula="factora:factorb:numeric", adjust="tukey")
View(contrasts$contrasts)
this works, but unfortunately the results hold numeric constant and only vary factora and factorb. Therefore, it does not answer my question.
I have also tried:
library(multcomp)
test = glht(model,linfct=mcp("factora:factorb:numeric"="Tukey"))
this yields the error of
Error in mcp2matrix(model, linfct = linfct) :
Variable(s) ‘factora:factorb:numeric’ have been specified in ‘linfct’ but cannot be found in ‘model’!
without regard of the way I specify the interaction and despite other functions like get_contrasts finding the interaction specified the same way.
I have also tried:
library(emmeans)
contrast(m.3[[2]], interaction = c("factora", "factorb", "numeric"))
this however does not support glmer.
Any ideas?
There are a couple of issues here that are tripping you up.
One is that we don't really apply contrasts to numeric predictors. Numeric predictors have slopes, not contrasts; and if you have a model where a numeric predictor interacts with a factor, that means that the slope for the numeric predictor is different for each level of the factor. The function emtrends() in the emmeans package can help you estimate those different slopes.
The second is that the interaction argument in emmeans::contrast() needs a specification for the type of contrasts to use, e.g., "pairwise". The factors to apply them to are those in the emmGrid object in the first argument.
So... I think maybe you want to try something like this:
emt <- emtrends(model, ~ factora*factorb, var = "numeric")
emt # list the estimated slopes
contrast(emt, interaction = "consec")
# obtain interaction contrasts comparing consecutive levels of the two factors

r glm function for binomial when x has more than 2 levels

I know
glm(formula = y ~ x1 + x2 + x3, family = binomial(logit), data = birds_poop)
will give parameter estimates for x {x1, x2, x3}.
What if there are more than 2 levels for x , how do I set the reference level and how can I get glm function to display parameter estimates for each level of x.
For example, lets say x1 has 4 levels how can I get parameter estimates for x1-level2, x1-level3 and x1-level4 assuming x1-level0 is reference level.
R uses treatment contrasts by default. You can change these to other types of contrasts and further info can be found by looking at help(contrasts) and help(C). With default settings the (Intercept) term will be the estimate on the log(odds) scale for a case with x1-level0 and all the other variables at their base levels or at a value of 0 for numeric variables. R will report parameter estimates for the factor levels that are not the reference levels. A common question is how to change the reference level which is easily done with the factor() function. You could also use predict.
You should read the help file for glm and all of the links. The other wonderful feature of the help pages is that they often contain worked examples and if you have further questions and are having difficulty formulating a good example than you may find that modifying one of those will make the discussion more concrete.

Resources