Does anyone have an idea how to do stepwise regression with Tweedie in R?
I found the mgcv package, which apparently treats the power parameter of Tweedie as yet another parameter to be estimated. This seems to improve on having to use tweedie.profile to estimate the power outside the glm, so it seems encouraging for using an automated stepwise function to do the regression. But I haven't been able to figure out if the package also offers a stepwise function. The package manual has this to say.
I got lost in the talk about smooths:
There is no step.gam in package mgcv.
To facilitate fully automatic model selection the package implements two smooth modification techniques
which can be used to allow smooths to be shrunk to zero as part of smoothness selection.
I would appreciate your help. Thanks.
Your question is not specific to "Tweedie" family; it is a general mgcv feature in model selection.
mgcv does not use step.gam for model selection. I think your confusion comes from another package gam, which would use step.gam to sequentially add/drop a term and reports AIC. When you go ?step.gam in mgcv, it refers you to ?gam.selection. ?step.gam is intentionally left there, in case people search it. But all the details are provided in ?gam.selection.
There is no need to do step.gam in mgcv. Model estimation and model selection are integrated in mgcv. For a penalized regression/smoothing spline, when smoothing parameter goes to infinity (very large), its second derivative is penalized to zero, leaving a simple linear term. For example, if we specify a model like:
y ~ s(x1, bs = 'cr') + s(x2, bs = 'cr')
while s(x2) is a spurious model term and should not be included in the model, then mgcv:::gam/bam will shrink s(x2) to x2 after estimation, resulting a model like:
y ~ s(x1) + x2
This means, when you use plot.gam() to inspect the estimated smooth function for each model term, s(x1) is a curve, but s(x2) is a straight line.
Now this is not entirely satisfying. For a complete, successful model selection, we want to drop x2 as well, i.e., shrink s(x2) to 0, to get notationally a model:
y ~ s(x1)
But this is not difficult to achieve. We can use shrinkage smooth class bs = 'ts' (shrinkage thin plate regression spline, as opposed to the ordinary tp) or bs = cs' (shrinkage cubic regression spline, as opposed to the ordinary 'cr'), and mgcv:::gam/bam should be able to shrink s(x2) to 0. The math behind this, is that mgcv will modify the eigen values of linear term (i.e., the null space) from 0, to 0.1, a small but positive number, so that penalization takes effect on linear term. As a result, when you do plot.gam(), you will see s(x2) is a horizontal line at 0.
bs = 'cs' or bs = 'ts' are supposed to be put in function s(); yet mgcv also allows you to leave bs = 'cr' or bs = 'tp' untouched in s(), but put select = TRUE in gam() or bam(). The select = TRUE is a more general treatment, as shrinkage smooths at the moment only have class cs and ts, while select = TRUE work for all kind of smooth specification. They essentially do the same thing, by increasing 0 eigen values to 0.1.
The following example is taken from the example under ?gam.selection. Note how select = TRUE shrinks several terms to 0, giving an informative model selection.
library(mgcv)
set.seed(3);n<-200
dat <- gamSim(1,n=n,scale=.15,dist="poisson") ## simulate data
dat$x4 <- runif(n, 0, 1);dat$x5 <- runif(n, 0, 1) ## spurious
b <- gam(y~s(x0)+s(x1)+s(x2)+s(x3)+s(x4)+s(x5),data=dat,
family=poisson,select=TRUE,method="REML")
summary(b)
plot.gam(b,pages=1)
Note that, the p-values in summary.gam() also gives evidence for such selection:
Approximate significance of smooth terms:
edf Ref.df Chi.sq p-value
s(x0) 1.7655119 9 5.264 0.0397 *
s(x1) 1.9271039 9 65.356 <2e-16 ***
s(x2) 6.1351372 9 156.204 <2e-16 ***
s(x3) 0.0002618 9 0.000 0.4088
s(x4) 0.0002766 9 0.000 1.0000
s(x5) 0.1757146 9 0.195 0.2963
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.545 Deviance explained = 51.6%
-REML = 430.78 Scale est. = 1 n = 200
Related
Somehow as a follow up on the question Creating confidence intervals for regression curve in GLMM using Bootstrapping, I am interested in getting the correct values of a regression curve and the associated confidence interval curves.
Consider a case where in a GLMM, there is one response variable, two continuous fixed effects and one random effect. Here is some fake data:
library (dplyr)
set.seed (1129)
x1 <- runif(100,0,1)
x2 <- rnorm(100,0.5,0.4)
f1 <- gl(n = 5,k = 20)
rnd1<-rnorm(5,0.5,0.1)
my_data <- data.frame(x1=x1, x2=x2, f1=f1)
modmat <- model.matrix(~x1+x2, my_data)
fixed <- c(-0.12,0.35,0.09)
y <- (modmat%*%fixed+rnd1)
my_data$y <- ((y - min (y))/max(y- min (y))) %>% round (digits = 1)
rm (y)
The GLMM that I fit looks like this:
m1<-glmer (y ~x1+x2+(1|f1), my_data, family="binomial")
summary (m1)
Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) ['glmerMod']
Family: binomial ( logit )
Formula: y ~ x1 + x2 + (1 | f1)
Data: my_data
AIC BIC logLik deviance df.resid
65.7 76.1 -28.8 57.7 96
Scaled residuals:
Min 1Q Median 3Q Max
-8.4750 -0.7042 -0.0102 1.5904 14.5919
Random effects:
Groups Name Variance Std.Dev.
f1 (Intercept) 1.996e-10 1.413e-05
Number of obs: 100, groups: f1, 5
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -9.668 2.051 -4.713 2.44e-06 ***
x1 12.855 2.659 4.835 1.33e-06 ***
x2 4.875 1.278 3.816 0.000136 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr) x1
x1 -0.970
x2 -0.836 0.734
convergence code: 0
boundary (singular) fit: see ?isSingular
Plotting y vs x1:
plot (y~x1, my_data)
It should be possible to get a regression curve from the summary of m1. I have learned that I need to reverse the link-function (in this case, "logit"):
y = 1/(1+exp(-(Intercept+b*x1+c*x2)))
In order to plot a regression curve of x1 in a two-dimensional space, I set x2 = mean(x2) in the formula (which also seems important - the red line in the following plots ignores x2, apparently leading to considerable bias). The regression line:
xx <- seq (from = 0, to = 1, length.out = 100)
yy <- 1/(1+exp(-(-9.668+12.855*xx+4.875*mean(x2))))
yyy <- 1/(1+exp(-(-9.668+12.855*xx)))
lines (yy ~ xx, col = "blue")
lines (yyy~ xx, col = "red")
I think, the blue line looks not so good (and the red line worse, of course). So as a side-question: is y = 1/(1+exp(-(Intercept+b*x1+c*x2))) always the right choice as a back-transformation of the logit-link? I am asking because I found this https://sebastiansauer.github.io/convert_logit2prob/, which made me suspicious. Or is there another reason for the model not to fit so well? Maybe my data creation process is somewhat 'bad'.
What I need now is to add the 95%-confidence interval to the curve. I think that Bootstrapping using the bootMer function should be a good approach. However, all examples that I found were on models with one single fixed effect. #Jamie Murphy asked a similar question, but he was interested in models containing a continuous and a categorical variable as fixed effects here: Creating confidence intervals for regression curve in GLMM using Bootstrapping
But when it comes to models with more than one continuous variables as fixed effects, I get lost. Perhaps someone can help solve this issue - possibly with a modification of the second part of this tutorial:
https://www.r-bloggers.com/2015/06/confidence-intervals-for-prediction-in-glmms/
I have the following situation:
my fixed-effect model find a main effect of Relation_PenultimateLast in the group of participant called 'composers'. I want therefore to find what level of Relation_PenultimateLast differ statistically from the others.
f.e.model.composers = lmer(Score ~ Relation_PenultimateLast + (1|TrajectoryType) + (1|StimulusType) + (1|Relation_FirstLast) + (1|LastPosition), data=datasheet.complete.composers)
Summary(f.e.model.composers)
Random effects:
Groups Name Variance Std.Dev.
TrajectoryType (Intercept) 0.005457 0.07387
LastPosition (Intercept) 0.036705 0.19159
Relation_FirstLast (Intercept) 0.004298 0.06556
StimulusType (Intercept) 0.019197 0.13855
Residual 1.318116 1.14809
Number of obs: 2200, groups:
TrajectoryType, 25; LastPosition, 8; Relation_FirstLast, 4; StimulusType, 4
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 2.90933 0.12476 14.84800 23.320 4.15e-13 ***
Relation_PenultimateLast 0.09987 0.02493 22.43100 4.006 0.000577 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
I have to make a Tukey comparison of my lmer() model.
Now, I find two methods for the comparison among Relation_PenultimateLast levels (I have found them in here: https://stats.stackexchange.com/questions/237512/how-to-perform-post-hoc-test-on-lmer-model):
summary(glht(f.e.model.composers, linfct = mcp(Relation_PenultimateLast = "Tukey")), test = adjusted("holm"))
and
lsmeans(f.e.model.composers, list(pairwise ~ Relation_PenultimateLast), adjust = "holm")
These do not work.
The former reports:
Variable(s) ‘Relation_PenultimateLast’ of class ‘integer’ is/are not contained as a factor in ‘model’
The latter:
Relation_PenultimateLast lsmean SE df lower.CL upper.CL
2.6 3.168989 0.1063552 8.5 2.926218 3.41176
Degrees-of-freedom method: satterthwaite
Confidence level used: 0.95
$` of contrast`
contrast estimate SE df z.ratio p.value
(nothing) nonEst NA NA NA NA
Can somebody help me understand why I have this result?
First, it's important to realize that the model you have fitted is inappropriate. It uses Relation_PenultimateLast as a numeric predictor; thus it fits a linear trend to its values 1, 2, 3, and 4, rather than separate estimates for each level of this as a factor. I also wonder, given the plot you show, why Test is not in the model; it looks like it should be (again as a factor, not a numeric predictor). I suggest that you get some statistical consulting help to check that you are using appropriate models in your research. Perhaps you could give a graduate student in statistics some grounding in practical applications -- a win-win proposition.
To model Relation_PenultimateLast as a factor, one way is to replace it in the model formula with factor(Relation_PenultimateLast). That will work for lsmeans() but not glht(). A better way is probably to change it in the dataset:
datasheet.complete.composers = transform(datasheet.complete.composers,
Relation_PenultimateLast = factor(Relation_PenultimateLast))
f.e.model.composers = lmer(...) ### (as before, assuming Test isn't needed)
(BTW, you must be a heck of a better typist than I am; I'd use shorter names, though I do applaud using informative ones.)
(Note: is f.e.model.composers supoposed to suggest a fixed-effects model? It isn't one; it is a mixed model. Again, a consultant...)
The lsmeans package is destined to be deprecated, so I suggest you use its continuation, the emmeans package:
library(emmeans)
emmeans(f.e.model.composers, pairwise ~ Relation_PenultimateLast)
I suggest using the default "tukey" adjustment rather than Holm for this application.
If indeed Test should be in the model, then it looks like you need to include the interaction; so it'd go something like this:
model.composers = lmer(Score ~ Relation_PenultimateLast * factor(Test) + ...)
### A plot like the one shown, but based on the model predictions:
emmip(model.composers, Relation_PenultimateLast ~ Test)
### Estimates and comparisons of Relation_PenultimateLast for each Test:
emmeans(model.composers, pairwise ~ Relation_PenultimateLast | Test)
I am using geepack for R to estimate logistic marginal model by geeglm(). But I am getting garbage estimates. They about 16 orders of magnitude too large. However the p-values seems to similar to what I expected. This means that the response essentially becomes a step function. See attached plot
Here is the code that generates the plot:
require(geepack)
data = read.csv(url("http://folk.uio.no/mariujon/data.csv"))
fit = geeglm(moden ~ 1 + power, id = defacto, data=data, corstr = "exchangeable", family=binomial)
summary(fit)
plot(moden ~ power, data=data)
x = 0:2500
y = predict(fit, newdata=data.frame(power = x), type="response" )
lines(x,y)
Here is the regression table:
Call:
geeglm(formula = moden ~ 1 + power, family = binomial, data = data,
id = defacto, corstr = "exchangeable")
Coefficients:
Estimate Std.err Wald Pr(>|W|)
(Intercept) -7.38e+15 1.47e+15 25.1 5.4e-07 ***
power 2.05e+13 1.60e+12 164.4 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Estimated Scale Parameters:
Estimate Std.err
(Intercept) 1.03e+15 1.65e+37
Correlation: Structure = exchangeable Link = identity
Estimated Correlation Parameters:
Estimate Std.err
alpha 0.196 3.15e+21
Number of clusters: 3 Maximum cluster size: 381
Hoping for some help. Thanks!
Kind regards,
Marius
I will give three procedures, each of which is a marginalized random intercept model (MRIM). These MRIMs have coefficients with marginal logistic interpretations and are of smaller magnitude than the GEE:
| Model | (Intercept) | power | LogL |
|-------|-------------|--------|--------|
| `L_N` | -1.050| 0.00267| -270.1|
| `LLB` | -0.668| 0.00343| -273.8|
| `LPN` | -1.178| 0.00569| -266.4|
compared to a glm that doesn't account for any correlation, for reference:
| Model | (Intercept) | power | LogL |
|-------|-------------|--------|--------|
| strt | -0.207| 0.00216| -317.1|
A marginalized random intercept model (MRIM) is worth exploring because you want a marginal model with exchangeable correlation structure for the clustered data, and that is the type of structure MRIMs exhibit.
The code (especially R script with comments) and PDFs for literature are in the GITHUB repo. I detail the code and literature down below.
The concept of MRIM has been around since 1999, and some background reading on this is in the GITHUB repo. I suggest reading Swihart et al 2014 first because it reviews the other papers.
In chronological order --
L_N Heagerty (1999): the approach fits a random intercept logistic model with a normally distributed random intercept. The trick is that the predictor in the random intercept model is nonlinearly parameterized with marginal coefficients so that the resulting marginal model has a marginal logistic interpretation. Its code is the lnMLE R package (not on CRAN, but on Patrick Heagerty's website here). This approach is denoted L_N in the code to indicate logit (L) on the marginal, no interepretation on conditional scale (_) and a normally (N) distributed random intercept.
LLB Wang & Louis (2003): the approach fits a random intercept logistic model with a bridge distributed random intercept. Unlike Heagerty 1999 where the trick is nonlinear-predictor for the random intercept model, the trick is a special random effects distribution (the bridge distribution) that allows both the random intercept model and the resulting marginal model to have a logistic interpretation. Its code is implemented with gnlmix4MMM.R (in the repo) which uses rmutil and repeated R packages. This approach is denoted LLB in the code to indicate logit (L) on the marginal, logit (L) on the conditional scale and a bridge (B) distributed intercept.
LPN Caffo and Griswold (2006): the approach fits a random intercept probit model with a normally distributed random intercept, whereas Heagerty 1999 used a logit random intercept model. This substitution makes computations easier and still yields a marginal logit model. Its code is implemented with gnlmix4MMM.R (in the repo) which uses rmutil and repeated R packages. This approach is denoted LPN in the code to indicate logit (L) on the marginal, probit (P) on the conditional scale and a normally (N) distributed intercept.
Griswold et al (2013): another review / practical introduction.
Swihart et al 2014: This is a review paper for Heagerty 1999 and Wang & Louis 2003 as well as others and generalizes the MRIM method. One of the most interesting generalizations is allowing the logistic CDF (equivalently, logit link) in both the marginal and conditional models to instead be a stable distribution that approximates a logistic CDF. Its code is implemented with gnlmix4MMM.R (in the repo) which uses rmutil and repeated R packages. I denote this SSS in the R script with comments to indicate stable (S) on the marginal, stable (S) on the conditional scale and a stable (S) distributed intercept. It is included in the R script but not detailed in this post on SO.
Prep
#code from OP Question: edit `data` to `d`
require(geepack)
d = read.csv(url("http://folk.uio.no/mariujon/data.csv"))
fit = geeglm(moden ~ 1 + power, id = defacto, data=d, corstr = "exchangeable", family=binomial)
summary(fit)
plot(moden ~ power, data=d)
x = 0:2500
y = predict(fit, newdata=data.frame(power = x), type="response" )
lines(x,y)
#get some starting values from glm():
strt <- coef(glm(moden ~ power, family = binomial, data=d))
strt
#I'm so sorry but these methods use attach()
attach(d)
L_N Heagerty (1999)
# marginally specifies a logit link and has a nonlinear conditional model
# the following code will not run if lnMLE is not successfully installed.
# See https://faculty.washington.edu/heagerty/Software/LDA/MLV/
library(lnMLE)
L_N <- logit.normal.mle(meanmodel = moden ~ power,
logSigma= ~1,
id=defacto,
model="marginal",
data=d,
beta=strt,
r=10)
print.logit.normal.mle(L_N)
Prep for LLB and LPN
library("gnlm")
library("repeated")
source("gnlmix4MMM.R") ## see ?gnlmix; in GITHUB repo
y <- cbind(d$moden,(1-d$moden))
LLB Wang and Louis (2003)
LLB <- gnlmix4MMM(y = y,
distribution = "binomial",
mixture = "normal",
random = "rand",
nest = defacto,
mu = ~ 1/(1+exp(-(a0 + a1*power)*sqrt(1+3/pi/pi*exp(pmix)) - sqrt(1+3/pi/pi*exp(pmix))*log(sin(pi*pnorm(rand/sqrt(exp(pmix)))/sqrt(1+3/pi/pi*exp(pmix)))/sin(pi*(1-pnorm(rand/sqrt(exp(pmix))))/sqrt(1+3/pi/pi*exp(pmix)))))),
pmu = c(strt, log(1)),
pmix = log(1))
print("code: 1 -best 2-ok 3,4,5 - problem")
LLB$code
print("coefficients")
LLB$coeff
print("se")
LLB$se
LPN Caffo and Griswold (2006)
LPN <- gnlmix4MMM(y = y,
distribution = "binomial",
mixture = "normal",
random = "rand",
nest = defacto,
mu = ~pnorm(qnorm(1/(1+exp(-a0 - a1*power)))*sqrt(1+exp(pmix)) + rand),
pmu = c(strt, log(1)),
pmix = log(1))
print("code: 1 -best 2-ok 3,4,5 - problem")
LPN$code
print("coefficients")
LPN$coeff
print("se")
LPN$se
coefficients from 3 approaches:
rbind("L_N"=L_N$beta, "LLB" = LLB$coefficients[1:2], "LPN"=LPN$coefficients[1:2])
max log likelihood for 3 models:
rbind("L_N"=L_N$logL, "LLB" = -LLB$maxlike, "LPN"=-LPN$maxlike)
I have a serial data formatted as follows:
time milk Animal_ID
30 25.6 1
31 27.2 1
32 24.4 1
33 17.4 1
34 33.6 1
35 25.4 1
33 29.4 2
34 25.4 2
35 24.7 2
36 27.4 2
37 22.4 2
80 24.6 3
81 24.5 3
82 23.5 3
83 25.5 3
84 24.4 3
85 23.4 3
. . .
Generally, 300 animals have records of milk in different time points of short period. However, if we join their data together and do not care about different animal_ID, we would have a curve between milk~time like this, the line in figure below:
Also, in the above figure, we have data for 1 example animal, they are short and highly variable. My purposed is to smooth each animal data but it would be would if the model allows learning general patter from whole data to be included. I used different smooth model (ns, bs, smooth.spline) with the following format but it just did not work:
mod <- lme(milk ~ bs(time, df=3), data=dat, random = ~1|Animal_ID)
I am hoping if somebody has already dealt with this problem would give me an advice. Thanks
The full dataset can be accessed from here:
https://www.dropbox.com/s/z9b5teh3su87uu7/dat.txt?dl=0
I would suggest you use mgcv package. This is one of the recommended R packages, performing a class of models called generalized additive mixed models. You can simply load it by library(mgcv). This is a very powerful library, which can handle from the simplest linear regression model, to generalized linear models, to additive models, to generalized additive models, as well as models with mixed effects (fixed effects + random effects). You can list all (exported) functions of mgcv via
ls("package:mgcv")
And you can see there are many of them.
For your specific data and problem, you may use a model with formula:
model <- milk ~ s(time, bs = 'cr', k = 100) + s(Animal_ID, bs = 're')
In mgcv, s() is a setup for smooth functions, represented by spline basis implied by bs. "cr" is the cubic spline basis, which is exactly what you want. k is the number of knots. It should be chosen depending on the number of unique values of variable time in your data set. If you set k to exactly this number, you end up with a smoothing spline; while any value smaller than that means a regression spline. However, both will be penalized (if you know what penalization mean). I read your data in:
dat <- na.omit(read.csv("data.txt", header = TRUE)) ## I saved you data into file "data.txt"
dat$Animal_ID <- factor(dat$Animal_ID)
nrow(dat) ## 12624 observations
length(unique(dat$time)) ## 157 unique time points
length(ID <- levels(dat$Animal_ID)) ## 355 cows
There are 157 unique values, so I reckon k = 100 is possibly appropriate.
For Animal_ID (coerced as a factor), we need a model term for random effect. "re" is a special class for i.i.d random effect. It is passed to bs for some internal matrix construction reason (so this is not a smooth function!).
Now to fit a GAM model, you can call the legacy gam or the constantly developing bam (gam for big data). I think you will use the latter. They have the same calling convention similar to lm and glm. For example, you can do:
fit <- bam(model, data = dat, family = "gaussian", discrete = TRUE, nthreads = 2)
As you can see, bam allows multi-core parallel computation via nthreads. While discrete is a newly developed feature which can speed up matrix formation.
Since you are dealing with time series data, finally you might consider some temporal autocorrelation. mgcv allows configuration of AR1 correlation, whose correlation coefficient is passed by bam argument rho. However, you need an extra index AR_start to tell mgcv how the time series breaks up into pieces. For example, when reaching a different Animal_ID, AR_start get a TRUE to indicate a new segment of time series. See ?bam for details.
mgcv also provides
summary.gam function for model summary
gam.check for basic model checking
plot.gam function for plotting individual terms
predict.gam (or predict.bam) for prediction on new data.
For example, the summary of the above suggested model is:
> summary(fit)
Family: gaussian
Link function: identity
Formula:
milk ~ s(time, bs = "cr", k = 100) + s(Animal_ID, bs = "re")
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 26.1950 0.2704 96.89 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(time) 10.81 13.67 5.908 1.99e-11 ***
s(Animal_ID) 351.43 354.00 136.449 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.805 Deviance explained = 81.1%
fREML = 29643 Scale est. = 5.5681 n = 12624
The edf (effective degree of freedom) may be thought of as a measure of the degree of non-linearity. So we put in k = 100, while ending up with edf = 10.81. This suggest that the spline s(time) has been heavily penalized. You can view the what s(time) looks like by:
plot.gam(fit, page = 1)
Note that the random effect s(Animal_ID) also has a "smooth", that is an cow-specific constant. For random effects, a Gaussian QQ plot will be returned.
The diagnostic figures returned by
invisible(gam.check(fit))
looks OK, so I think the model is acceptable (I am not offering you model selection, so think up a better model if you think there is).
If you want to make prediction for Animal_ID = 26, you may do
newd <- data.frame(time = 1:150, Animal_ID = 26)
oo <- predict.gam(fit, newd, type = `link`, se.fit = TRUE)
Note that
You need to include both variables in newd (otherwise mgcv complains missing variable)
since you have only one spline smooth s(time), and the random effect term s(Animal_ID) is a constant per Animal_ID. so it is OK to use type = 'link' for individual prediction. By the way, type = 'terms' is slower than type = 'link'.
If you want to make prediction for more than one cows, try something like this:
pred.ID <- ID[1:10] ## predict first 10 cows
newd <- data.frame (time = rep (1:150, times = n), Animal_ID = factor (rep (pred.ID, each = 150)))
oo <- predict.bam (fit, newd, type = "link", se.fit = TRUE)
Note that
I have used predict.bam here, as now we have 150 * 10 = 1500 data points to predict. Plus: we require se.fit = TRUE. This is rather expensive, so use predict.bam is faster than predict.gam. Particularly, if you have fitted your model using bam(..., discrete = TRUE), you can have predict.bam(..., discrete = TRUE). Prediction process goes through the same matrix formation steps as in model fitting (see ?smoothCon used in model fitting and ?PredictMat used in prediction, if you are keen to know more internal structure of mgcv.)
I specified Animal_ID as factors, because this is a random effect.
For more on mgcv, you can refer to library manual. Check specially ?mgcv, ?gam, ?bam ?s.
Final update
Though I said that I will not help you with model section, but I think this model is better (it gives higher adj-Rsquared) and is also more reasonable in sense:
model <- milk ~ s(time, bs = 'cr', k = 20) + s(Animal_ID, bs = 're') + s(Animal_ID, time, bs = 're')
The last term is imposing a random slop. This implies that we are assuming that each individual cow has different growing/reducing pattern of milk production. This is a more sensible assumption in your problem. The earlier model with only random intercept is not sufficient. After adding this random slop, the smooth term s(time) looks smoother. This is a good sign not a bad sign, because we want some simple explanation for s(time), don't we? Compare the s(time) you get from both models, and see what you discover.
I have also reduced k = 100 to k = 20. As we saw in previous fit, the edf for this term is about 10, so k = 20 is pretty sufficient.
Using a penalized spline of mgcv, I want to obtain effective degrees of freedom (EDF) of 10 /year in the example data (60 for the entire period).
library(mgcv)
library(dlnm)
df <- chicagoNMMAPS
df1<-subset(df, as.Date(date) >= '1995-01-01')
mod1 <-gam(resp ~ s(time,bs='cr',k=6*15, fx=F)+ s(temp,k=6, bs='cr') + as.factor(dow)
,family=quasipoisson,na.action=na.omit,data=df1)
In the example data the basis dimension for time as measured by edf for time is 56.117, which is less than 10 per year.
summary(mod1)
Approximate significance of smooth terms:
edf Ref.df F p-value
s(time) 56.117 67.187 5.369 <2e-16 ***
s(temp) 2.564 3.204 0.998 0.393
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.277 Deviance explained = 28.2%
GCV score = 1.1297 Scale est. = 1.0959 n = 2192
Manually I will change the edf a by supplying smoothing parameters as follows
mod1$sp
s(time) s(temp)
23.84809 17.23785
Then I will plug the sp output into a new model and rerun it. Basically I will continue to alter the sp until I obtain edf of around 60. I will alter only the smoothing parameter for time.
I will start with a lower value and check the edf:
mod1a <-gam(resp ~ s(time,bs='cr',k=6*15, fx=F)+ s(temp,k=6, bs='cr') + as.factor(dow)
,family=quasipoisson,na.action=na.omit,data=df1, sp= c(12.84809, 17.23785
))
summary(mod1a)
# edf 62.997
I have to increase the smoothing parameters for time to bring down the edf to around 60.
mod1b <-gam(resp ~ s(time,bs='cr',k=6*15, fx=F)+ s(temp,k=6, bs='cr') + as.factor(dow)
,family=quasipoisson,na.action=na.omit,data=df1, sp= c(14.84809, 17.23785
))
summary(mod1b)
edf 61.393 ## EDF still large, thus I have to increase the sp`
mod1c <-gam(resp ~ s(time,bs='cr',k=6*15, fx=F)+ s(temp,k=6, bs='cr') + as.factor(dow)
,family=quasipoisson,na.action=na.omit,data=df1, sp=c(16.8190989, 17.23785))
summary(mod1c)
edf= 60.005 ## This is what I want to obtain as a final model.
How can one achieve this final result with an efficient code?
I don't understand the details of your model, but if you are looking to minimize (or maximize) edf for models fitted with different sp, optim will do the job. First, create a function that returns just the edf given different values of sp.
edf.by.sp<-function(sp) {
model <-gam(resp ~ s(time,bs='cr',k=6*15, fx=F)+ s(temp,k=6, bs='cr') +
as.factor(dow),
family=quasipoisson,
na.action=na.omit,
data=df1,
sp= c(sp, 17.23785) # Not sure if this quite right.
)
abs(summary(model)$s.table['s(time)','edf']-60) # Subtract 60 and flip sign so 60 is lowest.
}
Now, you can just run optim to minimize edf:
# You could pick any reasonable starting sp value.
# Many optimization methods are available, but in your case
# they work equally well.
best<-optim(12,edf.by.sp,method='BFGS')$par
best
# 16.82708
and, subbing back in, you get nearly 0 (exactly 60 before transforming) when plugging in the function:
edf.by.sp(best) # 2.229869e-06
Why use a penalized spline and then modify it's smoothing parameters to create a fixed regression spline? Makes no sense to me.
A fixed df cubic regression spline with 60 edf is fitted like this:
mod1 <-gam(resp ~ s(time,bs='cr',k=61,fx=TRUE)+
s(temp,k=6, bs='cr') + as.factor(dow)
,family=quasipoisson,na.action=na.omit,data=df1)
Which gives a perfect:
> summary(mod1)
Family: quasipoisson
Link function: log
...
Approximate significance of smooth terms:
edf Ref.df F p-value
s(time) 60.000 60.000 6.511 <2e-16 ***
s(temp) 2.505 3.165 0.930 0.427
If you want a penalized spline, then use a penalized spline and accept that the core idea of penalization is exactly that you do NOT have a fixed edf.