Cubic spline method for longitudinal series data? - r

I have a serial data formatted as follows:
time milk Animal_ID
30 25.6 1
31 27.2 1
32 24.4 1
33 17.4 1
34 33.6 1
35 25.4 1
33 29.4 2
34 25.4 2
35 24.7 2
36 27.4 2
37 22.4 2
80 24.6 3
81 24.5 3
82 23.5 3
83 25.5 3
84 24.4 3
85 23.4 3
. . .
Generally, 300 animals have records of milk in different time points of short period. However, if we join their data together and do not care about different animal_ID, we would have a curve between milk~time like this, the line in figure below:
Also, in the above figure, we have data for 1 example animal, they are short and highly variable. My purposed is to smooth each animal data but it would be would if the model allows learning general patter from whole data to be included. I used different smooth model (ns, bs, smooth.spline) with the following format but it just did not work:
mod <- lme(milk ~ bs(time, df=3), data=dat, random = ~1|Animal_ID)
I am hoping if somebody has already dealt with this problem would give me an advice. Thanks
The full dataset can be accessed from here:
https://www.dropbox.com/s/z9b5teh3su87uu7/dat.txt?dl=0

I would suggest you use mgcv package. This is one of the recommended R packages, performing a class of models called generalized additive mixed models. You can simply load it by library(mgcv). This is a very powerful library, which can handle from the simplest linear regression model, to generalized linear models, to additive models, to generalized additive models, as well as models with mixed effects (fixed effects + random effects). You can list all (exported) functions of mgcv via
ls("package:mgcv")
And you can see there are many of them.
For your specific data and problem, you may use a model with formula:
model <- milk ~ s(time, bs = 'cr', k = 100) + s(Animal_ID, bs = 're')
In mgcv, s() is a setup for smooth functions, represented by spline basis implied by bs. "cr" is the cubic spline basis, which is exactly what you want. k is the number of knots. It should be chosen depending on the number of unique values of variable time in your data set. If you set k to exactly this number, you end up with a smoothing spline; while any value smaller than that means a regression spline. However, both will be penalized (if you know what penalization mean). I read your data in:
dat <- na.omit(read.csv("data.txt", header = TRUE)) ## I saved you data into file "data.txt"
dat$Animal_ID <- factor(dat$Animal_ID)
nrow(dat) ## 12624 observations
length(unique(dat$time)) ## 157 unique time points
length(ID <- levels(dat$Animal_ID)) ## 355 cows
There are 157 unique values, so I reckon k = 100 is possibly appropriate.
For Animal_ID (coerced as a factor), we need a model term for random effect. "re" is a special class for i.i.d random effect. It is passed to bs for some internal matrix construction reason (so this is not a smooth function!).
Now to fit a GAM model, you can call the legacy gam or the constantly developing bam (gam for big data). I think you will use the latter. They have the same calling convention similar to lm and glm. For example, you can do:
fit <- bam(model, data = dat, family = "gaussian", discrete = TRUE, nthreads = 2)
As you can see, bam allows multi-core parallel computation via nthreads. While discrete is a newly developed feature which can speed up matrix formation.
Since you are dealing with time series data, finally you might consider some temporal autocorrelation. mgcv allows configuration of AR1 correlation, whose correlation coefficient is passed by bam argument rho. However, you need an extra index AR_start to tell mgcv how the time series breaks up into pieces. For example, when reaching a different Animal_ID, AR_start get a TRUE to indicate a new segment of time series. See ?bam for details.
mgcv also provides
summary.gam function for model summary
gam.check for basic model checking
plot.gam function for plotting individual terms
predict.gam (or predict.bam) for prediction on new data.
For example, the summary of the above suggested model is:
> summary(fit)
Family: gaussian
Link function: identity
Formula:
milk ~ s(time, bs = "cr", k = 100) + s(Animal_ID, bs = "re")
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 26.1950 0.2704 96.89 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(time) 10.81 13.67 5.908 1.99e-11 ***
s(Animal_ID) 351.43 354.00 136.449 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.805 Deviance explained = 81.1%
fREML = 29643 Scale est. = 5.5681 n = 12624
The edf (effective degree of freedom) may be thought of as a measure of the degree of non-linearity. So we put in k = 100, while ending up with edf = 10.81. This suggest that the spline s(time) has been heavily penalized. You can view the what s(time) looks like by:
plot.gam(fit, page = 1)
Note that the random effect s(Animal_ID) also has a "smooth", that is an cow-specific constant. For random effects, a Gaussian QQ plot will be returned.
The diagnostic figures returned by
invisible(gam.check(fit))
looks OK, so I think the model is acceptable (I am not offering you model selection, so think up a better model if you think there is).
If you want to make prediction for Animal_ID = 26, you may do
newd <- data.frame(time = 1:150, Animal_ID = 26)
oo <- predict.gam(fit, newd, type = `link`, se.fit = TRUE)
Note that
You need to include both variables in newd (otherwise mgcv complains missing variable)
since you have only one spline smooth s(time), and the random effect term s(Animal_ID) is a constant per Animal_ID. so it is OK to use type = 'link' for individual prediction. By the way, type = 'terms' is slower than type = 'link'.
If you want to make prediction for more than one cows, try something like this:
pred.ID <- ID[1:10] ## predict first 10 cows
newd <- data.frame (time = rep (1:150, times = n), Animal_ID = factor (rep (pred.ID, each = 150)))
oo <- predict.bam (fit, newd, type = "link", se.fit = TRUE)
Note that
I have used predict.bam here, as now we have 150 * 10 = 1500 data points to predict. Plus: we require se.fit = TRUE. This is rather expensive, so use predict.bam is faster than predict.gam. Particularly, if you have fitted your model using bam(..., discrete = TRUE), you can have predict.bam(..., discrete = TRUE). Prediction process goes through the same matrix formation steps as in model fitting (see ?smoothCon used in model fitting and ?PredictMat used in prediction, if you are keen to know more internal structure of mgcv.)
I specified Animal_ID as factors, because this is a random effect.
For more on mgcv, you can refer to library manual. Check specially ?mgcv, ?gam, ?bam ?s.
Final update
Though I said that I will not help you with model section, but I think this model is better (it gives higher adj-Rsquared) and is also more reasonable in sense:
model <- milk ~ s(time, bs = 'cr', k = 20) + s(Animal_ID, bs = 're') + s(Animal_ID, time, bs = 're')
The last term is imposing a random slop. This implies that we are assuming that each individual cow has different growing/reducing pattern of milk production. This is a more sensible assumption in your problem. The earlier model with only random intercept is not sufficient. After adding this random slop, the smooth term s(time) looks smoother. This is a good sign not a bad sign, because we want some simple explanation for s(time), don't we? Compare the s(time) you get from both models, and see what you discover.
I have also reduced k = 100 to k = 20. As we saw in previous fit, the edf for this term is about 10, so k = 20 is pretty sufficient.

Related

emmeans: regrid() for binomial GLMM with user-defined link function

I have fitted a binomial GLMM in R with a modified link function with a fixed guessing probability as suggested in this thread - except that the guessing probability is 1/2 and not 1/3. Therefore the sigmoidal activation in my case becomes:
P(correct) = 0.5 + 0.5*(exp(term)/(1 + exp(term))).
My model looks like this:
library(lme4)
m = 2
mod = glmer(correct ~ group*stim_strength + (stim_strength|subject) ,
family=binomial(link=mafc.logit(m)), data=obs_data)
where: guessing probability is 1/m; correct is a categorical variable indicating correct/incorrect response; group is a factor with two levels; stim_strength is numerical with values in [0,1]; mafc.logit is the function suggested in the thread.
I'm essentially fitting separate psychometric curves of the stimulus strength (stim_strength) for the two groups, while taking into account the inter-subject fluctuations in slope and intercept (random effect structure (stim_strength|subject))
This is what I get:
plot_model(mod, type = 'emm', terms = c('stim_strength', 'group'))
---> plot
The model describes the data nicely, and I now want to perform some post-hoc analyses on it. Specifically, I want to run for example:
mod.emm = emmeans(mod, ~group|stim_strength, at=list(stim_strength=c(.25,.75)))
confint(regrid(mod.emm))
contrast(regrid(mod.emm), 'pairwise', simple = 'group', combine = TRUE, adjust = 'holm')
i.e. compute confidence intervals for the %correct of the two groups at some specified values of stim_strength, and compare the %correct of the two groups at these values.
Note that I'm using regrid(), because I want the analyses to be done on the back-transformed values, not on the linear part of the model!
However, regrid() won't work with a user-defined link function. In fact, the regrid is just ignored here, as you can see e.g. from the output of the confint() call above (estimates are labelled as prob but they're clearly not transformed to [.5,1]):
stim_strength = 0.25:
group prob SE df asymp.LCL asymp.UCL
1 -1.329 0.173 Inf -1.716 -0.942
2 -0.553 0.161 Inf -0.913 -0.192
stim_strength = 0.75:
group prob SE df asymp.LCL asymp.UCL
1 1.853 0.372 Inf 1.018 2.687
2 3.375 0.395 Inf 2.489 4.261
Similarly, when adding type='response' in emmeans, I get the message:
Unknown transformation "mafc.logit(2)": no transformation done
Any workaround?
Thanks!
Looking at the linked suggestion, it appears that mafc.logit() is a function that returns a list with all the information needed to implement the transform. All you need to do is update the emmGrid object with that information:
mod.emm <- update(mod.emm, tran = mafc.logit(2))
confint(regrid(mod.emm), adjust = 'holm')
# etc...
See, for example, this vignette section and possibly other parts of that vignette.

Extend doesn't show effects in R package simr

I'm trying to reproduce the example of Green and MacLeod in https://besjournals.onlinelibrary.wiley.com/doi/full/10.1111/2041-210X.12504.
library(simr)
model1 <- glmer(z ~ x + (1|g), family="poisson", data=simdata)
summary(model1)
fixef(model1)["x"] <- -0.05 # Specify desired effect size
model3 <- extend(model1, along="g", n=15) # Add more groups
summary(model3)
However, in the output of model 3, the number of groups is not extended (same result as in model 1):
Random effects:
Groups Name Variance Std.Dev.
g (Intercept) 0.08345 0.2889
Number of obs: 30, groups: g, 3
I know that by checking the rows I get different results, but why isn't that part of the regression? Am I doing something wrong? How can I extend the number of groups so that I can calculate a proper powerCurve?
> nrow(getData(model1))
[1] 30
> nrow(getData(model2))
[1] 130
First off, well done on checking that the functions are doing what you expect. I hadn't expected users to be quite this thorough.
model3 now has two datasets attached to it. The original one that lme4 knows about, and an attribute newData that simr checks instead.
So print and summary, which are part of lme4, give you values from the old dataset.
But getData and powerSim and powerCurve in simr will use the new extended dataset:
> powerSim(model3, nsim=10, progress=FALSE)
Power for predictor 'x', (95% confidence interval):
80.00% (44.39, 97.48)
Test: z-test
Effect size for x is -0.050
Based on 10 simulations, (0 warnings, 0 errors)
alpha = 0.05, nrow = 150
Time elapsed: 0 h 0 m 0 s
I've opened an issue on the github repository, this is probably a bug that should be fixed.

Repeated measures ANOVA and link to mixed-effect models in R

I have a problem when performing a two-way rm ANOVA in R on the following data (link : https://drive.google.com/open?id=1nIlFfijUm4Ib6TJoHUUNeEJnZnnNzO29):
subjectnbr is the id of the subject and blockType and linesTTL are the independent variables. RT2 is the dependent variable
I first performed the rm ANOVA through using ezANOVA with the following code:
ANOVA_RTS <- ezANOVA(
data=castRTs
, dv=RT2
, wid=subjectnbr
, within = .(blockType,linesTTL)
, type = 2
, detailed = TRUE
, return_aov = FALSE
)
ANOVA_RTS
The result is correct (I double-checked using statistica).
However, when I perform the rm ANOVA using the lme function, I do not get the same answer and I have no clue why.
There is my code:
lmeRTs <- lme(
RT2 ~ blockType*linesTTL,
random = ~1|subjectnbr/blockType/linesTTL,
data=castRTs)
anova(lmeRTs)
Here are the outputs of both ezANOVA and lme.
I hope I have been clear enough and have given you all the information needed.
I'm looking forward for your help as I am trying to figure it out for at least 4 hours!
Thanks in advance.
Here is a step-by-step example on how to reproduce ezANOVA results with nlme::lme.
The data
We read in the data and ensure that all categorical variables are factors.
# Read in data
library(tidyverse);
df <- read.csv("castRTs.csv");
df <- df %>%
mutate(
blockType = factor(blockType),
linesTTL = factor(linesTTL));
Results from ezANOVA
As a check, we reproduce the ez::ezANOVA results.
## ANOVA using ez::ezANOVA
library(ez);
model1 <- ezANOVA(
data = df,
dv = RT2,
wid = subjectnbr,
within = .(blockType, linesTTL),
type = 2,
detailed = TRUE,
return_aov = FALSE);
model1;
# $ANOVA
# Effect DFn DFd SSn SSd F p
#1 (Intercept) 1 13 2047405.6654 34886.767 762.9332235 6.260010e-13
#2 blockType 1 13 236.5412 5011.442 0.6136028 4.474711e-01
#3 linesTTL 1 13 6584.7222 7294.620 11.7348665 4.514589e-03
#4 blockType:linesTTL 1 13 1019.1854 2521.860 5.2538251 3.922784e-02
# p<.05 ges
#1 * 0.976293831
#2 0.004735442
#3 * 0.116958989
#4 * 0.020088855
Results from nlme::lme
We now run nlme::lme
## ANOVA using nlme::lme
library(nlme);
model2 <- anova(lme(
RT2 ~ blockType * linesTTL,
random = list(subjectnbr = pdBlocked(list(~1, pdIdent(~blockType - 1), pdIdent(~linesTTL - 1)))),
data = df))
model2;
# numDF denDF F-value p-value
#(Intercept) 1 39 762.9332 <.0001
#blockType 1 39 0.6136 0.4382
#linesTTL 1 39 11.7349 0.0015
#blockType:linesTTL 1 39 5.2538 0.0274
Results/conclusion
We can see that the F test results from both methods are identical. The somewhat complicated structure of the random effect definition in lme arises from the fact that you have two crossed random effects. Here "crossed" means that for every combination of blockType and linesTTL there exists an observation for every subjectnbr.
Some additional (optional) details
To understand the role of pdBlocked and pdIdent we need to take a look at the corresponding two-level mixed effect model
The predictor variables are your categorical variables blockType and linesTTL, which are generally encoded using dummy variables.
The variance-covariance matrix for the random effects can take different forms, depending on the underlying correlation structure of your random effect coefficients. To be consistent with the assumptions of a two-level repeated measure ANOVA, we must specify a block-diagonal variance-covariance matrix pdBlocked, where we create diagonal blocks for the offset ~1, and for the categorical predictor variables blockType pdIdent(~blockType - 1) and linesTTL pdIdent(~linesTTL - 1), respectively. Note that we need to subtract the offset from the last two blocks (since we've already accounted for the offset).
Some relevant/interesting resources
Pinheiro and Bates, Mixed-Effects Models in S and S-PLUS, Springer (2000)
Potvin and Schutz, Statistical power for the two-factor
repeated measures ANOVA, Behavior Research Methods, Instruments & Computers, 32, 347-356 (2000)
Deming Mi, How to understand and apply
mixed-effect models, Department of Biostatistics, Vanderbilt university

How to perform a K-fold cross validation and understanding the outputs

I have been trying to perform k-fold cross-validation in R on a data set that I have created. The link to this data is as follows:
https://drive.google.com/open?id=0B6vqHScIRbB-S0ZYZW1Ga0VMMjA
I used the following code:
library(DAAG)
six = read.csv("six.csv") #opening file
fit <- lm(Height ~ GLCM.135 + Blue + NIR, data=six) #applying a regression model
summary(fit) # show results
CVlm(data =six, m=10, form.lm = formula(Height ~ GLCM.135 + Blue + NIR )) # 10 fold cross validation
This produces the following output (Summarized version)
Sum of squares = 7.37 Mean square = 1.47 n = 5
Overall (Sum over all 5 folds)
ms
3.75
Warning message:
In CVlm(data = six, m = 10, form.lm = formula(Height ~ GLCM.135 + :
As there is >1 explanatory variable, cross-validation
predicted values for a fold are not a linear function
of corresponding overall predicted values. Lines that
are shown for the different folds are approximate
I do not understand what the ms value refers to as I have seen different interpretations on the internet. It is my understanding that K-fold cross validations produce a overall RMSE value for a specified model (which is what I am trying to obtain for my research).
I also don't understand why the results generated produce a Overall (Sum over all 5 folds), when I have specified a 10 fold cross validation in the code.
If anyone can help it would be much appreciated.
When I ran this same thing, I saw that it did do 10 folds, but the final output printed was the same as yours ("Sum over all 5 folds"). The "ms" is the mean squared prediction error. The value of 3.75 is not exactly a simple average across all 10 folds either (got 3.67):
msaverage <- (1.19+6.04+1.26+2.37+3.57+5.24+8.92+2.03+4.62+1.47)/10
msaverage
Notice the average as well as most folds are higher than "Residual standard error" (1.814). This is what we would expect as the CV error represents model performance likely on "test" data (not data used to trained the model). For instance on Fold 10, notice the residuals calculated are on the predicted observations (5 observations) that were not used in the training for that model:
fold 10
Observations in test set: 5
12 14 26 54 56
Predicted 20.24 21.18 22.961 18.63 17.81
cvpred 20.15 21.14 22.964 18.66 17.86
Height 21.98 22.32 22.870 17.12 17.37
CV residual 1.83 1.18 -0.094 -1.54 -0.49
Sum of squares = 7.37 Mean square = 1.47 n = 5
It appears this warning we received may be common too -- also saw it in this article: http://www.rpubs.com/jmcimula/xCL1aXpM3bZ
One thing I can suggest that may be useful to you is that in the case of linear regression, there is a closed form solution for leave-one-out-cross-validation (loocv) without actually fitting multiple models.
predictedresiduals <- residuals(fit)/(1 - lm.influence(fit)$hat)
PRESS <- sum(predictedresiduals^2)
PRESS #Predicted Residual Sum of Squares Error
fitanova <- anova(fit) #Anova to get total sum of squares
tss <- sum(fitanova$"Sum Sq") #Total sum of squares
predrsquared <- 1 - PRESS/(tss)
predrsquared
Notice this value is 0.574 vs. the original Rsquared of 0.6422
To better convey the concept of RMSE, it is useful to see the distribution of the predicted residuals:
hist(predictedresiduals)
RMSE can then calculated simply as:
sd(predictedresiduals)

mgcv: How to do stepwise regression with a Tweedie response model?

Does anyone have an idea how to do stepwise regression with Tweedie in R?
I found the mgcv package, which apparently treats the power parameter of Tweedie as yet another parameter to be estimated. This seems to improve on having to use tweedie.profile to estimate the power outside the glm, so it seems encouraging for using an automated stepwise function to do the regression. But I haven't been able to figure out if the package also offers a stepwise function. The package manual has this to say.
I got lost in the talk about smooths:
There is no step.gam in package mgcv.
To facilitate fully automatic model selection the package implements two smooth modification techniques
which can be used to allow smooths to be shrunk to zero as part of smoothness selection.
I would appreciate your help. Thanks.
Your question is not specific to "Tweedie" family; it is a general mgcv feature in model selection.
mgcv does not use step.gam for model selection. I think your confusion comes from another package gam, which would use step.gam to sequentially add/drop a term and reports AIC. When you go ?step.gam in mgcv, it refers you to ?gam.selection. ?step.gam is intentionally left there, in case people search it. But all the details are provided in ?gam.selection.
There is no need to do step.gam in mgcv. Model estimation and model selection are integrated in mgcv. For a penalized regression/smoothing spline, when smoothing parameter goes to infinity (very large), its second derivative is penalized to zero, leaving a simple linear term. For example, if we specify a model like:
y ~ s(x1, bs = 'cr') + s(x2, bs = 'cr')
while s(x2) is a spurious model term and should not be included in the model, then mgcv:::gam/bam will shrink s(x2) to x2 after estimation, resulting a model like:
y ~ s(x1) + x2
This means, when you use plot.gam() to inspect the estimated smooth function for each model term, s(x1) is a curve, but s(x2) is a straight line.
Now this is not entirely satisfying. For a complete, successful model selection, we want to drop x2 as well, i.e., shrink s(x2) to 0, to get notationally a model:
y ~ s(x1)
But this is not difficult to achieve. We can use shrinkage smooth class bs = 'ts' (shrinkage thin plate regression spline, as opposed to the ordinary tp) or bs = cs' (shrinkage cubic regression spline, as opposed to the ordinary 'cr'), and mgcv:::gam/bam should be able to shrink s(x2) to 0. The math behind this, is that mgcv will modify the eigen values of linear term (i.e., the null space) from 0, to 0.1, a small but positive number, so that penalization takes effect on linear term. As a result, when you do plot.gam(), you will see s(x2) is a horizontal line at 0.
bs = 'cs' or bs = 'ts' are supposed to be put in function s(); yet mgcv also allows you to leave bs = 'cr' or bs = 'tp' untouched in s(), but put select = TRUE in gam() or bam(). The select = TRUE is a more general treatment, as shrinkage smooths at the moment only have class cs and ts, while select = TRUE work for all kind of smooth specification. They essentially do the same thing, by increasing 0 eigen values to 0.1.
The following example is taken from the example under ?gam.selection. Note how select = TRUE shrinks several terms to 0, giving an informative model selection.
library(mgcv)
set.seed(3);n<-200
dat <- gamSim(1,n=n,scale=.15,dist="poisson") ## simulate data
dat$x4 <- runif(n, 0, 1);dat$x5 <- runif(n, 0, 1) ## spurious
b <- gam(y~s(x0)+s(x1)+s(x2)+s(x3)+s(x4)+s(x5),data=dat,
family=poisson,select=TRUE,method="REML")
summary(b)
plot.gam(b,pages=1)
Note that, the p-values in summary.gam() also gives evidence for such selection:
Approximate significance of smooth terms:
edf Ref.df Chi.sq p-value
s(x0) 1.7655119 9 5.264 0.0397 *
s(x1) 1.9271039 9 65.356 <2e-16 ***
s(x2) 6.1351372 9 156.204 <2e-16 ***
s(x3) 0.0002618 9 0.000 0.4088
s(x4) 0.0002766 9 0.000 1.0000
s(x5) 0.1757146 9 0.195 0.2963
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.545 Deviance explained = 51.6%
-REML = 430.78 Scale est. = 1 n = 200

Resources