Prediction in a linear mixed model in R - r

Consider the sleepstudy data in the lme4 package as shown below. The contains 18 subjects with repeated measurements of Reaction (Reaction is the response) taken on different days.
library("lme4")
head(sleepstudy)
Reaction Days Subject
1 249.5600 0 308
2 258.7047 1 308
3 250.8006 2 308
4 321.4398 3 308
5 356.8519 4 308
6 414.6901 5 308
The following code fits a linear mixed model with a random intercept.
fit1 = lmer(Reaction ~ Days + (1 | Subject), data = sleepstudy)
We can obtain subject-specific random intercept using "ranef(fit1)". Also, one can use "predict(fit1)" to give predictions of the response for all the time points in the original data.
However, I would like to predict the response (Reaction) in R for the 18 subjects at Day=12 and Day 14 (Day 12 and 14 are days that are not in the original data but would like to make a prediction for Reaction).
That is, I should end up with a dataset that looks like this.
Days Subject Predicted_Response
12 308
12 309
...
12 371
12 372
14 308
14 309
...
14 371
14 372

We can accomplish this with the "newdata" argument of the predict method:
library("lme4")
fit1 = lmer(Reaction ~ Days + (1 | Subject), data = sleepstudy)
newdata <- expand.grid(
Days = c(12, 14),
Subject = unique(sleepstudy$Subject)
)
newdata$Predicted_Response <- predict(fit1, newdata = newdata)
Days Subject Predicted_Response
1 12 308 417.7962
2 14 308 438.7308
3 12 309 299.1630
4 14 309 320.0976
5 12 310 313.9040
6 14 310 334.8385
7 12 330 381.4190
8 14 330 402.3536
9 12 331 387.2287
10 14 331 408.1633
11 12 332 385.2338
12 14 332 406.1683
... etc ...

Related

Creating and plotting confidence intervals

I have fitted a gaussian GLM model to my data, i now wish to create 95% CIs and fit them to my data. Im having a couple of issues with this when plotting as i cant get them to capture my data, they just seem to plot the same line as the model without captuing the data points. Also Im also unsure that I've created my CIs the correct way here for the mean. I entered my data and code below if anyone knows how to fix this
data used
aids
cases quarter date
1 2 1 83.00
2 6 2 83.25
3 10 3 83.50
4 8 4 83.75
5 12 1 84.00
6 9 2 84.25
7 28 3 84.50
8 28 4 84.75
9 36 1 85.00
10 32 2 85.25
11 46 3 85.50
12 47 4 85.75
13 50 1 86.00
14 61 2 86.25
15 99 3 86.50
16 95 4 86.75
17 150 1 87.00
18 143 2 87.25
19 197 3 87.50
20 159 4 87.75
21 204 1 88.00
22 168 2 88.25
23 196 3 88.50
24 194 4 88.75
25 210 1 89.00
26 180 2 89.25
27 277 3 89.50
28 181 4 89.75
29 327 1 90.00
30 276 2 90.25
31 365 3 90.50
32 300 4 90.75
33 356 1 91.00
34 304 2 91.25
35 307 3 91.50
36 386 4 91.75
37 331 1 92.00
38 368 2 92.25
39 416 3 92.50
40 374 4 92.75
41 412 1 93.00
42 358 2 93.25
43 416 3 93.50
44 414 4 93.75
45 496 1 94.00
my code used to create the model and intervals before plotting
#creating the model
model3 = glm(cases ~ date,
data = aids,
family = poisson(link='log'))
#now to add approx. 95% confidence envelope around this line
#predict again but at the linear predictor level along with standard errors
my_preds <- predict(model3, newdata=data.frame(aids), se.fit=T, type="link")
#calculate CI limit since linear predictor is approx. Gaussian
upper <- my_preds$fit+1.96*my_preds$se.fit #this might be logit not log
lower <- my_preds$fit-1.96*my_preds$se.fit
#transform the CI limit to get one at the level of the mean
upper <- exp(upper)/(1+exp(upper))
lower <- exp(lower)/(1+exp(lower))
#plotting data
plot(aids$date, aids$cases,
xlab = 'Date', ylab = 'Cases', pch = 20)
#adding CI lines
plot(aids$date, exp(my_preds$fit), type = "link",
xlab = 'Date', ylab = 'Cases') #add title
lines(aids$date,exp(my_preds$fit+1.96*my_preds$se.fit),lwd=2,lty=2)
lines(aids$date,exp(my_preds$fit-1.96*my_preds$se.fit),lwd=2,lty=2)
outcome i currently get with no data points, the model is correct here but the CI isnt as i have no data points, so the CIs are made incorrectly i think somewhere
Edit: Response to OP's providing full data set.
This started out as a question about plotting data and models on the same graph, but has morphed considerably. You seem you have an answer to the original question. Below is one way to address the rest.
Looking at your (and my) plots it seems clear that poisson glm is just not a good model. To say it differently, the number of cases may vary with date, but is also influenced by other things not in your model (external regressors).
Plotting just your data suggests strongly that you have at least two and perhaps more regimes: time frames where the growth in cases follows different models.
ggplot(aids, aes(x=date)) + geom_point(aes(y=cases))
This suggests segmented regression. As with most things in R, there is a package for that (more than one actually). The code below uses the segmented package to build successive poisson glm using 1 breakpoint (two regimes).
library(data.table)
library(ggplot2)
library(segmented)
setDT(aids) # convert aids to a data.table
aids[, pred:=
predict(
segmented(glm(cases~date, .SD, family = poisson), seg.Z = ~date, npsi=1),
type='response', se.fit=TRUE)$fit]
ggplot(aids, aes(x=date))+ geom_line(aes(y=pred))+ geom_point(aes(y=cases))
Note that we need to tell segmented the count of breakpoints, but not where they are - the algorithm figures that out for you. So here, we see a regime prior to 3Q87 which is well modeled using poission glm, and a regime after that which is not. This is a fancy way of saying that "something happened" around 3Q87 which changed the course of the disease (at least in this data).
The code below does the same thing but for between 1 and 4 breakpoints.
get.pred <- \(p.n, p.DT) {
fit <- glm(cases~date, p.DT, family=poisson)
seg.fit <- segmented(fit, seg.Z = ~date, npsi=p.n)
predict(seg.fit, type='response', se.fit=TRUE)[c('fit', 'se.fit')]
}
gg.dt <- rbindlist(lapply(1:4, \(x) { copy(aids)[, c('pred', 'se'):=get.pred(x, .SD)][, npsi:=x] } ))
ggplot(gg.dt, aes(x=date))+
geom_ribbon(aes(ymin=pred-1.96*se, ymax=pred+1.96*se), fill='grey80')+
geom_line(aes(y=pred))+
geom_point(aes(y=cases))+
facet_wrap(~npsi)
Note that the location of the first breakpoint does not seem to change, and also that, notwithstanding the use of the poisson glm the growth appears linear in all but the first regime.
There are goodness-of-fit metrics described in the package documentation which can help you decide how many break points are most consistent with your data.
Finally, there is also the mcp package which is a bit more powerful but also a bit more complex to use.
Original Response: Here is one way that builds the model predictions and std. error in a data.table, then plots using ggplot.
library(data.table)
library(ggplot2)
setDT(aids) # convert aids to a data.table
aids[, c('pred', 'se', 'resid.scale'):=predict(glm(cases~date, data=.SD, family=poisson), type='response', se.fit=TRUE)]
ggplot(aids, aes(x=date))+
geom_ribbon(aes(ymin=pred-1.96*se, ymax=pred+1.96*se), fill='grey80')+
geom_line(aes(y=pred))+
geom_point(aes(y=cases))
Or, you could let ggplot do all the work for you.
ggplot(aids, aes(x=date, y=cases))+
stat_smooth(method = glm, method.args=list(family=poisson))+
geom_point()

Clustering biological sequences based on numeric values

I am trying to cluster several amino acid sequences of a fixed length (13) into K clusters based on the Atchley factors (5 numbers which represent each amino acid.
For example, I have an input vector of strings like the following:
key <- HDMD::AAMetric.Atchley
sequences <- sapply(1:10000, function(x) paste(sapply(1:13, function (X) sample(rownames(key), 1)), collapse = ""))
However, my actual list of sequences is over 10^5 (specifying for need for computational efficiency).
I then convert these sequences into numeric vectors by the following:
key <- HDMD::AAMetric.Atchley
m1 <- key[strsplit(paste(sequences, collapse = ""), "")[[1]], ]
p = 13
output <-
do.call(cbind, lapply(1:p, function(i)
m1[seq(i, nrow(m1), by = p), ]))
I want to output (which is now 65 dimensional vectors) in an efficient way.
I was originally using Mini-batch kmeans, but I noticed the results were very inconsistent when I repeated. I need a consistent clustering approach.
I also was concerned about the curse of dimensionality, considering at 65 dimensions, Euclidean distance doesn't work.
Many high dimensional clustering algorithms I saw assume that outliers and noise exists in the data, but as these are biological sequences converted to numeric values, there is no noise or outlier.
In addition to this, feature selection will not work, as each of the properties of each amino acid and each amino acid are relevant in the biological context.
How would you recommend clustering these vectors?
I think self organizing maps can be of help here - at least the implementation is quite fast so you will know soon enough if it is helpful or not:
using the data from the op along with:
rownames(output) <- 1:nrow(output)
colnames(output) <- make.names(colnames(output), unique = TRUE)
library(SOMbrero)
you define the number of cluster in advance
fit <- trainSOM(x.data=output , dimension = c(5, 5), nb.save = 10, maxit = 2000,
scaling="none", radius.type = "gaussian")
the nb.save is used as intermediate steps for further exploration how the training developed during the iterations:
plot(fit, what ="energy")
seems like more iterations is in order
check the frequency of clusters:
table(my.som$clustering)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
428 417 439 393 505 458 382 406 271 299 390 303 336 358 365 372 332 268 437 464 541 381 569 419 467
predict clusters based on new data:
predict(my.som, output[1:20,])
#output
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
19 12 11 8 9 1 11 13 14 5 18 2 22 21 23 22 4 14 24 12
check which variables were important for clustering:
summary(fit)
#part of output
Summary
Class : somRes
Self-Organizing Map object...
online learning, type: numeric
5 x 5 grid with square topology
neighbourhood type: gaussian
distance type: euclidean
Final energy : 44.93509
Topographic error: 0.0053
ANOVA :
Degrees of freedom : 24
F pvalue significativity
pah 1.343 0.12156074
pss 1.300 0.14868987
ms 16.401 0.00000000 ***
cc 1.695 0.01827619 *
ec 17.853 0.00000000 ***
find optimal number of clusters:
plot(superClass(fit))
fit1 <- superClass(fit, k = 4)
summary(fit1)
#part of output
SOM Super Classes
Initial number of clusters : 25
Number of super clusters : 4
Frequency table
1 2 3 4
6 9 4 6
Clustering
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
1 1 2 2 2 1 1 2 2 2 1 1 2 2 2 3 3 4 4 4 3 3 4 4 4
ANOVA
Degrees of freedom : 3
F pvalue significativity
pah 1.393 0.24277933
pss 3.071 0.02664661 *
ms 19.007 0.00000000 ***
cc 2.906 0.03332672 *
ec 23.103 0.00000000 ***
Much more in this vignette

A concise way to extract some elements of a "survfit" object into a data frame

I load a data set from the survival library, and generate a survfit object:
library(survival)
data(lung)
lung$SurvObj <- with(lung, Surv(time, status == 2))
fit <- survfit(SurvObj ~ 1, data = lung, conf.type = "log-log")
This object is a list:
> str(fit)
List of 13
$ n : int 228
$ time : int [1:186] 5 11 12 13 15 26 30 31 53 54 ...
$ n.risk : num [1:186] 228 227 224 223 221 220 219 218 217 215 ...
$ n.event : num [1:186] 1 3 1 2 1 1 1 1 2 1 ...
...
Now I specify some members (all same length) that I want to turn into a data frame:
members <- c("time", "n.risk", "n.event")
I'm looking for a concise way to make a data frame with the three list members as columns, with the columns named time, n.risk, n.event (not fit$time, fit$n.risk, fit$n.event)
Thus the resulting data frame should look like this:
time n.risk n.event
[1,] 5 228 1
[2,] 11 227 3
[3,] 12 224 1
...
This is OK
data.frame(unclass(fit)[members])
Another (more canonical) way is
with(fit, data.frame(time, n.risk, n.event))
The broompackage contains functions to tidy up the results of regression models and present them in an object of class data.frame. For those unfamiliar with the tidy philosophy, please see Tidy data [ 1 ]
library(broom)
#create tidy dataframe and subset by the columns saved in members
df <- tidy(fit)[,members]
head(df)
# time n.risk n.event
#1 5 228 1
#2 11 227 3
#3 12 224 1
#4 13 223 2
#5 15 221 1
#6 26 220 1
[ 1 ] Wickham, Hadley . "Tidy Data." Journal of Statistical Software [Online], 59.10 (2014): 1 - 23. Web. 16 Jun. 2017
Used cbind to bind the dataframes, then used names to change the name of columns
time=as.data.frame(fit$time)
n.risk=as.data.frame(fit$n.risk)
n.event=as.data.frame(fit$n.event)
members2=cbind(time,n.risk,n.event)
names(members2)=c("time","n.risk","n.event")
head(members2)
time n.risk n.event
1 5 228 1
2 11 227 3
3 12 224 1
4 13 223 2
5 15 221 1
6 26 220 1
library(survival)
data(lung)
lung$SurvObj <- with(lung, Surv(time, status == 2))
fit <- survfit(SurvObj ~ 1, data = lung, conf.type = "log-log")
str(fit)
members<-data.frame(time=fit$time,n.risk=fit$n.risk,n.event=fit$n.event)
members

MICE package in R: passive imputation

I aimed to handle missing values with multiple imputation and then analyse with mixed linear model.
I am stacked by passive imputation for "BMI" (body mass index) and "BMI category". "BMI" was calculated by height and weight and then categorized into "BMI category".
How to impute 'BMI category'?
The database looks like below:
sub_eu_surf[1:5, 3:12]
age gender smoking exercise education sbp dbp height weight bmi
1 41 1 1 2 18 120 80 185 107 31.26370
2 46 1 3 2 18 130 70 182 102 30.79338
3 46 1 3 2 18 130 70 182 102 30.79338
4 47 1 1 2 14 130 80 178 78 24.61810
5 47 1 1 1 14 150 80 175 85 27.75510
Since 'bmi category' is not a predictor of my imputation, I decided to create it after imputation. And details are below:
1. To define method and predictor
ini<-mice(sub_eu_surf, maxit=0)
meth<-ini$meth
meth["bmi"]<-"~I(weight/(height/100)^2)"
pred <- ini$predictorMatrix
pred[c("pm25_global", "pm25_eu", "pm10_eu", "no2_eu"), ]<-0
pred[,c("bmi", "hba1c", "pm25_eu", "pm10_eu")]<-0
pred[,"tc"]<-0
pred[c("smoking", "exercise", "hdl", "glucose"), "tc"]<-1
pred[c("smoking", "exercise", "hdl", "glucose"), "ldl"]<-0
vis <- ini$vis
imp_eu<-mice(sub_eu_surf, meth=meth, pred=pred, vis=vis, seed=200, print=F, m=5, maxit=5)
long_eu<- complete(imp_eu, "long", include=TRUE)
long_eu$bmi_category<-cut(as.numeric(long_eu$bmi), breaks=c(0, 18.5, 25, 30, 72))
complete_eu<-as.mids(long_eu)
But I received an error when analyzing my data:
test1<-with(imp_eu, lme(sbp~pm25_global+gender+age+education+bmi_category, random=~1|centre))
Error in eval(expr, envir, enclos) : object 'bmi_category' not found
How does this happen?
You are running your analyses on the original mids object imp_eu, not on the modified complete_eu. Try:
test1<-with(complete_eu, lme(sbp~pm25_global+gender+age+education+bmi_category, random=~1|centre))

Using glmer.nb(), the error message:(maxstephalfit) PIRLS step-halvings failed to reduce deviance in pwrssUpdate is returned

When using glmer.nb, we just get error message
> glm1 <- glmer.nb(Jul ~ scale(I7)+ Maylg+(1|Year), data=bph.df)
Error: (maxstephalfit) PIRLS step-halvings failed to reduce deviance in pwrssUpdate
In addition: Warning message:
In theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace = control$trace > :
iteration limit reached
Who can help me? Thanks very much!
My data listed below.
Year Jul A7 Maylg L7b
331 1978 1948 6 1.322219 4
343 1979 8140 32 2.678518 2
355 1980 106896 26 2.267172 2
367 1981 36227 25 4.028205 2
379 1982 19085 18 2.752816 2
391 1983 26010 32 2.086360 3
403 1984 1959 1 2.506505 4
415 1985 8025 18 2.656098 0
427 1986 9780 20 1.939519 0
439 1987 48235 29 4.093912 1
451 1988 7473 30 2.974972 2
463 1989 2850 25 2.107210 2
475 1990 10555 18 2.557507 3
487 1991 70217 30 4.843563 0
499 1992 2350 31 1.886491 2
511 1993 3363 32 2.956649 4
523 1994 5140 37 1.934498 4
535 1995 14210 36 2.492760 1
547 1996 3644 27 1.886491 1
559 1997 9828 29 1.653213 1
571 1998 3119 41 2.535294 4
583 1999 5382 10 2.472756 3
595 2000 690 5 1.886491 2
607 2001 871 13 NA 2
619 2002 12394 27 0.845098 5
631 2003 4473 36 1.342423 2
You're going to have a lot of problems with this data set, among other things, because you have an observation-level random effect (you only have one data point per Year) and are trying to fit a negative binomial model. That essentially means you're trying to fit the overdispersion in two different ways at the same time.
If you fit the Poisson model, you can see that the results are strongly underdispersed (for a Poisson model, the residual deviance should be approximately equal to the residual degrees of freedom).
library("lme4")
glm0 <- glmer(Jul ~ scale(A7)+ Maylg+(1|Year), data=bph.df,
family="poisson")
print(glm0)
Generalized linear mixed model fit by maximum likelihood (Laplace
Approximation) [glmerMod]
Family: poisson ( log )
Formula: Jul ~ scale(A7) + Maylg + (1 | Year)
Data: bph.df
AIC BIC logLik deviance df.resid
526.4904 531.3659 -259.2452 518.4904 21
Random effects:
Groups Name Std.Dev.
Year (Intercept) 0.9555
Number of obs: 25, groups: Year, 25
Fixed Effects:
(Intercept) scale(A7) Maylg
7.3471 0.3363 0.6732
deviance(glm0)/df.residual(glm0)
## [1] 0.0003479596
Or alternatively:
library("aods3")
gof(glm0)
## D = 0.0073, df = 21, P(>D) = 1
## X2 = 0.0073, df = 21, P(>X2) = 1
glmmADMB does manage to fit it, but I don't know how far I would trust the results (the dispersion parameter is very large, indicating that the model has basically converged to a Poisson distribution anyway).
bph.df <- na.omit(transform(bph.df,Year=factor(Year)))
glmmadmb(Jul ~ scale(A7)+ Maylg+(1|Year), data=bph.df,
family="nbinom")
GLMM's in R powered by AD Model Builder:
Family: nbinom
alpha = 403.43
link = log
Fixed effects:
Log-likelihood: -259.25
AIC: 528.5
Formula: Jul ~ scale(A7) + Maylg + (1 | Year)
(Intercept) scale(A7) Maylg
7.3628472 0.3348105 0.6731953
Random effects:
Structure: Diagonal matrix
Group=Year
Variance StdDev
(Intercept) 0.9105 0.9542
Number of observations: total=25, Year=25
The results are essentially identical to the Poisson model from lme4 above.

Resources