For my graduate research I'm using the CPLM package (specifically the cpglmm function) to account for zero-inflated data (Tweedie compound Poisson distribution) in a data set looking at the effects of logging on breeding bird densities. This isn't a widely used package like lme4, nlme, etc. Therefore, the model validation methods that can be used on these more commonly used packages cannot be used on cpglmm.
I'm currently at the stage of describing the fit of my models and am trying to calculate R-squared values, both marginal and conditional. Unfortunately I cannot use the r2glmm package or MuMln to calculate R-squared values because they do not support cpglmm. Therefore, I've had to calculate those values manually through an example found here (example found in Appendix 6 under cpglmm parasite models, pg. 33). Here's the script from that example:
# Fit null model without fixed effects (but including all random effects)
parmodCPr <- cpglmm(Parasite ~ 1 + (1 | Population) + (1 | Container), data = DataAll)
# Fit alternative model including fixed and all random effects
parmodCPf <- cpglmm(Parasite ~ Sex + Treatment + Habitat + (1 | Population) +
(1 | Container), data = DataAll)
# Calculation of the variance in fitted values
VarF <- var(as.vector(model.matrix(parmodCPf) %*% fixef(parmodCPf)))
# getting the observation-level variance Null model
phiN <- parmodCPr#phi # the dispersion parameter
pN <- parmodCPr#p # the index parameter
mu <- exp(fixef(parmodCPr) + 0.5 * (VarCorr(parmodCPr)$Population[1] + VarCorr(parmodCPr)$Container[1]))
VarOdN <- phiN * mu^(pN - 2) # the delta method
# Full model
phiF <- parmodCPf#phi # the dispersion parameter
pF <- parmodCPf#p # the index parameter
VarOdF <- phiF * mu^(pF - 2) # the delta method
# R2[GLMM(m)] - marginal R2[GLMM]; using the delta method observation-level variance
R2glmmM <- VarF/(VarF + sum(as.numeric(VarCorr(parmodCPf))) + VarOdF)
# R2[GLMM(c)] - conditional R2[GLMM] for full model
R2glmmC <- (VarF + sum(as.numeric(VarCorr(parmodCPf))))/(VarF + sum(as.numeric(VarCorr(parmodCPf))) +
What I would like to be able to do is write a function in R using this code outputting both the marginal and conditional R-squared values (RglmmM and RglmmC) with my models as the input. I'd greatly appreciate any help with this problem. Hopefully I have supplied enough information.

Believe I figured it out. Here's an example I wrote up:
R2glmm <- function(model){
# Calculation of the variance in fitted values
VarALT <- var(as.vector(model.matrix(model) %*% fixef(model)))
# getting the observation-level variance Null model
phiNULL <- NULLmodel$phi # the dispersion parameter
pNULL <- NULLmodel$p # the index parameter
mu <- exp(fixef(NULLmodel) + 0.5 * (VarCorr(NULLmodel)$YEAR[1]))
VarOdNULL <- phiNULL * mu^(pNULL - 2) # the delta method
# Alternate model
phiALT <- model$phi # the dispersion parameter
pALT <- model$p # the index parameter
VarOdALT <- phiALT * mu^(pALT - 2) # the delta method
# R2[GLMM(m)] - marginal R2[GLMM]; using the delta method observation-level variance
R2glmmM <- VarALT/(VarALT + sum(as.numeric(VarCorr(model))) + VarOdALT)
# R2[GLMM(c)] - conditional R2[GLMM] for full model
R2glmmC <- (VarALT + sum(as.numeric(VarCorr(model))))/(VarALT + sum(as.numeric(VarCorr(model))) + VarOdALT)
return(c(R2glmmM, R2glmmC))
Variables containing ALT refers to the alternate model. "model" represents any cpglmm model you need to run through the function.
Hope this helps someone out. Been working on this problem and other related ones for ages now.


Including an offset when using cph {rms} for validation of a Cox model

I am externally validating and updating a Cox model in R. The model predicts 5 year risk. I don't have access to the original data, just the equation for the linear predictor and the value of the baseline survival probability at 5 years.
I have assessed calibration and discrimination of the model in my dataset and found that the model needs to be updated.
I want to update the model by adjusting baseline risk only, so I have been using a Cox model with the linear predictor ("beta.sum") included as an offset term, to restrict its coefficient to be 1.
I want to be able to use cph instead of coxph as it makes internal validation by bootstrapping much easier. However, when including the linear predictor as an offset I get the error:
"Error in exp(object$linear.predictors) :
non-numeric argument to mathematical function"
Is there something I am doing incorrectly, or does the cph function not allow an offset within the formula? If so, is there another way to restrict the coefficient to 1?
My code is below:
### Predicted risk ###
# linear predictor (LP)
k$beta.sum <- -0.2201 * ((k$age/10)-7.036) + 0.2467 * (k$male - 0.5642) - 0.5567 * ((k$epi/5)-7.222) +
0.4510 * (log(k$acr_mgmmol/0.113)-5.137)
k$pred <- 1 - 0.9365^exp(k$beta.sum)
# Recalibrated model
# Using coxph: <- coxph(Surv(time, rrt) ~ offset(beta.sum), data = k, x=TRUE, y=TRUE)
# new baseline survival at 5 years
predictSurvProb(, newdata=data.frame(beta.sum=0), times = 5) #baseline = 0.9570
# Using cph <- cph(Surv(time, rrt) ~ offset(beta.sum), data=k, x=TRUE, y=TRUE, surv=TRUE)
The model will run without surv=TRUE included, but this means a lot of the commands I want to use cannot work, such as calibrate, validate and predictSurvProb.
I will include a way to reproduce the error
n <- 1000
status <- as.numeric(rbernoulli(n, p=0.1))
time <- -5* log(runif(n))
lp <- rnorm(1000, mean=-2.7, sd=1)
mydata <- data.frame(status, time, lp)
test <- cph(Surv(time, status) ~ offset(lp), data=mydata, surv=TRUE)

Interpretation of output and prediction from cpglmm

I am helping a colleague fit a Compound-Poisson Generalized Linear Mixed Model in R, using the cpglmm-function from the cplm-package (link). The model involves a three-way interaction and I would like to compute some interpretable quantities. So far, I have tried to calculate some Odds-ratios but I am not sure this is the right way to do it?
# Fit model with three-way interaction in fixed effects #
m <- cpglmm(ncs ~ diversity_index*diversity_speciality*n_authors + selfcit +
n_refs + (1|region), data = diversity)
# Calculate Odds-ratio #
se <- sqrt(diag(vcov(m)))
tab <- cbind(Est = m$fixef,
S.E. = se,
LL = m$fixef - 1.96 * se,
UL = m$fixef + 1.96 * se)
print(exp(tab), digits=3)
I also want to compute some predicted values, e.g predicted probabilities or the like, but I can't get predict() to work for the cpglmm. Is there any functions I could use?

Simulate data for mixed-effects model with predefined parameter

I'm trying to simulate data for a model expressed with the following formula:
lme4::lmer(y ~ a + b + (1|subject), data) but with a set of given parameters:
a <- rnorm() measured at subject level (e.g nSubjects = 50)
y is measured at the observation level (e.g. nObs = 7 for each subject
b <- rnorm() measured at observation level and correlated at a given r with a
variance ratio of the random effects in lmer(y ~ 1 + (1 | subject), data) is fixed at for example 50/50 or 10/90 (and so on)
some random noise is present (so that a full model does not explain all the variance)
effect size of the fixed effects can be set at a predefined level (e.g. dCohen=0.5)
I played with various packages like: powerlmm, simstudy or simr but still fail to find a working solution that will accommodate the amount of parameters I'd like to define beforehand.
Also for my learning purposes I'd prefer a base R method than a package solution.
The closest example I found is a blog post by Ben Ogorek "Hierarchical linear models and lmer" which looks great but I can't figure out how to control for parameters listed above.
Any help would be appreciated.
Also if there a package that I don't know of, that can do these type of simulations please let me know.
Some questions about the model definition:
How do we specify a correlation between two random vectors that are different lengths? I'm not sure: I'll sample 350 values (nObs*nSubject) and throw away most of the values for the subject-level effect.
Not sure about "variance ratio" here. By definition, the theta parameters (standard deviations of the random effects) are scaled by the residual standard deviation (sigma), e.g. if sigma=2, theta=2, then the residual std dev is 2 and the among-subject std dev is 4
Define parameter/experimental design values:
nSubjects <- 50
nObs <- 7
## means of a,b are 0 without loss of generality
sdvec <- c(a=1,b=1)
rho <- 0.5 ## correlation
betavec <- c(intercept=0,a=1,b=2)
beta_sc <- betavec[-1]*sdvec ## scale parameter values by sd
theta <- 0.4 ## = 20/50
sigma <- 1
Set up data frame:
## generate a, b variables
mm <- MASS::mvrnorm(nSubjects*nObs,
subj <- factor(rep(seq(nSubjects),each=nObs)) ## or ?gl
## sample every nObs'th value of a
avec <- mm[seq(1,nObs*nSubjects,by=nObs),"a"]
avec <- rep(avec,each=nObs) ## replicate
bvec <- mm[,"b"]
dd <- data.frame(a=avec,b=bvec,Subject=subj)
dd$y <- simulate(~a+b+(1|Subject),

Calculating R^2 for a nonlinear least squares fit

Suppose I have x values, y values, and expected y values f (from some nonlinear best fit curve).
How can I compute R^2 in R? Note that this function is not a linear model, but a nonlinear least squares (nls) fit, so not an lm fit.
You just use the lm function to fit a linear model:
x = runif(100)
y = runif(100)
spam = summary(lm(x~y))
> spam$r.squared
[1] 0.0008532386
Note that the r squared is not defined for non-linear models, or at least very tricky, quote from R-help:
There is a good reason that an nls model fit in R does not provide
r-squared - r-squared doesn't make sense for a general nls model.
One way of thinking of r-squared is as a comparison of the residual
sum of squares for the fitted model to the residual sum of squares for
a trivial model that consists of a constant only. You cannot
guarantee that this is a comparison of nested models when dealing with
an nls model. If the models aren't nested this comparison is not
terribly meaningful.
So the answer is that you probably don't want to do this in the first
If you want peer-reviewed evidence, see this article for example; it's not that you can't compute the R^2 value, it's just that it may not mean the same thing/have the same desirable properties as in the linear-model case.
Sounds like f are your predicted values. So the distance from them to the actual values devided by n * variance of y
so something like
should give you a quasi rsquared value, so long as your model is reasonably close to a linear model and n is pretty big.
As a direct answer to the question asked (rather than argue that R2/pseudo R2 aren't useful) the nagelkerke function in the rcompanion package will report various pseudo R2 values for nonlinear least square (nls) models as proposed by McFadden, Cox and Snell, and Nagelkerke, e.g.
quadplat = function(x, a, b, clx) {
ifelse(x < clx, a + b * x + (-0.5*b/clx) * x * x,
a + b * clx + (-0.5*b/clx) * clx * clx)}
model = nls(Sodium ~ quadplat(Calories, a, b, clx),
data = BrendonSmall,
start = list(a = 519,
b = 0.359,
clx = 2304))
nullfunct = function(x, m){m}
null.model = nls(Sodium ~ nullfunct(Calories, m),
data = BrendonSmall,
start = list(m = 1346))
nagelkerke(model, null=null.model)
The soilphysics package also reports Efron's pseudo R2 and adjusted pseudo R2 value for nls models as 1 - RSS/TSS:
pred <- predict(model)
n <- length(pred)
res <- resid(model)
w <- weights(model)
if (is.null(w)) w <- rep(1, n)
rss <- sum(w * res ^ 2)
resp <- pred + res
center <- weighted.mean(resp, w)
r.df <- summary(model)$df[2]
int.df <- 1
tss <- sum(w * (resp - center)^2)
r.sq <- 1 - rss/tss
adj.r.sq <- 1 - (1 - r.sq) * (n - int.df) / r.df
out <- list(pseudo.R.squared = r.sq,
adj.R.squared = adj.r.sq)
which is also the pseudo R2 as calculated by the accuracy function in the rcompanion package. Basically, this R2 measures how much better your fit becomes compared to if you would just draw a flat horizontal line through them. This can make sense for nls models if your null model is one that allows for an intercept only model. Also for particular other nonlinear models it can make sense. E.g. for a scam model that uses stricly increasing splines (bs="mpi" in the spline term), the fitted model for the worst possible scenario (e.g. where your data was strictly decreasing) would be a flat line, and hence would result in an R2 of zero. Adjusted R2 then also penalize models with higher nrs of fitted parameters. Using the adjusted R2 value would already address a lot of the criticisms of the paper linked above, (besides if one swears by using information criteria to do model selection the question becomes which one to use - AIC, BIC, EBIC, AICc, QIC, etc).
Just using
r.sq <- max(cor(y,yfitted),0)^2
adj.r.sq <- 1 - (1 - r.sq) * (n - int.df) / r.df
I think would also make sense if you have normal Gaussian errors - i.e. the correlation between the observed and fitted y (clipped at zero, so that a negative relationship would imply zero predictive power) squared, and then adjusted for the nr of fitted parameters in the adjusted version. If y and yfitted go in the same direction this would be the R2 and adjusted R2 value as reported for a regular linear model. To me this would make perfect sense at least, so I don't agree with outright rejecting the usefulness of pseudo R2 values for nls models as the answer above seems to imply.
For non-normal error structures (e.g. if you were using a GAM with non-normal errors) the McFadden pseudo R2 is defined analogously as
1-residual deviance/null deviance
See here and here for some useful discussion.
Another quasi-R-squared for non-linear models is to square the correlation between the actual y-values and the predicted y-values. For linear models this is the regular R-squared.
As an alternative to this problem I used at several times the following procedure:
compute a fit on data with the nls function
using the resulting model make predictions
Trace (plot...) the data against the values predicted by the model (if the model is good, points should be near the bissectrix).
Compute the R2 of the linear régression.
Best wishes to all. Patrick.
With the modelr package
modelr::rsquare(nls_model, data)
nls_model <- nls(mpg ~ a / wt + b, data = mtcars, start = list(a = 40, b = 4))
modelr::rsquare(nls_model, mtcars)
# 0.794
This gives essentially the same result as the longer way described by Tom from the rcompanion resource.
Longer way with nagelkerke function
nullfunct <- function(x, m){m}
null_model <- nls(mpg ~ nullfunct(wt, m),
data = mtcars,
start = list(m = mean(mtcars$mpg)))
nagelkerke(nls_model, null_model)[2]
# 0.794 or 0.796
Lastly, using predicted values
lm(mpg ~ predict(nls_model), data = mtcars) %>% broom::glance()
# 0.795
Like they say, it's only an approximation.

R: Predict (0,1) in logistic regression in glm()

I am trying to model a "what if" situation in a binary logit model. I am estimating the probability of passing a test, given the level of difficulty of the test (1=easiest, 5=toughest), with gender as control. (The data is here). Students are administered a test which is generally tough ("HIGH" in the data). From this we can estimate the impact of test-difficulty on the likelihood of passing:
model = glm(PASS ~ as.factor(SEX) + as.factor(HIGH), family=binomial(link="logit"), data=df)
We can also get the predicted probabilities of passing with:
predict.high = predict(model, type="response")
The question is, what if the "LOW" test were given instead? To get the new probabilities, we can do:
newdata = rename.vars(subset(df, select=c(-HIGH)), 'LOW','HIGH')
predict.low = predict(model, newdata=newdata, type="response")
But how do I know how many additional students would have passed in this case? Is there an obvious switch in glm() I am not seeing?
I have not yet tried to dig out my code for prediction that I wrote based on Gelman and Hill (2006) who, I seem to recall used simulation. I still intend to do that. One aspect of your question that seemed unique in my limited experience was that I was accustomed to predicting for a single observation (in this case a single student taking a single test). You, however, seem to want to predict a difference between two sets of predictions. In other words, you want to predict how many more students will pass if given a set of 5 easy exams rather than a set of 5 hard exams.
I am not sure whether Gelman and Hill (2006) covered that. You also seem to want to do this with a frequentist approach.
I am thinking that if you can predict for a single observation, so that you have a confidence interval for each observation, then perhaps you can estimate a weighted average probability of passing within each group and subtract the two weighted averages. The delta method could be used to estimate a confidence interval on the weighted averages and on their difference.
Covariance among predicted observations might have to be assumed to be 0 to implement that approach.
If assuming a covariance of 0 is not satisfactory then perhaps a Bayesian approach would be better. Again, I am only familiar with predicting for a single observation. With a Bayesian approach I have predicted a single observation by including the independent variables, but not the dependent variable, for the observation to be predicted. I suppose you could predict for every observation in the same Bayesian run (predict each student in HIGH and in LOW). The weighted averages of passing tests for each group and the difference in weighted averages are derived parameters and I suspect could be included directly in the code for the Bayesian logistic regression. Then you would have your point estimate and estimate of variance for probability of passing each group of tests and for the difference in probability of passing each group of tests. If you want the difference in the number of students passing each group of tests, perhaps that could be included in the Bayesian code as a derived parameter also.
I realize this answer, so far, has been more conversational than might be desired. I am simply mapping out strategies to attempt without having had the time yet to try implementing those strategies. Providing all of the R and WinBUGS code to implement both proposed strategies might take me a few days. (WinBUGS or OpenBUGS can be called from within R.) I will append the code to this answer as I go along. If anyone deems my proposed strategies, and/or forthcoming code, incorrect I hope they will feel free to point out my errors and offer corrections.
Below is code that generates fake data and analyzes that data using a frequentist and Bayesian approach. I have not yet added the code to implement the above ideas for prediction. I will try to add the Bayesian prediction code in the next 1-2 days. I only used three tests instead of five. The way the code is written below you can change the number of students, n, to any non-zero number that can be divided into 6 equal whole numbers.
# Bayesian_logistic_regression_June2012.r
# June 24, 2012
# create fake data for n students and three tests
n <- 1200
# create factors for n/6 students in each of 6 categories
gender <- c(rep(0, (n/2)), rep(1, (n/2)))
test2 <- c(rep(0, (n/6)), rep(1, (n/6)), rep(0, (n/6)),
rep(0, (n/6)), rep(1, (n/6)), rep(0, (n/6)))
test3 <- c(rep(0, (n/6)), rep(0, (n/6)), rep(1, (n/6)),
rep(0, (n/6)), rep(0, (n/6)), rep(1, (n/6)))
# assign slopes to factors
B0 <- 0.4
Bgender <- -0.2
Btest2 <- 0.6
Btest3 <- 1.2
# estimate probability of passing test
p.pass <- ( exp(B0 + Bgender * gender +
Btest2 * test2 +
Btest3 * test3) /
(1 + exp(B0 + Bgender * gender +
Btest2 * test2 +
Btest3 * test3)))
# identify which students passed their test, 0 = fail, 1 = pass
passed <- rep(0, n)
r.passed <- runif(n,0,1)
passed[r.passed <= p.pass] = 1
# use frequentist approach in R to estimate probability
# of passing test
m.freq <- glm(passed ~ as.factor(gender) +
as.factor(test2) +
as.factor(test3) ,
family = binomial)
# predict(m.freq, type = "response")
# use OpenBUGS to analyze same data set
# Define model
model {
# Priors
alpha ~ dnorm(0,0.01)
bgender ~ dnorm(0,0.01)
btest2 ~ dnorm(0,0.01)
btest3 ~ dnorm(0,0.01)
# Likelihood
for (i in 1:n) {
passed[i] ~ dbin(p[i], 1)
logit(p[i]) <- (alpha + bgender * gender[i] +
btest2 * test2[i] +
btest3 * test3[i])
# Derived parameters
p.g.t1 <- exp(alpha) / (1 + exp(alpha))
p.b.t1 <- exp(alpha + bgender) / (1 + exp(alpha + bgender))
p.g.t2 <- ( exp(alpha + btest2) /
(1 + exp(alpha + btest2)))
p.b.t2 <- ( exp(alpha + bgender + btest2) /
(1 + exp(alpha + bgender + btest2)))
p.g.t3 <- ( exp(alpha + btest3) /
(1 + exp(alpha + btest3)))
p.b.t3 <- ( exp(alpha + bgender + btest3) /
(1 + exp(alpha + bgender + btest3)))
", fill = TRUE)
sink() <- list(passed = passed,
gender = gender,
test2 = test2,
test3 = test3,
n = length(passed))
# Inits function
inits <- function(){ list(alpha = rlnorm(1),
bgender = rlnorm(1),
btest2 = rlnorm(1),
btest3 = rlnorm(1)) }
# Parameters to estimate
params <- c("alpha", "bgender", "btest2", "btest3",
"p.g.t1", "p.b.t1", "p.g.t2", "p.b.t2",
"p.g.t3", "p.b.t3")
# MCMC settings
nc <- 3
ni <- 2000
nb <- 500
nt <- 2
# Start Gibbs sampling
out <- bugs(data =, inits = inits, = params,
"c:/users/Mark W Miller/documents/Bayesian.logistic.regression.txt",
program = 'OpenBUGS',
n.thin = nt, n.chains = nc,
n.burnin = nb, n.iter = ni, debug = TRUE)
print(out, dig = 5)
Before I attempt to implement the weighted-average approach to prediction I wanted to convince myself that it might work. So I ginned up the following code, which seems to suggest it may:
# specify number of girls taking each test and
# number of boys taking each test
g.t1 <- rep(0,400)
b.t1 <- rep(0,120)
g.t2 <- rep(0,1200)
b.t2 <- rep(0,50)
g.t3 <- rep(0,1000)
b.t3 <- rep(0,2000)
# specify probability of individuals in each of the
# 6 groups passing their test
p.g1.t1 <- 0.40
p.b1.t1 <- 0.30
p.g1.t2 <- 0.60
p.b1.t2 <- 0.50
p.g1.t3 <- 0.80
p.b1.t3 <- 0.70
# identify which individuals in each group passed their test
g.t1[1:(p.g1.t1 * length(g.t1))] = 1
b.t1[1:(p.b1.t1 * length(b.t1))] = 1
g.t2[1:(p.g1.t2 * length(g.t2))] = 1
b.t2[1:(p.b1.t2 * length(b.t2))] = 1
g.t3[1:(p.g1.t3 * length(g.t3))] = 1
b.t3[1:(p.b1.t3 * length(b.t3))] = 1
# determine the weighted average probability of passing
# on test day for all individuals as a class
wt.ave.p <- ((p.g1.t1 * length(g.t1) + p.b1.t1 * length(b.t1) +
p.g1.t2 * length(g.t2) + p.b1.t2 * length(b.t2) +
p.g1.t3 * length(g.t3) + p.b1.t3 * length(b.t3) ) /
(length(g.t1) + length(b.t1) + length(g.t2) +
length(b.t2) + length(g.t3) + length(b.t3)))
# determine the expected number of individuals passing
# their test in the class as a whole
exp.num.pass <- wt.ave.p * (length(g.t1) + length(b.t1) +
length(g.t2) + length(b.t2) +
length(g.t3) + length(b.t3))
# determine the number of individuals passing
num.passing <- (sum(g.t1) + sum(b.t1) +
sum(g.t2) + sum(b.t2) +
sum(g.t3) + sum(b.t3) )
# the expected number of students passing, exp.num.pass,
# should equal the observed number of students passing,
# num.passing regardless of the number of students in each
# group and regardless of the probability of passing a
# given test, within rounding error
identical(round(exp.num.pass), round(num.passing))
Hopefully in the next couple of days I can try adding the prediction code to the above Bayesian code.
EDIT - June 27, 2012
I have not forgotten about this. Rather, I have encountered several problems:
With logistic regression it is possible to predict: a) the probability, p, that students in a given group pass a test and b) the outcome of a given student taking a test (0 or 1). All of the 0's and 1's are then averaged. I am not sure which of these to use. The point estimate and SD of the predicted p is identical to the estimated p for known test outcomes. The point estimate of the average of the predicted 0's and 1's is a little different and the SD of the averaged 0's and 1's is much larger. I believe I want b, the average of the predicted 0's and 1's. However, I am attempting to examine various websites and books to be sure. Collett (1991) has a worked example that does not employ computer code, but that worked example includes a half-dozen variables including 2 interactions and I am having a little trouble getting my Bayesian estimates to match her frequentist estimates.
With lots of derived parameters the program is taking a long time to run.
Apparently OpenBUGS has been crashing frequently, I believe, even without prediction code. I am not sure whether that is because of something I am doing wrong or because of changes in the recent versions of R or changes in recent versions of R packages or maybe because I am trying to run the code with a 64-bit R or something else.
I will try to post the prediction code soon, but all of the above issues have slowed me down.
You can easily use this approach to find a cut off:
cutoff <- runif(length(predicted_probabilities))
This is a deterministic decision based on Metropolis-Hastings.
