MCMCglmm binomial model prior - r

I want to estimate a binomial model with the R package MCMCglmm. The model shall incorporate an intercept and a slope - both as fixed and random parts. How do I have to specify an accepted prior? (Note, here is a similar question, but in a much more complicated setting.)
Assume the data have the following form:
y x cluster
1 0 -0.56047565 1
2 1 -0.23017749 1
3 0 1.55870831 1
4 1 0.07050839 1
5 0 0.12928774 1
6 1 1.71506499 1
In fact, the data have been generated by
set.seed(123)
nj <- 15 # number of individuals per cluster
J <- 30 # number of clusters
n <- nj * J
x <- rnorm(n)
y <- rbinom(n, 1, prob = 0.6)
cluster <- factor(rep(1:nj, each = J))
dat <- data.frame(y = y, x = x, cluster = cluster)

The information in the question about the model, suggest to specify fixed = y ~ 1 + x and random = ~ us(1 + x):cluster. With us() you allow the random effects to be correlated (cf. section 3.4 and table 2 in Hadfield's 2010 jstatsoft-article)
First of all, as you only have one dependent variable (y), the G part in the prior (cf. equation 4 and section 3.6 in Hadfield's 2010 jstatsoft-article) for the random effects variance(s) only needs to have one list element called G1. This list element isn't the actual prior distribution - this was specified by Hadfield to be an inverse-Wishart distribution. But with G1 you specify the parameters of this inverse-Whishart distribution which are the scale matrix ( in Wikipedia notation and V in MCMCglmm notation) and the degrees of freedom ( in Wikipedia notation and nu in MCMCglmm notation). As you have two random effects (the intercept and the slope) V has to be a 2 x 2 matrix. A frequent choice is the two dimensional identity matrix diag(2). Hadfield often uses nu = 0.002 for the degrees of freedom (cf. his course notes)
Now, you also have to specify the R part in the prior for the residual variance. Here again an inverse-Whishart distribution was specified by Hadfield, leaving the user to specify its parameters. As we only have one residual variance, V has to be a scalar (lets say V = 0.5). An optional element for R is fix. With this element you specify, whether the residual variance shall be fixed to a certain value (than you have to write fix = TRUE or fix = 1) or not (then fix = FALSE or fix = 0). Notice, that you don't fix the residual variance to be 0.5 by fix = 0.5! So when you find in Hadfield's course notes fix = 1, read it as fix = TRUE and look to which value of V it is was fixed.
All togehter we set up the prior as follows:
prior0 <- list(G = list(G1 = list(V = diag(2), nu = 0.002)),
R = list(V = 0.5, nu = 0.002, fix = FALSE))
With this prior we can run MCMCglmm:
library("MCMCglmm") # for MCMCglmm()
set.seed(123)
mod0 <- MCMCglmm(fixed = y ~ 1 + x,
random = ~ us(1 + x):cluster,
data = dat,
family = "categorical",
prior = prior0)
The draws from the Gibbs-sampler for the fixed effects are found in mod0$Sol, the draws for the variance parameters in mod0$VCV.
Normally a binomial model requires the residual variance to be fixed, so we set the residual variance to be fixed at 0.5
set.seed(123)
prior1 <- list(G = list(G1 = list(V = diag(2), nu = 0.002)),
R = list(V = 0.5, nu = 0.002, fix = TRUE))
mod1 <- MCMCglmm(fixed = y ~ 1 + x,
random = ~ us(1 + x):cluster,
data = dat,
family = "categorical",
prior = prior1)
The difference can be seen by comparing mod0$VCV[, 5] to mod1$VCV[, 5]. In the later case, all entries are 0.5 as specified.

Related

Why are the predicted values of my GLM cyclical?

I wrote a binomial regression model to predict the prevalence of igneous stone, v, at an archaeological site based on proximity to a river, river_dist, but when I use the predict() function I'm getting odd cyclical results instead of the curve I was expecting. For reference, my data:
v n river_dist
1 102 256 1040
2 1 11 720
3 19 24 475
4 12 15 611
Which I fit to this model:
library(bbmle)
m_r <- mle2(ig$v ~ dbinom(size=ig$n, prob = 1/(1+exp(-(a + br * river_dist)))),
start = list(a = 0, br = 0), data = ig)
This produces a coefficient which, when back-transformed, suggests about 0.4% decrease in the likelihood of igneous stone per meter from the river (br = 0.996):
exp(coef(m_r))
That's all good. But when I try to predict new values, I get this odd cycling of values:
newdat <- data.frame(river_dist=seq(min(ig$river_dist), max(ig$river_dist),len=100))
newdat$v <- predict(m_r, newdata=newdat, type="response")
plot(v~river_dist, data=ig, col="red4")
lines(v ~ river_dist, newdat, col="green4", lwd=2)
Example of predicted values:
river_dist v
1 475.0000 216.855114
2 480.7071 9.285536
3 486.4141 20.187424
4 492.1212 12.571487
5 497.8283 213.762248
6 503.5354 9.150584
7 509.2424 19.888471
8 514.9495 12.381805
9 520.6566 210.476312
10 526.3636 9.007289
11 532.0707 19.571218
12 537.7778 12.180629
Why are the values cycling up and down like that, creating crazy spikes when graphed?
In order for newdata to work, you have to specify the variables as 'raw' values rather than with $:
library(bbmle)
m_r <- mle2(v ~ dbinom(size=n, prob = 1/(1+exp(-(a + br * river_dist)))),
start = list(a = 0, br = 0), data = ig)
At this point, as #user20650 suggests, you'll also have to specify a value (or values) for n in newdata.
This model appears to be identical to binomial regression: is there a reason not to use
glm(cbind(v,n-v) ~ river_dist, data=ig, family=binomial)
? (bbmle:mle2 is more general, but glm is much more robust.) (Also: fitting two parameters to four data points is theoretically fine, but you should not try to push the results too far ... in particular, a lot of the default results from GLM/MLE are asymptotic ...)
Actually, in double-checking the correspondence of the MLE fit with GLM I realized that the default method ("BFGS", for historical reasons) doesn't actually give the right answer (!); switching to method="Nelder-Mead" improves things. Adding control=list(parscale=c(a=1,br=0.001)) to the argument list, or scaling the river dist (e.g. going from "1 m" to "100 m" or "1 km" as the unit), would also fix the problem.
m_r <- mle2(v ~ dbinom(size=n,
prob = 1/(1+exp(-(a + br * river_dist)))),
start = list(a = 0, br = 0), data = ig,
method="Nelder-Mead")
pframe <- data.frame(river_dist=seq(500,1000,length=51),n=1)
pframe$prop <- predict(m_r, newdata=pframe, type="response")
CIs <- lapply(seq(nrow(ig)),
function(i) prop.test(ig[i,"v"],ig[i,"n"])$conf.int)
ig2 <- data.frame(ig,setNames(as.data.frame(do.call(rbind,CIs)),
c("lwr","upr")))
library(ggplot2); theme_set(theme_bw())
ggplot(ig2,aes(river_dist,v/n))+
geom_point(aes(size=n)) +
geom_linerange(aes(ymin=lwr,ymax=upr)) +
geom_smooth(method="glm",
method.args=list(family=binomial),
aes(weight=n))+
geom_line(data=pframe,aes(y=prop),colour="red")
Finally, note that your third-farthest site is an outlier (although the small sample size means it doesn't hurt much).

Approach for comparing linear, non-linear and different parameterization non-linear models

I search for one approach for comparing linear, non-linear and different parameterization non-linear models. For this:
#Packages
library(nls2)
library(minpack.lm)
# Data set - Diameter in function of Feature and Age
Feature<-sort(rep(c("A","B"),22))
Age<-c(60,72,88,96,27,
36,48,60,72,88,96,27,36,48,60,72,
88,96,27,36,48,60,27,27,36,48,60,
72,88,96,27,36,48,60,72,88,96,27,
36,48,60,72,88,96)
Diameter<-c(13.9,16.2,
19.1,19.3,4.7,6.7,9.6,11.2,13.1,15.3,
15.4,5.4,7,9.9,11.7,13.4,16.1,16.2,
5.9,8.3,12.3,14.5,2.3,5.2,6.2,8.6,9.3,
11.3,15.1,15.5,5,7,7.9,8.4,10.5,14,14,
4.1,4.9,6,6.7,7.7,8,8.2)
d<-dados <- data.frame(Feature,Age,Diameter)
str(d)
I will create three different models, two non-linear models with specific parametization and one linear model. In my example
a suppose that all the coefficients of each mode were significant (and not considering real results).
# Model 1 non-linear
e1<- Diameter ~ a1 * Age^a2
#Algoritm Levenberg-Marquardt
m1 <- nlsLM(e1, data = d,
start = list(a1 = 0.1, a2 = 10),
control = nls.control(maxiter = 1000))
# Model 2 linear
m2<-lm(Diameter ~ Age, data=d)
# Model 3 another non-linear
e2<- Diameter ~ a1^(-Age/a2)
m3 <- nls2(e2, data = d, alg = "brute-force",
start = data.frame(a1 = c(-1, 1), a2 = c(-1, 1)),
control = nls.control(maxiter = 1000))
Now, my idea is comparing the "better" model despite the different nature of each model, than I try a proportional measure
and for this I use each mean square error of each model comparing of total square error in data set, when a make this I have if
a comparing model 1 and 2:
## MSE approach (like pseudo R2 approach)
#Model 1
SQEm1<-summary(m1)$sigma^2*summary(m1)$df[2]# mean square error of model
SQTm1<-var(d$Diameter)*(length(d$Diameter)-1)#total square error in data se
R1<-1-SQEm1/SQTm1
R1
#Model 2
SQEm2<-summary(m2)$sigma^2*summary(m2)$df[2]# mean square error of model
R2<-1-SQEm2/SQTm1
R2
In my weak opinion model 1 is "better" that model 2. My question is, does this approach sounds correct? Is there any way to compare these models types?
Thanks in advance!
#First cross-validation approach ------------------------------------------
#Cross-validation model 1
set.seed(123) # for reproducibility
n <- nrow(d)
frac <- 0.8
ix <- sample(n, frac * n) # indexes of in sample rows
e1<- Diameter ~ a1 * Age^a2
#Algoritm Levenberg-Marquardt
m1 <- nlsLM(e1, data = d,
start = list(a1 = 0.1, a2 = 10),
control = nls.control(maxiter = 1000), subset = ix)# in sample model
BOD.out <- d[-ix, ] # out of sample data
pred <- predict(m1, new = BOD.out)
act <- BOD.out$Diameter
RSS1 <- sum( (pred - act)^2 )
RSS1
#[1] 56435894734
#Cross-validation model 2
m2<-lm(Diameter ~ Age, data=d,, subset = ix)# in sample model
BOD.out2 <- d[-ix, ] # out of sample data
pred <- predict(m2, new = BOD.out2)
act <- BOD.out2$Diameter
RSS2 <- sum( (pred - act)^2 )
RSS2
#[1] 19.11031
# Sum of squares approach -----------------------------------------------
deviance(m1)
#[1] 238314429037
deviance(m2)
#[1] 257.8223
Based in gfgm and G. Grothendieck comments, RSS2 has lower error that RSS1 and comparing deviance(m2) and deviance(m2) too, than model 2 is better than model 1.

`nls` fails to estimate parameters of my model

I am trying to estimate the constants for Heaps law.
I have the following dataset novels_colection:
Number of novels DistinctWords WordOccurrences
1 1 13575 117795
2 1 34224 947652
3 1 40353 1146953
4 1 55392 1661664
5 1 60656 1968274
Then I build the next function:
# Function for Heaps law
heaps <- function(K, n, B){
K*n^B
}
heaps(2,117795,.7) #Just to test it works
So n = Word Occurrences, and K and B are values that should be constants in order to find my prediction of Distinct Words.
I tried this but it gives me an error:
fitHeaps <- nls(DistinctWords ~ heaps(K,WordOccurrences,B),
data = novels_collection[,2:3],
start = list(K = .1, B = .1), trace = T)
Error = Error in numericDeriv(form[[3L]], names(ind), env) :
Missing value or an infinity produced when evaluating the model
Any idea in how could I fix this or a method to fit the function and get the values for K and B?
If you take log transform on both sides of y = K * n ^ B, you get log(y) = log(K) + B * log(n). This is a linear relationship between log(y) and log(n), hence you can fit a linear regression model to find log(K) and B.
logy <- log(DistinctWords)
logn <- log(WordOccurrences)
fit <- lm(logy ~ logn)
para <- coef(fit) ## log(K) and B
para[1] <- exp(para[1]) ## K and B
With minpack.lm we can fit a non-linear model but I guess it will be prone to overfitting more than a linear model on the log-transformed variables will do (as done by Zheyuan), but we may compare the residuals of linear / non-linear model on some held-out dataset to get the empirical results, which will be interesting to see.
library(minpack.lm)
fitHeaps = nlsLM(DistinctWords ~ heaps(K, WordOccurrences, B),
data = novels_collection[,2:3],
start = list(K = .01, B = .01))
coef(fitHeaps)
# K B
# 5.0452566 0.6472176
plot(novels_collection$WordOccurrences, novels_collection$DistinctWords, pch=19)
lines(novels_collection$WordOccurrences, predict(fitHeaps, newdata = novels_collection[,2:3]), col='red')

Does the function y = at^b * exp(-ct) have a name? Can it be linearized? How can I estimate a, b, c?

I am trying to fit a non-linear model, but can not find any good examples
online.
Does this function have a name?
Can it be linearized?
I've attempted to estimate the parameters a, b, and c with a random effect g (as in group) as a function of time t, below. I can fit the model using nls without a random effect, but am having trouble getting the model to converge. Suggestions welcome (preferably within R, but any suitable package will do)?
## time, repeated 16 times for 4 replicates from each of 4 groups
t <- rep(1:20, 16)
## g, group
g <- rep(1:4, each = 80)
## starting to create an example dataset,
## to see if I can recover known parameters
a <- rep(c(3.5, 4, 4.1, 5), each = 80)
b <- rep(c(1.1, 1.4, 1.8, 2.5), each = 80)
c <- rep(c(0.125, 0.25), each = 160)
## error to add to above parameters
set.seed(1)
e_a <- runif(320, -0.5, 0.5)
e_b <- runif(320, -0.1, -0.1)
e_c <- runif(320, -0.02, 0.02)
## this is my function
f <- function(t, a, b, c) a * (t^b) * exp(-c * t)
## simulate y
y <- f(t = t, a + e_a, b + e_b, c + e_c)
mydata <- data.frame(t = t, y = y, g = g)
library(nlme)
## now fit the model to estimate a, b, c
fm1 <- nlme(y ~ a * (t^b) * exp(-c * t),
data = mydata,
fixed = a + b + c~1,
random = a + b + c ~ 1|g,
start = c(a = 4, b = 1, c = 0.25),
method = "REML")
In physics (and some other areas) I've seen this or variants of it called a Hoerl curve or Hoerl function e.g. here, though it has other names. If c is negative and a and b are positive it's a scaled gamma density.
When you ask about linearizing it, you have to be careful; the equation y = at^b . exp(ct) is not actually what you mean - the observations, y(i), are not exactly equal to a . t(i)^b . exp(ct(i)) (otherwise almost any 3 observations would give you the exact parameter values).
So the noise has to enter your model for y somehow. Is it additive? multiplicative, or something else? (Also important, but for other reasons: does its size change in some way as t changes, or not? Are the noise terms for different observations independent?)
If your actual model is y(i) = at(i)^b . exp(ct(i))+ε(i), that's not linearizable.
If your actual model is y(i) = at(i)^b . exp(ct(i)) . ε(i), and ε(i)=exp(η(i)) for some (hopefully zero-mean) η(i), that is linearizable.
Taking the second form,
log(y(i)) = log(a) + b log(t(i)) + c t(i) + log(ε(i))
or
y*(i) = a* + b.log(t(i)) + c.t(i) + η(i)
which is linear in the parameters a* = log(a), b and c, and the error term η(i); so if you're prepared to make that sort of an assumption about the error you should be able to fit it with methods suitable for such linear models; you may wish in that case to ponder the parenthetical questions about the error term above which may affect how you model it.

MCMCglmm multinomial model in R

I'm trying to create a model using the MCMCglmm package in R.
The data are structured as follows, where dyad, focal, other are all random effects, predict1-2 are predictor variables, and response 1-5 are outcome variables that capture # of observed behaviors of different subtypes:
dyad focal other r present village resp1 resp2 resp3 resp4 resp5
1 10101 14302 0.5 3 1 0 0 4 0 5
2 10405 11301 0.0 5 0 0 0 1 0 1
…
So a model with only one outcome (teaching) is as follows:
prior_overdisp_i <- list(R=list(V=diag(2),nu=0.08,fix=2),
G=list(G1=list(V=1,nu=0.08), G2=list(V=1,nu=0.08), G3=list(V=1,nu=0.08), G4=list(V=1,nu=0.08)))
m1 <- MCMCglmm(teaching ~ trait-1 + at.level(trait,1):r + at.level(trait,1):present,
random= ~idh(at.level(trait,1)):focal + idh(at.level(trait,1)):other +
idh(at.level(trait,1)):X + idh(at.level(trait,1)):village,
rcov=~idh(trait):units, family = "zipoisson", prior=prior_overdisp_i,
data = data, nitt = nitt.1, thin = 50, burnin = 15000, pr = TRUE, pl = TRUE, verbose = TRUE, DIC = TRUE)
Hadfield's course notes (Ch 5) give an example of a multinomial model that uses only a single outcome variable with 3 levels (sheep horns of 3 types). Similar treatment can be found here: http://hlplab.wordpress.com/2009/05/07/multinomial-random-effects-models-in-r/ This is not quite right for what I'm doing, but contains helpful background info.
Another reference (Hadfield 2010) gives an example of a multi-response MCMCglmm that follows the same format but uses cbind() to predict a vector of responses, rather than a single outcome. The same model with multiple responses would look like this:
m1 <- MCMCglmm(cbind(resp1, resp2, resp3, resp4, resp5) ~ trait-1 +
at.level(trait,1):r + at.level(trait,1):present,
random= ~idh(at.level(trait,1)):focal + idh(at.level(trait,1)):other +
idh(at.level(trait,1)):X + idh(at.level(trait,1)):village,
rcov=~idh(trait):units,
family = cbind("zipoisson","zipoisson","zipoisson","zipoisson","zipoisson"),
prior=prior_overdisp_i,
data = data, nitt = nitt.1, thin = 50, burnin = 15000, pr = TRUE, pl = TRUE, verbose = TRUE, DIC = TRUE)
I have two programming questions here:
How do I specify a prior for this model? I've looked at the materials mentioned in this post but just can't figure it out.
I've run a similar version with only two response variables, but I only get one slope - where I thought I should get a different slope for each resp variable. Where am I going wrong, or having I misunderstood the model?
Answer to my first question, based on the HLP post and some help from a colleage/stats consultant:
# values for prior
k <- 5 # originally: length(levels(dative$SemanticClass)), so k = # of outcomes for SemanticClass aka categorical outcomes
I <- diag(k-1) #should make matrix of 0's with diagonal of 1's, dimensions k-1 rows and k-1 columns
J <- matrix(rep(1, (k-1)^2), c(k-1, k-1)) # should make k-1 x k-1 matrix of 1's
And for my model, using the multinomial5 family and 5 outcome variables, the prior is:
prior = list(
R = list(fix=1, V=0.5 * (I + J), n = 4),
G = list(
G1 = list(V = diag(4), n = 4))
For my second question, I need to add an interaction term to the fixed effects in this model:
m <- MCMCglmm(cbind(Resp1, Resp2...) ~ -1 + trait*predictorvariable,
...
The result gives both main effects for the Response variables and posterior estimates for the Response/Predictor interaction (the effect of the predictor variable on each response variable).

Resources