I am analyzing gene expression data in R. I would like to test for differences in expression when accounting for the phylogenetic effect.
I can run GLM with a negative binomial distribution and normalization factor as an offset:
library(MASS)
glm.nb(expression ~ Group + offset(log(normFactor)), data=data)
However, I don't know how to include phylogenetic effect in this model. I can obtain a variance-covariance or correlation matrix from my phylogeny:
library(ape)
tree <- read.tree("tree.nwk")
varCovMatrix <- vcv(tree, model = "Brownian", cor = FALSE)
I found that lmekin allows to specify the variance-covariance structure of the random effects:
library(coxme)
lmekin(expression ~ Group + (1| animal) + offset(log(normFactor)), data=data, varlist= varCovMatrix)
But I cannot specify negative binomial distribution and it isn't clear whether it understands offset.
The same problem is for MCMCglmm.
Please, help me to put into one GLMM:
the variance-covariance matrix
normalization factor as an offset
negative binomial distribution
Related
I have created a custom covariance matrix (converted to correlation matrix) that I wish to you in nlme. How do use I fit my model as random effect model where gen, rep:row and rep:col are all considered random effects. See below my gls model:
model.gls <- gls(y ~ gen + rep:row + rep:col- 1, correlation = C, data = dat)
C is my custom correlation matrix.
Any help would be deeply appreciated.
I want to use a mixed model without a random intercept but with a correlation structure. The reason is to get the AIC to help choose the best correlation structure (e.g., autoregressive versus compound symmetry). So it is essentially a GEE, but GEEs don't allow estimation of the AIC. They are also called covariance pattern models.
The code below simulates random data with a compound symmetry correlation. The model fits both a random intercept and a variance-covariance matrix. Is there any way to switch off the random intercept?
library(MASS)
library(nlme)
Sigma = toeplitz(c(1,0.5,0.5,0.5))
data = data.frame(mvrnorm(n=10, mu=1:4, Sigma=Sigma))
data$id = 1:nrow(data)
long = reshape(data, direction='long', varying=list(1:4), v.names='Y')
cs = corCompSymm(0.5, form = ~ 1 | id)
model = lme(Y~time , random=list(~1|id), data=long, correlation=cs)
summary(model)
If you are solely interested in comparing correlation structures, then I am pretty sure your goal could be served by a generalized least squares model fit with gls:
model = gls(Y~time, data=long, correlation=cs)
summary(model)
AIC(model)
Otherwise, a linear mixed effects model fit with lme must have random effects specified.
I ran a mixed model logistic regression adjusting my model with genetic relationship matrix using an R package known as GMMAT (function: glmmkin()).
My output from the model includes (taken from the user manual):
theta: the dispersion parameter estimate [1] and the variance component parameter estimate [2]
coefficients: fixed effects parameter estimates (including the intercept).
linear.predictors: the linear predictors.
fitted.values: fitted mean values on the original scale.
Y: a vector of length equal to the sample size for the final working vector.
P: the projection matrix with dimensions equal to the sample size.
residuals: residuals on the original scale. NOT rescaled by the dispersion parameter.
cov: covariance matrix for the fixed effects (including the intercept).
converged: a logical indicator for convergence.
I am trying to obtain the log-likelihood in order to compute variance explained. My first instinct was to pull apart the logLik.glm function in order to compute this "by hand" and I got stuck at trying to compute AIC. I used the answer from here.
I did a sanity check with a logistic regression run with stats::glm() where the model1$aic is 4013.232 but using the Stack Overflow answer I found, I obtained 30613.03.
My question is -- does anyone know how to compute log likelihood from a logistic regression by hand using the output that I have listed above in R?
No statistical insight here, just the solution I see from looking at glm.fit. This only works if you did not specify weights while fitting the models (or if you did, you would need to include those weights in the model object)
get_logLik <- function(s_model, family = binomial(logit)) {
n <- length(s_model$y)
wt <- rep(1, n) # or s_model$prior_weights if field exists
deviance <- sum(family$dev.resids(s_model$y, s_model$fitted.values, wt))
mod_rank <- sum(!is.na(s_model$coefficients)) # or s_model$rank if field exists
aic <- family$aic(s_model$y, rep(1, n), s_model$fitted.values, wt, deviance) + 2 * mod_rank
log_lik <- mod_rank - aic/2
return(log_lik)
}
For example...
model <- glm(vs ~ mpg, mtcars, family = binomial(logit))
logLik(model)
# 'log Lik.' -12.76667 (df=2)
sparse_model <- model[c("theta", "coefficients", "linear.predictors", "fitted.values", "y", "P", "residuals", "cov", "converged")]
get_logLik(sparse_model)
#[1] -12.76667
can anyone tell me why the slope coefficients deviate between those extracted from a lmer model with a random slope, and those from a lmList model fitted to the same dataset?
Thanks...
After some digging I found the answer in Doug Bates' book on lme4. Paraphrasing... when the individual linear fit at the subject level is poor the linear mixed effects model coefficient tends to exhibit what is called “shrinkage” (see http://lme4.r-forge.r-project.org/lMMwR/lrgprt.pdf) towards the population level value (e.g. the fixed effect). In this case the uncertainty in the site-level coefficient is large (e.g. our confidence in our absolute estimate of its precise value is low), so in order to balance fidelity to the data, measured by the residual sum of squares, with simplicity of the model, the mixed-effects model smooths out the between-subject differences in the predictions by bringing them closer to a common set of predictions, but not at the expense of dramatically increasing the sum of squared residuals.
Note that the "shrinkage" might be a good thing assuming some degree of similarity among your subjects (or observational units), for example if you assume they are drawn from the same population, because it makes the model more robust to outliers at the individual level.
You can quantify the increase in the sum of squared residuals by computing an overall coefficient of determination for the mixed-effects model and the within-subject fits. I am doing it here for the sleepstudy dataset contained in the lme4 package.
> library(lme4)
> mm <- lmer(Reaction ~ Days + (Days|Subject), data = sleepstudy) # mixef-effects
> ws <- lmList(Reaction ~ Days |Subject, data = sleepstudy) # within-subject
>
> # coefficient of determination for mixed-effects model
> summary(lm(sleepstudy$Reaction ~ predict(mm)))$r.squared
[1] 0.8271702
>
> # coefficient of determination for within subjects fit
> require(nlme)
> summary(lm(sleepstudy$Reaction ~ predict(ws)))$r.squared
[1] 0.8339452
You can check that the decrease in the proportion of variability explained by the mixed-effects model respect to within-subjects fits is quite small 0.8339452 - 0.8271702 = 0.006775.
I need to plot a binned residual plot with fitted versus residual values from an ordered multinominal logit regression.
How can I extract residuals when using polr? Is there any other function that runs ord multinominal logit in which residuals can be extracted?
This is the code I used
options(contrasts = c("contr.treatment", "contr.poly"))
mod1 <- polr(as.ordered(y) ~ x1 + x2 + x3, data, method='logistic')
fit <- mod1$fitted.values
res <- residuals(mod1)
binnedplot(fit, res)
The problem is that object 'res' is 'null'.
Thanks
For a start, can you tell us how residuals would be defined in principle for a model with categorical responses? fitted.values is a matrix of probabilities. You could define residuals in terms of correct prediction (defining the most likely outcome as the prediction, as in the default predict method for polr objects) -- or you could compute an n-by-n table of true values and predicted values. Alternatively you could reduce the ordinal data back to an integer scale and compute a mean outcome as the prediction ... but I can't see that there's any unique way to define the residuals in the first place.
In polr(), there is no function that returns residual. You should manually calculate it using its definition.
There are actually plenty of ways to get residuals from an ordinal probit/logit. Although polr does not provide any residuals, vglm provides several. See ?residualsvglm from the VGAMpackage (see also below).
NOTE: However, for a Control Function/2SRI approach Wooldridge (2014) suggests using the generalised residuals as described in Vella (1993). These are as far as I know currently not available in R, although I am working on that, but they are in Stata (using predict gr, score)
Residuals in VLGM
Surrogate residuals for polr
You can use the package sure (link), to calculate surrogate residuals with resids. The package is based on this paper, in the Journal of the American Statistical Association.
library(sure) # for residual function and sample data sets
library(MASS) # for polr function
df1 <- df1
df1$x1 <- df1$x
df1$x <- NULL
df1$y <- df2$y
df1$x2 <- df2$x
df1$x3 <- df3$x
options(contrasts = c("contr.treatment", "contr.poly"))
mod1 <- polr(as.ordered(y) ~ x1 + x2 + x3, data=df1, method='probit')
fit <- mod1$fitted.values
res <- resids(mod1)
EDIT: One big issue is the following (from ?resids):
"Note: Surrogate residuals require sampling from a continuous distribution; consequently, the result will be different with every call to resids. The internal functions used for sampling from truncated distributions when method = "latent" are based on modified versions of rtrunc and qtrunc."
Even when running resids(mod1, nsim=1000, method="latent"), there is no convergence of the outcome.