I am using the phylolm function (in package phylolm) to conduct PGLS phylogenetic analysis and am having some trouble interpreting the model output.
I am running a phylolm model with a continuous (log transformed) response variable and one predictor variable which is a factor with two groups. When I change the reference group (from condition A to B) and rerun the same model, the estimates change accordingly but the standard errors do not seem to. The standard error for the new reference group remains very high - high enough that I don't see how the difference between groups can be significant (which the p value indicates they are). I was under the impression that phylolm standard errors can be interpreted in the same was as for ordinary linear regression - am I mistaken?
Since you have a binary variable, changing the reference category should only reverse the value of the beta estimate - should it not? This is what happens in your models.
It might help to think of the coefficient for the conditions as the different between the group means. The difference between the means will be the same size, but the sign will change depending if you are comparing condition A to condition B or condition B to condition A.
# Create a binary variable
x = sample(c(0,1), n, replace = TRUE)
# and create the opposite variable (ie. changing the reference level)
x.rev = +(!x)
# add some error to the model
error = runif(n)
# create the continuous response variablbe
y = 2 + 2 * x + error
df = data.frame(y, x)
# look at the group means between each condition
group_means = tapply(y, x, mean)
group_means[1] - group_means[2]
group_means[1] - group_means[2]
# compare those to the coefficients in the model
summary(lm(y ~ x))
summary(lm(y ~ x.rev))
Related
I'm dealing with problems of three parts that I can solve separately, but now I need to solve them together:
extremely skewed, over-dispersed dependent count variable (the number of incidents while doing something),
necessity to include random effects,
lots of missing values -> multiple imputation -> 10 imputed datasets.
To solve the first two parts, I chose a quasi-Poisson mixed-effect model. Since stats::glm isn't able to include random effects properly (or I haven't figured it out) and lme4::glmer doesn't support the quasi-families, I worked with glmer(family = "poisson") and then adjusted the std. errors, z statistics and p-values as recommended here and discussed here. So I basically turn Poisson mixed-effect regression into quasi-Poisson mixed-effect regression "by hand".
This is all good with one dataset. But I have 10 of them.
I roughly understand the procedure of analyzing multiple imputed datasets – 1. imputation, 2. model fitting, 3. pooling results (I'm using mice library). I can do these steps for a Poisson regression but not for a quasi-Poisson mixed-effect regression. Is it even possible to A) pool across models based on a quasi-distribution, B) get residuals from a pooled object (class "mipo")? I'm not sure. Also I'm not sure how to understand the pooled results for mixed models (I miss random effects in the pooled output; although I've found this page which I'm currently trying to go through).
Can I get some help, please? Any suggestions on how to complete the analysis (addressing all three issues above) would be highly appreciated.
Example of data is here (repre_d_v1 and repre_all_data are stored in there) and below is a crucial part of my code.
library(dplyr); library(tidyr); library(tidyverse); library(lme4); library(broom.mixed); library(mice)
# please download "qP_data.RData" from the last link above and load them
## ===========================================================================================
# quasi-Poisson mixed model from single data set (this is OK)
# first run Poisson regression on df "repre_d_v1", then turn it into quasi-Poisson
modelSingle = glmer(Y ~ Gender + Age + Xi + Age:Xi + (1|Country) + (1|Participant_ID),
family = "poisson",
data = repre_d_v1)
# I know there are some warnings but it's because I share only a modified subset of data with you (:
printCoefmat(coef(summary(modelSingle))) # unadjusted coefficient table
# define quasi-likelihood adjustment function
quasi_table = function(model, ctab = coef(summary(model))) {
phi = sum(residuals(model, type = "pearson")^2) / df.residual(model)
qctab = within(as.data.frame(ctab),
{`Std. Error` = `Std. Error`*sqrt(phi)
`z value` = Estimate/`Std. Error`
`Pr(>|z|)` = 2*pnorm(abs(`z value`), lower.tail = FALSE)
})
return(qctab)
}
printCoefmat(quasi_table(modelSingle)) # done, makes sense
## ===========================================================================================
# now let's work with more than one data set
# object "repre_all_data" of class "mids" contains 10 imputed data sets
# fit model using with() function, then pool()
modelMultiple = with(data = repre_all_data,
expr = glmer(Y ~ Gender + Age + Xi + Age:Xi + (1|Country) + (1|Participant_ID),
family = "poisson"))
summary(pool(modelMultiple)) # class "mipo" ("mipo.summary")
# this has quite similar structure as coef(summary(someGLM))
# but I don't see where are the random effects?
# and more importantly, I wanted a quasi-Poisson model, not just Poisson model...
# ...but here it is not possible to use quasi_table function (defined earlier)...
# ...and that's because I can't compute "phi"
This seems reasonable, with the caveat that I'm only thinking about the computation, not whether this makes statistical sense. What I'm doing here is computing the dispersion for each of the individual fits and then applying it to the summary table, using a variant of the machinery that you posted above.
## compute dispersion values
phivec <- vapply(modelMultiple$analyses,
function(model) sum(residuals(model, type = "pearson")^2) / df.residual(model),
FUN.VALUE = numeric(1))
phi_mean <- mean(phivec)
ss <- summary(pool(modelMultiple)) # class "mipo" ("mipo.summary")
## adjust
qctab <- within(as.data.frame(ss),
{ std.error <- std.error*sqrt(phi_mean)
statistic <- estimate/std.error
p.value <- 2*pnorm(abs(statistic), lower.tail = FALSE)
})
The results look weird (dispersion < 1, all model results identical), but I'm assuming that's because you gave us a weird subset as a reproducible example ...
We have tried to do two (similar) panel regressions in R.
1) One with time and individual fixed effects (the usual intercept dummies) using plm(). However, we are only interested mostly interested in a "slope coefficient" or beta for each individual and not one for all of the individuals:
Regression 1
Where alpha_i is the individual fixed effect, gamme_t is the time fixed effect. The sum of X is the variable X and three lags:
Sum X variable
We have already included the lagged X variables as new columns in our dataset so in our specification in the code we simply treat them as four different variables:
This is an attempt at using plm() and include our own dummy variables for each individual Beta
plm(income ~ (factor(firmid)-1)*(expense_rate + lag1 + lag2 + lag3), data = data1,
effect = c("time"), model = c("within"), index = c("name", "date"))
The lag1, lag,2, lag2 are the lagged variables of expense rate.
Data1 is in the form of a data frame.
“(factor(firmid)-1)” is an attempt at introducing dummies to get Betas for each individual instead of one for all individuals.
2) The second (and simpler) regression is:
Regression 2
This is an example of our attempt at using pvcm
pvcm1 <- pvcm(income ~ expense_rate + lag1 + lag2 + lag3, data = data1,
effect = "individual", model = "within")
Our question is what specific code and or packages/functions which would be suitable for these regressions. We have tried pvcm to no avail, running into errors such as:
“Error in table(index[1], index[2], useNA = "ifany") :
attempt to make a table with >= 2^31 elements”
and
“Error: cannot allocate vector of size 599.7 Gb”
. Furthermore, pvcm() does not seem to be able to cope with both individua and time fixed effects as in 1).
I am trying to impute data in dataset with a longitudinal design. There are two predictors (experimental group, and time) and one outcome variable (score). The clustering variable is id.
Here is the toy data
set.seed(345)
A0 <- rnorm(4,2,.5)
B0 <- rnorm(4,2+3,.5)
A1 <- rnorm(4,6,.5)
B1 <- rnorm(4,6+2,.5)
A2 <- rnorm(4,10,.5)
B2 <- rnorm(4,10+1,.5)
A3 <- rnorm(4,14,.5)
B3 <- rnorm(4,14+0,.5)
score <- c(A0,B0,A1,B1,A2,B2,A3,B3)
id <- rep(1:8,times = 4, length = 32)
time <- rep(0:3, each = 8, length = 32)
group <- rep(c("A","B"), times =2, each = 4, length = 32)
df <- data.frame(id = id, group = group, time = time, score = score)
# plots
(ggplot(df, aes(x = time, y = score, group = group)) +
stat_summary(fun.y = "mean", geom = "line", aes(linetype = group)) +
stat_summary(fun.y = "mean", geom = "point", aes(shape = group), size = 3) +
coord_cartesian(ylim = c(0,18)))
# now place some NAs
df[sample(1:nrow(df), 10, replace = F),"score"] <- NA
df
If I understand this post correctly, in the predictor matrix I should specify the id clustering variable with a -2 and the two fixed predictors time and group with a 1. Like so
library(mice)
(ini <- mice(df, maxit=0))
(pred <- ini$predictorMatrix)
(pred["score",] <- c(-2, 1, 1, 0))
(imp <- mice(df,
method = c("", "", "", "2l.pan"),
pred = pred,
maxit = 1,
seed = 71152))
What i would like to know is:
Is this a longitudinal random intercepts imputation model? Specifying the id variable as -2 designates it as a 'class' variable, but in this mice primer it suggests that for multilevel models you should create a variable of all 1's in the dataframe as a constant, which is then specified as the random intercept via 2 in the predictor matrix. However, this is based on the 2l.norm function rather than the 2l.pan function, so I am not really sure where I am here. Does the 2l.pan function not require this column, or the specification of random effects?
Is there any way to specify a longitudinal random-slopes model, and, if so, how?
This answer is probably a bit late for you, but it may be able to help some people who read this in the future:
How to work with 2l.pan
Below are some details about specifying multilevel imputation models with mice. Because the application is longitudinal, I use the term "persons" to refer to units at Level 2. These are the most relevant arguments for 2l.pan as mentioned in the mice documentation:
type
Vector of length ncol(x) identifying random and class variables.
Random effects are identified by a 2. The group variable (only one
is allowed) is coded as -2. Random effects also include the fixed
effect. If for a covariates X1 group means shall be calculated and
included as further fixed effects choose 3. In addition to the
effects in 3, specification 4 also includes random effects of
X1.
There are 5 different codes you can use in the predictor matrix for variables imputed with 2l.pan. The person identifier is coded as -2 (this is different from 2l.norm). To include predictor variables with fixed or random effects, these variables are coded with 1 or 2, respectively. If coded as 2, the corresponding fixed effect is automatically included.
In addition, 2l.pan offers the codes 3 and 4, which have similar meanings as 1 and 2 but will include an additional fixed effect for the person mean of that variable. This is useful if you're trying to model within- and between-person effects of time-varying predictor variables.
intercept
Logical determining whether the intercept is automatically added.
By default, 2l.pan includes the intercept as both a fixed and a random effect. For this reason, it is not required to include a constant term in the predictor matrix. If one sets intercept=FALSE, this behavior is changed, and the intercept is dropped from the imputation model.
groupcenter.slope
If TRUE, in case of group means (type is 3 or 4) group mean
centering for these predictors are conducted before doing imputations.
Default is FALSE.
Using this option, it is possible to center predictor variables around the person mean instead of including the predictor variable "as is" (i.e., without centering). This only applies to variables coded as 3 or 4. For predictors coded as 3, this is not very important because the models with and without centering are identical.
However, when predictor variables are coded as 4 (i.e., with a random slope), then centering alters the meaning of the random effect so that the random slope no longer applies to the variable "as is" but to the within-person deviation of that variable.
In your example, you can include a simple random slope for time as follows:
library(mice)
ini <- mice(df, maxit=0)
# predictor matrix (following 'type')
pred <- ini$predictorMatrix
pred["score",] <- c(-2, 1, 2, 0)
# imputation method
meth <- c("", "", "", "2l.pan")
imp <- mice(df, method=meth, pred=pred, maxit=10, m=10)
In this example, coding time as 3 or 4 wouldn't make a lot of sense because the person means of time are identical for all persons. However, if you have time-varying covariates that you want to include as predictor variables in the imputation model, 3 and 4 can be useful.
The additional arguments like intercept and groupcenter.slope can be specified directly in the call to mice(), for example:
imp <- mice(df, ..., groupcenter.slope=TRUE)
Regarding your Questions
So, to answer your questions as stated in the post:
Yes, 2l.pan provides a multilevel (or rather two-level) imputation model. The intercept is included as both a fixed and a random effect by default (can be changed with intercept=FALSE) and need not be specified in the predictor matrix (this is in contrast to 2l.norm).
Yes, you can specify random slopes with 2l.pan. To do that, predictors with random slopes are coded as 2 or 4 in the predictor matrix. If coded
as 2, the random slope is included. If coded as 4, the random slope is included as well as an additional fixed effect for the person means of that variable. If coded as 4, the meaning of the random slope may be altered by making use of groupcenter.slope=TRUE (see above).
This article also includes some worked examples for how to work with 2l.pan and other functions for mutlivel imputation: [Link]
The pan library doesn't require an intercept term.
You can dig into the function using
library(pan)
?pan
That said mice uses a wrapper around pan called mice.impute.2l.pan with the mice library loaded you can look at the help for that function. It states: it has a parameters called intercept which is [a] Logical [and] determin[es] whether the intercept is automatically added. It is TRUE by default. This is defined as a random intercept by default. Found this out after browsing the R code for the mice wrapper:
if (intercept) {
x <- cbind(1, as.matrix(x))
type <- c(2, type)
}
Where the pan function parameter type is a Vector of length ncol(x) identifying random and class variables. The intercept is added by default and defined as a random effect.
They do provide and example like you stated with a 1 for "x" in the prediction matrix for fixed effects.
It also states for 2l.norm, The random intercept is automatically added in mice.impute.2l.norm().
It has a few examples with descriptions.
The CRAN documentation for pan might help you.
I have binomial count data, coming from a set of conditions, that are overdisperesed. To simulate them I use the beta binomial distribution implemented by the rbetabinom function of the emdbook R package:
library(emdbook)
set.seed(1)
df <- data.frame(p = rep(runif(3,0,1)),
n = as.integer(runif(30,100,200)),
theta = rep(runif(3,1,5)),
cond = rep(LETTERS[1:3],10),
stringsAsFactors=F)
df$k <- sapply(1:nrow(df), function(x) rbetabinom(n=1, prob=df$p[x], size=df$n[x],theta = df$theta[x], shape1=1, shape2=1))
I want to find the effect of each condition (cond) on the counts (k).
I think the glm.nb model of the MASS R package allows modelling that:
library(MASS)
fit <- glm.nb(k ~ cond + offset(log(n)), data = df)
My question is how to set the contrasts such that I get the effect of each condition relative to the mean effects over all conditions rather than relative to the dummy condition A?
Two things: (1) if you want contrasts relative to the mean, use contr.sum rather than the default contr.treatment; (2) you probably shouldn't fit beta-binomial data with a negative binomial model; use a beta-binomial model instead (e.g. via VGAM or bbmle)!
library(emdbook)
set.seed(1)
df <- data.frame(p = rep(runif(3,0,1)),
n = as.integer(runif(30,100,200)),
theta = rep(runif(3,1,5)),
cond = rep(LETTERS[1:3],10),
stringsAsFactors=FALSE)
## slightly abbreviated
df$k <- rbetabinom(n=nrow(df), prob=df$p,
size=df$n,theta = df$theta, shape1=1, shape2=1)
With VGAM:
library(VGAM)
## note dbetabinom/rbetabinom from emdbook are masked
options(contrasts=c("contr.sum","contr.poly"))
vglm(cbind(k,n-k)~cond,data=df,
family=betabinomialff(zero=2)
## hold shape parameter 2 constant
)
## Coefficients:
## (Intercept):1 (Intercept):2 cond1 cond2
## 0.4312181 0.5197579 -0.3121925 0.3011559
## Log-likelihood: -147.7304
Here intercept is the mean shape parameter across the levels; cond1 and cond2 are the differences of levels 1 and 2 from the mean (this doesn't give you the difference of level 3 from the mean, but by construction it should be (-cond1-cond2) ...)
I find the parameterization with bbmle (with logit-probability and dispersion parameter) a little easier:
detach("package:VGAM")
library(bbmle)
mle2(k~dbetabinom(k, prob=plogis(lprob),
size=n, theta=exp(ltheta)),
parameters=list(lprob~cond),
data=df,
start=list(lprob=0,ltheta=0))
## Coefficients:
## lprob.(Intercept) lprob.cond1 lprob.cond2 ltheta
## -0.09606536 -0.31615236 0.17353311 1.15201809
##
## Log-likelihood: -148.09
The log-likelihoods are about the same (the VGAM parameterization is a bit better); in theory, if we allowed both shape1 and shape2 (VGAM) or lprob and ltheta (bbmle) to vary across conditions, we'd get the same log-likelihoods for both parameterizations.
Effects must be estimated relative to some base level. The effect of having any of the 3 conditions would be the same as a constant in the regression.
Since the intercept is the expected mean value when cond is = 0 for both estimated levels (i.e. "B" and "C"), it is the mean value only for the reference group (i.e. "A").
Therefore, you basically already have this information in your model, or at least as close to it as you can get.
The mean value of a comparison group is the intercept plus the comparison group's coefficient. The comparison groups' coefficients, as you know, therefore give you the effect of having the comparison group = 1 (bearing in mind that each level of your categorical variable is a dummy variable which = 1 when that level is present) relative to the reference group.
So your results give you the means and relative effects of each level. You can of course switch out the reference level according to your presence.
That should hopefully give you all the information you need. If not then you need to ask yourself precisely what information it is that you're after.
I would like to estimate some panel data models in R using PLM package. Because of restricted knowledge in theory, I am strictly following the instructions from "econometrics academy" (code here). I customized that code with respect to my data (own dependant/independant variables), but did not change all other syntax/formulas.
Now here's the problem:
All models can be estimated and their results can also be summarized and interpreted except for the random effects model. Here I get the following error message:
Error in solve.default(crossprod(X.m)) :
system is computationally singular: reciprocal condition number = 9.57127e-023
Is there anybody who can give me a hint what this error does actually mean? What might be the underlying reason and how do I have to correct the code in order to get results?
Edit:
To be more precise, here's the part of R code I used:
# read in data
mydata<- read.csv2("Panel.csv")
attach(mydata)
# define dependant variable
sd1 <- cbind(sd)
# define independant variable
x <- cbind(ratio1, ratio2, ratio3, ratio4, mean)
# Set data as panel data
pdata <- plm.data(mydata, index=c("id","t"))
# Pooled OLS estimator
pooling <- plm(sd1 ~ x, data=pdata, model= "pooling")
summary(pooling)
# Between estimator
between <- plm(sd1 ~ x, data=pdata, model= "between")
summary(between)
# First differences estimator
firstdiff <- plm(sd1 ~ x, data=pdata, model= "fd")
summary(firstdiff)
# Fixed effects or within estimator
fixed <- plm(sd1 ~ x data=pdata, model= "within")
summary(fixed)
# Random effects estimator
random <- plm(sd1 ~ x, data=pdata, model= "random")
summary(random)
Due to policy restrictions I am not allowed to upload data. But I can provide the information that it is balance sheet data. The dependant variable is a standard deviation of a balance sheet position over time which should be explained by different balance sheet positions. These are mainly ratios of the type "position a / mean" (ratios 1 to 4). As additional independent variable the average sum of the assets on the blanace sheet is considered.
Again: Actually everything works only the last model (random) produces the stated error.
Eventually the problem might be caused by the definition of the ratios? They are defined using the variable "mean" (which is also itself an independant variable)?
Edit: Traceback-Code
> random <- plm(sd1 ~ x, data=pdata, model= "random")
Error in solve.default(crossprod(X.m)) :
system is computationally singular: reciprocal condition number = 1.65832e-022
> traceback()
8: solve.default(crossprod(X.m))
7: solve(crossprod(X.m))
6: diag(solve(crossprod(X.m)) %*% crossprod(X.sum))
5: swar(object, data, effect)
4: ercomp.formula(formula, data, effect, method = random.method)
3: ercomp(formula, data, effect, method = random.method)
2: plm.fit(formula, data, model, effect, random.method, inst.method)
1: plm(sd1 ~ x, data = pdata, model = "random")
If your model.matrix contrains very large values as well as very small values, solve might not be able to solve the system of linear equations by computation. Thus, have a look at model.matrix(sd1 ~ x, data=pdata) if this is the case. If so, try rescaling some variables (e.g. multiply oder divide by 100 oder 1000 [also log() makes sense sometimes). Take care, the interpretation of the coefficients changes due to the change of scales!