different mixed effect models with lme4 and lmer2stan - r

I'm trying the following model with the lme4 package:
library(nmle) # for the data
data("Machines") # the data
library(lme4)
# the model:
fit1 <- lmer(score ~ -1 + Machine + (1|Worker), data=Machines)
summary(fit1)
> summary(fit1)
Linear mixed model fit by REML ['lmerMod']
Formula: score ~ -1 + Machine + (1 | Worker)
Data: Machines
REML criterion at convergence: 286.9
Scaled residuals:
Min 1Q Median 3Q Max
-2.7249 -0.5233 0.1328 0.6513 1.7559
Random effects:
Groups Name Variance Std.Dev.
Worker (Intercept) 26.487 5.147
Residual 9.996 3.162
Number of obs: 54, groups: Worker, 6
Fixed effects:
Estimate Std. Error t value
MachineA 52.356 2.229 23.48
MachineB 60.322 2.229 27.06
MachineC 66.272 2.229 29.73
Correlation of Fixed Effects:
MachnA MachnB
MachineB 0.888
MachineC 0.888 0.888
I now try to fit the same model using rstan through the glmer2stan package:
library(glmer2stan)
Machines$Machine_idx <- as.numeric(Machines$Machine)
Machines$Worker_idx <- as.numeric(as.character(Machines$Worker))
fit3 <- lmer2stan(score ~ -1 + Machine_idx + (1|Worker_idx), data=Machines)
this is the result
> stanmer(fit3)
glmer2stan model: score ~ -1 + Machine_idx + (1 | Worker_idx) [gaussian]
Level 1 estimates:
Expectation StdDev 2.5% 97.5%
Machine_idx 7.04 0.55 5.95 8.08
sigma 3.26 0.35 2.66 4.02
Level 2 estimates:
(Std.dev. and correlations)
Group: Worker_idx (6 groups / imbalance: 0)
(Intercept) 55.09 (SE 15.82)
DIC: 287 pDIC: 7.9 Deviance: 271.3
I don't think that's the same model. Is my glmer2stan specification wrong?
I know that glmer2stan is not actively developed any more but it should handle this simple model, shouldn't it?
UPDATE:
thanks to the tip by Roland I changed the Machine factor levels to dummies and it now works:
Machines$Worker <- as.numeric(as.character(Machines$Worker))
m <- model.matrix(~ 0 + ., Machines)
m <- as.data.frame(m)
fit3 <- lmer2stan(score ~ -1 + (1|Worker) + MachineA + MachineB + MachineC, data=m, chains=2)

Related

glm in R, give all comparisons

Simple logistic regression example.
set.seed(1)
df <- data.frame(out=c(0,1,0,1,0,1,0,1,0),
y=rep(c('A', 'B', 'C'), 3))
result <-glm(out~factor(y), family = 'binomial', data=df)
summary(result)
#Call:
#glm(formula = out ~ factor(y), family = "binomial", data = df)
#Deviance Residuals:
# Min 1Q Median 3Q Max
#-1.4823 -0.9005 -0.9005 0.9005 1.4823
#Coefficients:
# Estimate Std. Error z value Pr(>|z|)
#(Intercept) -6.931e-01 1.225e+00 -0.566 0.571
#factor(y)B 1.386e+00 1.732e+00 0.800 0.423
#factor(y)C 3.950e-16 1.732e+00 0.000 1.000
#(Dispersion parameter for binomial family taken to be 1)
# Null deviance: 12.365 on 8 degrees of freedom
#Residual deviance: 11.457 on 6 degrees of freedom
#AIC: 17.457
#Number of Fisher Scoring iterations: 4
My reference category is now A; results for B and C relative to A are given. I would also like to get the results when B and C are the reference. One can change the reference manually by using levels = in factor(); but this would require fitting 3 models. Is it possible to do this in one go? Or what would be a more efficient approach?
If you want to do all pairwise comparisons, you should usually also do a correction for alpha-error inflation due to multiple testing. You can easily do a Tukey test with package multcomp.
set.seed(1)
df <- data.frame(out=c(0,1,0,1,0,1,0,1,0),
y=rep(c('A', 'B', 'C'), 3))
#y is already a factor, if not, coerce before the model fit
result <-glm(out~y, family = 'binomial', data=df)
summary(result)
library(multcomp)
comps <- glht(result, linfct = mcp(y = "Tukey"))
summary(comps)
#Simultaneous Tests for General Linear Hypotheses
#
#Multiple Comparisons of Means: Tukey Contrasts
#
#
#Fit: glm(formula = out ~ y, family = "binomial", data = df)
#
#Linear Hypotheses:
# Estimate Std. Error z value Pr(>|z|)
#B - A == 0 1.386e+00 1.732e+00 0.8 0.703
#C - A == 0 1.923e-16 1.732e+00 0.0 1.000
#C - B == 0 -1.386e+00 1.732e+00 -0.8 0.703
#(Adjusted p values reported -- single-step method)
#letter notation often used in graphs and tables
cld(comps)
# A B C
#"a" "a" "a"

R - Plm and lm - Fixed effects

I have a balanced panel data set, df, that essentially consists in three variables, A, B and Y, that vary over time for a bunch of uniquely identified regions. I would like to run a regression that includes both regional (region in the equation below) and time (year) fixed effects. If I'm not mistaken, I can achieve this in different ways:
lm(Y ~ A + B + factor(region) + factor(year), data = df)
or
library(plm)
plm(Y ~ A + B,
data = df, index = c('region', 'year'), model = 'within',
effect = 'twoways')
In the second equation I specify indices (region and year), the model type ('within', FE), and the nature of FE ('twoways', meaning that I'm including both region and time FE).
Despite I seem to be doing things correctly, I get extremely different results. The problem disappears when I do not consider time fixed effects - and use the argument effect = 'individual'.
What's the deal here? Am I missing something? Are there any other R packages that allow to run the same analysis?
Perhaps posting an example of your data would help answer the question. I am getting the same coefficients for some made up data. You can also use felm from the package lfe to do the same thing:
N <- 10000
df <- data.frame(a = rnorm(N), b = rnorm(N),
region = rep(1:100, each = 100), year = rep(1:100, 100))
df$y <- 2 * df$a - 1.5 * df$b + rnorm(N)
model.a <- lm(y ~ a + b + factor(year) + factor(region), data = df)
summary(model.a)
# (Intercept) -0.0522691 0.1422052 -0.368 0.7132
# a 1.9982165 0.0101501 196.866 <2e-16 ***
# b -1.4787359 0.0101666 -145.450 <2e-16 ***
library(plm)
pdf <- pdata.frame(df, index = c("region", "year"))
model.b <- plm(y ~ a + b, data = pdf, model = "within", effect = "twoways")
summary(model.b)
# Coefficients :
# Estimate Std. Error t-value Pr(>|t|)
# a 1.998217 0.010150 196.87 < 2.2e-16 ***
# b -1.478736 0.010167 -145.45 < 2.2e-16 ***
library(lfe)
model.c <- felm(y ~ a + b | factor(region) + factor(year), data = df)
summary(model.c)
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# a 1.99822 0.01015 196.9 <2e-16 ***
# b -1.47874 0.01017 -145.4 <2e-16 ***
This does not seem to be a data issue.
I'm doing computer exercises in R from Wooldridge (2012) Introductory Econometrics. Specifically Chapter 14 CE.1 (data is the rental file at: https://www.cengage.com/cgi-wadsworth/course_products_wp.pl?fid=M20b&product_isbn_issn=9781111531041)
I computed the model in differences (in Python)
model_diff = smf.ols(formula='diff_lrent ~ diff_lpop + diff_lavginc + diff_pctstu', data=rental).fit()
OLS Regression Results
==============================================================================
Dep. Variable: diff_lrent R-squared: 0.322
Model: OLS Adj. R-squared: 0.288
Method: Least Squares F-statistic: 9.510
Date: Sun, 05 Nov 2017 Prob (F-statistic): 3.14e-05
Time: 00:46:55 Log-Likelihood: 65.272
No. Observations: 64 AIC: -122.5
Df Residuals: 60 BIC: -113.9
Df Model: 3
Covariance Type: nonrobust
================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------
Intercept 0.3855 0.037 10.469 0.000 0.312 0.459
diff_lpop 0.0722 0.088 0.818 0.417 -0.104 0.249
diff_lavginc 0.3100 0.066 4.663 0.000 0.177 0.443
diff_pctstu 0.0112 0.004 2.711 0.009 0.003 0.019
==============================================================================
Omnibus: 2.653 Durbin-Watson: 1.655
Prob(Omnibus): 0.265 Jarque-Bera (JB): 2.335
Skew: 0.467 Prob(JB): 0.311
Kurtosis: 2.934 Cond. No. 23.0
==============================================================================
Now, the PLM package in R gives the same results for the first-difference models:
library(plm) modelfd <- plm(lrent~lpop + lavginc + pctstu,
data=data,model = "fd")
No problem so far. However, the fixed effect reports different estimates.
modelfx <- plm(lrent~lpop + lavginc + pctstu, data=data, model =
"within", effect="time") summary(modelfx)
The FE results should not be any different. In fact, the Computer Exercise question is:
(iv) Estimate the model by fixed effects to verify that you get identical estimates and standard errors to those in part (iii).
My best guest is that I am miss understanding something on the R package.

Nested random effects in `lme {nlme}`

I have to fit an LMM with an interaction random effect but without the marginal random effect, using the lme command. That is, I want to fit the model in oats.lmer (see below) but using the function lme from the nlme package.
The code is
require("nlme")
require("lme4")
oats.lmer <- lmer(yield~nitro + (1|Block:Variety), data = Oats)
summary(oats.lmer)
#Linear mixed model fit by REML ['lmerMod']
#Formula: yield ~ nitro + (1 | Block:Variety)
# Data: Oats
#
#REML criterion at convergence: 598.1
#
#Scaled residuals:
# Min 1Q Median 3Q Max
#-1.66482 -0.72807 -0.00079 0.56416 1.85467
#
#Random effects:
# Groups Name Variance Std.Dev.
# Block:Variety (Intercept) 306.8 17.51
# Residual 165.6 12.87
#Number of obs: 72, groups: Block:Variety, 18
#
#Fixed effects:
# Estimate Std. Error t value
#(Intercept) 81.872 4.846 16.90
#nitro 73.667 6.781 10.86
#
#Correlation of Fixed Effects:
# (Intr)
#nitro -0.420
I started playing with this
oats.lme <- lme(yield~nitro, data = Oats, random = (~1|Block/Variety))
summary(oats.lme)
#Linear mixed-effects model fit by REML
# Data: Oats
# AIC BIC logLik
# 603.0418 614.2842 -296.5209
#
#Random effects:
# Formula: ~1 | Block
# (Intercept)
#StdDev: 14.50596
#
# Formula: ~1 | Variety %in% Block
# (Intercept) Residual
#StdDev: 11.00468 12.86696
#
#Fixed effects: yield ~ nitro
# Value Std.Error DF t-value p-value
#(Intercept) 81.87222 6.945273 53 11.78819 0
#nitro 73.66667 6.781483 53 10.86291 0
# Correlation:
# (Intr)
#nitro -0.293
#
#Standardized Within-Group Residuals:
# Min Q1 Med Q3 Max
#-1.74380770 -0.66475227 0.01710423 0.54298809 1.80298890
#
#Number of Observations: 72
#Number of Groups:
# Block Variety %in% Block
# 6 18
but the problem is that it puts also a marginal random effect for Variety which I want to omit.
The question is: how to specify the random effects in oats.lme such that oats.lme is identical (at least structurally) to oats.lmer ?
It can be as simple as following:
library(nlme)
data(Oats)
## construct an auxiliary factor `f` for interaction / nesting effect
Oats$f <- with(Oats, Block:Variety)
## use `random = ~ 1 | f`
lme(yield ~ nitro, data = Oats, random = ~ 1 | f)
#Linear mixed-effects model fit by REML
# Data: Oats
# Log-restricted-likelihood: -299.0328
# Fixed: yield ~ nitro
#(Intercept) nitro
# 81.87222 73.66667
#
#Random effects:
# Formula: ~1 | f
# (Intercept) Residual
#StdDev: 17.51489 12.86695
#
#Number of Observations: 72
#Number of Groups: 18

Glmer Results Differences Through Pure R and rmagic

I am trying to duplicate the results from pure R code that uses lme4 glmer with Pandas -> R -> glmer. The original output is
%load_ext rpy2.ipython
%R library(lme4)
%R data("respiratory", package = "HSAUR2")
%R write.csv(respiratory, 'respiratory2.csv')
%R resp <- subset(respiratory, month > "0")
%R resp$baseline <- rep(subset(respiratory, month == "0")$status,rep(4, 111))
%R resp_lmer <- glmer(status ~ baseline + month + treatment + gender + age + centre + (1 | subject),family = binomial(), data = resp)
%R -o resp_lmer_summary resp_lmer_summary = summary(resp_lmer)
%R -o exp_res exp_res = exp(fixef(resp_lmer))
print resp_lmer_summary
print exp_res
The output is
Generalized linear mixed model fit by maximum likelihood (Laplace
Approximation) [glmerMod]
Family: binomial ( logit )
Formula: status ~ baseline + month + treatment + gender + age + centre +
(1 | subject)
Data: resp
AIC BIC logLik deviance df.resid
446.6 487.6 -213.3 426.6 434
Scaled residuals:
Min 1Q Median 3Q Max
-2.5855 -0.3609 0.1430 0.3640 2.2119
Random effects:
Groups Name Variance Std.Dev.
subject (Intercept) 3.779 1.944
Number of obs: 444, groups: subject, 111
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.65460 0.77621 -2.132 0.0330 *
baselinegood 3.08897 0.59859 5.160 2.46e-07 ***
month.L -0.20348 0.27957 -0.728 0.4667
month.Q -0.02821 0.27907 -0.101 0.9195
month.C -0.35571 0.28085 -1.267 0.2053
treatmenttreatment 2.16620 0.55157 3.927 8.59e-05 ***
gendermale 0.23836 0.66606 0.358 0.7204
age -0.02557 0.01994 -1.283 0.1997
centre2 1.03850 0.54182 1.917 0.0553 .
...
On the other hand, when I read the file through Pandas, pass it to glmer to R through rmagic, I get
import pandas as pd
df = pd.read_csv('respiratory2.csv',index_col=0)
baseline = df[df['month'] == 0][['subject','status']].set_index('subject')
df['status'] = (df['status'] == 'good').astype(int)
df['baseline'] = df.apply(lambda x: baseline.ix[x['subject']],axis=1)
df['centre'] = df['centre'].astype(str)
%R -i df
%R resp_lmer <- glmer(status ~ baseline + month + treatment + gender + age + centre + (1 | subject),family = binomial(), data = df)
%R -o res res = summary(resp_lmer)
%R -o exp_res exp_res = exp(fixef(resp_lmer))
print res
Output
Generalized linear mixed model fit by maximum likelihood (Laplace
Approximation) [glmerMod]
Family: binomial ( logit )
Formula: status ~ baseline + month + treatment + gender + age + centre +
(1 | subject)
Data: df
AIC BIC logLik deviance df.resid
539.2 573.7 -261.6 523.2 547
Scaled residuals:
Min 1Q Median 3Q Max
-3.8025 -0.4123 0.1637 0.4295 2.4482
Random effects:
Groups Name Variance Std.Dev.
subject (Intercept) 1.829 1.353
Number of obs: 555, groups: subject, 111
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.39252 0.60229 2.312 0.0208 *
baselinepoor -3.42262 0.46095 -7.425 1.13e-13 ***
month 0.12730 0.08465 1.504 0.1326
treatmenttreatment 1.59332 0.39981 3.985 6.74e-05 ***
gendermale 0.12915 0.49291 0.262 0.7933
age -0.01833 0.01480 -1.238 0.2157
centre2 0.70520 0.39676 1.777 0.0755 .
The results are somewhat different. When R reads the file itself, it turns month into something called "ordinal factor"; whereas from Pandas -> R this type is treated as numeric value, maybe that's the difference? I believe I was able to duplicate the derived column baseline correctly, I did have to turn status into 1/0 numeric value however, whereas pure R can work with this column as string (good/poor).
Note: Correction - I missed the filtering condition in Python part where only month>0 are taken. Once that is done
df = df[df['month'] > 0]
Then treatmenttreatment coefficient is 2.16, close to pure R. R still displays a positive baselinegood whereas Pandas -> R displays baselinepoor with negative coefficient, but I guess this is a minor difference.

p-values of mu parameter in gamlss

I'm trying to fit inflated beta regression model to proportional data. I'm using the package gamlss and specifing the family BEINF. I'm wondering how I can extract the p-values of the $mu.coefficients. When I typed the command fit.3$mu.coefficients (as shown at the bottom of the my r code), it gave me only the estimates of Mu coefficients. The following is an example of my data.
mydata = data.frame(y = c(0.014931087, 0.003880983, 0.006048048, 0.014931087,
+ 0.016469269, 0.013111447, 0.012715517, 0.007981377), index = c(1,1,2,2,3,3,4,4))
mydata
y index
1 0.004517611 1
2 0.004351405 1
3 0.007952064 2
4 0.004517611 2
5 0.003434018 3
6 0.003602046 4
7 0.002370690 4
8 0.002993016 4
> library(gamlss)
> fit.3 = gamlss(y ~ factor(index), family = BEINF, data = mydata)
> summary(fit.3)
*******************************************************************
Family: c("BEINF", "Beta Inflated")
Call:
gamlss(formula = y ~ factor(index), family = BEINF, data = mydata)
Fitting method: RS()
-------------------------------------------------------------------
Mu link function: logit
Mu Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.3994 0.1204 -44.858 1.477e-06
factor(index)2 0.2995 0.1591 1.883 1.329e-01
factor(index)3 -0.2288 0.1805 -1.267 2.739e-01
factor(index)4 -0.5017 0.1952 -2.570 6.197e-02
-------------------------------------------------------------------
Sigma link function: logit
Sigma Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.456 0.2514 -17.72 4.492e-07
-------------------------------------------------------------------
Nu link function: log
Nu Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -21.54 10194 -0.002113 0.9984
-------------------------------------------------------------------
Tau link function: log
Tau Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -21.63 10666 -0.002028 0.9984
-------------------------------------------------------------------
No. of observations in the fit: 8
Degrees of Freedom for the fit: 7
Residual Deg. of Freedom: 1
at cycle: 12
Global Deviance: -93.08548
AIC: -79.08548
SBC: -78.52938
*******************************************************************
fit.3$mu.coefficients
(Intercept) factor(index)2 factor(index)3 factor(index)4
-5.3994238 0.2994738 -0.2287571 -0.5016511
I really appreciate all your help.
Use the save option in summary.gamlss, like this for your model above
fit.3 = gamlss(y ~ factor(index), family = BEINF, data = mydata)
sfit.3<-summary(fit.3, save=TRUE)
sfit.3$mu.coef.table
sfit.3$sigma.coef.table
#to get a list of all the slots in the object
str(sfit.3)
fit.3 = gamlss(y ~ factor(index), family = BEINF, data = mydata.ex)
sfit.3<-summary(fit.3, save=TRUE)
fit.3$mu.coefficients
sfit.3$coef.table # Here use Brackets []
estimate.pval<-data.frame(Intercept=sfit.3$coef.table[1,1],pvalue=sfit.3$coef.table[1,4],
"factor(index)^2"=sfit.3$coef.table[2,1] ,pvalue=sfit.3$coef.table[2,4],
"factor(index)^3"=sfit.3$coef.table[3,1] ,pvalue=sfit.3$coef.table[3,4],
"factor(index)^4"=sfit.3$coef.table[4,1] ,pvalue=sfit.3$coef.table[4,4])

Resources