I am trying to duplicate the results from pure R code that uses lme4 glmer with Pandas -> R -> glmer. The original output is
%load_ext rpy2.ipython
%R library(lme4)
%R data("respiratory", package = "HSAUR2")
%R write.csv(respiratory, 'respiratory2.csv')
%R resp <- subset(respiratory, month > "0")
%R resp$baseline <- rep(subset(respiratory, month == "0")$status,rep(4, 111))
%R resp_lmer <- glmer(status ~ baseline + month + treatment + gender + age + centre + (1 | subject),family = binomial(), data = resp)
%R -o resp_lmer_summary resp_lmer_summary = summary(resp_lmer)
%R -o exp_res exp_res = exp(fixef(resp_lmer))
print resp_lmer_summary
print exp_res
The output is
Generalized linear mixed model fit by maximum likelihood (Laplace
Approximation) [glmerMod]
Family: binomial ( logit )
Formula: status ~ baseline + month + treatment + gender + age + centre +
(1 | subject)
Data: resp
AIC BIC logLik deviance df.resid
446.6 487.6 -213.3 426.6 434
Scaled residuals:
Min 1Q Median 3Q Max
-2.5855 -0.3609 0.1430 0.3640 2.2119
Random effects:
Groups Name Variance Std.Dev.
subject (Intercept) 3.779 1.944
Number of obs: 444, groups: subject, 111
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.65460 0.77621 -2.132 0.0330 *
baselinegood 3.08897 0.59859 5.160 2.46e-07 ***
month.L -0.20348 0.27957 -0.728 0.4667
month.Q -0.02821 0.27907 -0.101 0.9195
month.C -0.35571 0.28085 -1.267 0.2053
treatmenttreatment 2.16620 0.55157 3.927 8.59e-05 ***
gendermale 0.23836 0.66606 0.358 0.7204
age -0.02557 0.01994 -1.283 0.1997
centre2 1.03850 0.54182 1.917 0.0553 .
...
On the other hand, when I read the file through Pandas, pass it to glmer to R through rmagic, I get
import pandas as pd
df = pd.read_csv('respiratory2.csv',index_col=0)
baseline = df[df['month'] == 0][['subject','status']].set_index('subject')
df['status'] = (df['status'] == 'good').astype(int)
df['baseline'] = df.apply(lambda x: baseline.ix[x['subject']],axis=1)
df['centre'] = df['centre'].astype(str)
%R -i df
%R resp_lmer <- glmer(status ~ baseline + month + treatment + gender + age + centre + (1 | subject),family = binomial(), data = df)
%R -o res res = summary(resp_lmer)
%R -o exp_res exp_res = exp(fixef(resp_lmer))
print res
Output
Generalized linear mixed model fit by maximum likelihood (Laplace
Approximation) [glmerMod]
Family: binomial ( logit )
Formula: status ~ baseline + month + treatment + gender + age + centre +
(1 | subject)
Data: df
AIC BIC logLik deviance df.resid
539.2 573.7 -261.6 523.2 547
Scaled residuals:
Min 1Q Median 3Q Max
-3.8025 -0.4123 0.1637 0.4295 2.4482
Random effects:
Groups Name Variance Std.Dev.
subject (Intercept) 1.829 1.353
Number of obs: 555, groups: subject, 111
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.39252 0.60229 2.312 0.0208 *
baselinepoor -3.42262 0.46095 -7.425 1.13e-13 ***
month 0.12730 0.08465 1.504 0.1326
treatmenttreatment 1.59332 0.39981 3.985 6.74e-05 ***
gendermale 0.12915 0.49291 0.262 0.7933
age -0.01833 0.01480 -1.238 0.2157
centre2 0.70520 0.39676 1.777 0.0755 .
The results are somewhat different. When R reads the file itself, it turns month into something called "ordinal factor"; whereas from Pandas -> R this type is treated as numeric value, maybe that's the difference? I believe I was able to duplicate the derived column baseline correctly, I did have to turn status into 1/0 numeric value however, whereas pure R can work with this column as string (good/poor).
Note: Correction - I missed the filtering condition in Python part where only month>0 are taken. Once that is done
df = df[df['month'] > 0]
Then treatmenttreatment coefficient is 2.16, close to pure R. R still displays a positive baselinegood whereas Pandas -> R displays baselinepoor with negative coefficient, but I guess this is a minor difference.
Related
I am using lmerTest::lmer() to perform linear regression with repeated measures data.
My model contains a fixed effect (factor with 5 levels) and a random effect (subject):
library(lmerTest)
model_lm <- lmer(likertscore ~ task.f + (1 | subject), data = df_long)
I would like to include the total number of observations, the number of subjects, total R^2, and the R^2 of the fixed effects in the regression table which I generate with modelsummary().
I tried to extract these and build a gof_map as described by the author of the package but did not succeed.
Below my model output from lmerTest::lmer() the performance measures obtained:
Linear mixed model fit by REML ['lmerModLmerTest']
Formula: likertscore ~ factor + (1 | subject)
Data: df_long
REML criterion at convergence: 6674.915
Random effects:
Groups Name Std.Dev.
subject (Intercept) 1.076
Residual 1.514
Number of obs: 1715, groups: subject, 245
Fixed Effects:
(Intercept) factor1 factor2
3.8262 1.5988 0.3388
factor3 factor4 factor5
-0.7224 -0.1061 -1.1102
library("performance")
performance::model_performance(my_model)
# Indices of model performance
AIC | BIC | R2 (cond.) | R2 (marg.) | ICC | RMSE | Sigma
-----------------------------------------------------------------
6692.91 | 6741.94 | 0.46 | 0.18 | 0.34 | 1.42 | 1.51
The problem is that one of your statistics is not available by default in glance or performance, which means that you will need to do a bit of legwork to customize the output.
First, we load the libraries and estimate the model:
library(modelsummary)
library(lmerTest)
mod <- lmer(mpg ~ hp + (1 | cyl), data = mtcars)
Then, we check what goodness-of-fit statistics are available out-of-the-box using the get_gof function from the modelsummary package:
get_gof(mod)
#> aic bic r2.conditional r2.marginal icc rmse sigma nobs
#> 1 181.8949 187.7578 0.6744743 0.1432201 0.6200592 2.957141 3.149127 32
You'll notice that there is no N (subject) statistic there, so we need to add it manually. One way to do this in a replicable way is to leverage the glance_custom mechanism described in the modelsummary documentation. To do this, we need to know what the class of our model is:
class(mod)[1]
#> [1] "lmerModLmerTest"
Then, we need to define a method for this class name. This method should be called glance_custom.CLASSNAME. In lmerModLmerTest models, the number of groups can be retrieved by getting the ngrps object in the summary. So we do this:
glance_custom.lmerModLmerTest <- function(x, ...) {
s <- summary(x)
out <- data.frame(ngrps = s$ngrps)
out
}
Finally, we use the gof_map argument to format the result how you want it:
gm <- list(
list(raw = "nobs", clean = "N", fmt = 0),
list(raw = "ngrps", clean = "N (subjects)", fmt = 0),
list(raw = "r2.conditional", clean = "R2 (conditional)", fmt = 0),
list(raw = "r2.marginal", clean = "R2 (marginal)", fmt = 0),
list(raw = "aic", clean = "AIC", fmt = 3)
)
modelsummary(mod, gof_map = gm)
Model 1
(Intercept)
24.708
(3.132)
hp
-0.030
(0.015)
N
32
N (subjects)
3
R2 (conditional)
1
R2 (marginal)
0
AIC
181.895
I'm performing the polynomial regression and testing the linear combination of the coefficient. But I'm running to some problems that when I tried to test the linear combination of the coefficient.
LnModel_1 <- lm(formula = PROF ~ UI_1+UI_2+I(UI_1^2)+UI_1:UI_2+I(UI_2^2))
summary(LnModel_1)
It output the values below:
Call:
lm(formula = PROF ~ UI_1 + UI_2 + I(UI_1^2) + UI_1:UI_2 + I(UI_2^2))
Residuals:
Min 1Q Median 3Q Max
-3.4492 -0.5405 0.1096 0.4226 1.7346
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.66274 0.06444 72.354 < 2e-16 ***
UI_1 0.25665 0.07009 3.662 0.000278 ***
UI_2 0.25569 0.09221 2.773 0.005775 **
I(UI_1^2) -0.15168 0.04490 -3.378 0.000789 ***
I(UI_2^2) -0.08418 0.05162 -1.631 0.103643
UI_1:UI_2 -0.02849 0.05453 -0.522 0.601621
Then I use names(coef()) to extract the coefficient names
names(coef(LnModel_1))
output:
[1] "(Intercept)" "UI_1" "UI_2" "I(UI_1^2)"
"I(UI_2^2)""UI_1:UI_2"
For some reasons, when I use glht(), it give me an error on UI_2^2
slope <- glht(LnModel_1, linfct = c("UI_2+ UI_1:UI_2*2.5+ 2*2.5*I(UI_2^2) =0
") )
Output:
Error: multcomp:::expression2coef::walkCode::eval: within ‘UI_2^2’, the term
‘UI_2’ must not denote an effect. Apart from that, the term must evaluate to
a real valued constant
Don't know why it would give me this error message. How to input the I(UI_2^2) coefficient to the glht()
Thank you very much
The issue seems to be that I(UI^2) can be interpreted as an expression in R in the same fashion you did here LnModel_1 <- lm(formula = PROF ~ UI_1+UI_2+I(UI_1^2)+UI_1:UI_2+I(UI_2^2))
Therefore, you should indicate R that you want to evaluate a string inside your string:
slope <- glht(LnModel_1, linfct = c("UI_2+ UI_1:UI_2*2.5+ 2*2.5*\`I(UI_2^2)\` =0
") )
Check my example (since I cannot reproduce your problem):
library(multcomp)
cars <- copy(mtcars)
setnames(cars, "disp", "UI_2")
model <- lm(mpg~I(UI_2^2),cars)
names(coef(model))
slope <- glht(model, linfct = c("2*2.5*\`I(UI_2^2)\` =0") )
I have a balanced panel data set, df, that essentially consists in three variables, A, B and Y, that vary over time for a bunch of uniquely identified regions. I would like to run a regression that includes both regional (region in the equation below) and time (year) fixed effects. If I'm not mistaken, I can achieve this in different ways:
lm(Y ~ A + B + factor(region) + factor(year), data = df)
or
library(plm)
plm(Y ~ A + B,
data = df, index = c('region', 'year'), model = 'within',
effect = 'twoways')
In the second equation I specify indices (region and year), the model type ('within', FE), and the nature of FE ('twoways', meaning that I'm including both region and time FE).
Despite I seem to be doing things correctly, I get extremely different results. The problem disappears when I do not consider time fixed effects - and use the argument effect = 'individual'.
What's the deal here? Am I missing something? Are there any other R packages that allow to run the same analysis?
Perhaps posting an example of your data would help answer the question. I am getting the same coefficients for some made up data. You can also use felm from the package lfe to do the same thing:
N <- 10000
df <- data.frame(a = rnorm(N), b = rnorm(N),
region = rep(1:100, each = 100), year = rep(1:100, 100))
df$y <- 2 * df$a - 1.5 * df$b + rnorm(N)
model.a <- lm(y ~ a + b + factor(year) + factor(region), data = df)
summary(model.a)
# (Intercept) -0.0522691 0.1422052 -0.368 0.7132
# a 1.9982165 0.0101501 196.866 <2e-16 ***
# b -1.4787359 0.0101666 -145.450 <2e-16 ***
library(plm)
pdf <- pdata.frame(df, index = c("region", "year"))
model.b <- plm(y ~ a + b, data = pdf, model = "within", effect = "twoways")
summary(model.b)
# Coefficients :
# Estimate Std. Error t-value Pr(>|t|)
# a 1.998217 0.010150 196.87 < 2.2e-16 ***
# b -1.478736 0.010167 -145.45 < 2.2e-16 ***
library(lfe)
model.c <- felm(y ~ a + b | factor(region) + factor(year), data = df)
summary(model.c)
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# a 1.99822 0.01015 196.9 <2e-16 ***
# b -1.47874 0.01017 -145.4 <2e-16 ***
This does not seem to be a data issue.
I'm doing computer exercises in R from Wooldridge (2012) Introductory Econometrics. Specifically Chapter 14 CE.1 (data is the rental file at: https://www.cengage.com/cgi-wadsworth/course_products_wp.pl?fid=M20b&product_isbn_issn=9781111531041)
I computed the model in differences (in Python)
model_diff = smf.ols(formula='diff_lrent ~ diff_lpop + diff_lavginc + diff_pctstu', data=rental).fit()
OLS Regression Results
==============================================================================
Dep. Variable: diff_lrent R-squared: 0.322
Model: OLS Adj. R-squared: 0.288
Method: Least Squares F-statistic: 9.510
Date: Sun, 05 Nov 2017 Prob (F-statistic): 3.14e-05
Time: 00:46:55 Log-Likelihood: 65.272
No. Observations: 64 AIC: -122.5
Df Residuals: 60 BIC: -113.9
Df Model: 3
Covariance Type: nonrobust
================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------
Intercept 0.3855 0.037 10.469 0.000 0.312 0.459
diff_lpop 0.0722 0.088 0.818 0.417 -0.104 0.249
diff_lavginc 0.3100 0.066 4.663 0.000 0.177 0.443
diff_pctstu 0.0112 0.004 2.711 0.009 0.003 0.019
==============================================================================
Omnibus: 2.653 Durbin-Watson: 1.655
Prob(Omnibus): 0.265 Jarque-Bera (JB): 2.335
Skew: 0.467 Prob(JB): 0.311
Kurtosis: 2.934 Cond. No. 23.0
==============================================================================
Now, the PLM package in R gives the same results for the first-difference models:
library(plm) modelfd <- plm(lrent~lpop + lavginc + pctstu,
data=data,model = "fd")
No problem so far. However, the fixed effect reports different estimates.
modelfx <- plm(lrent~lpop + lavginc + pctstu, data=data, model =
"within", effect="time") summary(modelfx)
The FE results should not be any different. In fact, the Computer Exercise question is:
(iv) Estimate the model by fixed effects to verify that you get identical estimates and standard errors to those in part (iii).
My best guest is that I am miss understanding something on the R package.
This is my first time to ask here.
I have trouble generating the slope dummy variables only(without intercept dummy).
However, if I multiply dummy variable by independent variable as shown below,
both slope dummy and intercept dummy results are represented.
I want to incorporate slope dummy only and exclude intercept dummy.
I will appreciate your help.
Bests,
yjkim
reg <- lm(year ~ as.factor(age)*log(v1269))
Call:
lm(formula = year ~ as.factor(age) * log(v1269))
Residuals:
Min 1Q Median 3Q Max
-6.083 -1.177 1.268 1.546 3.768
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.18076 2.16089 2.398 0.0167 *
as.factor(age)2 1.93989 2.75892 0.703 0.4821
as.factor(age)3 2.46861 2.39393 1.031 0.3027
as.factor(age)4 -0.56274 2.30123 -0.245 0.8069
log(v1269) -0.06788 0.23606 -0.288 0.7737
as.factor(age)2:log(v1269) -0.15628 0.29621 -0.528 0.5979
as.factor(age)3:log(v1269) -0.14961 0.25809 -0.580 0.5622
as.factor(age)4:log(v1269) 0.16534 0.24884 0.664 0.5065
Just need a -1 within the formaula
reg <- lm(year ~ as.factor(age)*log(v1269) -1)
If you want to estimate a different slope in each level of age, the you can use the %in% operator in the formula
set.seed(1)
df <- data.frame(age = factor(sample(1:4, 100, replace = TRUE)),
v1269 = rlnorm(100),
year = rnorm(100))
m <- lm(year ~ log(v1269) %in% age, data = df)
summary(m)
This gives (for this entirely random , dummy, silly data set)
> summary(m)
Call:
lm(formula = year ~ log(v1269) %in% age, data = df)
Residuals:
Min 1Q Median 3Q Max
-2.93108 -0.66402 -0.05921 0.68040 2.25244
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.02692 0.10705 0.251 0.802
log(v1269):age1 0.20127 0.21178 0.950 0.344
log(v1269):age2 -0.01431 0.24116 -0.059 0.953
log(v1269):age3 -0.02588 0.24435 -0.106 0.916
log(v1269):age4 0.06019 0.21979 0.274 0.785
Residual standard error: 1.065 on 95 degrees of freedom
Multiple R-squared: 0.01037, Adjusted R-squared: -0.0313
F-statistic: 0.2489 on 4 and 95 DF, p-value: 0.9097
Note that this fits a single constant term plus 4 different effects of log(v1269), one per level of age. Visually, this is sort of what the model is doing
pred <- with(df,
expand.grid(age = factor(1:4),
v1269 = seq(min(v1269), max(v1269), length = 100)))
pred <- transform(pred, fitted = predict(m, newdata = pred))
library("ggplot2")
ggplot(df, aes(x = log(v1269), y = year, colour = age)) +
geom_point() +
geom_line(data = pred, mapping = aes(y = fitted)) +
theme_bw() + theme(legend.position = "top")
Clearly, this would only be suitable if there was no significant difference in the mean values of year (the response) in the different age categories.
Note that a different parameterisation of the same model can be achieved via the / operator:
m2 <- lm(year ~ log(v1269)/age, data = df)
> m2
Call:
lm(formula = year ~ log(v1269)/age, data = df)
Coefficients:
(Intercept) log(v1269) log(v1269):age2 log(v1269):age3
0.02692 0.20127 -0.21559 -0.22715
log(v1269):age4
-0.14108
Note that now, the first log(v1269) term is for the slope for age == 1, whilst the other terms are the adjustments required to be applied to the the log(v1269) term to get the slope for the indicated group:
coef(m)[-1]
coef(m2)[2] + c(0, coef(m2)[-(1:2)])
> coef(m)[-1]
log(v1269):age1 log(v1269):age2 log(v1269):age3 log(v1269):age4
0.20127109 -0.01431491 -0.02588106 0.06018802
> coef(m2)[2] + c(0, coef(m2)[-(1:2)])
log(v1269):age2 log(v1269):age3 log(v1269):age4
0.20127109 -0.01431491 -0.02588106 0.06018802
But they work out to the same estimated slopes.
I'm trying the following model with the lme4 package:
library(nmle) # for the data
data("Machines") # the data
library(lme4)
# the model:
fit1 <- lmer(score ~ -1 + Machine + (1|Worker), data=Machines)
summary(fit1)
> summary(fit1)
Linear mixed model fit by REML ['lmerMod']
Formula: score ~ -1 + Machine + (1 | Worker)
Data: Machines
REML criterion at convergence: 286.9
Scaled residuals:
Min 1Q Median 3Q Max
-2.7249 -0.5233 0.1328 0.6513 1.7559
Random effects:
Groups Name Variance Std.Dev.
Worker (Intercept) 26.487 5.147
Residual 9.996 3.162
Number of obs: 54, groups: Worker, 6
Fixed effects:
Estimate Std. Error t value
MachineA 52.356 2.229 23.48
MachineB 60.322 2.229 27.06
MachineC 66.272 2.229 29.73
Correlation of Fixed Effects:
MachnA MachnB
MachineB 0.888
MachineC 0.888 0.888
I now try to fit the same model using rstan through the glmer2stan package:
library(glmer2stan)
Machines$Machine_idx <- as.numeric(Machines$Machine)
Machines$Worker_idx <- as.numeric(as.character(Machines$Worker))
fit3 <- lmer2stan(score ~ -1 + Machine_idx + (1|Worker_idx), data=Machines)
this is the result
> stanmer(fit3)
glmer2stan model: score ~ -1 + Machine_idx + (1 | Worker_idx) [gaussian]
Level 1 estimates:
Expectation StdDev 2.5% 97.5%
Machine_idx 7.04 0.55 5.95 8.08
sigma 3.26 0.35 2.66 4.02
Level 2 estimates:
(Std.dev. and correlations)
Group: Worker_idx (6 groups / imbalance: 0)
(Intercept) 55.09 (SE 15.82)
DIC: 287 pDIC: 7.9 Deviance: 271.3
I don't think that's the same model. Is my glmer2stan specification wrong?
I know that glmer2stan is not actively developed any more but it should handle this simple model, shouldn't it?
UPDATE:
thanks to the tip by Roland I changed the Machine factor levels to dummies and it now works:
Machines$Worker <- as.numeric(as.character(Machines$Worker))
m <- model.matrix(~ 0 + ., Machines)
m <- as.data.frame(m)
fit3 <- lmer2stan(score ~ -1 + (1|Worker) + MachineA + MachineB + MachineC, data=m, chains=2)