Confidence intervals for predicted probabilities in mixed-effect regression? - r

I'm working with mixed-effect logistic regression models using a single random variable (using glmer), and I am struggling to find a way to produce predicted probabilities and the respective 95% CI's. I have been able to do this for fixed-effect models using the following type of code:
Call:
glm(formula = survive/trials ~ class, family = binomial(logexp(vespdata$expos)),
data = vespdata)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.6823 0.2621 0.4028 0.4540 0.6935
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.6774 0.5796 8.069 7.07e-16 ***
class2 -1.3236 0.6957 -1.903 0.0571 .
class3 -0.5751 0.9170 -0.627 0.5306
class4 -1.0806 0.9217 -1.172 0.2411
class5 -1.2889 0.6564 -1.964 0.0496 *
class6 -1.5379 0.6508 -2.363 0.0181 *
class8 -1.2078 0.6957 -1.736 0.0825 .
vesppredict2 = with(vespdata, data.frame(class = gl(7,1))
vesppredict2 = cbind(vesppredict2, predict(vespclass.exp, newdata = vesppredict2,
type = "link", se = TRUE))
vesppredict2 = within(vesppredict2,
{PredictedProb = (plogis(fit))^23
LL = (plogis(fit - (1.96 * se.fit)))^23
UL = (plogis(fit + (1.96 * se.fit)))^23
ErrorBar = (UL-PredictedProb)
})
The problem I'm having is that predict() cannot use the argument se = TRUE for mixed-effect models. I tried adding the argument re.form = NA but to no avail. I'd be grateful for any tips!

Related

plot coefficients of a model in R

I am fitting training data with glm() and want to plot the coefficients. however, I had no clue how to give a right plot as follows:
set.seed(1)
trn_index = createDataPartition(y = development$EQUAL_PAY, p = 0.80, list = FALSE)
trn_pay = development[trn_index, ]
tst_pay = development[-trn_index, ]
trn_pay_f <- trn_pay %>%
mutate(EQUAL_PAY = relevel(factor(EQUAL_PAY),ref = "YES"))
pay_lgr = train(EQUAL_PAY ~ .- EQUAL_WORK - COUNTRY, method = "glm", family = binomial(link = "logit"), data = trn_pay_f,trControl = trainControl(method = 'cv', number = 10))
summary(pay_lgr)
##Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.560e+00 2.552e+00 -1.003 0.3158
GDP_PER_CAP -5.253e-05 3.348e-05 -1.569 0.1167
CO2_PER_CAP 1.695e-01 7.882e-02 2.151 0.0315 *
PERC_ACCESS_ELECTRICITY -7.833e-03 1.249e-02 -0.627 0.5304
ATMS_PER_1E5 -2.473e-03 8.012e-03 -0.309 0.7576
PERC_INTERNET_USERS -2.451e-02 2.047e-02 -1.198 0.2310
SCIENTIFIC_ARTICLES_PER_YR 2.698e-05 1.519e-05 1.776 0.0757 .
PERC_FEMALE_SECONDARY_EDU 1.126e-01 5.934e-02 1.897 0.0578 .
PERC_FEMALE_LABOR_FORCE -6.559e-03 1.477e-02 -0.444 0.6569
PERC_FEMALE_PARLIAMENT -4.786e-02 2.191e-02 -2.184 0.0289 *
## extract all parameters in a dataframe
pay_lgrFrame <- data.frame(COEFFICIENT = rownames(summary(pay_lgr)$coef),
p_value = summary(pay_lgr)$coef[,4],
z_value = summary(pay_lgr)$coef[,3],
SE = summary(pay_lgr)$coef[,2],
Estimate = summary(pay_lgr)$coef[,1])
## and I was stuck in making a plot as the image I posted the link above.
Pulling in your summary table (you can get this directly as ss <- coef(summary(pay_lgr)), but I don't have your data set):
ss <- read.delim(header=TRUE,check.names=FALSE,text="
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.560e+00 2.552e+00 -1.003 0.3158
GDP_PER_CAP -5.253e-05 3.348e-05 -1.569 0.1167
CO2_PER_CAP 1.695e-01 7.882e-02 2.151 0.0315
PERC_ACCESS_ELECTRICITY -7.833e-03 1.249e-02 -0.627 0.5304
ATMS_PER_1E5 -2.473e-03 8.012e-03 -0.309 0.7576
PERC_INTERNET_USERS -2.451e-02 2.047e-02 -1.198 0.2310
SCIENTIFIC_ARTICLES_PER_YR 2.698e-05 1.519e-05 1.776 0.0757
PERC_FEMALE_SECONDARY_EDU 1.126e-01 5.934e-02 1.897 0.0578
PERC_FEMALE_LABOR_FORCE -6.559e-03 1.477e-02 -0.444 0.6569
PERC_FEMALE_PARLIAMENT -4.786e-02 2.191e-02 -2.184 0.0289")
Convert row names to a column called term:
ss2 <- tibble::rownames_to_column(ss,"term")
Draw the barplot:
library(ggplot2)
ggplot(ss2, aes(term,Estimate))+
geom_bar(stat="identity")+
coord_flip()
ggsave("bar.png")
As others have commented, there are probably better (both easier and preferable in terms of visual communication) ways to plot the coefficients. The dotwhisker::dwplot() function does several convenient things:
automatically extracts coefficients and plots them
automatically scales continuous predictors by 2*std dev, to enable comparison between coeficients (use by_2sd=FALSE if you don't want this)
automatically leaves out the intercept, which is on a different scale from the other parameters and is rarely of inferential interest
library(dotwhisker)
dwplot(lm(Murder/Population ~ ., data=as.data.frame(state.x77)))

glm in R, give all comparisons

Simple logistic regression example.
set.seed(1)
df <- data.frame(out=c(0,1,0,1,0,1,0,1,0),
y=rep(c('A', 'B', 'C'), 3))
result <-glm(out~factor(y), family = 'binomial', data=df)
summary(result)
#Call:
#glm(formula = out ~ factor(y), family = "binomial", data = df)
#Deviance Residuals:
# Min 1Q Median 3Q Max
#-1.4823 -0.9005 -0.9005 0.9005 1.4823
#Coefficients:
# Estimate Std. Error z value Pr(>|z|)
#(Intercept) -6.931e-01 1.225e+00 -0.566 0.571
#factor(y)B 1.386e+00 1.732e+00 0.800 0.423
#factor(y)C 3.950e-16 1.732e+00 0.000 1.000
#(Dispersion parameter for binomial family taken to be 1)
# Null deviance: 12.365 on 8 degrees of freedom
#Residual deviance: 11.457 on 6 degrees of freedom
#AIC: 17.457
#Number of Fisher Scoring iterations: 4
My reference category is now A; results for B and C relative to A are given. I would also like to get the results when B and C are the reference. One can change the reference manually by using levels = in factor(); but this would require fitting 3 models. Is it possible to do this in one go? Or what would be a more efficient approach?
If you want to do all pairwise comparisons, you should usually also do a correction for alpha-error inflation due to multiple testing. You can easily do a Tukey test with package multcomp.
set.seed(1)
df <- data.frame(out=c(0,1,0,1,0,1,0,1,0),
y=rep(c('A', 'B', 'C'), 3))
#y is already a factor, if not, coerce before the model fit
result <-glm(out~y, family = 'binomial', data=df)
summary(result)
library(multcomp)
comps <- glht(result, linfct = mcp(y = "Tukey"))
summary(comps)
#Simultaneous Tests for General Linear Hypotheses
#
#Multiple Comparisons of Means: Tukey Contrasts
#
#
#Fit: glm(formula = out ~ y, family = "binomial", data = df)
#
#Linear Hypotheses:
# Estimate Std. Error z value Pr(>|z|)
#B - A == 0 1.386e+00 1.732e+00 0.8 0.703
#C - A == 0 1.923e-16 1.732e+00 0.0 1.000
#C - B == 0 -1.386e+00 1.732e+00 -0.8 0.703
#(Adjusted p values reported -- single-step method)
#letter notation often used in graphs and tables
cld(comps)
# A B C
#"a" "a" "a"

Dummy variable as slope shifter without intercept

This is my first time to ask here.
I have trouble generating the slope dummy variables only(without intercept dummy).
However, if I multiply dummy variable by independent variable as shown below,
both slope dummy and intercept dummy results are represented.
I want to incorporate slope dummy only and exclude intercept dummy.
I will appreciate your help.
Bests,
yjkim
reg <- lm(year ~ as.factor(age)*log(v1269))
Call:
lm(formula = year ~ as.factor(age) * log(v1269))
Residuals:
Min 1Q Median 3Q Max
-6.083 -1.177 1.268 1.546 3.768
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.18076 2.16089 2.398 0.0167 *
as.factor(age)2 1.93989 2.75892 0.703 0.4821
as.factor(age)3 2.46861 2.39393 1.031 0.3027
as.factor(age)4 -0.56274 2.30123 -0.245 0.8069
log(v1269) -0.06788 0.23606 -0.288 0.7737
as.factor(age)2:log(v1269) -0.15628 0.29621 -0.528 0.5979
as.factor(age)3:log(v1269) -0.14961 0.25809 -0.580 0.5622
as.factor(age)4:log(v1269) 0.16534 0.24884 0.664 0.5065
Just need a -1 within the formaula
reg <- lm(year ~ as.factor(age)*log(v1269) -1)
If you want to estimate a different slope in each level of age, the you can use the %in% operator in the formula
set.seed(1)
df <- data.frame(age = factor(sample(1:4, 100, replace = TRUE)),
v1269 = rlnorm(100),
year = rnorm(100))
m <- lm(year ~ log(v1269) %in% age, data = df)
summary(m)
This gives (for this entirely random , dummy, silly data set)
> summary(m)
Call:
lm(formula = year ~ log(v1269) %in% age, data = df)
Residuals:
Min 1Q Median 3Q Max
-2.93108 -0.66402 -0.05921 0.68040 2.25244
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.02692 0.10705 0.251 0.802
log(v1269):age1 0.20127 0.21178 0.950 0.344
log(v1269):age2 -0.01431 0.24116 -0.059 0.953
log(v1269):age3 -0.02588 0.24435 -0.106 0.916
log(v1269):age4 0.06019 0.21979 0.274 0.785
Residual standard error: 1.065 on 95 degrees of freedom
Multiple R-squared: 0.01037, Adjusted R-squared: -0.0313
F-statistic: 0.2489 on 4 and 95 DF, p-value: 0.9097
Note that this fits a single constant term plus 4 different effects of log(v1269), one per level of age. Visually, this is sort of what the model is doing
pred <- with(df,
expand.grid(age = factor(1:4),
v1269 = seq(min(v1269), max(v1269), length = 100)))
pred <- transform(pred, fitted = predict(m, newdata = pred))
library("ggplot2")
ggplot(df, aes(x = log(v1269), y = year, colour = age)) +
geom_point() +
geom_line(data = pred, mapping = aes(y = fitted)) +
theme_bw() + theme(legend.position = "top")
Clearly, this would only be suitable if there was no significant difference in the mean values of year (the response) in the different age categories.
Note that a different parameterisation of the same model can be achieved via the / operator:
m2 <- lm(year ~ log(v1269)/age, data = df)
> m2
Call:
lm(formula = year ~ log(v1269)/age, data = df)
Coefficients:
(Intercept) log(v1269) log(v1269):age2 log(v1269):age3
0.02692 0.20127 -0.21559 -0.22715
log(v1269):age4
-0.14108
Note that now, the first log(v1269) term is for the slope for age == 1, whilst the other terms are the adjustments required to be applied to the the log(v1269) term to get the slope for the indicated group:
coef(m)[-1]
coef(m2)[2] + c(0, coef(m2)[-(1:2)])
> coef(m)[-1]
log(v1269):age1 log(v1269):age2 log(v1269):age3 log(v1269):age4
0.20127109 -0.01431491 -0.02588106 0.06018802
> coef(m2)[2] + c(0, coef(m2)[-(1:2)])
log(v1269):age2 log(v1269):age3 log(v1269):age4
0.20127109 -0.01431491 -0.02588106 0.06018802
But they work out to the same estimated slopes.

How do you predict outcomes from a new dataset using a model created from a different dataset in R?

I could be missing something about prediction -- but my multiple linear regression is seemingly working as expected:
> bigmodel <- lm(score ~ lean + gender + age, data = mydata)
> summary(bigmodel)
Call:
lm(formula = score ~ lean + gender + age, data = mydata)
Residuals:
Min 1Q Median 3Q Max
-25.891 -4.354 0.892 6.240 18.537
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 70.96455 3.85275 18.419 <2e-16 ***
lean 0.62463 0.05938 10.518 <2e-16 ***
genderM -2.24025 1.40362 -1.596 0.1121
age 0.10783 0.06052 1.782 0.0764 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 9 on 195 degrees of freedom
Multiple R-squared: 0.4188, Adjusted R-squared: 0.4098
F-statistic: 46.83 on 3 and 195 DF, p-value: < 2.2e-16
> head(predict(bigmodel),20)
1 2 3 4 5 6 7 8 9 10
75.36711 74.43743 77.02533 78.76903 79.95515 79.09251 80.38647 81.65807 80.14846 78.96234
11 12 13 14 15 16 17 18 19 20
82.39052 82.04468 81.05187 81.26753 84.50240 81.80667 80.92169 82.40895 81.76197 82.94809
But I can't wrap my head around the prediction after reading ?predict.lm. This output looks good to me for my original dataset -- but what if I want to run the prediction against a different dataset than the one I used to create bigmodel?
For example, if I import a .csv file into R called newmodel with 200 people complete with leans, gender, and age -- how can I use the regression formula from bigmodel to produce predictions for newmodel?
Thanks!
If you read the documentation for predict.lm, you will see the following. So, use the newdata argument to pass the newmodel data you imported to get predictions.
predict(object, newdata, se.fit = FALSE, scale = NULL, df = Inf,
interval = c("none", "confidence", "prediction"),
level = 0.95, type = c("response", "terms"),
terms = NULL, na.action = na.pass,
pred.var = res.var/weights, weights = 1, ...)
Arguments
object
Object of class inheriting from "lm"
newdata
An optional data frame in which to look for variables with which to predict.
If omitted, the fitted values are used.
UPDATE. On the question of exporting data with predictions, here is how you can do it.
predictions = cbind(newmodel, pred = predict(bigmodel, newdata = newmodel))
write.csv(predictions, 'predictions.csv', row.names = F)
UPDATE 2. A full minimally reproducible solution
bigmodel <- lm(mpg ~ wt, data = mtcars)
newdata = data.frame(wt = runif(20, min = 1.5, max = 6))
cbind(
newdata,
mpg = predict(bigmodel, newdata = newdata)
)

p-values of mu parameter in gamlss

I'm trying to fit inflated beta regression model to proportional data. I'm using the package gamlss and specifing the family BEINF. I'm wondering how I can extract the p-values of the $mu.coefficients. When I typed the command fit.3$mu.coefficients (as shown at the bottom of the my r code), it gave me only the estimates of Mu coefficients. The following is an example of my data.
mydata = data.frame(y = c(0.014931087, 0.003880983, 0.006048048, 0.014931087,
+ 0.016469269, 0.013111447, 0.012715517, 0.007981377), index = c(1,1,2,2,3,3,4,4))
mydata
y index
1 0.004517611 1
2 0.004351405 1
3 0.007952064 2
4 0.004517611 2
5 0.003434018 3
6 0.003602046 4
7 0.002370690 4
8 0.002993016 4
> library(gamlss)
> fit.3 = gamlss(y ~ factor(index), family = BEINF, data = mydata)
> summary(fit.3)
*******************************************************************
Family: c("BEINF", "Beta Inflated")
Call:
gamlss(formula = y ~ factor(index), family = BEINF, data = mydata)
Fitting method: RS()
-------------------------------------------------------------------
Mu link function: logit
Mu Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.3994 0.1204 -44.858 1.477e-06
factor(index)2 0.2995 0.1591 1.883 1.329e-01
factor(index)3 -0.2288 0.1805 -1.267 2.739e-01
factor(index)4 -0.5017 0.1952 -2.570 6.197e-02
-------------------------------------------------------------------
Sigma link function: logit
Sigma Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.456 0.2514 -17.72 4.492e-07
-------------------------------------------------------------------
Nu link function: log
Nu Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -21.54 10194 -0.002113 0.9984
-------------------------------------------------------------------
Tau link function: log
Tau Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -21.63 10666 -0.002028 0.9984
-------------------------------------------------------------------
No. of observations in the fit: 8
Degrees of Freedom for the fit: 7
Residual Deg. of Freedom: 1
at cycle: 12
Global Deviance: -93.08548
AIC: -79.08548
SBC: -78.52938
*******************************************************************
fit.3$mu.coefficients
(Intercept) factor(index)2 factor(index)3 factor(index)4
-5.3994238 0.2994738 -0.2287571 -0.5016511
I really appreciate all your help.
Use the save option in summary.gamlss, like this for your model above
fit.3 = gamlss(y ~ factor(index), family = BEINF, data = mydata)
sfit.3<-summary(fit.3, save=TRUE)
sfit.3$mu.coef.table
sfit.3$sigma.coef.table
#to get a list of all the slots in the object
str(sfit.3)
fit.3 = gamlss(y ~ factor(index), family = BEINF, data = mydata.ex)
sfit.3<-summary(fit.3, save=TRUE)
fit.3$mu.coefficients
sfit.3$coef.table # Here use Brackets []
estimate.pval<-data.frame(Intercept=sfit.3$coef.table[1,1],pvalue=sfit.3$coef.table[1,4],
"factor(index)^2"=sfit.3$coef.table[2,1] ,pvalue=sfit.3$coef.table[2,4],
"factor(index)^3"=sfit.3$coef.table[3,1] ,pvalue=sfit.3$coef.table[3,4],
"factor(index)^4"=sfit.3$coef.table[4,1] ,pvalue=sfit.3$coef.table[4,4])

Resources