set mininum and maximum for plotting predicted values of regression - r

I ran a regression that looks as follows:
fit <- lmer(support ~ income + (1 | country), data = df)
When using summary(df), it shows me that for income, the minimum is -2.4 and the maximum is 2.6.
I would like to plot the predicted values. I tried by using the following code:
library(ggeffects)
library(ggplot2)
p1 <- ggpredict(reg1, terms = "income")
ggplot(p1, aes(x, predicted)) + geom_line() + geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.1)
However, the plot goes from -3 to 3. How can I set the minimum and maximum values for the plot? I tried with min and max, but it did not work

By default, for continuous variables, a "pretty" range is chosen for the x-axis. This may include values that don't appear in the data. But using [all] might work, see this example, where in the 2nd case the predicted values range from 0.1 to 2.5, instead 0 to 2.6.
library(ggeffects)
data(iris)
m <- lm(Sepal.Length ~ Petal.Width, data = iris)
ggpredict(m, "Petal.Width")
#>
#> # Predicted values of Sepal.Length
#> # x = Petal.Width
#>
#> x predicted std.error conf.low conf.high
#> 0.0 4.778 0.073 4.635 4.921
#> 0.4 5.133 0.057 5.022 5.244
#> 0.6 5.311 0.050 5.213 5.408
#> 1.0 5.666 0.040 5.587 5.745
#> 1.4 6.022 0.040 5.943 6.101
#> 1.6 6.199 0.044 6.113 6.286
#> 2.0 6.555 0.057 6.444 6.666
#> 2.6 7.088 0.082 6.927 7.248
ggpredict(m, "Petal.Width [all]")
#>
#> # Predicted values of Sepal.Length
#> # x = Petal.Width
#>
#> x predicted std.error conf.low conf.high
#> 0.1 4.866 0.069 4.732 5.001
#> 0.4 5.133 0.057 5.022 5.244
#> 0.6 5.311 0.050 5.213 5.408
#> 1.2 5.844 0.039 5.767 5.920
#> 1.5 6.110 0.042 6.028 6.193
#> 1.7 6.288 0.047 6.197 6.380
#> 2.0 6.555 0.057 6.444 6.666
#> 2.5 6.999 0.077 6.847 7.151
Created on 2019-03-29 by the reprex package (v0.2.1)
This vignette could be helpful, too.

Related

Missing standard errors and confidence for mixed model and ggeffects

I'm trying to use ggeffects::ggpredict to make some effects plots for my model. I find that the standard errors and confidence limits are missing for many of the results. I can reproduce the problem with some simulated data. It seems specifically for observations where the standard error puts the predicted probability close to 0 or 1.
I tried to get predictions on the link scale to diagnose if it's a problem with the translation from link to response, but I don't believe this is supported by the package.
Any ideas how to address this? Many thanks.
library(tidyverse)
library(lme4)
library(ggeffects)
# number of simulated observations
n <- 1000
# simulated data with a numerical predictor x, factor predictor f, response y
# the simulated effects of x and f are somewhat weak compared to the noise, so expect high standard errors
df <- tibble(
x = seq(-0.1, 0.1, length.out = n),
g = floor(runif(n) * 3),
f = letters[1 + g] %>% as.factor(),
y = pracma::sigmoid(x + (runif(n) - 0.5) + 0.1 * (g - mean(g))),
z = if_else(y > 0.5, "high", "low") %>% as.factor()
)
# glmer model
model <- glmer(z ~ x + (1 | f), data = df, family = binomial)
print(summary(model))
#> Generalized linear mixed model fit by maximum likelihood (Laplace
#> Approximation) [glmerMod]
#> Family: binomial ( logit )
#> Formula: z ~ x + (1 | f)
#> Data: df
#>
#> AIC BIC logLik deviance df.resid
#> 1373.0 1387.8 -683.5 1367.0 997
#>
#> Scaled residuals:
#> Min 1Q Median 3Q Max
#> -1.3858 -0.9928 0.7317 0.9534 1.3600
#>
#> Random effects:
#> Groups Name Variance Std.Dev.
#> f (Intercept) 0.0337 0.1836
#> Number of obs: 1000, groups: f, 3
#>
#> Fixed effects:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 0.02737 0.12380 0.221 0.825
#> x -4.48012 1.12066 -3.998 6.39e-05 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Correlation of Fixed Effects:
#> (Intr)
#> x -0.001
# missing standard errors
ggpredict(model, c("x", "f")) %>% print()
#> Data were 'prettified'. Consider using `terms="x [all]"` to get smooth plots.
#> # Predicted probabilities of z
#>
#> # f = a
#>
#> x | Predicted | 95% CI
#> --------------------------------
#> -0.10 | 0.62 | [0.54, 0.69]
#> 0.00 | 0.51 |
#> 0.10 | 0.40 |
#>
#> # f = b
#>
#> x | Predicted | 95% CI
#> --------------------------------
#> -0.10 | 0.62 | [0.56, 0.67]
#> 0.00 | 0.51 |
#> 0.10 | 0.40 |
#>
#> # f = c
#>
#> x | Predicted | 95% CI
#> --------------------------------
#> -0.10 | 0.62 | [0.54, 0.69]
#> 0.00 | 0.51 |
#> 0.10 | 0.40 |
ggpredict(model, c("x", "f")) %>% as_tibble() %>% print(n = 20)
#> Data were 'prettified'. Consider using `terms="x [all]"` to get smooth plots.
#> # A tibble: 9 x 6
#> x predicted std.error conf.low conf.high group
#> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 -0.1 0.617 0.167 0.537 0.691 a
#> 2 -0.1 0.617 0.124 0.558 0.672 b
#> 3 -0.1 0.617 0.167 0.537 0.691 c
#> 4 0 0.507 NA NA NA a
#> 5 0 0.507 NA NA NA b
#> 6 0 0.507 NA NA NA c
#> 7 0.1 0.396 NA NA NA a
#> 8 0.1 0.396 NA NA NA b
#> 9 0.1 0.396 NA NA NA c
Created on 2022-04-12 by the reprex package (v2.0.1)
I think this may be due to the singular model fit.
I dug down into the guts of the code as far as here, where there appears to be a mismatch between the dimensions of the covariance matrix of the predictions (3x3) and the number of predicted values (15).
I further suspect that the problem may happen here:
rows_to_keep <- as.numeric(rownames(unique(model_matrix_data[
intersect(colnames(model_matrix_data), terms)])))
Perhaps the function is getting confused because the conditional modes/BLUPs for every group are the same (which will only be true, generically, when the random effects variance is zero) ... ?
This seems worth opening an issue on the ggeffects issues list ?

Mantel Haenszel stratified estimates in survival data in R?

I've got survival data with an outcome, an exposure, and a variable I'd like to stratify on but can't find a function to perform Mantel-Haenszel rate ratio. For example with the lung dataset from survival I'd like to look at the outcome of status based on sex but stratified by age. I've set up age brackets with
library(tidyverse)
library(survival)
lung2 <- lung %>%
mutate(agecat = as.factor(case_when(age < 50 ~ 0,
age < 70 ~ 1,
age >= 70 ~ 2)))
epi.2by2 from epiR gets me close with
library(epiR)
epi.2by2(table(as.factor(lung2$status),
as.factor(lung2$sex),
lung2$agecat),
method = "cohort.count")
#> Outcome + Outcome - Total Inc risk * Odds
#> Exposed + 26 37 63 41.3 0.703
#> Exposed - 112 53 165 67.9 2.113
#> Total 138 90 228 60.5 1.533
#>
#>
#> Point estimates and 95% CIs:
#> -------------------------------------------------------------------
#> Inc risk ratio (crude) 0.61 (0.44, 0.83)
#> Inc risk ratio (M-H) 0.61 (0.45, 0.84)
#> Inc risk ratio (crude:M-H) 0.99
#> Odds ratio (crude) 0.33 (0.18, 0.61)
#> Odds ratio (M-H) 0.34 (0.19, 0.63)
#> Odds ratio (crude:M-H) 0.97
#> Attrib risk in the exposed (crude) * -26.61 (-40.70, -12.52)
#> Attrib risk in the exposed (M-H) * -25.88 (-42.58, -9.19)
#> Attrib risk (crude:M-H) 1.03
#> -------------------------------------------------------------------
#> M-H test of homogeneity of PRs: chi2(2) = 0.191 Pr>chi2 = 0.909
#> M-H test of homogeneity of ORs: chi2(2) = 0.394 Pr>chi2 = 0.821
#> Test that M-H adjusted OR = 1: chi2(1) = 12.299 Pr>chi2 = <0.001
#> Wald confidence limits
#> M-H: Mantel-Haenszel; CI: confidence interval
#> * Outcomes per 100 population units
But it doesn't take time to event data (rates) into consideration. It has a method = "cohort.time" option, but I can't seem to get it to work. Ultimately the output I'd like would be similar to STATAs stmh sex, by(agecat) which would give risk ratio estimates for each strata with upper and lower 95% confidence interval as well as overall estimate for risk ratio with chi-square and p-value. mhor from epiDisplay gives output close to what I'm looking for
library(epiDisplay)
mhor(lung2$status,
lung2$sex,
lung2$agecat, design = "cohort",
graph = FALSE)
#>
#> Stratified analysis by Var3
#> OR lower lim. upper lim. P value
#> Var3 0 0.476 0.0532 3.786 0.653417
#> Var3 1 0.363 0.1642 0.788 0.006387
#> Var3 2 0.243 0.0424 1.222 0.060217
#> M-H combined 0.344 0.1880 0.631 0.000453
#>
#> M-H Chi2(1) = 12.3 , P value = 0
#> Homogeneity test, chi-squared 2 d.f. = 0.38 , P value = 0.828
but it only gives odds ratios and not rate ratios.

ROC curves using pROC on R: Calculating lab value a threshold equates to

I am using pROC to provide the ROC analysis of blood tests. I have calculated the ROC curve, AUC and am using the ci.coords function to provide the spec, sens, PPV and NPV at a provided specificity (with 95% CI).
I would like to be able to say at what value of blod test this is, for instance at 1.2 the sens is x, spec is y, NPV is c, PPV is d. Ideally I ould have the data for a table like:
Lab value | Sens | Spec | NPV | PPV
I don't seem to be able to get this from the methodology I am currently using?
Does anyone have any suggestions?
Many thanks
Currently
spred1 = predict(smodel1)
sroc1 = roc(EditedDF1$any_abnormality, spred1)
ci.coords(sroc1, x=0.95, input="sensitivity", transpose = FALSE, ret=c("sensitivity","specificity","ppv","npv"))```
As you gave no reproducible example let's use the one that comes with the package
library(pROC)
data(aSAH)
roc1 <- roc(aSAH$outcome, aSAH$s100b)
The package comes with the function coords which lists specificity and sensititivity at different thresholds:
> coords(roc1)
threshold specificity sensitivity
1 -Inf 0.00000000 1.00000000
2 0.035 0.00000000 0.97560976
3 0.045 0.06944444 0.97560976
4 0.055 0.11111111 0.97560976
5 0.065 0.13888889 0.97560976
6 0.075 0.22222222 0.90243902
7 0.085 0.30555556 0.87804878
8 0.095 0.38888889 0.82926829
9 0.105 0.48611111 0.78048780
10 0.115 0.54166667 0.75609756
...
From there you can use the function ci.coords that you already have used to complete the table by whatever data you desire.
library(tidyverse)
library(pROC)
#> Type 'citation("pROC")' for a citation.
#>
#> Attaching package: 'pROC'
#> The following objects are masked from 'package:stats':
#>
#> cov, smooth, var
data(aSAH)
roc <- roc(aSAH$outcome, aSAH$s100b,
levels = c("Good", "Poor")
)
#> Setting direction: controls < cases
tibble(threshold = seq(0, 1, by = 0.1)) %>%
mutate(
data = threshold %>% map(~ {
res <- roc %>% ci.coords(x = .x, ret = c("sensitivity", "specificity", "ppv", "npv"))
# 97.5%
list(
sens = res$sensitivity[[3]],
spec = res$specificity[[3]],
ppv = res$ppv[[3]],
npv = res$npv[[3]]
)
})
) %>%
unnest_wider(data)
#> # A tibble: 11 x 5
#> threshold sens spec ppv npv
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0 1 0 0.363 NA
#> 2 0.1 0.927 0.5 0.5 0.917
#> 3 0.2 0.780 0.903 0.784 0.867
#> 4 0.3 0.634 0.917 0.769 0.8
#> 5 0.4 0.561 0.958 0.85 0.782
#> 6 0.5 0.439 1 1 0.755
#> 7 0.6 0.366 1 1 0.735
#> 8 0.7 0.317 1 1 0.72
#> 9 0.8 0.195 1 1 0.686
#> 10 0.9 0.122 1 1 0.667
#> 11 1 0.0732 1 1 0.655
Created on 2021-09-10 by the reprex package (v2.0.1)

Mediated Moderation Model in R (Lavaan)

Suppose you have an outcome variable (Y; continuous), an independent variable (X; dummy), and a moderator (W; dummy). Suppose that you would like to test whether another variable (M; continuous) mediates the link between X and W. How would you go about coding this test in R (using lavaan)?
The closest post to mine is: Creating a first stage mediated moderation model, syntax issues
However, the offered answer deals with a question different from mine. My question is about mediating a moderation, whereas the answer deals with moderating a mediation.
Assuming that both X and W are dummy variables, you can use the : operator:
library(lavaan)
#> This is lavaan 0.6-7
#> lavaan is BETA software! Please report any bugs.
df <- data.frame(id=1:301)
df$w <- dummies::dummy(HolzingerSwineford1939$school)[,1]
#> Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts = FALSE):
#> non-list contrasts argument ignored
df$x <- dummies::dummy(HolzingerSwineford1939$sex)[,1]
#> Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts = FALSE):
#> non-list contrasts argument ignored
df$y <- HolzingerSwineford1939$x9
df$m <- HolzingerSwineford1939$agemo
model <- "
#x9 will be your Y
#sex will be your X
#school will be your W
#agemo will be your M
y ~ x + w + c*x:w + b*m
m ~ a*x:w
# indirect effect (a*b)
ab := a*b
# total effect
total := c + (a*b)
"
fit <- sem(model = model, data = df)
summary(object = fit, std=T)
#> lavaan 0.6-7 ended normally after 33 iterations
#>
#> Estimator ML
#> Optimization method NLMINB
#> Number of free parameters 7
#>
#> Number of observations 301
#>
#> Model Test User Model:
#>
#> Test statistic 0.041
#> Degrees of freedom 2
#> P-value (Chi-square) 0.980
#>
#> Parameter Estimates:
#>
#> Standard errors Standard
#> Information Expected
#> Information saturated (h1) model Structured
#>
#> Regressions:
#> Estimate Std.Err z-value P(>|z|) Std.lv Std.all
#> y ~
#> x -0.131 0.161 -0.812 0.417 -0.131 -0.065
#> w -0.130 0.162 -0.805 0.421 -0.130 -0.065
#> x:w (c) 0.086 0.232 0.373 0.709 0.086 0.037
#> m (b) 0.008 0.017 0.478 0.633 0.008 0.027
#> m ~
#> x:w (a) -0.238 0.465 -0.511 0.609 -0.238 -0.029
#>
#> Variances:
#> Estimate Std.Err z-value P(>|z|) Std.lv Std.all
#> .y 1.010 0.082 12.268 0.000 1.010 0.995
#> .m 11.865 0.967 12.268 0.000 11.865 0.999
#>
#> Defined Parameters:
#> Estimate Std.Err z-value P(>|z|) Std.lv Std.all
#> ab -0.002 0.005 -0.349 0.727 -0.002 -0.001
#> total 0.085 0.232 0.364 0.716 0.085 0.036
Created on 2021-03-16 by the reprex package (v0.3.0)

adding labels to diagnostic plots in R

I have run a beta regression in R and would like to assess the residual diagnostics. I have used the plot function and obtained plots, however, the potential outliers are not labelled. How can I add the corresponding labels to the outliers?
breg.full <- betareg(Percentage ~ Total_testscore + Campus + Programme +
Gender + SE_track + Hours_Math_SE, data = starters, # [-c(53, 24, 35), ]
link = "logit") # , , link.phi = NULL, type = "ML"
summary(breg.full)
par(mfrow = c(2,3))
plot(breg.full, which = 1:6)
EDIT:
I want to have something like this (without the actual pink box, but with the ID number.)
The author provides a link for this code (http://www.de.ufpe.br/~cribari/betareg_example.zip.) however it is no longer working ...
Explanation
I couldn't see your data anywhere here, but I will use the iris dataset to demonstrate how this can be achieved. I'll stick to only two examples because this takes some time to code, but once you see two examples I think it will become fairly quick to recognize what is going on. I will supply a reference at the end that will be helpful too.
Fitting Model Data
First we can fit a regression using the iris data, then turn the data into a tibble with model data using both fortify and as_tibble. I have added an index column for one of the plots later.
#### Load Library ####
library(tidyverse)
#### Fit Model ####
fit <- lm(Petal.Width ~ Petal.Length,
data = iris)
#### Turn Model into Data Frame ####
fit.data <- fortify(fit) %>%
as_tibble() %>%
mutate(.index = 1:150)
fit.data
Which gives you this:
# A tibble: 150 × 9
Petal…¹ Petal…² .hat .sigma .cooksd .fitted .resid .stdr…³ .index
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
1 0.2 1.4 0.0186 0.207 8.18e-5 0.219 -0.0190 -0.0928 1
2 0.2 1.4 0.0186 0.207 8.18e-5 0.219 -0.0190 -0.0928 2
3 0.2 1.3 0.0197 0.207 1.23e-4 0.177 0.0226 0.111 3
4 0.2 1.5 0.0176 0.207 7.86e-4 0.261 -0.0606 -0.296 4
5 0.2 1.4 0.0186 0.207 8.18e-5 0.219 -0.0190 -0.0928 5
6 0.4 1.7 0.0158 0.207 6.06e-4 0.344 0.0563 0.275 6
7 0.3 1.4 0.0186 0.207 1.49e-3 0.219 0.0810 0.396 7
8 0.2 1.5 0.0176 0.207 7.86e-4 0.261 -0.0606 -0.296 8
9 0.2 1.4 0.0186 0.207 8.18e-5 0.219 -0.0190 -0.0928 9
10 0.1 1.5 0.0176 0.207 5.53e-3 0.261 -0.161 -0.785 10
# … with 140 more rows, and abbreviated variable names ¹​Petal.Width,
# ²​Petal.Length, ³​.stdresid
# ℹ Use `print(n = ...)` to see more rows
You can see here it gives you a lot of valuable information...residuals, fitted residuals, Cook's distance, etc. This makes it easy to plot them in ggplot2.
Plotting
The first example will be a Cook's distance plot. This takes the index of the data point and plots the columns representing their respective distance using the geom_col function. The key ingredient here is the geom_text portion. Simply subset the data and nudge it a little so it doesnt totally overlap and you can essentially label whatever you want:
#### Cooks Distance ####
fit.data %>%
ggplot(aes(x=.index,
y=.cooksd,
label=.index))+
geom_col()+
labs(x="Index",
y="Cook's Distance",
title = "Cook's Distance")+
geom_text(data=subset(fit.data,
.cooksd > .05),
nudge_y = .003)
Giving you this plot:
Another example using a similar method below plots fitted values versus their respective residuals, with an arbitrary label placed here was well:
#### Fitted vs Residuals ####
ggplot(fit.data,
aes(.fitted,
round(.resid,2),
label=round(.resid,2))) +
geom_point() +
geom_hline(yintercept = 0) +
geom_smooth(se = FALSE)+
labs(x="Fitted",
y="Residual",
title = "Fitted vs Residuals")+
geom_text(data=subset(fit.data,
.resid > .5 | .resid < -.5),
nudge_x = .09)
A slew of other examples of how to do this can be seen at this link. The customization will be up to you, but it should give you a fair idea of how to hand tailor some of these base R plots you are getting.

Resources