I'm trying to use ggeffects::ggpredict to make some effects plots for my model. I find that the standard errors and confidence limits are missing for many of the results. I can reproduce the problem with some simulated data. It seems specifically for observations where the standard error puts the predicted probability close to 0 or 1.
I tried to get predictions on the link scale to diagnose if it's a problem with the translation from link to response, but I don't believe this is supported by the package.
Any ideas how to address this? Many thanks.
library(tidyverse)
library(lme4)
library(ggeffects)
# number of simulated observations
n <- 1000
# simulated data with a numerical predictor x, factor predictor f, response y
# the simulated effects of x and f are somewhat weak compared to the noise, so expect high standard errors
df <- tibble(
x = seq(-0.1, 0.1, length.out = n),
g = floor(runif(n) * 3),
f = letters[1 + g] %>% as.factor(),
y = pracma::sigmoid(x + (runif(n) - 0.5) + 0.1 * (g - mean(g))),
z = if_else(y > 0.5, "high", "low") %>% as.factor()
)
# glmer model
model <- glmer(z ~ x + (1 | f), data = df, family = binomial)
print(summary(model))
#> Generalized linear mixed model fit by maximum likelihood (Laplace
#> Approximation) [glmerMod]
#> Family: binomial ( logit )
#> Formula: z ~ x + (1 | f)
#> Data: df
#>
#> AIC BIC logLik deviance df.resid
#> 1373.0 1387.8 -683.5 1367.0 997
#>
#> Scaled residuals:
#> Min 1Q Median 3Q Max
#> -1.3858 -0.9928 0.7317 0.9534 1.3600
#>
#> Random effects:
#> Groups Name Variance Std.Dev.
#> f (Intercept) 0.0337 0.1836
#> Number of obs: 1000, groups: f, 3
#>
#> Fixed effects:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 0.02737 0.12380 0.221 0.825
#> x -4.48012 1.12066 -3.998 6.39e-05 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Correlation of Fixed Effects:
#> (Intr)
#> x -0.001
# missing standard errors
ggpredict(model, c("x", "f")) %>% print()
#> Data were 'prettified'. Consider using `terms="x [all]"` to get smooth plots.
#> # Predicted probabilities of z
#>
#> # f = a
#>
#> x | Predicted | 95% CI
#> --------------------------------
#> -0.10 | 0.62 | [0.54, 0.69]
#> 0.00 | 0.51 |
#> 0.10 | 0.40 |
#>
#> # f = b
#>
#> x | Predicted | 95% CI
#> --------------------------------
#> -0.10 | 0.62 | [0.56, 0.67]
#> 0.00 | 0.51 |
#> 0.10 | 0.40 |
#>
#> # f = c
#>
#> x | Predicted | 95% CI
#> --------------------------------
#> -0.10 | 0.62 | [0.54, 0.69]
#> 0.00 | 0.51 |
#> 0.10 | 0.40 |
ggpredict(model, c("x", "f")) %>% as_tibble() %>% print(n = 20)
#> Data were 'prettified'. Consider using `terms="x [all]"` to get smooth plots.
#> # A tibble: 9 x 6
#> x predicted std.error conf.low conf.high group
#> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 -0.1 0.617 0.167 0.537 0.691 a
#> 2 -0.1 0.617 0.124 0.558 0.672 b
#> 3 -0.1 0.617 0.167 0.537 0.691 c
#> 4 0 0.507 NA NA NA a
#> 5 0 0.507 NA NA NA b
#> 6 0 0.507 NA NA NA c
#> 7 0.1 0.396 NA NA NA a
#> 8 0.1 0.396 NA NA NA b
#> 9 0.1 0.396 NA NA NA c
Created on 2022-04-12 by the reprex package (v2.0.1)
I think this may be due to the singular model fit.
I dug down into the guts of the code as far as here, where there appears to be a mismatch between the dimensions of the covariance matrix of the predictions (3x3) and the number of predicted values (15).
I further suspect that the problem may happen here:
rows_to_keep <- as.numeric(rownames(unique(model_matrix_data[
intersect(colnames(model_matrix_data), terms)])))
Perhaps the function is getting confused because the conditional modes/BLUPs for every group are the same (which will only be true, generically, when the random effects variance is zero) ... ?
This seems worth opening an issue on the ggeffects issues list ?
Related
I want to analyze the relation between whether someone smoked or not and the number of drinks of alcohol.
The reproducible data set:
smoking_status
alcohol_drinks
1
2
0
5
1
2
0
1
1
0
1
0
0
0
1
9
1
6
1
5
I have used glm() to analyse this relation:
glm <- glm(smoking_status ~ alcohol_drinks, data = data, family = binomial)
summary(glm)
confint(glm)
Using the above I'm able to extract the p-value and the confidence interval for the entire set.
However, I would like to extract the confidence interval for each smoking status, so that I can produce this results table:
Alcohol drinks, mean (95%CI)
p-values
Smokers
X (X - X)
0.492
Non-smokers
X (X - X)
How can I produce this?
First of all, the response alcohol_drinks is not binary, a logistic regression is out of the question. Since the response is counts data, I will fit a Poisson model.
To have confidence intervals for each binary value of smoking_status, coerce to factor and fit a model without an intercept.
x <- 'smoking_status alcohol_drinks
1 2
0 5
1 2
0 1
1 0
1 0
0 0
1 9
1 6
1 5'
df1 <- read.table(textConnection(x), header = TRUE)
pois_fit <- glm(alcohol_drinks ~ 0 + factor(smoking_status), data = df1, family = poisson(link = "log"))
summary(pois_fit)
#>
#> Call:
#> glm(formula = alcohol_drinks ~ 0 + factor(smoking_status), family = poisson(link = "log"),
#> data = df1)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.6186 -1.7093 -0.8104 1.1389 2.4957
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> factor(smoking_status)0 0.6931 0.4082 1.698 0.0895 .
#> factor(smoking_status)1 1.2321 0.2041 6.036 1.58e-09 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for poisson family taken to be 1)
#>
#> Null deviance: 58.785 on 10 degrees of freedom
#> Residual deviance: 31.324 on 8 degrees of freedom
#> AIC: 57.224
#>
#> Number of Fisher Scoring iterations: 5
confint(pois_fit)
#> Waiting for profiling to be done...
#> 2.5 % 97.5 %
#> factor(smoking_status)0 -0.2295933 1.399304
#> factor(smoking_status)1 0.8034829 1.607200
#>
exp(confint(pois_fit))
#> Waiting for profiling to be done...
#> 2.5 % 97.5 %
#> factor(smoking_status)0 0.7948568 4.052378
#> factor(smoking_status)1 2.2333058 4.988822
Created on 2022-06-04 by the reprex package (v2.0.1)
Edit
The edit to the question states that the problem was reversed, what is asked is to find out the effect of alcohol drinking on smoking status. And with a binary response, individuals can be smokers or not, a logistic regression is a possible model.
bin_fit <- glm(smoking_status ~ alcohol_drinks, data = df1, family = binomial(link = "logit"))
summary(bin_fit)
#>
#> Call:
#> glm(formula = smoking_status ~ alcohol_drinks, family = binomial(link = "logit"),
#> data = df1)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -1.7491 -0.8722 0.6705 0.8896 1.0339
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 0.3474 0.9513 0.365 0.715
#> alcohol_drinks 0.1877 0.2730 0.687 0.492
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 12.217 on 9 degrees of freedom
#> Residual deviance: 11.682 on 8 degrees of freedom
#> AIC: 15.682
#>
#> Number of Fisher Scoring iterations: 4
# Odds ratios
exp(coef(bin_fit))
#> (Intercept) alcohol_drinks
#> 1.415412 1.206413
exp(confint(bin_fit))
#> Waiting for profiling to be done...
#> 2.5 % 97.5 %
#> (Intercept) 0.2146432 11.167555
#> alcohol_drinks 0.7464740 2.417211
Created on 2022-06-05 by the reprex package (v2.0.1)
Another way to conduct a logistic regression is to regress the cumulative counts of smokers on increasing numbers of alcoholic drinks. In order to do this, the data must be sorted by alcohol_drinks, so I will create a second data set, df2. Code inspired this in this RPubs post.
df2 <- df1[order(df1$alcohol_drinks), ]
Total <- sum(df2$smoking_status)
df2$smoking_status <- cumsum(df2$smoking_status)
fit <- glm(cbind(smoking_status, Total - smoking_status) ~ alcohol_drinks, data = df2, family = binomial())
summary(fit)
#>
#> Call:
#> glm(formula = cbind(smoking_status, Total - smoking_status) ~
#> alcohol_drinks, family = binomial(), data = df2)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -0.9714 -0.2152 0.1369 0.2942 0.8975
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -1.1671 0.3988 -2.927 0.003428 **
#> alcohol_drinks 0.4437 0.1168 3.798 0.000146 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 23.3150 on 9 degrees of freedom
#> Residual deviance: 3.0294 on 8 degrees of freedom
#> AIC: 27.226
#>
#> Number of Fisher Scoring iterations: 4
# Odds ratios
exp(coef(fit))
#> (Intercept) alcohol_drinks
#> 0.3112572 1.5584905
exp(confint(fit))
#> Waiting for profiling to be done...
#> 2.5 % 97.5 %
#> (Intercept) 0.1355188 0.6569898
#> alcohol_drinks 1.2629254 2.0053079
plot(smoking_status/Total ~ alcohol_drinks,
data = df2,
xlab = "Alcoholic Drinks",
ylab = "Proportion of Smokers")
lines(df2$alcohol_drinks, fit$fitted, type="l", col="red")
title(main = "Alcohol and Smoking")
Created on 2022-06-05 by the reprex package (v2.0.1)
Want to preface this with heaps of appreciate for gtsummary -- wonderful package.
After using tidymodels, GLM, and gtsummary for a while, I've been trying to understand gtsummary's computations for GLM model performance and confidence intervals.
Can the anyone and/or Dr. Sjoberg + gtsummary team explain the following questions 1 & 2
Question 1: Why are standard errors different when using broom::tidy() vs. parameters::model_parameters() functions to extract model residual data?
(Bolded text in print outs shows differences)
library(gtsummary)
library(parameters)
library(rsample)
library(broom)
trial2 <- trial %>% select(age, grade, response, trt) %>%
drop_na()
model_trial2 <- glm(response ~ age + grade + trt,
data = trial2,
family=binomial(link="logit"))
broom::tidy(model_trial2, exponentiate = TRUE)
# # A tibble: 5 × 5
# term estimate std.error statistic p.value
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 (Intercept) 0.184 **0.630** -2.69 0.00715
# 2 age 1.02 0.0114 1.67 0.0952
# 3 gradeII 0.852 **0.395** -0.406 0.685
# 4 gradeIII 1.01 0.385 0.0199 0.984
# 5 trtDrug B 1.13 **0.321** 0.387 0.699
preadmission_model_parameters <- model_trial2 %>% parameters::model_parameters(exponentiate = TRUE)
preadmission_model_parameters
# Parameter | Odds Ratio | SE | 95% CI | z | p
# ---------------------------------------------------------------
# (Intercept) | 0.18 | **0.12** | [0.05, 0.61] | -2.69 | 0.007
# age | 1.02 | 0.01 | [1.00, 1.04] | 1.67 | 0.095
# grade [II] | 0.85 | **0.34** | [0.39, 1.85] | -0.41 | 0.685
# grade [III] | 1.01 | 0.39 | [0.47, 2.15] | 0.02 | 0.984
# trt [Drug B] | 1.13 | **0.36** | [0.60, 2.13] | 0.39 | 0.699
Question 2: (a) What method does gtsummary use to produce confidence intervals? (b) can the user define (stratified or unstratified) k-fold cross-validation or bootstraps to produce confidence intervals?
(Bolded differences in confidence intervals for the reg_intervals() bootstrapped confidence intervals and the unknown method gtsummary tbl_regression() confidence intervals.)
library(gtsummary)
library(parameters)
library(rsample)
library(broom)
trial2 <- trial %>% select(age, grade, response, trt) %>%
drop_na()
bootstraps(trial2, times = 10)
trial_bootrapped_confidence_intervals <- reg_intervals(response ~ age + grade + trt,
data = trial2,
model_fn = "glm",
keep_reps = TRUE,
family=binomial(link="logit"))
trial_bootrapped_confidence_intervals_exp <- trial_bootrapped_confidence_intervals %>%
select(term:.alpha) %>%
mutate(across(.cols = c(.lower, .estimate, .upper), ~exp(.))) %>%
as_tibble()
trial_bootrapped_confidence_intervals_exp
# # A tibble: 4 × 5
# term .lower .estimate .upper .alpha
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 age 0.997 1.02 1.04 0.05
# 2 gradeII **0.400** 0.846 **1.86** 0.05
# 3 gradeIII 0.473 1.01 2.10 0.05
# 4 trtDrug B 0.600 1.14 2.22 0.05
model_trial2_tbl_regression <-
glm(response ~ age + grade + trt,
data = trial2,
family=binomial(link="logit")) %>%
tbl_regression(
exponentiate = T
) %>%
add_global_p()
model_trial2_tbl_regression_metrics <- model_trial2_tbl_regression$table_body %>%
select(
label,
estimate,
std.error,
statistic,
conf.low ,
conf.high,
p.value
)
model_trial2_tbl_regression_metrics
# A tibble: 8 × 7
# label estimate std.error statistic conf.low conf.high p.value
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 Age 1.02 0.0114 1.67 0.997 1.04 0.0909
# 2 Grade NA NA NA NA NA 0.894
# 3 I NA NA NA NA NA NA
# 4 II 0.852 0.395 -0.406 **0.389** **1.85** NA
# 5 III 1.01 0.385 0.0199 0.472 2.15 NA
# 6 Chemotherapy Treatment NA NA NA NA NA 0.699
# 7 Drug A NA NA NA NA NA NA
# 8 Drug B 1.13 0.321 0.387 0.603 2.13 NA
The issue is with the exponentiation (applied as the family is binomial). Broom::tidy does not exponentiate the standard errors but parameters does. You can see this with broom::tidy(model_trial2, exponentiate = TRUE) and broom::tidy(model_trial2, exponentiate = FALSE), which return the same standard errors. parameters::model_parameters(exponentiate = TRUE) and parameters::model_parameters(exponentiate = FALSE) return different standard errors. When exponentiate is FALSE for parameters, the standard errors match. This is discussed in Check exponentiate behavior in tidy methods #422
To create a custom tidier for gtsummary, see FAQ + Gallery
I am using pROC to provide the ROC analysis of blood tests. I have calculated the ROC curve, AUC and am using the ci.coords function to provide the spec, sens, PPV and NPV at a provided specificity (with 95% CI).
I would like to be able to say at what value of blod test this is, for instance at 1.2 the sens is x, spec is y, NPV is c, PPV is d. Ideally I ould have the data for a table like:
Lab value | Sens | Spec | NPV | PPV
I don't seem to be able to get this from the methodology I am currently using?
Does anyone have any suggestions?
Many thanks
Currently
spred1 = predict(smodel1)
sroc1 = roc(EditedDF1$any_abnormality, spred1)
ci.coords(sroc1, x=0.95, input="sensitivity", transpose = FALSE, ret=c("sensitivity","specificity","ppv","npv"))```
As you gave no reproducible example let's use the one that comes with the package
library(pROC)
data(aSAH)
roc1 <- roc(aSAH$outcome, aSAH$s100b)
The package comes with the function coords which lists specificity and sensititivity at different thresholds:
> coords(roc1)
threshold specificity sensitivity
1 -Inf 0.00000000 1.00000000
2 0.035 0.00000000 0.97560976
3 0.045 0.06944444 0.97560976
4 0.055 0.11111111 0.97560976
5 0.065 0.13888889 0.97560976
6 0.075 0.22222222 0.90243902
7 0.085 0.30555556 0.87804878
8 0.095 0.38888889 0.82926829
9 0.105 0.48611111 0.78048780
10 0.115 0.54166667 0.75609756
...
From there you can use the function ci.coords that you already have used to complete the table by whatever data you desire.
library(tidyverse)
library(pROC)
#> Type 'citation("pROC")' for a citation.
#>
#> Attaching package: 'pROC'
#> The following objects are masked from 'package:stats':
#>
#> cov, smooth, var
data(aSAH)
roc <- roc(aSAH$outcome, aSAH$s100b,
levels = c("Good", "Poor")
)
#> Setting direction: controls < cases
tibble(threshold = seq(0, 1, by = 0.1)) %>%
mutate(
data = threshold %>% map(~ {
res <- roc %>% ci.coords(x = .x, ret = c("sensitivity", "specificity", "ppv", "npv"))
# 97.5%
list(
sens = res$sensitivity[[3]],
spec = res$specificity[[3]],
ppv = res$ppv[[3]],
npv = res$npv[[3]]
)
})
) %>%
unnest_wider(data)
#> # A tibble: 11 x 5
#> threshold sens spec ppv npv
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0 1 0 0.363 NA
#> 2 0.1 0.927 0.5 0.5 0.917
#> 3 0.2 0.780 0.903 0.784 0.867
#> 4 0.3 0.634 0.917 0.769 0.8
#> 5 0.4 0.561 0.958 0.85 0.782
#> 6 0.5 0.439 1 1 0.755
#> 7 0.6 0.366 1 1 0.735
#> 8 0.7 0.317 1 1 0.72
#> 9 0.8 0.195 1 1 0.686
#> 10 0.9 0.122 1 1 0.667
#> 11 1 0.0732 1 1 0.655
Created on 2021-09-10 by the reprex package (v2.0.1)
Recently, I stumbled upon the fact that Stata and R handle regressions without intercept differently. I'm not a statistician, so please be kind if my vocabulary is not ideal.
I tried to make the example somewhat reproducible. This is my example in R:
> set.seed(20210211)
> df <- data.frame(y = runif(50), x = runif(50))
> df$d <- df$x > 0.5
>
> (tmp <- tempfile("data", fileext = ".csv"))
[1] "C:\\Users\\s1504gl\\AppData\\Local\\Temp\\1\\RtmpYtS6uk\\data1b2c1c4a96.csv"
> write.csv(df, tmp, row.names = FALSE)
>
> summary(lm(y ~ x + d, data = df))
Call:
lm(formula = y ~ x + d, data = df)
Residuals:
Min 1Q Median 3Q Max
-0.48651 -0.27449 0.03828 0.22119 0.53347
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.4375 0.1038 4.214 0.000113 ***
x -0.1026 0.3168 -0.324 0.747521
dTRUE 0.1513 0.1787 0.847 0.401353
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2997 on 47 degrees of freedom
Multiple R-squared: 0.03103, Adjusted R-squared: -0.0102
F-statistic: 0.7526 on 2 and 47 DF, p-value: 0.4767
> summary(lm(y ~ x + d + 0, data = df))
Call:
lm(formula = y ~ x + d + 0, data = df)
Residuals:
Min 1Q Median 3Q Max
-0.48651 -0.27449 0.03828 0.22119 0.53347
Coefficients:
Estimate Std. Error t value Pr(>|t|)
x -0.1026 0.3168 -0.324 0.747521
dFALSE 0.4375 0.1038 4.214 0.000113 ***
dTRUE 0.5888 0.2482 2.372 0.021813 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2997 on 47 degrees of freedom
Multiple R-squared: 0.7196, Adjusted R-squared: 0.7017
F-statistic: 40.21 on 3 and 47 DF, p-value: 4.996e-13
And here is what I have in Stata (please note that I have copied the filename from R to Stata):
. import delimited "C:\Users\s1504gl\AppData\Local\Temp\1\RtmpYtS6uk\data1b2c1c4a96.csv"
(3 vars, 50 obs)
. encode d, generate(d_enc)
.
. regress y x i.d_enc
Source | SS df MS Number of obs = 50
-------------+---------------------------------- F(2, 47) = 0.75
Model | .135181652 2 .067590826 Prob > F = 0.4767
Residual | 4.22088995 47 .089806169 R-squared = 0.0310
-------------+---------------------------------- Adj R-squared = -0.0102
Total | 4.3560716 49 .08889942 Root MSE = .29968
------------------------------------------------------------------------------
y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
x | -.1025954 .3168411 -0.32 0.748 -.7399975 .5348067
|
d_enc |
TRUE | .1512977 .1786527 0.85 0.401 -.2081052 .5107007
_cons | .4375371 .103837 4.21 0.000 .2286441 .6464301
------------------------------------------------------------------------------
. regress y x i.d_enc, noconstant
Source | SS df MS Number of obs = 50
-------------+---------------------------------- F(2, 48) = 38.13
Model | 9.23913703 2 4.61956852 Prob > F = 0.0000
Residual | 5.81541777 48 .121154537 R-squared = 0.6137
-------------+---------------------------------- Adj R-squared = 0.5976
Total | 15.0545548 50 .301091096 Root MSE = .34807
------------------------------------------------------------------------------
y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
x | .976214 .2167973 4.50 0.000 .5403139 1.412114
|
d_enc |
TRUE | -.2322011 .1785587 -1.30 0.200 -.5912174 .1268151
------------------------------------------------------------------------------
As you can see, the results of the regression with intercept are identical. But if I omit the intercept (+ 0 in R, , noconstant in Stata), the results differ. In R, the intercept is now captured in dFALSE, which is reasonable from what I understand. I don't understand what Stata is doing here. Also the degrees of freedom differ.
My questions:
Can anyone explain to me how Stata is handling this?
How can I replicate Stata's behavior in R?
I believe bas pointed in the right direction, but I am still unsure why both results differ.
I am not attempting to answer the question, but provdide deeper understanding of what stata is doing (by digging into the source of R's lm() function. In the following lines I replicate what lm() does, but jumping over sanity checks and options such as weights, contrasts, etc...
(I cannot yet fully understand why in the second regression (with NO CONSTANT) the dFALSE coefficient captures the effect of the intercept in the default regression (with constant)
set.seed(20210211)
df <- data.frame(y = runif(50), x = runif(50))
df$d <- df$x > 0.5
lm() With Constant
form_default <- as.formula(y ~ x + d)
mod_frame_def <- model.frame(form_default, df)
mod_matrix_def <- model.matrix(object = attr(mod_frame_def, "terms"), mod_frame_def)
head(mod_matrix_def)
#> (Intercept) x dTRUE
#> 1 1 0.7861162 1
#> 2 1 0.2059603 0
#> 3 1 0.9793946 1
#> 4 1 0.8569093 1
#> 5 1 0.8124811 1
#> 6 1 0.7769280 1
stats:::lm.fit(
y = model.response(mod_frame_def),
x = mod_matrix_def
)$coefficients
#> (Intercept) x dTRUE
#> 0.4375371 -0.1025954 0.1512977
lm() No Constant
form_nocon <- as.formula(y ~ x + d + 0)
mod_frame_nocon <- model.frame(form_nocon, df)
mod_matrix_nocon <- model.matrix(object = attr(mod_frame_nocon, "terms"), mod_frame_nocon)
head(mod_matrix_nocon)
#> x dFALSE dTRUE
#> 1 0.7861162 0 1
#> 2 0.2059603 1 0
#> 3 0.9793946 0 1
#> 4 0.8569093 0 1
#> 5 0.8124811 0 1
#> 6 0.7769280 0 1
stats:::lm.fit(
y = model.response(mod_frame_nocon),
x = mod_matrix_nocon
)$coefficients
#> x dFALSE dTRUE
#> -0.1025954 0.4375371 0.5888348
lm() with as.numeric()
[as indicated in the comments by bas]
form_asnum <- as.formula(y ~ x + as.numeric(d) + 0)
mod_frame_asnum <- model.frame(form_asnum, df)
mod_matrix_asnum <- model.matrix(object = attr(mod_frame_asnum, "terms"), mod_frame_asnum)
head(mod_matrix_asnum)
#> x as.numeric(d)
#> 1 0.7861162 1
#> 2 0.2059603 0
#> 3 0.9793946 1
#> 4 0.8569093 1
#> 5 0.8124811 1
#> 6 0.7769280 1
stats:::lm.fit(
y = model.response(mod_frame_asnum),
x = mod_matrix_asnum
)$coefficients
#> x as.numeric(d)
#> 0.9762140 -0.2322012
Created on 2021-03-18 by the reprex package (v1.0.0)
Using R...
I have a data.frame with five variables.
One of the variables colr has values ranging from 1 to 5.
Defined as an integer with values 1, 2, 3, 4, and 5.
Problem: I would like to build a regression model where the values within colr, the integers 1,2,3,4, and 5 are reported as independent variables with the following names.
1 = Silver,
2 = Blue,
3 = Pink,
4 = Other than Silver, Blue or Pink,
5 = Color Not Reported.
Question: Is there a way to extract or rename these values in a way that is different from the following (as this process does not rename, eg. 1 to Silver in the summary regression output):
lm(dependent variable ~ + I(colr.f == 1) +
I(colr.f == 2) +
I(colr.f == 3) +
I(colr.f == 4) +
I(colr.f == 5),
data = df)
I am open to any method that would allow me to create and name these different values independently but would prefer to see if there is a way to do so using the tidyverse or dplyr as this is something I have to do frequently when building multivariate models.
Thank you for any help.
If you have this:
df <- data.frame(int = sample(5, 20, TRUE), value = rnorm(20))
df
#> int value
#> 1 3 -0.62042198
#> 2 4 0.85009260
#> 3 5 -1.04971518
#> 4 1 -2.58255471
#> 5 1 0.62357772
#> 6 4 0.00286785
#> 7 4 -0.05981318
#> 8 4 0.72961261
#> 9 4 -0.03156315
#> 10 1 -2.05486209
#> 11 5 1.77099554
#> 12 1 1.02790956
#> 13 1 -0.70354012
#> 14 1 0.27353731
#> 15 2 -0.04817215
#> 16 2 0.17151374
#> 17 5 -0.54824346
#> 18 2 0.41123284
#> 19 5 0.05466070
#> 20 1 -0.41029986
You can do this:
library(tidyverse)
df <- df %>% mutate(color = factor(c("red", "green", "orange", "blue", "pink"))[int])
df
#> int value color
#> 1 3 -0.62042198 orange
#> 2 4 0.85009260 blue
#> 3 5 -1.04971518 pink
#> 4 1 -2.58255471 red
#> 5 1 0.62357772 red
#> 6 4 0.00286785 blue
#> 7 4 -0.05981318 blue
#> 8 4 0.72961261 blue
#> 9 4 -0.03156315 blue
#> 10 1 -2.05486209 red
#> 11 5 1.77099554 pink
#> 12 1 1.02790956 red
#> 13 1 -0.70354012 red
#> 14 1 0.27353731 red
#> 15 2 -0.04817215 green
#> 16 2 0.17151374 green
#> 17 5 -0.54824346 pink
#> 18 2 0.41123284 green
#> 19 5 0.05466070 pink
#> 20 1 -0.41029986 red
Which allows a regression like this:
lm(value ~ color, data = df) %>% summary()
#>
#> Call:
#> lm(formula = value ~ color, data = df)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -2.03595 -0.33687 -0.00447 0.46149 1.71407
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 0.2982 0.4681 0.637 0.534
#> colorgreen -0.1200 0.7644 -0.157 0.877
#> colororange -0.9187 1.1466 -0.801 0.436
#> colorpink -0.2413 0.7021 -0.344 0.736
#> colorred -0.8448 0.6129 -1.378 0.188
#>
#> Residual standard error: 1.047 on 15 degrees of freedom
#> Multiple R-squared: 0.1451, Adjusted R-squared: -0.0829
#> F-statistic: 0.6364 on 4 and 15 DF, p-value: 0.6444
Created on 2020-02-16 by the reprex package (v0.3.0)
I'm not sure I'm understanding your question the right way, but can't you just use
library(dplyr)
df <- df %>%
mutate(color=factor(colr.f, levels=c(1:5), labels=c("silver", "blue", "pink", "not s, b, p", "not reported"))
and then just run the regression on color only.
/edit for clarification. Making up some data:
df <- data.frame(
x=rnorm(100),
color=factor(rep(c(1,2,3,4,5), each=20),
labels=c("Silver", "Blue", "Pink", "Not S, B, P", "Not reported")),
y=rnorm(100, 4))
m1 <- lm(y~x+color, data=df)
m2 <- lm(y~x+color-1, data=df)
summary(m1)
Call:
lm(formula = y ~ x + color, data = df)
Residuals:
Min 1Q Median 3Q Max
-1.96394 -0.59647 0.00237 0.56916 2.13392
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.93238 0.19312 20.362 <2e-16 ***
x 0.13588 0.09856 1.379 0.171
colorBlue -0.07862 0.27705 -0.284 0.777
colorPink -0.02167 0.27393 -0.079 0.937
colorNot S, B, P 0.15238 0.27221 0.560 0.577
colorNot reported 0.14139 0.27230 0.519 0.605
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.8606 on 94 degrees of freedom
Multiple R-squared: 0.0268, Adjusted R-squared: -0.02496
F-statistic: 0.5177 on 5 and 94 DF, p-value: 0.7623
summary(m2)
Call:
lm(formula = y ~ x + color - 1, data = df)
Residuals:
Min 1Q Median 3Q Max
-1.96394 -0.59647 0.00237 0.56916 2.13392
Coefficients:
Estimate Std. Error t value Pr(>|t|)
x 0.13588 0.09856 1.379 0.171
colorSilver 3.93238 0.19312 20.362 <2e-16 ***
colorBlue 3.85376 0.19570 19.692 <2e-16 ***
colorPink 3.91071 0.19301 20.262 <2e-16 ***
colorNot S, B, P 4.08477 0.19375 21.083 <2e-16 ***
colorNot reported 4.07377 0.19256 21.156 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.8606 on 94 degrees of freedom
Multiple R-squared: 0.9578, Adjusted R-squared: 0.9551
F-statistic: 355.5 on 6 and 94 DF, p-value: < 2.2e-16
The first model is a model with intercept, therefore one of the factor levels must be dropped to avoid perfect multicollinearity. In this case, the "effect" of silver is the value of the intercept, while the "effect" of the other colors is the intercept coefficient value + their respective coefficient value.
The second model is estimated without intercept (without constant), so you can see the individual effects. However, you should probably know what you are doing before estimating the model without intercept.
With base R.
labels <- c("Silver", "Blue", "Pink", "Other Color", "Color Not Reported")
df$colr.f2 <- factor(colr.f, labels = labels, levels = seq_along(labels))