R - Logistic Regression with Control Variables - r

I just started to get into R for data analysis (previously I just used SPSS or Excel).
Currently, I am trying to run a logistic regression with one dependent and 5 independent while controlling for 3 variables.
My current attempt is:
reg_model <- glm(formula = Dependent ~ Independent1 + Independent2 + Independent3 + Independent4 + Independent5, family = binomial(), data = df)
I am not sure how (or where) to insert the 3 control variables into the model because just adding the 3 control variables as independent variables into the model seems wrong to me (or am I wrong here?).

You can control for potential confounders by adding them as independent variables into the model on the right-hand side of the formula.
Note that the estimate (effect size) of the Graduate Record Exam (GRE) score is lower in the second model after controlling for the grade point average (GPA), which is correlated to GRE:
library(readr)
# gre: Graduate Record Exam scores
# gpa: grade point average
data <- read_csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
#>
#> ── Column specification ────────────────────────────────────────────────────────
#> cols(
#> admit = col_double(),
#> gre = col_double(),
#> gpa = col_double(),
#> rank = col_double()
#> )
data
#> # A tibble: 400 x 4
#> admit gre gpa rank
#> <dbl> <dbl> <dbl> <dbl>
#> 1 0 380 3.61 3
#> 2 1 660 3.67 3
#> 3 1 800 4 1
#> 4 1 640 3.19 4
#> 5 0 520 2.93 4
#> 6 1 760 3 2
#> 7 1 560 2.98 1
#> 8 0 400 3.08 2
#> 9 1 540 3.39 3
#> 10 0 700 3.92 2
#> # … with 390 more rows
model1 <- glm(admit ~ gre, data = data, family = "binomial")
summary(model1)
#>
#> Call:
#> glm(formula = admit ~ gre, family = "binomial", data = data)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -1.1623 -0.9052 -0.7547 1.3486 1.9879
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -2.901344 0.606038 -4.787 1.69e-06 ***
#> gre 0.003582 0.000986 3.633 0.00028 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 499.98 on 399 degrees of freedom
#> Residual deviance: 486.06 on 398 degrees of freedom
#> AIC: 490.06
#>
#> Number of Fisher Scoring iterations: 4
# gre and gpa are correlated. Lets's control for them!
cor(data)
#> admit gre gpa rank
#> admit 1.0000000 0.1844343 0.17821225 -0.24251318
#> gre 0.1844343 1.0000000 0.38426588 -0.12344707
#> gpa 0.1782123 0.3842659 1.00000000 -0.05746077
#> rank -0.2425132 -0.1234471 -0.05746077 1.00000000
model2 <- glm(admit ~ gre + gpa, data = data, family = "binomial")
summary(model2)
#>
#> Call:
#> glm(formula = admit ~ gre + gpa, family = "binomial", data = data)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -1.2730 -0.8988 -0.7206 1.3013 2.0620
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -4.949378 1.075093 -4.604 4.15e-06 ***
#> gre 0.002691 0.001057 2.544 0.0109 *
#> gpa 0.754687 0.319586 2.361 0.0182 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 499.98 on 399 degrees of freedom
#> Residual deviance: 480.34 on 397 degrees of freedom
#> AIC: 486.34
#>
#> Number of Fisher Scoring iterations: 4
Created on 2021-10-01 by the reprex package (v2.0.1)

Related

How do I extract variables that have a low p-value in R

I have a logistic model with plenty of interactions in R.
I want to extract only the variables and interactions that are either interactions or just predictor variables that are significant.
It's fine if I can just look at every interaction that's significant as I can still look at which non-significant fields were used to get them.
Thank you.
This is the most I have
broom::tidy(logmod)[,c("term", "estimate", "p.value")]
Here is a way. After fitting the logistic model use a logical condition to get the significant predictors and a regex (logical grep) to get the interactions. These two index vectors can be combined with &, in the case below returning no significant interactions at the alpha == 0.05 level.
fit <- glm(am ~ hp + qsec*vs, mtcars, family = binomial)
summary(fit)
#>
#> Call:
#> glm(formula = am ~ hp + qsec * vs, family = binomial, data = mtcars)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -1.93876 -0.09923 -0.00014 0.05351 1.33693
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 199.02697 102.43134 1.943 0.0520 .
#> hp -0.12104 0.06138 -1.972 0.0486 *
#> qsec -10.87980 5.62557 -1.934 0.0531 .
#> vs -108.34667 63.59912 -1.704 0.0885 .
#> qsec:vs 6.72944 3.85348 1.746 0.0808 .
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 43.230 on 31 degrees of freedom
#> Residual deviance: 12.574 on 27 degrees of freedom
#> AIC: 22.574
#>
#> Number of Fisher Scoring iterations: 8
alpha <- 0.05
pval <- summary(fit)$coefficients[,4]
sig <- pval <= alpha
intr <- grepl(":", names(coef(fit)))
coef(fit)[sig]
#> hp
#> -0.1210429
coef(fit)[sig & intr]
#> named numeric(0)
Created on 2022-09-15 with reprex v2.0.2

Extract confidence interval for both values of binary variable for glm()?

I want to analyze the relation between whether someone smoked or not and the number of drinks of alcohol.
The reproducible data set:
smoking_status
alcohol_drinks
1
2
0
5
1
2
0
1
1
0
1
0
0
0
1
9
1
6
1
5
I have used glm() to analyse this relation:
glm <- glm(smoking_status ~ alcohol_drinks, data = data, family = binomial)
summary(glm)
confint(glm)
Using the above I'm able to extract the p-value and the confidence interval for the entire set.
However, I would like to extract the confidence interval for each smoking status, so that I can produce this results table:
Alcohol drinks, mean (95%CI)
p-values
Smokers
X (X - X)
0.492
Non-smokers
X (X - X)
How can I produce this?
First of all, the response alcohol_drinks is not binary, a logistic regression is out of the question. Since the response is counts data, I will fit a Poisson model.
To have confidence intervals for each binary value of smoking_status, coerce to factor and fit a model without an intercept.
x <- 'smoking_status alcohol_drinks
1 2
0 5
1 2
0 1
1 0
1 0
0 0
1 9
1 6
1 5'
df1 <- read.table(textConnection(x), header = TRUE)
pois_fit <- glm(alcohol_drinks ~ 0 + factor(smoking_status), data = df1, family = poisson(link = "log"))
summary(pois_fit)
#>
#> Call:
#> glm(formula = alcohol_drinks ~ 0 + factor(smoking_status), family = poisson(link = "log"),
#> data = df1)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.6186 -1.7093 -0.8104 1.1389 2.4957
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> factor(smoking_status)0 0.6931 0.4082 1.698 0.0895 .
#> factor(smoking_status)1 1.2321 0.2041 6.036 1.58e-09 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for poisson family taken to be 1)
#>
#> Null deviance: 58.785 on 10 degrees of freedom
#> Residual deviance: 31.324 on 8 degrees of freedom
#> AIC: 57.224
#>
#> Number of Fisher Scoring iterations: 5
confint(pois_fit)
#> Waiting for profiling to be done...
#> 2.5 % 97.5 %
#> factor(smoking_status)0 -0.2295933 1.399304
#> factor(smoking_status)1 0.8034829 1.607200
#>
exp(confint(pois_fit))
#> Waiting for profiling to be done...
#> 2.5 % 97.5 %
#> factor(smoking_status)0 0.7948568 4.052378
#> factor(smoking_status)1 2.2333058 4.988822
Created on 2022-06-04 by the reprex package (v2.0.1)
Edit
The edit to the question states that the problem was reversed, what is asked is to find out the effect of alcohol drinking on smoking status. And with a binary response, individuals can be smokers or not, a logistic regression is a possible model.
bin_fit <- glm(smoking_status ~ alcohol_drinks, data = df1, family = binomial(link = "logit"))
summary(bin_fit)
#>
#> Call:
#> glm(formula = smoking_status ~ alcohol_drinks, family = binomial(link = "logit"),
#> data = df1)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -1.7491 -0.8722 0.6705 0.8896 1.0339
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 0.3474 0.9513 0.365 0.715
#> alcohol_drinks 0.1877 0.2730 0.687 0.492
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 12.217 on 9 degrees of freedom
#> Residual deviance: 11.682 on 8 degrees of freedom
#> AIC: 15.682
#>
#> Number of Fisher Scoring iterations: 4
# Odds ratios
exp(coef(bin_fit))
#> (Intercept) alcohol_drinks
#> 1.415412 1.206413
exp(confint(bin_fit))
#> Waiting for profiling to be done...
#> 2.5 % 97.5 %
#> (Intercept) 0.2146432 11.167555
#> alcohol_drinks 0.7464740 2.417211
Created on 2022-06-05 by the reprex package (v2.0.1)
Another way to conduct a logistic regression is to regress the cumulative counts of smokers on increasing numbers of alcoholic drinks. In order to do this, the data must be sorted by alcohol_drinks, so I will create a second data set, df2. Code inspired this in this RPubs post.
df2 <- df1[order(df1$alcohol_drinks), ]
Total <- sum(df2$smoking_status)
df2$smoking_status <- cumsum(df2$smoking_status)
fit <- glm(cbind(smoking_status, Total - smoking_status) ~ alcohol_drinks, data = df2, family = binomial())
summary(fit)
#>
#> Call:
#> glm(formula = cbind(smoking_status, Total - smoking_status) ~
#> alcohol_drinks, family = binomial(), data = df2)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -0.9714 -0.2152 0.1369 0.2942 0.8975
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -1.1671 0.3988 -2.927 0.003428 **
#> alcohol_drinks 0.4437 0.1168 3.798 0.000146 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 23.3150 on 9 degrees of freedom
#> Residual deviance: 3.0294 on 8 degrees of freedom
#> AIC: 27.226
#>
#> Number of Fisher Scoring iterations: 4
# Odds ratios
exp(coef(fit))
#> (Intercept) alcohol_drinks
#> 0.3112572 1.5584905
exp(confint(fit))
#> Waiting for profiling to be done...
#> 2.5 % 97.5 %
#> (Intercept) 0.1355188 0.6569898
#> alcohol_drinks 1.2629254 2.0053079
plot(smoking_status/Total ~ alcohol_drinks,
data = df2,
xlab = "Alcoholic Drinks",
ylab = "Proportion of Smokers")
lines(df2$alcohol_drinks, fit$fitted, type="l", col="red")
title(main = "Alcohol and Smoking")
Created on 2022-06-05 by the reprex package (v2.0.1)

How to automatically inserting default key in prais winsten estimation?

I got an error message saying
Error in prais_winsten(agvprc.lm1, data = data) :
argument "index" is missing, with no default
How can I avoid inputting all the features
agvprc.lm1=lm(log(avgprc) ~ mon+tues+wed+thurs+t+wave2+wave3)
summary(agvprc.lm1)
agvprc.lm.pw = prais_winsten(agvprc.lm1, data=data)
summary(agvprc.lm.pw)
Not sure if I've understood the question correctly, but to avoid the "Error in prais_winsten(agvprc.lm1, data = data) : argument "index" is missing, with no default" error you need to provide an 'index' to the function (which is a character variable of "ID" and "time"). Using the inbuilt mtcars dataset as an example, with "cyl" as "time":
library(tidyverse)
#install.packages("prais")
library(prais)
#> Loading required package: sandwich
#> Loading required package: pcse
#>
#> Attaching package: 'pcse'
#> The following object is masked from 'package:sandwich':
#>
#> vcovPC
ggplot(mtcars, aes(x = cyl, y = mpg, group = cyl)) +
geom_boxplot() +
geom_jitter(aes(color = hp), width = 0.2)
agvprc.lm1 <- lm(log(mpg) ~ cyl + hp, data = mtcars)
summary(agvprc.lm1)
#>
#> Call:
#> lm(formula = log(mpg) ~ cyl + hp, data = mtcars)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.35699 -0.09882 0.01111 0.11948 0.24118
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 3.7829495 0.1062183 35.615 < 2e-16 ***
#> cyl -0.1072513 0.0279213 -3.841 0.000615 ***
#> hp -0.0011031 0.0007273 -1.517 0.140147
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.1538 on 29 degrees of freedom
#> Multiple R-squared: 0.7503, Adjusted R-squared: 0.7331
#> F-statistic: 43.57 on 2 and 29 DF, p-value: 1.83e-09
agvprc.lm.pw <- prais_winsten(formula = agvprc.lm1,
data = mtcars,
index = c("hp", "cyl"))
#> Iteration 0: rho = 0
#> Iteration 1: rho = 0.6985
#> Iteration 2: rho = 0.7309
#> Iteration 3: rho = 0.7285
#> Iteration 4: rho = 0.7287
#> Iteration 5: rho = 0.7287
#> Iteration 6: rho = 0.7287
#> Iteration 7: rho = 0.7287
summary(agvprc.lm.pw)
#>
#> Call:
#> prais_winsten(formula = agvprc.lm1, data = mtcars, index = c("hp",
#> "cyl"))
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.33844 -0.08166 0.03109 0.13612 0.25811
#>
#> AR(1) coefficient rho after 7 iterations: 0.7287
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 3.7643405 0.1116876 33.704 <2e-16 ***
#> cyl -0.1061198 0.0298161 -3.559 0.0013 **
#> hp -0.0011470 0.0007706 -1.489 0.1474
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.1077 on 29 degrees of freedom
#> Multiple R-squared: 0.973, Adjusted R-squared: 0.9712
#> F-statistic: 523.3 on 2 and 29 DF, p-value: < 2.2e-16
#>
#> Durbin-Watson statistic (original): 0.1278
#> Durbin-Watson statistic (transformed): 0.4019
Created on 2022-02-28 by the reprex package (v2.0.1)
# To present the index without having to write out all of the variables
# perhaps you could use:
agvprc.lm.pw <- prais_winsten(formula = agvprc.lm1,
data = mtcars,
index = names(agvprc.lm1$coefficients)[3:2])
summary(agvprc.lm.pw)
#>
#> Call:
#> prais_winsten(formula = agvprc.lm1, data = mtcars, index = names(agvprc.lm1$coefficients)[3:2])
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.33844 -0.08166 0.03109 0.13612 0.25811
#>
#> AR(1) coefficient rho after 7 iterations: 0.7287
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 3.7643405 0.1116876 33.704 <2e-16 ***
#> cyl -0.1061198 0.0298161 -3.559 0.0013 **
#> hp -0.0011470 0.0007706 -1.489 0.1474
#> ---
#> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#>
#> Residual standard error: 0.1077 on 29 degrees of freedom
#> Multiple R-squared: 0.973, Adjusted R-squared: 0.9712
#> F-statistic: 523.3 on 2 and 29 DF, p-value: < 2.2e-16
#>
#> Durbin-Watson statistic (original): 0.1278
#> Durbin-Watson statistic (transformed): 0.4019
NB. there are a number of assumptions here that may not apply to your actual data; with more information, such as a minimal reproducible example, you will likely get a better/more informed answer on stackoverflow

Plot predicted probabilities (logit)

I am currently trying to plot the predicted probabilities of my logit model in r. I have followed the approach from this link: https://stats.idre.ucla.edu/r/dae/logit-regression/.
I have successfully made plots for Brussels office given the interest group type. However, I seek to only plot the individual effects: for example, I want to plot the predicted probability for Brussels office on Meetings with MEPs (that is, what is the probability of having meetings with MEPs when you have a Brussels office?). Also, I want to see the effect of staff size and/or organisational form on the dependent variable.
I have not found such an approach yet. Any advice?
Thank you in advance.
My variables:
Meetings with MEPS (dependent variable, dummy)
1 Yes
0 No
Interest group type (categorical)
1 Business
2 Consultancies
3 NGOs
4 Public authorities
5 Institutions
6 Trade union/prof.org.
7 Other
Brussels office
1 Yes
0 No
Organisational form
1 Individual org.
2 National association
3 European association
4 Other
Staff size (count variable, presented in full time equivalent)
Ranges from 0.25 to 40
Picking up from yesterday.
library(ggplot2)
# mydata <- read.csv("binary.csv")
str(mydata)
#> 'data.frame': 400 obs. of 4 variables:
#> $ admit: int 0 1 1 1 0 1 1 0 1 0 ...
#> $ gre : int 380 660 800 640 520 760 560 400 540 700 ...
#> $ gpa : num 3.61 3.67 4 3.19 2.93 3 2.98 3.08 3.39 3.92 ...
#> $ rank : int 3 3 1 4 4 2 1 2 3 2 ...
mydata$rank <- factor(mydata$rank)
mylogit <- glm(admit ~ gre + gpa + rank, data = mydata, family = "binomial")
summary(mylogit)
#>
#> Call:
#> glm(formula = admit ~ gre + gpa + rank, family = "binomial",
#> data = mydata)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -1.6268 -0.8662 -0.6388 1.1490 2.0790
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -3.989979 1.139951 -3.500 0.000465 ***
#> gre 0.002264 0.001094 2.070 0.038465 *
#> gpa 0.804038 0.331819 2.423 0.015388 *
#> rank2 -0.675443 0.316490 -2.134 0.032829 *
#> rank3 -1.340204 0.345306 -3.881 0.000104 ***
#> rank4 -1.551464 0.417832 -3.713 0.000205 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 499.98 on 399 degrees of freedom
#> Residual deviance: 458.52 on 394 degrees of freedom
#> AIC: 470.52
#>
#> Number of Fisher Scoring iterations: 4
We're going to graph GPA on the x axis let's generate some points
range(mydata$gpa) # using GPA for your staff size
#> [1] 2.26 4.00
gpa_sequence <- seq(from = 2.25, to = 4.01, by = .01) # 177 points along x axis
This is in the IDRE example but they made it complicated. Step one build a data frame that has our sequence of GPA points, the mean of GRE for every entry in that column, and our 4 factors repeated 177 times.
constantGRE <- with(mydata, data.frame(gre = mean(gre), # keep GRE constant
gpa = rep(gpa_sequence, each = 4), # once per factor level
rank = factor(rep(1:4, times = 177)))) # there's 177
str(constantGRE)
#> 'data.frame': 708 obs. of 3 variables:
#> $ gre : num 588 588 588 588 588 ...
#> $ gpa : num 2.25 2.25 2.25 2.25 2.26 2.26 2.26 2.26 2.27 2.27 ...
#> $ rank: Factor w/ 4 levels "1","2","3","4": 1 2 3 4 1 2 3 4 1 2 ...
Make predictions for every one of the 177 GPA values * 4 factor levels. Put that prediction in a new column called theprediction
constantGRE$theprediction <- predict(object = mylogit,
newdata = constantGRE,
type = "response")
Plot one line per level of rank, color the lines uniquely. NB the lines are not straight, nor perfectly parallel nor equally spaced.
ggplot(constantGRE, aes(x = gpa, y = theprediction, color = rank)) +
geom_smooth()
#> `geom_smooth()` using method = 'loess' and formula 'y ~ x'
You might be tempted to just average the lines. Don't. If you want to know GPA by GRE not including Rank build a new model because (0.6357521 + 0.4704174 + 0.3136242 + 0.2700262) / 4 is not the proper answer.
Let's do it.
# leave rank out call it new name
mylogit2 <- glm(admit ~ gre + gpa, data = mydata, family = "binomial")
summary(mylogit2)
#>
#> Call:
#> glm(formula = admit ~ gre + gpa, family = "binomial", data = mydata)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -1.2730 -0.8988 -0.7206 1.3013 2.0620
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -4.949378 1.075093 -4.604 4.15e-06 ***
#> gre 0.002691 0.001057 2.544 0.0109 *
#> gpa 0.754687 0.319586 2.361 0.0182 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 499.98 on 399 degrees of freedom
#> Residual deviance: 480.34 on 397 degrees of freedom
#> AIC: 486.34
#>
#> Number of Fisher Scoring iterations: 4
Repeat the rest of the process to get one line
constantGRE2 <- with(mydata, data.frame(gre = mean(gre),
gpa = gpa_sequence))
constantGRE2$theprediction <- predict(object = mylogit2,
newdata = constantGRE2,
type = "response")
ggplot(constantGRE2, aes(x = gpa, y = theprediction)) +
geom_smooth()
#> `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Since you didn't provide your data I'll use the dataset from the example you're familiar with from UCLA. Are you trying to do this (assuming Rank to be like one of your variables...
library(ggplot2)
mydata <- read.csv("binary.csv")
mydata$rank <- factor(mydata$rank)
mylogit <- glm(admit ~ gre + gpa + rank, data = mydata, family = "binomial")
summary(mylogit)
#>
#> Call:
#> glm(formula = admit ~ gre + gpa + rank, family = "binomial",
#> data = mydata)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -1.6268 -0.8662 -0.6388 1.1490 2.0790
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -3.989979 1.139951 -3.500 0.000465 ***
#> gre 0.002264 0.001094 2.070 0.038465 *
#> gpa 0.804038 0.331819 2.423 0.015388 *
#> rank2 -0.675443 0.316490 -2.134 0.032829 *
#> rank3 -1.340204 0.345306 -3.881 0.000104 ***
#> rank4 -1.551464 0.417832 -3.713 0.000205 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 499.98 on 399 degrees of freedom
#> Residual deviance: 458.52 on 394 degrees of freedom
#> AIC: 470.52
#>
#> Number of Fisher Scoring iterations: 4
newdata1 <- with(mydata, data.frame(gre = mean(gre), gpa = mean(gpa), rank = factor(1:4)))
newdata1
#> gre gpa rank
#> 1 587.7 3.3899 1
#> 2 587.7 3.3899 2
#> 3 587.7 3.3899 3
#> 4 587.7 3.3899 4
newdata1$rankP <- predict(mylogit, newdata = newdata1, type = "response")
newdata1
#> gre gpa rank rankP
#> 1 587.7 3.3899 1 0.5166016
#> 2 587.7 3.3899 2 0.3522846
#> 3 587.7 3.3899 3 0.2186120
#> 4 587.7 3.3899 4 0.1846684
ggplot(newdata1, aes(x = rank, y = rankP)) +
geom_col()

Renaming integers within a data.frame

Using R...
I have a data.frame with five variables.
One of the variables colr has values ranging from 1 to 5.
Defined as an integer with values 1, 2, 3, 4, and 5.
Problem: I would like to build a regression model where the values within colr, the integers 1,2,3,4, and 5 are reported as independent variables with the following names.
1 = Silver,
2 = Blue,
3 = Pink,
4 = Other than Silver, Blue or Pink,
5 = Color Not Reported.
Question: Is there a way to extract or rename these values in a way that is different from the following (as this process does not rename, eg. 1 to Silver in the summary regression output):
lm(dependent variable ~ + I(colr.f == 1) +
I(colr.f == 2) +
I(colr.f == 3) +
I(colr.f == 4) +
I(colr.f == 5),
data = df)
I am open to any method that would allow me to create and name these different values independently but would prefer to see if there is a way to do so using the tidyverse or dplyr as this is something I have to do frequently when building multivariate models.
Thank you for any help.
If you have this:
df <- data.frame(int = sample(5, 20, TRUE), value = rnorm(20))
df
#> int value
#> 1 3 -0.62042198
#> 2 4 0.85009260
#> 3 5 -1.04971518
#> 4 1 -2.58255471
#> 5 1 0.62357772
#> 6 4 0.00286785
#> 7 4 -0.05981318
#> 8 4 0.72961261
#> 9 4 -0.03156315
#> 10 1 -2.05486209
#> 11 5 1.77099554
#> 12 1 1.02790956
#> 13 1 -0.70354012
#> 14 1 0.27353731
#> 15 2 -0.04817215
#> 16 2 0.17151374
#> 17 5 -0.54824346
#> 18 2 0.41123284
#> 19 5 0.05466070
#> 20 1 -0.41029986
You can do this:
library(tidyverse)
df <- df %>% mutate(color = factor(c("red", "green", "orange", "blue", "pink"))[int])
df
#> int value color
#> 1 3 -0.62042198 orange
#> 2 4 0.85009260 blue
#> 3 5 -1.04971518 pink
#> 4 1 -2.58255471 red
#> 5 1 0.62357772 red
#> 6 4 0.00286785 blue
#> 7 4 -0.05981318 blue
#> 8 4 0.72961261 blue
#> 9 4 -0.03156315 blue
#> 10 1 -2.05486209 red
#> 11 5 1.77099554 pink
#> 12 1 1.02790956 red
#> 13 1 -0.70354012 red
#> 14 1 0.27353731 red
#> 15 2 -0.04817215 green
#> 16 2 0.17151374 green
#> 17 5 -0.54824346 pink
#> 18 2 0.41123284 green
#> 19 5 0.05466070 pink
#> 20 1 -0.41029986 red
Which allows a regression like this:
lm(value ~ color, data = df) %>% summary()
#>
#> Call:
#> lm(formula = value ~ color, data = df)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -2.03595 -0.33687 -0.00447 0.46149 1.71407
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 0.2982 0.4681 0.637 0.534
#> colorgreen -0.1200 0.7644 -0.157 0.877
#> colororange -0.9187 1.1466 -0.801 0.436
#> colorpink -0.2413 0.7021 -0.344 0.736
#> colorred -0.8448 0.6129 -1.378 0.188
#>
#> Residual standard error: 1.047 on 15 degrees of freedom
#> Multiple R-squared: 0.1451, Adjusted R-squared: -0.0829
#> F-statistic: 0.6364 on 4 and 15 DF, p-value: 0.6444
Created on 2020-02-16 by the reprex package (v0.3.0)
I'm not sure I'm understanding your question the right way, but can't you just use
library(dplyr)
df <- df %>%
mutate(color=factor(colr.f, levels=c(1:5), labels=c("silver", "blue", "pink", "not s, b, p", "not reported"))
and then just run the regression on color only.
/edit for clarification. Making up some data:
df <- data.frame(
x=rnorm(100),
color=factor(rep(c(1,2,3,4,5), each=20),
labels=c("Silver", "Blue", "Pink", "Not S, B, P", "Not reported")),
y=rnorm(100, 4))
m1 <- lm(y~x+color, data=df)
m2 <- lm(y~x+color-1, data=df)
summary(m1)
Call:
lm(formula = y ~ x + color, data = df)
Residuals:
Min 1Q Median 3Q Max
-1.96394 -0.59647 0.00237 0.56916 2.13392
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.93238 0.19312 20.362 <2e-16 ***
x 0.13588 0.09856 1.379 0.171
colorBlue -0.07862 0.27705 -0.284 0.777
colorPink -0.02167 0.27393 -0.079 0.937
colorNot S, B, P 0.15238 0.27221 0.560 0.577
colorNot reported 0.14139 0.27230 0.519 0.605
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.8606 on 94 degrees of freedom
Multiple R-squared: 0.0268, Adjusted R-squared: -0.02496
F-statistic: 0.5177 on 5 and 94 DF, p-value: 0.7623
summary(m2)
Call:
lm(formula = y ~ x + color - 1, data = df)
Residuals:
Min 1Q Median 3Q Max
-1.96394 -0.59647 0.00237 0.56916 2.13392
Coefficients:
Estimate Std. Error t value Pr(>|t|)
x 0.13588 0.09856 1.379 0.171
colorSilver 3.93238 0.19312 20.362 <2e-16 ***
colorBlue 3.85376 0.19570 19.692 <2e-16 ***
colorPink 3.91071 0.19301 20.262 <2e-16 ***
colorNot S, B, P 4.08477 0.19375 21.083 <2e-16 ***
colorNot reported 4.07377 0.19256 21.156 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.8606 on 94 degrees of freedom
Multiple R-squared: 0.9578, Adjusted R-squared: 0.9551
F-statistic: 355.5 on 6 and 94 DF, p-value: < 2.2e-16
The first model is a model with intercept, therefore one of the factor levels must be dropped to avoid perfect multicollinearity. In this case, the "effect" of silver is the value of the intercept, while the "effect" of the other colors is the intercept coefficient value + their respective coefficient value.
The second model is estimated without intercept (without constant), so you can see the individual effects. However, you should probably know what you are doing before estimating the model without intercept.
With base R.
labels <- c("Silver", "Blue", "Pink", "Other Color", "Color Not Reported")
df$colr.f2 <- factor(colr.f, labels = labels, levels = seq_along(labels))

Resources