SAS proc glm random effects model with contrasts translated into R - r

My apologies for any errors; I only recently began learning SAS. I was given this SAS code (the code below is a reprex, not the exact code) that uses proc glm to assumedly make a random effects model. Instead of using color, the SAS code uses contrasts and idnumber to indirectly map onto color.
I would like to know how to replicate this in R. Several attempts using lme4 for random effects and contrasts using MASS::ginv were unsuccessful, so I may need to use a package I am unfamiliar with.
I would also like to know the difference between red-blue and red-blue2 and why the output is different. Thank you for your help.
data df1;
input idnumber color $ value1;
datalines;
1001 red 189
1002 red 145
1003 red 210
1004 red 194
1005 red 127
1006 red 189
1007 blue 145
1008 red 210
1009 red 194
1010 red 127.
;
proc glm data=df1;
class idnumber;
model value1=idnumber/noint solution clparm;
contrast 'red vs. blue' idnumber 1 1 1 1 1 1 -9 1 1 1;
estimate 'red-blue' idnumber 1 1 1 1 1 1 -9 1 1 1/ divisor=10;
estimate 'red-blue2' idnumber .111 .111 .111 .111 .111 .111 -.999 .111 .111 .111;
run;
Below are a few attempts at replication.
idnumber <- c(1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009, 1010)
color <- c('red', 'red', 'red', 'red', 'red', 'red', 'blue', 'red', 'red', 'red')
value1 <- c(189, 145, 210, 194, 127, 189, 145, 210, 194, 127)
df1 <- data.frame(idnumber, color, value1)
library(lme4)
library(MASS)
library(tidyverse)
options(contrasts = c(factor = "contr.SAS", ordered = "contr.poly"))
# attempt 1
mod1 <- lme4::lmer(value1 ~ (1|idnumber), data = df1) # error
#> Error: number of levels of each grouping factor must be < number of observations (problems: idnumber)
# attempt 2
mod2 <- lme4::lmer(value1 ~ (1|color), data = df1) # singular
#> boundary (singular) fit: see help('isSingular')
summary(mod2)
#> Linear mixed model fit by REML ['lmerMod']
#> Formula: value1 ~ (1 | color)
#> Data: df1
#>
#> REML criterion at convergence: 90.9
#>
#> Scaled residuals:
#> Min 1Q Median 3Q Max
#> -1.3847 -0.8429 0.4816 0.6321 1.1138
#>
#> Random effects:
#> Groups Name Variance Std.Dev.
#> color (Intercept) 0 0.00
#> Residual 1104 33.22
#> Number of obs: 10, groups: color, 2
#>
#> Fixed effects:
#> Estimate Std. Error t value
#> (Intercept) 173.00 10.51 16.47
#> optimizer (nloptwrap) convergence code: 0 (OK)
#> boundary (singular) fit: see help('isSingular')
# attempt 3
mat1 <- rbind(c(-0.5, 0.5))
cMat1 <- MASS::ginv(mat1)
mod3 <- lm(value1 ~ color, data = df1, contrasts = list(color = cMat1))
summary(mod3)
#>
#> Call:
#> lm(formula = value1 ~ color, data = df1, contrasts = list(color = cMat1))
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -49.11 -23.33 12.89 17.89 33.89
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 160.56 17.74 9.052 1.78e-05 ***
#> color1 15.56 17.74 0.877 0.406
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 33.65 on 8 degrees of freedom
#> Multiple R-squared: 0.08771, Adjusted R-squared: -0.02633
#> F-statistic: 0.7691 on 1 and 8 DF, p-value: 0.4061
# attempt 4
con <- c(.1, .1, .1, .1, .1, .1, -.9, .1, .1, .1)
mod4 <- lm(value1 ~ idnumber, data = df1, contrasts = list(idnumber = con)) # error, but unsure how to fix
#> Error in `contrasts<-`(`*tmp*`, value = contrasts.arg[[nn]]): contrasts apply only to factors
Created on 2022-02-08 by the reprex package (v2.0.1)

Answering the part of this that is answerable: what is going on with the two different estimates.
The estimate statement includes a list of coefficients. Those are multiplied by the values, and then summed - giving the result. The reason they're different is, well, they're different... the first one is (after the division) 0.1 / -0.9 and the second is 0.111 (one ninth) / -0.999, effectively the same as the first one with a divisor of 9 instead of 10. Hence, the math is different.
I'm also not sure about your reprex, it doesn't really make any sense to use idnumber as the class variable... seems more likely you'd use color as the class variable. Is it possible this is just bad SAS code? I'm not a GLM expert, but it seems odd to me to try to use GLM with the classification variable being the ID number (assuming it's a unique ID, anyway).

Related

Can I use a summary function on lm?

I am working with election survey data and have a dataset loaded into R and I have objects created. Right now I am working in tidyverse. I am trying to run a regression with male and another variable. However, male is under gender and I am trying to isolate just male from the gender overall. In the data male comes up as 1 and female is 2.
Warning messages:
1: In model.response(mf, "numeric") :
using type = "numeric" with a factor response will be ignored
2: In Ops.factor(y, z$residuals) : ‘-’ not meaningful for factors
I do get coefficients, but then I try to get summary:
gender_disgust_set<-lm(gender~dem_disgusted, data=my_data_set)
summary(gender_disgust_set)
then I get this warning message:
Error in quantile.default(resid) : (unordered) factors are not allowed
In addition: Warning message:
In Ops.factor(r, 2) : ‘^’ not meaningful for factors
lm(gender~dem_disgusted, data=my_data_set)
subset(my_data_set, gender = male)
male_total<-subset(my_data_set, gender = male)
summary(male_total)
lm(gender~dem_disgusted, data=my_data_set)
The lm() function is designed for linear regression, which generally assumes a continuous response.
From the lm() details page:
A typical model has the form response ~ terms where response is the (numeric) response vector and terms is a series of terms which specifies a linear predictor for response.
Your gender variable is a factor (not continuous; more information about data types here). If you really wanted to predict gender (a factor), you would need to use glm() for logistic regression.
Yes, you can use summary() on lm objects, but whether linear (or logistic) regression is best for your specific research question is a different question.
library(tidyverse)
set.seed(123)
gender <- sample(1:2, 10, replace = TRUE) %>% factor()
x1 <- sample(1:12, 10, replace = TRUE) %>% as.numeric()
x2 <- sample(1:100, 10, replace = TRUE) %>% as.numeric()
x3 <- sample(50:75, 10, replace = TRUE) %>% as.numeric()
my_data_set <- data.frame(gender, x1, x2, x3)
sapply(my_data_set, class)
#> gender x1 x2 x3
#> "factor" "numeric" "numeric" "numeric"
# error
# gender_disgust_set <- lm(gender ~ x1, data = my_data_set)
# summary(gender_disgust_set)
# logistic regression
gender_disgust_set1 <- glm(gender ~ x1, data = my_data_set, family = "binomial")
summary(gender_disgust_set1)
#>
#> Call:
#> glm(formula = gender ~ x1, family = "binomial", data = my_data_set)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -1.2271 -0.9526 -0.8296 1.1571 1.5409
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 0.6530 1.7983 0.363 0.717
#> x1 -0.1342 0.2149 -0.625 0.532
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 13.46 on 9 degrees of freedom
#> Residual deviance: 13.06 on 8 degrees of freedom
#> AIC: 17.06
#>
#> Number of Fisher Scoring iterations: 4
# or flip it around
# while this model works, please look into dummy-coding before using
# factors to predict continuous responses
gender_disgust_set2 <- lm(x1 ~ gender, data = my_data_set)
summary(gender_disgust_set2)
#>
#> Call:
#> lm(formula = x1 ~ gender, data = my_data_set)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -4.500 -2.438 0.500 2.688 3.750
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 8.500 1.371 6.199 0.00026 ***
#> gender2 -1.250 2.168 -0.577 0.58010
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 3.359 on 8 degrees of freedom
#> Multiple R-squared: 0.03989, Adjusted R-squared: -0.08012
#> F-statistic: 0.3324 on 1 and 8 DF, p-value: 0.5801

Extract confidence interval for both values of binary variable for glm()?

I want to analyze the relation between whether someone smoked or not and the number of drinks of alcohol.
The reproducible data set:
smoking_status
alcohol_drinks
1
2
0
5
1
2
0
1
1
0
1
0
0
0
1
9
1
6
1
5
I have used glm() to analyse this relation:
glm <- glm(smoking_status ~ alcohol_drinks, data = data, family = binomial)
summary(glm)
confint(glm)
Using the above I'm able to extract the p-value and the confidence interval for the entire set.
However, I would like to extract the confidence interval for each smoking status, so that I can produce this results table:
Alcohol drinks, mean (95%CI)
p-values
Smokers
X (X - X)
0.492
Non-smokers
X (X - X)
How can I produce this?
First of all, the response alcohol_drinks is not binary, a logistic regression is out of the question. Since the response is counts data, I will fit a Poisson model.
To have confidence intervals for each binary value of smoking_status, coerce to factor and fit a model without an intercept.
x <- 'smoking_status alcohol_drinks
1 2
0 5
1 2
0 1
1 0
1 0
0 0
1 9
1 6
1 5'
df1 <- read.table(textConnection(x), header = TRUE)
pois_fit <- glm(alcohol_drinks ~ 0 + factor(smoking_status), data = df1, family = poisson(link = "log"))
summary(pois_fit)
#>
#> Call:
#> glm(formula = alcohol_drinks ~ 0 + factor(smoking_status), family = poisson(link = "log"),
#> data = df1)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.6186 -1.7093 -0.8104 1.1389 2.4957
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> factor(smoking_status)0 0.6931 0.4082 1.698 0.0895 .
#> factor(smoking_status)1 1.2321 0.2041 6.036 1.58e-09 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for poisson family taken to be 1)
#>
#> Null deviance: 58.785 on 10 degrees of freedom
#> Residual deviance: 31.324 on 8 degrees of freedom
#> AIC: 57.224
#>
#> Number of Fisher Scoring iterations: 5
confint(pois_fit)
#> Waiting for profiling to be done...
#> 2.5 % 97.5 %
#> factor(smoking_status)0 -0.2295933 1.399304
#> factor(smoking_status)1 0.8034829 1.607200
#>
exp(confint(pois_fit))
#> Waiting for profiling to be done...
#> 2.5 % 97.5 %
#> factor(smoking_status)0 0.7948568 4.052378
#> factor(smoking_status)1 2.2333058 4.988822
Created on 2022-06-04 by the reprex package (v2.0.1)
Edit
The edit to the question states that the problem was reversed, what is asked is to find out the effect of alcohol drinking on smoking status. And with a binary response, individuals can be smokers or not, a logistic regression is a possible model.
bin_fit <- glm(smoking_status ~ alcohol_drinks, data = df1, family = binomial(link = "logit"))
summary(bin_fit)
#>
#> Call:
#> glm(formula = smoking_status ~ alcohol_drinks, family = binomial(link = "logit"),
#> data = df1)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -1.7491 -0.8722 0.6705 0.8896 1.0339
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 0.3474 0.9513 0.365 0.715
#> alcohol_drinks 0.1877 0.2730 0.687 0.492
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 12.217 on 9 degrees of freedom
#> Residual deviance: 11.682 on 8 degrees of freedom
#> AIC: 15.682
#>
#> Number of Fisher Scoring iterations: 4
# Odds ratios
exp(coef(bin_fit))
#> (Intercept) alcohol_drinks
#> 1.415412 1.206413
exp(confint(bin_fit))
#> Waiting for profiling to be done...
#> 2.5 % 97.5 %
#> (Intercept) 0.2146432 11.167555
#> alcohol_drinks 0.7464740 2.417211
Created on 2022-06-05 by the reprex package (v2.0.1)
Another way to conduct a logistic regression is to regress the cumulative counts of smokers on increasing numbers of alcoholic drinks. In order to do this, the data must be sorted by alcohol_drinks, so I will create a second data set, df2. Code inspired this in this RPubs post.
df2 <- df1[order(df1$alcohol_drinks), ]
Total <- sum(df2$smoking_status)
df2$smoking_status <- cumsum(df2$smoking_status)
fit <- glm(cbind(smoking_status, Total - smoking_status) ~ alcohol_drinks, data = df2, family = binomial())
summary(fit)
#>
#> Call:
#> glm(formula = cbind(smoking_status, Total - smoking_status) ~
#> alcohol_drinks, family = binomial(), data = df2)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -0.9714 -0.2152 0.1369 0.2942 0.8975
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -1.1671 0.3988 -2.927 0.003428 **
#> alcohol_drinks 0.4437 0.1168 3.798 0.000146 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 23.3150 on 9 degrees of freedom
#> Residual deviance: 3.0294 on 8 degrees of freedom
#> AIC: 27.226
#>
#> Number of Fisher Scoring iterations: 4
# Odds ratios
exp(coef(fit))
#> (Intercept) alcohol_drinks
#> 0.3112572 1.5584905
exp(confint(fit))
#> Waiting for profiling to be done...
#> 2.5 % 97.5 %
#> (Intercept) 0.1355188 0.6569898
#> alcohol_drinks 1.2629254 2.0053079
plot(smoking_status/Total ~ alcohol_drinks,
data = df2,
xlab = "Alcoholic Drinks",
ylab = "Proportion of Smokers")
lines(df2$alcohol_drinks, fit$fitted, type="l", col="red")
title(main = "Alcohol and Smoking")
Created on 2022-06-05 by the reprex package (v2.0.1)

No P or F values in Two Way ANOVA on R

I'm doing an assignment for university and have copied and pasted the R code so I know it's right but I'm still not getting any P or F values from my data:
Food Temperature Area
50 11 820.2175
100 11 936.5437
50 14 1506.568
100 14 1288.053
50 17 1692.882
100 17 1792.54
This is the code I've used so far:
aovdata<-read.table("Condition by area.csv",sep=",",header=T)
attach(aovdata)
Food <- as.factor(Food) ; Temperature <- as.factor(Temperature)
summary(aov(Area ~ Temperature*Food))
but then this is the output:
Df Sum Sq Mean Sq
Temperature 2 757105 378552
Food 1 1 1
Temperature:Food 2 35605 17803
Any help, especially the code I need to fix it, would be great. I think there could be a problem with the data but I don't know what.
I would do this. Be aware of difference between factor and continues predictors.
library(tidyverse)
df <- sapply(strsplit(c("Food Temperature Area", "50 11 820.2175", "100 11 936.5437",
"50 14 1506.568", "100 14 1288.053", "50 17 1692.882",
"100 17 1792.54")," +"), paste0, collapse=",") %>%
read_csv()
model <- lm(Area ~ Temperature * as.factor(Food),df)
summary(model)
#>
#> Call:
#> lm(formula = Area ~ Temperature * as.factor(Food), data = df)
#>
#> Residuals:
#> 1 2 3 4 5 6
#> -83.34 25.50 166.68 -50.99 -83.34 25.50
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -696.328 505.683 -1.377 0.302
#> Temperature 145.444 35.580 4.088 0.055 .
#> as.factor(Food)100 38.049 715.144 0.053 0.962
#> Temperature:as.factor(Food)100 -2.778 50.317 -0.055 0.961
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 151 on 2 degrees of freedom
#> Multiple R-squared: 0.9425, Adjusted R-squared: 0.8563
#> F-statistic: 10.93 on 3 and 2 DF, p-value: 0.08498
ggeffects::ggpredict(model,terms = c('Temperature','Food')) %>% plot()
Created on 2020-12-08 by the reprex package (v0.3.0)
The actual problem with your example is not that you're using factors as predictor variables, but rather that you have fitted a 'saturated' linear model (as many parameters as observations), so there is no variation left to compute a residual SSQ, so the ANOVA doesn't include F/P values etc.
It's fine for temperature and food to be categorical (factor) predictors, that's how they would be treated in a classic two-way ANOVA design. It's just that in order to analyze this design with the interaction you need more replication.

Why does a predict.glm() not create predicted values in the expected manner?

I am trying to get my head around what the predict.glm() function does for a project at work which uses it.
To do this, I first looked at the example code found in the documentation for ?predict.glm(). This has given me the sense that it can take a glm and predict response values for a given input vector. However I found it very difficult to customise that "budworm" example. So I created an exceptionally simply model of my own to try and see how it works. Spoiler- I'm still failing to get it to work.
a<-c(1,2,3,4,5)
b<-c(2,3,4,5,6)
result<-glm(b~a,family=gaussian)
summary(result)
plot(c(0,10), c(0,10), type = "n", xlab = "dose",
ylab = "response")
xvals<-seq(0,10,0.1)
data.frame(xinputs=xvals)
predict.glm(object=result,newdata= data.frame(xinputs=xvals),type='terms')
#lines(xvals, predict.glm(object=result,newdata = xvals, type="response" ))
When I run predict.glm(object=result,newdata= data.frame(xinputs=xvals),type='terms') I get the error message:
Warning message:
'newdata' had 101 rows but variables found have 5 rows
From what I understand, it shouldn't matter that the input GLM only used 5 rows... it should use the statistics of that GLM to predict response values to each of the 101 entries of the new data?
Column names in the newdata data frame must match column names from the data you used to fit the model. Thus,
predict.glm(object=result,newdata= data.frame(a=xvals),type='terms')
will resolve your issue.
a <- c(1, 2, 3, 4, 5)
b <- c(2, 3, 4, 5, 6)
result <- glm(b ~ a, family = gaussian)
summary(result)
#>
#> Call:
#> glm(formula = b ~ a, family = gaussian)
#>
#> Deviance Residuals:
#> 1 2 3 4 5
#> -1.776e-15 -8.882e-16 -8.882e-16 0.000e+00 0.000e+00
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 1.000e+00 1.317e-15 7.591e+14 <2e-16 ***
#> a 1.000e+00 3.972e-16 2.518e+15 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for gaussian family taken to be 1.577722e-30)
#>
#> Null deviance: 1.0000e+01 on 4 degrees of freedom
#> Residual deviance: 4.7332e-30 on 3 degrees of freedom
#> AIC: -325.47
#>
#> Number of Fisher Scoring iterations: 1
plot(c(0, 10),
c(0, 10),
type = "n",
xlab = "dose",
ylab = "response")
xvals <- seq(0, 10, 0.1)
head(data.frame(xinputs = xvals))
#> xinputs
#> 1 0.0
#> 2 0.1
#> 3 0.2
#> 4 0.3
#> 5 0.4
#> 6 0.5
head(predict.glm(object = result,
newdata = data.frame(a = xvals),
type = 'terms'))
#> a
#> 1 -3.0
#> 2 -2.9
#> 3 -2.8
#> 4 -2.7
#> 5 -2.6
#> 6 -2.5
Created on 2020-09-15 by the reprex package (v0.3.0)

Plot predicted probabilities (logit)

I am currently trying to plot the predicted probabilities of my logit model in r. I have followed the approach from this link: https://stats.idre.ucla.edu/r/dae/logit-regression/.
I have successfully made plots for Brussels office given the interest group type. However, I seek to only plot the individual effects: for example, I want to plot the predicted probability for Brussels office on Meetings with MEPs (that is, what is the probability of having meetings with MEPs when you have a Brussels office?). Also, I want to see the effect of staff size and/or organisational form on the dependent variable.
I have not found such an approach yet. Any advice?
Thank you in advance.
My variables:
Meetings with MEPS (dependent variable, dummy)
1 Yes
0 No
Interest group type (categorical)
1 Business
2 Consultancies
3 NGOs
4 Public authorities
5 Institutions
6 Trade union/prof.org.
7 Other
Brussels office
1 Yes
0 No
Organisational form
1 Individual org.
2 National association
3 European association
4 Other
Staff size (count variable, presented in full time equivalent)
Ranges from 0.25 to 40
Picking up from yesterday.
library(ggplot2)
# mydata <- read.csv("binary.csv")
str(mydata)
#> 'data.frame': 400 obs. of 4 variables:
#> $ admit: int 0 1 1 1 0 1 1 0 1 0 ...
#> $ gre : int 380 660 800 640 520 760 560 400 540 700 ...
#> $ gpa : num 3.61 3.67 4 3.19 2.93 3 2.98 3.08 3.39 3.92 ...
#> $ rank : int 3 3 1 4 4 2 1 2 3 2 ...
mydata$rank <- factor(mydata$rank)
mylogit <- glm(admit ~ gre + gpa + rank, data = mydata, family = "binomial")
summary(mylogit)
#>
#> Call:
#> glm(formula = admit ~ gre + gpa + rank, family = "binomial",
#> data = mydata)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -1.6268 -0.8662 -0.6388 1.1490 2.0790
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -3.989979 1.139951 -3.500 0.000465 ***
#> gre 0.002264 0.001094 2.070 0.038465 *
#> gpa 0.804038 0.331819 2.423 0.015388 *
#> rank2 -0.675443 0.316490 -2.134 0.032829 *
#> rank3 -1.340204 0.345306 -3.881 0.000104 ***
#> rank4 -1.551464 0.417832 -3.713 0.000205 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 499.98 on 399 degrees of freedom
#> Residual deviance: 458.52 on 394 degrees of freedom
#> AIC: 470.52
#>
#> Number of Fisher Scoring iterations: 4
We're going to graph GPA on the x axis let's generate some points
range(mydata$gpa) # using GPA for your staff size
#> [1] 2.26 4.00
gpa_sequence <- seq(from = 2.25, to = 4.01, by = .01) # 177 points along x axis
This is in the IDRE example but they made it complicated. Step one build a data frame that has our sequence of GPA points, the mean of GRE for every entry in that column, and our 4 factors repeated 177 times.
constantGRE <- with(mydata, data.frame(gre = mean(gre), # keep GRE constant
gpa = rep(gpa_sequence, each = 4), # once per factor level
rank = factor(rep(1:4, times = 177)))) # there's 177
str(constantGRE)
#> 'data.frame': 708 obs. of 3 variables:
#> $ gre : num 588 588 588 588 588 ...
#> $ gpa : num 2.25 2.25 2.25 2.25 2.26 2.26 2.26 2.26 2.27 2.27 ...
#> $ rank: Factor w/ 4 levels "1","2","3","4": 1 2 3 4 1 2 3 4 1 2 ...
Make predictions for every one of the 177 GPA values * 4 factor levels. Put that prediction in a new column called theprediction
constantGRE$theprediction <- predict(object = mylogit,
newdata = constantGRE,
type = "response")
Plot one line per level of rank, color the lines uniquely. NB the lines are not straight, nor perfectly parallel nor equally spaced.
ggplot(constantGRE, aes(x = gpa, y = theprediction, color = rank)) +
geom_smooth()
#> `geom_smooth()` using method = 'loess' and formula 'y ~ x'
You might be tempted to just average the lines. Don't. If you want to know GPA by GRE not including Rank build a new model because (0.6357521 + 0.4704174 + 0.3136242 + 0.2700262) / 4 is not the proper answer.
Let's do it.
# leave rank out call it new name
mylogit2 <- glm(admit ~ gre + gpa, data = mydata, family = "binomial")
summary(mylogit2)
#>
#> Call:
#> glm(formula = admit ~ gre + gpa, family = "binomial", data = mydata)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -1.2730 -0.8988 -0.7206 1.3013 2.0620
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -4.949378 1.075093 -4.604 4.15e-06 ***
#> gre 0.002691 0.001057 2.544 0.0109 *
#> gpa 0.754687 0.319586 2.361 0.0182 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 499.98 on 399 degrees of freedom
#> Residual deviance: 480.34 on 397 degrees of freedom
#> AIC: 486.34
#>
#> Number of Fisher Scoring iterations: 4
Repeat the rest of the process to get one line
constantGRE2 <- with(mydata, data.frame(gre = mean(gre),
gpa = gpa_sequence))
constantGRE2$theprediction <- predict(object = mylogit2,
newdata = constantGRE2,
type = "response")
ggplot(constantGRE2, aes(x = gpa, y = theprediction)) +
geom_smooth()
#> `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Since you didn't provide your data I'll use the dataset from the example you're familiar with from UCLA. Are you trying to do this (assuming Rank to be like one of your variables...
library(ggplot2)
mydata <- read.csv("binary.csv")
mydata$rank <- factor(mydata$rank)
mylogit <- glm(admit ~ gre + gpa + rank, data = mydata, family = "binomial")
summary(mylogit)
#>
#> Call:
#> glm(formula = admit ~ gre + gpa + rank, family = "binomial",
#> data = mydata)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -1.6268 -0.8662 -0.6388 1.1490 2.0790
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -3.989979 1.139951 -3.500 0.000465 ***
#> gre 0.002264 0.001094 2.070 0.038465 *
#> gpa 0.804038 0.331819 2.423 0.015388 *
#> rank2 -0.675443 0.316490 -2.134 0.032829 *
#> rank3 -1.340204 0.345306 -3.881 0.000104 ***
#> rank4 -1.551464 0.417832 -3.713 0.000205 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 499.98 on 399 degrees of freedom
#> Residual deviance: 458.52 on 394 degrees of freedom
#> AIC: 470.52
#>
#> Number of Fisher Scoring iterations: 4
newdata1 <- with(mydata, data.frame(gre = mean(gre), gpa = mean(gpa), rank = factor(1:4)))
newdata1
#> gre gpa rank
#> 1 587.7 3.3899 1
#> 2 587.7 3.3899 2
#> 3 587.7 3.3899 3
#> 4 587.7 3.3899 4
newdata1$rankP <- predict(mylogit, newdata = newdata1, type = "response")
newdata1
#> gre gpa rank rankP
#> 1 587.7 3.3899 1 0.5166016
#> 2 587.7 3.3899 2 0.3522846
#> 3 587.7 3.3899 3 0.2186120
#> 4 587.7 3.3899 4 0.1846684
ggplot(newdata1, aes(x = rank, y = rankP)) +
geom_col()

Resources