ggplot2: Logistic Regression - plot probabilities and regression line - r

I have a data.frame containing a continuous predictor and a dichotomous response variable.
> head(df)
position response
1 0 1
2 3 1
3 -4 0
4 -1 0
5 -2 1
6 0 0
I can easily compute a logistic regression by means of the glm()-function, no problems up to this point.
Next, I want to create a plot with ggplot, that contains both the empiric probabilities for each of the overall 11 predictor values, and the fitted regression line.
I went ahead and computed the probabilities with cast() and saved them in another data.frame
> probs
position prob
1 -5 0.0500
2 -4 0.0000
3 -3 0.0000
4 -2 0.2000
5 -1 0.1500
6 0 0.3684
7 1 0.4500
8 2 0.6500
9 3 0.7500
10 4 0.8500
11 5 1.0000
I plotted the probabilities:
p <- ggplot(probs, aes(x=position, y=prob)) + geom_point()
But when I try to add the fitted regression line
p <- p + stat_smooth(method="glm", family="binomial", se=F)
it returns a warning: non-integer #successes in a binomial glm!.
I know that in order to plot the stat_smooth "correctly", I'd have to call it on the original df data with the dichotomous variable. However if I use the dfdata in ggplot(), I see no way to plot the probabilities.
How can I combine the probabilities and the regression line in one plot, in the way it's meant to be in ggplot2, i.e. without getting any warning or error messages?

There are basically three solutions:
Merging the data.frames
The easiest, after you have your data in two separate data.frames would be to merge them by position:
mydf <- merge( mydf, probs, by="position")
Then you can call ggplot on this data.frame without warnings:
ggplot( mydf, aes(x=position, y=prob)) +
geom_point() +
geom_smooth(method = "glm",
method.args = list(family = "binomial"),
se = FALSE)
Avoiding the creation of two data.frames
In future you could directly avoid the creation of two separate data.frames which you have to merge later. Personally, I like to use the plyr package for that:
librayr(plyr)
mydf <- ddply( mydf, "position", mutate, prob = mean(response) )
Edit: Use different data for each layer
I forgot to mention, that you can use for each layer another data.frame which is a strong advantage of ggplot2:
ggplot( probs, aes(x=position, y=prob)) +
geom_point() +
geom_smooth(data = mydf, aes(x = position, y = response),
method = "glm", method.args = list(family = "binomial"),
se = FALSE)
As an additional hint: Avoid the usage of the variable name df since you override the built in function stats::df by assigning to this variable name.

Related

visualizing clusters extracted from MClust using ggplot2

I am analysing the distribution of my data using mclust (follow-up to Clustering with Mclust results in an empty cluster)
Here my data for download https://www.file-upload.net/download-14320392/example.csv.html
First, I evaluate the clusters present in my data:
library(reshape2)
library(mclust)
library(ggplot2)
data <- read.csv(file.choose(), header=TRUE, check.names = FALSE)
data_melt <- melt(data, value.name = "value", na.rm=TRUE)
fit <- Mclust(data$value, modelNames="E", G = 1:7)
summary(fit, parameters = TRUE)
----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm
----------------------------------------------------
Mclust E (univariate, equal variance) model with 4 components:
log-likelihood n df BIC ICL
-20504.71 3258 8 -41074.13 -44326.69
Clustering table:
1 2 3 4
0 2271 896 91
Mixing probabilities:
1 2 3 4
0.2807685 0.4342499 0.2544305 0.0305511
Means:
1 2 3 4
1381.391 1381.715 1574.335 1851.667
Variances:
1 2 3 4
7466.189 7466.189 7466.189 7466.189
Now having them identified, I would like to overlay the total distribution with distributions of the individual components. To do this, I tried to extract the assignment of each value to the respective cluster using:
df <- as.data.frame(data)
df$classification <- as.factor(df$value[fit$classification])
ggplot(df, aes(value, fill= classification)) +
geom_density(aes(col=classification, fill = NULL), size = 1)
As a result, I get the following:
It looks to have worked, however, I wonder,
a) where the descriptions (1602, 1639 and 1823) of the individual classifications come from
b) how I can scale the individual densities as a fraction of the total (for example 1823 contributes only 91 values out of 3258 observations; see above)
c) if it makes sense to alternatively use predicted normal distributions based on the mean + SD obtained?
Any help or suggestions are highly appreciated!
I think you could get what you want in the following way:
library(magrittr)
data_melt <- data_melt %>% mutate(class = as.factor(fit$classification))
ggplot(data_melt, aes(x=value, colour=class, fill=class)) +
geom_density(aes(y=..count..), alpha=.25)

Can geom_smooth accept logical variables for glm?

I have a tibble with numerical and logical variables, e.g. like this:
x f y
<dbl> <int> <dbl>
1 -2 1 -0.801
2 -1.96 0 -2.27
3 -1.92 0 -1.75
4 -1.88 0 -2.44
5 -1.84 1 -0.123
...
For reproducibility, it can be generated using:
library(tidyverse)
set.seed(0)
tb1 = tibble(
x=(-50:50)/25,
p=plogis(x),
f=rbinom(p, 1, p),
y = x+f+rnorm(x, 0, .5)
) %>% select(-p)
I'd like to plot the points and draw regression lines, once taking x as the predictor and f as the outcome (logistic regression), and once taking x and f as predictors and y as the outcome (linear regression). This works well for the logistic regression.
ggplot(tb1, aes(x, f)) +
geom_point() +
geom_smooth(method="glm", method.args=list(family="binomial"))
produces:
but:
ggplot(tb1, aes(x, y, colour=f)) +
geom_point() +
geom_smooth(method="lm")
produces:
which is wrong. I want f treated as a factor, producing two regression lines, and a discrete instead of the continuous-coloured legend. I can force f manually to a logical value:
tb2 = tb1 %>% mutate(f = f>0)
and obtain the correct linear regression graph:
but now I cannot plot the logistic regression. I get the
Warning message:
Computation failed in stat_smooth():
y values must be 0 <= y <= 1
For some reason, both lm() and glm() have no problems:
summary(glm(f ~ x, binomial, tb1))
summary(lm(y ~ x + f, tb1))
summary(glm(f ~ x, binomial, tb2))
summary(lm(y ~ x + f, tb2))
all produce reasonable results, and the results are identical for tb1 and tb2, as they should be. So is there a way of convincing geom_smooth to accept logical variables, or must I use two redundant variables, with identical values but of a different type, e.g. f.int and f.lgl?

Dummy variables for Logistic regression in R

I am running a logistic regression on three factors that are all binary.
My data
table1<-expand.grid(Crime=factor(c("Shoplifting","Other Theft Acts")),Gender=factor(c("Men","Women")),
Priorconv=factor(c("N","P")))
table1<-data.frame(table1,Yes=c(24,52,48,22,17,60,15,4),No=c(1,9,3,2,6,34,6,3))
and the model
fit4<-glm(cbind(Yes,No)~Priorconv+Crime+Priorconv:Crime,data=table1,family=binomial)
summary(fit4)
R seems to take 1 for prior conviction P and 1 for crime shoplifting. As a result the interaction effect is only 1 if both of the above are 1. I would now like to try different combinations for the interaction term, for example I would like to see what it would be if prior conviction is P and crime is not shoplifting.
Is there a way to make R take different cases for the 1s and the 0s? It would facilitate my analysis greatly.
Thank you.
You're already getting all four combinations of the two categorical variables in your regression. You can see this as follows:
Here's the output of your regression:
Call:
glm(formula = cbind(Yes, No) ~ Priorconv + Crime + Priorconv:Crime,
family = binomial, data = table1)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.9062 0.3231 5.899 3.66e-09 ***
PriorconvP -1.3582 0.3835 -3.542 0.000398 ***
CrimeShoplifting 0.9842 0.6069 1.622 0.104863
PriorconvP:CrimeShoplifting -0.5513 0.7249 -0.761 0.446942
So, for Priorconv, the reference category (the one with dummy value = 0) is N. And for Crime the reference category is Other. So here's how to interpret the regression results for each of the four possibilities (where log(p/(1-p)) is the log of the odds of a Yes result):
1. PriorConv = N and Crime = Other. This is just the case where both dummies are
zero, so your regression is just the intercept:
log(p/(1-p)) = 1.90
2. PriorConv = P and Crime = Other. So the Priorconv dummy equals 1 and the
Crime dummy is still zero:
log(p/(1-p)) = 1.90 - 1.36
3. PriorConv = N and Crime = Shoplifting. So the Priorconv dummy is 0 and the
Crime dummy is now 1:
log(p/(1-p)) = 1.90 + 0.98
4. PriorConv = P and Crime = Shoplifting. Now both dummies are 1:
log(p/(1-p)) = 1.90 - 1.36 + 0.98 - 0.55
You can reorder the factor values of the two predictor variables, but that will just change which combinations of variables fall into each of the four cases above.
Update: Regarding the issue of regression coefficients relative to ordering of the factors. Changing the reference level will change the coefficients, because the coefficients will represent contrasts between different combinations of categories, but it won't change the predicted probabilities of a Yes or No outcome. (Regression modeling wouldn't be all that credible if you could change the predictions just by changing the reference category.) Note, for example, that the predicted probabilities are the same even if we switch the reference category for Priorconv:
m1 = glm(cbind(Yes,No)~Priorconv+Crime+Priorconv:Crime,data=table1,family=binomial)
predict(m1, type="response")
1 2 3 4 5 6 7 8
0.9473684 0.8705882 0.9473684 0.8705882 0.7272727 0.6336634 0.7272727 0.6336634
table2 = table1
table2$Priorconv = relevel(table2$Priorconv, ref = "P")
m2 = glm(cbind(Yes,No)~Priorconv+Crime+Priorconv:Crime,data=table2,family=binomial)
predict(m2, type="response")
1 2 3 4 5 6 7 8
0.9473684 0.8705882 0.9473684 0.8705882 0.7272727 0.6336634 0.7272727 0.6336634
I agree with the interpretation provided by #eipi10. You can also use relevel to change the reference level before fitting the model:
levels(table1$Priorconv)
## [1] "N" "P"
table1$Priorconv <- relevel(table1$Priorconv, ref = "P")
levels(table1$Priorconv)
## [1] "P" "N"
m <- glm(cbind(Yes, No) ~ Priorconv*Crime, data = table1, family = binomial)
summary(m)
Note that I changed the formula argument of glm() to include Priorconv*Crime which is more compact.

Plotting one predictor of a model that has several predictors with ggplot

Here is an typical example of linear model and a ggplot:
require(ggplot2)
utils::data(anorexia, package = "MASS")
anorex.1 <- glm(Postwt ~ Prewt + Treat + offset(Prewt),
family = gaussian, data = anorexia)
coef(anorex.1)
(Intercept) Prewt TreatCont TreatFT
49.7711090 -0.5655388 -4.0970655 4.5630627
ggplot(anorexia, aes(y=Postwt, x=Prewt)) + geom_point() + geom_smooth(method='lm', se=F)
My problem is that the regression that is made by geom_smooth(...) is not the same model than anorex.1 but is:
coef(lm(Postwt ~ Prewt, data=anorexia))
(Intercept) Prewt
42.7005802 0.5153804
How can I plot the model anorexia1 on a ggplot?
Could I just take the intercept (49.77) and estimate (-0.5655) of anorexia1 for Prewt and plot it with geom_abline(..), is it correct? Is there a simpler solution?
As you have model that contains two predictors (different intercept values for levels) and also offset variable it won't e possible to directly include it in geom_smooth(). One way would be to make new data frame dat.new that contains values of Prewt for all three levels of Treat. Then use this new data frame to predict Postwt values for all levels using your model and add predicted values to new data frame
new.dat<-data.frame(Treat=rep(levels(anorexia$Treat),each=100),
Prewt=rep(seq(70,95,length.out=100),times=3))
anorexia.2<-data.frame(new.dat,Pred=predict(anorex.1,new.dat))
head(anorexia.2)
Treat Prewt Pred
1 CBT 70.00000 80.18339
2 CBT 70.25253 80.29310
3 CBT 70.50505 80.40281
4 CBT 70.75758 80.51253
5 CBT 71.01010 80.62224
6 CBT 71.26263 80.73195
Now plot original points from the original data frame and add lines using new data frame that contains predictions.
ggplot(anorexia,aes(x=Prewt,y=Postwt,color=Treat))+geom_point()+
geom_line(data=anorexia.2,aes(x=Prewt,y=Pred,color=Treat))

Fitting a function in R

I have a few datapoints (x and y) that seem to have a logarithmic relationship.
> mydata
x y
1 0 123
2 2 116
3 4 113
4 15 100
5 48 87
6 75 84
7 122 77
> qplot(x, y, data=mydata, geom="line")
Now I would like to find an underlying function that fits the graph and allows me to infer other datapoints (i.e. 3 or 82). I read about lm and nls but I'm not getting anywhere really.
At first, I created a function of which I thought it resembled the plot the most:
f <- function(x, a, b) {
a * exp(b *-x)
}
x <- seq(0:100)
y <- f(seq(0:100), 1,1)
qplot(x,y, geom="line")
Afterwards, I tried to generate a fitting model using nls:
> fit <- nls(y ~ f(x, a, b), data=mydata, start=list(a=1, b=1))
Error in numericDeriv(form[[3]], names(ind), env) :
Missing value or an Infinity produced when evaluating the model
Can someone point me in the right direction on what to do from here?
Follow up
After reading your comments and googling around a bit further I adjusted the starting parameters for a, b and c and then suddenly the model converged.
fit <- nls(y~f(x,a,b,c), data=data.frame(mydata), start=list(a=1, b=30, c=-0.3))
x <- seq(0,120)
fitted.data <- data.frame(x=x, y=predict(fit, list(x=x))
ggplot(mydata, aes(x, y)) + geom_point(color="red", alpha=.5) + geom_line(alpha=.5) + geom_line(data=fitted.data)
Maybe using a cubic specification for your model and estimating via lm would give you a good fit.
# Importing your data
dataset <- read.table(text='
x y
1 0 123
2 2 116
3 4 113
4 15 100
5 48 87
6 75 84
7 122 77', header=T)
# I think one possible specification would be a cubic linear model
y.hat <- predict(lm(y~x+I(x^2)+I(x^3), data=dataset)) # estimating the model and obtaining the fitted values from the model
qplot(x, y, data=dataset, geom="line") # your plot black lines
last_plot() + geom_line(aes(x=x, y=y.hat), col=2) # the fitted values red lines
# It fits good.
Try taking the log of your response variable and then using lm to fit a linear model:
fit <- lm(log(y) ~ x, data=mydata)
The adjusted R-squared is 0.8486, which at face value isn't bad. You can look at the fit using plot, for example:
plot(fit, which=2)
But perhaps, it's not such a good fit after all:
last_plot() + geom_line(aes(x=x, y=exp(fit$fitted.values)))
Check this document out: http://cran.r-project.org/doc/contrib/Ricci-distributions-en.pdf
In brief, first you need to decide on the model to fit onto your data (e.g., exponential) and then estimate its parameters.
Here are some widely used distributions:
http://www.itl.nist.gov/div898/handbook/eda/section3/eda366.htm

Resources