I am trying to plot a simple logistic regression in R.
I am following this tutorial to conduct the logistic regression and calculate a P-value (https://mgimond.github.io/Stats-in-R/Logistic.html). I am trying to use ggplot2 and ggpmisc to plot the regression. I have been trying to use this guide (http://cran.nexr.com/web/packages/ggpmisc/vignettes/user-guide-1.html#stat_fit_glance) to stat_fit_glance to display a p-value
require(cowplot)
require(ggplot2)
library(ggpmisc)
library(rms)
dataset=read.table('input.txt', header=TRUE)
model <- glm(variable ~ ancestry, data=dataset, family=binomial)
summary(model)
#plot logistic regression curve
plot <- ggplot(dataset, aes(x=ancestry, y=variable)) +
geom_point(alpha=.5, color=dataset$colorsite) +
stat_smooth(method="glm", se=FALSE, method.args = list(family=binomial)) + stat_fit_glance(method = "glm", method.args = list(formula = formula), geom = "text", aes(label = paste("P-value = ", signif(..p.value.., digits = 4), sep = "")))
ggsave("output.pdf")
The output however comes out as
> source("C:/Users/Deven/Desktop/logistic/script.R")
Saving 7 x 7 in image
`geom_smooth()` using formula 'y ~ x'
Warning message:
Computation failed in `stat_fit_glance()`:
object of type 'closure' is not subsettable
I have also tried stat_cor from ggpubr, but that seem to be generating different p-values and R^2 values from what I have calculated.
UPDATE BASED ON COMMENTS:
+ stat_poly_eq(formula = y ~ x, method="glm", aes(x = ancestry, y = variable, label = paste(..p.value.label..,sep = "~~~~")),parse = TRUE) fails due to
1: Computation failed in `stat_poly_eq()`:
Method 'glm' not yet implemented.
If I remove method it defaults to a linear regresssion (and gives p values that do not correspond to a logistic regression).
SECOND UPDATE
model <- glm(variable ~ ancestry, data=dataset, family=binomial)
summary(model)
#plot logistic regression curve
plot <- ggplot(dataset, aes(x=ancestry, y=variable)) +
geom_point(alpha=.5, color=dataset$colorsite) +
stat_smooth(method="glm", se=FALSE, method.args = list(family=binomial)) + stat_fit_tidy(method = "glm",method.args = list(family=binomial,formula=y~x), mapping = aes(label = sprintf("Coef = %.3g\np-value = %.3g",after_stat(x_estimate),after_stat(x_p.value))))
ggsave("variable.pdf")
yields the following error:
Saving 7 x 7 in image
`geom_smooth()` using formula 'y ~ x'
Warning message:
Computation failed in `stat_fit_tidy()`:
no applicable method for 'tidy' applied to an object of class "c('glm', 'lm')"
YET ANOTHER UPDATE
library(ggplot2)
library(ggpmisc)
da =read.table('data.txt', header=TRUE)
model = glm(variable ~ ancestry,family=binomial,data=da)
summary(model)
ggplot(da,aes(x = ancestry,y = variable)) + geom_point() +
stat_smooth(method="glm",se=FALSE,method.args = list(family=binomial)) +
stat_fit_tidy(method = "glm",method.args = list(family=binomial,formula=y~x),
mapping = aes(label = sprintf("Coef = %.3g\np-value = %.3g",
after_stat(x_estimate),after_stat(x_p.value))))
ggsave("test.pdf")
works in theory, but the p-value it gives me is very different from the p value that I calculated manually (which corresponds to the one from lrm(variable ~ ancestry, dataset))...
Not sure at all what is going on here...
There is a table on ggpmisc help page that specifies what can be applied to each type of models.
You have a glm, so glance() from tidy will not give you a p-value. Using an example:
library(ggplot2)
library(ggpmisc)
da = MASS::Pima.tr
da$label = as.numeric(da$type=="Yes")
model = glm(label ~ bmi,family=binomial,data=da)
summary(model)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.11156 0.92806 -4.430 9.41e-06 ***
bmi 0.10482 0.02738 3.829 0.000129 ***
You can see glance will not give you a p-value :
broom::glance(model)
# A tibble: 1 x 8
null.deviance df.null logLik AIC BIC deviance df.residual nobs
<dbl> <int> <dbl> <dbl> <dbl> <dbl> <int> <int>
1 256. 199 -120. 244. 251. 240. 198 200
You need to use tidy() and as #JonSpring mentioned in the comment, provide the formula, so something like this:
ggplot(da,aes(x = bmi,y = label)) + geom_point() +
stat_smooth(method="glm",se=FALSE,method.args = list(family=binomial)) +
stat_fit_tidy(method = "glm",method.args = list(family=binomial,formula=y~x),
mapping = aes(label = sprintf("Coef = %.3g\np-value = %.3g",
after_stat(x_estimate),after_stat(x_p.value))))
Thank you for all the help, but unfortunately nothing automated worked, so I came up with this instead
require(cowplot)
require(ggplot2)
library(ggpmisc)
library(rms)
dataset=read.table('data.txt', header=TRUE)
model <- glm(variable ~ ancestry, data=dataset, family=binomial)
summary(model)
M1 <- glm(variable ~ ancestry, dataset, family = binomial)
M1
M1$null.deviance
M1$deviance
modelChi <- M1$null.deviance - M1$deviance
pseudo.R2 <- modelChi / M1$null.deviance
pseudo.R2
test <-lrm(variable ~ ancestry, dataset)
Chidf <- M1$df.null - M1$df.residual
chisq.prob <- 1 - pchisq(modelChi, Chidf)
chisq.prob
#plot logistic regression curve
all_variable <- ggplot(dataset, aes(x=ancestry, y=variable)) +
geom_point(alpha=.5, color=dataset$colorsite) +
stat_smooth(method="glm", se=FALSE, method.args = list(family=binomial)) + annotate("text", x=-Inf, y=Inf, hjust = 0, vjust = 2.5, label=paste("p-value: ",signif(chisq.prob, digits = 3),"\nR2: ",signif(pseudo.R2, digits = 3),sep="") )+
ggtitle("Title not relevant to Stack Overflow")
ggsave("variable.pdf")
Related
i try to search about this problem but i couldn't find,that i try to apply ggplot for the relation between the two variables (predictor and predicted x,y), the method was linear regression model (lm) but i got this error
Error in FUN(X[[i]], ...) : object 'predicted' not found
and this my code:
# Install tidymodels if you haven't done so
install.packages("rlang")
install.packages("tidymodels")
install.packages("dplyr")
# Library for modeling
library(tidymodels)
# Load tidyverse
library(tidyverse)
library(dplyr)
URL <- 'https://dax-cdn.cdn.appdomain.cloud/dax-noaa-weather-data-jfk-airport/1.1.4/noaa-weather-sample-data.tar.gz'
download.file (URL, destfile='noaa-weather-sample-data.tar.gz')
untar('noaa-weather-sample-data.tar.gz',tar = 'internal')
dataset<- read.csv ('noaa-weather-sample-data/jfk_weather_sample.csv')
head(dataset)
glimpse(dataset)
subset_data <- data.frame(dataset$HOURLYRelativeHumidity,dataset$HOURLYDRYBULBTEMPF,dataset$HOURLYStationPressure,dataset$HOURLYWindSpeed,dataset$HOURLYPrecip)
subset_data<-setNames(subset_data,c('HOURLYRelativeHumidity','HOURLYDRYBULBTEMPF','HOURLYStationPressure','HOURLYWindSpeed', 'HOURLYPrecip'))
head(subset_data,10)
unique(subset_data$HOURLYPrecip)
subset_data <- subset_data %>%mutate(HOURLYPrecip = replace(HOURLYPrecip, HOURLYPrecip == "T", "0.0"))
View(subset_data)
install.packages('stringr ')
library(stringr)
subset_data$HOURLYPrecip <- str_remove(subset_data$HOURLYPrecip, pattern = 's')
unique(subset_data$HOURLYPrecip)
subset_data$HOURLYPrecip <- as.numeric(subset_data$HOURLYPrecip)
subset_data$HOURLYRelativeHumidity <- as.numeric(subset_data$HOURLYRelativeHumidity)
subset_data$HOURLYDRYBULBTEMPF <- as.numeric(subset_data$HOURLYDRYBULBTEMPF)
subset_data$HOURLYWindSpeed <- as.numeric(subset_data$HOURLYWindSpeed)
str(subset_data1)
subset_data1 <- setNames(subset_data,c('relative_humidity','dry_bulb_temp_f','station_pressure','wind_speed', 'precip'))
#library(rsample)
set.seed(1234)
#split_data <- initial_split(subset_data1)
dt<- sort(sample(nrow(subset_data1), nrow(subset_data1)*.8))
train_data<- subset_data1 [dt,]
test_data <- subset_data1 [-dt,]
install.packages('ggplot')
library(ggplot2)
hist(train_data$relative_humidity)
hist(train_data$dry_bulb_temp_f)
hist(train_data$station_pressure)
hist(train_data$wind_speed)
hist(train_data$precip)
install.packages('stats')
library(stats)
linear_reg1 <- lm (precip ~ relative_humidity, train_data)
linear_reg2 <- lm (precip ~ dry_bulb_temp_f, train_data)
linear_reg3 <- lm (precip ~ wind_speed, train_data)
linear_reg4 <- lm (precip ~ station_pressure, train_data)
#ggplot(train_data, aes(x = train_data$relative_humidity ,y= train_data$precip)) + geom_point()+ stat_smooth(method = 'lm', col ='red')
ggplot(train_data, aes(x= relative_humidity, y= precip))+
geom_smooth(method='lm',se=FALSE ,color='red')+
geom_segment(aes(xend= relative_humidity ,yend = predicted),alpha=.2)+
geom_point()+
geom_point(aes(y = predicted),shape=1)+
theme_bw()
You simply don't have a column called predicted in your data set. Presumably you want this to be the predicted value of precipitation based on humidity. In that case, you can easily create the column in your data frame using your first linear model:
train_data$predicted <- predict(linear_reg1, train_data)
ggplot(train_data, aes(x = relative_humidity, y = precip)) +
geom_segment(aes(xend = relative_humidity, yend = predicted), alpha = 0.2) +
geom_point(alpha = 0.1) +
geom_smooth(method='lm', se = FALSE , color = 'red') +
geom_point(aes(y = predicted), shape = 1, alpha = 0.1) +
theme_bw()
As you can see, a linear model is no good here. It predicts negative precipitation at low humidity and underestimates the precipitation at high humidity.
I am trying to fit a model to my data, which has a dependent variable that can be 0 or 1.
I tried to fit a binomial glmer to the data, but the fit is pretty bad as you can see below. This puzzles me because this is quite a sigmoid so I thought I would get a great fit with that kind of model? Am I using the wrong model?
(color is my data, black is the fit)
Here is the code I used on r
library(lme4)
library(ggplot2)
exdata <- read.csv("https://raw.githubusercontent.com/FlorianLeprevost/dummydata/main/exdata.csv")
model=glmer(VD~ as.factor(VI2)*VI1 + (1|ID),exdata,
family=binomial(link = "logit"),
control = glmerControl(optimizer = "bobyqa", optCtrl = list(maxfun=2e5)))
summary(model)
exdata$fit=predict(model, type = "response")
ggplot(exdata,aes(VI1, VD, color=as.factor(VI2),
group=as.factor(VI2))) +
stat_summary(geom="line", size=0.8) +
stat_summary(aes(y=(fit)),geom="line", size=0.8, color="black") +
theme_bw()
And I tried without the random effect to see if it would change but no...
ggplot(exdata, aes(x=VI1, y=VD, color=as.factor(VI2),
group=as.factor(VI2))) +
stat_summary(fun.data=mean_se, geom="line", size=1)+
stat_smooth(method="glm", se=FALSE, method.args = list(family=binomial), color='black')
Here is the data:https://github.com/FlorianLeprevost/dummydata/blob/main/exdata.csv
tl;dr I don't think these data are as sigmoidal as you think. In particular, a logistic regression estimates a sigmoid curve that ranges from 0 to 1, whereas yours levels out (sort of) at 0.9. In much more detail:
slightly streamlined data import/model fitting
library(lme4)
library(ggplot2)
library(dplyr)
exdata <- (read.csv("https://raw.githubusercontent.com/FlorianLeprevost/dummydata/main/exdata.csv")
|> mutate(across(VI2, factor))
)
model <- glmer(VD~ VI2*VI1 + (1|ID),
exdata,
family=binomial(link = "logit"))
compute data summary and predictions
This can also be done with stat_summary(), but I like the finer control of doing it myself. In particular, I like to get Clopper-Pearson CIs on the proportions (could also do this with prop.test() to get score-test CIs). I'm also computing predictions across a wider VI1-range than the data (see why below).
ddsum <- (exdata
|> group_by(VI1, VI2)
|> summarise(
num = n(),
pos = sum(VD),
VD = mean(VD),
lwr = binom.test(pos, num)$conf.int[1],
upr = binom.test(pos, num)$conf.int[2],
.groups = "drop")
)
pframe <- expand.grid(
VI1 = seq(-50, 50, length = 101),
VI2 = unique(exdata$VI2))
pframe$VD <- predict(model, newdata = pframe, re.form = ~0, type = "response")
plot
gg0 <- ggplot(ddsum,aes(x=VI1, y=VD, color=VI2)) +
geom_pointrange(position = position_dodge(width = 0.3),
aes(ymin = lwr, ymax = upr, size = num), alpha = 0.5) +
scale_size_area(max_size = 4) +
theme_bw()
gg1 <- gg0 + geom_line(data = pframe)
ggsave(g1, file = "gglogist1.png")
Conclusion: the sharp increase from x=0 to x=15 combined with the saturation below 1.0 makes it hard to fit with a logistic curve.
We could try a quadratic-logistic fit:
model2 <- update(model, . ~ poly(VI1,2)*VI2 + (1|ID))
pframe$VD2 <- predict(model2, newdata = pframe, re.form = ~0, type = "response")
gg2 <- gg1 + geom_line(data=pframe, aes(y=VD2), linetype = 2)
ggsave(gg2, file = "gglogist2.png")
This fits better (it couldn't fit worse), but might not make sense for your application.
In principle we could fit a logistic that saturated at a value <1, but it's a bit tricky ...
#linear regression
fit1 <- lm(temp ~ usage ,data= electemp)
#polynomial regression
fit2 <- lm(temp ~ poly(electemp$usage,degree), data = electemp)
ggplot(data=electemp, aes(x=temp,y=usage))+geom_point()+
stat_smooth(method="lm",col="red"). #linear regression
ggplot(electemp, aes(usage, temp) ) +
geom_point() +
stat_smooth(method = lm, formula=temp~ poly(electemp$usage, 3, raw=TRUE))
I am using the same ggplot for my polynomial regression but getting "Error: Aesthetics must be either length 1 or the same as the data (55): x".
You need to use x and y in the formula you pass to geom_smooth, not the variable names in your data frame.
Here's an example using some dummy data (though the structure and names are the same, so it should work on your own data):
library(ggplot2)
fit1 <- lm(temp ~ usage ,data= electemp)
fit2 <- lm(temp ~ poly(usage, 3), data = electemp)
ggplot(electemp, aes(usage, temp)) +
geom_point() +
stat_smooth(method = "lm", col = "red")
ggplot(electemp, aes(usage, temp) ) +
geom_point() +
stat_smooth(method = lm, formula= y ~ poly(x, 3))
Data
set.seed(1)
electemp <- data.frame(usage = 1:60,
temp = 20 + .2 * 1:60 - 0.02*(1:60)^2 +
0.0005 * (1:60)^3 + rnorm(60, 0, 5))
Created on 2020-11-24 by the reprex package (v0.3.0)
I have a dataset of choices on a task (either 1 or 0) given a variable x. To use mtcars as an example
#binomial_smooth() from https://ggplot2.tidyverse.org/reference/geom_smooth.html
binomial_smooth <- function(...) {
geom_smooth(method = "glm", method.args = list(family = "binomial"), ...)
}
#plot
ggplot(data = mtcars, aes(x = disp, y = am)) + geom_point(alpha = 0.5) + binomial_smooth()
#create a model
model <- glm(am ~ disp, family = "binomial", data = mtcars)
where am would be the subjects choices and disp the x variable. I'd like to work out the value of x +/- SE (which I imagine is what binomial_smooth is plotting though I could be wrong) for the binary variable = 0.5.
Using mtcars, I'd like to find out for which disp +/- SE am = 0.5. Looked around and just been getting more confused so any help would be much appreciated!
best,
Ok, so I figured this out after following a rabbit hole from Roman Luštrik (cheers!).
Uses the MASS package and a function used to calculate the LD50. Also allow for manually choosing a p value to look for.
library(ggplot2)
library(MASS)
#binomial_smooth() from https://ggplot2.tidyverse.org/reference/geom_smooth.html
binomial_smooth <- function(...) {
geom_smooth(method = "glm", method.args = list(family = "binomial"), ...)
}
#create a model
model <- glm(am ~ disp, family = "binomial", data = mtcars)
#get the 'LD50'- the point at which the binomial regression crosses 50%
LD50 <- dose.p(model, p = 0.5)
#print the details
print(LD50)
#replot the figure with the LD50 vlines
ggplot(data = mtcars, aes(x = disp, y = am)) +
geom_point(alpha = 0.5) +
binomial_smooth() +
geom_vline(xintercept = LD50[[1]])
I'm new with R and I have fit 3 models for my data as follows:
Model 1: y = a(x) + b
lm1 = lm(data$CBI ~ data$dNDVI)
Model 2: y = a(x)2 + b(x) + c
lm2 <- lm(CBI ~ dNDVI + I(dNDVI^2), data=data)
Model 3: y = x(a|x| + b)–1
lm3 = nls(CBI ~ dNDVI*(a*abs(dNDVI) + b) - 1, start = c(a = 1.5, b = 2.7), data = data)
Now I would like to plot all these three models in R but I could not find the way to do it, can you please help me? I have tried with the first two models as follow and it work but I don't know how to add the Model 3 on it:
ggplot(data = data, aes(x = dNDVI, y = CBI)) +
geom_point() +
geom_smooth(method = lm, formula = y ~ x, size = 1, se = FALSE) +
geom_smooth(method = lm, formula = y ~ x + I(x^2), size = 1, se = FALSE ) +
theme_bw()
I also would like to add a legend which show 3 different colours or types of lines/curves for the 3 models as well. Can you please guide me how to make it in the figure?
Using iris as a dummy set to represent the three models:
new.dat <- data.frame(Sepal.Length=seq(min(iris$Sepal.Length),
max(iris$Sepal.Length), length.out=50)) #new data.frame to predict the fitted values for each model
m1 <- lm(Petal.Length ~ Sepal.Length, iris)
m2 <- lm(Petal.Length ~ Sepal.Length + I(Sepal.Length^2), data=iris)
m3 <- nls(Petal.Length ~ Sepal.Length*(a*abs(Sepal.Length) + b) - 1,
start = c(a = 1.5, b = 2.7), data = iris)
new.dat$m1.fitted <- predict(m1, new.dat)
new.dat$m2.fitted <- predict(m2, new.dat)
new.dat$m3.fitted <- predict(m3, new.dat)
new.dat <- new.dat %>% gather(var, val, m1.fitted:m3.fitted) #stacked format of fitted data of three models (to automatically generate the legend in ggplot)
ggplot(new.dat, aes(Sepal.Length, val, colour=var)) +
geom_line()