Logistic regression for non-linear data - r

I have a data with continuous independent variable and binary dependent. Therefore I was trying to apply logistic regression for the analysis of this data. However in contrast to the classical case with S-shaped transition, I have a two transitions.
Here is an example of what I mean
library(ggplot)
library(visreg)
classic.data = data.frame(x = seq(from = 0, by = 0.5, length = 30),
y = c(rep(0, times = 14), 1, 0, rep(1, times = 14)))
model.classic = glm(formula = y ~ x,
data = classic.data,
family = "binomial")
summary(model.classic)
visreg(model.classic,
partial = FALSE,
scale = "response",
alpha = 0)
my.data = data.frame(x = seq(from = 0, by = 0.5, length = 30),
y = c(rep(0, times = 10), rep(1, times = 10), rep(0, times = 10)))
model.my = glm(formula = y ~ x,
data = my.data,
family = "binomial")
summary(model.my)
visreg(model.my,
partial = FALSE,
scale = "response",
alpha = 0)
The blue lines on both plots - it is outcome of glm, while red line it what I want to have.
Is there any way to apply logistic regression to such data? Or should I apply some other type of regression analysis?

In your second model, y is not a linear function of x. When you write y ~ x you assume that when x increases, y will increase/decrease depending on a positive/negative coefficient. That is not the case, it's increasing and then decreasing, making the average effect of x zero (hence the strait line). You therefore need a non-linear function. You could do that with a gam from the mgcv package, where the effect of x is modelled as a smooth function:
library(mgcv)
my.data = data.frame(x = seq(from = 0, by = 0.5, length = 30),
y = c(rep(0, times = 10), rep(1, times = 10), rep(0, times = 10)))
m = gam(y ~ s(x), data = my.data, family = binomial)
plot(m)
That would lead to the following fit on the original scale:
my.data$prediction = predict(m, type = "response")
plot(my.data$x, my.data$y)
lines(my.data$x, my.data$prediction, col = "red")

Related

Mixed model plotting with R - showing the data points

I have run a mixed effects binary model using the following code:
model = glmer(A ~ B + (1|C), data = data, family = "binomial")
summary(model)
I am now plotting the marginal fixed effects for a variable of interest (B). I have taken the code from the nice page on:
https://cran.r-project.org/web/packages/ggeffects/vignettes/practical_logisticmixedmodel.html
To produce the graph I have used:
ggpredict(model, "B")
plot(ggpredict(model, "B"))
The following is created which I like. But I want also the data points from the variable B to show on the graph. How can I add these in? Thanks.
welcome to stackoverflow :)
Sadly, I dont know how to (/whether it is possible) to add points to your plot of the ggpredict-object, since I am no good with ggplots :/
But I can do a workaround with baseplot. Only thing missing are the grey confidence intervals...which may bw crucial for good looks? :D
Cheers
#using the example data from the link you provided:
library(magrittr)
library(ggeffects)
library(sjmisc)
library(lme4)
library(splines)
set.seed(123)
#creating the data:
dat <- data.frame(
outcome = rbinom(n = 100, size = 1, prob = 0.35),
var_binom = as.factor(rbinom(n = 100, size = 1, prob = 0.2)),
var_cont = rnorm(n = 100, mean = 10, sd = 7),
group = sample(letters[1:4], size = 100, replace = TRUE)
)
dat$var_cont <- sjmisc::std(dat$var_cont)
#model creation:
m1 <- glmer( outcome ~ var_binom + var_cont + (1 | group),
data = dat,
family = binomial(link = "logit")
)
#save results:
m1_results <- ggpredict(m1, "var_cont")
#same plot you did:
plot(m1_results)
#workaround using baseplot:
#plotting the raw data:
plot(dat$outcome~dat$var_cont,
pch = 16,
ylab = "outcome",
xlab = "var_cont",
yaxt = "n")
#adding yaxis with percentages:
axis(2, at = pretty(dat$outcome), lab=paste0(pretty(dat$outcome) * 100," %"), las = TRUE)
#adding the model taken from ggpredict:
lines(m1_results$predicted~m1_results$x,
type = "l")
#upper and lower conf intervals:
lines(m1_results$conf.low~m1_results$x,
lty=2)
lines(m1_results$conf.high~m1_results$x,
lty=2)

Weighting using predict function

I have used 'predict' find a fit line for a linear model(lm) I have created. Because the lm was built on only 2 data points and needs to have a positive slope, I have forced it to go thru the origin (0,0). I have also weighted the function by the number of observations underlying each data point.
Question 1: (SOLVED -see comment by #Gregor)
Why does the predicted line lie so much closer to my second data point (B) than my first data point (A), when B has fewer underlying observations? Did I code something wrong here when weighting the model?
Question 2:
Plotting GLM (link=logit) now, but how can still I force this through 0,0? I've tried adding formula = y~0+x in several places, none of which seem to work.
M <- data.frame("rate" = c(0.4643,0.2143), "conc" = c(300,6000), "nr_dead" = c(13,3), "nr_surv" = c(15,11), "region" = c("A","B"))
M$tot_obsv <- (M$nr_dead+M$nr_surv)
M_conc <- M$conc
M_rate <- M$rate
M_tot_obsv <- M$tot_obsv
#**linear model of data, force 0,0 intercept, weighted by nr. of observations of each data point.**
M_lm <- lm(data = M, rate~0+conc, weights = tot_obsv)
#**plot line using "predict" function**
x_conc <-c(600, 6700)
y_rate <- predict(M_lm, list(conc = x_conc), weights = tot_obsv, type = 'response')
plot(x = M$conc, y = M$rate, pch = 16, ylim = c(0, 0.5), xlim = c(0,7000), xlab = "conc", ylab = "death rate")
lines(x_conc, y_rate, col = "red", lwd = 2)
#**EDIT 1:**
M_glm <- glm(cbind(nr_dead, nr_surv) ~ (0+conc), data = M, family = "binomial")
#*plot using 'predict' function*
binomial_smooth <- function(formula = (y ~ 0+x),...) {
geom_smooth(method = "glm", method.args = list(family = "binomial"), formula = (y ~ 0+x), ...)
}
tibble(x_conc = c(seq(300, 7000, 1), M$conc), y_rate = predict.glm(M_glm, list(conc = x_conc), type = "response")) %>% left_join(M, by = c('x_conc' = 'conc')) %>%
ggplot(aes(x = x_conc, y = y_rate)) + xlab("concentration") + ylab("death rate") +
geom_point(aes(y = rate, size = tot_obsv)) + binomial_smooth(formula = (y ~ 0+x)) + theme_bw()

Plotting multiple lift curves

I am new to R and trying to learn. I am trying to plot lift curves of multiple classifiers in one graph. I can't figure out a way to do it. I know the below two classifiers are essentially the same but they both give different graphs and I just want to combine the two. Below is the code I tried. Could someone please point me in the right direction
fullmod = glm(Response ~ page_views_90d+win_visits+osx_visits+mc_1+mc_2+mc_3+mc_4+mc_5+mc_6+store_page+orders+orderlines+bookings+purchase, data=training, family=binomial)
summary(fullmod)
fullmod.results <- predict(fullmod, newdata = testing, type='response')
plotLift(fitted.results, test_data_full$class, cumulative = TRUE,col="orange", n.buckets = 5)
redmod1 = glm(Response ~ win_visits+osx_visits+mc_2+mc_4+mc_6+store_page+orders+orderlines+bookings+purchase, data=training, family=binomial)
redmod1.results <- predict(redmod1, newdata = testing, type = 'response')
plotLift(redmod1.results, test_data_full$class, cumulative = TRUE,col="orange", n.buckets = 5)
# Attempt to plot multiple classifiers
plotLift((redmod1.results, fullmod.results), test_data_full$class, cumulative = TRUE,col="orange", n.buckets = 5)
Here is a way to plot multiple lift curves using the caret library. But first some data:
set.seed(1)
for_lift <- data.frame(Class = factor(rep(1:2, each = 50)),
model1 = sort(runif(100), decreasing = TRUE),
model2 = runif(100),
model3 = runif(100))
Here the Class column is the real classes
model1 is the predicted probabilities by the first model and so on.
Now create a lift object from the data using:
library(caret)
lift_curve <- lift(Class ~ model1 + model2, data = for_lift)
and plot it
xyplot(lift_curve, auto.key = list(columns = 3))
If you would like to plot with ggplot:
library(ggplot2)
ggplot(lift_curve$data)+
geom_line(aes(CumTestedPct, CumEventPct, color = liftModelVar))+
xlab("% Samples tested")+
ylab("% Samples found")+
scale_color_discrete(guide = guide_legend(title = "method"))+
geom_polygon(data = data.frame(x = c(0, lift_curve$pct, 100, 0),
y = c(0, 100, 100, 0)),
aes(x = x, y = y), alpha = 0.1)

Marginal effects of logit in weighted survey data using R

I´m trying to estimate marginal effect of a logit model in which I have several dichotomous explanatory variables.
Let's say the model estimated by
logit<- svyglm ( if_member ~ if_female + dummy_agegroup_2 + dummy_agegroup_3 + dummy_education_2 + dummy_education_3 + dummy_education_4, family = quasibinomial(link = "logit"), design = survey_design)
I know about the marginpred function in survey package, but I am not very familiar with it. I have only dichotomous variebles in the model so I am wondering how to estimate marginal effects by this function, especially I am not sure about the predictat (A data frame giving values of the variables in model to predict at).
Are you looking for marginal effects or marginal predictions?
As the name implies, the marginpred() function returns predictions. The argument for predictat is a data frame with both the control variables and the variables that are in the model. Let me emphasize that: control variables should be left out of the model.
library("survey")
odds2prob <- function(x) x / (x + 1)
prob2odds <- function(x) x / (1 - x)
expit <- function(x) odds2prob(exp(x))
logit <- function(x) log(prob2odds(x))
set.seed(1)
survey_data <- data.frame(
if_female = rbinom(n = 100, size = 1, prob = 0.5),
agegroup = factor(sample(x = 1:3, size = 100, replace = TRUE)),
education = NA_integer_,
if_member = NA_integer_)
survey_data["agegroup"] <- relevel(survey_data$agegroup, ref = 3)
# Different probabilities between female and male persons
survey_data[survey_data$if_female == 0, "education"] <- sample(
x = 1:4,
size = sum(survey_data$if_female == 0),
replace = TRUE,
prob = c(0.1, 0.1, 0.5, 0.3))
survey_data[survey_data$if_female == 1, "education"] <-sample(
x = 1:4,
size = sum(survey_data$if_female == 1),
replace = TRUE,
prob = c(0.1, 0.1, 0.3, 0.5))
survey_data["if_member"] <- rbinom(n = 100, size = 1, prob =
expit((survey_data$education - 3)/2))
survey_data["education"] <- factor(survey_data$education)
survey_data["education"] <- relevel(survey_data$education, ref = 3)
survey_design <- svydesign(ids = ~ 1, data = survey_data)
logit <- svyglm(if_member ~ if_female + agegroup + education,
family = quasibinomial(link = "logit"),
design = survey_design)
exp(cbind(`odds ratio` = coef(logit), confint(logit)))
newdf <- data.frame(if_female = 0:1, education = c(3, 3), agegroup = = c(3, 3))
# Fails
mp <- marginpred(model = logit, adjustfor = ~ agegroup + education,
predictat = newdf, se = TRUE, type = "response")
logit2 <- svyglm(if_member ~ if_female,
family = quasibinomial(link = "logit"),
design = survey_design)
mp <- marginpred(model = logit2, adjustfor = ~ agegroup + education,
predictat = newdf, se = TRUE, type = "response")
# Probability for male and for female persons controlling for agegroup and education
cbind(prob = mp, confint(mp))
That's how I estimate marginal effects with the survey package:
# Probability difference between female and male persons
# when agegroup and education are set to 3
svycontrast(full_model, quote(
(exp(`(Intercept)` + if_female) / (exp(`(Intercept)` + if_female) + 1)) -
(exp(`(Intercept)`) / (exp(`(Intercept)`) + 1))))
# Can't use custom functions like expit :_(
There are probably smarter ways, but I hope it helps.
Please note that the difference between the probabilities predicted by marginpred() is different from the difference estimated by svycontrast(). The probabilities predicted by marginpred() don't seem to be affected by changing the value of the control variables (in example,
education = c(4, 4) instead of education = c(3, 3)), but the estimates from svycontrast() are affected as implied by the regression model.

Plot a smooth and extrapolated curve using an nls model with several fitted parameters

I feel that I am close to finding the answer for my problem, but somehow I just cannot manage to do it. I have used nls function to fit 3 parameters using a rather complicated function describing fertilization success of eggs (y-axis) in a range of sperm concentrations (x-axis) (Styan's model [1], [2]). Fitting the parameters works fine, but I cannot manage to plot a smoothed extrapolated curve using predict function (see at the end of this post). I guess it is because I have used a value that was not fitted on x-axis. My question is how to plot a smoothed and extrapolated curve based on a model fitted with nls function
using non-fitted parameter on x-axis?
Here is an example:
library(ggplot2)
data.nls <- structure(list(S0 = c(0.23298, 2.32984, 23.2984, 232.98399, 2329.83993,
23298.39926), fert = c(0.111111111111111, 0.386792452830189,
0.158415841584158, 0.898648648648649, 0.616, 0.186440677966102
), speed = c(0.035161615379406, 0.035161615379406, 0.035161615379406,
0.035161615379406, 0.035161615379406, 0.035161615379406), E0 = c(6.86219803476946,
6.86219803476946, 6.86219803476946, 6.86219803476946, 6.86219803476946,
7.05624476582978), tau = c(1800, 1800, 1800, 1800, 1800, 1800
), B0 = c(0.000102758645352932, 0.000102758645352932, 0.000102758645352932,
0.000102758645352932, 0.000102758645352932, 0.000102758645352932
)), .Names = c("S0", "fert", "speed", "E0", "tau", "B0"), row.names = c(NA,
6L), class = "data.frame")
## Model S
modelS <- function(Fe, tb, Be) with (data.nls,{
x <- Fe*(S0/E0)*(1-exp(-B0*E0*tau))
b <- Fe*(S0/E0)*(1-exp(-B0*E0*tb))
x*exp(-x)+Be*(1-exp(-x)-(x*exp(-x)))*exp(-b)})
## Define starting values
start <- list(Fe = 0.2, tb = 0.1, Be = 0.1)
## Fit the model using nls
modelS.fitted <- nls(formula = fert ~ modelS(Fe, tb, Be), data = data.nls, start = start,
control=nls.control(warnOnly=TRUE,minFactor=1e-5),trace = T, lower = c(0,0,0),
upper = c(1, Inf, 1), algorithm = "port")
## Combine model parameters
model.data <- cbind(data.nls, data.frame(pred = predict(modelS.fitted)))
## Plot
ggplot(model.data) +
geom_point(aes(x = S0, y = fert), size = 2) +
geom_line(aes(x = S0, y = pred), lwd = 1.3) +
scale_x_log10()
I have tried following joran's example here, but it has no effect, maybe because I did not fit S0:
r <- range(model.data$S0)
S0.ext <- seq(r[1],r[2],length.out = 200)
predict(modelS.fitted, newdata = list(S0 = S0.ext))
# [1] 0.002871585 0.028289057 0.244399948 0.806316161 0.705116868 0.147974213
You function should have the parameters (S0,E0,B0,tau,Fe,tb,Be). nls will look for the parameters in the data.frame passed to its data argument and only try to fit those it doesn't find there (provided that starting values are given). No need for this funny with business in your function. (with shouldn't be used inside functions anyway. It's meant for interactive usage.) In predict newdata must contain all variables, that is S0,E0,B0, and tau.
Try this:
modelS <- function(S0,E0,B0,tau,Fe, tb, Be) {
x <- Fe*(S0/E0)*(1-exp(-B0*E0*tau))
b <- Fe*(S0/E0)*(1-exp(-B0*E0*tb))
x*exp(-x)+Be*(1-exp(-x)-(x*exp(-x)))*exp(-b)}
## Define starting values
start <- list(Fe = 0.2, tb = 0.1, Be = 0.1)
## Fit the model using nls
modelS.fitted <- nls(formula = fert ~ modelS(S0,E0,B0,tau,Fe, tb, Be), data = data.nls, start = start,
control=nls.control(warnOnly=TRUE,minFactor=1e-5),trace = T, lower = c(0,0,0),
upper = c(1, Inf, 1), algorithm = "port")
## Combine model parameters
model.data <- data.frame(
S0=seq(min(data.nls$S0),max(data.nls$S0),length.out=1e5),
E0=seq(min(data.nls$E0),max(data.nls$E0),length.out=1e5),
B0=seq(min(data.nls$B0),max(data.nls$B0),length.out=1e5),
tau=seq(min(data.nls$tau),max(data.nls$tau),length.out=1e5))
model.data$pred <- predict(modelS.fitted,newdata=model.data)
## Plot
ggplot(data.nls) +
geom_point(aes(x = S0, y = fert), size = 2) +
geom_line(data=model.data,aes(x = S0, y = pred), lwd = 1.3) +
scale_x_log10()
Obviously, this might not be what you want, since the function has multiple variables and more than one vary in new.data. Normally one would only vary one and keep the others constant for such a plot.
So this might be more appropriate:
S0 <- seq(min(data.nls$S0),max(data.nls$S0),length.out=1e4)
E0 <- seq(1,20,length.out=20)
B0 <- unique(data.nls$B0)
tau <- unique(data.nls$tau)
model.data <- expand.grid(S0,E0,B0,tau)
names(model.data) <- c("S0","E0","B0","tau")
model.data$pred <- predict(modelS.fitted,newdata=model.data)
## Plot
ggplot(model.data) +
geom_line(data=,aes(x = S0, y = pred, color=interaction(E0,B0,tau)), lwd = 1.3) +
geom_point(data=data.nls,aes(x = S0, y = fert), size = 2) +
scale_x_log10()

Resources