R Finding logistic curve with nls - r

I have a problem with logistic curve in R for panel data which are here:
https://docs.google.com/spreadsheets/d/1SO3EzFib7T3XqTz1xZCU2bZFTB0Ddhou/edit?usp=sharing&ouid=110784858039906954607&rtpof=true&sd=true
I tried:
log <- nls(lrc~SSlogis(time, Asym, xmid, scal), data = data_log)
I have an error: 'qr.solve(QR.B, cc)':singular matrix 'a' in solve.
What can I do?

I get a different error:
Error in nls(y ~ 1/(1 + exp((xmid - x)/scal)), data = xy, start = list(xmid = aux[[1L]], :
step factor 0.000488281 reduced below 'minFactor' of 0.000976562
(it's not surprising that different platforms will get slightly different results on numerically "difficult" problems ...)
However, here's what plot(lrc ~ time, data = dd) produces when I use your data:
It seems optimistic to think that you could fit a logistic curve to these data, or that the fit would make very much sense ...
I did find that I could fit a logistic to the logged data, i.e. nls(log(lrc) ...)
plot(log(lrc) ~ time, data =dd)
tvec <- seq(2008, 2016, by = 0.1)
lines(tvec, predict(m2, newdata=data.frame(time=tvec)), col=2, lwd=2)
If I really needed logistic coefficients for this plot (e.g. to compare with other cases) I would fit a linear model to the data, assume that the midpoint was equal to the midpoint of the data, set the scaling parameter equal to the linear slope coefficient divided by 4 (this is a standard rule of thumb for the logistic), and say that the asymptotic value could not be estimated.
Plotting all of your data unit-by-unit doesn't make much more hopeful that you will be able to fit logistic curves unit-by-unit either. While there might be a few units where a logistic curve is a sensible description of the data, it's not in most cases. You might have to back up and consider your analytical strategy — i.e., what are you hoping to learn from these data, and how might you go about it? (If you can frame the question suitably you could post on CrossValidated. If you are a student/trainee, you might want to ask your supervisor/teacher/mentor for advice ...
library(readxl)
library(tidyverse)
(dd <- read_excel("data2.xlsx")
%>% pivot_longer(-code, names_transform = as.numeric,
names_to = "year")
## indices to break groups into chunks/facets
%>% mutate(grp_cat = factor(as.numeric(factor(code)) %% 30))
)
gg1 <- ggplot(dd, aes(year, value, group = code,
colour=code)) + geom_point() +
facet_wrap(~grp_cat, scale="free_y") +
expand_limits(y=0) +
theme_bw() +
theme(legend.position = "none",
panel.spacing = grid::unit(0, "lines"),
axis.text.x = element_blank())
gg1 + geom_smooth(se=FALSE)
ggsave("all.png", width=8, height=8)

Related

ROC for Logistic regression in R

I would like to ask for help with my project. My goal is to get ROC curve from existing logistic regression.
First of all, here is what I'm analyzing.
glm.fit <- glm(Severity_Binary ~ Side + State + Timezone + Temperature.F. + Wind_Chill.F. + Humidity... + Pressure.in. + Visibility.mi. + Wind_Direction + Wind_Speed.mph. + Precipitation.in. + Amenity + Bump + Crossing + Give_Way + Junction + No_Exit + Railway + Station + Stop + Traffic_Calming + Traffic_Signal + Sunrise_Sunset , data = train_data, family = binomial)
glm.probs <- predict(glm.fit,type = "response")
glm.probs = predict(glm.fit, newdata = test_data, type = "response")
glm.pred = ifelse(glm.probs > 0.5, "1", "0")
This part works fine, I am able to show a table of prediction and mean result. But here comes the problem for me, I'm using pROC library, but I am open to use anything else which you can help me with. I'm using test_data with approximately 975 rows, but variable proc has only 3 sensitivities/specificities values.
library(pROC)
proc <- roc(test_data$Severity_Binary,glm.probs)
test_data$sens <- proc$sensitivities[1:975]
test_data$spec <- proc$specificities[1:975]
ggplot(test_data, aes(x=spec, y=sens)) + geom_line()
Here´s what I have as a result:
With Warning message:
Removed 972 row(s) containing missing values (geom_path).
As I found out, proc has only 3 values as I said.
You can't (and shouldn't) assign the sensitivity and specificity to the data. They are summary data and exist in a different dimension than your data.
Specifically, these two lines are wrong and make no sense at all:
test_data$sens <- proc$sensitivities[1:975]
test_data$spec <- proc$specificities[1:975]
Instead you must either save them to a new data.frame, or use some of the existing functions like ggroc:
ggroc(proc)
If you consider what the ROC curve does, there is no reason to expect it to have the same dimensions as your dataframe. It provides summary statistics of your model performance (sensitivity, specificity) evaluated on your dataset for different thresholds in your prediction.
Usually you would expect some more nuance on the curve (more than the 3 datapoints at thresholds -Inf, 0.5, Inf). You can look at the distribution of your glm.probs - this ROC curve indicates that all predictions are either 0 or 1, with very little inbetween (hence only one threshold at 0.5 on your curve). [This could also mean that you unintentially used your binary glm.pred for calculating the ROC curve, and not glm.probs as shown in the question (?)]
This seems to be more an issue with your model than with your code - here an example from a random different dataset, using the same steps you took (glm(..., family = binomial, predict(, type = "response"). This produces a ROC curve with 333 steps for ~1300 datapoints.
PS: (Ingore the fact that this is evaluated on training data, the point is the code looks alright up to the point of generating the ROC curve)
m1 <- glm(survived ~ passengerClass + sex + age, data = dftitanic, family = binomial)
myroc <- roc(dftitanic$survived,predict(m1, dftitanic, type = "response"))
plot(myroc)

Not getting a smooth curve using ggplot2

I am trying to fitting a mixed effects models using lme4 package. Unfortunately I cannot share the data that i am working with. Also i couldn't find a toy data set is relevant to my problem . So here i have showed the steps that i followed so far :
First i plotted the overall trend of the data as follows :
p21 <- ggplot(data = sub_data, aes(x = age_cent, y = y))
p21+ geom_point() + geom_smooth()
Based on this , there seems to be a some nonlinear trend in the data. Hence I tried to fit the quadratic model as follows :
sub_data$age_cent=sub_data$age-mean((sub_data)$age)
sub_data$age_centsqr=(sub_data$age-mean((sub_data)$age))^2
m1= lmer(y ~ 1 + age_cent + age_centsqr +(1 | id) , sub_data, REML = TRUE)
In the above model i only included a random intercept because i don't have enough data to include both random slope and intercept.Then i extracted the predictions of these model at population level as follows :
pred1=predict(m1,re.form=NA)
Next I plotted these predictions along with a smooth quadratic function like this
p21+ geom_point() + geom_smooth(method = "lm", formula = y ~ I(x) + I(x^2)
,col="red")+geom_line(aes(y=pred1,group = id) ,col="blue", lwd = 0.5)
In the above plot , the curve corresponds to predictions are not smooth. Can any one helps me to figure out the reason for that ?
I am doing anything wrong here ?
Update :
As eipi10 pointed out , this may due to fitting different curves for different people.
But when i tried the same thing using a toy data set which is in the lme4 package , i got the same curve for each person as follows :
m1 <- lmer(Reaction ~ 1+I(Days) + (1+ Days| Subject) , data = sleepstudy)
pred1new1=predict(m1,re.form=NA)
p21 <- ggplot(data = sleepstudy, aes(x = Days, y = Reaction))
p21+ geom_point() + geom_smooth()
p21+ geom_point() + geom_smooth()+ geom_line(aes(y=pred1new1,group = Subject) ,col="red", lwd = 0.5)
What may be the reason the for different results ? Is this due to unbalance of the data ?
The data i used collected in 3 time steps and some people didn't have it for all 3 time steps. But the toy data set is a balanced data set.
Thank you
tl;dr use expand.grid() or something like it to generate a balanced/evenly spaced sample for every group (if you have a strongly nonlinear curve you may want to generate a larger/more finely spaced set of x values than in the original data)
You could also take a look at the sjPlot package, which does a lot of this stuff automatically ...
You need both an unbalanced data set and a non-linear (e.g. polynomial) model for the fixed effects to see this effect.
if the model is linear, then you don't notice missing values because the linear interpolation done by geom_line() works perfectly
if the data are balanced then there are no gaps to get weirdly filled by linear interpolation
Generate an example with quadratic effects and an unbalanced data set; fit the model
library(lme4)
set.seed(101)
dd <- expand.grid(id=factor(1:10),x=1:10)
dd$y <- simulate(~poly(x,2)+(poly(x,2)|id),
newdata=dd,
family=gaussian,
newparams=list(beta=c(0,0,0.1),
theta=rep(0.1,6),
sigma=1))[[1]]
## subsample randomly (missing values)
dd <- dd[sort(sample(nrow(dd),size=round(0.7*nrow(dd)))),]
m1 <- lmer(y ~ poly(x,2) + (poly(x,2)|id) , data = dd)
Naive prediction and plot:
dd$pred1 <- predict(m1,re.form=NA)
library(ggplot2)
p11 <- (ggplot(data = dd, aes(x = x, y = y))
+ geom_point() + geom_smooth(method="lm",formula=y~poly(x,2))
)
p11 + geom_line(aes(y=pred1,group = id) ,col="red", lwd = 0.5)
Now generate a balanced data set. This version generates 51 evenly spaced points between the min and max - this will be useful if the original data are unevenly spaced. If you have NA values in your x variable, don't forget na.rm=TRUE ...
pframe <- with(dd,expand.grid(id=levels(id),x=seq(min(x),max(x),length.out=51)
Make predictions, and overlay them on the original plot:
pframe$pred1 <- predict(m1,newdata=pframe,re.form=NA)
p11 + geom_line(data=pframe,aes(y=pred1,group = id) ,col="red", lwd = 0.5)

Different slope in 'regression' between ggplot (by suing geom_smooth(method = "lm") , and lm -function

I am using a data-set (Panel).
With this data-set I conduct the following:
1)
ols <-lm(CapNormChange ~ Policychanges, data=Panel) summary(ols)
plot(Panel$CapNormChange, Panel$Policychanges,
pch=19, xlab="CapNormChange", ylab="Policychanges")
abline(lm(Panel$CapNormChange~Panel$Policychanges),lwd=3, col="blue")
and 2)
p2 <- ggplot(data = Panel, mapping = aes(x = CapNormChange, y = Policychanges))
p2 + geom_point(alpha=0.3) + geom_smooth(method = "lm", se=F, color="orange")
I thought that the slopes of the lines of germ-smooth and and abline of the first plot are the same, and also correspond to the parameter of the dependent variable (Policychanges) in the OLS regression.
However, this is not the case ! Instead the ggplot, has a higher intercept (I tested it for different dataset). I really don't understand this, could please somebody give some advice?
In 1) you use CapNormChange as y-variable and Policychanges as x-variable. It's always y ~ x. This doesn't match what you do in the plot command. In 2) you do it the other way around.
OLS regression assumes that only y-values have associated errors. Thus, swapping x and y changes the fit. If you want the same results from both, you'd need orthogonal regression.

Plotting a multiple logistic regression for binary and continuous values in R

I have a data frame of mammal genera. Each row of the column is a different genus. There are three columns: a column of each genus's geographic range size (a continuous variable), a column stating whether or not a genus is found inside or outside of river basins (a binary variable), and a column stating whether the genus is found in the fossil record (a binary variable).
I have performed a multiple logistic regression to see if geographic range size and presence in/out of basins is a predictor of presence in the fossil record using the following R code.
Regression<-glm(df[ ,"FossilRecord"] ~ log(df[ ,"Geographic Range"]) + df[ ,"Basin"], family="binomial")
I am trying to find a way to visually summarize the output of this regression (other than a table of the regression summary).
I know how to do this for a single variable regression. For example, I could use a plot like if I wanted to see the relationship between just geographic range size and presence in the fossil record.
However, I do not know how to make a similar or equivalent plot when there are two independent variables, and one of them is binary. What are some plotting and data visualization techniques I could use in this case?
Thanks for the help!
Visualization is important and yet it can be very hard. With your example, I would recommend plotting one line for predicted FossilRecord versus GeographicRange for each level of your categorical covariate (Basin). Here's an example of how to do it with the ggplot2 package
##generating data
ssize <- 100
set.seed(12345)
dat <- data.frame(
Basin = rbinom(ssize, 1,.4),
GeographicRange = rnorm(ssize,10,2)
)
dat$FossilRecord = rbinom(ssize,1,(.3 + .1*dat$Basin + 0.04*dat$GeographicRange))
##fitting model
fit <- glm(FossilRecord ~ Basin + GeographicRange, family=binomial(), data=dat)
We can use the predict() function to obtain predicted response values for many GeographicRange values and for each Basin category.
##getting predicted response from model
plotting_dfm <- expand.grid(GeographicRange = seq(from=0, to = 20, by=0.1),
Basin = (0:1))
plotting_dfm$preds <- plogis( predict(fit , newdata=plotting_dfm))
Now you can plot the predicted results:
##plotting the predicted response on the two covariates
library(ggplot2)
pl <- ggplot(plotting_dfm, aes(x=GeographicRange, y =preds, color=as.factor(Basin)))
pl +
geom_point( ) +
ggtitle("Predicted FossilRecord by GeoRange and Basin") +
ggplot2::ylab("Predicted FossilRecord")
This will produce a figure like this:
You can plot a separate curve for each value of the categorical variable. You didn't provide sample data, so here's an example with another data set:
library(ggplot2)
# Data
mydata <- read.csv("http://www.ats.ucla.edu/stat/data/binary.csv")
# Model. gre is continuous. rank has four categories.
m1 = glm(admit ~ gre + rank, family=binomial, data=mydata)
# Predict admit probability
newdata = expand.grid(gre=seq(200,800, length.out=100), rank=1:4)
newdata$prob = predict(m1, newdata, type="response")
ggplot(newdata, aes(gre, prob, color=factor(rank), group=rank)) +
geom_line()
UPDATE: To respond to #Provisional.Modulation's comment: There are lots of options, depending on what you want to highlight and what is visually clear enough to understand, given your particular data and model output.
Here's an example using the built-in mtcars data frame and a logistic regression with one categorical and two continuous predictor variables:
m1 = glm(vs ~ cyl + mpg + hp, data=mtcars, family=binomial)
Now we create a new data frame with the unique values of cyl, five quantiles of hp and a continuous sequence of mpg, which we'll put on the x-axis (you could also of course do quantiles of mpg and use hp as the x-axis variable). If you have many continuous variables, you may need to set some of them to a single value, say, the median, when you graph the relationships between other variables.
newdata = with(mtcars, expand.grid(cyl=unique(cyl),
mpg=seq(min(mpg),max(mpg),length=20),
hp = quantile(hp)))
newdata$prob = predict(m1, newdata, type="response")
Here are three potential graphs, with varying degrees of legibility.
ggplot(newdata, aes(mpg, prob, colour=factor(cyl))) +
geom_line() +
facet_grid(. ~ hp)
ggplot(newdata, aes(mpg, prob, colour=factor(hp), linetype=factor(cyl))) +
geom_line()
ggplot(newdata, aes(mpg, prob, colour=factor(hp))) +
geom_line() +
facet_grid(. ~ cyl)
And here's another approach using geom_tile to include two continuous dimensions in each plot panel.
newdata = with(mtcars, expand.grid(cyl=unique(cyl),
mpg=seq(min(mpg),max(mpg),length=100),
hp =seq(min(hp),max(hp),length=100)))
newdata$prob = predict(m1, newdata, type="response")
ggplot(newdata, aes(mpg, hp, fill=prob)) +
geom_tile() +
facet_grid(. ~ cyl) +
scale_fill_gradient2(low="red",mid="yellow",high="blue",midpoint=0.5,
limits=c(0,1))
If you're looking for a canned solution, the visreg package might work for you.
An example using #eipi10 's data
library(visreg)
mydata <- read.csv("http://www.ats.ucla.edu/stat/data/binary.csv")
m1 = glm(admit ~ gre + rank, family=binomial, data=mydata)
visreg(m1, "admit", by = "rank")
Many more options described in documentation.

Errors Plotting a Restricted Cubic Spline with ggplot2

I would like to use ggplot2 to illustrate a fit using a restricted cubic spline using geom_smooth() but it seems to be working incorrectly. Here is a short example:
# rms package Contains Restricted Cubic Splines (RCS)
library(rms)
library(ggplot2)
# Load Data
data(cars)
# Model Fit with RCS
fit <- lm(speed ~ rcs(dist, 5), data=cars)
# Obtain Diagnostic Data
plot.dat <- cbind(cars, fitted=fitted(fit))
# Compare Smooth to Actual
ggplot(data=plot.dat) +
geom_point(aes(x=dist, y=speed)) +
geom_smooth(aes(x=dist, y=speed), method="lm",
formula=y ~ rcs(x, 5), se=FALSE, colour="blue") +
geom_line(aes(y=fitted, x=dist), size=1.25, colour="red")
This results in the following image:
Comparison of Splines
I am not sure why geom_smooth() is not giving the correct results. Clearly there is a work-around (as illustrated above), but is there a way to make geom_smooth() produce the correct results?
I don't know how to integrate this with geom_smooth but I can do it with ggplot.Predict from the rms package:
ddist <- datadist(cars)
options(datadist='ddist')
fit <- ols(speed~ rcs(dist,5),data=cars,
x=TRUE, y=TRUE)
ggplot(Predict(fit))+geom_point(data=cars, aes(x=dist, y=speed))
It has been a long time, but I finally recognized the problem, and I thought I would post it here for those interested. Internally, geom_smooth() will create a sequence of the predictor at which to plot the predicted response. As this sequence is spaced out across the range of the x-axis, the knot points selected by rcs() inside of geom_smooth() will differ from the knot points selected by rcs() on the original data. To address this, you need to pass in the correct knot points.
# rms package Contains Restricted Cubic Splines (RCS)
library(rms)
library(ggplot2)
# Load Data
data(cars)
# Model Fit with RCS
fit <- lm(speed ~ rcs(dist, 5), data=cars)
# Obtain Diagnostic Data
plot.dat <- cbind(cars, fitted=fitted(fit))
# Compare Smooth to Actual
ggplot(data=plot.dat) +
geom_point(aes(x=dist, y=speed)) +
geom_smooth(aes(x=dist, y=speed), method="lm",
formula=y ~ rcs(x, quantile(plot.dat$dist, probs = c(0.05, 0.275, 0.5, 0.725, 0.95))), se=FALSE, colour="blue") +
geom_line(aes(y=fitted, x=dist), colour="red")

Resources