Not getting a smooth curve using ggplot2 - r

I am trying to fitting a mixed effects models using lme4 package. Unfortunately I cannot share the data that i am working with. Also i couldn't find a toy data set is relevant to my problem . So here i have showed the steps that i followed so far :
First i plotted the overall trend of the data as follows :
p21 <- ggplot(data = sub_data, aes(x = age_cent, y = y))
p21+ geom_point() + geom_smooth()
Based on this , there seems to be a some nonlinear trend in the data. Hence I tried to fit the quadratic model as follows :
sub_data$age_cent=sub_data$age-mean((sub_data)$age)
sub_data$age_centsqr=(sub_data$age-mean((sub_data)$age))^2
m1= lmer(y ~ 1 + age_cent + age_centsqr +(1 | id) , sub_data, REML = TRUE)
In the above model i only included a random intercept because i don't have enough data to include both random slope and intercept.Then i extracted the predictions of these model at population level as follows :
pred1=predict(m1,re.form=NA)
Next I plotted these predictions along with a smooth quadratic function like this
p21+ geom_point() + geom_smooth(method = "lm", formula = y ~ I(x) + I(x^2)
,col="red")+geom_line(aes(y=pred1,group = id) ,col="blue", lwd = 0.5)
In the above plot , the curve corresponds to predictions are not smooth. Can any one helps me to figure out the reason for that ?
I am doing anything wrong here ?
Update :
As eipi10 pointed out , this may due to fitting different curves for different people.
But when i tried the same thing using a toy data set which is in the lme4 package , i got the same curve for each person as follows :
m1 <- lmer(Reaction ~ 1+I(Days) + (1+ Days| Subject) , data = sleepstudy)
pred1new1=predict(m1,re.form=NA)
p21 <- ggplot(data = sleepstudy, aes(x = Days, y = Reaction))
p21+ geom_point() + geom_smooth()
p21+ geom_point() + geom_smooth()+ geom_line(aes(y=pred1new1,group = Subject) ,col="red", lwd = 0.5)
What may be the reason the for different results ? Is this due to unbalance of the data ?
The data i used collected in 3 time steps and some people didn't have it for all 3 time steps. But the toy data set is a balanced data set.
Thank you

tl;dr use expand.grid() or something like it to generate a balanced/evenly spaced sample for every group (if you have a strongly nonlinear curve you may want to generate a larger/more finely spaced set of x values than in the original data)
You could also take a look at the sjPlot package, which does a lot of this stuff automatically ...
You need both an unbalanced data set and a non-linear (e.g. polynomial) model for the fixed effects to see this effect.
if the model is linear, then you don't notice missing values because the linear interpolation done by geom_line() works perfectly
if the data are balanced then there are no gaps to get weirdly filled by linear interpolation
Generate an example with quadratic effects and an unbalanced data set; fit the model
library(lme4)
set.seed(101)
dd <- expand.grid(id=factor(1:10),x=1:10)
dd$y <- simulate(~poly(x,2)+(poly(x,2)|id),
newdata=dd,
family=gaussian,
newparams=list(beta=c(0,0,0.1),
theta=rep(0.1,6),
sigma=1))[[1]]
## subsample randomly (missing values)
dd <- dd[sort(sample(nrow(dd),size=round(0.7*nrow(dd)))),]
m1 <- lmer(y ~ poly(x,2) + (poly(x,2)|id) , data = dd)
Naive prediction and plot:
dd$pred1 <- predict(m1,re.form=NA)
library(ggplot2)
p11 <- (ggplot(data = dd, aes(x = x, y = y))
+ geom_point() + geom_smooth(method="lm",formula=y~poly(x,2))
)
p11 + geom_line(aes(y=pred1,group = id) ,col="red", lwd = 0.5)
Now generate a balanced data set. This version generates 51 evenly spaced points between the min and max - this will be useful if the original data are unevenly spaced. If you have NA values in your x variable, don't forget na.rm=TRUE ...
pframe <- with(dd,expand.grid(id=levels(id),x=seq(min(x),max(x),length.out=51)
Make predictions, and overlay them on the original plot:
pframe$pred1 <- predict(m1,newdata=pframe,re.form=NA)
p11 + geom_line(data=pframe,aes(y=pred1,group = id) ,col="red", lwd = 0.5)

Related

LMM on Chickweight data

I would like to write a model with random intercepts and random slopes with respect to time. I am not sure if my code is correct.
model4<-lmer(weight~Time + Diet + Time*Diet + (1+Time|Chick), data = Data, REML = TRUE)
summary(model4)
Yes, that is the correct specification for those random effects. You can check this out, by applying a similar model, but temporarily removing the fixed effect on diet and the interaction between time and diet
model4<-lmer(weight~Time + (1+Time|Chick), data = ChickWeight, REML = TRUE)
Column bind the original data, plus predictions from this simple model above, and select five random Chicks to plot
weight_hat = predict(model4)
cw = cbind(ChickWeight,weight_hat)
random_chicks = sample(unique(cw$Chick),5)
ggplot(cw[cw$Chick %in% random_chicks,], aes(Time, color=Chick)) +
geom_point(aes(y=weight), size=2) +
geom_line(aes(y=weight_hat), size=1.5) +
theme(legend.position="bottom")+
guides(color=guide_legend(nrow=1))
You can see that the intercept and slope for each Chick differs.

ROC for Logistic regression in R

I would like to ask for help with my project. My goal is to get ROC curve from existing logistic regression.
First of all, here is what I'm analyzing.
glm.fit <- glm(Severity_Binary ~ Side + State + Timezone + Temperature.F. + Wind_Chill.F. + Humidity... + Pressure.in. + Visibility.mi. + Wind_Direction + Wind_Speed.mph. + Precipitation.in. + Amenity + Bump + Crossing + Give_Way + Junction + No_Exit + Railway + Station + Stop + Traffic_Calming + Traffic_Signal + Sunrise_Sunset , data = train_data, family = binomial)
glm.probs <- predict(glm.fit,type = "response")
glm.probs = predict(glm.fit, newdata = test_data, type = "response")
glm.pred = ifelse(glm.probs > 0.5, "1", "0")
This part works fine, I am able to show a table of prediction and mean result. But here comes the problem for me, I'm using pROC library, but I am open to use anything else which you can help me with. I'm using test_data with approximately 975 rows, but variable proc has only 3 sensitivities/specificities values.
library(pROC)
proc <- roc(test_data$Severity_Binary,glm.probs)
test_data$sens <- proc$sensitivities[1:975]
test_data$spec <- proc$specificities[1:975]
ggplot(test_data, aes(x=spec, y=sens)) + geom_line()
Here´s what I have as a result:
With Warning message:
Removed 972 row(s) containing missing values (geom_path).
As I found out, proc has only 3 values as I said.
You can't (and shouldn't) assign the sensitivity and specificity to the data. They are summary data and exist in a different dimension than your data.
Specifically, these two lines are wrong and make no sense at all:
test_data$sens <- proc$sensitivities[1:975]
test_data$spec <- proc$specificities[1:975]
Instead you must either save them to a new data.frame, or use some of the existing functions like ggroc:
ggroc(proc)
If you consider what the ROC curve does, there is no reason to expect it to have the same dimensions as your dataframe. It provides summary statistics of your model performance (sensitivity, specificity) evaluated on your dataset for different thresholds in your prediction.
Usually you would expect some more nuance on the curve (more than the 3 datapoints at thresholds -Inf, 0.5, Inf). You can look at the distribution of your glm.probs - this ROC curve indicates that all predictions are either 0 or 1, with very little inbetween (hence only one threshold at 0.5 on your curve). [This could also mean that you unintentially used your binary glm.pred for calculating the ROC curve, and not glm.probs as shown in the question (?)]
This seems to be more an issue with your model than with your code - here an example from a random different dataset, using the same steps you took (glm(..., family = binomial, predict(, type = "response"). This produces a ROC curve with 333 steps for ~1300 datapoints.
PS: (Ingore the fact that this is evaluated on training data, the point is the code looks alright up to the point of generating the ROC curve)
m1 <- glm(survived ~ passengerClass + sex + age, data = dftitanic, family = binomial)
myroc <- roc(dftitanic$survived,predict(m1, dftitanic, type = "response"))
plot(myroc)

How to derive the equation for a non-linear time series regression model built in R?

I've built a non-linear time series regression model in R that I would like to write down as an equation, so that I can back-test against my data in an Excel spreadsheet. I've created a .ts object and created a model using the tslm function, as shown below:
model16 <- tslm(production ~ date + I(date^2) + I(date^3) +
I(temp_neg_32^3) +
I(humidity_avg^3) +
I(dew_avg^3) +
below_freezing_min,
data = production_temp_no_outlier.ts)
I find the coefficients for each variable in the model by using the following code:
summary(model16)
The output is below:
So, my understanding is that the equation of my model should be:
y = -7924000000 + 1268000*date -67.62*(date^2) + 0.001202*(date^3) +
0.04395*(temp_neg_32^3) + 0.008658*(humidity_avg^3) -0.03762*(dew_avg^3) + -11930*below_freezing_min
However, whenever I plug the data into this equation, the output is just completely off - it has nothing in common with the fitted curve visualization that I build in R based on this model. So I am clearly doing something wrong. I will be very grateful if someone could help point out my errors!
This use of regression doesn't give you an exact fit, it gives you the line of best fit. What is the coefficient of determination? (AKA explained variance or R^2)
Take a look at this set of data (somewhat modeled after your example).
library(forecast)
library(tidyverse)
data("us_change", package = "fpp3")
fit <- tslm(Production~Savings + I(Savings^2) + I(Savings^3) + I(Income^3) + Unemployment,
data = ts(us_change))
summary(fit)
Here I extracted the coefficients, so I can show you a bit more of what I mean. Then I created a function that calculates the outcome of the regression equation.
cFit <- coefficients(fit)
# (Intercept) Savings I(Savings^2) I(Savings^3) I(Income^3) Unemployment
# 5.221684e-01 6.321979e-03 -2.472784e-04 -6.376422e-06 7.029079e-03 -3.144743e+00
regFun <- function(cFit, data){
attach(data)
f = cFit[[2]] * Savings + cFit[[3]] * Savings^2 + cFit[[4]] * Savings^3 + cFit[[5]] * Income^3 + Unemployment + cFit[[1]]
detach(data)
return(f)
}
Here are some examples of the predicted outcome versus the actual outcome.
fitOne <- regFun(cFit, us_change[1,])
# [1] 1.455793
us_change[1,]$Production
# [1] -2.452486
fitTwo <- regFun(cFit, us_change[2,])
# [1] 1.066338
us_change[2,]$Production
# [1] -0.5514595
fitThree <- regFun(cFit, us_change[3,])
# [1] 1.08083
us_change[3,]$Production
# [1] -0.3586518
You can tell from the variance here that the production volume is not explained very well by the inputs I provided.
Now look at what happens when I graph this:
plt <- ggplot(data = us_change %>%
mutate(Regression = regFun(cFit, us_change)),
aes(x = Production)) +
geom_point(aes(y = Savings, color = "Savings")) +
geom_point(aes(y = Savings^2, color = "Savings^2")) +
geom_point(aes(y = Savings^3, color = "Savings^3")) +
geom_point(aes(y = Savings^3, color = "Savings^3")) +
geom_point(aes(y = Unemployment, color = "Unemployment")) +
geom_line(aes(y = Regression, color = "Regression")) + # regression line
scale_color_viridis_d(end = .8) + theme_bw()
plotly::ggplotly(plt)
The regression equation output is the black line. It's the best fit, but there are values that are not represented all that well.
If you look closer, it's not a straight line either.

R Finding logistic curve with nls

I have a problem with logistic curve in R for panel data which are here:
https://docs.google.com/spreadsheets/d/1SO3EzFib7T3XqTz1xZCU2bZFTB0Ddhou/edit?usp=sharing&ouid=110784858039906954607&rtpof=true&sd=true
I tried:
log <- nls(lrc~SSlogis(time, Asym, xmid, scal), data = data_log)
I have an error: 'qr.solve(QR.B, cc)':singular matrix 'a' in solve.
What can I do?
I get a different error:
Error in nls(y ~ 1/(1 + exp((xmid - x)/scal)), data = xy, start = list(xmid = aux[[1L]], :
step factor 0.000488281 reduced below 'minFactor' of 0.000976562
(it's not surprising that different platforms will get slightly different results on numerically "difficult" problems ...)
However, here's what plot(lrc ~ time, data = dd) produces when I use your data:
It seems optimistic to think that you could fit a logistic curve to these data, or that the fit would make very much sense ...
I did find that I could fit a logistic to the logged data, i.e. nls(log(lrc) ...)
plot(log(lrc) ~ time, data =dd)
tvec <- seq(2008, 2016, by = 0.1)
lines(tvec, predict(m2, newdata=data.frame(time=tvec)), col=2, lwd=2)
If I really needed logistic coefficients for this plot (e.g. to compare with other cases) I would fit a linear model to the data, assume that the midpoint was equal to the midpoint of the data, set the scaling parameter equal to the linear slope coefficient divided by 4 (this is a standard rule of thumb for the logistic), and say that the asymptotic value could not be estimated.
Plotting all of your data unit-by-unit doesn't make much more hopeful that you will be able to fit logistic curves unit-by-unit either. While there might be a few units where a logistic curve is a sensible description of the data, it's not in most cases. You might have to back up and consider your analytical strategy — i.e., what are you hoping to learn from these data, and how might you go about it? (If you can frame the question suitably you could post on CrossValidated. If you are a student/trainee, you might want to ask your supervisor/teacher/mentor for advice ...
library(readxl)
library(tidyverse)
(dd <- read_excel("data2.xlsx")
%>% pivot_longer(-code, names_transform = as.numeric,
names_to = "year")
## indices to break groups into chunks/facets
%>% mutate(grp_cat = factor(as.numeric(factor(code)) %% 30))
)
gg1 <- ggplot(dd, aes(year, value, group = code,
colour=code)) + geom_point() +
facet_wrap(~grp_cat, scale="free_y") +
expand_limits(y=0) +
theme_bw() +
theme(legend.position = "none",
panel.spacing = grid::unit(0, "lines"),
axis.text.x = element_blank())
gg1 + geom_smooth(se=FALSE)
ggsave("all.png", width=8, height=8)

Plotting a multiple logistic regression for binary and continuous values in R

I have a data frame of mammal genera. Each row of the column is a different genus. There are three columns: a column of each genus's geographic range size (a continuous variable), a column stating whether or not a genus is found inside or outside of river basins (a binary variable), and a column stating whether the genus is found in the fossil record (a binary variable).
I have performed a multiple logistic regression to see if geographic range size and presence in/out of basins is a predictor of presence in the fossil record using the following R code.
Regression<-glm(df[ ,"FossilRecord"] ~ log(df[ ,"Geographic Range"]) + df[ ,"Basin"], family="binomial")
I am trying to find a way to visually summarize the output of this regression (other than a table of the regression summary).
I know how to do this for a single variable regression. For example, I could use a plot like if I wanted to see the relationship between just geographic range size and presence in the fossil record.
However, I do not know how to make a similar or equivalent plot when there are two independent variables, and one of them is binary. What are some plotting and data visualization techniques I could use in this case?
Thanks for the help!
Visualization is important and yet it can be very hard. With your example, I would recommend plotting one line for predicted FossilRecord versus GeographicRange for each level of your categorical covariate (Basin). Here's an example of how to do it with the ggplot2 package
##generating data
ssize <- 100
set.seed(12345)
dat <- data.frame(
Basin = rbinom(ssize, 1,.4),
GeographicRange = rnorm(ssize,10,2)
)
dat$FossilRecord = rbinom(ssize,1,(.3 + .1*dat$Basin + 0.04*dat$GeographicRange))
##fitting model
fit <- glm(FossilRecord ~ Basin + GeographicRange, family=binomial(), data=dat)
We can use the predict() function to obtain predicted response values for many GeographicRange values and for each Basin category.
##getting predicted response from model
plotting_dfm <- expand.grid(GeographicRange = seq(from=0, to = 20, by=0.1),
Basin = (0:1))
plotting_dfm$preds <- plogis( predict(fit , newdata=plotting_dfm))
Now you can plot the predicted results:
##plotting the predicted response on the two covariates
library(ggplot2)
pl <- ggplot(plotting_dfm, aes(x=GeographicRange, y =preds, color=as.factor(Basin)))
pl +
geom_point( ) +
ggtitle("Predicted FossilRecord by GeoRange and Basin") +
ggplot2::ylab("Predicted FossilRecord")
This will produce a figure like this:
You can plot a separate curve for each value of the categorical variable. You didn't provide sample data, so here's an example with another data set:
library(ggplot2)
# Data
mydata <- read.csv("http://www.ats.ucla.edu/stat/data/binary.csv")
# Model. gre is continuous. rank has four categories.
m1 = glm(admit ~ gre + rank, family=binomial, data=mydata)
# Predict admit probability
newdata = expand.grid(gre=seq(200,800, length.out=100), rank=1:4)
newdata$prob = predict(m1, newdata, type="response")
ggplot(newdata, aes(gre, prob, color=factor(rank), group=rank)) +
geom_line()
UPDATE: To respond to #Provisional.Modulation's comment: There are lots of options, depending on what you want to highlight and what is visually clear enough to understand, given your particular data and model output.
Here's an example using the built-in mtcars data frame and a logistic regression with one categorical and two continuous predictor variables:
m1 = glm(vs ~ cyl + mpg + hp, data=mtcars, family=binomial)
Now we create a new data frame with the unique values of cyl, five quantiles of hp and a continuous sequence of mpg, which we'll put on the x-axis (you could also of course do quantiles of mpg and use hp as the x-axis variable). If you have many continuous variables, you may need to set some of them to a single value, say, the median, when you graph the relationships between other variables.
newdata = with(mtcars, expand.grid(cyl=unique(cyl),
mpg=seq(min(mpg),max(mpg),length=20),
hp = quantile(hp)))
newdata$prob = predict(m1, newdata, type="response")
Here are three potential graphs, with varying degrees of legibility.
ggplot(newdata, aes(mpg, prob, colour=factor(cyl))) +
geom_line() +
facet_grid(. ~ hp)
ggplot(newdata, aes(mpg, prob, colour=factor(hp), linetype=factor(cyl))) +
geom_line()
ggplot(newdata, aes(mpg, prob, colour=factor(hp))) +
geom_line() +
facet_grid(. ~ cyl)
And here's another approach using geom_tile to include two continuous dimensions in each plot panel.
newdata = with(mtcars, expand.grid(cyl=unique(cyl),
mpg=seq(min(mpg),max(mpg),length=100),
hp =seq(min(hp),max(hp),length=100)))
newdata$prob = predict(m1, newdata, type="response")
ggplot(newdata, aes(mpg, hp, fill=prob)) +
geom_tile() +
facet_grid(. ~ cyl) +
scale_fill_gradient2(low="red",mid="yellow",high="blue",midpoint=0.5,
limits=c(0,1))
If you're looking for a canned solution, the visreg package might work for you.
An example using #eipi10 's data
library(visreg)
mydata <- read.csv("http://www.ats.ucla.edu/stat/data/binary.csv")
m1 = glm(admit ~ gre + rank, family=binomial, data=mydata)
visreg(m1, "admit", by = "rank")
Many more options described in documentation.

Resources