Why do I get multiple predictions lines for my binomial GLM? - r

My data are masses of offspring in kg, and a column of 1's and 0's to represent whether a mother was in her terminal year or not.
Chick Mass Terminal Effect
3.4 0
3.1 1
2.4 1
3.6 0
etc..
So I have a model fitted to assess whether mass (in kg) has an effect on mortality (binomial)
m10 <-glm(Terminal_Effect~chick_mass, data = cranesData, family = binomial(link="logit"))
summary(m10)
plot(cranesData$Terminal_Effect~cranesData$chick_mass, xlab = "Chick Mass (kg)", ylab = "Probability of Mother Death", pch = 19)
When I plot this, there are multiple lines on my plot, is there a way to change this to a single line?
Any help would be appreciated :)

Sort your predictor values before plotting.
using the iris dataset with some variation:
set.seed(111)
dat = iris
dat$Species = as.numeric(dat$Species=="setosa")
dat$Petal.Width = dat$Petal.Width + rnorm(nrow(dat))
Order the predictor, in this case it is petal.width:
dat = dat[order(dat$Petal.Width),]
fit = glm(Species ~ Petal.Width,data=dat,family="binomial")
plot(dat$Species ~ dat$Petal.Width)
lines(dat$Petal.Width,fit$fitted.values,col="blue")

Related

Fitted vs observed plot with line from glmer model

I have a data frame that is made up individuals in different treatment groups, with a 1 for if they survived and a 0 for if they are dead, and a 3rd column indicating which dish. I ran a glmer model using lme4 package with Dish_ID as my random variable. I have a piece of code from base plot which plots the mortality rate against treatment group using the line from my glmer model. How can I write the same observed vs fitted plots in ggplot. I have tried looking online but cant seem to find an answer that explains the process.
I want to get the line from my binomial model (manually not using geom_smooth) and then plot my observed points in red in ggplot2. Thanks for the help.
library(tidyverse)
library(ggplot2)
library(lme4)
mortality_data$Dish_ID <- as.factor(mortality_data$Dish_ID)
mortality_model <- glmer(Survived ~ Treatment + (1|Dish_ID), data = mortality_data, family = "binomial")
summary(mortality_model)
plot(mortality_data$Treatment, 1 - fitted(mortality_model), ylim = c(0,1))
plot(mortality_data$Treatment, 1 - fitted(mortality_model), ylim = c(0,1), type = "l", xlab = "Concentration of Cu2SO4", ylab = "Mortality rate")
tv <- unique(mortality_data$Treatment)
#observed in red
for (i in tv) {
points(i, y = 1 - mean(mortality_data$Survived[mortality_data$Treatment == i]), col = "red")
}
my data frame looks something like this if it is of any use. There are 540 individuals, 90 for each treatment group
Treatment
Survived
Dish_ID
0.05
1
Dish_1
0.04
0
Dish_3

Plot interaction effect in sem model with observed variables in R

I am estimating an SEM model that has observed variables. I am using SEM to handle missing data using FIML. My model has an interaction term to test for moderation. Here is a toy example that illustrates the issue.
library(lavaan)
library(car)
library(dplyr)
data(starwars)
sw2 <- starwars %>% mutate(
male = Recode(sex, "'male' = 1; NA=NA; else = 0"),
human = Recode(species, "'Human' = 1; NA=NA; else = 0"),
maleXby = male * birth_year,
)
mod <- 'mass ~ height + human + male + birth_year + maleXby'
fit <- sem(mod, data = sw2, missing="fiml.x")
summary(fit)
What I want to do is plot the interaction term like a margin plot, to visualize the interaction effect. But package like library(interactions) does not work with an object of class lavaan. How could I visualize this? Is there a package (like interactions) that makes this easier.
You could fit this model using lm(), but I think you want to be able to use FIML estimates, yes? In that case, you could use the emmeans package, which can work on lavaan-class objects if you have the semTools package loaded.
You didn't say which predictor was focal vs. moderator, but I assume you want to treat male as moderator because it is a grouping variable. The example below can be adapted by switching their roles in the pairs() function, as well as by selecting different birth_year levels at= which to probe the effect of male. When birth_year is the focal predictor, its linear effect will be the same regardless of which levels are chosen, so I chose the full range() below.
library(emmeans)
library(semTools)
## for ease of use, fit model using colon operator
mod <- 'mass ~ height + human + male + birth_year + male:birth_year'
fit <- sem(mod, data = sw2, missing = "fiml.x")
## calculate expected marginal means for multiple
## levels of male (1:0) and birth_year
BYrange <- range(sw2$birth_year, na.rm = TRUE)
em.mass <- emmeans(fit, specs = ~ birth_year | male,
at = list(male = 1:0, birth_year = BYrange),
# because SEMs can have multiple DVs:
lavaan.DV = "mass")
em.mass
## probe effect of year across sex
rbind(pairs(em.mass))
## plot effect of year across sex
emmip(em.mass, male ~ birth_year) # 2 lines in same plot
emmip(em.mass, ~ birth_year | male) # in separate panels

(R) Adding Confidence Intervals To Plots

I am using R. I am following this tutorial over here (https://rviews.rstudio.com/2017/09/25/survival-analysis-with-r/ ) and I am trying to adapt the code for a similar problem.
In this tutorial, a statistical model is developed on a dataset and then this statistical model is used to predict 3 news observations. We then plot the results for these 3 observations:
#load libraries
library(survival)
library(dplyr)
library(ranger)
library(data.table)
library(ggplot2)
#use the built in "lung" data set
#remove missing values (dataset is called "a")
a = na.omit(lung)
#create id variable
a$ID <- seq_along(a[,1])
#create test set with only the first 3 rows
new = a[1:3,]
#create a training set by removing first three rows
a = a[-c(1:3),]
#fit survival model (random survival forest)
r_fit <- ranger(Surv(time,status) ~ age + sex + ph.ecog + ph.karno + pat.karno + meal.cal + wt.loss, data = a, mtry = 4, importance = "permutation", splitrule = "extratrees", verbose = TRUE)
#create new intermediate variables required for the survival curves
death_times <- r_fit$unique.death.times
surv_prob <-data.frame(r_fit$survival)
avg_prob <- sapply(surv_prob, mean)
#use survival model to produce estimated survival curves for the first three observations
pred <- predict(r_fit, new, type = 'response')$survival
pred <- data.table(pred)
colnames(pred) <- as.character(r_fit$unique.death.times)
#plot the results for these 3 patients
plot(r_fit$unique.death.times, pred[1,], type = "l", col = "red")
lines(r_fit$unique.death.times, r_fit$survival[2,], type = "l", col = "green")
lines(r_fit$unique.death.times, r_fit$survival[3,], type = "l", col = "blue")
From here, I would like to try an add confidence interval (confidence regions) to each of these 3 curves, so that they look something like this:
I found a previous stackoverflow post (survfit() Shade 95% confidence interval survival plot ) that shows how to do something similar, but I am not sure how to extend the results from this post to each individual observation.
Does anyone know if there is a direct way to add these confidence intervals?
Thanks
If you create your plot using ggplot, you can use the geom_ribbon function to draw confidence intervals as follows:
ggplot(data=...)+
geom_line(aes(x=..., y=...),color=...)+
geom_ribbon(aes(x=.. ,ymin =.., ymax =..), fill=.. , alpha =.. )+
geom_line(aes(x=..., y=...),color=...)+
geom_ribbon(aes(x=.. ,ymin =.., ymax =..), fill=.. , alpha =.. )
You can put + after geom_line and repeat the same steps for each observation.
You can also check:
Having trouble plotting multiple data sets and their confidence intervals on the same GGplot. Data Frame included and
https://bookdown.org/ripberjt/labbook/appendix-guide-to-data-visualization.html

Get marginal effect and predicted probability for glmer model in R

I'm trying to calculate both the predicted probability values and marginal effects values (with p-values) for a categorical variable over time in a logistic regression model in R. Basically, I want to know 1) the predicted probability of the response variable (an event occurring) in each year for sample sites in one of 2 categories and 2) the average marginal effect of a site being in 1 category vs. the other in each year. I can get predicted probability values using the ggeffects package and marginal effects values from the margins package, but I haven't figured out a way to get both sets of values from a single package.
So my questions are 1) is there a package/method to get both of these sets of values, and 2) if I get the predicted probability values from ggeffects and the marginal effects values from margins, are these values compatible? Or are there differences in the ways that the packages treat the models that mean I can't assume the marginal effects from one correspond to the predicted probabilities of the other? 3) In the margins package, how can I get the average marginal effect of the interaction of two factor variables over time? And 4) how can I get margins() to work with a large dataset?
Here is some sample data:
### Make dataset
df <- data.frame(year = rep(2001:2010, each = 100),
state = rep(c("montana", "idaho",
"colorado", "wyoming", "utah"),
times = 10, each = 20),
site_id = as.factor(rep(1:100, times = 10)),
cat_variable = as.factor(rep(0:1, times = 5, each = 10)),
ind_cont_variable = rnorm(100, mean = 20, sd = 5),
event_occurred = as.factor(sample(c(0, 1),
replace = TRUE,
size = 1000)))
### Add dummy columns for states
library(fastDummies)
df <- dummy_cols(df,
select_columns = "state",
remove_first_dummy = TRUE)
I'm interested in the effects of the state and the categorical variable on the probability that the event occurred, and in how the effect of the state and categorical variable changed over time. Here's the model:
library(lme4)
fit_state <- glmer(event_occurred ~ ind_cont_variable +
cat_variable*year*state +
(1|site_id),
data = df,
family = binomial(link = "logit"),
nAGQ = 0,
control = glmerControl(optimizer = "nloptwrap"))
I can use ggeffects to get the predicted probability values for each state and category combination over time:
library(ggeffects)
fit_pp_state <- data.frame(ggpredict(fit_state,
terms = c("year [all]",
"cat_variable",
"state")))
head(fit_pp_state)
### x = year, predicted = predicted probability, group = categorical variable level, facet = state
# x predicted std.error conf.low conf.high group facet
# 2001 0.2835665 0.3981910 0.1535170 0.4634655 0 colorado
# 2001 0.5911911 0.3762090 0.4089121 0.7514289 0 idaho
# 2001 0.5038673 0.3719418 0.3288209 0.6779708 0 montana
# 4 2001 0.7101610 0.3964843 0.5297327 0.8420101 0 utah
# 5 2001 0.5714579 0.3747205 0.3901606 0.7354088 0 wyoming
# 6 2001 0.6788503 0.3892568 0.4963910 0.8192719 1 colorado
This is really great for visualizing the changes in predicted probability over time in the 5 states. But I can't figure out how to go from these values to estimates of marginal effects using ggeffects. Using the margins package, I can get the marginal effect of the categorical variable over time, but I'm not sure how to interpret the outputs of the two different packages together or if that's even appropriate (my first two questions). In addition, I'm not sure how to get margins to give me the marginal effect of a sample site being in each combination of categorical variable level/state over time (bringing me to my third question):
library(margins)
fit_state_me <- summary(margins(fit_state,
at = list(year = 2001:2010),
variables = "cat_variable"))
head(fit_state_me)
# factor year AME SE z p lower
# cat_variable1 2001.0000 0.0224 0.0567 0.3953 0.6926 -0.0887
# cat_variable1 2002.0000 0.0146 0.0490 0.2978 0.7659 -0.0814
# cat_variable1 2003.0000 0.0062 0.0418 0.1478 0.8825 -0.0757
# cat_variable1 2004.0000 -0.0026 0.0359 -0.0737 0.9413 -0.0731
# cat_variable1 2005.0000 -0.0117 0.0325 -0.3604 0.7186 -0.0754
# cat_variable1 2006.0000 -0.0208 0.0325 -0.6400 0.5222 -0.0845
The actual dataset I'm using is fairly large (the csv of raw data is 1.51 GB and the regression model object is 1.29 GB when I save it as a .rds file). When I try to use margins() on my data, I get an error message:
Error: cannot allocate vector of size 369.5 Gb
Any advice for getting around this issue so that I can use this function on my data?
I'd be grateful for any tips-- packages I should check out, mistakes I'm making in my code or my conceptual understanding, etc. Thank you!

Sorting the x axes in R

I built a logistic regression model (called 'mylogit') using the glm function in R as follows:
mylogit <- glm(answer ~ as.factor(gender) + age, data = mydata, family = "binomial")
where age is numeric and gender is categorical (male and female).
I then proceeded to make predictions with the model built.
pred <- predict(mylogit, type = "response")
I can easily make a time series plot of the predictions by doing:
plot.ts(ts(pred))
to give a plot that looks like this:
Plot of Time against Predictions
which gives a plot of the predictions.
My question is this:
Is it possible to put the x axis in segments according to gender (male or female) which was specified in the glm? In other words, can I have predictions on the y axis and have gender (divided into male and female) on the x axis?
A sample of the data I want to plot from is similar to this:
I did:
bind = cbind(mydata, pred)
'bind' looks like this:
pred age gender
0.9461198 32 male
0.9463577 45 female
0.9461198 45 female
0.9461198 37 female
0.9477645 40 male
0.8304513 32 female
Check out #4 on this blog post, "4. How To Create Two Different X- or Y-axes".
My suggestion to you is that you look at some of the dedicated R plotting tools, like ggplot2.
I don't think you need to use ts and plot.ts because the data you have is not a time series, right? Just sort pred before plotting.
# Get data
str <- "pred,age,gender
0.9461198,32,male
0.9463577,45,female
0.9461198,45,female
0.9461198,37,female
0.9477645,40,male
0.8304513,32,female"
bind <- read.csv(textConnection(str))
# Plot
bind <- bind[order(bind$gender),]
plot(bind$pred, col = bind$gender)
library(ggplot2)
ggplot(bind, aes(x = gender, y = pred)) +
geom_point(position = position_jitter(width = .3))
Or without creating bind you could do plot(pred[order(mydata$gender)]).

Resources