Have created a GLMM model and plotted the predicted probabilities of each factor. However, I cannot fathom how to create confidence intervals using the BootMer function. I keep getting the error message cannot simulate from non integer prior weights.
I'm hoping someone would be able to help? Thanks in advance.
glmm1 <- glmer(cbind(Total_Dead, Total_Collected - Total_Dead) ~
Species + timeseries + (1|Location),
data = dat, family= "binomial")
dat$timeseries <- dat$Study_Date - 1998
plot(predict(glmm1, data.frame(Species="An. Arab", timeseries= dat$timeseries),
type="response", re.form = NA) ~
dat$timeseries, frame=FALSE, bty="n", pch="", xaxt="n", ylim=c(0, 0.5),
ylab="Predicted mortality", xlab="Year",
main = "Predicted mortality by species",
cex.lab=1.6, yaxt="n")
axis(1, at=c(1:17), labels=1999:2015, cex.axis=1.8)
axis(2, las=2, at=seq(0, 1, 0.2), labels=seq(0, 1, 0.2), cex.axis=1.8)
COLS <- c("blue", "red", "purple", "aquamarine3", "orange")
PCH <- c(17, 15, 19, 20, 5)
for(i in 1:length(unique(levels(dat$Species)))){
points((dat$Total_Dead[dat$Species == levels(dat$Species)[i]] /
dat$Total_Collected[dat$Species == levels(dat$Species)[i]]) ~
dat$timeseries[dat$Species == levels(dat$Species)[i]],
pch=PCH[i], col=COLS[i])
lines(predict(glmm1, data.frame(Species=levels(dat$Species)[i],
timeseries = dat$timeseries), type="response",
re.form = NA) ~ dat$timeseries, lty=4, col=COLS[i])
}
bootstrap <- bootMer(x=glmm1, FUN= fixef, nsim=200)
for some reason Bootmer has problems with that, you have to use the mertools package
library(merTools)
preds <- predictInterval(glmm1, newdata = your.datarame, n.sims = 1000)
I would use then then the preds data.frame to plot, the resulting data.frame has the fit, upper and lower limit, then you can use geom_ribbon to plot it, if you need more help let me know.
now bear with me, you actually want to make a new standardized dataset for your graph. If you use this code it will work:
glmm1 <- glmer(cbind(Total_Dead, Total_Collected - Total_Dead) ~
Species + timeseries + (1|Location),
data = dat,family= "binomial")
fit your model, then create your new data set, this will have your timeseries from 1 to 16 for each species, in your first location (Akron), note that you will have to do this for each location if you want the graph for each location, you can do that just by changing the number between [] from 1, to 2 up to your 17 locations
new.data <-data.frame(timeseries = rep(1:16, times = 5), Species = rep(unique(dat$Species), each = 16), Location = rep(unique(dat$Location)[1], each = 80))
Then predict the values and intervals for such dataset
preds <- predictInterval(glmm1, newdata = new.data, n.sims = 1000)
now join this prediction to your new.data
new.data <- cbind(new.data, preds)
and finally plot it with different colors for each species
ggplot(new.data, aes(x = timeseries, y = fit)) + geom_ribbon(aes(ymax=upr, ymin =lwr, fill=Species), alpha = 0.9)+ geom_line(aes(color=Species))
If you don't understand something don't hesitate to keep asking, currently your standard errors are quite big so first check to see if you like that better
ggplot(new.data, aes(x = timeseries, y = fit)) + geom_line(aes(color=Species))
Related
I have two questions:
I am using the pROC package to calculate the CI of the ROC curve for a logistic regression model and a random forest model. What I cannot understand is which algorithm is used for this computation. Is it the vertical averaging algorithm? Tom Fawsett's paper mentions, "Confidence intervals of the mean of tp rate are computed using the common
assumption of a binomial distribution." Does he mean normal approximation? Moreover the curve that I am plotting is the average curve?
forest <- randomForest(factor(extreme, levels = c("Yes", "No"))~ tas + X0+X1+X2+X3+X4+X5+X8,
train_df, ntree = 500, na.omit = TRUE)
Random_Forest <- predict(forest, test_df, type = "prob")[,2]
roc <- roc(test_df$extry, Random_Forest , plot=TRUE, legacy.axes=TRUE)
Logistic_Regression <- predict(model,test_df, type='response')
roc <- roc(test_df$extry, Logistic_Regression, plot=TRUE,legacy.axes=TRUE)
roc.list <- roc(test_df$extry ~ Logistic_Regression+Random_Forest,legacy.axes=TRUE)
ci.list <- lapply(roc.list, ci.se, specificities = seq(0, 1, .1), boot.n=2000, stratified=TRUE, conf.level=0.95,parallel = TRUE)
dat.ci.list <- lapply(ci.list, function(ciobj)
data.frame(x = as.numeric(rownames(ciobj)),
lower = ciobj[, 1],
upper = ciobj[, 3]))
p <- ggroc(roc.list,legacy.axes=TRUE,aes = c("linetype")) +
labs(x = "False Positive Rate", y = "True Positive Rate", linetype="Model")+
scale_linetype_discrete(labels=c("Logistic Regression","Random Forest"))+
theme_classic() +
geom_abline(slope=1, intercept = 1, linetype = "dashed", alpha=0.7, color = "grey") +
coord_equal()
for(i in 1:2) {
p <- p + geom_ribbon(
data = dat.ci.list[[i]],
aes(x = 1-x, ymin = lower, ymax = upper),
fill = i + 1,
alpha = 0.2,
inherit.aes = F)
}
p
Can I use the pROC package to calculate CI in the test datasets obtained from cross-validation? So, for example, if I want to use 10-fold validation for the logistic regression model, I will have 10 ROC curves. The part of the code:roc.list <- roc(test_df$extry ~ Logistic_Regression+Random_Forest,legacy.axes=TRUE) will not work since the data are not the same in the 10 different test datasets. Any idea?
So, I have my poisson regression model: (mvdiff = market value diff, participations = participations in world cup)
mod <- glm(goals~participations+MVdiff, family = poisson)
plot(MVdiff, jitter(goals , 0.2), pch=17)
Now I would like to include a regression function line into my plot: as a function of MVdiff, with the value of 9.2 for participations (as in 9.2 is the mean of the participations in the world cup).
Here my try:
curve(exp(mod$coefficients[1]+mod$coefficients[2]*x+9.2+mod$coefficients[3]),
lwd=3,col="red",add=TRUE)
But this doesn't quite work out. Is there a way to properly add the value of 9.2 into my coefficient variable participation?
Plot:
To get the model output at a fixed value of an independent variable, use predict with the value of that variable fixed:
pred_df <- data.frame(MVdiff = seq(-1500, 1500), participations = 9.2)
predictions <- predict(mod, newdata = pred_df , type = "response")
Now plot this as a line over your data:
plot(df$MVdiff, jitter(df$goals , 0.2), pch = 16)
lines(pred_df$MVdiff, predictions, col = "red")
Data used
Obviously, we don't have your data, so I had to create my own for the example with the following code:
set.seed(123)
participations <- rpois(500, 9.2)
MVdiff <- rnorm(500, 0, 600)
goals <- rpois(500, 1 + MVdiff/3000)
I have run a mixed effects binary model using the following code:
model = glmer(A ~ B + (1|C), data = data, family = "binomial")
summary(model)
I am now plotting the marginal fixed effects for a variable of interest (B). I have taken the code from the nice page on:
https://cran.r-project.org/web/packages/ggeffects/vignettes/practical_logisticmixedmodel.html
To produce the graph I have used:
ggpredict(model, "B")
plot(ggpredict(model, "B"))
The following is created which I like. But I want also the data points from the variable B to show on the graph. How can I add these in? Thanks.
welcome to stackoverflow :)
Sadly, I dont know how to (/whether it is possible) to add points to your plot of the ggpredict-object, since I am no good with ggplots :/
But I can do a workaround with baseplot. Only thing missing are the grey confidence intervals...which may bw crucial for good looks? :D
Cheers
#using the example data from the link you provided:
library(magrittr)
library(ggeffects)
library(sjmisc)
library(lme4)
library(splines)
set.seed(123)
#creating the data:
dat <- data.frame(
outcome = rbinom(n = 100, size = 1, prob = 0.35),
var_binom = as.factor(rbinom(n = 100, size = 1, prob = 0.2)),
var_cont = rnorm(n = 100, mean = 10, sd = 7),
group = sample(letters[1:4], size = 100, replace = TRUE)
)
dat$var_cont <- sjmisc::std(dat$var_cont)
#model creation:
m1 <- glmer( outcome ~ var_binom + var_cont + (1 | group),
data = dat,
family = binomial(link = "logit")
)
#save results:
m1_results <- ggpredict(m1, "var_cont")
#same plot you did:
plot(m1_results)
#workaround using baseplot:
#plotting the raw data:
plot(dat$outcome~dat$var_cont,
pch = 16,
ylab = "outcome",
xlab = "var_cont",
yaxt = "n")
#adding yaxis with percentages:
axis(2, at = pretty(dat$outcome), lab=paste0(pretty(dat$outcome) * 100," %"), las = TRUE)
#adding the model taken from ggpredict:
lines(m1_results$predicted~m1_results$x,
type = "l")
#upper and lower conf intervals:
lines(m1_results$conf.low~m1_results$x,
lty=2)
lines(m1_results$conf.high~m1_results$x,
lty=2)
I have predicted values, via:
glm0 <- glm(use ~ as.factor(decision), data = decision_use, family = binomial(link = "logit"))
predicted_glm <- predict(glm0, newdata = decision_use, type = "response", interval = "confidence", se = TRUE)
predict <- predicted_glm$fit
predict <- predict + 1
head(predict)
1 2 3 4 5 6
0.3715847 0.3095335 0.3095335 0.3095335 0.3095335 0.5000000
Now when I plot a box plot using ggplot2,
ggplot(decision_use, aes(x = decision, y = predict)) +
geom_boxplot(aes(fill = factor(decision)), alpha = .2)
I get a box plot with one horizontal line per categorical variable. If you look at the predict data, it's same for each categorical variable, so makes sense.
But I want a box plot with the range. How can I get that? When I use "use" instead of predict, I get boxes stretching from end to end (1 to 0). So I suppose that's not it. Thank you in advance.
To clarify, predicted_glm includes se.fit values. I wonder how to incorporate those.
It doesn't really make sense to do a boxplot here. A boxplot shows the range and spread of a continuous variable within groups. Your dependent variable is binary, so the values are all 0 or 1. Since you are plotting predictions for each group, your plot would have just a single point representing the expected value (i.e. the probability) for each group.
The closest you can come is probably to plot the prediction with 95% confidence bars around it.
You haven't provided any sample data, so I'll make some up here:
set.seed(100)
df <- data.frame(outcome = rbinom(200, 1, c(0.1, 0.9)), var1 = rep(c("A", "B"), 100))
Now we'll create our model and get the prediction for each level of my predictor variable using the newdata parameter of predict. I'm going to specify type = "link" because I want the log odds, and I'm also going to specify se.fit = TRUE so I can get the standard error of these predictions:
mod <- glm(outcome ~ var1, data = df, family = binomial)
prediction <- predict(mod, list(var1 = c("A", "B")), se.fit = TRUE, type = "link")
Now I can work out the 95% confidence intervals for my predictions:
prediction$lower <- prediction$fit - prediction$se.fit * 1.96
prediction$upper <- prediction$fit + prediction$se.fit * 1.96
Finally, I transform the fit and confidence intervals from log odds into probabilities:
prediction <- lapply(prediction, function(logodds) exp(logodds)/(1 + exp(logodds)))
plotdf <- data.frame(Group = c("A", "B"), fit = prediction$fit,
upper = prediction$upper, lower = prediction$lower)
plotdf
#> Group fit upper lower
#> 1 A 0.13 0.2111260 0.07700412
#> 2 B 0.92 0.9594884 0.84811360
Now I am ready to plot. I will use geom_points for the probability estimates and geom_errorbars for the confidence intervals :
library(ggplot2)
ggplot(plotdf, aes(x = Group, y = fit, colour = Group)) +
geom_errorbar(aes(ymin = lower, ymax = upper), size = 2, width = 0.5) +
geom_point(size = 3, colour = "black") +
scale_y_continuous(limits = c(0, 1)) +
labs(title = "Probability estimate with 95% CI", y = "Probability")
Created on 2020-05-11 by the reprex package (v0.3.0)
I am analyzing data from a wind turbine, normally this is the sort of thing I would do in excel but the quantity of data requires something heavy-duty. I have never used R before and so I am just looking for some pointers.
The data consists of 2 columns WindSpeed and Power, so far I have arrived at importing the data from a CSV file and scatter-plotted the two against each other.
What I would like to do next is to sort the data into ranges; for example all data where WindSpeed is between x and y and then find the average of power generated for each range and graph the curve formed.
From this average I want recalculate the average based on data which falls within one of two standard deviations of the average (basically ignoring outliers).
Any pointers are appreciated.
For those who are interested I am trying to create a graph similar to this. Its a pretty standard type of graph but like I said the shear quantity of data requires something heavier than excel.
Since you're no longer in Excel, why not use a modern statistical methodology that doesn't require crude binning of the data and ad hoc methods to remove outliers: locally smooth regression, as implemented by loess.
Using a slight modification of csgillespie's sample data:
w_sp <- sample(seq(0, 100, 0.01), 1000)
power <- 1/(1+exp(-(w_sp -40)/5)) + rnorm(1000, sd = 0.1)
plot(w_sp, power)
x_grid <- seq(0, 100, length = 100)
lines(x_grid, predict(loess(power ~ w_sp), x_grid), col = "red", lwd = 3)
Throw this version, similar in motivation as #hadley's, into the mix using an additive model with an adaptive smoother using package mgcv:
Dummy data first, as used by #hadley
w_sp <- sample(seq(0, 100, 0.01), 1000)
power <- 1/(1+exp(-(w_sp -40)/5)) + rnorm(1000, sd = 0.1)
df <- data.frame(power = power, w_sp = w_sp)
Fit the additive model using gam(), using an adaptive smoother and smoothness selection via REML
require(mgcv)
mod <- gam(power ~ s(w_sp, bs = "ad", k = 20), data = df, method = "REML")
summary(mod)
Predict from our model and get standard errors of fit, use latter to generate an approximate 95% confidence interval
x_grid <- with(df, data.frame(w_sp = seq(min(w_sp), max(w_sp), length = 100)))
pred <- predict(mod, x_grid, se.fit = TRUE)
x_grid <- within(x_grid, fit <- pred$fit)
x_grid <- within(x_grid, upr <- fit + 2 * pred$se.fit)
x_grid <- within(x_grid, lwr <- fit - 2 * pred$se.fit)
Plot everything and the Loess fit for comparison
plot(power ~ w_sp, data = df, col = "grey")
lines(fit ~ w_sp, data = x_grid, col = "red", lwd = 3)
## upper and lower confidence intervals ~95%
lines(upr ~ w_sp, data = x_grid, col = "red", lwd = 2, lty = "dashed")
lines(lwr ~ w_sp, data = x_grid, col = "red", lwd = 2, lty = "dashed")
## add loess fit from #hadley's answer
lines(x_grid$w_sp, predict(loess(power ~ w_sp, data = df), x_grid), col = "blue",
lwd = 3)
First we will create some example data to make the problem concrete:
w_sp = sample(seq(0, 100, 0.01), 1000)
power = 1/(1+exp(-(rnorm(1000, mean=w_sp, sd=5) -40)/5))
Suppose we want to bin the power values between [0,5), [5,10), etc. Then
bin_incr = 5
bins = seq(0, 95, bin_incr)
y_mean = sapply(bins, function(x) mean(power[w_sp >= x & w_sp < (x+bin_incr)]))
We have now created the mean values between the ranges of interest. Note, if you wanted the median values, just change mean to median. All that's left to do, is to plot them:
plot(w_sp, power)
points(seq(2.5, 97.5, 5), y_mean, col=3, pch=16)
To get the average based on data that falls within two standard deviations of the average, we need to create a slightly more complicated function:
noOutliers = function(x, power, w_sp, bin_incr) {
d = power[w_sp >= x & w_sp < (x + bin_incr)]
m_d = mean(d)
d_trim = mean(d[d > (m_d - 2*sd(d)) & (d < m_d + 2*sd(d))])
return(mean(d_trim))
}
y_no_outliers = sapply(bins, noOutliers, power, w_sp, bin_incr)
Here are some examples of fitted curves (weibull analysis) for commercial turbines:
http://www.inl.gov/wind/software/
http://www.irec.cmerp.net/papers/WOE/Paper%20ID%20161.pdf
http://www.icaen.uiowa.edu/~ie_155/Lecture/Power_Curve.pdf
I'd recommend also playing around with Hadley's own ggplot2. His website is a great resource: http://had.co.nz/ggplot2/ .
# If you haven't already installed ggplot2:
install.pacakges("ggplot2", dependencies = T)
# Load the ggplot2 package
require(ggplot2)
# csgillespie's example data
w_sp <- sample(seq(0, 100, 0.01), 1000)
power <- 1/(1+exp(-(w_sp -40)/5)) + rnorm(1000, sd = 0.1)
# Bind the two variables into a data frame, which ggplot prefers
wind <- data.frame(w_sp = w_sp, power = power)
# Take a look at how the first few rows look, just for fun
head(wind)
# Create a simple plot
ggplot(data = wind, aes(x = w_sp, y = power)) + geom_point() + geom_smooth()
# Create a slightly more complicated plot as an example of how to fine tune
# plots in ggplot
p1 <- ggplot(data = wind, aes(x = w_sp, y = power))
p2 <- p1 + geom_point(colour = "darkblue", size = 1, shape = "dot")
p3 <- p2 + geom_smooth(method = "loess", se = TRUE, colour = "purple")
p3 + scale_x_continuous(name = "mph") +
scale_y_continuous(name = "power") +
opts(title = "Wind speed and power")