Im a student of statistics and i would like kindly request for some assistance. I would like to plot predicted values together with actual values of the course of 100 days in my dataset:
Sample Data:
set.seet(1010)
count<-rpois(100, lambda = 5)
mood<- rbinom(100, size = 1, prob = .7)
temp<-rnorm(100, mean = 20, sd = 5)
wind<-rbinom(100, size = 3, prob = .7)
days<-seq(1,100,by=1)
df<-data.frame(count,mood,temp,wind,days)
Plotting actual values during 100 days:
plot(count~days,type="l")
Regression:
poisson <- glm(count ~mood+wind+temp)
Condition on my predictors and otaining predicted values:
hyp<- c(1,1,3,20)
coeff.p1 <- poisson$coefficients
XB <- hyp%*%coeff.p1
predv.y <- exp(XB)
predv.y
May be there is a way to predict values for all observations as for example:
coeff.p1 <- poisson$coefficients
XB <- c(,2:4)%*%t(coeff.p1)
I intend to multiply with columns 2:4 by always get
Error in c(, 2:4) : argument 1 is empty
Here Im stuck. As a result I would like to obtain predicted values and actuall values for 100 day on one plot.
Thank you
Your object poisson is of class glm, so it has a predict method when given any data.
poisson <- glm(count ~mood+temp+ wind)
df$pred<-predict(poisson,df[,2:4])
plot(df$days,df$count)
lines(df$days, df$pred,type='l',col='blue')
Try this:
set.seed(1010)
count<-rpois(100, lambda = 5)
mood<- rbinom(100, size = 1, prob = .7)
temp<-rnorm(100, mean = 20, sd = 5)
wind<-rbinom(100, size = 3, prob = .7)
days<-seq(1,100,by=1)
df<-data.frame(count,mood,temp,wind,days)
poisson <- glm(count ~ mood+wind+temp
, family = poisson() #specify your model type
, data=df)
# Calculate the predicted
phat.poisson <- predprob(poisson) # for each subj, prob of observing each value
phat.poisson.mn <- apply(phat.poisson, 2, mean) # mean predicted probs
#your plot of observed vs. predicted
hist(count, prob = TRUE, col = "grey60", breaks=seq(-0.5, 12.5, 1), xlab = "Counts",main = NULL, ylim=c(0, .20))
lines(x = seq(0, 12, 1), y = phat.poisson.mn, lty=2, lwd=2, col="red")
points(x = seq(0, 12, 1), y = phat.poisson.mn, cex=1, pch=16, col="red")
Related
I am trying to run a simulation to illustrate the unbiasedness of the OLS slope in simple linear regression when Y is affected by different levels of error variance. The following code seems to use only the first argument of sd <- c(1, 10). I guess something is wrong with how sd is used within the sapply() function but can't find the actual error.
set.seed(6578)
n <- 200 # sample size
n_samples <- 5000 # Number of samples drawn
sd <- c(1, 10) # Varying levels of error
x <- rnorm(n) # Covariate values
regressions <- function(x, sd) {
z <- .4 * x + rnorm(n, 0, sd)
fit_lm <- lm(z ~ x)
slopes <- fit_lm$coefficients[2]
return(slopes)
}
sample_slopes_varying_error <- sapply(sd, function(f) replicate(n_samples, regressions(x, sd)))
plot(density(sample_slopes_varying_error[, 1]), col = "red", main = "", xlab = "Slope estimates")
lines(density(sample_slopes_varying_error[, 2]), col = "blue")
abline(v = mean(sample_slopes_varying_error[, 1]), lty = "dashed", col = "red")
abline(v = mean(sample_slopes_varying_error[, 2]), lty = "dotted", col = "red")
apply(sample_slopes_varying_error, 2, sd) # Almost identical SDs.
#> [1] 0.5282542 0.5353999
I have run a mixed effects binary model using the following code:
model = glmer(A ~ B + (1|C), data = data, family = "binomial")
summary(model)
I am now plotting the marginal fixed effects for a variable of interest (B). I have taken the code from the nice page on:
https://cran.r-project.org/web/packages/ggeffects/vignettes/practical_logisticmixedmodel.html
To produce the graph I have used:
ggpredict(model, "B")
plot(ggpredict(model, "B"))
The following is created which I like. But I want also the data points from the variable B to show on the graph. How can I add these in? Thanks.
welcome to stackoverflow :)
Sadly, I dont know how to (/whether it is possible) to add points to your plot of the ggpredict-object, since I am no good with ggplots :/
But I can do a workaround with baseplot. Only thing missing are the grey confidence intervals...which may bw crucial for good looks? :D
Cheers
#using the example data from the link you provided:
library(magrittr)
library(ggeffects)
library(sjmisc)
library(lme4)
library(splines)
set.seed(123)
#creating the data:
dat <- data.frame(
outcome = rbinom(n = 100, size = 1, prob = 0.35),
var_binom = as.factor(rbinom(n = 100, size = 1, prob = 0.2)),
var_cont = rnorm(n = 100, mean = 10, sd = 7),
group = sample(letters[1:4], size = 100, replace = TRUE)
)
dat$var_cont <- sjmisc::std(dat$var_cont)
#model creation:
m1 <- glmer( outcome ~ var_binom + var_cont + (1 | group),
data = dat,
family = binomial(link = "logit")
)
#save results:
m1_results <- ggpredict(m1, "var_cont")
#same plot you did:
plot(m1_results)
#workaround using baseplot:
#plotting the raw data:
plot(dat$outcome~dat$var_cont,
pch = 16,
ylab = "outcome",
xlab = "var_cont",
yaxt = "n")
#adding yaxis with percentages:
axis(2, at = pretty(dat$outcome), lab=paste0(pretty(dat$outcome) * 100," %"), las = TRUE)
#adding the model taken from ggpredict:
lines(m1_results$predicted~m1_results$x,
type = "l")
#upper and lower conf intervals:
lines(m1_results$conf.low~m1_results$x,
lty=2)
lines(m1_results$conf.high~m1_results$x,
lty=2)
I have a data with continuous independent variable and binary dependent. Therefore I was trying to apply logistic regression for the analysis of this data. However in contrast to the classical case with S-shaped transition, I have a two transitions.
Here is an example of what I mean
library(ggplot)
library(visreg)
classic.data = data.frame(x = seq(from = 0, by = 0.5, length = 30),
y = c(rep(0, times = 14), 1, 0, rep(1, times = 14)))
model.classic = glm(formula = y ~ x,
data = classic.data,
family = "binomial")
summary(model.classic)
visreg(model.classic,
partial = FALSE,
scale = "response",
alpha = 0)
my.data = data.frame(x = seq(from = 0, by = 0.5, length = 30),
y = c(rep(0, times = 10), rep(1, times = 10), rep(0, times = 10)))
model.my = glm(formula = y ~ x,
data = my.data,
family = "binomial")
summary(model.my)
visreg(model.my,
partial = FALSE,
scale = "response",
alpha = 0)
The blue lines on both plots - it is outcome of glm, while red line it what I want to have.
Is there any way to apply logistic regression to such data? Or should I apply some other type of regression analysis?
In your second model, y is not a linear function of x. When you write y ~ x you assume that when x increases, y will increase/decrease depending on a positive/negative coefficient. That is not the case, it's increasing and then decreasing, making the average effect of x zero (hence the strait line). You therefore need a non-linear function. You could do that with a gam from the mgcv package, where the effect of x is modelled as a smooth function:
library(mgcv)
my.data = data.frame(x = seq(from = 0, by = 0.5, length = 30),
y = c(rep(0, times = 10), rep(1, times = 10), rep(0, times = 10)))
m = gam(y ~ s(x), data = my.data, family = binomial)
plot(m)
That would lead to the following fit on the original scale:
my.data$prediction = predict(m, type = "response")
plot(my.data$x, my.data$y)
lines(my.data$x, my.data$prediction, col = "red")
I have these curves below:
These curves were generated using a library called discreteRV.
library(discreteRV)
placebo.rate <- 0.5
mmm.rate <- 0.3
mmm.power <- power.prop.test(p1 = placebo.rate, p2 = mmm.rate, power = 0.8, alternative = "one.sided")
n <- as.integer(ceiling(mmm.power$n))
patients <- seq(from = 0, to = n, by = 1)
placebo_distribution <- dbinom(patients, size = n, prob = placebo.rate)
mmm_distribution <- dbinom(patients, size = n, prob = mmm.rate)
get_pmf <- function(p1, p2) {
X1 <- RV(patients,p1, fractions = F)
X2 <- RV(patients,p2, fractions = F)
pmf <- joint(X1, X2, fractions = F)
return(pmf)
}
extract <- function(string) {
ints <- unlist(strsplit(string,","))
x1 <- as.integer(ints[1])
x2 <- as.integer(ints[2])
return(x1-x2)
}
diff_prob <- function(pmf) {
diff <- unname(sapply(outcomes(pmf),FUN = extract)/n)
probabilities <- unname(probs(pmf))
df <- data.frame(diff,probabilities)
df <- aggregate(. ~ diff, data = df, FUN = sum)
return(df)
}
most_likely_rate <- function(x) {
x[which(x$probabilities == max(x$probabilities)),]$diff
}
mmm_rate_diffs <- diff_prob(get_pmf(mmm_distribution,placebo_distribution))
placebo_rate_diffs <- diff_prob(get_pmf(placebo_distribution,placebo_distribution))
plot(mmm_rate_diffs$diff,mmm_rate_diffs$probabilities * 100, type = "l", lty = 2, xlab = "Rate difference", ylab = "# of trials per 100", main = paste("Trials with",n,"patients per treatment arm",sep = " "))
lines(placebo_rate_diffs$diff, placebo_rate_diffs$probabilities * 100, lty = 1, xaxs = "i")
abline(v = c(most_likely_rate(placebo_rate_diffs), most_likely_rate(mmm_rate_diffs)), lty = c(1,2))
legend("topleft", legend = c("Alternative hypothesis", "Null hypothesis"), lty = c(2,1))
Basically, I took two binomial discrete random variables, created a joint probability mass function, determined the probability of any given rate difference then plotted them to demonstrate a distribution of those rate differences if the null hypothesis was true or if the alternative hypothesis was true over 100 identical trials.
Now I want to illustrate the 5% percentile on the null hypothesis curve. Unfortunately, I don't know how to do this. If I simply use quantile(x = placebo_rate_diffs$diff, probs = 0.05, I get -0.377027. This can't be correct looking at the graph. I want to calculate the 5th percentile like I would using pbinom() but I don't know how to do that with a graph created from essentially what are just x and y vectors.
Maybe I can approximate these two curves as binomial since they appear to be, but I am still not sure how to do this.
Any help would be appreciated.
Have created a GLMM model and plotted the predicted probabilities of each factor. However, I cannot fathom how to create confidence intervals using the BootMer function. I keep getting the error message cannot simulate from non integer prior weights.
I'm hoping someone would be able to help? Thanks in advance.
glmm1 <- glmer(cbind(Total_Dead, Total_Collected - Total_Dead) ~
Species + timeseries + (1|Location),
data = dat, family= "binomial")
dat$timeseries <- dat$Study_Date - 1998
plot(predict(glmm1, data.frame(Species="An. Arab", timeseries= dat$timeseries),
type="response", re.form = NA) ~
dat$timeseries, frame=FALSE, bty="n", pch="", xaxt="n", ylim=c(0, 0.5),
ylab="Predicted mortality", xlab="Year",
main = "Predicted mortality by species",
cex.lab=1.6, yaxt="n")
axis(1, at=c(1:17), labels=1999:2015, cex.axis=1.8)
axis(2, las=2, at=seq(0, 1, 0.2), labels=seq(0, 1, 0.2), cex.axis=1.8)
COLS <- c("blue", "red", "purple", "aquamarine3", "orange")
PCH <- c(17, 15, 19, 20, 5)
for(i in 1:length(unique(levels(dat$Species)))){
points((dat$Total_Dead[dat$Species == levels(dat$Species)[i]] /
dat$Total_Collected[dat$Species == levels(dat$Species)[i]]) ~
dat$timeseries[dat$Species == levels(dat$Species)[i]],
pch=PCH[i], col=COLS[i])
lines(predict(glmm1, data.frame(Species=levels(dat$Species)[i],
timeseries = dat$timeseries), type="response",
re.form = NA) ~ dat$timeseries, lty=4, col=COLS[i])
}
bootstrap <- bootMer(x=glmm1, FUN= fixef, nsim=200)
for some reason Bootmer has problems with that, you have to use the mertools package
library(merTools)
preds <- predictInterval(glmm1, newdata = your.datarame, n.sims = 1000)
I would use then then the preds data.frame to plot, the resulting data.frame has the fit, upper and lower limit, then you can use geom_ribbon to plot it, if you need more help let me know.
now bear with me, you actually want to make a new standardized dataset for your graph. If you use this code it will work:
glmm1 <- glmer(cbind(Total_Dead, Total_Collected - Total_Dead) ~
Species + timeseries + (1|Location),
data = dat,family= "binomial")
fit your model, then create your new data set, this will have your timeseries from 1 to 16 for each species, in your first location (Akron), note that you will have to do this for each location if you want the graph for each location, you can do that just by changing the number between [] from 1, to 2 up to your 17 locations
new.data <-data.frame(timeseries = rep(1:16, times = 5), Species = rep(unique(dat$Species), each = 16), Location = rep(unique(dat$Location)[1], each = 80))
Then predict the values and intervals for such dataset
preds <- predictInterval(glmm1, newdata = new.data, n.sims = 1000)
now join this prediction to your new.data
new.data <- cbind(new.data, preds)
and finally plot it with different colors for each species
ggplot(new.data, aes(x = timeseries, y = fit)) + geom_ribbon(aes(ymax=upr, ymin =lwr, fill=Species), alpha = 0.9)+ geom_line(aes(color=Species))
If you don't understand something don't hesitate to keep asking, currently your standard errors are quite big so first check to see if you like that better
ggplot(new.data, aes(x = timeseries, y = fit)) + geom_line(aes(color=Species))