After fitting the cumulative distribution in R creating the normal distribution from fitted parameters - r

After successfully fitting my cumulative data with Gompertz function, I need to create normal distribution from fitted function.
This is the code so far:
df <- data.frame(x = c(0.01,0.011482,0.013183,0.015136,0.017378,0.019953,0.022909,0.026303,0.0302,0.034674,0.039811,0.045709,0.052481,0.060256,0.069183,0.079433,0.091201,0.104713,0.120226,0.138038,0.158489,0.18197,0.20893,0.239883,0.275423,0.316228,0.363078,0.416869,0.47863,0.549541,0.630957,0.724436,0.831764,0.954993,1.096478,1.258925,1.44544,1.659587,1.905461,2.187762,2.511886,2.884031,3.311311,3.801894,4.365158,5.011872,5.754399,6.606934,7.585776,8.709636,10,11.481536,13.182567,15.135612,17.378008,19.952623,22.908677,26.30268,30.199517,34.673685,39.810717,45.708819,52.480746,60.255959,69.183097,79.432823,91.201084,104.712855,120.226443,138.038426,158.489319,181.970086,208.929613,239.883292,275.42287,316.227766,363.078055,416.869383,478.630092,549.540874,630.957344,724.43596,831.763771,954.992586,1096.478196),
y = c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.00044816,0.00127554,0.00221488,0.00324858,0.00438312,0.00559138,0.00686054,0.00817179,0.00950625,0.01085188,0.0122145,0.01362578,0.01514366,0.01684314,0.01880564,0.02109756,0.0237676,0.02683182,0.03030649,0.0342276,0.03874555,0.04418374,0.05119304,0.06076553,0.07437854,0.09380666,0.12115065,0.15836926,0.20712933,0.26822017,0.34131335,0.42465413,0.51503564,0.60810697,0.69886817,0.78237651,0.85461023,0.91287236,0.95616228,0.98569093,0.99869001,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999))
library(drc)
fm <- drm(y ~ x, data = df, fct = G.3())
options(scipen = 10) #to avoid scientific notation in x axis
plot(df$x, predict(fm),type = "l", log = "x",col = "blue",
main = "Cumulative function distribution",xlab = "x", ylab = "y")
points(df,col = "red")
legend("topleft", inset = .05,legend = c("exp","fit")
,lty = c(NA,1), col = c("red", "blue"), pch = c(1,NA), lwd=1, bty = "n")
summary(fm)
And this is the following plot:
My idea is now to transform somehow this cumulative fit to the normal distribution. Is there any idea how could I do that?

While your original intention might be non-parametric, I suggest using parametric estimation method: method of moments, which is widely used for problems like this, because you have a certain parametric distribution (normal distribution) to fit. The idea is quite simple, from the fitted cumulative distribution function, you can calculate the mean (E1 in my code) and variance (square of SD in my code), and then the problem is solved, because normal distribution can be totally determined by mean and variance.
df <- data.frame(x=c(0.01,0.011482,0.013183,0.015136,0.017378,0.019953,0.022909,0.026303,0.0302,0.034674,0.039811,0.045709,0.052481,0.060256,0.069183,0.079433,0.091201,0.104713,0.120226,0.138038,0.158489,0.18197,0.20893,0.239883,0.275423,0.316228,0.363078,0.416869,0.47863,0.549541,0.630957,0.724436,0.831764,0.954993,1.096478,1.258925,1.44544,1.659587,1.905461,2.187762,2.511886,2.884031,3.311311,3.801894,4.365158,5.011872,5.754399,6.606934,7.585776,8.709636,10,11.481536,13.182567,15.135612,17.378008,19.952623,22.908677,26.30268,30.199517,34.673685,39.810717,45.708819,52.480746,60.255959,69.183097,79.432823,91.201084,104.712855,120.226443,138.038426,158.489319,181.970086,208.929613,239.883292,275.42287,316.227766,363.078055,416.869383,478.630092,549.540874,630.957344,724.43596,831.763771,954.992586,1096.478196),
y=c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.00044816,0.00127554,0.00221488,0.00324858,0.00438312,0.00559138,0.00686054,0.00817179,0.00950625,0.01085188,0.0122145,0.01362578,0.01514366,0.01684314,0.01880564,0.02109756,0.0237676,0.02683182,0.03030649,0.0342276,0.03874555,0.04418374,0.05119304,0.06076553,0.07437854,0.09380666,0.12115065,0.15836926,0.20712933,0.26822017,0.34131335,0.42465413,0.51503564,0.60810697,0.69886817,0.78237651,0.85461023,0.91287236,0.95616228,0.98569093,0.99869001,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999))
library(drc)
fm <- drm(y ~ x, data = df, fct = G.3())
options(scipen = 10) #to avoid scientific notation in x axis
plot(df$x, predict(fm),type="l", log = "x",col="blue", main="Cumulative distribution function",xlab="x", ylab="y")
points(df,col="red")
E1 <- sum((df$x[-1] + df$x[-length(df$x)]) / 2 * diff(predict(fm)))
E2 <- sum((df$x[-1] + df$x[-length(df$x)]) ^ 2 / 4 * diff(predict(fm)))
SD <- sqrt(E2 - E1 ^ 2)
points(df$x, pnorm((df$x - E1) / SD), col = "green")
legend("topleft", inset = .05,legend= c("exp","fit","method of moment")
,lty = c(NA,1), col = c("red", "blue", "green"), pch = c(1,NA), lwd=1, bty="n")
summary(fm)
And the estimation results:
## > E1 (mean of fitted normal distribution)
## [1] 65.78474
## > E2 (second moment of fitted normal distribution)
##[1] 5792.767
## > SD (standard deviation of fitted normal distribution)
## [1] 38.27707
## > SD ^ 2 (variance of fitted normal distribution)
## [1] 1465.134
Edit: updated method to calculate moments from cdf fitted by drc. The function moment defined below calculates moment estimation using the moment formula for continuous r.v. E(X ^ k) = k * \int x ^ {k - 1} (1 - cdf(x)) dx. These are the best estimation I can get from the fitted cdf. And the fit is not very good when x is near zero because of the reason in original datasets as I discussed in comments.
df <- data.frame(x=c(0.01,0.011482,0.013183,0.015136,0.017378,0.019953,0.022909,0.026303,0.0302,0.034674,0.039811,0.045709,0.052481,0.060256,0.069183,0.079433,0.091201,0.104713,0.120226,0.138038,0.158489,0.18197,0.20893,0.239883,0.275423,0.316228,0.363078,0.416869,0.47863,0.549541,0.630957,0.724436,0.831764,0.954993,1.096478,1.258925,1.44544,1.659587,1.905461,2.187762,2.511886,2.884031,3.311311,3.801894,4.365158,5.011872,5.754399,6.606934,7.585776,8.709636,10,11.481536,13.182567,15.135612,17.378008,19.952623,22.908677,26.30268,30.199517,34.673685,39.810717,45.708819,52.480746,60.255959,69.183097,79.432823,91.201084,104.712855,120.226443,138.038426,158.489319,181.970086,208.929613,239.883292,275.42287,316.227766,363.078055,416.869383,478.630092,549.540874,630.957344,724.43596,831.763771,954.992586,1096.478196),
y=c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.00044816,0.00127554,0.00221488,0.00324858,0.00438312,0.00559138,0.00686054,0.00817179,0.00950625,0.01085188,0.0122145,0.01362578,0.01514366,0.01684314,0.01880564,0.02109756,0.0237676,0.02683182,0.03030649,0.0342276,0.03874555,0.04418374,0.05119304,0.06076553,0.07437854,0.09380666,0.12115065,0.15836926,0.20712933,0.26822017,0.34131335,0.42465413,0.51503564,0.60810697,0.69886817,0.78237651,0.85461023,0.91287236,0.95616228,0.98569093,0.99869001,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999))
library(drc)
fm <- drm(y ~ x, data = df, fct = G.3())
moment <- function(k){
f <- function(x){
x ^ (k - 1) * pmax(0, 1 - predict(fm, data.frame(x = x)))
}
k * integrate(f, lower = min(df$x), upper = max(df$x))$value
}
E1 <- moment(1)
E2 <- moment(2)
SD <- sqrt(E2 - E1 ^ 2)

I was thinking of the cumdiff (for lack of a better term). The link helped a lot.
EDIT
plot(df$x[-1], Mod(df$y[-length(df$y)]-df$y[-1]), log = "x", type = "b",
main = "Normal distribution for original data",
xlab = "x", ylab = "y")
yielding:
ADDITION
In order to get the Gaussian from the fittedfunction:
df$y_pred<-predict(fm)
plot(df$x[-1], Mod(df$y_pred[-length(df$y_pred)]-df$y_pred[-1]), log = "x",
type = "b", main="Normal distribution for fitted function",
xlab = "x", lab = "y")
yielding:

Related

Force GAM model fit to be monotonic and go through a fixed point (x0, y0) with R mgcv

I am trying to fit a GAM model to data under two constraints simultatenously: (1) the fit is monotonic (increasing), (2) the fit goes through a fixed point, say, (x0,y0).
So far, I managed to have these two constraints work separately:
For (1), based on mgcv::pcls() documentation examples, by using mgcv::mono.con() to get linear constraints sufficient for monotonicity, and estimate model coefs via mgcv::pcls(), using the constraints.
For (2), based on this post, by setting the value of spline at knot location x0 to 0 + using offset term in the model formula.
However, I struggle to combine these two constraints simultaneously. I guess a way to go is mgcv::pcls(), but I could work out neither (a) doing a similar trick of setting the value of spline at knot location x0 to 0 + using offset nor (b) setting equality constraint(s) (which I think could yield my (2) constraint setup).
I also note that the approach for setting the value of spline at knot location x0 to 0 for my constrain condition (2) yields weirdly wiggly outcome (as compared to unconstrained GAM fit) -- as showed below.
Attempt so far: fit a smooth function to data under two constraints separately
Simulate some data
library(mgcv)
set.seed(1)
x <- sort(runif(100) * 4 - 1)
f <- exp(4*x)/(1+exp(4*x))
y <- f + rnorm(100) * 0.1
dat <- data.frame(x=x, y=y)
GAM unconstrained (for comparison)
k <- 13
fit0 <- gam(y ~ s(x, k = k, bs = "cr"), data = dat)
# predict from unconstrained GAM fit
newdata <- data.frame(x = seq(-1, 3, length.out = 1000))
newdata$y_pred_fit0 <- predict(fit0, newdata = newdata)
GAM constrained: (1) the fit is monotonic (increasing)
k <- 13
# Show regular spline fit (and save fitted object)
f.ug <- gam(y~s(x,k=k,bs="cr"))
# explicitly construct smooth term's design matrix
sm <- smoothCon(s(x,k=k,bs="cr"),dat,knots=NULL)[[1]]
# find linear constraints sufficient for monotonicity of a cubic regression spline
# it assumes "cr" is the basis and its knots are provided as input
F <- mono.con(sm$xp)
G <- list(
X=sm$X,
C=matrix(0,0,0), # [0 x 0] matrix (no equality constraints)
sp=f.ug$sp, # smoothing parameter estimates (taken from unconstrained model)
p=sm$xp, # array of feasible initial parameter estimates
y=y,
w= dat$y * 0 + 1 # weights for data
)
G$Ain <- F$A # matrix for the inequality constraints
G$bin <- F$b # vector for the inequality constraints
G$S <- sm$S # list of penalty matrices; The first parameter it penalizes is given by off[i]+1
G$off <- 0 # Offset values locating the elements of M$S in the correct location within each penalty coefficient matrix. (Zero offset implies starting in first location)
p <- pcls(G); # fit spline (using smoothing parameter estimates from unconstrained fit)
# predict
newdata$y_pred_fit2 <- Predict.matrix(sm, data.frame(x = newdata$x)) %*% p
# plot
plot(y ~ x, data = dat)
lines(y_pred_fit0 ~ x, data = newdata, col = 2, lwd = 2)
lines(y_pred_fit2 ~ x, data = newdata, col = 4, lwd = 2)
Blue line: constrained; red line: unconstrained
GAM constrained: (2) fitted go through (x0,y0)=(-1, -0.1)
k <- 13
## Create a spline basis and penalty
## Make sure there is a knot at the constraint point (here: -1)
knots <- data.frame(x = seq(-1,3,length=k))
# explicit construction of a smooth term in a GAM
sm <- smoothCon(s(x,k=k,bs="cr"), dat, knots=knots)[[1]]
## 1st parameter is value of spline at knot location -1, set it to 0 by dropping
knot_which <- which(knots$x == -1)
X <- sm$X[, -knot_which] ## spline basis
S <- sm$S[[1]][-knot_which, -knot_which] ## spline penalty
off <- dat$y * 0 + (-0.1) ## offset term to force curve through (x0, y0)
## fit spline constrained through (x0, y0)
gam_1 <- gam(y ~ X - 1 + offset(off), paraPen = list(X = list(S)))
# predict (add offset of -0.1)
newdata_tmp <- Predict.matrix(sm, data.frame(x = newdata$x))
newdata_tmp <- newdata_tmp[, -knot_which]
newdata$y_pred_fit1 <- (newdata_tmp %*% coef(gam_1))[, 1] + (-0.1)
# plot
plot(y ~ x, data = dat)
lines(y_pred_fit0 ~ x, data = newdata, col = 2, lwd = 2)
lines(y_pred_fit1 ~ x, data = newdata, col = 3, lwd = 2)
# lines at cross of which the plot should go throught
abline(v=-1, col = 3); abline(h=-0.1, col = 3)
Green line: constrained; red line: unconstrained
I think you could augment the data vectors x and y with (x0, y0) and then put a (really) high weight on the first observation (i.e. add a weight vector to your G list).
Alternatively to the simple weighting strategy, we can write the quadratic programming problem starting from the results of the preliminary smoothing. This is illustrated in the second R-code below (in this case I used p-spline smoothers, see Eilers and Marx 1991).
Hope this helps a bit (a similar problem is discussed here).
Rcode example 1 (weight strategy)
set.seed(123)
N = 100
x <- sort(runif(N) * 4 - 1)
f <- exp(4*x)/(1+exp(4*x))
y <- f + rnorm(N) * 0.1
x = c(-1, x)
y = c(-0.1, y)
dat = data.frame(x = x, y= y)
k <- 13
fit0 <- gam(y ~ s(x, k = k, bs = "cr"), data = dat)
# predict from unconstrained GAM fit
newdata <- data.frame(x = seq(-1, 3, length.out = 1000))
newdata$y_pred_fit0 <- predict(fit0, newdata = newdata)
k <- 13
# Show regular spline fit (and save fitted object)
f.ug <- gam(y~s(x,k=k,bs="cr"))
# explicitly construct smooth term's design matrix
sm <- smoothCon(s(x,k=k,bs="cr"),dat,knots=NULL)[[1]]
# find linear constraints sufficient for monotonicity of a cubic regression spline
# it assumes "cr" is the basis and its knots are provided as input
F <- mono.con(sm$xp)
G <- list(
X=sm$X,
C=matrix(0,0,0), # [0 x 0] matrix (no equality constraints)
sp=f.ug$sp, # smoothing parameter estimates (taken from unconstrained model)
p=sm$xp, # array of feasible initial parameter estimates
y=y,
w= c(1e8, 1:N * 0 + 1) # weights for data
)
G$Ain <- F$A # matrix for the inequality constraints
G$bin <- F$b # vector for the inequality constraints
G$S <- sm$S # list of penalty matrices; The first parameter it penalizes is given by off[i]+1
G$off <- 0 # Offset values locating the elements of M$S in the correct location within each penalty coefficient matrix. (Zero offset implies starting in first location)
p <- pcls(G); # fit spline (using smoothing parameter estimates from unconstrained fit)
# predict
newdata$y_pred_fit2 <- Predict.matrix(sm, data.frame(x = newdata$x)) %*% p
# plot
plot(y ~ x, data = dat)
lines(y_pred_fit0 ~ x, data = newdata, col = 2, lwd = 2)
lines(y_pred_fit2 ~ x, data = newdata, col = 4, lwd = 2)
abline(v = -1)
abline(h = -0.1)
rm(list = ls())
library(mgcv)
library(pracma)
library(colorout)
set.seed(123)
N = 100
x = sort(runif(N) * 4 - 1)
f = exp(4*x)/(1+exp(4*x))
y = f + rnorm(N) * 0.1
x0 = -1
y0 = -0.1
dat = data.frame(x = x, y= y)
k = 50
# Show regular spline fit (and save fitted object)
f.ug = gam(y~s(x,k=k,bs="ps"))
# explicitly construct smooth term's design matrix
sm = smoothCon(s(x,k=k,bs="ps"), dat,knots=NULL)[[1]]
# Build quadprog to estimate the coefficients
scf = sapply(f.ug$smooth, '[[', 'S.scale')
lam = f.ug$sp / scf
Xp = rbind(sm$X, sqrt(lam) * f.ug$smooth[[1]]$D)
yp = c(dat$y, rep(0, k - 2))
X0 = Predict.matrix(sm, data.frame(x = x0))
sm$deriv = 1
X1 = Predict.matrix(sm, data.frame(x = dat$x))
coef_mono = pracma::lsqlincon(Xp, yp, Aeq = X0, beq = y0, A = -X1, b = rep(0, N))
# fitted values
fit = sm$X %*% coef_mono
sm$deriv = 0
xf = seq(-1, 3, len = 1000)
Xf = Predict.matrix(sm, data.frame(x = xf))
fine_fit = Xf %*% coef_mono
# plot
par(mfrow = c(2, 1), mar = c(3,3,3,3))
plot(dat$x, dat$y, pch = 1, main= 'Data and fit')
lines(dat$x, f.ug$fitted, lwd = 2, col = 2)
lines(dat$x, fit, col = 4, lty = 1, lwd = 2)
lines(xf, fine_fit, col = 3, lwd = 2, lty = 2)
abline(h = -0.1)
abline(v = -1)
plot(dat$x, X1 %*% coef_mono, type = 'l', main = 'Derivative of the fit', lwd = 2)
abline(h = 0.0)
The following package seems to implement what you are looking for:
The proposed shape constrained smoothing has been incorporated into generalized
additive models with a mixture of unconstrained and shape restricted smooth terms
(mono-GAM). [...]
The proposed modelling approach has been implemented in an R package monogam.
The model setup is the same as in mgcv(gam) with the addition of shape constrained
smooths. In order to be consistent with the unconstrained GAM, the package provides
key functions similar to those associated with mgcv(gam).
Additive models with shape constraints

Calculate 5th quantile of curve generated from vectors of X, Y points

I have these curves below:
These curves were generated using a library called discreteRV.
library(discreteRV)
placebo.rate <- 0.5
mmm.rate <- 0.3
mmm.power <- power.prop.test(p1 = placebo.rate, p2 = mmm.rate, power = 0.8, alternative = "one.sided")
n <- as.integer(ceiling(mmm.power$n))
patients <- seq(from = 0, to = n, by = 1)
placebo_distribution <- dbinom(patients, size = n, prob = placebo.rate)
mmm_distribution <- dbinom(patients, size = n, prob = mmm.rate)
get_pmf <- function(p1, p2) {
X1 <- RV(patients,p1, fractions = F)
X2 <- RV(patients,p2, fractions = F)
pmf <- joint(X1, X2, fractions = F)
return(pmf)
}
extract <- function(string) {
ints <- unlist(strsplit(string,","))
x1 <- as.integer(ints[1])
x2 <- as.integer(ints[2])
return(x1-x2)
}
diff_prob <- function(pmf) {
diff <- unname(sapply(outcomes(pmf),FUN = extract)/n)
probabilities <- unname(probs(pmf))
df <- data.frame(diff,probabilities)
df <- aggregate(. ~ diff, data = df, FUN = sum)
return(df)
}
most_likely_rate <- function(x) {
x[which(x$probabilities == max(x$probabilities)),]$diff
}
mmm_rate_diffs <- diff_prob(get_pmf(mmm_distribution,placebo_distribution))
placebo_rate_diffs <- diff_prob(get_pmf(placebo_distribution,placebo_distribution))
plot(mmm_rate_diffs$diff,mmm_rate_diffs$probabilities * 100, type = "l", lty = 2, xlab = "Rate difference", ylab = "# of trials per 100", main = paste("Trials with",n,"patients per treatment arm",sep = " "))
lines(placebo_rate_diffs$diff, placebo_rate_diffs$probabilities * 100, lty = 1, xaxs = "i")
abline(v = c(most_likely_rate(placebo_rate_diffs), most_likely_rate(mmm_rate_diffs)), lty = c(1,2))
legend("topleft", legend = c("Alternative hypothesis", "Null hypothesis"), lty = c(2,1))
Basically, I took two binomial discrete random variables, created a joint probability mass function, determined the probability of any given rate difference then plotted them to demonstrate a distribution of those rate differences if the null hypothesis was true or if the alternative hypothesis was true over 100 identical trials.
Now I want to illustrate the 5% percentile on the null hypothesis curve. Unfortunately, I don't know how to do this. If I simply use quantile(x = placebo_rate_diffs$diff, probs = 0.05, I get -0.377027. This can't be correct looking at the graph. I want to calculate the 5th percentile like I would using pbinom() but I don't know how to do that with a graph created from essentially what are just x and y vectors.
Maybe I can approximate these two curves as binomial since they appear to be, but I am still not sure how to do this.
Any help would be appreciated.

Plot the observed and fitted values from a linear regression using xyplot() from the lattice package

I can create simple graphs. I would like to have observed and predicted values (from a linear regression) on the same graph. I am plotting say Yvariable vs Xvariable. There is only 1 predictor and only 1 response. How could I also add linear regression curve to the same graph?
So to conclude need help with:
plotting actuals and predicted both
plotting regression line
Here is one option for the observed and predicted values in a single plot as points. It is easier to get the regression line on the observed points, which I illustrate second
First some dummy data
set.seed(1)
x <- runif(50)
y <- 2.5 + (3 * x) + rnorm(50, mean = 2.5, sd = 2)
dat <- data.frame(x = x, y = y)
Fit our model
mod <- lm(y ~ x, data = dat)
Combine the model output and observed x into a single object for plott
res <- stack(data.frame(Observed = dat$y, Predicted = fitted(mod)))
res <- cbind(res, x = rep(dat$x, 2))
head(res)
Load lattice and plot
require("lattice")
xyplot(values ~ x, data = res, group = ind, auto.key = TRUE)
The resulting plot should look similar to this
To get just the regression line on the observed data, and the regression model is a simple straight line model as per the one I show then you can circumvent most of this and just plot using
xyplot(y ~ x, data = dat, type = c("p","r"), col.line = "red")
(i.e. you don't even need to fit the model or make new data for plotting)
The resulting plot should look like this
An alternative to the first example which can be used with anything that will give coefficients for the regression line is to write your own panel functions - not as scary as it seems
xyplot(y ~ x, data = dat, col.line = "red",
panel = function(x, y, ...) {
panel.xyplot(x, y, ...)
panel.abline(coef = coef(mod), ...) ## using mod from earlier
}
)
That gives a plot from Figure 2 above, but by hand.
Assuming you've done this with caret then
mod <- train(y ~ x, data = dat, method = "lm",
trControl = trainControl(method = "cv"))
xyplot(y ~ x, data = dat, col.line = "red",
panel = function(x, y, ...) {
panel.xyplot(x, y, ...)
panel.abline(coef = coef(mod$finalModel), ...) ## using mod from caret
}
)
Will produce a plot the same as Figure 2 above.
Another option is to use panel.lmlineq from latticeExtra.
library(latticeExtra)
set.seed(0)
xsim <- rnorm(50, mean = 3)
ysim <- (0 + 2 * xsim) * (1 + rnorm(50, sd = 0.3))
## basic use as a panel function
xyplot(ysim ~ xsim, panel = function(x, y, ...) {
panel.xyplot(x, y, ...)
panel.lmlineq(x, y, adj = c(1,0), lty = 1,xol.text='red',
col.line = "blue", digits = 1,r.squared =TRUE)
})

R: predict.glm equivalent for MCMCpack::MCMClogit

I am running a Bayesian logit with MCMCpack::MCMClogit. The syntax is easy and follows lm() or glm(), but I can't find any equivalent of the predict.glm function. Is there any way of predicting the probabilities of the outcomes in MCMClogit for each unit of observation in the dataframe? predict() is especially useful for validating training data from new data, which is what I ultimately have to do.
df = read.csv("http://dl.dropbox.com/u/1791181/MCMC.csv")#Read in data
model.glm = glm(SECONDARY.LEVEL ~ AGE + SEX, data=df, family=binomial(link=logit))
glm.predict = predict(model.glm, type="response")
For MCMClogit():
model.mcmc = MCMClogit(SECONDARY.LEVEL ~ AGE + SEX, data=df, mcmc=1000)
You could use the posterior distribution of model parameters produced by MCMC to get a distribution of predictions, using the logistic function.
For instance, if your model formula is y ~ x1 + x2 + x3, and your MCMC output is stored in the variable posterior.mcmc, then you could use
function(x1, x2, x3) 1 / (1 + exp(-posterior.mcmc %*% rbind(1, x1, x2, x3)))
to give the distribution analogous to predict.glm(., 'response')
More detailed example for the case of a single input variable:
library(extraDistr)
library(MCMCpack)
# Take x uniformly distributed between -100 and 100
x <- runif(2000, min=-100, max=100)
# Generate a response which is logistic with some noise
beta <- 1/8
eps <- rnorm(length(x), 0, 1)
p <- function(x, eps) 1 / (1 + exp(-beta*x + eps))
p.x <- p(x, eps)
y <- sapply(p.x, function(p) rbern(1, p))
df1 <- data.frame(x, y)
# Fit by logistic regression
glm.logistic <- glm(y ~ x, df1, family=binomial)
# MCMC gives a distribution of values for the model parameters
posterior.mcmc <- MCMClogit(y ~ x, df1, verbose=2000)
densplot(posterior.mcmc)
# Thus, we have a distribution of model predictions for each x
predict.p.mcmc <- function(x) 1 / (1 + exp(-posterior.mcmc %*% rbind(1,x)))
interval.p.mcmc <- function(x, low, high) apply(predict.p.mcmc(x), 2,
function(x) quantile(x, c(low, high)))
predict.y.mcmc <- function(x) posterior.mcmc %*% rbind(1,x)
interval.y.mcmc <- function(x, low, high) apply(predict.y.mcmc(x), 2,
function(x) quantile(x, c(low, high)))
## Plot the data and fits ##
plot(x, p.x, ylab = 'Pr(y=1)', pch = 20, cex = 0.5, main = 'Probability vs x')
# x-values for prediction
x_test <- seq(-100, 100, 0.01)
# Blue line is the logistic function we used to generate the data, with noise removed
p_of_x_test <- p(x_test, 0)
lines(x_test, p_of_x_test, col = 'blue')
# Green line is the prediction from logistic regression
lines(x_test, predict(glm.logistic, data.frame(x = x_test), 'response'), col = 'green')
# Red lines indicates the range of model predictions from MCMC
# (for each x, 95% of the distribution of model predictions lies between these bounds)
interval.p.mcmc_95 <- interval.p.mcmc(x_test, 0.025, 0.975)
lines(x_test, interval.p.mcmc_95[1,], col = 'red')
lines(x_test, interval.p.mcmc_95[2,], col = 'red')
# Similarly for the log-odds
plot(x, log(p.x/(1 - p.x)), ylab = 'log[Pr(y=1) / (1 - Pr(y=1))]',
pch = 20, cex = 0.5, main = 'Log-Odds vs x')
lines(x_test, log(p_of_x_test/(1 - p_of_x_test)), col = 'blue')
lines(x_test, predict(glm.logistic, data.frame(x = x_test)), col = 'green')
interval.y.mcmc_95 <- interval.y.mcmc(x_test, 0.025, 0.975)
lines(x_test, interval.y.mcmc_95[1,], col = 'red')
lines(x_test, interval.y.mcmc_95[2,], col = 'red')
The description of the function says :
This function generates a sample from the posterior distribution of a logistic regression model using a random walk Metropolis algorithm.
I think therefore that your model.mcmc already contains the points that MCMClogit() has simulated.
You can use str to see what it contains and summary and plot functions on it like in the example there : http://cran.r-project.org/web/packages/MCMCpack/MCMCpack.pdf

R Language - Sorting data into ranges; averaging; ignore outliers

I am analyzing data from a wind turbine, normally this is the sort of thing I would do in excel but the quantity of data requires something heavy-duty. I have never used R before and so I am just looking for some pointers.
The data consists of 2 columns WindSpeed and Power, so far I have arrived at importing the data from a CSV file and scatter-plotted the two against each other.
What I would like to do next is to sort the data into ranges; for example all data where WindSpeed is between x and y and then find the average of power generated for each range and graph the curve formed.
From this average I want recalculate the average based on data which falls within one of two standard deviations of the average (basically ignoring outliers).
Any pointers are appreciated.
For those who are interested I am trying to create a graph similar to this. Its a pretty standard type of graph but like I said the shear quantity of data requires something heavier than excel.
Since you're no longer in Excel, why not use a modern statistical methodology that doesn't require crude binning of the data and ad hoc methods to remove outliers: locally smooth regression, as implemented by loess.
Using a slight modification of csgillespie's sample data:
w_sp <- sample(seq(0, 100, 0.01), 1000)
power <- 1/(1+exp(-(w_sp -40)/5)) + rnorm(1000, sd = 0.1)
plot(w_sp, power)
x_grid <- seq(0, 100, length = 100)
lines(x_grid, predict(loess(power ~ w_sp), x_grid), col = "red", lwd = 3)
Throw this version, similar in motivation as #hadley's, into the mix using an additive model with an adaptive smoother using package mgcv:
Dummy data first, as used by #hadley
w_sp <- sample(seq(0, 100, 0.01), 1000)
power <- 1/(1+exp(-(w_sp -40)/5)) + rnorm(1000, sd = 0.1)
df <- data.frame(power = power, w_sp = w_sp)
Fit the additive model using gam(), using an adaptive smoother and smoothness selection via REML
require(mgcv)
mod <- gam(power ~ s(w_sp, bs = "ad", k = 20), data = df, method = "REML")
summary(mod)
Predict from our model and get standard errors of fit, use latter to generate an approximate 95% confidence interval
x_grid <- with(df, data.frame(w_sp = seq(min(w_sp), max(w_sp), length = 100)))
pred <- predict(mod, x_grid, se.fit = TRUE)
x_grid <- within(x_grid, fit <- pred$fit)
x_grid <- within(x_grid, upr <- fit + 2 * pred$se.fit)
x_grid <- within(x_grid, lwr <- fit - 2 * pred$se.fit)
Plot everything and the Loess fit for comparison
plot(power ~ w_sp, data = df, col = "grey")
lines(fit ~ w_sp, data = x_grid, col = "red", lwd = 3)
## upper and lower confidence intervals ~95%
lines(upr ~ w_sp, data = x_grid, col = "red", lwd = 2, lty = "dashed")
lines(lwr ~ w_sp, data = x_grid, col = "red", lwd = 2, lty = "dashed")
## add loess fit from #hadley's answer
lines(x_grid$w_sp, predict(loess(power ~ w_sp, data = df), x_grid), col = "blue",
lwd = 3)
First we will create some example data to make the problem concrete:
w_sp = sample(seq(0, 100, 0.01), 1000)
power = 1/(1+exp(-(rnorm(1000, mean=w_sp, sd=5) -40)/5))
Suppose we want to bin the power values between [0,5), [5,10), etc. Then
bin_incr = 5
bins = seq(0, 95, bin_incr)
y_mean = sapply(bins, function(x) mean(power[w_sp >= x & w_sp < (x+bin_incr)]))
We have now created the mean values between the ranges of interest. Note, if you wanted the median values, just change mean to median. All that's left to do, is to plot them:
plot(w_sp, power)
points(seq(2.5, 97.5, 5), y_mean, col=3, pch=16)
To get the average based on data that falls within two standard deviations of the average, we need to create a slightly more complicated function:
noOutliers = function(x, power, w_sp, bin_incr) {
d = power[w_sp >= x & w_sp < (x + bin_incr)]
m_d = mean(d)
d_trim = mean(d[d > (m_d - 2*sd(d)) & (d < m_d + 2*sd(d))])
return(mean(d_trim))
}
y_no_outliers = sapply(bins, noOutliers, power, w_sp, bin_incr)
Here are some examples of fitted curves (weibull analysis) for commercial turbines:
http://www.inl.gov/wind/software/
http://www.irec.cmerp.net/papers/WOE/Paper%20ID%20161.pdf
http://www.icaen.uiowa.edu/~ie_155/Lecture/Power_Curve.pdf
I'd recommend also playing around with Hadley's own ggplot2. His website is a great resource: http://had.co.nz/ggplot2/ .
# If you haven't already installed ggplot2:
install.pacakges("ggplot2", dependencies = T)
# Load the ggplot2 package
require(ggplot2)
# csgillespie's example data
w_sp <- sample(seq(0, 100, 0.01), 1000)
power <- 1/(1+exp(-(w_sp -40)/5)) + rnorm(1000, sd = 0.1)
# Bind the two variables into a data frame, which ggplot prefers
wind <- data.frame(w_sp = w_sp, power = power)
# Take a look at how the first few rows look, just for fun
head(wind)
# Create a simple plot
ggplot(data = wind, aes(x = w_sp, y = power)) + geom_point() + geom_smooth()
# Create a slightly more complicated plot as an example of how to fine tune
# plots in ggplot
p1 <- ggplot(data = wind, aes(x = w_sp, y = power))
p2 <- p1 + geom_point(colour = "darkblue", size = 1, shape = "dot")
p3 <- p2 + geom_smooth(method = "loess", se = TRUE, colour = "purple")
p3 + scale_x_continuous(name = "mph") +
scale_y_continuous(name = "power") +
opts(title = "Wind speed and power")

Resources