I am trying to modify Kyle Gorman's autoloess function to be callable as a method in ggplot2's stat_smooth. autoloess is a simply wrapper which runs loess through an optimiser to find the value of span which minimises AICc.
I have created something which runs successfully, but only by using a global variable. Is there a more elegant, idomatic way of programming this?
My code:
AICc.loess <- function(fit) {
# compute AIC_C for a LOESS fit, from:
#
# Hurvich, C.M., Simonoff, J.S., and Tsai, C. L. 1998. Smoothing
# parameter selection in nonparametric regression using an improved
# Akaike Information Criterion. Journal of the Royal Statistical
# Society B 60: 271–293.
#
# #param fit loess fit
# #return 'aicc' value
stopifnot(inherits(fit, 'loess'))
# parameters
n <- fit$n
trace <- fit$trace.hat
sigma2 <- sum(resid(fit) ^ 2) / (n - 1)
return(log(sigma2) + 1 + 2 * (2 * (trace + 1)) / (n - trace - 2))
}
.autoloess.magic.w <- NULL
autoloess <- function(formula, data, weights, span=c(0.01, 2.0)) {
.autoloess.magic.w <- ~weights
fit <- loess(formula=formula,
data=data,
weights=.autoloess.magic.w)
stopifnot(length(span) == 2)
# loss function in form to be used by optimize
f <- function(span) AICc.loess(update(fit, span=span))
# find best loess according to loss function
res <- update(fit, span=optimize(f, span)$minimum)
cat(paste("Optimal span:", res$pars$span, "\n"))
return(res)
}
And a quick test:
# Test
library(ggplot2)
set.seed(1984)
# Create a cubic curve
df <- data.frame(x=1:2500, y=500000 +
(-1000*(1:2500)) +
((1:2500)^2) +
-0.00025*((1:2500)^3) +
rnorm(2500, sd=60000),
ww=runif(2500, min=0, max=10))
# Use loess span
ggplot(df, aes(x=x, y=y, weight=ww)) + geom_point() + stat_smooth(method="loess")
# Use autoloess
ggplot(df, aes(x=x, y=y, weight=ww)) + geom_point() + stat_smooth(method="autoloess")
You can use the weight variable (seems like it is there when the function is called):
autoloess <- function(formula, data, weights, span=c(0.01, 2.0)) {
fit <- loess(formula = formula,
data = data,
weights=weight)
stopifnot(length(span) == 2)
# loss function in form to be used by optimize
f <- function(span) AICc.loess(update(fit, span=span))
# find best loess according to loss function
res <- update(fit, span=optimize(f, span)$minimum)
cat(paste("Optimal span:", res$pars$span, "\n"))
return(res)
}
Related
I have made an ODE model in R using the package deSolve. Currently the output of the model gives me the "observed" prevalence of a disease (i.e. the prevalence not accounting for diagnostic imperfection).
However, I want to adjust the model to output the "true" prevalence, using a simple adjustment formula called the Rogan-Gladen estimator (http://influentialpoints.com/Training/estimating_true_prevalence.htm):
True prevalence =
(Apparent prev. + (Specificity-1)) / (Specificity + (Sensitivity-1))
As you will see in the code below, I have attempted to adjust only one of the differential equations (diggP).
Running the model without adjustment gives an expected output (a proportion between 0 and 1). However, attempting to adjust the model using the RG-estimator gives a spurious output (a proportion less than 0).
Any advice on what might be going wrong here would be very much appreciated.
# Load required packages
library(tidyverse)
library(broom)
library(deSolve)
# Set time (age) for function
time = 1:80
# Defining exponential decay of lambda over age
y1 = 0.003 + (0.15 - 0.003) * exp(-0.05 * time) %>% jitter(10)
df <- data.frame(t = time, y = y1)
fit <- nls(y ~ SSasymp(time, yf, y0, log_alpha), data = df)
fit
# Values of lambda over ages 1-80 years
data <- as.matrix(0.003 + (0.15 - 0.003) * exp(-0.05 * time))
lambda<-as.vector(data[,1])
t<-as.vector(seq(1, 80, by=1))
foi<-cbind(t, lambda)
foi[,1]
# Making lambda varying by time useable in the ODE model
input <- approxfun(x = foi[,1], y = foi[,2], method = "constant", rule = 2)
# Model
ab <- function(time, state, parms) {
with(as.list(c(state, parms)), {
# lambda, changing by time
import<-input(time)
# Derivatives
# RG estimator:
#True prevalence = (apparent prev + (sp-1)) / (sp + (se-1))
diggP<- (((import * iggN) - iggR * iggP) + (sp_igg-1)) / (sp_igg + (se_igg-1))
diggN<- (-import*iggN) + iggR*iggP
dtgerpP<- (0.5*import)*tgerpN -tgerpR*tgerpP
dtgerpN<- (0.5*-import)*tgerpN + tgerpR*tgerpP
# Return results
return(list(c(diggP, diggN, dtgerpP, dtgerpN)))
})
}
# Initial values
yini <- c(iggP=0, iggN=1,
tgerpP=0, tgerpN=1)
# Parameters
pars <- c(iggR = 0, tgerpR = (1/8)/12,
se_igg = 0.95, sp_igg = 0.92)
# Solve model
results<- ode(y=yini, times=time, func=ab, parms = pars)
# Plot results
plot(results, xlab="Time (years)", ylab="Proportion")
I am trying to solve for the parameters of a gamma distribution that is convolved with both normal and lognormal distributions. I can experimentally derive parameters for both the normal and lognormal components, hence, I just want to solve for the gamma params.
I have attempted 3 approaches to this problem:
1) generating convolved random datasets (i.e. rnorm()+rlnorm()+rgamma()) and using least-squares regression on the linear- or log-binned histograms of the data (not shown, but was very biased by RNG and didn't optimize well at all.)
2) "brute-force" numerical integration of the convolving functions (example code #1)
3) numerical integration approaches w/ the distr package. (example code #2)
I have had limited success with all three approaches. Importantly, these approaches seem to work well for "nominal" values for the gamma parameters, but they all begin to fail when k(shape) is low and theta(scale) is high—which is where my experimental data resides. please find the examples below.
Straight-up numerical Integration
# make the functions
f.N <- function(n) dnorm(n, N[1], N[2])
f.L <- function(l) dlnorm(l, L[1], L[2])
f.G <- function(g) dgamma(g, G[1], scale=G[2])
# make convolved functions
f.Z <- function(z) integrate(function(x,z) f.L(z-x)*f.N(x), -Inf, Inf, z)$value # L+N
f.Z <- Vectorize(f.Z)
f.Z1 <- function(z) integrate(function(x,z) f.G(z-x)*f.Z(x), -Inf, Inf, z)$value # G+(L+N)
f.Z1 <- Vectorize(f.Z1)
# params of Norm, Lnorm, and Gamma
N <- c(0,5)
L <- c(2.5,.5)
G <- c(2,7) # this distribution is the one we ultimately want to solve for.
# G <- c(.5,10) # 0<k<1
# G <- c(.25,5e4) # ballpark params of experimental data
# generate some data
set.seed(1)
rN <- rnorm(1e4, N[1], N[2])
rL <- rlnorm(1e4, L[1], L[2])
rG <- rgamma(1e4, G[1], scale=G[2])
Z <- rN + rL
Z1 <- rN + rL + rG
# check the fit
hist(Z,freq=F,breaks=100, xlim=c(-10,50), col=rgb(0,0,1,.25))
hist(Z1,freq=F,breaks=100, xlim=c(-10,50), col=rgb(1,0,0,.25), add=T)
z <- seq(-10,50,1)
lines(z,f.Z(z),lty=2,col="blue", lwd=2) # looks great... convolution performs as expected.
lines(z,f.Z1(z),lty=2,col="red", lwd=2) # this works perfectly so long as k(shape)>=1
# I'm guessing the failure to compute when shape 0 < k < 1 is due to
# numerical integration problems, but I don't know how to fix it.
integrate(dgamma, -Inf, Inf, shape=1, scale=1) # ==1
integrate(dgamma, 0, Inf, shape=1, scale=1) # ==1
integrate(dgamma, -Inf, Inf, shape=.5, scale=1) # !=1
integrate(dgamma, 0, Inf, shape=.5, scale=1) # != 1
# Let's try to estimate gamma anyway, supposing k>=1
optimFUN <- function(par, N, L) {
print(par)
-sum(log(f.Z1(Z1[1:4e2])))
}
f.G <- function(g) dgamma(g, par[1], scale=par[2])
fitresult <- optim(c(1.6,5), optimFUN, N=N, L=L)
par <- fitresult$par
lines(z,f.Z1(z),lty=2,col="green3", lwd=2) # not so great... likely better w/ more data,
# but it is SUPER slow and I observe large step sizes.
Attempting convolving via distr package
# params of Norm, Lnorm, and Gamma
N <- c(0,5)
L <- c(2.5,.5)
G <- c(2,7) # this distribution is the one we ultimately want to solve for.
# G <- c(.5,10) # 0<k<1
# G <- c(.25,5e4) # ballpark params of experimental data
# make the distributions and "convolvings'
dN <- Norm(N[1], N[2])
dL <- Lnorm(L[1], L[2])
dG <- Gammad(G[1], G[2])
d.NL <- d(convpow(dN+dL,1))
d.NLG <- d(convpow(dN+dL+dG,1)) # for large values of theta, no matter how I change
# getdistrOption("DefaultNrFFTGridPointsExponent"), grid size is always wrong.
# Generate some data
set.seed(1)
rN <- r(dN)(1e4)
rL <- r(dL)(1e4)
rG <- r(dG)(1e4)
r.NL <- rN + rL
r.NLG <- rN + rL + rG
# check the fit
hist(r.NL, freq=F, breaks=100, xlim=c(-10,50), col=rgb(0,0,1,.25))
hist(r.NLG, freq=F, breaks=100, xlim=c(-10,50), col=rgb(1,0,0,.25), add=T)
z <- seq(-10,50,1)
lines(z,d.NL(z), lty=2, col="blue", lwd=2) # looks great... convolution performs as expected.
lines(z,d.NLG(z), lty=2, col="red", lwd=2) # this appears to work perfectly
# for most values of K and low values of theta
# this is looking a lot more promising... how about estimating gamma params?
optimFUN <- function(par, dN, dL) {
tG <- Gammad(par[1],par[2])
d.NLG <- d(convpow(dN+dL+tG,1))
p <- d.NLG(r.NLG)
p[p==0] <- 1e-15 # because sometimes very low probabilities evaluate to 0...
# ...and logs don't like that.
-sum(log(p))
}
fitresult <- optim(c(1,1e4), optimFUN, dN=dN, dL=dL)
fdG <- Gammad(fitresult$par[1], fitresult$par[2])
fd.NLG <- d(convpow(dN+dL+fdG,1))
lines(z,fd.NLG(z), lty=2, col="green3", lwd=2) ## this works perfectly when ~k>1 & ~theta<100... but throws
## "Error in validityMethod(object) : shape has to be positive" when k decreases and/or theta increases
## (boundary subject to RNG).
Can i speed up the integration in example 1? can I increase the grid size in example 2 (distr package)? how can I address the k<1 problem? can I rescale the data in a way that will better facilitate evaluation at high theta values?
Is there a better way all-together?
Help!
Well, convolution of function with gaussian kernel calls for use of Gauss–Hermite quadrature. In R it is implemented in special package: https://cran.r-project.org/web/packages/gaussquad/gaussquad.pdf
UPDATE
For convolution with Gamma distribution this package might be useful as well via Gauss-Laguerre quadrature
UPDATE II
Here is quick code to convolute gaussian with lognormal,
hopefully not a lot of bugs and and prints some reasonable looking graph
library(gaussquad)
n.quad <- 170 # integration order
# get the particular weights/abscissas as data frame with 2 observables and n.quad observations
rule <- ghermite.h.quadrature.rules(n.quad, mu = 0.0)[[n.quad]]
# test function - integrate 1 over exp(-x^2) from -Inf to Inf
# should get sqrt(pi) as an answer
f <- function(x) {
1.0
}
q <- ghermite.h.quadrature(f, rule)
print(q - sqrt(pi))
# convolution of lognormal with gaussian
# because of the G-H rules, we have to make our own function
# for simplicity, sigmas are one and mus are zero
sqrt2 <- sqrt(2.0)
c.LG <- function(z) {
#print(z)
f.LG <- function(x) {
t <- (z - x*sqrt2)
q <- 0.0
if (t > 0.0) {
l <- log(t)
q <- exp( - 0.5*l*l ) / t
}
q
}
ghermite.h.quadrature(Vectorize(f.LG), rule) / (pi*sqrt2)
}
library(ggplot2)
p <- ggplot(data = data.frame(x = 0), mapping = aes(x = x))
p <- p + stat_function(fun = Vectorize(c.LG))
p <- p + xlim(-1.0, 5.0)
print(p)
I've had great experiences asking for help here before and I'm hoping to get some help again.
I'm estimating a rather large mixed effects model in which one of the random effects has over 150 different levels. That would make a standard caterpillar plot to be quite unreadable.
I would like, if all possible, to get a caterpillar plot of just the levels of the random effect that are, for a lack of better term, "significant". That is: I want a caterpillar plot in which either the random intercept or the random slope for a varying coefficient has a "confidence interval" (I know that's not quite what it is) that does not include zero.
Consider this standard model from the sleepstudy data that is standard with lme4.
library(lme4)
fit <- lmer(Reaction ~ Days + (Days|Subject), sleepstudy)
ggCaterpillar(ranef(fit,condVar=TRUE), QQ=FALSE, likeDotplot=TRUE, reorder=FALSE)[["Subject"]]
I would get this caterpillar plot.
The caterpillar plot I use comes from this code. Do note I tend to use less conservative bounds for the intervals (i.e. 1.645*se and not 1.96*se).
Basically, I want a caterpillar plot that would just include the levels for 308, 309, 310, 330, 331, 335, 337, 349, 350, 352, and 370 because those levels had either intercepts or slopes whose intervals did not include zero. I ask because my caterpillar plot of over 150 different levels is kind of unreadable and I think this might be a worthwhile solution to it.
Reproducible code follows. I genuinely appreciate any help.
# https://stackoverflow.com/questions/34120578/how-can-i-sort-random-effects-by-value-of-the-random-effect-not-the-intercept
ggCaterpillar <- function(re, QQ=TRUE, likeDotplot=TRUE, reorder=TRUE) {
require(ggplot2)
f <- function(x) {
pv <- attr(x, "postVar")
cols <- 1:(dim(pv)[1])
se <- unlist(lapply(cols, function(i) sqrt(pv[i, i, ])))
if (reorder) {
ord <- unlist(lapply(x, order)) + rep((0:(ncol(x) - 1)) * nrow(x), each=nrow(x))
pDf <- data.frame(y=unlist(x)[ord],
ci=1.645*se[ord],
nQQ=rep(qnorm(ppoints(nrow(x))), ncol(x)),
ID=factor(rep(rownames(x), ncol(x))[ord], levels=rownames(x)[ord]),
ind=gl(ncol(x), nrow(x), labels=names(x)))
} else {
pDf <- data.frame(y=unlist(x),
ci=1.645*se,
nQQ=rep(qnorm(ppoints(nrow(x))), ncol(x)),
ID=factor(rep(rownames(x), ncol(x)), levels=rownames(x)),
ind=gl(ncol(x), nrow(x), labels=names(x)))
}
if(QQ) { ## normal QQ-plot
p <- ggplot(pDf, aes(nQQ, y))
p <- p + facet_wrap(~ ind, scales="free")
p <- p + xlab("Standard normal quantiles") + ylab("Random effect quantiles")
} else { ## caterpillar dotplot
p <- ggplot(pDf, aes(ID, y)) + coord_flip()
if(likeDotplot) { ## imitate dotplot() -> same scales for random effects
p <- p + facet_wrap(~ ind)
} else { ## different scales for random effects
p <- p + facet_grid(ind ~ ., scales="free_y")
}
p <- p + xlab("Levels of the Random Effect") + ylab("Random Effect")
}
p <- p + theme(legend.position="none")
p <- p + geom_hline(yintercept=0)
p <- p + geom_errorbar(aes(ymin=y-ci, ymax=y+ci), width=0, colour="black")
p <- p + geom_point(aes(size=1.2), colour="blue")
return(p)
}
lapply(re, f)
}
library(lme4)
fit <- lmer(Reaction ~ Days + (Days|Subject), sleepstudy)
ggCaterpillar(ranef(fit,condVar=TRUE), QQ=FALSE, likeDotplot=TRUE, reorder=FALSE)[["Subject"]]
ggsave(file="sleepstudy.png")
First, thanks for putting "significant" in quotation marks ... everyone reading this should remember that significance doesn't have any statistical meaning in this context (it might be better to use a Z-statistic (value/std.error) criterion such as |Z|>1.5 or |Z|>1.75 instead, just to emphasize that this is not an inferential threshold ...)
I ended up getting a little carried away ... I decided that it would be better to refactor/modularize things a little bit, so I wrote an augment method (designed to work with the broom package) that constructs useful data frames from ranef.mer objects ... once this is done, the manipulations you want are pretty easy.
I put the augment.ranef.mer code at the end of my answer -- it's a bit long (you'll need to source it before you can run the code here). update: this augment method has been part of the broom.mixed package for a while now ...
library(broom)
library(reshape2)
library(plyr)
Apply the augment method to the RE object:
rr <- ranef(fit,condVar=TRUE)
aa <- augment(rr)
names(aa)
## [1] "grp" "variable" "level" "estimate" "qq" "std.error"
## [7] "p" "lb" "ub"
Now the ggplot code is pretty basic. I'm using geom_errorbarh(height=0) rather than geom_pointrange()+coord_flip() because ggplot2 can't use coord_flip with facet_wrap(...,scales="free") ...
## Q-Q plot:
g0 <- ggplot(aa,aes(estimate,qq,xmin=lb,xmax=ub))+
geom_errorbarh(height=0)+
geom_point()+facet_wrap(~variable,scale="free_x")
## regular caterpillar plot:
g1 <- ggplot(aa,aes(estimate,level,xmin=lb,xmax=ub))+
geom_errorbarh(height=0)+
geom_vline(xintercept=0,lty=2)+
geom_point()+facet_wrap(~variable,scale="free_x")
Now find the levels you want to keep:
aa2 <- ddply(aa,c("grp","level"),
transform,
keep=any(p<0.05))
aa3 <- subset(aa2,keep)
Update caterpillar plot with only levels with "significant" slopes or intercepts:
g1 %+% aa3
If you only wanted to highlight "significant" levels rather than removing "non-significant" levels entirely
ggplot(aa2,aes(estimate,level,xmin=lb,xmax=ub,colour=factor(keep)))+
geom_errorbarh(height=0)+
geom_vline(xintercept=0,lty=2)+
geom_point()+facet_wrap(~variable,scale="free_x")+
scale_colour_manual(values=c("black","red"),guide=FALSE)
##' #importFrom reshape2 melt
##' #importFrom plyr ldply name_rows
augment.ranef.mer <- function(x,
ci.level=0.9,
reorder=TRUE,
order.var=1) {
tmpf <- function(z) {
if (is.character(order.var) && !order.var %in% names(z)) {
order.var <- 1
warning("order.var not found, resetting to 1")
}
## would use plyr::name_rows, but want levels first
zz <- data.frame(level=rownames(z),z,check.names=FALSE)
if (reorder) {
## if numeric order var, add 1 to account for level column
ov <- if (is.numeric(order.var)) order.var+1 else order.var
zz$level <- reorder(zz$level, zz[,order.var+1], FUN=identity)
}
## Q-Q values, for each column separately
qq <- c(apply(z,2,function(y) {
qnorm(ppoints(nrow(z)))[order(order(y))]
}))
rownames(zz) <- NULL
pv <- attr(z, "postVar")
cols <- 1:(dim(pv)[1])
se <- unlist(lapply(cols, function(i) sqrt(pv[i, i, ])))
## n.b.: depends on explicit column-major ordering of se/melt
zzz <- cbind(melt(zz,id.vars="level",value.name="estimate"),
qq=qq,std.error=se)
## reorder columns:
subset(zzz,select=c(variable, level, estimate, qq, std.error))
}
dd <- ldply(x,tmpf,.id="grp")
ci.val <- -qnorm((1-ci.level)/2)
transform(dd,
p=2*pnorm(-abs(estimate/std.error)), ## 2-tailed p-val
lb=estimate-ci.val*std.error,
ub=estimate+ci.val*std.error)
}
Here is some code I am using to auto generate some regression fits;
require(ggplot2)
# Prep data
nPts = 200
prepared=runif(nPts,0,10)
rich=5-((prepared-5)^2)/5 + 5*runif(length(prepared))
df <- data.frame(rich=rich, prepared=prepared)
deg = 1 # User variable
lm <- lm(df$rich ~ poly(df$prepared, deg, raw=T))
# Create expression
coefs <- lm$coefficients
eq <- paste0(round(coefs,2),'*x^', 0:length(coefs), collapse='+') # (1)
pl <- ggplot(df, aes(x=prepared, y=rich)) +
geom_point() +
geom_smooth(method = "lm", formula = y ~ poly(x,deg), size = 1) +
ggtitle(eq) # (2)
print(pl)
This code should run (with ggplot2 installed). The problem is in the lines marked 1 and 2:
Generates a string representation of the polynomial
Sets the string as the plot title
As it stand my title is "6.54*x^0+0.09*x^1+6.54*x^2". However I want a more attractive rendering so that (2) is more like would be seen with:
ggtitle(expression(6.54*x^0+0.09*x^1+6.54*x^2)) # (2')
i.e, powers raised, multiplications dropped etc. Any help much appreciated!
Here's a function that I built to solve my problem;
poly_expression <- function(coefs){
# build the string
eq <- paste0(round(coefs,2),'*x^', (1:length(coefs)-1), collapse='+')
# some cleaning
eq <- gsub('\\+\\-','-', eq) # +-n -> -n
eq <- gsub('\\*x\\^0','', eq) # n*x^0 <- n
eq <- gsub('x\\^1','x', eq) # n*x^1 <- nx
eq <- parse(text=eq) # return expressions
return(eq)
}
Then ggtitle(poly_expression(coefs)) renders as required.
I'm trying to fit two gaussian peaks to my density plot data, using the following code:
model <- function(coeffs,x)
{
(coeffs[1] * exp( - ((x-coeffs[2])/coeffs[3])**2 ))
}
y_axis <- data.matrix(den.PA$y)
x_axis <- data.matrix(den.PA$x)
peak1 <- c(1.12e-2,1075,2) # guess for peak 1
peak2 <- c(1.15e-2,1110,2) # guess for peak 2
peak1_fit <- model(peak1,den.PA$x)
peak2_fit <- model(peak2,den.PA$x)
total_peaks <- peak1_fit + peak2_fit
err <- den.PA$y - total_peaks
fit <- nls(y_axis~coeffs2 * exp( - ((x_axis-coeffs3)/coeffs4)**2 ),start=list(coeffs2=1.12e-2, coeffs3=1075, coeffs4=2))
fit2<- nls(y_axis~coeffs2 * exp( - ((x_axis-coeffs3)/coeffs4)**2 ),start=list(coeffs2=1.15e-2, coeffs3=1110, coeffs4=2))
fit_coeffs = coef(fit)
fit2_coeffs = coef(fit2)
a <- model(fit_coeffs,den.PA$x)
b <- model(fit2_coeffs,den.PA$x)
plot(den.PA, main="Cytochome C PA", xlab= expression(paste("Collision Cross-Section (", Å^2, ")")))
lines(results2,a, col="red")
lines(results2,b, col="blue")
This gives me the following plot:
This is where I have my problem. I calculate the fits independently of each other and gaussian peaks are overlaid on on top of each other. I need to feed the err variable into nls which should return 6 coeffs from which I can then re-model the gaussian peaks to fit to the plot.
The answer came to me as soon as i Posted the question. Changing fit to this:
fit <- nls(y_axis~(coeffs2 * exp( - ((x_axis-coeffs3)/coeffs4)**2)) + (coeffs5 * exp( - ((x_axis-coeffs6)/coeffs7)**2)), start=list(coeffs2=1.12e-2, coeffs3=1075, coeffs4=2,coeffs5=1.15e-2, coeffs6=1110, coeffs7=2))
Gives:
An inelegant soloution but it does the job.