Why do my beta priors produce unexpected results from rethinking::ulam()? - r

I've been trying to run a binomial MCMC model through rethinking::ulam() using beta priors (0.5,5), but the resulting posterior prediction density plots start from 0.5 on the x axis, rather than 0. For some background context, my data is in the format of binary animal behaviour observations, where I've recorded instances of uncommon behaviours using instantaneous sampling every 30 seconds (i.e. a lot of zeros, and a few 1s). I'm not very familiar with beta distributions, but something like dbeta(0.5, 5) seems like it would lean more towards zero, which makes more sense for my data than a flat normal prior.
For all other prior predictive checks I've ever carried out in ulam(), the density plot scale starts from zero. However, seemingly in this case only, the plot starts from 0.5 and the probability is heaped around 0.5, even if I change the first shape parameter to something else. However, when I run a simulated distribution with the same shape (0.5,5), it looks as expected.
I have run out of ideas, so any help would be hugely appreciated!
Below is a simplified version of my model, using the beta distribution for my prior (to install the rethinking package, see:
https://www.rdocumentation.org/packages/rethinking/versions/2.13
library(rethinking)
# Dummy data
behaviour <- rbinom(39, 1, 0.1)
data <- list(behaviour = behaviour)
# Model structure
model <- ulam(alist(
behaviour ~ dbinom(1, p),
logit(p) <- a,
a ~ dbeta(0.5, 5)),
data = data, chains = 4, cores = 4, log_lik = TRUE)
The density plot of the prior distribution starts from 0.5 on the x axis, like this:
# Extracting prior
set.seed(1999)
prior <- extract.prior(model, n = 1e4)
# Converting parameter to the outcome scale
p <- inv_logit(prior$a)
# Density plot of prior distribution
dens(p, adj = 0.1)
However, the distribution should look like this (simulated data), starting from 0 on the x axis:
dens(rbeta(1000, 0.5, 5))

Related

Segmented Regression with two zero constraints at beginning and end boundaries

I am having trouble with this segmented regression as it requires two constraints and so far I have only treated single constraints.
Here is an example of some data I am trying to fit:
library(segmented)
library("readxl")
library(ggplot2)
#DATA PRE-PROCESSING
yields <- c(-0.131, 0.533, -0.397, -0.429, -0.593, -0.778, -0.92, -0.987, -1.113, -1.314, -0.808, -1.534, -1.377, -1.459, -1.818, -1.686, -1.73, -1.221, -1.595, -1.568, -1.883, -1.53, -1.64, -1.396, -1.679, -1.782, -1.033, -0.539, -1.207, -1.437, -1.521, -0.691, -0.879, -0.974, -1.816, -1.854, -1.752, -1.61, -0.602, -1.364, -1.303, -1.186, -1.336)
maturities <- c(2.824657534246575, 2.9013698630136986, 3.106849315068493, 3.1534246575342464, 3.235616438356164, 3.358904109589041, 3.610958904109589, 3.654794520547945, 3.778082191780822, 3.824657534246575, 3.9013698630136986, 3.9863013698630136, 4.153424657534247, 4.273972602739726, 4.32054794520548, 4.654794520547945, 4.778082191780822, 4.986301369863014, 5.153424657534247, 5.32054794520548, 5.443835616438356, 5.572602739726028, 5.654794520547945, 5.824425480949174, 5.941911819746988, 6.275245153080321, 6.4063926940639275, 6.655026573845348, 6.863013698630137, 7.191780821917808, 7.32054794520548, 7.572602739726028, 7.693150684931507, 7.901369863013699, 7.986301369863014, 8.32054794520548, 8.654794520547945, 8.986301369863014, 9.068493150684931, 9.32054794520548, 9.654794520547945, 9.903660453626769, 10.155026573845348)
off_2 <- 2.693277939965566
off_10 <- 10.655026573845348
bond_data = data.frame(yield_change = yields, maturity = maturities) code here
I am trying to fit a segmented model (with a formula of "yield_change~maturity") that has the following constraints:
At maturity = 2 I want the yield_change to be zero
At maturity = 10, I want the yield_change to zero
I want breakpoints(fixed in x) at the 3, 5 and 7-year maturity values.
The off_2 and off_10 variables are the offsets I must use (to set the yields to zero at the 2 and 10-year mark)
As I mentioned before, my previous regressions only required one initial constraint, having one offset value I had to use, I did the following:
I subtracted the offset value from the maturity vector (for example I had maturity = c(10.8,10.9,11,14,16,18, etc... then subtracted the offset, always lower than the initial vector value, 10,4 for example and then fitted a lm with an origin constraint)
From there I could use the segmented package and fit as many breakpoints as I wanted)
As the segmented() function requires a lm object as an input that was possible.
However in this case I cannot to the previous approach as I have two offsets and cannot subtract all the values by off_2 or off_10 as it would fix one point at the zero and not the other.
What I have tried doing is the following:
Separate the dataset into maturities below 5 and maturities over 5 (and essentially apply a segmented model to each of these (with only one breakpoint being 3 or 7).
The issue is that I need to have the 5 year point yield the same for the two models.
I have done this:
bond_data_sub5 <- bond_data[bond_data$maturity < 5,]
bond_data_over5 <- bond_data[bond_data$maturity > 5,]
bond_data_sub5["maturity"] <- bond_data_sub5$maturity - off_2
#2 to 5 year model
model_sub5 <- lm(yield_change~maturity+0, data = bond_data_sub5)
plot(bond_data_sub5$maturity,bond_data_sub5$yield_change, pch=16, col="blue",
xlab = "maturity",ylab = "yield_change", xlim = c(0,12))
abline(model_sub5)
Which gives me the following graph:
The fact that the maturities have an offset of off_2 is not a problem as when I input my predictions to the function I will create, I will then subtract them by off_2.
The worrying thing is that the 5-year prediction is not at all close to where the actual 5 year should be. Looking at the scatter plot of all maturities we can see this:
five_yr_yield <- predict(model_sub5,data.frame(maturity = 5 - off_2))
plot(bond_data$maturity,bond_data$yield_change, pch=16, col="blue",
xlab = "maturity",ylab = "yield_change", xlim = c(0,12), ylim = c(-3,0.5))
points(5,five_yr_yield, pch=16, col = "red")
Gives:
The issue with this method is that if I set the model_sub5 5-year prediction as the beginning constraint of model_over5, I will have the exact same problem I am trying to resolve, two constraints in one lm (but this time (5,five_yr_yield) and (10,0) constraints.
Isn't there a way to fit a lm with no slope and zero as an intercept from (2,0) to (10,0) and then apply the segmented function with breakpoints at 3,5 and 7?
If that isn't possible how would I make the logic I am trying to apply work? Or is there another way of doing this?
If anyone has any suggestions I would greatly appreciate them!
Thank you very much!

Estimating PDF with monotonically declining density at tails

tldr: I am numerically estimating a PDF from simulated data and I need the density to monotonically decrease outside of the 'main' density region (as x-> infinity). What I have yields a close to zero density, but which does not monotonically decrease.
Detailed Problem
I am estimating a simulated maximum likelihood model, which requires me to numerically evaluate the probability distribution function of some random variable (the probability of which cannot be analytically derived) at some (observed) value x. The goal is to maximize the log-likelihood of these densities, which requires them to not have spurious local maxima.
Since I do not have an analytic likelihood function I numerically simulate the random variable by drawing the random component from some known distribution function, and apply some non-linear transformation to it. I save the results of this simulation in a dataset named simulated_stats.
I then use density() to approximate the PDF and approxfun() to evaluate the PDF at x:
#some example simulation
Simulated_stats_ <- runif(n=500, 10,15)+ rnorm(n=500,mean = 15,sd = 3)
#approximation for x
approxfun(density(simulated_stats))(x)
This works well within the range of simulated simulated_stats, see image:
Example PDF. The problem is I need to be able to evaluate the PDF far from the range of simulated data.
So in the image above, I would need to evaluate the PDF at, say, x=50:
approxfun(density(simulated_stats))(50)
> [1] NA
So instead I use the from and to arguments in the density function, which correctly approximate near 0 tails, such
approxfun(
density(Simulated_stats, from = 0, to = max(Simulated_stats)*10)
)(50)
[1] 1.924343e-18
Which is great, under one condition - I need the density to go to zero the further out from the range x is. That is, if I evaluated at x=51 the result must be strictly smaller. (Otherwise, my estimator may find local maxima far from the 'true' region, since the likelihood function is not monotonic very far from the 'main' density mass, i.e. the extrapolated region).
To test this I evaluated the approximated PDF at fixed intervals, took logs, and plotted. The result is discouraging: far from the main density mass the probability 'jumps' up and down. Always very close to zero, but NOT monotonically decreasing.
a <- sapply(X = seq(from = 0, to = 100, by = 0.5), FUN = function(x){approxfun(
density(Simulated_stats_,from = 0, to = max(Simulated_stats_)*10)
)(x)})
aa <- cbind( seq(from = 0, to = 100, by = 0.5), a)
plot(aa[,1],log(aa[,2]))
Result:
Non-monotonic log density far from density mass
My question
Does this happen because of the kernel estimation in density() or is it inaccuracies in approxfun()? (or something else?)
What alternative methods can I use that will deliver a monotonically declining PDF far from the simulated density mass?
Or - how can I manually change the approximated PDF to monotonically decline the further I am from the density mass? I would happily stick some linear trend that goes to zero...
Thanks!
One possibility is to estimate the CDF using a beta regression model; numerical estimate of the derivative of this model could then be used to estimate the pdf at any point. Here's an example of what I was thinking. I'm not sure if it helps you at all.
Import libraries
library(mgcv)
library(data.table)
library(ggplot2)
Generate your data
set.seed(123)
Simulated_stats_ <- runif(n=5000, 10,15)+ rnorm(n=500,mean = 15,sd = 3)
Function to estimate CDF using gam beta regression model
get_mod <- function(ss,p = seq(0.02, 0.98, 0.02)) {
qp = quantile(ss, probs=p)
betamod = mgcv::gam(p~s(qp, bs="cs"), family=mgcv::betar())
return(betamod)
}
betamod <- get_mod(Simulated_stats_)
Very basic estimate of PDF at val given model that estimates CDF
est_pdf <- function(val, betamod, tol=0.001) {
xvals = c(val,val+tol)
yvals = predict(betamod,newdata=data.frame(qp = xvals), type="response")
as.numeric((yvals[1] - yvals[2])/(xvals[1] - xvals[2]))
}
Lets check if monotonically increasing below min of Simulated_stats
test_x = seq(0,min(Simulated_stats_), length.out=1000)
pdf = sapply(test_x, est_pdf, betamod=betamod)
all(pdf == cummax(pdf))
[1] TRUE
Lets check if monotonically decreasing above max of Simulated_stats
test_x = seq(max(Simulated_stats_), 60, length.out=1000)
pdf = sapply(test_x, est_pdf, betamod=betamod)
all(pdf == cummin(pdf))
[1] TRUE
Additional thoughts 3/5/22
As discussed in comments, using the betamod to predict might slow down the estimator. While this could be resolved to a great extent by writing your own predict function directly, there is another possible shortcut.
Generate estimates from the betamod over the range of X, including the extremes
k <- sapply(seq(0,max(Simulated_stats_)*10, length.out=5000), est_pdf, betamod=betamod)
Use the approach above that you were initially using, i.e. a linear interpolation across the density, but rather than doing this over the density outcome, instead do over k (i.e. over the above estimates from the beta model)
lin_int = approxfun(x=seq(0,max(Simulated_stats_)*10, length.out=5000),y=k)
You can use the lin_int() function for prediction in the estimator, and it will be lighting fast. Note that it produces virtually the same value for a given x
c(est_pdf(38,betamod), lin_int(38))
[1] 0.001245894 0.001245968
and it is very fast
microbenchmark::microbenchmark(
list = alist("betamod" = est_pdf(38, betamod),"lin_int" = lint(38)),times=100
)
Unit: microseconds
expr min lq mean median uq max neval
betamod 1157.0 1170.20 1223.304 1188.25 1211.05 2799.8 100
lin_int 1.7 2.25 3.503 4.35 4.50 10.5 100
Finally, lets check the same plot you did before, but using lin_int() instead of approxfun(density(....))
a <- sapply(X = seq(from = 0, to = 100, by = 0.5), lin_int)
aa <- cbind( seq(from = 0, to = 100, by = 0.5), a)
plot(aa[,1],log(aa[,2]))

Posterior distribution missing from plots

I'm trying to use R to calculate a posterior distribution and produce a triplot gragh for my prior, likelihood and posterior distribution. I have the prior distribution π_1 (θ) = Be (1.5, 1.5).
Here is my R code:
n <- 25
X <- 16
a <- 1.5
b <- 1.5
grid <- seq(0,1,.01)
like <- dbinom(X,n,grid)
like
like <- like/sum(like)
like
prior <- dbeta(grid,a,b)
prior1 <- prior/sum(prior)
post <- like*prior
post <- post/sum(post)
It does give me a Triplot but I also want to get the value for my posterior distribution, but it seems something missing in my code.
To clarify, I am looking for the posterior distribution of θ for the above prior distribution
In addition, I have tried:
install.packages("LearnBayes")
library("LearnBayes")
prior = c( a= 1.5, b = 1.5 )
data = c( s = 25, f = 16 )
triplot(prior,data)
It gives me a perfect Triplot, but again no value for posterior.
It's there, but just that the prior is so weakly informative (Beta[a=1.5, b=1.5] is nearly uniform) that the likelihood function differs very little from the posterior. An intuitive way to think about this is that a+b-2 is 1, meaning the prior is effectively only supported by 1 previous observation, whereas N is 25, meaning the data is supported by 25 observations. This leads to the data dominating the posterior in terms of contributing information.
Changing the prior to be stronger will make the difference more apparent:
prior <- c(a=10, b=10)
data <- c(s=25, f=16)
triplot(prior, data)
Note, there is nothing wrong with using a weakly informative prior, if that is all the information that is available. When the observed data is large enough, it should dominate the posterior.

Convert uniform draws to normal distributions with known mean and std in R

I apply the sensitivity package in R. In particular, I want to use sobolroalhs as it uses a sampling procedure for inputs that allow for evaluations of models with a large number of parameters. The function samples uniformly [0,1] for all inputs. It is stated that desired distributions need to be obtained as follows
####################
# Test case: dealing with non-uniform distributions
x <- sobolroalhs(model = NULL, factors = 3, N = 1000, order =1, nboot=0)
# X1 follows a log-normal distribution:
x$X[,1] <- qlnorm(x$X[,1])
# X2 follows a standard normal distribution:
x$X[,2] <- qnorm(x$X[,2])
# X3 follows a gamma distribution:
x$X[,3] <- qgamma(x$X[,3],shape=0.5)
# toy example
toy <- function(x){rowSums(x)}
y <- toy(x$X)
tell(x, y)
print(x)
plot(x)
I have non-zero mean and standard deviations for some input parameter that I want to sample out of a normal distribution. For others, I want to uniformly sample between a defined range (e.g. [0.03,0.07] instead [0,1]). I tried using built in R functions such as
SA$X[,1] <- rnorm(1000, mean = 579, sd = 21)
but I am afraid this procedure messes up the sampling design of the package and resulted in odd results for the sensitivity indices. Hence, I think I need to adhere for the uniform draw of the sobolroalhs function in which and use the sampled value between [0, 1] when drawing out of the desired distribution (I think as density draw?). Does this make sense to anyone and/or does anyone know how I could sample out of the right distributions following the syntax from the package description?
You can specify mean and sd in qnorm. So modify lines like this:
x$X[,2] <- qnorm(x$X[,2])
to something like this:
x$X[,2] <- qnorm(x$X[,2], mean = 579, sd = 21)
Similarly, you could use the min and max parameters of qunif to get values in a given range.
Of course, it's also possible to transform standard normals or uniforms to the ones you want using things like X <- 579 + 21*Z or Y <- 0.03 + 0.04*U, where Z is a standard normal and U is standard uniform, but for some distributions those transformations aren't so simple and using the q* functions can be easier.

R: How to read Nomograms to predict the desired variable

I am using Rstudio. I have created nomograms using function nomogram from package rms using following code (copied from the example code of the documentation):
library(rms)
n <- 1000 # define sample size
set.seed(17) # so can reproduce the results
age <- rnorm(n, 50, 10)
blood.pressure <- rnorm(n, 120, 15)
cholesterol <- rnorm(n, 200, 25)
sex <- factor(sample(c('female','male'), n,TRUE))
# Specify population model for log odds that Y=1
L <- .4*(sex=='male') + .045*(age-50) +
(log(cholesterol - 10)-5.2)*(-2*(sex=='female') + 2*(sex=='male'))
# Simulate binary y to have Prob(y=1) = 1/[1+exp(-L)]
y <- ifelse(runif(n) < plogis(L), 1, 0)
ddist <- datadist(age, blood.pressure, cholesterol, sex)
options(datadist='ddist')
f <- lrm(y ~ lsp(age,50)+sex*rcs(cholesterol,4)+blood.pressure)
nom <- nomogram(f, fun=function(x)1/(1+exp(-x)), # or fun=plogis
fun.at=c(.001,.01,.05,seq(.1,.9,by=.1),.95,.99,.999),
funlabel="Risk of Death")
#Instead of fun.at, could have specified fun.lp.at=logit of
#sequence above - faster and slightly more accurate
plot(nom, xfrac=.45)
Result:
This code produces a nomogram but there is no line connecting each scale (called isopleth) to help predict the desired variable ("Risk of Death") from the plot. Usually, nomograms have the isopleth for prediction (example from wikipedia). But here, how do I predict the variable value?
EDIT:
From the documentation:
The nomogram does not have lines representing sums, but it has a
reference line for reading scoring points (default range 0--100). Once
the reader manually totals the points, the predicted values can be
read at the bottom.
I don't understand this. It seems that predicting is supposed to be done without the isopleth, from the scale of points. but how? Can someone please elaborate with this example on how I can read the nomograms to predict the desired variable? Thanks a lot!
EDIT 2 (FYI):
In the description of the bounty, I am talking about the isopleth. When starting the bounty, I did not know that nomogram function does not provide isopleth and has points scale instead.
From the documentation, the nomogram is used to manualy obtain prediction:
In the top of the plot (over Total points)
you draw a vertical line for each of the variables of your patient (for example age=40, cholesterol=220 ( and sex=male ), blood.pressure=172)
then you sum up the three values you read on the Points scale (40+60+3=103) to obtain Total Points.
Finally you draw a vertical line on the Total Points scale (103) to read the Risk of death (0.55).
These are regression nomograms, and work in a different way to classic nomograms. A classic nomogram will perform a full calculation. For these nomograms you drop a line from each predictor to the scale at the bottom and add your results.
The only way to have a classic 'isopleth' nomogram working on a regression model would be 1 have just two predictors or 2 have a complex multi- step nomogram.

Resources