Beta Distribution Fitting in R -- Various Attempts - r

I need to fit a custom probability density (based on the symmetric beta distribution B(shape, shape), where the two parameters shape1 and shape2 are identical) to my data.
The trouble is that I experience some problems also when dealing with the plain vanilla symmetric beta distribution.
Please consider the code at the end of the post.
In the code, dbeta1 is the density of the beta distribution for shape1=shape2=shape.
In the code, dbeta2 is the same quantity written explicitly, without the normalization factor (which should not matter at all if we talk about maximizing a quantity).
I then generate some random numbers according to Beta(0.2, 0.2) and I try to estimate the shape parameter using
1) fitdistr from MASS
2) mle from stats4
Results: generally speaking I have non-sense estimates of the shape parameter when I use dbeta2 instead of dbeta1 and I do not understand why.
On top of that, mle crashes with dbeta2 and often I have numerical problems depending on how I seed the x sequence of random numbers.
I must be misunderstanding something, so any suggestion is appreciated.
library(MASS)
library(stats4)
dbeta1 <- function(x, shape, ...)
dbeta(x, shape, shape, ...)
dbeta2 <- function(x, shape){
res <- x^(shape-1)*(1-x)^(shape-1)
return(res)
}
LL1 <- function(shape){
R <- dbeta1(x, shape)
res <- -sum(log(R))
return(res)
}
LL2 <- function(shape){
R <- dbeta2(x, shape)
res <- -sum(log(R))
return(res)
}
set.seed(124)
x <- rbeta(1000, 0.2, 0.2)
fit_dbeta1 <- fitdistr( x , dbeta1, start=list(shape=0.5) , method="Brent", lower=c(0), upper=c(1))
print("estimate of shape from fit_dbeta1 is")
print(fit_dbeta1$estimate)
fit_dbeta2 <- fitdistr( x , dbeta2, start=list(shape=0.5) , method="Brent", lower=c(0), upper=c(1))
print("estimate of shape from fit_dbeta2 is")
print(fit_dbeta2$estimate)
fit_LL1 <- mle(LL1, start=list(shape=0.5))
print("estimate of from fit_LL1")
print(summary(fit_LL1))
## this does not work
fit_LL2 <- mle(LL2, start=list(shape=0.5))

Well, I understood the problem. Missing out the normalisation factor in dbeta2 was the issue, because that quantity also depends on shape.
If I use
dbeta2 <- function(x, shape){
res <- x^(shape-1)*(1-x)^(shape-1)/beta(shape, shape)
return(res)
}
then the results are consistent.

Related

How to fix code in RMarkdown for simulation that will not run due to "Error: Discrete value supplied to continuous scale"?

I am wondering what is wrong with my following R code (R markdown)? I keep getting an error message for the last line that says "Error in h(x.n, df = N - 2) : unused argument (df = N - 2)". I am very confused because my TA looked at my code and told me that it should run perfectly.
For context, this is the problem I am working on:
library(MASS)
library(tidyverse)
library(hypergeo)
set.seed(1)
rm(list=ls())
N=7
Nsim=10000
rho=0
Sigma=matrix(c(1,rho,rho,1),2,2)
Sigma
mu=c(0,0)
r_vec=matrix(NaN,nrow=1,ncol=Nsim)
#have function mvrnorm-->simulate from multivariate normal distribution. N=7 Correlation matrix sigma. before X was fixed but now is random and formal dependence from Y that I can control. Compute rho hat and see if on average it gives me correct rho. Check how serious bias is when the expected value of rho hat isn't equal to rho. I want a feeling about whether this is something I should worry about or not
for (i in 1:Nsim){
data=mvrnorm(N, mu, Sigma)
r_vec[i]=cor(data[,1],data[,2])
}
mean(r_vec)
update.packages("deSolve")
x.n=seq(-1,1,0.1)
sim_rho0<-function(Nsim,N,rho){
rho=rho
mu=c(0,0)
Sigma=matrix(c(1,rho,rho,1),nrow=2)
r_vec=matrix(NaN,nrow=Nsim)
for (i in 1:Nsim){
data=mvrnorm(N, mu, Sigma)
r_vec[i]=cor(data[,1],data[,2])
}
# here we compute t, which should have a t_{N-2} distribution. This is different here and trying to reconstruct the .Not a mathematical proof. Might be a mistake*****
#range of values and plotting density for each one
h<- function(N,rho,x.n){
rho=rho
a <- ((N-2)*(gamma(N-1))*(1-rho^2)^(N-1)/2*(1-x.n^2)^(N-4)/2)/((2*pi)*(sqrt(N-1/2))((1-x.n*rho)^(N-3/2)))
b <- hypergeo(1/2, 1/2, (2*N-1/2), ((x.n*rho)+1)/2)
h2 = a*b
return(h2)
}
t=r_vec*sqrt(N-2)/(1-r_vec^2)
x.n=seq(-1,1,0.1)
y.n= h(N=10, rho=0.8, x.n=x.n)
df=tibble(X=t)
df2=tibble(x=x.n,y=y.n)
ggplot()+geom_histogram(data=df, aes(x=X,y=..density..),binwidth=0.2,
color="black", fill="white")+ geom_line(data = df2, aes(x = x, y = y),
color = "red")+xlim(-5,5)
}
rho=0.8
Nsim=3000
N=10
sim_rho0(Nsim,N,rho)
You've defined that the function h has the arguments N, rho and x.n. Then you try to call it with the argument df which h does not have, therefore you get the error. You need to call h with the correct arguments (i.e. also don't leave out N and rho, and if the value x.n should be passed to the function argument x.n, you need to specify it (don't use a positional argument). I also recommend to follow a style guide, e.g. https://style.tidyverse.org/

nls peak fitting with a mixed normal and lognormal dataset (R)

I'm trying to use nls() to to curve-fit a dataset consisting of a mixture of normally and lognormally distributed values. However, the normally distributed subset contains negative values that the lognormal function cannot tolerate. Using nls(), is there a way to constrain the values which a PORTION of the fitted curve evaluate? (e.g. let the normal function evaluate across 0 and force the lognormal function to evaluate only x>0)
here's the test case I've been playing with:
test <- rnorm(5000, 2, 2)
test2 <- rlnorm(10000,2,2)
test3 <- append(test, test2)
bins <- seq(min(test3),100, .1)
tops <- data.frame(bin=bins, count=NA)
for (i in 1:nrow(tops)) { tops[i,2] <- length(test3[which(test3>=tops[i,1] &
test3<tops[i+1,1])]) }
fit <- nls(count ~ exp(-(bin-n.mu)^2/(2*n.sd^2))/(sqrt(2*pi)*n.sd)*C1 +
exp(-(log(bin)-l.mu)^2/(2*l.sd^2))/(sqrt(2*pi)*l.sd*bin)*C2,
data=tops, start=list(n.mu=2, n.sd=2, C1=500, l.mu=2, l.sd=2, C2=1000),
algorithm="port", trace=T)
coef(fit)
topsfit <- data.frame(bin=seq(-3, 100, 0.1))
topsfit$fit <- predict(fit, newdata=topsfit)
ggplot() + geom_point(data=tops, aes(x=(bins), y=count), shape=1, size=4) +
geom_path(data=topsfit, aes(x=(bin), y=fit), colour="red", size=1.5)
Very simply, I'm fitting a normal PDF + lognormal PDF. The problem is that log(bin) in the lognormal PDF does not play nice with negative numbers... but I don't want to crop negative values because that affects the calculations for the underlying, normally distributed values. I just want the lognormal half of my curve to ignore them.
alternatively, is there a different approach to accomplishing this task that doesn't rely on nls()?
Seems like NO ONE wants to touch this topic, so I'll post a solution that I figured out with the help of a non-internet comrade-- the linchpin of my problem was in generating the functions that would comprise my curve. Writing the lognormal function separately allows conditional evaluation of x values, which is what I needed. Once I figured out that the nls() function operates on vectors and wrote my function to match, things shaped up quite nicely.
normal <- function(x, mu, sd, C) {
ans <- vector(length = length(x), mode = "numeric")
for (i in 1:length(x)) {
value <- exp(-(x[i]-mu)^2/(2*sd^2))/(sqrt(2*pi)*sd)*C
ans[i] <- value
}; return(ans) }
lognormal <- function(x, mu, sd, C) {
ans <- vector(length = length(x), mode = "numeric")
for (i in 1:length(x)) {
if (x[i]>0) {
value <- exp(-(log10(x[i])-mu)^2/(2*sd^2))/(sqrt(2*pi)*sd*x[i])*C
ans[i] <- value
} else { ans[i] <- 0 } }; return(ans) }
fit <- nls(count ~ normal(bin, n.mu, n.sd, C1) + lognormal(bin, l.mu, l.sd, C2),
data=tops, start=list(n.mu=30, n.sd=30, C1=5000,
l.mu=4, l.sd=2, C2=5000), algorithm="port", trace=T)
...and just like that, you can solve for mixed normal and lognormal distributions.

Observation in a bivariate Ellipse

I am trying find the probability that a point lies within an ellipse?
For eg if I was plotting the bivariate data (x,y) for 300 datasets in an 95% ellipsoid region, how do I calculate how many times out of 300 will my points fall inside the
ellipse?
Heres the code I am using
library(MASS)
seed<-1234
x<-NULL
k<-1
Sigma2 <- matrix(c(.72,.57,.57,.46),2,2)
Sigma2
rho <- Sigma2[1,2]/sqrt(Sigma2[1,1]*Sigma2[2,2])
rho
eta1<-replicate(300,mvrnorm(k, mu=c(-1.59,-2.44), Sigma2))
library(car)
dataEllipse(eta1[1,],eta1[2,], levels=c(0.05, 0.95))
Thanks for your help.
I don't see why people are jumping on the OP. In context, it's clearly a programming question: it's about getting the empirical frequency of data points within a given ellipse, not a theoretical probability. The OP even posted code and a graph showing what they're trying to obtain.
It may be that they don't fully understand the statistical theory behind a 95% ellipse, but they didn't ask about that. Besides, making plots and calculating frequencies like this is an excellent way of coming to grips with the theory.
Anyway, here's some code that answers the narrowly-defined question of how to count the points within an ellipse obtained via a normal distribution (which is what underlies dataEllipse). The idea is to transform your data to the unit circle via principal components, then get the points within a certain radius of the origin.
within.ellipse <- function(x, y, plot.ellipse=TRUE)
{
if(missing(y) && is.matrix(x) && ncol(x) == 2)
{
y <- x[,2]
x <- x[,1]
}
if(plot.ellipse)
dataEllipse(x, y, levels=0.95)
d <- scale(prcomp(cbind(x, y), scale.=TRUE)$x)
rad <- sqrt(2 * qf(.95, 2, nrow(d) - 1))
mean(sqrt(d[,1]^2 + d[,2]^2) < rad)
}
It was also commented that a 95% data ellipse contains 95% of the data by definition. This is certainly not true, at least for normal-theory ellipses. If your distribution is particularly bad, the coverage frequency may not even converge to the assumed level as the sample size increases. Consider a generalised pareto distribution, for example:
library(evd) # for rgpd
# generalised pareto has no variance for shape > 0.5
z <- sapply(1:1000, function(...) within.ellipse(rgpd(100, shape=5), rgpd(100, shape=5), FALSE))
mean(z)
[[1] 0.97451
z <- sapply(1:1000, function(...) within.ellipse(rgpd(10000, shape=5), rgpd(10000, shape=5), FALSE))
mean(z)
[1] 0.9995808

Generate a random number from a density object (or more broadly from a set of numbers)

Let's say I have a set of numbers that I suspect come from the same distribution.
set.seed(20130613)
x <- rcauchy(10)
I would like a function that randomly generates a number from that same unknown distribution. One approach I have thought of is to create a density object and then get the CDF from that and take the inverse CDF of a random uniform variable (see Wikipedia).
den <- density(x)
#' Generate n random numbers from density() object
#'
#' #param n The total random numbers to generate
#' #param den The density object from which to generate random numbers
rden <- function(n, den)
{
diffs <- diff(den$x)
# Making sure we have equal increments
stopifnot(all(abs(diff(den$x) - mean(diff(den$x))) < 1e-9))
total <- sum(den$y)
den$y <- den$y / total
ydistr <- cumsum(den$y)
yunif <- runif(n)
indices <- sapply(yunif, function(y) min(which(ydistr > y)))
x <- den$x[indices]
return(x)
}
rden(1, den)
## [1] -0.1854121
My questions are the following:
Is there a better (or built into R) way to generate a random number from a density object?
Are there any other ideas on how to generate a random number from a set of numbers (besides sample)?
To generate data from a density estimate you just randomly choose one of the original data points and add a random "error" piece based on the kernel from the density estimate, for the default of "Gaussian" this just means choose a random element from the original vector and add a random normal with mean 0 and sd equal to the bandwidth used:
den <- density(x)
N <- 1000
newx <- sample(x, N, replace=TRUE) + rnorm(N, 0, den$bw)
Another option is to fit a density using the logspline function from the logspline package (uses a different method of estimating a density), then use the rlogspline function in that package to generate new data from the estimated density.
If all you need is to draw values from your existing pool of numbers, then sample is the way to go.
If you want to draw from the presumed underlying distribution, then use density , and fit that to your presumed distribution to get the necessary coefficients (mean, sd, etc.), and use the appropriate R distribution function.
Beyond that, I'd take a look at Chapter7.3 ("rejection method") of Numerical Recipes in C for ways to "selectively" sample according to any distribution. The code is simple enough to be easily translated into R .
My bet is someone already has done so and will post a better answer than this.
Greg Snow's answer was helpful to me, and I realized that the output of the density function has all the data needed to create random numbers from the input distribution. Building on his example, you can do the following to get random values using the density output.
x <- rnorm(100) # or any numeric starting vector you desire
dens <- density(x)
N <- 1000
newx <- sample(x = dens$x, N, prob = dens$y, replace=TRUE) + rnorm(N, 0, dens$bw)
You can even create a simple random number generating function
rdensity <- function(n, dens) {
return(sample(x = dens$x, n, prob = dens$y, replace=TRUE) + rnorm(n, 0, dens$bw))
}

Superimposing gamma distribution curve to a plot

I have to superimpose the gamma distribution curve to a plot of others powerlaw-like curve.
I first plot dot points of the histogram in log-log scale
plot(log(pp$mids),log(pp$density))
then i would like to superimpose my gamma distribution curve calling an external function gamma()
gamma <- function(X)
{
n <- length(X)
theta<-var(hh2$V1)/mean(hh2$V1)
kappa<-mean(hh2$V1)/theta
y<-rgamma(n,kappa,theta)
xx<-hist(y,plot=F)
curve(log(xx$density),add=T,col='violet',type='l')
return( c(kappa) )
}
but this return me an error because curve() needs a true curve to plot. How can i do this?
Here is a somewhat-working variant of your code:
Generate an example structured as (I guess) your data are:
library(rmutil) ## for rpareto
set.seed(101)
hh2 <- data.frame(V1=rpareto(1000, m=1, s=1.5))
Initial histogram calculation:
pp <- hist(hh2$V1,plot=FALSE)
Function (better not to call it gamma as that masks a built-in function):
ghistfun <- function(x) {
n <- length(x)
scalepar <- var(x)/mean(x)
shapepar <- mean(x)^2/var(x)
y <- rgamma(n,shape=shapepar,scale=scalepar)
xx <- hist(y,plot=FALSE)
lines(log(xx$mids),log(xx$density),col="red")
curve(dgamma(exp(x),shape=shapepar,scale=scalepar,log=TRUE),
add=TRUE,col="blue")
shapepar
}
It might be better to use a very large number for n rather than just using the length of the data, unless you are particularly interested in seeing the random fluctuations in a data set of precisely the same size. Alternatively you could just use curve(dgamma(x,...)), as shown (I initially thought that you would have to allow for scaling from a density of x to a density of log(x), but because of the way you have computed a histogram on the original (unlogged) scale and then transformed the bin midpoints, you don't have to ...)
plot(log(pp$mids),log(pp$density))
ghistfun(hh2$V1)

Resources