Fitting values to a Binomial Distribution with Julia - julia

I am new to Julia and want to understand how to fit some values to a binomial distribution and get their parameters:
d = Distributions.fit_mle(Binomial, [1, 1.1, 1.2, 1.4, 2.0, 1.4, 1.3, 1.1, 1.2, 1.5, 2.0, 2.2, 2.6, 2.9, 3.2, 2.8, 2.5, 2.0, 1.6, 1.0])
When I run this a get the following error:
suffstats is not implemented for (Binomial, Array{Float64,1}).
Well I know that one you run other distributions like the Normal you do get the parameters. So there are two questions first is how do I fit the above data to a Binomial distribution? The second is why I can't use Binomial with fit or fit_mle from Distrbutions?

For a starter, the binomial distribution is usually defined over the integers, and you gave it an array of floats (and Distributions does expect integers as well). What does that data of yours even mean? If you are interested in a binomial distribution over a finite set of non-integer values, I think the best alternative would be to map your data to unique integers and fit the distribution on them.
Secondly, there is no MLE in terms of sufficient statistics for the size parameter of the binomial distribution (it is an exponential family only over p, not N). You must pass it to fit_mle. I didn't think of this myself, but found it out by looking at the respective methods of suffstats; for example:
julia> methods(suffstats)
...
[7] suffstats(::Type{#s29} where #s29<:Binomial, n::Integer, x::AbstractArray{T,N} where N) where T<:Integer in Distributions at /home/philipp/.julia/packages/Distributions/dTXqn/src/univariate/discrete/binomial.jl:195
...
Combining both requirements:
julia> data = rand(Binomial(5, 0.2), 10)
10-element Array{Int64,1}:
2
0
1
1
0
0
2
1
1
1
julia> fit_mle(Binomial, 5, data)
Binomial{Float64}(n=5, p=0.18)

Related

Probability that the mean of k observations out of N is under a certain value

I'm trying to estimate the probability that the mean of 3 observations from a population is under a certain value.
Let's say I want to know what's the probability that the mean of 3 people's heights is under 1.8m
Population = c(1.7, 1.9, 1.6, 1.76, 1.8, 1.72, 1.99, 2, 1.66, 1.89)
If I pick randomly 3 observations (x_i, x_j, x_k)... What's the probability that the mean of these 3 observations is under 1.8m?
Thanks in advance.
Since the distribution of sums of variables is given by convolutions, you could do an approximation to the convolutions using FFTs using small enough sampling windows based on the gaps between your population values.
If k is large, you can use the central limit theorem to approximate:
sqrt(k)(sum_k(x_k)/k - population-mean)/population_stdev as Normal(0,1).
which makes it very easy to evaluate:
P(sum_k(x_k)/k< Val) = P( sqrt(k)(sum_k(x_k)/k - population_mean)/population_stdev<sqrt(k)(Val-population_mean/population_stdev)
= Phi(sqrt(k)(Val-population_mean)/population_stdev) where Phi is the standard normal CDF.

Truncated normal with a given mode

I would like to generate a random truncated normal distribution with the following properties.
lower bound= 3
higher bond= 5
using a fixed set of values at the first digit(3.0,3.1,3.2,...,4.9,5.0)
With mode=3
the probability of 5 occurring should be 50% of the probability of 3 occurring.
I was able to deal with the first 3 steps but I am struggling to find a way to set mode=3 and to establish a fixed relationship on the occurrence of the higher and lower bound.
library(truncnorm)
library(ggplot2)
set.seed(123)
dist<- as.data.frame(list(truncnorm=round(rtruncnorm(10000, a=3, b=5, mean=3.3, sd=1),1)))
ggplot(dist,aes(x=truncnorm))+
geom_histogram(bins = 40)+
theme_bw()
As you can see I can create truncated normal with the desired boundaries.
The problem with this distribution are two.
First I want truncnorm==3.0 to be the mode (i.e. most frequent value of my distribution, while in this case the mode is truncnorm==3.2
Second I want the count of 5.0 values to be 50% of the 3.0 values. For example, if I generated a distribution with 800 observations with truncnorm=3.0, there should be approximately 400 observations with truncnorm=5.0.
Luckily, all your requirements are achievable using a truncated normal distribution.
Let low = 3 and high = 5.
Simply evaluate the density (at discrete points such as 3.0, 3.1, ..., 4.9, 5.0) of a normal distribution with mean low and standard deviation sqrt(2)*(high-low)/(2*sqrt(ln(2)).
This standard deviation is found by taking the following function proportional to a normal density with mean 0 and standard deviation z:
f(x) = exp(-(x-0)**2/(2*z**2))
Since f(0) = 1, we must find the necessary standard deviation z such that f(x) = 1/2. The solution is:
g(x) = sqrt(2)*x/(2*sqrt(ln(2))
And plugging high-low into x leads to the final standard deviation given above.

Can and how to run mutual information testing in R on a pair of samples from a normal distribution?

I would like to ask for an example calculating the mutual information of two samples selected from a standard normal distribution in R, because so far all the posts I have found are about discrete variables. This includes the documentation for the package entropy:
# load entropy library
library("entropy")
# joint distribution of two discrete variables
freqs2d = rbind( c(0.2, 0.1, 0.15), c(0.1, 0.2, 0.25) )
# corresponding mutual information
mi.plugin(freqs2d)
# MI computed via entropy
H1 = entropy.plugin(rowSums(freqs2d))
H2 = entropy.plugin(colSums(freqs2d))
H12 = entropy.plugin(freqs2d)
H1+H2-H12
# and corresponding (half) chi-squared divergence of independence
0.5*chi2indep.plugin(freqs2d)
The pdf of the normal distribution is called for, but does the function call for discretization, for example binning the values in the samples?
Mutual information may just make reference to the random variable, and not a sample, but then it wouldn't really be as versatile as the correlation.

How to select top N values and sum them in mathematical optimization

I have an array of values (probabilities) as below,
a = [0, 0.1, 0.9, 0.2, 0, 0.8, 0.7, 0]
To select the maximum value from this array I can use the generalized mean equation. But what should I do if I want to select say top N values and sum them?
e.g. summing top 2 values will give me 0.9 + 0.8 = 1.7
**** But I dont need an implementation/algorithm for it. I need a mathematical equation (e.g. generalized mean for selecting the max value), so that I want to optimize a function which includes this selection.
Effective way to get top K values is using of some K-select algorithm.
For example, fast and reliable one is Quickselect (based on the same partition procedure as QuickSort), which has average complexity O(N).
generalized mean with regard to maximum seems pure mathematical conception, doesn't it?
So I doubt that sum of infinite powers is applicable in real life (programming).

Meta-analysis from odds ratios and confidence intervals, using metafor package in r

I'm trying to do a meta-analysis from a number of Odds ratios and their confidence intervals. The source articles do not report standard errors.
In order to use rma.uni() from the metafor package, I need to supply variances (through vi=" ") or standard errors (throuh sei = " "). So I calculated the standard errors in the following way (logor = log(odds ratio), UL= CI upper limit, LL = CI lower limit):
se1<-(log(UL)-logor)/1.96
se2<-(log(OR)-log(LL))/1.96
My problem is, the standard errors derived in this way differ a little bit, although they should be the same. I think this is due to the fact that the CI's were rounded by the authors. My solution was to take the average of these as the standard errors in the model.
However when I fit the model and plot the forest plot, the resulting confidence intervals differ quite a bit from the ones I started with..
dmres<-rma.uni(yi=logor, sei=se, data=dm2)
forest(dmres, atransf=exp, slab=paste(dm2$author))
Is there a better way to do this?
Maybe a function that I can put confidence intervals in directly?
Thanks a lot for your comments.
Update
Example data and code:
dm<-structure(list(or = c(1.6, 4.4, 1.14, 1.3, 4.5), cill = c(1.2,
2.9, 0.45, 0.6, 3.2), ciul = c(2, 6.9, 2.86, 2.7, 6.1)), .Names = c("or",
"cill", "ciul"), class = "data.frame", row.names = c(NA, -5L))
dm$logor<-log(dm$or)
dm$se1<-(log(dm$ciul)-dm$logor)/1.96
dm$se2<-(dm$logor-log(dm$cill))/1.96
dm$se<-(dm$se1+dm$se2)/2
library(metafor)
dmres<-rma.uni(yi=logor, sei=se, data=dm)
forest(dmres, atransf=exp)
Since the confidence interval bounds (on the log scale) are not symmetric to begin with, you get these discrepancies. You can use the forest.default() function, supplying the CI bounds directly and then add the summary polygon with the addpoly() function. Using your example:
forest(dm$logor, ci.lb=log(dm$cill), ci.ub=log(dm$ciul), atransf=exp, rows=7:3, ylim=c(.5,10))
addpoly(dmres, row=1, atransf=exp)
abline(h=2)
will ensure that the CI bounds in the dataset are exactly the same as in the forest plot.

Resources