Plotting mixtures of distributions with Julia - julia

I want to plotting mixtures of two 1d Gaussian distributions with Julia. I am not sure what is the best way to do it. I am trying to use Distributions.jl and in specific to both
Define two Gaussians using d1 = Normal(0.0, 1.0) and d2 = Normal(1.0, 1.8)
Define a mixture using
MixtureModel(Normal[ Normal(-2.0, 1.2), Normal(0.0, 1.0), Normal(3.0, 2.5)], [0.1, 0.6, 0.3])
Now, for the 1st attempt I do not know how to define the weights. My questions therefore is about how to proceed on to simply generate and draw samples of this mixture?
I would like to plot them and also use these samples in order to perform parameter estimation.

I'm not sure I fully understand the question - why are you defining d1 and d2?
To answer your bold question: just use rand() to draw from your mixture distribution:
julia> using Distributions
julia> mm = MixtureModel([Normal(-2.0, 1.2), Normal(), Normal(3.0, 2.5)], [0.1, 0.6, 0.3])
MixtureModel{Normal{Float64}}(K = 3)
components[1] (prior = 0.1000): Normal{Float64}(μ=-2.0, σ=1.2)
components[2] (prior = 0.6000): Normal{Float64}(μ=0.0, σ=1.0)
components[3] (prior = 0.3000): Normal{Float64}(μ=3.0, σ=2.5)
julia> rand(mm)
1.882130062980293
Note that here I have used Normal() instead of Normal(0.0, 1.0), as Normal() already returns the standard normal distribution.
To plot:
julia> using Plots
julia> histogram(rand(mm, 100_000), normalize = true, xlabel = "Value", ylabel = "Frequency", label = "Mixture model")

Related

How can i plot the difference between two histograms? (Julia)

So time ago i asked the same question here and someone answered just what i wanted!
Using fit(histogram...) and weights you can do it! (just like the picture below).
julia> using StatsBase, Random; Random.seed!(0);
julia> x1, x2 = rand(100), rand(100);
julia> h1 = fit(Histogram, x1, 0:0.1:1);
julia> h2 = fit(Histogram, x2, 0:0.1:1);
julia> using Plots
julia> p1 = plot(h1, α=0.5, lab="x1") ; plot!(p1, h2, α=0.5, lab="x2")
julia> p2 = bar(0:0.1:1, h2.weights - h1.weights, lab="diff")
julia> plot(p1, p2)
The problem is i can't use fit, i need to use Histogram(...). And this one doesn't have .weights.
How can i do this using Histogram ?
This is what i'm using:
using Plots
using StatsBase
h1 = histogram(Group1, bins= B, normalize =:probability, labels = "Group 1")
h2 = histogram(Group2 , bins= B, normalize =:probability, labels ="Group 2"))
Technically there is no Histogram function in any common Julia package; perhaps you mean either the Histogram (capital h) type provided by StatsBase, or the histogram (lowercase h) function provided by Plots.jl? In either case though, the answer is "you can't".
If you mean histogram from Plots.jl there is unfortunately no practical way to access that underlying data. If you mean Histogram from StatsBase on the other hand, that only works with fit (it's a type, not a function that can be used on its own).
There are other histogram packages though if for any reason you cannot or do not want to use StatsBase and fit, including FastHistograms.jl and NaNStatistics.jl, both of which are additionally somewhat faster than StatsBase for simple cases. So, for example
using NaNStatistics, Plots
a,b = rand(100), rand(100)
dx = 0.1
binedges = 0:dx:1
aw = histcounts(a, binedges)
bw = histcounts(b, binedges)
bar(binedges, aw-bw, label="difference", bar_width=dx)

Drawing random numbers from a power law distribution in R

I am using the R package "poweRlaw" to estimate and subsequently draw from discrete power law distributions, however the distribution drawn from the fit does not seem to match the data. To illustrate, consider this example from a guide for this package: https://cran.r-project.org/web/packages/poweRlaw/vignettes/b_powerlaw_examples.pdf. Here we first download an example dataset from the package and then fit a discrete power law.
library("poweRlaw")
data("moby", package = "poweRlaw")
m_pl = displ$new(moby)
est = estimate_xmin(m_pl)
m_pl$setXmin(est)
The fit looks like a good one, as we can't discard the hypothesis that this data is drawn from a power distribution (p-value > 0.05):
bs = bootstrap_p(m_pl, threads = 8)
bs$p
However, when we draw from this distribution using the built in function dist_rand(), the resulting distribution is shifted to the right of the original distribution:
set.seed(1)
randNum = dist_rand(m_pl, n = length(moby))
plot(density(moby), xlim = c(0, 1000), ylim = c(0, 1), xlab = "", ylab = "", main = "")
par(new=TRUE)
plot(density(randNum), xlim = c(0, 1000), ylim = c(0, 1), col = "red", xlab = "x", ylab = "Density", main = "")
I am probably misunderstanding what it means to draw from a power distribution, but does this happen because we only fit the tail of the experimental distribution (so we draw after the parameter Xmin)? If something like this is happening, is there any way I can compensate for this fact so that the fitted distribution resembles the experimental distribution?
So there's a few things going on here.
As you hinted at in your question, if you want to compare distributions, you need to truncate moby, so moby = moby[moby >= m_pl$getXmin()]
Using density() is a bit fraught. This is a kernel density smoother, that draws Normal distributions over discrete points. As the powerlaw has a very long tail, this is suspect
Comparing the tails of two powerlaw distributions is tricky (simulate some data and see).
Anyway, if you run
set.seed(1)
x = dist_rand(m_pl, n = length(moby))
# Cut off the tail for visualisation
moby = moby[moby >= m_pl$getXmin() & moby < 100]
plot(density(moby), log = "xy")
x = x[ x < 100]
lines(density(x), col = 2)
Gives something fairly similar.

QQ plot in r from tassel pipeline

For my GWAS analysis I am using the tassel pipeline. In my GWAS I am studying two correlated traits.
I want to plot a Q_Q plot for two trait in one plot like the one which we can obtain from tassel Program.
Any one has any suggestion with which package of r I can do that?
With qq() command from qqman package I plot QQ plot in seprate plot but I want a plot which involved my two traits as i did in Tassel
Ay suggestion?
A QQ-Plot in your case compares quantiles of the empirical distribution of your result to quantiles of the distribution that you'd expect theoretically if the null hypothesis is true.
If you have n data points, it makes sense to compare the n-quantiles, because then the actual quantiles of your empirical distribution are just your data points, ordered.
The theoretical distribution of p-values is the uniform distribution. Think of it, that's exactly the reason why they exist. If a measurement is assigned for example a p-value of 0.05, you'd expect this or a more extreme measurement by pure chance (null hypothesis) in only 5% of your experiments, if you repeat that experiment very often. A measurement with p=0.5, is expected in 50% of the cases. So, generalizing to any value p, your cumulative distribution function
CDF(p) = P[measurement with p-value of ≤ p] = p.
Look in Wikipedia, that's the
CDF for the uniform distribution between 0 and 1.
Therefore, the expected n-quantiles for your QQ-Plot are {1/n, 2/n, ... n/n}. (They represent the case that the null hypothesis is true)
So, now we have the theoretical quantiles (x-axis) and the actual quantiles. In R code, this is something like
expected_quantiles <- function(pvalues){
n = length(pvalues)
actual_quantiles = sort(pvalues)
expected_quantiles = seq_along(pvalues)/n
data.frame(expected = expected_quantiles, actual = actual_quantiles)
}
You can take the -log10 of these values and plot them, for example like so
testdata1 <- c(runif(98,0,1), 1e-4, 2e-5)
testdata2 <- c(runif(96,0,1), 1e-3, 2e-3, 2e-4)
qq <- lapply(list(d1 = testdata1, d2 = testdata2), expected_quantiles)
xlim <- rev(-log10(range(rbind(qq$d1, qq$d2)$expected))) * c(1, 1.1)
ylim <- rev(-log10(range(rbind(qq$d1, qq$d2)$actual))) * c(1, 1.1)
plot(NULL, xlim = xlim, ylim = ylim)
points(x = -log10(qq$d1$expected) ,y = -log10(qq$d1$actual), col = "red")
points(x = -log10(qq$d2$expected) ,y = -log10(qq$d2$actual), col = "blue")
abline(a = 0, b = 1)

Smooth curve through points and include the origin in R

I am a beginner in R and started with graphics recently.
I have managed to program a working empirical cumulative distribution function (user-generated, not using the standard ecdf() function) and to generate a plot. However, the plot is not as it should be, there are two issues with it and I am not sure on how to solve them (I have done my 'research' but have not found a solution).
This is my code:
set.seed(1)
n = 50
x = rpois(n, 2.2)
cdf = function(x,n)
{
v=c()
for(z in 1:max(x))
{
a = length(x[x<=z])/n
v = c(v, a)
}
plot(v,type="l", main="empirical cumulative distribution function", xlab="x", ylab="cumulative probability", xlim=c(0,6), ylim=c(0,1.0))
}
cdf(x, n)
There are two issues with this plot:
The lines are straight but it should be a smooth curve through all points.
The origin is not included (now the curve starts at x = 1).
How can these issues be resolved in an elegant way?
Try the following spline interpolator:
plot(spline(c(0, v)), type = "l")

Using user-defined functions within "curve" function in R graphics

I am needing to produce normally distributed density plots with different total areas (summing to 1). Using the following function, I can specify the lambda - which gives the relative area:
sdnorm <- function(x, mean=0, sd=1, lambda=1){lambda*dnorm(x, mean=mean, sd=sd)}
I then want to plot up the function using different parameters. Using ggplot2, this code works:
require(ggplot2)
qplot(x, geom="blank") + stat_function(fun=sdnorm,args=list(mean=8,sd=2,lambda=0.7)) +
stat_function(fun=sdnorm,args=list(mean=18,sd=4,lambda=0.30))
but I really want to do this in base R graphics, for which I think I need to use the "curve" function. However, I am struggling to get this to work.
If you take a look at the help file for ? curve, you'll see that the first argument can be a number of different things:
The name of a function, or a call or an expression written as a function of x which will evaluate to an object of the same length as x.
This means you can specify the first argument as either a function name or an expression, so you could just do:
curve(sdnorm)
to get a plot of the function with its default arguments. Otherwise, to recreate your ggplot2 representation you would want to do:
curve(sdnorm(x, mean=8,sd=2,lambda=0.7), from = 0, to = 30)
curve(sdnorm(x, mean=18,sd=4,lambda=0.30), add = TRUE)
The result:
You can do the following in base R
x <- seq(0, 50, 1)
plot(x, sdnorm(x, mean = 8, sd = 2, lambda = 0.7), type = 'l', ylab = 'y')
lines(x, sdnorm(x, mean = 18, sd = 4, lambda = 0.30))
EDIT I added ylab = 'y' and updated the picture to have the y-axis re-labeled.
This should get you started.

Resources