Testing First-Order Stochastic Dominance Using R - r

I ran simulation and generated two (random) variables: X and Y.
I would like to test whether X first-order stochastically dominates Y in R.
That is, how can I check whether X's empirical CDF is on the right side of Y's empirical CDF for all support?

Related

Generate two negative binomial distributed random variables with predefined correlation

Assume I have a negative binomial distributed variable X1 with NB(mu=MU1,size=s1) and a negative binomial distributed variable X2 with NB(mu=MU2,size=s2).
I fitted a negative binomial regression to estimate Mu's and size's from my data
I can use the rnbinom() function in R to generate random draws from this distribution.
X1model<-rnbinom(n=1000,mu=MU1fitted,size=s1fitted)
X2model<-rnbinom(n=1000,mu=MU2fitted,size=s2fitted)
Those draws are now independent. However how can I draw from those distributions, so that they exhibit a predefined correlation r, which is the correlation I observe between my original data X1,X2.
so that:
cor(X1,X2,method="spearman") = r = cor(X1model,X2model,method="spearman")
-or even better draw from those with any arbitrary preset correlation r

How to generate a random variable from two different distributions in R

Suppose a random variable Z is taken randomly from two different distributions with equal probability: a standard N(0,1) and an exponential exp(1) with rate=1. I want to generate the random variable Z.
So in r, my approach is: Z=0.5X+0.5Y, so Z is from the joint distribution of N(0,1) and exp(1). The r code will be:
x<-rnorm(1)
y<-rexp(1)
z<-0.5x+0.5y
My question is can I obtain Z by just adding up x and y with their probabilities, or I have to consider the correlations between variables ?
Unfortunately not. You need another variable U, which is a Bernoulli random variable with p=0.5 and independent of X and Y. Define Z = U*X+(1-U)*Y. In R, you can do
x<-rnorm(1)
y<-rexp(1)
u<-rbinom(1,1,0.5)
z<-u*x+(1-u)*y
Averaging X and Y results in totally different distribution, not the mixture of distributions you want.

Estimating a probability distribution and sampling from it in Julia

I am trying to use Julia to estimate a continuous univariate distribution using N observed data points (stored as an array of Float64 numbers), and then sample from this estimated distribution. I have no prior knowledge restricting attention to some family of distributions.
I was thinking of using the KernelDensity package to estimate the distribution, but I'm not sure how to sample from the resulting output.
Any help/tips would be much appreciated.
Without any restrictions on the estimated distribution, a natural candidate would be the empirical distribution function (see Wikipedia). For this distribution there are very nice theorems about convergence to actual distribution (see Dvoretzky–Kiefer–Wolfowitz inequality).
With this choice, sampling is especially simple. If dataset is a list of current samples, then dataset[rand(1:length(dataset),sample_size)] is a set of new samples from the empirical distribution. With the Distributions package, it could be more readable, like so:
using Distributions
new_sample = sample(dataset,sample_size)
Finally, Kernel density estimation is also good, but might need a parameter to be chosen (the kernel and its width). This shows a preference for a certain family of distributions. Sampling from a kernel distribution is surprisingly similar to sampling from the empirical distribution: 1. choose a sample from the empirical distributions; 2. perturb each sample using a sample from the kernal function.
For example, if the kernel function is a Normal distribution of width w, then the perturbed sample could be calculated as:
new_sample = dataset[rand(1:length(dataset),sample_size)]+w*randn(sample_size)

Generating random values from non-normal and correlated distributions

I have a random variable X that is a mixture of a binomial and two normals (see what the probability density function would look like (first chart))
and I have another random variable Y of similar shape but with different values for each normally distributed side.
X and Y are also correlated, here's an example of data that could be plausible :
X Y
1. 0 -20
2. -5 2
3. -30 6
4. 7 -2
5. 7 2
As you can see, that was simply to represent that my random variables are either a small positive (often) or a large negative (rare) and have a certain covariance.
My problem is : I would like to be able to sample correlated and random values from these two distributions.
I could use Cholesky decomposition for generating correlated normally distributed random variables, but the random variables we are talking here are not normal but rather a mixture of a binomial and two normals.
Many thanks!
Note, you don't have a mixture of a binomial and two normals, but rather a mixture of two normals. Even though for some reason in your previous post you did not want to use a two-step generation process (first genreate a Bernoulli variable telling which component to sample from, and then sampling from that component), that is typically what you would want to do with a mixture distribution. This process naturally generalizes to a mixture of two bivariate normal distributions: first pick a component, and then generate a pair of correlated normal values. Your description does not make it clear whether you are fitting some data with this distribution, or just trying to simulate such a distribution - the difficulty of getting the covariance matrices for the two components will depend on your situation.

Generating means from a bivariate gaussian distribution

I am reading Elements of Statistical Learning ESLII and in chapter 2, they have a gaussian mixture data set to illustrate some learning algorithms. To generate this data set, they first generate 10 means from a bivariate gaussian distribution N((1,0)', I). I am not sure what they mean?
How can you generate 10 means from a bivariate distribution having mean(1,0)?
Each of the means that are generated from the bivariate Gaussian distribution are simply single points sampled in exactly the same way as any other random points that could be generated from the distribution. The fact that they use these generated points to be the means of new distributions is irrelevant.
Let's say that each of the 10 means is then used to construct a new bivariate Gaussian.
means ~ N( (1,0), I)
Where ~ indicates a value being drawn from the distribution. Since the distribution being sampled from in this case is a bivariate Gaussian, each of the data points sampled will be a 2-dimensional point (x1, y1).
Each of these points sampled from the original distribution can then be used to make a new distribution.
Example:
means = [ (x1,y1), (x2,y2), ..., (x10,y10) ]
To build new bivariate Gaussians:
N1((x1,x2), I), N2((x2,y2), I), ..., N10((x10,y10), I)
They are just using the initial bivariate Gaussian distribution N((1,0), I) as an easy way to pick 10 random means that are distributed normally.

Resources