Generating means from a bivariate gaussian distribution - r

I am reading Elements of Statistical Learning ESLII and in chapter 2, they have a gaussian mixture data set to illustrate some learning algorithms. To generate this data set, they first generate 10 means from a bivariate gaussian distribution N((1,0)', I). I am not sure what they mean?
How can you generate 10 means from a bivariate distribution having mean(1,0)?

Each of the means that are generated from the bivariate Gaussian distribution are simply single points sampled in exactly the same way as any other random points that could be generated from the distribution. The fact that they use these generated points to be the means of new distributions is irrelevant.
Let's say that each of the 10 means is then used to construct a new bivariate Gaussian.
means ~ N( (1,0), I)
Where ~ indicates a value being drawn from the distribution. Since the distribution being sampled from in this case is a bivariate Gaussian, each of the data points sampled will be a 2-dimensional point (x1, y1).
Each of these points sampled from the original distribution can then be used to make a new distribution.
Example:
means = [ (x1,y1), (x2,y2), ..., (x10,y10) ]
To build new bivariate Gaussians:
N1((x1,x2), I), N2((x2,y2), I), ..., N10((x10,y10), I)
They are just using the initial bivariate Gaussian distribution N((1,0), I) as an easy way to pick 10 random means that are distributed normally.

Related

Generate two negative binomial distributed random variables with predefined correlation

Assume I have a negative binomial distributed variable X1 with NB(mu=MU1,size=s1) and a negative binomial distributed variable X2 with NB(mu=MU2,size=s2).
I fitted a negative binomial regression to estimate Mu's and size's from my data
I can use the rnbinom() function in R to generate random draws from this distribution.
X1model<-rnbinom(n=1000,mu=MU1fitted,size=s1fitted)
X2model<-rnbinom(n=1000,mu=MU2fitted,size=s2fitted)
Those draws are now independent. However how can I draw from those distributions, so that they exhibit a predefined correlation r, which is the correlation I observe between my original data X1,X2.
so that:
cor(X1,X2,method="spearman") = r = cor(X1model,X2model,method="spearman")
-or even better draw from those with any arbitrary preset correlation r

How to generate a random variable from two different distributions in R

Suppose a random variable Z is taken randomly from two different distributions with equal probability: a standard N(0,1) and an exponential exp(1) with rate=1. I want to generate the random variable Z.
So in r, my approach is: Z=0.5X+0.5Y, so Z is from the joint distribution of N(0,1) and exp(1). The r code will be:
x<-rnorm(1)
y<-rexp(1)
z<-0.5x+0.5y
My question is can I obtain Z by just adding up x and y with their probabilities, or I have to consider the correlations between variables ?
Unfortunately not. You need another variable U, which is a Bernoulli random variable with p=0.5 and independent of X and Y. Define Z = U*X+(1-U)*Y. In R, you can do
x<-rnorm(1)
y<-rexp(1)
u<-rbinom(1,1,0.5)
z<-u*x+(1-u)*y
Averaging X and Y results in totally different distribution, not the mixture of distributions you want.

Estimating a probability distribution and sampling from it in Julia

I am trying to use Julia to estimate a continuous univariate distribution using N observed data points (stored as an array of Float64 numbers), and then sample from this estimated distribution. I have no prior knowledge restricting attention to some family of distributions.
I was thinking of using the KernelDensity package to estimate the distribution, but I'm not sure how to sample from the resulting output.
Any help/tips would be much appreciated.
Without any restrictions on the estimated distribution, a natural candidate would be the empirical distribution function (see Wikipedia). For this distribution there are very nice theorems about convergence to actual distribution (see Dvoretzky–Kiefer–Wolfowitz inequality).
With this choice, sampling is especially simple. If dataset is a list of current samples, then dataset[rand(1:length(dataset),sample_size)] is a set of new samples from the empirical distribution. With the Distributions package, it could be more readable, like so:
using Distributions
new_sample = sample(dataset,sample_size)
Finally, Kernel density estimation is also good, but might need a parameter to be chosen (the kernel and its width). This shows a preference for a certain family of distributions. Sampling from a kernel distribution is surprisingly similar to sampling from the empirical distribution: 1. choose a sample from the empirical distributions; 2. perturb each sample using a sample from the kernal function.
For example, if the kernel function is a Normal distribution of width w, then the perturbed sample could be calculated as:
new_sample = dataset[rand(1:length(dataset),sample_size)]+w*randn(sample_size)

How does one extract hat values and Cook's Distance from an `nlsLM` model object in R?

I'm using the nlsLM function to fit a nonlinear regression. How does one extract the hat values and Cook's Distance from an nlsLM model object?
With objects created using the nls or nlreg functions, I know how to extract the hat values and the Cook's Distance of the observations, but I can't figure out how to get them using nslLM.
Can anyone help me out on this? Thanks!
So, it's not Cook's Distance or based on hat values, but you can use the function nlsJack in the nlstools package to jackknife your nls model, which means it removes every point, one by one, and bootstraps the resulting model to see, roughly speaking, how much the model coefficients change with or without a given observation in there.
Reproducible example:
xs = rep(1:10, times = 10)
ys = 3 + 2*exp(-0.5*xs)
for (i in 1:100) {
xs[i] = rnorm(1, xs[i], 2)
}
df1 = data.frame(xs, ys)
nls1 = nls(ys ~ a + b*exp(d*xs), data=df1, start=c(a=3, b=2, d=-0.5))
require(nlstools)
plot(nlsJack(nls1))
The plot shows the percentage change in each model coefficient as each individual observation is removed, and it marks influential points above a certain threshold as "influential" in the resulting plot. The documentation for nlsJack describes how this threshold is determined:
An observation is empirically defined as influential for one parameter if the difference between the estimate of this parameter with and without the observation exceeds twice the standard error of the estimate divided by sqrt(n). This empirical method assumes a small curvature of the nonlinear model.
My impression so far is that this a fairly liberal criterion--it tends to mark a lot of points as influential.
nlstools is a pretty useful package overall for diagnosing nls model fits though.

Generating random values from non-normal and correlated distributions

I have a random variable X that is a mixture of a binomial and two normals (see what the probability density function would look like (first chart))
and I have another random variable Y of similar shape but with different values for each normally distributed side.
X and Y are also correlated, here's an example of data that could be plausible :
X Y
1. 0 -20
2. -5 2
3. -30 6
4. 7 -2
5. 7 2
As you can see, that was simply to represent that my random variables are either a small positive (often) or a large negative (rare) and have a certain covariance.
My problem is : I would like to be able to sample correlated and random values from these two distributions.
I could use Cholesky decomposition for generating correlated normally distributed random variables, but the random variables we are talking here are not normal but rather a mixture of a binomial and two normals.
Many thanks!
Note, you don't have a mixture of a binomial and two normals, but rather a mixture of two normals. Even though for some reason in your previous post you did not want to use a two-step generation process (first genreate a Bernoulli variable telling which component to sample from, and then sampling from that component), that is typically what you would want to do with a mixture distribution. This process naturally generalizes to a mixture of two bivariate normal distributions: first pick a component, and then generate a pair of correlated normal values. Your description does not make it clear whether you are fitting some data with this distribution, or just trying to simulate such a distribution - the difficulty of getting the covariance matrices for the two components will depend on your situation.

Resources