Scaling of covariance matrices - math

For the question "Ellipse around the data in MATLAB", in the answer given by Amro, he says the following:
"If you want the ellipse to represent
a specific level of standard
deviation, the correct way of doing is
by scaling the covariance matrix"
and the code to scale it was given as
STD = 2; %# 2 standard deviations
conf = 2*normcdf(STD)-1; %# covers around 95% of population
scale = chi2inv(conf,2); %# inverse chi-squared with dof=#dimensions
Cov = cov(X0) * scale;
[V D] = eig(Cov);
I don't understand the first 3 lines of the above code snippet. How is the scale calculated by chi2inv(conf,2), and what is the rationale behind multiplying it with the covariace matrix?
Additional Question:
I also found that if I scale it with 1.5 STD, i.e. 86% tiles, the ellipse can cover all of the points, my points set are clumping together, at almost all the cases. On the other hand, if I scale it with 3 STD, i.e. 99%tiles, the ellipse is far too big. Then how can I choose a STD to just tightly cover the clumping points?
Here is an example:
The inner ellipse corresponds to 1.5 STD and outer to 2.5 STD. why 1.5 STD is tightly cover the clumping white points? Is there any approach or reason to define it?

The objective of displaying an ellipse around the data points is to show the confidence interval, or in other words, "how much of the data is within a certain standard deviation way from the mean"
In the above code, he has chosen to display an ellipse that covers 95% of the data points. For a normal distribution, ~67% of the data is 1 s.d. away from the mean, ~95% within 2 s.d. and ~99% within 3 s.d. (the numbers are off the top of my head, but you can easily verify this by calculating the area under the curve). Hence, the value STD=2; You'll find that conf is approx 0.95.
The distance of the data points from the centroid of the data goes something like (xi^2+yi^2)^0.5, ignoring coefficients. Sums of squares of random variables follow a chi-square distribution and hence to get the corresponding 95 percentile, he uses the inverse chi-square function, with d.o.f. 2, as there are two variables.
Lastly, the rationale behind multiplying the scaling constant follows from the fact that for a square matrix A with eigenvalues a1,...,an, the eigenvalues of a matrix kA, where k is a scalar is simply ka1,...,kan. The eigenvalues give the corresponding lengths of the major/minor axis of the ellipse, and so scaling the ellipse or the eigenvalues to the 95%tile is equivalent to multiplying the covariance matrix with the scaling factor.
EDIT
Cheng, although you might already know this, I suggest that you also read this answer to a question on randomness. Consider a Gaussian random variable with zero mean, unit variance. The PDF of a collection of such random variables looks like this
Now, if I were to take two such collections of random variables, square them separately and add them to form a single collection of a new random variable, its distribution looks like this
This is the chi-square distribution with 2 degrees of freedom (since we added two collections).
The equation of the ellipse in the above code can be written as x^2/a^2 +y^2/b^2=k, where x,y are the two random variables, a and b are the major/minor axes, and k is some scaling constant that we need to figure out. As you can see, the above can be interpreted as squaring and adding two collections of Gaussian random variables, and we just saw above what its distribution looks like. So, we can say that k is a random variable that is chi-square distributed with 2 degrees of freedom.
Now all that needs to be done is to find a value for k such that 95%ile of the data is within it. Just like the 1s.d, 2s.d, 3s.d. percentiles that we're familiar with Gaussians, the 95%tile for chi-square with 2 degrees of freedom is around 6.18. This is what Amro obtains from the chi2inv function. He could have just as well written scale=chi2inv(0.95,2) and it would have been the same. It's just that talking in terms of n s.d. away from the mean is intuitive.
Just to illustrate, here's a PDF of the chi-square distribution above, with 95% of the area < some x shaded in red. This x is ~6.18.
Hope this helped.

Related

Preferentially Sampling Based upon Value Size

So, this is something I think I'm complicating far too much but it also has some of my other colleagues stumped as well.
I've got a set of areas represented by polygons and I've got a column in the dataframe holding their areas. The distribution of areas is heavily right skewed. Essentially I want to randomly sample them based upon a distribution of sampling probabilities that is inversely proportional to their area. Rescaling the values to between zero and one (using the {​​​​​​​​x-min(x)}​​​​​​​​/{​​​​​​​​max(x)-min(x)}​​​​​​​​ method) and subtracting them from 1 would seem to be the intuitive approach, but this would simply mean that the smallest are almost always the one sampled.
I'd like a flatter (but not uniform!) right-skewed distribution of sampling probabilities across the values, but I am unsure on how to do this while taking the area values into account. I don't think stratifying them is what I am looking for either as that would introduce arbitrary bounds on the probability allocations.
Reproducible code below with the item of interest (the vector of probabilities) given by prob_vector. That is, how to generate prob_vector given the above scenario and desired outcomes?
# Data
n= 500
df <- data.frame("ID" = 1:n,"AREA" = replicate(n,sum(rexp(n=8,rate=0.1))))
# Generate the sampling probability somehow based upon the AREA values with smaller areas having higher sample probability::
prob_vector <- ??????
# Sampling:
s <- sample(df$ID, size=1, prob=prob_vector)```
There is no one best solution for this question as a wide range of probability vectors is possible. You can add any kind of curvature and slope.
In this small script, I simulated an extremely right skewed distribution of areas (0-100 units) and you can define and directly visualize any probability vector you want.
area.dist = rgamma(1000,1,3)*40
area.dist[area.dist>100]=100
hist(area.dist,main="Probability functions")
area = seq(0,100,0.1)
prob_vector1 = 1-(area-min(area))/(max(area)-min(area)) ## linear
prob_vector2 = .8-(.6*(area-min(area))/(max(area)-min(area))) ## low slope
prob_vector3 = 1/(1+((area-min(area))/(max(area)-min(area))))**4 ## strong curve
prob_vector4 = .4/(.4+((area-min(area))/(max(area)-min(area)))) ## low curve
legend("topright",c("linear","low slope","strong curve","low curve"), col = c("red","green","blue","orange"),lwd=1)
lines(area,prob_vector1*500,col="red")
lines(area,prob_vector2*500,col="green")
lines(area,prob_vector3*500,col="blue")
lines(area,prob_vector4*500,col="orange")
The output is:
The red line is your solution, the other ones are adjustments to make it weaker. Just change numbers in the probability function until you get one that fits your expectations.

Transforming a variable (with many data points close to the max and min) into uniform distribution?

I need to make some models in R and have some trouble with some of my predictors. They are distributed between 0 and 1, they give the percentage of landcover types. E.g. 0.3 means 30% of the area is covered by forest.
Here are a histogram and a density plot of one of them:
histogram
density plot
I want to transform these predictors towards a uniform distribution within R (it does not have to be perfect). I don't know what transformation to use since there are many data points close to the maximum and the minimum of them.
Any help is appreciated, thanks!
It's not clear to me why you need to do this - most statistical methods don't make demands about the distribution of the predictor variables - but
rank(x)/(length(x)+1)
will give you a new variable that's uniformly distributed between 0 and 1 (and is never exactly 0 or 1)

Probability Density (pdf) extraction in R

I am attempting to reproduce the above function in R. The numerator has the product of the probability density function (pdf) of "y" at time "t". The omega_t is simply the weight (which for now lets ignore). The i stands for each forecast of y (along with the density) derived for model_i, at time t.
The denominator is the integral of the above product. My question is: How to estimate the densities. To get the density of the variable one needs some datapoints. So far I have this:
y<-c(-0.00604,-0.00180,0.00292,-0.0148)
forecastsy_model1<-c(-0.0183,0.00685) # respectively time t=1 and t=2 of the forecasts
forecastsy_model2<-c(-0.0163,0.00931) # similarly
all.y.1<-c(y,forecasty_model1) #together in one vector
all.y.2<-c(y,forecasty_model2) #same
However, I am not aware how to extract the density of x1 for time t=1, or t=6, in order to do the products. I have considered this to find the density estimated using this:
dy1<-density(all.y.1)
which(dy1$x==0.00685)
integer(0) #length(dy1$x) : 512
with dy1$x containing the n coordinates of the points where the density is estimated, according to the documentation. Shouldn't n be 6, or at least contain the points of y that I have supplied? What is the correct way to extract the density (pdf) of y?
There is an n argument in density which defaults to 512. density returns you estimated density values on a relatively dense grid so that you can plot the density curve. The grid points are determined by the range of your data (plus some extension) and the n value. They produce a evenly spaced grid. The sampling locations may not lie exactly on this grid.
You can use linear interpolation to get density value anywhere covered by this grid:
Find the probability density of a new data point using "density" function in R
Exact kernel density value for any point in R

R - simulate data for probability density distribution obtained from kernel density estimate

First off, I'm not entirely sure if this is the correct place to be posting this, as perhaps it should go in a more statistics-focussed forum. However, as I'm planning to implement this with R, I figured it would be best to post it here. Please apologise if I'm wrong.
So, what I'm trying to do is the following. I want to simulate data for a total of 250.000 observations, assigning a continuous (non-integer) value in line with a kernel density estimate derived from empirical data (discrete), with original values ranging from -5 to +5. Here's a plot of the distribution I want to use.
It's quite essential to me that I don't simulate the new data based on the discrete probabilities, but rather the continuous ones as it's really important that a value can be say 2.89 rather than 3 or 2. So new values would be assigned based on the probabilities depicted in the plot. The most frequent value in the simulated data would be somewhere around +2, whereas values around -4 and +5 would be rather rare.
I have done quite a bit of reading on simulating data in R and about how kernel density estimates work, but I'm really not moving forward at all. So my question basically entails two steps - how do I even simulate the data (1) and furthermore, how do I simulate the data using this particular probability distribution (2)?
Thanks in advance, I hope you guys can help me out with this.
With your underlying discrete data, create a kernel density estimate on as fine a grid as you wish (i.e., as "close to continuous" as needed for your application (within the limits of machine precision and computing time, of course)). Then sample from that kernel density, using the density values to ensure that more probable values of your distribution are more likely to be sampled. For example:
Fake data, just to have something to work with in this example:
set.seed(4396)
dat = round(rnorm(1000,100,10))
Create kernel density estimate. Increase n if you want the density estimated on a finer grid of points:
dens = density(dat, n=2^14)
In this case, the density is estimated on a grid of 2^14 points, with distance mean(diff(dens$x))=0.0045 between each point.
Now, sample from the kernel density estimate: We sample the x-values of the density estimate, and set prob equal to the y-values (densities) of the density estimate, so that more probable x-values will be more likely to be sampled:
kern.samp = sample(dens$x, 250000, replace=TRUE, prob=dens$y)
Compare dens (the density estimate of our original data) (black line), with the density of kern.samp (red):
plot(dens, lwd=2)
lines(density(kern.samp), col="red",lwd=2)
With the method above, you can create a finer and finer grid for the density estimate, but you'll still be limited to density values at grid points used for the density estimate (i.e., the values of dens$x). However, if you really need to be able to get the density for any data value, you can create an approximation function. In this case, you would still create the density estimate--at whatever bandwidth and grid size necessary to capture the structure of the data--and then create a function that interpolates the density between the grid points. For example:
dens = density(dat, n=2^14)
dens.func = approxfun(dens)
x = c(72.4588, 86.94, 101.1058301)
dens.func(x)
[1] 0.001689885 0.017292405 0.040875436
You can use this to obtain the density distribution at any x value (rather than just at the grid points used by the density function), and then use the output of dens.func as the prob argument to sample.

Combining two normal random variables

suppose I have the following 2 random variables :
X where mean = 6 and stdev = 3.5
Y where mean = -42 and stdev = 5
I would like to create a new random variable Z based on the first two and knowing that : X happens 90% of the time and Y happens 10% of the time.
It is easy to calculate the mean for Z : 0.9 * 6 + 0.1 * -42 = 1.2
But is it possible to generate random values for Z in a single function?
Of course, I could do something along those lines :
if (randIntBetween(1,10) > 1)
GenerateRandomNormalValue(6, 3.5);
else
GenerateRandomNormalValue(-42, 5);
But I would really like to have a single function that would act as a probability density function for such a random variable (Z) that is not necessary normal.
sorry for the crappy pseudo-code
Thanks for your help!
Edit : here would be one concrete interrogation :
Let's say we add the result of 5 consecutives values from Z. What would be the probability of ending with a number higher than 10?
But I would really like to have a
single function that would act as a
probability density function for such
a random variable (Z) that is not
necessary normal.
Okay, if you want the density, here it is:
rho = 0.9 * density_of_x + 0.1 * density_of_y
But you cannot sample from this density if you don't 1) compute its CDF (cumbersome, but not infeasible) 2) invert it (you will need a numerical solver for this). Or you can do rejection sampling (or variants, eg. importance sampling). This is costly, and cumbersome to get right.
So you should go for the "if" statement (ie. call the generator 3 times), except if you have a very strong reason not to (using quasi-random sequences for instance).
If a random variable is denoted x=(mean,stdev) then the following algebra applies
number * x = ( number*mean, number*stdev )
x1 + x2 = ( mean1+mean2, sqrt(stdev1^2+stdev2^2) )
so for the case of X = (mx,sx), Y= (my,sy) the linear combination is
Z = w1*X + w2*Y = (w1*mx,w1*sx) + (w2*my,w2*sy) =
( w1*mx+w2*my, sqrt( (w1*sx)^2+(w2*sy)^2 ) ) =
( 1.2, 3.19 )
link: Normal Distribution look for Miscellaneous section, item 1.
PS. Sorry for the wierd notation. The new standard deviation is calculated by something similar to the pythagorian theorem. It is the square root of the sum of squares.
This is the form of the distribution:
ListPlot[BinCounts[Table[If[RandomReal[] < .9,
RandomReal[NormalDistribution[6, 3.5]],
RandomReal[NormalDistribution[-42, 5]]], {1000000}], {-60, 20, .1}],
PlotRange -> Full, DataRange -> {-60, 20}]
It is NOT Normal, as you are not adding Normal variables, but just choosing one or the other with certain probability.
Edit
This is the curve for adding five vars with this distribution:
The upper and lower peaks represent taking one of the distributions alone, and the middle peak accounts for the mixing.
The most straightforward and generically applicable solution is to simulate the problem:
Run the piecewise function you have 1,000,000 (just a high number) of times, generate a histogram of the results (by splitting them into bins, and divide the count for each bin by your N (1,000,000 in my example). This will leave you with an approximation for the PDF of Z at every given bin.
Lots of unknowns here, but essentially you just wish to add the two (or more) probability functions to one another.
For any given probability function you could calculate a random number with that density by calculating the area under the probability curve (the integral) and then generating a random number between 0 and that area. Then move along the curve until the area is equal to your random number and use that as your value.
This process can then be generalized to any function (or sum of two or more functions).
Elaboration:
If you have a distribution function f(x) which ranges from 0 to 1. You could calculate a random number based on the distribution by calculating the integral of f(x) from 0 to 1, giving you the area under the curve, lets call it A.
Now, you generate a random number between 0 and A, let's call that number, r. Now you need to find a value t, such that the integral of f(x) from 0 to t is equal to r. t is your random number.
This process can be used for any probability density function f(x). Including the sum of two (or more) probability density functions.
I'm not sure what your functions look like, so not sure if you are able to calculate analytic solutions for all this, but worse case scenario, you could use numeric techniques to approximate the effect.

Resources