Interpreting metaMDS procrustes rotation - r

Does the metaMDS function in vegan rotate the ordination solution so the first axis explains the most variance? If not, is there a way to achieve this?
Run 20 stress 0.09957583
... Procrustes: rmse 0.0001349268 max resid 0.0009665635
... Similar to previous best
I am also unsure about how to interpret the procrustes data. What do the values for RMSE and max residual represent?
Thanks!

Yes, it does. Documentation (?metaMDS) expresses it this way:
Principal components rotate the configuration so that
the variance of points is maximized on first dimension
The output also prints when PC rotation was used:
Scaling: centring, PC rotation, halfchange scaling
However, this has nothing to do with Procrustes rotation that is used to asses the similarity of solutions during iteration steps. PC rotation only concerns the final returned result.
About interpretation of the RMSE and max residual: they are statistics that compare two solutions during iteration. RMSE is a kind of average difference, and max is the maximum difference. If you they are small, two iterations yielded similar results. To see what they mean, run metaMDS with option plot = TRUE and all Procrustes rotation are plotted when running the results. The blue arrows in that plot will show the differences, and RMSE is an averaged arrow length, and max residual is the length of the longest arrow. If you won't see much arrows, then two solutions are so similar that differences are not visible.

Related

How to average graph functions and find the confidence band in R

I'm using the 'spatstat' package in R and obtained a set of Ripley's K functions (or L functions). I want to find a good way to average out this set of graphs on a single average line, as well as graphing out the standard deviation or confidence interval around this average line.
So far I've tried:
env.A <- envelope(A, fun=Lest, correction=c("Ripley"), nsim=99, rank=1, global=TRUE)
Aa <- env.A
avg <- eval.fv((Aa+Bb+Cc+Dd+Ee+Ff+Gg+Hh+Ii+Jj+Kk+Ll+Mm+Nn+Oo+Pp+Qq+Rr+Ss+Tt+Uu+Vv+Ww+Xx)/24)
plot(avg, xlim=c(0,200), . - r ~ r, ylab='', legend='')
With this, I got the average line from the data set.
However, I'm now stuck on finding the confidence interval around this average line.
Does anyone know a good way to do this?
The help file for envelope explains how to do this.
E <- envelope(A, Lest, correction="Ripley", nsim=100, VARIANCE=TRUE)
plot(E, . - r ~ r)
See help(envelope) for more explanation.
In this example, the average or middle curve is computed using a theoretical formula, because the simulations are generated from Complete Spatial Randomness, and the theoretical value of the L function is known. If you want the middle curve to be determined by the sample averages instead, set use.theo = FALSE in the call to envelope.
Can I also point out that the bands you get from envelope are not confidence intervals. A confidence interval would be centred around the estimated L function for the data point pattern A. The bands you get from the envelope command are centred around the mean value of the simulated curves. They are significance bands and their interpretation is related to a statistical significance test. This is also explained in the help file.

R - simulate data for probability density distribution obtained from kernel density estimate

First off, I'm not entirely sure if this is the correct place to be posting this, as perhaps it should go in a more statistics-focussed forum. However, as I'm planning to implement this with R, I figured it would be best to post it here. Please apologise if I'm wrong.
So, what I'm trying to do is the following. I want to simulate data for a total of 250.000 observations, assigning a continuous (non-integer) value in line with a kernel density estimate derived from empirical data (discrete), with original values ranging from -5 to +5. Here's a plot of the distribution I want to use.
It's quite essential to me that I don't simulate the new data based on the discrete probabilities, but rather the continuous ones as it's really important that a value can be say 2.89 rather than 3 or 2. So new values would be assigned based on the probabilities depicted in the plot. The most frequent value in the simulated data would be somewhere around +2, whereas values around -4 and +5 would be rather rare.
I have done quite a bit of reading on simulating data in R and about how kernel density estimates work, but I'm really not moving forward at all. So my question basically entails two steps - how do I even simulate the data (1) and furthermore, how do I simulate the data using this particular probability distribution (2)?
Thanks in advance, I hope you guys can help me out with this.
With your underlying discrete data, create a kernel density estimate on as fine a grid as you wish (i.e., as "close to continuous" as needed for your application (within the limits of machine precision and computing time, of course)). Then sample from that kernel density, using the density values to ensure that more probable values of your distribution are more likely to be sampled. For example:
Fake data, just to have something to work with in this example:
set.seed(4396)
dat = round(rnorm(1000,100,10))
Create kernel density estimate. Increase n if you want the density estimated on a finer grid of points:
dens = density(dat, n=2^14)
In this case, the density is estimated on a grid of 2^14 points, with distance mean(diff(dens$x))=0.0045 between each point.
Now, sample from the kernel density estimate: We sample the x-values of the density estimate, and set prob equal to the y-values (densities) of the density estimate, so that more probable x-values will be more likely to be sampled:
kern.samp = sample(dens$x, 250000, replace=TRUE, prob=dens$y)
Compare dens (the density estimate of our original data) (black line), with the density of kern.samp (red):
plot(dens, lwd=2)
lines(density(kern.samp), col="red",lwd=2)
With the method above, you can create a finer and finer grid for the density estimate, but you'll still be limited to density values at grid points used for the density estimate (i.e., the values of dens$x). However, if you really need to be able to get the density for any data value, you can create an approximation function. In this case, you would still create the density estimate--at whatever bandwidth and grid size necessary to capture the structure of the data--and then create a function that interpolates the density between the grid points. For example:
dens = density(dat, n=2^14)
dens.func = approxfun(dens)
x = c(72.4588, 86.94, 101.1058301)
dens.func(x)
[1] 0.001689885 0.017292405 0.040875436
You can use this to obtain the density distribution at any x value (rather than just at the grid points used by the density function), and then use the output of dens.func as the prob argument to sample.

Convergence of R density() function to a delta function

I'm a bit puzzled by the behavior of the R density() function in an edge case...
Suppose I add more and more points with x=0 into a simulated data set. What I expect is that the density estimate will very quickly converge (I'm being deliberately vague about what that means...) to a delta function at x=0. In practice, the fit certainly gets narrower, but very slowly, as shown by this sequence of plots:
plot(density(c(0,0)), xlim=c(-2,2))
plot(density(c(0,0,0,0)), xlim=c(-2,2))
plot(density(c(rep(0,10000))), xlim=c(-2,2))
plot(density(c(rep(0,10000000))), xlim=c(-2,2))
But if you add a tiny bit of noise to the simulated data, the behavior is much better:
plot(density(0.0000001*rnorm(10000000) + c(rep(0,10000000))), xlim=c(-2,2))
Just let sleeping dogs lie? Or am I missing something about the usage of density()?
Per ?bw.nrd0, the default bandwidth selector for density:
bw.nrd0 implements a rule-of-thumb for choosing the bandwidth of a Gaussian kernel density estimator. It defaults to 0.9 times the minimum of the standard deviation and the interquartile range divided by 1.34 times the sample size to the negative one-fifth power (= Silverman's ‘rule of thumb’, Silverman (1986, page 48, eqn (3.31)) unless the quartiles coincide when a positive result will be guaranteed.
When your data is constant, then the quartiles coincide, so the last clause guaranteeing a positive result kicks in. This basically means that the bandwidth chosen is not a continuous function of the spread of the data, at zero.
To illustrate:
> bw.nrd0(rep(0, 1e6))
[1] 0.05678616
> bw.nrd0(rnorm(1e6, s=1e-6))
[1] 5.672872e-08
Actually (...tail between legs...) I now realize that my entire question was misguided. Being fairly new to R, I had instantly assumed that density() tries to fit Gaussians of different widths to the data points, optimizing both the number of Gaussians and their individual widths. But in fact it is doing something much simpler. It just smears out each data point, and adds up the smears to give a smoothed estimate of the data. density() is just a simple smoothing algorithm. So, yes indeed, RTFM :)

how to cluster curve with kmeans?

I want to cluster some curves which contains daily click rate.
The dataset is click rate data in time series.
y1 = [time1:0.10,time2:0.22,time3:0.344,...]
y2 = [time1:0.10,time2:0.22,time3:0.344,...]
I don't know how to measure two curve's similarity using kmeans.
Is there any paper for this purpose or some library?
For similarity, you could use any kind of time series distance. Many of these will perform alignment, also of sequences of different length.
However, k-means will not get you anywhere.
K-means is not meant to be used with arbitrary distances. It actually does not use distance for assignment, but least-sum-of-squares (which happens to be squared euclidean distance) - aka: variance.
The mean must be consistent with this objective. It's not hard to see that the mean also minimizes the sum of squares. This guarantees convergence of k-means: in each single step (both assignment and mean update), the objective is reduced, thus it must converge after a finite number of steps (as there are only a finite number of discrete assignments).
But what is the mean of multiple time series of different length?

Scaling of covariance matrices

For the question "Ellipse around the data in MATLAB", in the answer given by Amro, he says the following:
"If you want the ellipse to represent
a specific level of standard
deviation, the correct way of doing is
by scaling the covariance matrix"
and the code to scale it was given as
STD = 2; %# 2 standard deviations
conf = 2*normcdf(STD)-1; %# covers around 95% of population
scale = chi2inv(conf,2); %# inverse chi-squared with dof=#dimensions
Cov = cov(X0) * scale;
[V D] = eig(Cov);
I don't understand the first 3 lines of the above code snippet. How is the scale calculated by chi2inv(conf,2), and what is the rationale behind multiplying it with the covariace matrix?
Additional Question:
I also found that if I scale it with 1.5 STD, i.e. 86% tiles, the ellipse can cover all of the points, my points set are clumping together, at almost all the cases. On the other hand, if I scale it with 3 STD, i.e. 99%tiles, the ellipse is far too big. Then how can I choose a STD to just tightly cover the clumping points?
Here is an example:
The inner ellipse corresponds to 1.5 STD and outer to 2.5 STD. why 1.5 STD is tightly cover the clumping white points? Is there any approach or reason to define it?
The objective of displaying an ellipse around the data points is to show the confidence interval, or in other words, "how much of the data is within a certain standard deviation way from the mean"
In the above code, he has chosen to display an ellipse that covers 95% of the data points. For a normal distribution, ~67% of the data is 1 s.d. away from the mean, ~95% within 2 s.d. and ~99% within 3 s.d. (the numbers are off the top of my head, but you can easily verify this by calculating the area under the curve). Hence, the value STD=2; You'll find that conf is approx 0.95.
The distance of the data points from the centroid of the data goes something like (xi^2+yi^2)^0.5, ignoring coefficients. Sums of squares of random variables follow a chi-square distribution and hence to get the corresponding 95 percentile, he uses the inverse chi-square function, with d.o.f. 2, as there are two variables.
Lastly, the rationale behind multiplying the scaling constant follows from the fact that for a square matrix A with eigenvalues a1,...,an, the eigenvalues of a matrix kA, where k is a scalar is simply ka1,...,kan. The eigenvalues give the corresponding lengths of the major/minor axis of the ellipse, and so scaling the ellipse or the eigenvalues to the 95%tile is equivalent to multiplying the covariance matrix with the scaling factor.
EDIT
Cheng, although you might already know this, I suggest that you also read this answer to a question on randomness. Consider a Gaussian random variable with zero mean, unit variance. The PDF of a collection of such random variables looks like this
Now, if I were to take two such collections of random variables, square them separately and add them to form a single collection of a new random variable, its distribution looks like this
This is the chi-square distribution with 2 degrees of freedom (since we added two collections).
The equation of the ellipse in the above code can be written as x^2/a^2 +y^2/b^2=k, where x,y are the two random variables, a and b are the major/minor axes, and k is some scaling constant that we need to figure out. As you can see, the above can be interpreted as squaring and adding two collections of Gaussian random variables, and we just saw above what its distribution looks like. So, we can say that k is a random variable that is chi-square distributed with 2 degrees of freedom.
Now all that needs to be done is to find a value for k such that 95%ile of the data is within it. Just like the 1s.d, 2s.d, 3s.d. percentiles that we're familiar with Gaussians, the 95%tile for chi-square with 2 degrees of freedom is around 6.18. This is what Amro obtains from the chi2inv function. He could have just as well written scale=chi2inv(0.95,2) and it would have been the same. It's just that talking in terms of n s.d. away from the mean is intuitive.
Just to illustrate, here's a PDF of the chi-square distribution above, with 95% of the area < some x shaded in red. This x is ~6.18.
Hope this helped.

Resources