Clustering : two different values for the within sum of square - r

I'm doing a clustering analysis and I have two issues :
I found two different values for the within sum of square with these 2 methods :
1/ First method founded here : http://www.statmethods.net/advstats/cluster.html
set.seed(180)
wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))
for (i in 1:8)
wss[i] <- sum(kmeans(mydata,
centers=i)$withinss)
wss
[1] 2244832.0 1707497.8 1514193.9 1131349.7 990028.8 698772.0 683106.4 522783.8
2/ Second method
set.seed(180)
fit <- kmeans(mydata, 5)
fit$tot.withinss
[1] 857443.8
As you can see 990 028 !=857 443 even though I used "set.seed"
Is there a mistake in the formula of the Statmethods website ?
Lastly, sometimes the wss raise with the number of cluster. Is it ok or it's impossible ?

You use set.seed but you also do a lot with the random number generator before getting to kmeans(data, 5) in your first example. You are likely getting a different clustering solution. If you just look at sum(fit$withinss) it should match fit$tot.withinss for a given clustering solution. There is some randomness involved though so if you want the same thing you need to make sure you set the seed properly.

By setting the initial centers you effectively provide a manual seed.
k-means only uses the random seed during initial generation of the centers.
So in fact, you provide two different initializations, and as such you shouldn't be surprised to get different results. K-means only finds local optima.

Related

what if the FD steps varied w.r.t output/input

I am using the finite difference scheme to find gradients.
Lets say i have 2 outputs (y1,y2) and 1 input (x) in a single component. And in advance I know that the sensitivity of y1 with respect to x is not same as the sensitivity of y2 to x. And thus i could potentially have two different steps for those as in ;
self.declare_partials(of=y1, wrt=x, method='fd',step=0.01, form='central')
self.declare_partials(of=y2, wrt=x, method='fd',step=0.05, form='central')
There is nothing that stops me (algorithmically) but it is not clear what would openmdao gradient calculation exactly do in this case?
does it exchange information from the case where the steps are different by looking at the steps ratios or simply treating them independently and therefore doubling computational time ?
I just tested this, and it does the finite difference twice with the two different step sizes, and only saves the requested outputs for each step. I don't think we could do anything with the ratios as you suggested, as the reason for using different stepsizes to resolve individual outputs is because you don't trust the accuracy of the outputs at the smaller (or large) stepsize.
This is a fair question about the effect of the API. In typical FD applications you would get only 1 function call per design variable for forward and backward difference and 2 function calls for central difference.
However in this case, you have asked for two different step sizes for two different outputs, both with central difference. So here, you'll end up with 4 function calls to compute all the derivatives. dy1_dx will be computed using the step size of .01 and dy2_dx will be computed with a step size of .05.
There is no crosstalk between the two different FD calls, and you do end up with more function calls than you would have if you just specified a single step size via:
self.declare_partials(of='*', wrt=x, method='fd',step=0.05, form='central')
If the cost is something you can bear, and you get improved accuracy, then you could use this method to get different step sizes for different outputs.

Equation of rbfKernel in kernlab is different from the standard?

I have observed that kernlab uses rbfkernel as,
rbf(x,y) = exp(-sigma * euclideanNorm(x-y)^2)
but according to this wiki link, the rbf kernel should be of the form
rbf(x,y) = exp(-euclideanNorm(x-y)^2/(2*sigma^2))
which is also more intuitive since two close samples with a large kernel sigma value will lead to a higher similarity matching.
I am not sure what e1071 svm uses (native code libsvm?)
I hope someone can enlighten me on why there is a difference ? I caught this because I was initially using e1071 but switched to ksvm but saw inconsistent results for the two.
A small example for comparison
set.seed(123)
x <- rnorm(3)
y <- rnorm(3)
sigma <- 100
rbf <- rbfdot(sigma=sigma)
rbf(x, y)
exp( -sum((x-y)^2)/(2*sigma^2) )
I would expect the kernel value to be close to 1 (since x,y come from sigma=1, while kernel sigma=100). This is observed only in the second case.
I came across that discrepancy too and I wound up digging into the source to figure out if there was a typo in the documentation or what was going on exactly since sigma in the context of Gaussians traditionally goes as the standard deviation in the denominator right?
Here's the relevant source
**kernlab\R\kernels.R**
## Define the kernel objects,
## functions with an additional slot for the kernel parameter list.
## kernel functions take two vector arguments and return a scalar (dot product)
rbfdot<- function(sigma=1)
{
rval <- function(x,y=NULL)
{
if(!is(x,"vector")) stop("x must be a vector")
if(!is(y,"vector")&&!is.null(y)) stop("y must a vector")
if (is(x,"vector") && is.null(y)){
return(1)
}
if (is(x,"vector") && is(y,"vector")){
if (!length(x)==length(y))
stop("number of dimension must be the same on both data points")
return(exp(sigma*(2*crossprod(x,y) - crossprod(x) - crossprod(y))))
# sigma/2 or sigma ??
}
}
return(new("rbfkernel",.Data=rval,kpar=list(sigma=sigma)))
}
You can observe from their comment on sigma/2 or sigma ?? that they may perhaps be a bit confused about the convention to adopt, the presence of /2 would be consistent with the standard deviation form /(2*sigma), but I had to speculate about this discovery.
Now another corroborating piece of evidence is in the help page for ? rbfdot which reads...
sigma The inverse kernel width used by the Gaussian the Laplacian,
the Bessel and the ANOVA kernel
And that is consistent with the form they use with sigma in the numerator, since in the denominator it would scale proportionately with the width of the Gaussian right. So it indeed looks like they settled on the convention that is described in the Wikipedia article as the gamma form, where they say
An equivalent, but simpler, definition involves a parameter gamma =
-1/(2*sigma^2)
So the difference just seems to be a matter of adopting different but equivalent conventions. One motivator for the particular convention (which someone may confirm in a comment) may arise from issues of code reuse and consistency, where as you see the parameter is used by three other kernel forms that may have their parameters more traditionally set in the numerator. I'm not sure on that point however since I've never used those alternate kernels and am unfamiliar with each.

kmeans: Quick-TRANSfer stage steps exceeded maximum

I am running k-means clustering in R on a dataset with 636,688 rows and 7 columns using the standard stats package: kmeans(dataset, centers = 100, nstart = 25, iter.max = 20).
I get the following error: Quick-TRANSfer stage steps exceeded maximum (= 31834400), and although one can view the code at http://svn.r-project.org/R/trunk/src/library/stats/R/kmeans.R - I am unsure as to what is going wrong. I assume my problem has to do with the size of my dataset, but I would be grateful if someone could clarify once and for all what I can do to mitigate the issue.
I just had the same issue.
See the documentation of kmeans in R via ?kmeans:
The Hartigan-Wong algorithm
generally does a better job than either of those, but trying
several random starts (‘nstart’> 1) is often recommended. In rare
cases, when some of the points (rows of ‘x’) are extremely close,
the algorithm may not converge in the “Quick-Transfer” stage,
signalling a warning (and returning ‘ifault = 4’). Slight
rounding of the data may be advisable in that case.
In these cases, you may need to switch to the Lloyd or MacQueen algorithms.
The nasty thing about R here is that it continues with a warning that may go unnoticed. For my benchmark purposes, I consider this to be a failed run, and thus I use:
if (kms$ifault==4) { stop("Failed in Quick-Transfer"); }
Depending on your use case, you may want to do something like
if (kms$ifault==4) { kms = kmeans(X, kms$centers, algorithm="MacQueen"); }
instead, to continue with a different algorithm.
If you are benchmarking K-means, note that R uses iter.max=10 per default. It may take much more than 10 iterations to converge.
Had the same problem, seems to have something to do with available memory.
Running Garbage Collection before the function worked for me:
gc()
or reference:
Increasing (or decreasing) the memory available to R processes
#jlhoward's comment:
Try
kmeans(dataset, algorithm="Lloyd", ..)
I got the same error message, but in my case it helped to increase the number of iterations iter.max. That contradicts the theory of memory overload.

What's the lowest number R will present before rounding to 0?

I'm doing some statistical analysis with R software (bootstrapped Kolmogorov-Smirnov tests) of very large data sets, meaning that my p values are all incredibly small. I've Bonferroni corrected for the large number of tests that I've performed meaning that my alpha value is also very small in order to reject the null hypothesis.
The problem is, R presents me with p values of 0 in some cases where the p value is presumably so small that it cannot be presented (these are usually for the very large sample sizes). While I can happily reject the null hypothesis for these tests, the data is for publication, so I'll need to write p < ..... but I don't know what the lowest reportable values in R are?
I'm using the ks.boot function in case that matters.
Any help would be much appreciated!
.Machine$double.xmin gives you the smallest non-zero normalized floating-point number. On most systems that's 2.225074e-308. However, I don't believe this is a sensible limit.
Instead I suggest that in Matching::ks.boot you change the line
ks.boot.pval <- bbcount/nboots to
ks.boot.pval <- log(bbcount)-log(nboots) and work on the log-scale.
Edit:
You can use trace to modify the function.
Step 1: Look at the function body, to find out where to add additional code.
as.list(body(ks.boot))
You'll see that element 17 is ks.boot.pval <- bbcount/nboots, so we need to add the modified code directly after that.
Step 2: trace the function.
trace (ks.boot, quote(ks.boot.pval <- log(bbcount)-log(nboots)), at=18)
Step 3: Now you can use ks.boot and it will return the logarithm of the bootstrap p-value as ks.boot.pvalue. Note that you cannot use summary.ks.boot since it calls format.pval, which will not show you negative values.
Step 4: Use untrace(ks.boot) to remove the modifications.
I don't know whether ks.boot has methods in the packages Rmpfr or gmp but if it does, or you feel like rolling your own code, you can work with arbitrary precision and arbitrary size numbers.

Sample uniformly at random from an n-dimensional unit simplex

Sampling uniformly at random from an n-dimensional unit simplex is the fancy way to say that you want n random numbers such that
they are all non-negative,
they sum to one, and
every possible vector of n non-negative numbers that sum to one are equally likely.
In the n=2 case you want to sample uniformly from the segment of the line x+y=1 (ie, y=1-x) that is in the positive quadrant.
In the n=3 case you're sampling from the triangle-shaped part of the plane x+y+z=1 that is in the positive octant of R3:
(Image from http://en.wikipedia.org/wiki/Simplex.)
Note that picking n uniform random numbers and then normalizing them so they sum to one does not work. You end up with a bias towards less extreme numbers.
Similarly, picking n-1 uniform random numbers and then taking the nth to be one minus the sum of them also introduces bias.
Wikipedia gives two algorithms to do this correctly: http://en.wikipedia.org/wiki/Simplex#Random_sampling
(Though the second one currently claims to only be correct in practice, not in theory. I'm hoping to clean that up or clarify it when I understand this better. I initially stuck in a "WARNING: such-and-such paper claims the following is wrong" on that Wikipedia page and someone else turned it into the "works only in practice" caveat.)
Finally, the question:
What do you consider the best implementation of simplex sampling in Mathematica (preferably with empirical confirmation that it's correct)?
Related questions
Generating a probability distribution
java random percentages
This code can work:
samples[n_] := Differences[Join[{0}, Sort[RandomReal[Range[0, 1], n - 1]], {1}]]
Basically you just choose n - 1 places on the interval [0,1] to split it up then take the size of each of the pieces using Differences.
A quick run of Timing on this shows that it's a little faster than Janus's first answer.
After a little digging around, I found this page which gives a nice implementation of the Dirichlet Distribution. From there it seems like it would be pretty simple to follow Wikipedia's method 1. This seems like the best way to do it.
As a test:
In[14]:= RandomReal[DirichletDistribution[{1,1}],WorkingPrecision->25]
Out[14]= {0.8428995243540368880268079,0.1571004756459631119731921}
In[15]:= Total[%]
Out[15]= 1.000000000000000000000000
A plot of 100 samples:
alt text http://www.public.iastate.edu/~zdavkeos/simplex-sample.png
I'm with zdav: the Dirichlet distribution seems to be the easiest way ahead, and the algorithm for sampling the Dirichlet distribution which zdav refers to is also presented on the Wikipedia page on the Dirichlet distribution.
Implementationwise, it is a bit of an overhead to do the full Dirichlet distribution first, as all you really need is n random Gamma[1,1] samples. Compare below
Simple implementation
SimplexSample[n_, opts:OptionsPattern[RandomReal]] :=
(#/Total[#])& # RandomReal[GammaDistribution[1,1],n,opts]
Full Dirichlet implementation
DirichletDistribution/:Random`DistributionVector[
DirichletDistribution[alpha_?(VectorQ[#,Positive]&)],n_Integer,prec_?Positive]:=
Block[{gammas}, gammas =
Map[RandomReal[GammaDistribution[#,1],n,WorkingPrecision->prec]&,alpha];
Transpose[gammas]/Total[gammas]]
SimplexSample2[n_, opts:OptionsPattern[RandomReal]] :=
(#/Total[#])& # RandomReal[DirichletDistribution[ConstantArray[1,{n}]],opts]
Timing
Timing[Table[SimplexSample[10,WorkingPrecision-> 20],{10000}];]
Timing[Table[SimplexSample2[10,WorkingPrecision-> 20],{10000}];]
Out[159]= {1.30249,Null}
Out[160]= {3.52216,Null}
So the full Dirichlet is a factor of 3 slower. If you need m>1 samplepoints at a time, you could probably win further by doing (#/Total[#]&)/#RandomReal[GammaDistribution[1,1],{m,n}].
Here's a nice concise implementation of the second algorithm from Wikipedia:
SimplexSample[n_] := Rest## - Most## &[Sort#Join[{0,1}, RandomReal[{0,1}, n-1]]]
That's adapted from here: http://www.mofeel.net/1164-comp-soft-sys-math-mathematica/14968.aspx
(Originally it had Union instead of Sort#Join -- the latter is slightly faster.)
(See comments for some evidence that this is correct!)
I have created an algorithm for uniform random generation over a simplex. You can find the details in the paper in the following link:
http://www.tandfonline.com/doi/abs/10.1080/03610918.2010.551012#.U5q7inJdVNY
Briefly speaking, you can use following recursion formulas to find the random points over the n-dimensional simplex:
x1=1-R11/n-1
xk=(1-Σi=1kxi)(1-Rk1/n-k), k=2, ..., n-1
xn=1-Σi=1n-1xi
Where R_i's are random number between 0 and 1.
Now I am trying to make an algorithm to generate random uniform samples from constrained simplex.that is intersection between a simplex and a convex body.
Old question, and I'm late to the party, but this method is much faster than the accepted answer if implemented efficiently.
In Mathematica code:
#/Total[#,{2}]&#Log#RandomReal[{0,1},{n,d}]
In plain English, you generate n rows * d columns of randoms uniformly distributed between 0 and 1. Then take the Log of everything. Then normalize each row, dividing each element in the row by the row total. Now you have n samples uniformly distributed over the (d-1) dimensional simplex.
If found this method here: https://mathematica.stackexchange.com/questions/33652/uniformly-distributed-n-dimensional-probability-vectors-over-a-simplex
I'll admit, I'm not sure why it works, but it passes every statistical test I can think of. If anyone has a proof of why this method works, I'd love to see it!

Resources