Cluster analysis in R: How can I get deterministic results from pvclust? - r

pvclust is great for cluster analysis in R. However, when running it as part of a batch operation, it is annoying to get different results for the same data. Obviously, there are many "correct" clusterings of the same data, and it seems that pvclust uses some randomness to determine the clusters of a specific run. But is there any way to get deterministic results?
I want to be able to present a minimal, repeatable analysis package: the data plus an R script, and a separate written document that contains my interpretations of the clustering. It is then possible for others to add to the analysis, e.g. by changing the aesthetic appearance of plots. Now, the interpretations will always be out of sync with what someone else gets when they run the script containing pvclust.

Not only for cluster analysis, but when there is randomness involved, you can fix the random number generator so you always get the same results.
Try:
set.seed(seed=123)
# your code here
The seed can be any integer, or something that can be converted to integer. And that's all.

i've only used k means. There I had to set the number of 'runs' or iterations to a higher value than default to get the same custers at consecutive runs.

Related

Why should we use set.seed() before apply knn() in R?

When I read An Introduction To Statistical Learning, I am puzzled by the following passage:
We set a random seed before we apply knn() because if several
observations are tied as nearest neighbors, then R will randomly break
the tie. Therefore, a seed must be set in order to ensure
reproducibility of results.
Could anyone please tell me why is the result of KNN random?
The reason behind that if we use set.seed() before knn() in R then it helps to select only one random number because if we run knn() then random numbers are generated but if we want that the numbers do not change then we can use it.

Effect of setting seeds on an algorithm

I am writing an R code where, I am using set.seed() function in the whole program to generate the data and then using it in a function , ultimately plotting the function and then using optim to get the minima. But now the issue is the graphs of the function changes if I change the seed value and sometimes doesn't even produce a concave graph but an exponential graph.
I am not able to understand why this is happening and how I can fix it. If anyone can provide me with any reference to read in this subject or any suggestions as to what can be done, that will be great.
Thanks in advance
set.seed() configures the random number generator to start from that seed. This may be a bit more complicated, depending on the precise implementation, but the effects are always the same: The sequence of numbers will be identical.
This is useful in a number of applications where you want some randomness, but you want to get the same result if you re-run the code. Say for example you need to randomly sample your data, but since you are debugging, it's useful if you get the same sample so that the bugs don't disappear on you.
Also if you want other people to replicate the results, you simply pick some random number as the seed and tell them that you used that seed. Anything in the algorithm based on random numbers will behave the same because you are both using the same sequence of numbers.
For your graph problem you need to share some code so that people understand what you are doing. It's very hard to guess what went wrong. At the outset it seems that you algorithm is very strongly influenced by the random numbers (usually not a good sign).
In simple, if you set a seed, and extract a random number, the random number will be always the same. If you not set a seed, every time you choose a number the number will be different. The seed permit you to replicate your experiment.

LASSO coefficients equal to 0 using opt1D

I have a question about LASSO. I'm getting crazy because it is something that I can not solve only according to my background. I'm a biologist.
Briefly I run LASSO using the R library "penalized". In particular I used the opt1D function with around 500 simulations on a data.frame (numerical) of around 30 columns that are my biomarkers (gene expression). I want to test and 3000 rows that are people of which around 50 are tumours and all the others are normals.
Unfortunately by using L1 regularization, all and really all coefficients of 500 simulations are 0. If I check L2 matrix of coefficients they are close to 0. Now my point is that I cannot think that all my biomarkers are not able to distinguish between Normals and Tumors.
I don't know if what I have done is all I can to check for the discriminatory potential of my molecules. Is there something else I can do to understand why are they all 0 and also is there something else I can do to verify that really they are not able to stratify my cohort?
Did you consider fitting your data without penalization before using regularization? L1 regularization will naturally result in a significant number of zero coefficients.
As a side note I would first run PCA/PCoA and see whether or not your genes separate according to your class variable. This could save you some time and allow you to trim your data set to those genes that show the greatest differences across your class variable. Also if you have relatively little experience with R I would suggest using a linear modeling package such as Limma since it has excellent documentation and many examples that are easy to follow.

R `optim` returns different results if run in parallel

Is there any possible explanation for multiple optim instances with set starting values to return different results if run in parallel or one after another on a single core?
Basically, I do a rolling forecast with refitting of the model each time, so I can easily parallelize over the rolling windows, but the results are different if I do not parallelize...
Sadly, I don't have a simple reproducible example. I know that if I link to different BLAS then the results differ, so is there anything like different numerical precision / set of libraries used, that might cause this?

R - 'princomp' can only be used with more units than variables

I am using R software (R commander) to cluster my data. I have a smaller subset of my data containing 200 rows and about 800 columns. I am getting the following error when trying kmeans cluster and plot on a graph.
"'princomp' can only be used with more units than variables"
I then created a test doc of 10 row and 10 columns whch plots fine but when I add an extra column I get te error again.
Why is this? I need to be able to plot my cluster. When I view my data set after performing kmeans on it I can see the extra results column which shows which clusters they belong to.
IS there anything I am doing wrong, can I ger rid of this error and plot my larger sample???
Please help, been wrecking my head for a week now.
Thanks guys.
The problem is that you have more variables than sample points and the principal component analysis that is being done is failing.
In the help file for princomp it explains (read ?princomp):
‘princomp’ only handles so-called R-mode PCA, that is feature
extraction of variables. If a data matrix is supplied (possibly
via a formula) it is required that there are at least as many
units as variables. For Q-mode PCA use ‘prcomp’.
Principal component analysis is underspecified if you have fewer samples than data point.
Every data point will be it's own principal component. For PCA to work, the number of instances should be significantly larger than the number of dimensions.
Simply speaking you can look at the problems like this:
If you have n dimensions, you can encode up to n+1 instances using vectors that are all 0 or that have at most one 1. And this is optimal, so PCA will do this! But it is not very helpful.
you can use prcomp instead of princomp

Resources