I am using R software (R commander) to cluster my data. I have a smaller subset of my data containing 200 rows and about 800 columns. I am getting the following error when trying kmeans cluster and plot on a graph.
"'princomp' can only be used with more units than variables"
I then created a test doc of 10 row and 10 columns whch plots fine but when I add an extra column I get te error again.
Why is this? I need to be able to plot my cluster. When I view my data set after performing kmeans on it I can see the extra results column which shows which clusters they belong to.
IS there anything I am doing wrong, can I ger rid of this error and plot my larger sample???
Please help, been wrecking my head for a week now.
Thanks guys.
The problem is that you have more variables than sample points and the principal component analysis that is being done is failing.
In the help file for princomp it explains (read ?princomp):
‘princomp’ only handles so-called R-mode PCA, that is feature
extraction of variables. If a data matrix is supplied (possibly
via a formula) it is required that there are at least as many
units as variables. For Q-mode PCA use ‘prcomp’.
Principal component analysis is underspecified if you have fewer samples than data point.
Every data point will be it's own principal component. For PCA to work, the number of instances should be significantly larger than the number of dimensions.
Simply speaking you can look at the problems like this:
If you have n dimensions, you can encode up to n+1 instances using vectors that are all 0 or that have at most one 1. And this is optimal, so PCA will do this! But it is not very helpful.
you can use prcomp instead of princomp
Related
I am working on finding out outliers using Mahalanobis distance in R. I have a dataset with 30 rows and 24 columns, which I feed into the mahanalobis function from stats package.I want to create find distance of each vector with rest of the rows. The results look good till I export the same input data and same code to another machine and rerun the code, which gives different results than the one seen on machine1. Is this expected behaviour ? or am I missing something. Please advice.
Code I used:
m_dist <- mahalanobis(data[, 2:25], colMeans(data[, 2:25]), cov(data[,2:25]),tol=1e-20)
Then I used boxplot on m_dist to identify the outliers. The result on first machine doesnt match the same on second. I even used set.seed(1007) on both machines just to check, but results are still different
I found another thread which discusses the result difference in python, but it doesnt help me in anyway...
I am trying to run a cluster on a very large data set. It contains only strings for values. I have removed the NA's and relaced with a dummy value. My K-Means in R keeps failing due to NA coerecion. How would the community run a cluster on this data. I am shwoing 10 rows of a dummy example below. In this situation lets call the data frame: cluster_data
ANy help would be greatly appreciated. I am trying see if any of the columns cause the data to break earlier then another to try and understand a possible struture. Thought Clustering with K-means was best approach but do not see how to do with strings. Have converted to factors in R and still have issues. ANy example code is greatly appreciated
Question: how do you run kmeans clustering with strings?
Answer: You can't run k means cluster analysis on categorical data. You need data that a distance function can make sense of.
K-means is designed for continuous variables, where least-squares and the mean make sense to be used as centers.
For other data types, it is better to sue other algorithms, such as PAM, HAC, DBSCAN, OPTICS, ...
Could you please help me to add zooming option for wordcloud
Please find reproducible example #
´http://shiny.rstudio.com/gallery/word-cloud.html´
I tried to incorporate rbokeh and plotly but couldnt find wordcloud equivalent render function
Additionally, I found ECharts from github #
´https://github.com/XD-DENG/ECharts2Shiny/tree/8ac690a8039abc2334ec06f394ba97498b518e81´
But incorporating this ECharts are also not convenient for really zoom.
Thanks in advance,
Abi
Normalisation is required only if the predictors are not meant to be comparable on the original scaling. There's no rule that says you must normalize.
PCA is a statistical method that gives you a new linear transformation. By itself, it loses nothing. All it does is to give you new principal components.
You lose information only if you choose a subset of those principal components.
Usually PCA includes centering the data as a Pre Process Step.
PCA only arranges the data in its own Axis (Eigne Vectors) System.
If you use all axis you lose no information.
Yet, usually we want to apply Dimensionality Reduction, intuitively, having less coordinates for the data.
This process means projecting the data into Sub Space which is spanned by only some of the Eigen Vectors of the data.
If one chose wisely the number of vectors one might end up with a significant reduction in the number of dimensions of the data with negligible loss of data / information.
The way to do so is by choosing Eigen Vectors which their Eigen Values sum to most of the data power.
PCA itself is invertible, so lossless.
But:
It is common to drop some components, which will cause a loss of information.
Numerical issues may cause a loss in precision.
pvclust is great for cluster analysis in R. However, when running it as part of a batch operation, it is annoying to get different results for the same data. Obviously, there are many "correct" clusterings of the same data, and it seems that pvclust uses some randomness to determine the clusters of a specific run. But is there any way to get deterministic results?
I want to be able to present a minimal, repeatable analysis package: the data plus an R script, and a separate written document that contains my interpretations of the clustering. It is then possible for others to add to the analysis, e.g. by changing the aesthetic appearance of plots. Now, the interpretations will always be out of sync with what someone else gets when they run the script containing pvclust.
Not only for cluster analysis, but when there is randomness involved, you can fix the random number generator so you always get the same results.
Try:
set.seed(seed=123)
# your code here
The seed can be any integer, or something that can be converted to integer. And that's all.
i've only used k means. There I had to set the number of 'runs' or iterations to a higher value than default to get the same custers at consecutive runs.
I have a large (10000 X 5001) table representing 10000 samples and 5001 different features of these samples. One of these features represents an output variable of each sample. In other words, I have 5000 input variables and one output variable for each sample.
I know that most of these inputs are irrelevant. Therefore, what I would like to do is determine the subset of input variables that predicts the output variable best. What is the best/simplest way to go about doing this in R?
You might want to check out Weka. In the Explorer load the data and then go to the Select attributes tab. There you will find several options to get the most informative attributes/features in your dataset.
You may want Principal Component Analysis (stats::prcomp) or Linear Discriminant Analysis (MASS::lda).
See this document by Avril Coghlan
http://little-book-of-r-for-multivariate-analysis.readthedocs.org/en/latest/
Rather than taking 'random' suggestions, why not go to the CRAN Task View for Cluster Analysis & Finite Mixture Models ?