I am trying to run a k-means cluster analysis in R using only a subset of data from my data source. I have created the subset (as dataframe) because I am only interested in using these variables for segmentation, and the rest of the variables will be used to describe the segments.
After the k-means clustering is done, I was wondering how I could connect the clustering results back to my original dataset, which include the descriptive variables as well.
Please let me know if I could provide any clarifications on my questions. Many thanks in advance.
Cheers,
AC
You get a cluster label for each point.
These should refer to the same samples as for your original data.
Related
I am trying to run stamppFst() and stamppConvert() with haplotype data. The data I have is a squence of nucleotides in a DNAbin. I have tried to find ways to turn it into a matrix but what I have read goes way over my head since this is the first time I have ever used R.
data
This is an example of one of the data sets I want to use.
I apologize if this is a very basic question. Thanks for any help!
A similar question has been asked few years ago previous post but stayed unanswered/unsolved so I try my chance again.
I'm trying to code and duplicate in R, cluster analyses done in SAS that involved the Ward method and the Trim option. This trim option automatically omits points with low probability densities (outliers). Densities are estimated by the kth-nearest-neighbor or uniform-kernel method. This option is runned during the clustering analysis.
My goal is to find the same clustering method including this trimmming option in R because I have to complement my dataset with new datas. I thus want to be sure my cluster analysis in R are right and follow similarly what was done in SAS.
So far, as my dataset is composed of mixed variables, I have computed a Gower dissimiliraty matrix. Then, I tried different clustering ways:
the usual one from 'cluster' package (hclust with ward method) => worked well but I can't find any function to deal with outliers during the analysis.
partitionning clustering from 'TraMineRextras' (pam with ward method). => Outliers can be removed but only once the cluster are identified so it gives a different results from the one from SAS.
density-based clustering algorithm from 'dbscan' package. => worked well (good numbers of cluster and identifications of outliers) but the clustering analysis is not using the ward method. So I can't rely on this method to reproduce the exact same analysis from SAS.
Does anyone know how to proceed or would have ideas to reproduce the trimming from SAS in R?
Many thanks !
I think you are looking for the agnes function in the cluster package. Documentation can be found here: https://cran.r-project.org/web/packages/cluster/cluster.pdf
I am trying to run a cluster on a very large data set. It contains only strings for values. I have removed the NA's and relaced with a dummy value. My K-Means in R keeps failing due to NA coerecion. How would the community run a cluster on this data. I am shwoing 10 rows of a dummy example below. In this situation lets call the data frame: cluster_data
ANy help would be greatly appreciated. I am trying see if any of the columns cause the data to break earlier then another to try and understand a possible struture. Thought Clustering with K-means was best approach but do not see how to do with strings. Have converted to factors in R and still have issues. ANy example code is greatly appreciated
Question: how do you run kmeans clustering with strings?
Answer: You can't run k means cluster analysis on categorical data. You need data that a distance function can make sense of.
K-means is designed for continuous variables, where least-squares and the mean make sense to be used as centers.
For other data types, it is better to sue other algorithms, such as PAM, HAC, DBSCAN, OPTICS, ...
I have a pretty big data table (about 100.000 observations) that I'd like to use for clustering. Since some of the data is categorical, I've tried using "gower distance" and then hclust() with the "ward" method.
The data itself is very heterogeneous, which is why I'd like to sort of "pre-cluster" the data and then do the actual cluster analysis. Have any of you done this before and can point me in the right direction? I'm at a loss at the moment :(
With the mentioned methods, I don't really get useful clusters.
Thanks guys, I really appreciate every tip I can get.
Edit: I think that I didn't really explain my problem right, so here's another attempt: let's say, that I have a dataset containing brands of cars and some of their features. Before clustering them by features I would like to precluster them by brand. So all BMW e.g. are in the same cluster and so on.. and only after that I would like to cluster by features, so I should get a cluster with fast cars etc.
does anybody know, how to do this in R?
this does not describe my dataset, but maybe the question I'm having is clearer now.
You should start with a sample first.
Once you get good results on the sample, try to reproduce it on a different sample. Once the results are stable, you can either try to scale the algorithm to the entire data set (maybe try doubling first), or you can train a classifier and predict the clusters of the remaining data. With most clustering algorithms, a 1 nearest neighbor classifier will be very good.
I am using R software (R commander) to cluster my data. I have a smaller subset of my data containing 200 rows and about 800 columns. I am getting the following error when trying kmeans cluster and plot on a graph.
"'princomp' can only be used with more units than variables"
I then created a test doc of 10 row and 10 columns whch plots fine but when I add an extra column I get te error again.
Why is this? I need to be able to plot my cluster. When I view my data set after performing kmeans on it I can see the extra results column which shows which clusters they belong to.
IS there anything I am doing wrong, can I ger rid of this error and plot my larger sample???
Please help, been wrecking my head for a week now.
Thanks guys.
The problem is that you have more variables than sample points and the principal component analysis that is being done is failing.
In the help file for princomp it explains (read ?princomp):
‘princomp’ only handles so-called R-mode PCA, that is feature
extraction of variables. If a data matrix is supplied (possibly
via a formula) it is required that there are at least as many
units as variables. For Q-mode PCA use ‘prcomp’.
Principal component analysis is underspecified if you have fewer samples than data point.
Every data point will be it's own principal component. For PCA to work, the number of instances should be significantly larger than the number of dimensions.
Simply speaking you can look at the problems like this:
If you have n dimensions, you can encode up to n+1 instances using vectors that are all 0 or that have at most one 1. And this is optimal, so PCA will do this! But it is not very helpful.
you can use prcomp instead of princomp