I want to do clustering of my data (kmeans or hclust) in R language (coding). My data is ordinal, which means that the data is Likert scale to measure the causes of cost escalation (I have 41 causes "variables") that scaled from 1 to 5, which 1 is no effect to 5 major effect (I have about 160 observations "who rank the causes")... any help of how to cluster the 41 cause based on the observations ... do I have to convert the scale to percentage or z score before clustering or any thing that help ...... I really need your help!! here is the data to play with https://docs.google.com/spreadsheet/ccc?key=0AlrR2eXjV8nXdGtLdlYzVk01cE96Rzg2NzRpbEZjUFE&usp=sharing
I want to cluster the variables (the columns) in terms of similarity of occurrence in observations... I follow the code in statmethods.net/advstats/cluster.html; but I couldn't cluster the variables (the columns) in terms of similarity of occurrence in observations and also I follow the work at mattpeeples.net/kmeans.html#help; but I don't know why he convert the data to percentage and then to Z-score standardize.
It isn't clear to me if you want to cluster the rows (the observations) in terms of similarity in the variables, or cluster the variables (the columns) in terms of similarity of occurrence in observations?
Anyway, see package cluster. This is a recommended package that comes with all R installations.
Read ?daisy for details of what is done with ordinal data. This metric can be used in functions such as agnes (for hierarchical clustering) or pam (for partitioning about medoids, a more robust version of k-means).
By default, these will cluster the rows/observations. Simply transpose the data object using t() if you want to cluster the columns (variables). Although that may well mess up the data depending on how you have stored them.
Converting the data to percentage is called normalization of data so all the variables are in the range of 0 - 1.
If data is not normalized you run the risk of bias towards dimensions with large values
Related
I am trying to do CCA using a presence / absence matrix of plant quadrat data and continuous environmental data for the same quadrats, using the Vegan package in R. Some of the quadrats have no plant species present (the row for the quadrat is full of 0's) but do have corresponding environmental data in another dataframe. The context of the study is that the environmental data is metal concentrations in soil, which are typically high where there are no plant species, so the quadrats with zero species do contribute to the data, and are not errors or NA's. When running the CCA with the R Vegan Package so far I have had to delete these rows to get it to work, otherwise it returns the error
'Error in cca.default(d$X, d$Y, d$Z) :
all row sums must be >0 in the community data matrix' .
Is there a way to include the data from quadrats that have no plant species in the CCA? I have read in this paper, which also uses the Vegan package,: https://www.researchgate.net/publication/229087061_Relationships_between_the_presence_of_odonate_species_and_environmental_characteristics_in_lowland_ponds_of_central_Italy and that has a similar research design, that they have included plots with zero species by adding a 'zero species' variable but do not elaborate on how this is done.
I am new to coding so any help is very much appreciated,
Thanks in advance
Here is how to do it. Assume your data set is called comm and it has some rows (sampling units) that have no species:
comm$ZERO <- as.numeric(rowSums(comm) == 0)
This will add a new column ZERO which is 1 for rows that had no species, and 0 for others.
Personally, I would be worried about doing this. Correspondence Analysis is a compositional analysis, and adding a column (species) that never occurs with any other species (by definition) creates a data set with two disjunct blocks. In unconstrained CA this disjunct block manifests in first eigenvalue 1 – which is the theoretical maximum in CA. This first eigenvector will separate the blocks: ZERO species and the sampling units with ZERO species in one extreme, and all other species and sampling units in another extreme of the first axis. The second axis of this ZERO ordination will be identical to the first axis without ZERO, so in effect you just add this disjunction axis to the ordination.
Things are slightly different with CCA which actually looks at the fitted values of your species, and these fitted values may not be disjunct. So technically you can do it. However, it is not quite clear to me what you do if you do so. Even if the data set is not completely disjunct with CCA, the zero sampling units will probably be far separated from other points, and all plotted in the same point.
I am trying to carry out hierarchical cluster analysis (based on Ward's method) on a large dataset (thousands of records and 13 variables) representing multi-species observations of marine predators, to identify possible significant clusters in species composition.
Each record has date, time etc and presence/absence data (0 / 1) for each species.
I attempted hierarchical clustering with the function pvclust. I transposed the data (pvclust works on transposed tables), then I ran pvclust on the data selecting Jacquard distances (“binary” in R) as a distance measure (suitable for species pres/abs data) and Ward’s method (“ward.D2”). I used “parallel = TRUE” to reduce computation time. However, using a default of nboots= 1000, my computer was not able to finish the computation in hours and finally I got ann error, so I tried with lower nboots (100).
I cannot provide my dataset here, and I do not think it makes sense to provide a small test dataset, as one of the main issues here seems to be the size itself of the dataset. However, I am providing the lines of code I used for the transposition, clustering and plotting:
tdata <- t(data)
cluster <- pvclust(tdata, method.hclust="ward.D2", method.dist="binary",
nboot=100, parallel=TRUE)
plot(cluster, labels=FALSE)
This is the dendrogram I obtained (never mind the confusion at the lower levels due to overlap of branches).
As you can see, the p-values for the higher ramifications of the dendrogram all seem to be 0.
Now, I understand that my data may not be perfect, but I still think there is something wrong with the method I am using, as I would not expect all these values to be zero even with very low significance in the clusters.
So my questions would be
is there anything I got wrong in the pvclust function itself?
may my low nboots (due to “weak” computer) be a reason for the non-significance of my results?
are there other functions in R I could try for hierarchical clustering that also deliver p-values?
Thanks in advance!
.............
I have tried to run the same code on a subset of 500 records with nboots = 1000. This worked in a reasonable computation time, but the output is still not very satisfying - see dendrogram2 .dendrogram obtained for a SUBSET of 500 records and nboots=1000
I have a dataset consisting of 132 observations and 10 variables.
These variables are all categorical. I am trying to see how my observations cluster and how they are different based on the percentage of variance. i.e I want to find out if a) there are any variables which helps to draw certain observation points apart from one another and b) if yes, what is the percentage of variance explained by it?
I was advised to run a PCoA (Principle Coordinates Analysis) on my data. I ran it using vegan and ape package. This is my code after loading my csv file into r, I call it data
#data.dis<-vegdist(data,method="gower",na.rm=TRUE)
#data.pcoa<-pcoa(data.dis)
I was then told to extract the vectors from the pcoa data and so
#data.pcoa$vectors
It then returned me 132 rows but 20 columns of values (e.g. from Axis 1 to Axis 20)
I was perplexed over why there were 20 columns of values when I only have 10 variables. I was under the impression that I would only get 10 columns. If any kind souls out there could help to explain a) what do the vectors actually represent and b) how do I get the percentage of variance explained by Axis 1 and 2?
Another question that I had was I don't really understand the purpose of extracting the eigenvalues from data.pcoa because I saw some websites doing that after running a pcoa on their distance matrix but there was no further explanation on it.
Gower index is non-Euclidean and you can expect more real axes than the number of variables in Euclidean ordination (PCoA). However, you said that your variables are categorical. I assume that in R lingo they are factors. If so, you should not use vegan::vegdist() which only accepts numeric data. Moreover, if the variable is defined as a factor, vegan::vegdist() refuses to compute the dissimilarities and gives an error. If you managed to use vegdist(), you did not properly define your variables as factors. If you really have factor variables, you should use some other package than vegan for Gower dissimilarity (there are many alternatives).
Te percentage of "variance" is a bit tricky for non-Euclidean dissimilarities which also give some negative eigenvalues corresponding to imaginary dimensions. In that case, the sum of all positive eigenvalues (real axes) is higher than the total "variance" of data. ape::pcoa() returns the information you asked in the element values. The proportion of variances explained is in its element values$Relative_eig. The total "variance" is returned in element trace. All this was documented in ?pcoa where I read it.
I have basketball player data that looks like the following:
Player Weight Height Shots School
A NA 70 23 AB
B 130 62 10 AB
C 180 66 NA BC
D 157 65 22 CD
and I want to do unsupervised and supervised(based on height) clustering. Looking into online resources I found that I can use kmeans for unsupervised but I don't know how to handle NAs without losing a good amount of data. I also don't know how to handle the quantitative variable "school". Are there any ways to resolve both issues for unsupervised and supervised clustering?
K-means cannot be used for categorical data. One work around would be to instead use data about the schools such as # of enrollments or local SES data.
kmeans() in R cannot handle NA's so you could either omit them (and you should check that the NA's are distributed fairly evenly among other factors) or look into using cluster::clara() from the cluster library.
You have not asked anything specifically about super-learning so I cannot address that part of the question.
The problem you are facing is known as missing data. And you have to decide about it before start the clustering. in most cases the samples with missed data (NAs here) are simple omitted. that happen in preparing data and clearing process step of data mining. In R you can do it using the following code:
na.omit(yourdata)
it omit the records or samples (in row) that contains NAs.
but if you want to include them in the clustering process you can use the average value of that feature in entire cluster for the missing value option.
in your case, Consider weight:
for player A you can set (130+180+157)/3 for his weight.
For another question: it seems you are a little bit confused about the meaning of supervised and unsupervised learning. in supervised learning you need to define the class label of the samples. then you build a model (classifier) and train it to learn about each class of samples and after training you can use the model to predict the label of a test sample, like you give it a player with this values (W=100,H=190,shots=55) and it will give you the predicted class label.
For unsupervised learning you just need to cluster the data to find group or cluster relation of samples. for doing this you do not need a class label, you should define the features that you are going to cluster the samples based on them, for example you can cluster players only based on their weights, or just cluster them based on their height,... or you can use all height, weight and shots features for clustering. this is possible in R using the following code:
clus <- kmeans(na.omit(data$weight), 5) #for cluster them to 5 clusters based on weight
clus <- kmeans(na.omit(data[,1:3]),5) # to cluster them based on weight, height, shots into 5 clusters.
consider the using of na.omit here that remove rows which has NAs in their columns.
let me know if this helps you.
I would like to use R to perform hierarchical clustering with two groups of variables describing the same samples. One group is microarray gene expression data (for specific genes) that have been normalized and batch effect corrected. The other group also has some quantitative clinical parameters that describe the same samples. However, these clinical variables have not been normalized or subjected to any kind of transformation(i.e. raw continuous values).
For example, one variable of these could have range of values from 2 to 35, whereas another from 0.1 to 0.9, etc.
Thus, as my ultimate goal in to implement hierarchical clustering and use both groups simultaneously (merged in a matrix/dataframe), in order to inspect which of these clinical variables cluster with specific genes, etc:
1) Is an initial transformation in the group of the clinical variables necessary before merging with the genes and perform the clustering ? For example: log2 transformation, which has also been done to part of my gene expression data !!
2) Or, a row scaling (that is the total features in the input data) would take into account this discrepancy ?
3) For a similar analysis/approach, like constructing a correlation plot of the above total variables, would a simple scaling be sufficient?
Without having seen your gene expression data, I can only provide you some general suggestions based on your description, in the context of the 3 questions you asked:
1) You should definitely check the distribution of each group. In R, you may use one or more of the following function to visualize the distribution:
hist(expression_data) ##histogram
plot(density(expression_data)) ##density plot; alternative to histogram
qqnorm(expression_data); qqline(expression_data) #QQ plot
Since my understanding is that one of your expression data group is log2 transformed, that particular group should have a normal distribution (i.e. a bell curve shape in the histogram and a straight line in the QQ plot). Whether to transform the group that has not yet been transformed will depend on what you want to do with the data. For instance, if you want to use a t-test to compare the two groups, then you definitely need a transformation, as there is a normality assumption associated with a t-test. With regard to hierarchical clustering, if you decide to use both groups in a single clustering analysis, then why would you ever keep one transformed and the other not?
2) Scaling by features is a reasonable approach. Here is a clustering lecture from a Utah State Univ. stats course, with an example. scale=TRUE is an option for you if you decide to use heatmap function in R.
3) I don't think there is a definitive answer to your third question. It has to depend on how many available features you have and what analyses you will be doing downstream. Similar to question 1, I would argue that simple scaling may be sufficient for visualizing your data by hierarchical clustering. However, do keep in mind that, say you decide to perform a linear model (which is very common with microarray gene expression data), you might want to consider more sophisticated data scaling.