Unsupervised Supervised Clusters with NAs and Qualitative Data in R - r

I have basketball player data that looks like the following:
Player Weight Height Shots School
A NA 70 23 AB
B 130 62 10 AB
C 180 66 NA BC
D 157 65 22 CD
and I want to do unsupervised and supervised(based on height) clustering. Looking into online resources I found that I can use kmeans for unsupervised but I don't know how to handle NAs without losing a good amount of data. I also don't know how to handle the quantitative variable "school". Are there any ways to resolve both issues for unsupervised and supervised clustering?

K-means cannot be used for categorical data. One work around would be to instead use data about the schools such as # of enrollments or local SES data.
kmeans() in R cannot handle NA's so you could either omit them (and you should check that the NA's are distributed fairly evenly among other factors) or look into using cluster::clara() from the cluster library.
You have not asked anything specifically about super-learning so I cannot address that part of the question.

The problem you are facing is known as missing data. And you have to decide about it before start the clustering. in most cases the samples with missed data (NAs here) are simple omitted. that happen in preparing data and clearing process step of data mining. In R you can do it using the following code:
na.omit(yourdata)
it omit the records or samples (in row) that contains NAs.
but if you want to include them in the clustering process you can use the average value of that feature in entire cluster for the missing value option.
in your case, Consider weight:
for player A you can set (130+180+157)/3 for his weight.
For another question: it seems you are a little bit confused about the meaning of supervised and unsupervised learning. in supervised learning you need to define the class label of the samples. then you build a model (classifier) and train it to learn about each class of samples and after training you can use the model to predict the label of a test sample, like you give it a player with this values (W=100,H=190,shots=55) and it will give you the predicted class label.
For unsupervised learning you just need to cluster the data to find group or cluster relation of samples. for doing this you do not need a class label, you should define the features that you are going to cluster the samples based on them, for example you can cluster players only based on their weights, or just cluster them based on their height,... or you can use all height, weight and shots features for clustering. this is possible in R using the following code:
clus <- kmeans(na.omit(data$weight), 5) #for cluster them to 5 clusters based on weight
clus <- kmeans(na.omit(data[,1:3]),5) # to cluster them based on weight, height, shots into 5 clusters.
consider the using of na.omit here that remove rows which has NAs in their columns.
let me know if this helps you.

Related

R and SPSS: Different results for Hierarchical Cluster Analysis

I'm performing hierarchical cluster analysis using Ward's method on a dataset containing 1000 observations and 37 variables (all are 5-point likert-scales).
First, I ran the analysis in SPSS via
CLUSTER Var01 to Var37
/METHOD WARD
/MEASURE=SEUCLID
/ID=ID
/PRINT CLUSTER(2,10) SCHEDULE
/PLOT DENDROGRAM
/SAVE CLUSTER(2,10).
FREQUENCIES CLU2_1.
I additionaly performed the analysis in R:
datA <- subset(dat, select = Var01:Var37)
dist <- dist(datA, method = "euclidean")
hc <- hclust(d = dist, method = "ward.D2")
table(cutree(hc, k = 2))
The resulting cluster sizes are:
1 2
SPSS 712 288
R 610 390
These results are obviously confusing to me, as they differ substentially (which becomes highly visible when observing the dendrograms; also applies for the 3-10 clusters solutions). "ward.D2" takes into account the squared distance, if I'm not mistaken, so I included the simple distance matrix here. However, I tried several (combinations) of distance and clustering methods, e.g. EUCLID instead of SEUCLID, squaring the distance matrix in R, applying "ward.D" method,.... I also looked at the distance matrices generated by SPSS and R, which are identical (when applying the same method). Ultimately, I excluded duplicate cases (N=29) from my data, guessing that those might have caused differences when being allocated (randomly) at a certain point. All this did not result in matching outputs in R and SPSS.
I tried running the analysis with the agnes() function from the cluster package, which resulted in - again - different results compared to SPSS and even hclust() (But that's a topic for another post, I guess).
Are the underlying clustering procedures that different between the programs/packages? Or did I overlook a crucial detail? Is there a "correct" procedure that replicates the results yielded in SPSS?
If the distance matrices are identical and the merging methods are identical, the only thing that should create different outcomes is having tied distances handled differently in two algorithms. Tied distances might be present with the original full distance matrix, or might occur during the joining process. If one program searches the matrix and finds two or more distances tied at the minimum value at that step, and it selects the first one, while another program selects the last one, or one or both select one at random from among the ties, different results could occur.
I'd suggest starting with a small example with some data with randomness added to values to make tied distances unlikely and see if the two programs produce matching results on those data. If not, there's a deeper problem. If so, then tie handling might be the issue.

How can I achieve hierarchical clustering with p-values for a large dataset?

I am trying to carry out hierarchical cluster analysis (based on Ward's method) on a large dataset (thousands of records and 13 variables) representing multi-species observations of marine predators, to identify possible significant clusters in species composition.
Each record has date, time etc and presence/absence data (0 / 1) for each species.
I attempted hierarchical clustering with the function pvclust. I transposed the data (pvclust works on transposed tables), then I ran pvclust on the data selecting Jacquard distances (“binary” in R) as a distance measure (suitable for species pres/abs data) and Ward’s method (“ward.D2”). I used “parallel = TRUE” to reduce computation time. However, using a default of nboots= 1000, my computer was not able to finish the computation in hours and finally I got ann error, so I tried with lower nboots (100).
I cannot provide my dataset here, and I do not think it makes sense to provide a small test dataset, as one of the main issues here seems to be the size itself of the dataset. However, I am providing the lines of code I used for the transposition, clustering and plotting:
tdata <- t(data)
cluster <- pvclust(tdata, method.hclust="ward.D2", method.dist="binary",
nboot=100, parallel=TRUE)
plot(cluster, labels=FALSE)
This is the dendrogram I obtained (never mind the confusion at the lower levels due to overlap of branches).
As you can see, the p-values for the higher ramifications of the dendrogram all seem to be 0.
Now, I understand that my data may not be perfect, but I still think there is something wrong with the method I am using, as I would not expect all these values to be zero even with very low significance in the clusters.
So my questions would be
is there anything I got wrong in the pvclust function itself?
may my low nboots (due to “weak” computer) be a reason for the non-significance of my results?
are there other functions in R I could try for hierarchical clustering that also deliver p-values?
Thanks in advance!
.............
I have tried to run the same code on a subset of 500 records with nboots = 1000. This worked in a reasonable computation time, but the output is still not very satisfying - see dendrogram2 .dendrogram obtained for a SUBSET of 500 records and nboots=1000

Hierarchical clustering on continuous heterogeneous variables with different range/scales in R

I would like to use R to perform hierarchical clustering with two groups of variables describing the same samples. One group is microarray gene expression data (for specific genes) that have been normalized and batch effect corrected. The other group also has some quantitative clinical parameters that describe the same samples. However, these clinical variables have not been normalized or subjected to any kind of transformation(i.e. raw continuous values).
For example, one variable of these could have range of values from 2 to 35, whereas another from 0.1 to 0.9, etc.
Thus, as my ultimate goal in to implement hierarchical clustering and use both groups simultaneously (merged in a matrix/dataframe), in order to inspect which of these clinical variables cluster with specific genes, etc:
1) Is an initial transformation in the group of the clinical variables necessary before merging with the genes and perform the clustering ? For example: log2 transformation, which has also been done to part of my gene expression data !!
2) Or, a row scaling (that is the total features in the input data) would take into account this discrepancy ?
3) For a similar analysis/approach, like constructing a correlation plot of the above total variables, would a simple scaling be sufficient?
Without having seen your gene expression data, I can only provide you some general suggestions based on your description, in the context of the 3 questions you asked:
1) You should definitely check the distribution of each group. In R, you may use one or more of the following function to visualize the distribution:
hist(expression_data) ##histogram
plot(density(expression_data)) ##density plot; alternative to histogram
qqnorm(expression_data); qqline(expression_data) #QQ plot
Since my understanding is that one of your expression data group is log2 transformed, that particular group should have a normal distribution (i.e. a bell curve shape in the histogram and a straight line in the QQ plot). Whether to transform the group that has not yet been transformed will depend on what you want to do with the data. For instance, if you want to use a t-test to compare the two groups, then you definitely need a transformation, as there is a normality assumption associated with a t-test. With regard to hierarchical clustering, if you decide to use both groups in a single clustering analysis, then why would you ever keep one transformed and the other not?
2) Scaling by features is a reasonable approach. Here is a clustering lecture from a Utah State Univ. stats course, with an example. scale=TRUE is an option for you if you decide to use heatmap function in R.
3) I don't think there is a definitive answer to your third question. It has to depend on how many available features you have and what analyses you will be doing downstream. Similar to question 1, I would argue that simple scaling may be sufficient for visualizing your data by hierarchical clustering. However, do keep in mind that, say you decide to perform a linear model (which is very common with microarray gene expression data), you might want to consider more sophisticated data scaling.

cluster ordinal data

I want to do clustering of my data (kmeans or hclust) in R language (coding). My data is ordinal, which means that the data is Likert scale to measure the causes of cost escalation (I have 41 causes "variables") that scaled from 1 to 5, which 1 is no effect to 5 major effect (I have about 160 observations "who rank the causes")... any help of how to cluster the 41 cause based on the observations ... do I have to convert the scale to percentage or z score before clustering or any thing that help ...... I really need your help!! here is the data to play with https://docs.google.com/spreadsheet/ccc?key=0AlrR2eXjV8nXdGtLdlYzVk01cE96Rzg2NzRpbEZjUFE&usp=sharing
I want to cluster the variables (the columns) in terms of similarity of occurrence in observations... I follow the code in statmethods.net/advstats/cluster.html; but I couldn't cluster the variables (the columns) in terms of similarity of occurrence in observations and also I follow the work at mattpeeples.net/kmeans.html#help; but I don't know why he convert the data to percentage and then to Z-score standardize.
It isn't clear to me if you want to cluster the rows (the observations) in terms of similarity in the variables, or cluster the variables (the columns) in terms of similarity of occurrence in observations?
Anyway, see package cluster. This is a recommended package that comes with all R installations.
Read ?daisy for details of what is done with ordinal data. This metric can be used in functions such as agnes (for hierarchical clustering) or pam (for partitioning about medoids, a more robust version of k-means).
By default, these will cluster the rows/observations. Simply transpose the data object using t() if you want to cluster the columns (variables). Although that may well mess up the data depending on how you have stored them.
Converting the data to percentage is called normalization of data so all the variables are in the range of 0 - 1.
If data is not normalized you run the risk of bias towards dimensions with large values

Cluster Variables Against Single Outcome Variable - ClustOfVar

I have a dataset with variables representing both scores and traits (mix of qualitative and quantitative on both counts). I want to cluster the traits (not the individual observations) according to each of the scores. So, I want to form clusters of traits (trait_1 through trait_15) that are similar on the basis of score_1, then repeat for scores 2 and 3. Example of the data structure below.
I am thinking that I can use the ClustOfVar package to form these clusters, which I would understand if I were just trying to cluster all of the variables into like groups. However, I don't know how to cluster them on the basis of one of the other variables.
If anyone has suggestions, I'd appreciate it. Thanks in advance.
Score_1 Score_2 Score_3 Trait_1 Trait_2 Trait_3 … Trait_15
n1
n2
n3
…
n100000
You may want to look into subspace clustering algorithms.
They usually allow overlapping clusters, so you may get out quite a number of clusters.
You cluster on the traits only, then check if the found clusters correspond to your known scores in a second phase.

Resources