Hierarchical clustering and k means - r

I want to run a hierarchical cluster analysis. I am aware of the hclust() function but not how to use this in practice; I'm stuck with supplying the data to the function and processing the output.
The main issue that I would like to cluster a given measurement.
I would also like to compare the hierarchical clustering with that produced by kmeans(). Again I am not sure how to call this function or use/manipulate the output from it.
My data are similar to:
df<-structure(list(id=c(111,111,111,112,112,112), se=c(1,2,3,1,2,3),t1 = c(1, 2, 1, 1,1,3),
t2 = c(1, 2, 2, 1,1,4), t3 = c(1, 0, 0, 0,2,1), t4 = c(2, 5, 7, 7,1,2),
t5 = c(1, 0, 1, 1,1,1),t6 = c(1, 1, 1, 1,1,1), t7 = c(1, 1, 1 ,1,1,1), t8=c(0,0,0,0,0,0)), row.names = c(NA,
6L), class = "data.frame")
I would like to run the hierarchical cluster analysis to identify the optimum number of clusters.
How can I run clustering based on a predefined measurement - in this case for example to cluster measurement number 2?

For hierarchical clustering there is one essential element you have to define. It is the method for computing the distance between each data point. Clustering is an state of art technique so you have to define the number of clusters based on how fair data points are distributed. I will teach you how to do this in next code. We will compare three methods of distance using your data df and the function hclust():
First method is average distance, which computes the mean across all distances for all points. We will omit first variable as it is an id:
#Method 1
hc.average <- hclust(dist(df[,-1]),method='average')
Second method is complete distance, which computes the largest value across all distances for all points:
#Method 2
hc.complete<- hclust(dist(df[,-1]),method='complete')
Third method is single distance, which computes the minimal value across all distances for all points:
#Method 3
hc.single <- hclust(dist(df[,-1]),method='single')
With all models we can analyze the groups.
We can define the number of clusters based on the height of hierarchical tree, the largest the height then we will have only one cluster equals to all dataset. It is a standard to choose an intermediate value for height.
With average method a height value of three will produce four groups and a value around 4.5 will produce 2 groups:
plot(hc.average, xlab='')
Output:
With the complete method results are similar but the scale measure of height has changed.
plot(hc.complete, xlab='')
Output:
Finally, single method produces a different scheme for groups. There are three groups and even with an intermediate choice of height, you will always have that number of clusters:
plot(hc.single, xlab='')
Output:
You can use any method you wish to determine the cluster for your data using cutree() function, where you set the model object and the number of clusters. One way to determine clustering performance is checking how homogeneous the groups are. That depends of the researcher criteria. Next the method to add the cluster to your data. I will choose last model and three groups:
#Add cluster
df$Cluster <- cutree(hc.single,k = 3)
Output:
id se t1 t2 t3 t4 t5 t6 t7 t8 Cluster
1 111 1 1 1 1 2 1 1 1 0 1
2 111 2 2 2 0 5 0 1 1 0 2
3 111 3 1 2 0 7 1 1 1 0 2
4 112 1 1 1 0 7 1 1 1 0 2
5 112 2 1 1 2 1 1 1 1 0 1
6 112 3 3 4 1 2 1 1 1 0 3
The function cutree() also has an argument called h where you can set the height, we have talked previously, instead of number of clusters k.
About your doubt of using some measure to define a cluster, you could scale your data excluding the desired variable so that the variable will have a different measure and can influence in the results of your clustering.

Related

Does this problem have an "exact" solution?

I am working with R.
Suppose you have the following data:
#generate data
set.seed(123)
a1 = rnorm(1000,100,10)
b1 = rnorm(1000,100,10)
c1 = rnorm(1000,5,1)
train_data = data.frame(a1,b1,c1)
#view data
a1 b1 c1
1 94.39524 90.04201 4.488396
2 97.69823 89.60045 5.236938
3 115.58708 99.82020 4.458411
4 100.70508 98.67825 6.219228
5 101.29288 74.50657 5.174136
6 117.15065 110.40573 4.384732
We can visualize the data as follows:
#visualize data
par(mfrow=c(2,2))
plot(train_data$a1, train_data$b1, col = train_data$c1, main = "plot of a1 vs b1, points colored by c1")
hist(train_data$a1)
hist(train_data$b1)
hist(train_data$c1)
Here is the Problem :
From the data, only take variables "a1" and "b1" : using only 2 "logical conditions", split this data into 3 regions (e.g. Region 1 WHERE 20 > a1 >0 AND 0< b1 < 25)
In each region, you want the "average value of c1" within that region to be as small as possible - but each region must have at least some minimum number of data points, e.g. 100 data points (to prevent trivial solutions)
Goal : Is it possible to determine the "boundaries" of these 3 regions that minimizes :
the mean value of "c1" for region 1
the mean value of "c1" for region 2
the mean value of "c1" for region 3
the average "mean value of c1 for all 3 regions" (i.e. c_avg = (region1_c1_avg + region2_c1_avg + region3_c1_avg) / 3)
In the end, for a given combination, you would find the following, e.g. (made up numbers):
Region 1 : WHERE 20> a1 >0 AND 0 < b1 < 25 ; region1_c1_avg = 4
Region 2 : WHERE 50> a1 >20 AND 25 < b1 < 60 ; region2_c1_avg = 2.9
Region 3 : WHERE a1>50 AND b1 > 60 ; region3_c1_avg = 1.9
c_avg = (4 + 2.9 + 1.9) / 3 = 2.93
And hope that (region1_c1_avg, region2_c1_avg, region3_c1_avg and c_avg) are minimized
My Question:
Does this kind of problem have an "exact solution"? The only thing I can think of is performing a "random search" that considers many different definitions of (Region 1, Region 2 and Region 3) and compares the corresponding values of (region1_c1_avg, region2_c1_avg, region3_c1_avg and c_avg), until a minimum value is found. Is this an application of linear programming or multi-objective optimization (e.g. genetic algorithm)? Has anyone worked on something like this before?
I have done a lot of research and haven't found a similar problem like this. I decided to formulate this problem as a "multi-objective constrained optimization problem", and figured out how to implement algorithms like "random search" and "genetic algorithm".
Thanks
Note 1: In the context of multi-objective optimization, for a given set of definitions of (Region1, Region2 and Region3): to collectively compare whether a set of values for (region1_c1_avg, region2_c1_avg, region3_c1_avg and c_avg) are satisfactory, the concept of "Pareto Optimality" (https://en.wikipedia.org/wiki/Multi-objective_optimization#Visualization_of_the_Pareto_front) is often used to make comparisons between different sets of {(Region1, Region2 and Region3) and (region1_c1_avg, region2_c1_avg, region3_c1_avg and c_avg)}
Note 2 : Ultimately, these 3 Regions can defined by any set of 4 numbers. If each of these 4 numbers can be between "0 and 100", and through 0.1 increments (e.g. 12, 12.1, 12.2, 12.3, etc) : this means that there exists 1000 ^ 4 = 1 e^12 possible solutions (roughly 1 trillion) to compare. There are simply far too many solutions to individually verify and compare. I am thinking that a mathematical based search/optimization problem can be used to strategically search for an optimal solution.

Stata twoway graph of means with confidence intervals

Using
clear
score group test
2 0 A
3 0 B
6 0 B
8 0 A
2 0 A
2 0 A
10 1 B
7 1 B
8 1 A
5 1 A
10 1 A
11 1 B
end
I want to scatter plot mean score by group for each test (same graph) with confidence intervals (the real data has thousands of observations). The resulting graph would have two sets of two dots. One set of dots for test==a (group==0 vs group==1) and one set of dots for test==b (group==0 vs group==1).
My current approach works but it is laborious. I compute all of the needed statistics using egen: the mean, number of observations, standard deviations...for each group by test. I then collapse the data and plot.
There has to be another way, no?
I assumed that Stata would be able to take as its input the score group and test variables and then compute and present this pretty standard graph.
After spending a lot of time on Google, I had to ask.
Although there are user-written programs, I lean towards statsby as a basic approach here. Discussion is accessible in this paper.
This example takes your data example (almost executable code). Some choices depend on the large confidence intervals implied. Note that if your version of Stata is not up-to-date, the syntax of ci will be different. (Just omit means.)
clear
input score group str1 test
2 0 A
3 0 B
6 0 B
8 0 A
2 0 A
2 0 A
10 1 B
7 1 B
8 1 A
5 1 A
10 1 A
11 1 B
end
save cj12 , replace
* test A
statsby mean=r(mean) ub=r(ub) lb=r(lb) N=r(N), by(group) clear : ///
ci means score if test == "A"
gen test = "A"
save cj12results, replace
* test B
use cj12
statsby mean=r(mean) ub=r(ub) lb=r(lb) N=r(N), by(group) clear : ///
ci means score if test == "B"
gen test = "B"
append using cj12results
* graph; show sample sizes too, but where to show them is empirical
set scheme s1color
gen where = -20
scatter mean group, ms(O) mcolor(blue) || ///
rcap ub lb group, lcolor(blue) ///
by(test, note("95% confidence intervals") legend(off)) ///
subtitle(, fcolor(ltblue*0.2)) ///
ytitle(score) xla(0 1) xsc(r(-0.25 1.25)) yla(-10(10)10, ang(h)) || ///
scatter where group, ms(none) mla(N) mlabpos(12) mlabsize(*1.5)
We can't compare your complete code or your graph, because you show neither.

Congruence among different values within samples

I'd like to test the congruence among different scores within each sampled site. These scores were calculated with five different methods to measure species diversity (http://en.wikipedia.org/wiki/Diversity_index). For instance, if the value of index "a" is high, should the value of index b, c, d, and e by high as well? In this way, I'd like to calculate that congruence within each sampled site.
Should you guys suggest any method to test this congruence? I've tried to calculate the coefficient of variation within each site, but it does not make sense to me because they vary in different scales. I provided an example of the dataset below.
Thank you in advance.
Sample data
df <- data.frame(a=rnorm(11, 5, 2),
b=rnorm(11, 1, 1),
c=rnorm(11, 2, 1),
d=rnorm(11, 0, 1),
e=rnorm(11, 3, 2))
rownames(df) <- paste("site", 1:11, sep="")
df
A classification tree would automatically optimize your congruence index. The rpart package in R offers the Gini index and the Information index (I think that is the same as the Entropy Index). You would need to stack your data (using reshape2 package here). In this example I assumed you were trying to classify species by the numeric observation and the site location.
Also if you have a more statistics inspired question with a bit of R, you should feel free to try https://stats.stackexchange.com/
require(rpart)
require(reshape2)
df$site = rownames(df)
stackDF = melt(df, variable.name="species", value.name="observation")
str(stackDF)
classTree <- rpart(species ~ observation + site,data=stackDF, parms=list(split="gini"))
# classTree <- rpart(species ~ site + observation,data=stackDF, parms=list(split="information"))
printcp(classTree)
table(actual=stackDF$species, predicted=predict(classTree,type="class"))
plot(classTree,compress=T,uniform=T,branch=0.4,margin=0.1)
text(classTree)
Roland makes a good suggestion to use principal components. You can use pck = princomp(stackDF[,-which(colnames(stackDF)=="species"),drop=F]) and then change the formula in your tree to be stackDF$species ~ pck +.... You can check the cross-validation with printcp and prune the tree with prune.
> table(actual=stackDF$species, predicted=predict(classTree,type="class"))
predicted
actual a b c d e
a 10 1 0 0 0
b 0 6 3 2 0
c 0 1 9 1 0
d 0 3 0 8 0
e 9 0 0 2 0
Of course none of the classifications in the example make sense because they are random.

Cutting dendrogram into n trees with minimum cluster size in R

I'm trying to use hirearchical clustering (specifically hclust) to cluster a data set into 10 groups with sizes of 100 members or fewer, and with no group having more than 40% of the total population. The only method I currently know is to repeatedly use cut() and select continually lower levels of h until I'm happy with the dispersion of the cuts. However, this forces me to then go back and re-cluster the groups I pruned to aggregate them into 100 member groups, which can be very time consuming.
I've experimented with the dynamicTreeCut package, but can't figure out how to enter these (relatively simple) limitations. I'm using deepSplit as the way to designate the number of groupings, but following the documentation, this limits the maximum number to 4. For the exercise below, all I'm looking to do is to get the clusters into 5 groups of 3 or more individuals (I can deal with the maximum size limitation on my own, but if you want to try to tackle this too, it would be helpful!).
Here's my example, using the Orange dataset.
library(dynamicTreeCut)
library(reshape2)
##creating 14 individuals from Orange's original 5
Orange1<-Orange
Orange1$Tree<-as.numeric(as.character(Orange1$Tree))
Orange2<-Orange1
Orange3<-Orange1
Orange2$Tree=Orange2$Tree+6
Orange3$Tree=Orange3$Tree+11
combOr<-rbind(Orange1, Orange2[1:28,], Orange3)
####casting the data to make a correlation matrix, and then running
#### a hierarchical cluster
castOrange<-dcast(combOr, age~Tree, mean, fill=0)
castOrange[,16]<-c(1,34,5,35,34,35,21)
castOrange[,17]<-c(1,34,5,35,34,35,21)
orangeCorr<-cor(castOrange[, -1])
orangeClust<-hclust(dist(orangeCorr))
###running the dynamic tree cut
dynamicCut<-cutreeDynamic(orangeClust, minClusterSize=3, method="tree", deepSplit=4)
dynamicCut
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0
As you can see, it only designates two clusters. For my exercise, I want to shy away from using an explicit height term to cut the trees, as I want a k number of trees instead.
1- Figure out the most appropriate dissimilarity measure (e.g., "euclidean", "maximum", "manhattan", "canberra", "binary", or "minkowski") and linkage method (e.g., "ward", "single", "complete", "average", "mcquitty", "median", or "centroid") based on the nature of your data and the objective(s) of clustering. See ?dist and ?hclust for more details.
2- Plot the dendogram tree before starting the cutting step. See ?hclust for more details.
3- Use the hybrid adaptive tree cut method in dynamicTreeCut package, and tune the shape parameters (maxCoreScatter and minGap / maxAbsCoreScatter and minAbsGap). See Langfelder et al. 2009 (http://labs.genetics.ucla.edu/horvath/CoexpressionNetwork/BranchCutting/Supplement.pdf).
For your example,
1- Change "euclidean" and/or "complete" methods as appropriate,
orangeClust <- hclust(dist(orangeCorr, method="euclidean"), method="complete")
2- Plot dendogram,
plot(orangeClust)
3- Use the hybrid tree cut method and tune shape parameters,
dynamicCut <- cutreeDynamic(orangeClust, minClusterSize=3, method="hybrid", distM=as.matrix(dist(orangeCorr, method="euclidean")), deepSplit=4, maxCoreScatter=NULL, minGap=NULL, maxAbsCoreScatter=NULL, minAbsGap=NULL)
dynamicCut
..cutHeight not given, setting it to 1.8 ===> 99% of the (truncated) height range in dendro.
..done.
2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0
As a guide for tuning the shape parameters, the default values are
deepSplit=0: maxCoreScatter = 0.64 & minGap = (1 - maxCoreScatter) * 3/4
deepSplit=1: maxCoreScatter = 0.73 & minGap = (1 - maxCoreScatter) * 3/4
deepSplit=2: maxCoreScatter = 0.82 & minGap = (1 - maxCoreScatter) * 3/4
deepSplit=3: maxCoreScatter = 0.91 & minGap = (1 - maxCoreScatter) * 3/4
deepSplit=4: maxCoreScatter = 0.95 & minGap = (1 - maxCoreScatter) * 3/4
As you can see, both maxCoreScatter and minGap should be between 0 and 1, and increasing maxCoreScatter (decreasing minGap) increases the number of clusters (with smaller sizes). The meaning of these parameters is described in Langfelder et al. 2009.
For example, to get more smaller clusters
maxCoreScatter <- 0.99
minGap <- (1 - maxCoreScatter) * 3/4
dynamicCut <- cutreeDynamic(orangeClust, minClusterSize=3, method="hybrid", distM=as.matrix(dist(orangeCorr, method="euclidean")), deepSplit=4, maxCoreScatter=maxCoreScatter, minGap=minGap, maxAbsCoreScatter=NULL, minAbsGap=NULL)
dynamicCut
..cutHeight not given, setting it to 1.8 ===> 99% of the (truncated) height range in dendro.
..done.
2 3 2 2 2 3 3 2 2 3 3 2 2 2 1 2 1 1 1 2 2 1 1 2 2 1 1 1 0 0
Finally, your clustering constraints (size, height, number, ... etc) should be reasonable and interpretable, and the generated clusters should agree with the data. This guides you to the important step of clustering validation and interpretation.
Good Luck!

Generating random graph in r

I would like to generate a grandom graph in R using any of the packages.
The desired output would be a two column matrix with the first column listing agents and the second column their connections of the following form:
1 3
1 4
1 6
1 7
2 2
2 5
3 9
3 11
3 32
3 43
3 2
4 5
I would like to be able to specify the average degree and minimum and maximum number of contacts.
What is the easiest way of doing this?
Since you don't specify the need for anything other than just a graph we ca do this very simply:
actor <- sample(1:4, 10, replace=TRUE)
receiver <- sample(3:43, 10, replace=TRUE)
graph <- cbind(actor,receiver)
if you want something more specific have a look at igraph for instance
library(igraph)
graph <- erdos.renyi.game(21, 0.3, type=c("gnp", "gnm"),
directed = FALSE, loops = FALSE)
# here the 0.3 is the probability of ties and 21 is the number of nodes
# this is a one mode network
or using package bipartite which focuses specifically on two mode networks:
library(bipartite)
web <- genweb(N1 = 5, N2 = 10, dens = 2)
web2edges(web,return=TRUE)
# here N1 is the number of nodes in set 1 and N2 the number of nodes in set 2
# and dens the average number of ties per node
There are many things to take into account, for instance if you want to constrain the degree distribution, probablity of ties between agents etc.

Resources