I am trying to extract the Validation Measures from an R clustering validation object created using clValid.
When I create the object and print the full summary, I use the following
library(clValid)
x <- clValid(iris[, -5], nClust=2:10,
clMethods=c('hierarchical'), validation='internal')
summary(x)
The output of this is:
Clustering Methods:
hierarchical
Cluster sizes:
2 3 4 5 6 7 8 9 10
Validation Measures:
2 3 4 5 6 7 8 9 10
hierarchical Connectivity 0.0000 4.4770 8.9929 15.4893 18.4183 24.8464 29.8425 36.8567 39.5607
Dunn 0.3389 0.1378 0.1540 0.1540 0.1668 0.1624 0.1624 0.1915 0.1915
Silhouette 0.6867 0.5542 0.4720 0.4307 0.3420 0.3707 0.3659 0.3167 0.3083
Optimal Scores:
Score Method Clusters
Connectivity 0.0000 hierarchical 2
Dunn 0.3389 hierarchical 2
Silhouette 0.6867 hierarchical 2
Required output
I am trying to get the Validation Measures as a dataframe like this:
2 3 4 5 6 7 8 9 10
hierarchical Connectivity 0.0000 4.4770 8.9929 15.4893 18.4183 24.8464 29.8425 36.8567 39.5607
Dunn 0.3389 0.1378 0.1540 0.1540 0.1668 0.1624 0.1624 0.1915 0.1915
Silhouette 0.6867 0.5542 0.4720 0.4307 0.3420 0.3707 0.3659 0.3167 0.3083
Attempt
When I use:
names(summary(x))
attributes(summary(x))
these both give
NULL
I can get the Optimal Scores using optimalScores(x), however, this does not work with validationMeasures(x).
Question
Is there a way to extract the Validation Measures as a data.frame from this summary object?
First of all, you should always try
str(x)
Formal class 'clValid' [package "clValid"] with 14 slots
..# clusterObjs:List of 1
.. ..$ hierarchical:List of 7
.. .. ..$ merge : int [1:149, 1:2] -102 -8 -1 -10 -129 -11 -5 -20 -30 -58 ...
.. .. ..$ height : num [1:149] 0 0.1 0.1 0.1 0.1 ...
.. .. ..$ order : int [1:150] 42 15 16 33 34 37 21 32 44 24 ...
.. .. ..$ labels : NULL
.. .. ..$ method : chr "average"
.. .. ..$ call : language hclust(d = Dist, method = method)
.. .. ..$ dist.method: chr "euclidean"
.. .. ..- attr(*, "class")= chr "hclust"
..# measures : num [1:3, 1:9, 1] 0 0.339 0.687 4.477 0.138 ...
.. ..- attr(*, "dimnames")=List of 3
.. .. ..$ : chr [1:3] "Connectivity" "Dunn" "Silhouette"
.. .. ..$ : chr [1:9] "2" "3" "4" "5" ...
.. .. ..$ : chr "hierarchical"
..# measNames : chr [1:3] "Connectivity" "Dunn" "Silhouette"
..# clMethods : chr "hierarchical"
..# labels : chr [1:150] "1" "2" "3" "4" ...
..# nClust : num [1:9] 2 3 4 5 6 7 8 9 10
..# validation : chr "internal"
..# metric : chr "euclidean"
..# method : chr "average"
..# neighbSize : num 10
..# annotation : NULL
..# GOcategory : chr "all"
..# goTermFreq : num 0.05
..# call : language clValid(obj = iris[, -5], nClust = 2:10, clMethods = c("hierarchical"), validation = "internal")
So we can see that this package uses and returns S4 objects, and that one of the slots, measures, seems to be the one you want.
x#measures[,,"hierarchical"]
2 3 4 5 6 7
Connectivity 0.0000000 4.4769841 8.9928571 15.4892857 18.4182540 24.8464286
Dunn 0.3389087 0.1378257 0.1540416 0.1540416 0.1668323 0.1624158
Silhouette 0.6867351 0.5541609 0.4719936 0.4306700 0.3419904 0.3707424
8 9 10
Connectivity 29.8424603 36.8567460 39.5607143
Dunn 0.1624158 0.1914854 0.1914854
Silhouette 0.3658753 0.3166807 0.3082851
Related
I had a large dataset that contains more than 300,000 rows/observations and 22 variables. I used the CLARA method for the clustering and plotted the results using fviz_cluster. Using the silhouette method, I got 10 as my number of clusters and from there I applied it to my CLARA algorithm.
clara.res <- clara(df, 10, samples = 50,trace = 1,sampsize = 1000, pamLike = TRUE)
str(clara.res)
List of 10
$ sample : chr [1:1000] "100046" "100303" "10052" "100727" ...
$ medoids : num [1:10, 1:22] 0.925 0.125 0.701 0 0 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:10] "193751" "137853" "229261" "257462" ...
.. ..$ : chr [1:22] "COD" "DMW" "HER" "SPR" ...
$ i.med : int [1:10] 104171 42062 143627 174961 300065 13836 192832 207079 185241 228575
$ clustering: Named int [1:302251] 1 1 1 2 3 4 5 3 3 3 ...
..- attr(*, "names")= chr [1:302251] "1" "10" "100" "1000" ...
$ objective : num 0.37
$ clusinfo : num [1:10, 1:4] 71811 40181 46271 10155 31309 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr [1:4] "size" "max_diss" "av_diss" "isolation"
$ diss : 'dissimilarity' num [1:499500] 1.392 2.192 0.937 2.157 1.643 ...
..- attr(*, "Size")= int 1000
..- attr(*, "Metric")= chr "euclidean"
..- attr(*, "Labels")= chr [1:1000] "100046" "100303" "10052" "100727" ...
$ call : language clara(x = df, k = 10, samples = 50, sampsize = 1000, trace = 1, pamLike = TRUE)
$ silinfo :List of 3
..$ widths : num [1:1000, 1:3] 1 1 1 1 1 1 1 1 1 1 ...
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ : chr [1:1000] "83395" "181310" "34452" "42991" ...
.. .. ..$ : chr [1:3] "cluster" "neighbor" "sil_width"
..$ clus.avg.widths: num [1:10] 0.645 0.408 0.487 0.513 0.839 ...
..$ avg.width : num 0.612
$ data : num [1:302251, 1:22] 1 1 1 0.366 0.35 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:302251] "1" "10" "100" "1000" ...
.. ..$ : chr [1:22] "COD" "DMW" "HER" "SPR" ...
- attr(*, "class")= chr [1:2] "clara" "partition"
For the plot:
fviz_cluster(clara.res,
palette = c(
"#004c6d",
"#00a1c1",
"#ffc334",
"#78ab63",
"#00ffff",
"#00cfe3",
"#6efa75",
"#cc0089",
"#ff9509",
"#ffb6de"
), # color palette
ellipse.type = "t",geom = "point",show.clust.cent = TRUE,repel = TRUE,pointsize = 0.5,
ggtheme = theme_classic()
)+ xlim(-7, 3) + ylim (-5, 4) + labs(title = "Plot of clusters")
The result:
I reckoned that this cluster plot is based on PCA and have been trying to figure out which variables in my original data were chosen as Dim1 and Dim2 or what these x and y-axis represent. Can somebody help me how to find out these Dim1 and Dim2 and eigenvalues/variance of the whole Dim that exist without running PCA separately?
I saw there are some other functions/packages for PCA such as get_eigenvalue in factoextra and FactomineR, but it seemed that will require me to use the PCA algorithm from the beginning? How can I integrate it directly with my CLARA results?
Also, my Dim1 only consists of 12.3% and Dim2 8.8%, does it mean that these variables are not representative enough or? considering that I would have 22 dimensions in total (from my 22 variables), I think it's alright, no? I am not sure how these percentages of Dim1 and Dim2 affect my cluster results. I was thinking to do the screeplot from my CLARA results but I also can't figure it out.
I'd appreciate any insights.
I'm definitely a noob, though I have used R for various small tasks for several years.
For the life of me, I cannot figure out how to get the results from the "Desc" function into something I can work with. When I save the x<-Desc(mydata) the class(x) shows up as "Desc." In R studio it is under Values and says "List of 1." Then when I click on x it says ":List of 25" in the first line. There is a list of data in this object, but I cannot for the life of me figure out how to grab any of it.
Clearly I have a severe misunderstanding of the R data structures, but I have been searching for the past 90 minutes to no avail so figured I would reach out.
In short, I just want to pull certain aspects (N, mean, UB, LB, median) of the descriptive statistics provided from the Desc results for multiple datasets and build a little table that I can then work with.
Thanks for the help.
Say you have a dataframe, x, where:
x <- data.frame(i=c(1,2,3),j=c(4,5,6))
You could set:
desc.x <- Desc(x)
And access the info on any given column like:
desc.x$i
desc.x$i$mead
desc.x$j$sd
And any other stats Desc comes up with. The $ is the key here, it's how you access the named fields of the list that Desc returns.
Edit: In case you pass a single column (as the asker does), or simply a vector to Desc, you are then returned a 1 item list. The same principle applies but the usual syntax is different. Now you would use:
desc.x <- Desc(df$my.col)
desc.x[[1]]$mean
In the future, the way to attack this is to either look in the environment window in RStudio and play around trying to figure out how to access the fields, check the source code on github or elsewhere, or (best first choice) use str(desc.x), which gives us:
> str(desc.x)
List of 1
$ :List of 25
..$ xname : chr "data.frame(i = c(1, 2, 3), j = c(4, 5, 6))$i"
..$ label : NULL
..$ class : chr "numeric"
..$ classlabel: chr "numeric"
..$ length : int 3
..$ n : int 3
..$ NAs : int 0
..$ main : chr "data.frame(i = c(1, 2, 3), j = c(4, 5, 6))$i (numeric)"
..$ unique : int 3
..$ 0s : int 0
..$ mean : num 2
..$ meanSE : num 0.577
..$ quant : Named num [1:9] 1 1.1 1.2 1.5 2 2.5 2.8 2.9 3
.. ..- attr(*, "names")= chr [1:9] "min" ".05" ".10" ".25" ...
..$ range : num 2
..$ sd : num 1
..$ vcoef : num 0.5
..$ mad : num 1.48
..$ IQR : num 1
..$ skew : num 0
..$ kurt : num -2.33
..$ small :'data.frame': 3 obs. of 2 variables:
.. ..$ val : num [1:3] 1 2 3
.. ..$ freq: num [1:3] 1 1 1
..$ large :'data.frame': 3 obs. of 2 variables:
.. ..$ val : num [1:3] 3 2 1
.. ..$ freq: num [1:3] 1 1 1
..$ freq :Classes ‘Freq’ and 'data.frame': 3 obs. of 5 variables:
.. ..$ level : Factor w/ 3 levels "1","2","3": 1 2 3
.. ..$ freq : int [1:3] 1 1 1
.. ..$ perc : num [1:3] 0.333 0.333 0.333
.. ..$ cumfreq: int [1:3] 1 2 3
.. ..$ cumperc: num [1:3] 0.333 0.667 1
..$ maxrows : num 12
..$ x : num [1:3] 1 2 3
- attr(*, "class")= chr "Desc"
"List of 1" means you access it by desc.x[[1]], and below that follow the $s. When you see something like num[1:3] that means it's an atomic vector so you access the first member like var$field$numbers[1]
For some reason when I try to using the plot() function to visualise the output of the RFsimulate() function in the RandomFields package, the output is always an empty plot.
I am just using the example code included in the help file:
## first let us look at the list of implemented models
RFgetModelNames(type="positive definite", domain="single variable",
iso="isotropic")
## our choice is the exponential model;
## the model includes nugget effect and the mean:
model <- RMexp(var=5, scale=10) + # with variance 4 and scale 10
RMnugget(var=1) + # nugget
RMtrend(mean=0.5) # and mean
## define the locations:
from <- 0
to <- 20
x.seq <- seq(from, to, length=200)
y.seq <- seq(from, to, length=200)
simu <- RFsimulate(model=model, x=x.seq, y=y.seq)
str(simu)
Which gives:
Formal class 'RFspatialGridDataFrame' [package ""] with 5 slots
..# .RFparams :List of 5
.. ..$ n : num 1
.. ..$ vdim : int 1
.. ..$ T : num(0)
.. ..$ coordunits: NULL
.. ..$ varunits : NULL
..# data :'data.frame': 441 obs. of 1 variable:
.. ..$ variable1: num [1:441] 4.511 2.653 3.951 0.771 2.718 ...
..# grid :Formal class 'GridTopology' [package "sp"] with 3 slots
.. .. ..# cellcentre.offset: Named num [1:2] 0 0
.. .. .. ..- attr(*, "names")= chr [1:2] "coords.x1" "coords.x2"
.. .. ..# cellsize : Named num [1:2] 1 1
.. .. .. ..- attr(*, "names")= chr [1:2] "coords.x1" "coords.x2"
.. .. ..# cells.dim : int [1:2] 21 21
..# bbox : num [1:2, 1:2] -0.5 -0.5 20.5 20.5
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ : chr [1:2] "coords.x1" "coords.x2"
.. .. ..$ : chr [1:2] "min" "max"
..# proj4string:Formal class 'CRS' [package "sp"] with 1 slot
.. .. ..# projargs: chr NA
... so data has been simulated, but when I call
plot(simu)
I end up with something like this:
e.g. Empty plot
Can anyone tell what going on here?!
I would coerce the object back to an sp SpatialGridDataFrame and plot that, as RandomFields creates a wrapper around this S4 class:
sgdf = sp::SpatialGridDataFrame(simu#grid, simu#data, simu#proj4string)
sp::plot(sgdf)
Also, you can coerce to matrix and plot using the standard graphics library:
graphics::image(as.matrix(simu))
The strange thing is that converting it to a SpatialGridDataFrame requires a flip and transpose before plotting:
graphics::image(t(apply(as.matrix(sgdf), 1, rev)))
Apparently, they are a bit internally inconsistent. The simplest solution is to convert simu to raster and plot:
r = raster::raster(simu)
raster::plot(r)
I would like to extract the p-values from the Anderson-Darling test (ad.test from package kSamples). The test result is a list of 12 containing a 2x3 matrix. The p value is part of the 2x3 matrix and is present in element 7.
When using the following code:
lapply(AD_result, "[[", 7)
I get the following subset of AD test results (first 2 of a total of 50 shown)
[[1]]
AD T.AD asympt. P-value
version 1: 1.72 0.94536 0.13169
version 2: 1.51 0.66740 0.17461
[[2]]
AD T.AD asympt. P-value
version 1: 12.299 14.624 6.9248e-07
version 2: 11.900 14.144 1.1146e-06
My question is how to extract only the p-value (e.g. from version 1) and put these 50 results into a vector
The output from str(AD_result) is:
List of 55
$ :List of 12
..$ test.name : chr "Anderson-Darling"
..$ k : int 2
..$ ns : int [1:2] 103 2905
..$ N : int 3008
..$ n.ties : int 2873
..$ sig : num 0.762
..$ ad : num [1:2, 1:3] 1.72 1.51 0.945 0.667 0.132 ...
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ : chr [1:2] "version 1:" "version 2:"
.. .. ..$ : chr [1:3] "AD" "T.AD" " asympt. P-value"
..$ warning : logi FALSE
..$ null.dist1: NULL
..$ null.dist2: NULL
..$ method : chr "asymptotic"
..$ Nsim : num 1
..- attr(*, "class")= chr "kSamples"
You could try:
unlist(lapply(AD_result, function(x) x$ad[,3]))
I'm working on validating the goodness of hierarchical clustering using clValid. Below is my code. The clustering always results in one noisy cluster which contains the 70% of the elements and hence I recursively cluster the elements in the noisy cluster.
intern <- clValid(primaryDataSource, 2:10,clMethods = c("Hierarchical"),
validation="internal", maxitems = 2200)
summary(intern)
Output of summary(intern):
Clustering Methods:
hierarchical
Cluster sizes:
2 3 4 5 6 7 8 9 10
Validation Measures:
2 3 4 5 6 7 8 9 10
hierarchical Connectivity 3.8738 3.8738 8.2563 10.9452 16.0286 18.6452 20.6452 22.6452 24.6452
Dunn 4.0949 0.8810 0.6569 0.8694 0.8808 1.0416 1.0230 1.0262 1.3724
Silhouette 0.9592 0.9879 0.9785 0.9751 0.9727 0.9729 0.9727 0.9726 0.9725
Optimal Scores:
Score Method Clusters
Connectivity 3.8738 hierarchical 2
Dunn 4.0949 hierarchical 2
Silhouette 0.9879 hierarchical 3
At each iteration I have to execute the clValid() and select the number of clusters which would give me the highest Silhouette value (in the above example it's 3). I'm trying to automate the recursive clustering approach. Hence I'm looking to pick the number of clusters which would have the highest Silhouette value. Can you please help me in extracting that piece of information? Thank you.
P.S: I tried converting the results into a data frame or a table. However it didn't work.
Update: After using str()
> str(intern)
Formal class 'clValid' [package "clValid"] with 14 slots
..# clusterObjs:List of 1
.. ..$ hierarchical:List of 7
.. .. ..$ merge : int [1:2173, 1:2] -1673 -714 -1121 -1688 -1876 -1123 -1689 -1228 -429 -535 ...
.. .. ..$ height : num [1:2173] 0 0.001 0.001 0.001 0.001 ...
.. .. ..$ order : int [1:2174] 2165 2166 1950 1951 1954 1955 1577 1565 1564 1576 ...
.. .. ..$ labels : chr [1:2174] "out_M_aacald_c_boundary" "out_M_12ppd_DASH_R_e_boundary" "out_M_12ppd_DASH_S_e_boundary" "in_M_14glucan_e_boundary" ...
.. .. ..$ method : chr "average"
.. .. ..$ call : language hclust(d = Dist, method = method)
.. .. ..$ dist.method: chr "euclidean"
.. .. ..- attr(*, "class")= chr "hclust"
..# measures : num [1:3, 1:9, 1] 3.874 4.095 0.959 3.874 0.881 ...
.. ..- attr(*, "dimnames")=List of 3
.. .. ..$ : chr [1:3] "Connectivity" "Dunn" "Silhouette"
.. .. ..$ : chr [1:9] "2" "3" "4" "5" ...
.. .. ..$ : chr "hierarchical"
..# measNames : chr [1:3] "Connectivity" "Dunn" "Silhouette"
..# clMethods : chr "hierarchical"
..# labels : chr [1:2174] "out_M_aacald_c_boundary" "out_M_12ppd_DASH_R_e_boundary" "out_M_12ppd_DASH_S_e_boundary" "in_M_14glucan_e_boundary" ...
..# nClust : num [1:9] 2 3 4 5 6 7 8 9 10
..# validation : chr "internal"
..# metric : chr "euclidean"
..# method : chr "average"
..# neighbSize : num 10
..# annotation : NULL
..# GOcategory : chr "all"
..# goTermFreq : num 0.05
..# call : language clValid(obj = primaryDataSource, nClust = 2:10, clMethods = c("Hierarchical"), validation = "internal", maxitems = 2200)
I guess the important section is
# measures : num [1:3, 1:9, 1] 3.874 4.095 0.959 3.874 0.881 ...
.. ..- attr(*, "dimnames")=List of 3
.. .. ..$ : chr [1:3] "Connectivity" "Dunn" "Silhouette"
.. .. ..$ : chr [1:9] "2" "3" "4" "5" ...
.. .. ..$ : chr "hierarchical"
when I executed >intern#measuresI got the below result.
2 3 4 5 6 7 8 9
Connectivity 3.8738095 3.8738095 8.2563492 10.9452381 16.0285714 18.6452381 20.6452381 22.645238
Dunn 4.0948837 0.8810494 0.6568857 0.8694067 0.8808228 1.0415614 1.0230197 1.026192
Silhouette 0.9591803 0.9879153 0.9784684 0.9751393 0.9727454 0.9728736 0.9727153 0.972622
10
Connectivity 24.6452381
Dunn 1.3724494
Silhouette 0.9725379
I'm able to get the max and access individual items based on the index. I want to get the maximum value for Silhouette.
intern#measures[1]
max(intern#measures)
Some additionnal explanation, when str() shows # signs, this points that the object you are inspecting is a S4 class with attributes. I am not familiar with clValid but a quick look at the source code shows that the clValid class inherits from S4.
You can access those using object#attribute. Typically these attributes can be anything.
Looking at the print function for clValid it seems that you can access the measures using the convenience function measures(object). Looking at the remaining source code for clValid there are utility functions that may be of use for you. Check optimalScores().