R: Extract values from a summary() in a clValid object - r

I'm working on validating the goodness of hierarchical clustering using clValid. Below is my code. The clustering always results in one noisy cluster which contains the 70% of the elements and hence I recursively cluster the elements in the noisy cluster.
intern <- clValid(primaryDataSource, 2:10,clMethods = c("Hierarchical"),
validation="internal", maxitems = 2200)
summary(intern)
Output of summary(intern):
Clustering Methods:
hierarchical
Cluster sizes:
2 3 4 5 6 7 8 9 10
Validation Measures:
2 3 4 5 6 7 8 9 10
hierarchical Connectivity 3.8738 3.8738 8.2563 10.9452 16.0286 18.6452 20.6452 22.6452 24.6452
Dunn 4.0949 0.8810 0.6569 0.8694 0.8808 1.0416 1.0230 1.0262 1.3724
Silhouette 0.9592 0.9879 0.9785 0.9751 0.9727 0.9729 0.9727 0.9726 0.9725
Optimal Scores:
Score Method Clusters
Connectivity 3.8738 hierarchical 2
Dunn 4.0949 hierarchical 2
Silhouette 0.9879 hierarchical 3
At each iteration I have to execute the clValid() and select the number of clusters which would give me the highest Silhouette value (in the above example it's 3). I'm trying to automate the recursive clustering approach. Hence I'm looking to pick the number of clusters which would have the highest Silhouette value. Can you please help me in extracting that piece of information? Thank you.
P.S: I tried converting the results into a data frame or a table. However it didn't work.
Update: After using str()
> str(intern)
Formal class 'clValid' [package "clValid"] with 14 slots
..# clusterObjs:List of 1
.. ..$ hierarchical:List of 7
.. .. ..$ merge : int [1:2173, 1:2] -1673 -714 -1121 -1688 -1876 -1123 -1689 -1228 -429 -535 ...
.. .. ..$ height : num [1:2173] 0 0.001 0.001 0.001 0.001 ...
.. .. ..$ order : int [1:2174] 2165 2166 1950 1951 1954 1955 1577 1565 1564 1576 ...
.. .. ..$ labels : chr [1:2174] "out_M_aacald_c_boundary" "out_M_12ppd_DASH_R_e_boundary" "out_M_12ppd_DASH_S_e_boundary" "in_M_14glucan_e_boundary" ...
.. .. ..$ method : chr "average"
.. .. ..$ call : language hclust(d = Dist, method = method)
.. .. ..$ dist.method: chr "euclidean"
.. .. ..- attr(*, "class")= chr "hclust"
..# measures : num [1:3, 1:9, 1] 3.874 4.095 0.959 3.874 0.881 ...
.. ..- attr(*, "dimnames")=List of 3
.. .. ..$ : chr [1:3] "Connectivity" "Dunn" "Silhouette"
.. .. ..$ : chr [1:9] "2" "3" "4" "5" ...
.. .. ..$ : chr "hierarchical"
..# measNames : chr [1:3] "Connectivity" "Dunn" "Silhouette"
..# clMethods : chr "hierarchical"
..# labels : chr [1:2174] "out_M_aacald_c_boundary" "out_M_12ppd_DASH_R_e_boundary" "out_M_12ppd_DASH_S_e_boundary" "in_M_14glucan_e_boundary" ...
..# nClust : num [1:9] 2 3 4 5 6 7 8 9 10
..# validation : chr "internal"
..# metric : chr "euclidean"
..# method : chr "average"
..# neighbSize : num 10
..# annotation : NULL
..# GOcategory : chr "all"
..# goTermFreq : num 0.05
..# call : language clValid(obj = primaryDataSource, nClust = 2:10, clMethods = c("Hierarchical"), validation = "internal", maxitems = 2200)
I guess the important section is
# measures : num [1:3, 1:9, 1] 3.874 4.095 0.959 3.874 0.881 ...
.. ..- attr(*, "dimnames")=List of 3
.. .. ..$ : chr [1:3] "Connectivity" "Dunn" "Silhouette"
.. .. ..$ : chr [1:9] "2" "3" "4" "5" ...
.. .. ..$ : chr "hierarchical"
when I executed >intern#measuresI got the below result.
2 3 4 5 6 7 8 9
Connectivity 3.8738095 3.8738095 8.2563492 10.9452381 16.0285714 18.6452381 20.6452381 22.645238
Dunn 4.0948837 0.8810494 0.6568857 0.8694067 0.8808228 1.0415614 1.0230197 1.026192
Silhouette 0.9591803 0.9879153 0.9784684 0.9751393 0.9727454 0.9728736 0.9727153 0.972622
10
Connectivity 24.6452381
Dunn 1.3724494
Silhouette 0.9725379
I'm able to get the max and access individual items based on the index. I want to get the maximum value for Silhouette.
intern#measures[1]
max(intern#measures)

Some additionnal explanation, when str() shows # signs, this points that the object you are inspecting is a S4 class with attributes. I am not familiar with clValid but a quick look at the source code shows that the clValid class inherits from S4.
You can access those using object#attribute. Typically these attributes can be anything.
Looking at the print function for clValid it seems that you can access the measures using the convenience function measures(object). Looking at the remaining source code for clValid there are utility functions that may be of use for you. Check optimalScores().

Related

R convert cluster summary object to dataframe

I am trying to extract the Validation Measures from an R clustering validation object created using clValid.
When I create the object and print the full summary, I use the following
library(clValid)
x <- clValid(iris[, -5], nClust=2:10,
clMethods=c('hierarchical'), validation='internal')
summary(x)
The output of this is:
Clustering Methods:
hierarchical
Cluster sizes:
2 3 4 5 6 7 8 9 10
Validation Measures:
2 3 4 5 6 7 8 9 10
hierarchical Connectivity 0.0000 4.4770 8.9929 15.4893 18.4183 24.8464 29.8425 36.8567 39.5607
Dunn 0.3389 0.1378 0.1540 0.1540 0.1668 0.1624 0.1624 0.1915 0.1915
Silhouette 0.6867 0.5542 0.4720 0.4307 0.3420 0.3707 0.3659 0.3167 0.3083
Optimal Scores:
Score Method Clusters
Connectivity 0.0000 hierarchical 2
Dunn 0.3389 hierarchical 2
Silhouette 0.6867 hierarchical 2
Required output
I am trying to get the Validation Measures as a dataframe like this:
2 3 4 5 6 7 8 9 10
hierarchical Connectivity 0.0000 4.4770 8.9929 15.4893 18.4183 24.8464 29.8425 36.8567 39.5607
Dunn 0.3389 0.1378 0.1540 0.1540 0.1668 0.1624 0.1624 0.1915 0.1915
Silhouette 0.6867 0.5542 0.4720 0.4307 0.3420 0.3707 0.3659 0.3167 0.3083
Attempt
When I use:
names(summary(x))
attributes(summary(x))
these both give
NULL
I can get the Optimal Scores using optimalScores(x), however, this does not work with validationMeasures(x).
Question
Is there a way to extract the Validation Measures as a data.frame from this summary object?
First of all, you should always try
str(x)
Formal class 'clValid' [package "clValid"] with 14 slots
..# clusterObjs:List of 1
.. ..$ hierarchical:List of 7
.. .. ..$ merge : int [1:149, 1:2] -102 -8 -1 -10 -129 -11 -5 -20 -30 -58 ...
.. .. ..$ height : num [1:149] 0 0.1 0.1 0.1 0.1 ...
.. .. ..$ order : int [1:150] 42 15 16 33 34 37 21 32 44 24 ...
.. .. ..$ labels : NULL
.. .. ..$ method : chr "average"
.. .. ..$ call : language hclust(d = Dist, method = method)
.. .. ..$ dist.method: chr "euclidean"
.. .. ..- attr(*, "class")= chr "hclust"
..# measures : num [1:3, 1:9, 1] 0 0.339 0.687 4.477 0.138 ...
.. ..- attr(*, "dimnames")=List of 3
.. .. ..$ : chr [1:3] "Connectivity" "Dunn" "Silhouette"
.. .. ..$ : chr [1:9] "2" "3" "4" "5" ...
.. .. ..$ : chr "hierarchical"
..# measNames : chr [1:3] "Connectivity" "Dunn" "Silhouette"
..# clMethods : chr "hierarchical"
..# labels : chr [1:150] "1" "2" "3" "4" ...
..# nClust : num [1:9] 2 3 4 5 6 7 8 9 10
..# validation : chr "internal"
..# metric : chr "euclidean"
..# method : chr "average"
..# neighbSize : num 10
..# annotation : NULL
..# GOcategory : chr "all"
..# goTermFreq : num 0.05
..# call : language clValid(obj = iris[, -5], nClust = 2:10, clMethods = c("hierarchical"), validation = "internal")
So we can see that this package uses and returns S4 objects, and that one of the slots, measures, seems to be the one you want.
x#measures[,,"hierarchical"]
2 3 4 5 6 7
Connectivity 0.0000000 4.4769841 8.9928571 15.4892857 18.4182540 24.8464286
Dunn 0.3389087 0.1378257 0.1540416 0.1540416 0.1668323 0.1624158
Silhouette 0.6867351 0.5541609 0.4719936 0.4306700 0.3419904 0.3707424
8 9 10
Connectivity 29.8424603 36.8567460 39.5607143
Dunn 0.1624158 0.1914854 0.1914854
Silhouette 0.3658753 0.3166807 0.3082851

Plotting realisations of random Gaussian fields using RandomFields package results in blank graph. Why?

For some reason when I try to using the plot() function to visualise the output of the RFsimulate() function in the RandomFields package, the output is always an empty plot.
I am just using the example code included in the help file:
## first let us look at the list of implemented models
RFgetModelNames(type="positive definite", domain="single variable",
iso="isotropic")
## our choice is the exponential model;
## the model includes nugget effect and the mean:
model <- RMexp(var=5, scale=10) + # with variance 4 and scale 10
RMnugget(var=1) + # nugget
RMtrend(mean=0.5) # and mean
## define the locations:
from <- 0
to <- 20
x.seq <- seq(from, to, length=200)
y.seq <- seq(from, to, length=200)
simu <- RFsimulate(model=model, x=x.seq, y=y.seq)
str(simu)
Which gives:
Formal class 'RFspatialGridDataFrame' [package ""] with 5 slots
..# .RFparams :List of 5
.. ..$ n : num 1
.. ..$ vdim : int 1
.. ..$ T : num(0)
.. ..$ coordunits: NULL
.. ..$ varunits : NULL
..# data :'data.frame': 441 obs. of 1 variable:
.. ..$ variable1: num [1:441] 4.511 2.653 3.951 0.771 2.718 ...
..# grid :Formal class 'GridTopology' [package "sp"] with 3 slots
.. .. ..# cellcentre.offset: Named num [1:2] 0 0
.. .. .. ..- attr(*, "names")= chr [1:2] "coords.x1" "coords.x2"
.. .. ..# cellsize : Named num [1:2] 1 1
.. .. .. ..- attr(*, "names")= chr [1:2] "coords.x1" "coords.x2"
.. .. ..# cells.dim : int [1:2] 21 21
..# bbox : num [1:2, 1:2] -0.5 -0.5 20.5 20.5
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ : chr [1:2] "coords.x1" "coords.x2"
.. .. ..$ : chr [1:2] "min" "max"
..# proj4string:Formal class 'CRS' [package "sp"] with 1 slot
.. .. ..# projargs: chr NA
... so data has been simulated, but when I call
plot(simu)
I end up with something like this:
e.g. Empty plot
Can anyone tell what going on here?!
I would coerce the object back to an sp SpatialGridDataFrame and plot that, as RandomFields creates a wrapper around this S4 class:
sgdf = sp::SpatialGridDataFrame(simu#grid, simu#data, simu#proj4string)
sp::plot(sgdf)
Also, you can coerce to matrix and plot using the standard graphics library:
graphics::image(as.matrix(simu))
The strange thing is that converting it to a SpatialGridDataFrame requires a flip and transpose before plotting:
graphics::image(t(apply(as.matrix(sgdf), 1, rev)))
Apparently, they are a bit internally inconsistent. The simplest solution is to convert simu to raster and plot:
r = raster::raster(simu)
raster::plot(r)

no method for coercing this S4 class to a vector for utilization of mclust

I'm trying to use the mclust method on an .FCS format file (which is a flow cytometry format file) and I read this file into R as flowFrame object.
install.packages("openCyto") # since the old version sefaulted my R session
library( openCyto )
library( flowCore)
library( mclust)
trial1=read.FCS("export_Alcina TregMAIT_AV 10-1974 P1_CD4.fcs")
a=as.matrix(trial1)
Editors note: some of these are Bioconductor packages and you should install according to the help pages for that environment.
However, mclust does not accept the .fcs file as a matrix & I tried to convert it to a matrix with the function as.matrix, and I get this error:
Error in as.vector(data) :
no method for coercing this S4 class to a vector
I've found similar questions where they explain you have to add importMethodsFrom(S4Vectors,as.matrix) into the NAMESPACE of mclust, which I did. I also did importMethodsFrom(BiocGenerics,as.vector) in the NAMESPACE of mclust. However, I'm still not able to use mclust.
P.S. any advice or reading would be appreciated!
If, anyone knows other clustering methods that use GMM model that could accept .FCS format without converting, I'd be very happy.
I've edited your question to show what you should have done originally and also didn't do later (instead of including code in a comment you should have responded by then editing the question as was specifically suggested.) My response is based on the first example in flowCore::read.FCS (since you also did not include a pointer to the dataset you were loading from disk) so rather than "trial1" I will be referring to the "samp" object I get running that code.
The "samp" object is now returns this from class and str:
> class(samp)
[1] "flowFrame"
attr(,"package")
[1] "flowCore"
str(samp)
Formal class 'flowFrame' [package "flowCore"] with 3 slots
..# exprs : num [1:10000, 1:8] 382 628 1023 373 1023 ...
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ : NULL
.. .. ..$ : Named chr [1:8] "FSC-H" "SSC-H" "FL1-H" "FL2-H" ...
.. .. .. ..- attr(*, "names")= chr [1:8] "$P1N" "$P2N" "$P3N" "$P4N" ...
.. ..- attr(*, "ranges")= num [1:8] 1023 1023 10000 10000 10000 ...
..# parameters :Formal class 'AnnotatedDataFrame' [package "Biobase"] with 4 slots
.. .. ..# varMetadata :'data.frame': 5 obs. of 1 variable:
.. .. .. ..$ labelDescription: chr [1:5] "Name of Parameter" "Description of Parameter" "Range of Parameter" "Minimum Parameter Value after Transforamtion" ...
.. .. ..# data :'data.frame': 8 obs. of 5 variables:
.. .. .. ..$ name :Class 'AsIs' Named chr [1:8] "FSC-H" "SSC-H" "FL1-H" "FL2-H" ...
.. .. .. .. .. ..- attr(*, "names")= chr [1:8] "$P1N" "$P2N" "$P3N" "$P4N" ...
.. .. .. ..$ desc :Class 'AsIs' Named chr [1:8] "FSC-H" "SSC-H" NA NA ...
.. .. .. .. .. ..- attr(*, "names")= chr [1:8] "$P1S" "$P2S" "$P3S" "$P4S" ...
.. .. .. ..$ range : num [1:8] 1024 1024 1024 1024 1024 ...
.. .. .. ..$ minRange: num [1:8] 0 0 1 1 1 0 1 0
.. .. .. ..$ maxRange: num [1:8] 1023 1023 10000 10000 10000 ...
.. .. ..# dimLabels : chr [1:2] "rowNames" "columnNames"
.. .. ..# .__classVersion__:Formal class 'Versions' [package "Biobase"] with 1 slot
.. .. .. .. ..# .Data:List of 1
.. .. .. .. .. ..$ : int [1:3] 1 1 0
..# description:List of 164
.. ..$ FCSversion : chr "2"
.. ..$ $BYTEORD : chr "4,3,2,1"
.. ..$ $DATATYPE : chr "F"
#----- output truncated -----------
So "samp" is not a rectangular objects in any sense but rather a complex list with lots of the associated information in attributes. My guess is that you want the information in the # exprs node which is a matrix.
A further difficulty is that there is no function mamed mclust in the mclust package, although looking at ?mclust we do see an example demonstrating the use of an Mclust function. R is unforgiving in its insistence on correct capitalization of function names.
Mclust(exprs(samp)[1:100,])
#-----------
'Mclust' model object:
best model: ellipsoidal, equal orientation (VVE) with 4 components

Extract nested list elements using bracketed numbers and names

After running a repeated measures ANOVA and naming the output
RM_test <- ezANOVA(data=test_data, dv=var_test, wid = .(subject),
within = .(water_year), type = 3)
I looked at the internal structure of the named object using str(RM_test) and received the following:
List of 3
$ ANOVA :List of 3
..$ ANOVA :'data.frame': 1 obs. of 7 variables:
.. ..$ Effect: chr "water_year"
.. ..$ DFn : num 2
.. ..$ DFd : num 22
.. ..$ F : num 26.8
.. ..$ p : num 1.26e-06
.. ..$ p<.05 : chr "*"
.. ..$ ges : num 0.531
..$ Mauchly's Test for Sphericity:'data.frame': 1 obs. of 4 variables:
.. ..$ Effect: chr "water_year"
.. ..$ W : num 0.875
.. ..$ p : num 0.512
.. ..$ p<.05 : chr ""
..$ Sphericity Corrections :'data.frame': 1 obs. of 7 variables:
.. ..$ Effect : chr "water_year"
.. ..$ GGe : num 0.889
.. ..$ p[GG] : num 4.26e-06
.. ..$ p[GG]<.05: chr "*"
.. ..$ HFe : num 1.05
.. ..$ p[HF] : num 1.26e-06
.. ..$ p[HF]<.05: chr "*"
$ Mauchly's Test for Sphericity:'data.frame': 1 obs. of 4 variables:
..$ Effect: chr "wtr_yr"
..$ W : num 0.875
..$ p : num 0.512
..$ p<.05 : chr ""
$ Sphericity Corrections :'data.frame': 1 obs. of 7 variables:
..$ Effect : chr "wtr_yr"
..$ GGe : num 0.889
..$ p[GG] : num 4.26e-06
..$ p[GG]<.05: chr "*"
..$ HFe : num 1.05
..$ p[HF] : num 1.26e-06
..$ p[HF]<.05: chr "*"
I was able to extract the fourth variable F from the first data frame using RM_test[[1]][[4]][1] but cannot figure out how to extract the third variable p[GG] from the data frame Sphericity Corrections. This data frame appears twice so extracting either one would be fine.
Suggestions on how to do this using bracketed numbers and names would be appreciated.
The problem seems to be you not knowing how to extract list elements. As you said, there are two Sphericity Corrections data frames, so I will how to get the p[GG] value for both.
using bracketed number
For the first one, we do RM_test[[1]][[3]][[3]]. You can do it step by step to understand it:
x1 <- RM_test[[1]]; str(x1)
x2 <- x1[[3]]; str(x2)
x3 <- x2[[3]]; str(x3)
For the second one, do RM_test[[3]][[3]].
using bracketed name
Instead of using numbers for indexing, we can use names. For the first, do
RM_test[["ANOVA"]][["Sphericity Corrections"]][["p[GG]"]]
For the second, do
RM_test[["Sphericity Corrections"]][["p[GG]"]]
using $
For the first one, do
RM_test$ANOVA$"Sphericity Corrections"$"p[GG]"
For the second one, do
RM_test$"Sphericity Corrections"$"p[GG]"
Note the use of quote "" when necessary.

Extract knot values out of gam with spline [duplicate]

This question already has an answer here:
mgcv: How to set number and / or locations of knots for splines
(1 answer)
Closed 5 years ago.
I am running a GAM across many samples and am extracting coefficients/t-values/r-squared from the results in the way shown below. For background, I am using a natural spline, so the regular lm() works fine here and perhaps that is why this method works fine.
tvalsm93exf=ldply(fitsm93exf, function(x) as.data.frame(t(coef(summary(x))[,'t value', drop=FALSE])))
r2m93exf=ldply(fitsm93exf, function(x) as.data.frame(t(summary(x))[,'r.squared', drop=FALSE]))
I would also like to extract the knot locations for each sample set(df=4 and no intercept, so three internal knots and the boundaries). I have tried several variations of the commands above, but haven't been able to index in to this. The regular way to do this is below, so I was attempting to put this into the form above. But I am not certain if the summary function contains these values, or if there is another result I should be including instead.
attr(terms(fits),"predvars")
http://www.inside-r.org/r-doc/splines/ns
Note: This question is related to the question below, if that helps, though its solution did not help me solve my problem:
Extract estimates of GAM
The knots are fixed at the time that the ns function is called in the examples on the help page you linked to, so you could have extracted the knots without going into the model object. But ... you have not provided the code for the GAM model creation, so we can only speculate about what you might have done. Just because the word "spline" is used in both the ?ns-help-page and in the documentation does not mean they are the same. The model in the other page you linked to had two "smooth" terms constructed wtih the s function.
.... + s(time,bs="cr",k=200) + s(tmpd,bs="cr")
The result of that gam call had a list node named "smooth" and the first one looked like this when viewed with str():
str(ap1$smooth)
List of 2
$ :List of 22
..$ term : chr "time"
..$ bs.dim : num 200
..$ fixed : logi FALSE
..$ dim : int 1
..$ p.order : logi NA
..$ by : chr "NA"
..$ label : chr "s(time)"
..$ xt : NULL
..$ id : NULL
..$ sp : Named num -1
.. ..- attr(*, "names")= chr "s(time)"
..$ S :List of 1
.. ..$ : num [1:199, 1:199] 5.6 -5.475 2.609 -0.577 0.275 ...
..$ rank : num 198
..$ null.space.dim: num 1
..$ df : num 199
..$ xp : Named num [1:200] -2556 -2527 -2502 -2476 -2451 ...
.. ..- attr(*, "names")= chr [1:200] "0.0000000%" "0.5025126%" "1.0050251%" "1.5075377%" ...
..$ F : num [1:40000] 0 0 0 0 0 0 0 0 0 0 ...
..$ plot.me : logi TRUE
..$ side.constrain: logi TRUE
..$ S.scale : num 9.56e-05
..$ vn : chr "time"
..$ first.para : num 5
..$ last.para : num 203
..- attr(*, "class")= chr [1:2] "cr.smooth" "mgcv.smooth"
..- attr(*, "qrc")=List of 4
.. ..$ qr : num [1:200, 1] -0.0709 0.0817 0.0709 0.0688 0.0724 ...
.. ..$ rank : int 1
.. ..$ qraux: num 1.03
.. ..$ pivot: int 1
.. ..- attr(*, "class")= chr "qr"
..- attr(*, "nCons")= int 1
So the smooth was evaluated at each of 200 points and a polynomial function fit to the data on that grid. If you forced the knots to be at three interior locations then they will just be at the extremes and evenly spaced location between the extremes.

Resources