R kohonen - Is the input data scaled and centred automatically? - r

I have been following an online example for R Kohonen self-organising maps (SOM) which suggested that the data should be centred and scaled before computing the SOM.
However, I've noticed the object created seems to have attributes for centre and scale, in which case am I really applying a redundant step by centring and scaling first? Example script below
# Load package
require(kohonen)
# Set data
data(iris)
# Scale and centre
dt <- scale(iris[, 1:4],center=TRUE)
# Prepare SOM
set.seed(590507)
som1 <- som(dt,
somgrid(6,6, "hexagonal"),
rlen=500,
keep.data=TRUE)
str(som1)
The output from the last line of the script is:
List of 13
$ data :List of 1
..$ : num [1:150, 1:4] -0.898 -1.139 -1.381 -1.501 -1.018 ...
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ : NULL
.. .. ..$ : chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length"
"Petal.Width"
.. ..- attr(*, "scaled:center")= Named num [1:4] 5.84 3.06 3.76 1.2
.. .. ..- attr(*, "names")= chr [1:4] "Sepal.Length" "Sepal.Width"
"Petal.Length" "Petal.Width"
.. ..- attr(*, "scaled:scale")= Named num [1:4] 0.828 0.436 1.765 0.762
.. .. ..- attr(*, "names")= chr [1:4] "Sepal.Length" "Sepal.Width"
"Petal.Length" "Petal.Width"
$ unit.classif : num [1:150] 3 5 5 5 4 2 4 4 6 5 ...
$ distances : num [1:150] 0.0426 0.0663 0.0768 0.0744 0.1346 ...
$ grid :List of 6
..$ pts : num [1:36, 1:2] 1.5 2.5 3.5 4.5 5.5 6.5 1 2 3 4 ...
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ : NULL
.. .. ..$ : chr [1:2] "x" "y"
..$ xdim : num 6
..$ ydim : num 6
..$ topo : chr "hexagonal"
..$ neighbourhood.fct: Factor w/ 2 levels "bubble","gaussian": 1
..$ toroidal : logi FALSE
..- attr(*, "class")= chr "somgrid"
$ codes :List of 1
..$ : num [1:36, 1:4] -0.376 -0.683 -0.734 -1.158 -1.231 ...
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ : chr [1:36] "V1" "V2" "V3" "V4" ...
.. .. ..$ : chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length"
"Petal.Width"
$ changes : num [1:500, 1] 0.0445 0.0413 0.0347 0.0373 0.0337 ...
$ alpha : num [1:2] 0.05 0.01
$ radius : Named num [1:2] 3.61 0
..- attr(*, "names")= chr [1:2] "66.66667%" ""
$ user.weights : num 1
$ distance.weights: num 1
$ whatmap : int 1
$ maxNA.fraction : int 0
$ dist.fcts : chr "sumofsquares"
- attr(*, "class")= chr "kohonen"
Note notice that in lines 7 and 10 of the output there are references to centre and scale. I would appreciate an explanation as to the process here.

Your step with scaling is not redundant because in source code there are no scaling, and attributes, that you see in 7 and 10 are attributes from train dataset.
To check this, just run and compare results of this chunk of code:
# Load package
require(kohonen)
# Set data
data(iris)
# Scale and centre
dt <- scale(iris[, 1:4],center=TRUE)
#compare train datasets
str(dt)
str(as.matrix(iris[, 1:4]))
# Prepare SOM
set.seed(590507)
som1 <- kohonen::som(dt,
kohonen::somgrid(6,6, "hexagonal"),
rlen=500,
keep.data=TRUE)
#without scaling
som2 <- kohonen::som(as.matrix(iris[, 1:4]),
kohonen::somgrid(6,6, "hexagonal"),
rlen=500,
keep.data=TRUE)
#compare results of som function
str(som1)
str(som2)

Related

Interpreting the PCA axis Dim1 and Dim2 from CLARA plot results directly

I had a large dataset that contains more than 300,000 rows/observations and 22 variables. I used the CLARA method for the clustering and plotted the results using fviz_cluster. Using the silhouette method, I got 10 as my number of clusters and from there I applied it to my CLARA algorithm.
clara.res <- clara(df, 10, samples = 50,trace = 1,sampsize = 1000, pamLike = TRUE)
str(clara.res)
List of 10
$ sample : chr [1:1000] "100046" "100303" "10052" "100727" ...
$ medoids : num [1:10, 1:22] 0.925 0.125 0.701 0 0 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:10] "193751" "137853" "229261" "257462" ...
.. ..$ : chr [1:22] "COD" "DMW" "HER" "SPR" ...
$ i.med : int [1:10] 104171 42062 143627 174961 300065 13836 192832 207079 185241 228575
$ clustering: Named int [1:302251] 1 1 1 2 3 4 5 3 3 3 ...
..- attr(*, "names")= chr [1:302251] "1" "10" "100" "1000" ...
$ objective : num 0.37
$ clusinfo : num [1:10, 1:4] 71811 40181 46271 10155 31309 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr [1:4] "size" "max_diss" "av_diss" "isolation"
$ diss : 'dissimilarity' num [1:499500] 1.392 2.192 0.937 2.157 1.643 ...
..- attr(*, "Size")= int 1000
..- attr(*, "Metric")= chr "euclidean"
..- attr(*, "Labels")= chr [1:1000] "100046" "100303" "10052" "100727" ...
$ call : language clara(x = df, k = 10, samples = 50, sampsize = 1000, trace = 1, pamLike = TRUE)
$ silinfo :List of 3
..$ widths : num [1:1000, 1:3] 1 1 1 1 1 1 1 1 1 1 ...
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ : chr [1:1000] "83395" "181310" "34452" "42991" ...
.. .. ..$ : chr [1:3] "cluster" "neighbor" "sil_width"
..$ clus.avg.widths: num [1:10] 0.645 0.408 0.487 0.513 0.839 ...
..$ avg.width : num 0.612
$ data : num [1:302251, 1:22] 1 1 1 0.366 0.35 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:302251] "1" "10" "100" "1000" ...
.. ..$ : chr [1:22] "COD" "DMW" "HER" "SPR" ...
- attr(*, "class")= chr [1:2] "clara" "partition"
For the plot:
fviz_cluster(clara.res,
palette = c(
"#004c6d",
"#00a1c1",
"#ffc334",
"#78ab63",
"#00ffff",
"#00cfe3",
"#6efa75",
"#cc0089",
"#ff9509",
"#ffb6de"
), # color palette
ellipse.type = "t",geom = "point",show.clust.cent = TRUE,repel = TRUE,pointsize = 0.5,
ggtheme = theme_classic()
)+ xlim(-7, 3) + ylim (-5, 4) + labs(title = "Plot of clusters")
The result:
I reckoned that this cluster plot is based on PCA and have been trying to figure out which variables in my original data were chosen as Dim1 and Dim2 or what these x and y-axis represent. Can somebody help me how to find out these Dim1 and Dim2 and eigenvalues/variance of the whole Dim that exist without running PCA separately?
I saw there are some other functions/packages for PCA such as get_eigenvalue in factoextra and FactomineR, but it seemed that will require me to use the PCA algorithm from the beginning? How can I integrate it directly with my CLARA results?
Also, my Dim1 only consists of 12.3% and Dim2 8.8%, does it mean that these variables are not representative enough or? considering that I would have 22 dimensions in total (from my 22 variables), I think it's alright, no? I am not sure how these percentages of Dim1 and Dim2 affect my cluster results. I was thinking to do the screeplot from my CLARA results but I also can't figure it out.
I'd appreciate any insights.

How to plot PCA using hellinger transformation in ggplot?

I'm trying to ggplot using Hellinger Transformation on my dataset. It works fine for a regular prcomp function but not Hellingers. How can I plot the data from Hellinger transformed data using ggplot?
library(ggfortify)
library(vegan)
df <- iris[1:4]
pca_res <- prcomp(df, scale. = TRUE)
autoplot(pca_res, data = iris, colour =
'Species',
loadings = TRUE, loadings.colour = 'blue',
loadings.label = TRUE, loadings.label.size = 3)
##Hellinger Transformation
df.hell <- decostand(df, method = "hellinger")
df.hell <- rda(df.hell)
ggplot2::autoplot(df.hell)
autoplot(df.hell, data = iris, colour =
'Species',
loadings = TRUE, loadings.colour = 'blue',
loadings.label = TRUE, loadings.label.size = 3)
Error: Objects of type rda/cca not supported by autoplot.
Error: Objects of type rda/cca not supported by autoplot.
Edit 1: Even if the first plot can be manually computed in ggplot2, what about the rest of the plots like loading, or ellipses etc? base plot allows for overlay when using Hellingers but doesn't seem like ggplot2 would directly allow for it.
prcomp returns an object of class prcomp, which can be plotted with autoplot. As the error message says, rda function returns an object of class "rda" "cca", which cannot be plotted using autoplot. Therefore, you must extract the bits you need manually:
data.frame(PC = df.hell$CA$u, species = iris$Species) %>%
ggplot(aes(x=PC.PC1, y=PC.PC2)) +
geom_point(aes(colour=species))
You can find the relevant parts of the object by doing str(df.hell):
List of 10
$ colsum : Named num [1:4] 0.037 0.0746 0.086 0.0854
..- attr(*, "names")= chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
$ tot.chi : num 0.0216
$ Ybar : num [1:150, 1:4] 0.0042 0.00511 0.0042 0.00359 0.00363 ...
..- attr(*, "scaled:center")= Named num [1:4] 0.656 0.479 0.498 0.267
.. ..- attr(*, "names")= chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
..- attr(*, "METHOD")= chr "PCA"
$ method : chr "rda"
$ call : language rda(X = df.hell)
$ pCCA : NULL
$ CCA : NULL
$ CA :List of 7
..$ eig : Named num [1:4] 0.0208691 0.0005348 0.0001951 0.0000205
.. ..- attr(*, "names")= chr [1:4] "PC1" "PC2" "PC3" "PC4"
..$ poseig : NULL
..$ u : num [1:150, 1:4] -0.122 -0.11 -0.119 -0.106 -0.123 ...
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ : NULL
.. .. ..$ : chr [1:4] "PC1" "PC2" "PC3" "PC4"
..$ v : num [1:4, 1:4] -0.241 -0.508 0.589 0.58 0.375 ...
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ : chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
.. .. ..$ : chr [1:4] "PC1" "PC2" "PC3" "PC4"
..$ rank : int 4
..$ tot.chi: num 0.0216
..$ Xbar : num [1:150, 1:4] 0.0042 0.00511 0.0042 0.00359 0.00363 ...
.. ..- attr(*, "scaled:center")= Named num [1:4] 0.656 0.479 0.498 0.267
.. .. ..- attr(*, "names")= chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ : NULL
.. .. ..$ : chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
.. ..- attr(*, "METHOD")= chr "PCA"
$ inertia : chr "variance"
$ regularization: chr "this is a vegan::rda result object"
- attr(*, "class")= chr [1:2] "rda" "cca"

Produce a graphic tree diagram showing the structure of an R object

In R, str() is handy for showing the structure of an object, such as the list of lists returned by lm() and other modelling functions, but it gives way too much output. I'm looking for some tool to create a simple tree diagram showing only the names of the list elements and their structure.
e.g., for this example,
data(Prestige, package="car")
out <- lm(prestige ~ income+education+women, data=Prestige)
str(out, max.level=2)
#> List of 12
#> $ coefficients : Named num [1:4] -6.79433 0.00131 4.18664 -0.00891
#> ..- attr(*, "names")= chr [1:4] "(Intercept)" "income" "education" "women"
#> $ residuals : Named num [1:102] 4.58 -9.39 4.69 4.22 8.15 ...
#> ..- attr(*, "names")= chr [1:102] "gov.administrators" "general.managers" "accountants" "purchasing.officers" ...
#> $ effects : Named num [1:102] -472.99 -123.61 -92.61 -2.3 6.83 ...
#> ..- attr(*, "names")= chr [1:102] "(Intercept)" "income" "education" "women" ...
#> $ rank : int 4
#> $ fitted.values: Named num [1:102] 64.2 78.5 58.7 52.6 65.3 ...
#> ..- attr(*, "names")= chr [1:102] "gov.administrators" "general.managers" "accountants" "purchasing.officers" ...
#> $ assign : int [1:4] 0 1 2 3
#> $ qr :List of 5
#> ..$ qr : num [1:102, 1:4] -10.1 0.099 0.099 0.099 0.099 ...
#> .. ..- attr(*, "dimnames")=List of 2
#> .. ..- attr(*, "assign")= int [1:4] 0 1 2 3
#> ..$ qraux: num [1:4] 1.1 1.44 1.06 1.06
#> ..$ pivot: int [1:4] 1 2 3 4
#> ..$ tol : num 1e-07
#> ..$ rank : int 4
#> ..- attr(*, "class")= chr "qr"
#> $ df.residual : int 98
...
I would like to get something like this:
This is similar to what I get from tree for file folders in my file system:
C:\Dropbox\Documents\images>tree
Folder PATH listing
Volume serial number is 2250-8E6F
C:.
+---cartoons
+---chevaliers
+---icons
+---milestones
+---minard
+---minard-besancon
The result could be either in graphic characters, as in tree or an actual graphic as shown above. Is anything like this available?
A simple approach to getting this from the str output would be something like...
a <- capture.output(str(out, max.level=2))
a <- trimws(gsub("\\:.*", "", a[grepl("\\$", a)]))
cat(a, sep="\n")
$ coefficients
$ residuals
$ effects
$ rank
$ fitted.values
$ assign
$ qr
..$ qr
..$ qraux
..$ pivot
..$ tol
..$ rank
$ df.residual
$ xlevels
$ call
$ terms
$ model
..$ prestige
..$ income
..$ education
..$ women

How to access principal components rownames in prcomp?

I'm probably not explaining myself very well here. How can I access the column of names in prcomp following its use as shown below? I would like to use this as a list for subsequent plots.
prcomp(USArrests)
Standard deviations:
[1] 83.732400 14.212402 6.489426 2.482790
Rotation:
PC1 PC2 PC3 PC4
Murder 0.04170432 -0.04482166 0.07989066 -0.99492173
Assault 0.99522128 -0.05876003 -0.06756974 0.03893830
UrbanPop 0.04633575 0.97685748 -0.20054629 -0.05816914
Rape 0.07515550 0.20071807 0.97408059 0.07232502
I would like to access the extract the list "Murder, Assault, UrbanPop, Rape".
It is always helpful to use str:
res <- prcomp(USArrests)
str(res)
# List of 5
# $ sdev : num [1:4] 83.73 14.21 6.49 2.48
# $ rotation: num [1:4, 1:4] 0.0417 0.9952 0.0463 0.0752 -0.0448 ...
# ..- attr(*, "dimnames")=List of 2
# .. ..$ : chr [1:4] "Murder" "Assault" "UrbanPop" "Rape"
# .. ..$ : chr [1:4] "PC1" "PC2" "PC3" "PC4"
# $ center : Named num [1:4] 7.79 170.76 65.54 21.23
# ..- attr(*, "names")= chr [1:4] "Murder" "Assault" "UrbanPop" "Rape"
# $ scale : logi FALSE
# $ x : num [1:50, 1:4] 64.8 92.8 124.1 18.3 107.4 ...
# ..- attr(*, "dimnames")=List of 2
# .. ..$ : chr [1:50] "Alabama" "Alaska" "Arizona" "Arkansas" ...
# .. ..$ : chr [1:4] "PC1" "PC2" "PC3" "PC4"
# - attr(*, "class")= chr "prcomp"
Then we can do:
rownames(res$rotation)
#[1] "Murder" "Assault" "UrbanPop" "Rape"

extract the correlation matrix for the factors in the psych package's fa.poly function

I'm working from caracal's great example conducting a factor analysis on dichotomous data and I'm now struggling to extract the factors from the object produced by the psych package's fa.poly function.
Can anyone help me extract the factors from the fa.poly object (and look at the correlation)?
Please see caracal's example for the working example.
In this example you create an object with:
faPCdirect <- fa.poly(XdiNum, nfactors=2, rotate="varimax") # polychoric FA
so somewhere in faPCdirect there is what you want. I recommend using str() to inspect the structure of faPCdirect
> str(faPCdirect)
List of 5
$ fa :List of 34
..$ residual : num [1:6, 1:6] 4.79e-01 7.78e-02 -2.97e-0...
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ : chr [1:6] "X1" "X2" "X3" "X4" ...
.. .. ..$ : chr [1:6] "X1" "X2" "X3" "X4" ...
..$ dof : num 4
..$ fit
...skip stuff....
..$ BIC : num 4.11
..$ r.scores : num [1:2, 1:2] 1 0.0508 0.0508 1
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ : chr [1:2] "MR2" "MR1"
.. .. ..$ : chr [1:2] "MR2" "MR1"
..$ R2 : Named num [1:2] 0.709 0.989
.. ..- attr(*, "names")= chr [1:2] "MR2" "MR1"
..$ valid : num [1:2] 0.819 0.987
..$ score.cor : num [1:2, 1:2] 1 0.212 0.212 1
So this says that this object is a list of five, with the first element called fa and that contains an element called score.cor that is a 2x2 matrix. I think what you want is the off diagonal.
> faPCdirect$fa$score.cor
[,1] [,2]
[1,] 1.0000000 0.2117457
[2,] 0.2117457 1.0000000

Resources