I have been following an online example for R Kohonen self-organising maps (SOM) which suggested that the data should be centred and scaled before computing the SOM.
However, I've noticed the object created seems to have attributes for centre and scale, in which case am I really applying a redundant step by centring and scaling first? Example script below
# Load package
require(kohonen)
# Set data
data(iris)
# Scale and centre
dt <- scale(iris[, 1:4],center=TRUE)
# Prepare SOM
set.seed(590507)
som1 <- som(dt,
somgrid(6,6, "hexagonal"),
rlen=500,
keep.data=TRUE)
str(som1)
The output from the last line of the script is:
List of 13
$ data :List of 1
..$ : num [1:150, 1:4] -0.898 -1.139 -1.381 -1.501 -1.018 ...
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ : NULL
.. .. ..$ : chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length"
"Petal.Width"
.. ..- attr(*, "scaled:center")= Named num [1:4] 5.84 3.06 3.76 1.2
.. .. ..- attr(*, "names")= chr [1:4] "Sepal.Length" "Sepal.Width"
"Petal.Length" "Petal.Width"
.. ..- attr(*, "scaled:scale")= Named num [1:4] 0.828 0.436 1.765 0.762
.. .. ..- attr(*, "names")= chr [1:4] "Sepal.Length" "Sepal.Width"
"Petal.Length" "Petal.Width"
$ unit.classif : num [1:150] 3 5 5 5 4 2 4 4 6 5 ...
$ distances : num [1:150] 0.0426 0.0663 0.0768 0.0744 0.1346 ...
$ grid :List of 6
..$ pts : num [1:36, 1:2] 1.5 2.5 3.5 4.5 5.5 6.5 1 2 3 4 ...
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ : NULL
.. .. ..$ : chr [1:2] "x" "y"
..$ xdim : num 6
..$ ydim : num 6
..$ topo : chr "hexagonal"
..$ neighbourhood.fct: Factor w/ 2 levels "bubble","gaussian": 1
..$ toroidal : logi FALSE
..- attr(*, "class")= chr "somgrid"
$ codes :List of 1
..$ : num [1:36, 1:4] -0.376 -0.683 -0.734 -1.158 -1.231 ...
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ : chr [1:36] "V1" "V2" "V3" "V4" ...
.. .. ..$ : chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length"
"Petal.Width"
$ changes : num [1:500, 1] 0.0445 0.0413 0.0347 0.0373 0.0337 ...
$ alpha : num [1:2] 0.05 0.01
$ radius : Named num [1:2] 3.61 0
..- attr(*, "names")= chr [1:2] "66.66667%" ""
$ user.weights : num 1
$ distance.weights: num 1
$ whatmap : int 1
$ maxNA.fraction : int 0
$ dist.fcts : chr "sumofsquares"
- attr(*, "class")= chr "kohonen"
Note notice that in lines 7 and 10 of the output there are references to centre and scale. I would appreciate an explanation as to the process here.
Your step with scaling is not redundant because in source code there are no scaling, and attributes, that you see in 7 and 10 are attributes from train dataset.
To check this, just run and compare results of this chunk of code:
# Load package
require(kohonen)
# Set data
data(iris)
# Scale and centre
dt <- scale(iris[, 1:4],center=TRUE)
#compare train datasets
str(dt)
str(as.matrix(iris[, 1:4]))
# Prepare SOM
set.seed(590507)
som1 <- kohonen::som(dt,
kohonen::somgrid(6,6, "hexagonal"),
rlen=500,
keep.data=TRUE)
#without scaling
som2 <- kohonen::som(as.matrix(iris[, 1:4]),
kohonen::somgrid(6,6, "hexagonal"),
rlen=500,
keep.data=TRUE)
#compare results of som function
str(som1)
str(som2)
I would like to extract the p-values from the Anderson-Darling test (ad.test from package kSamples). The test result is a list of 12 containing a 2x3 matrix. The p value is part of the 2x3 matrix and is present in element 7.
When using the following code:
lapply(AD_result, "[[", 7)
I get the following subset of AD test results (first 2 of a total of 50 shown)
[[1]]
AD T.AD asympt. P-value
version 1: 1.72 0.94536 0.13169
version 2: 1.51 0.66740 0.17461
[[2]]
AD T.AD asympt. P-value
version 1: 12.299 14.624 6.9248e-07
version 2: 11.900 14.144 1.1146e-06
My question is how to extract only the p-value (e.g. from version 1) and put these 50 results into a vector
The output from str(AD_result) is:
List of 55
$ :List of 12
..$ test.name : chr "Anderson-Darling"
..$ k : int 2
..$ ns : int [1:2] 103 2905
..$ N : int 3008
..$ n.ties : int 2873
..$ sig : num 0.762
..$ ad : num [1:2, 1:3] 1.72 1.51 0.945 0.667 0.132 ...
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ : chr [1:2] "version 1:" "version 2:"
.. .. ..$ : chr [1:3] "AD" "T.AD" " asympt. P-value"
..$ warning : logi FALSE
..$ null.dist1: NULL
..$ null.dist2: NULL
..$ method : chr "asymptotic"
..$ Nsim : num 1
..- attr(*, "class")= chr "kSamples"
You could try:
unlist(lapply(AD_result, function(x) x$ad[,3]))
Perhaps it is just me, but I have always found str unsatisfactory. It is frequently too verbose, yet not very informative in many occasions.
I actually really like the description of the function (?str):
Compactly display the internal structure of an R object
and this bit in particular
Ideally, only one line for each ‘basic’ structure is displayed.
Only that, in many cases, the default str implementation simply does not do justice to such description.
Ok, let's say it works partially good for data.frames.
library(ggplot2)
str(mpg)
> str(mpg)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 234 obs. of 11 variables:
$ manufacturer: chr "audi" "audi" "audi" "audi" ...
$ model : chr "a4" "a4" "a4" "a4" ...
$ displ : num 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
$ year : int 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
$ cyl : int 4 4 4 4 6 6 6 4 4 4 ...
$ trans : chr "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
$ drv : chr "f" "f" "f" "f" ...
$ cty : int 18 21 20 21 16 18 18 18 16 20 ...
$ hwy : int 29 29 31 30 26 26 27 26 25 28 ...
$ fl : chr "p" "p" "p" "p" ...
$ class : chr "compact" "compact" "compact" "compact" ...
Yet, for a data.frame it's not as informative as I would like. In addition to class, it would be very useful that it shows number of NA values, and number of unique values, for example.
But for other objects, it quickly becomes unmanageable. For example:
gp <- ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point()
str(gp)
> str(gp)
List of 9
$ data :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 234 obs. of 11 variables:
..$ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
..$ model : chr [1:234] "a4" "a4" "a4" "a4" ...
..$ displ : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
..$ year : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
..$ cyl : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
..$ trans : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
..$ drv : chr [1:234] "f" "f" "f" "f" ...
..$ cty : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
..$ hwy : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
..$ fl : chr [1:234] "p" "p" "p" "p" ...
..$ class : chr [1:234] "compact" "compact" "compact" "compact" ...
$ layers :List of 1
..$ :Classes 'LayerInstance', 'Layer', 'ggproto' <ggproto object: Class LayerInstance, Layer>
aes_params: list
compute_aesthetics: function
compute_geom_1: function
compute_geom_2: function
compute_position: function
compute_statistic: function
data: waiver
draw_geom: function
geom: <ggproto object: Class GeomPoint, Geom>
aesthetics: function
default_aes: uneval
draw_group: function
draw_key: function
draw_layer: function
draw_panel: function
extra_params: na.rm
handle_na: function
non_missing_aes: size shape
parameters: function
required_aes: x y
setup_data: function
use_defaults: function
super: <ggproto object: Class Geom>
geom_params: list
inherit.aes: TRUE
layer_data: function
map_statistic: function
mapping: NULL
position: <ggproto object: Class PositionIdentity, Position>
compute_layer: function
compute_panel: function
required_aes:
setup_data: function
setup_params: function
super: <ggproto object: Class Position>
print: function
show.legend: NA
stat: <ggproto object: Class StatIdentity, Stat>
compute_group: function
compute_layer: function
compute_panel: function
default_aes: uneval
extra_params: na.rm
non_missing_aes:
parameters: function
required_aes:
retransform: TRUE
setup_data: function
setup_params: function
super: <ggproto object: Class Stat>
stat_params: list
subset: NULL
super: <ggproto object: Class Layer>
$ scales :Classes 'ScalesList', 'ggproto' <ggproto object: Class ScalesList>
add: function
clone: function
find: function
get_scales: function
has_scale: function
input: function
n: function
non_position_scales: function
scales: list
super: <ggproto object: Class ScalesList>
$ mapping :List of 2
..$ x: symbol displ
..$ y: symbol hwy
$ theme : list()
$ coordinates:Classes 'CoordCartesian', 'Coord', 'ggproto' <ggproto object: Class CoordCartesian, Coord>
aspect: function
distance: function
expand: TRUE
is_linear: function
labels: function
limits: list
range: function
render_axis_h: function
render_axis_v: function
render_bg: function
render_fg: function
train: function
transform: function
super: <ggproto object: Class CoordCartesian, Coord>
$ facet :List of 1
..$ shrink: logi TRUE
..- attr(*, "class")= chr [1:2] "null" "facet"
$ plot_env :<environment: R_GlobalEnv>
$ labels :List of 2
..$ x: chr "displ"
..$ y: chr "hwy"
- attr(*, "class")= chr [1:2] "gg" "ggplot"
Whaaattttt???, what happened to "Compactly display". That's not compact!
And it can be worse, crazy scary, for example, for S4 objects. If you want try this:
library(rworldmap)
newmap <- getMap(resolution = "coarse")
str(newmap)
I do not post the output here because it is too much. It does not even fit in the console buffer!
How can you possibly understand the internal structure of the object with such a NON-compact display? It's just too many details and you easily get lost. Or at least I do.
Well, all right. Before someone tells me, hey checkout ?str and tweak the arguments, that's what I did. Of course it can get better, but I am still kind of disappointed with str.
The best solution I've got is to create a function that do this
if(isS4(obj)){
str(obj, max.level = 2, give.attr = FALSE, give.head = FALSE)
} else {
str(obj, max.level = 1, give.attr = FALSE, give.head = FALSE)
}
This displays compactly the top level structures of the object. The output for the sp object above (S4 object) becomes much more insightful
Formal class 'SpatialPolygonsDataFrame' [package "sp"] with 5 slots
..# data :'data.frame': 243 obs. of 49 variables:
..# polygons :List of 243
.. .. [list output truncated]
..# plotOrder :7 135 28 167 31 23 9 66 84 5 ...
..# bbox :-180 -90 180 83.6
..# proj4string:Formal class 'CRS' [package "sp"] with 1 slot
So now you can see there are 5 top level structures, and you can investigate them further individually.
Similar for the ggplot object above, now you can see
List of 9
$ data :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 234 obs. of 11 variables:
$ layers :List of 1
$ scales :Classes 'ScalesList', 'ggproto'
$ mapping :List of 2
$ theme : list()
$ coordinates:Classes 'CoordCartesian', 'Coord', 'ggproto'
$ facet :List of 1
$ plot_env :
$ labels :List of 2
Although this is much better, I still feel it could be much more insightful. So, perhaps someone has felt the same way and created a nice function that is more informative and still compactly displays the information. Anyone?
In such situation I use glimpse from the tibble package which is less verbose and briefly descriptive of the data structure.
library(tibble)
glimpse(gp)
There is the lobstr package by Hadley. Besides several other more or less helpful functions it includes lobstr::tree() which tries to be more predictable, compact and overall more helpful than str().
An important difference between the two is that str() is an S3 generic whereas lobstr::tree() is not. That means package developers can and will include their own methods for str() which can substantially improve the usefulness of str(). But it also means that str() output can be very inconsistent.
For comparison, here is a display of the structure of a simple lm() with both functions. lobstr::tree() also prints a colorized output, which improves legibility further, but you obviously can't see the colors here on SO. Note in particular the much more concise and useful parts of the formula and the data frame items:
m <- lm(mpg~cyl, mtcars)
lobstr::tree(m)
#> S3<lm>
#> ├─coefficients<dbl [2]>: 37.8845764854614, -2.87579013906448
#> ├─residuals<dbl [32]>: 0.370164348925359, 0.370164348925418, -3.58141592920354, 0.770164348925411, 3.82174462705436, -2.52983565107459, -0.578255372945636, -1.98141592920354, -3.58141592920354, -1.42983565107459, ...
#> ├─effects<dbl [32]>: -113.649737406208, -28.5956806590543, -3.70425398161014, 0.709596949580206, 3.82344788077055, -2.59040305041979, -0.576552119229446, -2.10425398161014, -3.70425398161014, -1.49040305041979, ...
#> ├─rank: 2
#> ├─fitted.values<dbl [32]>: 20.6298356510746, 20.6298356510746, 26.3814159292035, 20.6298356510746, 14.8782553729456, 20.6298356510746, 14.8782553729456, 26.3814159292035, 26.3814159292035, 20.6298356510746, ...
#> ├─assign<int [2]>: 0, 1
#> ├─qr: S3<qr>
#> │ ├─qr<dbl [64]>: -5.65685424949238, 0.176776695296637, 0.176776695296637, 0.176776695296637, 0.176776695296637, 0.176776695296637, 0.176776695296637, 0.176776695296637, 0.176776695296637, 0.176776695296637, ...
#> │ ├─qraux<dbl [2]>: 1.17677669529664, 1.01602374277435
#> │ ├─pivot<int [2]>: 1, 2
#> │ ├─tol: 1e-07
#> │ └─rank: 2
#> ├─df.residual: 30
#> ├─xlevels: <list>
#> ├─call: <language> lm(formula = mpg ~ cyl, data = mtcars)
#> ├─terms: S3<terms/formula> mpg ~ cyl
#> └─model: S3<data.frame>
#> ├─mpg<dbl [32]>: 21, 21, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, ...
#> └─cyl<dbl [32]>: 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, ...
str(m)
#> List of 12
#> $ coefficients : Named num [1:2] 37.88 -2.88
#> ..- attr(*, "names")= chr [1:2] "(Intercept)" "cyl"
#> $ residuals : Named num [1:32] 0.37 0.37 -3.58 0.77 3.82 ...
#> ..- attr(*, "names")= chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
#> $ effects : Named num [1:32] -113.65 -28.6 -3.7 0.71 3.82 ...
#> ..- attr(*, "names")= chr [1:32] "(Intercept)" "cyl" "" "" ...
#> $ rank : int 2
#> $ fitted.values: Named num [1:32] 20.6 20.6 26.4 20.6 14.9 ...
#> ..- attr(*, "names")= chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
#> $ assign : int [1:2] 0 1
#> $ qr :List of 5
#> ..$ qr : num [1:32, 1:2] -5.657 0.177 0.177 0.177 0.177 ...
#> .. ..- attr(*, "dimnames")=List of 2
#> .. .. ..$ : chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
#> .. .. ..$ : chr [1:2] "(Intercept)" "cyl"
#> .. ..- attr(*, "assign")= int [1:2] 0 1
#> ..$ qraux: num [1:2] 1.18 1.02
#> ..$ pivot: int [1:2] 1 2
#> ..$ tol : num 1e-07
#> ..$ rank : int 2
#> ..- attr(*, "class")= chr "qr"
#> $ df.residual : int 30
#> $ xlevels : Named list()
#> $ call : language lm(formula = mpg ~ cyl, data = mtcars)
#> $ terms :Classes 'terms', 'formula' language mpg ~ cyl
#> .. ..- attr(*, "variables")= language list(mpg, cyl)
#> .. ..- attr(*, "factors")= int [1:2, 1] 0 1
#> .. .. ..- attr(*, "dimnames")=List of 2
#> .. .. .. ..$ : chr [1:2] "mpg" "cyl"
#> .. .. .. ..$ : chr "cyl"
#> .. ..- attr(*, "term.labels")= chr "cyl"
#> .. ..- attr(*, "order")= int 1
#> .. ..- attr(*, "intercept")= int 1
#> .. ..- attr(*, "response")= int 1
#> .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
#> .. ..- attr(*, "predvars")= language list(mpg, cyl)
#> .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
#> .. .. ..- attr(*, "names")= chr [1:2] "mpg" "cyl"
#> $ model :'data.frame': 32 obs. of 2 variables:
#> ..$ mpg: num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
#> ..$ cyl: num [1:32] 6 6 4 6 8 6 8 4 4 6 ...
#> ..- attr(*, "terms")=Classes 'terms', 'formula' language mpg ~ cyl
#> .. .. ..- attr(*, "variables")= language list(mpg, cyl)
#> .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
#> .. .. .. ..- attr(*, "dimnames")=List of 2
#> .. .. .. .. ..$ : chr [1:2] "mpg" "cyl"
#> .. .. .. .. ..$ : chr "cyl"
#> .. .. ..- attr(*, "term.labels")= chr "cyl"
#> .. .. ..- attr(*, "order")= int 1
#> .. .. ..- attr(*, "intercept")= int 1
#> .. .. ..- attr(*, "response")= int 1
#> .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
#> .. .. ..- attr(*, "predvars")= language list(mpg, cyl)
#> .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
#> .. .. .. ..- attr(*, "names")= chr [1:2] "mpg" "cyl"
#> - attr(*, "class")= chr "lm"
Created on 2022-11-23 with reprex v2.0.2
I am trying to use the penalizedLDA package to run a penalized linear discriminant analysis in order to select the "most meaningful" variables. I have searched here and on other sites for help in accessing the the output from the penalized model to no avail.
My data comprises of 400 varaibles and 44 groups. Code I used and results I got thus far:
yy.m<-as.matrix(yy) #Factors/groups
xx.m<-as.matrix(xx) #Variables
cv.out<-PenalizedLDA.cv(xx.m,yy.m,type="standard")
## aplly the penalty
out <- PenalizedLDA(xx.m,yy.m,lambda=cv.out$bestlambda,K=cv.out$bestK)
Too get the structure of the output from the anaylsis:
> str(out)
List of 10
$ discrim: num [1:401, 1:4] -0.0234 -0.0219 -0.0189 -0.0143 -0.0102 ...
$ xproj : num [1:100, 1:4] -8.31 -14.68 -11.07 -13.46 -26.2 ...
$ K : int 4
$ crits :List of 4
..$ : num [1:4] 2827 2827 2827 2827
..$ : num [1:4] 914 914 914 914
..$ : num [1:4] 162 162 162 162
..$ : num [1:4] 48.6 48.6 48.6 48.6
$ type : chr "standard"
$ lambda : num 0
$ lambda2: NULL
$ wcsd.x : Named num [1:401] 0.0379 0.0335 0.0292 0.0261 0.0217 ...
..- attr(*, "names")= chr [1:401] "R400" "R405" "R410" "R415" ...
$ x : num [1:100, 1:401] 0.147 0.144 0.145 0.141 0.129 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr [1:401] "R400" "R405" "R410" "R415" ...
$ y : num [1:100, 1] 2 2 2 2 2 1 1 1 1 1 ...
- attr(*, "class")= chr "penlda"
I am interested in obtaining a list or matrix of the top 20 variables for feature selection, more than likely based on the coefficients of the Linear discrimination.
I realized I would have to sort the coefficients in descending order, and get the variable names matched to it. So the output I would expect is something like this imaginary example
V1 V2
R400 0.34
R1535 0.22...
Can anyone provide any pointers (not necessarily the R code). Thanks in advance.
Your out$K is 4, and that means you have 4 discriminant vectors. If you want the top 20 variables according to, say, the 2nd vector, try this:
# get the data frame of variable names and coefficients
var.coef = data.frame(colnames(xx.m), out$discrim[,2])
# sort the 2nd column (the coefficients) in decreasing order, and only keep the top 20
var.coef.top = var.coef[order(var.coef[,2], decreasing = TRUE)[1:20], ]
var.coef.top is what you want.
I'm working from caracal's great example conducting a factor analysis on dichotomous data and I'm now struggling to extract the factors from the object produced by the psych package's fa.poly function.
Can anyone help me extract the factors from the fa.poly object (and look at the correlation)?
Please see caracal's example for the working example.
In this example you create an object with:
faPCdirect <- fa.poly(XdiNum, nfactors=2, rotate="varimax") # polychoric FA
so somewhere in faPCdirect there is what you want. I recommend using str() to inspect the structure of faPCdirect
> str(faPCdirect)
List of 5
$ fa :List of 34
..$ residual : num [1:6, 1:6] 4.79e-01 7.78e-02 -2.97e-0...
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ : chr [1:6] "X1" "X2" "X3" "X4" ...
.. .. ..$ : chr [1:6] "X1" "X2" "X3" "X4" ...
..$ dof : num 4
..$ fit
...skip stuff....
..$ BIC : num 4.11
..$ r.scores : num [1:2, 1:2] 1 0.0508 0.0508 1
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ : chr [1:2] "MR2" "MR1"
.. .. ..$ : chr [1:2] "MR2" "MR1"
..$ R2 : Named num [1:2] 0.709 0.989
.. ..- attr(*, "names")= chr [1:2] "MR2" "MR1"
..$ valid : num [1:2] 0.819 0.987
..$ score.cor : num [1:2, 1:2] 1 0.212 0.212 1
So this says that this object is a list of five, with the first element called fa and that contains an element called score.cor that is a 2x2 matrix. I think what you want is the off diagonal.
> faPCdirect$fa$score.cor
[,1] [,2]
[1,] 1.0000000 0.2117457
[2,] 0.2117457 1.0000000