I am working with a dataset of 10000 data points and 100 variables in R. Unfortunately the variables I have do not describe the data in a good way. I carried out a PCA analysis using prcomp() and the first 3 PCs seem to account for a most of the variability of the data. As far as I understand, a principal component is a combination of different variables; therefore it has a certain value corresponding to each data point and can be considered as a new variable. Would I be able to add these principal components as 3 new variables to my data? I would need them for further analysis.
A reproducible dataset:
set.seed(144)
x <- data.frame(matrix(rnorm(2^10*12), ncol=12))
y <- prcomp(formula = ~., data=x, center = TRUE, scale = TRUE, na.action = na.omit)
PC scores are stored in the element x of prcomp() result.
str(y)
List of 6
$ sdev : num [1:12] 1.08 1.06 1.05 1.04 1.03 ...
$ rotation: num [1:12, 1:12] -0.0175 -0.1312 0.3284 -0.4134 0.2341 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:12] "X1" "X2" "X3" "X4" ...
.. ..$ : chr [1:12] "PC1" "PC2" "PC3" "PC4" ...
$ center : Named num [1:12] 0.02741 -0.01692 -0.03228 -0.03303 0.00122 ...
..- attr(*, "names")= chr [1:12] "X1" "X2" "X3" "X4" ...
$ scale : Named num [1:12] 0.998 1.057 1.019 1.007 0.993 ...
..- attr(*, "names")= chr [1:12] "X1" "X2" "X3" "X4" ...
$ x : num [1:1024, 1:12] 1.023 -1.213 0.167 -0.118 -0.186 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:1024] "1" "2" "3" "4" ...
.. ..$ : chr [1:12] "PC1" "PC2" "PC3" "PC4" ...
$ call : language prcomp(formula = ~., data = x, na.action = na.omit, center = TRUE, scale = TRUE)
- attr(*, "class")= chr "prcomp"
You can get them with y$x and then chose those columns you need.
x.new<-cbind(x,y$x[,1:3])
str(x.new)
'data.frame': 1024 obs. of 15 variables:
$ X1 : num 1.14 2.38 0.684 1.785 0.313 ...
$ X2 : num -0.689 0.446 -0.72 -3.511 0.36 ...
$ X3 : num 0.722 0.816 0.295 -0.48 0.566 ...
$ X4 : num 1.629 0.738 0.85 1.057 0.116 ...
$ X5 : num -0.737 -0.827 0.65 -0.496 -1.045 ...
$ X6 : num 0.347 0.056 -0.606 1.077 0.257 ...
$ X7 : num -0.773 1.042 2.149 -0.599 0.516 ...
$ X8 : num 2.05511 0.4772 0.18614 0.02585 0.00619 ...
$ X9 : num -0.0462 1.3784 -0.2489 0.1625 0.6137 ...
$ X10: num -0.709 0.755 0.463 -0.594 -1.228 ...
$ X11: num -1.233 -0.376 -2.646 1.094 0.207 ...
$ X12: num -0.44 -2.049 0.315 0.157 2.245 ...
$ PC1: num 1.023 -1.213 0.167 -0.118 -0.186 ...
$ PC2: num 1.2408 0.6077 1.1885 3.0789 0.0797 ...
$ PC3: num -0.776 -1.41 0.977 -1.343 0.987 ...
Didzis Elferts's response only works if your data, x, has no NAs. Here's how you can add the components if your data does have NAs.
library(tidyverse)
components <- y$x %>% rownames_to_column("id")
x <- x %>% rownames_to_column("id") %>% left_join(components, by = "id")
Related
res.pca = prcomp(y, scale = TRUE)
summ=summary(res.pca)
summ
Gives me the output Desired Output
I want to change this Summary in to a Data Frame,
I've Tried to use the do.call(cbind, lapply(res.pca, summary)) but it gives me the summary of Min/Max but not the one I desire.
Please See That I dont want to extract values from column names, I seek a general solution That I can use.
What you are looking for is in the "element" importance of summary(res.pca):
Example taken from Principal Components Analysis - how to get the contribution (%) of each parameter to a Prin.Comp.?:
a <- rnorm(10, 50, 20)
b <- seq(10, 100, 10)
c <- seq(88, 10, -8)
d <- rep(seq(3, 16, 3), 2)
e <- rnorm(10, 61, 27)
my_table <- data.frame(a, b, c, d, e)
res.pca <- prcomp(my_table, scale = TRUE)
summary(res.pca)$importance
# PC1 PC2 PC3 PC4 PC5
#Standard deviation 1.7882 0.9038 0.8417 0.52622 9.037e-17
#Proportion of Variance 0.6395 0.1634 0.1417 0.05538 0.000e+00
#Cumulative Proportion 0.6395 0.8029 0.9446 1.00000 1.000e+00
class(summary(res.pca)$importance)
#[1] "matrix"
N.B.:
When you want to "study" an object, it can be convenient to use str on it. Here, you can do str(summary(pca) to see where the information are and hence where you can get what you want:
str(summary(res.pca))
List of 6
$ sdev : num [1:5] 1.79 9.04e-01 8.42e-01 5.26e-01 9.04e-17
$ rotation : num [1:5, 1:5] 0.278 0.512 -0.512 0.414 -0.476 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:5] "a" "b" "c" "d" ...
.. ..$ : chr [1:5] "PC1" "PC2" "PC3" "PC4" ...
$ center : Named num [1:5] 34.9 55 52 9 77.8
..- attr(*, "names")= chr [1:5] "a" "b" "c" "d" ...
$ scale : Named num [1:5] 22.4 30.28 24.22 4.47 26.11
..- attr(*, "names")= chr [1:5] "a" "b" "c" "d" ...
$ x : num [1:10, 1:5] -2.962 -1.403 -1.653 -0.537 1.186 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr [1:5] "PC1" "PC2" "PC3" "PC4" ...
$ importance: num [1:3, 1:5] 1.788 0.64 0.64 0.904 0.163 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:3] "Standard deviation" "Proportion of Variance" "Cumulative Proportion"
.. ..$ : chr [1:5] "PC1" "PC2" "PC3" "PC4" ...
- attr(*, "class")= chr "summary.prcomp"
I have been following an online example for R Kohonen self-organising maps (SOM) which suggested that the data should be centred and scaled before computing the SOM.
However, I've noticed the object created seems to have attributes for centre and scale, in which case am I really applying a redundant step by centring and scaling first? Example script below
# Load package
require(kohonen)
# Set data
data(iris)
# Scale and centre
dt <- scale(iris[, 1:4],center=TRUE)
# Prepare SOM
set.seed(590507)
som1 <- som(dt,
somgrid(6,6, "hexagonal"),
rlen=500,
keep.data=TRUE)
str(som1)
The output from the last line of the script is:
List of 13
$ data :List of 1
..$ : num [1:150, 1:4] -0.898 -1.139 -1.381 -1.501 -1.018 ...
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ : NULL
.. .. ..$ : chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length"
"Petal.Width"
.. ..- attr(*, "scaled:center")= Named num [1:4] 5.84 3.06 3.76 1.2
.. .. ..- attr(*, "names")= chr [1:4] "Sepal.Length" "Sepal.Width"
"Petal.Length" "Petal.Width"
.. ..- attr(*, "scaled:scale")= Named num [1:4] 0.828 0.436 1.765 0.762
.. .. ..- attr(*, "names")= chr [1:4] "Sepal.Length" "Sepal.Width"
"Petal.Length" "Petal.Width"
$ unit.classif : num [1:150] 3 5 5 5 4 2 4 4 6 5 ...
$ distances : num [1:150] 0.0426 0.0663 0.0768 0.0744 0.1346 ...
$ grid :List of 6
..$ pts : num [1:36, 1:2] 1.5 2.5 3.5 4.5 5.5 6.5 1 2 3 4 ...
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ : NULL
.. .. ..$ : chr [1:2] "x" "y"
..$ xdim : num 6
..$ ydim : num 6
..$ topo : chr "hexagonal"
..$ neighbourhood.fct: Factor w/ 2 levels "bubble","gaussian": 1
..$ toroidal : logi FALSE
..- attr(*, "class")= chr "somgrid"
$ codes :List of 1
..$ : num [1:36, 1:4] -0.376 -0.683 -0.734 -1.158 -1.231 ...
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ : chr [1:36] "V1" "V2" "V3" "V4" ...
.. .. ..$ : chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length"
"Petal.Width"
$ changes : num [1:500, 1] 0.0445 0.0413 0.0347 0.0373 0.0337 ...
$ alpha : num [1:2] 0.05 0.01
$ radius : Named num [1:2] 3.61 0
..- attr(*, "names")= chr [1:2] "66.66667%" ""
$ user.weights : num 1
$ distance.weights: num 1
$ whatmap : int 1
$ maxNA.fraction : int 0
$ dist.fcts : chr "sumofsquares"
- attr(*, "class")= chr "kohonen"
Note notice that in lines 7 and 10 of the output there are references to centre and scale. I would appreciate an explanation as to the process here.
Your step with scaling is not redundant because in source code there are no scaling, and attributes, that you see in 7 and 10 are attributes from train dataset.
To check this, just run and compare results of this chunk of code:
# Load package
require(kohonen)
# Set data
data(iris)
# Scale and centre
dt <- scale(iris[, 1:4],center=TRUE)
#compare train datasets
str(dt)
str(as.matrix(iris[, 1:4]))
# Prepare SOM
set.seed(590507)
som1 <- kohonen::som(dt,
kohonen::somgrid(6,6, "hexagonal"),
rlen=500,
keep.data=TRUE)
#without scaling
som2 <- kohonen::som(as.matrix(iris[, 1:4]),
kohonen::somgrid(6,6, "hexagonal"),
rlen=500,
keep.data=TRUE)
#compare results of som function
str(som1)
str(som2)
In R, str() is handy for showing the structure of an object, such as the list of lists returned by lm() and other modelling functions, but it gives way too much output. I'm looking for some tool to create a simple tree diagram showing only the names of the list elements and their structure.
e.g., for this example,
data(Prestige, package="car")
out <- lm(prestige ~ income+education+women, data=Prestige)
str(out, max.level=2)
#> List of 12
#> $ coefficients : Named num [1:4] -6.79433 0.00131 4.18664 -0.00891
#> ..- attr(*, "names")= chr [1:4] "(Intercept)" "income" "education" "women"
#> $ residuals : Named num [1:102] 4.58 -9.39 4.69 4.22 8.15 ...
#> ..- attr(*, "names")= chr [1:102] "gov.administrators" "general.managers" "accountants" "purchasing.officers" ...
#> $ effects : Named num [1:102] -472.99 -123.61 -92.61 -2.3 6.83 ...
#> ..- attr(*, "names")= chr [1:102] "(Intercept)" "income" "education" "women" ...
#> $ rank : int 4
#> $ fitted.values: Named num [1:102] 64.2 78.5 58.7 52.6 65.3 ...
#> ..- attr(*, "names")= chr [1:102] "gov.administrators" "general.managers" "accountants" "purchasing.officers" ...
#> $ assign : int [1:4] 0 1 2 3
#> $ qr :List of 5
#> ..$ qr : num [1:102, 1:4] -10.1 0.099 0.099 0.099 0.099 ...
#> .. ..- attr(*, "dimnames")=List of 2
#> .. ..- attr(*, "assign")= int [1:4] 0 1 2 3
#> ..$ qraux: num [1:4] 1.1 1.44 1.06 1.06
#> ..$ pivot: int [1:4] 1 2 3 4
#> ..$ tol : num 1e-07
#> ..$ rank : int 4
#> ..- attr(*, "class")= chr "qr"
#> $ df.residual : int 98
...
I would like to get something like this:
This is similar to what I get from tree for file folders in my file system:
C:\Dropbox\Documents\images>tree
Folder PATH listing
Volume serial number is 2250-8E6F
C:.
+---cartoons
+---chevaliers
+---icons
+---milestones
+---minard
+---minard-besancon
The result could be either in graphic characters, as in tree or an actual graphic as shown above. Is anything like this available?
A simple approach to getting this from the str output would be something like...
a <- capture.output(str(out, max.level=2))
a <- trimws(gsub("\\:.*", "", a[grepl("\\$", a)]))
cat(a, sep="\n")
$ coefficients
$ residuals
$ effects
$ rank
$ fitted.values
$ assign
$ qr
..$ qr
..$ qraux
..$ pivot
..$ tol
..$ rank
$ df.residual
$ xlevels
$ call
$ terms
$ model
..$ prestige
..$ income
..$ education
..$ women
This question is related to my earlier question found here: https://stackoverflow.com/questions/33089532/r-accounting-for-a-factor-with-this-logistic-regression-function-replace-lappl
I realize that I didn't do a good job at asking the first question, so here is a more simple analog with actual data:
My data looks something like this:
#data look like this, but with a variable number of "y" columms
wk<-rep(1:50,2)
X<-rnorm(1:100,1)
y1<-rnorm(1:100,1)
y2<-rnorm(1:100,1)
df1<-as.data.frame(cbind(wk,X,y1,y2))
df1$hyst<-ifelse(df1$wk>=5 & df1$wk<32, "R", "F")
Y<-df1[, -which(colnames(df1) %in% c("wk"))] #this step makes more sense with my actual data since I have a bunch of columns to remove
l1<-length(Y)-1
lst1<-lapply(2:l1,function(x){colnames(Y[x])})
dflst<-c("Y",'Y[Y$hyst=="R",]','Y[Y$hyst=="F",]')
I want to run a model over all Y columns for the full data set (all data) and for two subsets, when the factor hyst=="R" and when hyst=="F".
To do this, I have nested two lapply functions, which sort of works, but I think it essentially doubles my results and is causing me all sorts of list headaches.
Here is the nested lapply code:
lms <- lapply(dflst, function(z){
lapply(lst1, function(y) {
form <- paste0(y, " ~ X")
lm(form, data=eval(parse(text=z)))
})
})
How can I replace or modify the nested lapply function to obtain a model run for each Y column for each data set( all, "R", and "F")?
Construct your DF list like
DFlst <- c(list(full=Y), split(Y, Y$hyst))
str(DFlst)
List of 3
$ full:'data.frame': 100 obs. of 4 variables:
..$ X : num [1:100] 1.792 3.192 0.367 1.632 1.388 ...
..$ y1 : num [1:100] 3.354 1.189 1.99 0.639 0.1 ...
..$ y2 : num [1:100] 0.864 2.415 0.437 1.069 1.368 ...
..$ hyst: chr [1:100] "F" "F" "F" "F" ...
$ F :'data.frame': 46 obs. of 4 variables:
..$ X : num [1:46] 1.792 3.192 0.367 1.632 0.707 ...
..$ y1 : num [1:46] 3.354 1.189 1.99 0.639 0.894 ...
..$ y2 : num [1:46] 0.864 2.415 0.437 1.069 1.213 ...
..$ hyst: chr [1:46] "F" "F" "F" "F" ...
$ R :'data.frame': 54 obs. of 4 variables:
..$ X : num [1:54] 1.388 2.296 0.409 1.494 0.943 ...
..$ y1 : num [1:54] 0.1002 0.6425 -0.0918 1.199 0.8767 ...
..$ y2 : num [1:54] 1.368 1.122 0.402 -0.237 1.518 ...
..$ hyst: chr [1:54] "R" "R" "R" "R" ...
Do some regressions:
res <- lapply(DFlst, function(DF) {
cols = grep("^y[0-9]+$",names(DF),value=TRUE)
lapply(setNames(cols,cols),
function(y) lm(paste(y,"~X"), data=DF))
})
str(res, list.len=2, give.attr=FALSE)
List of 3
$ full:List of 2
..$ y1:List of 12
.. ..$ coefficients : Named num [1:2] 0.903 0.111
.. ..$ residuals : Named num [1:100] 2.2509 -0.0698 1.046 -0.4464 -0.9578 ...
.. .. [list output truncated]
..$ y2:List of 12
.. ..$ coefficients : Named num [1:2] 1.423 -0.166
.. ..$ residuals : Named num [1:100] -0.2623 1.5213 -0.9253 -0.0837 0.1751 ...
.. .. [list output truncated]
$ F :List of 2
..$ y1:List of 12
.. ..$ coefficients : Named num [1:2] 0.9289 0.0769
.. ..$ residuals : Named num [1:46] 2.2871 0.0146 1.0332 -0.4157 -0.0889 ...
.. .. [list output truncated]
..$ y2:List of 12
.. ..$ coefficients : Named num [1:2] 1.4177 -0.0789
.. ..$ residuals : Named num [1:46] -0.413 1.25 -0.952 -0.22 -0.149 ...
.. .. [list output truncated]
[list output truncated]
I am trying to overlay envfit arrows on to an NMDS chart like this one (which is when I replace missing values with fake numbers):
However, with our actual data, it doesnt give the arrows but labels each point individually, like this:
Any suggestions would be appreciated.
Code:
# Make MDS
x.mds <- metaMDS(x_matrix, trace = FALSE)
# Extract point co-ordinates for putting into ggplot
NMDS <- data.frame(MDS1 = x.mds$points[,1], MDS2 = x.mds$points[,2])
p <- ggplot(NMDS, aes(MDS1, MDS2))
p + geom_point()
#environmental variables
ef <- envfit(x.mds ~ pH + Ammonia + DO, x.env)
ef <- envfit(x.mds ~ pH + Ammonia + DO, x.env, na.rm = TRUE) ##ALTERNATIVE
plot(ef)
Data:
Sample Region pH Ammonia Nitrate BOD DO
15 N 7.618 0.042 0.845 1 NA
34 N 7.911 0.04 7.41 8 5.62
42 SE 7.75 NA 3.82 1 21.629
........
> ef
***VECTORS
NMDS1 NMDS2 r2 Pr(>r)
pH 0.50849 -0.86107 0.0565 0.719
Ammonia 0.99050 -0.13751 0.0998 0.504
DO -0.88859 -0.45871 0.1640 0.319
P values based on 999 permutations.
1 observation deleted due to missingness
> str(ef)
List of 3
$ vectors :List of 4
..$ arrows : num [1:3, 1:2] 0.508 0.991 -0.889 -0.861 -0.138 ...
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ : chr [1:3] "pH" "Ammonia" "DO"
.. .. ..$ : chr [1:2] "NMDS1" "NMDS2"
.. ..- attr(*, "decostand")= chr "normalize"
..$ r : Named num [1:3] 0.0565 0.0998 0.164
.. ..- attr(*, "names")= chr [1:3] "pH" "Ammonia" "DO"
..$ permutations: num 999
..$ pvals : num [1:3] 0.719 0.504 0.319
..- attr(*, "class")= chr "vectorfit"
$ factors : NULL
$ na.action:Class 'omit' int 17
- attr(*, "class")= chr "envfit"