Weights in Principal Component Analysis (PCA) using Psych::principal - r

I am computing a Principal Component Analysis with this matrix as input using the function psych::principal . Each column in the input data is the monthly correlations between crop yields and a climatic variable in a region (30) so what I want to obtain with the PCA is to reduce the information and find simmilarities pattern of response between regions.
pc <- principal(dat,nfactors = 9, residuals = FALSE, rotate="varimax", n.obs=NA, covar=TRUE,scores=TRUE, missing=FALSE, impute="median", oblique.scores=TRUE, method="regression")
The matrix has dimensions 10*30, and the first message I get is:
The determinant of the smoothed correlation was zero. This means the
objective function is not defined. Chi square is based upon observed
residuals. The determinant of the smoothed correlation was zero. This
means the objective function is not defined for the null model either.
The Chi square is thus based upon observed correlations. Warning
messages: 1: In cor.smooth(r) : Matrix was not positive definite,
smoothing was done 2: In principal(dat, nfactors = 3, residuals = F,
rotate = "none", : The matrix is not positive semi-definite, scores
found from Structure loadings
Nontheless, the function seems to work, the main problem is when you check pc$weights and realize that is equal to pc$loadings.
When the number of columns is less than/equal to the number of rows the results are coherent, however that is not the case here.
I have to obtain the weights for refering the score values in the same magnitude as the input data (correlation values).
I would really appreciate any help.
Thank you.

Related

Calculate GWESP for a matrix with fixed decay parameter

I'm wondering if there exist pre-programed R functions that can calculate geometrically weighted edgewise shared partners (GWESP, per Hunter (2007)) for a given adjacency matrix with fixed decay parameter (alpha) and then return the estimated values also in matrix form?
I've looked at xergm and igraph packages but could not find one, they only estimate GWESP during model (network models) fitting but that's not what I want to do here. I only need a function to calculate GWESP for a given adjacency matrix and return the estimated GWESP value (with fixed decay parameter) for each dyad of the adjacency matrix in matrix form as well.
For example
# for a given adjacency matrix (adjm)
adjm <- matrix(sample(0:1, 100, replace=TRUE, prob=c(0.6,0.4)), nc=10)
# apply some functions to calculate GWESP for each dyad of adjm (and fix alpha at some value) and return a matrix of the same dimension filled with estimated GWESP values
somefunction(adjm, alpha = somevalue)

Applying PCA to a covariance matrix

I am have some difficulty understanding some steps in a procedure. They take coordinate data, find the covariance matrix, apply PCA, then extract the standard deviation from the square root of each eigenvalue in short. I am trying to re-produce this process, but I am stuck on the steps.
The Steps Taken
The data set consists of one matrix, R, that contains coordiante paris, (x(i),y(i)) with i=1,...,N for N is the total number of instances recorded. We applied PCA to the covariance matrix of the R input data set, and the following variables were obtained:
a) the principal components of the new coordinate system, the eigenvectors u and v, and
b) the eigenvalues (λ1 and λ2) corresponding to the total variability explained by each principal component.
With these variables, a graphical representation was created for each item. Two orthogonal segments were centred on the mean of the coordinate data. The segments’ directions were driven by the eigenvectors of the PCA, and the length of each segment was defined as one standard deviation (σ1 and σ2) around the mean, which was calculated by extracting the square root of each eigenvalue, λ1 and λ2.
My Steps
#reproducable data
set.seed(1)
x<-rnorm(10,50,4)
y<-rnorm(10,50,7)
# Note my data is not perfectly distirbuted in this fashion
df<-data.frame(x,y) # this is my R matrix
covar.df<-cov(df,use="all.obs",method='pearson') # this is my covariance matrix
pca.results<-prcomp(covar.df) # this applies PCA to the covariance matrix
pca.results$sdev # these are the standard deviations of the principal components
# which is what I believe I am looking for.
This is where I am stuck because I am not sure if I am trying to get the sdev output form prcomp() or if I should scale my data first. They are all on the same scale, so I do not see the issue with it.
My second question is how do I extract the standard deviation in the x and y direciton?
You don't apply prcomp to the covariance matrix, you do it on the data itself.
result= prcomp(df)
If by scaling you mean normalize or standardize, that happens before you do prcomp(). For more information on the procedure see this link that is introductory to the procedure: pca on R. That can walk you through the basics. To get the sdev use the the summary on the result object
summary(result)
result$sdev
You don't apply prcomp to the covariance matrix. scale=T bases the PCA on the correlation matrix and F on the covariance matrix
df.cor = prcomp(df, scale=TRUE)
df.cov = prcomp(df, scale=FALSE)

How is xgboost cover calculated?

Could someone explain how the Cover column in the xgboost R package is calculated in the xgb.model.dt.tree function?
In the documentation it says that Cover "is a metric to measure the number of observations affected by the split".
When you run the following code, given in the xgboost documentation for this function, Cover for node 0 of tree 0 is 1628.2500.
data(agaricus.train, package='xgboost')
#Both dataset are list with two items, a sparse matrix and labels
#(labels = outcome column which will be learned).
#Each column of the sparse Matrix is a feature in one hot encoding format.
train <- agaricus.train
bst <- xgboost(data = train$data, label = train$label, max.depth = 2,
eta = 1, nthread = 2, nround = 2,objective = "binary:logistic")
#agaricus.test$data#Dimnames[[2]] represents the column names of the sparse matrix.
xgb.model.dt.tree(agaricus.train$data#Dimnames[[2]], model = bst)
There are 6513 observations in the train dataset, so can anyone explain why Cover for node 0 of tree 0 is a quarter of this number (1628.25)?
Also, Cover for the node 1 of tree 1 is 788.852 - how is this number calculated?
Any help would be much appreciated. Thanks.
Cover is defined in xgboost as:
the sum of second order gradient of training data classified to the
leaf, if it is square loss, this simply corresponds to the number of
instances in that branch. Deeper in the tree a node is, lower this
metric will be
https://github.com/dmlc/xgboost/blob/f5659e17d5200bd7471a2e735177a81cb8d3012b/R-package/man/xgb.plot.tree.Rd
Not particularly well documented....
In order to calculate the cover, we need to know the predictions at that point in the tree, and the 2nd derivative with respect to the loss function.
Lucky for us, the prediction for every data point (6513 of them) in the 0-0 node in your example is .5. This is a global default setting whereby your first prediction at t=0 is .5.
base_score [ default=0.5 ] the initial prediction score of all
instances, global bias
http://xgboost.readthedocs.org/en/latest/parameter.html
The gradient of binary logistic (which is your objective function) is p-y, where p = your prediction, and y = true label.
Thus, The hessian (which we need for this) is p*(1-p). Note: the Hessian can be determined without y, the true labels.
So (bringing it home) :
6513 * (.5) * (1 - .5) = 1628.25
In the second tree, the predictions at that point are no longer all .5,sp lets get the predictions after one tree
p = predict(bst,newdata = train$data, ntree=1)
head(p)
[1] 0.8471184 0.1544077 0.1544077 0.8471184 0.1255700 0.1544077
sum(p*(1-p)) # sum of the hessians in that node,(root node has all data)
[1] 788.8521
Note , for linear (squared error) regression the hessian is always one, so the cover indicates how many examples are in that leaf.
The big takeaway is that cover is defined by the hessian of the objective function. Lots of info out there in terms of getting to the gradient, and hessian of the binary logistic function.
These slides are helpful is seeing why he uses hessians as a weighting, and also explain how xgboost splits differently from standard trees. https://homes.cs.washington.edu/~tqchen/pdf/BoostedTree.pdf

How do you select the rank-k approximation for SVDImpute (package: imputation) in R?

I have a matrix with nominal values from 1-5, with some missing values. I would like to use SVDImpute (from the "imputation" package) in R to fill in the missing values, but I am unsure of what number to use for k (rank-k approximation) in the function.
The help page description of the imputation is:
Imputation using the SVD First fill missing values using the mean of
the column Then, compute a low, rank-k approximation of x. Fill the
missing values again from the rank-k approximation. Recompute the
rank-k approximation with the imputed values and fill again, repeating
num.iters times
To me, this sounds likes the columns means are calculated as part of the function; is this correct? If so, then how was the value of k=3 chosen for the example?
x = matrix(rnorm(100),10,10)
x.missing = x > 1
x[x.missing] = NA
SVDImpute(x, 3)
Any help is greatly appreciated.

standardization of data in R

I am doing some PCA analysis for large spreadsheets, and I'm picking my PCs according to the loadings.
As far as I have read, since the data I have have differnt units, standardization is a must before performing the PCA analysis.
Does the function prcomp() inherently performs standardization?
I was reading the prcomp() help file and saw this under the arguments of prcomp():
scale. a logical value indicating whether the variables should be scaled to have
unit variance before the analysis takes place. The default is FALSE for
consistency with S, but in general scaling is advisable. Alternatively, a
vector of length equal the number of columns of x can be supplied. The
value is passed to scale.
Does "scaling variables to have unit variance" mean standardization?
I am currently using this command:
prcomp(formula = ~., data=file, center = TRUE, scale = TRUE, na.action = na.omit)
is it enough? or shall I do a separate step of standardization?
Thanks,
Yes, scale = TRUE will result in all variables being scaled to have unit variance (i.e. a variance of 1, and hence a standard deviation of 1). This is the common definition of "standardise", but there are other ways to do it etc. center = TRUE mean-centres the data, i.e. the mean of a variable is subtracted from each observation of that variable.
When you do this (scale = TRUE, center = TRUE) instead of the PCA being on the covariance matrix of your data set, it is on the correlation matrix. Hence the PCA finds axes that explain the correlations between variables rather than their covariances.
If you mean by standardization that each column is divided by their standard deviation, and the mean of each column is subtracted, than using scale = TRUE and center = TRUE is what you want.

Resources