k-means analysis: how to convert data into numeric? - r

I want to perform a k-means analysis in R. For that I need numeric data. I tried the following
unlist(pca)
as.numeric(pca)
lapply(pca,as.numeric(pca))
pca is just "normal" Principal Component Analysis data, showed in a plot (with fviz_pca_ind() function).
By the way, when I try to run the k-means analysis, it gives me "list object cannot be coerced to type double". That is why I thought to turn everything into numeric.
How to convert the pca-data into numeric?
Thank you ;)

You're almost correct
lapply(pca,as.numeric)
as.numeric is a function and therefore an object. You need to pass it to lapply() as an object and therefore without the quotation marks.

Most pca objects should return you a list, and you should show which package or function is used to perform the pca, so we can see what's in the list
For example, if you use prcomp, it returns a list of eigenvectors / loadings ($rotation) and principal components ($x). I suppose you are trying to do k-means on the principal componets, and you can do it like:
# perform pca
pca = prcomp(USArrests,scale=TRUE)
# we call out the PCs using pca$x
# and kmeans
kmeans_clus = kmeans(pca$x,3)
## plot
# define colors
COLS = c("#65587f","#f18867","#e85f99")
plot(pca$x[,1:2],col=COLS[kmeans_clus$cluster],pch=20)
legend("topright",fill=COLS,legend=1:3,horiz=TRUE)

Related

Why are polychoric correlation coefficients in matrices calculated by different R packages slightly different for the same data?

I calculated polychoric correlation matrices for the same data frame (20 ordinal variables, 190 missing values) in R, using three different packages and the coefficients for same variables are slightly different from each other.
I used the lavCor function from "lavaan" (I did list the ordinal variables when calling the function), polychoric function from "psych" (1.9.1) (took the rhos), and cor_auto function from "qgraph" (which is supposed to automatically calculate polychoric correlations for ordinal data). I am confused because I thought they were supposed to give exactly the same results. I read package documentations but could not find anything that helped me understand why. Could anyone let me know why this happens? I am sure I am missing some tiny difference between those, but I cannot figure it out.
PS: I guess this could have happened because psych package adjusts missing values (I have 190) using the correction for continuity, but I still do not understand why qgraph yields different results than lavaan as qgraph says it uses lavaan's lavCor function to calculate polychoric correlations.
Thanks!!
depanx<-data[1:20]
cor.depanx<-cor_auto(depanx)
polychor<-polychoric(depanx)
polymat<-polychor$rho
lav<-lavCor(depanx,ordered=c("unh","enj","trd","rst","noG","cry","cnc","htd","bdp","lnl","lov",
"cmp","wrg","pst","sch","dss","hlt","bad","ftr","oth"))
# as a result, matrices "cor.depanx", "polymat", and "lav" are different from each other.
Nice question! I do not know what the "data" dataset in you example is, but i recreate the two possible scenarios, which have most probably caused the discrepancy between cor_auto and lavCor results. In summary, first you must set the "ordinalLevelMax" argument in cor_auto based on your data and second you need to synchronize the "missing" argument in the two functions. Detailed explanation in the code snippet below:
depanx<-data.frame(lapply(1:5,function(x)sample(1:6,100,replace = T)),
stringsAsFactors = F)
colnames(depanx)=LETTERS[1:5]
lav<-lavaan::lavCor(depanx,ordered=colnames(depanx))
cor.depanx<-cor_auto(depanx)
all(lav==cor.depanx)#TRUE
#The first argument in cor auto, which you need to pay attention to is
#"ordinalLevelMax". #It is set to 7 by default in cor_auto,
#so any variable with levels more than 7 is sent to lavCor as plain numeric and not
#ordinal.
#Now we create the same dataset with 8 level variables. lavCor detects all as ordinal,
#since we have labeled them as so by "ordered" argument of lavCor, so it uses
#ploychorial
#correlations. Since "ordinalLevelMax" in cor_auto is 7 by default and you have not
#changed it,
#cor_auto detect none as ordinaland does not send them to lavCor as Ordinalvariables,
#so Lavcor computes pearson correlations between them,all.
depanx2<-data.frame(lapply(1:5,function(x)sample(1:8,100,replace =T)),
stringsAsFactors = F)
colnames(depanx2)=LETTERS[1:5]
lav2<-lavaan::lavCor(depanx2,ordered=colnames(depanx2))
cor.depanx2<-cor_auto(depanx2)
all(lav2==cor.depanx2)#FALSE
# the next argument you must synchronise in lavCor and cor_auto is the "missing",
#which is by default set to "pairwise" and "listwise" in cor_auto and lavCor,
#respectively.
#here we set row 10:20 value of the fifth variable to NA, without synchronizing the
#argument
depanx3<-data.frame(lapply(1:5,function(x)sample(1:6,100,replace =T)),
stringsAsFactors = F)
colnames(depanx3)=LETTERS[1:5]
depanx3[10:20,5]<-NA
lav3<-lavaan::lavCor(depanx3,ordered=colnames(depanx3))
cor.depanx3<-cor_auto(depanx3)
all(lav3==cor.depanx3)#FALSE

package "fdapace" (R) - create a functional plot of the first principal component

My question is about functional principal component analysis in R.
I am working with a multi-dimensional time series looking something like this:
My goal is to reduce the dimensions by applying functional PCA and then plot the first principal component like this:
I have already used the FPCA function of the fdapace package on the dataset. Unfortunately, I don't understand how to interpret the resulting matrix of the FPCA estimates (xiEst).
In my understanding the values of the Principal components are stored in the columns of the matrix.
Unfortunately the number of columns doesn't fit the number of time intervals of my multi dimensional time series.
I don't know how the values in the matrix correspond to the values of the original data and how to plot the first principal component as a dimensional reduction of the original data.
If you need some code to reproduce the situation you can use the medfly dataset of the package:
library(fdapace)
data(medfly25)
Flies <- MakeFPCAInputs(medfly25$ID, medfly25$Days, medfly25$nEggs)
pfcaObjFlies <- FPCA(Flies$Ly, Flies$Lt)
when I plot the first principal component via
plot(fpcaObjFlies$xiEst[,1], type = "o")
the graph doesn't really fit my expectations:
I would have expected a graph with 25 observations similar to the graphs of the medfly dataset.

Accessing and Interpreting Principal Components

I have performed PCA in R (using the prcomp() function) on returns data of 5 variables.
This is my code:
pca1 = prcomp(~df.var1+df.var2+df.var3+df.var4+df.var5, data = ccy)
I would like to move to the interpretation stage...the "rotation" matrix in the "pca1" object comprises, to my understanding, the coefficients assigned to each of the original 5 variables in the equation describing the principal component (PC). This link suggests calculating the correlations between the PCs and each of the variables. Is the "x" object within the "pca1" object (accessed using:
pcs = pca1$x
) a matrix of values for the PCs? If I calculated correlations between these values and the original variables, would that represent the correlation between the PCs and the variables? Is there perhaps a "built-in" method for this?

How to downproject with PCA in R?

How to downproject with PCA in R?
When I use princomp function on my data
it creates as many principal components as
there are dimensions in the original data.
But how can I down-project, let's say if I have
10 dimensional data and I want to downproject to 2 dimensions?
if you mean doing PCA and keeping just a few of the components (dimensions) then one way is to use principal in package psych. (Using the argument nfactors)

how to cut the dendrogram with VARCLUS in R (package Hmisc)

I want to perform variable clustering using the varclus() function from Hmisc package.
However I do not know how to put clusters of variables into a table if I cut the dendrogram into 10 clusters of variable.
I used to use
groups <- cutree(hclust(d), k=10)
to cut dendrograms of individuals but it doesn't work for variables.
Expanding on #Anatoliy's comment, you indeed use the same cutree() function
as before because the clustering done in varclus() is actually done by the hclust() function.
When you use varclus() you're creating an object of class varclus that contains a hclust object - which can be referenced by using $hclust.
Example:
x <- varclus(d)
x_hclust <- x$hclust ## retrieve hclust object
groups <- cutree(x_hclust, 10)

Resources