what is the tsne initial pca step doing - r

Looking at the parameters to the Rtsne function:
https://cran.r-project.org/web/packages/Rtsne/Rtsne.pdf
There is a parameter called "pca" defined as "logical; Whether an initial PCA step should be performed (default: TRUE)"
Let's say you have a 10 dimensional feature set and you run TSNE. I was thinking you would scale the 10-D matrix and then pass it to Rtsne().
What does the pca indicated by the pca parameter do?
WOuld it take the 10-D matrix and run PCA on that? If so, would it pass all 10 dimensions in the PCA space to Rtsne?
Is there any info anywhere else about what this initial PCA step is?
Thank you.

The original tSNE paper used PCA.
To reduce the dimensionality of the MNIST data prior to running tSNE.

Related

k-means analysis: how to convert data into numeric?

I want to perform a k-means analysis in R. For that I need numeric data. I tried the following
unlist(pca)
as.numeric(pca)
lapply(pca,as.numeric(pca))
pca is just "normal" Principal Component Analysis data, showed in a plot (with fviz_pca_ind() function).
By the way, when I try to run the k-means analysis, it gives me "list object cannot be coerced to type double". That is why I thought to turn everything into numeric.
How to convert the pca-data into numeric?
Thank you ;)
You're almost correct
lapply(pca,as.numeric)
as.numeric is a function and therefore an object. You need to pass it to lapply() as an object and therefore without the quotation marks.
Most pca objects should return you a list, and you should show which package or function is used to perform the pca, so we can see what's in the list
For example, if you use prcomp, it returns a list of eigenvectors / loadings ($rotation) and principal components ($x). I suppose you are trying to do k-means on the principal componets, and you can do it like:
# perform pca
pca = prcomp(USArrests,scale=TRUE)
# we call out the PCs using pca$x
# and kmeans
kmeans_clus = kmeans(pca$x,3)
## plot
# define colors
COLS = c("#65587f","#f18867","#e85f99")
plot(pca$x[,1:2],col=COLS[kmeans_clus$cluster],pch=20)
legend("topright",fill=COLS,legend=1:3,horiz=TRUE)

package "fdapace" (R) - create a functional plot of the first principal component

My question is about functional principal component analysis in R.
I am working with a multi-dimensional time series looking something like this:
My goal is to reduce the dimensions by applying functional PCA and then plot the first principal component like this:
I have already used the FPCA function of the fdapace package on the dataset. Unfortunately, I don't understand how to interpret the resulting matrix of the FPCA estimates (xiEst).
In my understanding the values of the Principal components are stored in the columns of the matrix.
Unfortunately the number of columns doesn't fit the number of time intervals of my multi dimensional time series.
I don't know how the values in the matrix correspond to the values of the original data and how to plot the first principal component as a dimensional reduction of the original data.
If you need some code to reproduce the situation you can use the medfly dataset of the package:
library(fdapace)
data(medfly25)
Flies <- MakeFPCAInputs(medfly25$ID, medfly25$Days, medfly25$nEggs)
pfcaObjFlies <- FPCA(Flies$Ly, Flies$Lt)
when I plot the first principal component via
plot(fpcaObjFlies$xiEst[,1], type = "o")
the graph doesn't really fit my expectations:
I would have expected a graph with 25 observations similar to the graphs of the medfly dataset.

Predict values in Multidimensional Scaling (MDS) in R

I’m trying to use multidimensional scaling (MDS) in R.
Can I predict new values on test set based on the values that I receive from my training set?
I’m looking for something similar to what I’ve done in PCA for example:
prin_comp <- prcomp(pca.train, scale. = FALSE)
test.data <- predict(prin_comp, newdata = pca.test)
Thank you,
Ittai
You can use MDS as the first of a three step process.
Generate the MDS coordinates
Apply a traditional clustering algorithm to the generated coordinates
E.g. Kmeans with kmeans(x, K) where you will need to supply the K=number of clusters
Note that you will probably want to do some metrics of the generated clusters by
cross validation to ensure they are providing good labels for your existing data.
Use the kmeans clusters to find the nearest centroid/cluster for each of
the new data
Then you have a decision to make (as the modeler): do you apply the mode of the chosen cluster as the label for your new data? That is the simplest solution - but there can be other approaches.
In addition to what you wrote, can't I use the predict function based on the coefficient from the training model, and use the test data to predict new MDS values?

Silhouette plot in R

I have a set of data containing:
item, associated cluster, silhouette coefficient. I can further augment this data set with more information if necessary.
I would like to generate a silhouette plot in R. I am having trouble with this because examples I came across use the built-in kmeans (or related) clustering function and plot the result. I want to bypass this step and produce the plot for my own clustering algorithm but I'm ending up short on providing the correct arguments to the plot function.
Thank you.
EDIT
Data set example https://pastebin.mozilla.org/8853427
What I've tried is loading the dataset and passing it to the plot function using various arguments based on https://stat.ethz.ch/R-manual/R-devel/library/cluster/html/silhouette.html
Function silhouette in package cluster can do the plots for you. It just needs a vector of cluster membership (produced from whatever algorithm you choose) and a dissimilarity matrix (probably best to use the same one used in producing the clusters). For example:
library (cluster)
library (vegan)
data(varespec)
dis = vegdist(varespec)
res = pam(dis,3) # or whatever your choice of clustering algorithm is
sil = silhouette (res$clustering,dis) # or use your cluster vector
windows() # RStudio sometimes does not display silhouette plots correctly
plot(sil)
EDIT: For k-means (which uses squared Euclidean distance)
library (vegan)
library (cluster)
data(varespec)
dis = dist(varespec)^2
res = kmeans(varespec,3)
sil = silhouette (res$cluster, dis)
windows()
plot(sil)

Is it impossible to do PCA on the data whose # of variables are bigger than that of individuals?

I am a new user of R and I try to do PCA on my data set using R. The dimension of data is 20x10000, i.e. # of features is 10000 and # of individuals is 20. It seems that prcomp() cannot handle the data exactly, because the dimension of calculated eigenvectors and new data is 20x20 and 10000x20 instead of 10000x10000 and 20x10000. I tried FactoMineR library also, but the results looked like that it looses some dimension, too. Is there any way to doing PCA on the data like this? :(
By reading the manual, it looks like no components are omitted by default but check the tol argument. The problem is with negative eigenvalues that may bet there (and often are) when you have less cases than individuals. (I think with 10000 cases and 20 individuals you will always have many negative eigenvalues.) See a simplified version of PCA I'm sometimes using that computes "PC loadings" the way they're usually used in psychology.
PCA <- function(X, cut=NULL, USE="complete.obs") {
if(is.null(cut)) cut<- ncol(X)
E<-eigen(cor(X,use=USE))
vec<-E$vectors
val<-E$values
P<-sweep(vec,2,sqrt(val),"*")[,1:cut]
P
}
The "loadings" are, basically, eigenvectors multiplied by the square root of eigenvalues -- but there's a problem here if you have negative eigenvalues. Something similar may happen with prcomp.
If you just want to reconstruct your data matrix exactly (for whatever reason), you can easily use svd or eigen directly. /My example used correlation matrix but the logic is not confined to this case./

Resources