Trouble reproducing principal component plots from a paper? - r

I was initially trying to reproduce PCA plots shown in this paper (Figure 1).
The paper uses PCA technique to visualize protein structure conformations in a lower dimension as per reference 16 (Figure 1 - B and C). Each point in the PC plots represents a protein structure in a lower dimensional space. But I have some doubts now, as I am trying to reproduce these plots. So I looked in this link which is a R library called bio3d from the authors of reference-16. Each pdb files has {X Y Z} coordinate positions in their pdb files. After aligning the regions among proteins you take these data for PCA. I am trying to reproduce the results which bio3d toolbox example page has but using MATLAB (since I am not familiar with R). But I am unable to get the plot as in FIGURE-9 in the bio3d link.
Can someone help me to reproduce these figures? I have my matlab script and 6 structures prepared as in the webpage uploaded here. The script will help you to load data only although I have made some attempt from my side.
UPDATE 1 : In short, my question is:
Can someone advice me how to prepare the covariance matrix from the 6 structures with their coordinates for this particular problem, so that I can do PCA on it?
UPDATE 2 : I have initially mistakenly shared non-aligned pdb strucutre files in the google drive. I have correctly uploaded it.

Quoting from the question:
After aligning the regions among proteins you take these data for PCA. (Emphasis added).
You do not seem to have aligned the regions among the proteins first.
This application of PCA to protein structures starts with a set of similar proteins whose 3-dimensional structures have been determined, perhaps under different conditions of biological interest. For example, the proteins may have been bound to specific small molecules that regulate their structure and function. The idea is that most of the structure of these proteins will agree closely under these different conditions, while the portions of the proteins that are most important for function will be different. Those most important portions of the proteins thus may show variance in 3-dimensional positions among the set of structures, and clusters in principal components (as in part C of the first figure in this question) illustrate which particular combinations of proteins and experimental conditions are similar to each other in terms of these differences in 3-dimensional structure.
The {X,Y,Z} coordinates of the atoms in the proteins, however, may have different systematic orientations in space among the set of protein structures, as the coordinate system in any one case is based on details of the x-ray crystallography or other methods used to determine the structures. So the first step is to rotate the individual protein structures so that all protein structures align as closely as possible to start. Then variances are calculated around those closely aligned (after rotation) 3-dimensional structures. Otherwise, most of the variance in {X,Y,Z} space will represent the differences in systematic orientation among the crystallography sessions.
As with all R packages, bio3d has publicly available source code. The pdbfit() function includes 2 important pre-processings before PCA. It tries to account for gaps in structures with a gap.inspect() function, and then it rotates the protein structures in 3 dimensions for best overall alignment with a fit.xyz() function. Only then does it proceed to PCA.
You certainly could try to reproduce those pre-processing functionalities in MATLAB, but in this case it might be simplest to learn enough R to take advantage of what is already provided in this extensive package.

Related

Creation of correlated marks. E.g. point sizes varying with inter-point distances

I recently dabbled a bit into point pattern analysis and wonder if there is any standard practice to create mark correlation structures varying with the inter-point distance of point locations. Clearly, I understand how to simulate independent marks, as it is frequently mentioned e.g.
library(spatstat)
data(finpines)
set.seed(0907)
marks(finpines) <- rnorm(npoints(finpines), 30, 5)
plot(finpines)
enter image description here
More generally speaking, assume we have a fair amount of points, say n=100 with coordinates x and y in an arbitrary observation window (e.g. rectangle). Every point carries a characteristic, for example the size of the point as a continuous variable. Also, we can examine every pairwise distance between the points. Is there a way to introduce correlation structure between the marks (of pairs of points) which depends on the inter-point distance between the point locations?
Furthermore, I am aware of the existence of mark analysing techniques like
fin <- markcorr(finpines, correction = "best")
plot(fin)
When it comes to interpretation, my lack of knowledge forces me to trust my colleagues (non-scientists). Besides, I looked at several references given in the documentation of the spatstat functions; especially, I had a look on "Statistical Analysis and Modelling of Spatial Point Patterns", p. 347, where inhibition and mutual stimulation as deviations from 1 (independence of marks) of the normalised mark correlation function are explained.
I think the best bet is to use a random field model conditional on your locations. Unfortunately the package RandomFields is not on CRAN at the moment, but hopefully it will return soon. I think it may be possible to install an old version of RandomFields from the archives if you want to get going immediately.
Thus, the procedure is:
Use spatstat to generate the random locations you like.
Extract the coordinates (either coords() or as.data.frame.ppp()).
Define a model in RandomFields (e.g. RMexp()).
Simulate the model on the given coordinates (RFsimulate()).
Convert back to a marked point pattern in spatstat (either ppp() or as.ppp()).

Why is k-means clustering ignoring a significant patch of data?

I'm working with a set of co-ordinates, and want to dynamically (I have many sets that need to go through this process) understand how many distinct groups there are within the data. My approach was to apply k-means to investigate whether it would find the centroids and I could go from there.
When plotting some data with 6 distinct clusters (visually) the k-means algorithm continues to ignore two significant clusters while putting many centroids into another.
See image below:
Red are the co-ordinate data points and blue are centroids that k-means has provided. In this specific case I've gone for 15 (arbitrary), but it still doesn't recognise those patches of data on the right hand side, rather putting a mid point between them while putting in 8 in the cluster in the top right.
Admittedly there are slightly more data points in the top right, but not by much.
I'm using the standard k-means algorithm in R and just feeding in x and y co-ordinates. I've tried standardising the data, but this doesn't make any difference.
Any thoughts on why this is, or other potential methodologies that could be applied to try and dynamically understand the number of distinct clusters there are in the data?
You could try with Self-organizing map:
this is a clustering algorithm based on Neural Networks which create a discretized representation of the input space of the training samples, called a map, and is, therefore, a method to do dimensionality reduction (SOM).
This algorithm is very good for clustering also because does not require a priori selection of the number of clusters (in k-mean you need to choose k, here no). In your case, it hopefully finds automatically the optimal number of cluster, and you can actually visualize it.
You can find a very nice python package called somoclu which has got this algorithm implemented and an easy way to visualize the result. Else you can go with R. Here you can find a blog post with a tutorial, and Cran package manual for SOM.
K-means is a randomized algorithm and it will get stuck in local minima.
Because of these problems, it is common to run k-means several times, and keep the result with least squares, I.e., the best of the local minima found.

Genetics example with huge dimensional data

I am looking for a real example of huge dimensional contingency tables, where, says, the number of rows and columns are thousands or millions. And the two random variables are ordinal (not nominal).
Is there any problem like that with sparse data? Says, we need to test the independence of two ordinal random variables X and Y in contingency tables where X or Y or both of dimension 1000 (1000000) and the input of the tables contains many cells with no observations?
I think that there may be some example in biology but I have no knowledge of it. Could anyone suggest one?
I'm not sure I fully understand your question (these statistical terms are somewhat unfamiliar to me); however, one example of the data you seek might be transcriptomic data. The term "transcriptomic data" refers to measurements of the amount of RNA present in the cells of an organism. The axes of datasets like this are usually Gene (the gene which coded for that particular strand of RNA) by cell (the type of cell in the body from which the measurement was taken, e.g. heart, lung) by time (the point in time at which the cell was measured).
Unfortunately, the cell axis is not an ordinal axis but a nominal one. The other two axes are definitely ordinal. I suppose this is also a 3-dimensional tensor rather than a 2-dimensional matrix.
There are about 20,000 genes, and as our sequencing technology improves, the time axis can obviously grow very large.
This kind of data is typically very sparse. Not only do cells tend not to "express" [1] every gene, but we also suspect that sometimes the amount of RNA is too low to reliable measure it! This leads to interesting statistical problems wherein one needs to model both sparsity and low measurement counts!
The wikipedia page on RNA-Seq is an OK introduction. Moreover, if you're interested in the fusion of biology and math and computer science, you might find the lectures at Models Inference and Algorithms interesting, in particular you might like Kharchenko's talk "From one to millions of cells: computational challenges in single-cell analysis"!
[1] the expression "express a gene" means that the cell actually transcribes the gene into the corresponding RNA rather than ignoring it

How to add zoom option for wordcloud in Shiny (with reproducible example)

Could you please help me to add zooming option for wordcloud
Please find reproducible example #
´http://shiny.rstudio.com/gallery/word-cloud.html´
I tried to incorporate rbokeh and plotly but couldnt find wordcloud equivalent render function
Additionally, I found ECharts from github #
´https://github.com/XD-DENG/ECharts2Shiny/tree/8ac690a8039abc2334ec06f394ba97498b518e81´
But incorporating this ECharts are also not convenient for really zoom.
Thanks in advance,
Abi
Normalisation is required only if the predictors are not meant to be comparable on the original scaling. There's no rule that says you must normalize.
PCA is a statistical method that gives you a new linear transformation. By itself, it loses nothing. All it does is to give you new principal components.
You lose information only if you choose a subset of those principal components.
Usually PCA includes centering the data as a Pre Process Step.
PCA only arranges the data in its own Axis (Eigne Vectors) System.
If you use all axis you lose no information.
Yet, usually we want to apply Dimensionality Reduction, intuitively, having less coordinates for the data.
This process means projecting the data into Sub Space which is spanned by only some of the Eigen Vectors of the data.
If one chose wisely the number of vectors one might end up with a significant reduction in the number of dimensions of the data with negligible loss of data / information.
The way to do so is by choosing Eigen Vectors which their Eigen Values sum to most of the data power.
PCA itself is invertible, so lossless.
But:
It is common to drop some components, which will cause a loss of information.
Numerical issues may cause a loss in precision.

Point pattern similarity and comparison

I recently started to work with a huge dataset, provided by medical emergency
service. I have cca 25.000 spatial points of incidents.
I am searching books and internet for quite some time and am getting more and more confused about what to do and how to do it.
The points are, of course, very clustered. I calculated K, L and G function
for it and they confirm serious clustering.
I also have population point dataset - one point for every citizen, that is similarly clustered as incidents dataset (incidents happen to people, so there is a strong link between these two datasets).
I want to compare these two datasets to figure out, if they are similarly
distributed. I want to know, if there are places, where there are more
incidents, compared to population. In other words, I want to use population dataset to explain intensity and then figure out if the incident dataset corresponds to that intensity. The assumption is, that incidents should appear randomly regarding to population.
I want to get a plot of the region with information where there are more or less incidents than expected if the incidents were randomly happening to people.
How would you do it with R?
Should I use Kest or Kinhom to calculate K function?
I read the description, but still don't understand what is a basic difference
between them.
I tried using Kcross, but as I figured out, one of two datasets used
should be CSR - completely spatial random.
I also found Kcross.inhom, should I use that one for my data?
How can I get a plot (image) of incident deviations regarding population?
I hope I asked clearly.
Thank you for your time to read my question and
even more thanks if you can answer any of my questions.
Best regards!
Jernej
I do not have time to answer all your questions in full, but here are some pointers.
DISCLAIMER: I am a coauthor of the spatstat package and the book Spatial Point Patterns: Methodology and Applications with R so I have a preference for using these (and I genuinely believe these are the best tools for your problem).
Conceptual issue: How big is your study region and does it make sense to treat the points as distributed everywhere in the region or are they confined to be on the road network?
For now I will assume we can assume they are distributed anywhere.
A simple approach would be to estimate the population density using density.ppp and then fit a Poisson model to the incidents with the population density as the intensity using ppm. This would probably be a reasonable null model and if that fits the data well you can basically say that incidents happen "completely at random in space when controlling for the uneven population density". More info density.ppp and ppm are in chapters 6 and 9 of 1, respectively, and of course in the spatstat help files.
If you use summary statistics like the K/L/G/F/J-functions you should always use the inhom versions to take the population density into account. This is covered in chapter 7 of 1.
Also it could probably be interesting to see the relative risk (relrisk) if you combine all your points in to a marked point pattern with two types (background and incidents). See chapter 14 of 1.
Unfortunately, only chapters 3, 7 and 9 of 1 are availble as free to download sample chapters, but I hope you have access to it at your library or have the option of buying it.

Resources