Genetics example with huge dimensional data - biometrics

I am looking for a real example of huge dimensional contingency tables, where, says, the number of rows and columns are thousands or millions. And the two random variables are ordinal (not nominal).
Is there any problem like that with sparse data? Says, we need to test the independence of two ordinal random variables X and Y in contingency tables where X or Y or both of dimension 1000 (1000000) and the input of the tables contains many cells with no observations?
I think that there may be some example in biology but I have no knowledge of it. Could anyone suggest one?

I'm not sure I fully understand your question (these statistical terms are somewhat unfamiliar to me); however, one example of the data you seek might be transcriptomic data. The term "transcriptomic data" refers to measurements of the amount of RNA present in the cells of an organism. The axes of datasets like this are usually Gene (the gene which coded for that particular strand of RNA) by cell (the type of cell in the body from which the measurement was taken, e.g. heart, lung) by time (the point in time at which the cell was measured).
Unfortunately, the cell axis is not an ordinal axis but a nominal one. The other two axes are definitely ordinal. I suppose this is also a 3-dimensional tensor rather than a 2-dimensional matrix.
There are about 20,000 genes, and as our sequencing technology improves, the time axis can obviously grow very large.
This kind of data is typically very sparse. Not only do cells tend not to "express" [1] every gene, but we also suspect that sometimes the amount of RNA is too low to reliable measure it! This leads to interesting statistical problems wherein one needs to model both sparsity and low measurement counts!
The wikipedia page on RNA-Seq is an OK introduction. Moreover, if you're interested in the fusion of biology and math and computer science, you might find the lectures at Models Inference and Algorithms interesting, in particular you might like Kharchenko's talk "From one to millions of cells: computational challenges in single-cell analysis"!
[1] the expression "express a gene" means that the cell actually transcribes the gene into the corresponding RNA rather than ignoring it

Related

Is testing collaborative filtering technique on randomly generated user-item rating matrix meaningful?

I know that some data sets are available to run collaborative filtering algorithms such as user-based or item-based filtering. However I need to test an algorithm on many data sets to prove that my proposed methodology performs better. I generated random user-item rating matrices with values from 1 to 5. I consider the generated matrices as ground truth. Then I remove some of the ratings in the matrix and using my algorithm I predict missing ratings. Finally I use RMSE measure to compare ground truth matrix and the matrix I get as an output from my algorithm. Do this methodology seems meaningful or not ?
No not really.
If every item is uniformly random in [1-5]
perfect estimator is predicting 3 for all entries
You are missing non-uniform / real-world distributions. Every recommendation-system is build on assumptions or it can't beat random-guessing. (Keep in mind, that this is not only about the distribution of the rating; but also about which items are rated -> a lot of theoretical research showing different assumptions: e.g. uniform vs. something else; mostly in convex MF with nuclear-norm vs. max-norm and co.)
Better pick those available datasets and if needed, sub-sample those without destroying every kind of correlation. E.g. filtering by some attribute like A: all ratings with some movie <= 1990; all ratings > 1990. Yes, this will shift the underlying distributions, but it sounds something like that is what you want. IF not you can always sub-sample uniformly, but that's more for some generalization-evaluation (small vs. big datasets).

Trouble reproducing principal component plots from a paper?

I was initially trying to reproduce PCA plots shown in this paper (Figure 1).
The paper uses PCA technique to visualize protein structure conformations in a lower dimension as per reference 16 (Figure 1 - B and C). Each point in the PC plots represents a protein structure in a lower dimensional space. But I have some doubts now, as I am trying to reproduce these plots. So I looked in this link which is a R library called bio3d from the authors of reference-16. Each pdb files has {X Y Z} coordinate positions in their pdb files. After aligning the regions among proteins you take these data for PCA. I am trying to reproduce the results which bio3d toolbox example page has but using MATLAB (since I am not familiar with R). But I am unable to get the plot as in FIGURE-9 in the bio3d link.
Can someone help me to reproduce these figures? I have my matlab script and 6 structures prepared as in the webpage uploaded here. The script will help you to load data only although I have made some attempt from my side.
UPDATE 1 : In short, my question is:
Can someone advice me how to prepare the covariance matrix from the 6 structures with their coordinates for this particular problem, so that I can do PCA on it?
UPDATE 2 : I have initially mistakenly shared non-aligned pdb strucutre files in the google drive. I have correctly uploaded it.
Quoting from the question:
After aligning the regions among proteins you take these data for PCA. (Emphasis added).
You do not seem to have aligned the regions among the proteins first.
This application of PCA to protein structures starts with a set of similar proteins whose 3-dimensional structures have been determined, perhaps under different conditions of biological interest. For example, the proteins may have been bound to specific small molecules that regulate their structure and function. The idea is that most of the structure of these proteins will agree closely under these different conditions, while the portions of the proteins that are most important for function will be different. Those most important portions of the proteins thus may show variance in 3-dimensional positions among the set of structures, and clusters in principal components (as in part C of the first figure in this question) illustrate which particular combinations of proteins and experimental conditions are similar to each other in terms of these differences in 3-dimensional structure.
The {X,Y,Z} coordinates of the atoms in the proteins, however, may have different systematic orientations in space among the set of protein structures, as the coordinate system in any one case is based on details of the x-ray crystallography or other methods used to determine the structures. So the first step is to rotate the individual protein structures so that all protein structures align as closely as possible to start. Then variances are calculated around those closely aligned (after rotation) 3-dimensional structures. Otherwise, most of the variance in {X,Y,Z} space will represent the differences in systematic orientation among the crystallography sessions.
As with all R packages, bio3d has publicly available source code. The pdbfit() function includes 2 important pre-processings before PCA. It tries to account for gaps in structures with a gap.inspect() function, and then it rotates the protein structures in 3 dimensions for best overall alignment with a fit.xyz() function. Only then does it proceed to PCA.
You certainly could try to reproduce those pre-processing functionalities in MATLAB, but in this case it might be simplest to learn enough R to take advantage of what is already provided in this extensive package.

Document similarity selfplagiarism

I have thousands of small documents from 100 different authors. Using quanteda package, I calculated cosine similarity between the authors with themselves. For example, author x has 100 texts, so I have come up with a 100 x 100 matrix of similarity. Author y has 50 texts, so I have come up with a 50 x 50 similarity matrix.
Now I want to compare these two authors. In other words, which author copies himself more? If I take the average the columns or rows and then average again the vector of means, I arrive at a number so I can compare these two means of means, but I am not sure if these proceeding is right. I hope I made myself clear.
I think the answer depends on what exactly is your quantity of interest. If this is a single summary of how similar are an author's documents to one another, then some distribution across the document similarities, within author, is probably your best means of comparing this quantity between authors.
You could save and plot the cosine similarities across an author's documents as a density, for instance, in addition to your strategy of summarising this distribution using a mean. To capture the variance I would also characterise the standard deviation of this similarity.
I'd be cautious about calling cosine similarity within author as "self-plagiarism". Cosine similarity computes a measure of distance across vector representations of bags of words, and is not viewed as a method for identifying "plagiarism". In addition, there are very pejorative connotations to the term "plagiarism", which means the dishonest representation of someone else's ideas as your own. (I don't even believe that the term "self-plagiarism" makes sense at all, but then I have academic colleagues who disagree.)
Added:
Consider the textreuse package for R, it is designed for the sort of text analysis of reuse that you are looking for.
I don't think Levenshtein distance is what you are looking for. As the Wikipedia page points out, the LD between kitten and sitting is 3, but this means absolutely nothing in substantive terms about their semantic relationship or one being an example of "re-use" of the other. An argument could be made that LD based on words might show re-use, but that's not how most algorithms e.g. http://turnitin.com implement detection for plagiarism.

Which cluster methodology should I use for a multidimensional dataset?

I am trying to create clusters of countries with a dataset quite heterogeneous (the data I have on countries goes from median age to disposable income, including education levels).
How should I approach this problem?
I read some interesting papers on clustering, using K-means for instance, but it seems those algorithms are mostly used when there are two sets of variables, not 30 like in my case, and when the variables are comparable (it might be though to try to cluster countries with such diversity in the data).
Should I normalise some of the data? Should I just focus on fewer indicators to avoid this multidimensional issue? Use spectral clustering first?
Thanks a lot for the support!
Create a "similarity metric". Probably just a weight to all your measurements, but you might build in some corrections for population size and so on. Then you can only have low hundreds of countries, so most brute force methods will work. Hierarchical clustering would be my first point of call, and that will tell you if the data is inherently clustered.
If all the data is quantitative, you can normalise on 0 - 1 (lowest country is 0, highest is 1), then take eigenvectors. Then plot out the first two axes in eigenspace. That will give another visual fix on clusters.
If it's not clustered, however, it's better to admit that.

Point pattern similarity and comparison

I recently started to work with a huge dataset, provided by medical emergency
service. I have cca 25.000 spatial points of incidents.
I am searching books and internet for quite some time and am getting more and more confused about what to do and how to do it.
The points are, of course, very clustered. I calculated K, L and G function
for it and they confirm serious clustering.
I also have population point dataset - one point for every citizen, that is similarly clustered as incidents dataset (incidents happen to people, so there is a strong link between these two datasets).
I want to compare these two datasets to figure out, if they are similarly
distributed. I want to know, if there are places, where there are more
incidents, compared to population. In other words, I want to use population dataset to explain intensity and then figure out if the incident dataset corresponds to that intensity. The assumption is, that incidents should appear randomly regarding to population.
I want to get a plot of the region with information where there are more or less incidents than expected if the incidents were randomly happening to people.
How would you do it with R?
Should I use Kest or Kinhom to calculate K function?
I read the description, but still don't understand what is a basic difference
between them.
I tried using Kcross, but as I figured out, one of two datasets used
should be CSR - completely spatial random.
I also found Kcross.inhom, should I use that one for my data?
How can I get a plot (image) of incident deviations regarding population?
I hope I asked clearly.
Thank you for your time to read my question and
even more thanks if you can answer any of my questions.
Best regards!
Jernej
I do not have time to answer all your questions in full, but here are some pointers.
DISCLAIMER: I am a coauthor of the spatstat package and the book Spatial Point Patterns: Methodology and Applications with R so I have a preference for using these (and I genuinely believe these are the best tools for your problem).
Conceptual issue: How big is your study region and does it make sense to treat the points as distributed everywhere in the region or are they confined to be on the road network?
For now I will assume we can assume they are distributed anywhere.
A simple approach would be to estimate the population density using density.ppp and then fit a Poisson model to the incidents with the population density as the intensity using ppm. This would probably be a reasonable null model and if that fits the data well you can basically say that incidents happen "completely at random in space when controlling for the uneven population density". More info density.ppp and ppm are in chapters 6 and 9 of 1, respectively, and of course in the spatstat help files.
If you use summary statistics like the K/L/G/F/J-functions you should always use the inhom versions to take the population density into account. This is covered in chapter 7 of 1.
Also it could probably be interesting to see the relative risk (relrisk) if you combine all your points in to a marked point pattern with two types (background and incidents). See chapter 14 of 1.
Unfortunately, only chapters 3, 7 and 9 of 1 are availble as free to download sample chapters, but I hope you have access to it at your library or have the option of buying it.

Resources