Complex clustering of RNA-seq data - r

Even though I'm working with RNA-seq data, my question is more of a statistics and machine learning nature.
So, what I have is expression data from wild types (controls) and mutants over four different conditions, where a condition is a combination of two cell types and time points. Basically, my expression matrix looks like this:
time1_loc1_control time1_loc1_mutant time1_loc2_control time1_loc2_mutant ...
gene1
gene2
...
Expression values are initially in counts-per-million, but I have tried using different transformations before clustering attempts.
What I am trying to achieve is to cluster the genes based on the direction of the change between mutant and control (upregulated or downregulated) and absolute values of expression.
So far I was able to roughly group the genes based solely on the direction of the change, but I would also need to retain the information of absolute expression values. Are there any methods that could help me with this?
One other idea was to split the dataset by conditions (as conditions are independent) and cluster the genes separately. This would mean each gene would be clustered four times. Is there any way to determine which genes are being clustered together?
Thank you for any input on this. I realize it's a bit vague but I am not extremely experienced with this.
Kind regards,
Z

Related

Get the name and values of second member of cluster made at stage1(contains two members only) of heatmaply given the name of one of the members

I have a data frame S=[rows x cols] containing samples s such that rownames tell the names of the samples and colnames tell the features of the respective samples.
In this data frame, i have inserted one test sample t1. Now i want to extract the name and values of that particular sample "s" which is similar or twin of t1.
For this purpose i have used heatmaply() which plots hierarchical clustering. Observing this plot, i can see a cluster made at stage 1(of iterative hierarchical clustering process ) which contains only two members, one is my test sample t1 and the other member is its almost twin or almost similar sample.
Now i want to extract only that twin or sample s from the cluster made at stage 1 that was similar to my test sample t1 and nothing else. Please guide me in this regard.
I know a little about the hclust and dist functions. The problem with dist is that it provides too much information and i cant think of anyway of how to extract the twin of my test sample t1 from the dist matrix. I know a little bit about cutree(). To my less knowledge, i think it can give the clusters, to which the members belong, depending on the value of argument k. So when the value of k changes, the members of the clusters change. i am wondering if i can exploit cutree so that i can get the stage 1 clusters(contain two members) and find the member that is similar to my test sample t1.
The values of hclust object particularly interested me. For example merge and order. Maybe someone can guide me more about it and i can use them to get the twin
I am sorry for this long post. I was trying to explain as clearly as possible in a concise manner and wanted to show that i have tried my knowledge but your experience is highly appreciated to solve this problem.

How to do K-means clustering on a dataset full of string variables in r

Right now I have a dataset that is full of string variables, but I want to do a clustering project on that. After I apply as.factor() to all the variables, nbclust() still could not work, what am I suppose to do?
K-means typically uses Euclidean distances (see e.g. https://stats.stackexchange.com/questions/81481/why-does-k-means-clustering-algorithm-use-only-euclidean-distance-metric) so you can't directly "cluster on words".
If you want to cluster observations based on words, you have to generate numbers (e.g. k-means for text clustering) For example if you were trying to cluster customer profiles to do segmentation, you could count up words representing their interests in their profiles, and then have one column per interest, and count the number of times that word or n-gram appeared in the profile, then cluster on that matrix of numbers. Or in clustering documents, generate a term-document matrix (or document-term matrix, or term-term occurrence like k-means clustering on term-term co-ocurrence matrix) and use those numbers for clustering.
Don't use k-means on such data.
You can't get meaningful statistical analysis just by "trial and error". Because there are many ways to get a result that looks okayish but that is totally unfounded.
Before you use any of these approaches, you need to understand what it does. In the case of k-means, it minimizes least squares, which obviously makes only sense on continuous variables. They also need to behave linearly. If you have multiple variables, they also need to have the same magnitude.
It's not a black box method. If you use it badly, you just get garbage out.

Which cluster methodology should I use for a multidimensional dataset?

I am trying to create clusters of countries with a dataset quite heterogeneous (the data I have on countries goes from median age to disposable income, including education levels).
How should I approach this problem?
I read some interesting papers on clustering, using K-means for instance, but it seems those algorithms are mostly used when there are two sets of variables, not 30 like in my case, and when the variables are comparable (it might be though to try to cluster countries with such diversity in the data).
Should I normalise some of the data? Should I just focus on fewer indicators to avoid this multidimensional issue? Use spectral clustering first?
Thanks a lot for the support!
Create a "similarity metric". Probably just a weight to all your measurements, but you might build in some corrections for population size and so on. Then you can only have low hundreds of countries, so most brute force methods will work. Hierarchical clustering would be my first point of call, and that will tell you if the data is inherently clustered.
If all the data is quantitative, you can normalise on 0 - 1 (lowest country is 0, highest is 1), then take eigenvectors. Then plot out the first two axes in eigenspace. That will give another visual fix on clusters.
If it's not clustered, however, it's better to admit that.

Clustering genes based on function

We would like to use either hierarchical or k means clustering, to cluster the genes in our dataset based on their function. We got the GO id for each gene and now we would like to cluster them in groups based on the function preferably hierarchical. That means from the bottom (where each function is unique) to upper levels (where we have more generalized/groups of functions). We are programming in R.
Thanks in advance for your help!
Usuall one either performs a differential expression analysis between two conditions, or clusters genes based on expression across conditions or time points. After that, it is possible to look for overrepresentation of GO terms in differentially expressed gene sets or in clusters.
You may be interested in GeneMania (http://www.genemania.org/) - you can enter a list of genes that will be presented in a network (with lots of options for customisation and expansioN). This tool will again provide you with GO terms that are enriched in the network. A second tool of interest is Gorilla (http://cbl-gorilla.cs.technion.ac.il/) - this will show the GO hierarchy itself with GO terms lighting up if they are enriched.
k-means isn't a good idea for this kind of data.
Instead, look at algorithms specialized for this data, in particular biclustering algorithms.

Determining optimal number of clusters and with Daisy function and Gower Similarity

I am attempting to cluster the behavioral traits of 250 species into life-history strategies. The trait data consists of both numerical and nominal variables. I am relatively new to R and to cluster analysis, but I believe the best option to find the distances for these points is to use the gower similarity method within the daisy function. 1) Is that the best method?
Once I have these distances, I would like to find significant clusters. I have looked into pvclust and like its ability to give me the strength of the cluster. However, I have not been able to modify the code to accept the distance measurements previously made using daisy. I have unsuccessfully tried to follow the advice given here https://stats.stackexchange.com/questions/10347/making-a-heatmap-with-a-precomputed-distance-matrix-and-data-matrix-in-r/10349#10349 and using the code obtained here http://www.is.titech.ac.jp/~shimo/prog/pvclust/pvclust_unofficial_090824/pvclust.R
2)Can anyone help me to modify the existing code to accept my distance measurements?
3) Or, is there another better way to determine the number of significant clusters?
I thank all in advance for your help.
Some comments...
About 1)
It is a good way to deal with different types of data.
You could also create as many new rows in the dataset as possible nominal values and put 1/0 where it is needed. For example if there are 3 nominal values such as "reptile", "mammal" and "bird" you could change your initial dataset that has 2 columns (numeric, Nominal)
for a new one with 4 columns (numeric, numeric( representing reptile), numeric(representing mammal), numeric(representing bird)) an instance (23.4,"mammal") would be mapped to (23.4,0,1,0).
Using this mapping you could work with "normal" distances (be sure to standardize the data so that no column dominates the others due to it's big/small values).
About 2)
daisy returns an element of type dissimilarity, you can use it in other clustering algorithms from the cluster package (maybe you don't have to implement more stuff). For example the function pam can get the object returned by daisy directly.
About 3)
Clusters are really subjective and most cluster algorithms depend on the initial conditions so "significant clusters" is not really a term that some people would not be comfortable using. Pam could be useful in your case because clusters are centered using medoids which is good for nominal data (because it is interpretable). K-means for example has the disadvantage that the centroids are not interpretable (what does it mean 1/2 reptile 1/2 mammal?) pam builds the clusters centered to instances which is nice for interpretation purposes.
About pam:
http://en.wikipedia.org/wiki/K-medoids
http://stat.ethz.ch/R-manual/R-devel/library/cluster/html/pam.html
You can use Zahn algorithm to find the cluster. Basically it's a minimum spanning tree and a function to remove the longest edge.

Resources