How to create adjacency matrix for gene-gene interactions from RNA-Seq (circlize input) - r

I'm profiling tumor microenvironment and I want to show interactions between subpopulations that I found. I have a list of receptors and ligands for example, and I want to show that population A expresses ligand 1 and population C expresses receptor 1 so there's likely an interaction between these two populations through the expression of ligand-receptor 1.
I have been trying to use circlize to visualize these interactions by making a chordDiagram, but it requires an adjacency matrix as input and I do not understand how to create the matrix. The adjacency matrix is supposed to show the strength of the relationship between any two genes in my matrix. I have 6 unique populations of cells that can express any of the 485 ligands/receptors that I am interested in, and the goal is to show interactions between these populations through the ligands and receptors.
I found a tool to use in RStudio called BUS- gene.similarity: Calculate adjacency matrix for gene-gene interaction.
Maybe I am just using BUS incorrectly but it says: For gene expression data with M genes and N experiments, the adjacency matrix is in size of MxM. An adjacency matrix in size of MxM with rows and columns both standing for genes. Element in row i and column j indicates the similarity between gene i and gene j.
So, I made a matrix where each column is a subpopulation and each row is a ligand/receptor I want to show interactions with. The cells have expression values and it looks like this:
> head(Test)
A B C D E F
Adam10 440.755990 669.875468 748.7313995 702.991422 1872.033343 2515.074366
Adam17 369.813134 292.625603 363.0301707 434.905968 1183.152694 1375.424034
Agt 12.676036 28.269671 9.2428034 19.920561 121.587010 168.116735
Angpt1 22.807415 42.350205 25.5464603 16.010813 194.620550 99.383567
Angpt2 92.492760 186.167844 819.3679836 852.666499 669.642441 1608.748788
Angpt4 3.327743 0.693985 0.8292746 1.112826 5.463647 5.826927
Where A-F are my populations. Then I pass this matrix to BUS:
res<-gene.similarity(Test,measure="corr",net.trim="none")
Warning message:
In cor(mat) : the standard deviation is zero
But the output file which is supposed to be my adjacency matrix is full of NA's:
Adam10 Adam17
Adam10 1 NA
Adam17 NA 1
I thought maybe my matrix was too complex, so I compared only 2 cell populations with my ligands/receptors, but I get the exact same output.
I was expecting to get something like:
A:Adam10 A:Adam17
C:Adam10 6 1
E:Adam17 2 10
But, even if the res object gave me numbers instead of NA it does not maintain the identity of the population when making relationships amongst genes so it still would not produce my expected output.
I do not have to use BUS to make the matrix, so I don't necessarily need help troubleshooting that code, I just need SOME way to make an adjacency matrix.
I've never used circlize or Circos before so I apologize if my question is stupid.

Seems like you need to transform you matrix a little.
you can create a new matrix which has size (nrow(Test) x ncol(Text)) x (nrow(Test) x ncol(Text)), so in the example you gave, the new matrix will be 36x36, and the colnames and rownames will be the same which are A_Adam10, A_Adam17,..., A_Angpt4, B_Adam10,..., F_Angpt4.
With the help of a loop, you can load the similarity of each pair into the new matrix, and now you can plot the matrix. It's a little complicated, also takes a while to run the loop, but it's intuitive.
You're welcomed to check my github repo since I had a similar problem not too long ago, and I posted detailed code on there. I hope this will help you

Related

What is a good substitute for averaging vectors generated from Word2vec

My dataset is in the following format where for each disease I am generating a 2D vector using word2vec.(Showing 2D vectors for example but in practice, vectors are in 100D )
Disease Vectors
disease a, disease c [[ 0.2520773 , 0.433798],[0.38915345, 0.5541569]]
disease b [0.12321666, 0.64195603]
disease c, disease b [[0.38915345, 0.5541569],[0.12321666, 0.64195603]]
disease c [0.38915345, 0.5541569]
From here I am generating a 1D array for each disease/disease combination by taking the average of the vectors. The issue with averaging word vectors is the fact that the combination of 2 or more diseases can have the same average vector as a totally different disease which is not at all relevant but the average vectors get matched. This makes the concept of averaging vectors flawed. To counter this, the understanding is with an increase in dimension of the vectors, this should be even less of a possibility.
So, couple of questions in all:
Is there a better way than averaging the output from word2vec vectors to generate a 1D array?
These generated vectors will be treated as features to a classifier model that I am trying to build for each disease/disease combination so, if I generate a 100D feature vector from word2vec, shall I use something like a PCA on it to reduce the dimension or shall I just consider the 100D feature vector as 100 features to my classifier.

R: Rank cells in a list of matrices based on cell position

I have a list of matrices containing association measurements between GPS tracked animals. One matrix in the list is observed association rates, the others are association rates for randomized versions of the GPS tracking trajectories. For example, I currently have 99 permutations of randomized tracking trajectories resulting in a list of 99 animal association matrices, plus the observed association matrix. I am expecting that for the animals that belong to the same pack, the observed association rates will be higher than the randomized association rates. Accordingly, I would like to determine the rank of the observed rates compared to the randomized rates for each dyad (cell). Essentially, I am doing a rank-permutation test. However, since I am only really concerned with determining if the observed association data is greater than the randomized trajectory association data, any result just giving the rank of the observed cells is sufficient.
ls <- list(matrix(10:18,3,3), matrix(18:10,3,3))
I've seen using sapply can get the ranks of particular cells. Could I do the following for all cells and take the final number in the resulting vector to get the rank of the cell in that position in the list (knowing the position of the observed data in the list of matrices, e.g. last).
rank(sapply(ls, '[',1,1))
The ideal result would be a matrix of the same form as those in the list giving the rank of the observed data, although any similar solutions are welcome. Thanks in advance.
You can proceed that way, but there are cleaner and quicker methods to get what you want.
Here's some code that would take your ls produce a 3x3 matrix with the following properties:
if the entry in ls[[1]] is greater than the corresponding entry of ls[[2]], record a 1
if the entry in ls[[1]] is less than the corresponding entry of ls[[2]], record a 2
if the entries are equal, record a 1.5
result <- 1 * (ls[[1]] > ls[[2]]) + 2 * (ls[[1]] < ls[[2]]) + 1.5 * (ls[[1]] == ls[[2]])
How it works: when we do something like ls[[1]] > ls[[2]], we are ripping out the matrices of interest and directly comparing them. The result of this bit of code is a T/F-populated matrix, which is secretly coded as a 0/1 matrix. We can then multiply it by whatever coefficient we want to represent that situation.

prcomp( .. ,retx=TRUE), do I get the new data to train over?

I am having some issues in interpreting the results from prcomp().
Say I have a centered and scaled data.table called dat, with N columns and M rows. Indeed every column represents a feature and every row a record. I also got a M-dimensional vector of outcomes Y.
I wanted to know what the PCA of this system says. So I just executed:
dat.pca=prcomp(dat,retx=TRUE)
By the elbow method I decided to retain 5 PCA modes, accounting for 90% of the variance. Then, I got the following data.table:
dat.pcadata=as.data.table(dat.pca$x)
dat.pcadata has M rows and N columns, and each column corresponds to a PCA mode.
My question is: do I understand correctly if I say that now my system should be trained to forecast the outcomes Y using the first 5 columns of dat.pcadata as features?

Clustering big data

I have a list like this:
A B score
B C score
A C score
......
where the first two columns contain the variable name and third column contains the score between both. Total number of variables is 250,000 (A,B,C....). And the score is a float [0,1]. The file is approximately 50 GB. And the pairs of A,B where scores are 1, have been removed as more than half the entries were 1.
I wanted to perform hierarchical clustering on the data.
Should I convert the linear form to a matrix with 250,000 rows and 250,000 columns? Or should I partition the data and do the clustering?
I'm clueless with this. Please help!
Thanks.
Your input data already is the matrix.
However hierarchical clustering usually scales O(n^3). That won't work with your data sets size. Plus, they usually need more than one copy of the matrix. You may need 1TB of RAM then... 2*8*250000*250000is a lot.
Some special cases can run in O(n^2): SLINK does. If your data is nicely sorted, it should be possible to run single-link in a single pass over your file. But you will have to implement this yourself. Don't even think of using R or something fancy.

Multiple Matrix Operations in R with loop based on matrix name

I'm a novice R user, who's learning to use this coding language to deal with data problems in research. I am trying to understand how knowledge evolves within an industry by looking at patenting in subclasses. So far I managed to get the following:
# kn.matrices<-with(patents, table(Class,year,firm))
# kn.ind <- with(patents, table(Class, year))
patents is my datafile, with Subclass, app.yr, and short.name as three of the 14 columns
# for (k in 1:37)
# kn.firms = assign(paste("firm", k ,sep=''),kn.matrices[,,k])
There are 37 different firms (in the real dataset, here only 5)
This has given 37 firm-specific and 1 industry-specific 2635 by 29 matrices (in the real dataset). All firm-specific matrices are called firmk with k going from 1 until 37.
I would like to perform many operations in each of the firm-specific matrices (e.g. compare the numbers in app.yr 't' with the average of the 3 previous years across all rows) so I am looking for a way that allows me to loop the operations for every matrix named firm1,firm2,firm3...,firm37 and that generates new matrices with consistent naming, e.g. firm1.3yearcomparison
Hopefully I framed this question in an appropriate way. Any help would be greatly appreciated.
Following comments I'm trying to add a minimal reproducible example
year<-c(1990,1991,1989,1992,1993,1991,1990,1990,1989,1993,1991,1992,1991,1991,1991,1990,1989,1991,1992,1992,1991,1993)
firm<-(c("a","a","a","b","b","c","d","d","e","a","b","c","c","e","a","b","b","e","e","e","d","e"))
class<-c(1900,2000,3000,7710,18000,19000,36000,115000,212000,215000,253600,383000,471000,594000)
These three vectors thus represent columns in a spreadsheet that forms the "patents" matrix mentioned before.
it looks like you already have a 3 dimensional array with all your data. You can basically view this as your 38 matrices all piled one on top of the other. You don't want to split this into 38 matrices and use loops. Instead, you can use R's apply function and extraction functions. Just view the help topic on the apply() family and it should show you how to do what you want. Here are a few basic examples
examples:
# returns the sums of all columns for all matrices
apply(kn.matrices, 3, colSums)
# extract the 5th row of all matrices
kn.matrices[5, , ]
# extract the 5th column of all matrices
kn.matrices[, 5, ]
# extract the 5th matrix
kn.matrices[, , 5]
# mean of 5th column for all matrices
colMeans(kn.matrices[, 5, ])

Resources