Clustering multivalue nominal attributes with different measures - r

I have to apply a clustering algorithm to my dataset which is composed by elements composed by attributes of different nature:
A1 -> multivalued, nominal values
A2 -> multivalued, nominal values
A3 -> multivalued, nominal values
A4 -> single nominal value
The domain of each attribute is potentially huge, such as a dictionary word.
I've just found Jaccard measure, wich will be great for each attribute values-set, except for the first one.
Consider the following example
E1: [
A1 (ab,bb),
A2 (Mark, Rose, Bet),
A3 (rock, pop, soul),
A4 (France)
],
E2: [
A1 (ab,bb,cc,ca),
A2 (Mark, Peter, Bet, Louise),
A3 (pop, disco),
A4 (Spain),
]
While all the other attributes must be considerend as made of atomic value, the first attribute is composed by values that needs to be compared with string similarity, such as Levenshtein distance.
Which is the best approach? The first attribute is generally of small cardinality, about 10 values. But I have no a priori knowledge. All the other attributes have a huge cardinality.
I'm new to clustering stuff, I'm just trying to figure out how to do it in the best way :)
I have found that R software should be a good way to implement that kind of clustering (consider a dataset of milions of elements).
Any suggestion?

Related

What is a good substitute for averaging vectors generated from Word2vec

My dataset is in the following format where for each disease I am generating a 2D vector using word2vec.(Showing 2D vectors for example but in practice, vectors are in 100D )
Disease Vectors
disease a, disease c [[ 0.2520773 , 0.433798],[0.38915345, 0.5541569]]
disease b [0.12321666, 0.64195603]
disease c, disease b [[0.38915345, 0.5541569],[0.12321666, 0.64195603]]
disease c [0.38915345, 0.5541569]
From here I am generating a 1D array for each disease/disease combination by taking the average of the vectors. The issue with averaging word vectors is the fact that the combination of 2 or more diseases can have the same average vector as a totally different disease which is not at all relevant but the average vectors get matched. This makes the concept of averaging vectors flawed. To counter this, the understanding is with an increase in dimension of the vectors, this should be even less of a possibility.
So, couple of questions in all:
Is there a better way than averaging the output from word2vec vectors to generate a 1D array?
These generated vectors will be treated as features to a classifier model that I am trying to build for each disease/disease combination so, if I generate a 100D feature vector from word2vec, shall I use something like a PCA on it to reduce the dimension or shall I just consider the 100D feature vector as 100 features to my classifier.

R: Rank cells in a list of matrices based on cell position

I have a list of matrices containing association measurements between GPS tracked animals. One matrix in the list is observed association rates, the others are association rates for randomized versions of the GPS tracking trajectories. For example, I currently have 99 permutations of randomized tracking trajectories resulting in a list of 99 animal association matrices, plus the observed association matrix. I am expecting that for the animals that belong to the same pack, the observed association rates will be higher than the randomized association rates. Accordingly, I would like to determine the rank of the observed rates compared to the randomized rates for each dyad (cell). Essentially, I am doing a rank-permutation test. However, since I am only really concerned with determining if the observed association data is greater than the randomized trajectory association data, any result just giving the rank of the observed cells is sufficient.
ls <- list(matrix(10:18,3,3), matrix(18:10,3,3))
I've seen using sapply can get the ranks of particular cells. Could I do the following for all cells and take the final number in the resulting vector to get the rank of the cell in that position in the list (knowing the position of the observed data in the list of matrices, e.g. last).
rank(sapply(ls, '[',1,1))
The ideal result would be a matrix of the same form as those in the list giving the rank of the observed data, although any similar solutions are welcome. Thanks in advance.
You can proceed that way, but there are cleaner and quicker methods to get what you want.
Here's some code that would take your ls produce a 3x3 matrix with the following properties:
if the entry in ls[[1]] is greater than the corresponding entry of ls[[2]], record a 1
if the entry in ls[[1]] is less than the corresponding entry of ls[[2]], record a 2
if the entries are equal, record a 1.5
result <- 1 * (ls[[1]] > ls[[2]]) + 2 * (ls[[1]] < ls[[2]]) + 1.5 * (ls[[1]] == ls[[2]])
How it works: when we do something like ls[[1]] > ls[[2]], we are ripping out the matrices of interest and directly comparing them. The result of this bit of code is a T/F-populated matrix, which is secretly coded as a 0/1 matrix. We can then multiply it by whatever coefficient we want to represent that situation.

how to find index where two matrices are different in Julia

My problem is:
I have two large matrices: Matrix A is of rank 5 (namelly a Tensor), which I reshape to a Matrix B (NxM) of rank 2. At some point, my problem involved normalizing my matrices, so I was doing:
1) A*norm_scalar;
2) reshaping A to get B.
which is giving a different result than doing
1) reshaping A to get B
2) B*norm_scalar;
Both results should have the same output, as I am only multiplying by a scalar. My theory is that there is something related to rounding precision. If so, which one is the most recommended way to proceed?
In this sense, I was trying to get B with both methods, namelly B1 and B2 and compare them:
I have tryed:
julia> isequal(B1,B2)
false
So, yes, they are different.
I know that find(B1.==B2) will give me the indexes where B1 and B2 are equal. Now: Is there any command that gives me the indexes where B1 and B2 are different?. this would help me great deal!
find(B1.!=B2) should do what you want.

How to create adjacency matrix for gene-gene interactions from RNA-Seq (circlize input)

I'm profiling tumor microenvironment and I want to show interactions between subpopulations that I found. I have a list of receptors and ligands for example, and I want to show that population A expresses ligand 1 and population C expresses receptor 1 so there's likely an interaction between these two populations through the expression of ligand-receptor 1.
I have been trying to use circlize to visualize these interactions by making a chordDiagram, but it requires an adjacency matrix as input and I do not understand how to create the matrix. The adjacency matrix is supposed to show the strength of the relationship between any two genes in my matrix. I have 6 unique populations of cells that can express any of the 485 ligands/receptors that I am interested in, and the goal is to show interactions between these populations through the ligands and receptors.
I found a tool to use in RStudio called BUS- gene.similarity: Calculate adjacency matrix for gene-gene interaction.
Maybe I am just using BUS incorrectly but it says: For gene expression data with M genes and N experiments, the adjacency matrix is in size of MxM. An adjacency matrix in size of MxM with rows and columns both standing for genes. Element in row i and column j indicates the similarity between gene i and gene j.
So, I made a matrix where each column is a subpopulation and each row is a ligand/receptor I want to show interactions with. The cells have expression values and it looks like this:
> head(Test)
A B C D E F
Adam10 440.755990 669.875468 748.7313995 702.991422 1872.033343 2515.074366
Adam17 369.813134 292.625603 363.0301707 434.905968 1183.152694 1375.424034
Agt 12.676036 28.269671 9.2428034 19.920561 121.587010 168.116735
Angpt1 22.807415 42.350205 25.5464603 16.010813 194.620550 99.383567
Angpt2 92.492760 186.167844 819.3679836 852.666499 669.642441 1608.748788
Angpt4 3.327743 0.693985 0.8292746 1.112826 5.463647 5.826927
Where A-F are my populations. Then I pass this matrix to BUS:
res<-gene.similarity(Test,measure="corr",net.trim="none")
Warning message:
In cor(mat) : the standard deviation is zero
But the output file which is supposed to be my adjacency matrix is full of NA's:
Adam10 Adam17
Adam10 1 NA
Adam17 NA 1
I thought maybe my matrix was too complex, so I compared only 2 cell populations with my ligands/receptors, but I get the exact same output.
I was expecting to get something like:
A:Adam10 A:Adam17
C:Adam10 6 1
E:Adam17 2 10
But, even if the res object gave me numbers instead of NA it does not maintain the identity of the population when making relationships amongst genes so it still would not produce my expected output.
I do not have to use BUS to make the matrix, so I don't necessarily need help troubleshooting that code, I just need SOME way to make an adjacency matrix.
I've never used circlize or Circos before so I apologize if my question is stupid.
Seems like you need to transform you matrix a little.
you can create a new matrix which has size (nrow(Test) x ncol(Text)) x (nrow(Test) x ncol(Text)), so in the example you gave, the new matrix will be 36x36, and the colnames and rownames will be the same which are A_Adam10, A_Adam17,..., A_Angpt4, B_Adam10,..., F_Angpt4.
With the help of a loop, you can load the similarity of each pair into the new matrix, and now you can plot the matrix. It's a little complicated, also takes a while to run the loop, but it's intuitive.
You're welcomed to check my github repo since I had a similar problem not too long ago, and I posted detailed code on there. I hope this will help you

Optimal order and scaling of matrices

I have two tables A1 and A2
(for example A1=[0.4472,-0.8944;-0.8944 0.4472] A2=[-0.5558 0.9101;0.8313 0.41420] )
and i want to check if the columns of A2 are optimally ordered and scalled
(its columns are least-squares estimates of the columns of A)
And if not , to make them.
Any help?
Thanks

Resources