Analysing the similarity/dissimilarity between binary codes of equal length in R - r

I have been trying to analyse some data which is 32 bit binary code. It has been converted into corresponding decimal value and I have grouped the data into several categories based on some similarity. Now, is there a way I can compute a measure of similarity/dissimilarity within each category ?
The concepts of standard deviation, variance etc do not really make sense here but I am sure the concept of hamming distance can be used to compare similarity between two strings.
However, how to compute a sensible metric using hamming distance for a group containing 10000 such binary codes (converted to decimal) ? How do I proceed in general ? And how can I achieve something using R as the rest of the calculations I have done is in R.
Any help would be deeply appreciated.
Kindly find below a sample of my data:
ID(Group5) BinCode
1 2621440
2 3670018
3 3670018
4 3670018
5 3670018
6 2621440
7 3670018
8 2621442
9 3670018

Related

Finding if there relationship between numbers

I have a challenge. This may be little tricky or even not possible but wanted to check if anyone has any thoughts on this?
PS : This question is in general and not related to only to R. May be I can say its general mathematics
I have a data
df
ColA ColB ColC
6 9 27
1 4 32
4 8 40
If you observe closely, there is some relationship between these columns.
Example, (ColC/ColB)+ColA will give you number 9.
df
ColA ColB ColC ColD
6 9 27 9
1 4 32 9
4 8 40 9
However this data is manipulated and I made sure there is some relation.
But in general, lets us take any numbers, is there a way to find if there is any relationship between these numbers. Need not be (ColC/ColB)+ColA . It could be anything.
Say we have 5 columns of numeric data. I need to find mathematical operation between these so that common number exists.
This is more into mathematics(algebra).
Can anyone let me know is this even possible ?
For some types of relationships this is doable. But when such a method fails to find a relationship, it typically just means there could be a relationship of a kind not covered by your approach.
One common tool for finding relationships is linear algebra, and linear dependencies in particular. Write your data in a matrix like you did. Consider that a linear equation
a*ColA + b*ColB + c*ColC = 0
Use standard techniques such as Gaussian elimination to find coefficients a, b, c which satisfy this equation but are not all zero themselves. You probably can find a library to compute the kernel of a matrix which you can use for that. Now you know whether one of the columns can be expressed as a linear combination of the other two.
This is a very limited class of relationships, and doesn't cover your example yet. But you can improve it by including more columns. Include a column with ones everywhere to allow for a constant term in your formula. Include all pair wise products.
x + a*ColA + b*ColB + c*ColC + ab*ColA*ColB + ac*ColA*ColC + bc*ColB*ColC + aa*ColA^2 + bb*ColB^2 + cc*ColC^2 = 0
Now for your data this could tell you that there is a solution of the form
b=-9 c=1 ab=1 x=a=ac=bc=aa=bb=cc=0
-9*ColB + ColC + ColA*ColB = 0
which is equivalent to the relationship you described in your question.
But also observed that you are now using 3 data points to determine 10 variables. So this one relationship is by far not the only one.
In general you want at least as many data points as you have variables in your equation. You want at least as many rows as you have columns in your extended matrix. Only then can you say that a relationship between them us indeed a property of the underlying data and not merely an artifact of having too much flexibility and too little data.
In R you might want to look into using linear models for determining coefficients in the presence of imprecise data. You can also use powers of formulas to include all interactions between columns, i.e. those higher degree terms which I included above as well.

What is a good substitute for averaging vectors generated from Word2vec

My dataset is in the following format where for each disease I am generating a 2D vector using word2vec.(Showing 2D vectors for example but in practice, vectors are in 100D )
Disease Vectors
disease a, disease c [[ 0.2520773 , 0.433798],[0.38915345, 0.5541569]]
disease b [0.12321666, 0.64195603]
disease c, disease b [[0.38915345, 0.5541569],[0.12321666, 0.64195603]]
disease c [0.38915345, 0.5541569]
From here I am generating a 1D array for each disease/disease combination by taking the average of the vectors. The issue with averaging word vectors is the fact that the combination of 2 or more diseases can have the same average vector as a totally different disease which is not at all relevant but the average vectors get matched. This makes the concept of averaging vectors flawed. To counter this, the understanding is with an increase in dimension of the vectors, this should be even less of a possibility.
So, couple of questions in all:
Is there a better way than averaging the output from word2vec vectors to generate a 1D array?
These generated vectors will be treated as features to a classifier model that I am trying to build for each disease/disease combination so, if I generate a 100D feature vector from word2vec, shall I use something like a PCA on it to reduce the dimension or shall I just consider the 100D feature vector as 100 features to my classifier.

How to create adjacency matrix for gene-gene interactions from RNA-Seq (circlize input)

I'm profiling tumor microenvironment and I want to show interactions between subpopulations that I found. I have a list of receptors and ligands for example, and I want to show that population A expresses ligand 1 and population C expresses receptor 1 so there's likely an interaction between these two populations through the expression of ligand-receptor 1.
I have been trying to use circlize to visualize these interactions by making a chordDiagram, but it requires an adjacency matrix as input and I do not understand how to create the matrix. The adjacency matrix is supposed to show the strength of the relationship between any two genes in my matrix. I have 6 unique populations of cells that can express any of the 485 ligands/receptors that I am interested in, and the goal is to show interactions between these populations through the ligands and receptors.
I found a tool to use in RStudio called BUS- gene.similarity: Calculate adjacency matrix for gene-gene interaction.
Maybe I am just using BUS incorrectly but it says: For gene expression data with M genes and N experiments, the adjacency matrix is in size of MxM. An adjacency matrix in size of MxM with rows and columns both standing for genes. Element in row i and column j indicates the similarity between gene i and gene j.
So, I made a matrix where each column is a subpopulation and each row is a ligand/receptor I want to show interactions with. The cells have expression values and it looks like this:
> head(Test)
A B C D E F
Adam10 440.755990 669.875468 748.7313995 702.991422 1872.033343 2515.074366
Adam17 369.813134 292.625603 363.0301707 434.905968 1183.152694 1375.424034
Agt 12.676036 28.269671 9.2428034 19.920561 121.587010 168.116735
Angpt1 22.807415 42.350205 25.5464603 16.010813 194.620550 99.383567
Angpt2 92.492760 186.167844 819.3679836 852.666499 669.642441 1608.748788
Angpt4 3.327743 0.693985 0.8292746 1.112826 5.463647 5.826927
Where A-F are my populations. Then I pass this matrix to BUS:
res<-gene.similarity(Test,measure="corr",net.trim="none")
Warning message:
In cor(mat) : the standard deviation is zero
But the output file which is supposed to be my adjacency matrix is full of NA's:
Adam10 Adam17
Adam10 1 NA
Adam17 NA 1
I thought maybe my matrix was too complex, so I compared only 2 cell populations with my ligands/receptors, but I get the exact same output.
I was expecting to get something like:
A:Adam10 A:Adam17
C:Adam10 6 1
E:Adam17 2 10
But, even if the res object gave me numbers instead of NA it does not maintain the identity of the population when making relationships amongst genes so it still would not produce my expected output.
I do not have to use BUS to make the matrix, so I don't necessarily need help troubleshooting that code, I just need SOME way to make an adjacency matrix.
I've never used circlize or Circos before so I apologize if my question is stupid.
Seems like you need to transform you matrix a little.
you can create a new matrix which has size (nrow(Test) x ncol(Text)) x (nrow(Test) x ncol(Text)), so in the example you gave, the new matrix will be 36x36, and the colnames and rownames will be the same which are A_Adam10, A_Adam17,..., A_Angpt4, B_Adam10,..., F_Angpt4.
With the help of a loop, you can load the similarity of each pair into the new matrix, and now you can plot the matrix. It's a little complicated, also takes a while to run the loop, but it's intuitive.
You're welcomed to check my github repo since I had a similar problem not too long ago, and I posted detailed code on there. I hope this will help you

correlation; lower values better than higher values R

I am trying to calculate the correlation between some vector of investment returns and a matching vector that has a number from 1 to 5 rating the quality of the company. It looks something like this (lets call this data returnrank:
company returns rank
at&t 0.09034 2
verizon 0.23341 1
sprint 0.03021 3
How can I make it so that when I calculate cor(returnrank$returns,returnrank$rank) it treats lower values as better and higher values as worse in the rank column
(ie: if a stock has high returns and what R would consider a low score (1), I want to see a high positive correlation because I am treating 1 as better than 5).
You probably just want:
cor(returnrank$returns, max(returnrank$rank) - returnrank$rank))
It may be better to just graph the data since it's unlikely to be a linear relationship given the nature of rank

Interpreting the results of R Mclust package

I'm using the R package mclust to estimate the number of clusters in my data and get this result:
Clustering table:
2 7 8 9
205693 4465 2418 91
Warning messages:
1: In map(z) : no assignment to 1,3,4,5,6
2: In map(z) : no assignment to 1,3,4,5,6
I have 9 clusters as the best, but it has no assignment to 5 of the clusters.
So does this mean I want to use 9 or 5 clusters?
If the answer can be found somewhere online, a link would be appreciated. Thanks in advance.
Most likely, the method just did not work at all on your data...
You may try other seeds, because when you "lose" clusters (i.e. they become empty) this usually means your seeds were not chosen well enough. And your cluster 9 is also pretty much gone, too.
However, if your data is actually generated by a mixture of Gaussians, it's hard to find such a bad starting point... so most likely, all of your results are bad, because the data does not satisfy your assumptions.
Judging from your cluster sizes, I'd say you have 1 cluster and a lot of noise...
Have you visualized and validated the results?
Don't blindly follow some number. Validate.

Resources