How to calculate jaccard similarity between two rows in data frame - r

I have an excel file with records of students including 14 attributes (Shown below). I want to calculate the similarity between each pair of students.
First, I have to convert rows in a character array. then I have made a document-term matrix and calculate the distance between each pair. Then I subtract the distance from 1. But find the wrong similarity.
F360 <- read_excel("C:/Users/DreamWorld/F360.xlsx")
mydf=data.frame(F360$nursery,F360$higher,F360$internet,F360$romantic,stringsAsFactors = FALSE)
td1=as.character(mydf[1,])
td2=as.character(mydf[2,])
d1=paste(td1[1],td1[2],td1[3],td1[4],sep = " ")
d2=paste(td2[1],td2[2],td2[3],td2[4],sep = " ")
myvector=c(d1,d2)
mycorpus=Corpus(VectorSource(myvector))
dtm=as.matrix(DocumentTermMatrix(mycorpus))
jdist=as.matrix(dist(dtm,method = "jaccard"))
jsim=1-jdist
I'm expecting similarity between each pair of the row in the data frame.

recently, I have found that Function sum will give me a number of common attributes.
Com=sum((td1==td2)==TRUE)
Next thing is to get the number of elements in both vectors, which are obviously 4.
len = length(td1)
Finally, we can find Jaccard similarity, which is an intersection over the union.
sim = com/len

Related

R: Rank cells in a list of matrices based on cell position

I have a list of matrices containing association measurements between GPS tracked animals. One matrix in the list is observed association rates, the others are association rates for randomized versions of the GPS tracking trajectories. For example, I currently have 99 permutations of randomized tracking trajectories resulting in a list of 99 animal association matrices, plus the observed association matrix. I am expecting that for the animals that belong to the same pack, the observed association rates will be higher than the randomized association rates. Accordingly, I would like to determine the rank of the observed rates compared to the randomized rates for each dyad (cell). Essentially, I am doing a rank-permutation test. However, since I am only really concerned with determining if the observed association data is greater than the randomized trajectory association data, any result just giving the rank of the observed cells is sufficient.
ls <- list(matrix(10:18,3,3), matrix(18:10,3,3))
I've seen using sapply can get the ranks of particular cells. Could I do the following for all cells and take the final number in the resulting vector to get the rank of the cell in that position in the list (knowing the position of the observed data in the list of matrices, e.g. last).
rank(sapply(ls, '[',1,1))
The ideal result would be a matrix of the same form as those in the list giving the rank of the observed data, although any similar solutions are welcome. Thanks in advance.
You can proceed that way, but there are cleaner and quicker methods to get what you want.
Here's some code that would take your ls produce a 3x3 matrix with the following properties:
if the entry in ls[[1]] is greater than the corresponding entry of ls[[2]], record a 1
if the entry in ls[[1]] is less than the corresponding entry of ls[[2]], record a 2
if the entries are equal, record a 1.5
result <- 1 * (ls[[1]] > ls[[2]]) + 2 * (ls[[1]] < ls[[2]]) + 1.5 * (ls[[1]] == ls[[2]])
How it works: when we do something like ls[[1]] > ls[[2]], we are ripping out the matrices of interest and directly comparing them. The result of this bit of code is a T/F-populated matrix, which is secretly coded as a 0/1 matrix. We can then multiply it by whatever coefficient we want to represent that situation.

Find column with values closest to vector

I have a vector containing times in milliseconds looking like this;
vector <- c(667753, 671396, 675356, 679286, 683413, 687890, 691742,
695651, 700100, 704552, 708832, 713117, 717082, 720872, 725002, 729490,
733824, 738233, 742239, 746092, 750003, 754236, 867342, 870889, 873704,
876617, 879626, 882595, 885690, 888602, 891789, 894717, 897547, 900797,
903615, 906646, 909624, 912613, 915645, 918566, 921792, 924625, 927538,
930721, 933542)
Now i want to look into a large data frame with a lot of time columns and search for a single column that contains time values being closest (row-wise) to my vector time values.
The data.frame containing all the columns is of the same number of rows. So lets say my vector has 240 elements, then every column in the larger data.frame consists of 240 rows.
Any idia how to do this ?
You can calculate the euclidean distance from your vector and each column of the dataframe and then check which column has the smallest distance:
which.min(sapply(1:ncol(dataFrame), function(i) sqrt(sum((t(v)-dataFrame[,i])^2))))
The above returns the index of the column with the lowest distance.
Where dataFrame is the data frame containing columns of different times(so we compare each column to the vector v) and v is the vector.
The following is just the square root of the sum of squared distances (euclidean distance):
sqrt(sum((t(v)-dataFrame[,i])^2)))
You can also use the following as a distance measure:
abs(t(v)-dataFrame[,i])
EDIT
As Evan Friedland pointed out you can actually just use:
which.min(colSums(abs(v-dataFrame)))
or
which.min(sqrt(colSums((t(v)-dataFrame)^2)))

Iterating a vector over a list in R

I am dealing with some computational feature extracting problem from RNA data, and I found myself unable to deal with this question:
I have n sequences (say two for example) from which I obtained an iterated statistic i times (kind of doing a Monte Carlo iteration for analizing distribution of obtained statistics compared with original).
Example:
Say we iterate 10 times
n <- 10
I got a vector of 20 values with all the iterations, but this vector corresponds to two different sequences, so I must divide this vector in two equal parts (the iterations are ordered 1:10 - 1:10 for each sequence).
MFEit <- c(10, 12, 34, 32, 12 .....) ## vector of length 20
MFEit.split <- split(MFEit, ceiling(MFEit.along/n5))
This generates a list of two items each with 10 values, named $1 and $2
On the other hand I have a vector of two values which are the original statistics, each corresponding to each original sequence
MFE <- c(25, 15)
What I want to do is to know how many values of first item in the list MFEit.split, are equal or less than the first value of MFE, and, iteratively, how many values of second item in the list MFEit.split, are equal or less than the second value of MFE, and so on, provided that I would have more than two values or items.
I know how to do it one by one, say:
R <- length(subset(MFEit.split$`1`, MFEit.split$`1`<=MFE[1]))
R <- length(subset(MFEit.split$`2`, MFEit.split$`1`<=MFE[2]))
But... how to include this into a loop so that I can get iteratively each comparison, no matter how many MFE values or items in the list I have?
The desired output would be a vector called R, with n values corresponding to each comparison.
Any help?...

Similarity between bags of words

I have three bags of words:
BoW1 = [word11, word12, word13]
BoW2 = [word21, word22, word23]
BoW3 = [word31, word32, word33]
BoW1 contains synonym words, BoW2 also contain synonym words. Both BoW1 and BoW are fixed. BoW3 contains words of a document, so it is multiset.
I want to search BoW3 to see if it contains any word of BoW1 and BoW2. Then, I would like to calculate the similarity between Bow1 + BoW2 and BoW3. So, together BoW1 and BoW2. I am not interested in calculating the similarity between BoW1 and BoW2, in calculating I can assume that they are one. However, for my case, BoW1 contains significant words than BoW2.
What do you think is the best and accurate way to calculate such similarity. I though to use term frequency as in Information retrieval filed. However, I am not sure if repetition is important in my case.
You are probably wanting the cosine similarity (https://en.wikipedia.org/wiki/Cosine_similarity). Compute the dot product between each bag of words vector. If you're using Python, your code will look something like:
# Make sure each BoW is a map from word -> frequency
BoW1 = {word11: 1, word12: 5, word13: 3}
BoW2 = ...
BoW3 = ...
# Normalise the frequencies
BoW1_total = sum([freq for freq in BoW1.values()])
BoW1 = {word : freq / BoW1_total for word, freq in BoW1.items()}
BoW2_total = ...
...
# Compute the dot product
similarity = 0
for word in set(BoW1.keys()).intersection(BoW2.keys()):
similarity += BoW1[word] * BoW2[word]
... # continue for each pair you want to work out the similarities
Of course, organise the code better than this ^ (write functions for all the things you need to do multiple times, etc) but this should give you the rough idea.

In R: sort the maximum dissimilarity between rows in a matrix

I have a matrix, which includes 100 rows and 10 columns, here I want to compare the diversity between rows and sort them. And then, I want to select the 10 maximum dissimilarity rows from it, Which method can I use?
set.seed(123)
mat <- matrix(runif(100 * 10), nrow = 100, ncol = 10)
My initial method is to calculate the similarity (e.g. saying tanimoto coefficient or others: http://en.wikipedia.org/wiki/Jaccard_index ) between two rows, and dissimilairty = 1 - similarity, and then compare the dissimilarty values. At last I will sort all dissimilarity value, and select the 10 maximum dissimilarity values. But it seems that the result is a 100 * 100 matrix, maybe need efficient method to such calculation if there are a large number of rows. However, this is just my thought, maybe not right, so I need help.
[update]
After looking for some literatures. I find the one definition for the maximum dissimilarity method.
Maximum dissimilarity method: It begins by randomly choosing a data record as the first cluster center. The record maximally distant from the first point is selected as the next cluster center. The record maximally distant from both current points is selected after that . The process repeats itself until there is a sufficient number of cluster centers.
Here in my question, the sufficient number should be 10.
Thanks.
First of all, the Jacard Index is not right for you. From the wikipedia page
The Jaccard coefficient measures similarity between finite sample sets...
Your matrix has samples of floats, so you have a different problem (note that the Index in question is defined in terms of intersections; that should be a red flag right there :-).
So, you have to decide what you mean by dissimilarity. One natural interpretation would be to say row A is more dissimilar from the data set than row B if it has a greater Euclidean distance to the center of mass of the data set. You can think of the center of mass of the data set as the vector you get by taking the mean of each of the colums and putting them together (apply(mat, 2, mean)).
With this, you can take the distance of each row to that central vector, and then get an ordering on those distances. From that you can work back to the rows you desire from the original matrix.
All together:
center <- apply(mat, 2, mean)
# not quite the distances, actually, but their squares. That will work fine for us though, since the order
# will still be the same
dists <- apply(mat, 1, function(row) sum((row - center) ** 2))
# this gives us the row indices in order of least to greaest dissimiliarity
dist.order <- order(dists)
# Now we just grab the 10 most dissimilar of those
most.dissimilar.ids <- dist.order[91:100]
# and use them to get the corresponding rows of the matrix
most.dissimilar <- mat[most.dissimilar.ids,]
If I was actually writing this, I probably would have compressed the last three lines as most.dissimilar <- mat[order(dists)[91:100],], but hopefully having it broken up like this makes it a little easier to see what's going on.
Of course, if distance from the center of mass doesn't make sense as the best way of thinking of "dissimilarity" in your context, then you'll have to amend with something that does.

Resources