Cluster groups based on pairwise distances - r

I have an n x n matrix with pairwise distances as entries. The matrix looks for example like this:
m = matrix (c(0, 0, 1, 1, 1, 1,0, 0, 1, 1, 0, 1,1, 1, 0, 1, 1, 0,1, 1, 1, 0, 1, 1,1, 0, 1, 1, 0, 1,1, 1, 0, 1, 1, 0),ncol=6, byrow=TRUE)
colnames(m) <- c("A","B","C","D","E","F")
rownames(m) <- c("A","B","C","D","E","F")
Now I want to put every letter in the same cluster if the distance to any other letter is 0. For the example above, I should get three clusters consisting of:
(A,B,E)
(C,F)
(D)
I would be interested in the number of entries in each cluster. At the end, I want to have a vector like:
clustersizes = c(3,2,1)
I assume it is possible by using the hclust function, but I'm not able to extract the three clusters. I also tried the cutree function, but if I don't know the number of clusters before and also not the cutoff for the height, how should I do it?
This is what I tried:
h <- hclust(dist(m),method="single")
plot(h)
Thanks!

Welcome to SO.
There are several ways to handle this but an easy choice is to use the igraph package.
First we convert your matrix m to an adjacency matrix. It contains the distances to neighbouring nodes, where 0 means no connection. Thus, we subtract your matrix from 1 to get that
mm <- 1 - m
diag(mm) <- 0 # We don't allow loops
This gives
> mm
A B C D E F
A 0 1 0 0 0 0
B 1 0 0 0 1 0
C 0 0 0 0 0 1
D 0 0 0 0 0 0
E 0 1 0 0 0 0
F 0 0 1 0 0 0
Then we just need to feed it to igraph to compute communities
library("igraph")
fastgreedy.community(as.undirected(graph.adjacency(mm)))
which produces
IGRAPH clustering fast greedy, groups: 3, mod: 0.44
+ groups:
$`1`
[1] "A" "B" "E"
$`2`
[1] "C" "F"
$`3`
[1] "D"
Now if you save that result you can get the community sizes right away
res < fastgreedy.community(as.undirected(graph.adjacency(mm)))
sizes(res)
which yields
Community sizes
1 2 3
3 2 1

Related

assign vector elements to matrix

I wish to assign vector elements to a matrix. In my example I have a population of 10 kinds of fruit available for purchase, in my.fruit. A sample of five pieces of fruit are purchased in the sequence shown in sequence.of.purchased.item. I want to create a matrix containing the sequence in which the fruit was purchased as shown in desired.result.
Here is a somewhat long description of how desired.result is constructed. The vector my.fruit is essentially the row names of the matrix desired.result. The first plum purchased was the first piece of fruit purchased. The sequence number of 1 is placed in the first column of the fifth row, the row representing plums. The second plum purchased was the third piece of fruit purchased. The sequence number of 3 is placed in the second column of the plum row. The first apple purchased was the second piece of fruit purchased. So, the number 2 is placed in the first column of the first row, the row representing apples. The fourth piece of purchased fruit was an orange. So, a 4 is placed in the first column of the second row, the row representing oranges.
The tenth row represents olives, but no olives were purchased. Similarly, no cherries, peaches, apricots, pears, grapefruit or figs were purchased. So, their rows are all zero in desired.result.
my.fruit <- c('apple', 'orange', 'cherry', 'peach', 'plum',
'apricot', 'pear', 'grapefruit', 'fig', 'olive')
sequence.of.purchased.item <- c('plum', 'apple', 'plum', 'orange', 'plum')
desired.result <- matrix(c(
2, 0, 0,
4, 0, 0,
0, 0, 0,
0, 0, 0,
1, 3, 5,
0, 0, 0,
0, 0, 0,
0, 0, 0,
0, 0, 0,
0, 0, 0), ncol = 3, byrow = TRUE)
I can obtain the row and column indices for each piece of purchased fruit using:
purchase.order.row <- match(sequence.of.purchased.item, my.fruit)
purchase.order.row
#[1] 5 1 5 2 5
purchase.order.col <- sapply(1:length(sequence.of.purchased.item),
function(i) {sum(sequence.of.purchased.item[i] == sequence.of.purchased.item[1:i])})
purchase.order.col
#[1] 1 1 2 1 3
Here I attempt to assign the sequence number of each purchased fruit to an output matrix using sapply:
my.output <- matrix(0, ncol = 3, nrow = 10)
sapply(1:5, function(x) my.output[purchase.order.row[x],
purchase.order.col[x]] = x)
However, the sapply statement is not returning the desired output.
my.output
[,1] [,2] [,3]
[1,] 0 0 0
[2,] 0 0 0
[3,] 0 0 0
[4,] 0 0 0
[5,] 0 0 0
[6,] 0 0 0
[7,] 0 0 0
[8,] 0 0 0
[9,] 0 0 0
[10,] 0 0 0
This will probably be much easier:
my.output <- matrix(0, ncol = 3, nrow = 10)
i <- cbind(purchase.order.row,purchase.order.col)
my.output[i] <- 1:5
Use <<- instead of = inside your sapply function so that sapply knows to assign x to a globally defined variable instead of function internal.

Transform categorical attribute vector into similarity matrix

I need to transfrom a categorical attribute vector into a "same attribute matrix" using R.
For example I have a vector which reports gender of N people (male = 1, female = 0). I need to convert this vector into a NxN matrix named A (with people names on rows and columns), where each cell Aij has the value of 1 if two persons (i and j) have the same gender and 0 otherwise.
Here is an example with 3 persons, first male, second female, third male, which produce this vector:
c(1, 0, 1)
I want to transform it into this matrix:
A = matrix( c(1, 0, 1, 0, 1, 0, 1, 0, 1), nrow=3, ncol=3, byrow = TRUE)
Like lmo said in acomment it's impossible to know the structure of your dataset so what follows is just an example for you to see how it could be done.
First, make up some data.
set.seed(3488) # make the results reproducible
x <- LETTERS[1:5]
y <- sample(0:1, 5, TRUE)
df <- data.frame(x, y)
Now tabulate it according to your needs
A <- outer(df$y, df$y, function(a, b) as.integer(a == b))
dimnames(A) <- list(df$x, df$x)
A
# A B C D E
#A 1 1 1 0 0
#B 1 1 1 0 0
#C 1 1 1 0 0
#D 0 0 0 1 1
#E 0 0 0 1 1

How to use a positional vector to create a binary string?

I have a vector c(2, 5) and I want to turn this into a vector of n elements where positions 2 and 5 are equal to 1 and any remaining positions are equal to zero.
i.e. If I want to create a vector of length 6, I would want to use vector c(2, 5) to generate the following vector:
c(0, 1, 0, 0, 1, 0)
How about
x <- c(2, 5)
n <- 6
replace(integer(n), x, 1L)
# [1] 0 1 0 0 1 0
And another option is
as.integer(1:n %in% x)
# [1] 0 1 0 0 1 0

make data frame with binaries to sum to 1

I have a data frame with only zeros and ones, e.g.
df <- data.frame(v1 = rbinom(100, 1, 0.5),
v2 = rbinom(100, 1, 0.2),
v3 = rbinom(100, 1, 0.4))
Now I want to modify this data set so that each row sums to 1.
So this
1 0 0
1 1 0
0 0 1
1 1 1
0 0 0
should become this:
1 0 0
0.5 0.5 0
0 0 1
0.33 0.33 0.33
0 0 0
edit: rows with all zeros should be left as is
As already pointed out by #lmo the data.frame (or matrix) can be modified with
df <- df / rowSums(df)
In the case of rows containing only zeros this will lead to rows containing only NaN. Since these rows should be kept as they were, the easiest way is probably to correct for this afterwards with
df[is.na(df)] <- 0
Here is a quick method:
# create matrix
temp <- matrix(c(1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1), ncol=3, byrow=T)
temp / rowSums(temp)
This exploits the fact that matrices are ordered column-wise, so that the element by element division of rowsSums and the recycling are aligned.
In the case that all elements in a row are zero, and you don't want an Inf, another method from #RHertel s is the following:
# save rowSum:
mySums <- rowSums(temp)
temp / ifelse(mySums != 0, mySums, 1)

How to utilize recursive functions to help rank matrix rows - R

I would like some advice as to how best solve this puzzle. I have got some of the way to solving it using manually written long-hand code. I feel as if I need to utilize recursive functions, but I am still not very good at using them. I hope this question is not too long, I'm trying to be as succinct as possible whilst giving enough information. Sorry if it's too long - though hopefully somebody finds it of interest.
I have a matrix mat1
# A B C D E F G
# A 0 2 1 1 0 1 1
# B 0 0 0 1 2 2 1
# C 1 2 0 0 0 2 1
# D 1 1 2 0 1 2 1
# E 2 0 2 1 0 2 1
# F 1 0 0 0 0 0 1
# G 1 1 1 1 1 1 0
This represents the results of contests between individuals in rows and columns. Numbers refer to how often the individual in the row 'won' against the individual in the column.
I wish to rank individuals A-G from 1-7 using the following criteria:
number of wins against all others (most wins should be ranked 1, least wins 7, 2nd most wins 2, etc.)
if number of wins are tied, then ranks should be based on the number of wins obtained when considering contests only between those individuals with the same number of wins.
if individuals still have a tied number of wins, then ranks should be applied randomly.
I realize that this is not a very good ranking system, but that's not the issue here. According to the above scheme, ranks should be the following:
1 - D or E - D & E have joint highest overall wins (8), and equal wins also in contests between them.
2 - E or D - pick randomly D or E for rank 1 and rank 2
3 - A or C - tied with A,B,C,G for overall 6 wins, both have 4 wins in contents with ABCG
4 - C or A - considering contests between C&A both have 1 win, so randomly pick for rank3 and rank4
5 - G - tied with A,B,C,G for overall 6 wins, has 3 wins in contests between A,B,C,G
6 - B - tied with A,B,C,G for overall 6 wins, but only has 1 win in contests between A,B,C,G
7 - F - has the fewest wins of all in the overall win matrix
What I have tried:
storeresults <- vector("list") #use this to store results of the following
Step 1: Use winsfun function (see below) to identify number of wins of each individual & whether wins are unique (as noted by dupes column):
w1 <- winsfun(mat1)
storeresults[[1]] <- w1 #store results
w1 Only "F" has a unique number of wins and so can be ranked (7th) in the first instance:
# wins ranks dupes
#A 6 4.5 TRUE
#B 6 4.5 TRUE
#C 6 4.5 TRUE
#D 8 1.5 TRUE
#E 8 1.5 TRUE
#F 2 7.0 FALSE
#G 6 4.5 TRUE
Step 2: For individuals with non-unique wins (i.e. duplicated ranks) subset them into matrices considering only contests against others with the same number of wins, and determine new ranks if possible.
allSame(w1[,3]) #FALSE - this says that not all wins/ranks are unique so need to subset
s2 <- subsetties(w1) #this just splits the data into groups by number of wins (see below)
w2 <- lapply(s2, winsfun, m=mat1)
storeresults[[2]] <- w2 # store results
w2 As can be seen, those individuals with 8 wins (the most of anyone) from Step1 ("D" and "E") each have one win versus each other. They cannot be teased apart, so will be ranked 1 and 2 randomly. Those individuals with 6 wins (A, B, C, G) have different number of wins when only considering contests between each other. "B" and "G" can be ranked 6th overall and 5th overall respectively. We need to reconsider "A" and "C" in contests against only each other:
$`6`
wins ranks dupes
A 4 1.5 TRUE
B 1 4.0 FALSE
C 4 1.5 TRUE
G 3 3.0 FALSE
$`8`
wins ranks dupes
D 1 1.5 TRUE
E 1 1.5 TRUE
Step 3: Repeat Step 2 where required
allSame(w2[[1]][,3]) #FALSE - need to subset again as not everyone has same number of wins
allSame(w2[[2]][,3]) #TRUE - no more action required
s3 <- subsetties(w2[[1]])
w3 <- winsfun(s3[[1]], m=mat1)
storeresults[[3]] <- w3 #store results
w3 When considering "A" and "C" together, they have one win each, so should now be ranked randomly in 2nd and 3rd place. They cannot be teased apart.
wins ranks dupes
A 1 1.5 TRUE
C 1 1.5 TRUE
allSame(w3[,3]) #TRUE - no more action required - both have same number of wins
Step 4 Processing Stored Results
storeresults
# I can manually work out ranks from this, but have yet to work out how to do it in R
Below are the functions used in the above:
Function to calculate wins and ranks of subsetted matrices
winsfun <- function(m, out=NULL){
if (is.null(out)==F){
m1 <- m[rownames(out),rownames(out)]
wins <- apply(m1, 1, sum)
ranks <- rank(-wins)
dupes <- duplicated(wins)| duplicated(wins, fromLast = T)
df <- data.frame(wins, ranks,dupes)
return(df)
}
else
wins <- apply(m, 1, sum)
ranks <- rank(-wins)
dupes <- duplicated(wins)| duplicated(wins, fromLast = T)
df <- data.frame(wins, ranks,dupes)
return(df)
}
Function to subset those rows with duplicated ranks
subsetties <- function(df){
df1 <- df[df[,3]==T,]
df1.sp <- split(df1, df1$wins)
return(df1.sp)
}
Function to test if all elements of vector are identical
allSame <- function(x) length(unique(x)) == 1
Code to recreate above matrix:
structure(c(0, 0, 1, 1, 2, 1, 1, 2, 0, 2, 1, 0, 0, 1, 1, 0, 0,
2, 2, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 2, 0, 1, 0, 0, 1, 1, 2, 2,
2, 2, 0, 1, 1, 1, 1, 1, 1, 1, 0), .Dim = c(7L, 7L), .Dimnames = list(
c("A", "B", "C", "D", "E", "F", "G"), c("A", "B", "C", "D",
"E", "F", "G")))
I hope this question is clear. I am trying to work out how to perform this algorithm iteratively. I am not too sure how to achieve this, but hopefully by writing this out long-hand and providing the functions I have been using, it may be obvious to somebody. One extra thing is that it's best to have the proposed solution be generally applicable (i.e. to matrices of different sizes).
calc_gain<-function(mat=mat1){
if(nrow(mat)==1) {
return(row.names(mat))
} else {
classement<-sort(rowSums(mat),decreasing=T)
diffgains<-diff(classement)
if (all(diffgains!=0)){
return(names(classement))
} else {
if (all(diffgains==0)){
return(sample(names(classement)))
} else {
parex<-split(classement,factor(classement,levels=unique(classement)))
class_parex<-lapply(parex,function(vect){calc_gain(mat[names(vect),names(vect),drop=F])})
return(unlist(class_parex))
}
}
}
}
Here is what the function does :
if there is only one element, it returns the name of it (only "player" there is)
else, it calculates the scores.
If there is no tie, it returns the "players" in the order first to last
else, - if all "players" have the same score, it randomly gives an order.
else, it splits the ordered list according to the scores and apply the function (that is the recursive part) on the subsets of "players" with tied scores.
Here's a start:
Step0:
> split(rownames(m), -rowSums( m ) )
$`-8`
[1] "D" "E"
$`-6`
[1] "A" "B" "C" "G"
$`-2`
[1] "F"
Step1:
m <- m[ order( -rowSums(m) ), ]) # order within overall wins
A B C D E F G
D 1 1 2 0 1 2 1
E 2 0 2 1 0 2 1
A 0 2 1 1 0 1 1
B 0 0 0 1 2 2 1
C 1 2 0 0 0 2 1
G 1 1 1 1 1 1 0
F 1 0 0 0 0 0 1
> rowSums( m )
D E A B C G F
8 8 6 6 6 6 2
Step2: Order within group that has 4 wins
> mred <- m[c("A","B","C","G"), c("A","B","C","G") ]
> mred
A B C G
A 0 2 1 1
B 0 0 0 1
C 1 2 0 1
G 1 1 1 0
> rowSums(mred)
A B C G
4 1 4 3
> rownames(mred)[order(-rowSums(mred))]
[1] "A" "C" "G" "B"

Resources