How to utilize recursive functions to help rank matrix rows - R - r

I would like some advice as to how best solve this puzzle. I have got some of the way to solving it using manually written long-hand code. I feel as if I need to utilize recursive functions, but I am still not very good at using them. I hope this question is not too long, I'm trying to be as succinct as possible whilst giving enough information. Sorry if it's too long - though hopefully somebody finds it of interest.
I have a matrix mat1
# A B C D E F G
# A 0 2 1 1 0 1 1
# B 0 0 0 1 2 2 1
# C 1 2 0 0 0 2 1
# D 1 1 2 0 1 2 1
# E 2 0 2 1 0 2 1
# F 1 0 0 0 0 0 1
# G 1 1 1 1 1 1 0
This represents the results of contests between individuals in rows and columns. Numbers refer to how often the individual in the row 'won' against the individual in the column.
I wish to rank individuals A-G from 1-7 using the following criteria:
number of wins against all others (most wins should be ranked 1, least wins 7, 2nd most wins 2, etc.)
if number of wins are tied, then ranks should be based on the number of wins obtained when considering contests only between those individuals with the same number of wins.
if individuals still have a tied number of wins, then ranks should be applied randomly.
I realize that this is not a very good ranking system, but that's not the issue here. According to the above scheme, ranks should be the following:
1 - D or E - D & E have joint highest overall wins (8), and equal wins also in contests between them.
2 - E or D - pick randomly D or E for rank 1 and rank 2
3 - A or C - tied with A,B,C,G for overall 6 wins, both have 4 wins in contents with ABCG
4 - C or A - considering contests between C&A both have 1 win, so randomly pick for rank3 and rank4
5 - G - tied with A,B,C,G for overall 6 wins, has 3 wins in contests between A,B,C,G
6 - B - tied with A,B,C,G for overall 6 wins, but only has 1 win in contests between A,B,C,G
7 - F - has the fewest wins of all in the overall win matrix
What I have tried:
storeresults <- vector("list") #use this to store results of the following
Step 1: Use winsfun function (see below) to identify number of wins of each individual & whether wins are unique (as noted by dupes column):
w1 <- winsfun(mat1)
storeresults[[1]] <- w1 #store results
w1 Only "F" has a unique number of wins and so can be ranked (7th) in the first instance:
# wins ranks dupes
#A 6 4.5 TRUE
#B 6 4.5 TRUE
#C 6 4.5 TRUE
#D 8 1.5 TRUE
#E 8 1.5 TRUE
#F 2 7.0 FALSE
#G 6 4.5 TRUE
Step 2: For individuals with non-unique wins (i.e. duplicated ranks) subset them into matrices considering only contests against others with the same number of wins, and determine new ranks if possible.
allSame(w1[,3]) #FALSE - this says that not all wins/ranks are unique so need to subset
s2 <- subsetties(w1) #this just splits the data into groups by number of wins (see below)
w2 <- lapply(s2, winsfun, m=mat1)
storeresults[[2]] <- w2 # store results
w2 As can be seen, those individuals with 8 wins (the most of anyone) from Step1 ("D" and "E") each have one win versus each other. They cannot be teased apart, so will be ranked 1 and 2 randomly. Those individuals with 6 wins (A, B, C, G) have different number of wins when only considering contests between each other. "B" and "G" can be ranked 6th overall and 5th overall respectively. We need to reconsider "A" and "C" in contests against only each other:
$`6`
wins ranks dupes
A 4 1.5 TRUE
B 1 4.0 FALSE
C 4 1.5 TRUE
G 3 3.0 FALSE
$`8`
wins ranks dupes
D 1 1.5 TRUE
E 1 1.5 TRUE
Step 3: Repeat Step 2 where required
allSame(w2[[1]][,3]) #FALSE - need to subset again as not everyone has same number of wins
allSame(w2[[2]][,3]) #TRUE - no more action required
s3 <- subsetties(w2[[1]])
w3 <- winsfun(s3[[1]], m=mat1)
storeresults[[3]] <- w3 #store results
w3 When considering "A" and "C" together, they have one win each, so should now be ranked randomly in 2nd and 3rd place. They cannot be teased apart.
wins ranks dupes
A 1 1.5 TRUE
C 1 1.5 TRUE
allSame(w3[,3]) #TRUE - no more action required - both have same number of wins
Step 4 Processing Stored Results
storeresults
# I can manually work out ranks from this, but have yet to work out how to do it in R
Below are the functions used in the above:
Function to calculate wins and ranks of subsetted matrices
winsfun <- function(m, out=NULL){
if (is.null(out)==F){
m1 <- m[rownames(out),rownames(out)]
wins <- apply(m1, 1, sum)
ranks <- rank(-wins)
dupes <- duplicated(wins)| duplicated(wins, fromLast = T)
df <- data.frame(wins, ranks,dupes)
return(df)
}
else
wins <- apply(m, 1, sum)
ranks <- rank(-wins)
dupes <- duplicated(wins)| duplicated(wins, fromLast = T)
df <- data.frame(wins, ranks,dupes)
return(df)
}
Function to subset those rows with duplicated ranks
subsetties <- function(df){
df1 <- df[df[,3]==T,]
df1.sp <- split(df1, df1$wins)
return(df1.sp)
}
Function to test if all elements of vector are identical
allSame <- function(x) length(unique(x)) == 1
Code to recreate above matrix:
structure(c(0, 0, 1, 1, 2, 1, 1, 2, 0, 2, 1, 0, 0, 1, 1, 0, 0,
2, 2, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 2, 0, 1, 0, 0, 1, 1, 2, 2,
2, 2, 0, 1, 1, 1, 1, 1, 1, 1, 0), .Dim = c(7L, 7L), .Dimnames = list(
c("A", "B", "C", "D", "E", "F", "G"), c("A", "B", "C", "D",
"E", "F", "G")))
I hope this question is clear. I am trying to work out how to perform this algorithm iteratively. I am not too sure how to achieve this, but hopefully by writing this out long-hand and providing the functions I have been using, it may be obvious to somebody. One extra thing is that it's best to have the proposed solution be generally applicable (i.e. to matrices of different sizes).

calc_gain<-function(mat=mat1){
if(nrow(mat)==1) {
return(row.names(mat))
} else {
classement<-sort(rowSums(mat),decreasing=T)
diffgains<-diff(classement)
if (all(diffgains!=0)){
return(names(classement))
} else {
if (all(diffgains==0)){
return(sample(names(classement)))
} else {
parex<-split(classement,factor(classement,levels=unique(classement)))
class_parex<-lapply(parex,function(vect){calc_gain(mat[names(vect),names(vect),drop=F])})
return(unlist(class_parex))
}
}
}
}
Here is what the function does :
if there is only one element, it returns the name of it (only "player" there is)
else, it calculates the scores.
If there is no tie, it returns the "players" in the order first to last
else, - if all "players" have the same score, it randomly gives an order.
else, it splits the ordered list according to the scores and apply the function (that is the recursive part) on the subsets of "players" with tied scores.

Here's a start:
Step0:
> split(rownames(m), -rowSums( m ) )
$`-8`
[1] "D" "E"
$`-6`
[1] "A" "B" "C" "G"
$`-2`
[1] "F"
Step1:
m <- m[ order( -rowSums(m) ), ]) # order within overall wins
A B C D E F G
D 1 1 2 0 1 2 1
E 2 0 2 1 0 2 1
A 0 2 1 1 0 1 1
B 0 0 0 1 2 2 1
C 1 2 0 0 0 2 1
G 1 1 1 1 1 1 0
F 1 0 0 0 0 0 1
> rowSums( m )
D E A B C G F
8 8 6 6 6 6 2
Step2: Order within group that has 4 wins
> mred <- m[c("A","B","C","G"), c("A","B","C","G") ]
> mred
A B C G
A 0 2 1 1
B 0 0 0 1
C 1 2 0 1
G 1 1 1 0
> rowSums(mred)
A B C G
4 1 4 3
> rownames(mred)[order(-rowSums(mred))]
[1] "A" "C" "G" "B"

Related

Cluster groups based on pairwise distances

I have an n x n matrix with pairwise distances as entries. The matrix looks for example like this:
m = matrix (c(0, 0, 1, 1, 1, 1,0, 0, 1, 1, 0, 1,1, 1, 0, 1, 1, 0,1, 1, 1, 0, 1, 1,1, 0, 1, 1, 0, 1,1, 1, 0, 1, 1, 0),ncol=6, byrow=TRUE)
colnames(m) <- c("A","B","C","D","E","F")
rownames(m) <- c("A","B","C","D","E","F")
Now I want to put every letter in the same cluster if the distance to any other letter is 0. For the example above, I should get three clusters consisting of:
(A,B,E)
(C,F)
(D)
I would be interested in the number of entries in each cluster. At the end, I want to have a vector like:
clustersizes = c(3,2,1)
I assume it is possible by using the hclust function, but I'm not able to extract the three clusters. I also tried the cutree function, but if I don't know the number of clusters before and also not the cutoff for the height, how should I do it?
This is what I tried:
h <- hclust(dist(m),method="single")
plot(h)
Thanks!
Welcome to SO.
There are several ways to handle this but an easy choice is to use the igraph package.
First we convert your matrix m to an adjacency matrix. It contains the distances to neighbouring nodes, where 0 means no connection. Thus, we subtract your matrix from 1 to get that
mm <- 1 - m
diag(mm) <- 0 # We don't allow loops
This gives
> mm
A B C D E F
A 0 1 0 0 0 0
B 1 0 0 0 1 0
C 0 0 0 0 0 1
D 0 0 0 0 0 0
E 0 1 0 0 0 0
F 0 0 1 0 0 0
Then we just need to feed it to igraph to compute communities
library("igraph")
fastgreedy.community(as.undirected(graph.adjacency(mm)))
which produces
IGRAPH clustering fast greedy, groups: 3, mod: 0.44
+ groups:
$`1`
[1] "A" "B" "E"
$`2`
[1] "C" "F"
$`3`
[1] "D"
Now if you save that result you can get the community sizes right away
res < fastgreedy.community(as.undirected(graph.adjacency(mm)))
sizes(res)
which yields
Community sizes
1 2 3
3 2 1

R help - change the maximum value of each row in a certain condition

I am in a novice of R. I have a dataframe with columns 1:n. Excluding column 1 and n, I want to change the maximum value of each row if the row has a specific value in a different column AND set the remaining values (excluding column 1 and n) to zero. I have about 300,000 cases and 40 columns in my real data, however, the example below illustrates what I am trying to achieve:
A <- c(1,1,5,5,10)
B <- rnorm(1:5)
C <- rnorm(1:5)
D <- rnorm(1:5)
E <- c(10,15,100,100,100)
df <- data.frame(A,B,C,D,E)
df
A B C D E
1 1 0.74286670 0.3222136 0.9381296 10
2 1 -0.03352498 0.5262685 0.1225731 15
3 5 -0.17689629 -0.8949740 -1.4376567 100
4 5 0.48329153 1.1574834 -1.1116581 100
5 10 0.13117277 -0.2068736 0.4841806 100
Here, if column A of each row has 1, I want to change the maximum value of each row into the value of column E, and set columns B, C and D to 0.
So, the result should be like this:
A B C D E
1 1 0 0 10 10
2 1 0 15 0 15
3 5 -0.17689629 -0.8949740 -1.4376567 100
4 5 0.48329153 1.1574834 -1.1116581 100
5 10 0.13117277 -0.2068736 0.4841806 100
I tried to do this for two days. Thanks.
Try this out and see what happens :)
df <- read.table(text = "A B C D E
1 1 0.74286670 0.3222136 0.9381296 10
2 1 -0.03352498 0.5262685 0.1225731 15
3 5 -0.17689629 -0.8949740 -1.4376567 100
4 5 0.48329153 1.1574834 -1.1116581 100
5 10 0.13117277 -0.2068736 0.4841806 100", stringsAsFactor = FALSE)
# find the max in columns B,C,D
z <- apply(df[df$A == 1, 2:4], 1, max)
# substitute the maximum value of each row for columns B,C,D where A == 1
# with the value of column E. Assign 0 to the others
y <- ifelse(df[df$A == 1, 2:4] == z, df$E[df$A == 1], 0)
# Change the values in your dataframe
df[df$A == 1, 2:4] <- y

How to apply a formula one row at a time in R - row 2's values from calculated values of row 1

I have a data frame where I need to apply a formula to create new columns. The catch is, I need to calculate these numbers one row at a time. For eg,
df <- data.frame(c(1:10),c(21:30),5,10)
names(df) <- c('a','b','c','d')
I now need to create columns 'c' and 'd' as follows. Column 'c' whose R1 value is fixed as 5. But from R2 onwards the value of 'c' is calculated as (c (from previous row) - b(from previous row). Column 'd' R1 value is fixed as 10, but from R2 onwards, 'd' is calculated as 'c' from R2 - d from previous row.
I want my output to look like this:
A B C D
1 21 5 10
2 22 -16 -26
3 23 -38 -12
And so on. My actual data has over 1000 rows and 18 columns. For every row, 5 of the column values come from different columns of the previous row only (no other rows). And the rest of the column values are calculated from these newly calculated row values. I am quite at a loss in creating a formula that will apply my formulae to each row, calculate values for that row and then move to the next row. I know that I have simplified the problem a bit here, but this captures the essence of what I am attempting.
This is what I attempted:
df <- within(df, {
v1 <- shift(c)
v2 <- shift(d)
c <- v1-shift(b)
d <- c-v2
})
However, I need to apply this only from row 2 onwards and I have no idea how to do that.Because of that, I get something like this:
a b c d v2 v1
1 21 NA NA NA NA
2 22 4 -6 10 5
3 23 4 -6 10 5
I only get these values repeatedly for c, and d (4, -6, 10, 5).
Output
Thank you for your help.
df <- data.frame(a = 1:10, b = 21:30, c = 5:-4, d = 10)
for (i in (2:nrow(df))) {
df[i, "c"] <- df[i - 1, "c"] - df[i - 1, "b"]
df[i, "d"] <- df[i, "c"] - df[i - 1, "d"]
}
df[1:3, ]
a b c d
1 1 21 5 10
2 2 22 -16 -26
3 3 23 -38 -12
Edit: adapting to your comment
# Let's define the coefficients of the equations into a dataframe
equation1 <- c("c", 0, 0, 0, 0, 0, -1, 1, 0) # c (from previous row) - b(from previous row)
equation2 <- c("d", 0, 0, 1, 0, 0, 0, 0, -1) # d is calculated as 'c' from R2 - d from previous row
equations <- data.frame(rbind(equation1,equation2), stringsAsFactors = F)
names(equations) <- c("y","a","b","c","d","a_previous","b_previous","c_previous","d_previous")
equations
# y a b c d a_previous b_previous c_previous d_previous
# "c" 0 0 0 0 0 -1 1 0
# "d" 0 0 1 0 0 0 0 -1
# define function to mutiply the rows of the dataframes
sumProd <- function(vect1, vect2) {
return(as.numeric(as.numeric(vect1) %*% as.numeric(vect2)))
}
# Apply the formulas to the originaldataframe
for (i in (2:nrow(df))) {
for(e in 1:nrow(equations)) {
df[i, equations[e, 'y']] <- sumProd(equations[e, c('a','b','c','d')], df[i, c('a','b','c','d')]) +
sumProd(equations[e, paste0(c('a','b','c','d'),'_previous')], df[i - 1, c('a','b','c','d')])
}
}
df[1:3,]
a b c d
1 1 21 5 10
2 2 22 -16 -26
3 3 23 -38 -12
It might not be the most elegant way to do it with a for loop but it works. Your column c sounds like a simple sequence to me.
This is waht I would do:
df <- data.frame(c(1:10),c(21:30),5,10)
names(df) <- c('a','b','c','d')
# Use a simple sequence for c
df$c <- seq(5,5-(dim(df)[1]-1))
# Use for loop to calculate d
for(i in 2:(length(df$d)-1))
{
df$d[i] <- df$c[i] - df$d[i-1]
}
> df
a b c d
1 1 21 5 10
2 2 22 4 -6
3 3 23 3 9
4 4 24 2 -7
5 5 25 1 8
6 6 26 0 -8
7 7 27 -1 7
8 8 28 -2 -9
9 9 29 -3 6
10 10 30 -4 10

How to create a running total, according to factor, in R?

I am wishing to create a running scoreline margin for sport data. For example, consider my data as follows:
df <- data.frame(Club = c("O", "H", "H", "O", "H", "O", "O"),
TimeOfScore = c("1:30", "2:06", "7:09", "9:09", "11:08", "14:32", "19:11"),
Points = c(1, 3, 1, 2, 2, 3, 3))
In the above, "df$Club==O" represents the opposition's score, whilst df$Club=="H". The column df$TimeOfScore represents when the score occurs. I wish to know a running scoreline of how many points the opposition is ahead or below of the home team.
My anticipated output would be:
df$Margin <- c(-1, 2, 3, 1, 3, 0, -3)
This output is based on how many points ahead or below the opposition team are compared to the home team. For example, the opposition team score 1 point at 1:30 (1 minute, 30 seconds), into the game. The corresponding margin at that point in time is -1 or the home team down by one point. In the next occurrence, the home team score 3 points and are then 2 points ahead in the margin.
How would I please go about doing this?
df$Margin = with(df, cumsum(ifelse(Club == "H", Points, -Points)))
# df
# Club Points Margin
# 1 O 1 -1
# 2 H 3 2
# 3 H 1 3
# 4 O 2 1
# 5 H 2 3
# 6 O 3 0
# 7 O 3 -3
In words
You can tests if the Club is "H" or "O", which will give a TRUE or FALSE.
You can then use the fact that T == 1 and F == 0 to add 1 to it.
Then you use this result to subset the vector c(-1, 1), then multiply this value by the points.
Then find the cumulative sum, and there's your answer.
In code
df$Margin <- cumsum(c(-1, 1)[(df$Club == "H")+1] * df$Points)
df
# Club Points Margin
# 1 O 1 -1
# 2 H 3 2
# 3 H 1 3
# 4 O 2 1
# 5 H 2 3
# 6 O 3 0
# 7 O 3 -3

Sorting data frame by character string

I have a data frame and need to sort its columns by a character string.
I tried it like this:
# character string
a <- c("B", "E", "A", "D", "C")
# data frame
data <- data.frame(A = c(0, 0, 1), B = c(1, 1, 1), C = c(1, 0, 1), D = c(0, 0, 1), E = c(0, 1, 1))
data
# A B C D E
# 1 0 1 1 0 0
# 2 0 1 0 0 1
# 3 1 1 1 1 1
# sorting
data.sorted <- data[, order(a)]
# order of characters in data
colnames(data.sorted)
# [1] "C" "A" "E" "D" "B"
However, the order of columns in the sorted data frame is not the same as the characters in the original character string.
Is there any way, how to sort it?
The function order(a) returns the position in the vector a that each ranked value lies in. So, since "A" (ranked first) lies in the third position of a, a[1] is equal to 3. Similarly "C" (ranked third) lies in the fifth position of a, then a[3] equals 5.
Luckily your solution is actually even more simple, thanks to the way R works with brackets. If you ask to see just the column named "B" you'll get:
> data[, "B", drop=FALSE]
B
1 1
2 1
3 1
Or if you want two specific columns
> data[, c("B", "E")]
B E
1 1 0
2 1 1
3 1 1
And finally, more generally, if you have a whole vector by which you want to order your columns, then you can do that, too:
> data.sorted <- data[, a]
> data.sorted
B E A D C
1 1 0 0 0 1
2 1 1 0 0 0
3 1 1 1 1 1
> all(colnames(data.sorted)==a)
[1] TRUE
string[] str = { "H", "G", "F", "D", "S","A" };
Array.Sort(str);
for (int i = 0; i < str.Length; i++)
{
Console.WriteLine(str[i]);
}
Console.ReadLine();

Resources